Processing RSS Feed (3.1.4) / Reading local file (3.1.5) and more

Let’s resume as of chapter 3.1.4 of the whale book.

For RSS feed processing, Feed Parser is introduced. However, feedparser.org did not exist when I tried. Just I can find this site:

http://code.google.com/p/feedparser/

Although you can download the files from here, the easiest way to install is to use “easy_install” for beginners like me.

$ sudo easy_install feedparser
Searching for feedparser
Reading http://pypi.python.org/simple/feedparser/
Reading https://code.google.com/p/feedparser/
Reading http://code.google.com/p/feedparser/
Best match: feedparser 5.1.3
Downloading http://feedparser.googlecode.com/files/feedparser-5.1.3.zip
Processing feedparser-5.1.3.zip
Running feedparser-5.1.3/setup.py -q bdist_egg --dist-dir /tmp/easy_install-dHOUCF/feedparser-5.1.3/egg-dist-tmp-fb0a2T
zip_safe flag not set; analyzing archive contents...
Adding feedparser 5.1.3 to easy-install.pth file

Installed /Library/Python/2.7/site-packages/feedparser-5.1.3-py2.7.egg
Processing dependencies for feedparser
Finished processing dependencies for feedparser

Then import necessary tools again.

$ python
Python 2.7.4 (v2.7.4:026ee0057e2d, Apr  6 2013, 11:43:10) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from __future__ import division
>>> import nltk, re, pprint
>>> 

Let’s continue.

>>> import feedparser
>>> llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
>>> llog['feed']['title']
u'Language Log'
>>> len(llog.entries)
15
>>> post = llog.entries[2]
>>> post.title
u'A reprieve for DARE'
>>> content = post.content[0].value
>>> content[:70]
u'<p>A month ago, I posted an "<a href="http://languagelog.ldc.upenn.edu'

It seems different results from the sample of the textbook. Anyway continue…

>>> nltk.word_tokenize(nltk.html_clean(content))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'html_clean'

What’s wrong? There is a print error in the textbook (in Japanese version). The right command is clean_html() instead of html_clean().

>>> nltk.word_tokenize(nltk.clean_html(content))
[u'A', u'month', u'ago', u',', u'I', u'posted', u'an', u'``', u'SOS', u'for', u'DARE', u',', u"''", u'detailing', u'the', u'impending', u'financial', u'threat', u'faced', u'by', u'the', u'Dictionary', u'of',
....
u'Here', u"'s", u'some', u'local', u'coverage', u'of', u'the', u'news', u',', u'from', u'WKOW', u'Madison', u':']
>>> 

Now we got the same format except unicode conversion. This will be explained at chapter 3.3.

Reading Local Files (chapter 3.1.5):

First created a text file under /Users/xxx/Documents/workspace/NLTK Learning/text files. Let’s open the file.

>>> f = open('document.txt')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory: 'document.txt'

No worry, I expected this error and the reason is clear.


>>> import sys
>>> print sys.path
['', '/Library/Python/2.7/site-packages/setuptools-0.6c11-py2.7.egg', '/Library/Python/2.7/site-packages/pip-1.3.1-py2.7.egg', '/Library/Python/2.7/site-packages/ipython-0.13.2-py2.7.egg', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python', '/Library/Python/2.7/site-packages/feedparser-5.1.3-py2.7.egg', '/Library/Frameworks/Python.framework/Versions/2.7/lib/python27.zip', '/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7', '/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin', '/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac', '/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages', '/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk', '/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-old', '/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload', '/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages', '/Library/Python/2.7/site-packages']
>>> 

That should be ok to specify the “full path” of the file.

>>> f = open('/Users/ken/Documents/workspace/NLTK Learning/text files/document.txt')
>>> raw = f.read()
>>> print raw
I was born in Tokyo, Japan in mid of 1970's. After graduated university, I have been working as an IT engineer since then. Now my base is in Shanghai, China. Many of colleagues were surprised when they heard my major at university was Law. That's not so rare cases in Japan, as many of Japanese IT companies were hiring fresh graduates not only from computer science or mathematics area  but also social sciences like law or economics. At that time ---- the end of 1990's, many of IT companies in Japan were eager to hire as many as possible even they need to educate them in the companies. 
>>> 

Some trouble shooting guides are also introduced for opening files. This is to check the files in the current directly.

>>> import os
>>> os.listdir('.')
['#.bashrc#', '.android', '.bash_history', '.bash_profile', '.bash_profile.pysave', '.cache', '.CFUserTextEncoding', '.config', '.DS_Store', '.emacs.d', '.idlerc', '.matplotlib', '.Trash', 'Desktop', 'Development', 'Documents', 'Downloads', 'Library', 'Movies', 'Music', 'nltk_data', 'Pictures', 'Public', 'ScipySuperpack', 'src', 'test.png']
>>> 

Yes, the current directly should be /Users/xxx. This seems correct.
I created another file with “return”s to change lines in the file. In my example, \n (as a return code) was not displayed.

>>> f = open('/Users/ken/Documents/workspace/NLTK Learning/text files/document2.txt')
>>> raw = f.read()
>>> print raw
I was born in Tokyo, Japan in mid of 1970's. 
After graduated university, I have been working as an IT engineer since then. Now my base is in Shanghai, China. 
Many of colleagues were surprised when they heard my major at university was Law. That's not so rare cases in Japan, as many of Japanese IT companies were hiring fresh graduates not only from computer science or mathematics area  but also social sciences like law or economics. 
At that time ---- the end of 1990's, many of IT companies in Japan were eager to hire as many as possible even they need to educate them in the companies. 
>>> 

This is an example to print sentences line by line.

>>> f = open('/Users/ken/Documents/workspace/NLTK Learning/text files/document2.txt', 'rU')
>>> for line in f:
...     print line.strip()
... 
I was born in Tokyo, Japan in mid of 1970's.
After graduated university, I have been working as an IT engineer since then. Now my base is in Shanghai, China.
Many of colleagues were surprised when they heard my major at university was Law. That's not so rare cases in Japan, as many of Japanese IT companies were hiring fresh graduates not only from computer science or mathematics area  but also social sciences like law or economics.
At that time ---- the end of 1990's, many of IT companies in Japan were eager to hire as many as possible even they need to educate them in the companies.
>>> 

‘rU’ is given as the second parameter of open(). ‘r’ means open file for reading. ‘U’ means universal, ignore return code differences.

Corpus in NLTK can also be used as a text file like this.

>>> path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
>>> raw = open(path, 'rU'.read()
... )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'read'
>>> raw = open(path, 'rU').read()
>>> raw
'[Moby Dick by Herman Melville 1851]\n\n\nETYMOLOGY.\n\n(Supplied by a Late Consumptive Usher to a Grammar School)\n\nThe pale Usher--threadbare in coat, heart, body, and brain; I see him\nnow.  
....

Lot of “\n” can be seen.

Get user’s input (3.1.7):

>>> s = raw_input("Enter some text: ")
Enter some text: I love Shanghai very much!!!
>>> print "You typed", len(nltk.word_tokenize(s)), "words."
You typed 8 words.
>>> 

NLP Pipeline (3.1.8):

>>> f = open('/Users/ken/Documents/workspace/NLTK Learning/text files/document.txt')
>>> raw = f.read()
>>> type(raw)
<type 'str'>

The file is read with read(), the type is str (stands for string, I guess).


>>> tokens = nltk.word_tokenize(raw)
>>> type(tokens)
<type 'list'>
>>> tokens
['I', 'was', 'born', 'in', 'Tokyo', ',', 'Japan', 'in', 'mid', 'of', "1970's.", 'After', 'graduated', 'university', ',', 'I', 'have', 'been', 'working', 'as', 'an', 'IT', 'engineer', 'since', 'then.', 'Now', 'my', 'base', 'is', 'in', 'Shanghai', ',', 'China.', 'Many', 'of', 'colleagues', 'were', 'surprised', 'when', 'they', 'heard', 'my', 'major', 'at', 'university', 'was', 'Law.', 'That', "'s", 'not', 'so', 'rare', 'cases', 'in', 'Japan', ',', 'as', 'many', 'of', 'Japanese', 'IT', 'companies', 'were', 'hiring', 'fresh', 'graduates', 'not', 'only', 'from', 'computer', 'science', 'or', 'mathematics', 'area', 'but', 'also', 'social', 'sciences', 'like', 'law', 'or', 'economics.', 'At', 'that', 'time', '--', '--', 'the', 'end', 'of', '1990', "'s", ',', 'many', 'of', 'IT', 'companies', 'in', 'Japan', 'were', 'eager', 'to', 'hire', 'as', 'many', 'as', 'possible', 'even', 'they', 'need', 'to', 'educate', 'them', 'in', 'the', 'companies', '.']
>>> 

Those text (str) is tokenized and split into words. Tokens should be type list.

>>> words = [w.lower() for w in tokens]
>>> type(words)
<type 'list'>
>>> words
['i', 'was', 'born', 'in', 'tokyo', ',', 'japan', 'in', 'mid', 'of', "1970's.", 'after', 'graduated', 'university', ',', 'i', 'have', 'been', 'working', 'as', 'an', 'it', 'engineer', 'since', 'then.', 'now', 'my', 'base', 'is', 'in', 'shanghai', ',', 'china.', 'many', 'of', 'colleagues', 'were', 'surprised', 'when', 'they', 'heard', 'my', 'major', 'at', 'university', 'was', 'law.', 'that', "'s", 'not', 'so', 'rare', 'cases', 'in', 'japan', ',', 'as', 'many', 'of', 'japanese', 'it', 'companies', 'were', 'hiring', 'fresh', 'graduates', 'not', 'only', 'from', 'computer', 'science', 'or', 'mathematics', 'area', 'but', 'also', 'social', 'sciences', 'like', 'law', 'or', 'economics.', 'at', 'that', 'time', '--', '--', 'the', 'end', 'of', '1990', "'s", ',', 'many', 'of', 'it', 'companies', 'in', 'japan', 'were', 'eager', 'to', 'hire', 'as', 'many', 'as', 'possible', 'even', 'they', 'need', 'to', 'educate', 'them', 'in', 'the', 'companies', '.']
>>> 

Then converting into lower cases. Words is still type list. BTW, in the document, I used word IT as “information technology” but after converting lower cases, this cannot be distinguished from it as pronoun.

>>> vocab = sorted(set(words))
>>> type(vocab)
<type 'list'>
>>> vocab
["'s", ',', '--', '.', "1970's.", '1990', 'after', 'also', 'an', 'area', 'as', 'at', 'base', 'been', 'born', 'but', 'cases', 'china.', 'colleagues', 'companies', 'computer', 'eager', 'economics.', 'educate', 'end', 'engineer', 'even', 'fresh', 'from', 'graduated', 'graduates', 'have', 'heard', 'hire', 'hiring', 'i', 'in', 'is', 'it', 'japan', 'japanese', 'law', 'law.', 'like', 'major', 'many', 'mathematics', 'mid', 'my', 'need', 'not', 'now', 'of', 'only', 'or', 'possible', 'rare', 'science', 'sciences', 'shanghai', 'since', 'so', 'social', 'surprised', 'that', 'the', 'them', 'then.', 'they', 'time', 'to', 'tokyo', 'university', 'was', 'were', 'when', 'working']

This also easy to understand. vacab is still type of list.

We need to be sensitive which type of object we are going to handle because some command can be used only for specific types.

>>> vocab.append('blog')
>>> raw.append('blog')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'append'
>>> query = 'Who knows?'
>>> beatles = ['john', 'paul', 'george', 'ringo']
>>> query + beatles
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: cannot concatenate 'str' and 'list' objects
>>>