Reuters corpus & Inaugural corpus

O’Reilly’s textbook chapter 2.1.4-2.1.5

Reuters corpus should have plenty of news documents.

>>> import nltk
>>> import sys
>>> from nltk.corpus import reuters
>>> reuters.fileids()
['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833', 'test/14839', 'test/14840', 'test/14841', 'test/14842', 'test/14843', 'test/14844', 'test/14849', 'test/14852', 'test/14854', 'test/14858', 'test/14859', 'test/14860', 
....
'training/9984', 'training/9985', 'training/9988', 'training/9989', 'training/999', 'training/9992', 'training/9993', 'training/9994', 'training/9995']

Each fileid starts with “test” or “training”. Following code shows that each fileid has multiple “categories”.

>>> reuters.categories()
['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']
>>> reuters.categories('training/9865')
['barley', 'corn', 'grain', 'wheat']

>>> reuters.categories(['training/9865', 'training/9880'])
['barley', 'corn', 'grain', 'money-fx', 'wheat']
>>> reuters.fileids('barley')
['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', 'test/15875', 'test/15952', 'test/17767', 'test/17769', 'test/18024', 'test/18263', 'test/18908', 'test/19275', 'test/19668', 'training/10175', 'training/1067', 'training/11208', 'training/11316', 'training/11885', 'training/12428', 'training/13099', 'training/13744', 'training/13795', 'training/13852', 'training/13856', 'training/1652', 'training/1970', 'training/2044', 'training/2171', 'training/2172', 'training/2191', 'training/2217', 'training/2232', 'training/3132', 'training/3324', 'training/395', 'training/4280', 'training/4296', 'training/5', 'training/501', 'training/5467', 'training/5610', 'training/5640', 'training/6626', 'training/7205', 'training/7579', 'training/8213', 'training/8257', 'training/8759', 'training/9865', 'training/9958']
>>> reuters.fileids(['barley', 'corn'])
['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106', 'test/15287', 'test/15341', 'test/15618', 'test/15648', 'test/15649', 
...
'training/9058', 'training/9093', 'training/9094', 'training/934', 'training/9470', 'training/9521', 'training/9667', 'training/97', 'training/9865', 'training/9958', 'training/9989']
>>> reuters.words('training/9865')[:14]
['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', 'BIDS', 'DETAILED', 'French', 'operators', 'have', 'requested', 'licences', 'to', 'export']
>>> reuters.words(['training/9865', 'training/9880'])
['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]
>>> reuters.words(categories='barley')
['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]
>>> reuters.words(categories=['barley', 'corn'])
['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...]

Headlines at the beginning of each document are written in upper cases.

“Inaugural” is not a common word in my poor English vocabulary, but it is a corpus of First speech of the president of USA.

>>> from nltk.corpus import inaugural
>>> inaugural.fileids()
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt', '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.txt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt']

Extract the last 4 digit of each text.

>>> [fileid[:4] for fileid in inaugural.fileids()]
['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', '1825', '1829', '1833', '1837', '1841', '1845', '1849', '1853', '1857', '1861', '1865', '1869', '1873', '1877', '1881', '1885', '1889', '1893', '1897', '1901', '1905', '1909', '1913', '1917', '1921', '1925', '1929', '1933', '1937', '1941', '1945', '1949', '1953', '1957', '1961', '1965', '1969', '1973', '1977', '1981', '1985', '1989', '1993', '1997', '2001', '2005', '2009']

Then do a plot with words ‘america’ and ‘citizen’.

>>> cfd = nltk.ConditionalFreqDist(
...     (target, fileid[:4])
...     for fileid in inaugural.fileids()
...     for w in inaugural.words(fileid)
...     for target in ['america', 'citizen']
...     if w.lower().startswith(target))
>>> cfd.plot()

figure_1

‘Citizen’ was most frequently used in the speech of 1841. I thought this might be by Lincoln but it was John Tyler. In this stat, the length of the speech was not considered.

Various corpus

Coming back to the O’Reilly’s textbook. (Chapter 2.1.2-2.1.3)

Referring Webtext:

>>> from nltk.corpus import webtext
>>> for fileid in webtext.fileids():
...     print fileid, webtext.raw(fileid)[:65], '...'
... 
firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se ...
grail.txt SCENE 1: [wind] [clop clop clop] 
KING ARTHUR: Whoa there!  [clop ...
overheard.txt White guy: So, do you have any plans for this evening?
Asian girl ...
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr ...
singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun ...
wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ...
>>> 

First to get a fileid then extract first 65 characters. What kind of conversation between a White guy and an Asian girl in overheard.txt???

This is chatroom conversation. Getting the 124th sentence (as start from 0) from the corpus.

>>> from nltk.corpus import nps_chat
>>> chatroom =nps_chat.posts('10-19-20s_706posts.xml')
>>> chatroom[123]
['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']

By the way, this was not my first time to make a mistake (typo) when typing a term “fileid”. I understand this came from “File ID” in my mind, but it looks like “field” if it is written in lower cases, doesn’t it?

>>> for field in webtext.fileids():
...     print fileid, webtext.raw(fileid)[:65], '...'
... 
whitman-leaves.txt
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Library/Python/2.7/site-packages/nltk/corpus/reader/plaintext.py", line 73, in raw
    return concat([self.open(f, sourced).read() for f in fileids])
  File "/Library/Python/2.7/site-packages/nltk/corpus/reader/api.py", line 187, in open
    stream = self._root.join(file).open(encoding)
  File "/Library/Python/2.7/site-packages/nltk/data.py", line 176, in join
    return FileSystemPathPointer(path)
  File "/Library/Python/2.7/site-packages/nltk/data.py", line 154, in __init__
    raise IOError('No such file or directory: %r' % path)
IOError: No such file or directory: '/Users/ken/nltk_data/corpora/webtext/whitman-leaves.txt'

Brown Corpus:

The entire list of category of Brown Corpus is listed here:
http://icame.uib.no/brown/bcm-los.html

>>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
>>> brown.words(categories='news')
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
>>> brown.words(fileids=['cg22'])
['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]
>>> brown.sents(categories=['news', 'editorial', 'reviews'])
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]
>>> 

fileid “cg22” stands for Kenneth Reiner’s “Coping with Runaway Technology” according to the list.

Displaying frequency of specific words in the category.

>>> news_text = brown.words(categories='news')
>>> fdist = nltk.FreqDist([w.lower() for w in news_text])
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> for m in modals:
...     print m + ':', fdist[m],
... 
can: 94 could: 87 may: 93 might: 38 must: 53 will: 389
>>> 
>>> modals = ['what', 'why', 'when', 'which', 'who', 'how']
>>> for m in modals:
...     print m + ':', fdist[m],
... 
what: 95 why: 14 when: 169 which: 245 who: 268 how: 42
>>> 

By using ConditionalFreqDist, we can compare more easily.

>>> cfd = nltk.ConditionalFreqDist(
...     (genre, word)
...     for genre in brown.categories()
...     for word in brown.words(categories=genre))

>>> 
>>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> cfd.tabulate(conditions=genres, samples=modals)
                 can could  may might must will
           news   93   86   66   38   50  389
       religion   82   59   78   12   54   71
        hobbies  268   58  131   22   83  264
science_fiction   16   49    4   12    8   16
        romance   74  193   11   51   45   43
          humor   16   30    8    8    9   13
>>> modals = ['what', 'why', 'when', 'which', 'who', 'how']
>>> cfd.tabulate(conditions=genres, samples=modals)
                what  why when which  who  how
           news   76    9  128  244  268   37
       religion   64   14   53  202  100   23
        hobbies   78   10  119  252  103   40
science_fiction   27    4   21   32   13   12
        romance  121   34  126  104   89   60
          humor   36    9   52   62   48   18
>>> 

Both stats are interesting. As mentioned in the textbook “will” is most frequently used in news, on the other hand, “could” is in romance. In news, people (who) is most interested, but no so high interests in science_fiction. Maybe scientists are more interested in objects rather than people?

Replacing words matching regular expressions

Takes long time to move on this because of lack of knowledge.

First I created replacer.py the put under the root of Python27. However, I got following error.

>>> from replacer import *
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named replacer

Teacher Google game me a solution. I didn’t import sys in advance.

>>> import sys
>>> print sys.path
['', 'C:\\WINDOWS\\system32\\python27.zip', 'C:\\Python27\\DLLs', 'C:\\Python27\
\lib', 'C:\\Python27\\lib\\plat-win', 'C:\\Python27\\lib\\lib-tk', 'C:\\Python27
', 'C:\\Python27\\lib\\site-packages']
>>> from replacers import *
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "replacers.py", line 19
    def replace(self, repl) in self.patterns:
                             ^
SyntaxError: invalid syntax
>>>

Oops, sytax error found. After a couple of try & errors, I got a following results.

>>> from replacers import *
>>> replacer = RegexpReplacer()
>>> replacer.replace("Can't is a contraction")
'Ca not is a contraction'
>>> replacer.replace("can't is a contraction")
'cannot is a contraction'
>>> replacer.replace("I should've done that thing I didn't do")
'I should have done that thing I did not do'

(I will add some more here when I have time…)

Babelfish Translation from NLTK

As mentioned in the past, Babelfish Translatoin service seems no longer available.

>>> from nltk.misc import babelfish
>>> babelfish.translate('cookbook', 'english', 'spanish')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\nltk\misc\babelfish.py", line 100, in tran
slate
    raise BabelizerIOError("Couldn't talk to server: %s" % what)
nltk.misc.babelfish.BabelizerIOError: Couldn't talk to server: [Errno socket err
or] [Errno 11004] getaddrinfo failed
>>>
>>> babelfish.available_languages
['Chinese', 'English', 'French', 'German', 'Greek', 'Italian', 'Japanese', 'Kore
an', 'Portuguese', 'Russian', 'Spanish']

In stead, I found a source code in http://nltk.org/_modules/nltk/misc/babelfish.html

...
try:
        response = urllib.urlopen('http://babelfish.yahoo.com/translate_txt', params)

    except IOError as what:
        raise BabelizerIOError("Couldn't talk to server: %s" % what)
...

The url does not exist then return the error message.

Stemming and lemmatization

Stemming is technique for removing affixes from a word, ending up with the stem.
I don’t know the meaning of the words, “affixes” and “stem” but there is an example in the textbook. The stem of “cooking” is “cook” and “ing” is the suffix.

Porter Stemming Algorithm is the one of the most common stemming algorithms.

>>> from nltk.stem import PorterStemmer
>>> stemmer = PorterStemmer()
>>> stemmer.stem('cooking')
'cook'
>>> stemmer.stem('cookery')
'cookeri'
>>> stemmer.stem('kicker')
'kickers'
>>> stemmer.stem('books')
'book'
>>> stemmer.stem('said')
'said'
>>> stemmer.stem('feet')
'feet'

Some stemming did not work as expected… It seems this works only with simple case, like just removing ‘ing’ or ‘s’.

Another one is LancasterStemmer.

>>> from nltk.stem import LancasterStemmer
>>> stemmer = LancasterStemmer()
>>> stemmer.stem('cooking')
'cook'
>>> stemmer.stem('cookery')
'cookery'
>>> stemmer.stem('feet')
'feet'
>>> stemmer.stem('books')
'book'
>>> stemmer.stem('brought')
'brought'

Only this one worked different from the textbook.

>>> stemmer.stem('ingleside')
'inglesid'

SnowballStemmer supports 13 non-Enlgish languages.

>>> from nltk.stem import SnowballStemmer
>>> SnowballStemmer.languages
('danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'ital
ian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'sw
edish')

This example is trying to use Spanish.

>>> spanish_stemmer = SnowballStemmer('spanish')
>>> spanish_stemmer.stem('hola')
u'hol'

The return was uni-code as “u” was added before the value.

Lemmatizing

Lemmatization is smilar to synonyms replacement. A lemma is a root word as opposed to the root stem.

>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> lemmatizer.lemmatize('cooking')
'cooking'
>>> lemmatizer.lemmatize('cooking', pos='v')
'cook'
>>> lemmatizer.lemmatize('cookbooks')
'cookbook'
>>> lemmatizer.lemmatize('brought', pos='v')
'bring'
>>> lemmatizer.lemmatize('brought')
'brought'

The WordNetLemmatizer refers to the WordNet corpus and uses the morphy() function of the WordNetCorpusReader.

This is comparison between stemming and lemmatizing.

>>> stemmer = PorterStemmer()
>>> stemmer.stem('believes')
'believ'
>>> lemmatizer.lemmatize('believes')
'belief'
>>> stemmer.stem('buses')
'buse'
>>> lemmatizer.lemmatize('busses')
'bus'
>>> stemmer.stem('bus')
'bu'

I am still not unclear which case we can use stemming at this moment.

Access to text corpus

Now start reading Chapter 2.1.1.

[code language="python"]
>>> import nltk
>>> nltk.corpus.gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')
>>> len(emma)
192427
>>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
>>> emma.concorance("surprize")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Text' object has no attribute 'concorance'
[/code]

What’s wrong? Ooops, it was just a typo…

[code language="python"]
>>> emma.concordance("surprize")
Building index...
Displaying 25 of 37 matches:
er father , was sometimes taken by surprize at his being still able to pity `
hem do the other any good ." " You surprize me ! Emma must do Harriet good : a
Knightley actually looked red with surprize and displeasure , as he stood up ,
r . Elton , and found to his great surprize , that Mr . Elton was actually on
d aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great ,
father was quite taken up with the surprize of so sudden a journey , and his f
y , in all the favouring warmth of surprize and conjecture . She was , moreove
he appeared , to have her share of surprize , introduction , and pleasure . Th
ir plans ; and it was an agreeable surprize to her , therefore , to perceive t
talking aunt had taken me quite by surprize , it must have been the death of m
f all the dialogue which ensued of surprize , and inquiry , and congratulation
the present . They might chuse to surprize her ." Mrs . Cole had many to agre
the mode of it , the mystery , the surprize , is more like a young woman ' s s
to her song took her agreeably by surprize -- a second , slightly but correct
" " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ;
t to be considered . Emma ' s only surprize was that Jane Fairfax should accep
of your admiration may take you by surprize some day or other ." Mr . Knightle
ation for her will ever take me by surprize .-- I never had a thought of her i
expected by the best judges , for surprize -- but there was great joy . Mr .
sound of at first , without great surprize . " So unreasonably early !" she w
d Frank Churchill , with a look of surprize and displeasure .-- " That is easy
; and Emma could imagine with what surprize and mortification she must be retu
tled that Jane should go . Quite a surprize to me ! I had not the least idea !
. It is impossible to express our surprize . He came to speak to his father o
g engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclai
>>>

[/code]

Another way to import.

[code language="language="python'"]
>>> from nltk.corpus import gutenberg
>>> gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
>>>
[/code]

Get various information from the texts.

[code language="python"]

>>> for fileid in gutenberg.fileids():
... num_chars = len(gutenberg.raw(fileid))
... num_words = len(gutenberg.words(fileid))
... num_sents = len(gutenberg.sents(fileid))
... num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))
... print int(num_chars / num_words), int(num_words / num_sents), int(num_words / num_vocab), fileid
...
4 21 26 austen-emma.txt
4 23 16 austen-persuasion.txt
4 23 22 austen-sense.txt
4 33 79 bible-kjv.txt
4 18 5 blake-poems.txt
4 17 14 bryant-stories.txt
4 17 12 burgess-busterbrown.txt
4 16 12 carroll-alice.txt
4 17 11 chesterton-ball.txt
4 19 11 chesterton-brown.txt
4 16 10 chesterton-thursday.txt
4 17 24 edgeworth-parents.txt
4 24 15 melville-moby_dick.txt
4 52 10 milton-paradise.txt
4 11 8 shakespeare-caesar.txt
4 12 7 shakespeare-hamlet.txt
4 12 6 shakespeare-macbeth.txt
4 35 12 whitman-leaves.txt
>>>
[/code]

Need some explanations. The first column of the output is calculated as:

Number of characters / Number of words

Therefore the value is average of word length. Be noted spaces between words are also included in Number of chars. We should reduce 1 from the value. (Average length is 3, actually) The next one is:

Number of words / Number of sentences

Yes, the average number of words in sentences. The last one:

Number of words / Number of vocabularies

This stands for how many times each word is used in the text.

Raw() is to access “raw data” of file contents instead of splitting to tokens.

The following example is to get sentences.

[code language="python"]

>>> macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')
>>> macbeth_sentences
[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...]
>>> macbeth_sentences[1037]
['Good', 'night', ',', 'and', 'better', 'health', 'Attend', 'his', 'Maiesty']
>>> macbeth_sentences[2219]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/nltk/corpus/reader/util.py", line 264, in __getitem__
raise IndexError('index out of range')
IndexError: index out of range
>>> len(macbeth_sentences)
1907
>>> macbeth_sentences[1587]
['who', 'knowes', 'it', ',', 'when', 'none', 'can', 'call', 'our', 'powre', 'to', 'accompt', ':', 'yet', 'who', 'would', 'haue', 'thought', 'the', 'olde', 'man', 'to', 'haue', 'had', 'so', 'much', 'blood', 'in', 'him']
>>>
>>> longest_len = max([len(s) for s in macbeth_sentences])
>>> [s for s in macbeth_sentences if len(s) == longest_len]
[['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that', 'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The', 'mercilesse', 'Macdonwald', '(', 'Worthie', 'to', 'be', 'a', 'Rebell', ',', 'for', 'to', 'that', 'The', 'multiplying', 'Villanies', 'of', 'Nature', 'Doe', 'swarme', 'vpon', 'him', ')', 'from', 'the', 'Westerne', 'Isles', 'Of', 'Kernes', 'and', 'Gallowgrosses', 'is', 'supply', "'", 'd', ',', 'And', 'Fortune', 'on', 'his', 'damned', 'Quarry', 'smiling', ',', 'Shew', "'", 'd', 'like', 'a', 'Rebells', 'Whore', ':', 'but', 'all', "'", 's', 'too', 'weake', ':', 'For', 'braue', 'Macbeth', '(', 'well', 'hee', 'deserues', 'that', 'Name', ')', 'Disdayning', 'Fortune', ',', 'with', 'his', 'brandisht', 'Steele', ',', 'Which', 'smoak', "'", 'd', 'with', 'bloody', 'execution', '(', 'Like', 'Valours', 'Minion', ')', 'caru', "'", 'd', 'out', 'his', 'passage', ',', 'Till', 'hee', 'fac', "'", 'd', 'the', 'Slaue', ':', 'Which', 'neu', "'", 'r', 'shooke', 'hands', ',', 'nor', 'bad', 'farwell', 'to', 'him', ',', 'Till', 'he', 'vnseam', "'", 'd', 'him', 'from', 'the', 'Naue', 'toth', "'", 'Chops', ',', 'And', 'fix', "'", 'd', 'his', 'Head', 'vpon', 'our', 'Battlements']]
>>>

[/code]

Like this, a sentence is displayed as a list[]. The last code is to get the longest sentence in the text.

O’Reilly: Chapter 1 Exercise 23-29

23.

>>>for w in [w for w in text6 if w.isupper()]:
...     print w
...
....
ARTHUR
GALAHAD
KNIGHTS
TIM
ROBIN
ARTHUR
ROBIN
GALAHAD
ARTHUR
ROBIN
KNIGHTS
ARTHUR
TIM
I
KNIGHTS
....

24.

At beginning, I thought the requirement is to extract words who meet all conditions but I could not find. I executed one by one.

>>> [w for w in text6 if w.endswith('ize')]
[]
>>> [w for w in text6 if 'z' in w]
['zone', 'amazes', 'Fetchez', 'Fetchez', 'zoop', 'zoo', 'zhiv', 'frozen', 'zoosh']
>>> [w for w in text6 if 'pt' in w]
['empty', 'aptly', 'Thpppppt', 'Thppt', 'Thppt', 'empty', 'Thppppt', 'temptress' , 'temptation', 'ptoo', 'Chapter', 'excepting', 'Thpppt']
>>> set([w for w in text6 if (w.istitle() and len(w) > 1)])
set(['Welcome', 'Winter', 'Lead', 'Uugh', 'Does', 'Saint', 'Until', 'Today', 'Th ou', 'Burn', 'Lucky', 'Uhh', 'Not', 'Now', 'Twenty', 'Where', 'Just', 'Course', 'Go', 'Erbert', 'Uther', 'Actually', 'Cherries', 'Thpppt', 'Bloody', 'Aramaic', 'Mmm', 'Put', 'Haw', 'True', 'Pull', 'Fiends', 'Agh', 'Yup', 'We', 'Arthur', 'Zo ot', 'English', 'Alright', 'My', 'Silence', 'Clark', 'Bedevere', 'Bors', 'Back',  'Maynard', 'Fetchez', 'Seek', 'Exactly', 'Doctor', 'Rather', 'When', 'Three', ' Providence', 'Book', 'Therefore', 'Huh', 'Stay', 'Umhm', 'Aaaaaaaah', 'Huy', 'Th ose', 'Dingo', 'Cider', 'Chop', 'Aauuugh', 'So', 'Found', 'Guy', 'Oui', 'Anarcho ', 'Torment', 'Our', 'Your', 'Lie', 'Almighty', 'Galahad', 'Britons', 'Lord', 'W ho', 'Beast', 'Loimbard', 'Why', 'Don', 'Guards', 'Oooh', 'All', 'Aaauugh', 'Ass yria', 'Yeaaah', 'One', 'Farewell', 'Greetings', 'Beyond', 'Blue', 'What', 'Ayy' , 'His', 'Recently', 'Here', 'Hic', 'Away', 'Wait', 'Concorde', 'Herbert', 'Ere' , 'Bad', 'She', 'Mother', 'Shh', 'Erm', 'Tower', 'Robin', 'Summer', 'Chaste', 'E nchanter', 'Skip', 'Four', 'Say', 'Anthrax', 'Mud', 'Armaments', 'Build', 'Which ', 'Nador', 'Hiyaah', 'Woa', 'More', 'Picture', 'Holy', 'Very', 'Practice', 'Pac king', 'Uuh', 'Hold', 'Huyah', 'Throw', 'Must', 'None', 'This', 'Leaving', 'Ives ', 'Nine', 'Stand', 'Firstly', 'Brother', 'Oooo', 'Eh', 'Amen', 'Jesus', 'Camaaa aaargue', 'Divine', 'Speak', 'Even', 'Hallo', 'Dappy', 'Yay', 'Iiiives', 'Prepar e', 'There', 'Please', 'Black', 'Pure', 'Quoi', 'Excalibur', 'Iesu', 'Hmm', 'Mid get', 'Angnor', 'Splendid', 'Aggh', 'Lancelot', 'Victory', 'See', 'Will', 'Shrub beries', 'Court', 'Aauuuves', 'God', 'Father', 'Patsy', 'It', 'Peng', 'Other', ' Then', 'Halt', 'Thee', 'Ridden', 'Aaaah', 'Knight', 'Antioch', 'They', 'Ask', 'W ith', 'Gallahad', 'Off', 'Thy', 'Well', 'Didn', 'Anybody', 'Isn', 'Grail', 'Neee ', 'The', 'Bridge', 'Thsss', 'Hiyah', 'Yapping', 'Robinson', 'Hah', 'Explain', ' Aauuggghhh', 'Hill', 'Forward', 'Behold', 'European', 'Shut', 'Meanwhile', 'Chic kennn', 'French', 'Psalms', 'Auuuuuuuugh', 'Ector', 'Aah', 'Keep', 'Quick', 'Onc e', 'Right', 'Help', 'Over', 'Anyway', 'Aaaugh', 'For', 'France', 'Umm', 'Walk',  'Dramatically', 'Good', 'Run', 'That', 'Arimathea', 'Forgive', 'Ecky', 'King', 'Could', 'Quiet', 'Hooray', 'Himself', 'African', 'Launcelot', 'Gable', 'Bravest ', 'Bring', 'Shrubber', 'Aaah', 'Yes', 'Death', 'Christ', 'Would', 'Hey', 'Waa',  'Hee', 'Sorry', 'Heh', 'Get', 'Crapper', 'But', 'Hiyya', 'Aaaaaaaaah', 'Schools ', 'Hurry', 'Princess', 'Together', 'Dragon', 'Honestly', 'Caerbannog', 'Action' , 'Knights', 'Round', 'And', 'Old', 'How', 'Winston', 'Mercea', 'Battle', 'Follo w', 'Aaaaugh', 'Open', 'Ahh', 'Bedwere', 'Hya', 'Tis', 'Til', 'Tim', 'Charge', ' Wood', 'You', 'Nay', 'Tell', 'Stop', 'Aaaaaah', 'Excuse', 'Riiight', 'Supposing' , 'Aaauggh', 'Attila', 'Do', 'Clear', 'Alice', 'Apples', 'Bristol', 'Order', 'Tr y', 'Piglet', 'Tall', 'Spring', 'Is', 'Mind', 'Mine', 'Have', 'In', 'Table', 'De nnis', 'If', 'Wayy', 'Thank', 'Ninepence', 'Said', 'Hyy', 'Churches', 'Be', 'Aug h', 'Ewing', 'Far', 'Oooohoohohooo', 'Surely', 'Consult', 'By', 'On', 'Unfortuna tely', 'Oh', 'Did', 'Of', 'Supreme', 'Morning', 'Tale', 'Ow', 'England', 'Or', ' Dis', 'Brave', 'Ohh', 'Pin', 'Pendragon', 'Are', 'Bones', 'Fine', 'Prince', 'Too ', 'Iiiiives', 'Since', 'Pie', 'Idiom', 'Between', 'Whoa', 'Listen', 'Monsieur',  'Oooooooh', 'Frank', 'Quite', 'Let', 'Ho', 'Hm', 'Nothing', 'Ha', 'He', 'Chapte r', 'Look', 'Thppppt', 'Um', 'Un', 'Uh', 'Bon', 'Hello', 'First', 'Ages', 'Autum n', 'Looks', 'Olfin', 'Message', 'Really', 'Ni', 'Use', 'Cut', 'No', 'Make', 'Aa uuuuugh', 'Two', 'Quickly', 'Everything', 'Thpppppt', 'Nu', 'Rheged', 'Most', 'H ang', 'Ooh', 'Hand', 'Gawain', 'Every', 'Aaagh', 'Come', 'Bread', 'Peril', 'Stea dy', 'Thppt', 'Ulk', 'Silly', 'Defeat', 'Eee', 'Castle', 'Grenade', 'Camelot', ' Aagh', 'Britain', 'Joseph', 'Badon', 'Sir', 'Hoa', 'Perhaps', 'Hoo', 'Saxons', ' Lake', 'Thursday', 'To', 'Shall', 'May', 'Never', 'Eternal', 'As', 'Cornwall', ' Running', 'Five', 'Gorge', 'Lady', 'Man', 'Great', 'Like', 'Yeaah', 'Remove', 'S wamp', 'Heee', 'Ah', 'Am', 'Yeah', 'An', 'Bravely', 'Allo', 'At', 'Ay', 'Roger',  'Chicken'])
>>>

Although no words were found for the first condition, there are some which end with ‘ise’ instead of ‘ize’.

>>> [w for w in text6 if w.endswith('ise')]
['wise', 'wise', 'apologise', 'surprise', 'surprise', 'surprise', 'noise', 'surp rise']

25.

>>> sentx = ['she', 'sells', 'sea', 'shells', 'by', 'the', 'sea', 'shore']
>>> sentx
['she', 'sells', 'sea', 'shells', 'by', 'the', 'sea', 'shore']
>>> for w in [w for w in sentx if w.startswith('sh')]:
...     print w
...
she
shells
shore
>>> for w in [w for w in sentx if len(w) > 4]:
...     print w
...
sells
shells
shore
>>>

26.

>>> sum([len(w) for w in text1])
999044

This code to sum up the length (“len”) of each word in text1. The calculated number should be total number of characters in text1.

27.

>>> def vocab_size(text):
...     return len(set(text))
...
>>> vocab_size(text1)
19317
>>> len(set(text1))
19317

28.

>>> def percent(word, text):
...     wf = text.count(word)
...     pt = wf / len(text) * 100
...     rt = []
...     rt.append(wf)
...     rt.append(pt)
...     return rt
...
>>> percent('whale', text1)
[906, 0.3473673313677301]

Not beautiful. I know I don’t have a good sense of coding…

29.

>>> set(sent3) < set(text1)
True
>>> set(text3) < set(text1)
False
>>> len(set(text3))
2789
>>> len(set(text1))
19317

When using set() for comparison, I always used with len(). I thought they are just comparing length, however it cannot be explained for the seconde one, set(text3) < set(text1) is False.

According to language reference of Python (http://docs.python.org/2/library/sets.html), set() supports set to set comparison. Based on this information. This code means set(text1) includes all element of set(sent3). Therefore the return is True.

>>> set(sent3) < set(text1)
True

The next one is to check whether set(text1) includes all element of set(text3).

>>> set(text3) < set(text1)
False

The reason of returning False is that at least one element of set(text3) is not included in set(text1).

If so, this would be very helpful when comparing two sets.