Exercise: Chapter 2 (8-11)

8.

>>> names = nltk.corpus.names
>>> names.fileids()
['female.txt', 'male.txt']
>>> cfd = nltk.ConditionalFreqDist(
...     (fileid, name[0])
...     for fileid in names.fileids()
...     for name in names.words(fileid))
>>> cfd.plot()

figure_1

There are more female’s names in most of characters. Especially more than double in A, C, J, K, L and M. Only an exception is W. Male’s names are majority. Let’s see what kind of names are listed.

>>> male_names = names.words('male.txt')
>>> [w for w in male_names if w.startswith('W')]
['Wade', 'Wadsworth', 'Wain', 'Waine', 'Wainwright', 'Wait', 'Waite', 'Waiter', 'Wake', 'Wakefield', 'Wald', 'Waldemar', 'Walden', 'Waldo', 'Waldon', 'Waleed', 'Walker', 'Wallace', 'Wallache', 'Wallas', 'Wallie', 'Wallis', 'Wally', 'Walsh', 'Walt', 'Walter', 'Walther', 'Walton', 'Wang', 'Ward', 'Warde', 'Warden', 'Ware', 'Waring', 'Warner', 'Warren', 'Wash', 'Washington', 'Wat', 'Waverley', 'Waverly', 'Way', 'Waylan', 'Wayland', 'Waylen', 'Waylin', 'Waylon', 'Wayne', 'Web', 'Webb', 'Weber', 'Webster', 'Weidar', 'Weider', 'Welbie', 'Welby', 'Welch', 'Wells', 'Welsh', 'Wendall', 'Wendel', 'Wendell', 'Werner', 'Wes', 'Wesley', 'Weslie', 'West', 'Westbrook', 'Westbrooke', 'Westleigh', 'Westley', 'Weston', 'Weylin', 'Wheeler', 'Whit', 'Whitaker', 'Whitby', 'Whitman', 'Whitney', 'Whittaker', 'Wiatt', 'Wilber', 'Wilbert', 'Wilbur', 'Wilburn', 'Wilburt', 'Wilden', 'Wildon', 'Wilek', 'Wiley', 'Wilfred', 'Wilfrid', 'Wilhelm', 'Will', 'Willard', 'Willdon', 'Willem', 'Willey', 'Willi', 'William', 'Willie', 'Willis', 'Willmott', 'Willy', 'Wilmar', 'Wilmer', 'Wilson', 'Wilt', 'Wilton', 'Win', 'Windham', 'Winfield', 'Winford', 'Winfred', 'Winifield', 'Winn', 'Winnie', 'Winny', 'Winslow', 'Winston', 'Winthrop', 'Winton', 'Wit', 'Witold', 'Wittie', 'Witty', 'Wojciech', 'Wolf', 'Wolfgang', 'Wolfie', 'Wolfram', 'Wolfy', 'Woochang', 'Wood', 'Woodie', 'Woodman', 'Woodrow', 'Woody', 'Worden', 'Worth', 'Worthington', 'Worthy', 'Wright', 'Wyatan', 'Wyatt', 'Wye', 'Wylie', 'Wyn', 'Wyndham', 'Wynn', 'Wynton']
>>> 

9.

Try to compare text2 and text3.

>>> len(text2)
141576
>>> len(set(text2))
6833
>>> len(text3)
44764
>>> len(set(text3))
2789
>>> from __future__ import division
>>> len(text2) / len(set(text2))
20.719449729255086
>>> len(text3) / len(set(text3))
16.050197203298673
>>> 

Then try to extract “maybe” interesting words.

>>> fdist2 = FreqDist([w.lower() for w in text2 if len(w) > 5])
>>> fdist3 = FreqDist([w.lower() for w in text3 if len(w) > 5])

>>> sorted(fdist2.items()[:50])
[('affection', 79), ('almost', 85), ('always', 124), ('another', 79), ('barton', 89), ('before', 201), ('believe', 91), ('better', 78), ('brandon', 144), ('brother', 83), ('cannot', 89), ('colonel', 176), ('dashwood', 252), ('edward', 263), ('elinor', 685), ('enough', 103), ('family', 83), ('feelings', 73), ('ferrars', 130), ('herself', 255), ('himself', 114), ('however', 155), ('immediately', 72), ('indeed', 100), ('jennings', 230), ('letter', 76), ('little', 160), ('marianne', 566), ('middleton', 102), ('moment', 99), ('morning', 87), ('mother', 258), ('myself', 103), ('nothing', 189), ('palmer', 77), ('perhaps', 87), ('present', 84), ('rather', 74), ('really', 86), ('replied', 102), ('seemed', 91), ('should', 236), ('sister', 282), ('something', 75), ('therefore', 93), ('though', 216), ('thought', 116), ('together', 73), ('willoughby', 216), ('without', 174)]
>>> sorted(fdist3.items()[:50])
[('abimelech', 24), ('abraham', 129), ('according', 32), ('against', 23), ('because', 65), ('before', 91), ('behold', 118), ('between', 25), ('blessed', 46), ('brethren', 80), ('brother', 91), ('brought', 60), ('called', 98), ('camels', 22), ('canaan', 43), ('cattle', 50), ('children', 62), ('commanded', 18), ('conceived', 22), ('covenant', 26), ('daughter', 46), ('daughters', 53), ('famine', 24), ('father', 198), ('flocks', 23), ('ground', 24), ('heaven', 29), ('himself', 22), ('hundred', 61), ('israel', 40), ('joseph', 157), ('little', 20), ('master', 29), ('morning', 20), ('mother', 28), ('multiply', 18), ('people', 35), ('pharaoh', 90), ('rachel', 44), ('rebekah', 29), ('saying', 79), ('servant', 41), ('servants', 37), ('should', 22), ('sister', 21), ('surely', 20), ('therefore', 47), ('toward', 19), ('waters', 32), ('wherefore', 19)]
>>> 

Are there any common words? If no, try to find interesting words. Let me pick up one word “little”.

>>> text2.concordance("little")
Displaying 25 of 160 matches:
ld spare so considerable a sum with little inconvenience ."-- He thought of it
 present , of shewing them with how little attention to the comfort of other p
unds from the fortune of their dear little boy would be impoverishing him to t
he to ruin himself , and their poor little Harry , by giving away all his mone
 , it could be restored to our poor little boy --" " Why , to be sure ," said 
ch occasions , do too much than too little . No one , at least , can think I h
comfortable ." His wife hesitated a little , however , in giving her consent t
ommodate her as far as I can . Some little present of furniture too may be acc
oved him , to hear him read with so little sensibility . Mama , the more I kno
ore than it is . In my heart I feel little -- scarcely any doubt of his prefer
et above , will make it a very snug little cottage . I could wish the stairs w
ith her their eldest child , a fine little boy about six years old , by which 
mansion which , by reminding them a little of Norland , interested their imagi
carrying her into the house with so little previous formality , there was a ra
red fellow , and has got the nicest little black bitch of a pointer I ever saw
u , Miss Dashwood ; he has a pretty little estate of his own in Somersetshire 
 . " I remember last Christmas at a little hop at the park , he danced from ei
g him as much as ever ." CHAPTER 11 Little had Mrs . Dashwood or her daughters
 constant visitors as to leave them little leisure for serious employment . Ye
n being more silent . Elinor needed little observation to perceive that her re
enced in sitting at home ;-- and so little did her presence add to the pleasur
 her place , would not have done so little . The whole story would have been s
eiving such a present from a man so little , or at least so lately known to he
warmly , " in supposing I know very little of Willoughby . I have not known hi
om Willoughby . Of John I know very little , though we have lived together for
>>> text3.concordance("little")
Displaying 20 of 20 matches:
 I pray thee , from thy serva Let a little water , I pray you , be fetched , a
 is near to flee unto , and it is a little o Oh , let me escape thither , ( is
t me escape thither , ( is it not a little one ?) and my soul shall live . And
id , Let me , I pray thee , drink a little water of thy pitcher . And she said
 to her , Give me , I pray thee , a little water of thy pitcher to drink ; And
thy cattle was with me . For it was little which thou hadst before I came , an
nd all their wealth , and all their little ones , and their wives took they ca
d from Bethel ; and there was but a little way to come to Ephra and Rachel tra
aid unto them , Go again , buy us a little food . And Judah spake unto him , s
, both we , and thou , and also our little ones . I will be surety for him ; o
nd carry down the man a present , a little balm , and a little honey , spices 
n a present , a little balm , and a little honey , spices , and myrrh , nuts ,
an , and a child of his old age , a little one ; and his brother is dead , and
ther said , Go again , and buy us a little food . And we said , We cannot go d
s out of the land of Egypt for your little ones , and for your wives , and bri
ried Jacob their father , and their little ones , and their wives , in the wag
 households , and for food for your little ones . And they said , Thou hast sa
 the way , when yet there was but a little way to come unto Ephra and I buried
, and his father ' s hou only their little ones , and their flocks , and their
not : I will nourish you , and your little ones . And he comforted them , and 
>>> 

Are there any difference of usage?

10.

link: http://news.bbc.co.uk/2/hi/uk_news/education/6173441.stm

For comparison, let’s use text5 (chat corpus).

>>> fdist5 = FreqDist([w.lower() for w in text5])
>>> len(text5)
45010
>>> fdist5.plot(100, cumulative=True)

figure_1

One-third should be around 15,000 (=45010/3). Let’s check narrow down the range to top 30.

>>> fdist5.plot(30, cumulative=True)

figure_1

It reaches to 15,000 round 27th or 28th word. In addition, non-word like “.” and “!” are also included in the stats. Can we still say that’s the problem if it occupies one-third only with top 20 words?

11.

Let’s use slightly different examples form the textbook.

>>> cfd = nltk.ConditionalFreqDist(
...     (genre, word)
...     for genre in brown.categories()
...     for word in brown.words(categories=genre))
>>> genres=['adventure', 'editorial', 'fiction', 'government', 'mystery']
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> cfd.tabulate(conditions=genres, samples=modals)
            can could  may might must will
 adventure   46  151    5   58   27   50
 editorial  121   56   74   39   53  233
   fiction   37  166    8   44   55   52
government  117   38  153   13  102  244
   mystery   42  141   13   57   30   20
>>> 

One of interest thing from my point of view is “could” and “might” are more frequently used in adventure and fiction. Do they contain more uncertain things? On the other hand, present tense (can, may, must) are intensively used in government. Citizens who listen governmental announcement do not like “uncertain” things.