8.
>>> names = nltk.corpus.names >>> names.fileids() ['female.txt', 'male.txt'] >>> cfd = nltk.ConditionalFreqDist( ... (fileid, name[0]) ... for fileid in names.fileids() ... for name in names.words(fileid)) >>> cfd.plot()
There are more female’s names in most of characters. Especially more than double in A, C, J, K, L and M. Only an exception is W. Male’s names are majority. Let’s see what kind of names are listed.
>>> male_names = names.words('male.txt') >>> [w for w in male_names if w.startswith('W')] ['Wade', 'Wadsworth', 'Wain', 'Waine', 'Wainwright', 'Wait', 'Waite', 'Waiter', 'Wake', 'Wakefield', 'Wald', 'Waldemar', 'Walden', 'Waldo', 'Waldon', 'Waleed', 'Walker', 'Wallace', 'Wallache', 'Wallas', 'Wallie', 'Wallis', 'Wally', 'Walsh', 'Walt', 'Walter', 'Walther', 'Walton', 'Wang', 'Ward', 'Warde', 'Warden', 'Ware', 'Waring', 'Warner', 'Warren', 'Wash', 'Washington', 'Wat', 'Waverley', 'Waverly', 'Way', 'Waylan', 'Wayland', 'Waylen', 'Waylin', 'Waylon', 'Wayne', 'Web', 'Webb', 'Weber', 'Webster', 'Weidar', 'Weider', 'Welbie', 'Welby', 'Welch', 'Wells', 'Welsh', 'Wendall', 'Wendel', 'Wendell', 'Werner', 'Wes', 'Wesley', 'Weslie', 'West', 'Westbrook', 'Westbrooke', 'Westleigh', 'Westley', 'Weston', 'Weylin', 'Wheeler', 'Whit', 'Whitaker', 'Whitby', 'Whitman', 'Whitney', 'Whittaker', 'Wiatt', 'Wilber', 'Wilbert', 'Wilbur', 'Wilburn', 'Wilburt', 'Wilden', 'Wildon', 'Wilek', 'Wiley', 'Wilfred', 'Wilfrid', 'Wilhelm', 'Will', 'Willard', 'Willdon', 'Willem', 'Willey', 'Willi', 'William', 'Willie', 'Willis', 'Willmott', 'Willy', 'Wilmar', 'Wilmer', 'Wilson', 'Wilt', 'Wilton', 'Win', 'Windham', 'Winfield', 'Winford', 'Winfred', 'Winifield', 'Winn', 'Winnie', 'Winny', 'Winslow', 'Winston', 'Winthrop', 'Winton', 'Wit', 'Witold', 'Wittie', 'Witty', 'Wojciech', 'Wolf', 'Wolfgang', 'Wolfie', 'Wolfram', 'Wolfy', 'Woochang', 'Wood', 'Woodie', 'Woodman', 'Woodrow', 'Woody', 'Worden', 'Worth', 'Worthington', 'Worthy', 'Wright', 'Wyatan', 'Wyatt', 'Wye', 'Wylie', 'Wyn', 'Wyndham', 'Wynn', 'Wynton'] >>>
9.
Try to compare text2 and text3.
>>> len(text2) 141576 >>> len(set(text2)) 6833 >>> len(text3) 44764 >>> len(set(text3)) 2789 >>> from __future__ import division >>> len(text2) / len(set(text2)) 20.719449729255086 >>> len(text3) / len(set(text3)) 16.050197203298673 >>>
Then try to extract “maybe” interesting words.
>>> fdist2 = FreqDist([w.lower() for w in text2 if len(w) > 5]) >>> fdist3 = FreqDist([w.lower() for w in text3 if len(w) > 5]) >>> sorted(fdist2.items()[:50]) [('affection', 79), ('almost', 85), ('always', 124), ('another', 79), ('barton', 89), ('before', 201), ('believe', 91), ('better', 78), ('brandon', 144), ('brother', 83), ('cannot', 89), ('colonel', 176), ('dashwood', 252), ('edward', 263), ('elinor', 685), ('enough', 103), ('family', 83), ('feelings', 73), ('ferrars', 130), ('herself', 255), ('himself', 114), ('however', 155), ('immediately', 72), ('indeed', 100), ('jennings', 230), ('letter', 76), ('little', 160), ('marianne', 566), ('middleton', 102), ('moment', 99), ('morning', 87), ('mother', 258), ('myself', 103), ('nothing', 189), ('palmer', 77), ('perhaps', 87), ('present', 84), ('rather', 74), ('really', 86), ('replied', 102), ('seemed', 91), ('should', 236), ('sister', 282), ('something', 75), ('therefore', 93), ('though', 216), ('thought', 116), ('together', 73), ('willoughby', 216), ('without', 174)] >>> sorted(fdist3.items()[:50]) [('abimelech', 24), ('abraham', 129), ('according', 32), ('against', 23), ('because', 65), ('before', 91), ('behold', 118), ('between', 25), ('blessed', 46), ('brethren', 80), ('brother', 91), ('brought', 60), ('called', 98), ('camels', 22), ('canaan', 43), ('cattle', 50), ('children', 62), ('commanded', 18), ('conceived', 22), ('covenant', 26), ('daughter', 46), ('daughters', 53), ('famine', 24), ('father', 198), ('flocks', 23), ('ground', 24), ('heaven', 29), ('himself', 22), ('hundred', 61), ('israel', 40), ('joseph', 157), ('little', 20), ('master', 29), ('morning', 20), ('mother', 28), ('multiply', 18), ('people', 35), ('pharaoh', 90), ('rachel', 44), ('rebekah', 29), ('saying', 79), ('servant', 41), ('servants', 37), ('should', 22), ('sister', 21), ('surely', 20), ('therefore', 47), ('toward', 19), ('waters', 32), ('wherefore', 19)] >>>
Are there any common words? If no, try to find interesting words. Let me pick up one word “little”.
>>> text2.concordance("little") Displaying 25 of 160 matches: ld spare so considerable a sum with little inconvenience ."-- He thought of it present , of shewing them with how little attention to the comfort of other p unds from the fortune of their dear little boy would be impoverishing him to t he to ruin himself , and their poor little Harry , by giving away all his mone , it could be restored to our poor little boy --" " Why , to be sure ," said ch occasions , do too much than too little . No one , at least , can think I h comfortable ." His wife hesitated a little , however , in giving her consent t ommodate her as far as I can . Some little present of furniture too may be acc oved him , to hear him read with so little sensibility . Mama , the more I kno ore than it is . In my heart I feel little -- scarcely any doubt of his prefer et above , will make it a very snug little cottage . I could wish the stairs w ith her their eldest child , a fine little boy about six years old , by which mansion which , by reminding them a little of Norland , interested their imagi carrying her into the house with so little previous formality , there was a ra red fellow , and has got the nicest little black bitch of a pointer I ever saw u , Miss Dashwood ; he has a pretty little estate of his own in Somersetshire . " I remember last Christmas at a little hop at the park , he danced from ei g him as much as ever ." CHAPTER 11 Little had Mrs . Dashwood or her daughters constant visitors as to leave them little leisure for serious employment . Ye n being more silent . Elinor needed little observation to perceive that her re enced in sitting at home ;-- and so little did her presence add to the pleasur her place , would not have done so little . The whole story would have been s eiving such a present from a man so little , or at least so lately known to he warmly , " in supposing I know very little of Willoughby . I have not known hi om Willoughby . Of John I know very little , though we have lived together for >>> text3.concordance("little") Displaying 20 of 20 matches: I pray thee , from thy serva Let a little water , I pray you , be fetched , a is near to flee unto , and it is a little o Oh , let me escape thither , ( is t me escape thither , ( is it not a little one ?) and my soul shall live . And id , Let me , I pray thee , drink a little water of thy pitcher . And she said to her , Give me , I pray thee , a little water of thy pitcher to drink ; And thy cattle was with me . For it was little which thou hadst before I came , an nd all their wealth , and all their little ones , and their wives took they ca d from Bethel ; and there was but a little way to come to Ephra and Rachel tra aid unto them , Go again , buy us a little food . And Judah spake unto him , s , both we , and thou , and also our little ones . I will be surety for him ; o nd carry down the man a present , a little balm , and a little honey , spices n a present , a little balm , and a little honey , spices , and myrrh , nuts , an , and a child of his old age , a little one ; and his brother is dead , and ther said , Go again , and buy us a little food . And we said , We cannot go d s out of the land of Egypt for your little ones , and for your wives , and bri ried Jacob their father , and their little ones , and their wives , in the wag households , and for food for your little ones . And they said , Thou hast sa the way , when yet there was but a little way to come unto Ephra and I buried , and his father ' s hou only their little ones , and their flocks , and their not : I will nourish you , and your little ones . And he comforted them , and >>>
Are there any difference of usage?
10.
link: http://news.bbc.co.uk/2/hi/uk_news/education/6173441.stm
For comparison, let’s use text5 (chat corpus).
>>> fdist5 = FreqDist([w.lower() for w in text5]) >>> len(text5) 45010 >>> fdist5.plot(100, cumulative=True)
One-third should be around 15,000 (=45010/3). Let’s check narrow down the range to top 30.
>>> fdist5.plot(30, cumulative=True)
It reaches to 15,000 round 27th or 28th word. In addition, non-word like “.” and “!” are also included in the stats. Can we still say that’s the problem if it occupies one-third only with top 20 words?
11.
Let’s use slightly different examples form the textbook.
>>> cfd = nltk.ConditionalFreqDist( ... (genre, word) ... for genre in brown.categories() ... for word in brown.words(categories=genre)) >>> genres=['adventure', 'editorial', 'fiction', 'government', 'mystery'] >>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] >>> cfd.tabulate(conditions=genres, samples=modals) can could may might must will adventure 46 151 5 58 27 50 editorial 121 56 74 39 53 233 fiction 37 166 8 44 55 52 government 117 38 153 13 102 244 mystery 42 141 13 57 30 20 >>>
One of interest thing from my point of view is “could” and “might” are more frequently used in adventure and fiction. Do they contain more uncertain things? On the other hand, present tense (can, may, must) are intensively used in government. Citizens who listen governmental announcement do not like “uncertain” things.