About Ken Xu

An IT engineer based in Shanghai and Tokyo. Over 15 years experience of application implementation and troubleshooting in SAP ERP Financials including ABAP coding. Also interested in iOS, Android, FileMaker as application for small businesses, Natural Language analysis and other IT topics. Am Native Japanese speaker, using English in daily business and studied Mandarin Chinese during my stay in China (Upper intermediate).

Notice: This blog will be migrated

As I have to activate VPN to connect to wordpress.com from my location, China. I am considering to migrate this blog to other services. The update of this blog will be stopped temporary and restarted after the migration is done.

The new URL of the blog will be:
http://deutschina.hatenablog.com/category/NLTK

Advertisements

Using context (6.1.5)

This example to get previous word as well as suffix.

>>> train_set, test_set = featuresets[size:], featuresets[:size]
>>> classifier = nltk.DecisionTreeClassifier.train(train_set)
>>> 
>>> def pos_features(sentence, i):
...     features = {"suffix(1)": sentence[i][-1:],
...                 "suffix(2)": sentence[i][-2:],
...                 "suffix(3)": sentence[i][-3:]}
...     if i == 0:
...             features["prev-word"] = "<START>"
...     else:
...             features["prev-word"] = sentence[i-1]
...     return features
... 
>>> pos_features(brown.sents()[0],8)
{'suffix(3)': 'ion', 'prev-word': 'an', 'suffix(2)': 'on', 'suffix(1)': 'n'}
>>> tagged_sents = brown.tagged_sents(categories='news')
>>> featuresets = []
>>> for tagged_sent in tagged_sents:
...     untagged_sent = nltk.tag.untag(tagged_sent)
...     for i, (word, tag) in enumerate(tagged_sent):
...             featuresets.append((pos_features(untagged_sent, i), tag))
... 
>>> size = int(len(featuresets) * 0.1)
>>> train_set, test_set = featuresets[size:], featuresets[:size]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> nltk.classify.accuracy(classifier, test_set)
0.7891596220785678

Tagging POS (6.1.4)

FreqDist.inc() is new for me. According to help, “Increment this FreqDist’s count for the given sample.” Okay, then start from empty and extract last 1, 2, 3 chars from the words.

>>> from nltk.corpus import brown
>>> suffix_fdist = nltk.FreqDist()
>>> for word in brown.words():
...     word = word.lower()
...     suffix_fdist.inc(word[-1:])
...     suffix_fdist.inc(word[-2:])
...     suffix_fdist.inc(word[-3:])
... 
>>> common_suffixes = suffix_fdist.keys()[:100]
>>> print common_suffixes
['e', ',', '.', 's', 'd', 't', 'he', 'n', 'a', 'of', 'the', 'y', 'r', 'to', 'in', 'f', 'o', 'ed', 'nd', 'is', 'on', 'l', 'g', 'and', 'ng', 'er', 'as', 'ing', 'h', 'at', 'es', 'or', 're', 'it', '``', 'an', "''", 'm', ';', 'i', 'ly', 'ion', 'en', 'al', '?', 'nt', 'be', 'hat', 'st', 'his', 'th', 'll', 'le', 'ce', 'by', 'ts', 'me', 've', "'", 'se', 'ut', 'was', 'for', 'ent', 'ch', 'k', 'w', 'ld', '`', 'rs', 'ted', 'ere', 'her', 'ne', 'ns', 'ith', 'ad', 'ry', ')', '(', 'te', '--', 'ay', 'ty', 'ot', 'p', 'nce', "'s", 'ter', 'om', 'ss', ':', 'we', 'are', 'c', 'ers', 'uld', 'had', 'so', 'ey']
>>> def pos_features(word):
...     features = {}
...     for suffix in common_suffixes:
...             features['endswith(%s)' % suffix] = word.lower().endswith(suffix)
...     return features
... 
>>> tagged_words = brown.tagged_words(categories='news')
>>> featuresets = [(pos_features(n), g) for (n, g) in tagged_words]
>>> size = int(len(featuresets) * 0.1)
>>> train_set, test_set = featuresets[size:], featuresets[:size]
>>> classifier = nltk.DecisionTreeClassifier.train(train_set)

The last step takes extremely long time. I killed the session after 10 minutes. I had to skip remaining part of this section.

Document classification (6.1.3)

Construct correctly labeled document.

>>> from nltk.corpus import movie_reviews
>>> decoments = [(list(movie_reviews.words(fileid)), category)
...             for category in movie_reviews.categories()
...             for fileid in movie_reviews.fileids(category)]
>>> random.shuffle(decoments)

Let’s ignore my small typo here (decoments–>documents). But it looks like to generate list of words, isn’t it?

Then check the words are used in the specified document.

>>> def document_features(document):
...     document_words = set(document)
...     features = {}
...     for word in word_features:
...             features['contains(%s)' % word] = (word in document_words)
...     return features
... 
>>> print document_features(movie_reviews.words('pos/cv957_8737.txt'))
{'contains(waste)': False, 'contains(lot)': False, 'contains(*)': True, 'contains(black)':
....

'contains(towards)': False, 'contains(smile)': False, 'contains(cross)': False}
>>> 

Now got it. The purpose is to evaluate that words are used in positive or negative context.

>>> featuresets = [(document_features(d), c) for (d, c) in decoments]
>>> train_set, test_set = featuresets[100:], featuresets[:100]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, test_set)
0.82
>>> classifier.show_most_informative_features(5)
Most Informative Features
   contains(outstanding) = True              pos : neg    =     10.1 : 1.0
         contains(mulan) = True              pos : neg    =      8.5 : 1.0
        contains(seagal) = True              neg : pos    =      7.3 : 1.0
   contains(wonderfully) = True              pos : neg    =      6.6 : 1.0
         contains(damon) = True              pos : neg    =      6.0 : 1.0

Choosing the Right Features (6.1.2)

In my understanding, the example is try to explain “overfit” situation.

>>> def gender_features2(name):
...     features = {}
...     features["firstletter"] = name[0].lower()
...     features["lastletter"] = name[-1].lower()
...     for letter in 'abcdefghijklmnopqrstuvwxyz':
...             features["count(%s)" % letter] = name.lower().count(letter)
...             features["has(%s)" % letter] = (letter in name.lower())
...     return features
... 
>>> gender_features2('John')
{'count(u)': 0, 'has(d)': False, 'count(b)': 0, 'count(w)': 0, 'has(b)': False, 'count(l)': 0, 'count(q)': 0, 'count(n)': 1, 'has(j)': True, 'count(s)': 0, 'count(h)': 1, 'has(h)': True, 'has(y)': False, 'count(j)': 1, 'has(f)': False, 'has(o)': True, 'count(x)': 0, 'has(m)': False, 'count(z)': 0, 'has(k)': False, 'has(u)': False, 'count(d)': 0, 'has(s)': False, 'count(f)': 0, 'lastletter': 'n', 'has(q)': False, 'has(w)': False, 'has(e)': False, 'has(z)': False, 'count(t)': 0, 'count(c)': 0, 'has(c)': False, 'has(x)': False, 'count(v)': 0, 'count(m)': 0, 'has(a)': False, 'has(v)': False, 'count(p)': 0, 'count(o)': 1, 'has(i)': False, 'count(i)': 0, 'has(r)': False, 'has(g)': False, 'count(k)': 0, 'firstletter': 'j', 'count(y)': 0, 'has(n)': True, 'has(l)': False, 'count(e)': 0, 'has(t)': False, 'count(g)': 0, 'count(r)': 0, 'count(a)': 0, 'has(p)': False}

Then evaluate the result.

>>> featuresets = [(gender_features2(n), g) for (n,g) in names]
>>> train_set, test_set = featuresets[500:], featuresets[:500]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, test_set)
0.778

But the result was better than the previous one 0.758–>0.778. Anyway it was not difficult to understand the concept itself. If too many features are specified in the train data, this might not be optimized for entire sample data.

This one is to split sample data smaller to seek better features.

>>> train_name = names[1500:]
>>> devtest_name = names[500:1500]
>>> test_names = names[:500]
>>> 
>>> train_set = [(gender_features(n), g) for (n,g) in train_name]
>>> devtest_set = [(gender_features(n), g) for (n,g) in devtest_name]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, devtest_set)
0.77

The sample data was split into 3, train_name(1500 to the end), devtest_name(500 to 1500) and test_name(first 500). Training is done with train_name then evaluate with deftest_set.

Then check out errors.

>>> errors = []
>>> for (name, tag) in devtest_name:
...     guess = classifier.classify(gender_features(name))
...     if guess != tag:
...             errors.append((tag, guess, name))
... 
>>> for (tag, guess, name) in sorted(errors):
...     print 'correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name)
... 
correct=feamale  guess=male     name=Abigael                       
correct=feamale  guess=male     name=Allyson                       
correct=feamale  guess=male     name=Alys                          
correct=feamale  guess=male     name=Angil                         
correct=feamale  guess=male     name=Arlyn                         
correct=feamale  guess=male     name=Aurel                         
correct=feamale  guess=male     name=Avis                          
correct=feamale  guess=male     name=Avril                         
correct=feamale  guess=male     name=Bell                          
correct=feamale  guess=male     name=Bev                           
correct=feamale  guess=male     name=Birgit                        
correct=feamale  guess=male     name=Bliss                         
correct=feamale  guess=male     name=Brandais                      
correct=feamale  guess=male     name=Brett                         
correct=feamale  guess=male     name=Brit                          
correct=feamale  guess=male     name=Brooks                        
correct=feamale  guess=male     name=Calypso                       
correct=feamale  guess=male     name=Carolann                      
correct=feamale  guess=male     name=Caroleen                      
correct=feamale  guess=male     name=Carolyn                       
correct=feamale  guess=male     name=Carrol                        
correct=feamale  guess=male     name=Caryl                         
correct=feamale  guess=male     name=Cat                           
correct=feamale  guess=male     name=Cathryn                       
correct=feamale  guess=male     name=Charis                        
correct=feamale  guess=male     name=Christan                      
correct=feamale  guess=male     name=Christean                     
correct=feamale  guess=male     name=Cindelyn                      
correct=feamale  guess=male     name=Consuelo                      
correct=feamale  guess=male     name=Cyb                           
correct=feamale  guess=male     name=Cybel                         
correct=feamale  guess=male     name=Daniel                        
correct=feamale  guess=male     name=Darb                          
correct=feamale  guess=male     name=Dawn                          
correct=feamale  guess=male     name=Delores                       
correct=feamale  guess=male     name=Devan                         
correct=feamale  guess=male     name=Devin                         
correct=feamale  guess=male     name=Diamond                       
correct=feamale  guess=male     name=Dorcas                        
correct=feamale  guess=male     name=Dot                           
correct=feamale  guess=male     name=Estel                         
correct=feamale  guess=male     name=Evaleen                       
correct=feamale  guess=male     name=Evangelin                     
correct=feamale  guess=male     name=Fanchon                       
correct=feamale  guess=male     name=Farrand                       
correct=feamale  guess=male     name=Fern                          
correct=feamale  guess=male     name=Flor                          
correct=feamale  guess=male     name=Gabriel                       
correct=feamale  guess=male     name=Gabriell                      
correct=feamale  guess=male     name=Gill                          
correct=feamale  guess=male     name=Ginnifer                      
correct=feamale  guess=male     name=Glad                          
correct=feamale  guess=male     name=Hazel                         
correct=feamale  guess=male     name=Ines                          
correct=feamale  guess=male     name=Ingaborg                      
correct=feamale  guess=male     name=Iris                          
correct=feamale  guess=male     name=Jennifer                      
correct=feamale  guess=male     name=Jill                          
correct=feamale  guess=male     name=Jillian                       
correct=feamale  guess=male     name=Jocelin
....
correct=male     guess=feamale  name=Wylie                         
correct=male     guess=feamale  name=Yehudi                        
correct=male     guess=feamale  name=Yule                          
correct=male     guess=feamale  name=Zechariah                     
correct=male     guess=feamale  name=Zeke                          
>>> 

According to the result of errors, adjust features. In this example, to check last 2 characters of each name.

>>> def gender_features(word):
...     return {'suffix1': word[-1:], 'suffix2': word[-2:]}
... 
>>> train_set = [(gender_features(n), g) for (n, g) in train_name]
>>> devtest_set = [(gender_features(n), g) for (n, g) in devtest_name]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, devtest_set)
0.786

The result was improved 1.6 point (0.77–>0.786).

Supervised classifying (6.1-6.1.1)

Go into the new chapter, Chapter 6 of the whale book.

We learned there are some relationship between the last character of first name and gender at Chapter 2.4. Going to use same sample here.

This function is to get the last character of the name.

>>> def gender_features(word):
...     return {'last_letter': word[-1]}
... 
>>> gender_features('Shrek')
{'last_letter': 'k'}

Here import the Name corpus and shuffle entries.

>>> from nltk.corpus import names
>>> import random
>>> names = ([(name, 'male') for name in names.words('male.txt')] +
...          [(name, 'feamale') for name in names.words('female.txt')])
>>> random.shuffle(names)

Note: Here is a typo; wrong:feamale, correct:female

After that extract the last character (by using gender_features()) and gender. The fist 500 record to be used for training, the remaining for testing.

>>> featuresets = [(gender_features(n), g) for (n, g) in names]
>>> train_set, test_set = featuresets[500:], featuresets[:500]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> classifier.classify(gender_features('Neo'))
'male'
>>> classifier.classify(gender_features('Trinity'))
'feamale'

Evaluate the accuracy of test_set.

>>> print nltk.classify.accuracy(classifier, test_set)
0.756
>>> classifier.show_most_informative_features(5)
Most Informative Features
             last_letter = 'k'              male : feamal =     45.8 : 1.0
             last_letter = 'a'            feamal : male   =     41.2 : 1.0
             last_letter = 'f'              male : feamal =     16.6 : 1.0
             last_letter = 'p'              male : feamal =     11.9 : 1.0
             last_letter = 'v'              male : feamal =     10.5 : 1.0
>>> 

This result clearly says that if the name ended with ‘k’ is most likely male and if ‘a’ for female. But I tried this afterwards, my impression was slightly changed.

>>> classifier.show_most_informative_features(26)
Most Informative Features
             last_letter = 'k'              male : feamal =     45.8 : 1.0
             last_letter = 'a'            feamal : male   =     41.2 : 1.0
             last_letter = 'f'              male : feamal =     16.6 : 1.0
             last_letter = 'p'              male : feamal =     11.9 : 1.0
             last_letter = 'v'              male : feamal =     10.5 : 1.0
             last_letter = 'm'              male : feamal =     10.1 : 1.0
             last_letter = 'd'              male : feamal =      9.0 : 1.0
             last_letter = 'o'              male : feamal =      7.7 : 1.0
             last_letter = 'r'              male : feamal =      7.2 : 1.0
             last_letter = 'w'              male : feamal =      5.8 : 1.0
             last_letter = 'g'              male : feamal =      5.3 : 1.0
             last_letter = 't'              male : feamal =      4.3 : 1.0
             last_letter = 's'              male : feamal =      4.1 : 1.0
             last_letter = 'b'              male : feamal =      4.1 : 1.0
             last_letter = 'z'              male : feamal =      4.0 : 1.0
             last_letter = 'j'              male : feamal =      4.0 : 1.0
             last_letter = 'i'            feamal : male   =      3.6 : 1.0
             last_letter = 'u'              male : feamal =      2.2 : 1.0
             last_letter = 'n'              male : feamal =      2.1 : 1.0
             last_letter = 'e'            feamal : male   =      1.9 : 1.0
             last_letter = 'l'              male : feamal =      1.8 : 1.0
             last_letter = 'h'              male : feamal =      1.5 : 1.0
             last_letter = 'x'              male : feamal =      1.4 : 1.0
             last_letter = 'y'              male : feamal =      1.2 : 1.0

Even though there are some exceptions (a, e, i), most of the case, male’s names are majority. Therefore Female’s names trend to be ended with specific characters like a, e and i rather than male’s name. Anyway this is just my impression.

This is just additional information. This one can be used to avoid high memory consumptions.

>>> from nltk.classify import apply_features
>>> train_set = apply_features(gender_features, name[500:])
>>> test_set = apply_features(gender_features, name[:500])

Brill tagging (5.6)

Brill tagging concept is shown in the textbook. This is an example in NLTK.

>>> nltk.tag.brill.demo()
Loading tagged data... 
Done loading.
Training unigram tagger:
    [accuracy: 0.832151]
Training bigram tagger:
    [accuracy: 0.837930]
Training Brill tagger on 1600 sentences...
Finding initial useful rules...
    Found 9757 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
  11  15   4   0  | WDT -> IN if the tag of words i+1...i+2 is 'DT'
  10  12   2   0  | IN -> RB if the text of the following word is
                  |   'well'
   9   9   0   0  | WDT -> IN if the tag of the preceding word is
                  |   'NN', and the tag of the following word is 'NNP'
   7   9   2   0  | RBR -> JJR if the tag of words i+1...i+2 is 'NNS'
   7  10   3   0  | WDT -> IN if the tag of words i+1...i+2 is 'NNS'
   5   5   0   0  | WDT -> IN if the tag of the preceding word is
                  |   'NN', and the tag of the following word is 'PRP'
   4   4   0   1  | WDT -> IN if the tag of words i+1...i+3 is 'VBG'
   3   3   0   0  | RB -> IN if the tag of the preceding word is 'NN',
                  |   and the tag of the following word is 'DT'
   3   3   0   0  | RBR -> JJR if the tag of the following word is
                  |   'NN'
   3   3   0   0  | VBP -> VB if the tag of words i-3...i-1 is 'MD'
   3   3   0   0  | NNS -> NN if the text of the preceding word is
                  |   'one'
   3   3   0   0  | RP -> RB if the text of words i-3...i-1 is 'were'
   3   3   0   0  | VBP -> VB if the text of words i-2...i-1 is "n't"

Brill accuracy: 0.839156
Done; rules and errors saved to rules.yaml and errors.out.
>>> 

This is the error log.

>>> print(open("errors.out").read())
Errors for Brill Tagger 'rules.yaml'

             left context |    word/test->gold     | right context
--------------------------+------------------------+--------------------------
 Chairman/NNP Richard/NNP |    Breeden/NN->NNP     | has/VBZ said/VBD 0/-NONE-
TO consider/VB circuit/NN |    breakers/NN->NNS    | that/WDT *T*-215/-NONE- h
/NN breakers/NNS that/WDT |   *T*-215/NN->-NONE-   | have/VBP preset/JJ trigge
T *T*-215/-NONE- have/VBP |     preset/NN->JJ      | trigger/NN points/NNS ,/,
n/NNP was/VBD so/RB ``/`` |      vague/NN->JJ      | and/CC mushy/JJ ''/'' tha
/RB ``/`` vague/JJ and/CC |      mushy/NN->JJ      | ''/'' that/IN it/PRP was/
....
gton/NNP ,/, D.C./NNP ,/, |       as/IN->RB        | long/RB as/IN they/PRP co
NP ,/, D.C./NNP ,/, as/RB |      long/JJ->RB       | as/IN they/PRP could/MD i
B as/IN they/PRP could/MD |     install/NN->VB     | a/DT crash/JJ barrier/NN 
 could/MD install/VB a/DT |      crash/NN->JJ      | barrier/NN between/IN the
                          |      Tray/NN->NNP      | Bon/NNP ?/.
                 Tray/NNP |      Bon/NN->NNP       | ?/.
      Drink/NN Carrier/NN |    Competes/NN->VBZ    | With/IN Cartons/NNS
r/NN Competes/VBZ With/IN |    Cartons/NN->NNS     | 
                 */-NONE- |    PORTING/NN->VBG     | POTABLES/NNS just/RB got/
     */-NONE- PORTING/VBG |    POTABLES/NN->NNS    | just/RB got/VBD easier/JJ
/, or/CC so/RB claims/VBZ |    Scypher/NN->NNP     | Corp./NNP ,/, the/DT make
/DT maker/NN of/IN the/DT |    Cup-Tote/NN->NNP    | ./.

>>>