Tagging POS (6.1.4)

FreqDist.inc() is new for me. According to help, “Increment this FreqDist’s count for the given sample.” Okay, then start from empty and extract last 1, 2, 3 chars from the words.

>>> from nltk.corpus import brown
>>> suffix_fdist = nltk.FreqDist()
>>> for word in brown.words():
...     word = word.lower()
...     suffix_fdist.inc(word[-1:])
...     suffix_fdist.inc(word[-2:])
...     suffix_fdist.inc(word[-3:])
... 
>>> common_suffixes = suffix_fdist.keys()[:100]
>>> print common_suffixes
['e', ',', '.', 's', 'd', 't', 'he', 'n', 'a', 'of', 'the', 'y', 'r', 'to', 'in', 'f', 'o', 'ed', 'nd', 'is', 'on', 'l', 'g', 'and', 'ng', 'er', 'as', 'ing', 'h', 'at', 'es', 'or', 're', 'it', '``', 'an', "''", 'm', ';', 'i', 'ly', 'ion', 'en', 'al', '?', 'nt', 'be', 'hat', 'st', 'his', 'th', 'll', 'le', 'ce', 'by', 'ts', 'me', 've', "'", 'se', 'ut', 'was', 'for', 'ent', 'ch', 'k', 'w', 'ld', '`', 'rs', 'ted', 'ere', 'her', 'ne', 'ns', 'ith', 'ad', 'ry', ')', '(', 'te', '--', 'ay', 'ty', 'ot', 'p', 'nce', "'s", 'ter', 'om', 'ss', ':', 'we', 'are', 'c', 'ers', 'uld', 'had', 'so', 'ey']
>>> def pos_features(word):
...     features = {}
...     for suffix in common_suffixes:
...             features['endswith(%s)' % suffix] = word.lower().endswith(suffix)
...     return features
... 
>>> tagged_words = brown.tagged_words(categories='news')
>>> featuresets = [(pos_features(n), g) for (n, g) in tagged_words]
>>> size = int(len(featuresets) * 0.1)
>>> train_set, test_set = featuresets[size:], featuresets[:size]
>>> classifier = nltk.DecisionTreeClassifier.train(train_set)

The last step takes extremely long time. I killed the session after 10 minutes. I had to skip remaining part of this section.

Document classification (6.1.3)

Construct correctly labeled document.

>>> from nltk.corpus import movie_reviews
>>> decoments = [(list(movie_reviews.words(fileid)), category)
...             for category in movie_reviews.categories()
...             for fileid in movie_reviews.fileids(category)]
>>> random.shuffle(decoments)

Let’s ignore my small typo here (decoments–>documents). But it looks like to generate list of words, isn’t it?

Then check the words are used in the specified document.

>>> def document_features(document):
...     document_words = set(document)
...     features = {}
...     for word in word_features:
...             features['contains(%s)' % word] = (word in document_words)
...     return features
... 
>>> print document_features(movie_reviews.words('pos/cv957_8737.txt'))
{'contains(waste)': False, 'contains(lot)': False, 'contains(*)': True, 'contains(black)':
....

'contains(towards)': False, 'contains(smile)': False, 'contains(cross)': False}
>>> 

Now got it. The purpose is to evaluate that words are used in positive or negative context.

>>> featuresets = [(document_features(d), c) for (d, c) in decoments]
>>> train_set, test_set = featuresets[100:], featuresets[:100]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, test_set)
0.82
>>> classifier.show_most_informative_features(5)
Most Informative Features
   contains(outstanding) = True              pos : neg    =     10.1 : 1.0
         contains(mulan) = True              pos : neg    =      8.5 : 1.0
        contains(seagal) = True              neg : pos    =      7.3 : 1.0
   contains(wonderfully) = True              pos : neg    =      6.6 : 1.0
         contains(damon) = True              pos : neg    =      6.0 : 1.0

Choosing the Right Features (6.1.2)

In my understanding, the example is try to explain “overfit” situation.

>>> def gender_features2(name):
...     features = {}
...     features["firstletter"] = name[0].lower()
...     features["lastletter"] = name[-1].lower()
...     for letter in 'abcdefghijklmnopqrstuvwxyz':
...             features["count(%s)" % letter] = name.lower().count(letter)
...             features["has(%s)" % letter] = (letter in name.lower())
...     return features
... 
>>> gender_features2('John')
{'count(u)': 0, 'has(d)': False, 'count(b)': 0, 'count(w)': 0, 'has(b)': False, 'count(l)': 0, 'count(q)': 0, 'count(n)': 1, 'has(j)': True, 'count(s)': 0, 'count(h)': 1, 'has(h)': True, 'has(y)': False, 'count(j)': 1, 'has(f)': False, 'has(o)': True, 'count(x)': 0, 'has(m)': False, 'count(z)': 0, 'has(k)': False, 'has(u)': False, 'count(d)': 0, 'has(s)': False, 'count(f)': 0, 'lastletter': 'n', 'has(q)': False, 'has(w)': False, 'has(e)': False, 'has(z)': False, 'count(t)': 0, 'count(c)': 0, 'has(c)': False, 'has(x)': False, 'count(v)': 0, 'count(m)': 0, 'has(a)': False, 'has(v)': False, 'count(p)': 0, 'count(o)': 1, 'has(i)': False, 'count(i)': 0, 'has(r)': False, 'has(g)': False, 'count(k)': 0, 'firstletter': 'j', 'count(y)': 0, 'has(n)': True, 'has(l)': False, 'count(e)': 0, 'has(t)': False, 'count(g)': 0, 'count(r)': 0, 'count(a)': 0, 'has(p)': False}

Then evaluate the result.

>>> featuresets = [(gender_features2(n), g) for (n,g) in names]
>>> train_set, test_set = featuresets[500:], featuresets[:500]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, test_set)
0.778

But the result was better than the previous one 0.758–>0.778. Anyway it was not difficult to understand the concept itself. If too many features are specified in the train data, this might not be optimized for entire sample data.

This one is to split sample data smaller to seek better features.

>>> train_name = names[1500:]
>>> devtest_name = names[500:1500]
>>> test_names = names[:500]
>>> 
>>> train_set = [(gender_features(n), g) for (n,g) in train_name]
>>> devtest_set = [(gender_features(n), g) for (n,g) in devtest_name]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, devtest_set)
0.77

The sample data was split into 3, train_name(1500 to the end), devtest_name(500 to 1500) and test_name(first 500). Training is done with train_name then evaluate with deftest_set.

Then check out errors.

>>> errors = []
>>> for (name, tag) in devtest_name:
...     guess = classifier.classify(gender_features(name))
...     if guess != tag:
...             errors.append((tag, guess, name))
... 
>>> for (tag, guess, name) in sorted(errors):
...     print 'correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name)
... 
correct=feamale  guess=male     name=Abigael                       
correct=feamale  guess=male     name=Allyson                       
correct=feamale  guess=male     name=Alys                          
correct=feamale  guess=male     name=Angil                         
correct=feamale  guess=male     name=Arlyn                         
correct=feamale  guess=male     name=Aurel                         
correct=feamale  guess=male     name=Avis                          
correct=feamale  guess=male     name=Avril                         
correct=feamale  guess=male     name=Bell                          
correct=feamale  guess=male     name=Bev                           
correct=feamale  guess=male     name=Birgit                        
correct=feamale  guess=male     name=Bliss                         
correct=feamale  guess=male     name=Brandais                      
correct=feamale  guess=male     name=Brett                         
correct=feamale  guess=male     name=Brit                          
correct=feamale  guess=male     name=Brooks                        
correct=feamale  guess=male     name=Calypso                       
correct=feamale  guess=male     name=Carolann                      
correct=feamale  guess=male     name=Caroleen                      
correct=feamale  guess=male     name=Carolyn                       
correct=feamale  guess=male     name=Carrol                        
correct=feamale  guess=male     name=Caryl                         
correct=feamale  guess=male     name=Cat                           
correct=feamale  guess=male     name=Cathryn                       
correct=feamale  guess=male     name=Charis                        
correct=feamale  guess=male     name=Christan                      
correct=feamale  guess=male     name=Christean                     
correct=feamale  guess=male     name=Cindelyn                      
correct=feamale  guess=male     name=Consuelo                      
correct=feamale  guess=male     name=Cyb                           
correct=feamale  guess=male     name=Cybel                         
correct=feamale  guess=male     name=Daniel                        
correct=feamale  guess=male     name=Darb                          
correct=feamale  guess=male     name=Dawn                          
correct=feamale  guess=male     name=Delores                       
correct=feamale  guess=male     name=Devan                         
correct=feamale  guess=male     name=Devin                         
correct=feamale  guess=male     name=Diamond                       
correct=feamale  guess=male     name=Dorcas                        
correct=feamale  guess=male     name=Dot                           
correct=feamale  guess=male     name=Estel                         
correct=feamale  guess=male     name=Evaleen                       
correct=feamale  guess=male     name=Evangelin                     
correct=feamale  guess=male     name=Fanchon                       
correct=feamale  guess=male     name=Farrand                       
correct=feamale  guess=male     name=Fern                          
correct=feamale  guess=male     name=Flor                          
correct=feamale  guess=male     name=Gabriel                       
correct=feamale  guess=male     name=Gabriell                      
correct=feamale  guess=male     name=Gill                          
correct=feamale  guess=male     name=Ginnifer                      
correct=feamale  guess=male     name=Glad                          
correct=feamale  guess=male     name=Hazel                         
correct=feamale  guess=male     name=Ines                          
correct=feamale  guess=male     name=Ingaborg                      
correct=feamale  guess=male     name=Iris                          
correct=feamale  guess=male     name=Jennifer                      
correct=feamale  guess=male     name=Jill                          
correct=feamale  guess=male     name=Jillian                       
correct=feamale  guess=male     name=Jocelin
....
correct=male     guess=feamale  name=Wylie                         
correct=male     guess=feamale  name=Yehudi                        
correct=male     guess=feamale  name=Yule                          
correct=male     guess=feamale  name=Zechariah                     
correct=male     guess=feamale  name=Zeke                          
>>> 

According to the result of errors, adjust features. In this example, to check last 2 characters of each name.

>>> def gender_features(word):
...     return {'suffix1': word[-1:], 'suffix2': word[-2:]}
... 
>>> train_set = [(gender_features(n), g) for (n, g) in train_name]
>>> devtest_set = [(gender_features(n), g) for (n, g) in devtest_name]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, devtest_set)
0.786

The result was improved 1.6 point (0.77–>0.786).

Supervised classifying (6.1-6.1.1)

Go into the new chapter, Chapter 6 of the whale book.

We learned there are some relationship between the last character of first name and gender at Chapter 2.4. Going to use same sample here.

This function is to get the last character of the name.

>>> def gender_features(word):
...     return {'last_letter': word[-1]}
... 
>>> gender_features('Shrek')
{'last_letter': 'k'}

Here import the Name corpus and shuffle entries.

>>> from nltk.corpus import names
>>> import random
>>> names = ([(name, 'male') for name in names.words('male.txt')] +
...          [(name, 'feamale') for name in names.words('female.txt')])
>>> random.shuffle(names)

Note: Here is a typo; wrong:feamale, correct:female

After that extract the last character (by using gender_features()) and gender. The fist 500 record to be used for training, the remaining for testing.

>>> featuresets = [(gender_features(n), g) for (n, g) in names]
>>> train_set, test_set = featuresets[500:], featuresets[:500]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> classifier.classify(gender_features('Neo'))
'male'
>>> classifier.classify(gender_features('Trinity'))
'feamale'

Evaluate the accuracy of test_set.

>>> print nltk.classify.accuracy(classifier, test_set)
0.756
>>> classifier.show_most_informative_features(5)
Most Informative Features
             last_letter = 'k'              male : feamal =     45.8 : 1.0
             last_letter = 'a'            feamal : male   =     41.2 : 1.0
             last_letter = 'f'              male : feamal =     16.6 : 1.0
             last_letter = 'p'              male : feamal =     11.9 : 1.0
             last_letter = 'v'              male : feamal =     10.5 : 1.0
>>> 

This result clearly says that if the name ended with ‘k’ is most likely male and if ‘a’ for female. But I tried this afterwards, my impression was slightly changed.

>>> classifier.show_most_informative_features(26)
Most Informative Features
             last_letter = 'k'              male : feamal =     45.8 : 1.0
             last_letter = 'a'            feamal : male   =     41.2 : 1.0
             last_letter = 'f'              male : feamal =     16.6 : 1.0
             last_letter = 'p'              male : feamal =     11.9 : 1.0
             last_letter = 'v'              male : feamal =     10.5 : 1.0
             last_letter = 'm'              male : feamal =     10.1 : 1.0
             last_letter = 'd'              male : feamal =      9.0 : 1.0
             last_letter = 'o'              male : feamal =      7.7 : 1.0
             last_letter = 'r'              male : feamal =      7.2 : 1.0
             last_letter = 'w'              male : feamal =      5.8 : 1.0
             last_letter = 'g'              male : feamal =      5.3 : 1.0
             last_letter = 't'              male : feamal =      4.3 : 1.0
             last_letter = 's'              male : feamal =      4.1 : 1.0
             last_letter = 'b'              male : feamal =      4.1 : 1.0
             last_letter = 'z'              male : feamal =      4.0 : 1.0
             last_letter = 'j'              male : feamal =      4.0 : 1.0
             last_letter = 'i'            feamal : male   =      3.6 : 1.0
             last_letter = 'u'              male : feamal =      2.2 : 1.0
             last_letter = 'n'              male : feamal =      2.1 : 1.0
             last_letter = 'e'            feamal : male   =      1.9 : 1.0
             last_letter = 'l'              male : feamal =      1.8 : 1.0
             last_letter = 'h'              male : feamal =      1.5 : 1.0
             last_letter = 'x'              male : feamal =      1.4 : 1.0
             last_letter = 'y'              male : feamal =      1.2 : 1.0

Even though there are some exceptions (a, e, i), most of the case, male’s names are majority. Therefore Female’s names trend to be ended with specific characters like a, e and i rather than male’s name. Anyway this is just my impression.

This is just additional information. This one can be used to avoid high memory consumptions.

>>> from nltk.classify import apply_features
>>> train_set = apply_features(gender_features, name[500:])
>>> test_set = apply_features(gender_features, name[:500])

Brill tagging (5.6)

Brill tagging concept is shown in the textbook. This is an example in NLTK.

>>> nltk.tag.brill.demo()
Loading tagged data... 
Done loading.
Training unigram tagger:
    [accuracy: 0.832151]
Training bigram tagger:
    [accuracy: 0.837930]
Training Brill tagger on 1600 sentences...
Finding initial useful rules...
    Found 9757 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
  11  15   4   0  | WDT -> IN if the tag of words i+1...i+2 is 'DT'
  10  12   2   0  | IN -> RB if the text of the following word is
                  |   'well'
   9   9   0   0  | WDT -> IN if the tag of the preceding word is
                  |   'NN', and the tag of the following word is 'NNP'
   7   9   2   0  | RBR -> JJR if the tag of words i+1...i+2 is 'NNS'
   7  10   3   0  | WDT -> IN if the tag of words i+1...i+2 is 'NNS'
   5   5   0   0  | WDT -> IN if the tag of the preceding word is
                  |   'NN', and the tag of the following word is 'PRP'
   4   4   0   1  | WDT -> IN if the tag of words i+1...i+3 is 'VBG'
   3   3   0   0  | RB -> IN if the tag of the preceding word is 'NN',
                  |   and the tag of the following word is 'DT'
   3   3   0   0  | RBR -> JJR if the tag of the following word is
                  |   'NN'
   3   3   0   0  | VBP -> VB if the tag of words i-3...i-1 is 'MD'
   3   3   0   0  | NNS -> NN if the text of the preceding word is
                  |   'one'
   3   3   0   0  | RP -> RB if the text of words i-3...i-1 is 'were'
   3   3   0   0  | VBP -> VB if the text of words i-2...i-1 is "n't"

Brill accuracy: 0.839156
Done; rules and errors saved to rules.yaml and errors.out.
>>> 

This is the error log.

>>> print(open("errors.out").read())
Errors for Brill Tagger 'rules.yaml'

             left context |    word/test->gold     | right context
--------------------------+------------------------+--------------------------
 Chairman/NNP Richard/NNP |    Breeden/NN->NNP     | has/VBZ said/VBD 0/-NONE-
TO consider/VB circuit/NN |    breakers/NN->NNS    | that/WDT *T*-215/-NONE- h
/NN breakers/NNS that/WDT |   *T*-215/NN->-NONE-   | have/VBP preset/JJ trigge
T *T*-215/-NONE- have/VBP |     preset/NN->JJ      | trigger/NN points/NNS ,/,
n/NNP was/VBD so/RB ``/`` |      vague/NN->JJ      | and/CC mushy/JJ ''/'' tha
/RB ``/`` vague/JJ and/CC |      mushy/NN->JJ      | ''/'' that/IN it/PRP was/
....
gton/NNP ,/, D.C./NNP ,/, |       as/IN->RB        | long/RB as/IN they/PRP co
NP ,/, D.C./NNP ,/, as/RB |      long/JJ->RB       | as/IN they/PRP could/MD i
B as/IN they/PRP could/MD |     install/NN->VB     | a/DT crash/JJ barrier/NN 
 could/MD install/VB a/DT |      crash/NN->JJ      | barrier/NN between/IN the
                          |      Tray/NN->NNP      | Bon/NNP ?/.
                 Tray/NNP |      Bon/NN->NNP       | ?/.
      Drink/NN Carrier/NN |    Competes/NN->VBZ    | With/IN Cartons/NNS
r/NN Competes/VBZ With/IN |    Cartons/NN->NNS     | 
                 */-NONE- |    PORTING/NN->VBG     | POTABLES/NNS just/RB got/
     */-NONE- PORTING/VBG |    POTABLES/NN->NNS    | just/RB got/VBD easier/JJ
/, or/CC so/RB claims/VBZ |    Scypher/NN->NNP     | Corp./NNP ,/, the/DT make
/DT maker/NN of/IN the/DT |    Cup-Tote/NN->NNP    | ./.

>>> 

Performance limitation (5.5.7-5.5.8)

>>> cfd = nltk.ConditionalFreqDist(
...             ((x[1], y[1], z[0]), z[1])
...             for sent in brown_tagged_sents
...             for x, y, z in nltk.trigrams(sent))
>>> ambiguous_contexts = [c for c in cfd.conditions() if len(cfd[c]) > 1]
>>> sum(cfd[c].N() for c in ambiguous_contexts) / cfd.N()
0
>>> from __future__ import division
>>> sum(cfd[c].N() for c in ambiguous_contexts) / cfd.N()
0.049297702068029296

In ambiguous_contexts, more than one different tags are assigned even under the same contexts (tags of neighbor words). The last formula is to calculate share of ambiguous_contexts, around 5%(0.04929…).

>>> test_tags = [tag for sent in brown.sents(categories='editorial')
...             for (word, tag) in t2.tag(sent)]
>>> gold_tags = [tag for (word, tag) in brown.tagged_words(categories='editorial')]
>>> print nltk.ConfusionMatrix(gold_tags, test_tags)
....

This matrix is to compare gold standard and assigned tags. However, the out put is too wide, likely more than 800 columns per row. Not easy to display…

Combine Taggers (5.5.4-5.5.5)

In case tags cannot be assigned, it is possible to switch to more general tagger by using backoff option.

>>> t0 = nltk.DefaultTagger('NN')
>>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)
>>> t2 = nltk.BigramTagger(train_sents, backoff=t1)
>>> t2.evaluate(test_sents)
0.8447124489185687
>>> t3 = nltk.TrigramTagger(train_sents, backoff=t2)
>>> t3.evaluate(test_sents)
0.8423203428685339

Saving Tagger (5.5.6):

>>> from cPickle import dump
>>> output = open('t2.pkl', 'wb')
>>> dump(t2, output, -1)
>>> output.close()
>>> 
>>> from cPickle import load
>>> input = open('t2.pkl', 'rb')
>>> tagger = load(input)
>>> input.close()
>>> 
>>> text = """The board's action shows what free enterprise
...           is up against in our complex maze of regulatory laws . """
>>> tokens = text.split()
>>> tagger.tag(tokens)
[('The', 'AT'), ("board's", 'NN$'), ('action', 'NN'), ('shows', 'NNS'), ('what', 'WDT'), ('free', 'JJ'), ('enterprise', 'NN'), ('is', 'BEZ'), ('up', 'RP'), ('against', 'IN'), ('in', 'IN'), ('our', 'PP$'), ('complex', 'JJ'), ('maze', 'NN'), ('of', 'IN'), ('regulatory', 'NN'), ('laws', 'NNS'), ('.', '.')]
>>>