Performance limitation (5.5.7-5.5.8)

>>> cfd = nltk.ConditionalFreqDist(
...             ((x[1], y[1], z[0]), z[1])
...             for sent in brown_tagged_sents
...             for x, y, z in nltk.trigrams(sent))
>>> ambiguous_contexts = [c for c in cfd.conditions() if len(cfd[c]) > 1]
>>> sum(cfd[c].N() for c in ambiguous_contexts) / cfd.N()
0
>>> from __future__ import division
>>> sum(cfd[c].N() for c in ambiguous_contexts) / cfd.N()
0.049297702068029296

In ambiguous_contexts, more than one different tags are assigned even under the same contexts (tags of neighbor words). The last formula is to calculate share of ambiguous_contexts, around 5%(0.04929…).

>>> test_tags = [tag for sent in brown.sents(categories='editorial')
...             for (word, tag) in t2.tag(sent)]
>>> gold_tags = [tag for (word, tag) in brown.tagged_words(categories='editorial')]
>>> print nltk.ConfusionMatrix(gold_tags, test_tags)
....

This matrix is to compare gold standard and assigned tags. However, the out put is too wide, likely more than 800 columns per row. Not easy to display…

Advertisements

Combine Taggers (5.5.4-5.5.5)

In case tags cannot be assigned, it is possible to switch to more general tagger by using backoff option.

>>> t0 = nltk.DefaultTagger('NN')
>>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)
>>> t2 = nltk.BigramTagger(train_sents, backoff=t1)
>>> t2.evaluate(test_sents)
0.8447124489185687
>>> t3 = nltk.TrigramTagger(train_sents, backoff=t2)
>>> t3.evaluate(test_sents)
0.8423203428685339

Saving Tagger (5.5.6):

>>> from cPickle import dump
>>> output = open('t2.pkl', 'wb')
>>> dump(t2, output, -1)
>>> output.close()
>>> 
>>> from cPickle import load
>>> input = open('t2.pkl', 'rb')
>>> tagger = load(input)
>>> input.close()
>>> 
>>> text = """The board's action shows what free enterprise
...           is up against in our complex maze of regulatory laws . """
>>> tokens = text.split()
>>> tagger.tag(tokens)
[('The', 'AT'), ("board's", 'NN$'), ('action', 'NN'), ('shows', 'NNS'), ('what', 'WDT'), ('free', 'JJ'), ('enterprise', 'NN'), ('is', 'BEZ'), ('up', 'RP'), ('against', 'IN'), ('in', 'IN'), ('our', 'PP$'), ('complex', 'JJ'), ('maze', 'NN'), ('of', 'IN'), ('regulatory', 'NN'), ('laws', 'NNS'), ('.', '.')]
>>> 

Generic N-gram tagger (5.3.3)

Unigram tagger is to assign tags which are “probably” used. This is the restriction as each single word is focused in Unigram tagger. N-gramTagger is to check tags of neighbor words.

>>> size = int(len(brown_tagged_sents) * 0.9)
>>> train_sents = brown_tagged_sents[:size]
>>> bigram_tagger = nltk.BigramTagger(train_sents)
>>> bigram_tagger.tag(brown_sents[2007])
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments',
'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace',
'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'),
('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), <strong>('so', 'CS')</strong>,
('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'),
('.', '.')]
>>> 

Compared with the result from Unigram, the tag of ‘so’ has been changed to ‘CS’ (in Unigram it was ‘QL’).

>>> unseen_sent = brown_sents[4203]
>>> bigram_tagger.tag(unseen_sent)
[('The', 'AT'), ('population', 'NN'), ('of', 'IN'), ('the', 'AT'),
('Congo', 'NP'), ('is', 'BEZ'), ('13.5', None), ('million', None),
(',', None), ('divided', None), ('into', None), ('at', None),
('least', None), ('seven', None), ('major', None), ('``', None),
('culture', None), ('clusters', None), ("''", None), ('and', None),
('innumerable', None), ('tribes', None), ('speaking', None), ('400',
None), ('separate', None), ('dialects', None), ('.', None)]
>>> test_sents = brown_tagged_sents[size:]
>>> bigram_tagger.evaluate(test_sents)
0.10216286255357321

If the word is not in the training data, bigram cannot specify tag then assign “None”. Later on tags for all words are “None” as there was no “None” tag in the training data.

Separating Training and Test data (5.5.2)

What a busy week! Today’s topic is also short version.

>>> size = int(len(brown_tagged_sents) * 0.9)
>>> size
4160
>>> train_sents = brown_tagged_sents[:size]
>>> test_sents = brown_tagged_sents[size:]
>>> unigram_tagger = nltk.UnigramTagger(train_sents)
>>> unigram_tagger.evaluate(test_sents)
0.8110236220472441

UnigramTagger is to use already-tagged-data as a parameter. This concept is called ‘Training’. Today’s example is to use 90% of data for training. Then start tagging for the remaining 10% and evaluate the result.

By the way, do we still need to use 90% data for training? I change the ratio (0.5 and 0.2) and got following results.

>>> size = int(len(brown_tagged_sents) * 0.5)
>>> train_sents = brown_tagged_sents[:size]
>>> test_sents = brown_tagged_sents[size:]
>>> unigram_tagger = nltk.UnigramTagger(train_sents)
>>> unigram_tagger.evaluate(test_sents)
0.7656880697816371
>>> size = int(len(brown_tagged_sents) * 0.2)
>>> train_sents = brown_tagged_sents[:size]
>>> test_sents = brown_tagged_sents[size:]
>>> unigram_tagger = nltk.UnigramTagger(train_sents)
>>> unigram_tagger.evaluate(test_sents)
0.6882990411506192

Seems not too bad.

Unigram tagger (5.5.1)

Today’s article is short as too busy today!

Unigram tagger:

>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents(categories='news')
>>> brown_sents = brown.sents(categories='news')
>>> unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
>>> unigram_tagger.tag(brown_sents[2007])
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
>>> unigram_tagger.evaluate(brown_tagged_sents)
0.9349006503968017

Lookup tagger (5.4.3-5.4.4)

Lookup tagger:

>>> fd = nltk.FreqDist(brown.words(categories='news'))
>>> cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
>>> most_freq_words = fd.keys()[:100]
>>> likely_tags = dict((word, cfd[word].max()) for word in most_freq_words)
>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags)
>>> baseline_tagger.evaluate(brown_tagged_sents)
0.45578495136941344

Need to go for detail to understand.

>>> most_freq_words
['the', ',', '.', 'of', 'and', 'to', 'a', 'in', 'for', 'The', 'that', '``', 'is', 'was', "''", 'on', 'at', 'with', 'be', 'by', 'as', 'he', 'said', 'his', 'will', 'it', 'from', 'are', ';', '--', 'an', 'has', 'had', 'who', 'have', 'not', 'Mrs.', 'were', 'this', 'which', 'would', 'their', 'been', 'they', 'He', 'one', 'I', 'but', 'its', 'or', ')', 'more', 'Mr.', '(', 'up', 'all', 'last', 'out', 'two', ':', 'other', 'new', 'first', 'than', 'year', 'A', 'about', 'there', 'when', 'In', 'after', 'home', 'also', 'It', 'over', 'into', 'But', 'no', 'made', 'her', 'only', 'years', 'three', 'time', 'them', 'some', 'New', 'can', 'him', '?', 'any', 'state', 'President', 'before', 'could', 'week', 'under', 'against', 'we', 'now']

most_freq_words simply selects top 100 frequently used words. Then most frequently used tag is assigned to those top 100 words in likely_tags.

>>> likely_tags
{'all': 'ABN', 'over': 'IN', 'years': 'NNS', 'against': 'IN', 'its': 'PP$', 'before': 'IN', '(': '(', 'had': 'HVD', ',': ',', 'to': 'TO', 'only': 'AP', 'under': 'IN', 'has': 'HVZ', 'New': 'JJ-TL', 'them': 'PPO', 'his': 'PP$', 'Mrs.': 'NP', 'they': 'PPSS', 'not': '*', 'now': 'RB', 'him': 'PPO', 'In': 'IN', '--': '--', 'week': 'NN', 'some': 'DTI', 'are': 'BER', 'year': 'NN', 'home': 'NN', 'out': 'RP', 'said': 'VBD', 'for': 'IN', 'state': 'NN', 'new': 'JJ', ';': '.', '?': '.', 'He': 'PPS', 'be': 'BE', 'we': 'PPSS', 'after': 'IN', 'by': 'IN', 'on': 'IN', 'about': 'IN', 'last': 'AP', 'her': 'PP$', 'of': 'IN', 'could': 'MD', 'Mr.': 'NP', 'or': 'CC', 'first': 'OD', 'into': 'IN', 'one': 'CD', 'But': 'CC', 'from': 'IN', 'would': 'MD', 'there': 'EX', 'three': 'CD', 'been': 'BEN', '.': '.', 'their': 'PP$', ':': ':', 'was': 'BEDZ', 'more': 'AP', '``': '``', 'that': 'CS', 'but': 'CC', 'with': 'IN', 'than': 'IN', 'he': 'PPS', 'made': 'VBN', 'this': 'DT', 'up': 'RP', 'will': 'MD', 'can': 'MD', 'were': 'BED', 'and': 'CC', 'is': 'BEZ', 'it': 'PPS', 'an': 'AT', "''": "''", 'as': 'CS', 'at': 'IN', 'have': 'HV', 'in': 'IN', 'any': 'DTI', 'no': 'AT', ')': ')', 'when': 'WRB', 'also': 'RB', 'other': 'AP', 'which': 'WDT', 'President': 'NN-TL', 'A': 'AT', 'I': 'PPSS', 'who': 'WPS', 'two': 'CD', 'The': 'AT', 'a': 'AT', 'It': 'PPS', 'time': 'NN', 'the': 'AT'}
>>> 

For non-top100 words, no tag is assigned (None) as below.

>>> sent = brown.sents(categories='news')[3]
>>> baseline_tagger.tag(sent)
[('``', '``'), ('Only', None), ('a', 'AT'), ('relative', None), ('handful', None), ('of', 'IN'), ('such', None), ('reports', None), ('was', 'BEDZ'), ('received', None), ("''", "''"), (',', ','), ('the', 'AT'), ('jury', None), ('said', 'VBD'), (',', ','), ('``', '``'), ('considering', None), ('the', 'AT'), ('widespread', None), ('interest', None), ('in', 'IN'), ('the', 'AT'), ('election', None), (',', ','), ('the', 'AT'), ('number', None), ('of', 'IN'), ('voters', None), ('and', 'CC'), ('the', 'AT'), ('size', None), ('of', 'IN'), ('this', 'DT'), ('city', None), ("''", "''"), ('.', '.')]

backoff option enables to add default tag (‘NN’ in this case).

>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags, backoff=nltk.DefaultTagger('NN'))
>>> baseline_tagger.tag(sent)                                                   [('``', '``'), ('Only', 'NN'), ('a', 'AT'), ('relative', 'NN'), ('handful', 'NN'), ('of', 'IN'), ('such', 'NN'), ('reports', 'NN'), ('was', 'BEDZ'), ('received', 'NN'), ("''", "''"), (',', ','), ('the', 'AT'), ('jury', 'NN'), ('said', 'VBD'), (',', ','), ('``', '``'), ('considering', 'NN'), ('the', 'AT'), ('widespread', 'NN'), ('interest', 'NN'), ('in', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('the', 'AT'), ('number', 'NN'), ('of', 'IN'), ('voters', 'NN'), ('and', 'CC'), ('the', 'AT'), ('size', 'NN'), ('of', 'IN'), ('this', 'DT'), ('city', 'NN'), ("''", "''"), ('.', '.')]
>>> 

This one is to explain how large model size could be reasonable.

>>> def performance(ccd, wordlist):
...     lt = dict((word, cfd[word].max()) for word in wordlist)
...     baseline_tagger = nltk.UnigramTagger(model=lt,
...                                          backoff=nltk.DefaultTagger('NN')) 
...     return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))
... 
>>> def display():
...     import pylab, nltk
...     words_by_freq = list(nltk.FreqDist(brown.words(categories='news')))
...     cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
...     sizes = 2 ** pylab.arange(15)
...     perfs = [performance(cfd, words_by_freq[:size]) for size in sizes]
...     pylab.plot(sizes, perfs, '-bo')
...     pylab.title('Lookup Tagger Performance with Varying Model Size')
...     pylab.xlabel('Model Size')
...     pylab.ylabel('Performance')
...     pylab.show()
... 
>>> display()

figure_1

When tried with top 100 words, the result of evaluation was around 0.45. During the model size is very small, let’s say less than 2000, the performance is improved sharply, however, the pace of improvement is slow down later on.

This stats indicate that we don’t have to “too high” size of models.

Automatic tagging (5.4-5.4.2)

Start preparation:

>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents(categories='news')
>>> brown_sents = brown.sents(categories='news')

To check which tag is most frequently used. It’s ‘NN’.

>>> tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
>>> nltk.FreqDist(tags).max()
'NN'

Then assign the most popular tag to each word.

>>> raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
>>> tokens = nltk.word_tokenize(raw)
>>> default_tagger = nltk.DefaultTagger('NN')
>>> default_tagger.tag(tokens)
[('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'), ('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'), ('I', 'NN'), ('am', 'NN'), ('!', 'NN')]

Even though ‘NN’ is most frequently used, it is just around 13%.

>>> default_tagger.evaluate(brown_tagged_sents)
0.13089484257215028

Another approach is here.

>>> patterns = [
...     (r'.*ing$', 'VBG'),
...     (r'.*ed$', 'VBD'),
...     (r'.*es$', 'VBZ'),
...     (r'.*ould$', 'MD'),
...     (r'.*\'s$', 'NN$'),
...     (r'.*s$', 'NNS'),
...     (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
...     (r'.*', 'NN')
... ]

>>> regexp_tagger = nltk.RegexpTagger(patterns)
>>> regexp_tagger.tag(brown_sents[3])
[('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful', 'NN'), ('of', 'NN'), ('such', 'NN'), ('reports', 'NNS'), ('was', 'NNS'), ('received', 'VBD'), ("''", 'NN'), (',', 'NN'), ('the', 'NN'), ('jury', 'NN'), ('said', 'NN'), (',', 'NN'), ('``', 'NN'), ('considering', 'VBG'), ('the', 'NN'), ('widespread', 'NN'), ('interest', 'NN'), ('in', 'NN'), ('the', 'NN'), ('election', 'NN'), (',', 'NN'), ('the', 'NN'), ('number', 'NN'), ('of', 'NN'), ('voters', 'NNS'), ('and', 'NN'), ('the', 'NN'), ('size', 'NN'), ('of', 'NN'), ('this', 'NNS'), ('city', 'NN'), ("''", 'NN'), ('.', 'NN')]
>>> regexp_tagger.evaluate(brown_tagged_sents)
0.20326391789486245

This logic seems to assign different tags based on the spelling of words. If not match with any patterns, default tag (‘NN’) is assigned. As a result, the evaluation result was improved to 20%.