Using context (6.1.5)

This example to get previous word as well as suffix.

>>> train_set, test_set = featuresets[size:], featuresets[:size]
>>> classifier = nltk.DecisionTreeClassifier.train(train_set)
>>> 
>>> def pos_features(sentence, i):
...     features = {"suffix(1)": sentence[i][-1:],
...                 "suffix(2)": sentence[i][-2:],
...                 "suffix(3)": sentence[i][-3:]}
...     if i == 0:
...             features["prev-word"] = "<START>"
...     else:
...             features["prev-word"] = sentence[i-1]
...     return features
... 
>>> pos_features(brown.sents()[0],8)
{'suffix(3)': 'ion', 'prev-word': 'an', 'suffix(2)': 'on', 'suffix(1)': 'n'}
>>> tagged_sents = brown.tagged_sents(categories='news')
>>> featuresets = []
>>> for tagged_sent in tagged_sents:
...     untagged_sent = nltk.tag.untag(tagged_sent)
...     for i, (word, tag) in enumerate(tagged_sent):
...             featuresets.append((pos_features(untagged_sent, i), tag))
... 
>>> size = int(len(featuresets) * 0.1)
>>> train_set, test_set = featuresets[size:], featuresets[:size]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> nltk.classify.accuracy(classifier, test_set)
0.7891596220785678

Exercise: Chapter 4 (1-2)

1.

Just showing the help documents.

>>> help(str)
>>> help(list)
>>> help(tuple)

2.

Just compare the two help documents.

Tuple:
Help on class tuple in module __builtin__:

class tuple(object)
| tuple() -> empty tuple
| tuple(iterable) -> tuple initialized from iterable’s items
|
| If the argument is a tuple, the return value is the same object.
|
| Methods defined here:
|
| __add__(…)
| x.__add__(y) x+y
|
| __contains__(…)
| x.__contains__(y) y in x
|
| __eq__(…)
| x.__eq__(y) x==y
|
| __ge__(…)
| x.__ge__(y) x>=y
|
| __getattribute__(…)

| x.__getattribute__(‘name’) x.name
|
| __getitem__(…)
| x.__getitem__(y) x[y]
|
| __getnewargs__(…)
|
| __getslice__(…)
| x.__getslice__(i, j) x[i:j]
|
| Use of negative indices is not supported.
|
| __gt__(…)
| x.__gt__(y) x>y
|
| __hash__(…)
| x.__hash__() hash(x)
|
| __iter__(…)
| x.__iter__() iter(x)
|
| __le__(…)
| x.__le__(y) x<=y
| __len__(…)
| x.__len__() len(x)
|
| __lt__(…)
| x.__lt__(y) x<y
|
| __mul__(…)
| x.__mul__(n) x*n
|
| __ne__(…)
| x.__ne__(y) x!=y
|
| __repr__(…)
| x.__repr__() repr(x)
|
| __rmul__(…)
| x.__rmul__(n) n*x
|
| __sizeof__(…)
| T.__sizeof__() — size of T in memory, in bytes
|
| count(…)
| T.count(value) -> integer — return number of occurrences of value
| index(…)
| T.index(value, [start, [stop]]) -> integer — return first index of value.
| Raises ValueError if the value is not present.
|
| ———————————————————————-
| Data and other attributes defined here:
|
| __new__ =
| T.__new__(S, …) -> a new object with type S, a subtype of T

list:

Help on class list in module __builtin__:

class list(object)
| list() -> new empty list
| list(iterable) -> new list initialized from iterable’s items
|
| Methods defined here:
|
| __add__(…)
| x.__add__(y) x+y
|
| __contains__(…)
| x.__contains__(y) y in x
|
| __delitem__(…)
| x.__delitem__(y) del x[y]
|
| __delslice__(…)
| x.__delslice__(i, j) del x[i:j]
|
| Use of negative indices is not supported.
|
| __eq__(…)

| x.__eq__(y) x==y
|
| __ge__(…)
| x.__ge__(y) x>=y
|
| __getattribute__(…)
| x.__getattribute__(‘name’) x.name
|
| __getitem__(…)
| x.__getitem__(y) x[y]
|
| __getslice__(…)
| x.__getslice__(i, j) x[i:j]
|
| Use of negative indices is not supported.
|
| __gt__(…)
| x.__gt__(y) x>y
|
| __iadd__(…)
| x.__iadd__(y) x+=y
|
| __imul__(…)

| x.__imul__(y) x*=y
|
| __init__(…)
| x.__init__(…) initializes x; see help(type(x)) for signature
|
| __iter__(…)
| x.__iter__() iter(x)
|
| __le__(…)
| x.__le__(y) x<=y
|
| __len__(…)
| x.__len__() len(x)
|
| __lt__(…)
| x.__lt__(y) x<y
|
| __mul__(…)
| x.__mul__(n) x*n
|
| __ne__(…)
| x.__ne__(y) x!=y
|
| __repr__(…)
| x.__repr__() repr(x)
|
| __reversed__(…)
| L.__reversed__() — return a reverse iterator over the list
|
| __rmul__(…)
| x.__rmul__(n) n*x
|
| __setitem__(…)
| x.__setitem__(i, y) x[i]=y
|
| __setslice__(…)
| x.__setslice__(i, j, y) x[i:j]=y
|
| Use of negative indices is not supported.
|
| __sizeof__(…)
| L.__sizeof__() — size of L in memory, in bytes
|
| append(…)
| L.append(object) — append object to end
|
| count(…)
| L.count(value) -> integer — return number of occurrences of value
|
| extend(…)
| L.extend(iterable) — extend list by appending elements from the iterable
|
| index(…)
| L.index(value, [start, [stop]]) -> integer — return first index of value.
| Raises ValueError if the value is not present.
|
| insert(…)
| L.insert(index, object) — insert object before index
|
| pop(…)
| L.pop([index]) -> item — remove and return item at index (default last).
| Raises IndexError if list is empty or index is out of range.
|
| remove(…)
| L.remove(value) — remove first occurrence of value.
| Raises ValueError if the value is not present.
|
| reverse(…)
| L.reverse() — reverse *IN PLACE*
|
| sort(…)
| L.sort(cmp=None, key=None, reverse=False) — stable sort *IN PLACE*;
| cmp(x, y) -> -1, 0, 1
|
| ———————————————————————-
| Data and other attributes defined here:
|
| __hash__ = None
|
| __new__ =
| T.__new__(S, …) -> a new object with type S, a subtype of T

The biggest difference between tuple and list is mutable or not. Therefore, what we can only for list are editing related operations. For example,

__delitem__
__delslice__
__iadd__
__imul__
__revsersed__
__setitem__
__setslice__
extend()
insert()
pop()
….

What can do only for tuple?

Hash()

Analysing Chinese words 2

I missed ConditionalFreqDist since last article.

>>> ccfd = nltk.ConditionalFreqDist((c,v) for (c, v, tone) in ping_elements)
>>> ccfd.conditions()
['', 'b', 'c', 'ch', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 'sh', 't', 'w', 'x', 'y', 'z', 'zh']
>>> ccfd['b']['a']
27
>>> ccfd['b']
<FreqDist with 16 samples and 613 outcomes>
>>> ccfd['b']['i']
61
>>> ccfd['b']['u']
134
>>> ccfd['b']['e']
0
>>> ccfd['b']['o']
33
>>> ccfd.tabulate()
      a   ai   an  ang   ao    e   ei   en  eng   er    i   ia  ian iang  iao   ie   in  ing iong   iu    o  ong   ou    u   ua  uai  uan uang   ue   ui   un   uo    v   ve
      2   29   29    1    7   13    0    1    0   48    0    0    0    0    0    0    0    0    0    0    1    0    7    0    0    0    0    0    0    0    0    0    0    0
....
>>> tcfd = nltk.ConditionalFreqDist((ping,tone) for (ping, tone) in ping_tone)
>>> tcfd.tabulate()
          0    1    2    3    4
     a    1    1    0    0    0
    ai    0    4    1    2   22
....

After that I revised the code of pingyin_spliter().

import re, sys

def pingyin_spliter(pingyin):
	# List of Consonants / Vowels for final check
        consonants = ['b', 'c', 'ch', 'd', 'f', 'g', 'h', 'j', 'k', 
		      'l', 'm', 'n', 'p', 'q', 'r', 's', 'sh', 't', 
                      'u', 'w', 'x', 'y', 'z', 'ng']
        vowels = ['a', 'ai', 'ang', 'ao', 'e', 'ei', 'eng', 'er', 'i',
                  'ia', 'ian', 'iang', 'iao', 'ie', 'in', 'ing', 'iong',
                  'iu', 'o', 'ong', 'ou', 'u', 'ua', 'uai', 'uan', 
                  'uang', 'ue', 'ui', 'un', 'uo', 'v', 've']

	s_ping = re.findall(r"[0-9]|er|[aeiouv]+[n|ng]*|[^aeiouv0-9]+", pingyin.lower())
	try:
	# Check split results
		if len(s_ping) == 0 or len(s_ping) > 3:	#Invalid Pingyin
			raise Exception, 'Invalid Pingyin enterd: %s' % str(s_ping)
		elif len(s_ping) == 1:
			if s_ping[0].isdigit():
				raise Exception, 'Invalid Pingyin enterd: %s', str(s_ping)
			else:
				s_ping.append('')
				s_ping.append('0')
		elif len(s_ping) == 2:
			if s_ping[-1].isdigit():
				s_ping.append('')
				s_ping[2] = s_ping[1]
				s_ping[1] = ''
			else:
				s_ping.append('0')	#Qingsheng

		#All entry should have 3 elements in s_ping
		if not s_ping[-1].isdigit():
			raise Exception, 'Invalid Pingyin entered: %s', str(s_ping)
		elif s_ping[0] in vowels and s_ping[1] == '':
			s_ping[1] = s_ping[0]
			s_ping[0] = ''
		elif s_ping[0] in consonants and s_ping[1] in vowels:
			pass
		elif s_ping[0] == 'ng':
			s_ping[1] = ''
		else:
			raise Exception, 'Invalid Pingyin entered: %s', str(s_ping)

		return s_ping

	except Exception, etext:
		info = sys.exc_info()
		raise info[0], info[1], info[2]

def split_multiple(m_ping):
	m_ping = m_ping.lower()
	r_ping = m_ping.split()
	return r_ping

def split_tone(pingyin):
	s_tone = re.findall(r"[0-9]$|[a-z]+", pingyin.lower())
	
	try:
		if len(s_tone) == 1 and s_tone[-1].isdigit() == False:
			s_tone.append('0')
		if s_tone[0].isalpha and s_tone[1].isdigit == False:
			raise Exception, s_tone
		if len(s_tone) != 2:
			raise Exception, s_tone

		return s_tone

	except Exception, etext:
		info = sys.exc_info()
		raise info[0], info[1], info[2]

Structure of Python module (4.6)

It took some time to find out the source code in my laptop when I faced strange behavior because I did’t know this command.

>>> nltk.metrics.distance.__file__
'/Library/Python/2.7/site-packages/nltk/metrics/distance.pyc'

To get help of the source. Already learned this before.

>>> help(nltk.metrics.distance)

Got a following document:

Help on module nltk.metrics.distance in nltk.metrics:

NAME
nltk.metrics.distance – Distance Metrics.

FILE
/Library/Python/2.7/site-packages/nltk/metrics/distance.py

DESCRIPTION
Compute the distance between two items (usually strings).
As metrics, they must satisfy the following three requirements:

1. d(a, a) = 0
2. d(a, b) >= 0
3. d(a, c) >> from nltk.metrics import binary_distance

>>> binary_distance(1,1)
0.0

>>> binary_distance(1,3)
1.0

custom_distance(file)

demo()

edit_distance(s1, s2)
Calculate the Levenshtein edit-distance between two strings.
The edit distance is the number of characters that need to be
substituted, inserted, or deleted, to transform s1 into s2. For
example, transforming “rain” to “shine” requires three steps,
consisting of two substitutions and one insertion:
“rain” -> “sain” -> “shin” -> “shine”. These operations could have
been done in other orders, but at least three steps are needed.

:param s1, s2: The strings to be analysed
:type s1: str
:type s2: str
:rtype int
….

Note:
If the name of functions or variables start with underscore(_), those are not imported by import *.

from module import *

Function find_words() has 3 parameters. As the 3rd parameter result has default value (=[]), this can be omitted when calling.

>>> def find_words(text, wordlength, result=[]):
...     for word in text:
...             if len(word) == wordlength:
...                     result.append(word)
...     return result
... 
>>> find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'],3)
['omg', 'teh', 'teh', 'mat']
>>> find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'],2,['ur'])
['ur', 'on']
>>> find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'],3)
['omg', 'teh', 'teh', 'mat', 'omg', 'teh', 'teh', 'mat']
>>> 

At the first call, “result” is omitted and blank list was generated. The second one, [‘ur’] is provided and adding an entry into the existing list. The last one, “result” is omitted again but a new blank list was NOT generated, the list generated at the first call was reused. As a result, entries are duplicated.

If I think about real programming situation and call the same function multiple times, I would not omit the parameter but do like do this.

>>> result_l = []
>>> find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'],3, result_l)
['omg', 'teh', 'teh', 'mat']
>>> find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'],2,result_l)
['omg', 'teh', 'teh', 'mat', 'on']
>>> find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'],3, result_l)
['omg', 'teh', 'teh', 'mat', 'on', 'omg', 'teh', 'teh', 'mat']
>>> 

Or need to separate, do like this:

>>> result_l = []                                                               >>> find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'],3, result_l)
['omg', 'teh', 'teh', 'mat']
>>> result_m = ['ur']                                                           >>> find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'],2, result_m)
['ur', 'on']
>>> result_n = []
>>> find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'],3, result_n)
['omg', 'teh', 'teh', 'mat']

How to debug:

Use pdb.

>>> import pdb
>>> find_words(['cat'],3)
['omg', 'teh', 'teh', 'mat', 'omg', 'teh', 'teh', 'mat', 'omg', 'teh', 'teh', 'mat', 'cat']
>>> pdb.run("find_words(['dog'],3)")
> <string>(1)<module>()
(Pdb) step
--Call--
> <stdin>(1)find_words()
(Pdb) args
text = ['dog']
wordlength = 3
result = ['omg', 'teh', 'teh', 'mat', 'omg', 'teh', 'teh', 'mat', 'omg', 'teh', 'teh', 'mat', 'cat']
(Pdb) next
> <stdin>(2)find_words()
(Pdb) continue
>>>