Writing structured program (4.1)

As I could not see the end of the exercises of Chapter 3, I decided to continue Chapter 4 of the whale book.

Now, Go back to basic (chapter 4.1)

>>> foo = "Monty"
>>> bar = foo
>>> foo = "Python"
>>> bar
'Monty'
>>> foo
'Python'

I have seen similar one in some other books.

>>> foo = ['Monty', 'Python']
>>> bar = foo
>>> foo[1] = 'Bodkin'
>>> bar
['Monty', 'Bodkin']
>>> 

How do I understand this?

>>> empty = []
>>> nested = [empty, empty, empty]
>>> nested
[[], [], []]
>>> nested[1].append('Python')
>>> nested
[['Python'], ['Python'], ['Python']]
>>> empty
['Python']

nested[1] refers to empty. This means the value of empty is updated with ‘Python’.

>>> nested = [[]] * 3
>>> nested[1].append('Shanghai')
>>> nested
[['Shanghai'], ['Shanghai'], ['Shanghai']]
>>> id(nested[0])
4347110320
>>> id(nested[1])
4347110320
>>> id(nested[2])
4347110320

This means they are all the same list. I saw different behavior in this example.

>>> nested = [[]] * 3
>>> nested
[[], [], []]
>>> empty
['Python']
>>> nested[1] = 'Shanghai'
>>> nested
[[], 'Shanghai', []]

In this case, I just made a mistake and directly put the value instead of using append(). As a result, other elements were not updated. We can see completely same situation in the text book.

>>> nested = [[]] * 3
>>> nested[1].append('Shanghai')
>>> nested[1] = ['Hong Kong']
>>> nested
[['Shanghai'], ['Hong Kong'], ['Shanghai']]
>>> 

Usage of ‘==’ and ‘is’. What is difference?

>>> size = 5
>>> python = ['python']
>>> snake_nest = [python] * size
>>> snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4]
True
>>> snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4]
True
>>>

No difference?

>>> import random
>>> position = random.choice(range(size))
>>> snake_nest[position] = ['python']
>>> snake_nest
[['python'], ['python'], ['python'], ['python'], ['python']]
>>> snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4]
True
>>> snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4]
False
>>> position
3
>>> [id(snake) for snake in snake_nest]
[4347112192, 4347112192, 4347112192, 4347110176, 4347112192]
>>> snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[4] 
True
>>>  

According to the result above, ‘==’ is comparing the value itself. On the other hand, ‘is’ to compare id.

Regarding conditions, if/elif/else is the python’s format.

>>> animals = ['cat', 'dog']
>>> if 'cat' in animals:
...     print 1
... elif 'dog' in animals:
...     print 2
... 
1
>>> if 'cat' in animals:
...     print 1
... 
1
>>> if 'dog' in animals:
...     print 2
... 
2

all/any are new for me.

>>> sent = ['No', 'good', 'fish', 'goes', 'anywhere', 'without', 'a', 'porpoise', '.']
>>> all(len(w) > 4 for w in sent)
False
>>> any(len(w) > 4 for w in sent)
True
>>> 
Advertisements

Exercise: Chapter 3 (18-21)

18.

>>> text = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')
>>> words = nltk.word_tokenize(text)
>>> list = sorted(set([w for w in words if re.search(r'^wh', w.lower())]))
>>> for word in list:
...     print word
... 
WHALE
WHALE-FISHERY.
WHALE-SHIP
WHALE.
WHALEBONE
WHALEMAN
WHALES
WHALESHIPS
WHALING
WHALING.
WHARTON
WHAT
WHEN
WHERE
WHICH
WHIFF
WHITE
WHOEL
Whale
Whale's
Whale's.
Whale-Bones
Whale-balls
Whale-bone
Whale-ship
Whale-ships
Whale-teeth
Whale.
Whalebone
Whaleman
Whalemen
Whaler
Whales
Whales.
Whaling
Whaling.
What
What's
Whatever
Wheelbarrow.
Whelped
When
Whence
Whenever
Where
Where-away
Whereas
Wherefore
Wherein
Whereupon
Whether
Whew
Which
While
Whilst
Whirlpooles
Whisper
White
Whitehall
Whiteness
Whitsuntide
Who
Who's
Who-e
Whole
Whom
Whose
Whosoever
Why
whale
whale's
whale-boat
whale-boat.
whale-boats
whale-bone
whale-books.
whale-craft
whale-cruisers
whale-cry
whale-e
whale-fastener
whale-fish
whale-fishers
whale-fishery
whale-fleet.
whale-ground
whale-hater
whale-hunt
whale-hunter
whale-hunters
whale-hunters.
whale-hunting
whale-jets
whale-killer
whale-lance
whale-lance.
whale-line
whale-line.
whale-lines
whale-lines.
whale-naturalists
whale-pike
whale-pole
whale-ports
whale-ship
whale-ship.
whale-ships
whale-smitten
whale-spades
whale-spout
whale-steak
whale-surgeon
whale-trover
whale-wise
whale.
whale.*
whaleboat
whaleboats
whalebone
whalebone.
whaleboning
whaled
whaleman
whaleman's
whaleman.
whalemen
whalemen's
whalemen.
whaler
whaler.
whalers
whalers.
whales
whales.
whaleship
whaleships
whalesmen
whaling
whaling-craft
whaling-fleet
whaling-pike
whaling-scenes
whaling-ships
whaling-spade
whaling-spades
whaling-vessels
whaling-voyage
whaling.
whang
wharf
wharf.
wharves
wharves.
what
what's
what.
whatever
whatsoever
whatsoever.
wheat
wheat.
wheel
wheel-spokes
wheel.
wheelbarrow
wheeled
wheeling
wheels
wheezing
whelm
whelmed
whelmed.
whelmings
when
whence
whencesoe'er
whenever
where
where'er
where.
whereas
whereat
whereby
wherefore
wherein
whereof
whereon
wheresoe'er
whereto
whereupon
wherever
wherewith
whether
whets
whetstone
whetstones
whew
which
whichever
whiff
whiffs
while
while.
whim
whimsicalities
whimsiness
whip
whipped
whipping
whips
whirl
whirl.
whirled
whirling
whirlpool
whirls
whirlwinds
whisker
whiskers
whiskey
whisper
whispered
whispering
whisperingly
whispers
whispers.
whist-tables
whistle
whistled
whistling
whistlingly
whit
white
white-ash
white-bearded
white-bone
white-elephant
white-fire
white-headed
white-horse
white-lead
white-shrouded
white-turbaned
whitened
whiteness
whiteness.
whitenesses
whites
whitest
whitewashed
whither
whitish
whitish.
whittled
whittling
whittling.
whizzings
who
who-ee
whoever
whole
whole.
wholesome
wholly
whom
whooping
whose
whosoever
why
why.
>>> 

Some words are duplicated because of upper/lower cases or dot(.) after words.

19.

>>> nlist = open('word_number.txt').readlines()
>>> nlist
['fuzzy 53\n', 'funny 44\n', 'future 65\n', 'fun 12\n', 'gun 48\n', 'music 33\n', 'punk 21\n', 'quick 9\n', 'run 71\n', 'sun 42\n', 'tunnel 18\n']
>>> slist = re.split(r' ', nlist)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 167, in split
    return _compile(pattern, flags).split(string, maxsplit)
TypeError: expected string or buffer

re.split() cannot be used for list? Then use for loop.

>>> for element in nlist:
...     nlist2.append(re.split(r' ', element))
... 
>>> nlist2
[['fuzzy', '53\n'], ['funny', '44\n'], ['future', '65\n'], ['fun', '12\n'], ['gun', '48\n'], ['music', '33\n'], ['punk', '21\n'], ['quick', '9\n'], ['run', '71\n'], ['sun', '42\n'], ['tunnel', '18\n']]
>>> for element in nlist2:
...     element[1] = int(element[1])
... 
>>> nlist2
[['fuzzy', 53], ['funny', 44], ['future', 65], ['fun', 12], ['gun', 48], ['music', 33], ['punk', 21], ['quick', 9], ['run', 71], ['sun', 42], ['tunnel', 18]]
>>> 

20.

Use this URL: http://weather.yahoo.com/china/shanghai/shanghai-2151849/

>>> from urllib import urlopen
>>> url = 'http://weather.yahoo.com/china/shanghai/shanghai-2151849/'
>>> html = urlopen(url).read()

Then removing html tags.

>>> raw = nltk.clean_html(html)
>>> raw.index('Today')
3308
>>> raw[3308:3408]
'Today Mostly Cloudy High 83° High 28° Low 70° Low 21°  Tomorrow Scattered Thunde'
>>> 

Try to find the index of word ‘Today’ then got 100 chars from the index.

Today is Mostly Cloudy and the Highest temperature will be 83F/28C. We are still in May, aren’t we? It is too hot!!!

21.

I used this web page as a sample:
Yuan gains strength as PBOC sets record rate

Just do some small test before writing functions:

>>> url = "http://www.shanghaidaily.com/nsp/Business/2013/05/25/Yuan%2Bgains%2Bstrength%2Bas%2BPBOC%2Bsets%2Brecord%2Brate/"
>>> html = urlopen(url).read()
>>> raw = nltk.clean_html(html)
>>> raw
"Yuan gains strength as PBOC sets record rate -- Shanghai Daily | \xe4\xb8\x8a\xe6\xb5\xb7\xe6\x97\xa5\xe6\x8a\xa5 -- English Window to China New \r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n \r\n\r\n\r\n \r\n \r\n \r\n\r\n\t \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t Yuan gains strength as PBOC sets record rate http://www.shanghaidaily.com/article/?id=531420&type=Business \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t\t\r\n\t\t \r\n\t\t \r\n\r\n   \r\n\t\t Mobile Version | \r\n\t\t\tSaturday, 25 May, 2013 | Last updated 18 minutes ago\r\n\t\t \r\n\t\t\r\n\t\t \r\n\t\t\t \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t \r\n\t\t \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t\t\r\n\t\t \r\n\t\t\r\n\t \r\n\t \r\n\t\t \r\n\t \r\n\t \r\n\t \r\n\t\r\n\t \r\n\t\t \r\n\t\t\t Metro \r\n\t\t \r\n\t\t \r\n\t\t\t Business \r\n\t\t \r\n\t\t \r\n\t\t\t National \r\n\t\t \r\n\t\t \r\n\t\t\t World \r\n\t\t \r\n\t\t \r\n\t\t\t Sports \r\n\t\t \r\n\t\t \r\n\t\t\t Feature \r\n\t\t \r\n\t\t \r\n\t\t\t Opinion \r\n\t\t \r\n\t\t \r\n\t\t\t V IBE \r\n\t\t \r\n\t\t \r\n\t\t\t i DEAL \r\n\t\t \r\n\t\t\t \r\n\t\t\t PDF \r\n\t\t \r\n\t\t \r\n\t\t\t Gallery \r\n\t\t \r\n\t\t \r\n\t\t\t  \r\n\t\t \r\n\t\t  \r\n\t\t\r\n\t\t \r\n\t \r\n\t \r\n\t \r\n\t\r\n\t \r\n\t\t \r\n\t\t RSS | MMS Newspaper | Newsletter \r\n\t\t \r\n\t \r\n \r\n\r\n\t \r\n\t \r\n\t\r\n\t\t\r\n\t\t \r\n\t\t\t Business | Economy \r\n\t\t\t Yuan gains strength as PBOC sets record rate \r\n\t\t\t \r\n\t\t\t\t\r\n\t\t\t\tBy Feng Jianmin | \r\n\t\t\t\t2013-5-25 | \r\n\t\t\t\t\r\n\t\t\t\t NEWSPAPER EDITION\r\n\t\t\t\t\r\n\t\t\t \r\n\t\t\t \r\n\t\t \r\n\t\t \r\n\t\t \r\n\t\t\t\r\n\t \r\n\t\t \r\n\t\tThe story appears on \r\n\t\t Page A7 \r\n\t\t \r\n\t\tMay 25, 2013\r\n\t\t \r\n\t\tFree for subscribers\r\n\t\t \r\n\t\t \r\n\t\t \r\n\t\t Shopping Cart \r\n\t\t \r\n\t \r\n\t \r\n\t\r\n Reading Tools \r\n \r\n  Email Story \r\n  Printable View \r\n  Blog Story \r\n  Copy Headline/URL \r\n \r\n\t\t\t\r\n\t\t\t \r\n\t\t\t\r\n\r\n\r\n \r\n \r\n \r\n\t\t\t\r\n\t\t\t \r\n\t\t\t\t Keywords \r\n\t\t\t\t \r\n\t\t\t\t Financial crisis \r\n\t\t\t\t \r\n\t\t\t\t 3G network \r\n\t\t\t\t \r\n\t\t\t\t Shanghai stock market \r\n\t\t\t\t \r\n\t\t\t\t Housing price \r\n\t\t\t\t \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t\t\r\n\t\t\t \r\n\t\t\t \r\n\t\t\t\t \r\n\t\t\t\t \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t\t\r\n\t\t\t \r\n\t\t\t\r\n\t\t\t Related Stories \r\n\t\t\t \r\n\t\t\t\r\n\t\t\t\t Huiyuan to buy unit from its chairman \r\n\t\t\t\t 2013-5-25 \r\n\t\t\t\t \r\n\t\t\t\t\r\n\t\t\t\t Yuan extends rising streak \r\n\t\t\t\t 2013-5-4 0:07:42 \r\n\t\t\t\t \r\n\t\t\t\t\r\n\t\t\t\t Yuan band widening in the works \r\n\t\t\t\t 2013-4-20 0:42:09 \r\n\t\t\t\t \r\n\t\t\t\t\r\n\t\t\t\t Peng's popularity pushes demand for home br... \r\n\t\t\t\t 2013-3-30 1:18:33 \r\n\t\t\t\t \r\n\t\t\t\t\r\n\t\t\t\t Peng Liyuan steals hearts on first trip \r\n\t\t\t\t 2013-3-23 1:45:11 \r\n\t\t\t\t \r\n\t\t\t\t\r\n\t\t\t Read More \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t\r\n\t\t\t \r\n\t\t\t\t\r\n\t\t\t\t\r\n\t\t\t \r\n\t\t \r\n\t\t \r\n\t\t\t\r\n\r\n\t\t\t\t\r\n\t\t\t\r\n\t\r\n      \r\n\t\t\t\r\n\t\t\t\r\n\r\n\t\t \r\n\t\t \r\n\t\t\t \r\n\t\t\t\t \r\n\r\n \r\n\t\t\t \r\n\t\t\t\r\n\t\t\t\t \r\n\t\t\t\t\tTHE Chinese yuan yesterday strengthened above the 6.13 level against the US dollar on the spot market for the first time in 19 years after the central bank set a record reference rate for the currency and Premier Li Keqiang reiterated the country was making progress in opening up its capital account. The yuan closed at 6.1316 per dollar in Shanghai yesterday, 0.04 percent stronger than Thursday, according to the China Foreign Exchange Trade System. The yuan touched an intraday high of 6.1279, the strongest since the government unified the official and market rates at the end of 1993. The People's Bank of China raised the central parity rate by 0.13 percent to 6.1867 per US dollar yesterday before the market opened. It was the third time that the PBOC had raised the daily fixing to a record in a week, guiding the market rate up 0.17 percent from May 17. The nation is steadily pushing forward market-oriented reforms on interest rates and capital-account convertibility, Premier Li said in a signed article in Neue Zuricher Zeitung, a German-language Swiss newspaper, on Thursday. \r\n\t\t\t\t \r\n\t\t\t\t  \r\n\t\t\t\t\r\n\t\t\t\t\t\t\t \r\n\t\t\t\t\r\n\t\t\t \r\n\t\t\t\r\n\t\t\t\t \r\n\t\t\t\t\t \r\n\t\t\t\t \r\n\t\t\t\r\n\t\t\t \r\n\t\t\t \r\n\t\t\t\t Email Story \r\n\t\t\t\t  Printable View \r\n\t\t\t\t  Blog Story \r\n\t\t\t\t  Copy Headline/URL \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t\r\n\t\t\t\r\n\t\t\t      \r\n \r\n\t\t\t\r\n\t\t\t\r\n\t\t\t \r\n\r\n\t\t \r\n\t\t \r\n\t\t\r\n\t\r\n\t \r\n\t\r\n\t \r\n\t\t\r\n\r\n \r\n\t \r\n\t\t \r\n\t\t \r\n\t\t \r\n\t\t\t News text \r\n\t\t\t News title \r\n\t\t\t Photo captions \r\n\t\t\t Live in Shanghai \r\n\t\t \r\n\t\t \r\n\t\t \r\n\t \r\n\t Advanced Search \r\n \r\n \r\n \r\n \r\n \r\n  Our Products \r\n \r\n  \r\n\t\t \r\n\t\t\t  \r\n\t\t\t  \r\n\t\t\t  \r\n\t\t\t  \r\n\t\t\t  \r\n\t\t \r\n\t\t \r\n\t\t\t Home \r\n\t\t\t\tDelivery \r\n\t\t\t Online \r\n\t\t\t\tAccount \r\n\t\t\t Amazon \r\n\t\t\t\tKindle \r\n\t\t\t iPhone \r\n\t\t\t\tApp \r\n\t\t\t iPad \r\n\t\t\t\tApp \r\n\t\t \r\n\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t \r\n\t\t \r\n\t\t\t  \r\n\t\t\t  \r\n\t\t\t  \r\n\t\t\t  \r\n\t\t\t  \r\n\t\t \r\n\t\t \r\n\t\t\t Blackberry Phone App \r\n\t\t\t PlayBook \r\n\t\t\t\tApp \r\n\t\t\t Android \r\n\t\t\t\tApp \r\n\t\t\t Windows Phone App \r\n\t\t\t MMS \r\n\t\t\t\t \xe6\x89\x8b\xe6\x9c\xba\xe6\x8a\xa5 \r\n\t\t \r\n\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t \r\n\t \r\n  \r\n\r\n\r\n \r\n\r\n\t\r\n\t \r\n\t\t \r\n\t \r\n\r\n\t \r\n\t \r\n \r\n \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t \r\n\t   \r\n\t \r\n \r\n\t \r\n\t\t \r\n\t\t\t \r\n\t\t\t\t \r\n\t\t\t\t Metro \r\n\t\t\t\t \r\n\t\t\t\t\t Aging , Crime and public security , Education , Health , Traffic , Urban construction , Weather ... \r\n\t\t\t\t \r\n\t\t\t\t Business \r\n\t\t\t\t \r\n\t\t\t\t\t Banking , Energy , Foreign investment , Insurance , Macro-economy and policy , Real estate , Securities ... \r\n\t\t\t\t \r\n\t\t\t\t National \r\n\t\t\t\t \r\n\t\t\t\t World \r\n\t\t\t\t \r\n\t\t\t\t Odd \r\n\t\t\t\t \r\n\t\t\t\t Districts \r\n\t\t\t\t \r\n\t\t\t\t\t Changning , Hongkou , Huangpu , Jing'an , Luwan , Minhang , Pudong , Putuo , Xuhui , Yangpu , Zhabei ... \r\n\t\t\t\t \r\n\t\t\t\t \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t\t Sport \r\n\t\t\t\t \r\n\t\t\t\t\t Basketball , Boxing , Cricket , Golf , Gymnastics , Ice hockey , Olympics , Rugby union , Soccer , Tennis ... \r\n\t\t\t\t \r\n\t\t\t\t Feature \r\n\t\t\t\t \r\n\t\t\t\t\t Art , City Style , Culture and history , Expat Tales , Fashion , Home Deco , Literature , Music , Stage , Travel ... \r\n\t\t\t\t \r\n\t\t\t\t Opinion \r\n\t\t\t\t \r\n\t\t\t\t\t Chinese perspectives , Foreign perspectives , Shanghai Daily columnists \r\n\t\t\t\t \r\n\t\t\t\t Sunday \r\n\t\t\t\t \r\n\t\t\t\t\t Animal Planet , Book , City Scene , Cover , Film , Food , Home and Deco , Now and Then , People , Style ... \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t\t\t Supplement \r\n\t\t\t\t \r\n\t\t\t\t Downloads \r\n\t\t\t\t \r\n\t\t\t\t\t PDF , eMagazine \r\n\t\t\t\t \r\n\t\t\t\t Gallery \r\n\t\t\t\t \r\n\t\t\t\t\t Photos , Cartoons , HD Photo Album \r\n\t\t\t\t \r\n\t\t\t\t Blogs \r\n\t\t\t\t \r\n\t\t\t\t\t Buzzword and Shanghai Talk , Word on Street , Team Blog \r\n\t\t\t\t \r\n\t\t\t\t Services \r\n\t\t\t\t \r\n\t\t\t\t\t Subscribe, Advertising Info , Contact Us , RSS Center \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t\t FEATURED SITES \r\n\t\t\t\t \r\n\t\t\t\t Campus \r\n\t\t\t\t \r\n\t\t\t\t\t Learning , Careers , Students' Club , Prize English , Sense & Simplicity \r\n\t\t\t\t \r\n\t\t\t\t Mini sites \r\n\t\t\t\t \r\n\t\t\t\t\t Undiscovered Zhoushan , Minhang today www.maicaipian.com , Science Podcasting , Elegant Rhythms from the East \r\n\t\t\t\t \r\n\t\t\t \r\n\t\t\t \r\n\t\t \r\n\t \r\n \r\n\t \r\n\t\t\r\n\t \r\n\t \r\n\t\t @ CONTACT US |  BACK TO TOP \r\n\t \r\n\t \r\n \r\n\r\n\r\n\t \r\n\t \r\n\t\t Metro \r\n\t\t World \r\n\t\t National \r\n\t\t Business \r\n\t\t Sports \r\n\t\t Feature \r\n\t\t Opinion \r\n\t \r\n\t \r\n\t\t \r\n\t\t \r\n\t\t \r\n\t\t \r\n\t \r\n\t \r\n\t \r\n\t\t \r\n\t\t\t About Shanghai Daily | \r\n\t\t\t About US 5.0 New | \r\n\t\t\t Advertising | \r\n\t\t\t Term of Use | \r\n\t\t\t RSS | \r\n\t\t\t Privacy Policy | \r\n\t\t\t Contact US | \r\n\t\t\t Shanghai World Expo \r\n\t\t \r\n\t\t \xe6\xb2\xaaICP\xe8\xaf\x81\xef\xbc\x9a\xe6\xb2\xaaICP\xe5\xa4\x8705050403 | \xe7\xbd\x91\xe7\xbb\x9c\xe8\xa7\x86\xe5\x90\xac\xe8\xae\xb8\xe5\x8f\xaf\xe8\xaf\x81\xef\xbc\x9a0909346 | \xe5\xb9\xbf\xe6\x92\xad\xe7\x94\xb5\xe8\xa7\x86\xe8\x8a\x82\xe7\x9b\xae\xe5\x88\xb6\xe4\xbd\x9c\xe8\xae\xb8\xe5\x8f\xaf\xe8\xaf\x81\xef\xbc\x9a\xe6\xb2\xaa\xe5\xad\x97\xe7\xac\xac354\xe5\x8f\xb7 | \xe5\xa2\x9e\xe5\x80\xbc\xe7\x94\xb5\xe4\xbf\xa1\xe4\xb8\x9a\xe5\x8a\xa1\xe7\xbb\x8f\xe8\x90\xa5\xe8\xae\xb8\xe5\x8f\xaf\xe8\xaf\x81\xef\xbc\x9a\xe6\xb2\xaaB2-20120012 \r\n\t\t Copyright \xc2\xa9 2001- Shanghai Daily Publishing House. All rights reserved."
>>> 

Splitting?

>>> words = re.split(r'\s', raw)
>>> words
['Yuan', 'gains', 'strength', 'as', 'PBOC', 'sets', 'record', 'rate', '--', 'Shanghai', 'Daily', '|', '\xe4\xb8\x8a\xe6\xb5\xb7\xe6\x97\xa5\xe6\x8a\xa5', '--', 'English', 'Window', 'to', 'China', 'New', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Yuan', 'gains', 'strength', 'as', 'PBOC', 'sets', 'record', 'rate', 'http://www.shanghaidaily.com/article/?id=531420&type=Business', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Mobile', 'Version', '|', '', '', '', '', '', 'Saturday,', '25', 'May,', '2013', '|', 'Last', 'updated', '18', 'minutes', 'ago', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Metro', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Business', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'National', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'World', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Sports', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Feature', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Opinion', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'V', 'IBE', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'i', 'DEAL', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'PDF', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Gallery', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'RSS', '|', 'MMS', 'Newspaper', '|', 'Newsletter', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Business', '|', 'Economy', '', '', '', '', '', '', 'Yuan', 'gains', 'strength', 'as', 'PBOC', 'sets', 'record', 'rate', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'By', 'Feng', 'Jianmin', '|', '', '', '', '', '', '', '2013-5-25', '|', '', '', '', '', '', '', '', '', '', '', '', '', '', 'NEWSPAPER', 'EDITION', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'The', 'story', 'appears', 'on', '', '', '', '', '', 'Page', 'A7', '', '', '', '', '', '', '', '', '', 'May', '25,', '2013', '', '', '', '', '', '', '', '', 'Free', 'for', 'subscribers', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Shopping', 'Cart', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Reading', 'Tools', '', '', '', '', '', '', '', 'Email', 'Story', '', '', '', '', 'Printable', 'View', '', '', '', '', 'Blog', 'Story', '', '', '', '', 'Copy', 'Headline/URL', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Keywords', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Financial', 'crisis', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '3G', 'network', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Shanghai', 'stock', 'market', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Housing', 'price', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Related', 'Stories', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Huiyuan', 'to', 'buy', 'unit', 'from', 'its', 'chairman', '', '', '', '', '', '', '', '2013-5-25', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Yuan', 'extends', 'rising', 'streak', '', '', '', '', '', '', '', '2013-5-4', '0:07:42', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Yuan', 'band', 'widening', 'in', 'the', 'works', '', '', '', '', '', '', '', '2013-4-20', '0:42:09', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', "Peng's", 'popularity', 'pushes', 'demand', 'for', 'home', 'br...', '', '', '', '', '', '', '', '2013-3-30', '1:18:33', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Peng', 'Liyuan', 'steals', 'hearts', 'on', 'first', 'trip', '', '', '', '', '', '', '', '2013-3-23', '1:45:11', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Read', 'More', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'THE', 'Chinese', 'yuan', 'yesterday', 'strengthened', 'above', 'the', '6.13', 'level', 'against', 'the', 'US', 'dollar', 'on', 'the', 'spot', 'market', 'for', 'the', 'first', 'time', 'in', '19', 'years', 'after', 'the', 'central', 'bank', 'set', 'a', 'record', 'reference', 'rate', 'for', 'the', 'currency', 'and', 'Premier', 'Li', 'Keqiang', 'reiterated', 'the', 'country', 'was', 'making', 'progress', 'in', 'opening', 'up', 'its', 'capital', 'account.', 'The', 'yuan', 'closed', 'at', '6.1316', 'per', 'dollar', 'in', 'Shanghai', 'yesterday,', '0.04', 'percent', 'stronger', 'than', 'Thursday,', 'according', 'to', 'the', 'China', 'Foreign', 'Exchange', 'Trade', 'System.', 'The', 'yuan', 'touched', 'an', 'intraday', 'high', 'of', '6.1279,', 'the', 'strongest', 'since', 'the', 'government', 'unified', 'the', 'official', 'and', 'market', 'rates', 'at', 'the', 'end', 'of', '1993.', 'The', "People's", 'Bank', 'of', 'China', 'raised', 'the', 'central', 'parity', 'rate', 'by', '0.13', 'percent', 'to', '6.1867', 'per', 'US', 'dollar', 'yesterday', 'before', 'the', 'market', 'opened.', 'It', 'was', 'the', 'third', 'time', 'that', 'the', 'PBOC', 'had', 'raised', 'the', 'daily', 'fixing', 'to', 'a', 'record', 'in', 'a', 'week,', 'guiding', 'the', 'market', 'rate', 'up', '0.17', 'percent', 'from', 'May', '17.', 'The', 'nation', 'is', 'steadily', 'pushing', 'forward', 'market-oriented', 'reforms', 'on', 'interest', 'rates', 'and', 'capital-account', 'convertibility,', 'Premier', 'Li', 'said', 'in', 'a', 'signed', 'article', 'in', 'Neue', 'Zuricher', 'Zeitung,', 'a', 'German-language', 'Swiss', 'newspaper,', 'on', 'Thursday.', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Email', 'Story', '', '', '', '', '', '', '', '', 'Printable', 'View', '', '', '', '', '', '', '', '', 'Blog', 'Story', '', '', '', '', '', '', '', '', 'Copy', 'Headline/URL', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'News', 'text', '', '', '', '', '', '', 'News', 'title', '', '', '', '', '', '', 'Photo', 'captions', '', '', '', '', '', '', 'Live', 'in', 'Shanghai', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Advanced', 'Search', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Our', 'Products', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Home', '', '', '', '', '', '', 'Delivery', '', '', '', '', '', '', 'Online', '', '', '', '', '', '', 'Account', '', '', '', '', '', '', 'Amazon', '', '', '', '', '', '', 'Kindle', '', '', '', '', '', '', 'iPhone', '', '', '', '', '', '', 'App', '', '', '', '', '', '', 'iPad', '', '', '', '', '', '', 'App', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Blackberry', 'Phone', 'App', '', '', '', '', '', '', 'PlayBook', '', '', '', '', '', '', 'App', '', '', '', '', '', '', 'Android', '', '', '', '', '', '', 'App', '', '', '', '', '', '', 'Windows', 'Phone', 'App', '', '', '', '', '', '', 'MMS', '', '', '', '', '', '', '', '\xe6\x89\x8b\xe6\x9c\xba\xe6\x8a\xa5', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Metro', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Aging', ',', 'Crime', 'and', 'public', 'security', ',', 'Education', ',', 'Health', ',', 'Traffic', ',', 'Urban', 'construction', ',', 'Weather', '...', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Business', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Banking', ',', 'Energy', ',', 'Foreign', 'investment', ',', 'Insurance', ',', 'Macro-economy', 'and', 'policy', ',', 'Real', 'estate', ',', 'Securities', '...', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'National', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'World', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Odd', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Districts', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Changning', ',', 'Hongkou', ',', 'Huangpu', ',', "Jing'an", ',', 'Luwan', ',', 'Minhang', ',', 'Pudong', ',', 'Putuo', ',', 'Xuhui', ',', 'Yangpu', ',', 'Zhabei', '...', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Sport', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Basketball', ',', 'Boxing', ',', 'Cricket', ',', 'Golf', ',', 'Gymnastics', ',', 'Ice', 'hockey', ',', 'Olympics', ',', 'Rugby', 'union', ',', 'Soccer', ',', 'Tennis', '...', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Feature', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Art', ',', 'City', 'Style', ',', 'Culture', 'and', 'history', ',', 'Expat', 'Tales', ',', 'Fashion', ',', 'Home', 'Deco', ',', 'Literature', ',', 'Music', ',', 'Stage', ',', 'Travel', '...', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Opinion', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Chinese', 'perspectives', ',', 'Foreign', 'perspectives', ',', 'Shanghai', 'Daily', 'columnists', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Sunday', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Animal', 'Planet', ',', 'Book', ',', 'City', 'Scene', ',', 'Cover', ',', 'Film', ',', 'Food', ',', 'Home', 'and', 'Deco', ',', 'Now', 'and', 'Then', ',', 'People', ',', 'Style', '...', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Supplement', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Downloads', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'PDF', ',', 'eMagazine', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Gallery', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Photos', ',', 'Cartoons', ',', 'HD', 'Photo', 'Album', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Blogs', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Buzzword', 'and', 'Shanghai', 'Talk', ',', 'Word', 'on', 'Street', ',', 'Team', 'Blog', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Services', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Subscribe,', 'Advertising', 'Info', ',', 'Contact', 'Us', ',', 'RSS', 'Center', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'FEATURED', 'SITES', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Campus', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Learning', ',', 'Careers', ',', "Students'", 'Club', ',', 'Prize', 'English', ',', 'Sense', '&', 'Simplicity', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Mini', 'sites', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Undiscovered', 'Zhoushan', ',', 'Minhang', 'today', 'www.maicaipian.com', ',', 'Science', 'Podcasting', ',', 'Elegant', 'Rhythms', 'from', 'the', 'East', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '@', 'CONTACT', 'US', '|', '', 'BACK', 'TO', 'TOP', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Metro', '', '', '', '', '', 'World', '', '', '', '', '', 'National', '', '', '', '', '', 'Business', '', '', '', '', '', 'Sports', '', '', '', '', '', 'Feature', '', '', '', '', '', 'Opinion', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'About', 'Shanghai', 'Daily', '|', '', '', '', '', '', '', 'About', 'US', '5.0', 'New', '|', '', '', '', '', '', '', 'Advertising', '|', '', '', '', '', '', '', 'Term', 'of', 'Use', '|', '', '', '', '', '', '', 'RSS', '|', '', '', '', '', '', '', 'Privacy', 'Policy', '|', '', '', '', '', '', '', 'Contact', 'US', '|', '', '', '', '', '', '', 'Shanghai', 'World', 'Expo', '', '', '', '', '', '', '', '', '', '', '\xe6\xb2\xaaICP\xe8\xaf\x81\xef\xbc\x9a\xe6\xb2\xaaICP\xe5\xa4\x8705050403', '|', '\xe7\xbd\x91\xe7\xbb\x9c\xe8\xa7\x86\xe5\x90\xac\xe8\xae\xb8\xe5\x8f\xaf\xe8\xaf\x81\xef\xbc\x9a0909346', '|', '\xe5\xb9\xbf\xe6\x92\xad\xe7\x94\xb5\xe8\xa7\x86\xe8\x8a\x82\xe7\x9b\xae\xe5\x88\xb6\xe4\xbd\x9c\xe8\xae\xb8\xe5\x8f\xaf\xe8\xaf\x81\xef\xbc\x9a\xe6\xb2\xaa\xe5\xad\x97\xe7\xac\xac354\xe5\x8f\xb7', '|', '\xe5\xa2\x9e\xe5\x80\xbc\xe7\x94\xb5\xe4\xbf\xa1\xe4\xb8\x9a\xe5\x8a\xa1\xe7\xbb\x8f\xe8\x90\xa5\xe8\xae\xb8\xe5\x8f\xaf\xe8\xaf\x81\xef\xbc\x9a\xe6\xb2\xaaB2-20120012', '', '', '', '', '', 'Copyright', '\xc2\xa9', '2001-', 'Shanghai', 'Daily', 'Publishing', 'House.', 'All', 'rights', 'reserved.']
>>> 

Or directly used findall()?

>>> words2 = re.findall(r'\w+', raw)
>>> words2
['Yuan', 'gains', 'strength', 'as', 'PBOC', 'sets', 'record', 'rate', 'Shanghai', 'Daily', 'English', 'Window', 'to', 'China', 'New', 'Yuan', 'gains', 'strength', 'as', 'PBOC', 'sets', 'record', 'rate', 'http', 'www', 'shanghaidaily', 'com', 'article', 'id', '531420', 'type', 'Business', 'Mobile', 'Version', 'Saturday', '25', 'May', '2013', 'Last', 'updated', '18', 'minutes', 'ago', 'Metro', 'Business', 'National', 'World', 'Sports', 'Feature', 'Opinion', 'V', 'IBE', 'i', 'DEAL', 'PDF', 'Gallery', 'RSS', 'MMS', 'Newspaper', 'Newsletter', 'Business', 'Economy', 'Yuan', 'gains', 'strength', 'as', 'PBOC', 'sets', 'record', 'rate', 'By', 'Feng', 'Jianmin', '2013', '5', '25', 'NEWSPAPER', 'EDITION', 'The', 'story', 'appears', 'on', 'Page', 'A7', 'May', '25', '2013', 'Free', 'for', 'subscribers', 'Shopping', 'Cart', 'Reading', 'Tools', 'Email', 'Story', 'Printable', 'View', 'Blog', 'Story', 'Copy', 'Headline', 'URL', 'Keywords', 'Financial', 'crisis', '3G', 'network', 'Shanghai', 'stock', 'market', 'Housing', 'price', 'Related', 'Stories', 'Huiyuan', 'to', 'buy', 'unit', 'from', 'its', 'chairman', '2013', '5', '25', 'Yuan', 'extends', 'rising', 'streak', '2013', '5', '4', '0', '07', '42', 'Yuan', 'band', 'widening', 'in', 'the', 'works', '2013', '4', '20', '0', '42', '09', 'Peng', 's', 'popularity', 'pushes', 'demand', 'for', 'home', 'br', '2013', '3', '30', '1', '18', '33', 'Peng', 'Liyuan', 'steals', 'hearts', 'on', 'first', 'trip', '2013', '3', '23', '1', '45', '11', 'Read', 'More', 'THE', 'Chinese', 'yuan', 'yesterday', 'strengthened', 'above', 'the', '6', '13', 'level', 'against', 'the', 'US', 'dollar', 'on', 'the', 'spot', 'market', 'for', 'the', 'first', 'time', 'in', '19', 'years', 'after', 'the', 'central', 'bank', 'set', 'a', 'record', 'reference', 'rate', 'for', 'the', 'currency', 'and', 'Premier', 'Li', 'Keqiang', 'reiterated', 'the', 'country', 'was', 'making', 'progress', 'in', 'opening', 'up', 'its', 'capital', 'account', 'The', 'yuan', 'closed', 'at', '6', '1316', 'per', 'dollar', 'in', 'Shanghai', 'yesterday', '0', '04', 'percent', 'stronger', 'than', 'Thursday', 'according', 'to', 'the', 'China', 'Foreign', 'Exchange', 'Trade', 'System', 'The', 'yuan', 'touched', 'an', 'intraday', 'high', 'of', '6', '1279', 'the', 'strongest', 'since', 'the', 'government', 'unified', 'the', 'official', 'and', 'market', 'rates', 'at', 'the', 'end', 'of', '1993', 'The', 'People', 's', 'Bank', 'of', 'China', 'raised', 'the', 'central', 'parity', 'rate', 'by', '0', '13', 'percent', 'to', '6', '1867', 'per', 'US', 'dollar', 'yesterday', 'before', 'the', 'market', 'opened', 'It', 'was', 'the', 'third', 'time', 'that', 'the', 'PBOC', 'had', 'raised', 'the', 'daily', 'fixing', 'to', 'a', 'record', 'in', 'a', 'week', 'guiding', 'the', 'market', 'rate', 'up', '0', '17', 'percent', 'from', 'May', '17', 'The', 'nation', 'is', 'steadily', 'pushing', 'forward', 'market', 'oriented', 'reforms', 'on', 'interest', 'rates', 'and', 'capital', 'account', 'convertibility', 'Premier', 'Li', 'said', 'in', 'a', 'signed', 'article', 'in', 'Neue', 'Zuricher', 'Zeitung', 'a', 'German', 'language', 'Swiss', 'newspaper', 'on', 'Thursday', 'Email', 'Story', 'Printable', 'View', 'Blog', 'Story', 'Copy', 'Headline', 'URL', 'News', 'text', 'News', 'title', 'Photo', 'captions', 'Live', 'in', 'Shanghai', 'Advanced', 'Search', 'Our', 'Products', 'Home', 'Delivery', 'Online', 'Account', 'Amazon', 'Kindle', 'iPhone', 'App', 'iPad', 'App', 'Blackberry', 'Phone', 'App', 'PlayBook', 'App', 'Android', 'App', 'Windows', 'Phone', 'App', 'MMS', 'Metro', 'Aging', 'Crime', 'and', 'public', 'security', 'Education', 'Health', 'Traffic', 'Urban', 'construction', 'Weather', 'Business', 'Banking', 'Energy', 'Foreign', 'investment', 'Insurance', 'Macro', 'economy', 'and', 'policy', 'Real', 'estate', 'Securities', 'National', 'World', 'Odd', 'Districts', 'Changning', 'Hongkou', 'Huangpu', 'Jing', 'an', 'Luwan', 'Minhang', 'Pudong', 'Putuo', 'Xuhui', 'Yangpu', 'Zhabei', 'Sport', 'Basketball', 'Boxing', 'Cricket', 'Golf', 'Gymnastics', 'Ice', 'hockey', 'Olympics', 'Rugby', 'union', 'Soccer', 'Tennis', 'Feature', 'Art', 'City', 'Style', 'Culture', 'and', 'history', 'Expat', 'Tales', 'Fashion', 'Home', 'Deco', 'Literature', 'Music', 'Stage', 'Travel', 'Opinion', 'Chinese', 'perspectives', 'Foreign', 'perspectives', 'Shanghai', 'Daily', 'columnists', 'Sunday', 'Animal', 'Planet', 'Book', 'City', 'Scene', 'Cover', 'Film', 'Food', 'Home', 'and', 'Deco', 'Now', 'and', 'Then', 'People', 'Style', 'Supplement', 'Downloads', 'PDF', 'eMagazine', 'Gallery', 'Photos', 'Cartoons', 'HD', 'Photo', 'Album', 'Blogs', 'Buzzword', 'and', 'Shanghai', 'Talk', 'Word', 'on', 'Street', 'Team', 'Blog', 'Services', 'Subscribe', 'Advertising', 'Info', 'Contact', 'Us', 'RSS', 'Center', 'FEATURED', 'SITES', 'Campus', 'Learning', 'Careers', 'Students', 'Club', 'Prize', 'English', 'Sense', 'Simplicity', 'Mini', 'sites', 'Undiscovered', 'Zhoushan', 'Minhang', 'today', 'www', 'maicaipian', 'com', 'Science', 'Podcasting', 'Elegant', 'Rhythms', 'from', 'the', 'East', 'CONTACT', 'US', 'BACK', 'TO', 'TOP', 'Metro', 'World', 'National', 'Business', 'Sports', 'Feature', 'Opinion', 'About', 'Shanghai', 'Daily', 'About', 'US', '5', '0', 'New', 'Advertising', 'Term', 'of', 'Use', 'RSS', 'Privacy', 'Policy', 'Contact', 'US', 'Shanghai', 'World', 'Expo', 'ICP', 'ICP', '05050403', '0909346', '354', 'B2', '20120012', 'Copyright', '2001', 'Shanghai', 'Daily', 'Publishing', 'House', 'All', 'rights', 'reserved']
>>> 

This is better result than split(). Then get a sorted list.

>>> sorted(set([w.lower() for w in words2 if w.isalpha()]))
['a', 'about', 'above', 'according', 'account', 'advanced', 'advertising', 'after', 'against', 'aging', 'ago', 'album', 'all', 'amazon', 'an', 'and', 'android', 'animal', 'app', 'appears', 'art', 'article', 'as', 'at', 'back', 'band', 'bank', 'banking', 'basketball', 'before', 'blackberry', 'blog', 'blogs', 'book', 'boxing', 'br', 'business', 'buy', 'buzzword', 'by', 'campus', 'capital', 'captions', 'careers', 'cart', 'cartoons', 'center', 'central', 'chairman', 'changning', 'china', 'chinese', 'city', 'closed', 'club', 'columnists', 'com', 'construction', 'contact', 'convertibility', 'copy', 'copyright', 'country', 'cover', 'cricket', 'crime', 'crisis', 'culture', 'currency', 'daily', 'deal', 'deco', 'delivery', 'demand', 'districts', 'dollar', 'downloads', 'east', 'economy', 'edition', 'education', 'elegant', 'emagazine', 'email', 'end', 'energy', 'english', 'estate', 'exchange', 'expat', 'expo', 'extends', 'fashion', 'feature', 'featured', 'feng', 'film', 'financial', 'first', 'fixing', 'food', 'for', 'foreign', 'forward', 'free', 'from', 'gains', 'gallery', 'german', 'golf', 'government', 'guiding', 'gymnastics', 'had', 'hd', 'headline', 'health', 'hearts', 'high', 'history', 'hockey', 'home', 'hongkou', 'house', 'housing', 'http', 'huangpu', 'huiyuan', 'i', 'ibe', 'ice', 'icp', 'id', 'in', 'info', 'insurance', 'interest', 'intraday', 'investment', 'ipad', 'iphone', 'is', 'it', 'its', 'jianmin', 'jing', 'keqiang', 'keywords', 'kindle', 'language', 'last', 'learning', 'level', 'li', 'literature', 'live', 'liyuan', 'luwan', 'macro', 'maicaipian', 'making', 'market', 'may', 'metro', 'minhang', 'mini', 'minutes', 'mms', 'mobile', 'more', 'music', 'nation', 'national', 'network', 'neue', 'new', 'news', 'newsletter', 'newspaper', 'now', 'odd', 'of', 'official', 'olympics', 'on', 'online', 'opened', 'opening', 'opinion', 'oriented', 'our', 'page', 'parity', 'pboc', 'pdf', 'peng', 'people', 'per', 'percent', 'perspectives', 'phone', 'photo', 'photos', 'planet', 'playbook', 'podcasting', 'policy', 'popularity', 'premier', 'price', 'printable', 'privacy', 'prize', 'products', 'progress', 'public', 'publishing', 'pudong', 'pushes', 'pushing', 'putuo', 'raised', 'rate', 'rates', 'read', 'reading', 'real', 'record', 'reference', 'reforms', 'reiterated', 'related', 'reserved', 'rhythms', 'rights', 'rising', 'rss', 'rugby', 's', 'said', 'saturday', 'scene', 'science', 'search', 'securities', 'security', 'sense', 'services', 'set', 'sets', 'shanghai', 'shanghaidaily', 'shopping', 'signed', 'simplicity', 'since', 'sites', 'soccer', 'sport', 'sports', 'spot', 'stage', 'steadily', 'steals', 'stock', 'stories', 'story', 'streak', 'street', 'strength', 'strengthened', 'stronger', 'strongest', 'students', 'style', 'subscribe', 'subscribers', 'sunday', 'supplement', 'swiss', 'system', 'tales', 'talk', 'team', 'tennis', 'term', 'text', 'than', 'that', 'the', 'then', 'third', 'thursday', 'time', 'title', 'to', 'today', 'tools', 'top', 'touched', 'trade', 'traffic', 'travel', 'trip', 'type', 'undiscovered', 'unified', 'union', 'unit', 'up', 'updated', 'urban', 'url', 'us', 'use', 'v', 'version', 'view', 'was', 'weather', 'week', 'widening', 'window', 'windows', 'word', 'works', 'world', 'www', 'xuhui', 'yangpu', 'years', 'yesterday', 'yuan', 'zeitung', 'zhabei', 'zhoushan', 'zuricher']
>>> 

Then remove words which includes in nltk.corpus.words.words().

>>> [w for w in words_list if not w in nltk.corpus.words.words()]
['amazon', 'app', 'appears', 'blog', 'blogs', 'br', 'buzzword', 'captions', 'careers', 'cartoons', 'changning', 'chinese', 'columnists', 'com', 'deco', 'districts', 'downloads', 'emagazine', 'email', 'english', 'expat', 'expo', 'extends', 'feng', 'guiding', 'hd', 'hongkou', 'http', 'huangpu', 'huiyuan', 'ibe', 'icp', 'info', 'intraday', 'ipad', 'iphone', 'jianmin', 'keqiang', 'keywords', 'liyuan', 'luwan', 'maicaipian', 'metro', 'minhang', 'mini', 'minutes', 'mms', 'neue', 'olympics', 'online', 'opened', 'oriented', 'pboc', 'pdf', 'peng', 'perspectives', 'photos', 'podcasting', 'products', 'publishing', 'pudong', 'pushes', 'putuo', 'rates', 'reforms', 'rhythms', 'rights', 'rss', 'rugby', 'saturday', 'securities', 'services', 'sets', 'shanghaidaily', 'signed', 'sites', 'steals', 'stories', 'strengthened', 'stronger', 'strongest', 'students', 'subscribers', 'sunday', 'thursday', 'tools', 'updated', 'url', 'widening', 'windows', 'www', 'xuhui', 'yangpu', 'years', 'zeitung', 'zhabei', 'zhoushan', 'zuricher']
>>> 

Then writing code based on the results above.

>>> def unknown(url):
...     raw = nltk.clean_html(urlopen(url).read())
...     words = re.findall(r'\w+', raw)
...     words_list = sorted(set([w.lower() for w in words2 if w.isalpha()]))
...     unknown_w = [w for w in words_list if not w in nltk.corpus.words.words()]
...     print unknown_w
... 

Let’s test it.

>>> url = "http://www.shanghaidaily.com/nsp/Business/2013/05/25/Yuan%2Bgains%2Bstrength%2Bas%2BPBOC%2Bsets%2Brecord%2Brate/"
>>> unknown(url)
['amazon', 'app', 'appears', 'blog', 'blogs', 'br', 'buzzword', 'captions', 'careers', 'cartoons', 'changning', 'chinese', 'columnists', 'com', 'deco', 'districts', 'downloads', 'emagazine', 'email', 'english', 'expat', 'expo', 'extends', 'feng', 'guiding', 'hd', 'hongkou', 'http', 'huangpu', 'huiyuan', 'ibe', 'icp', 'info', 'intraday', 'ipad', 'iphone', 'jianmin', 'keqiang', 'keywords', 'liyuan', 'luwan', 'maicaipian', 'metro', 'minhang', 'mini', 'minutes', 'mms', 'neue', 'olympics', 'online', 'opened', 'oriented', 'pboc', 'pdf', 'peng', 'perspectives', 'photos', 'podcasting', 'products', 'publishing', 'pudong', 'pushes', 'putuo', 'rates', 'reforms', 'rhythms', 'rights', 'rss', 'rugby', 'saturday', 'securities', 'services', 'sets', 'shanghaidaily', 'signed', 'sites', 'steals', 'stories', 'strengthened', 'stronger', 'strongest', 'students', 'subscribers', 'sunday', 'thursday', 'tools', 'updated', 'url', 'widening', 'windows', 'www', 'xuhui', 'yangpu', 'years', 'zeitung', 'zhabei', 'zhoushan', 'zuricher']
>>> 

Let’s analyzing the results…

Plurals: blogs, captions, careers, cartoons, columnists…
Added ed, ing: opened, oriented, publishing…

These can be included in the vocabulary words (nltk.corpus.words.words()) by good stemming.

Proper nouns: english, chinese, saturday, thursday…

I guess, these words are listed here due to the side-effect of converting all words into lower cases. If the first characters are changed to upper case, these words could be listed in the vocabulary words.

Let me try this:


>>> url = "http://www.shanghaidaily.com/nsp/Business/2013/05/25/Yuan%2Bgains%2Bstrength%2Bas%2BPBOC%2Bsets%2Brecord%2Brate/"
>>> raw = nltk.clean_html(urlopen(url).read())
>>> words_list = sorted(set([w.lower() for w in words2 if w.isalpha()]))    
>>> vocab_word = [w.lower() for w in nltk.corpus.words.words()] 
>>> [w for w in words_list if not w in vocab_word]
['app', 'appears', 'blog', 'blogs', 'br', 'buzzword', 'captions', 'careers', 'cartoons', 'changning', 'columnists', 'com', 'deco', 'districts', 'downloads', 'emagazine', 'email', 'expat', 'expo', 'extends', 'feng', 'guiding', 'hd', 'hongkou', 'http', 'huangpu', 'huiyuan', 'ibe', 'icp', 'info', 'intraday', 'ipad', 'iphone', 'jianmin', 'keqiang', 'keywords', 'liyuan', 'luwan', 'maicaipian', 'metro', 'minhang', 'mini', 'minutes', 'mms', 'neue', 'olympics', 'online', 'opened', 'oriented', 'pboc', 'pdf', 'peng', 'perspectives', 'photos', 'podcasting', 'products', 'publishing', 'pudong', 'pushes', 'putuo', 'rates', 'reforms', 'rhythms', 'rights', 'rss', 'securities', 'services', 'sets', 'shanghaidaily', 'signed', 'sites', 'steals', 'stories', 'strengthened', 'stronger', 'strongest', 'students', 'subscribers', 'tools', 'updated', 'url', 'widening', 'windows', 'www', 'xuhui', 'yangpu', 'years', 'zeitung', 'zhabei', 'zhoushan', 'zuricher']
>>> 

My assumption seems correct. The vocabulary list should also be converted into lower cases.

Chinese specific words: huangpu, huiyuan, jianmin, hongkou…

They are name or place name in China. This could be normal if not included in the vocabulary list.

New or internet specific words: app, blog, buzzword, emagazine, email, online, expo…

One possibility is that the vocabulary list is not up to date or not appropriate for this kind of internet-based news articles.

Exercise: Chapter 3 (14-17)

14.

Using words.sort():

>>> words =["banana", "pineapple", "peach", "apple", "orange", "mango", "maron",
 "nuts"]
>>> words
['banana', 'pineapple', 'peach', 'apple', 'orange', 'mango', 'maron', 'nuts']
>>> words.sort()
>>> words
['apple', 'banana', 'mango', 'maron', 'nuts', 'orange', 'peach', 'pineapple']

After sorting, the sequence in the list was sorted permanently.

Using sorted(words):

>>> words =["banana", "pineapple", "peach", "apple", "orange", "mango", "maron",
 "nuts"]
>>> sorted(words)
['apple', 'banana', 'mango', 'maron', 'nuts', 'orange', 'peach', 'pineapple']
>>> words
['banana', 'pineapple', 'peach', 'apple', 'orange', 'mango', 'maron', 'nuts']

This sorting is just temporary.

15.

>>> "3" * 7
'3333333'
>>> 3 * 7
21
>>> int("3") * 7
21
>>> str(3) * 7
'3333333'
>>>

16.

>>> monty
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'monty' is not defined
>>> from test import msg
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name msg
>>> from test import *
>>> monty
'Monty Python'

17.

In general, if using positive values, the display will be aligned to right. For negative, it will be left aligned.

>>> string = 'abc'
>>> print '%6s' % string
   abc
>>> print '%-6s' % string
abc
>>> string = 'abcdefghijk'
>>> print '%6s' % string
abcdefghijk
>>> print '%-6s' % string
abcdefghijk
>>>

There is no difference to display longer than 6.

Exercise: Chapter 3 (10-13)

10.

The original one is:

>>> sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
>>> result = []
>>> for word in sent:
...     word_len = (word, len(word))
...     result.append(word_len)
...
>>> result
[('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]

>>>

Convert to a list comprehension:

>>> result = [(word, len(word)) for word in sent]
>>> result
[('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]

>>>

11.

>>> raw = "I don't like sports because I could not be an hero when I was playing childhood."

re.split() can be used for this purpose.

>>> re.split(r's', raw)
["I don't like ", 'port', ' becau', 'e I could not be an hero when I wa', ' play
ing childhood.']
>>> re.split(r'[ns]', raw)
['I do', "'t like ", 'port', ' becau', 'e I could ', 'ot be a', ' hero whe', ' I
 wa', ' playi', 'g childhood.']
>>>

12.

>>> raw
"I don't like sports because I could not be an hero when I was playing childhood."

Let’s use same raw data.

>>> for char in raw:
...     print char
...
I

d
o
n
'
t

l
i
k
e

s
p
o
r
t
s

b
e
c
a
u
s
e

I

c
o
u
l
d

n
o
t

b
e

a
n

h
e
r
o

w
h
e
n

I

w
a
s

p
l
a
y
i
n
g

c
h
i
l
d
h
o
o
d
.
>>>

13.

>>> raw.split()
['I', "don't", 'like', 'sports', 'because', 'I', 'could', 'not', 'be', 'an', 'he
ro', 'when', 'I', 'was', 'playing', 'childhood.']
>>> raw.split(' ')
['I', "don't", 'like', 'sports', 'because', 'I', 'could', 'not', 'be', 'an', 'he
ro', 'when', 'I', 'was', 'playing', 'childhood.']

No differences for the same raw data. Change a little bit (raw2). Put multiple spaces (4 times) after ‘like’, put a single tab between ‘because’ and ‘I’ and double tabs between ‘hero’ and ‘when’.

>>> raw2 = "I don't like    sports because      I could  not be an hero
        when I was playing childhood."
>>> raw2.split()
['I', "don't", 'like', 'sports', 'because', 'I', 'could', 'not', 'be', 'an', 'he
ro', 'when', 'I', 'was', 'playing', 'childhood.']
>>> raw2.split(' ')
['I', "don't", 'like', '', '', '', 'sports', 'because', '\tI', 'could', '', 'not
', 'be', 'an', 'hero', '\t\twhen', 'I', 'was', 'playing', 'childhood.']
>>>

According to this result, single/multiple space(s) and tabs are reconginzed as a splitter with raw2.split(). For the second one(split(‘ ‘)), tabs are displayed as “\t”. For consecutive spaces, it is displayed like non-character ” is inserted between spaces.

Exercise: Chapter 3 (7-9)

7.

>>> nltk.re_show(r'\b(a|an|the)\b', 'brian a then an the man')
brian {a} then {an} {the} man

Usage of ‘\b’ is the key point, I think.

8.

>>> import urllib
>>> def cleantags(url):
...     raw_contents = urllib.urlopen(url).read()
...     return nltk.clean_html(raw_contents)
... 
>>> cleantags('http://www.nltk.org')
'Natural Language Toolkit &mdash; NLTK 2.0 documentation \n \n  \n 
\n \n \n \n \n \n  \n  \n \n \n  \n  \n   NLTK 2.0 documentation \n
   \n   next |\n   modules |\n   index \n   \n  \n  \n\n  \n  \n   
\n   \n  \n   \n   \n   \n \n Natural Language Toolkit \xc2\xb6 \n 
NLTK is a leading platform for building Python programs to work 
with human language data.\nIt provides easy-to-use interfaces to
 over 50 corpora and lexical resources such as WordNet,\nalong with
 a suite of text processing libraries for classification, 
tokenization, stemming, tagging, parsing, and semantic reasoning. 
\n Thanks to a hands-on guide introducing programming fundamentals
 alongside topics in computational linguistics,\nNLTK is suitable 
for linguists, engineers, students, educators, researchers, and 
industry users alike.\nNLTK is available for Windows, Mac OS X, 
and Linux. Best of all, NLTK is a free, open source, community-driven 
project. \n NLTK has been called “a wonderful tool for teaching,
 and working in, computational linguistics using Python,”\n
and “an amazing library to play with natural language.”
 \n Natural Language Processing with Python provides a practical\n
introduction to programming for language processing.\nWritten by 
the creators of NLTK, it guides the reader through the fundamentals\n
of writing Python programs, working with corpora, categorizing 
text, analyzing linguistic structure,\nand more. \n \n Some simple 
things you can do with NLTK \xc2\xb6 \n Tokenize and tag some text:
 \n &gt;&gt;&gt; import nltk \n &gt;&gt;&gt; sentence = &quot;
&quot;&quot;At eight o'clock on Thursday morning \n ... 
Arthur didn't feel very good.&quot;&quot;&quot; \n &gt;&gt;&gt;
 tokens = nltk . word_tokenize ( sentence ) \n &gt;&gt;&gt; tokens 
\n ['At', 'eight', &quot;o'clock&quot;, 
'on', 'Thursday', 'morning', \n '
Arthur', 'did', &quot;n't&quot;, 'feel',
 'very', 'good', '.'] \n &gt;&gt;&gt; 
tagged = nltk . pos_tag ( tokens ) \n &gt;&gt;&gt; tagged [ 0 : 6 ]
 \n [('At', 'IN'), ('eight', 'CD'),
 (&quot;o'clock&quot;, 'JJ'), ('on', 'IN
'), \n ('Thursday', 'NNP'), ('morning'
, 'NN')] \n \n \n Identify named entities: \n &gt;&gt;&gt; 
entities = nltk . chunk . ne_chunk ( tagged ) \n &gt;&gt;&gt; 
entities \n Tree('S', [('At', 'IN'), 
('eight', 'CD'), (&quot;o'clock&quot;, 'JJ
'), \n   ('on', 'IN'), ('Thursday', 
'NNP'), ('morning', 'NN'), \n  Tree('
PERSON', [('Arthur', 'NNP')]), \n   ('did', 'VBD'), (&quot;n't&quot;, 'RB'), ('feel', 'VB'), \n   ('very', 'RB'), ('good', 'JJ'), ('.', '.')]) \n \n \n Display a parse
 tree: \n \n NB. If you publish work that uses NLTK, please cite the
 NLTK book as follows:\nBird, Steven, Edward Loper and Ewan Klein (2009).
\nNatural Language Processing with Python. O’Reilly Media Inc.
 \n \n \n Links \xc2\xb6 \n \n NLTK mailing list - release announcements
 only, very low volume \n NLTK-Users mailing list - user discussions \n
 NLTK-Dev mailing list - developers only \n NLTK-Translation mailing 
list - discussions about translating the NLTK book \n NLTK’s 
previous website \n NLTK development at GitHub \n Publications about 
NLTK \n \n \n \n \n Contents \xc2\xb6 \n \n \n NLTK News \n Installing
 NLTK \n Installing NLTK Data \n nltk Package \n Team NLTK \n \n \n 
\n Index \n Module Index \n Search Page \n \n \n\n\n   \n   \n  \n   
\n   \n   Table Of Contents \n   \n NLTK News \n Installing NLTK \n 
Installing NLTK Data \n nltk Package \n Team NLTK \n \n\n   Search 
\n   \n    \n    \n    \n    \n   \n   \n   Enter search terms or a 
module, class or function name.\n   \n   \n   \n  \n  \n\n  \n  \n
   \n   next |\n   modules |\n   index \n    \n    Show Source \n   
\n\n   \n   \n  \n  &copy; Copyright 2012, NLTK Project.\n  Created 
using Sphinx 1.1.3.'
>>> 

Note: Inserted return to display entire results.

9-a.

>>> pnct_pattern = r'''(?x)     #set flag to allow verbose regexps
...     \.      #full stop
...     |,      #comma
...     |:      #colon
...     |;      #semicolon
...     |-      #dash
...     |/      #slash
...     |\?     #question
...     |\.\.\  #ellipsis
...     |[()]   #brackets
... '''

This contains some mistakes. Full stop(.) should not be the first, after the ellipsis(…). I also missed the third dot(.) at the ellipsis.

Define load(f).

>>> def load(f):
...     ofile = open(f)
...     raw = ofile.read()
...     return raw

Try nltk.regexp_tokenize().

>>> load('corpus.txt')
"Hello! My name is Ken Xu. Where are you from? I was born in (eastern part of) Japan. Every morning I go to Gym and ride on aero-bike, running machine or so on... It's my shame my writing english is not so good as yours. "
>>> text = load('corpus.txt')
>>> nltk.regexp_tokenize(text, pnct_pattern)
['.', '?', '(', ')', '.', '-', ',', '.', '.', '.', '.']

Revise mistakes then try again.

>>> pnct_pattern = r'''(?x)     #set flag to allow verbose regexps
...     ,       #comma
...     |[()]   #brackets
...     |\.\.\. #ellipsis
...     |\.     #full stop
...     |\?     #question
...     |/      #slash
...     |-      #dash
...     |;      #semicolon
...     |:      #colon
... '''
>>> nltk.regexp_tokenize(text, pnct_pattern)
['.', '?', '(', ')', '.', '-', ',', '...', '.']

9-b.

Need to revise corpus.txt

>>> text
"Hello! My name is Ken Xu. Where are you from? I was born in (eastern part of) Japan. Every morning I go to Gym and ride on aero-bike, running machine or so on... It's my shame my writing english is not so good as yours. \n\nOne of interesting thing is date format is different by regions. For example, we usually use this format in my country.\n\n2013/05/24\n\nBut I found the format was like 24.05.2013 in the ABC SYSTEMS. I heard this format is very popular in European countries. We can see similar situation in amount format. \n\nUSD12,345.67\nEUR12.345,67\nUSD123.45\nJPY123\n\n"

Amount:

>>> amnt_pattern = r'''(?x)
...     USD[\d,]+(.\d\d)?               #USD1.234,00
...     |EUR[\d.]+(,\d\d)?              #EUR1.234,00
...     |JPY[\d]+                       #JPY123
... '''
>>> text = load('corpus.txt')
>>> nltk.regexp_tokenize(text, amnt_pattern)
['USD12,345.67', 'EUR12.345,67', 'USD123.45', 'JPY123']

Date:

>>> date_pattern = r'''(?x)
...     \d{2}.\d{2}.\d{4}               #DD.MM.YYYY
...     |\d{4}/\d{2}/\d{2}              #YYYY.MM.DD
... '''
>>> nltk.regexp_tokenize(text, date_pattern)
['2013/05/24', '24.05.2013']

Name:

>>> name_pattern = r'''(?x)
...     [A-Z][a-z]+\s[A-Z][a-z]+
... '''
>>> nltk.regexp_tokenize(text, name_pattern)
['Ken Xu']

Organization:

I defined the organization name is written all upper cases as a condition.

>>> orgz_pattern = r'''(?x)
...     [A-Z]+(\s[A-Z])?
... '''
>>> nltk.regexp_tokenize(text, orgz_pattern)
['H', 'M', 'K', 'X', 'W', 'I', 'J', 'E', 'I', 'G', 'I', 'O', 'F', 'B', 'I', 'ABC S', 'YSTEMS', 'I', 'E', 'W', 'USD', 'EUR', 'USD', 'JPY']
>>> orgz_pattern = r'''(?x)
...     [A-Z]+\b(\s[A-Z]+)?
... '''
>>> nltk.regexp_tokenize(text, orgz_pattern)
['I', 'I', 'I', 'ABC SYSTEMS', 'I']
>>> orgz_pattern = r'''(?x)
...     [A-Z]{2,}\b(\s[A-Z]+)?
... '''
>>> nltk.regexp_tokenize(text, orgz_pattern)
['ABC SYSTEMS']

Exercise: Chapter 3 (1 – 6)

1.

>>> s = 'colorless'
>>> print s[:4] + 'u' + s[4:]
colourless

2.

>>> 'dogs'[:-1]
'dog'
>>> 'dishes'[:-2]
'dish'
>>> 'running'[:-4]
'run'
>>> 'nationality'[:-5]
'nation'
>>> 'undo'[:-2]
'un'
>>> 'undo'[2:]
'do'
>>> 'preheat'[3:]
'heat'

3.

>>> strg = 'abcdefghijklmnopqrstuvwxyz'
>>> strg[26]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: string index out of range
>>> strg[-2]
'y'

I am not sure I understand this question correctly. If use numbers smaller than stating point[0] (go far too left), the value should be negative. The negative value in index means “from the end”. In my example, second character form the end (‘y’) was selected.

4.

>>> monty = 'Monty Python'
>>> monty[6:11:2]
'Pto'
>>> monty[2:10:3]
'n t'
>>> monty[::4]
'Myt'

5.

>>> monty[::-1]
'nohtyP ytnoM'

It was reversed!

6.

a. [a-zA-Z]+
alphabet at least and more than one time

>>> nltk.re_show(r'[a-zA-Z]+', 'a abc aBcd ABcd ABCD a1234 12A34 aB1234')
{a} {abc} {aBcd} {ABcd} {ABCD} {a}1234 12{A}34 {aB}1234

b. [A-Z][a-z]*
Start with upper case after that lower case is coming but lower cases can be omitted. (*)

>>> nltk.re_show(r'[A-Z][a-z]*', 'a abc aBcd ABcd ABCD a1234 12A34 aB1234')
a abc a{Bcd} {A}{Bcd} {A}{B}{C}{D} a1234 12{A}34 a{B}1234

c. p[aeiou]{,2}t
start with ‘p’ and end with ‘t’ between them 0 to 2 vowels(aeiou) can be inserted.

>>> nltk.re_show(r'p[aeiou]{,2}t', 'pit pet peat pool good puuut pt')
{pit} {pet} {peat} pool good puuut {pt}

d. \d+(\.\d+)?
‘\d’ means numbers, therefore numbers with decimals and under decimal point is optional.

>>> nltk.re_show(r'\d+(\.\d+)?', '0 0.2 12.0003 -4 -5.2 5,000')
{0} {0.2} {12.0003} -{4} -{5.2} {5},{000}
>>> nltk.re_show(r'-?\d+(\.\d+)?', '0 0.2 12.0003 -4 -5.2 5,000')
{0} {0.2} {12.0003} {-4} {-5.2} {5},{000}

Just adjusted to include optional negative sign(-) in the second example.

e. ([^aeiou][aeiou][^aeiou])*
combination of non-vowel + vowel + non-vowel is repeated. Because of last (*), this condition was made optional.

>>> nltk.re_show(r'([^aeiou][aeiou][^aeiou])*', 'appeal push pool shose neck 1234 gogleb')
{}a{}p{}p{}e{}a{}l{} {pus}h{} {}p{}o{}o{}l{} {}s{hos}e{} {nec}k{} {}1{}2{}3{}4{} {gogleb}
>>> nltk.re_show(r'([^aeiou][aeiou][^aeiou])+', 'appeal push pool shose neck 1234 gogleb')
appeal {pus}h pool s{hos}e {nec}k 1234 {gogleb}

From my point of view, to use ‘+’ is more natural like the second example.

f. \w+|[^\w\s]+
non-space characters are repeated at least one time.

>>> nltk.re_show(r'\w+|[^\w\s]+', '123 abc 1_2_3_A_B_C .....        ')
{123} {abc} {1_2_3_A_B_C} {.....}

From list to string (3.9)

Now in chapter 3.9 in the whale book.

How to use join().

>>> silly = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.']
>>> ' '.join(silly)
'We called him Tortoise because he taught us .'
>>> ';'.join(silly)
'We;called;him;Tortoise;because;he;taught;us;.'
>>> ''.join(silly)
'WecalledhimTortoisebecausehetaughtus.'
>>> 

Nothing new to me.

>>> word = 'cat'
>>> sentence = """hello
... world"""
>>> print word
cat
>>> print sentence
hello
world
>>> word
'cat'
>>> sentence
'hello\nworld'
>>> 

I have already learned the result difference when using print statement in the past.

>>> fdist = nltk.FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])
>>> for word in fdist:
...     print word, '->', fdist[word], ';',
... 
dog -> 4 ; cat -> 3 ; snake -> 1 ;
>>> for word in fdist:
...     print '%s->%d;' % (word, fdist[word]),
... 
dog->4; cat->3; snake->1;
>>> 

One of the reasons why I like this book is that it let me try something first even if I don’t understand the meaning, then later part I can get a detailed explanation.

Actually it was hard to understand how to use ‘%’ in print statement when I saw first time in the earlier section of the textbook. What I could at that time was to assume it.

Now I can check my assumption was correct or not.

>>> '%s->%d;' % ('cat', 3)
'cat->3;'
>>> '%s->%d;' % 'cat'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: not enough arguments for format string
>>> '%s->' % 'cat'
'cat->'
>>> '%d' % 3
'3'
>>> 'I want a %s right now' % 'coffee'
'I want a coffee right now'
>>> "%s wants a %s %s" % ("Lee", "sandwich", "for lunch")
'Lee wants a sandwich for lunch'
>>> template = 'Lee wants a %s right now'
>>> menu = ['sandwich', 'spam fritter', 'pancake']
>>> for snack in menu:
...     print template % snack
... 
Lee wants a sandwich right now
Lee wants a spam fritter right now
Lee wants a pancake right now
>>> 

To specify the width of the field, we can use following way.

>>> '%6s' % 'dog'
'   dog'
>>> '%-6s' % 'dog'
'dog   '
>>> width = 6
>>> '%-*s' % (width, 'dog')
'dog   '
>>> '%*s' % (width, 'dog')
'   dog'

Normally field values are aligned to right. For left aligned, use ‘-‘.

Now I understand more clearly how to be displayed in tabulate.

>>> def tabulate(cfdist, words, categories):
...     print '%-16s' % 'Category',
...     for word in words:
...             print '%6s' % word,
...     print
...     for category in categories:
...             print '%-16s' % category,
...             for word in words:
...                     print '%6d' % cfdist[category][word],
...             print
... 
>>> from nltk.corpus import brown
>>> cfd = nltk.ConditionalFreqDist(
...     (genre, word)
...     for genre in brown.categories()
...     for word in brown.words(categories=genre))
>>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> tabulate(cfd, modals, genres)
Category            can  could    may  might   must   will
news                 93     86     66     38     50    389
religion             82     59     78     12     54     71
hobbies             268     58    131     22     83    264
science_fiction      16     49      4     12      8     16
romance              74    193     11     51     45     43
humor                16     30      8      8      9     13
>>> 

Export the result to a file.

>>> output_file = open('output.txt', 'w')
>>> words = set(nltk.corpus.genesis.words('english-kjv.txt'))
>>> for word in sorted(words):
...     output_file.write(word + "\n")
... 
>>> 

What happen if I ommit “\n” ?

>>> output_file = open('output2.txt', 'w')
>>> for word in sorted(words):
...     output_file.write(word)
... 

Screen Shot 2013-05-22 at 6.53.47 AM

Needless to explain…

When wrting non-text data, need to convert into string first.

>>> len(words)
2789
>>> str(len(words))
'2789'
>>> output_file.write(str(len(words)) + "\n")
>>> output_file.close()

The result was added at the end of the file.

Capture

>>> saying = ['After', 'all', 'is', 'said', 'and', 'done', ',',
...     'more', 'is', 'said', 'than', 'done', '.']
>>> for word in saying:
...     print word, '(' + str(len(word)) + ')',
...
After (5) all (3) is (2) said (4) and (3) done (4) , (1) more (4) is (2) said (4) than (4) done (4) . (1)
>>>

Wrapping text example.

>>> from textwrap import fill
>>> format = '%s (%d),'
>>> pieces = [format % (word, len(word)) for word in saying]
>>> output = ' '.join(pieces)
>>> wrapped = fill(output)
>>> print wrapped
After (5), all (3), is (2), said (4), and (3), done (4), , (1), more
(4), is (2), said (4), than (4), done (4), . (1),
>>>

Chapter 3 is almost over. Let’s move onto exercises!