Notice: This blog will be migrated

As I have to activate VPN to connect to wordpress.com from my location, China. I am considering to migrate this blog to other services. The update of this blog will be stopped temporary and restarted after the migration is done.

The new URL of the blog will be:
http://deutschina.hatenablog.com/category/NLTK

Advertisements

Stopwords

Stopwords are common words that generally do not contribute to the meaning of sentences. As usual, import nltk.book.

>>> import nltk
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

Then import stopwords and do some simple test.

>>> from nltk.corpus import stopwords
>>> english_stops = set(stopwords.words('english'))
>>> words = ["Can't", 'is', 'a', 'contraction']
>>> [word for word in words if word not in english_stops]
["Can't", 'contraction']

As long as seeing the results, ‘is’ and ‘a’ must be included in the stopwords.

Then I wanna try with bigger sample data.

>>> text1sw = [word for word in text1 if word not in english_stops]

By this command, the stopwords are excluded from text1. Now I should be able to get more useful results from plot. Let’s try…

>>> fdist2 = FreqDist(text1sw)
>>> fdist2.plot(50, cumulative=True)

Image

Although a lot of punctuation are still included,  I can see some interesting words in the top 50, like whale, ship, Ashab and Stubb. Now I can imagine something in the story. For example, the location of the story should be close to sea. Ashab and Stubb are characters in the story.

It seems “whale” and “Whale” are distinguished. I might be able to convert all text to lowercase before calling FreqDist. Following should work.


>>> text1sw = [word.lower() fro word in text1 if word not in english_stops]
>>> fdist2 = FreqDist(text1sw)
>>> fdist2.plot(50, cumulative=True)

Note: When I missed “()” after lower, I got a very strange error.It looks like each word was converted into memory address something like <built-in method lower of str object at 0x0000000004EAF238>.