Processing HTML (3.1.2) / Search engine (3.1.3)

Continuing as of chapter 3.1.2 in the whale book.

>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> html = urlopen(url).read()
>>> html[:60]
'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

Can display the source of the HTML.

>>> print html
<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>BBC NEWS | Health | Blondes 'to die out in 200 years'</title>
<meta name="keywords" content="BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service">
...
<br>
<link type="text/css" rel="stylesheet" href="/nol/shared/stylesheets/uki_globalstylesheet.css">

</body>
</html>

>>> 

Extract text data then do tokenize. nltk.clean_html() can help to remove html tags.

>>> raw = nltk.clean_html(html)
>>> takens = nltk.word_tokenize(raw)
>>> takens
['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'to", 'die', 'out', 'in', '200', 'years', "'", 'NEWS', 'SPORT', 'WEATHER', 'WORLD', 'SERVICE', 'A-Z', 'INDEX', 'SEARCH', 'You', 'are', 'in', ':', 'Health', 
....
 ';', '|', 'To', 'BBC', 'World', 'Service', '&', 'gt', ';', '&', 'gt', ';', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '--', '&', 'copy', ';', 'MMIII', '|', 'News', 'Sources', '|', 'Privacy']
>>> 

Please ignore my typo (takens –> tokens).

>>> tokens = takens
>>> tokens = tokens[96:399]
>>> text = nltk.Text(tokens)
>>> text.concordance('gene')
Building index...
Displaying 4 of 4 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance. They do n't disappear 
des would disappear is if having the gene was a disadvantage and I do not thin
>>> 

Check inside of text.

>>> text
<Text: Help EDITIONS Change to UK Friday , 27...>
>>> len(text)
303
>>> text[:-1]
['Help', 'EDITIONS', 'Change', 'to', 'UK', 'Friday', ',', '27', 'September', ',', '2002', ',', '11:51', 'GMT', '12:51', 'UK', 'Blondes', "'to", 'die', 'out', 'in', '200', 'years', "'", 
....

'that', 'is', 'the', 'case.', "''", 'The', 'frequency', 'of', 'blondes', 'may', 'drop', 'but', 'they', 'wo', "n't", 'disappear.', "''", 'See', 'also', ':', '28', 'Mar', '01']
>>> 

Beautiful Soup” would empower us to processing this kind of things. Following is extracted from the web site.

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

1. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn’t take much code to write an application
2. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don’t have to think about encodings, unless the document doesn’t specify an encoding and Beautiful Soup can’t autodetect one. Then you just have to specify the original encoding.
3. Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.

Regarding the search engine (Chapter 3.1.3 of the whale book), collocation stats for “absolutely” and “definitely” (combination with adore, love, like, prefer) are absolutely interesting for non-native English speaker like me. It was very head-aching problems when I was learning English at my school-hoods.

BTW Should I use “definitely” for combination with “interesting”? I believe both are ok for “interesting”. 🙂

Additional question in the textbook. There are lots of results of searching “the of” in a search engine like Google. Can we say that “the of” is highly frequent collocation in English?

The answer should be NO in my understanding. In the result of Googling, most of results are “of the” or some other word(s) inserted between “the” and “of”. Therefore I could not say “the of” is common collocation in English.