The first alpha release of NLTK 3.0 — i.e. NLTK for Python3 has just been released. Downloads and further information can be found here:
Although not quite ready for prime time, this is a major step towards full Python 3 support in the NLTK library.
Boilerpipe is a useful library for extracting body content from web pages and discard the ‘boilerplate’ (menus, footers, advertising, etc). It is a Java library, so it requires a Bridge (e.g. JPype for Python) if you wish to use it in a non-Java environment. Luckily for C# users, Arif Ogan has ported Boilerpipe to C#/Mono. The port is called NBoilerpipe and can be downloaded from github.
Continue reading Extracting Body content from a Web Page using .NET
I recently encountered the problem of having to extract the main body content from a series of web pages, and to discard all of the ‘boiler plate’ — i.e. header, menus, footer, and advertising. The application was performing statistical comparisons between web pages, and although it was producing the correct answers for my test data, identical body text could produce wildly different statistical scores according to the amount of ‘junk’ boilerplate that was present. This article introduces the Boilerpipe library which can be used to perform this task.
Continue reading Extracting Body Content from a Web Page
If you haven’t heard of it yet, the Raspberry Pi is a $25/$35 barebones computer intended to excite kids with programming and hardware projects. It is very much modeled on the British experience of home computing in the early 1980s and even has a “Model A” and a “Model B” in homage to the BBC Micro. It is about the size of an Altoids tin, uses an ARM cpu, boots from an SD flash card, and runs Linux. Further information can be found on the official Raspberry Pi website. Yes it can also run NLTK!
Continue reading NLTK on the Raspberry PI
Previously, I showed you how to segment words and sentences whilst also taking into account full stops (periods) and abbreviations. The problem with this implementation is that it is easily confused by contiguous punctuation characters. For example “).” is not recognized as the end of a sentence. This article shows you how to correct this.
Continue reading Sentence Segmentation: Handling multiple punctuation characters
Previously, I showed you how to create N-Gram frequency tables from large text datasets. Unfortunately, when used on very large datasets such as the English language Wikipedia and Gutenberg corpora, memory limitations limited these scripts to unigrams. Here, I show you how to use the BerkeleyDB database to create N-gram tables of these large datasets.
Continue reading Using BerkeleyDB to Create a Large N-gram Table
NLTK 2.0 has officially been released as “v2.0.1″., and can be downloaded here:
NLTK 2.0 was previously released as a “Release Candidate” – this is the first official release.
“Foundations of Statistical Natural Language Processing” by Christopher D. Manning and Hinrich Schutze has a relatively old publication date of 1999, but do not let this deter you from reading this useful book. This book continues to be an important foundation text in a fast moving field.
Continue reading Book Review: Foundations of Statistical Natural Language Processing
The Word Frequency Table scripts can be easily expanded to calculate N-Gram frequency tables. This post explains how.
Continue reading Calculating N-Gram Frequency Tables
As well as using the Gutenberg Corpus, it is possible to create a word frequency table for the English text of the Wikipedia encyclopedia.
Continue reading Calculating Word and N-Gram Statistics from a Wikipedia Corpora