Pages

Categories

Calculating Word and N-Gram Statistics from a Wikipedia Corpora

As well as using the Gutenberg Corpus, it is possible to create a word frequency table for the English text of the Wikipedia encyclopedia.

The English language download of Wikipedia (download pages-articles.xml.bz2 from http://en.wikipedia.org/wiki/Wikipedia:Database_download ) contains a large amount of markup. Use the wikipedia2text scripts (see Generating a Plain Text corpus from Wikipedia for instructions and download link) followed by WikiExtractor.py to produce consolidated plain text files of articles.

The consolidated files have XML wrappers around each individual entry, and they are stored in a directory hierarchy. As with the Gutenberg corpus, a custom Python script is used to remove unnecessary information and to copy the files into one large flat directory. Here is the script for the Wikipedia pages:

 # Loops over all of the Wikipedia text files in /var/bigdisk/wikipedia/text
# Removes XML wrappers, and copies them to ./wk_raw
# Files have to be renamed using sub-directories to ensure duplicate names
# do not overwrite

import string
import os
import gc
import shutil
import lxml.html

# Recursively walk the entire directory tree finding all .txt files which
# are not in old sub-directories. readme.txt files are also skipped.

# Empty the output directory
outputdir = "/var/bigdisk/wikipedia/wk_raw"
for f in os.listdir(outputdir):
    fpath = os.path.join(outputdir, f)
    try:
        if (os.path.isfile(fpath)):
            os.unlink(fpath)
    except Exception, e:
        print e

for dirname, dirnames, filenames in os.walk('/var/bigdisk/wikipedia/text'):
    for fname in filenames:
        infile = os.path.join(dirname, fname)
        ofname = dirname[-2:] + "_" + fname   # Output filename
        outfile = os.path.join(outputdir, ofname)

        # Copy the file, removing all XML tags
        print "Copying: " + infile
        fout = open(outfile,"w")
        for line in open(infile):
            # remove XML tags
            #ln = re.sub('<[^<]+?>', '', line)
            ln = line.strip()
            if ( len(ln) > 0):
                #print ln
                try:
                    ln = lxml.html.fromstring(ln).text_content()
                except:
                    ln = ""

                ln = ln.encode('latin-1', 'ignore')
            #print ln
            fout.write(ln + "\n")

        fout.close()

 

This simply goes through the directory tree, copying each file. During the copy process, all XML tags are removed, and HTML entity symbols are converted to their correct characters. These are converted to ISO-8859 (’8 bit ASCII’, aka ‘latin-1′) as necessary. WikiExtractor.py uses the same file names for each sub-directory, so new names are created that also incorporate the sub-directory name.

The resulting frequency table (created using the word frequency table scripts) can be downloaded as a GZIP file (21.6MB). A combined (Wikipedia and Gutenberg) frequency table is also available (45.7MB)

Note that although the word frequency table scripts could be easily modified to process N-grams, the shear size of the Wikipedia dataset will prove a challenge. I shall address this in a future article.

3 comments to Calculating Word and N-Gram Statistics from a Wikipedia Corpora

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>