As well as using the Gutenberg Corpus, it is possible to create a word frequency table for the English text of the Wikipedia encyclopedia.
The English language download of Wikipedia (download pages-articles.xml.bz2 from http://en.wikipedia.org/wiki/Wikipedia:Database_download ) contains a large amount of markup. Use the wikipedia2text scripts (see Generating a Plain Text corpus from Wikipedia for instructions and download link) followed by WikiExtractor.py to produce consolidated plain text files of articles.
The consolidated files have XML wrappers around each individual entry, and they are stored in a directory hierarchy. As with the Gutenberg corpus, a custom Python script is used to remove unnecessary information and to copy the files into one large flat directory. Here is the script for the Wikipedia pages:
# Loops over all of the Wikipedia text files in /var/bigdisk/wikipedia/text # Removes XML wrappers, and copies them to ./wk_raw # Files have to be renamed using sub-directories to ensure duplicate names # do not overwrite import string import os import gc import shutil import lxml.html # Recursively walk the entire directory tree finding all .txt files which # are not in old sub-directories. readme.txt files are also skipped. # Empty the output directory outputdir = "/var/bigdisk/wikipedia/wk_raw" for f in os.listdir(outputdir): fpath = os.path.join(outputdir, f) try: if (os.path.isfile(fpath)): os.unlink(fpath) except Exception, e: print e for dirname, dirnames, filenames in os.walk('/var/bigdisk/wikipedia/text'): for fname in filenames: infile = os.path.join(dirname, fname) ofname = dirname[-2:] + "_" + fname # Output filename outfile = os.path.join(outputdir, ofname) # Copy the file, removing all XML tags print "Copying: " + infile fout = open(outfile,"w") for line in open(infile): # remove XML tags #ln = re.sub('<[^<]+?>', '', line) ln = line.strip() if ( len(ln) > 0): #print ln try: ln = lxml.html.fromstring(ln).text_content() except: ln = "" ln = ln.encode('latin-1', 'ignore') #print ln fout.write(ln + "\n") fout.close()
This simply goes through the directory tree, copying each file. During the copy process, all XML tags are removed, and HTML entity symbols are converted to their correct characters. These are converted to ISO-8859 (‘8 bit ASCII’, aka ‘latin-1′) as necessary. WikiExtractor.py uses the same file names for each sub-directory, so new names are created that also incorporate the sub-directory name.
The resulting frequency table (created using the word frequency table scripts) can be downloaded as a GZIP file (21.6MB). A combined (Wikipedia and Gutenberg) frequency table is also available (45.7MB)
Note that although the word frequency table scripts could be easily modified to process N-grams, the shear size of the Wikipedia dataset will prove a challenge. I shall address this in a future article.