Calculating N-Gram Frequency Tables

The Word Frequency Table scripts can be easily expanded to calculate N-Gram frequency tables. This post explains how.

The Word Frequency Table scripts are not limited to string (word) keys, but can work with any kind of valid Python data type. This means we could easily create an n-gram frequency table using tuples to represent n-grams. This can work very well for relatively small amounts of data, but memory usage and memory churn add inefficiencies that can become significant for large amounts of data. Therefore we shall use simple space-separated text strings to represent each n-gram.

The previous word frequency script only requires two modifications. First, the constructor must be modified in order to store the size of the required n-grams:

class WordFrequencyBuilder(object):
    # n = Length of N-Gram
    def __init__ (self, n):
        self.N = n
        # Create one sentence tokenizer for re-use as required
        self.myTokenizer = SentenceTokenizer()

        # Create an empty frequency distribution
        self.myFD = FreqDist()

        # regex for 'words' to drop (numbers/punctuation)
        self.regexDrop = re.compile(r'^[\d\WeE]+$')
        # regex for punctuation (breaks N-grams)
        self.regexPunct = re.compile(r'^\W+$')
        self.regexNum = re.compile(r'^[\deE\-\+\.]+$')

In this modified constructor, ‘N’ stores the size of the n-grams to search for. Setting this in the constructor ensures we do not accidentally mix different n-gram sizes in the same table.

Next we modify the processText() method to keep track of the n-grams rather than simply store word frequencies. We shall mark sentence beginnings and ends with ‘<s>’ and ‘</s>’ tags. This particular implementation can skip numbers and punctuation. Skipped numbers and punctuation will break tuples (i.e. tuples cannot span skipped tokens). The calling methods will need to be modified to set these parameters.

Here is the new processText() implementation:

    # Used by buildTableForFile() to process a section of text for
    # an N-gram. The supplied text is assumed to be a paragraph
    # Note: Always skips punctuation, and punctuation will break an N-gram
    # text: Text to process
    # N: Length of n grams
    # incl_num, incl_punct: Should numbers or punctuation be included?
    # If incl_num=false, then numbers are recorded as '<n>'
    # If N>1, sentence start/end is recorded as '<s>' and '</s>'    
    # If N=1, <n>, <s>, </s> are all skipped
    def processText(self, text, incl_num, incl_punct):
        # segment the text into words and sentences
        # Only words are required, but sentence segmentation is involved
        # because we want to intepret full stops correctly
        sentences = self.myTokenizer.segment_text(text)
        for sentence in sentences:
            n_gram = ['<s>']
            for word in sentence:
                wd = word.lower().strip()
                if (len(wd) ==0):
                if (self.regexNum.match(wd) ):
                    # numeric
                    if (not incl_num):
                        n_gram = [ ]    # reset ngram

                elif (self.regexPunct.match(wd) ):
                    # punctuation
                    if (not incl_punct):
                        # skip punctuation
                        n_gram = [ ]    # reset ngram

                # if okay, add word (or symbol)
                n_gram.append( wd )

                # Shrink N-gram if it has grown too big
                if (len(n_gram) > self.N):

                # save if big enough
                if (len(n_gram) == self.N):
           " ".join(n_gram) )

            # sentence finish - write a sentence marker if N>1
            if (self.N>1):
                n_gram.append( '</s>')
                if (len(n_gram)==self.N):
           " ".join(n_gram) )

That is all that is required to convert the word frequency table script into an n-gram frequency table script. The main limitation is that the frequency table is constructed entirely in memory. This is not a problem for simple word frequencies, but it becomes a major problems for large datasets such as the Wikipedia or Gutenberg texts. Even a bigram frequency table for either of these texts will take more than 8GB of RAM. In a future article I shall address this issue by creating a disk version of the table. A disk version can also perform more sophisticated subset queries (i.e. “find the frequency table for n-grams that start with the following words”).

Leave a Reply