Assignment 1: Language modeling and text generation
Due Thursday, February 16

Your task

  • Make a corpus of text that's not built into NLTK.
  • Make unigram probability distributions from this corpus and from the Brown corpus (from nltk.corpus import brown). Do appropriate smoothing so you don't have zero probabilities all over the place.
  • Compare these probability distributions. What are the 50 words in your corpus that are most likely relative to the Brown corpus?
  • Build a probability distribution for bigrams of words in your corpus. What are the 50 bigrams with the most pointwise mutual information compared to the unigrams in your corpus?
  • Write code that will create probability distributions for each Markov state of the previous n words. (Code to get you started is below.) Use it to make a function called generate_text that generates text from your corpus. Show examples for n-grams with n=1 through 5.

Helpful code

Here's one way to build a corpus from a text file:

import nltk
file_contents = open('my_corpus.txt').read()
text = nltk.Text(nltk.word_tokenize(file_contents))

nltk.probability contains useful classes for making probability distributions out of frequency counts. When you have a distribution whose probabilities add up to 1, you can use its .generate() method to sample a random item from it.

Here's some code to get you started on the Markov text generator. It creates a mapping from Markov states to probability distributions of words that can come next.

from collections import defaultdict
from nltk.probability import *

def make_freqs(text, size=3):
    """
    For each context, make a FreqDist of what comes next.
    """
    # This defaultdict acts like a dictionary, but it will create an
    # empty FreqDist() when asked about a word it hasn't seen.
    freqs = defaultdict(FreqDist)

    # Scan over windows of the appropriate size.
    for left in xrange(len(text)-size):
        # If we are creating a model based on N-grams, then the context we
        # use for Markov chaining will contain (N-1) words and we will predict
        # the Nth.
        right = left + size - 1
        nextword = text[right]
        
        # Make the sequence of words something that we can store in a
        # dictionary.
        context = tuple(text[left:right])

        # Record the fact that this word appeared in this context once.
        freqs[context].inc(nextword)

        # Also record that this context appeared at all. You might want to
        # choose a random context, for example, to get the process started,
        # or to fall back on if you end up in a state where no known word
        # comes next.
        freqs[None].inc(context)
    return freqs

def make_probs(text, size=3):
    """
    Convert the FreqDists to ProbDists using the maximum likelihood estimate.
    """
    freqs = make_freqs(text, size)
    probs = {}
    for context in freqs:
        # Use a maximum likelihood estimate, so that we're always generating
        # words we know about.
        probs[context] = MLEProbDist(freqs[context])
    return probs

Assignment guidelines

  • The assignment is due on Thursday, February 16.
  • What you should turn in consists of code and a brief writeup describing what you did.
  • It's okay to get advice from other people, but the code in this assignment should be code you wrote yourself. We'll have group assignments later on in the class.
  • Some helpful background appears in chapters 1 and 2 of the book.
  • Turn in the assignment by e-mailing it to havasi@media.mit.edu. You can do this in one of two ways:
    • Make a zip file of your code, writeup, and any supporting materials you need, and attach the zip file in the e-mail.
    • Or put up the writeup as a Web page, linking to the code and any supporting materials, and e-mail a link to the page.