# Assignment 1: Language modeling and text generation Due Thursday, February 16

• Make a corpus of text that's not built into NLTK.
• Make unigram probability distributions from this corpus and from the Brown corpus (`from nltk.corpus import brown`). Do appropriate smoothing so you don't have zero probabilities all over the place.
• Compare these probability distributions. What are the 50 words in your corpus that are most likely relative to the Brown corpus?
• Build a probability distribution for bigrams of words in your corpus. What are the 50 bigrams with the most pointwise mutual information compared to the unigrams in your corpus?
• Write code that will create probability distributions for each Markov state of the previous n words. (Code to get you started is below.) Use it to make a function called `generate_text` that generates text from your corpus. Show examples for n-grams with n=1 through 5.

Here's one way to build a corpus from a text file:

```import nltk
text = nltk.Text(nltk.word_tokenize(file_contents))
```

`nltk.probability` contains useful classes for making probability distributions out of frequency counts. When you have a distribution whose probabilities add up to 1, you can use its `.generate()` method to sample a random item from it.

Here's some code to get you started on the Markov text generator. It creates a mapping from Markov states to probability distributions of words that can come next.

```from collections import defaultdict
from nltk.probability import *

def make_freqs(text, size=3):
"""
For each context, make a FreqDist of what comes next.
"""
# This defaultdict acts like a dictionary, but it will create an
freqs = defaultdict(FreqDist)

# Scan over windows of the appropriate size.
for left in xrange(len(text)-size):
# If we are creating a model based on N-grams, then the context we
# use for Markov chaining will contain (N-1) words and we will predict
# the Nth.
right = left + size - 1
nextword = text[right]

# Make the sequence of words something that we can store in a
# dictionary.
context = tuple(text[left:right])

# Record the fact that this word appeared in this context once.
freqs[context].inc(nextword)

# Also record that this context appeared at all. You might want to
# choose a random context, for example, to get the process started,
# or to fall back on if you end up in a state where no known word
# comes next.
freqs[None].inc(context)
return freqs

def make_probs(text, size=3):
"""
Convert the FreqDists to ProbDists using the maximum likelihood estimate.
"""
freqs = make_freqs(text, size)
probs = {}
for context in freqs:
# Use a maximum likelihood estimate, so that we're always generating