Assignment 4: Knowledge Extraction from Wikipedia
Due Thursday, April 19

Assignment overview

In class on April 3, we started writing knowledge extraction scripts using Wikipedia as a corpus. Your task is to build on this, creating a tool that learns informative statements just by reading encyclopedic text.

Your task

  • Choose a relation between entities that you want your system to learn. Some examples include "person X is a member of group Y", "event X happened before event Y", "X is a desirable/undesirable property of Y", or "X is located in Y".
  • Identify a pattern in text that tends to indicates your relation, but in an unstructured way.
    • In other words, the pattern should appear in natural language, not in structured data such as tables, infoboxes, or categories. However, you may use related structured information when reading the natural language text if it helps.
  • This pattern will produce a number of false positives. Use some examples to build a training set, and mark them as positive or negative examples.
  • When you have enough examples, train a classifier on your training set.
    • Part of this involves finding a set of features that help to sort out good examples from bad. These might be surface features such as the number of words between matching entities, grammatical features such as parts of speech, or semantic features such as WordNet or ConceptNet similarity.
    • The classifier doesn't have to be Naïve Bayes. We will cover other classifiers in lecture, in fact, which may be more effective depending on your data. But, particularly in NLTK, Naïve Bayes often provides the optimal combination of power, speed, and ease of use.
  • Re-run your information extractor, using the classifier to refine its output. Include the scores that the classifier gives, but also improve your overall precision by dropping outputs whose score is too low. Include the resulting data when you turn in the assignment.

Helpful code and data

60,000 Wikipedia articles in Wikitext format (421 MB)

Code for cleaning up Wiki articles and reading links as chunks

Assignment guidelines

  • The assignment is due on Thursday, April 19.
  • You may work in groups and share code and data for this assignment. Each person in the group must still submit their own writeup.
  • What you should turn in consists of code, a writeup that describes what you did for each step, and the resulting data.
  • Some helpful background appears in chapter 7 of the book.
  • Turn in the assignment by e-mailing it to havasi@media.mit.edu. You can do this in one of two ways:
    • Make a zip file of your code, writeup, and any supporting materials you need, and attach the zip file in the e-mail.
    • Or put up the writeup as a Web page, linking to the code and any supporting materials, and e-mail a link to the page.