In class on April 3, we started writing knowledge extraction
scripts using Wikipedia as a corpus. Your task is to build on this,
creating a tool that learns informative statements just by reading
- Choose a relation between entities that you want your
to learn. Some examples include "person X is a member of group Y",
"event X happened before event Y", "X is a desirable/undesirable
property of Y", or "X is located in Y".
- Identify a pattern in text that tends to indicates your
relation, but in an unstructured way.
- In other words, the pattern should appear in natural
language, not in structured data such as tables, infoboxes, or
categories. However, you may use related structured
information when reading the natural language text if it helps.
- This pattern will produce a number of false positives. Use
some examples to build a training set, and mark them as
positive or negative examples.
- When you have enough examples, train a classifier on your
- Part of this involves finding a set of features
to sort out good examples from bad. These might be surface
features such as the number of words between matching entities,
grammatical features such as parts of speech, or semantic
features such as WordNet or ConceptNet similarity.
- The classifier doesn't have to be Naïve Bayes. We will
cover other classifiers in lecture, in fact, which may be more
effective depending on your data. But, particularly in NLTK,
Naïve Bayes often provides the optimal combination of power,
speed, and ease of use.
- Re-run your information extractor, using the classifier to
refine its output. Include the scores that the classifier
but also improve your overall precision by dropping outputs whose
score is too low. Include the resulting data when you turn
in the assignment.
Helpful code and data
60,000 Wikipedia articles
in Wikitext format (421 MB)
Code for cleaning up Wiki
articles and reading links as chunks