Assignment 3: Named entity recognition in Dutch
Due Thursday, March 15

CoNLL 2002

CoNLL 2000 was a bake-off for general chunking in English, which led to the creation of some very successful chunking systems. CoNLL 2002 followed up with the more difficult task of named entity recognition in Spanish and Dutch. We're going to attempt the version in Dutch, whose corpus text is from the Belgian newspaper De Morgen.

We presented a Naïve Bayes chunker we presented in class that does reasonably well at identifying noun phrases. But on conll2002 it only gets an F-score of 40.8%. There's lots of room to improve it.

What's the F-score, anyway?

We forgot to actually give the formula for the F-score in class:

f = (2 * precision * recall) / (precision + recall)

The F-score is the harmonic mean of precision and recall, which means that it ranges from 0 to 1 as they do, and the score goes to 0 as either your precision or recall goes to 0. Thus, it rewards a balance of precision and recall, instead of going to either extreme.

Your task

Begin with the provided code,, and edit it -- particularly the feature extractor -- to improve its F-score.

You may use any external resources or data, within reason (don't just run some code that did well at CoNLL 2002).

In addition to looking at the precision, recall, and F-score, make a confusion matrix, showing which tags are commonly confused with other tags (see Helpful Code below). Use it to understand the types of errors you are making.

In your writeup:

  • Describe the changes you made in your code.
  • List three features or changes you tried that improved the chunker's performance.
  • List three features or changes you tried that didn't improve the chunker's performance.
  • Look at the confusion matrix your chunker produces. What kinds of errors are most common?
  • Sometimes, you can get an IOB tag wrong and yet still get the chunks right. Why is this? Which entries in the confusion matrix are often (but not always) harmless for this reason?
  • Look at the most informative Bayesian features that the underlying classifier has learned, using chunker.tagger.classifier.show_most_informative_features(n), and describe some interesting features you see.
  • Describe an error your chunker makes that could be fixed if you had better training data.
  • Describe an error your chunker makes that seems to be unsolvable at the level of chunking. (You may need to Google Translate some sentences to find out what they mean.)

Helpful code: Printing a confusion matrix

NLTK's code for confusion matrices doesn't care what the inputs were, and it also doesn't care what the final chunks are. It just wants a list of tags, and a gold-standard list of tags to compare it to. So here's how to show a confusion matrix for your chunker:

>>> gold_tags = [tag for item in test_data
    for (word, pos, tag) in nltk.chunk.tree2conlltags(item)]
>>> result_tags = [tag for sentence in test_data
    for (word, tag) in chunker.tagger.tag(sentence.leaves())]
>>> print nltk.ConfusionMatrix(gold_tags, result_tags)

       |           B                       I                   |
       |     B     -     B     B     I     -     I     I       |
       |     -     M     -     -     -     M     -     -       |
       |     L     I     O     P     L     I     O     P       |
       |     O     S     R     E     O     S     R     E       |
       |     C     C     G     R     C     C     G     R     O |
 B-LOC |  <364>   23     1    32     1     .     .    23    35 |
B-MISC |    80  <459>    8    51     .     2     .    24   124 |
 B-ORG |   106    95  <175>   72     .     2    20    56   160 |
 B-PER |    93     8     1  <377>    .     .     1   110   113 |
 I-LOC |     4     4     .    12   <10>    2     .    24     8 |
I-MISC |    11    11     1     6     .   <51>   11    61    63 |
 I-ORG |    17    12     5     9     1     9  <126>  141    76 |
 I-PER |     4    13     .    17     .     .     .  <359>   30 |
     O |   287   844     4   186     .    19     .   100<32533>|
(row = reference; col = test)

The diagonal contains correct tags; everything else represents the number of times one tag was confused for another. For example, 93 instances of B-PER were tagged as B-LOC instead.

Assignment guidelines

  • The assignment is due on Thursday, March 15.
  • What you should turn in consists of code and a writeup, which describes what you did and answers the questions.
  • It's okay to get advice from other people, but the code in this assignment should be code you wrote yourself.
  • Some helpful background appears in chapter 7 of the book.
  • Turn in the assignment by e-mailing it to You can do this in one of two ways:
    • Make a zip file of your code, writeup, and any supporting materials you need, and attach the zip file in the e-mail.
    • Or put up the writeup as a Web page, linking to the code and any supporting materials, and e-mail a link to the page.


Probably the hard part is going to be distinguishing people, places, and organizations. Here are some things you could try (but you don't have to do all of them!):

  • The given feature extractor doesn't yet take its history into account. Can you get your classifier to learn that an I-PER can follow a B-PER, but an I-ORG can't?
  • Consider this as a similar problem to distinguishing male from female names. Are there features that tend to identify whether something names a person, place, or organization?
  • Take the context into account: can you find a list of prominent Belgian places, or common Belgian names, and use them as extra training data? Does it help?