A large corpus can provide a wide variety of useful information, provided that there are decent tools to extract it. In Natural Language Processing (NLP), for example, statistical information obtained from large corpora (consisting of tens of millions of words) is used to inform many different tasks, ranging from guessing the most likely parsing for a sentence to determining the likelihood that a document matches key terms in a search.
In this tutorial, we will look at one particular English corpus, the Wall Street Journal (WSJ) corpus, which is a component of the Penn Treebank, and show how it can be manipulated using Python. (The article assumes at least basic familiarity with Python. If Python is new to you, try the Python-related links at the end of the article.) We will first build some homegrown tools for parsing and manipulating the WSJ corpus, and then discuss how the the Natural Language Toolkit (NLTK) for Python can be used to accomplish some of the same tasks.
The full WSJ corpus comes with the Penn Treebank, which is available from the Linguistic Data Consortium (LDC). The full corpus is only available to members of the LDC, but a small part of it can be found in one of the NLTK's modules. Currently, there are three NLTK modules:
|nltk||The actual Python packages and modules for the Natural Language Toolkit|
|nltk-data||A collection of corpora and sample data that can be easily used with NLTK|
|nltk-contrib||Third-party modules and packages based on NLTK|
(The latest version of the NLTK, at the time of writing, is 1.4. If you install another version, there's no guarantee that all of the code here will work.)
Full installation instructions for the NLTK can be found here. For now, you
only need to download and install nltk-data, instructions for the
installation of which are available for both Unix and
We will assume here that the reader is working in a Unix environment and
that nltk-data is installed under
Our corpus of choice for this tutorial is the WSJ corpus, which consists of WSJ articles that have been tagged for their part-of-speech and annotated for their grammatical structure. For each article, there are three files: the raw text, the tagged text, and the annotated text. (We'll ignore the annotated texts here and focus on the raw and tagged ones.)
Let's have a look at a sample file from the corpus, which is a short article about Zenith obtaining a lucrative contract with the American Navy. The plain text (raw) version of the article looks like this (wsj_0099):
|Zenith Data Systems Corp., a subsidiary of Zenith Electronics Corp., received a $534 million Navy contract for software and services of microcomputers over an 84-month period. Rockwell International Corp. won a $130.7 million Air Force contract for AC-130U gunship replacement aircraft. Martin Marietta Corp. was given a $29.9 million Air Force contract for low-altitude navigation and targeting equipment. Federal Data Corp. got a $29.4 million Air Force contract for intelligence data handling.|
The tagged version of the same article looks like this (wsj_0099.pos):
[ Zenith/NNP Data/NNP Systems/NNPS Corp./NNP ] ,/, [ a/DT subsidiary/NN ] of/IN [ Zenith/NNP Electronics/NNP Corp./NNP ] ,/, received/VBD [ a/DT $/$ 534/CD million/CD Navy/NNP contract/NN ] for/IN [ software/NN ] and/CC [ services/NNS ] of/IN [ microcomputers/NNS ] over/IN [ an/DT 84-month/JJ period/NN ] ./. [ Rockwell/NNP International/NNP Corp./NNP ] won/VBD [ a/DT $/$ 130.7/CD million/CD Air/NNP Force/NNP contract/NN ] for/IN [ AC-130U/NN gunship/NN replacement/NN aircraft/NN ] ./. [ Martin/NNP Marietta/NNP Corp./NNP ] was/VBD given/VBN [ a/DT $/$ 29.9/CD million/CD Air/NNP Force/NNP contract/NN ] for/IN [ low-altitude/NN navigation/NN ] and/CC [ targeting/VBG|NN equipment/NN ] ./. [ Federal/NNP Data/NNP Corp./NNP ] got/VBD [ a/DT $/$ 29.4/CD million/CD Air/NNP Force/NNP contract/NN ] for/IN [ intelligence/NN data/NNS handling/NN ] ./.
In the tagged version, each sentence in the article has been broken down into words, and each word has been associated with a tag that describes how the word functions in the sentence. These tags refer to what is traditionally known as a part-of-speech, such as noun, verb, adjective, or adverb. (And if you ever watched Grammar Rock, you may remember others, like the conjunction: "Conjunction junction, what's your function? Hookin' up words and phrases and clauses.")
|Wall Street Journal (WSJ) Tagset|
|CC||Coordinating conjunction||PP$||Possessive pronoun|
|EX||Existential there||RBS||Adverb, superlative|
|IN||Preposition/subord. conjunction||SYM||Symbol (mathematical or scientific)|
|JJS||Adjective, superlative||VB||Verb, base form|
|LS||List item marker||VBD||Verb, past tense|
|MD||Modal||VBG||Verb, gerund/present participle|
|NN||Noun, singular or mass||VBN||Verb, past participle|
|NNS||Noun, plural||VBP||Verb, non-3rd ps. sing. present|
|NNP||Proper noun, singular||VBZ||Verb,3rd ps. sing. present|
|NNPS||Proper noun plural||WDT||wh-determiner|
|POS||Possessive ending||WP$||Possessive wh-pronoun|
When writing programs to analyze corpora, we often want quick-and-dirty tools for the rapid extraction of information. However, we also sometimes want to build larger systems. The ideal would be to have general-purpose tools that can be reused, either in full-scale applications or in short one-off scripts. Scripting languages fit the bill quite well, especially those with very good string processing capabilities, such as Perl and Python. Since Python has the Natural Language Tool Kit (NLTK), which provides various tools for natural language processing and comes with a sample of the WSJ corpus, it is our language of choice.
One question we might immediately ask ourselves is: How often do the different tags occur in the WSJ corpus? We can answer this question by extracting all of the tags from the corpus and counting the number of times they occur using a Python script written to do the job, such as count_tags.py. In broad strokes, the script does the following:
The script would be run on the commandline as follows:
[stuart@localhost]$ python code/count_tags.py /usr/share/nltk/treebank/wsj_tagged/ CC 1124 CD 1414 DT 3990 EX 48 FW 2 ...
The output consists of two tab-separated columns. The first column lists the tags, and the second column has the number of times each occurs in the corpus. After you've run the scripts, see what the least and most frequent tags are. The default order is alphabetical by tag, but the output can be piped to Unix utilities to be sorted by value. We'll leave that as an exercise for the reader...
Since we assume basic familiarity with Python, we don't need to go
through count_tags.py in detail. The only part of the script that is
not straightforward is the function
def parseLine(line) : words = re.split(r" +", line) # break line into words for w in words : # go through the words if "/" in w : pos = re.split(r"(?<!\\)/", w) # split words into parts try : tagList[pos] = tagList[pos] + 1 # increment counter except KeyError : tagList[pos] = 1 # intialize counter return
Let's see how it works by looking at how an actual line from the corpus would be processed. We'll look at a line from wsj_0049.pos which possesses some special challenges:
[ the/DT Iran\/Contra/NNP affair/NN ]
To make discussion easier, let's first establish some terminology. We
will use the term token for a particular pairing of a wordform
with a part-of-speech. In other words,
the/DT is the first
token in the line above,
Iran\/Contra/NNP is the second,
affair/NN is the third. In the WSJ corpus, a token
consists of a wordform and a part-of-speech tag separated from one
another by a slash. We use the term wordform (instead of simply word)
because we want to emphasizes that we are dealing with a particular form
of a word. After all, a word (e.g., break) may have multiple
forms (e.g., breaking, broken, broke, etc.).
Using this terminology, we can say that
each line into tokens and that these tokens are then iterated over in a
for loop. Splitting the fifth line would produce a list
with five elements, as follows:
The square brackets are ignored during the next step, which is to split
a token into a wordform and a part-of-speech using a slash. However,
some wordforms contain slashes in the original article
Iran/Contra), and in the tagging, a backslash is used to
distinguish real slashes from slashes that separate wordform from
part-of-speech. To ensure that the word is split on the proper slash, we
split using a regular expression that matches only slashes not
preceded by a backslash. This is done using a regular expression trick
known as a "negative lookbehind assertion", which is described in the Python library
documentation on Python regular
expression syntax. (More on regular expression syntax can be found
in the Python regular
As another exercise in corpus manipulation, let's take our corpus and analyze the frequency of words by part-of-speech. In other words, we want to produce a list of wordforms that tells us which parts-of-speech they function as, and how frequently. The Python script make_wordlist.py accomplishes this task. In broad strokes, it does the following:
This script is run in the same manner as the last one, although the output is obviously different, consisting of three columns (wordform, tag, frequency count):
[stuart@localhost freshmeat]$ python code/make_wordlist.py /usr/share/nltk/treebank/wsj_tagged/ ! . 3 # # 1 $ $ 332 % JJ 1 % NN 153
As before, you may want to sort the output differently using Unix utilities, but even without any custom sorting, it should be obvious that all sorts of interesting information about word usage can be obtained from this kind of word list. The sample of the WSJ corpus available in the NLTK consists of only about 40,000 words, however, which limits its utility. As mentioned in the beginning, statistical information obtained from word lists can inform a variety of natural language processing tasks. For example, search technology can take advantage of this data to second-guess the intentions of users performing searches. For example, we find that the word yield functions primarily as a noun in the portion of the WSJ corpus available here:
... yield NN 17 yielded VBD 1 yielding VBG 2 yielding JJ 1 yielding NN 1 yields NNS 4 ...
On the basis of this type of information, we can assume that, all things being equal, if a user searches on the word yield, documents in which the word functions as a noun (e.g., wsj_0090: "They are keeping a close watch on the yield on the S&P 500.") are better matches than documents in which the word functions as a verb (e.g., wsj_0099: "There are no signs, however, of China's yielding on key issues."). The important proviso here is the qualification all things being equal. The genre of a text, the immediate local environment of a word, and a variety of other factors influence these statistics, and more sophisticated statistical models enable more sensitive fine-tuning of searches. For more information about the use of word statistics in natural language processing, see Manning and Schütze's book The Foundations of Statistical NLP.
So far, we have written our own Python code to break the corpus down into tokens, but ideally, we shouldn't have to reinvent the wheel and write all of this low-level logic. There should be pre-existing tools that know about tags and tokens and the like, which could simply be used in whatever script we write. Fortunately, the world sometimes lives up to our ideals. Enter the Natural Language Toolkit (NLTK), which is, according to its authors, "a suite of program modules, data sets, tutorials, and exercises, covering symbolic and statistical natural language processing". In other words, the NLTK provides functionality in Python for language processing, and since it's Open Source, it's free, in every sense of the term, meaning that you can peek under the hood, tinker with it, and contribute to its development.
You can learn more about what the NLTK has to offer by consulting the NLTK documentation, which is reasonably good. In addition, there are also two academic articles on the NLTK (1 | 2) and a few tutorials. But if you're feeling impatient and want to get your hands dirty, there is a mini NLTK tutorial by David Mertz (author of the Charming Python column).
But before we can use the NLTK, we need to install it. The first step is to download the required files for the NLTK. As you will recall, the NLTK is divided into three modules. The module nltk-data should already be installed, and the module nltk-contrib can be ignored. It's the NLTK itself that you should be installing now. After you follow the installation instructions for the NLTK, you should familiarize yourself with its contents. As a step in that direction, we'll use the NLTK's functionality to perform the same two tasks handled by the scripts discussed above.
The NLTK is organized into multiple packages which handle different
domains in natural language processing: tagging, parsing, probability,
text classification, etc. Since we are only doing fairly basic corpus
work, the only package we need is the
corpus package, which
includes functionality for handling "tokenization" (the process of
breaking texts down into tokens). Fortunately, there is a decent tokenization
To illustrate the NLTK in action, let's tackle an earlier task, that of
counting the number of tags in a corpus. The script nltk_count_tags.py
should produce output identical to that of count_tags.py.
The main difference is that the parsing of corpus files and their
breakdown into sentences, words, tags, etc. is handled by the NLTK's
functionality! The script imports the
treebank module from
nltk.corpus and calls
read() on each file to
obtain a parsed version of it.
def main() : for f in treebank.items('tagged') : corpus = treebank.read(f) for sentenceToken in corpus['SENTS'] : for wt in sentenceToken['WORDS'] : pos = wt['POS'] try : tagList[pos] = tagList[pos] + 1 except KeyError : tagList[pos] = 1 printResults() return
The program is run as follows:
[stuart@localhost]$ python code/nltk_count_tags.py CC 1124 CD 1414 DT 3990 EX 48 FW 2 ...
You may have noticed that, unlike the previous scripts, this one does not take commandline arguments telling the script where the WSJ corpus files can be found. This is because the NLTK knows the location of the corpus in the filesystem. To find the path to these files and get a listing of them, you can query the NLTK using the following code (from nltk_wsj_filepaths.py):
from nltk.corpus import treebank print "BASE" print " %s" % treebank.rootdir() for g in treebank.groups() : print "%s" % g.upper() for item in treebank.items(g) : print " %s" % item
The script nltk_make_wordlist.py is very similar to make_wordlist.py. Again, the main difference is that the parsing of corpus files and their breakdown into sentences, word, tags, etc. is handled by the NLTK. The script uses the NLTK's treebank parser to read each file and tokenize it, and all of the tokens are parsed and entered into a dictionary along with their relative frequency.
The program is run as follows:
[stuart@localhost freshmeat]$ python code/nltk_make_wordlist.py ! . 3 # # 1 $ $ 332 % JJ 1 % NN 153
As they say, the journey of a thousand miles begins with a single step. Now that you have the NLTK installed and have used a small part of its functionality to perform a few simple tasks, you're ready to dig more deeply into corpus linguistics. The first step is to learn about some of the other parts of the NLTK, for tagging or parsing or text classification. Of course, the best programming skills in the world won't make up for bad theory and/or poor algorithms, so you might try reading more widely in the fields of linguistics and computational linguistics.