This chapter introduces concepts in algorithms, data structures, program design, and advanced Python programming. It also contains review of the basic mathematical notions of set, relation, and function, and illustrates them in terms of Python data structures. It contains many working program fragments which you should try yourself.
Object-Oriented Programming is a programming paradigm in which complex structures and processes are decomposed into modules, each encapsulating a single data type and the legal operations on that type.
An important data type in language processing is the syntactic tree. Here we will review the parts of the NLTK code which defines the Tree class.
The first line of a class definition is the class keyword followed by the class name, in this case Tree. This class is derived from Python's built-in list class, permitting us to use standard list operations to access the children of a tree node.
|
Next we define the initializer, also known as the constructor. It has a special name, starting and ending with double underscores; Python knows to call this function when you asks for a new tree object by writing t = Tree(node, children). The constructor's first argument is special, and is standardly called self, giving us a way to refer to the current object from inside the definition. This constructor calls the list initializer (similar to calling self = list(children)), then defines the node property of a tree.
|
Next we define another special function that Python knows to call when we index a Tree. The first case is the simplest, when the index is an integer, e.g. t[2], we just ask for the list item in the obvious way. The other cases are for handling slices, like t[1:2], or t[:].
|
This method was for accessing a child node. Similar methods are provided for setting and deleting a child (using __setitem__) and __delitem__).
Two other special member functions are __repr__() and __str__(). The __repr__() function produces a string representation of the object, one which can be executed to re-create the object, and is accessed from the interpreter simply by typing the name of the object and pressing 'enter'. The __str__() function produces a human-readable version of the object; here we call a pretty-printing function we have defined called pp().
|
Next we define some member functions that do other standard operations on trees. First, for accessing the leaves:
|
Next, for computing the height:
|
And finally, for enumerating all the subtrees (optionally filtered):
|
This section will discuss the tag.ngram module.
Programming is a skill which is acquired over several years of experience with a variety of programming languages and tasks. Key high-level abilities are algorithm design and its manifestation in structured programming. Key low-level abilities include familiarity with the syntactic constructs of the language, and knowledge of a variety of diagnostic methods for trouble-shooting a program which does not exhibit the expected behaviour.
We have just seen how the same task can be performed in different ways, with implications for efficiency. Another factor influencing program development is programming style. Consider the following program to compute the average length of words in the Brown Corpus:
|
In this program we use the variable count to keep track of the number of tokens seen, and total to store the combined length of all words. This is a low-level style, not far removed from machine code, the primitive operations performed by the computer's CPU. The two variables are just like a CPU's registers, accumulating values at many intermediate stages, values which are almost meaningless. We say that this program is written in a procedural style, dictating the machine operations step by step. Now consider the following program which computes the same thing:
|
The first line uses a list comprehension to construct the sequence of tokens. The second line maps the len function to this sequence, to create a list of length values, which are summed. The third line computes the average as before. Notice here that each line of code performs a complete, meaningful action. Moreover, they do not dictate how the computer will perform the computations; we state high level relationships like "total is the sum of the lengths of the tokens" and leave the details to the Python interpreter. Accordingly, we say that this program is written in a declarative style.
Here is another example to illustrate the procedural/declarative distinction. Notice again that the procedural version involves low-level steps and a variable having meaningless intermediate values:
|
The declarative version (given second) makes use of higher-level built-in functions:
|
What do these programs compute? Which version did you find easier to interpret?
Consider one further example, which sorts three-letter words by their final letters. The words come from the widely-used Unix word-list, made available as an NLTK corpus called words. Two words ending with the same letter will be sorted according to their second-last letters. The result of this sort method is that many rhyming words will be contiguous. Two programs are given; Which one is more declarative, and which is more procedural?
As an aside, for readability we define a function for reversing strings that will be used by both programs:
|
Here's the first program. We define a helper function reverse_cmp which calls the built-in cmp comparison function on reversed strings. The cmp function returns -1, 0, or 1, depending on whether its first argument is less than, equal to, or greater than its second argument. We tell the list sort function to use reverse_cmp instead of cmp (the default).
|
Here's the second program. In the first loop it collects up all the three-letter words in reversed form. Next, it sorts the list of reversed words. Then, in the second loop, it iterates over each position in the list using the variable i, and replaces each item with its reverse. We have now re-reversed the words, and can print them out.
|
Choosing between procedural and declarative styles is just that, a question of style. There are no hard boundaries, and it is possible to mix the two. Readers new to programming are encouraged to experiment with both styles, and to make the extra effort required to master higher-level constructs, such as list comprehensions, and built-in functions like map and filter.
|
Commands:
list [first [,last]]: list sourcecode for the current file
next: continue execution until the next line in the current function is reached
cont: continue execution until a breakpoint is reached (or the end of the program)
break: list the breakpoints
break n: insert a breakpoint at this line number in the current file
break file.py:n: insert a breakpoint at this line in the specified file
break function: insert a breakpoint at the first executable line of the function
Figure 10.1: T9: Text on 9 Keys
An algorithm is a "recipe" for solving a problem. For example, to multiply 16 by 12 we might use any of the following methods:
Each of these methods is a different algorithm, and requires different amounts of computation time and different amounts of intermediate information to store. A similar situation holds for many other superficially simple tasks, such as sorting a list of words. Now, as we saw above, Python provides a built-in function sort() that performs this task efficiently. However, NLTK-Lite also provides several algorithms for sorting lists, to illustrate the variety of possible methods. To illustrate the difference in efficiency, we will create a list of 1000 numbers, randomize the list, then sort it, counting the number of list manipulations required.
|
Now we can try a simple sort method called bubble sort, which scans through the list many times, exchanging adjacent items if they are out of order. It sorts the list a in-place, and returns the number of times it modified the list:
|
We can try the same task using various sorting algorithms. Evidently merge sort is much better than bubble sort, and quicksort is better still.
|
Readers are encouraged to look at nltk_lite.misc.sort to see how these different methods work. The collection of NLTK-Lite modules exemplify a variety of algorithm design techniques, including brute-force, divide-and-conquer, dynamic programming, and greedy search. Readers who would like a systematic introduction to algorithm design should consult the resources mentioned at the end of this tutorial.
In Chapter 6 we saw how to sort a list of items according to some property of the list.
|
This is inefficient when the list of items gets long, as we compute len() twice for every comparison (about 2nlog(n) times). The following is more efficient:
|
This technique is called decorate-sort-undecorate. We can compare its performance by timing how long it takes to execute it a million times.
|
MORE: consider what happens as the lists get longer...
Find words which, when reversed, make legal words. Extremely wasteful solution:
|
More efficient:
|
Observe that this output contains redundant information; each word and its reverse is included. How could we remove this redundancy?
Presorting, sets:
Find words which have at least (or exactly) one instance of all vowels. Instead of writing extremely complex regular expressions, some simple preprocessing does the trick:
|
Many NLP tasks can be construed as search problems. For example, the task of a parser is to identify one or more parse trees for a given sentence. As we saw in Part II, there are several algorithms for parsing. A recursive descent parser performs backtracking search, applying grammar productions in turn until a match with the next input word is found, and backtracking when there is no match. We saw in Chapter 8 that the space of possible parse trees is very large; a parser can be thought of as providing a relatively efficient way to find the right solution(s) within a very large space of candidates.
As another example of search, suppose we want to find the most complex sentence in a text corpus. Before we can begin we have to be explicit about how the complexity of a sentence is to be measured: word count, verb count, character count, parse-tree depth, etc. In the context of learning this is known as the objective function, the property of candidate solutions we want to optimize.
In this section we will explore some other search methods which are useful in NLP. For concreteness we will apply them to the problem of learning word segmentations in text, following the work of [Brent, 1995]. Put simply, this is the problem faced by a language learner in dividing a continuous speech stream into individual words. We will consider this problem from the perspective of a child hearing utterances from a parent, e.g.
(1a) | doyouseethekitty |
(1b) | seethedoggy |
(1c) | doyoulikethekitty |
(1d) | likethedoggy |
Our first challenge is simply to represent the problem: we need to find a way to separate the text content from the segmentation. We will borrow an idea from IOB-tagging (Chapter 5), by annotating each character with a boolean value to indicate whether or not a word-break appears after the character. We will assume that the learner is given the utterance breaks, since these often correspond to extended pauses. Here is a possible representation, including the initial and target segmentations:
|
Observe that the segmentation strings consist of zeros and ones. They are one character shorter than the source text, since a text of length n can only be broken up in n-1 places.
Now let's check that our chosen representation is effective. We need to make sure we can read segmented text from the representation. The following function, segment(), takes a text string and a segmentation string, and returns a list of strings.
| ||
| ||
Listing 10.1 (segment.py): Program to Reconstruct Segmented Text from String Representation |
Now the learning task becomes a search problem: find the bit string that causes the text string to be correctly segmented into words. Our first task is done: we have represented the problem in a way that allows us to reconstruct the data, and to focus on the information to be learned.
Now that we have effectively represented the problem we need to choose the objective function. We assume the learner is acquiring words and storing them in an internal lexicon. Given a suitable lexicon, it is possible to reconstruct the source text as a sequence of lexical items. Following [Brent, 1995], we can use the size of the lexicon and the amount of information needed to reconstruct the source text as the basis for an objective function, as shown in Figure 10.2.
Figure 10.2: Calculation of Objective Function for Given Segmentation
It is a simple matter to implement this objective function, as shown in Listing 10.2.
| ||
| ||
Listing 10.2 (evaluate.py): Computing the Cost of Storing the Lexicon and Reconstructing the Source Text |
For a computer that can do 100,000 evaluations per second, this would take over 10,000 years!
Starting from a given location in the search space, evaluate nearby locations and move to a new location only if it is an improvement on the current location.
| ||
| ||
Listing 10.3 (hill_climb.py): Hill-Climbing Search |
| ||
| ||
Listing 10.4 (anneal.py): Non-Deterministic Search Using Simulated Annealing |
One of the difficulties in re-using functions is remembering the order of arguments. Consider the following function, which finds the n most frequent words that are at least min_len characters long:
|
This function has three arguments. It follows the convention of listing the most basic and substantial argument first (the file). However, it might be hard to remember the order of the second and third arguments on subsequent use. We can make this function more readable by using keyword arguments. These appear in the function's argument list with an equals sign and a default value:
|
Now there are several equivalent ways to call this function:
|
When we use an integrated development environment such as IDLE, simply typing the name of a function at the command prompt will list the arguments. Using named arguments helps someone to re-use the code...
A side-effect of having named arguments is that they permit optionality. Thus we can leave out any arguments for which we are happy with the default value.
|
Another common use of optional arguments is to permit a flag, e.g.:
|
These functions start by initializing some storage, and iterate over input to build it up, before returning some final object (a large structure or aggregated result). The standard way to do this is to initialize an empty list, accumulate the material, then return the list:
|
We can apply this function to some tagged text to extract the nouns:
|
However, a superior way to do this is to define a generator
|
The first time this function is called, it gets as far as the yield statement and stops. The calling program gets the first word and does any necessary processing. Once the calling program is ready for another word, execution of the function is continued from where it stopped, until the next time it encounters a yield statement.
Let's see what happens when we call the function:
|
We cannot call it directly. Instead, we can convert it to a list.
|
We can also iterate over it in the usual way:
|
[Efficiency]
Knowing a bit about sets will come in useful when you look at Chapter 11. A set is a collection of entities, called the members of the set. Sets can be finite or infinite, or even empty. In Python, we can define a set just by listing its members; the notation is similar to specifying a list:
|
In mathematical notation, we would specify this set as:
(2) | {'a', 'b', 1, 2, 3} |
Set membership is a relation — we can ask whether some entity x belongs to a set A (in mathematical notation, written x ∈ A).
|
However, sets differ from lists in that they are unordered collections. Two sets are equal if and only if they have exactly the same members:
|
The cardinality of a set A (written ∣A∣) is the number of members in A. We can get this value using the len() function:
|
The argument to the set() constructor can be any sequence, including a string, and just calling the constructor with no argument creates the empty set (written ∅).
|
We can construct new sets out of old ones. The union of two sets A and B (written A ∪ B) is the set of elements which belong to A or B. Union is represented in Python with |:
|
The intersection of two sets A and B (written A ∩ B) is the set of elements which belong to both A and B. Intersection is represented in Python with &. If the intersection of two sets is empty, they are said to be disjoint.
|
The (relative) complement of two sets A and B (written A − B) is the set of elements which belong to A but not B. Complement is represented in Python with -.
|
So far, we have described how to define 'basic' sets and how to form new sets out of those basic ones. All the basic sets have been specified by listing all their members. Often we want to specify set membership more succinctly:
(3) | the set of positive integers less than 10 |
(4) | the set of people in Melbourne with red hair |
We can informally write these sets using the following predicate notation:
(5) | {x | x is a positive integer less than 10} |
(6) | {x | x is a person in Melbourne with red hair} |
In axiomatic set theory, the axiom schema of comprehension states that given a one-place predicate P, there is set A such that for all x, x belongs to A if and only if (written ≡) P(x) is true:
(7) | ∃A∀x.(x ∈ A ≡ P(x)) |
From a computational point of view, (7) is problematic: we have to treat sets as finite objects in the computer, but there is nothing to stop us defining infinite sets using comprehension. Now, there is a variant of (7), called the axiom of restricted comprehension, which allows us to specify a set A with a predicate P so long as we only consider xs which belong to some already defined set B:
(8) | ∀B ∃A∀x. (x ∈ A ≡ x ∈ B ∧ P(x)) |
(For all sets B there is a set A such that for all x, x belongs to A if and only if x belongs to B and P(x) is true.) This is equivalent to the following set in predicate notation:
(9) | {x | x ∈ B ∧ P(x)) |
(8) corresponds pretty much to what we get with list comprehension in Python: if you already have a list, then you can define a new list in terms of the old one, using an if condition. In other words, (10) is the Python counterpart of (8).
(10) | set([x for x in B if P(x)]) |
To illustrate this further, the following list comprehension relies on the existence of the previously defined set nats (n % 2 is the remainder when n is divided by 2):
|
Now, when we defined evens before, what we actually had was a set of strings, rather than Python integers. But we can use int to coerce the strings to be of the right type:
|
If every member of A is also a member of B, we say that A is a subset of B (written A ⊆ B). The subset relation is represented in Python with <=.
|
As the above examples show, B can contain more members than A for A ⊆ B to hold, but this need not be so. Every set is a subset of itself. To exclude the case where a set is a subset of itself, we use the relation proper subset (written A ⊂ B). In Python, this relation is represented as <.
|
Sets can contain other sets. For instance, the set A = {{a}, {b} } contains the two singleton sets {a} and {b}. Note that {a} ⊆ A does not hold, since a belongs to {a} but not to A. In Python, it is a bit more awkward to specify sets whose members are also sets; the latter have to be defined as frozensets, i.e., immutable objects.
|
We also need to be careful to distinguish between the empty set ∅ and the set whose only member is the empty set: {∅}.
We write 〈x1, …, xn〉 for the ordered n-tuple of objects x1, …, xn, where n ≥ 0. These are exactly the same as Python tuples. Two tuples are equal only if they have the same lengths, and the same objects in the same order.
|
A tuple with just 2 elements is called an ordered pair, with just three elements, an ordered triple, and so on.
Given two sets A and B, we can form a set of ordered pairs by drawing the first member of the pair from A and the second from B. The Cartesian product of A and B, written A × B, is the set of all such pairs. More generally, we have for any sets S1, …, Sn,
(11) | S1 × … × Sn = {〈x1, …, xn〉 ∣ xi ∈ Si} |
In Python, we can build Cartesian products using list comprehension. As you can see, the sets in a Cartesian product don't have to be distinct.
|
In general, a relation R is a set of tuples. For example, in set-theoretic terms, the binary relation kiss is the set of all ordered pairs 〈x, y〉 such that x kisses y. More formally, an n-ary relation over sets S1, …, Sn is any set R ⊆ S1 × … × Sn.
Given a binary relation R over two sets A and B, not everything in A need stand in the R relation to something in B. As an illustration, consider the set evens and the relation mod defined as follows:
|
Now, mod ⊆ evens × evens, but there are elements of evens, namely 6, 8 and 10, which do not stand in the mod relation to anything else in evens. In this case, we say that only 2 and 4 are in the domain of the mod relation. More formally, for a relation R over A × B, we define
(12) | dom(R) = {x ∣ ∃y.〈x, y〉 ∈ A × B} |
Correspondingly, the set of entities in B which are the second member of a pair in R is called the range of R, written ran(R).
We can visually represent the relation mod by drawing arrows to indicate elements that stand in the relation, as shown in Figure 10.3.
Figure 10.3: Visual Representation of a relation
The domain and range of the relation are shown as shaded areas in Figure 10.3.
A relation R ⊆ A × B is a (set-theoretic) function just in case it meets the following two conditions:
Thus, the mod relation defined earlier is not a function, since the element 2 is paired with four items, not just one. By contrast, the relation doubles defined as follows is a function:
|
If f is a function ⊆ A × B, then we also say that f is a function from A to B. We also write this as f: A → B. If 〈x, y〉 ∈ f, then we write f(x) = y. Here, x is called an argument of f and y is a value. In such a case, we may also say that f maps x to y.
Given that functions always map a given argument to a single value, we can also represent them in Python using dictionaries (which incidentally are also known as mapping objects). The update() method on dictionaries can take as input any iterable of key/value pairs, including sets of two-membered tuples:
|
A function f: S1 × … × Sn → T is called an n-ary function; we usually write f(s1, …, sn) rather than f(〈s1, …, sn〉). For sets A and B, we write AB for the set of all functions from A to B, that is {f ∣ f: A → B}. If S is a set, then we can define a corresponding function fS called the characteristic function of S, defined as follows:
(13) | fS(x) = True if x ∈ S
fS(x) = False if x ∉ S
|
fS is a member of the set {True, False}S.
It can happen that a relation meets condition (1) above but fails condition (2); such relations are called partial functions. For instance, let's slightly modify the definition of doubles:
|
doubles2 is a partial function since its domain is a proper subset of evens. In such a case, we say that doubles2 is defined for 2 and 4 but undefined for the other elements in evens.
☼ Consider the relation doubles, where evens is defined as in the text earlier:
|
Is doubles a relation over evens? Explain your answer.
◑ What happens if you try to update a dictionary with a relation which is not a function?
☼ Write a couple of Python functions which for any set of pairs R, return the domain and range of R.
◑ Let S be a family of three children, {Bart, Lisa, Maggie}. Define relations R ⊆ S × S such that:
◑ Write a Python function which for any set of pairs R, returns True if and only if R is a function.
[Brent1995]
About this document...
This chapter is a draft from Introduction to Natural Language Processing, by Steven Bird, Ewan Klein and Edward Loper, Copyright © 2007 the authors. It is distributed with the Natural Language Toolkit [http://nltk.sourceforge.net], Version 0.7.5, under the terms of the Creative Commons Attribution-ShareAlike License [http://creativecommons.org/licenses/by-sa/2.5/].
This document is Revision: 4518 Wed May 16 20:08:28 EST 2007