Treebank-Based Probabilistic Phrase Structure ... - Semantic Scholar

Report 1 Downloads 68 Views
Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x

Treebank-Based Probabilistic Phrase Structure Parsing Aoife Cahill* Universität Stuttgart

Abstract

The area of probabilistic phrase structure parsing has been a central and active field in computational linguistics. Stochastic methods in natural language processing, in general, have become very popular as more and more resources become available. One of the main advantages of probabilistic parsing is in disambiguation: it is useful for a parsing system to return a ranked list of potential syntactic analyses for a string. In this article, we introduce probabilistic context-free grammars (PCFGs) and outline some of their strengths and weaknesses. We concentrate on the automatic extraction of stochastic grammars from treebanks (large collections of hand-corrected syntactic structures). We describe the current state of the field and the current research on improving the basic PCFG model. This includes lexicalized, history-based and generative models. Finally, we briefly mention some research into probabilistic phrase structure parsing for domains other than traditional treebank text and languages other than English (Chinese, Arabic, German and French).

Introduction The task of parsing is a central one in the field of computational linguistics. For further processing (e.g. semantic, prosodic), one would like to be able to have an analysis for the internal structure of a sentence. In many cases, this analysis is in the form of a phrase structure tree. In this article, we will describe the current state of the field of treebank-based statistical phrase structure parsing. There are, of course, other approaches to probabilistic phrase structure parsing that do not rely on treebanks for training, but these are not discussed in this article. So what is the motivation behind treebank-based statistical parsing? With the amount of data available to researchers nowadays, it is becoming increasingly easier to train statistical parsing models. These models offer some advantages over hand-crafted grammars. They are easy to acquire, achieve wide-coverage over the training domain, are robust, almost always providing some analysis and they can be easily adapted to new domains and data (Armstrong-Warwick 1993). Another advantage is the natural © 2007 The Author Journal Compilation © 2007 Blackwell Publishing Ltd

Treebank-Based Probabilistic Phrase Structure Parsing 19

inclusion of lexical distributional information. But one of the main advantages over hand-crafted grammars is in disambiguation. A statistical parser can return a ranked n-best list of possible analyses of a string. A hand-crafted system, while in general might return far fewer analyses, has no way of indicating which ones are more likely. This article first introduces context-free grammars (CFG) and their probabilistic extension (PCFG). We briefly mention some algorithms for parsing with PCFGs and then go on to discuss some of the problems with the basic model. Much of the remainder of the article is dedicated to the brief description of the large body of work in statistical phrase structure parsing as researchers develop methods for overcoming the major weaknesses of the original model. Most of the work here is based on the Penn Treebank. At the end we mention some work on other data and on languages other than English. There are many introductory materials on probabilistic parsing including Charniak (1993), Manning and Schütze (1999) and Jurafsky and Martin (2000). We point the reader to these books for more detailed discussions on the topic. Context-Free Grammars The most common kinds of grammars are context free. They are popular because they are mathematically straightforward and well-understood. They are widely used to describe computer programming languages, for example. In natural language processing, they can be used to provide a ‘shallow’ analysis of a natural language string, where ‘shallow’ here refers to a purely surface syntactic analysis, as opposed to a ‘deep’ analysis that would also provide some relation between the string and its meaning.1 Providing ‘deep’ analysis for natural languages is a large research area and includes dependency parsing and unification (or constraint-based) grammars. Unification grammars extend simple context-free formalisms with the addition of feature structures or attribute value matrices (AVM). Members of the unification grammar family include lexical functional grammar (LFG) (Kaplan and Bresnan 1982), functional unification grammars (FUG) (Kay 1985), PATR-II (Shieber 1984) and head-driven phrase structure grammar (HPSG) (Pollard and Sag 1994). There are a number of advantages and disadvantages to both ‘shallow’ and ‘deep’ processing. Deep grammars tend to be precise (because they are often hand-crafted), allow semantic composition and are often reversible (the same grammar can be used for parsing and generation). Some of them have wide-coverage, but are not always as robust as their shallower counterparts. For example, deep grammars will often disallow ungrammatical input, while shallow parsers usually provide a parse for any input given to them (one can argue whether this is a desired feature or not). Shallow parsers are usually fast and statistical disambiguation methods can easily be integrated into them. Shallow parsers cannot usually be used for generation: because © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

20 Aoife Cahill Table 1. A context-free grammar. S

V

S

P

Alice bought a plant with yellow leaves

S NP VP PP N V P Adj Det

S

S → NP VP NP → N VP → V NP VP → V NP PP NP → NP PP PP → P NP NP → Det N NP → Adj N N → Alice V → bought Det → a N → plant P → with Adj → yellow N → leaves

they can parse ungrammatical input, they will generate ungrammatical output, too. The output of shallow parsers does not usually contain an analysis of long-distance dependencies that is required for further semantic processing. Formally, a CFG is a four-tuple (Σ, V, S and P), where Σ is a finite, non-empty set of terminals (the alphabet); V is a finite, non-empty set of non-terminal symbols (category labels) such that Σ ∩ V = ∅; S ∈ V is the start symbol; and P is a finite set of production rules, A → α, where A ∈ V and α ∈ (V ∪ Σ)*. If we take the grammar in Table 1 and want to parse the sentence Alice bought a plant with yellow leaves, we would get the structure (called a context-free phrase structure tree) in Figure 1. There are a number of ways in which a context-free tree for a particular sentence can be built up using a CFG. The most common are top-down and bottom-up. In a top-down construction or derivation, you start with the start symbol of the grammar (in this case S). You then recursively carry out the following tasks: 1. for each non-terminal node, n, labelled L, select a rule, r, with L on the left hand side (LHS) and construct children for n with the symbols on the right hand side (RHS) of r; and 2. locate the next non-terminal node that has no children. Figure 2 gives a sample top-down derivation for the tree in Figure 1. Bottom-up parsing is carried out in a similar fashion, but instead of starting with the grammar’s start symbol, you start with the words and build the tree upwards. © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

Treebank-Based Probabilistic Phrase Structure Parsing 21

Fig. 1. A phrase structure tree for the sentence Alice bought a plant with yellow leaves.

There are, however, some drawbacks to simple CFGs. It can be complicated to treat long-distance dependencies (in constructions such as who did Mary see?) with regular CFGs [you need to introduce more complex mechanisms such as slash categories (Gazdar et al. 1985) to account for this]. However, for most computational purposes, given the wide range of efficient algorithms for CFGs, a CFG is generally accepted as being sufficient to at least get a basic syntactic representation for an input string. Probabilistic Context-Free Grammars Probabilistic context-free grammars extend CFGs by associating a probability with each production rule. Formally, a PCFG is defined as a five-tuple (Σ, V, S, P and D), where Σ is a finite, non-empty set of terminals (the alphabet); V is a finite, non-empty set of non-terminal symbols (category labels) such that Σ ∩ V = ∅; S ∈ V is the start symbol; P is a finite set of production rules, A → α, where A ∈ V and α ∈ (V ∪ Σ)*; and D is a function assigning a probability to each member of P. Moreover,

∀A ∈V

∑ D(α|A) = 1;

A → α∈P

that is, the probabilities for all RHSs for a given LHSs must sum to 1. A PCFG defines the probability of the tree, P(T), as the product of the probabilities of the token occurrences of the rules expanding each LHS to its RHS in the tree: n

P (T ) = ∏ P ( RHSi | LHSi )

(1)

i =1

© 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

22 Aoife Cahill

Fig. 2. A top-down derivation for the tree in Figure 1.

If we associate probabilities with each of the CFG rules in Table 1 to get the PCFG in Table 2, we can calculate the probability of the trees in Figure 3 for the string Alice bought a plant with yellow leaves. The probability of each tree is calculated by multiplying together the probabilities of each of the rules used to derive the tree. Hence, the probability of the tree on the left of Figure 3 is P(T1) = 1.0 * 0.2 * 0.1 * 0.8 * 1.0 * 0.4 * 0.2 * 1.0 * 0.5 * 1.0 * 1.0 * 0.2 * 1.0 * 0.4 = 0.0000512, while the probability of the tree on the right is P(T2) = 1.0 * 0.2 * 0.1 * 0.2 * 1.0 * 0.2 * 1.0 * 0.5 * 1.0 * 1.0 * 0.2 * 1.0 * 0.4 = 0.000032. Choosing the tree with the highest probability gives us the one on the right, which, in this © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

Treebank-Based Probabilistic Phrase Structure Parsing 23 Table 2. A probabilistic version of the context-free grammar in Table 1. S

V

S

P

D

Alice bought a plant with yellow leaves

S NP VP PP N V P Adj Det

S

S → NP VP VP → V NP VP → V NP PP NP → N NP → NP PP NP → Det N NP → Adj N PP → P NP

D(S → NP VP) = 1.0 D(VP → V NP) = 0.8 D(VP → V NP PP) = 0.2 D(NP → N) = 0.2 D(NP → NP PP) = 0.4 D(NP → Det N) = 0.2 D(NP → Adj N) = 0.2 D(PP → P NP) = 1.0

N → Alice N → plant N → leaves V → bought Det → a P → with Adj → yellow

D(N → Alice) = 0.1 D(N → plant) = 0.5 D(N → leaves) = 0.4 D(V → bought) = 1.0 D(Det → a) = 1.0 D(P → with) = 1.0 D(Adj → yellow) = 1.0

Fig. 3. Two possible parse trees for the sentence Alice bought a plant with yellow leaves.

case, is the desired analysis. The expansion of each node depends only on the node category. This is called a Markov assumption (where the probability of an event depends only on the n previous events) and is commonly referred to as an independence assumption. For more details on the mathematics behind PCFGs, see Charniak (1993). One of the main advantages of a probabilistic CFG over a regular one is that now we can associate each parse of a string with a probability, allowing us to disambiguate to a large extent. Probabilistic CFGs are also used as language models (Charniak 2001; Roark 2001; Xu et al. 2002) assigning probabilities to strings of words (e.g. in speech recognition). However, PCFG parsing in general has certain known weaknesses, such as its inability to take lexical information or structural context into account. © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

24 Aoife Cahill

This is due to the independence assumptions that are inherent in the formalism, as it assumes that the expansion of any non-terminal label is independent of any other expansion. We will return to this in the section on the limitations of PCFGs. Basic Parsing Algorithms for PCFGs Given a PCFG, we would like to parse with it. There are very efficient algorithms available for calculating the most-probable parse, most of which are chart (or tableau) based. A chart parser stores intermediate results, avoiding the need to recompute identical analyses in different parts of the search space. Active chart parsers, in addition, store partially explored analyses. A chart is made up of ‘cells’ that store the intermediate analyses. A chart-based parsing algorithm fills the cells with ‘items’. An item is basically a partial analysis (or derivation), where in each cell, all items cover the same part of the input string. Therefore, in probabilistic chart parsing, when two items with the same LHS are added to the same cell, the algorithm can always choose the item with the highest probability and disregard the lower probability item.2 This is because in any complete analysis involving the items, the lower probability item will always lead to a lower probability overall derivation. These algorithms are called Viterbi algorithms. Viterbi algorithms are an instance of dynamic programming, and although they cannot be applied to all types of probabilistic parsing methods, they are well-suited to PCFG parsing. The Cocke–Younger–Kasami (CYK) algorithm (Younger 1967; Aho and Ullman 1972), developed in 1965 (independently and at roughly the same time by Cocke, Younger and Kasami), is a chart-based bottom-up dynamic programming algorithm that can be used to parse with a PCFG to efficiently give the most probable string. A probabilistic version of the Earley parsing algorithm (Earley 1968) has also been adapted for PCFGs (Stolcke 1995). The Earley parsing algorithm is a top-down chart-based algorithm. For more information about the details of these parsing algorithms, see Manning and Schütze (1999) or Jurafsky and Martin (2000). It is not always possible to apply Viterbi algorithms to calculate the most probable parse. For example, in data-oriented parsing (DOP), the most probable parse is calculated by summing over the probabilities of all derivations that lead to that parse. Viterbi algorithms cannot be used because sub-derivations that have a low probability will be pruned, even if they are part of the most probable parse. In this case, random sampling (e.g. Monte Carlo) is used to approximate the most probable parse (Bod 2000). Sometimes, the use of Viterbi is impractical and the size of the chart grows too large. When this is the case, the chart is often pruned using methods such as beam searching (e.g. Collins 1999 inter alia). Klein and Manning (2003) describe an A* search algorithm that is guaranteed to find the most probable parse and is much more efficient than standard Viterbi. © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

Treebank-Based Probabilistic Phrase Structure Parsing 25

Automatically Acquiring a PCFG From a Treebank A treebank is a collection of syntactically annotated sentences where each structure has been hand corrected by linguists. The idea is that a linguist finds it easier to assign an analysis to one particular sentence than to come up with a comprehensive grammar from scratch. We can assume that there is an implicit grammar underlying all the analyses. This grammar is often made quite explicit by means of annotation guidelines. A treebank provides a way of using supervised learning to automatically acquire a PCFG. There are methods for automatically acquiring a PCFG from an unannotated corpus (unsupervised learning) and in recent years there has been a revived interest in this topic (Clark 2001; van Zaanen 2002; Dennis 2005; Klein 2005; Bod 2006). The advantage of unsupervised learning is that one does not need to hand-annotate sentences, and would provide a cost-effective way of developing parsing resources for minority languages for which there are no treebanks. However, at the moment, the success rate of the unsupervised learning approaches has yet to achieve the same kinds of results as the most basic PCFG automatically extracted from a treebank. Hopefully in the coming years, the unsupervised methods will continue to develop and begin to achieve comparable results to the supervised methods. For now, we will concentrate on the supervised learning of a PCFG from a treebank. The Penn Treebank (Marcus et al. 1993) is the most widely used treebank in computational linguistics. The parsed part contains text from the Brown Corpus, one million words from the Wall Street Journal, parts of the Switchboard Corpus and ATIS. The texts are all part-of-speech tagged and a syntactic structure has been assigned to each sentence. In the later versions of the Penn Treebank (Marcus et al. 1994), the structures also contain some information that would be useful in the extraction of predicate-argument information. This information includes additional functional labels (such as SBJ for subject, TMP for temporal modifier, etc.). Also included were trace elements and co-indexation to indicate ‘moved’ constituents. Figure 4 shows an example tree from the Penn

Fig. 4. An example Penn Treebank tree with function labels and empty nodes. © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

26 Aoife Cahill

Treebank, including functional labels (-SBJ to mark subject) and an empty node (*T*-1) to indicate movement. It is straightforward to automatically extract a simple PCFG from a fully parsed treebank. The probability associated with each rule is determined by relative frequency; that is, by counting the number of times a rule occurs in a corpus and dividing it by the number of occurrences of all rules expanding the same LHS. Formally, this is expressed as:

P (LHS → RHS j ) =

#(LHS → RHS j ) n

∑ #(LHS → RHSi )

.

(2)

i =1

Here again, the probabilities for all RHSs of a given LHS will sum to 1 (see also Table 2). So, for example, if in your corpus, you have three VP rules for intransitive, transitive and ditransitive verbs (VP → V, VP → V NP and VP → V NP NP), the sum of all the probabilities of the VP rules must be 1. Some of the earliest experiments in automatically extracting a PCFG from the first version of the Penn Treebank were carried out at the University of Sheffield (Krotov et al. 1994). However, at that stage, the computational power needed for parsing with a PCFG of that magnitude was not available.3 This prompted further research into the compaction of large PCFGs (Krotov et al. 1998). Other earlier research includes work by Pereira and Schabes (1992) who take advantage of the partial bracketing of the ATIS corpus and Charniak (1995). It was a few years before computational power had caught up with the demands of PCFG parsing, enough to allow the first parsing experiments using the entire Penn Treebank. There was other work carried out on probabilistic parsing in the meantime (e.g. Magermann and Marcus 1991; Magerman and Weir 1992), but this work did not use simple PCFGs such as those extracted from a treebank. Charniak (1996) describes the first treebank-based PCFG experiments in detail. His paper was written to refute a ‘common wisdom’ among the research community that parsing with grammars extracted from treebanks did not produce good results. Charniak extracted 10,605 rules (after first performing some automatic pre-processing of the original treebank structures). He used a PCFG parser with a more complex probability model than the standard one based on relative frequencies that was better able to efficiently identify the most probable parse and also to account for the right-branching nature of English. The PARSEVAL metrics (Black et al. 1991) are usually used to evaluate the output of PCFG parsing. These include precision and recall: Precision: the percentage of non-terminal bracketings in the PCFG parse that also appeared in the treebank parse, © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

Treebank-Based Probabilistic Phrase Structure Parsing 27

Recall: the percentage of non-empty non-terminal bracketings from the treebank that also appeared in the PCFG parse. Often both labelled and unlabelled precision and recall are calculated, where unlabelled scores only reflect the accuracy of the constituent boundaries, and not the labels assigned to the constituents. Another common measure for the accuracy of a parser is f-score (or also f-measure), which is calculated as the harmonic mean of precision and recall:

Fscore =

2 * Precision * Recall . Precision + Recall

(3)

One of the earliest applications of the PARSEVAL metrics appears in Grishman et al. (1992). Charniak (1996) evaluates all the sentences of length ≤40 in his test set in terms of precision and recall, and reports unlabelled precision of 78.8 and unlabelled recall of 80.4 on sentences of length 2–40 in his test set. However, these are not the only metrics available for evaluation. There have also been a number of papers on evaluating the output of parsers using dependencies rather than bracketed constituents (Lin 1995; Carroll et al. 1998, 2002; Clark and Hockenmaier 2002; King et al. 2003; Preiss 2003; Kaplan et al. 2004; Miyao and Tsujii 2004). Limitations of Basic PCFGs Parsing with PCFGs allows ranking of solutions that is very useful for disambiguation. However, PCFGs have a certain bias for smaller, lesshierarchical trees. This is due to the independence assumptions that allow us to calculate the probability of a parse tree by multiplying the probabilities of all rules that contribute to it. A larger, more hierarchical tree will include more rules, and therefore more rule probabilities to multiply out, possibly resulting in a smaller probability than a less-preferred tree with less structure. Similarly, PCFGs are biased towards non-terminals with fewer expansions over those with many expansions. The independence assumptions inherent in PCFGs are their weakest point. It means that the probability of rewriting a non-terminal X with a production R is independent of the previous sequence of rewrites. However, it also means that the rewrites are insensitive to local information. For example, it has been shown that in English there is a strong tendency for the subject of a sentence to be a pronoun (Kuno 1972; Francis et al. 1999, inter alia). However, if you take a standard PCFG, no distinction is made between NP → Pron vs. NP → Det Noun, in subject position and elsewhere, for example. It would be nice to be able to prefer the first rule when we are looking at a subject NP and the second one otherwise. Another major problem with PCFGs is their lack of sensitivity to lexical clues in the sentence. Taking an example from Charniak (1993), a simple PCFG model will assign the same parse to each of the following strings: © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

28 Aoife Cahill

Fig. 5. Augmenting all non-root, non-preterminal nodes with their parent category label.

Alice bought a plant with Mary. Alice bought a plant with yellow leaves. The PCFG will have two rules for the attachment of the PP. But, whether the most frequent rule in the PCFG attaches the PP to bought or to plant, it will be consistently wrong for one of the strings. First Steps Towards Improved PCFGs The independence assumptions of PCFG parsing are very strong. This means that if we extract a simple PCFG from, for example, the Penn Treebank, it will be insensitive to much relevant contextual or lexical information. This can be explained by the fact that there are not very many different category labels in the treebank, rather a relatively small set of broad-coverage labels. For example, the labels on transitive and intransitive verbs are identical, although syntactically these verb types behave very differently. However, there are various ways in which contextual or lexical information can be ‘smuggled’ into PCFGs, increasing their accuracy while remaining within the CFG framework and hence maintaining its computational simplicity over other more complex parsing methods. Johnson (1999) investigates the idea of a ‘parent-transformation’ (credited to Charniak) where each node N of a tree is annotated with its parent category label P to give N^P. The category label NP, for example, becomes NP^S, if it occurs under an S node (Figure 5). A training corpus can be transformed automatically in this manner before extracting a parent-annotated grammar. This grammar now has additional contextual information that a basic PCFG does not encode. For example, it is now possible to distinguish between NPs occurring as subjects of a sentence and NPs occurring as objects of a verb: NPs in subject position are daughters of S nodes, and NPs in object position are daughters of VP nodes. Subject NPs will be annotated NP^S and object NPs will be annotated NP^VP. Johnson performs experiments on the Penn-II Treebank, training on Sections 02-21 and testing on Section 22. The accuracy of the parser output is measured in terms of labelled precision and recall as defined above. The parent transformation achieves labelled precision of 0.8 and labelled recall of 0.792, a significant improvement on the basic © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

Treebank-Based Probabilistic Phrase Structure Parsing 29

Fig. 6. A head-lexicalized tree.

PCFG that achieves labelled precision of 0.735 and labelled recall of 0.697. These results show that this transformation is a very simple yet effective method of weakening some of the independence assumptions that cause basic PCFGs to perform poorly. Lexicalization and Other More Complex PCFG Parsing Hindle and Rooth (1993) demonstrate that lexical dependencies are crucial for resolving ambiguities such as PP attachment. However, basic PCFGs do not take lexical information into account. For example, not all verbs can take two NP objects, yet for a simple PCFG, all verbs are equally likely to take two NP objects. One way to overcome this problem is to annotate each phrasal category with its head word (Figure 6). However, if we were to do this, we would run into the notorious ‘sparse data’ problem, meaning that many of the rules we would extract from the treebank (e.g. S^bought → NP^Alice VP^bought) would not occur very often (in fact, most of them would only occur once), and many of the rules required to parse new input would not have been seen at all. One simple solution would be to abstract away from the surface forms and use root forms instead; however, this will still be plagued with sparse data problems. A further refinement might be to only lexicalize certain categories such as VPs (giving rules such as VP^bought → V^bought NP, or VP^sleeps → V^sleeps, thereby capturing the difference between transitive and intransitive verb syntax). Some early more complex approaches to PCFG parsing are documented in the literature. These include Charniak (1997), which implements a model that first computes a set of parses and later applies a word-based probability model to choose the most probable parse. This is very similar to the approach by Carroll and Rooth (1998) who present a system for head-lexicalized PCFG parsing that addresses the problem of learning open-class word valences. It parses using a hand-crafted unlexicalized © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

30 Aoife Cahill

PCFG and then automatically determines the lexicalized frequencies from a corpus of raw text, simulating lexicalization of the chart. The context-free framework allows the use of efficient chart parsing techniques while incorporating important lexical dependencies. Magerman (1995) outlines a statistical decision-tree model, which differs from that of Charniak (1997) mainly in the type of probabilities it considers as part of the probability model. A probabilistic parser is described by Inui et al. (1997). This model was based on an earlier parser by Briscoe and Carroll (1993) but improves on the probability model. It gains some context sensitivity by assigning a probability to each LR parsing action according to its left and right context. There is a large body of work relating to PCFG parsing, as models become more and more complex. For a comprehensive discussion of early work, see Chapter 4 of Collins (1999). Chiang (2000) automatically extracts a lexicalized tree-adjoining grammar (TAG) from the Penn Treebank but achieves comparable results to a purely lexicalized PCFG and with a more flexible formalism. There has also been much recent work on automatically extracting ‘deep’ probabilistic grammars from the Penn Treebank. These include grammars for combinatory categorial grammar (Hockenmaier and Steedman 2002), lexical functional grammar (Cahill et al. 2002) and head-driven phrase structure grammar (Miyao et al. 2003). Current State of the Art The most influential research in probabilistic phrase structure parsing is probably Collins (1999). He presents three parsing models (1, 2 and 3). Here we will limit the discussion to the final models as presented in Chapter 7 of his thesis. These differ in significant ways from earlier versions such as the one presented in Chapter 6 of the thesis, and also in Collins (1996, 1997). Model 1 is a basic history-based model that attempts to overcome the problem of sparse data in lexicalized parsing. A history-based model (Black et al. 1992) incorporates a rich context model (where anything that has previously been generated can appear in the conditioning context). In the Collins model, this is done by decomposing (markovizing) the generation of the RHS of a rule so that the head constituent is generated first, then the left modifiers are generated, and lastly the right modifiers are generated. The probability of a constituent being generated depends on the category of the parent and head nodes, the head word, and a complex feature representing distance, intervening words and punctuation. A parse tree is represented as the sequence of decisions corresponding to a head-centred, top-down derivation of the tree. Collins Model 2 incorporates a distinction between complement and adjunct, and Model 3 incorporates traces for wh-movement. Table 3 outlines the results achieved by each model when evaluated against Section 23 of the Penn-II © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

Treebank-Based Probabilistic Phrase Structure Parsing 31 Table 3. Published parsing results against Section 23 of the Wall Street Journal (note: not always directly comparable). £40 words

Collins (1999) M1 Collins (1999) M2 Collins (1999) M3 Charniak (2000) Charniak and Johnson (2005) McClosky et al. (2006a) Bod (2003) Petrov and Klein (2007)

LP

LR

88.2 88.7 88.7 90.1

87.9 88.5 88.6 90.1

90.7

90.5

£100 words F-Score

LP

LR

88.0 88.6 88.6 90.1

87.7 88.3 88.3 89.6

87.5 88.1 88.0 89.5

90.6

90.8 90.2

90.7 89.9

F-score 87.6 88.2 88.1 89.5 91.0 92.1 90.7 90.0

Treebank. LP is labelled precision, and LR is labelled recall as defined above. Model 3 achieves the best results with precision of 88.7 and recall of 88.6 on sentences of length ≤40. Model 2 outperforms Model 1 with precision and recall of 88.7 and 88.5 on sentences of length ≤40. Each model performs slightly worse overall on sentences of length ≤100. The parser presented in Charniak (2000) is inspired by a log-linear (or maximum entropy) probability model defined over a set of features. The strength of these models lies in their flexibility and their novel approach to smoothing (Berger et al. 1996). Smoothing is a vital component in any lexicalized parser, because without it the parser will very quickly run into sparse data problems. Smoothing attempts to compensate for sparse data by redistributing the probability mass from observed events to unobserved events.4 Charniak’s parser achieves a 13% error reduction over the results in Collins (1997). Table 3 presents the results in terms of labelled precision and recall for Charniak’s parser. It achieves precision and recall of 90.1 on sentences of length ≤40, with a slight drop in performance on sentences of length ≤100. The output of the Charniak’s (2000) parser has since been improved with the integration of a maximum-entropy reranker (Charniak and Johnson 2005). The reranker takes as input the 50-best parses returned by the Charniak parser and uses features derived from each parse to assign a weight to it, returning the parse among the original best 50 that is assigned the highest weight. This model achieves an f-score of 91.0, which is one of the highest results reported in the literature on Section 23 of the Wall Street Journal to date. The self-training parsing system of McClosky et al. (2006a) currently achieves the best f-score of 92.1 on Wall Street Journal Section 23 of the Penn Treebank. McClosky et al. (2006a) retrain the Charniak parser on both the Wall Street Journal (inserted five times into the training data) and © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

32 Aoife Cahill

1.75 million sentences of the North American News Text corpus that had been parsed by the original Charniak parser. In addition, they use the reranking technology described in Charniak and Johnson (2005) to achieve maximum results. The DOP framework combines already-seen tree fragments to build up the most probable parse tree (Bod 1996, 2001; Bod and Scha 2003). The idea behind DOP is that contextual information is explicitly encoded in the tree fragments, addressing a key weakness of basic PCFGs. DOP uses tree fragments (of potentially all depths) rather than grammar rules, and combines fragments using node substitution. Bod (2003) achieves one of the current best results on Wall Street Journal Section 23 with an f-score of 90.7 (90.8% precision and 90.6% recall). One of the main disadvantages of DOP is that the calculation of the most probable parse, which is a sum-of-products calculation, has been shown to be an NP-hard problem (Sima]an 1995, 1999, 2003). Therefore, in real computing terms, these grammars are expensive, both in time and space. There have been a number of attempts to address these issues. The task of building the DOP parse space for an input string has been examined by Bod (1995), Goodman (1998) and Sima]an (1999), inter alia. Selecting the best parse from that space according to the model has been investigated by Bod (1995, 2000) and Chappelier and Rajman (2003), for example. It has also been shown that the simple relative frequency estimator of the original DOP models is inherently biased (Bonnema et al. 2000; Johnson 2002; Sima]an and Buratto 2003). The DOP model is still very much in development as different strategies to overcome the main weaknesses are developed. It is one of the more challenging statistical parsing models in the field at the moment, and it will be interesting to see the solutions proposed for approximating the most probable parse calculation, and the development of new and improved estimators to overcome the original bias. Currently, the best unlexicalized PCFG model is that of Petrov and Klein (2007). They present several improvements to previous methods using hierarchically state-split PCFGs. State-split PCFGs are derived from treebanks (e.g. Matsuzaki et al. 2005; Petrov et al. 2006). In the model of Petrov and Klein (2007), the PCFGs are iteratively refined (by splitting all symbols in two, for example, DT becomes DT-1 and DT-2) in order to give improved parameter estimation. Hierarchical pruning is used to speed up parsing with no loss in accuracy. Finding the most probable parse is an NP-complete problem, so several methods of approximation are investigated. Empirically, Petrov and Klein show that, for their model, reranking methods maximize the exact match metric best. However, to maximize f-score, non-reranking methods perform best. For English, an f-score of 90.6 on sentences of length ≤40 and an f-score of 90.0 for all sentences is achieved, slightly lower than the reranking model of Charniak and Johnson’s (2005) and the results reported in McClosky et al. (2006a). The model is also shown to generalize well to other languages and treebanks, without any language- or treebank-specific modifications. © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

Treebank-Based Probabilistic Phrase Structure Parsing 33

A summary of the main results for each of the parsers described in this section is given in Table 3. It is important to note that there may be slight differences in the way the parsers are evaluated (e.g. whether punctuation markers are counted as constituents or not), so they are not necessarily directly comparable. PCFG Parsing in Other Domains and Languages We see that tremendous progress has been made with respect to parsing Section 23 of the Wall Street Journal. However, that cannot be considered to be representative of the English language. There have been a number of attempts to port parsers trained on the Wall Street Journal to other domains including the Brown corpus (a slightly more representative corpus of American English) and the ATIS corpus (spoken queries to an airline reservation system). Gildea (2001) showed that a relatively basic PCFG model was very sensitive to relevant training data and that a small amount of relevant data was better than a large amount of irrelevant, out-of-domain data. McClosky et al. (2006b) recently showed that their self-trained parsing model was not overfitted to Wall Street Journal text and also performed almost as well on Brown corpus data as a parser trained only on Brown data. The PCFG model seems to be a reasonable one for English, but how does it fare when applied to other languages? Bikel and Chiang (2000) carry out experiments with a lexicalized PCFG on a section of the Penn Chinese Treebank and show that it does not perform as well as an automatically extracted TAG for Chinese (Chiang 2000). Bikel (2002) describe a multilingual parsing engine based on Collins’s (1999) model. This new engine is designed to be able to work with any new annotated corpus in any language. One simply needs to write some language and treebank-specific code to use it. The time needed to write the language- and treebank-specific code is far less than what would be needed to rewrite the entire parser for the new language and treebank. This is an important engineering advancement and provides researchers the means to try out state-of-the-art techniques developed for English in their language (provided they have a treebank readily available and it does not differ too greatly from the Penn Treebank style). The downloadable version of the parser (http://www.cis.upenn.edu/~dbikel/software.html#stat-parser) comes with a language package for English, Chinese and Arabic (corresponding to the Penn Treebanks for those languages). Bikel (2004) reports parsing results for Chinese with an f-score of 81.2 for sentences of length ≤40 and an f-score of 79.0 on sentences of length ≤100. These results are considerably lower than current state-of-the-art results (with similar technology) for English. Similarly, the preliminary results for Arabic reported in Bikel (2004) are also much lower than for English (an f-score of 75.7 for sentences length ≤40 and 72.9 for all sentences). Petrov and Klein © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

34 Aoife Cahill

(2007) report improved results for Chinese with their unlexicalized parser, achieving an f-score of 86.3 for sentences of length ≤40 and an f-score of 83.3 on sentences of length ≤100. Beil et al. (1999) and Dubey (2004) show that the lexicalized PCFG models developed for English (Collins 1999, etc.) are not suitable for German and that there are other aspects of German that are more important than lexicalization. Dubey (2005) also shows that long-distance dependencies play an important role in parsing German, given the relatively free-word order and rich morphology of the language. Schiehlen (2004) suggests some treebank transformations that can be applied to improve German PCFG parsing. Dubey (2005) investigates integrating a morphological analyser into the parser (by means of a suffix analyser) and improving the unlexicalized parsing. Petrov and Klein (2007) report the highest results so far for German against the NEGRA corpus (80.7 f-score). Arun and Keller (2005) show that lexicalization does in fact help when parsing French. They perform experiments with a parser based on Collins (1999) and achieve an f-score of 81 against a French treebank. Conclusions In this article, we have provided an overview of the current state of the field of treebank-based PCFG parsing. It is a very competitive field, as everyone strives to achieve the best f-score against Section 23 of the Wall Street Journal. While researchers have proven that they can achieve good results against such text, it remains to be proven how useful these parsers are in the grand scheme of general natural language processing. Lease et al. (2006) outline some current applications of the Charniak parser in language modelling, speech recognition, machine translation and sentence condensation. The applications of statistical parsing will drive the field in the future. Applications are likely to drive the need, for example, to include functional tags (Gabbard et al. 2006) or empty categories (Schmid 2006) in PCFG parser output. For many other languages, the challenge to design good probabilistic models is still open. For many languages (Czech, Japanese, etc.), the CFG model is not appropriate and dependency banks (rather than treebanks) have been developed instead. That work remains beyond the scope of this article. Short Biography Aoife Cahill is currently a post-doctoral researcher at the Institute for Natural Language Processing at the University of Stuttgart, Germany. Since her time as an undergraduate student, she has always been interested in the area of statistical parsing. She received her PhD from Dublin City University, Ireland, in 2004. The title of her thesis was ‘Parsing with Automatically Acquired, Wide-Coverage, Robust, Probabilistic LFG © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

Treebank-Based Probabilistic Phrase Structure Parsing 35

Approximations’. She was instrumental in the development of new technology in the automatic acquisition of statistical unification-based grammar resources and after completed her PhD she went on to jointly lead, together with Professor Josef van Genabith at Dublin City University, a group of eight post-graduate and post-doctoral researchers in a project further developing the automatic acquisition of lexical functional grammar (LFG) resources. She spent 6 months in 2006 as a visiting researcher in the Palo Alto Research Center, California. Her main interests centre around statistical natural language processing, specifically integrating machine learning techniques with traditional hand-crafted methodologies. She has published a book chapter, journal articles and conference papers on the automatic acquisition of LFG resources, the development of tools relating to this acquisition and more recently on statistical surface realization. Notes * Correspondence address: Aoife Cahill, Institut für Machinelle Sprachverarbeitung (IMS), Universität Stuttgart, Azenbergstrae 12, D-70174 Stuttgart, Germany. Email: [email protected]. 1 This use of the term shallow differs from the more common use, where a shallow analysis consists of constituent boundaries, with very little hierarchical structure. These shallow analyses can often be underspecified. 2 This is in contrast to purely symbolic chart parsing, where all analyses of a constituent are stored in the cell. 3 They extracted 2733 rules from just over 45,000 words of Dow Jones text and predicted an exponential growth in the number as more words were included. 4 As an example, the simplest smoothing algorithm is the ‘Add One’ or Laplace’s Law, and it adds a count of 1 to all (including unseen) events.

Works Cited Aho, Alfred, and Jeffrey Ullman. 1972. The theory of parsing, translation and compiling: Volume 1: parsing. Englewood Cliffs, NJ: Prentice-Hall. Armstrong-Warwick, Susan. 1993. Preface. Computational Linguistics 19.iii–iv. Arun, Abhishek, and Frank Keller. 2005. Lexicalization in crosslinguistic probabilistic parsing: the case of French. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, Michigan, 306–13. Morristown, NJ: Association for Computational Linguistics. Beil, Franz, Glenn Carroll, Detlef Prescher, Stefan Riezler, and Mats Rooth. 1999. Inside-outside estimation of a lexicalized PCFG for German. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, College Park, Maryland, 269–276. Morristown, NJ: Association for Computational Linguistics. Berger, Adam L., Stephen A. Della Pietra, and Vincent J. Della Pietra. 1996. A maximum entropy approach to natural language processing. Computational Linguistics 22:39–71. Bikel, Daniel. 2002. Design of a multi-lingual, parallel-processing statistical parsing engine. Proceedings of Human Language Technology Conference (HLT) 2002, San Diego, CA, 24–7. San Francisco, CA: Morgan Kaufmann. ——. 2004. On the parameter space of generative lexicalized statistical parsing models, PhD thesis, University of Pennsylvania. Bikel, Daniel, and David Chang. 2000. Two statistical parsing models applied to the Chinese © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

36 Aoife Cahill Treebank. Proceedings of the Second Workshop on Chinese Language Processing: Held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, October 8, 2000, Hong Kong, 1–6. Morristown, NJ: Association for Computational Linguistics. Black, Ezra, Fred Jelinek, John Lafferty, David M. Magerman, Robert Mercer, and Salim Roukos. 1992. Towards history-based grammars: using Richer models for probabilistic parsing. Human Language Technology Conference: Proceedings of the workshop on Speech and Natural Language, Harriman, NY, 134–9. Morristown, NJ: Association for Computational Linguistics. Black, Ezra, Steven Abney, Dan Flickenger, Claudia Gdaniec, Ralph Grishman, Phil Harrison, Donald Hindle, Robert Ingria, Fred Jelinek, Judith Klavans, Mark Liberman, Mitch Marcus, Salim Roukos, Beatrice Santorini, and Tomek Strzalkowski. 1991. Procedure for quantitatively comparing the syntactic coverage of English grammars. Human Language Technology Conference: Proceedings of the workshop on Speech and Natural Language, Pacific Grove, CA, 306–311. Morristown, NJ: Association for Computational Linguistics. Bod, Rens. 1995. Enriching linguistics with statistics: performance models of natural language, PhD thesis. Institute for Logic, Language and Computation, University of Amsterdam, The Netherlands. ——. 1996. Data-oriented language processing. An overview. Technical Report LP-96-13. Institute for Logic, Language and Computation, University of Amsterdam. ——. 2000. Parsing with the shortest derivation. Proceedings of the 18th international conference on Computational Linguistics (COLING 00). Saarbrücken, Germany, 69–75. Morristown, NJ: Association for Computational Linguistics. ——. 2001. What is the minimal set of fragments that achieves maximal parse accuracy? Proceedings of the 39th annual meeting of the Association for Computational Linguistics, Toulouse, France, 66–73. Morristown, NJ: Association for Computational Linguistics. ——. 2003. An efficient implementation of a new DOP model. Proceedings of the 10th conference of the European chapter of the Association for Computational Linguistics (EACL’03), Budapest, Hungary, 19–26. Morristown, NJ: Association for Computational Linguistics. ——. 2006. An all-subtrees approach to unsupervised parsing. Proceedings of the 21st International Conference on Computational Linguistics and 44th annual meeting of the Association for Computational Linguistics, Sydney, Australia, 865–72. Morristown, NJ: Association for Computational Linguistics. Bod, Rens, and Remko Scha. 2003. A DOP model for phrase structure tree. Data oriented parsing, ed. by Rens Bod, Remko Scha and Khalil Sima]an, 13 –23. Stanford, CA: CSLI Publications. Bonnema, Remko, Paul Buying, and Remko Scha. 2000. Parse tree probability in data oriented parsing. Proceedings of the conference on Intelligent Text Processing and Computational Linguistics, Mexico City, Mexico, 219–32. Briscoe, E. J., and J. Carroll. 1993. Generalised probabilistic LR parsing of natural language (corpora) with unification-based grammars. Computational Linguistics 19.25–60. Cahill, Aoife, Mairéad McCarthy, Josef van Genabith, and Andy Way. 2002. Parsing with PCFGs and automatic F-structure annotation. Proceedings of the Seventh International Conference on lexical-functional grammar, ed. by Miriam Butt and Tracy Holloway King, 76–95. Stanford, CA: CSLI Publications. Carroll, Glenn, and Mats Rooth. 1998. Valence induction with a head-lexicalized PCFG. Proceedings of the 3rd Conference on Empirical Methods in Natural Language Processing (EMNLP 3), Granada, Spain, ed. by Nancy Ide and Atro Voutilainen, 36–45. Carroll, John, Edward Briscoe, and Antonio Sanfilippo. 1998. Parser evaluation: a survey and a new proposal. Proceedings of the 1st International Conference on Language Resources and Evaluation, Granada, Spain, 447–54. Carroll, John, Anette Frank, Dekang Lin, Detlef Prescher, and Haus Uszkozent (editors). 2002. Beyond PARSEVAL – Towards improved evaluation measures for parsing systems, workshop of the third LREC conference 2 June, 2002, Las Palmas, Canary Islands, Spain. © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

Treebank-Based Probabilistic Phrase Structure Parsing 37 Chappelier, Jean-Cédric, and Martin Rajman. 2003. Parsing DOP with Monte-Carlo techniques. Data-oriented parsing, ed. by Rens Bod, Remko Scha and Khalil Sima]an, 83–106. Stanford, CA: CSLI Publications. Charniak, Eugene. 1993. Statistical language learning. Cambridge, MA: MIT Press. ——. 1995. Parsing with context-free grammars and word statistics. Technical Report CS-95-28. Providence, RI: Brown University. ——. 1996. Treebank grammars. Proceedings of the Thirteenth National Conference on Artificial Intelligence, Menlo Park, CA, 1031–6. Cambridge, MA: MIT Press. ——. 1997. Statistical parsing with a context free grammar and word statistics. Proceedings of the Fourteenth National Conference on Artificial Intelligence, Providence, RI, 598–603. Menlo Park, CA: AAAI Press/MIT Press. ——. 2000. A maximum-entropy-inspired parser. Proceedings of the First Annual Meeting of the North American chapter of the Association for Computational Linguistics (NAACL 2000), Seattle, WA, 132–9. San Francisco, CA: Morgan Kaufmann Publishers. ——. 2001. Immediate-head parsing for language models. Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France, 124–31. Morristown, NJ: Association for Computational Linguistics. Charniak, Eugene, and Mark Johnson. 2005. Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, Michigan, 173–80. Morristown, NJ: Association for Computational Linguistics. Chiang, David. 2000. Statistical parsing with an automatically-extracted tree adjoining grammar. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, 456–63. Morristown, NJ: Association for Computational Linguistics. Clark, Alexander. 2001. Unsupervised induction of stochastic context-free grammars using distributional clustering. Proceedings of the 2001 Workshop on Computational Natural Language Learning, Toulouse, France, 105–12. Morristown, NJ: Association for Computational Linguistics. Clark, Stephen, and Julia Hockenmaier. 2002. Evaluating a wide-coverage CCG parser. Proceedings of the LREC 2002 Beyond Parseval Workshop, Las Palmas, Canary Islands, Spain, 60 – 66. Collins, Michael. 1996. A new statistical parser based on bigram texical dependencies. Proceedings of the Thirty-Fourth Annual Meeting of the Association for Computational Linguistics, Santa Cruz, CA, 184–91. Morristown, NJ: Association for Computational Linguistics. ——. 1997. Three generative, lexicalized models for statistical parsing. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, Madrid, Spain, ed. by Philip R. Cohen and Wolfgang Wahlster, 16–23. Somerset, NJ: Association for Computational Linguistics. ——. 1999. Head-driven statistical models for natural language parsing, PhD thesis. Philadelphia, PA: University of Pennsylvania. Dennis, Simon. 2005. An exemplar-based approach to unsupervised parsing. The Proceedings of the Twenty Seventh Conference of the Cognitive Science Society, Stresa, Italy. Mahwah, NJ: Lawrence Elbaum Associates. Dubey, Amit. 2004. Statistical parsing for German: modeling syntactic properties and annotation differences, PhD thesis. Saarland University, Germany. ——. 2005. What to do when lexicalization fails: parsing German with suffix analysis and smoothing. Proceedings of the 43rd annual meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, Michigan, 314 – 21. Morristown, NJ: Association for Computational Linguistics. Earley, Jay. 1968. An efficient context-free parsing algorithm, PhD thesis. Carnegie Mellon University. Francis, Hartwell S., Michelle L. Gregory, and Laura A. Michaelis. 1999. Are lexical subjects deviant? Thirty-Fifth annual regional meeting of the Chicago Linguistic Society 35, ed. by Billings, Sabrina J., John P. Boyle and Aaron M. Griffith, 85–97. Gabbard, Ryan, Seth Kulick, and Marcus Mitchell. 2006. Fully parsing the Penn Treebank. Proceedings of the main conference on human language technology conference of the North © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

38 Aoife Cahill American Chapter of the Association of Computational Linguistics, New York, NY, 184–191. Morristown, NJ: Association for Computational Linguistics. Gazdar, Gerald, Ewan Klein, Geoffrey Pullum, and Ivan Sag. 1985. Generalized phrase structure grammar. Oxford, UK: Basil Blackwell. Gildea, Daniel. 2001. Corpus variation and parser performance. Proceedings of 2001 on Conference on Empirical Methods in Natural Language Processing (EMNLP), Pittsburgh, PA, 167–202. Goodman, Joshua. 1998. Parsing inside-out, PhD thesis. Cambridge, MA: Harvard University. Grishman, Ralph, Catherine Macleod, and John Sterling. 1992. Evaluating parsing strategies using standardized parse files. Proceedings of the 3rd conference on applied natural language processing, Trento, Italy, 156–61. Morristown, NJ: Association for Computational Linguistics. Hindle, Donald, and Mats Rooth. 1993. Structural ambiguity and lexical relations. Computational Linguistics 19.103–20. Hockenmaier, Julia, and Mark Steedman. 2002. Generative models for statistical parsing with combinatory categorial grammar. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, 335–42. Morristown, NJ: Association for Computational Linguistics. Inui, Kentaro, Vinarch Sornlertlamvanich, Hozumi Tanaka, and Takenobu Tokunaga. 1997. A new formalization of probabilistic GLR parsing. Proceedings of the 5th International Workshop on Parsing Technologies, Boston, MA, 123–34. Johnson, Mark. 1999. PCFG models of linguistic tree representations. Computational Linguistics 24.613–32. ——. 2002. The DOP estimation method is biased and inconsistent. Computational Linguistics 28.71–6. Jurafsky, Daniel, and James H. Martin. 2000. Speech and language processing. Englewood Cliffs, NJ: Prentice-Hall. Kaplan, Ron, and Joan Bresnan. 1982. Lexical functional grammar, a formal system for grammatical representation. The mental representation of grammatical relations, ed. by Joan Bresnan, 173–281. Cambridge, MA: MIT Press. Kaplan, Ron, Stefan Riezler, Tracy Holloway King, John T. Maxwell, Alexander Vasserman, and Richard Crouch. 2004. Speed and accuracy in shallow and deep stochastic parsing. Proceedings of the Human Language Technology Conference and the 4th Annual Meeting of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL’04), Boston, MA, 97–104. Kay, Martin. 1985. Parsing in functional-unification grammar. Natural language parsing, ed. by D. R. Dowty, Lauri Karttunen and A. M. Zwicky, 251–278. Cambridge, UK: Cambridge University Press. King, Tracy Holloway, Richard Crouch, Stefan Riezler, Mary Dalrymple, and Ron Kaplan. 2003. The PARC700 dependency bank. Proceedings of the EACL03: 4th International Workshop on Linguistically Interpreted Corpora (LINC-03), Budapest, Hungary, 1–8. Klein, Dan. 2005. The unsupervised learning of natural language structure, PhD thesis. Stanford University. Klein, Dan, and Christopher Manning. 2003. A* parsing: Fast exact viterbi parse selection. Proceedings of Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Edmonton, Canada, 40–7. Krotov, Alexander, Mark Hepple, Robert J. Gaizauskas, and Yorick Wilks. 1998. Compacting the Penn Treebank grammar. Proceedings of COLING/ACL98: Joint Meeting of the 36th Annual Meeting of the ACL and the 17th International Conference on Computational Linguistics, Montréal, Canada, 699–703. Morristown, NJ: Association for Computational Linguistics. Krotov, Alexander, Rob Gaizauskas, and Yorick Wilks. 1994. Acquiring a Stochastic context-free grammar from the Penn Treebank. Proceedings of the Third Conference on the Cognitive Science of Natural Language Processing, Dublin, Ireland, 79–86. Kuno, Susumu. 1972. Functional sentence perspective: a case study from Japanese and English. Linguistic Inquiry 3.269–320. © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

Treebank-Based Probabilistic Phrase Structure Parsing 39 Lease, Matthew, Eugene Charniak, Mark Johnson, and David McClosky. 2006. A look at parsing and its applications. Proceedings of American Association for Artificial Intelligence (AAAI 2006), Boston, MA. Lin, Dekang. 1995. A dependency-based method for evaluating broad-coverage parsers. Proceedings of the International Joint Conference on AI, Montréal, Canada, 1420–7. Magerman, David. 1995. Statistical decision-tree models for parsing. Proceedings of the 33rd annual meeting on the Association for Computational Linguistics, Cambridge, MA, 276–83. Morristown, NJ: Association for Computational Linguistics. Magerman, David, and Mitch Marcus. 1991. Pearl: a probabilistic chart parser. Proceedings of the 6th Conference of the European Chapter of the ACL, Germany, Berlin, 40–7. Morristown, NJ: Association for Computational Linguistics. Magerman, David M., and Carl Weir. 1992. Efficiency, robustness and accuracy in Picky chart parsing. Proceedings of the 30th annual meeting on Association for Computational Linguistics, Newark, Delaware, 40–7. Morristown, NJ: Association for Computational Linguistics. Manning, Christopher D., and Hinrich Schütze. 1999. Foundations of statistical natural language processing. Boston, MA: MIT Press. Marcus, Mitch, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a Large annotated corpus of English: The Penn Treebank. Computational Linguistics 19.313–30. Marcus, Mitchell, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. 1994. The Penn Treebank: annotating predicate argument structure. Proceedings of the workshop on Human Language Technology, Princeton, NJ, 110–5. Morristown, NJ: Association for Computational Linguistics. Matsuzaki, Takuya, Yusuke Miyao, and Jun]ichi Tsujii. 2005. Probabilistic CFG with latent annotations. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, MI, 75–82. Morristown, NJ: Association for Computational Linguistics. McClosky, David, Eugene, Charniak, and Mark Johnson. 2006a. Effective self-training for parsing. Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, New York, NY, 152–9. Morristown, NJ: Association for Computational Linguistics. ——. 2006b. Reranking and self-training for parser adaptation. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 337– 44. Morristown, NJ: Association for Computational Linguistics. Miyao, Yusuke, and Jun]ichi Tsujii. 2004. Deep linguistic analysis for the accurate identification of predicate-argument relations. Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland, 1392–7. Morristown, NJ: Association for Computational Linguistics. Miyao, Yusuke, Takashi Ninomiya, and Jun]ichi Tsujii. 2003. Probabilistic modeling of argument structures including non-local dependencies. Proceedings of the Conference on Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria, 285–91. Pereira, Fernando C. N., and Yves Schabes. 1992. Inside-outside reestimation from partially bracketed corpora. Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, Newark, DE, 128–35. Morristown, NJ: Association for Computational Linguistics. Petrov, Slav, and Dan Klein. 2007. Improved inference for unlexicalized parsing. Proceedings of Human Language Technologies: The conference of the North American chapter of the Association for Computational Linguistics, Rochester, NY, 404–11. Morristown, NJ: Association for Computational Linguistics. Petrov, Slav, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. Proceedings of the 21st International Conference on Computational Linguistics and 44th annual meeting of the Association for Computational Linguistics, Sydney, Australia, 433–40. Morristown, NJ: Association for Computational Linguistics. Pollard, Carl, and Ivan Sag. 1994. Head-driven phrase structure grammar. Stanford, CA: CSLI Publications. © 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

40 Aoife Cahill Preiss, Judita. 2003. Using grammatical relations to compare parsers. Proceedings of the tenth conference of the European chapter of the Association for Computational Linguistics (EACL’03), Budapest, Hungary, 291–8. Roark, Brian. 2001. Probabilistic top-down parsing and language modeling. Computational Linguistics 27.249–76. Schiehlen, Michael. 2004. Annotation strategies for probabilistic parsing in German. Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland, 390–96. Morristown, NJ: Association for Computational Linguistics. Schmid, Helmut. 2006. Trace prediction and recovery with unlexicalized PCFGs and slash features. Proceedings of the 21st International Conference on Computational Linguistics and 44th annual meeting of the Association for Computational Linguistics, Sydney, Australia, 177–84. Morristown, NJ: Association for Computational Linguistics. Shieber, Stuart M. 1984. The design of a computer language for linguistic information. Proceedings of the 22nd annual meeting on association for Computational Linguistics, Stanford, CA, 362–6. Morristown, NJ: Association for Computational Linguistics. Sima]an, Khalil. 1995. Computational complexity of probabilistic disambiguation by means of tree-grammars. Proceedings of the 16th conference on Computational Linguistics (COLING’96), Copenhagen, Denmark, 1175–80. Morristown, NJ: Association for Computational Linguistics. ——. 1999. Learning efficient disambiguation, PhD thesis. University of Amsterdam, The Netherlands. ——. 2003. Computational complexity of disambiguation under DOP1. Data-oriented parsing, ed. by Rens Bod, Remko Scha, and Khalil Sima]an, 63–81. Stanford CA: CSLI Publications. Sima]an, Khalil, and L. Buratto. 2003. Backoff parameter estimation for the DOP model. Proceedings of the 14th European Conference on Machine Learning (ECML’03), Lecture Notes in Artificial Intelligence (LNAI 2837), Cavtat-Dubrovnik, Croatia, ed. by N. Lavrac, D. Gamberger, H. Blockeel and L. Todorovski, 373–84. Berlin, Germany: Springer. Stolcke, Andreas. 1995. An Efficient probabilistic context-free parsing algorithm that computes prefix probabilities. Computational Linguistics 21.165–202. van Zaanen, Menno. 2002. Bootstrapping structure into language: alignment-based learning, PhD thesis. University of Leeds, Leeds, UK. Xu, Peng, Ciprian Chelba, and Frederick Jelinek. 2002. A study on richer syntactic dependencies for structured language modeling. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, 191–8. Younger, D. 1967. Recognition and parsing of context-free languages in time n3. Information Control 10.189–208.

© 2007 The Author Language and Linguistics Compass 2/1 (2008): 18–40, 10.1111/j.1749-818x.2007.00046.x Journal Compilation © 2007 Blackwell Publishing Ltd

Recommend Documents