synchronous parse - Semantic Scholar

Report 4 Downloads 219 Views
Two monolingual parses are better than one (synchronous parse)∗ Chris Dyer UMIACS Laboratory for Computational Linguistics and Information Processing Department of Linguistics University of Maryland, College Park, MD 20742, USA redpony AT umd.edu Abstract We describe a synchronous parsing algorithm that is based on two successive monolingual parses of an input sentence pair. Although the worst-case complexity of this algorithm is and must be O(n6 ) for binary SCFGs, its average-case run-time is far better. We demonstrate that for a number of common synchronous parsing problems, the two-parse algorithm substantially outperforms alternative synchronous parsing strategies, making it efficient enough to be utilized without resorting to a pruned search.

1

Introduction

Synchronous context free grammars (SCFGs) generalize monolingual context-free grammars to generate strings concurrently in pairs of languages (Lewis and Stearns, 1968) in much the same way that finite state transducers (FSTs) generalize finite state automata (FSAs).1 Synchronous parsing is the problem of finding the best derivation, or forest of derivations, of a source and target sentence pair hf, ei under an SCFG, G.2 Solving this problem is necessary for several applications, for example, optimizing how well an SCFG translation model fits parallel training data. Wu (1997) describes a bottom-up O(n6 ) synchronous parsing algorithm for ITGs, a binary SCFG with a restricted form. For general grammars, the situation is even worse: the problem has been shown to be NP-hard (Satta and Peserico, 2005). Even if we restrict ourselves to binary ITGs, the This work was supported in part by the GALE program of DARPA, Contract No. HR0011-06-2-001. The author wishes to thank Philip Rensik, Adam Lopez, Phil Blunsom, and Jason Eisner for helpful discussions. 1 SCFGs have enjoyed a resurgence in popularity as the formal basis for a number of statistical translation systems, e.g. Chiang (2007). However, translation requires only the manipulation of SCFGs using monolingual parsing algorithms. 2 It is assumed that n = |f| ≈ |e|. ∗

O(n6 ) run-time makes large-scale learning applications infeasible. The usual solution is to use a heuristic search that avoids exploring edges that are likely (but not guaranteed) to be low probability (Zhang et al., 2008; Haghighi et al., 2009). In this paper, we derive an alternative synchronous parsing algorithm starting from a conception of parsing with SCFGs as a composition of binary relations. This enables us to factor the synchronous parsing problem into two successive monolingual parses. Our algorithm runs more efficiently than O(n6 ) with many grammars (including those that required using heuristic search with other parsers), making it possible to take advantage of synchronous parsing without developing search heuristics; and the SCFGs are not required to be in a normal form, making it possible to easily parse with more complex SCFG types.

2

Synchronous parsing

Before presenting our algorithm, we review the O(n6 ) synchronous parser for binary ITGs.3 2.1

ITG synchronous parsing algorithm

Wu (1997) describes a bottom-up synchronous parsing algorithm that can be understood as a generalization of the CKY algorithm. CKY defines a table consisting of n2 cells, with each cell corresponding to a span [i, j] in the input sentence; and the synchronous variant defines a table in 4 dimensions, with cells corresponding to a source span [s, t] and a target span [u, v]. The bottom of the chart is initialized first, and pairs of items are combined from bottom to top. Since combining items from the n4 cells involves considering two split points (one source, one target), it is not hard to see that this algorithm runs in time O(n6 ). 3

Generalizing the algorithm to higher rank grammars is possible (Wu, 1997), as is converting a grammar to a weakly equivalent binary form in some cases (Huang et al., 2009).

ructure of the rules, induces a forest n the target language. parsing, which is our focus for the is paper, is the problem of finding on, or the forest of derivations, of which, because parsed (with a monolingual parser), get sentence pair !f, e". This forest of the parallel structure of the rules, induces a forest 2.2 Parsing, intersection, and composition seful in learningin problems it of translations the target since language. We motivate an alternative ompute and optimize statistics about Synchronous parsing, which is our focusconception for the of the synchronous parsing problem as follows. It has long arallel training data. In the MT litremainder of this paper, is the problem of finding been appreciateddethat monolingual parsing computes kthe is also as “constrained bestknown derivation, or the forest of derivations, of the intersection of an FSA and a CFG (Bar-Hillel et 1997) describes a bottom-up algoa source and target sentence pair !f, 1995). e". This forest 1961;a van Noord, That is, if S is an FSA ucting this forestal.,given sentence is particularly useful in learning problems since it of S with a encoding some sentence s, intersection 3 3 grammar G that runs in O(|f| |e| ) can be used to compute and optimize statistics in a parse forestabout which contains all ssume that n = CFG, |f| ≈ G, |e|,results the runderivations of parallel training data. ofIns,the MT and only derivations that is litL(S) ∩ L(G) ∈ erature, this task is also as “constrained de- the resulting {{s}, ∅}.4known Crucially for our purposes, 5 Figure 1 illusparse forest is also itself a CFG. Wu (1997) describes a bottom-up algos coding”. composition trates, giving two equivalent representations of the rithm for constructing this forest given a sentence alternative conception of theonce syn-as a directed hypergraph forest S ∩ G, and once pair !f, e" and grammar G that runs in O(|f|3 |e|3 ) g problem as follows. It has longS ∩ G appears similar to G, the as a CFG. While since we will assume that n = |f| ≈ |e|, the rund that parsing6computes the intersecnon-terminals (NTs) of the resulting CFG are a cross time is O(n ). and a CFG (Bar-Hillel 1961; productetofal., pairs of states from S and NTs from G.6 5; and Jacobs, 2008). That 1.1Grune Parsing as composition SA, S, with a CFG, G, results in a S NP VP DT the a synWecontains motivate an alternative conception of ch derivations of strings in NP PRN NN 1 and which chronous parsing as itfollows. It hasforestlong G), mayproblem be ∅. NPBut, is DT NN NN tree appreciated that parsing computes the intersecinbeen mind that the resulting parse VP V forNP PRN i of (that an FSA and derives a CFG (Bar-Hillel etVal.,saw 1961; ation CFG exactly DT strings the Noord, 1995; Grune and Jacobs, 2008). That evan ?? for an example. is, parsing an S, with a CFG, G, results in a l parsing case, FSA, it’s helpful to think i saw the forest 2 0 4 eration. parserepresenting forest whicha context-free contains derivations of2 strings3 in CFG re-1 1 I =parsing L(S) ∩ which may be ∅. But, it is llel asL(G), being aand composition S ∩that G the resulting parse (1) helpful to keep in mind forcase, S is a deterministic linear chain FSA est is also itself a CFG (that exactly derives strings nce. 0S4 0NP1 1VP4 inExperiments I). See Figure ?? for an example. NP 0 1 0PRN1 In the parallel parsing case, it’s helpful to think NP gure 1 plots the average runtime of the algorithm 2 4 2DT3 3NN4 in terms of an SCFG representing a context-free1VP re-4 1V2 2NP4 a function of the Arabic sentence length on an lations and parallel parsing as being a composition the 2DT3 0S4

1VP4

0NP1

2NP4

abic-English phrasal ITG alignment task. 1

3NN4

V DT NN In the familiar case, S PRN is a deterministic linear chain FSA 0PRN1 representing sentence. Related awork V 0

1

1 2

2

3

3

4

1 2

i

saw

the

forest

forest i saw

tion, parsing computes a related operation, composition.7 The standard MT decoding-by-parsing task can be understood as computing the composition of an FST,8 F , which encodes the source sentence f with the SCFG, G, representing the translation model. The result is the translation forest, F ◦ G, which encodes all translations of f licensed by the translation model. While G can generate a potentially infinite set of strings in the source and target languages, F ◦ G generates only f in the source language (albeit with possibly infinitely many derivations), but any number of different strings in the target language. It is not hard to see that a second composition operation of an FST, E, encoding the target string e with the e-side of F ◦G (again using a monolingual parsing algorithm), will result in a parse forest that exactly derives hf, ei, which is the goal of synchronous composition. Figure 2 shows an example. In F ◦ G ◦ E the NTs (nodes) are the cross product of pairs of states from E, the NTs from G, and pairs of states in F . Thus, synchronous parsing is the task of computing F ◦ G ◦ E. Since composition is associative, we can compute this quantity either as (F ◦ G) ◦ E or F ◦ (G ◦ E). Alternatively, we can use an algorithm that performs 3-way composition directly. 2.3

The two-parse algorithm9

The two-parse algorithm refers to performing a synchronous parse by computing either (F ◦ G) ◦ E or F ◦ (G ◦ E). Each composition operation is carried out using a standard monolingual parsing algorithm, such as Earley’s or CKY. In the experiments below, since we use -free grammars, we use a variant of CKY for unrestricted CFGs (Chiang, 2007). Once the first composition is done, the resulting parse forest must be converted into a CFG representation that the second parser can utilize. This is straightforward to do: each node becomes a unique non-terminal symbol, with its incoming edges corresponding to different ways of rewriting it. Tails of edges are non-terminal variables in the RHS of these rewrites. A single bottom-up traversal of the forest is sufficient to perform the conversion. Since

nchronous parsing has been widely used to comte sufficient statistics for a (a) variety of machine (b) 1: A CFG, G, trees; an FSA, however, S, encoding a sentence, and rning models Figure of synchronous two equivalent representations of the parse forest S ∩ G, ce the naive algorithm is too slow to deal with (a) as a directed hypergraph and (b) as a CFG. ntence sizes, most authors have proposed pruning hniques. Zhang et al. (2008) tic-tac-toe When dealingsuggest with SCFGs, rather than intersecuning, which uses Model 1 posteriors to exclude 4 L(x) computed. denotes the setBlunsom of strings et generated by the gramnges of cells from being al. 7 mar/automaton x. In future mentions of intersection and 1: comIntersection special case composition Figure Synchronous parseris aruntime as aof function ofwhere the in008) do a monolingual parse with of one language position operations, this will be implicit. put and output labels on the transducers are identical (Mohri, sentence length on an Arabic-English corpus us5 2009). The forest derives only s, but possibly many t split the parser states by grammar the string yielded byusing (Arabic) ing a phrasal ITG. 8 FSTs used to represent the source and target sentences have derivations. e target derivations, 6 pruning any nodes that yield identical input and output labels on every transition. Each pair of states from the FSA corresponds to a span [i, j] 9 ings that do not inexist in table. the target. Haghighi et al. a CKY Satta (submitted) has independently derived this algorithm. 009) also describe a pruning heuristic that results Mehryar Mohri. 2009. Weighted automata algorithms. In Manfred Droste, Werner Kuich, and Heiko Vogler, average case runtime of O(n3 ).

eferences

editors, Handbook of Weighted Automata, Monographs in Theoretical Computer Science, pages 213– 254. Springer.

nstring alternative the syne in theconception target willof result in the will result in ng problem It has long parse forest as thatfollows. exactly derives #f, e$, which is the d that parsing computes the intersecoal of synchronous composition. and a CFG (Bar-Hillel et al., 1961; Thus, in synchronous parsing, we seek to com95; andSince Jacobs, 2008). That ute FGrune ◦ G ◦ E. composition is associative, we FSA, S, with a CFG, G, results in a F an compute this quantity either S < Xas , X >(F ◦ G) ◦ E or a b ich contains derivations of strings in 0 1 2 < X b ,use c X > an algorithm ◦ (G ◦ E). Alternatively, Xwe can 1 (G), and which may be ∅. XBut, it is < X b ,directly, Xd> hat performs 3-way composition such as in mind that resulting parse a string e in the the target will result X < a for, c >in the will E result in 3 Wu’s algorithm. c d X derives < a , d > #f, e$, which 0 is the1 2 f a parse CFG (that derives strings forestexactly that exactly regoal ?? for an example. of synchronous composition. F ◦ G to think (1) el parsing case, it’s helpful Thus, in synchronous parsing, we seek to comSCFG representing a context-free re- is associative, we pute F ◦ G ◦ E. Since composition S < 0X1 b , c 0X1 > 2 Analysis 0S2 allel asthis being a composition can parsing compute quantity either as (F ◦ G) ◦ E< 0or X1 b , 0X1 d > 0S2 b : c 0X1 b : 0X1 d Monolingual parsing isX1 commonly of 0 0X1thought X1as a F ◦ (G ◦ E). Alternatively, we can use an algorithm 0 case, S is a deterministic linear chain FSA orst-case O(n3 )3-way algorithm, even thedirectly, known0Xalgo1 X that performs composition such as ence. 3 thms do have a grammar term that can contribute Wu’s algorithm. a:c a:d gnificantly. However, since the grammar that a arser will employ is generally assumed to be fixed, F ◦G◦E (1) 0 2

0 1

Experiments 1.2 Analysis

0 2 0S2

1X2 b : c 1X2 0 1 0 1

0 1 0 1 0X1 b : 0X1 d

igure 2 plots the average of the algorithm Monolingual parsing is runtime commonly thought of as a X X 3 ) algorithm, s worst-case a function O(n of the Arabic sentence length on algoan even the known rabic-English phrasal ITG alignment rithms do have a grammar term thattask. can contribute a:d a:c significantly. However, since the grammar that a Figure 2: An SCFG, G, two FSAs, E and F , and two Related parser will work employ is generally assumed to be fixed, 1 2 0 1

0 1 0 1

equivalent representations of F ◦ G. The synchronous

parsehas forest of the pair hab, cditowith G is given under F ◦ ynchronous parsing been widely used com2 Experiments G ◦ E. ute sufficient statistics for a variety of machine arning of average synchronous trees; Figure models 2 plots the runtime of thehowever, algorithm nce naive algorithm is operates toosentence slowmore to length deal with as athe function of Arabic on anwith a deterourthe parser efficiently minized grammar, we left-factor the grammar durentence sizes, most authors have proposed pruning Arabic-English phrasal ITG alignment task. ingetthis as well. tic-tac-toe chniques. Zhang al. traversal (2008) suggest 3 Related workModel 1 posteriors to exclude runing, which uses Analysis. Monolingual parsing runs in worst case anges of cells from being computed. Blunsom et al. 3 )been Synchronous parsing has widely used to comO(|G| · n time, where n is the 2008) do a monolingual parse with of one languagelength of the input being parsed and |G| of is amachine measure of the size variety utpute splitsufficient the parserstatistics states byforthea string yielded by of the grammar (Graham et however, al., 1980). Since the learning models of synchronous trees; he target derivations, pruning any nodes that yield grammar term too is constant most typical parsing sincethat thedo naive algorithm slow tofor deal with rings notapplications, exist in the is target. Haghighi et al. it is generally not considered carefully; sentence most authors heuristic have proposed pruning 2009) also sizes, describe a pruning that results however, in the two-parse algorithm, the size of the techniques. etofal.O(n (2008) 3 ). suggest tic-tac-toe n average case Zhang runtime grammar term for the second parse is not |G| but pruning, which uses Model 1 posteriors to exclude |F ◦ G|, which clearly depends on the size of the inranges of cells from computed. Blunsom et al. of this term put Fbeing ; and so understanding the impact (2008) do a monolingual parse with of one language References is key to understanding the algorithm’s run-time. but split the parser by theSCFG stringwith yielded by If Gstates is anMohri. -free non-terminals N and yril Allauzen and Mehryar 2008. 3-way the target derivations, pruning any nodes that yield maximally two NTs in a rule’s right hand side, and composition of weighted finite-state transducers. In strings that do not in the target. Haghighi et al. n isexist the number of states in F (corresponding to the 3 (2009) alsocomposition describe pruning heuristic that results Three-way operate only on numberaalgorithms of wordsthat in the f in a sentence pair hf, ei), 3 STs have also been developed (Allauzen and Mohri, 2008). the number in average casethen runtime of O(n of). nodes in the parse forest F ◦ G will be O(|N | · n2 ). This can be shown easily since by stipulation, we are able to use CKY+ to perReferences form the parse, and there will be maximally as many

Cyril Allauzen and Mehryar Mohri. 2008. 3-way composition of weighted finite-state transducers. In 3

Three-way composition algorithms that operate only on FSTs have also been developed (Allauzen and Mohri, 2008).

nodes in the forest as there are cells in the CKY chart times the number of NTs. The number of edges will be O(|N | · n3 ), which occurs when every node can be derived from all possible splits. This bound on the number of edges implies that |F ◦ G| ∈ O(n3 ).10 Therefore, the worst case run-time of the two-parse algorithm is O(|N |·n3 ·n3 +|G|·n3 ) = O(|N |·n6 ), the same as the bound on the ITG algorithm. We note that while the ITG algorithm requires that the SCFGs be rank-2 and in a normal form, the twoparse algorithm analysis holds as long as the grammars are rank-2 and -free.11

3

Experiments

We now describe two different synchronous parsing applications, with different classes of SCFGs, and compare the performance of the two-parse algorithm with that of previously used algorithms. Phrasal ITGs. Here we compare performance of the two-parse algorithm and the O(n6 ) ITG parsing algorithm on an Arabic-English phrasal ITG alignment task. We used a variant of the phrasal ITG described by Zhang et al. (2008).12 Figure 3 plots the average run-time of the two algorithms as a function of the Arabic sentence length. The two-parse approach is far more efficient. In total, aligning the 80k sentence pairs in the corpus completed in less than 4 hours with the two-parse algorithm but required more than 1 week with the baseline algorithm.13 “Hiero” grammars. An alternative approach to computing a synchronous parse forest is based on cube pruning (Huang and Chiang, 2007). While more commonly used to integrate a target m-gram LM during decoding, Blunsom et al. (2008), who required synchronous parses to discriminatively train 10

How tight these bounds are depends on the ambiguity in the grammar w.r.t. the input: to generate n3 edges, every item in every cell must be derivable by every combination of its subspans. Most grammars are substantially less ambiguous. 11 Since many widely used SCFGs meet these criteria, including hierarchical phrase-based translation grammars (Chiang, 2007), SAMT grammars (Zollmann and Venugopal, 2006), and phrasal ITGs (Zhang et al., 2008), a detailed analysis of containing and higher rank grammars is left to future work. 12 The restriction that phrases contain exactly a single alignment point was relaxed, resulting in much larger and more ambiguous grammars than those used in the original work. 13 A note on implementation: our ITG aligner was minimal; it only computed the probability of the sentence pair using the inside algorithm. With the two-parse aligner, we stored the complete forest during both the first and second parses.

60

haustive search over all n4 span pairs without awareness of any top-down constraints. This suggests that faster composition algorithms that incorporate topdown filtering may still be discovered.

Wu (1997) this work

40

References

20 0

10

20

30

40

50

60

Figure 3: Average synchronous parser run-time (in seconds) as a function of Arabic sentence length (in words).

an SCFG translation model, repurposed this algorithm to discard partial derivations during translation of f if the derivation yielded a target m-gram not found in e (p.c.). We replicated their BTEC Chinese-English baseline system and compared the speed of their ‘cube-parsing’ technique and our twoparse algorithm.14 The SCFG used here was extracted from a word-aligned corpus, as described in Chiang (2007).15 The following table compares the average per sentence synchronous parse time. Algorithm Blunsom et al. (2008) this work

4

avg. run-time (sec) 7.31 0.20

Discussion

Thinking of synchronous parsing as two composition operations has both conceptual and practical benefits. The two-parse strategy can outperform both the ITG parsing algorithm (Wu, 1997), as well as the ‘cube-parsing’ technique (Blunsom et al., 2008). The latter result points to a connection with recent work showing that determinization of edges before LM integration leads to fewer search errors during decoding (Iglesias et al., 2009). Our results are somewhat surprising in light of work showing that 3-way composition algorithms for FSTs operate far more efficiently than performing successive pairwise compositions (Allauzen and Mohri, 2009). This is certainly because the 3-way algorithm used here (the ITG algorithm) does an ex14

To the extent possible, the two experiments were carried out using the exact same code base, which was a C++ implementation of an SCFG-based decoder. 15 Because of the mix of terminal and non-terminal symbols, such grammars cannot be used by the ITG synchronous parsing algorithm.

C. Allauzen and M. Mohri. 2009. N-way composition of weighted finite-state transducers. International Journal of Foundations of Comp. Sci., 20(4):613–627. Y. Bar-Hillel, M. Perles, and E. Shamir. 1961. On formal properties of simple phrase structure grammars. Zeitschrift f¨ur Phonetik, Sprachwissenschaft und Kommunikationsforschung, 14:143–172. P. Blunsom, T. Cohn, and M. Osborne. 2008. Probalistic inference for machine translation. In EMNLP. D. Chiang. 2007. Hierarchical phrase-based translation. Computational Linguistics, 33(2):201–228. S. L. Graham, W. L. Ruzzo, and M. Harrison. 1980. An improved context-free recognizer. ACM Trans. Program. Lang. Syst., 2(3):415–462. A. Haghighi, J. Blitzer, J. DeNero, and D. Klein. 2009. Better word alignments with supervised ITG models. In Proc. of ACL/IJCNLP, pages 923–931. L. Huang and D. Chiang. 2007. Forest rescoring: Faster decoding with integrated language models. In ACL. L. Huang, H. Zhang, D. Gildea, and K. Knight. 2009. Binarization of synchronous context-free grammars. Computational Linguistics, 35(4). G. Iglesias, A. de Gispert, E. R. Banga, and W. Byrne. 2009. Hierarchical phrase-based translation with weighted finite state transducers. In Proc. NAACL. P. M. Lewis, II and R. E. Stearns. 1968. Syntax-directed transduction. J. ACM, 15(3):465–488. M. Mohri. 2009. Weighted automata algorithms. In M. Droste, W. Kuich, and H. Vogler, editors, Handbook of Weighted Automata, Monographs in Theoretical Computer Science, pages 213–254. Springer. G. Satta and E. Peserico. 2005. Some computational complexity results for synchronous context-free grammars. In Proceedings of NAACL. G. Satta. submitted. Translation algorithms by means of language intersection. G. van Noord. 1995. The intersection of finite state automata and definite clause grammars. In Proc. of ACL. D. Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3):377–404. H. Zhang, C. Quirk, R. C. Moore, and D. Gildea. 2008. Bayesian learning of non-compositional phrases with synchronous parsing. In Proceedings of ACL. A. Zollmann and A. Venugopal. 2006. Syntax augmented machine translation via chart parsing. In Proc. of the Workshop on SMT.

Recommend Documents