Stochastic Attribute-Value Grammars - CiteSeerX

Report 3 Downloads 150 Views
Stochastic Attribute-Value Grammars Steven P. Abney

AT&T Laboratories

Probabilistic analogues of regular and context-free grammars are well-known in computational linguistics, and currently the subject of intensive research. To date, however, no satisfactory probabilistic analogue of attribute-value grammars has been proposed: previous attempts have failed to de ne an adequate parameter-estimation algorithm. In the present paper, I de ne stochastic attribute-value grammars and give an algorithm for computing the maximum-likelihood estimate of their parameters. The estimation algorithm is adapted from (Della Pietra, Della Pietra, and La erty, 1995). To estimate model parameters, it is necessary to compute the expectations of certain functions under random elds. In the application discussed by Della Pietra, Della Pietra, and La erty (representing English orthographic constraints), Gibbs sampling can be used to estimate the needed expectations. The fact that attribute-value grammars generate constrained languages makes Gibbs sampling inapplicable, but I show that sampling can be done using the more general Metropolis-Hastings algorithm.

1. Introduction Stochastic versions of regular grammars and context-free grammars have received a great deal of attention in computational linguistics for the last several years, and basic techniques of stochastic parsing and parameter estimation have been known for decades. However, regular and context-free grammars are widely deemed linguistically inadequate; standard grammars in computational linguistics are attribute-value (AV) grammars of some variety. Before the advent of statistical methods, regular and context-free grammars were considered too inexpressive for serious consideration, and even now the reliance on stochastic versions of the less-expressive grammars is often seen as an expedient necessitated by the lack of an adequate stochastic version of attribute-value grammars. Proposals have been made for extending stochastic models developed for the regular



AT&T Laboratories, Rm. A216, 180 Park Avenue, Florham Park, NJ 07932

Computational Linguistics

Volume 0, Number 0

and context-free cases to grammars with constraints.1 (Brew, 1995) sketches a probabilistic version of Head-Driven Phrase Structure Grammar (HPSG). He proposes a stochastic process for generating attribute-value structures, that is, directed acyclic graphs (dags). A dag is generated starting from a single node labelled with the (unique) most general type. Each type S has a set of maximal subtypes T1 ; : : :; Tm . To expand a node labelled S, one chooses a maximal subtype T stochastically. One then considers equating the current node with other nodes of type T, making a stochastic yes/no decision for each. Equating two nodes creates a re-entrancy. If the current node is equated with no other node, one proceeds to expand it. Each maximal type introduces types U1 ; : : :; Un, corresponding to values of attributes; one creates a child node for each introduced type, and then expands each child in turn. A limitation of this approach is that it permits one to specify only the average rate of re-entrancies; it does not permit one to specify more complex context dependencies. (Eisele, 1994) takes a logic-programming approach to constraint grammars. He assigns probabilities to proof trees by attaching parameters to logic program clauses. He presents the following logic program as an example: 1. 2. 3. 4. 5.

p(X,Y,Z) q(a,b) q(X,c) r(b,d) r(X,e)

1 q(X,Y), r(Y,Z). 0:4 . 0:6 . 0:5 . 0:5 .

The probability of a proof tree is de ned to be proportional to the product of the probabilities of clauses used in the proof. Normalization is necessary, because some derivations lead to invalid proof trees: for example, the derivation 1 I con ne my discussion here to Brew and Eisele because they aim to describe parametric models of probability distributions over the languages of constraint-based grammars, and to estimate the parameters of those models. Other authors have assigned weights or preferences to constraint-based grammars but not discussed parameter estimation. One approach of the latter sort that I nd of particular interest is that of Stefan Riezler (Riezler, 1996), who describes a weighted logic for constraint-based grammars that characterizes the languages of the grammars as fuzzy sets. This interpretation avoids the need for normalization that Brew and Eisele face, though parameter estimation still remains to be addressed.

2

Abney

Stochastic Attribute-Value Grammars p(X,Y,Z)

?

by 1

q(X,Y) r(Y,Z)

?

by 3

r(c,Z) : Y=c

?

by 4

: Y=c b=c Z=d

is invalid because of the illegal assignment b = c. Both Brew and Eisele associate weights with analogues of rewrite rules. In Brew's case, we can view type expansion as a stochastic choice from a nite set of rules of form X ! i , where X is the type to expand and each i is a sequence of introduced child types. A re-entrancy decision is a stochastic choice between two rules, X ! yes and X ! no, where X is the type of the node being considered for re-entrancy. In Eisele's case, expanding a goal term can be viewed as a stochastic choice among a nite set of rules X ! i , where X is the predicate of the goal term and each i is a program clause whose head has predicate X. The parameters of the models are essentially weights on such rules, representing the probability of choosing i when making a choice of type X. In these terms, Brew and Eisele propose estimating parameters as the empirical relative frequency of the corresponding rules. That is, the weight of the rule X ! i is obtained by counting the number of times X rewrites as i in the training corpus, divided by the total number of times X is rewritten in the training corpus. For want of a standard term, let us call these estimates Empirical Relative Frequency (ERF) estimates. To deal with incomplete data, both Brew and Eisele appeal to the Expectation-Maximization (EM) algorithm, applied however to ERF rather than maximum likelihood estimates. Under certain independence conditions, ERF estimates are maximum likelihood estimates. Unfortunately, these conditions are violated when there are context dependencies of the sort found in attribute-value grammars, as will be shown below. As a consequence, applying the ERF method to attribute-value grammars does not generally yield maximum likelihood estimates. This is true whether one uses EM or not|a method that yields the \wrong" estimates on complete data does not improve when EM is used to extend the method to incomplete data. 3

Computational Linguistics

Volume 0, Number 0

Eisele identi es an important symptom that something is amiss with ERF estimates: the probability distribution over proof trees that one obtains does not agree with the frequency of proof trees in the training corpus. Eisele recognizes that this problem arises only where there are context dependencies. Fortunately, solutions to the context-dependency problem have been described (and indeed are currently enjoying a surge of interest) in statistics, machine learning, and statistical pattern recognition, particularly image processing. The models of interest are known as random elds. Random elds can be seen as a generalization of Markov chains and stochastic branching processes. Markov chains are stochastic processes corresponding to regular grammars and random branching processes are stochastic processes corresponding to context-free grammars. The evolution of a Markov chain describes a line, in which each stochastic choice depends only on the state at the immediately preceding time-point. The evolution of a random branching process describes a tree in which a nitestate process may spawn multiple child processes at the next time-step, but the number of processes and their states depend only on the state of the unique parent process at the preceding time-step. In particular, stochastic choices are independent of other choices at the same time-step: each process evolves independently. If we permit re-entrancies, that is, if we permit processes to re-merge, we generally introduce context-sensitivity. In order to re-merge, processes must be \in synch," which is to say, they cannot evolve in complete independence of one another. Random elds are a particular class of multi-dimensional random processes, that is, processes corresponding to probability distributions over an arbitrary graph. The theory of random elds can be traced back to (Gibbs, 1902); indeed, the probability distributions involved are known as Gibbs distributions. To my knowledge, the rst application of random elds to natural language was (Mark et al., 1992). The problem of interest was how to combine a stochastic contextfree grammar with n-gram language models. In the resulting structures, the probability 4

Abney

Stochastic Attribute-Value Grammars

of choosing a particular word is constrained simultaneously by the syntactic tree in which it appears and the choices of words at the n preceding positions. The context-sensitive constraints introduced by the n-gram model are re ected in re-entrancies in the structure of statistical dependencies, e.g.: S NP there was

VP NP no

response

In this diagram, the choice of label on a node z with parent x and preceding word y is dependent on the label of x and y, but conditionally independent of the label on any other node. (Della Pietra, Della Pietra, and La erty, 1995, henceforth, DD&L) also apply random elds to natural language processing. The application they consider is the induction of English orthographic constraints|inducing a grammar of possible English words. DD&L describe an algorithm called Improved Iterative Scaling (IIS) for selecting informative features of words to construct a random eld, and for setting the parameters of the eld optimally for a given set of features, to model an empirical word distribution. It is not immediately obvious how to use the IIS algorithm to equip attribute-value grammars with probabilities. In brief, the diculty is the following. The IIS algorithm requires the computation of the expectations, under random elds, of certain functions. In general, computing these expectations involves summing over all con gurations (all possible character sequences, in the orthography application), which is not possible when the con guration space is large. Instead, DD&L use Gibbs sampling to estimate the needed expectations. Gibbs sampling is possible for the application that DD&L consider. A prerequisite for Gibbs sampling is that the con guration space be closed under relabelling of graph 5

Computational Linguistics

Volume 0, Number 0

nodes. In the orthography application, the con guration space is the set of possible English words, represented as nite linear graphs labelled with ASCII characters. Every way of changing a label, that is, every substitution of one ASCII character for a di erent one, yields a possible English word. By contrast, the set of graphs admitted by an attribute-value grammar G is highly constrained. If one changes an arbitrary node label in a dag admitted by G, one does not necessarily obtain a new dag admitted by G. Hence, Gibbs sampling is not applicable. However, I will show that a more general sampling method, the Metropolis-Hastings algorithm, can be used to compute the maximum-likelihood estimate of the parameters of AV grammars.

2. Stochastic Context-Free Grammars Let us begin by examining stochastic context-free grammars (SCFGs) and asking why the natural extension of SCFG parameter estimation to attribute-value grammars fails. A point of terminology: I will use the term grammar to refer to an unweighted grammar, be it a context-free grammar or attribute-value grammar. A grammar equipped with weights (and other periphenalia as necessary) I will refer to as a model. Occasionally I will also use model to refer to the weights themselves, or the probability distribution they de ne. Throughout we will use the following stochastic context-free grammar for illustrative purposes. Let us call the underlying grammar G1 and the grammar equipped with weights as shown, M1 : 6

Abney

Stochastic Attribute-Value Grammars

1. S ! A A 1 = 1=2 2. S ! B 2 = 1=2 3. A ! a 3 = 2=3 (1) 4. A ! b 4 = 1=3 5. B ! a a 5 = 1=2 6. B ! b b 6 = 1=2 The probability of a given tree is computed as the product of probabilities of rules used in it. For example: β1

S

β3

A

A

a

a

β3

(2)

Let x be tree (2) and let q1 be the probability distribution over trees de ned by model M1 . Then: q1(x) = 1  3  3 = 12  23  23 = 92 In parsing, we use the probability distribution q1(x) de ned by model M1 to disambiguate: the grammar assigns some set of trees fx1; : : :; xng to a sentence , and we choose that tree xi that has greatest probability q1(xi ). The issue of eciently computing the most-probable parse for a given sentence has been thoroughly addressed in the literature. The standard parsing techniques can be readily adapted to the random- eld models to be discussed below, so I simply refer the reader to the literature. Instead, I concentrate on parameter estimation, which for attribute-value grammars cannot be accomplished by standard techniques. By parameter estimation we mean determining values for the weights . In order for a stochastic grammar to be useful, we must be able to compute the correct weights, where by correct weights we mean the weights that best account for a training corpus. 7

Computational Linguistics

Volume 0, Number 0

The degree to which a given set of weights accounts for a training corpus is measured by the similarity between the distribution q(x) determined by the weights and the distribution of trees x in the training corpus.

2.1 The Goodness of a Model The distribution determined by the training corpus is known as the empirical distri-

bution. For example, suppose we have a training corpus containing twelve trees of the following four types from L(G1):

c= ~p =

x1

x2

x3

x4

S

S

S

S

B

B

A

A

A

A

a

a

b

b

4x 4/12

a

2x 2/12

a 3x 3/12

b

b 3x 3/12

= 12

(3)

where c(x) is the count of how often the tree (type) x appears in the corpus, and p~() is the empirical distribution, de ned as: p~(x) = c(x) N

N=

X x

c(x)

In comparing a distribution q to the empirical distribution p~, we shall actually measure dissimilarity rather than similarity. Our measure for dissimilarity of distributions is the Kullback-Leibler (KL) divergence, de ned as: D(~pj q) =

X x

p~(x) p~(x) ln q(x)

The divergence between p~ and q at point x is the log of the ratio of p~(x) to q(x). The overall divergence between p~ and q is the average divergence, where the averaging is over tree (tokens) in the corpus; i.e., point divergences ln p~(x)=q(x) are weighted by p~(x) and summed. 8

Abney

Stochastic Attribute-Value Grammars

For example, let q1 be, as before, the distribution determined by model M1 . The following table shows q1, p~, the ratio q1(x)=~p(x), and the weighted point divergence p~(x) ln(~p(x)=q1(x)). The sum of the fourth column is the KL divergence D(~pj q1) between p~ and q1 . The third column contains q1(x)=~p(x) rather than p~(x)=q1(x) so that one can see at a glance whether q1(x) is too large (> 1) or too small (< 1). x1 x2 x3 x4

q1 2/9 1/18 1/4 1/4

p~ 1/3 1/6 1/4 1/4

q1=~p 0.67 0.33 1.00 1.00

The total divergence D(~pj q1) = 0:32.

p~ ln(~p=q1) 0.14 0.18 0.00 0.00

(4)

0.32

One set of weights is better than another if its divergence from the empirical distribution is less. For example, let us consider a di erent set of weights for grammar G1 . Let M 0 be G1 with weights (1=2; 1=2; 1=2;1=2; 1=2; 1=2), and let q0 be the probability distribution determined by M 0 . Then the computation of the KL divergence is as follows: q0 p~ q0 =~p p~ ln(~p=q0 ) x1 1/8 1/3 0.38 0.33 x2 1/8 1/6 0.75 0.05 x3 1/4 1/4 1.00 0.00 x4 1/4 1/4 1.00 0.00

0.38

The t for x2 improves, but that is more than o set by a poorer t for x1 . The distribution q1 is a better distribution than q0 , in the sense that q1 is more similar (less dissimilar) to the empirical distribution than q0 is. One reason for adopting minimal KL divergence as a measure of goodness is that minimizing KL divergence maximizes likelihood. The likelihood of distribution q is the probability of the training corpus according to q: Q

L(q) = Qx in training q(x) = x q(x)c(x) Since log is monotone increasing, maximizing likelihood is equivalent to maximizing log likelihood: 9

Computational Linguistics

Volume 0, Number 0

P

ln L(q) = xPc(x) ln q(x) = N x p~(x) lnq(x) The expression on the right hand side is -1/N times the cross entropy of q with respect to p~, hence maximizing log likelihood is equivalent to minimizing cross entropy. Finally, D(~pj q) is equal to the cross entropy of q less the entropy of p~, and the entropy of p~ is constant with respect to q; hence minimizing cross entropy (maximizing likelihood) is equivalent to minimizing divergence.

2.2 The ERF Method For stochastic context-free grammars, it can be shown that the ERF method yields the best model for a given training corpus. First, let us introduce some terminology and notation. With each rule i in a stochastic context-free grammar is associated a weight i and a function fi (x) that returns the number of times rule i is used in the derivation of tree x. For example, consider tree (2), repeated here for convenience: β1

S

β3

A

A

a

a

β3

Rule 1 is used once and rule 3 is used twice; accordingly f1 (x) = 1, f3 (x) = 2, and fi (x) = 0 for i 2 f2; 4; 5; 6g. We use the notation p[f] to represent the expectation of f under probability distribution p; that is, p[f] =

P

x p(x)f(x).

The ERF method instructs us to choose the

weight i for rule i proportional to its empirical expectation p~[fi ]. Algorithmically, we compute the expectation of each rule's frequency, and normalize among rules with the same lefthand side. 10

Abney

Stochastic Attribute-Value Grammars

To illustrate, let us consider corpus (3) again. The expectation of each rule frequency

x1 x2 x3 x4

B!bb

B!aa

A!b

p~f1 p~f2 1/3 1/6 1/4 1/4 1/2 1/2 1/2 1/2

A!a

S!B

p~ [S [A a] [A a]] 1/3 [S [A b] [A b]] 1/6 [S [B a a]] 1/4 [S [B b b]] 1/4 p~[f] = =

S!AA

fi is a sum of terms p~(x)fi (x). These terms are shown for each tree, in the following table.

p~f3 p~f4 p~f5 p~f6 2/3 2/6 1/4 1/4 2/3 1/3 1/4 1/4 2/3 1/3 1/2 1/2

For example, in tree x1, rule 1 is used once and rule 3 is used twice. The empirical probability of x1 is 1/3, so x1's contribution to p~[f1] is 1=3  1, and its contribution to p~[f3] is 1=3  2. The weight i is obtained from p~[fi] by normalizing among rules with the same lefthand side. For example, the expected rule frequencies p~[f1 ] and p~[f2 ] of rules with lefthand side S already sum to 1, so they are adopted without change as 1 and 2 . On the other hand, the expected rule frequencies p~[f5 ] and p~[f6] for rules with lefthand side B sum to 1/2, not 1, so they are doubled to yield weights 5 and 6 . It should be observed that the resulting weights are precisely the weights of model M1 . It can be proven that the ERF weights are the best weights for a given contextfree grammar, in the sense that they de ne the distribution that is most similar to the empirical distribution. That is, if are the ERF weights (for a given grammar), de ning distribution q, and 0 de ning q0 is any set of weights such that q 6= q0 , then D(~pj q) < D(~pj q0). One might expect the best weights to yield D(~pj q) = 0, but such is not the case. We have just seen, for example, that the best weights for grammar G1 yield distribution q1, yet D(~pj q1) = 0:32 > 0. A closer inspection of the divergence calculation (4) reveals that q1 is sometimes less than p~, but never greater than p~. Could we improve the t by increasing q1? For that matter, how can it be that q1 is never greater than p~? As 11

Computational Linguistics

Volume 0, Number 0

probability distributions, q1 and p~ should have the same total mass, namely, one. Where is the missing mass for q1 ? The answer is of course that q1 and p~ are probability distributions over L(G), but not all of L(G) appears in the corpus. Two trees are missing, and they account for the missing mass. These two trees are: S

S

A

A

A

A

a

b

b

a

(5)

Each of these trees has probability 0 according to p~ (hence they can be ignored in the divergence calculation), but probability 1=9 according to q1 . Intuitively, the problem is this. The distribution q1 assigns too little weight to trees x1 and x2, and too much weight to the \missing" trees (5); call them x5 and x6. Yet exactly the same rules are used in x5 and x6 as are used in x1 and x2. Hence there is no way to increase the weight for trees x1 and x2, improving their t to p~, without simultaneously increasing the weight for x5 and x6, making their t to p~ worse. The distribution q1 is the best compromise possible. To say it another way, our assumption that the corpus was generated by a contextfree grammar means that any context dependencies in the corpus must be accidental, the result of sampling noise. There is indeed a dependency in corpus (3): in the trees where there are two A's, the A's always rewrite the same way. If corpus (3) was generated by a stochastic context-free grammar, then this dependency is accidental. This does not mean that the context-free assumption is wrong. If we generate twelve trees at random from q1, it would not be too surprising if we got corpus (3). More extremely, if we generate a random corpus of size 1 from q1, it is quite impossible for the resulting empirical distribution to match the distribution q1. But as the corpus size 12

Abney

Stochastic Attribute-Value Grammars

increases, the t between p~ and q1 becomes ever better.

3. Attribute-Value Grammars But what if the dependency in corpus (3) is not accidental? What if we wish to adopt a grammar that imposes the constraint that both A's rewrite the same way? We can impose such a constraint by means of an attribute-value grammar. We may formalize an attribute-value grammar as a context-free grammar with attribute labels and path equations. An example is the following grammar; let us call it G2:

1. S ! 1:A 2:A = 2. S ! 1:B 3. A ! 1:a 4. A ! 1:b 5. B ! 1:a 6. B ! 1:b The following illustrates how a dag is generated from G2. S 1 S

1

(a)

S 1

2 A

A 1

1 (b)

(G2 )

3

S 2

A

A 1

1

3

1

2

1

1

A

A

a

a

(c)

(d)

We begin in (a) with a single node labelled with the start category of G2, namely, S. A node x is expanded by choosing a rule that rewrites the category of x. In this case, we choose rule 1 to expand the root node. Rule 1 instructs us to create two children, both labelled A. The edge to the rst child is labelled \1" and the edge to the second child is labelled \2". The constraint \ = " indicates that the \1" child of the \1" child of x is identical to the \1" child of the \2" child of x. We create an unlabelled node to represent this grandchild of x and direct appropriately labelled edges from the children, yielding (b). We proceed to expand the newly introduced nodes. We choose rule 3 to expand the 13

Computational Linguistics

Volume 0, Number 0

rst \A" node. In this case, a child with edge labelled \1" already exists, so we use it rather than creating a new one. Rule 3 instructs us to label this child \a", yielding (c). Now we expand the second \A" node. Again we choose rule 3. We are instructed to label the \1" child \a", but it already has that label, so we do not need to do anything. Finally, in (d), the only remaining node is the bottommost node, labelled \a". Since its label is a terminal category, it does not need to be expanded, and we are done. Let us back up to (c) again. Here we were free to choose rule 4 instead of rule 3 to expand the righthand \A" node. Rule 4 instructs us to label the \1" child \b", but we cannot, inasmuch as it is already labelled \a". The derivation fails, and no dag is generated. The language L(G2) is the set of dags produced by successful derivations, namely: x1

x2

x3

x4

S

S

S

S

B

B

a

b

A

A a

A

A b

(6)

(The edges of the dags should actually be labelled with 1's and 2's, but I have suppressed the edge labels for the sake of perspicuity.)

3.1 AV Grammars and the ERF Method Now we face the question of how to attach probabilities to grammar G2. The natural extension of the method we used for context-free grammars is the following. Associate a weight with each of the six rules of grammar G2. For example, let M2 be the model consisting of G2 plus weights ( 1 ; : : :; 6) = (1=2; 1=2; 2=3; 1=3;1=2;1=2). Let 2(x) be the weight that M2 assigns to dag x; it is de ned to be the product of the weights of the rules used to generate x. For example, the weight 2(x1 ) assigned to tree x1 of (6) is 2=9, computed as follows: 14

Abney

Stochastic Attribute-Value Grammars β1

S x1 =

A β3

A a

β3

Rule 1 is used once and rule 3 is used twice; hence 2 (x1) = 1 3 3 = 1=2  2=3  2=3 = 2=9. Observe that 2(x1) = 1 32 , which is to say, 1f1 (x1 ) 3f3 (x1 ) . Moreover, since 0 = 1, it does not hurt to include additional factors ifi (x1 ) for those i where fi (x1 ) = 0. That is, we can de ne the dag weight  corresponding to rule weights = ( 1 ; : : :; n) generally as: (x) =

n Y ifi (x) i=1

The next question is how to estimate weights. Let us consider what happens when we use the ERF method. Let us assume a corpus distribution for the dags (6) analogous to the distribution in (3): x1 x2 x3 x4 p~ = 1=3 1=6 1=4 1=4

(7)

Using the ERF method, we estimate rule weights as follows: p~ p~f1 p~f2 p~f3 p~f4 p~f5 p~f6 1/3 1/3 2/3 1/6 1/6 2/6 1/4 1/4 1/4 (8) 1/4 1/4 1/4 p~[f] = 1/2 1/2 2/3 1/3 1/4 1/4 = 1/2 1/2 2/3 1/3 1/2 1/2 This table is identical to the one given earlier in the context-free case. We arrive at the x1 x2 x3 x4

same weights M2 we considered above, de ning dag weights 2(x).

3.2 Why the ERF Method Fails But at this point a problem arises: 2 is not a probability distribution. Unlike in the context-free case, the four dags in (6) constitute the entirety of L(G). This time, there 15

Computational Linguistics

Volume 0, Number 0

are no missing dags to account for the missing probability mass. There is an obvious \ x" for this problem: we can simply normalize 2. We might de ne the distribution q for an AV grammar with weight function  as: q(x) = Z1 (x) where Z is the normalizing constant: Z=

X x2L(G)

(x)

In particular, for 2, we have Z = 2=9 + 1=18 + 1=4 + 1=4 = 7=9. Dividing 2 by 7/9 yields the ERF distribution: x1 x2 x3 x4 q2(x) = 2=7 1=14 9=28 9=28 On the face of it, then, we can transplant the methods we used in the context-free case to the AV case and nothing goes wrong. The only problem that arises ( not summing to one) has an obvious x (normalization). However, something has actually gone very wrong. The ERF method yields the best weights only under certain conditions that we inadvertently violated by changing L(G) and re-apportioning probability via normalization. In point of fact, we can easily see that the ERF weights (8) are not the best weights for our example grammar. Consider the alternative model M given in (9), de ning probability distribution q: S ! pA A S ! B A p! a A ! b B ! a B ! b 2 1 3 1 1 3+2 2 p

6+2 2

p

6+2 2

p

1+ 2

p

1+ 2

2

2

(9)

These weights are proper, in the sense that weights for rules with the same lefthand side p

sum to one. The reader can verify that  sums to Z = 3+3 2 and that q is: x1 x2 x3 x4 q(x) = 1=3 1=6 1=4 1=4 16

Abney

Stochastic Attribute-Value Grammars

That is, q = p~. Comparing q2 (the ERF distribution) and q to p~, we observe that D(~pj q2) = 0:07 but D(~pj q) = 0. In short, in the AV case, the ERF weights do not yield the best weights. This means that the ERF method does not converge to the correct weights as the corpus size increases. If there are genuine dependencies in the grammar, the ERF method converges systematically to the wrong weights. Fortunately, there are methods that do converge to the right weights. These are methods that have been developed for random elds.

4. Random Fields A random eld de nes a probability distribution over a set of labelled graphs called con gurations. In our case, the con gurations are the dags generated by the grammar,

i.e., = L(G). The weight assigned to a con guration is the product of the weights assigned to selected features of the con guration. We use the notation: (x) =

Y fi (x) i i

where i is the weight for feature i and fi () is its frequency function, that is, fi (x) is the number of times that feature i occurs in con guration x. (For most purposes, a feature can be identi ed with its frequency function; I will not always make a careful distinction between them.) I use the term feature here as it is used in the machine learning and statistical pattern recognition literature, not as in the constraint grammar literature, where feature is synonymous with attribute. In my usage, dag edges are labelled with attributes, not features. Features are rather like geographic features of dags: a feature is some larger or smaller piece of structure that occurs|possibly at more than one place|in a dag. The probability of a con guration (that is, a dag) is proportional to its weight, and is obtained by normalizing the weight distribution. 17

Computational Linguistics

Volume 0, Number 0

P

q(x) = Z1 (x) Z = x2 (x) If we identify the features of a con guration with local trees|equivalently, with applications of rewrite rules|the random eld model is almost identical to the model we considered in the previous section. There are two important di erences. First, we no longer require weights to sum to one for rules with the same lefthand side. Second, the model does not require features to be identi ed with rewrite rules. We use the grammar to de ne the set of con gurations = L(G), but in de ning a probability distribution over L(G), we can choose features of dags however we wish. Let us consider an example. Let us continue to assume grammar G2 generating language (6), and let us continue to assume the empirical distribution (7). But now rather than taking rule applications to be features, let us adopt the following two features: A 1.

2.

1

B

a

p

For purpose of illustration, take feature 1 to have weight 1 = 2 and feature 2 to have weight 2 = 3=2. The functions f1 and f2 represent the frequencies of features 1 and 2, respectively: S

S

S

S

2 B

2 B

a

b

1 1 A

A a

A

A b

f1 = 2 0 0 0 f2 = 0 0 1 1 p p 2 2 1 3/2 3/2 Z = 6 = q= 1/3 1/6 1/4 1/4 We are able to exactly recreate the empirical distribution using fewer features than before. Intuitively, we need only use as many features as are necessary to distinguish among trees that have di erent empirical probabilities. 18

Abney

Stochastic Attribute-Value Grammars

This added exibility is welcome, but it does make parameter estimation more involved. Now we must not only choose values for weights, we must also choose the features that weights are to be associated with. We would like to do both in a way that permits us to nd the best model, in the sense of the model that minimizes the Kullback-Leibler distance with respect to the empirical distribution. The IIS algorithm (Della Pietra, Della Pietra, and La erty, 1995) provides a method to do precisely that.

5. Field Induction In outline, the IIS algorithm is as follows: 1.Start (t = 0) with the null eld, containing no features. 2.Feature Selection. Consider every feature that might be added to eld Mt and choose the best one. 3.Weight Adjustment. Readjust weights for all features. The result is eld Mt+1 . 4.Iterate until the eld cannot be improved. For the sake of concreteness, let us take features to be labelled subdags. In step 2 of the algorithm we do not consider every conceivable labelled subdag, but only the atomic (i.e., single-node) subdags and those complex subdags that can be constructed by combining features already in the eld or by combining a feature in the eld with some atomic feature. We also limit our attention to features that actually occur in the training corpus. In our running example, the atomic features are: S

A

B

a

b

Features can be combined by adding connecting arcs. For example: 19

Computational Linguistics

Volume 0, Number 0

A

S

A + a =

S

S

S + A =

+ A =

a

A

A

A

A

5.1 The Null Field Field induction begins with the null eld. With the corpus we have been assuming, the null eld takes the following form. S

S A

A a

A

A b

S

S

B

B

a

b

(x) = 1 1 1 1 Z =4 q(x) = 1/4 1/4 1/4 1/4 Q No dag x has any features, so (x) = i ifi (x) is a product of zero terms, and hence has value 1. As a result, q is the uniform distribution. The Kullback-Leibler divergence D(~pj q) is 0.03. The aim of feature selection is to choose a feature that reduces this divergence as much as possible. The astute reader will note that there is a problem with the null eld if L(G) is in nite. Namely, it is not possible to have a uniform probability mass distribution over an in nite set. If each dag in an in nite set of dags is assigned a constant nonzero probability , then the total probability is in nite, no matter how small  is. There are a couple of ways of dealing with the problem. The approach that DD&L adopt is to assume a consistent prior distribution p(k) over graph sizes k, and a family of random elds qk representing the conditional probability q(xjk); the probability of a tree is then p(k)q(xjk). All the random elds have the same features and weights, di ering only in their normalizing constants. I will take a somewhat di erent approach here. As sketched at the beginning of section 3, we can generate dags from an AV grammar much as proposed by Brew and Eisele. If we ignore failed derivations, the process of dag generation is completely analogous to the process of tree generation from a stochastic CFG|indeed, in the limiting case in which none of the rules contain constraints, the grammar is a CFG. To obtain an initial 20

Abney

Stochastic Attribute-Value Grammars

distribution, we associate a weight with each rule, the weights for rules with a common lefthand side summing to one. The probability of a dag is proportional to the product of weights of rules used to generate it. (Renormalization is necessary because of the failed derivations.) We estimate weights using the ERF method: we estimate the weight of a rule as the relative frequency of the rule in the training corpus, among rules with the same lefthand side. The resulting initial distribution (the ERF distribution) is not the maximum likelihood distribution, as we know. But it can be taken as a useful rst approximation. Intuitively, we begin with the ERF distribution and construct a random eld to take account of context-dependencies that the ERF distribution fails to capture, incrementally improving the t to the empirical distribution. In this framework, a model consists of: (1) An AV grammar G whose purpose is to de ne a set of dags L(G). (2) A set of initial weights  attached to the rules of G. The weight of a dag is the product of weights of rules used in generating it. Discarding failed derivations and renormalizing yields the initial distribution p0(x). (3) A set of features Q

f1 ; : : :; fn with weights 1 ; : : :; n to de ne the eld distribution q(x) = Z1 p0 (x) i ifi (x) .

5.2 Feature Selection At each iteration, we select a new feature f by considering all atomic features, and all complex features that can be constructed from features already in the eld. Holding the weights constant for all old features in the eld, we choose the best weight for f (how is chosen will be discussed shortly), yielding a new distribution q ;f . The score for feature f is the reduction it permits in D(~pj qold), where qold is the old eld. That is, the score for f is D(~pj qold) ? D(~pj q ;f ). We compute the score for each candidate feature and add to the eld that feature with the highest score. To illustrate, consider the two atomic features `a' and `B'. Given the null eld as old 21

Computational Linguistics

Volume 0, Number 0

eld, the best weight for `a' is = 7=5, and the best weight for `B' is = 1. This yields q and D(~pj f) as follows: S

S A

A a

A

A b

S

S

B

B

a

b

p~ 1/3 1/6 1/4 1/4 a 7/5 1 7/5 1 Z = 24=5 qa 7/24 5/24 7/24 5/24 0.04 ?0:04 ?0:04 0:05 D = 0:01 p~ ln qp~a B 1 1 1 1 Z=4 qB 1/4 1/4 1/4 1/4 p~ ln qp~B 0.10 ?0:07 0 0 D = 0:03 The better feature is `a', and `a' would be added to the eld if these were the only two choices. Intuitively, `a' is better than `B' because `a' permits us to distinguish the set fx1; x3g from the set fx2; x4g; the empirical probability of the former is 1=3+1=4 = 7=12 whereas the empirical probability of the latter is 5=12. Distinguishing these sets permits us to model the empirical distribution better (since the old eld assigns them equal probability, counter to the empirical distribution). By contrast, the feature `B' distinguishes the set

fx1; x2g from fx3; x4g. The empirical probability of the former is 1=3 + 1=6 = 1=2 and the empirical probability of the latter is also 1=2. The old eld models these probabilities exactly correctly, so making the distinction does not permit us to improve on the old eld. As a result, the best weight we can choose for `B' is 1, which is equivalent to not having the feature `B' at all.

5.3 Selecting the Initial Weight DD&L show that there is a unique weight ^ that maximizes the score for a new feature f (provided that the score for f is not constant for all weights). Writing q for the distribution that results from assigning weight to feature f, ^ is the solution to the equation 22

Abney

Stochastic Attribute-Value Grammars

q [f] = p~[f]

(10)

Intuitively, we choose the weight such that the expectation of f under the resulting new eld is equal to its empirical expectation. Solving equation (10) for is easy if L(G) is small enough to enumerate. Then the sum over L(G) that is implicit in q [f] can be expanded out, and solving for is simply a matter of arithmetic. Things are a bit trickier if L(G) is too large to enumerate. DD&L show that we can solve equation (10) if we can estimate qold[f = k] for k from 0 to the maximum value of f in the training corpus. (See appendix 1 for details.) We can estimate qold [f = k] by means of random sampling. The idea is actually rather simple: to estimate how often the feature appears in \the average dag", we generate a representative mini-corpus from the distribution qold and count. That is, we generate dags at random in such a way that the relative frequency of dag x is qold (x) (in the limit), and we count how often the feature of interest appears in dags in our generated mini-corpus. The application that DD&L consider is the induction of English orthographic constraints, that is, inducing a eld that assigns high probability to \English-sounding" words and low probability to non-English-sounding words. For this application, Gibbs sampling is appropriate. Gibbs sampling does not work for the application to AV grammars, however. Fortunately, there is an alternative random sampling method we can use: Metropolis-Hastings sampling. We will discuss the issue in some detail shortly.

5.4 Readjusting Weights When a new feature is added to the eld, the best value for its initial weight is chosen, but the weights for the old features are held constant. In general, however, adding the new feature may make it necessary to readjust weights for all features. The second half 23

Computational Linguistics

Volume 0, Number 0

of the IIS algorithm involves nding the best weights for a given set of features. The method is very similar to the method for selecting the initial weight for a new feature. Let ( 1 ; : : :; n ) be the old weights for the features. We wish to compute \increments" (1 ; : : :; n ) to determine a new eld with weights (1 1 ; : : :; n n ). Consider the equation qold[if # fi ] = p~[fi] where f#(x) =

(11)

P i fi (x) is the total number of features of dag x. The reason for the

factor if # is a bit involved. Very roughly, we would like to choose weights so that the expectation of fi under the new eld is equal to p~[fi]. Now qnew(x) is: Q

f (x) qnew(x) = Z1 p0(x) Q j (j j ) j = Z1 qold (x) j jfj x

where we factor Z as Z Z , for Z the normalization constant in qold . Hence, qnew [fi] = Q

qold [ Z1 fi j jfj x ]. Now there are two problems with this expression: it requires us to compute Z , which we are not able to do, and it requires us to determine weights j for all the features simultaneously, not just the weight i for feature i. We might consider approximating qnew[fi ] by ignoring the normalization factor and assuming that all features Q

have the same weight as feature i. Since j ifj (x) = if# (x) , we arrive at the expression on the lefthand side of equation (11). One might expect the approximation just described to be rather poor, but it is proven in (Della Pietra, Della Pietra, and La erty, 1995) that solving equation (11) for i (for each i) and setting the new weight for feature i to i i is guaranteed to improve the model. This is the real justi cation for equation (11), and the reader is referred to (Della Pietra, Della Pietra, and La erty, 1995) for details. Solving (11) yields improved weights, but it does not necessarily immediately yield the globally best weights. We can obtain the globally best weights by iterating. Set 24

Abney

i

Stochastic Attribute-Value Grammars

i i , for all i, and solve equation (11) again. Repeat until the weights no longer

change. As with equation (10), solving equation (11) is straightforward if L(G) is small enough to enumerate, but not if L(G) is large. In that case, we must use random sampling. We generate a representative mini-corpus and estimate expectations by counting in the mini-corpus. (See appendix 2.)

5.5 Random Sampling We have seen that random sampling is necessary both to set the initial weight for features under consideration and to adjust all weights after a new feature is adopted. Random sampling involves creating a corpus that is representative of a given model distribution q(x). To take a very simple example, a fair coin can be seen as a method for sampling from the distribution q in which q(H) = 1=2, q(T) = 1=2. Saying that a corpus is representative is actually not a comment about the corpus itself but the method by which it was generated: a corpus representative of distribution q is one generated by a process that samples from q. Saying that a process M samples from q is to say that the empirical distributions of corpora generated by M converge to q in the limit. For example, if we ip a fair coin once, the resulting empirical distribution over (H; T) is either (1; 0) or (0; 1), not the fair-coin distribution (1=2; 1=2). But as we take larger and larger corpora, the resulting empirical distributions converge to (1=2; 1=2). An advantage of SCFGs that random elds lack is the transparent relationship between an SCFG de ning a distribution q and a sampler for q. We can sample from q by performing stochastic derivations: each time we have a choice among rules expanding a category X, we choose rule X ! i with probability i , where i is the weight of rule X ! i . Now we can sample from the initial distribution p0 by performing stochastic deriva25

Computational Linguistics

Volume 0, Number 0

tions. At the beginning of section 3, we sketched how to generate dags from an AV grammar G via nondeterministic derivations. We de ned the initial distribution in terms of weights  attached to the rules of G. We can convert the nondeterministic derivations discussed at the beginning of section 3 into stochastic derivations by choosing rule X ! i with probability i when expanding a node labelled X. Some derivations fail, but throwing away failed derivations has the e ect of renormalizing the weight function, so that we generate a dag x with probability p0 (x), as desired. The Metropolis-Hastings algorithm provides us with a means of converting the sampler for the initial distribution p0(x) into a sampler for the eld distribution q(x). Generally, let p() be a distribution for which we have a sampler. We wish to construct a sample x1; : : :; xN from a di erent distribution q(). Assume that items x1; : : :; xn are already in the sample, and we wish to choose xn+1 . The sampler for p() proposes a new item y. We do not simply add y to the sample|that would give us a sample from p()|but rather we make a stochastic decision whether to accept the proposal y or reject it. If we accept y, it is added to the sample (xn+1 = y), and if we reject y, then xn is repeated (xn+1 = xn). The acceptance decision is made as follows. If p(y) > q(y), then y is overrepresented among the proposals. We can quantify the degree of overrepresentation as p(y)=q(y). The idea is to reject y with a probability corresponding to its degree of overrepresentation. However, we do not consider the absolute degree of overrepresentation, but rather the degree of overrepresentation relative to xn. (If y and xn are equally overrepresented, there is no reason to reject y in favor of xn.) That is, we consider the value p(y)=q(y) = p(y)q(xn ) r = p(x n )=q(xn) p(xn)q(y) If r  1, then y is underrepresented relative to xn , and we accept y with probability one. If r > 1, then we accept y with a probability that diminishes as r increases: speci cally, 26

Abney

Stochastic Attribute-Value Grammars

with probability 1=r. In brief, the acceptance probability of y is A(yjxn ) = min(1; 1=r). It can be shown that proposing items with probability p() and accepting them with probability A(jxn) yields a sampler for q(). (See e.g. (Winkler, 1995)).2 The acceptance probability A(yjxn ) reduces in our case to a particularly simple form. Q

If r < 1 then A(yjx) = 1. Otherwise, writing (x) for the \ eld weight" i ifi (x) , we have: ?1

A(yjxn ) = ZZ?1 ((xyn))pp00((yx)np0)p(x0 (ny)) = (y)=(xn )

(12)

6. Final Remarks In summary, we cannot simply transplant CF methods to the AV grammar case. In particular, the ERF method yields correct weights only for SCFGs, not for AV grammars. We can de ne a probabilistic version of AV grammars with a correct weight-selection method by going to random elds. Feature selection and weight adjustment can be accomplished using the IIS algorithm. In feature selection, we need to use random sampling to nd the initial weight for a candidate feature, and in weight adjustment we need to use random sampling to solve the weight equation. The random sampling method that DD&L used is not appropriate for sets of dags, but we can solve that problem by using the Metropolis-Hastings method instead. 2 The Metropolis-Hastings acceptance probability is usually given in the form  x)  A(yjx) = min 1; ((xy))gg((y; x;y) in which  is the distribution we wish to sample from (q, in our notation) and g(x; y) is the proposal probability: the probability that the input sampler will propose y if the previous con guration was x. The case we consider is a special case in which the proposal probability is independent of x: the proposal probability g(x;y) is, in our notation, p(y). The original Metropolis algorithm is also a special case of the Metropolis-Hastings algorithm, in which the proposal probability is symmetric, that is, g(x;y) = g(y; x). The acceptance function then reduces to min(1;(y)=(x)), which is min(1;q(y)=q(x)) in our notation. I mention this only to point out that it is a di erent special case. Our proposal probability is not symmetric, but rather independent of the previous con guration, and though our acceptance function reduces to a form (12) that is similar to the original Metropolis acceptance function, it is not the same: in general, (y)=(x) 6= q(y)=q(x).

27

Computational Linguistics

Volume 0, Number 0

Open questions remain. First, random sampling is notorious for being slow, and it remains to be shown whether the approach proposed here will be practicable. I expect practicability to be quite sensitive to the choice of grammar|the more the grammar's distribution diverges from the initial context-free approximation, the more features will be necessary to \correct" it, and the more random sampling will be called on. A second issue is incomplete data. The approach described here assumes complete data (a parsed training corpus). Fortunately, an extension of the method to handle incomplete data (unparsed training corpora) is described in (Riezler, 1997), and I refer readers to that paper. As a closing note, it should be pointed out explicitly that the random eld techniques described here can be pro tably applied to context-free grammars, as well. As Stanley Peters nicely put it, there is a distinction between possibilistic and probabilistic contextsensitivity. Even if the language described by the grammar of interest|that is, the set of possible trees|is context-free, there may well be context-sensitive statistical dependencies. Random elds can be readily applied to capture such statistical dependencies whether or not L(G) is context-sensitive.

Acknowledgments

This work has greatly pro ted from the comments, criticism, and suggestions of a number of people, including Yoav Freund, John La erty, Stanley Peters, Hans Uszkoreit, and members of the audience at talks I gave at Saarbrucken and Tubingen. Michael Miller and Kevin Mark introduced me to random elds as a way of dealing with context-sensitivities in language, planting the idea that led (much later) to this paper. Finally, I would especially like to thank Marc Light and Stefan Riezler for extended discussions of the issues addressed here and helpful criticism of my rst attempts to present this material. All responsibility for aws and errors of course remains with me.

References Brew, Chris. 1995. Stochastic HPSG. In

28

Proceedings of EACL-95. Della Pietra, Stephen, Vincent Della Pietra, and John La erty. 1995. Inducing features of random elds. tech report CMU-CS-95-144, CMU. Eisele, Andreas. 1994. Towards probabilistic extensions of constraint-based grammars. Technical Report Deliverable R1.2.B, DYANA-2. Gibbs, W. 1902. Elementary principles of statistical mechanics. Yale University Press, New Haven, CT. Mark, Kevin, Michael Miller, Ulf Grenander, and Steve Abney. 1992. Parameter estimation for constrained context-free language models. In Proceedings of the Fifth Darpa Workshop on Speech and Natural Language, San Mateo, CA. Morgan Kaufman. Riezler, Stefan. 1996. Quantitative constraint logic programming for

Abney

Stochastic Attribute-Value Grammars

weighted grammar applications. Talk given at LACL, September. Riezler, Stefan. 1997. Probabilistic Constraint Logic Programming. Arbeitspapiere des Sonderforschungsbereichs 340, Bericht Nr. 117, Universitat Tubingen. Winkler, Gerhard. 1995. Image Analysis, Random Fields and Dynamic Monte Carlo Methods. Springer.

and iteratively compute:

t

(14)

(Della Pietra, Della Pietra, and La erty, 1995) show that F 0( t) is equal to the negative of the variance of f under the

A. Initial Weight Estimation In the feature selection step, we choose an initial weight for each candidate feature f so as to maximize the gain G = D(~pj qold) ? D(~pj qf; ) of adding f to the eld. It is actually more convenient to consider log weights = ln . For a given feature f, the log weight ^ that maximizes gain is the solution to the equation:

new eld, which I will write ?V [f]. To compute the iteration (14) we need to be able to compute F( t) and F 0( t). For F( t) we require p~[f] and q [f], and F 0( t) can be expressed as q [f]2 ? q [f 2 ]. p~[f] is simply the average value of f in the training corpus. The remaining terms are all of the form q [f r ]. We can re-express this expectation in terms of the old eld qold:

q [f] = p~[f] where q is the distribution that results from adding f to the eld with log weight . This equation can be solved using Newton's method. De ne F( ) = p~[f] ? q [f]

t) t+1 = t ? FF( 0( )

P

q [f r ] = Px frr (x)q f ((x) x) q (x) old x f (x)e = P f ( x ) xr e f qold (x) [f e ] = qold qold [e f ]

The expectations qold [f r e f ] can be obtained by generating a random sample (z1 ; : : :; zN ) of size N from qold and computing the av-

(13)

To nd the value of for which F( ) = 0, we begin at a convenient point 0 (the \null" weight 0 = 0 recommends itself)

erage value of f r e f . That is, qold [f r e f ]  (1=N)sr ( ), where: P

sr ( ) = Pk f r (zk )e f (zk ) = u countk [f(zk ) = u]ur e u This yields: 29

Computational Linguistics

Volume 0, Number 0

for u from 0 to umax do x

C[f; u]e u

s0

s0 + x

and the Newton iteration (14) reduces to:

s1

s0 + xu

2 t+1 = t + s0s( ( t)~p)s[f]( ? s) 0?( st)s( 1 ( )2t) 0 t 2 t 1 t

s2

s0 + xu2

q [f r ] = sr ( ) s0 ( )

end

To compare candidates, we also need to know the gain D(~pj qold) ? D(~pj q ^ ) for each candidate. This can be expressed as follows (Della Pietra, Della Pietra, and La erty, 1995): G(f; ^ ) = p~[f] ln ^ ? lnqold [e ^ f ]  p~[f] ln ^ ? lns0 (^ ) + ln N Putting everything together, the algorithm for feature selection has the following form. The array E[f] is assumed to have been initialized with the empirical expectations p~[f].



end

2

+ s0 sE0[sf2]??ss021 s1

E[f] ? ln s0 + lnN if G > G^ then G^ G; g f; ^ G

end return g; ^ ; G^ end B. Adjusting Field Weights The procedure for adjusting eld weights has much the same structure as the procedure for choosing initial weights. In terms

procedure SelectFeature () begin

of log weights, we wish to compute incre-

Fill array C[f; u] = countk [f(zk ) = u]

ments (1 ; : : :; n) such that the new eld,

by sampling from old eld

with log weights ( 1 + 1 ; : : :; n + n )

G^

0, g

none

has a lower divergence than the old eld

for each f in candidates do

( 1; : : :; n). We choose each i as the so-



0

lution to the equation:

until is accurate enough do s0 30

s1

s2

0

p~[fi ] = qold [fi ei f# ]



Abney

Stochastic Attribute-Value Grammars

Again, we use Newton's method. We wish

by sampling from q

to nd  such that Fi() = 0, where:

for i from 1 to n 

Fi () = p~[fi] ? qold [fi ef# ]

0

until  is suciently accurate do

As (Della Pietra, Della Pietra, and Laf-

s0

ferty, 1995) show, the rst derivative is:

s1

0

for m from 0 to mmax do

F 0() = ?qold [fi f# ef# ] i

We see that the expectations we need to compute by sampling from qold are of form

x

C[i; m]em

s0

s0 + x

s1

s1 + xm

end

qold [fif#r ef# ]. We generate a random sam-



ple (z1 ; : : :; zN ) and de ne:

 + NE [fsi1]?s0

end

P sr (i; ) = Pk fP i (zk )f# (zk )r ef# (zk ) i i +  = Pm u count k [fi (zk ) = u ^ f# (zk ) = m]umr em P = m mr em kjf#(zk )=m fi (zk ) end

As we generate the sample we update the

P array C[i; m] = kjf#(zk )=m fi (zk ). We estimate qold [fi f r ef# ] as the average value #

end return ( 1; : : :; n) end

of fi f#r ef# in the sample, namely, (1=N)sr (i; ). This permits us to compute Fi () and Fi0(). The resulting Newton iteration is: t+1 = t + N p~[fsi] ?(i;s)0 (i; t ) 1

The estimation procedure is:

procedure AdjustWeights ( 1; : : :; n) begin until the eld converges do Fill array C[i; m]

31