Adaptive Mixtures of Probabilistic Transducers 1 Introduction - CiteSeerX

Report 12 Downloads 89 Views
Adaptive Mixtures of Probabilistic Transducers Yoram Singer AT&T Laboratories 600 Mountain Ave., Room 2A-407 Murray Hill, NJ 07974 [email protected]

Abstract

We describe and analyze a mixture model for supervised learning of probabilistic transducers. We devise an on-line learning algorithm that eciently infers the structure and estimates the parameters of each probabilistic transducer in the mixture. Theoretical analysis and comparative simulations indicate that the learning algorithm tracks the best transducer from an arbitrarily large (possibly in nite) pool of models. We also present an application of the model for inducing a noun phrase recognizer.

1 Introduction Supervised learning of probabilistic mappings between temporal sequences is an important goal of natural data analysis and classi cation with a broad range of applications, including handwriting and speech recognition, natural language processing and biological sequence analysis. Research e orts in supervised learning of probabilistic mappings have been almost exclusively focused on estimating the parameters of a prede ned model. For example, Giles et al. (1992) used a second order recurrent neural network to induce a nite state automaton that classi es input sequences, and Bengio and Fransconi (1994) introduced an input-output HMM architecture that was used for similar tasks. In this paper we introduce and analyze an alternative approach based on a mixture model of a new subclass of probabilistic transducers which we call sux tree transducers. Mixture models, often referred to as mixtures of experts, have shown to be a powerful approach both theoretically and experimentally. See (DeSantis et al., 1988; Jacobs et al., 1991; Haussler and Barron, 1993; Littlestone and Warmuth, 1994; Cesa-Bianchi et al., 1993; Helmbold and Schapire, 1995) for analyses and applications of mixture models, from di erent perspectives such as connectionism, Bayesian inference and computational learning theory. We combine techniques used for compression and unsupervised learning (Willems et al., 1995; Ron et al., 1996) to devise an on-line algorithm that eciently estimates the weights of all the possible models from an arbitrarily large (possibly in nite) pool of sux tree transducers. Moreover, we employ the mixture paradigm for the estimation of the free parameters of each model in the pool and achieve an ecient estimate of these parameters. We present formal analysis, simulations and experiments with natural data which show that the learning algorithm indeed tracks the best model in an arbitrarily large pool of models, yielding an accurate approximation of the source. 1

2 Mixture of Sux Tree Transducers

Let in and out be two nite alphabets. A Sux Tree Transducer T over (in ; out) is a jin j-ary tree where every internal node of T has one child for each symbol in in . The nodes of the tree are associated with pairs (s; s), where s is the string associated with the path (sequence of symbols in in ) that leads from the root to that node, and s : out ! [0; 1] is an output probability function. A sux tree transducer maps arbitrarily long input sequences over in to output sequences over out as follows. The probability that a sux tree transducer T will output a symbol y 2 out at time step n, denoted by P (y jx1 : : :xn ; T), is sn (y ), where sn is the string labeling the leaf reached by taking the path corresponding to xn xn 1 xn 2 : : : starting at the root of T. We assume without loss of generality that the input sequence x1 : : :xn was padded with an arbitrarily long sequence : : :x 2 x 1 x0 so that the leaf corresponding to sn is well de ned for any time step n  1. The probability that T will output a string y1 : : :y n in nout given an input string x1 : : :xn in nin , Q denoted by P (y1 : : :yn jx1 : : :xn ; T), is therefore nk=1 sk (yk ). Note that only the leaves of T are used for prediction. A sux tree transducer is thus a probabilistic mapping that induces a measure over the possible output strings given an input string. A sub-model T0 of a sux tree transducer T is obtained by removing some of the internal nodes and leaves of T and using the leaves of the pruned tree T0 for prediction. An example of a sux tree transducer and two of its possible sub-models is given in Figure 1. Figure 1: A sux tree transducer over (  ) = in ;

f ;

@@@@@@@ P(a)=0.5 @@@@@@@ ε P(b)=0.25 P(c)=0.25 @@@@@@@ @@@@@@@ @@@@@@@ @@@@@@@

P(a)=0.5 P(b)=0.2 P(c)=0.3

@@@@@@@ @@@@@@@ 0 @@@@@@@ @@@@@@@ @@@@@@@ @@@@@@@

P(a)=0.3 P(b)=0.2 P(c)=0.5

@@@@@@@ @@@@@@@ @@@@@@@ 00 @@@@@@@ @@@@@@@ @@@@@@@ @@@@@@@ P(a)=0.8 P(b)=0.1 P(c)=0.1

@@@@@@@ @@@@@@@ 1 @@@@@@@ @@@@@@@ @@@@@@@ @@@@@@@

P(a)=0.7 P(b)=0.2

P(c)=0.1 @@@@@@@ @@@@@@@ @@@@@@@ 10 @@@@@@@ @@@@@@@ @@@@@@@ @@@@@@@

@@@@@@@ @@@@@@@ @@@@@@@ 010 @@@@@@@ @@@@@@@ @@@@@@@

P(a)=0.5 P(b)=0.3 P(c)=0.2

P(a)=0.4 P(b)=0.2 P(c)=0.4

P(a)=0.5 P(b)=0.4 P(c)=0.1

@@@@@@@ @@@@@@@ @@@@@@@ 110 @@@@@@@ @@@@@@@ @@@@@@@

@@@@@@@ @@@@@@@ 01 @@@@@@@ @@@@@@@ @@@@@@@ @@@@@@@

P(a)=0.6 P(b)=0.3

P(c)=0.1 @@@@@@@ @@@@@@@ @@@@@@@ 11 @@@@@@@ @@@@@@@ @@@@@@@ @@@@@@@

out

(01 ) and two of its possible sub-models (sub-trees). The strings labeling the nodes are the suxes of the input string used to predict the output string. At each node there is an output probability function de ned for each of the possible output symbols. For instance, using the full sux tree transducer, the probability of observing the symbol b given that the input sequence is 010, is 0 1. The probability of the current output, when each transducer is associated with a weight (prior), is the weighted sum of the predictions of each transducer. For example, assume that the weights of the trees are 0 7 (full tree), 0 2 (large sub-tree), and 0 1 (small sub-tree), then the probability that the output = a given that ( 2 ) = (0 1 0) is 0 7 ( 010 T1) + 0 2 1 ( 10 T2 ) + 0 1 ( 0 T3 ) = 0 7 0 8 + 0 2 0 7 + 0 1 0 5 = 0 75. g ; fa; b; cg

:::

:

:

:

:

yn

xn

; xn

P aj

;

:



:

; xn

;

:

;

 P aj ;

:

 P aj :



;

:

:

:





:

:

For a sux tree transducer T and a node s 2 T we use the following notations:  The set of all possible complete1 sub-trees of T (including T itself) is denoted by Sub(T).  The set of leaves of T is denoted by Leaves(T).  The set of sub-trees T0 2 Sub(T) such that s 2 Leaves(T0) is denoted by Ls (T).  The set of sub-trees T0 2 Sub(T) such that s 2 T0 and s 62 Leaves(T0) is denoted by Is (T).  The set of sub-trees T0 2 Sub(T) such that s 2 T0 is denoted by As (T).

1 A sux tree is complete if each node is either a leaf or all of its children are in the tree as well.

2

Whenever it is clear from the context we will omit the dependency on T and denote the above sets as Ls ; Is and As . Note that the above de nitions imply that As = Ls [ Is . For a given sux tree transducer T we are interested in the mixture of all possible sub-models (sub-trees) of T. We associate with each sub-tree (including T itself) a weight which can be interpreted as its prior probability. We later show how the learning algorithm for a mixture of sux tree transducers adapts these weights in accordance to the performance (evidence in Bayesian terms) of each sub-tree on past observations. Direct calculation of the mixture probability is infeasible since there might be exponentially many such sub-trees. However, the technique introduced in (Willems et al., 1995) can be generalized and applied to our setting. Let T0 be a sub-tree of T. Denote by n1 the number of internal nodes of T0 and by n2 the number of leaves of T0 which are not leaves of T. For example, n1 = 2 and n2 = 1 for the small sub-tree of the sux tree transducer depicted on Figure 1. We de ne the prior weight of a tree T0 , denoted by P0 (T0 ), to be (1 )n1 n2 , where 2 (0; 1). It can be easily veri ed by induction on the number of nodes of T that this de nition of the weights is a proper measure, i.e., P 0 T 2Sub(T) P0 (T0 ) = 1. This distribution over sux trees can be extended to trees of unbounded depth assuming that T is an in nite jin j-ary sux tree transducer. The prior weights can be interpreted as if they were created by the following stochastic process. We start with a sux tree that includes only the root node. With probability we stop the process and with probability 1 we add all the possible jin j children of the node and continue the process recursively for each of the children. For a prior probability distribution over sux trees of unbounded depth, the above process ends when no new children were created. For bounded depth trees, the recursive process stops if either no new children were created at a node or if the node is a leaf of the maximal sux tree T. For a node s that belongs to either a nite or an in nite sux tree transducer, the weight is therefore the apriori probability that the node is a leaf of a transducer T0 2 Sub(T), X X P0 (T0) : (1) P0 (T0) = = 0 0 T 2As T 2Ls Therefore, for a bounded depth transducer T, if s 2 Leaves(T) then As = Ls ) = 1 and the recursive process stops with probability one. In fact, we can associate a di erent prior probability s with each node s 2 T. For the sake of brevity we omit the condition on the node s and assume that all the priors for internal nodes are equal, that is 8s 62 Leaves(T) s = . Using the recursive prior over sux tree transducers, we can calculate the prediction of the mixture of all T0 2 Sub(T) on the rst output symbol y = y1 as follows,  (2)  (y ) + (1 ) x1 (y ) + (1 ) x0 x1 (y ) + (1 ) x 1x0 x1 (y ) + (1 ) : : : The above calculation ends when a leaf of T is reached or when the beginning of the input sequence is encountered. Therefore, the prediction time of a single symbol is bounded by the maximal depth of T, or the length of the input sequence if T is in nite. Let us assume from now on that T is nite. (A similar derivation holds when T is in nite.) Let ~s (y ) be the prediction propagating from a node s labeled by xl : : :x1 (we give a precise de nition of ~s in the next section), that is,

~s (y ) = ~xl:::x1 (y ) = xl:::x1 (y ) + (1 ) ( xl 1:::x1 (y ) + (1 ) ( xl 2 :::x1 (y ) + (1 ) ( : : : + xj :::x1 (y )) : : : ))) ; (3) 3

where xj : : :x1 is a label of a leaf in T. Let  = x1 jsj , then the above sum can be evaluated recursively as follows ( 2 Leaves(T) : (4)

~s(y ) = s (y()y ) + (1 ) ~ (y ) sotherwise s s The prediction of the whole mixture of sux tree transducers on y is the prediction propagated from the root node , namely ~ (y ). For example, for the input sequence : : : 0110, output symbol y = b, and = 1=2, the predictions propagated from the nodes of the full sux tree transducer from Figure 1 are

~110(b) = 0:4 ; ~10(b) = 0:5 10(b) + 0:5 0:4 = 0:3 ;

~0(b) = 0:5 0(b) + 0:5 0:3 = 0:25 ; ~ (b) = 0:5 (b) + 0:5 0:25 = 0:25 :

3 An Online Learning Algorithm We now describe an ecient learning algorithm for the mixture of sux tree transducers. The learning algorithm uses the recursive priors and the evidence to eciently update the posterior weight of each possible transducer in the mixture. In this section we assume that the output probability functions are known. Hence, at each time step we need to evaluate the following X P (yn jx1 : : :xn ; T0)Pn 1 (T0 ) ; (5) P (yn jx1 : : :xn) = T0 2Sub(T) where Pn (T0 ) is the posterior weight of the sux tree transducer T after n input-output pairs, (x1; y1 ); : : :; (xn; yn ). Omitting the dependency on the input sequence for the sake of brevity, Pn (T0) equals to 0 ) Qni=1 P (yi jT0) P ( T P0 (T0)P (y1 : : :yn jT0 ) 0 0 Q P Pn(T ) = P 00 = : (6) T 2Sub(T) P0 (T00) ni=1 P (yi jT00) T00 2Sub(T) P0 (T00)P (y1 : : :yn jT00) Direct evaluation of the above is again infeasible due to the fact that there might be exponentially many sub-models of T. However, using the technique of recursive evaluation as in Equ. (4) we can eciently calculate the prediction of the mixture. Similar to the de nition of the recursive prior , de ne qn (s) to be the posterior probability (after n input-output pairs) that the node s is a leaf. As shown subsequently in Thm. 1, qn (s) is the ratio between the weighted predictions of all sub-trees for which the node s is a leaf and the weighted predications of all sub-trees that include s, X X qn (s) = P0(T0)P (y1 : : :ynjT0 ) = P0 (T0 )P (y1 : : :yn jT0) : (7) T0 2Ls T0 2As We can compute the predictions of the nodes along the path de ned by xn xn 1 xn 2 : : : simply by replacing the prior weight with the posterior weight qn 1 (s), as follows ( s 2 Leaves(T) ;

~s(yn ) = q s(yn()s) (y ) + (1 q (s)) ~ (y ) otherwise (8) n 1 s n n 1 s n where  = xn jsj . We also show in Thm. 1 that ~s is the following weighted prediction of all the sux tree transducers that include the node s, P 0 P (y jT0)P (T0) P 0 P (T0)P (y : : :y jT0) P s 0 nP (Tn 0)1 = P T0 2AsP (0T0)P (y 1: : :y n jT0) :

~s(yn ) = T 2A (9) 1 n 1 T 2As n 1 T 2As 0 4

In order to update qn (s) we introduce one more variable which we denote by rn (s). Setting r0(s) = log( =(1 )) for all s, rn(s) is updated as follows

rn(s) = rn 1 (s) + log( s(yn )) log(~ s (yn )) :

(10)

Based on the de nition of qn (s), rn (s) is the log-likelihood ratio between the weighted predictions of the sub-trees for which s is a leaf and the weighted predictions of the sub-trees for which s is an internal node. The new posterior weights qn (s) are calculated from rn (s) using a sigmoid function,



qn (s) = 1= 1 + e

rn (s)



:

(11)

For a node s0 that is not reached at time step n we simply de ne rn (s0 ) = rn 1 (s0 ) and qn (s0 ) = qn 1 (s0). In summary, for each new observation pair, we traverse the tree by following the path that corresponds to the input sequence xn xn 1 xn 2 : : : The predictions at the nodes along the path are calculated using Equ. (8). Given these predictions, the posterior probabilities of being a leaf are updated for each node s along the path using Equ. (10) and Equ. (11). Finally, the probability of yn induced by the whole mixture is the probability propagated from the root node, as stated by the following theorem. Theorem 1 ~(yn) = PT02Sub(T) P (yn jT0)Pn 1(T0). Proof: Let s be a node reached at time step n. De ne

P Ls(n) def = T02L P0(T0 )P (y1 : : :yn jT0 ) ; P Is(n) def = T0 2I P0 (T0 )P (y1 : : :yn jT0) ; P As(n) def = T0 2A P0 (T0 )P (y1 : : :yn jT0) : s

s

s

Note that since As = Ls [ Is then As (n) = Ls (n) + Is (n). Also, since s was reached at time step n

Ls(n) =

X

T0 2Ls

P0 (T0 )P (y1 : : :yn jT0) =

X

T0 2Ls

P0 (T0 )P (y1 : : :yn 1 jT0) s(yn ) ;

(12)

and therefore

s(yn ) = Ls(n) =Ls(n 1) : (13) De ne 8s 2 Leaves(T) : rn (s) def = 1 and 8s : As ( 1) def = 1. Let the depth of a node be the node's maximal distance from any of the leaves and denote it by d. We now prove the following by induction on n and d, rn(s) = log(Ls(n) = Is(n)) ;

~s(yn ) = As(n) = As(n 1) :

(14) (15)

First note that the following holds based on the induction hypotheses

qn (s) = 1=(1 + e

rn (s))

= 1=(1 + Is (n)=Ls(n)) = Ls (n)=(Ls(n) + Is (n)) = Ls (n)=As (n) : (16)

Equ. (14) holds by de nition for d = 0 (s 2 Leaves(T)) and for all n. Since, 8s 2 Leaves(T) ; As (n) = Ls (n), then from Equ. (13) we get that ~s(yn ) = s (yn ) = Ls(n)=Ls(n 1) = As(n)=As(n 1) and 5

Equ. (15) holds as well. For n = 0 and for all d, Equ. (14) holds from the de nition of the prior probabilities P 0 P (T0) = P 0 P (T0) !   def T 2Ls 0 T 2As 0 P r0(s) = log 1 = log 1 P (17) 0 0 2Ls P0 (T ) = T0 2As P0 (T0) T  Ls(0)   Ls (0)   Ls(0)=As(0)  = log 1 L (0)=A (0) = log A (0) L (0) = log I (0) ; s s s s s and Equ. (15) from the de nition of As ( 1). Assume that the induction assumptions hold for d0 < d; n0  n and d0  d; n0 < n and let  = xn jsj . Using the induction hypotheses for ~ at the node s (of depth d 1 < d) and for rn 1 (s) (at time n 1 < n) we get,  1) Ls (n) As (n 1)  : (18) rn (s) = rn 1 (s) + log( s(yn )= ~s(yn)) = log LI s((nn 1) Ls(n 1) As (n) s The sets As and Is consist of complete sub-trees. Therefore, 8 : T0 2 As , T0 2 Is , and  1) Ls (n) Is (n 1)  = log(L (n)=I (n)) : (19) As (n) = Is(n) ) rn(s) = log LI s((nn 1) s s Ls(n 1) Is(n) s Similarly, using the induction hypotheses for qn 1 (s) and the node s,

~s(yn) = qn 1 (s) s(yn) + (1 qn 1 )~ s (yn ) Ls(n 1) Ls (n) + Is (n 1) As (n) = A s (n 1) Ls (n 1) As (n 1) As (n 1) Ls(n 1) Ls (n) + Is (n 1) Is(n) = A s (n 1) Ls (n 1) As (n 1) Is (n 1) L (20) = sA(n()n+ Is1)(n) = A A(ns (n)1) : s

s

Hence, the induction hypotheses hold for all n and d and in particular,

~(yn ) = A (n)=A (n 1) = P 0 P (T0)P (y : : :y jT0) = P 0 P (T0)P (y : : :y jT0) ; (21) 1 n 1 n 1 T 2A 0 T 2A 0 and the theorem holds since by de nition A = Sub(T). We now discuss the performance of a mixture of sux tree transducers. Let Lossn (T) be the negative log-likelihood of a sux tree transducer T achieved on n input-output pairs,

Lossn (T) =

n X i=1

log2 (P (yi jT))

(22)

Similarly, the loss of the mixture is de ned to be,

Lossmix n =

n X

0 1 n X X 0 0 @ A log2 P (yi jT )Pi 1(T ) = log2 (~  (yi )) :

(23) i=1 T0 2Sub(T) The advantage of using a mixture of sux tree transducers over a single sux tree is due to the robustness of the solution, in the sense that the predictions of the mixture are almost as good as the predictions of the best sux tree in the mixture, as shown in the following theorem. i=1

6

Theorem 2 Let T be a (possibly in nite) sux tree transducer, and let (x1; y1); : : :; (xn; yn) be any possible sequence of input-output pairs. The loss of the mixture is at most

 Loss (T0) log (P (T0)) : n 2 0 T 2Sub(T) 0 min

The running time of the algorithm is Dn where D is the maximal depth of T or 21 (n + 1)2 when T is of an unbounded depth. Proof: The proof of the rst part of the theorem is based on a technique introduced by DeSantis et al. (1988). Based on the de nition of A (n) and Pn 1 (T0) from Thm. 1 we can rewrite Lossmix n as

Lossmix = n

0 1 X log2 @ P (yi jT0)Pi 1(T0 )A i=1 T0 2Sub(T)  n A (i) ! n Y X A (i)   n X

: log2 A (i 1) = log2  i=1 A (i 1) Canceling the numerator of one term with the denominator of the next, we get =

Lossmix = n

i=1

 A (n) 

log2 A (0) = log2 P A (n)P (T0 )  T0 2Sub(T) 0

1 0 X = log2 @ P0 (T0 )P (y1 : : :yn jT0)A T0 2Sub(T) ! 0 0  log2 T02max P0(T )P (y1 : : :yn jT ) Sub(T)  = 0 min Lossn (T0 ) log2 (P0(T0 )) : T 2Sub(T)

(24)

!

(25)

This completes the proof of the rst part of the theorem. At time step i, we traverse the sux tree T until a leaf is reached (at most D nodes are visited) or the beginning of the input sequence is encountered (at most i nodes are visited). Hence, the running time for a sequence of length n is Dn for a bounded depth sux tree or Pni=1 i  21 (n + 1)2 for an unbounded depth sux tree.

4 Parameter Estimation In this section we describe how the output probability functions are estimated. We devise two adaptive schemes that track the maximum likelihood assignment for the parameters of the output probability functions. To simplify notation, we omit the dependency on the node s and regard the sub-sequence yi1 yi2 : : :yik , (ij are the time steps at which the node s was reached) as an individual sequence y1 y2 y3 : : : We show later in this section how the estimations of the output probability functions at the di erent nodes are combined together. Denote by cn (y ) = jfi j yi = y; 1  i  ngj the number of times an output symbol y 2 out was observed out of the n times a node was visited and let K def = jout j. A commonly used estimator approximates the output probability function at time step n + 1 by adding a constant  to each count cn (y ), (26)

(y )  ^ n(y ) = P cn(cy )(+y 0 ) +  K = cnn(+y )+K : y0 2out n 7

The estimator that achieves the best likelihood on a given sequence of length n, called the maximum likelihood estimator (MLE), is

ML(y ) = cnn(y ) : (27) The maximum likelihood estimator is evaluated in batch, after the entire sequence is observed, while the on-line estimator de ned by Equ. (26) can be used on-the- y as new symbols arrive. The special case of  = 12 is called Laplace's modi ed rule of succession or the add 21 estimator. Krichevsky and Tro mov (1981) proved that the likelihood attained by the add 21 estimator is almost as good as the likelihood achieved by the maximum likelihood estimator. Applying the bound of Krichevsky and Tro mov in our setting results in the following bound on the predictions of the add 21 estimator, n X i=1

 ci 1(yi) + 1=2  = log2 i 1 + K=2 i=1  1 n X log2(cn{z(yi )=n)} + (K 1) 2 log2(n) + 1 : i=1 |

log2 (^ i 1(yi ))



n X

=log2 ( M L(yi ))

(28)

For a mixture of sux tree transducers, we use the add 21 estimator at each node separately. To do so, we simply need to store counts of the number of appearances of each symbol at each node of the maximal sux tree transducer T. Let T0 2 Sub(T) be any sux tree transducer from the pool of possible sub-models whose parameters are set so that it achieves the maximum likelihood on (x1; y1 ); : : :; (xn; yn ) and let T^ 0 be the same sux tree transducer but with parameters adaptively estimated using the add 21 estimator. Denote by ns the number of times a node s 2 Leaves(T0 ) was reached on the input-output sequence and let l(T0) = jLeaves(T0)j be the total number of leaves of T0 . From Thm. 1 we get, ^ 0) log2(P0 (T^ 0 )) : Lossmix (29) n  Lossn (T From the bound of Equ. (28) on the add 21 estimator we get,

Lossn (T^ 0 )  Lossn (T0 ) +

X

s2Leaves(T0 )

( 1=2(K 1)log2 (ns ) + K 1 )

0) X 1 = Lossn (T0 ) + l(T0 )(K 1) + l(T ( K 1) 0 log2 (ns ) 2 s l(TP)   0  Lossn (T0) + l(T0)(K 1) + l(T2 ) (K 1) log2 l(Ts n0)s   n   = Lossn (T0 ) + l(T0 ) (K 1) 12 log2 l(T 0) + 1 ;

(30)

where we used the concavity of the logarithm function to obtain the last inequality. Finally, combining Equ. (29) and Equ. (30) we get that, 1  n   0 0 0 Lossmix  Loss ( T ) log ( P ( T )) + l ( T ) ( K 1) n 2 0 n 2 log2 l(T0) + 1 : (31) The above bound holds for any T0 2 Sub(T), thus we obtain the following corollary. 8

Corollary 3 Let T be a (possibly in nite) sux tree transducer, and let (x1; y1); : : :; (xn; yn) be any possible sequence of input-output pairs. The loss of the mixture is at most   1  n   0 0 0 Loss ( T ) log ( P ( T )) + l ( T ) ( K 1) min n 2 0 2 log2 l(T0) + 1 ; T0 2Sub(T)

where the parameters of each T0 2 Sub(T) are set so as to maximize the likelihood of T0 on the entire input-output sequence. The running time of the algorithm is K  D  n where D is the maximal depth of T or K2 (n + 1)2 when T is of an unbounded depth. Note that the above corollary holds for any n. Therefore, the mixture model tracks the best transducer although that transducer may in fact change with time. The additional cost we need to `pay' for not knowing ahead of time the transducer's structure and parameters grows sublinearly with the length of the input-output sequence like O(log (n)). Normalizing this additional cost term by the length of the sequence, we get that the predictions of the adaptively estimated mixture lag behind the predictions of the best transducer by an additive factor which behaves like C log (n)=n!0, where the constant C depends on the total number of parameters of the best transducer. When K is large, the smoothing of the output probabilities using the add 12 estimator is too crude. Furthermore, in many real problems only a small subset of the full output alphabet is observed in a given context (a node in the tree). For example, when mapping phonemes to phones (Riley, 1991), for a given sequence of input phonemes the possible phones that can be pronounced is limited to a few possibilities (usually about two to four). Therefore, we would like to devise an estimation scheme that statistically depends on the actual local (context dependent) alphabet and not on the full alphabet. Such an estimation scheme can be devised by employing again a mixture of models, one model for each possible subset 0out of out . Although there are 2K subsets of out , we now show that if the estimation technique depends only on the size of each subset then the whole mixture can be maintained in time linear in K . As before, we omit the dependency on the node and treat the sub-sequence that reaches a node as an individual sequence. Denote by ^ n (y j 0out) the estimate of (y ) after n observations assuming that the actual alphabet is 0out . Let i = j0outj; then, using the add 21 estimator,

^ n(y j0out) = ^ n(y j j0outj = i) = (cn(y ) + 1=2)=(n + i=2) : For the sake of brevity we will simply denote ^ n (y j j0outj = i) as ^ n (y j i).

(32)

Let nout be the set of di erent output symbols that were observed at a node, i.e., nout def = f j  2 out ;  = yi ; 1  i  ng :

Lastly, we de ne 0out to be the empty set and let Kn def = jnoutj. Since each possible alphabet 0out  K K 0 n n of size i = jout j should contain all symbols in out , there are exactly i Kn possible alphabets of size i. Let win be the posterior probability (after n observations) of any alphabet of size i. The prediction of the mixture of all possible subsets 0out such that nout  0out  out is

^ n(y ) =

K K K ! X n n n j K wj ^ (y jj ) : n

j = Kn

(33)

Evaluation of this sum requires O(K ) operations (and not O(2K )). Furthermore, we can compute Equ. (33) in an on-line fashion as follows. Let Win be the (prior weighted) likelihood of all alphabets 9

of size i

Win def =

X j0out

!

n Y

n Y K K n 0 k 1 0 0 wi

^ (yk jout ) = i K wi ^ k 1(yk ji) : n k=1 k=1 j=i

(34)

Without loss of generality let us assume that all alphabets are equally likely apriori.2 That is, P0 (0out ) = 1=2K . Therefore, Wi0 = Ki =2K . The weights Win can be updated incrementally depending on the following cases. If the number of di erent symbols observed so far exceeds a given size (jnoutj > i), then all alphabets of this size are too small and should be eliminated from the mixture by setting their posterior probability to zero. Otherwise, if the next symbol was observed before (yn+1 2 nout ), the output probability is approximated by the add 12 estimator. Lastly, if the next symbol is entirely new (yn+1 62 nout ), we need to re-scale Win for i  jnout+1 j for the following reason. Since yn+1 62 nout , then nout+1 = nout [ fyn+1 g ) Kn+1 = Kn + 1 and ! ! ! K Kn+1 = K Kn 1 = K Kn i Kn : i Kn+1 i Kn 1 i Kn K Kn Hence, the number of possible alphabets of size i has decreased by a factor of Ki KKnn and each weight Win+1 (i  Kn+1 ) has to be multiplied by this factor in addition to the prediction of the add 21 estimator on yn+1 . The above three cases can be summarized as an update rule of Win+1 from Win as follows, 8 > < 0cn(yn+1 )+1=2 if Kn+1 > i n +1 n if Kn+1  i and yn+1 2 nout : (35) Wi = Wi  > n+i=2 : Ki KKnn n+1=i=22 if Kn+1  i and yn+1 62 nout Applying Bayes rule for the mixture of all possible subsets of the output alphabet, we get

^ n(yn+1 ) =

K X i=1

Win+1 =

K X i=1

Win =

K X

i=Kn

Win+1 =

K X

i=Kn

Win :

(36)

P Let W~ in be the posterior probability of all possible alphabets of size i, that is W~ in = Win = j Wjn . Let Pin (yn+1 ) def = Win+1 =Win be the sum of the predictions of all possible alphabets of size i on yn+1 . Thus, the prediction of the mixture ^ n(yn+1 ) can be re-written as

^ n(yn+1 ) =

K X i=1

Win+1 =

K X i=1

Win =

K X ~n i=1

Wi Pin (yn+1 ) :

In Figure 2 we give an illustration of the values obtained for W~ in in the following experiment. We simulated a multinomial source whose output symbols belong to an alphabet of size jout j = 10 and set the probabilities of observing any of the last ve symbols to zero. Therefore, the actual alphabet is of size 5. The posterior probabilities, W~ in (1  i  10), were calculated after each iteration. These probabilities are plotted left to right and top to bottom. The very rst observations rule out alphabets of size smaller than 5, setting their posterior probability to zero. After a few observations, the posterior probability is concentrated around the actual size, yielding an accurate and adaptive estimate of the actual alphabet size of the source. 2 One can easily de ne and use non-uniform priors. For instance, the assumption that all alphabet sizes are equally

likely apriori and all alphabets of the same size have the same prior probability results in a prior that favors small and large alphabets. Such a prior is used in the example appearing in Figure 2.

10

0.4

0.4

0.4

0.4

0.2

0.2

0.2

0.2

0

0 5

10

0 5

10

0.4

10

0 5

10

0.4

0.4

0.4

0.4

0.4

0.2

0.2

0.2

0.2

0.2

0

0

0

0

5

10

5

10

5

10

10

0.4

0.4

0.4

0.4

0.2

0.2

0.2

0.2

0.2

0

0

0

0

5

10

5

10

0.4

0.2

0.2

0 10

10

10

10

10

10

0.4

0.4

0.4

0.4

0.2

0.2

0.2

0.2

0.2

0

0

0

0

10

5

10

5

10

5

10

5

10

5

10

0 5

0.4

5

10

0.2

0 5

5

0.4

0.2

0 5

10

0 5

0.4

0.2

0 5

5 0.4

5

0 5

0.4

0.4

Figure 2: An illustration of adaptive estimation of the alphabet size for a multinomial source with a large number of possible outcomes when only a subset of the full alphabet is actually observed. In this illustration, examples were randomly drawn from an alphabet of size 10 but only ve symbols have a non-zero probability. The posterior probability concentrates on the right size after less than 20 iterations. The initial prior used in this example assigns an equal probability to the di erent alphabet sizes.

0.2

0 5

0 5

10

Let ML denote again the maximum-likelihood estimator of evaluated after n input-output observation pairs ( n c ( y ) n

ML(y ) = n = 0cn(y )=n yy 622 nout : (37) out

We can now apply the same technique used to bound the loss of a mixture of transducers to bound the loss of a mixture of all possible subsets of the output alphabet. First we bound the likelihood of the mixture of add 12 estimators based on all possible subsets of out . We do so by comparing the mixture's likelihood to the likelihood achieved by a single add 21 estimator that uses the \right" size (i.e., the estimator that uses the nal nout for prediction), n X i=1

log2 (^ i 1(yi ))  =

Using the bound on the likelihood n X i=1

n X i=1

n X

i=1 of the add 21 n X

log2 (^ i 1(yi jnout)) 

i=1

log2 (^ i 1(yi jnout )) log2 (P0 (nout))

  log2 (^ i 1(yi jnout )) + log2 2K :

(38)

estimator from Equ. (28), we get

log2 ( ML(yi )) + (Kn 1) (1=2 log2 (n) + 1)

(39)

Combining Equations (38) and (39) we obtain a bound on the predictions of the mixture of smoothed predictors compared to the predictions of maximum likelihood estimator, n X i=1

log2 (^ i 1(yi )) 

n X i=1

log2( ML(yi )) + (Kn 1) (1=2 log2 (n) + 1) + K :

2K

(40)

It is easy to verify that if n  2 K Kn , which is the case when the observed alphabet is small (Kn  K ), then a mixture of add 12 predictors, one for each possible subset of the alphabet, achieves a better performance than a single add 12 predictor based on the full alphabet. Applying the on-line mixture estimation technique twice, rst for the mixture of all possible sub-models of a sux tree transducer T and then for the parameters of each model in the mixture, yields an accurate and ecient on-line learning algorithm for both the structure and the parameters of the source. As before, let T0 2 Sub(T) be a sux tree transducer whose parameters are set so as to maximize the likelihood of T0 on the entire input-output sequence. l(T0) is again the number 11

of leaves of T0 , and ns the number of times a node s 2 Leaves(T0 ) was reached on an input-output sequence (x1; y1); : : :; (xn; yn ). Denote by ms the number of non-zero parameters of the output probability function of T0 at node s. Note that for aPgiven node s and 8n0  n, the value Kn0 for that node satis es Kn0  ms . Denote by m(T0 ) = s2Leaves(T0 ) ms the total number of nonzero parameters of the maximum likelihood transducer T0 . Combining the bound on the mixture of transducers from Thm. 1 with the bound on the mixture of parameters from Equ. (40) while following the same steps used to derive Equ. (31) we get X Lossmix  Lossn (T0) log2(P0(T0)) + [(ms 1) (1=2 log2(ns ) + 1) + K ] n 0 s2Leaves(T ) 0 0  Lossn (T ) log2(P0(T )) X + l(T0)(K 1) + m(T0 ) + 21 (ms 1) log2(ns ) : (41) s2Leaves(T0 ) Now, X X X (ms 1) log2 (ns ) = (ms 1) log2 (ns =(ms 1)) + (ms 1)log2(ms 1) s



s X s

s

(ms 1) log2 (ns =(ms 1)) + m(T0) log2(K ) ;

Using the log-sum inequality (Cover and Thomas, 1991) we get, X (ms 1) log2 (ns =(ms 1))  s

Therefore,

X ! s

   Pn  n s s 0 P = m ( T ) log ms log2 2 m(T0 ) l(T0) : s (ms 1)

0 0 Lossmix n  Lossn (T )  log2 (P0(T )) + 1 m(T0) log n 0 0 2 0 2 m(T ) l(T0) + l(T )(K 1) + m(T ) log2 (2K ) :

(42)

As before, the above bound holds for any T0 2 Sub(T). Therefore, the mixture model again tracks the best transducer but the additional cost for specifying the parameters is smaller. That is, the predictions of a mixture of sux tree transducers each with parameters estimated using a mixture of all possible alphabets lag behind the predictions of the maximum likelihood model by an additive factor that scales again like C log (n)=n; but the constant C now depends on the total number of the non-zero parameters of the maximum likelihood model. Using ecient data structures to maintain the pointers to all the children of a node, both the time and the space complexity of the algorithm are O(Dnjout j) (O(n2jout j) for unbounded depth transducers). Furthermore, if the model is kept xed after the training phase, we can evaluate the nal estimation of the output functions once for each node and the time complexity during a test (prediction only) phase reduces to O(D) per observation.

5 Evaluation and Applications In this section we present evaluation results of the model and its learning algorithm on arti cial data. We also discuss and present results obtained from learning syntactic structure of noun phrases. In all the experiments described in this section was set to 12 . 12

2

Normalized Log. Prob.

1.8 1.6

(d) Single Model (c) Mixture of Models (b) Mixture of Models and Parameters (a) Source

1.4 1.2 1

Figure 3: Performance comparison of the predictions of a single model, two mixture models and the true underlying transducer. The -axis is the normalized negative loglikelihood where the base of the log is  . y

0.8 0.6 50

100 150 200 250 300 350 400 450 500 Number of Examples / 100

j

out j

The simplicity of the learning algorithm enables evaluation of the algorithm on millions of input-output pairs in a few minutes. For example, the average update time for a mixture of sux tree transducers of maximal depth 10 when joutj = 4 is about 0:2 milliseconds on an SGI Indy 200Mhz workstation. Results of the on-line predictions various estimation schemes are shown in Figure 3. In this example, out = in = f1; 2; 3; 4g. The description of the source is as follows. If xn  3 then yn is uniformly distributed over out , otherwise (xn  2) yn = xn 5 with probability 0:9 and yn 5 = 5 xn 5 with probability 0:1. The input sequence x1 x2 : : : was created entirely at random. This source can be implemented by a sparse sux tree transducer of maximal depth ve. Note that the actual size of the alphabet is only 2 at half of the leaves of the tree. We used a mixture of sux tree transducers of maximal depth 20 to learn the source. The normalized negative log-likelihood values are given for (a) the true source, (b) a mixture model of sux tree transducers with parameters estimated using a mixture of all sub-sets of the output alphabet, (c) a mixture model of the possible sux tree transducers with parameters estimated using the add 12 scheme, and (d) a single model of depth 8 whose parameters are estimated using the add 21 scheme (note that the source creating the examples can be embedded in this model). Clearly, the prediction of the mixture model converges to the prediction of the source much faster than the single model. Furthermore, applying the mixture estimation both for the structure and the parameters signi cantly accelerates the convergence. In less than 10;000 examples the cross-entropy between the mixture model and source is less than 0:01 bits while even after 50;000 examples the cross-entropy between the source and a single model is about 0:1 bits. As mentioned in the introduction, there is a broad range of applications of probabilistic transducers. Here we brie y demonstrate and discuss how to induce an English noun phrase recognizer based on a sux tree transducer. Recognizing noun phrases is an important task in automatic natural text processing for applications such as information retrieval, translation tools and data extraction from texts. A common practice is to recognize noun phrases by rst analyzing the text with a part-of-speech tagger, which assigns the appropriate part-of-speech (verb, noun, adjective etc.) for each word in context. Then, noun phrases are identi ed by manually de ned regular expression patterns that are matched against the part-of-speech sequences. An alternative approach is to learn a mapping from the part-of-speech sequence to an output sequence that identi es the words that belong to a noun phrase (Church, 1988). We followed the second approach by building a sux tree transducer based on a labeled data set from the Penn Tree Bank corpus. We de ned in to be the set of possible part-of-speech tags and set out = f0; 1g, where an output symbol given its corresponding input sequence (the part-of-speech tags of the words constituting the sentence) is 13

1 i the word is part of a noun phrase. We used 250;000 marked tags and tested the performance on 37;000 tags. Testing the performance of the sux tree transducer based tagger is performed by freezing the model structure, the mixture weights, and the estimated parameters. The sux tree transducer was of maximal depth 15 hence very long phrases could be potentially identi ed. By comparing the predictions of the transducer to a threshold (1=2), we labeled each tag in the test data as either a member of a non-phrase or a \ ller". An example of the prediction of the transducer and the resulting classi cation is given in Table 1. We found that 2:4% of the words were labeled incorrectly. For comparison, when a single (zero-depth) sux tree transducer is used to classify words the error rate is about 15%. It is impractical to compare the performance of our transducer based tagger to previous works due to the many di erences in the train and test data-sets used to construct the di erent systems. Nevertheless, the low error rate achieved by the mixture of sux tree transducers and the eciency of its learning algorithm suggest that this approach might be a powerful tool for language processing. One of our future research goal is to investigate methods for incorporating linguistic knowledge through the set of priors. Sentence Tom Smith , group chief executive of U.K. metals POS tag PNP PNP , NN NN NN IN PNP NNS Class 1 1 0 1 1 1 0 1 1 Prediction 0.99 0.99 0.01 0.98 0.98 0.98 0.02 0.99 0.99 Sentence and industrial materials maker , will become chairman . POS tag CC JJ NNS NN , MD VB NN . Class 1 1 1 1 0 0 0 1 0 Prediction 0.67 0.96 0.99 0.96 0.03 0.03 0.01 0.87 0.01 Table 1: Extraction of noun phrases using a sux tree transducer. In this typical example, two long noun phrases were identi ed correctly with high con dence.

Acknowledgments

Thanks to Y. Bengio, Y. Freund, L. Lee, F. Pereira, D. Ron, M. Shaw, R. Schapire, N. Tishby, and M. Warmuth for helpful discussions. Part of this work was done while the author was at the Hebrew University of Jerusalem.

References Y. Bengio and P. Fransconi. An input/output HMM architecture. In Advances in Neural Information Processing Systems 7, pages 427{434. MIT Press, 1994. N. Cesa-Bianchi, Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, and M. K. Warmuth. How to use expert advice. In Proceedings of the 25th Annual ACM Symposium on the Theory of Computing, pages 382{391, 1993. K. W. Church. A stochastic parts program and noun phrase parser for unrestricted text. In Second Conference on Applied Natural Language Processing, pages 136{143, 1988. T.M. Cover and J.A. Thomas. Elements of information theory. Wiley, 1991.

14

A. DeSantis, G. Markowski, and M.N. Wegman. Learning probabilistic prediction functions. In Proceedings of the First Annual Workshop on Computational Learning Theory, pages 312{328, 1988. C.L. Giles, C.B. Miller, D. Chen, G.Z. Sun, H.H. Chen, and Y.C. Lee. Learning and extracting nite state automata with second-order recurrent neural networks. Neural Computation, 4:393{ 405, 1992. D. Haussler and A. Barron. How well do Bayes methods work for on-line prediction of f+1; 1g values ? In The 3rd NEC Symposium on Computation and Cognition, 1993. D.P. Helmbold and R.E. Schapire. Predicting nearly as well as the best pruning of a decision tree. In Proceedings of the Eighth Annual Conference on Computational Learning Theory, pages 61{68, 1995. R.A. Jacobs, M.I. Jordan, S.J. Nowlan, and G.E. Hinton. Adaptive mixture of local experts. Neural Computation, 3:79{87, 1991. R.E. Krichevsky and V.K. Tro mov. The performance of universal encoding. IEEE Transactions on Information Theory, 27:199{207, 1981. Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and Computation, 108:212{261, 1994. M.D. Riley. A statistical model for generating pronounication networks. In Proc. of IEEE Conf. on Acoustics, Speech and Signal Processing, pages 737{740, 1991. D. Ron, Y. Singer, and N. Tishby. The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning, 1996. (to appear). F.M.J. Willems, Y.M. Shtarkov, and T.J. Tjalkens. The context tree weighting method: Basic properties. IEEE Transactions on Information Theory, 41(3):653{664, 1995.

15