Automatic Segmentation of Continuous Trajectories with Invariance to Nonlinear Warpings of Time Lawrence K. Saul
AT&T Labs | Research 180 Park Ave, E-171 Florham Park, NJ 07932
[email protected] Abstract We study the classi cation problem that arises when two variables|one continuous (x), one discrete (s)|evolve jointly in time. We suppose that the vector x traces out a smooth multidimensionalcurve, to each point of which the variable s attaches a discrete label. The trace of s thus partitions the curve into dierent segments whose boundaries occur where s changes value. We consider how to learn the mapping between x and s from examples of segmented curves. Our approach is to model the conditional random process that generates segments of constant s along the curve of x. We suppose that the variable s evolves stochastically as a function of the arc length traversed by x. Since arc length does not depend on the rate at which a curve is traversed, this gives rise to a family of Markov processes whose predictions, Pr[s j x], are invariant to nonlinear warpings (or reparameterizations) of time. We show how to learn the parameters of these Markov processes from labeled and/or unlabeled examples of segmented curves. The resulting models are motivated for automatic speech recognition, where x are acoustic features and s are phonetic transcriptions.
1 INTRODUCTION The automatic segmentation of continuous trajectories poses a challenging problem in machine learning. The problem arises whenever a multidimensional trajectory fx(t)jt 2 [0; ]g must be described by a se-
quence of discrete labels s1 s2 : : :sn . A simple way to map trajectories into sequences is to specify consecutive time intervals such that s(t) = sk for t 2 [tk?1; tk ]. This attaches the labels sk to contiguous arcs along the trajectory. The learning problem is to discover such a mapping from labeled and/or unlabeled examples. In this paper, we study this problem, paying special attention to the fact that curves have intrinsic geometric properties that do not depend on the rate at which they are traversed (do Carmo, 1976). Such properties include, for example, the total arc length and the maximum distance between any two points on the curve. Given a multidimensional trajectory fx(t)jt 2 [0; ]g, these properties are invariant to reparameterizations t ! f (t), where f (t) is any monotonic function that maps the interval [0; ] into itself. Put another way, the intrinsic geometric properties of the curve are invariant to nonlinear warpings of time. Invariance to nonlinear warpings of time is an example of a mathematical symmetry. The importance of such symmetries in statistical pattern recognition (Duda & Hart, 1973) is well-known. For example, in the problem of object recognition from two dimensional images, one often incorporates invariances to translations, rotations, and changes of scale (Simard et al, 1993). In the segmentation of continuous trajectories, one naturally encounters the question of invariance to nonlinear warpings of time. A better understanding of this invariance is therefore valuable in its own right. Beyond its mathematical interest, however, the principled handling of this invariance suggests new algorithms for the automatic segmentation of continuous trajectories. Indeed, the primary motivation for this work is its potential application to automatic speech recognition|a subject to which we return in the nal section of the paper. The study of curves requires some simple notions from
dierential geometry. As a matter of terminology, we refer to particular parameterizations of curves as trajectories. We regard two trajectories x1 (t) and x2(t) as equivalent to the same curve if there exists a monotonically increasing function f for which x1 (t) = x2 (f (t)). (To be precise, we mean the same oriented curve: the direction of traversal matters.) Here, as in what follows, we adopt the convention of using x(t) to denote an entire trajectory as opposed to constantly writing out fx(t)jt 2 [0; ]g. When necessary to refer to the value of x(t) as a particular moment in time, we use a dierent index, such as x(t1 ). Let us return now to the problem of automatic segmentation. Consider two variables|one continuous (x), one discrete (s)|that evolve jointly in time. Thus the vector x traces out a smooth multidimensional curve, to each point of which the variable s attaches a discrete label. Note that each trace of s yields a partition of the curve into dierent components; in particular, the boundaries of these components occur at the points where s changes value. We refer to such partitions as segmentations and to the regions of constant s as segments; see gure 1. Our goal in this paper is to learn a probabilistic mapping between trajectories x(t) and segmentations s(t) from labeled and/or unlabeled examples. Consider the conditional random process that generates segments of constant s along the curve traced out by x. Given a trajectory x(t), let Pr[s(t) j x(t)] denote the conditional probability distribution over possible segmentations. Suppose that for any two equivalent trajectories x(t) and x(f (t)), we have the identity: Pr[s(t) j x(t)] = Pr[s(f (t)) j x(f (t))]: (1) Eq. (1) captures a fundamental invariance|namely, that the probability that the curve is segmented in a particular way is independent of the rate at which it is traversed. In this paper, we study Markov processes with this property. We call them Markov processes on curves (MPCs) because for these processes it is unambiguous to write Pr[s j x] without providing explicit parameterizations for the trajectories, x(t) or s(t). The distinguishing feature of MPCs is that the variable s evolves as a function of the arc length traversed along x, a quantity that is manifestly invariant to nonlinear warpings of time. The main contributions of this paper are: (i) to postulate eq. (1) as a fundamental invariance of random processes; (ii) to introduce MPCs as a family of probabilistic models that capture this invariance; (iii) to derive monotonically convergent learning procedures
s(t) = s 1 START t=0
s(t) = s 2 x(t)
s(t) = s 3 END t=τ
Figure 1: Two variables|one continuous (x), one discrete (s)|evolve jointly in time. The trace of s partitions the curve of x into dierent segments whose boundaries occur where s changes value. Markov processes on curves model the conditional distribution, Pr[sjx]. for MPCs based on the principle of maximum likelihood estimation; and (iv) to contrast the properties of MPCs with those of hidden Markov models (HMMs), especially as they relate to problems in automatic speech recognition (Rabiner & Juang, 1993). In terms of previous work, our motivation most closely resembles that of Tishby (1990), who several years ago proposed a dynamical system approach to speech processing. The organization of this paper is as follows. In section 2, we begin by reviewing some basic concepts from dierential geometry. We then introduce MPCs as a family of continuous-time Markov processes that parameterize the conditional probability distribution, Pr[s j x]. The processes are derived from a set of dierential equations that describe the pointwise evolution of s along the curve traced out by x. In section 3, we consider how to learn the parameters of MPCs in both supervised and unsupervised settings. These settings correspond to whether the learner has access to labeled or unlabeled examples. Labeled examples consist of trajectories x(t), along with their corresponding segmentations: fstart ! (s1 ; t1) (sn ; tn) ! endg: (2) The ordered pairs in eq. (2) indicate that s(t) takes the value sk between times tk?1 and tk ; the start and end states are used to mark endpoints. Unlabeled examples consist only of the trajectories x(t) and the boundary values: f(0; start) ?! (; end)g: (3) Eq. (3) speci es only that the Markov process starts at time t =0 and terminates at some later time . In this case, the learner must infer its own target values for s(t) in order to update its parameter estimates. We view both types of learning as instances of maximum
likelihood estimation and describe an ExpectationMaximization (EM) algorithm for the more general case of unlabeled (or partially labeled) examples. In section 4, we discuss the application of MPCs to automatic speech recognition (Rabiner & Juang, 1993). Here we can identify the curves x with time-varying spectral signatures and the segmentations s with phonetic transcriptions. We discuss possible advantages of MPCs over hidden Markov models, the current leading technology for automatic speech recognition. The most important of these are: (i) the natural handling of variations in speaking rate|i.e., the rate at which acoustic features (summarized by x) change with time|and (ii) the emphasis on learning a recognition model Pr[sjx], as opposed to a synthesis model Pr[xjs]. Finally, we conclude by outlining our plans for future work.
2 MARKOV PROCESSES ON CURVES Markov processes on curves are based fundamentally on the notion of arc length. After reviewing how to compute arc lengths along curves, we show how they can be used to de ne random processes that capture the invariance of eq. (1).
2.1 ARC LENGTH Let g(x) de ne a D D matrix-valued function over x 2 RD . If g(x) is everywhere non-negative de nite,
then we can use it as a metric to compute distances along curves. In particular, consider two nearby points separated by the in nitesimal vector dx. We de ne the squared distance between these two points as: d`2 = dxT g(x) dx:
(4)
Arc length along a curve is the non-decreasing function computed by integrating these local distances. Thus, for the trajectory x(t), the arc length between the points x(t1 ) and x(t2 ) is given by: `=
Z
t2
t1
h
i1
2
dt x_ Tg(x) x_ ;
(5)
where x_ = dtd [x(t)] denotes the time derivative of x. Note that the arc length between two points is invariant under reparameterizations of the trajectory, x(t) ! x(f (t)), where f (t) is any smooth monotonic function of time that maps the interval [t1; t2] into itself.
In the special case where g(x) is the identity matrix, eq. (5) reduces to the standard de nition of arc length in Euclidean space. More generally, however, eq. (4) de nes a non-Euclidean metric for computing arc lengths. Thus, for example, if the metric g(x) varies as a function of x, then eq. (5) can assign dierent arc lengths to the trajectories x(t) and x(t) + x0, where x0 is a constant displacement.
2.2 STATES AND LIFETIMES The problem of segmentation is to map a trajectory x(t) into a sequence of discrete labels s1 s2 : : :sn . If these labels are attached to contiguous arcs along the curve of x, then we can describe this sequence by a piecewise constant function of time, s(t), as in gure 1. We refer to the possible values of s as states. In what follows, we introduce a family of conditional random processes that evolve s as a function of the arc length traversed along the curve traced out by x. These random processes are based on a simple premise|namely, that the probability of remaining in a particular state decays exponentially with the cumulative arc length traversed in that state. The signature of a state is
the particular way in which it computes arc length. To formalize this idea, we associate with each state i the following quantities: (i) a position-dependent matrix gi(x) that can be used to compute arc lengths, as in eq. (5); (ii) a decay parameter i that measures the probability per unit arc length that s makes a transition from state i to some other state; and (iii) a set of transition probabilities aij , where aij represents the probability that|having decayed out of state i|the variable s makes a transition to state j . Thus, aij de nes a stochastic transition matrix with zero elements alongPthe diagonal and rows that sum to one: aii = 0 and j aij = 1. Together, these quantities can be used de ne a Markov process along the curve traced out by x. In particular, let pi (t) denote the probability that s is in state i at time t, based on its history up to that point in time. A Markov process is de ned by the set of dierential equations:
h i1 X h i1 dpi = ? i pi x_ Tgi(x) x_ 2 + j pj aji x_ Tgj (x) x_ 2 : dt j 6=i
(6) The right hand side of eq. (6) consists of two competing terms. The rst term computes the probability that s decays out of state i; the second computes the probability that s decays into state i. Both probabilities are proportional to measures of arc length, and
combining them gives the overall change in probability that occurs in the time interval [t; t + dt]. The process is Markovian because the evolution of pi depends only on quantities available at time t; thus the future is independent of the past given the present. Eq. (6) has certain properties of interest. First, note that summing both sides over i gives the identity P dp remains a nori i=dt = 0. This shows that pi P malized probability distribution: i.e., i pi = 1 at all times. Second, suppose that we start in state i and do not allow return visits: i.e., pi = 1 and aji = 0 for all j . In this case, the second term of eq. (6) vanishes, and we obtain a simple, one-dimensional linear dierential equation for pi (t). It follows that the probability of remaining in state i decays exponentially with the amount of arc length traversed by x, where arc length is computing using the matrix gi(x). The decay parameter, i , controls the typical amount of arc length traversed in state i; it may be viewed as an inverse lifetime or|to be more precise|an inverse lifelength. Finally, noting that arc length is a reparameterizationinvariant quantity, we therefore observe that these dynamics capture the fundamental invariance of eq. (1). Let a0i denote the probability that the variable s makes an immediate transition from the start state| denoted by the zero index|to state i; put another way, this is the probability that the rst segment belongs to state i. Given a trajectory x(t), the Markov process in eq. (6) gives rise to a conditional probability distribution over possible segmentations, s(t). Consider the segmentation in which s(t) takes the value sk between times tk?1 and tk , and let `sk =
tk
dt
tk?1
h
sk (x) x_
x_ Tg
i1
2
(7)
denote the arc length traversed in state sk . From eq. (6), we know that the probability of remaining in a particular state decays exponentially with this arc length. Thus, the conditional probability of this segmentation is given by: Pr[sjx] =
n
o
s = arg max lnPr[sjx] : s
(9)
Given a particular trajectory x(t), eq. (9) calls for a maximization over all piecewise constant functions of time, s(t). In practice, this maximization can be performed by discretizing the time axis and applying a dynamic programming procedure. The resulting segmentations will be optimal at some nite temporal resolution, t. For example, let i(t) denote the loglikelihood of the most probable segmentation, ending in state i, of the subtrajectory up to time t. Starting from the initial condition i(0) = ln[a0i], we compute
h
j (t + t) = max i(t) ? i t x_ Tgi (x) x_ i
i1
2
+ ln[i aij ](1 ? ij ) ; (10)
2.3 INFERENCE
Z
second product multiplies the probabilities for transitions between states sk and sk+1 . The leading factors of sk are included to normalize each state's duration model. There are many important quantities that can be computed from the distribution, Pr[sjx]. Of particular interest is the most probable segmentation:
n Y k=1
sk e?sk `sk
n Y k=0
asksk+1 ;
(8)
where we have used s0 and sn+1 to denote the start and end states of the Markov process. The rst product in eq. (8) multiplies the probabilities that each segment traverses exactly its observed arc length. The
where ij is the discrete delta function. Also, at each time step, let j (t +t) record the value of i that maximizes the right hand side of eq. (10). Suppose that the Markov process terminates at time . Enforcing the endpoint condition s ( ) = end, we nd the most likely segmentation by back-tracking: s (t ? t) = s (t) (t):
(11)
These recursions yield a segmentation that is optimal at some nite temporal resolution t. Generally speaking, by choosing t to be suciently small, one can minimize the errors introduced by discretization. In practice, one would choose t to re ect the time scale beyond which it is not necessary to consider changes of state. Other types of inferences can also be made from the distribution, eq. (8). For example, one can compute the marginal probability that the Markov process terminates at precisely the observed time. This is done by summing the probabilities Pr [s( ) = end j x(t)] = X if s( ) = end (12) Pr [s(t) j x(t)] 01 otherwise s(t)
where the zero-one weighting factor selects out only those segmentations that terminate precisely at time . Similarly, one can compute the posterior probability, Pr[s(t1) = ijx(t); s( ) = end], that at an earlier moment in time, t1, the variable s was in state i. Both types of inferences are handled by discretizing the time axis and applying a dynamic programming procedure similar to eqs. (10{11). In the interest of brevity, we do not give the details of these constructions, noting only that in most respects they are completely analogous to the ones for discrete-time hidden Markov models (Rabiner & Juang, 1993).
3 LEARNING FROM EXAMPLES In this section, we consider how to learn Markov processes of the form, eq. (6). By learning, we mean how to estimate the parameters fi ; aij ; gi(x)g from examples of segmented (or non-segmented) curves. Our rst step is to assume a convenient parameterization for the matrices, gi (x), that compute arc lengths. We then show how to t these matrices, along with the parameters i and aij , by maximum likelihood estimation. A variety of parameterizations can be considered for the matrices, gi(x). In this paper, we consider the very simple form: 2 gi (x) = (x ? i )T ?i 1 (x ? i ) i?1; (13) where the parameters i , i and i are set by maximum likelihood estimation. Here, i and i are positive-de nite D D square matrices, while i is a D-dimensional vector. We also impose the determinant constraint jijjij 21 = 1; this eliminates the degenerate solution, gi (x) = 0, in which every trajectory is assigned zero arc length. Note that there remains an arti cial degree of freedom associated with simultaneously rescaling i and i . The form of eq. (13) is designed to endow each state with a characteristic signature. In particular, consider the dierential arc lengths that appear in eq. (6): h
i1
h
i1
x_ Tgi(x) x_ 2 = (x ? i )T ?i 1(x ? i ) x_ T i?1 x_ :2 If x is close to i , then both the arc length and the corresponding probability of decay (out of state i) are small. Each state is therefore characterized by the values of x that allow it to persist. Intuitively, the parameters i can be viewed as target vectors associated with each state of the Markov process. Typical deviations about i are encoded by i and i. In what follows, we show how to learn the parameters that best characterize each state.
3.1 LABELED EXAMPLES Suppose we are given examples of segmented trajectories, fx (t); s (t)g, where the index runs over the example in the training set. As shorthand, let i (t) denote the indicator function that selects out segments associated with state i: if s (t) = i, i(t) = 01 otherwise. (14) Also, let `i denote the total arc length traversed by state i in the th example: Z
h
`i = dt i (t) x_ T gi (x ) x_
i1
2
(15) In this paper we view learning as a problem in maximum likelihood estimation. Thus we seek the parameters that maximize the conditional log-likelihood: X X X lnPr[s jx ] = ? i `i + nij ln[i aij ];
:
ij
i
(16) where nij is the overall number of observed transitions from state i to state j . The rst term in eq. (16) measures the log-likelihood of observed segments in isolation, while the second measures the log-likelihood of observed transitions. Eq. (16) has a convenient form for maximum likelihood estimation. In particular, there are closed-form solutions for the values of i and aij that maximize this log-likelihood; they are given by: aij = nij =ni ; (17) X 1 `i; (18) ?i 1 = n i
P
where ni = j nij . In general, we cannot nd closedform solutions for the maximum-likelihood estimates of fi ; i; ig. However, we can update these parameters in an iterative fashion that is guaranteed to increase the log-likelihood at each step. Denoting the updated parameters by f~i ; ~ i; ~ig, we consider the iterative scheme (derived in the appendix): ~ i
R
i
~ i ~i
i1
h
2 T ?1 dt i x_ i x_ x i1 ; h P R 2 dt i x_ T ?1 x_
P
P R
h
dt i x_ T i?1 x_
i1
2
iTi
; i1 2 T ? 1 dt i x_ i x_ Z X T ~ ?1 i x_ x_ T ; ci dt i h i i1 x_ T i?1 x_ 2 P
R
h
(19) (20) (21)
where the constant ci is1 determined by the determinant constraint j~ ijj~ij 2 = 1 and we have introduced the shorthand notation, i(t) = x (t) ? ~i; (22) for the dierence between x (t) and its (re-estimated) target value in state i. Note that all the variables in eqs. (19{21) with the subscript have an implicit time dependence. Some intuition for the form of these updates can be gained by considering the points distributed along 1x (t), as weighted by the measure i (t)[x_ T i?1 x_ ] 2 . The updates for i and i simply compute the mean and covariance of this distribution. The update for i has a similar interpretation, though its derivation relies on the introduction of an auxiliary function, Q(~i; i), as in the ExpectationMaximization (EM) procedure (Dempster, Laird, & Rubin, 1977). Note that it is important to perform the updates in the order shown, since (for example) the ~ {update depends on the re-estimated value of ~. By taking gradients of eq. (16), one can show that the xed points of this iterative procedure correspond to stationary points of the log-likelihood. A proof sketch of monotonic convergence is given in the appendix. In the case of labeled examples, the above procedures for maximum likelihood estimation can be invoked independently for each state i. One rst iterates eqs. (19{21) to estimate the parameters that determine gi (x). These parameters are then used to compute the arc lengths, `i, that appear in eq. (15). Given these arc lengths, the decay parameters and transition probabilities follow directly from eqs. (17{18). Thus the problem of learning given labeled examples is relatively straightforward.
3.2 UNLABELED EXAMPLES In this section we consider the problem of unsupervised learning. In this setting, the learner does not have access to labeled examples; the only available information consists of the trajectories x (t), as well as the fact that each process terminates at some time . The goal of unsupervised learning is to maximize the conditional log-likelihood, X
lnPr [s ( ) = end j x (t)] ;
(23)
that for each trajectory x (t), some probable segmentation can be found that terminates at precisely the observed time. The marginal probabilities in eq. (23)
are computed by summing Pr[s(t)jx(t)] over allowed segmentations, as in eq. (12). The maximization of this log-likelihood de nes a problem in hidden variable density estimation. The hidden variables are the states of the Markov process. If these variables were known, the problem would reduce to the one considered in the previous section. To ll in these missing values, we avail ourselves of the ExpectationMaximization (EM) algorithm (Baum, 1972; Dempster, Laird, & Rubin, 1976). Roughly speaking, the EM algorithm works by converting the maximization of eq. (23) into a weighted version of the problem where the segmentations, s (t), are known. The weights are determined by the posterior probabilities, Pr[s(t)jx (t); s ( ) = end], derived from the current parameter estimates. In the interest of brevity, we do not give a detailed account of the full EM algorithm for MPCs. We note, however, that eqs. (10{11) by themselves suce to implement a very good approximation to the full procedure. This approximation is to compute, based on the current parameter estimates, the optimal segmentation, s (t), for each trajectory in the training set; one then re-estimates the parameters of the Markov process by treating the inferred segmentations, s (t), as targets. This approximation reduces the problem of parameter estimation to the one considered in the previous section. It can be viewed as a winner-take-all approximation to the full EM algorithm, analogous to the Viterbi approximation for hidden Markov models (Rabiner & Juang, 1993). Essentially the same algorithm can also be applied to the intermediate case of partially labeled examples. Suppose, for example, that the learner has access to labeled state sequences but not to segmented curves; in other words, examples are provided in the form: fstart ! (s1 ; ?) (sn ; ?) ! endg: (24) The ability to handle such examples is important for two reasons: rst, because they provide signi cantly more information than unlabeled examples, and second, because they are often much cheaper to generate than fully segmented curves. As before, we can view the learning problem for these examples as one in hidden variable density estimation. In this case, the hidden variables are not the states of the Markov process per se, but only the times at which they change. We can incorporate knowledge of the state sequence into the EM algorithm simply by restricting the sums over paths in eqs. (10) and (12) to those that pass through the desired sequence.
4 AUTOMATIC SPEECH RECOGNITION The Markov processes in this paper were conceived as models for automatic speech recognition (Rabiner & Juang, 1993). Speech recognizers take as input a sequence of feature vectors, each of which encodes a short window of speech. Acoustic feature vectors typically have ten or more components, so that a particular sequence of feature vectors can be viewed as tracing out a multidimensional curve. The goal of a speech recognizer is to translate this curve into a sequence of words, or more generally, a sequence of sub-syllabic units known as phonemes. Denoting the feature vectors by xt and the phonemes by st , we can view this problem as the discrete-time equivalent of the segmentation problem in MPCs. Why consider MPCs as models of speech recognition? Hidden Markov models (HMMs), the current leading technology, are also based on probabilistic methods. These models manipulate joint distributions of the form: Y Pr[s; x] = Pr[st jst?1] Pr[xtjst ]: (25) t
Though HMMs have led to signi cant advances in speech recognition, they are handicapped by certain weaknesses. One of these is the poor manner in which they handle variations in speaking rate. Intuitively, we can represent these variations by nonlinear warpings of time. For example, consider the pair of trajectories xt and yt, where y t is created by the doubling operation: yt = xy t=2 ifif tt even, (26) odd. t?1 Both trajectories trace out the same curve, but yt does so at half the rate as xt. Hidden Markov models will not assign these trajectories the same likelihood, nor are they guaranteed to infer equivalent segmentations. This example shows that HMMs do not even approximately capture the invariances modeled by MPCs or other arc-length based descriptions of speech (Tishby, 1990). Admittedly, the warping in eq. (26) represents a highly idealized picture of acoustic variability. Nevertheless, there is a great deal of empirical evidence that HMMs suer from the inability to model variations in speaking rate (Siegler & Stern, 1995). For example, word error rates increase dramatically when one moves from scripted to spontaneous speech. Also, one generally observes that consonants are more frequently botched
than vowels. The reason is that in HMMs, the contribution of particular states to the overall log-likelihood is in direct proportion to their duration. Thus training procedures designed to maximize the log-likelihood are inherently biased to model long-lived phonemes (i.e., vowels) more accurately than short-lived ones. MPCs are quite dierent from HMMs in this respect. In MPCs, the contribution of each state to the loglikelihood is determined by its arc length. The weighting by arc length attaches a more important role to short-lived but non-stationary phonemes. Of course, one can imagine heuristics in HMMs that achieve the same eect, such as dividing each state's contribution to the log-likelihood by its observed (or inferred) duration. Unlike such heuristics, however, the statedependent metric g(x) in MPCs is learned from data; in particular, it is designed to reweight the speech signal in a way that re ects the actual statistics of acoustic trajectories. So far we have emphasized the invariance to nonlinear warpings of time as the main dierence between MPCs and HMMs. Another important dierence, however, lies in what each tries to model. While MPCs attempt to model the conditional distribution Pr[sjx], HMMs attempt to model the joint distribution, Pr[s; x]. Only the former is required for speech recognition, yet HMMs attempt something much more ambitious by learning a generative model of acoustic trajectories. Maximum likelihood training in HMMs is designed to increase the likelihood of observed trajectories, Pr[x]. Unfortunately, because HMMs do not represent the true model of speech, maximizing this likelihood does not always translate into minimizing error rates. These issues point to yet another dierence between MPCs and HMMs. Learning in MPCs is directed at learning a recognition model, Pr[sjx], as opposed to a synthesis model, Pr[xjs]. The direction of conditioning is a crucial dierence between maximum likelihood estimation in MPCs and HMMs. In terms of previous work, our motivation for MPCs most closely resembles that of Tishby (1990), who stressed the importance of invariance to nonlinear warpings of time as a mathematical symmetry. In that MPCs stress the continuous nature of the speech signal, they also bear some resemblance to so-called segmental acoustic models (Ostendorf, Digalakis, & Kimball, 1996) of speech. Unlike HMMs, segmental acoustic models enforce the constraint that acoustic feature vectors within the same phonemic state trace out a continuous trajectory. Despite this shared emphasis on continuity, however, segmental models and MPCs
dier in fundamental respects. In particular, segmental models incorporate the constraint of continuity by building a more complicated synthesis model Pr[xjs] of acoustic trajectories. They retain, however, the usual Markov assumption between states: Pr[st jst?1; st?2; : : :; st? ] = Pr[st jst?1]: (27) By contrast, MPCs build a recognition model Pr[sjx] whose very de nition is conditioned on the existence of a continuous trajectory. Moreover, the Markov assumption in MPCs|as embodied by eq. (6)|is conditioned on the current position and tangent vector of the acoustic feature trajectory. This diers from the Markov assumption in eq. (27), which is made independent of (or unconditioned on) the acoustic features. Finally, to the best of our knowledge, MPCs are novel in two key respects: the formulation of a warpinvariant probabilistic model explicitly in terms of arc length, and the emphasis on learning a metric g(x) for each hidden state of the Markov process. These ideas dierentiate MPCs from segmental acoustic models as well as ordinary HMMs. The starting point of this work was to postulate eq. (1) as an invariance of random processes. Of course, it would be naive to expect speech signals to exhibit a strict invariance to nonlinear warpings of time. The acoustic realization of a phoneme does depend to some extent on the speaking rate, and certain phonemes are more likely to be stretched or shortened than others. To accommodate this, one can relax the warping invariance in MPCs. This is most easily done by building models of the space-time1 trajectories X (t) = fx(t); tg and computing generalized arc lengths, dL = [X_ TG(X ) X_ ] 21 dt, where X_ = fx_ ; 1g and G(X ) is a space-time metric. The eect of replacing x_ by X_ is to allow each acoustic feature vector to contribute a nite amount to the overall log-likelihood even when jx_ j is zero|that is, even when it represents a perfectly stationary frame of speech. We are currently evaluating MPCs as engines for automatic speech recognition. Naturally, we expect that many further elaborations will be required to surpass the nely tuned performance of modern recognizers. These may include more sophisticated parameterizations of the metric gi(x), the use of information from higher order derivatives (e.g., x_ and x ), and/or transition probabilities aij (x) that vary along the length 1 The admixture of space and time coordinates in this way is an old idea from physics, originating in the theory of relativity (Einstein, 1924) (though in that context the metric is negative-de nite).
of the curve. Nevertheless, we hope that this paper serves to introduce the basic principles of MPCs, as well as to suggest an intriguing departure from traditional methods in automatic speech recognition.
A REESTIMATION FORMULAS In this appendix we derive the reestimation formulas, eqs. (19{21) and show that they lead to monotonic increases in the log-likelihood, eq. (16). We begin by examining a simpler problem. Let fx(t)jt 2 [0; ]g denote a D-dimensional trajectory, and let (x) 0 denote an everywhere non-negative function of x. Now consider the function: `() =
Z
0
h
i1
2
dt x_ T ?1 x_ (x(t));
(28)
where is a D D positive-de nite matrix. The right hand side of eq. (28) clearly depends on the trajectory x(t) and the function (x), but for now let us regard both of these as xed and consider `() simply as a function of the matrix . Since is positive-de nite and (x) 0, we immediately observe that the function `() is bounded below by zero. Let us consider how to nd the value of that minimizes `(), subject to the determinant constraint jj = 1. Note that the matrix elements of ?1 appear nonlinearly in the right hand side of eq. (28); thus it is not possible to compute their optimal values in closed form. As an alternative, we consider the auxiliary function: ) ( Z h i1 T ?1 x_ x _ 2 (x(t)) T ? 1 + x_ x_ Q(; ) = dt 2 ; [x_ T ?1 x_ ] 12 0 (29) where is a D D positive-de nite matrix like . It follows directly from the de nition in eq. (29) that `() = Q(; ). Somewhat less trivially, we observe that Q(; ) Q(; ) for all positive de nite matrices and . This inequality follows from the concavity of the square root function, as illustrated in gure 2. Consider the value of which minimizes Q(; ), subject to the determinant constraint jj = 1. We denote this value by ~ = minjj=1 Q(; ). Because the matrix elements of ?1 appear linearly in Q(; ), this minimization essentially reduces to computing the covariance matrix of the tangent vector x_ , as distributed along the trajectory x(t). In particular, we have: Z x_ x_ T (30) ~ / dt T ?1 1 (x(t)); [x_ x_ ] 2 0
along each state's segments, as weighted by the measure [x_ T ?1x_ ] 21 . Within each state, we thus obtain a monotonically convergent learning procedure by alternately optimizing and for xed , then optimizing for xed and . This leads directly to the reestimation formulas in eqs. (19{21).
1
0.8
0.6
Acknowledgements
0.4
The author thanks F. Pereira, M. Rahim, and the anonymous reviewers for many helpful comments about the presentation of these ideas.
0.2
0 0
0.2
0.4
0.6
0.8
1
z
Figure 2: The square p p is concave and upp root function per bounded by z 12 [z= + ] for all 0. The bounding tangents are shown for = 101 and = 1. where the constant of proportionality is determined by the constraint j~ j = 1. To minimize `() with respect to , we now consider the iterative procedure where at each step we replace by ~ . We observe that: `(~) = Q(~; ~ ) Q(~; ) due to concavity Q(; ) since ~ = min Q(; ) = `(); with equality generally holding only when ~ = . In other words, this iterative procedure converges monotonically to a local minimum of `(). Let us now relate the problem of minimizing `() to the original problem of maximizing the likelihood in eq. (16). There we saw that for each state of the MPC, it was necessary to optimize the parameters f; ; g. Here, for notational convenience, we have dropped the subscript denoting the state index of these parameters. Note that in terms of these parameters, maximizing each state's contribution to the log-likelihood is equivalent to minimizing the total arc length of its segments in the training set. This problem can be viewed as a particular instance of the one considered above, provided that we make the identi cation: (x) = (x ? )T ?1(x ? ): (31) Of course, now in addition to minimizingthe arc length with respect to , we must also optimize the values of and . To this end, note that eq. (31) de nes a standard quadratic form; hence for xed , the values of and that minimize eq. (28) are given simply by the mean and covariance matrix of the points x(t)
References L. Baum (1972). An inequality and associated maximization technique in statistical estimation for probabilistic functions of a markov process. In O. Shisha, editor, Inequalities, 3:1{8. New York: Academic Press. A. Dempster, N. Laird, and D. Rubin (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society B, 39:1{38. M. P. do Carmo (1976) Dierential Geometry of Curves and Surfaces. Prentice Hall. R. O. Duda and P. E. Hart (1973). Pattern Classi cation and Scene Analysis. New York: Wiley. A. Einstein (1924). The Principle of Relativity. Dover. M. Ostendorf, V. Digalakis, and O. Kimball (1996). From HMMs to segment models: a uni ed view of stochastic modeling for speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 4:360{378. L. Rabiner and B. Juang (1993). Fundamentals of Speech Recognition. Englewood Clis, NJ: Prentice
Hall. M. A. Siegler and R. M. Stern (1995). On the eects of speech rate in large vocabulary speech recognition systems. In Proceedings of the 1995 IEEE International
Conference on Acoustics, Speech, and Signal Processing, 612{615.
P. Simard, Y. LeCun, and J. Denker (1993). Ecient pattern recognition using a new transformation distance. In Advances in Neural Information Processing Systems 5:50{58. San Mateo, CA: Morgan Kauman. N. Tishby (1990). A dynamical system approach to speech processing. In Proceedings of the 1990 IEEE
International Conference on Acoustics, Speech, and Signal Processing, 365{368.