IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 1, JANUARY 2004
47
Target-Directed Mixture Dynamic Models for Spontaneous Speech Recognition Jeff Z. Ma, Member, IEEE, and Li Deng, Senior Member, IEEE
Abstract—In this paper, a novel mixture linear dynamic model (MLDM) for speech recognition is developed and evaluated, where several linear dynamic models are combined (mixed) to represent different vocal-tract-resonance (VTR) dynamic behaviors and the mapping relationships between the VTRs and the acoustic observations. Each linear dynamic model is formulated as the state-space equations, where the VTRs target-directed property is incorporated in the state equation and a linear regression function is used for the observation equation that approximates the nonlinear mapping relationship. A version of the generalized EM algorithm is developed for learning the model parameters, where the constraint that the VTR targets change at the segmental level (rather than at the frame level) is imposed in the parameter learning and model scoring algorithms. Speech recognition experiments are carried out to evaluate the new model using the N-best re-scoring paradigm in a Switchboard task. Compared with a baseline recognizer using the triphone HMM acoustic model, the new recognizer demonstrated improved performance under several experimental conditions. The performance was shown to increase with an increased number of the mixture components in the model. Index Terms—Dynamic system, mixture model, phonetic target, speech recognition, state-space model.
the speech models some key dynamic properties of the human speech process. In recent research work along this direction [4], [8]–[10], [21], we have been developing a statistical coarticulatory model for spontaneous speech recognition. Our new approach is based on statistical Bayesian theory, yet is a radical departure from the current HMM-based approach. Rather than using a large number of unstructured Gaussian mixture components to account for the tremendous variation in the observable acoustic data of highly coarticulated spontaneous speech, the new model provides a rich structure for the partially observable (hidden) dynamics in the domain of vocal-tract-resonances (VTRs). In the design of the speech recognizer, we used a statistical nonlinear dynamic system to describe the physical process of spontaneous speech production where knowledge of the VTR dynamic behavior in speech production is naturally incorporated into the model training and decoding. This coarticulatory model was formulated in mathematical terms as a constrained and simplified state-space nonlinear system for each phone: (1) (2)
I. INTRODUCTION
S
PEECH recognition technology has achieved significant progress with the introduction of Hidden Markov Models (HMM) and related statistical data-driven modeling and pattern recognition techniques. However, current technology is far from being mature. It is very fragile, and it breaks down with minor changes in speaker characteristics and in other factors related to recognizers’ operating environments. The fragility is particularly apparent in terms of the recognizers’ sensitivity to speaking style variation. This exemplifies weaknesses and limitations of the current HMM approach, where the basic target-directed dynamic properties of human speech production are conspicuously missing in the mathematical representation of the speech model. To overcome such limitations of current technology and to pursue the next generation of solutions to speech recognition problems, it appears necessary to build into
Manuscript received January 17, 2000; revised October 10, 2002. This work was supported by the National Science Foundation via Johns Hopkins University and by NSERC of Canada. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Dirk van Compernolle. J. Z. Ma was with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, Canada N2L 3G1. He is now with BBN Technologies, Cambridge, MA USA. L. Deng was with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, Canada N2L 3G1. He is now with Microsoft Research, Redmond, WA 98052-6399 USA (e-mail: deng@microsoft. com). Digital Object Identifier 10.1109/TSA.2003.818074
represents the VTR hidden In the above state (1), state dynamics, the phone’s VTR target and the “time-constant” which controls how fast the dynamic variable would reach its target. The target-directed and asymptotic behavior of the dywhich forces the system namics can be seen by setting to enter the asymptotic region where . With , (1) then directly the assumption of mild levels of noise : is gives the target-directed behavior in a zero mean Gaussian noise with covariance . In the measurement equation (2), is the acoustic measurements (Mel-frequency cepstral coefficient, or MFCC) computed from a conventional speech preprocessor. The nonlinear , which is implemented by a multi-layer perceptron function (MLP), represents the physically nonlinear relationship between the hidden dynamic space and observable acoustic space. is a zero-mean Gaussian noise with covariance . During our earlier work [4], [10], we found that the nonlinear function represented by the MLP had not provided an accurate nonlinear mapping relationship between the VTR and the MFCC spaces. This was observed based on large prediction errors in the MFCC domain using the MLP. Efforts for improving the nonlinear mapping, however, have resulted in relatively minor improvement of recognition performance. Such efforts included use of “piece-wise nonlinear” functions with multiple MLPs and use of alternative neural network architectures such
1063-6676/04$20.00 © 2004 IEEE
48
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 1, JANUARY 2004
as radial-basis functions. From some analysis experiments, we found that the relationship between the MFCC space and the VTR space is highly complex (including the many-to-one relationship) and this relationship appears difficult to learn.1 One principal difficulty of the many nonlinear methods available is the gross approximation needed for implementing the model learning algorithm. In addition, the large amount of computation for implementing the approximate learning algorithm made it difficult to extensively tune the speech recognition system. To overcome the above difficulties inherent in the nonlinear methods that have been experimented, we seek linear methods which may possibly provide adequate approximations to the nonlinear relationship between the VTR and the MFCC spaces in the formulation of the speech model while gaining computational effectiveness and efficiency in model learning. A most straightforward linear method is to use a linear regression function to replace the MLP mapping in (2), while keeping the target-directed, linear state dynamics of (1) intact. This gives the measurement equation of the state-space model as follows: (3) which can be rewritten as (4) and . where Comparing speech recognition results of this linear system with its nonlinear counterpart using MLPs, we found that both recognizers gave similar performance. It appears that the effectiveness in learning the linear model has adequately compensated for its weakness in representing the nonlinear function between the VTR and the MFCC spaces by the linear approximation with apparent low accuracy. One way of improving modeling accuracy while maintaining effectiveness in model learning is to extend the linear dynamic system model discussed above to its mixture version. That is, rather than using one single set of model parameters to characterize each phone, we can use multiple sets of model parameters. This gives rise to the mixture linear dynamic model reported in this paper. Since separate dynamic system models are constructed for different phones and thus the VTR space has already been constrained for each phone, the degree of nonlinearity of each individual model is expected to be relatively minor. Therefore, use of a small number of mixture components is expected to be adequate for representing the phone-dependent nonlinear function between the VTR and the MFCC spaces. Development of effective model learning and likelihood scoring algorithms for this new mixture linear dynamic system model is the focus of the present study. The organization of this paper is as follows. In Section II, a mathematical description of the mixture linear dynamic system model is provided. A maximum-likelihood parameter estimation algorithm is derived in Section III for learning the model. A likelihood-scoring algorithm for the model is described in Section IV for recognizer evaluation. In Section V, the model which is used to construct the speech recognizer is evaluated on Switchboard database under the N-best list re-scoring par1Detailed
descriptions about these experiments are given in Section V-C-II.
Fig. 1. phone.
Diagram to show different mixture-component dynamics for one
adigm. A set of analysis experiments are reported also in Section V, which shed lights on why the new model outperforms the conventional triphone HMM. In Section VI, conclusions are drawn and future work discussed. II. PHONE-DEPENDENT, MIXTURE LINEAR DYNAMIC MODEL One motivation of developing the new mixture dynamic model is that different speakers (such as males and females) and ) even may have different VTR dynamics (different when uttering the same phone. As illustrated in Fig. 1, the two different VTR dynamics, characterized by two sets of model parameters, may belong to the same phone. Therefore, the VTR dynamics and the resulting measurable acoustic dynamics, whose parameters are distinct for each separate phone, are represented mathematically by a combination of a set of linear dynamic models (LDM). This is called the mixture linear dynamic model (MLDM), which can be written succinctly in the following form: (5) where is the total number of linear dynamic models (or mixis the mixture weight. ture components) for each phone, and is the th LDM, which is expressed in the same statespace form as in (1) and (4) except that the parameters are indexed by (6) (7) and have covariances and , rewhere spectively. The mixture concept here applies at the phone segment level, so we are mixing the probability of the entire phone segment. The probability of the segment is produced by the linear dynamic model given jointly in (6) and (7). The use of the segmentlevel mixture components is intended to represent the sources of speech variability, including speakers’ vocal tract shape differences and speaking-habit differences, etc., as the systematic variations in speech. The mixture concept at the segmental level discussed here is similar to that in [11], [13], [14], and [16] (all summarized in [20]). The key difference between the mixture model in this paper and those other mixture models is the different dynamic models in the mixture components. An important constraint, which we call mixture-path constraint, is imposed on the above MLDM in the model formulation. That is, for each sequence of acoustic observations associated with a phone, being either a training token or a test token,
MA AND DENG: TARGET-DIRECTED MIXTURE DYNAMIC MODELS FOR SPONTANEOUS SPEECH RECOGNITION
the dynamics is constrained to be produced from a fixed mixture component of the dynamic system model. This means that the target of the VTR in a phone is not permitted to switch from one mixture component to another at the frame level.2 The constraint of such a type is motivated by the physical nature of the speech model—the target which is correlated with the phonetic identity is defined at the segment (phone) level, not at the frame level. For example, in Fig. 1, if a token chooses mixture component one ( and ) at the beginning frame, it must sustain within the same mixture component until the end of the segment without switching to mixture component two during the same segment. This mixture-path constraint applies in both model parameter learning and likelihood re-scoring. This mixture-path constraint has direct consequences in the model computation. If the mixture components were allowed to switch at the frame level, the number of Gaussian components of in the VTR variable would grow exponentially with time frame. (Given fixed segment boundaries, this would be to the power of segment duration.) However, the mixture-path constraint as implemented in the current model prohibits the mixture components from switching during any phone segment, and thus eliminates the exponential growth problem by keeping the number of possible Gaussian components fixed at (with assumed fixed segment boundaries). Nevertheless, the Gaussian component associated with in the previous phone segment is free to continue with any mixture component in the current phone segment. This thus would make the total number of possible Gaussian components in an entire utterance still grow exponentially in terms of the number of phone segments in the utterance transcription. To overcome this difficulty, we adopt a merging strategy that combines all Gaussian components at the end of each phone segment. The merging is done according to the posterior probabilities of the mixture component dynamics for producing the entire phone segment. All the component dynamics in the following phone start from the merged single point. To summarize, we have the phone-dependent, MLDM presented in (6) and (7), together with the mixture-path constraint and the merging strategy (described above) that complete the model formulation. The entire model parameter set are . Note that each of the parameters is indexed by the mixture component . III. PARAMETER ESTIMATION ALGORITHM One principal contribution of this study is the development of the parameter estimation (or learning) algorithm, which allows automatic determination of all parameters of the mixture linear dynamic system model discussed above from a given set of training data. The algorithm developed is based on the Expectation-Maximization (EM) [5] principle for maximum likelihood. We will use to represent the sequence of acoustic (MFCC) and the sequence observations, . To of VTR state dynamic variables, proceed with the development of the parameter estimation algorithm, we need define one discrete variable , which indicates the observation-to-mixture assignment for every sequence 2This same mixture-path constraint has been imposed on the mixture trended HMM; see [11].
49
of observations of a phone. For example, for a given sequence , , it of observations of a phone, if means that the th model (mixture component) is the true one to generate that sequence of observations. For simplicity, we to denote . The EM algorithm described will use and as missing data, and treats in this section treats measurements as observation or training data. To impose the mixture-path constraint on the training process, , at the segment level. we define the joint variable, training tokens for a phone, where a token is a Let there be sequence of speech frames for that phone.3 We use to represent the joint variables
where , and are the th training token, its corresponding hidden dynamics, and its corresponding mixture-component assignment, respectively. We assume that the tokens are independent of each other and all discrete random variables , have an identical distribution. These assumptions are both reasonable because there is usually no strong correlation among tokens and all models have the same structure. The development of the EM algorithm described below consists of several steps. First, we develop an explicit expression for the joint probability density function (PDF) of the observation and missing data. We also develop an expression for the mixture-component weighting factor. These expressions are then used to compute the conditional expectation as required in the E-step of the EM algorithm. This conditional expectation is expressed as a function of a set of sufficient statistics computed from the linear Kalman filter. Finally, re-estimation formulas are derived using the conditional expectation in the M-step of the EM algorithm. A. Joint PDF of Observation and Missing Data Due to the token-independence assumption, the joint PDF of observation and missing data , given the parameter set , can be written as
(8) is the conditional joint PDF of In (8), and given a fixed mixture component. It can be further expressed as [23]
(9) where is the total number of frames of the th training is the distribution of token (MFCC sequence), the initial value of the hidden dynamic variable at time given the mixture component. For notational clarity, we drop 3For continuous speech, phone boundaries have to be segmented first. In our experiments all phone boundaries are provided by a conventional HMM system.
50
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 1, JANUARY 2004
and use
and and
to represent , respectively. We
then have
has two notable properties which we will among tokens. . Second, use later. First, they sum to unity: they satisfy: (16)
(10) Substituting this into (8), we obtain the conditional joint PDF of the explicit form
where
denotes the set of
excluding
.
C. E-Step Given the various PDFs computed above, we are now in a position to derive an iterative EM algorithm for parameter estiand mation. For the model presented in this paper, both are treated as missing data. The -function in the E-step of the EM algorithm is computed below as the conditional expectation over the missing data [5], [25]
(11) B. Mixture-Component Weighting Factor The conditional joint PDF for
(17) is (12)
where denotes the model parameters associated with the immediately previous iteration of the EM algorithm. Substituting (11) into (17) above, we have
The PDF for the observation sequence is
(13) have an identical In the above equation, all distribution, so we use one common variable to replace them. The conditional PDF of given is (18) Substituting (14) into the above equation, using the property shown in (16), changing the order of the summations, and using ’s, we obtain the common variable to replace all
(14) where we define the (token-dependent) mixture-component weighting factor to be (15)
The mixture-component weighting factor is the posterior , due to the independence assumption probability,
(19)
MA AND DENG: TARGET-DIRECTED MIXTURE DYNAMIC MODELS FOR SPONTANEOUS SPEECH RECOGNITION
where has the same expression as except that the in the expression is replaced by from the previous EM iteration. above as We can express
51
on both sides of (24) and using the property of , we obtain
Taking
(25)
(20) where and are equal to the first part and second part of (19), respectively. is From the model definition by (6) and (7), and coa Gaussian with mean , and is a Gaussian as well with mean variance and covariance . Fixing as a Gaussian to with zero mean and a given covariance, we can simplify
This gives the re-estimation formula for
(26)
and : Before deriving the re-estima2) Reestimating tion formula for these new parameters, we adopt the following notations first:
(21) where use
to represent
and . For notational simplicity, we will henceforth.
D. M-Step : For simplicity, we use 1) Reestimating to represent , which we call mixture-component weighting probabilities. Since in the -function only is related to , we can obtain the re-estimation formula by setting the partial derivative of with respect to to zero, and then solving it subject to the constraint
To proceed, we define the Lagrangian equation of (22) Taking the derivative of
with respect to
where where
and
stand for the newly re-estimated values, and and
To reestimate and . Furthermore, only partial derivatives are4
we note that they are related only to includes and . The relevant
, we obtain (23)
Setting the derivative equal to zero, we have the reestimate for (27) 4In
(24)
deriving the following formulas, we use the matrix calculus formulas:
@ (x Ay )=@A = xy , @ (x A y )=@A = yx , and @ (x A BAy )=@A B Axy + BAyx , where x and y are vectors, and A and B are matrices.
=
52
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 1, JANUARY 2004
and
and
(36) (28) Setting the above derivatives to zero, we obtain the re-estiand mates for
In above, cording to
(29) and
is calculated ac-
(37) In above, cording to
is calculated ac-
(30)
Note that in the above, the parameters and are updated alternatively at separate EM iterations. This gives rise to the generalized EM algorithm, i.e., local optimization in the M-step, rather than global optimization. 3) Reestimating : To reestimate , we note that it is included only in . The relevant partial derivative is
(31) Setting the above to zero, we have the reestimate (32) 4) Reestimating and : Since the noise covariances and are included only in , we compute the following derivatives
(33) and
(38)
E. Calculation of the Sufficient Statistics In order to obtain the re-estimates for the model parameters according to the formulas derived above as the M-step of the EM algorithm, a set of conditional expectations need to be calculated. Essentially, three conditional expectations , (all conditioned on the observation sequences), and , are required by the M-step. , which denotes The conditional expectation , is precisely the Kalman smoother for the th mixture component and for the th observation (token). All conditional expectations required in the M-step can be calculated by using the results of the Kalman smoothing algorithm. We now list the computational steps of the Kalman smoothing algorithm below.5 Forward recursion (or Kalman filtering): (39) (40) (41) (42)
(34) Letting the derivatives equal to zero, we obtain the estimates
(43) (44)
(35)
(45) 5Details of the algorithm and the derivations can be found in ([1], [15], [17], [19], [26]).
MA AND DENG: TARGET-DIRECTED MIXTURE DYNAMIC MODELS FOR SPONTANEOUS SPEECH RECOGNITION
Backward recursion (or Kalman smoothing):
53
G. Merging the Mixture Dynamics
(46)
As mentioned earlier, at the end of each phone segment, all possible mixture components for are merged to a single one. This has been carried out according to (56)
(47) (57) (48)
Based on the Kalman smoothing results above, the three required conditional expectations are computed as follows: (49)
(58) is proportional to the posterior probability of mixture where component being chosen given the observation.
(50) IV. LIKELIHOOD-SCORING ALGORITHM (51) where
is recursively calculated by [24]
(52) for
The speech model presented so far combines different linear dynamic models (mixture models), according to the mixture-component weighting probabilities, to describe the VTR dynamics. After the weighting probabilities and all other models’ parameters are trained as described in Section III, the likelihood of the model for each phone, given a sequence of observations, can be computed directly. We describe this is equal to6 computation below. The likelihood
, where (59) (53)
F. Updating To update according to (15), lated. The calculation proceeds as follows:
must be calcu-
Based on the estimation theory for dynamic systems ([12], [19], [26], etc.), the likelihood function for each individual mixture-component model is calculated from its innovation according to sequence
(60)
(54) where is the PDF of the innovation sequence. This PDF has the Gaussian form of
and its covariance where the innovation sequence are computed from the Kalman filtering recursion. Then log-likelihood for the entire mixture model becomes (61)
(55) and its covariance are comwhere the innovation puted directly from Kalman Filtering described earlier.
6In this section, all superscripts of particular phone is referred to.
O are dropped because a fixed token for a
54
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 1, JANUARY 2004
For a speech utterance which consists of a sequence of phones with the phones’ dynamic segments (frames between the phone boundaries) given, the log-likelihood for each phone in the sequence as defined in (61) are summed to give the total log-likelihood score for the entire utterance.
TABLE I WER PERFORMANCE OF THE BASELINE TRIPHONE HMM SYSTEM; THE NUMBER OF MODEL PARAMETERS IS IN PARENTHESIS
V. SPEECH RECOGNITION EXPERIMENTS In this section we first introduce the experimental paradigm and design of the new recognizer based on the speech model presented in Sections II–IV. We then report the evaluation results of the new recognizer on Switchboard data. The evaluation is based on performance comparison with a conventional triphone HMM recognizer under identical conditions.
TABLE II PERFORMANCE (WER) OF MIXTURE LINEAR DYNAMIC SYSTEM MODEL WITH HALF AN HOUR OF TRAINING DATA; THE NUMBERS OF RECOGNIZER PARAMETERS ARE IN PARENTHESES
A. Experimental Paradigm In all the experiments reported in this section, we use an N-best list re-scoring paradigm to evaluate the new recognizer on the Switchboard spontaneous telephony speech data. The N-best lists (transcription hypotheses) and their phone-level segmentation (or alignment) are obtained from a conventional triphone-based HMM recognizer (with use of a language model). The acoustic model used in the recognizer also serves as the baseline to gauge recognizer’s performance improvement via the use of the new MLDM speech model developed in this work. The baseline HMM acoustic model we used in this work comes from one of the systems for the Switchboard task described in some detail in [4], [21]. We build the MLDM models for a total of 44 distinct phone-like symbols, including 8 context-dependent phones, one silence (sil) and one short pause (sp). The VTR targets of the context-dependent phones are affected by the anticipatory tongue position associated with the following phone. B. Experiments on N-Best List Re-Scoring All the male speakers from the WS’97 DevTest set are selected as the test data set, which results in a total of 23 male speakers comprising 24 conversations, 1243 utterances, 9970 words and 50 min of speech. 1) Baseline Acoustic Model: The baseline HMM acoustic model was trained on 60-hour Switchboard training data.7 This baseline model has word-internal triphones clustered by a decision tree. The total number of the parameters in this baseline HMM triphone acoustic model is approximately 3 276 000 that can be broken down to the product of: 1) 39, which is the MFCC feature vector dimension; 2) 12, which is the number of Gaussian mixture components for each HMM state; 3) 2, which includes Gaussian mean and diagonal covariance matrix in each mixture component; and 4) 3500, which is the total number of the distinct HMM states clustered by the decision tree. The performance of the HMM baseline systems are listed in Table I, where “Ref 5” and “ 5-best” means 5 best hypotheses with and without reference included, respectively, and “Ref 100” and “100-best” mean 100-best hypotheses with and without reference included, respectively. We also add the 7See
http://www.clsp.jhu.edu/ws97/ws97_general.html)
“Oracle” and “By chance” performance into Table I to calibrate the recognizer’s performance. The “Oracle” WER is calculated by always choosing the best one hypothesis and the “By chance” WER is computed by randomly picking one out of all hypotheses. We noted that the HMM acoustic model may be unfairly treated in the above evaluation scenario because the hypotheses being rescored were derived from a similar HMM system (with a language model) and hence they are highly confused with the system. This may explain why the model is almost as poor as the “By chance” output. 2) Speech Model Trained With One Speaker’s Data: In this set of experiments, only one speaker’s (speaker ID: 1028) training data, about half an hour long, extracted from the standard training set was used for training. We gradually increased the number of mixture components in the MLDM. The results expressed as the Word Error Rate (WER) are listed in Table II. The numbers of parameters used are also given in the table.8 From one mixture component to two, there is large improvement in the performance. The WER drops from “39.1%” to “33.9%” (about 15% relative error reduction) for the “Ref 5” case, and from “55.7%” to “50.7%” (about 10% relative error reduction) for the “Ref 100” case. There are about one percent and two percents absolute error reduction for the cases without references included, “5-best” and “100-best,” respectively. When we further increased the number of mixture components to four, we observed no further decreases in the WER from the results shown in Table II. Two factors might account for this observation. First, the amount of training data is not enough for the increased number (4) of mixture components. We had some warning information during the training, which said some models suffers from under-training. Second, the confusibility among the different phones are increased with the increasing number of mixture components. In order to pin down the more likely cause between the two possibilities, we used more training data to train the models in the next set of experiments. total number of parameters is calculated according to 3 1 M 1 44 1 diag. 8 and diag. Q) + M 1 44 1 D (diag. R)+M 1 44 1 (N + 1) 1 D (h(1)) where M is the number of mixture components, D the dimension of VTR dynamics (4) and D the dimension of observations (12). 8The
D
(
T;
MA AND DENG: TARGET-DIRECTED MIXTURE DYNAMIC MODELS FOR SPONTANEOUS SPEECH RECOGNITION
55
TABLE III PERFORMANCE (WER) OF MIXTURE LINEAR DYNAMIC SYSTEM MODEL WITH INCREASED AMOUNTS OF TRAINING DATA; THE NUMBERS OF RECOGNIZER PARAMETERS ARE IN PARENTHESES
3) Speech Model Trained With Multiple Speakers’ Data: To investigate how the amount of training data and the number of mixture components affect the recognizer performance, we extracted multiple speakers’ data from the standard training set for training the speech model. In Table III, “1 hour” means that we added another half an hour data to the original half an hour of training set, where the new half an hour of the training data came from 30 different speakers. “2 hour” means that we added one more hour of training data to the “1 hour” training set. The additional one hour of data came from 50 different speakers. Comparing the results shown in Tables II and III with use of two mixture components, we observe that the model does not seem to make substantial performance differences by doubling the amount of training data. The recognizer has slightly better performance by using four-mixture-component model trained on the same one-hour training data. Compared with its counterpart trained on half an hour of data, the model has about 1% absolute WER reduction for the “100-best” and “Ref 100” cases. Furthermore, when trained on the “2 hour” data set, the four-mixture-component model, some further WER reduction has been achieved. From this set of experiments, we conclude that using more mixture components is able to improve the model’s performance even though it may have simultaneously increased the confusions among the phones. More training data is needed for the training of the models with an increasing number of mixture components. Compared with the baseline triphone HMM acoustic model given in Table I, the new system based on the MLDM has a higher performance, especially when including the references in the N-best list. When the number of mixture components is increased to four and the amount of the training data is increased accordingly, further performance improvement is observed. Especially, in the situation with references included, the new system gives more than 25% and 10% relative error reduction for the “Ref 5” and “Ref 100” cases, respectively. This means that the new system is able to score the correct reference hypotheses with higher likelihoods than the HMM baseline system. For the cases without references included, the new system achieves about 1.0% absolute WER reduction. C. Analysis Experiments The number of parameters used by the MLDM with four mixture components is 14 960. In contrast, the number of parameters used by the baseline HMM system is 3 276 000. The re-scoring results shown above are significant in that the new system uses a considerably smaller number of model parameters but it outperforms the HMM baseline system.
Fig. 2. Comparison of log-likelihoods of the speech model with and without the state dynamics being modified. TABLE IV COMPARISON OF RECOGNIZER WERS USING THE SPEECH MODEL WITH AND WITHOUT THE STATE DYNAMICS BEING MODIFIED
1) Analysis Experiments—Modifying State Dynamics: Why is the new dynamic system model able to, using so small a number of parameters, achieve the better performance than HMM system? We believe the main reason is that some key aspects of true dynamic properties of speech has been explicitly incorporated into the new system. In order to examine this belief, we have performed a set of analysis experiments by deliberately modifying the dynamic property in the model. To do this, we set the “time-constant” parameter, , for all models to zero. This changes the state equation to (62) Now, the hidden state dynamics is modified to be a flat one with noise added. All other parts of the speech model were kept identical to the system described earlier. We carried out the speech recognition experiments using the same training data (1-hour) and using the same N-best re-scoring paradigm. The linear dynamic system model with four mixture components was used. The log-likelihoods during the model training are plotted in Fig. 2, as a function of the EM iteration number. The solid line is associated with the model without the state dynamics being modified (i.e., with the trained parameter ), and the dashed line is associated with the model with the state dynamics modi). The log-likelihood of the model after fied (i.e., setting modifying the state dynamics is observed to be uniformly lower than that of the original model. This is especially so at the early iteration of the EM algorithm. The N-best re-scoring results using the speech model with the state dynamics modified are listed in Table IV. For comparison purposes, the results using the original model are also shown in the same table. We observe from Table IV that when
56
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 1, JANUARY 2004
the dynamic property is modified by setting , the system performance become worse, especially for the “Ref 100” and “100-best” cases. Compared with the baseline HMM system listed in Table I, the system with the state dynamics modified still performs better in the cases with the references included. It also gets half the gain for the “5-best” case. This interesting performance, despite lack of dynamics, may be due to the use of Kalman filtering. Note that during the Kalman filtering process, due to the noises added to the state and observation equations, the is still varying with time estimated hidden variable, , k. This causes the estimated mean of observation, , to change with time as well. and its covariance, Such changes are more desirable than the constant means and variances associated with individual states in the conventional HMMs. In [20], it was shown that use of segment-level mixtures (with no dynamics) can lead to moderate improvement of speech recognition accuracy in some tasks, consistent with the results we have reported here. 2) Analysis Experiments—Varying Forms of the Observation Equation: In another set of analysis experiments, we aim to examine the importance of the mixture linear form of the observation equation proposed in this paper. We use several alternative forms (linear and nonlinear) of the observation equation, while keeping the state equation identical, and then compare the recognizers constructed from these different forms of the observation equation in the dynamic system model. For the nonlinear observation equations, we trained the recognizers using various forms of the approximate EM algorithm. The first alternative form we explored is the global MLP. The WERs of the associated recognizer evaluated on the identical Switchboard task to the one described earlier are shown in the third row of Table V, where the MLDM’s performance described earlier is duplicated in the first row. The number of parameters for this system is 2656.9 Significant degradation in WER is seen for the MLP nonlinear form, which is most likely due to the nonrigorous nature of the learning procedure. The second alternative form of the observation equation we explored is the “piece-wise nonlinear” function which replaces the single, global MLP. The motivation of these experiments is to investigate the effect of the modeling accuracy of the nonon the system performance. The average linear mapping MLP prediction error is large (about 220).10 The “piece-wise nonlinear” function was implemented by separating the entire acoustic space into many subspaces and using different MLPs to describe these subspaces. We separated the entire acoustic space 9The
global MLP has 100 hidden nodes. So,
100( )+44 2 (3 1 = 12. M LP
(8
D
; T; Q
)+
D
2 656 = ( ( )), where ;
D
R
D
+ ) = 4 and D
1
D
10The
prediction error is calculated according to
pred err
=
j ( ) 0 ( ( )j O
k
h Z
k
TABLE V PERFORMANCE COMPARISON OF DIFFERENT NONLINEAR FUNCTIONS (WER); THE NUMBERS OF RECOGNIZER PARAMETERS ARE IN PARENTHESES
into 256 and 512 subspaces, respectively, which decreases the prediction errors by over 50% (110) and 60% (80). But the recognizer’ performance has not been improved correspondingly. The results, named “subspace 256” and “subspace 512” respectively, are listed in Table V. These experiments demonstrate two things: first, the relationship between MFCC space and VTR space is complicated (many-to-one) and it is difficult to learn (even with 512 subspaces the prediction error is still large); second, the mapping accuracy does not affect much the system performance at current level (so we may try some linear cases). The numbers of parameters for these two systems are 123 936 and 246 816, respectively.11 The third alternative form we explored is phone-dependent MLP. The hope was by adding new phone-dependent parameters in the observation equation, we could gain discrimination among different phones. The WER results, labeled “phone MLP” in Table V, unfortunately, have not been better than other forms of the observation equation. The number of parameters for this system is 71 456.12 The final second alternative form of the observation equation we explored is the radial-basis-function neural network (RBFNN). which is thought to be able to approach a nonlinear function more smoothly ([1]). We used a global RBFNN for all phone models. The number of parameters for this system is 3056.13 The WER results of the associated recognizer, labeled “RBFNN,” are shown in Table V. Comparable performance is observed to the earlier alternative forms of the observation equation. The analysis experiments reported in this section based on a number of alternative forms of observation equation suggest that the importance of the mixture linear form. The superiority is likely to originate from the consistent, rigorous maximum likelihood procedure used to learn the model parameters. In Section V, we also showed that the MLDM is superior to the single linear model. This suggests that by using multiple sets of a linear model, we have gained additional modeling accuracy while maintaining the effectiveness of modeling learning.
123 936 = ( + )1 (8 ) + ( )) and 246 816 = )+44 2 (3 1 ) 1 30 1 512( )+44 2 (3 1 (8 ) + ( )), = 4 and = 12. where 12Each phone MLP has 100 hidden nodes. So, 71 456 = ( + )1 (8 ) + ( )), where =4 30 1 44( )+44 2 (3 1 and = 12. 13The global RBFNN has 100 Gaussian kernels. So, 3 056 = (2 1 + )1100144( )+442(31 (8 )+ ( )), where = 4 and = 12. 11Each subspace MLP has 30 hidden nodes. So,
30 1 256( ( +
M LP
D
D
;
; T; Q
M LP
D
K
D
D
D
D
; T; Q
;
0
;
D
; T; Q
D
D
D
R
D
D
D
;
MLP
D
D
R
D
M LP
where N is the number of training tokens for a phone and K is the length (frames) of the nth token. With the average pred err of 220, the average difference on each MFCC between its true value and predicted one is about 4. It is large considering most MFCC values are between 10 and 10.
D
R
D
; T; Q
D
D
R
D
MA AND DENG: TARGET-DIRECTED MIXTURE DYNAMIC MODELS FOR SPONTANEOUS SPEECH RECOGNITION
VI. DISCUSSION AND CONCLUSION In this paper, a novel target-directed MLDM of speech is developed. This new model originated from our previous work [9], [10], where a nonlinear form of the observation equation was used to represent the nonlinear relationship between hidden VTR space and the acoustic space. The use of the nonlinear function caused difficulties in state estimation, and specifically, in computation of the conditional expectation of the nonlinear function. While there are many nonlinear filtering approaches already developed to cope with these difficulties by employing approximation, the effect of approximation has been poorly understood, especially for the current model of speech. In our previous work of [10], we adopted the iterated extended Kalman filtering approach approximating the sufficient statistics of the hidden states, as was required in the E-step of the EM algorithm. We further made the following approximation in the M-step of the EM algorithm:
57
included, the corresponding absolute WER reduction is about 1–2%. The results show that the mixture concept at the segment level favors the systematic variations in speech, which is consistent with the observation by other researchers [20]. The evaluation results show that the target-directed, MLDM proposed in this paper is a promising new approach to spontaneous speech recognition. Our future research will incorporate the phonetic boundary optimization methods [18] into the learning and scoring algorithms described in this paper. This type of global optimization methodology is expected to further improve the recognition accuracy. ACKNOWLEDGMENT We thank Prof. F. Jelinek for the encouragement, discussions, and support of this work, and thank Dr. J. Bridle for suggesting an analysis experiment described in Section V-C and in Table IV. We finally thank two reviewers who provided constructive suggestions that improved the quality of this paper. REFERENCES
whose accuracy had not been well understood. However, once the nonlinear observation equation is replaced by a linear one, as one motivation of this work, we no longer need the above two approximations. Nevertheless, the relationship between the VTR hidden space and the acoustic space is physically nonlinear. This requires us to approximate a nonlinear function by a linear one or by a set of linear ones.14 To improve the approximation accuracy by use of linear models, we have developed a MLDM. The basic idea underlying this development is that within limited input and output spaces, a global nonlinear relation can be relatively accurately approximated by combining a set of linear regression functions. Applying this idea to the speech modeling problem, we approximate the nonlinear relation between the VTR space and the MFCC space for each separate phone by a mixture of static, linear regression functions. The division of the input-output spaces is achieved in two ways. First, phone-dependent linear mapping characterized by the regression matrix parameter allows the VTR space to limit itself to only a narrow range of the variation specific to the phone. Second, a further limitation of the VTR changes in a phone is achieved by using a set of ’s, each of which further narrows the regression matrices range of the VTR variation. In Section V, we reported speech recognition experiments aiming to evaluate the new dynamic model of speech. With the use of two mixture components and of only one speaker’s data (about half an hour) for model training, the new model is able to achieve significantly better performance than the baseline triphone HMM system trained on 60 hour of data. When the number of mixture components is increased to four and the amount of training data increased to two hours (about 81 different speakers), the new model outperforms the baseline HMM system in all cases (with and without references included). For the cases with references included, the new model decreases the relative WER by 10–15%. For the cases without references 14Similar
nonlinear relationships were used in other related work [3], [22].
[1] Y. Bar-Shalom and T. E. Fortmann, Tracking and Data Association. New York: Academic, 1988. [2] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, U.K.: Clarendon, 1994. [3] C. Blackburn and S. Young, “Toward improved speech recognition using a speech production model,” in Proc. Eurospeech, vol. 2, 1995, pp. 1623–1626. [4] J. Bridle, L. Deng, J. Picone, H. Richards, J. Ma, T. Kamm, M. Schuster, S. Pike, and R. Reagan, “An investigation of segmental hidden dynamic models of speech coarticulation for automatic speech recognition,” in Final Report for the 1998 Workshop on Language Engineering, 1998, pp. 1–61. [5] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist. Soc., vol. B-39, pp. 1–38, 1977. [6] L. Deng, “A generalized hidden Markov model with state-conditioned trend functions of time for the speech signal,” Signal Process., vol. 27, pp. 65–78, 1992. , “A stochastic model of speech incorporating hierarchical nonsta[7] tionarity,” IEEE Trans. Speech Audio Processing, vol. 1, pp. 471–474, 1993. , “Computational models for speech production,” in Computational [8] Models of Speech Pattern Processing, ser. NATO ASI Series. New York: Springer, 1999, pp. 199–214. , “A dynamic, feature-based approach to the interface between [9] phonology and phonetics for speech modeling and recognition,” Speech Commun., vol. 24, no. 4, pp. 299–323, 1998. [10] L. Deng and J. Z. Ma, “Spontaneous speech recognition using a statistical coarticulatory model for the vocal-tract-resonance dynamics,” J. Acoust. Soc. Amer., vol. 108, no. 6, pp. 3036–3048, Dec. 2000. [11] L. Deng and M. Aksmanovic, “Speaker-independent phonetic classification using hidden Markov models with state-conditioned mixtures of trend functions,” IEEE Trans. Speech Audio Processing, vol. 5, pp. 319–324, July 1997. [12] V. Digalakis, J. Rohlicek, and M. Ostendorf, “ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition,” IEEE Trans. Speech Audio Processing, vol. 1, pp. 431–442, 1993. [13] H. Gish and K. Ng, “A segmental speech model with applications to word spotting,” in Proc. Int. Conf. Acoust., Speech Signal Processing, vol. II, 1993, pp. 447–450. [14] Y. Gong and J.-P. Haton, “Stochastic trajectory modeling for speech recognition,” in Proc. Int. Conf. Acoust., Speech Signal Processing, vol. I, 1994, pp. 57–60. [15] A. H. Jazwinski, Stochastic Processes and Filtering Theory. New York: Academic, 1970. [16] O. Kimball, “Segment Modeling Alternatives for Continuous Speech Recognition,” Ph.D., Boston Univ., Boston, MA, 1994. [17] K. Kitagawa, “Non-Gaussian state-space modeling of nonstationary time series,” J. Amer. Statist. Assoc., vol. 82, pp. 1032–1041, 1987.
58
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 1, JANUARY 2004
[18] J. Z. Ma and L. Deng, “Optimization of dynamic regimes in a statistical hidden dynamic model for conversational speech recognition,” in Proc. Eurospeech 1999, vol. 3, 1999, pp. 339–1342. [19] J. M. Mendel, Lessons in Estimation Theory for Signal Processing, Communications and Control. Englewood Cliffs, NJ: Prentice-Hall, 1995. [20] M. Ostendorf, V. Digalakis, and O. Kimball, “From HMM’s to segment models: A unified view of stochastic modeling for speech recognition,” IEEE Trans. Speech Audio Processing, vol. 4, pp. 360–378, 1996. [21] J. Picone, S. Pike, R. Reagan, T. Kamm, J. Bridle, L. Deng, Z. Ma, H. Richards, and M. Schuster, “Initial evaluation of hidden dynamic models on conversational speech,” in Proc. Int. Conf. Acoust., Speech Signal Processing, Mar. 1999, pp. 109–112. [22] H. Richards and J. Bridle, “The HDM: A segmental hidden dynamic model of coarticulation,” in Proc. Int. Conf. Acoust., Speech Signal Processing, Mar. 1999, pp. 357–360. [23] R. H. Shumway, “An approach to time series smoothing and forecasting using the em algorithm,” J. Time Series Anal., vol. 3, no. 4, 1982. [24] R. H. Shumway and D. S. Stoffer, “Dynamic linear models with switching,” J. Amer. Statist. Assoc., vol. 86, pp. 63–769, 1991. [25] R. L. Streit and T. E. Luginbuhl, “Probabilistic multi-hypothesis tracking,” Studies in Probabilistic Multi-Hypothesis Tracking and Related Topics, vol. SES-98–01, pp. 5–50, 1998. [26] H. Tanizaki, Nonlinear Filters, 2nd ed. New York: Springer-Verlag, 1996.
Jeff Z. Ma (M’95) received the B.Sc. degree in electrical engineering from Xi’an Jiaotong University, China in 1989, the M.Sci. degree in pattern recognition from Chinese Academy of Sciences, China, in 1992, and the Ph.D degree in electrical and computer engineering from University of Waterloo, Waterloo, ON, Canada in 2000. He was Research Assistant in the Department of Computer Science, University of Hong Kong, from 1993 to 1995, and Research Assistant in the Department of Electrical and Computer Engineering, University of Waterloo, 1996 to 2000. In 2000, he joined BBN Technologies, Cambridge, MA. He has been working on conversational speech recognition on different languages (English, Mandarin, Arabic), speech recognition over IP channels, Mandarin broadcast news audio indexing, and topic identification. His current research interests include conversational speech recognition, topic classification, discriminative training, and dynamic models for speech recognition.
Li Deng (S’83–M’86–SM’91) received the B.S. degree from University of Science and Technology of China in 1982, the M.S. degree from the University of Wisconsin—Madison in 1984, and the Ph.D. degree from the University of Wisconsin—Madison in 1986. He worked on large vocabulary automatic speech recognition at INRS-Telecommunications in Montreal, Montreal, QC, Canada, from 1986 to 1989. In 1989, he joined the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, Canada as Assistant Professor; he became Full Professor in 1996. From 1992 to 1993, he conducted sabbatical research at Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, and from 1997 to 1998, at ATR Interpreting Telecommunications Research Laboratories, Kyoto, Japan. In 1999, he joined Microsoft Research, Redmond, WA, as Senior Researcher, and is currently a Principal Investigator in the DARPA-EARS program and affiliate Professor of Electrical Engineering at University of Washington. His research interests include acoustic-phonetic modeling of speech, speech and speaker recognition, speech synthesis and enhancement, speech production and perception, auditory speech processing, noise robust speech processing, statistical methods and machine learning, nonlinear signal processing, spoken language systems, multimedia signal processing, and multimodal human–computer interaction. In these areas, he has published over 200 technical papers and book chapters, and has given keynote, tutorial, and other invited lectures. He recently completed the book Speech Processing—A Dynamic and Optimization-Oriented Approach (New York: Marcel Dekker, 2003). Dr. Deng served on Education Committee and Speech Processing Technical Committee of the IEEE Signal Processing Society during 1996-2000, and is currently serving as Associate Editor for the IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING. He is a Technical Chair of the 2004 International Conference on Acoustics, Speech, and Signal Processing (ICASSP).