232
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 3, MAY 1997
Use of Generalized Dynamic Feature Parameters for Speech Recognition Rathinavelu Chengalvarayan, Member, IEEE, and Li Deng, Senior Member, IEEE Abstract— In this study, a new hidden Markov model that integrates generalized dynamic feature parameters into the model structure is developed and evaluated using maximum-likelihood (ML) and minimum-classification-error (MCE) pattern recognition approaches. In addition to the motivation of direct minimization of error rate, the MCE approach automatically eliminates the necessity of artificial constraints, which were essential for the model formulation based on the ML approach, on the weighting functions in the definition of the generalized dynamic parameters. We design the loss function for minimizing error rate specifically for the new model, and derive an analytical form of the gradient of the loss function that enables the implementation of the MCE approach. The convergence property of the training procedure based on the MCE approach is investigated, and the experimental results from a standard TIMIT phonetic classification task demonstrate a 13.4% error rate reduction compared with the ML approach.
I. INTRODUCTION
D
URING THE past decade, use of the dynamic feature parameters associated with speech spectra has resulted in demonstrable success in enhancing the performance of speech recognition systems [2], [9], [11], [12], [15], [16], [17]. In practically all these systems, however, the way in which the speech spectral dynamics is represented has been as naive as simply taking the differences of or taking other experimentally chosen combinations of the “static” feature parameters (e.g., cepstral coefficients as a simple transformation of the speech spectra). This representation is often given the names of deltacepstrum, delta-delta-cepstrum, etc., depending on the order of the temporal differencing, and has been confined strictly within the speech preprocessing domain in the speech recognizer design. The objective of the research reported in this paper is to generalize the already successful, despite its empirical nature, delta-cepstrum technique such that the design of the dynamic features of speech is gracefully integrated into the overall speech recognizer design including optimization of the speech model parameters. Although the basic principle guiding our research is sufficiently general and can be applied to all Manuscript received December 7, 1994; revised October 22, 1996. The work of C. Rathinavelu was supported by the Natural Sciences and Engineering Research Council of Canada under a Commonwealth Scholarship. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Amro El-Jaroudi. R. Chengalvarayan was with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Ont., Canada N2L 3G1. He is now with the Speech Processing Group, Bell Laboratories, Lucent Technologies, Naperville, IL 60566 USA. L. Deng is with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Ont., Canada N2L 3G1 (e-mail:
[email protected]). Publisher Item Identifier S 1063-6676(97)03185-4.
types of speech recognizers, we restrict our presentation to only the recognizer based on hidden Markov model (HMM) representation of the speech spectra/cepstra. This paper is devoted to the maximum likelihood (ML) and a more effective but more computation-intensive minimum classification error (MCE) approaches to the integrated design of the dynamic feature parameters of speech. Both of these approaches are applied to a common statistical model of speech, viz., the phonetic HMM with continuous mixture Gaussian output densities that have proved to be a most effective modeling machinery for high-performance speech recognition [9], [22]. When the generalized dynamic features are used as an integrated component in this HMM as the speech model, we shall call the model the integrated HMM. The organization of this paper is as follows. The statistical model of speech incorporating generalized dynamic feature parameters is described in detail in Section II of this paper. Section III contains the detailed description for the joint optimization (via ML) of the conventional HMM parameters and of the time-varying, state-dependent weighting functions that define the generalized dynamic features of speech. In Section IV, we provide full detail of the MCE training algorithm developed from this study specifically for the integrated HMM. In particular, in Section IV-C, we rigorously carried out the computation for the gradient of the loss function with respect to the new model parameters (not present in the conventional HMM). In Section V, we will report experimental results obtained using a standard phonetic classification task from the TIMIT data base as a test-bed to evaluate the performance of the speech recognizer incorporating the generalized dynamic features. The experimental results provide preliminary evidence for the effectiveness of our new approach to the use of dynamic characteristics of speech spectra in phonetic classification. The superior performance of the MCE approach over the ML approach, both for the integrated HMM and for the conventional HMM, is also demonstrated and analyzed. Finally, we summarize our findings in Section VI. II. A STATISTICAL MODEL OF SPEECH INCORPORATING GENERALIZED DYNAMIC FEATURE PARAMETERS The statistical model that incorporates generalized dynamic speech features described in this paper is an extension of the model from the earlier unimodal Gaussian version of the integrated HMM [8] to the current Gaussian mixture version. This statistical model integrates the dynamic features that belong traditionally to the preprocessing domain into the speech modeling process. The integration is accomplished by defining a set of HMM-state-dependent weighting functions, which serve the role of converting the static features to the dynamic
1063–6676/97$10.00 1997 IEEE
CHENGALVARAYAN AND DENG: DYNAMIC FEATURE PARAMETERS FOR SPEECH RECOGNITION
ones in a time-varying manner, as a set of intrinsic parameters of the model that can be learned from the speech data. Let denote a set of staticfeature (vector) sequences (i.e., variable-length tokens), and let denote the th sequence having frames. The dynamic feature vector at the length of time frame is defined as a linear combination of the static features stretching over the interval frames forward and frames backward according to (1) where is the th weighting coefficient associated with the th mixture residing in the Markov state . In the matrix form, (1) can be written as
.. .
.. .
.. .
.. .
.. .
.. .
.. . where subscript denotes the individual element in the feature vector. The static feature matrix above has the dimensionality , with being the dimension of the feature vectors. According to the definition of (1), the dynamic features can be physically interpreted as the output from a time-varying linear filter with a static feature vector sequence serving as the input. The time-varying filter coefficients are evolving slowly according to the Markov chain in the HMM. A finite mixture Gaussian density associated with each integrated HMM state (a total of states) assumes the form
(2) is the augmented feature vector of the th token where at frame , is the number of mixture components, and is the mixture weight for the th mixture in state . In (2), and are -dimensional,1 the unimodal Gaussian densities
(3)
(4) and stand for the static and the genwhere variables eralized dynamic features defined by (1), respectively. Su1 In our actual implementation of the recognizer using speech cepstra as the static features X (Section IV), the zeroth-order cepstrum is removed from the static features after the associated dynamic feature is determined from (1). So the generalized dynamic features Y defined in (1) has one higher dimension than the static features X .
233
perscripts and denote vector transposition and matrix inversion, respectively. III. MAXIMUM LIKELIHOOD CRITERION FOR TRAINING PARAMETERS OF THE INTEGRATED HMM In this section, we describe closed-form solutions for jointly training all the integrated HMM parameters using the celebrated expectation maximization (EM) algorithm according to the ML criterion. The algorithm consists of iterative expectation-step (E-step) and maximization-step (M-step). The E-step involves evaluation and simplification of the conditional expectation , where and are the present estimate of the model parameters and the estimate prior to the current iteration, respectively. The present estimate of the model parameters are obtained in the M-step via maximization of . According to the celebrated Baum’s inequality [4], each iteration of the above two steps will lead to a set of new model parameters with objective function guaranteed to be nondecreasing. The objective function can be simplified through a set of well-established procedures [20] in the E-step to the desirable form, which becomes suitable for maximization in the M-step. Reestimates for the model parameters are obtained in the Mstep of the EM algorithm by simultaneous maximization of the simplified objective function with respect to all model parameters. The reestimation formulas we have described for joint optimization of the state-dependent weighting functions defining the dynamic features and of the mixture Gaussian means for the dynamic features are the key result of this study. The reestimation formulas for the remaining parameters are similar to those for the conventional HMM [19]. We now formally present the reestimation procedure for joint optimization of and . In order for the ML approach to the problem of estimating dynamic feature parameters to be sensible, constraints on the must be provided. This is so because infinparameters itely high likelihood would be achieved by uniformly setting 0 without discriminability among different speech classes. (In the minimum classification error approach to the parameter estimation problem presented in the next section, the need for the constraints will be entirely eliminated.) In this study, we explore nonlinear type of constraint (NC) that is imposed on the solution of the problem of jointly optimizing and in the M-step of the EM algorithm , where 0 is a model-specific constant, serving the role of eliminating the possibility that all ’s are set to zero (singularity), thereby giving infinitely large but senseless likelihood. The reestimation formulas can then be derived by solving a system of equations obtained by with respect setting partial derivatives of objective function ’s and ’s to zero, to each of the model parameters subject to the above constraint. Solution of a set of nonlinear system equations, based on the Newton–Raphson method, and gives the reestimate of the model parameters as the outcome of the M-step of the EM algorithm [20]. Although the ML approach has been enjoying wide success in the HMM-based speech recognition technology [3], [19], several inherent limitations of the approach have been under intensive investigation recently. Such investigation has led to
234
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 3, MAY 1997
a new approach aiming at direct minimization of the expected speech classification or recognition error rate [6], [13]. This new approach has been successfully applied to a variety of popular structures of speech recognition including the HMM [6], dynamic time warping [5], and neural networks [14]. Both the theoretical and practical issues of the MCE approach have been investigated in the greatest detail for the speech recognition systems based on the HMM representation of the speech statistics because of its widely acknowledged status as the prevalent technology and because of its rich theoretical structure, making it most suitable for the investigation. In the next section, the MCE approach has been extended from the conventional HMM to the integrated HMM.
The approach we employed in previous section to the parameter estimation problem associated with the integrated HMM has been built on the basis of the maximum likelihood (ML) principle. Use of an entirely new theoretical principle, the minimum classification error (MCE) principle, to design the integrated HMM is the subject of the current section. One principal motivation for the use of the MCE training is to overcome one key weakness of the ML training as applied to the integrated HMM; that is, the constraints applied to the weighting parameters transforming static speech features to the dynamic ones are entirely artificial. The only reason for introducing these constraints was to avoid infinitely high likelihoods, as would be desired by the ML training criterion. Such infinitely high likelihoods would be simply obtained by setting the weighting parameters to be uniformly zero. This apparently destroys discriminability of the classifier among different speech classes altogether. In the supervised training mode that we assume, each training token (an augmented data sequence ) is classes . The goal known to belong to one of of the MCE training is to find the classifier parameter set, denoted by , such that the probability of is minimized and the resulting misclassifying any gives the optimal solution of the classifier. In the integrated HMM, the classifier parameter set consists of all the statedependent, mixture-dependent weighting functions , together with the conventional HMM parameters [including Markov transition probabilities , mixture weights , mixture Gaussian mean vectors , and mixture Gaussian covariance matrices ], for all the models, each representing a distinctive class of the speech sounds to be classified. A. The MCE Optimization Criterion for the Integrated HMM The first step in the formulation of the objective function is to choose an appropriate discriminant function according to the following decision rule for classification:
where is the class associated with the test data determined by the classifier.
(6) is the probability of generating the feature at time in state by the model for class th, is the transition probability of the th model, and is the number of frames of the th observation sequence. In our implementation of the integrated HMM, we assume uncorrelatedness between the static features and the generalized dynamic features. Hence, the output probability takes the form
where vector
IV. MINIMUM CLASSIFICATION ERROR TRAINING FOR THE INTEGRATED HMM
if
In our implementation of the integrated HMM, we choose the most likely (optimal) state path traversing the Markov model as the basis for defining the discriminant function. The log-likelihood score of the input utterance along the optimal state sequence for the model associated with the th class can be written as
(5)
with
(7) Given a discriminant function, a misclassification measure for an input training utterance from class can be defined as follows to quantify the classification behavior:
(8) where is a positive number and is the total number of classes. above is a quantity that indicates the degree of confusion between the correct class and the other competing classes for a given input utterance . When approaches , the misclassification measure becomes
as (9)
CHENGALVARAYAN AND DENG: DYNAMIC FEATURE PARAMETERS FOR SPEECH RECOGNITION
where is the most confusable class. Clearly, a positive value of indicates a misclassification and a negative value of implies a correct decision. Given a misclassification measure, we further define a smoothed loss function for each class , as follows: (10) which approximates the classification error count. That is, the loss function assigns near-zero penalty when an input is correctly classified and assigns a near-unity penalty when an input is misclassified. The parameter controls the slope of the above smoothed zero-one function. Finally, given a loss function defined for each class, we define the overall loss function for the entire classifier as
235
C. Gradient Computation In the MCE discriminative training, the integrated HMM parameters are adaptively adjusted to reduce the overall loss function along a gradient descent direction. These model parameters must satisfy certain constraints such as the positive definiteness of the covariance matrices and the stochastic constraint . These constraints need to be checked in every iteration cycle. For easier implementation, the constrained parameters are transformed to an unconstrained domain and the gradient is computed with respect to the , , and , transformed parameters as follows: (15)
(11) where is the Kronecker indicator function of a logic expression that gives value one if the value of is true and value zero otherwise. An empirical average loss function can be obtained by averaging the overall loss function for the classifier over all utterances in the training set according to (12) When the loss function (11), specific for each training token , is used as the performance criterion to be minimized to achieve a minimum-error classifier design, we call such a procedure token-by-token training. When the empirical average loss function (12), which is independent of the training token, is used as the performance criterion, we then call the procedure batch training. In this paper, we report only the results obtained by token-by-token training.
Since (11) is a well-defined function, standard optimization techniques can be used to achieve minimum-error classifier design. In the current work, we examine the gradient descent method in which the parameters are adjusted in a negative gradient direction. The loss function is minimized, each time a training token is presented, by adaptively adjusting the parameter set according to (13)
sample
(17) and are the diagonal covariance where matrices (as implemented in this study), associated with class , mixture , and state , for the static and dynamic features, is the th mixture weights in state respectively. associated with class . We make a special note here that in the discriminative training for the integrated HMM, the need for the constraints on the parameters defining dynamic parameters associated with the ML training is completely eliminated. The following gradient equations are obtained by computing with respect to each the partial derivatives of integrated HMM parameter for a given training token belonging to class th
(18)
B. The Optimization Algorithm
where
(16)
is the parameter set at the th iteration, and is the gradient of the loss function for training . is a small positive learning constant
(19)
(20)
(21)
(14) where is a positive number, is a large prescribed number (the limit for the number of iterations), and is the number of epochs (an epoch represents a completion of processing on the entire training data) in the training. The step size given by (14) is chosen empirically, as too large a value leads to unstable behavior and too small a value leads to unnecessarily slow learning. With appropriate step size chosen, the update rule given by (13) will converge to a (local) minima.
(22) (23) identity matrix, the set includes all where denotes the time indices such that the state index of the state sequence
236
at time
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 3, MAY 1997
belongs to state th in the Markov chain, i.e., (24)
Other quantities in (18)–(23) are
speakers completely disjoint from the training set. In these data sets, only “sx” and “si” sentences were used. B. Computation of the Static Speech Features
if if (25) (26) (27) and (28) Note in the above that of (25) serves as adaptive step size of parameter adjustment. It can be seen from (25) that a substantial parameter adjustment is made when the absolute is small—that is, when the training token is value of likely to be misclassified. On the other hand, when the absolute value of is large, that is, when the input token is either unlikely to cause confusion or obviously an extreme outlier, then the amount of adjustment is accordingly reduced. Detailed derivations leading to (18)–(23) are provided in Appendix. V. EXPERIMENTAL EVALUATION OF THE INTEGRATED MODEL In view of the preliminary evidence for the effectiveness of the MCE training for the conventional HMM reported in [6] and of the motivation for use of the generalized dynamic features allowing for joint optimization in the recognizer design, it is naturally expected that the integrated HMM trained using the MCE principle as described in Section IV be superior in performance over both the integrated HMM using ML training (see Section III for details) and the conventional HMM constructed from MCE training. This expectation has been confirmed in phonetic classification experiments carried out in this study. A. Task and Corpus The experiments described in this section are aimed at classifying the 61 quasiphonemic labels defined in the TIMIT data base. In keeping with the convention adopted by the speech recognition community, we folded 22 phone labels into the remaining 39 ones in determining classification accuracy. After the folding, all the phones in the effectively 39 separate categories are: uw ux , uh , ah ax ax-h , aa ao , ae , eh , ih ix , ey , iy , y , ay , ow , aw , oy , er axr , r , l el , w , m em , n en nx , ng eng , dx , v , th , dh , hh hv , z , s , sh zh , f , jh , ch , b , d , g , p , t , k , and pcl dcl gcl kcl pcl tcl epi ps q pau . The confusions within the same categories are not counted in calculating classification accuracy. The training set consists of 442 speakers, of both male and female, resulting in 3536 sentences from a training subset of the TIMIT data base. The test set consists of 160 sentences (a total of 5775 phone tokens), spoken from 20
The raw speech data in TIMIT was in the form of waveforms. The following is the analysis condition under which the static speech features are computed: Sampling rate: 16 kHz Frame size: 10 ms (160 samples) Window type: Hamming Window length: 32 ms (512 samples) Window overlap: 22 ms (352 samples) Analysis: Short-time spectrum analysis Features: Mel-frequency cepstrum coefficients (MFCC). For the computation of MFCC, 25 triangular bandpass filters are simulated, spaced linearly from 0–1 kHz and exponentially from 1–8.86 kHz, with the adjacent filters overlapped in the frequency range by 50%. The fast Fourier transform (FFT) power spectrum points are combined using a weighted sum to obtain the output of the triangular filter. The MFCC (static features) are then computed according to MFCC (29) where is the log-energy output of the th mel-filter. An eight-component static feature vector is extracted every 10 ms throughout the signal. The augmented feature vectors used for the benchmark HMM consist of 15 elements, with seven cepstrum coefficients, seven delta cepstra, and the delta log energy. For the integrated HMM, only the static feature vectors are used as the raw data to the recognizer, which constructs the dynamic feature parameters internally within the recognizer in a manner as described in detail earlier in Section II. C. Study of Convergence Property of the Training Procedure In this section, we report the results from an empirical study on the convergence property of the MCE training procedure described in Section IV. Fixing the 39-phone classification task, we first initialized the integrated HMM with the parameters obtained from the ML training. The integrated HMM in use is the version that contains the nonlinear constraint on the state-dependent weighting function that determines the generalized dynamic parameters. Some parameters of the models that we used to obtain the results reported in this section are: number of HMM states 3, number of Gaussian mixtures in each state 5, optimal window parameters 2, 2, and nonlinear weight-constraint parameter 2. Fig. 1 presents the empirical results on the behavior of the training procedure for the contextdependent phonetic classification tasks. The upper graphs plot the 39-phone classification rates as a function of the epoch or iteration of the MCE training algorithm for the training data (closed set with solid lines) and for the test data (open set with dashed lines), respectively. (Training data consist of all phone tokens from 442 speakers and test data from 20
CHENGALVARAYAN AND DENG: DYNAMIC FEATURE PARAMETERS FOR SPEECH RECOGNITION
Fig. 1. Convergence characteristics of the gradient-descent-based MCE training procedure. Upper graph plots the context-dependent 39-phone classification rates as a function of the training epoch for the training data (closed set with solid lines) and for the test data (open set with dashed lines), respectively. The nonlinear constraint version of the integrated HMM is used. Lower graph plots the average loss, defined by (12) and evaluated at the end of each training epoch, as a function of the training epoch.
speakers.) The lower graphs of Fig. 1 plot the average loss, defined by (12) and evaluated at the end of each training epoch, as a function of the training epoch. The convergence behavior of the MCE training, which we expected from general theoretical considerations, is confirmed by the results shown in Fig. 1; that is, classification rates monotonically increase with the training epoch, and the average loss monotonically decrease, both reaching their respective asymptotic values after 20 epochs of the training. Note that the decreasing values of the average loss with the training epoch follow the same trend in a qualitative manner as those of the classification error rate for the training set. This indicates that the original objective set out for minimizing the classification error via the MCE training is accomplished. In the next section we report full detail of the phonetic classification results, focusing on the comparative performances of the integrated HMM versus the conventional state-of-the-art HMM. D. Phonetic Classification Results The main goal of the experiments designed in this study is to investigate the relative effectiveness of the generalized dynamic-parameter technique in comparison with the conventional one. Therefore, we have attempted to keep all other aspects of the speech models associated with both the conventional and the generalized techniques as much in common as possible, and to keep the recognizer structure as simple as possible. For both the integrated HMM and its benchmark counterpart (i.e., the conventional HMM using the preprocessor that appends the delta feature vectors into the static ones), each phone is represented by a three-state, leftto-right HMM with no skips. For the experiments reported in this paper, we fixed the number of iterations of the EM algorithm to be five for the integrated HMM, as well as for the benchmark HMM. The covariance matrices in all the states of all the models are diagonal and are not tied.
237
For the ML approach, we have implemented the integrated HMM according to the nonlinear constraint imposed on the state-dependent weights that define the generalized dynamic parameters. For the MCE approach, the initial model parameters were directly taken from the integrated HMM trained by the ML criterion with nonlinear constraint. Further, for the integrated HMM and the benchmark HMM, we have explored both context-independent and context-dependent versions of the phonetic model. For the context-independent version, a total of 39 models (39 3 117 states) were constructed, one for each of the 39 classes intended for the classification task. For the context-dependent version, a total of 1209 states were constructed, with each three-state combination out of these states representing one allophone conditioned on predefined merged phonetic classes as left and right contexts (i.e., generalized triphone). These predefined merged phonetic classes (15 in total) are listed in Table I, which were modified from the merged classes published in [10] and [18]. We have counted the total number of legal triphones encountered in the training set of the TIMIT data base to be as large as 21 127 without merge of phones as the context. This would amount to as many as 21 127 3 63 381 HMM states in total. Even with the merge of the phonetic classes into a total of 15, the total number of the HMM states would still be as many as 15 15 39 3 26 325 if no state tying were put in place. To reduce the HMM state size, we adopted the following state tying strategy: • The center states of all triphone HMM’s are tied independent of the left or right contexts. • The right-most states of all triphone HMM’s are tied independent of the left context. • The left-most states of all triphone HMM’s are tied independent of the right context. After this state tying, the total number of the states for all the triphone models is reduced to (15 15 1) 39 1209. This set of context-dependent models were used throughout the experiments reported in this section. The results shown in Table II are obtained by varying the window variables ( and ) in the definition of the generalized dynamic parameters. The task is context-dependent classification of six stop consonants in the TIMIT data.2 We observe from Table II that the optimal window length over which the generalized dynamic speech features are determined is about five frames (equivalent to 50 ms for our 10-ms frame size). Use of a window size of two or six frames results in a noticeable drop in the performance measure. The experimental results for the 39-phone classification task shown in Table III are obtained using the optimal window 2, 2 determined from the results just described above. We observe from Table III that for both the contextindependent and context-dependent classification tasks, the integrated HMM initialized by the ML-trained model with nonlinear constraint (last row, NC-IHMM) is superior to the benchmark HMM. The best classification rate, 81.45%, achieved with use of the context-dependent integrated HMM 2 Compared with the 39-phone classification task, this six-stop classification task incurs much less computation in the MCE training.
238
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 3, MAY 1997
TABLE I LIST OF 15 MERGED PHONETIC CLASSES USED IN CONSTRUCTING CONTEXT-DEPENDENT PHONETIC MODELS
TABLE II CLASSIFICATION RATE FOR SIX STOP CONSONANTS IN TIMIT, IN THE CONTEXT-DEPENDENT MODE, AS A FUNCTION OF THE WINDOW VARIABLES (f AND b). RESULTS ARE SHOWN FOR THE BENCHMARK HMM (COLUMN 2) AND FOR THE INTEGRATED HMM’S INITIALIZED BY THE ML-TRAINED MODELS WITH NC (COLUMN 3) CONSTRAINTS, RESPECTIVELY. ALL THE MODELS WERE TRAINED USING THE MCE CRITERION. THE NUMBER OF MIXTURES IN ALL HMM STATES IS FIXED AT ONE
trained by the MCE criterion starting from the nonlinearconstraint ML version of the integrated HMM, represents a 7.1% error rate reduction compared with the benchmark HMM also with the MCE training (80.04%). It also represents a 13.4% error rate reduction compared with the same version of the integrated HMM, except with only the ML training (78.58%, see [21]). In general, moving from the ML training to the MCE training, we are able to achieve a classification error rate reduction ranging from 10–25% for the integrated HMM as well as for the benchmark HMM. To the best of our knowledge, even for the benchmark HMM (i.e., the same HMM evaluated in [6]), the results we report in the present study are the first demonstrating the effectiveness of the MCE training for the standard TIMIT 39-phone classification
TABLE III TIMIT 39-PHONE CONTEXT-INDEPENDENT (LEFT COLUMNS) AND CONTEXT-DEPENDENT (RIGHT COLUMNS) CLASSIFICATION RATE AS A FUNCTION OF THE MODEL TYPE (ALL USING THE MCE TRAINING) AND OF THE NUMBER OF GAUSSIAN MIXTURES IN THE HMM STATE
task. We also note from Table II the quantified superiority in performance of the context-dependent models and of the five-mixture models over the unimodal Gaussian models, for both the benchmark models and the integrated models. This superiority is consistent with our earlier ML-training results reported in [21]. For the best classification result we obtained (81.45% rate), we show in Fig. 2 the confusion matrix displaying details of the classification error distribution. Comparing this confusion matrix with its counterpart obtained using the ML training [20], in addition to the 13.4% overall lower number of errors, we observe somewhat different error distribution patterns. This indicates a potential for achieving superior performance to both of the classifiers by intelligently combining them [7]. Finally, presented in Tables V and VI are the contextindependent and context-dependent classification results, respectively, for each of five broad phonetic classes: vowels, stops, nasals, semivowels, and fricatives. The five broad phonetic classes used in the experiment are defined in Table IV. The superior performance of the integrated HMM over the benchmark HMM demonstrated in Table III for an ensemble of 39 phones is translated to every one of the five broad phonetic classes with a varying degree. We note in particular that for
CHENGALVARAYAN AND DENG: DYNAMIC FEATURE PARAMETERS FOR SPEECH RECOGNITION
239
Fig. 2. Confusion matrix for the standard 39-phone classification task with use of the integrated HMM trained using minimum-classification criterion.
TABLE IV BROAD PHONETIC CLASSES FOR USE IN THE FINAL EXPERIMENT
the context-dependent experiment, the greatest improvement of the integrated HMM over the benchmark HMM is for classification of six stop class, which is more than a 20% error rate reduction (error rate from 12.83% to 10.26%). VI. SUMMARY
AND
CONCLUSION
In comparison with the conventional technique exploring dynamic features, which largely relied on empirical evidence for selecting the weights to convert static speech features to dynamic ones, our new, generalized dynamic-feature technique is based on a solid theoretical ground. Within the theoretical framework described in this paper, use of dynamic features of speech is automatically integrated as a subcomponent of the overall speech modeling strategy, rather than being treated as just a narrow signal processing problem. Specifically, the new integrated HMM generalizes the currently widely used dynamic-parameter (delta-cepstrum) technique in two ways. First, the model contains state-dependent weighting functions for transforming static speech features into the dynamic ones, instead of having the weights be prefixed by the preprocessor. Second, the theoretically motivated EM algorithm and MCE procedure are developed for the integrated HMM that allows joint optimization of the state-dependent weighting functions
and the remaining conventional HMM parameters. Effectiveness of the MCE training for the conventional HMM has been reported in [6], and this effectiveness is extended in our current study to the integrated HMM and to a new (TIMIT) evaluation task. (The evaluation task reported in [6] was designed from the Resource Management data base.) To the best of our knowledge, the results reported in the current paper are the first to demonstrate superiority of the MCE training over the ML training for the TIMIT speech data. Moving from the ML training to the MCE training is particularly desirable, at least theoretically, for the integrated HMM. With the conventional HMM, the sole motivation for the use of the MCE training in place of the ML one is from the general consideration of minimizing error rate due to poor approximation of the HMM as a source model to true statistical characteristics of the speech process. While the same motivation applies to the integrated HMM, the MCE approach automatically eliminates the need for use of unrealistic and artificial constraints that are essential for the formulation of the integrated HMM based on the ML design philosophy. The constraints have been on the state-dependent weighting functions in the definition of the generalized dynamic parameters. Elimination of these constraints by moving away from the ML approach appears to be a significant contributing factor for the improvement of the classifier performance (on the 39-TIMIT phone task) from the best classification rate of 78.58% obtained by the ML approach to that of 81.45% by the MCE approach. We designed the loss function for minimizing error rate specifically for the new model, and derive an analytical form of the gradient of the loss function that enables the implementation of the new approach. The convergence property of the new training procedure based on the MCE approach is investigated, and the experimental results from a standard TIMIT phonetic classification task
240
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 3, MAY 1997
The first factor in the right-hand side of (30) can be simplified to
TABLE V CONTEXT-INDEPENDENT CLASSIFICATION RATE FOR EACH OF FIVE BROAD PHONETIC CLASSES (ALL USING THE MCE TRAINING)
(31) The second factor of the right-hand side of (30) can be simplified as follows:
TABLE VI CONTEXT-DEPENDENT CLASSIFICATION RATE FOR EACH OF FIVE BROAD PHONETIC CLASSES (ALL USING THE MCE TRAINING)
if if
(32)
Using (31) and (32), (30) can be rewritten as (33) demonstrated a 13.4% error rate reduction compared with the ML approach. In conclusion, our evaluation results have provided preliminary evidence for the effectiveness of the integrated HMM, which generalizes the traditional use of the dynamic characteristics of speech spectra, in phonetic classification. The effectiveness of the MCE approach over the ML approach demonstrated in our study for both the integrated HMM and the benchmark one, however, is balanced by its significantly greater computational burden. We, like the researchers who first proposed the MCE approach [1], [13], have not been able to develop EM-like algorithms [4] for efficient computation in implementing the MCE training. We have had to rely on the gradient descent algorithm for the MCE training, necessarily incurring a large computational requirement. It seems unlikely that this computation-related downside of the MCE approach compared with the ML approach can be easily overcome. APPENDIX GRADIENT COMPUTATION FOR INTEGRATED HMM This appendix provides detailed derivations of the gradient computation for the integrated HMM leading to (18)–(23). Let denote a parameter associated with model , then in the case of token-by-token training, we can write the gradient as
(30)
where
if if (34) In the rest of this appendix, class index will be omitted for clarity of presentation. 1) Derivation of (18): The partial derivative of with respect to each is given by
CHENGALVARAYAN AND DENG: DYNAMIC FEATURE PARAMETERS FOR SPEECH RECOGNITION
241
In matrix form, this becomes
for
, and . The set and the a posteriori probabilities are defined in (24) and (25), respectively. 2) Derivation of (20):
of
3) Derivation of (22): Denote the by , , and
4) Derivation of (23):
th diagonal element , . Then
The gradient computations for the static feature parameters can be carried out in a similar yet relatively simpler way, and are omitted here.
242
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 3, MAY 1997
ACKNOWLEDGMENT The authors thank Dr. C. Lee for useful discussions on technical details of the minimum classification error method, which forms the basis of this work. REFERENCES [1] S. Amari, “A theory of adaptive pattern classifier,” IEEE Trans. Electromag. Compat., vol. 16, pp. 299–307, 1967. [2] T. Applebaum and B. Hanson, “Regression features for recognition of speech in quiet and in noise,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 1991, vol. 2, pp. 985–988. [3] L. R. Bahl, F. Jelinek, and R. L. Mercer, “A maximum likelihood approach to continuous speech recognition,” IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-5, pp. 179–190, 1983. [4] L. E. Baum, “An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes,” Inequal., vol. 3, pp. 1–8, 1972. [5] P. C. Chang and B. H. Juang, “Discriminative template training for dynamic programming speech recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 1992, vol. 1, pp. 493–496. [6] W. Chow, C. H. Lee, B. H. Juang, and F. K. Soong, “A minimum error rate pattern recognition approach to speech recognition,” Int. J. Pattern Recog. Artif. Intell., vol. 8, no. 1, pp. 5–31. [7] L. Deng and D. Braam, “Context-dependent Markov model structured by locus equations: Application to phonetic classification,” J. Acoust. Soc. Amer., vol. 96, pp. 2008–2025, Oct. 1994. [8] L. Deng, “Integrated optimization of dynamic feature parameters for hidden Markov modeling of speech,” IEEE Signal Processing Lett., vol. 1, no. 4, pp. 66–69, 1994. [9] L. Deng et al., “Phonemic hidden Markov models with continuous mixture output densities for large vocabulary word recognition,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 39, pp. 1677–1681, July 1991. [10] L. Deng, M. Lennig, F. Seitz, and P. Mermelstein, “Large vocabulary word recognition using context-dependent allophonic hidden Markov models,” Comput. Speech Lang., vol. 4, pp. 345–357, 1990. [11] S. Furui, “Speaker independent isolated word recognition using dynamic features of speech spectrum,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-34, pp. 52–59, 1986. [12] V. Gupta, M. Lennig, and P. Mermelstein, “Integration of acoustic information in a large vocabulary word recognizer,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Dallas, TX, 1988, pp. 697–700. [13] B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error rate training,” IEEE Trans. Signal Processing, vol. 40, pp. 3043–3054, 1992. [14] S. Katagiri, C. H. Lee, and B.-H. Juang, “Discriminative multi-layer feed-forward networks,” in Proc. IEEE-SP Workshop Neural Networks for Signal Processing, Princeton, NJ, Sept. 1991, pp. 11–20. [15] C. Lee, L. Rabiner, R. Pieraccini, and J. Wilpon, “Acoustic modeling for large vocabulary speech recognition,” Comput. Speech Lang., vol. 4, pp. 127–165, 1990. [16] K. Lee and H. Hon, “Context-dependent phonetic hidden Markov models for continuous speech recognition,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 38, pp. 599–609, Apr. 1990. [17] H. Leung, B. Chigier, and G. Glass, “A comparative study of signal representations and classification techniques for speech recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Minneapolis, MN, 1993, vol. 1, pp. 680–683. [18] A. Ljolje, “High accuracy phone recognition using context clustering and quasitriphonic models,” Comput. Speech Lang., vol. 8, pp. 129–151, 1994.
[19] L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, pp. 257–285, 1991. [20] R. Chengalvarayan, “Integrated design of preprocessing and modeling components of a speech recognition system,” Ph.D. dissertation, Univ. Waterloo, Waterloo, Ont., Canada, 1995. [21] R. Chengalvarayan and L. Deng, “Use of generalized dynamic feature parameters for speech recognition: Maximum likelihood and minimum classification error approaches,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Detroit, MI, 1995, vol. 1, pp. 373–376. [22] P. Woodland, J. Odell, V. Valtchev, and S. Young, “Large vocabulary continuous speech recognition using HTK,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Adelaide, Australia, 1994, vol. 1.
Rathinavelu Chengalvarayan (S’92–M’96) was born in Madras, India. He received the B.E. degree in electronics and communications engineering and the M.E. degree in communication systems engineering from Anna University, Madras, India. He received the M.A.Sc. and Ph.D. degrees in electrical engineering from University of Waterloo, Ont., Canada, in 1992 and 1995, respectively. His Ph.D. dissertation involved research on the integrated design of preprocessing and modeling components of a speech recognition system. From March 1986 to December 1990, he was a Deputy Engineer at Bharat Electronics Limited, Bangalore, India, and was involved in a number of projects ranging from telephone exchange systems to wireless equipments, with emphasis on speech signal processing. He served as a Post-Doctoral Fellow at the University of Waterloo from January 1996 to August 1996. He is currently a Member of Technical Staff at Bell Laboratories, Lucent Technologies, Naperville, IL. His current interests include speech segment modeling, continuous speech recognition, speaker adaptation, and model-based discriminative feature extraction. Dr. Rathinavelu was a recipient of a Canadian Commonwealth Scholarship award. He is a member of IEEE Signal Processing Society and a member of Acoustical Society of America.
Li Deng (S’83–M’86–SM’91) received the B.S. degree in biophysics from University of Science and Technology of China, Hefei, in 1982, and the M.S. and Ph.D. degrees in electrical engineering from the University of Wisconsin, Madison, in 1984 and 1986, respectively. He worked on large vocabulary automatic speech recognition at INRS-Telecommunications, Montreal, P.Q., Canada, from 1986 to 1989. Since 1989, he has been with Department of Electrical and Computer Engineering, University of Waterloo, Ont., Canada, where he is currently Full Professor. From 1992 to 1993, he conducted sabbatical research at the Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, working on statistical models of speech production and on advanced speech recognition algorithms. His research interests include acoustic-phonetic modeling of speech, speech recognition, synthesis, enhancement, speech production and perception, statistical methods for signal analysis and modeling, nonlinear signal processing, neural network algorithms, computational phonology for the world’s languages, and auditory speech processing.