Integrated optimization of dynamic feature ... - Semantic Scholar

Report 14 Downloads 185 Views
EEE SIGNAL PROCESSING LETTERS, VOL. 1, NO. 4, APRIL 1994

66

Integrated Optimization of Dynamic Feature Parameters for Hidden Markov Modeling of Speech Li Deng, Senior Member, IEEE

Abstract-Construction of dynamic (delta) features of speech, which has been in the past confined to only the preprocessing domain in the hidden Markov modeling (HMM) framework, is generalized and formulated as an integrated speech modeling problem. This generalizationallows us to utilize state-dependent weights to transform static speech features into dynamic ones. In this letter, we describe a rigorous theoretical hmework that naturally incorporates the generalized dynamic-parameter technique and present a maximum-likelihood-basedalgorithmfor integrated optimization of the conventional HMM parameters and of the time-varying weighting functions that define the dynamic features of speech.

I. INTRODUCTION

0

NE MAJOR advance during the past decade in automatic

traditionally treated as a narrow signal processing problem, is now automatically integrated as a natural subcomponent of the overall speech-modeling strategy. In particular, the theory generalizes the currently widely used dynamic-parameter technique in two significant aspects. First, it provides a new formulation of the HMM, which contains state-dependent weighting functions that are responsible for transforming static speech features into the dynamic ones in a time-varying manner; in contrast, in the conventional HMM, such weighting is prefixed (once and for all) and is totally independent of the modeling issue. Second, central to our new formulation of the HMM to be described below is a novel maximum-likelihood method that jointly optimizes the dynamic-feature weighting functions and the remaining conventional HMM parameters.

speech recognition, which has been driven mainly by the hidden Markov model (HMM)-based technology, is the introduction of the “dynamic” feature parameters of speech. II. THEMODEL These dynamic or delta features have been obtained simply The new formulation of the HMM described in this letter by taking the differences of or taking other experimentally accepts only the static features of speech (sufficient statistics); chosen Combinations of the “static” feature parameters (e.g., the dynamic features are automatically assimilated into the cepstral coefficients) over an empirically determined, fixed modeling process with the weighting functions incorporated time span. The earliest reports on the use of this dynamic- as a set of trainable intrinsic parameters of the model. parameter technique appear to be those of [8] and [9] in Denote by XF = (XI, XZ,... ,XT)the vector (dimenthe mid 1980’s. Since then, the success of the technique sionality D) sequence of static speech features of duration has been consistently reported in all major automatic speech T (in unit of frame). The dynamic feature Yt at each frame recognfition laboratories (e.g., [ll, [31-[61, [101-[131). t , t = B l , B 2 , . . - , T- F is defined in the model as Despite the apparent success of the dynamic-parameter a linear combination of the static features spanning over the technique, its theoretical justification has been weak. In fact, temporal interval F frame forward and B frame backward without additional assumptions, use of the dynamic features, F in conjunction of the static features that themselves already formed sufficient statistics, creates theoretical inconsistency in Yt = w,(i)Xt+, the ensuing HMM modeling framework. Further, due to the m=-B lack of theoretical guidance, the dynamic-parameter technique employed to date relies solely on heuristic choices of the where wm(i) is the scalar’ weighting coefficient, which is ways in which the static features are combined in forming defined for generality as a function of the HMM state i. the dynamic features. In all the previous speech modeling The remaining parameters of the proposed model are similar and recognition works known to the author, derivation of to the parameters of the conventional Gaussian2 HMM (e.g., the dynamic features has been confined strictly to the speech [14]). These parameters are the following: preprocessing domain that has been completely divorced from 1) the transition probabilities a i j , i, j = 1,2,. ,N of the the speech modeling issue. N-state homogeneous Markov chain The purpose of this letter is to develop a generalized HMM 2) &(i) of size D x D for dynamic features theory in which use of dynamic features of speech, which is 3) the state-dependent Gaussian mean vectors Oy(i).

+

+

2

+

Manuscript received January 7, 1994; approved March 4, 1994. The associate editor coordinating the review of this letter and approving it for publication was Dr. .I. R. Rohlicek. The author is with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Canada N2L 3G1. E E E Log Number 9401805.

‘The more general case for vector-valued weighting coefficients can be treated similarly. *The model formulation and the parameter learning methods discussed in this letter can be straightforwardly extended to the mixture HMM version of the proposed model.

1070-9908/94$04.00 0 1994 E E E

DENG INTEGRATED OPTIMIZATION OF DYNAMIC FEATURE PARAMETERS

For the maximum-likelihood approach to the problem of parameter learning for the model outlined above, it is necessary to provide constraints to the parameters wm(i) defined in (1). This necessity arises from the fact that infinitely high likelihood can be achieved by uniformly setting w m ( i ) = 0 without discriminability among different speech c l a ~ s e s .In~ this letter, we explore two types of constraints. The Type-I constraint is of the linear form P

wm(i)= c

m=-B where C # 0 is a model-specific constant, serving the role of eliminating the possibility that all wm(i) are set to zero. The w-IIconstraint is of the nonlinear form

67

{ -; log(27r) -

1

- -log ICy(i)l

2

1 2

- [(y,- e y ( i ) ) T T q 2 ( i ) ( y , e&))]

Re-estimates of the model parameters are obtained in the M step by maximization of (6) with respect to all model parameters. Re-estimation formulas for the transition probabilities and for the covariance matrices are similar to those for the conventional HMM and are not dealt with here. In this letter, we describe the re-estimation formulas for the weighting coefficients and for the Gaussian means. After substituting (1) into (6) and removing optimizationindependent terms and factors in (6), an equivalent objective function can be written as

F

w L ( i ) = C’

(3)

m=-B Although the Type-I1 constraint complicates the parameter learning algorithm (see Section III), it embraces the conventional delta-parameter techniques as special cases in an elegant way: The first-order delta-parameter setup becomes a degenerative instance of the above model where C’= 2 and 20, = +1, -1, the delta-delta-parameter setup is an instance of the model where C’= 6 and w, = +1,- 2 , + 1 , and so on. HI.

JOINT

OFTMEATION OF CONVENTIONAL Hh4h’f

PARAMETERS AND OF WEIGHTS DEFINING DYNAMIC FEATURES We have developed closed-form solutions for jointly estimating the state-dependent weighting coefficients defining dynamic features and the remaining HMM parameters based on the EM algorithm. The estimates are obtained through an iterative procedure, where each iteration consists of an E step and an M step. The E step involves evaluation and simplification of the conditional expectation4

Q(@l@o) = E [ l o g W ? , S l @ ) I X T ,a01

=

\l

F

(7) where T t ( i ) = P(st = i ( X T , @ o ) ,which can be computed efficiently by the use of the standard forward-backward recursions [2]. The re-estimation formulas can then be established by solving a system of equations obtained by setting partial derivatives of QO with respect to each of the parameters to zero subject to the constraint expressed in either (2) or (3). A. Solution for Constraint Type-I

Based on Lagrange multiplier method, the objective function incorporating the Type-I constraint (2) is

(4)

where @ and @O stand for the models in the current and the previous iteration, respectively. The Q function in (4) can be rewritten in terms of the expectation over state sequences S in the form of a weighted sum

&(@Po)

/

P ( S l X T , Qo) logP(Y?, SI@). ( 5 )

N

/

F

/

F

\l

\

In this case, the system of equations from

S

Through a set of well-established procedures5 in the E step, we simplify (5) into N

N T-I

i = l t=l

3We note that use of discriminative types of parameter learning could e l i a t e the need for the constraints. 4The expectation is taken over the “hidden” state sequence S. 51nterested readers are referred to a paper by the author [7] for detailed exposition.

i = l , . . - , N ; m = -B,-B+l,..-,O,...,F is linear. The re-estimates of e y ( i ) and w m ( i ) ,as the solution of the above linear system, can be put into matrix form (omitting state label i for clarity). This is shown at the bottom of this page, where T

~ ( m1 ) ,=

Tt(i)X~~~~l(i)Xt+m, t=l T

t=l

68

IEEE SIGNAL PROCESSING " I T E R S , VOL. 1, NO. 4, APRIL

with m,Z = -B

,...,F, and

T

1994

P

T

c T

R ( q m )=

rt(i)[ c , ' ( i ) x t + m ] ( d )

t=l

are the dth components of the respective vectors, with d = l , . . - , D , and m = - B , . . . , F . B. Solution for Constraint Type-It

The objective function incorporating the Type-I1 constraint ( 3 ) has the form

N

/

F

\

(9) i=l

m=-B

Setting the derivatives of (9) with respect to wl(i) , Xi, and &(i) to zero, we establish the following set of nonlinear system of equations:

m=-B

0

1 1 U(-B,-B 1 1

1 1 0

0

with 1 = - B , - . - , F and i = l , . . . , N . Strict application of the EM algorithm [2] would require solution of the above nonlinear system for w~(i) and &(i) simultaneously. In our current implementation, to simplify the numerical solution, we invoke the generalized EM algorithm [15], which states that the M-step (global maximization) in the EM algorithm can be replaced by partial maximization where a subset of parameters are fixed at the values set by the previous EM iteration while optimizing the remaining parameters. The procedure we adopt is to solve a simpler nonlinear system defined by (10) and (11) only for wl(i)(and Xi) first while fixing 6'y(i) and then using the obtained result to compute &(i) according to (12) separately.

...

1

... U(-B,O)

... e . .

U(0,-B)

U(F,-B) P(l)(-B)

... ...

U(F,O) PW(0)

...

...

IV. DISCUSSION AND CONCLUSION In this letter, we developed a rigorous theoretical framework that allows optimal construction of dynamic features of speech for potential use in Hh4M-based speech recognition. The conventional technique exploring the dynamic features, although highly successful, nevertheless lacks theoretical justification and hence can only rely on empirical evidence for selecting weights to convert static features to dynamic ones. The main contribution of this study is its creation of a first paradigm (to the best knowledge of the author) in the modem speech recognition literature, where the parameters that belong traditionally to the preprocessing domain can be learned, jointly with those pertaining to the ensuing speech

DENG WEGRATED OF’TIMIZATION OF DYNAMIC FEATURE PARAMETERS

models based on the theoretical principle rather than on the heuristic ground. In particular, applying this joint-optimization paradigm to the construction of the dynamic speech features from the static ones, we generalize the conventional dynamicparameter technique in a principled way. To see this, let us first relax the state-dependent assumption for the weights w, ( 2 ) in our model with constraint Qpe II. Now, if we arbitrarily (as opposed to using the optimization method described in this paper) set w-1 = -1,wo = l(B = 1,F = 0,C’ = 2), then our model reduces to use of delta features in the conventional HMM.Similarly, setting w-1 = 1,wo = -2,wl = 1(B = 1,F = 1,C’ = 6) gives rise to delta delta features, setting W-2 = 1,w-l = -4,wo = 6 , w l = -4,w2 = l(B = 2, F = 2, C’ = 70) gives rise to fourth-order delta features, and so on. In view of the prevailing success of the conventional heuristics-driven dynamic-parameter technique in the past for speech recognition applications, it is not difficult to see potentially vast practical significance of the theory developed in this study. ACKNOWLEDGMENT

The author wishes to thank Dr. R. Rohlicek and a reviewer for their constructive comments that improve the quality of this letter. REERJZNCES [l] T. Applebaum and B. Hanson, “Regression features for recognition of speech in quiet and in noise,” in Proc. ICASSP, 1991, pp. 985-988, vol. 2.

69

[2] L. E. Baum, “An inequality and associated maximization technique in statistical estimation for probablistic functions of Markov processes,” Inequalities, vol. 3, pp. 1-8, 1972. [3] P. Brown, “The acoustic modeling problem in automatic speech recognition,” Tech. Rep. RC 12750, IBM Thomas J. Watson Res. Ctr., 1987. [4] Y. Chow er al., “BYBLOS: The BBN continuous speech recognition system,” in Proc. ICASSP, 1987, pp. 89-92. [5] L. Deng, P.Kenny, M. Lennig, and P. Mermelstein, “Modeling acoustic transitions in speech by state-interpolation hidden Markov models,” IEEE Trans. Signal Processing, vol. 40,no. 2, pp. 265-272, Feb. 1992. [6] L. Deng er al., “Phonemic hidden Markov models with continuous mixture output densities for large vocabulary word recognition,” IEEE Trans. Acousr. Speech Signal Processing, vol. 39, no. 7, pp. 1677-1681, July 1991. [7] -, “A generalized hidden Markov model with stateconditioned trend functions of time for the speech signal,” Signal Processing, vol. 27, no. 1, pp. 65-78, Apr. 1992. [8] S. Furui, “Speaker independent isolated word recognition using dynamic features of speech spectrum,” IEEE Trans. Acoust. Speech Signal Processing, vol. 34, pp. 52-59, 1986. [9] V. Gupta, M. Lennig, and P. Mermelstein, “Integration of acoustic information in a large vocabulary word recognizer,” in Proc. ICASSP, 1987, pp. 697-700. [lo] X . Huang et al., “The SPHINX-II speech recognition system: An overview,” Comput. Speech Language. vol. 7, no. 2, pp. 137-148, 1993. [ 111 C. Lee, L. Rabiner, R. Pieraccini, and J. Wilpon, “Acoustic modeling for large vocabulary speech recognition,” Compur. Speech Language, vol. 4, pp. 127-165, 1990. [12] K. Lee and H. Hon, “Context-dependent phonetic hidden Markov models for continuous speech recognition,” IEEE Trans. Acoust. Speech Signal Processing, vol. 38, no. 4, pp. 599-609, Apr. 1990. [13] H. hung, B. Chigier, and G. Glass, “A comparative study of signal representations and classificationtechniques for speech recognition,” in Proc. ICASSP, 1993, pp. 680-683. [14] L. A. Liporace, “ M a x i ” likelihood estimation for multivariate observations of Markov sources,” IEEE Trans. Inform. T k o r y , vol. 28, pp. 729-734, 1982. [ 151 C. F. J. Wu, “On the convergence properties of the EM algorithm,” Ann. Stat., vol. 11, pp. 95-103, 1983.