Integration of acoustic and articulatory information ... - Semantic Scholar

Report 2 Downloads 145 Views
Information Fusion 5 (2004) 141–151 www.elsevier.com/locate/inffus

Integration of acoustic and articulatory information with application to speech recognition Ka-Yee Leung, Manhung Siu

*

Department of Electrical and Electronic Engineering, Hong Kong University of Science and Technology, Clearwater Bay, Kowloon, Hong Kong Received 9 September 2002; received in revised form 5 March 2003; accepted 28 April 2003

Abstract In speech recognition, fusion of multiple systems often results in improved recognition accuracy or robustness. All the previously suggested system fusions mainly focused on the recognition process. Training, on the other hand, are performed independently across different systems. In this paper, we investigated the combination of a Mel frequency cepstral coefficients (MFCC) based acoustic feature (ACF) system and an articulatory feature (AF) based system. In addition to proposing an asynchronous combination during the recognition process that makes the state combination more flexible during recognition, we proposed an efficient combination approach during the model training stage. We show that combining the models during training not only improved performance but also simplified fusion process during recognition. Because fusion during training removes inconsistency between the individual models, such as in state or phoneme alignments, it is particularly useful for highly constrained recognition fusion such as synchronous models combination. Comparing fusion of separately trained AF and ACF systems, fusion of jointly trained AF and ACF models resulted in more than 3% absolute phoneme recognition error reduction on the TIMIT corpus for synchronous and 1% for asynchronous combination. Ó 2003 Elsevier B.V. All rights reserved. Keywords: Speech recognition; Articulatory feature; Acoustic feature; Asynchronous combinations; Retraining parameters; System fusion; Joint training

1. Introduction Automatic speech recognition (ASR) has made tremendous progress in the past decade. While a single ASR system can perform quite well, many recent works are focusing on the integration of multiple systems [4,7,8,14,16,17] that result in a better performance than using a single system alone. Almost all of these focused on the integration during different stages of the recognition process, such as at frame level [7], state level [4,16,17] or word level [8]. For some system fusions, combination at the state level has been shown to be more effective [14]. However, integration during recognition at the state-level poses two mis-matches: (1) Combination of two state sequences causes alignment problems because of differences in state definitions between the systems. (2) Since the two systems are trained *

Corresponding author. Tel.: +852-2358-8027. E-mail addresses: [email protected], [email protected] (M. Siu).

1566-2535/$ - see front matter Ó 2003 Elsevier B.V. All rights reserved. doi:10.1016/j.inffus.2003.10.007

independently, their models are not optimized for joint recognition. In this paper, we have investigated the state-level integration of two systems with different feature sets: one uses the acoustic feature (ACF), Mel frequency cepstral coefficients (MFCCs) and the other uses the articulatory features (AF). We have improved their combination by introducing the asynchronous combination. Furthermore, similar to the development of speaker adaptive training in which the trained model is optimized when recognition is performed with speaker adaptation, we proposed to fuse the two models during training so that the resulted models are ‘‘optimized’’ when they are used jointly in recognition. Cepstral coefficients are the output of signal processing. They are widely used for speech recognition partly because they separate the effect of source excitation and the vocal tract characteristics. The Mel-scale which is more sensitive to lower frequency than high frequency is applied when computing the coefficients.

142

K.-Y. Leung, M. Siu / Information Fusion 5 (2004) 141–151

This is motivated by human perception of the frequency scale. The articulatory feature set represents the position or action of the human articulators and can be thought of as an intermediate representation of the acoustic signal. Articulatory features include voicing, tongue position, etc. The integration of the ACF-based and AF-based systems was originally proposed in [14] that showed the combined system was more robust in noisy environments. In this paper, we compare the performance of integrating the two systems during training and during recognition and show that integration during training not only can improve recognition accuracy but also simplify the combination during recognition. The remainder of this paper is organized as follows. In Section 2, overviews of the ACF system and AF system are provided. The description of the ACF system is relatively short because it is widely used and detailed descriptions can easily be obtained from [21]. In Section 3, we describe the combination of the two systems during recognition, including the proposed asynchronous combination that can solve the state alignment problem. In Section 4, combination at the training level is described which includes the training of AF models using ACF modeling information as well as the joint training of both systems. Experimental setup and results are reported in Section 5. Finally, the paper is summarized in Section 6.

2. Baseline systems The two systems that are combined in this work are the ACF recognition system using MFCC and the AFbased system. Both systems are HMM-based which uses a three-state left-to-right Hidden Markov model (HMM) per phoneme. The main differences between the systems are in their feature sets, observation probability modeling and parameter estimation procedures. 2.1. Acoustic feature system

K X

wk Nðot ; lj ; Rj Þ;

at ðjÞbt ðjÞ ; ct ðjÞ ¼ P i at ðiÞbt ðiÞ

ð1Þ

k¼1

where K is the number of mixture, Nð Þ is the Gaussian function with mean lj and covariance Rj . Model parameters, including the state transitions, Gaussian means, covariances and mixture weights are estimated using the Baum–Welch parameter estimation procedure [25]. For a given observation sequence, OT1 , of

ð2Þ

where at ðjÞ ¼ pðOt1 ; qt ¼ jÞ and bt ðjÞ ¼ pðOTtþ1 jqt ¼ jÞ. They can be recursively computed using the following equations: X at ðjÞ ¼ at1 ðiÞaij bj ðot Þ; ð3Þ i

bt ðiÞ ¼

X

btþ1 ðjÞaij bj ðotþ1 Þ;

ð4Þ

j

where aij is the state transition probability of moving from state i to state j. The posterior probability of being in state i at time t and in state j at time t þ 1 is typically denoted as nt ði; jÞ and is given by nt ði; jÞ ¼ P ðqt ¼ i; qtþ1 ¼ jjOT1 Þ at ðiÞaij bj ðotþ1 Þbtþ1 ðjÞ : ¼P P i j at ðiÞaij bj ðotþ1 Þbtþ1 ðjÞ

ð5Þ

The at , bt , ct and nt can be considered the soft alignments and are re-estimated in each iteration. That is why in acoustic feature models, the final state/phoneme alignments can be very different from the initial state alignments. Using the at and bt variables, the transition proba^ il and mixture weights ^il , variances R bilities ^aij , means l ^cil can be re-estimated using the following equations: P n ði; jÞ ^aij ¼ Pt t ; ð6Þ t ct ðiÞ P c ði; lÞ ^cil ¼ Pt t ; ð7Þ t ct ðiÞ P c ði; lÞot ^il ¼ Pt t l ; ð8Þ t ct ði; lÞ ^ il ¼ R

In the ACF systems, the observation distributions are modeled using Gaussian mixtures. The observation likelihood bj ðot Þ ¼ pðot jqt ¼ jÞ for an observation ot at time t generated by state j is given by pðot jqt ¼ jÞ ¼

T frames, the posterior probability of being in state j at time t, pðqt ¼ jjOT1 Þ denoted as ct ðjÞ, is given by

P

T

t

ct ði; lÞðot  lil Þðot  lil Þ P ; t ct ði; lÞ

ð9Þ

where ct ði; lÞ ¼ ct ðiÞ

cil bil ðot Þ ; bi ðot Þ

ð10Þ

and i, j are the state indexes and l is the mixture component. For the ACF system, recognition is performed using the standard Viterbi algorithm [27]. 2.2. Articulatory feature system Our AF system is similar to the one used in [15] in the extraction of articulatory features. Our system is a hybrid HMM/ANN system as shown in Fig. 1. Hybrid

K.-Y. Leung, M. Siu / Information Fusion 5 (2004) 141–151

Articulatory Feature Extraction MLP for Rounding MLP for Front-Back Speech Feature Extraction

MLP for Voicing

Phoneme MLP MLP for Manner MLP for Place

Fig. 1. Block diagram of the AF system.

HMM/ANN systems use artificial neural networks (ANNs) for the HMM observation distributions and, thus, are different from traditional HMM systems that use codebooks or Gaussian mixtures. Among different types of ANNs, Multi-Layer Perceptrons (MLPs) [6] are used in our system. The MLPs are used in two different stages in our AF system. First, different MLPs are trained as feature classifiers to extract articulatory features. Second, the outputs of the AF classifiers are fed to a single phoneme MLP to estimate the posterior probability of phoneme occurrence. 2.2.1. Articulatory features In our system, five different articulatory features, as tabulated in Table 1, are extracted by the five different feature classifiers. The input to these feature classifiers are the acoustic-based MFCCs and the extraction of the articulatory features are illustrated within the dotted box in Fig. 1. To better model the articulatory features, multiple frames of MFCCs are used in the AF extraction. The first AF classifier is the MLP that takes  voicing  n þ 1 frames, which are t  n2 ; . . . ; t; . . . ; t þ n2 of MFCCs created by a moving window as input. Since voicing information can be useful for the classification of other features and the accuracy of the voicing MLP is over 90% which makes its output reliable enough to be

143

used as input to other feature classifiers. So, the estimated voicing scores together with the MFCC vectors are the inputs to the remaining four feature MLPs. Every feature classifier contains a different number of output nodes to represent the probabilities of different output classes. For example, the MLP for voicing contains three output nodes to represent the classes: voiced, unvoiced and silence. The five feature classifiers generate a total of 27 outputs for each frame which are then concatenated to form an AF vector. These AF vectors serve as the input to the ‘‘phoneme MLP’’ that estimates the phoneme posterior probabilities. In [15], the phoneme MLP used a single output node for each phoneme. However, different parts of a phoneme can have different characteristics. For example, the place of articulation of diphthongs changes from begin to end. Analogous to the ACF system that uses three different states with unique distributions to model a phoneme, we generalize the AF model by using three output nodes for each phoneme to represent three different parts of the phoneme. In our case, instead of using the phoneme MLP to estimate the phoneme posterior probabilities, the phoneme MLP is modified to estimate the state posterior probabilities. 2.2.2. Training the MLPs Training the feature classifiers and the phoneme MLP requires the phoneme/state labels for each frame of the training data. For the phoneme MLP, this means that phoneme alignments or state alignments are needed. However, for the feature classifiers, we need the correct states of the articulators for each frame. While the state of the articulators at any time can be found using X-ray or other imaging techniques [9,11], this information is not available to most speech recognition tasks. To make the AF approach applicable to more speech recognition tasks, we approximate the state of articulators using a mapping from the phoneme labels. This mapping shows the theoretic state of the articulator when producing the phoneme. Although the phoneme to AF mapping may potentially limit the accuracy of the feature classifiers, past research which adopted the mapping approach has shown that it is sufficient for phoneme recognition [15]. Different approaches can be used to obtain the phoneme/state alignments:

Table 1 Classes of Articulatory features and their values Feature group

Values

# Value

Voicing Front-back Rounding Manner Place

Voiced, unvoiced, silence Front, back, nil, silence Rounded, not rounded, nil, silence Vowel, stop, fricative, approximant-lateral, nasal, silence High, middle, low, dental, labial, coronal, palatal, velar, glottal, silence

3 4 4 6 10

Total

27

144

K.-Y. Leung, M. Siu / Information Fusion 5 (2004) 141–151

1. If the phoneme alignments of the training data are given as in the case of the TIMIT corpus [22], these alignments can be used directly. To get the state alignments, the most straight forward approach is to partition phonemes into equal portions. This is often used to form an initial model. 2. Phoneme and state alignments can be re-derived from prior models, such as the initial model mentioned above, using the Viterbi algorithm. As these alignments define the labels for the phoneme MLP, they have significant impact on the quality of the model. In later sections, we will demonstrate the importance of using good phoneme/state alignments. For a given set of alignments, the MLP training involves the learning of weights using the back-propagation approach [23] that minimizes the total square error between the estimated posterior probabilities and the target outputs. For an MLP of K outputs, denoted gðot Þ and tðot Þ as the two K dimensional column vectors which represent the posterior probabilities estimated from the MLP and the binary target labels (the correct label is marked as one) of input feature ot , respectively. The cost function, E, can be expressed as E¼

T X

2

kgðot Þ  tðot Þk ;

ð11Þ

t¼1

where T is the total number of training patterns. Interested reader can refer to [20]. Generating better alignments iteratively and retraining the MLPs can give a much better performance than using the TIMIT phoneme segmentations alone. This approach is adopted to perform AF model training in this paper. 2.2.3. Recognition There are a number of ways to use the AF models for recognition, such as using the AF outputs as features which can be modeled by Gaussian mixtures as is commonly applied in ACF systems [12,13,15]. The approach we adopted in this paper is to use the phoneme MLP output posterior probability to estimate the observation probability. To be consistent with the ACF model, a three-state left-to-right topology is also used. Since the values estimated by the phoneme MLP are the posterior probabilities of states, it is not easy to compute a joint probability for a sequence of frames. As a result, the posterior probabilities are converted into observation likelihoods during recognition. This is achieved by normalizing the posterior probabilities with the state prior probabilities as suggested in [6]. After this normalization, the standard Viterbi recognition algorithm can be applied.

3. System fusion in recognition In [14], several ways of combining AF models with ACF models during recognition were evaluated and the best approach was the weighted linear combination of the respective log observation probabilities at the state level. That is, the observation likelihood denoted as P ðot js; s; HÞ 1 depends on two states, namely an ACF state s and an AF state s. The combined state observation likelihood, P ðot js; s; HÞ, can be expressed as log P ðot js; s; HÞ ¼ waf log Paf ðot js; HÞ þ ð1  waf Þ log Pacf ðot js; HÞ;

ð12Þ

where waf is a weighting factor for the AF log likelihood. In general, the ACF state s and the AF state s are different. However, because the same phoneme set is shared by both systems, there exists only one phoneme label for each frame and thus, s and s are constrained to be states of the same phoneme. The definition of the joint state, ðs; sÞ, and the combined state transition probabilities will be introduced in the following sections. 3.1. Synchronous state combination By making s ¼ s in Eq. (12), synchronous combination of AF states and ACF states can be performed. This constrains the two models to enter and exit each state at the same time. This constraint is reasonable only if there are commonly defined meanings for the phoneme states which are shared by both systems or if the observation distributions across the different states within a phoneme are similar. In [14], the distributions across the three AF states within a phoneme were tied and thus synchronous combination was sufficient. If the systems are trained separately and their states have different distributions, it is likely that the synchrony constraint is not optimal. 3.2. Asynchronous state combination Asynchronous state combination was proposed by Mirghafori et al. [18] and has been applied to combining different sub-band HMMs for speech recognition. It can also be applied to asynchronously combine any two sets of HMMs, such as the combination of our AF and ACF HMMs by using the phoneme topology shown in Fig. 2. While the individual ACF and AF models are threestate HMMs, the combination in effect expands the number of states of the combined model to nine, 1

To be exact, the observation ot in the ACF system is a single frame of MFCC with 39 dimensions while ot in the AF system is a vector of the 27 outputs of the AF classifiers. However, we can assume a ‘‘superfeature vector’’ that is the concatenation of the ACF and AF features and that each system only evaluates a subset of the vector.

K.-Y. Leung, M. Siu / Information Fusion 5 (2004) 141–151

4. System fusion during training and recognition

States of ACF Model

States of AF Model

1,1

1,2

1,3

2,1

2,2

2,3

3,1

3,2

3,3

Fig. 2. Topology used for asynchronous combination of AF model and ACF HMM.

accounting for all the combinations of AF and ACF states. It should be noted that the number of observation distributions remains unchanged. Since the expanded phoneme topology is still a HMM, Viterbi recognition can be applied. One remaining issue is the estimation of the state transition probabilities. There are two ways to estimate the transition probabilities. One simple approach to calculate the combined state transition probabilities, aðm;iÞðn;jÞ from state ðm; iÞ to ðn; jÞ, is to use a normalized product of the state transition probabilities from the individual models. That is, aðm;iÞðn;jÞ ¼ C  aij   amn ;

145

ð13Þ

where aij is the ACF transition probability from state i to j, and  amn is the AF state transition probability from state P m to n and C is a normalization constant to ensure that ðn;jÞ aðm;iÞðn;jÞ ¼ 1. Another approach is to re-estimate the probabilities during training which will be discussed in Section 4. While the initial phoneme alignments for both systems may be the same, it is unlikely that the state alignments of the two independent system are exactly the same. That is, the state distributions, even with the same state index, of the two systems should capture different portions of phoneme. By making use of the asynchronous state combination, more flexible state alignments of the two models can be achieved during recognition at a higher computational cost. It should also be noted that the expanded topology shown in Fig. 2 only allows for state asynchrony within a phoneme. That is shown by the fact that the two models start with the same beginning state and end with the same ending state. While across-phoneme asynchrony is possible, it will greatly increase the model complexity. The implication is that asynchronous combination can handle state alignment inconsistency only under the assumption that the phoneme alignment is consistent.

While better performance can be obtained by combining the ACF and AF models during recognition, asynchronous combination is needed which results in an increase in computation and memory usage. When training the two models, the knowledge that the resulting models are combined is not being taken advantage of. Inconsistency between model training and joint recognition may be removed if training is also performed jointly. How model inconsistency can be removed depends on the particular training algorithms and criteria. For the ACF and AF models, different optimization criteria are used during training. The ACF system uses the maximum likelihood (ML) criterion with an iterative training procedure involving two steps: (a) the estimation of the state alignments based on the model likelihood of the observation against the previous model, resulting in either hard decisions on state boundaries (i.e. the Viterbi training) or soft decisions using partial counts, (b) the re-estimation of parameters based on the alignments. For the MLP training of the AF system, once state/ phoneme alignments are given, MLP weights are iteratively re-estimated to minimize the total square error. One difficulty in joint training is that different training criteria are used to train the individual models. However, the ACF and AF model training both depend on different state/phoneme alignments that are re-estimated iteratively to optimize the individual training criterion. One direct way of joint training is to share these alignments during training. We consider two types of alignments, the state alignments and the phoneme alignments. State alignment, which is of finer resolution, is the mapping between the observations and the state indexes that generate these observations. Similarly, phoneme alignment is the mapping between the observations and the phoneme indexes. Because asynchronous combination allows for more flexible state alignments within the same phoneme, state alignment consistency is mainly a concern for synchronous combination. In synchronous combination, the two models may have different meanings/alignments between the states and the data, forcing a state to align during recognition may result in poor performance. This inconsistency, however, can be removed by sharing the alignment during training. While asynchronous combination explicitly handles the state alignment issues during recognition, it is still constrained to be phoneme-synchronous, i.e., both models are forced to align the test data with the same phoneme boundary during recognition as we have discussed in Section 3.2. Phonemes, compare to states, are more well-defined. However, inconsistency of phoneme alignment still exists between the ACF and the AF systems because of the iterative nature of the training

146

K.-Y. Leung, M. Siu / Information Fusion 5 (2004) 141–151

algorithms. With a new set of alignments generated during each iteration, starting with the same initial alignment does not have significant impact on the final model alignments when trainings are completed. In ACF model training, it is well-known that the phoneme as well as the state alignments are re-estimated in each of the training iterations either by making hard decision via the Viterbi training or calculating the partial count via the E-M algorithm. Interested reader can refer to [25]. In fact, in ACF-based speech recognition, it is common to start model training using ‘‘flat start’’, i.e. uses a uniform segmentation as initial state alignment [21]. Similarly, the MLP parameters are estimated based on a given set of phoneme/state alignments which again can be re-estimated at each iteration. The fact that the two individually trained systems resulted with different phoneme alignments is a reflection that the two training algorithms, with different optimization criterion, capture different distinctive characteristics of the phonemes. The phoneme alignment inconsistency, which can affect both synchronous and asynchronous combinations, can be removed via fusion during training. When the AF and the ACF models are combined, the parameter set of the combined models include parameters for the Gaussian mixtures of the ACF model, MLP weights of the AF model and the combined state transition probabilities. To have a more systematic view of joint training, we explore the effects of different modes of sharing including (a) updating the ACF model while keeping the AF model unchanged, (b) updating the AF parameters while keeping the ACF parameters unchanged and (c) jointly updating all model parameters. 4.1. Re-estimating the ACF models In Section 3.2, we have described how one can expand the HMM state space into the product of the ACF and AF states and how to combine them asynchronously. In fact, the combined phoneme model is a HMM with the observation probability denoted as bj;j ðotþ1 Þ. The log observation distribution can be expressed as log bj;j ðotþ1 Þ ¼ ð1  waf Þ log bj ðotþ1 Þ þ waf log baf j ðotþ1 Þ;

ð14Þ

where ðj; jÞ denotes the joint state with ACF state j and AF state j. Previous researches on asynchronous combination of two sub-band HMMs [3] have shown that with re-estimating the combined state transition probabilities using the joint observation distribution, a better recognition performance can be obtained as compared to multiplying the state transition probabilities of two sub-band models. This suggests that we should consider re-estimating the combined state transition probabili-

ties. Besides state transition probabilities, others model parameters can also be re-estimated so that the individual model parameters can incorporate information from the combined model. Changing the observation probabilities causes changes in the count accumulators at and bt , which, in effect, changes the joint and posterior probabilities nt ði; jÞ ¼ P ðqt ¼ i; qtþ1 ¼ jjOÞ and ct ðiÞ ¼ pðqt ¼ ijOÞ as described in Eqs. (2)–(5). These variables can be thought of as the soft decision state alignment. at and bt for the expanded state space can be expressed as " # N X N X JOINT JOINT at ði; iÞaði;iÞðj;jÞ bðj;jÞ ðotþ1 Þ; atþ1 ðj; jÞ ¼ i¼1

i¼1

ð15Þ ði; iÞ ¼ bJOINT t

N X N X j¼1

aði;iÞðj;jÞ bJOINT ðj; jÞbðj;jÞ ðotþ1 Þ; tþ1

ð16Þ

j¼1

where aJOINT ði; iÞ and bJOINT ði; iÞ are the forward probt t ability and backward probability, respectively, of the combined model state ði; iÞ at time t. To re-estimate the ACF model parameters, the AF states are marginalized. That is, aJtþ1 ðjÞ ¼

N X

aJOINT ðj; jÞ; tþ1

ð17Þ

j¼1

bJt ðiÞ ¼

N X

bJOINT ði; iÞ; t

ð18Þ

i¼1

where N and N are the total number of ACF states and AF states, respectively. This is because out of the N N states, there are only N unique ACF states. The others are all tied to these N states [27]. aJt ðiÞ and bJt ðiÞ are the joint forward and backward probabilities for the ACF model state i at time t. Using these new aJt and bJt , the expression of ct ðjÞ in Eq. (2) becomes aJ ðjÞbJ ðjÞ cJt ðjÞ ¼ P t J t J : i at ðiÞbt ðiÞ

ð19Þ

Since the estimated ACF model will be used to combine with the AF model rather than using alone, cJOINT ðj; jÞ and nJOINT ðði; iÞ; ðj; jÞÞ are used during the t t estimation of the combined state transition probabilities and they can be expressed as ðj; jÞbJOINT ðj; jÞ t ; JOINT JOINT ði;  iÞbt ði; iÞ i at

JOINT

at cJOINT ðj; jÞ ¼ P P t i

ð20Þ

ðði; iÞ; ðj; jÞÞ nJOINT t aJOINT ði; iÞaði;iÞðj;jÞ bðj;jÞ ðotþ1 ÞbJOINT ðj; jÞ t tþ1 ¼P P P P JOINT ði; iÞaði;iÞðj;jÞ bðj;jÞ ðotþ1 ÞbJOINT ðj; jÞ i j at tþ1 i j ð21Þ

K.-Y. Leung, M. Siu / Information Fusion 5 (2004) 141–151

and the combined state transition probabilities can be estimated by P JOINT nt ðði; iÞ; ðj; jÞÞ J ^ : ð22Þ aði;iÞðj;jÞ ¼ t P JOINT ði; iÞ t ct The re-estimation equations for the lth Gaussian mixtures’ parameters of the HMM are summarized as follows: P J c ði; lÞ ^cJil ¼ Pt t J ; ð23Þ t ct ðiÞ P J c ði; lÞot ^Jil ¼ Pt t J l ; ð24Þ t ct ði; lÞ ^J ¼ R il

P

T

t

cJt ði; lÞðot  lil Þðot  lil Þ P J ; t ct ði; lÞ

ð25Þ

where cJt ði; lÞ ¼ cJt ðiÞ

cil bil ðot Þ : bi ðot Þ

ð26Þ

It should be noted that in Eq. (26), the estimation of the mixture posterior probability uses the ACF system likelihood only because the mixture posterior probability is independent of the AF likelihood. Because the parameters are estimated to optimize the joint likelihood, the resulting models are only useful for recognition using the combined ACF and AF models. Furthermore, the re-estimation is done iteratively. This means that after a new set of parameters (aJij ; lJil ; RJil ; cJil ) are obtained, they are used to combine with the AF model, which are unchanged, to re-estimate the alignments (cJOINT ; aJOINT ; bJOINT ; nJOINT ). t t t t From the above recursive equations, it may not be obvious how optimizing the joint likelihood would affect the ACF alignments and implicitly the model parameters. Since the product of the forward probabilities, aJt ðjÞ, and the backward probability, bJt ðjÞ, forms the partial counts or the soft alignments of the observations with the models, one way to understand the effect of the joint likelihood on alignment is to analyze the changes to aJt ðjÞ. Similar analysis can also be derived for other alignment variables such the bJt ðjÞ or cJt . By combining Eqs. (14), (15) and (17), one can rewrite aJtþ1 ðjÞ as aJtþ1 ðjÞ ¼

N X

aJOINT ðj;jÞ tþ1

j¼1

¼

" N N X N X X j¼1

# aJOINT ði;iÞaði;iÞðj;jÞ t

ð1waf Þ

bj

ðotþ1 Þbwafafj ðotþ1 Þ:

i¼1 i¼1

ð27Þ In Sections 4.1.1 and 4.1.2, we examine the effect of this new aJtþ1 ðjÞ estimation under synchronous and asynchronous combination.

147

4.1.1. Effect on synchronous combination In synchronous combination, because i ¼ i and j ¼ j, we can simplify Eq. (27) by removing i and j, aJtþ1 ðjÞ ¼

N X

ð1waf Þ

aJOINT ði; iÞai;j bj t

ðotþ1 Þbwafafj ðotþ1 Þ;

ð28Þ

i¼1

where aði;iÞ;ðj;jÞ is replaced by ai;j in which only the ACF state transition probabilities are used during the synchronous combination process. Comparing Eq. (28) with Eq. (3) one major difference is in the computation of the observation likelihood which is now expressed as a geometric combination of the AF and ACF observation likelihoods. Let us consider the case in which the observation otþ1 matches well with the ACF model j resulting in a high bj ðotþ1 Þ. Because it is aligned differently in AF, otþ1 could match poorly with AF state j resulting in a low baf j ðotþ1 Þ. In this case, the overall aJtþ1 ðjÞ is reduced and the ACF alignment will be modified such that it less likely for this observation to align to state j in the ACF model. Thus, incorporation of the AF likelihood into the ACF training could remove the inconsistency between the state alignments of the two systems. 4.1.2. Effect on asynchronous combination In asynchronous combination, the analysis is more involved. To make the analysis clearer, let us assume without loss of generality that the joint transition probability can be factored into the transition of the ACF states and the transitions of the AF states as we have described in Eq. (13). That is, aði;iÞðj;jÞ ¼ ai;j ai;j ; where ai;j is the transition of the AF states from i to j. Then, Eq. (27) can be rewritten as aJtþ1 ðjÞ N X N X N X ð1w Þ ¼ aJOINT ði;iÞai;j ai;j bj af ðotþ1 Þbwafafj ðotþ1 Þ t j¼1 i¼1 i¼1

¼

N X N X i¼1 i¼1

ð1w Þ aJOINT ði;iÞai;j bj af ðotþ1 Þ t

"

N X

# ai;j bwafafj ðotþ1 Þ

:

j¼1

|fflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} AF phoneme likelihood

ð29Þ Because the asynchronous combination allows the ACF states within a phoneme to match with different AF states of the same phoneme, the state alignment inconsistency can be resolved during recognition. However, for observations near phoneme boundaries, it is possible that the two models cannot reach a consensual decision about the exact location of the phoneme boundaries. In fact, marking phoneme boundaries is a difficult task even for human [26] because speech is generated by the continuous motion of human articulators. In cases where two phonemes have similar articulatory features,

148

K.-Y. Leung, M. Siu / Information Fusion 5 (2004) 141–151

such as both are vowels or between stop closure and silence, the boundaries cannot be determined with certainty. P As shown in Eq. (29), the term Nj  ai;j bwafafj ðotþ1 Þ is the AF contribution of the phoneme in which j belongs. This term’s value can affect the ACF alignment similar to the discussion in Section 4.1.1. For example, in the case that the AF likelihood of ot for a particular phoneme (across all j) is low while the ACF likelihood bj ðot Þ is high, aJt ðjÞ is reduced, making it less likely that ot is assigned to the phoneme. This, in effect, modifies the ACF phoneme alignment by considering the AF likelihood and removes the inconsistency between training and joint recognition. However, inconsistency in phoneme alignments affects fewer observations than state alignment inconsistency. So, the impact of re-training the ACF model with asynchronous combination would be smaller than re-training with synchronous combination. 4.2. Re-estimating the AF model In addition to using joint information to re-estimate the ACF HMM parameters, one can also use the joint information to re-estimate the AF parameters. As we have discussed in Section 2, one important factor in MLP training is the state alignments and using a set of joint alignments is a direct way to combine the ACF HMM with the AF training. To simplify our discussion, alignments are generalized to include both the fixed state/phoneme boundaries and ct and nt can be viewed as soft boundaries. The advantage of using ct to generalize the AF training is that it no longer needs hard decisions as have been reported in [24]. This will also make the AF model/state definition more consistent with the state definition of ACF HMM states and may simplify the model combination during recognition. There are two interpretations of ct ðkÞ. One is to consider ct ðkÞ as a soft count of the contribution of a state k at time t. In that case, the cost function E in Eq. (11) can be re-written as E¼

T X K X t¼1

2

ct ðkÞkgðot Þ  tðot Þk ;

N X t¼1

2

kgðot Þ  Ct k :

4.3. Joint re-estimation of ACF and AF models Instead of keeping one model fixed and re-estimating another model, both models can be re-trained jointly. The steps are summarized in Table 2. Initially, models are trained for each system independently. They are then combined using a particular combination weight waf for the purpose of computing the ct scores. Throughout the re-estimation process, we have assumed that this combination weight is kept unchanged. The iterative process can be stopped either empirically by measuring the testset recognition performance or by measuring the changes in the nt from iteration to iteration. It can also be stopped by measuring the change in log likelihood. In our experiments, we have used the test-set accuracy as a stopping criterion. The iterative joint training can be done either synchronously or asynchronously. In the synchronous case, the state indexes of both models are assumed to be the same during training and recognition. This, in effect, constrains the two models to enter and exit a state simultaneously. In the asynchronous case, an expanded HMM topology is used for each phoneme which allows asynchronous state alignment within the phoneme. This means that state transitions are not synchronized. However, the joint training do impose a new constrains that both models have to enter and exit a phoneme at the same time. In other words, the two models will share the same phoneme alignment at each training iteration. Furthermore, both soft decision and hard decision alignments can be used for both the AF model training and ACF model training.

ð30Þ

k¼1

where T is the total number of training patterns, K is the number of MLP output classes, gðot Þ is the MLP output and tðot Þ is a K dimensional column vector of all zeros except a one in row k. Another interpretation of ct ðjÞ is the state posterior probability for the frame t that the MLP should try to predict. In this sense, denote Ct ¼ ½ct ð1Þ; . . . ; ct ðKÞ T , the square error criterion E becomes E¼

Given a set of models, either the Viterbi algorithm or the forward–backward algorithm can be used to generate the new training alignments. They are then used to re-estimate the AF model parameters. This is done even though the ACF model parameters, such as lij remain unchanged.

ð31Þ

Table 2 Joint training of AF and ACF model parameters i¼0 train the AF model train the ACF model MðiÞ ¼ joint model recognition accuracy on separate test set DO i¼iþ1 Find joint alignments of training data Re-estimate: (a) AF model parameters (b) ACF model parameters Evaluate MðiÞ UNTIL MðiÞ converges

BEGIN Initialize

END

K.-Y. Leung, M. Siu / Information Fusion 5 (2004) 141–151

149

5.1. Combination of the AF model and the ACF model during recognition

5. Experiments and results All our experiments were performed on the TIMIT [22] task with the original phoneme set mapped to 39 context independent phonemes [1]. One major advantage of the TIMIT task is the availability of the time aligned phonetic transcription which facilitates the AF MLPs training. Furthermore, the TIMIT corpus has been used by many other researchers in the speech processing community [2,4,10] and the performance of our system is comparable to other researchers [5]. While context-dependent phoneme models will give a better performance, our goal here is to demonstrate the usefulness of the fusion and context-independent phoneme model can simplify our experiments significantly. Only the SI and SX sentences were used. For the ACF model, iterative HMM training was performed using the HTK [21] and recognition was performed using a recognizer that gave the exact results as the HTK recognizer. The MFCCs were generated with 20 ms Hamming window and 10 ms overlap. Thirty-nine-dimension feature including 12 MFCC, the normalized power and their first and second order derivatives were used in the ACF model. For the AF model, feature and phoneme MLPs were trained using the Quicknet software developed by the realization group at ICSI of the University of Berkeley [19]. Instead of using 39 features as in the ACF, second order derivatives were not used resulting in only 26 features per frame. Nine consecutive frames, i.e., ½t  4; . . . ; t; . . . ; t þ 4 were concatenated to form a single input vector to the feature MLPs. Each feature MLP consisted of 50 hidden units while 700 hidden units were used for the phoneme MLP. For the feature MLPs, phoneme labels and alignment from the TIMIT corpus were mapped to the corresponding AF labels as was done in [14]. These were used as the initial alignments for training the AF classifiers. To get the initial state alignments for training the phoneme MLP, each phoneme was partitioned into states as described in Section 2.2.2. Both the AF and the ACF models were re-estimated for several iterations until they were converged. The best, individual system results are shown in the first two rows of Table 3.

Table 3 Performance of joint recognitions (waf ¼ 0:6) Models used

Phoneme Acc (%)

ACF model (ACFM) AF model (AFM) Synchronous combined AFM and ACFM Asynchronous combined AFM and ACFM

60.92 63.48 63.70 66.13

One determining factor in both synchronous and asynchronous model combination is the combination weights waf . We have tried nine different values of waf from 0.1 to 0.9 in the recognition experiments and the best result is obtained at waf ¼ 0:6. This (waf > 0:5) is reasonable given that the baseline AF performance is better than that of ACF. This also suggests that the combination weight depends on the performance of the individual models and it may change with the two systems’ performances. Experimental results for different recognition combinations of the AF model and the ACF model are summarized in Table 3. The combined models out-perform both single feature models. Asynchronous combination gives a relative error reduction of 9.3% while synchronous combination only gives a small reduction compared to the AF model baseline. From the experiments, it is clear that combination during recognition can give significant error reduction and asynchronous combination is important. 5.2. Joint re-estimation of model parameters We have performed a series of experiments to evaluate the effects of re-estimating the combined model parameters. These included the re-estimation of only the ACF model (combined state transition probabilities, the Gaussian mixtures parameters), the re-estimation of only the AF model (weights of the phoneme MLP) and the joint re-estimation of both the parameters of ACF model and AF model. The joint re-estimation can generally be applied to either synchronous or asynchronous combinations. The same combination weight waf ¼ 0:6 was kept in all the experiments. To guarantee that any gain obtained from the combined model re-estimation was not simply because more training iterations were done on the individual model, the ACF model were trained until recognition performance did not improve by further training. These best models from the individual training were used as starting point for joint training. Table 4 shows the results of different training options. The first row compares the asynchronous and synchronous combinations and serves as the baseline result for model combination. In our first experiment, we investigated the effect of re-estimating the state transition probabilities only while keeping other model parameters unchanged. In the asynchronous case, this means the reestimation of the transition probabilities of the expanded phoneme topology. In the case of synchronous combination, we have re-estimated a set of transition probabilities that forced the AF and ACF states to transit at the same time. Surprisingly, the re-trained transitions did not help the asynchronous combination.

150

K.-Y. Leung, M. Siu / Information Fusion 5 (2004) 141–151

Table 4 Recognition performance of jointly trained models (waf ¼ 0:6) Joint retraining

Asyn combination (%)

Syn combination (%)

Joint recognition Re-train state transition only Re-train ACF parameters only Re-train AF parameters only Re-train both AF and ACF

66.13 66.13 66.31

63.70 64.21 64.27

66.72 67.38

66.56 66.90

In the synchronous combination, a small gain was obtained but it was still inferior to that of asynchronous combination. In the third row, not only did we re-train the transition probabilities, we also re-trained all the ACF parameters including the Gaussian means, covariances and mixture weights. However, the AF model parameters remained unchanged. Re-training all ACF model parameters gave a slight gain for both asynchronous and synchronous combination. However, the gap between them remained big suggesting that the incorporation of the AF likelihood into the ACF model training was not sufficient to change the state alignment of the ACF model to match that of the AF. Instead of re-training the ACF model, we then kept the ACF model unchanged and modified the AF model. In the fourth row, we re-trained the AF model using the hard decision joint model alignments. It is interesting to see that a better improvement was obtained. For the asynchronous combination, the improvement was still not much. However, the synchronous combination approached the result of asynchronous combination. This suggests that the AF was adopting the ACF like definition for the states. We have also performed some experiments using the soft decision to re-train the AF models but the performance was not as good as the hard decision. Finally, we have re-trained both the ACF and AF models jointly and the results are tabulated in the fifth row. The improvements were more than 1% for asynchronous combination and 3% for synchronous combination. As described in [5], the 95% significance value for the TIMIT testset is less than 0.5%, the improvements reported above should be significant. The results suggest that sharing the same phoneme (for asynchronous case) or state (for synchronous case) alignments during the model re-estimation. It is satisfying to see that the best gain was obtained by iteratively retraining both models instead of retraining either one only. However, it seems that the AF model was more responsive to sharing a common alignment than the ACF model. One possibility is that with ML training, a small change in alignment did not have significant impact. For MLPs, because of the discriminative nature of the training, alignment changes can have bigger impact. Because synchronous combination is more restrictive,

removing the inconsistency has a bigger effect. For the asynchronous combination, as expected, phoneme alignment affected smaller number of observations compared to state alignment and thus, the gain from joint training for asynchronous combination was smaller. Another useful result of joint re-estimation was to narrow the gap between synchronous and asynchronous combinations. Because synchronous combination, comparing to the asynchronous case, is simpler in model topology and less computationally intensive in recognition, the joint training makes it an attractive choice when computation complexity is of concern.

6. Conclusions Fusion of the acoustic feature and articulatory systems has been investigated both during recognition and training. When fusion was performed during recognition, the proposed asynchronous combination significantly improved the phoneme recognition performance but at an expense of increased model size and computation. When fusion was performed on both recognition and training, the more efficient recognition combination, namely, synchronous combination performed almost as well as the more complex asynchronous combination. The results demonstrated the advantage of performing model fusion during training. One limitation of the current ACF and AF fusion is that maximum likelihood criterion was used in ACF training making it difficult to fully utilize the AF information. One obvious direction is to investigate using discriminatively trained ACF systems for system fusion.

Acknowledgements The authors are grateful to the anonymous reviewers for their comments, which helped to improve the quality of the paper. Part of this research is supported by the Hong Kong research grant council, under project number HKUST 6049/00E.

References [1] A.K. Halberstadt, J.R. Glass, Heterogeneous acoustic measurements for phonetic classification, in: Proceedings of Eurospeech, Rhodes, Greece, September 1997, pp. 401–404. [2] A.M.A. Ali, J. Van der Spiegel, P. Mueller, Robust auditorybased speech processing using the average localized synchrony detection, IEEE Transaction on Speech and Audio Processing 10 (5) (2002) 279–292. [3] B. Mak, Y.C. Tam, Asynchrony with trained transition probabilities improves performance in multi-band speech recognition, in: Proceedings of ICSLP, vol. IV, Beijing, China, October 2000, pp. 149–152.

K.-Y. Leung, M. Siu / Information Fusion 5 (2004) 141–151 [4] C.A. Antoniou, T.J. Reynolds, Modular neural networks exploit multiple front-ends to improve speech recognition systems, in: Proceedings of International Conference in Knowledge Based Intelligent Engineering Systems and Allied Technologies, Neural Networks I, Brighton, UK, 2000, pp. 205–208. [5] D.W. Purnell, E.C. Botha, Improved generalization of MCE parameter estimation with application to speech recognition, IEEE Transactions on Speech and Audio Processing 10 (4) (2002) 232–239. [6] H.A. Bourlard, N. Morgan, Connectionist Speech Recognition. A Hybrid Approach, Kluwer Academic Publishers, 1994. [7] H. Bourlard, S. Dupont, Subband-based speech recognition, in: Proceedings of ICASSP, Munich, Germany, April 1997, pp. 1251– 1254. [8] J. Fiscus, A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER), in: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, Santa Barbara, CA, December 1997, pp. 347–354. [9] J. Frankel, K. Richmond, S. King, P. Taylor, An automatic speech recognition system using neural networks and linear dynamic models to recover and model articulatory traces, in: Proceeding of ICSLP, vol. IV, Beijing, October 2000, pp. 254–257. [10] J. Ming, F.J. Smith, Improved phone recognition using Bayesian triphone models, in: Proceedings of ICASSP, vol. 1, Seattle, WA, May 1998, pp. 409–412. [11] J. Zacks, T.R. Thomas, A new neural network for articulatory speech recognition and its application to vowel identification, Computer, Speech and Language 8 (1994) 189–209. [12] K. Erler, G. Freeman, Using articulatory feature for speech recognition, in: Proceedings of IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, Victoria, BC, Canada, May 1995, pp. 562–566. [13] K. Erler, L. Deng, HMM representation of quantized articulatory features for recognition of highly confusible words, in: Proceedings of ICASSP, vol. I, San Francisco, CA, March 1992, pp. 545–548. [14] K. Kirchhoff, G.A. Gink, G. Sagerer, Conversational speech recognition using acoustic and articulatory input, in: Proceedings of ICASSP, vol. III, Istanbul, Turkey, June 2000, pp. 1435–1438.

151

[15] K. Kirchhoff, Robust Speech Recognition Using Articulatory Information, Ph.D. thesis, University of Bielefeld, 1999. [16] K. Leung, M. Siu, Speech recognition using combined acoustic and articulatory information with retraining of acoustic model parameters, in: Proceedings of ICSLP, Denver, CO, September 2002, pp. 2117–2120. [17] M. Richardson, J. Bilmes, C. Diorio, Hidden-articulator Markov models: performance improvements and robustness to noise, in: Proceedings of ICSLP, vol. III, Beijing, China, October 2000, pp. 131–134. [18] N. Mirghafori, N. Morgan, Sooner or later: exploring asynchrony in multi-band speech recognition, in: Proceedings of Eurospeech, vol. II, Budapest, Hungary, September 1999, pp. 595–598. [19] P. F€arber, Quicknet on MultiSpert: Fast Parallel Neural Network Training, ICSI Technical Report TR-97-047, 1998, http:// www.icsi.berkeley.edu/techreports. [20] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley & Sons, Inc, 2001. [21] S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, P. Woodland, The HTK Book for HTK 3.0. Microsoft Corporation, Jul 2000. [22] V. Zue, S. Seneff, J. Class, Speech database development at MIT: TIMIT and beyond, Speech Communication 9 (4) (1990) 351–356. [23] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by back propogating errors, Nature 323 (99) (1986) 533–536. [24] Y. Yen, M. Fanty, R. Cole, Speech recognition using neural networks with forward–backward probability generated targets, in: Proceedings of ICASSP, vol. IV, Munich, Germany, April 1997, pp. 3241–3244. [25] L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of IEEE 77 (2) (1989) 257–286. [26] J.P. Hosom, Automatic Time Alignment of Phonemes using Acoustic-phonetic Information, Ph.D. thesis, Oregon Graduate Institute of Science and Technology, 1999. [27] F. Jelinek, Statistical Methods for Speech Recognition, The MIT Press, 1998.