ROBUST SPEECH RECOGNITION USING DYNAMIC NOISE ADAPTATION Steven Rennie, Pierre Dognin, Petr Fousek IBM Thomas J. Watson Research Center {sjrennie, pdognin}@us.ibm.com, petr
[email protected] 2.1. Interaction Model
ABSTRACT Dynamic noise adaptation (DNA) [1, 2] is a model-based technique for improving automatic speech recognition (ASR) performance in noise. DNA has shown promise on artificially mixed data such as the Aurora II and DNA+Aurora II tasks [1]—significantly outperforming well-known techniques like the ETSI AFE and fMLLR [2]—but has never been tried on real data. In this paper, we present new results generated by commercial-grade ASR systems trained on large amounts of data. We show that DNA improves upon the performance of the spectral subtraction (SS) and stochastic fMLLR algorithms of our embedded recognizers, particularly in unseen noise conditions, and describe how DNA has been evolved to become suitable for deployment in low-latency ASR systems. DNA improves our best embedded system, which utilizes SS, fMLLR, and fMPE [3] by over 22% relative at SNRs below 6 dB, reducing the word error rate in these adverse conditions from 4.24% to 3.29%. Index Terms— Dynamic Noise Adaptation (DNA), robust speech recognition (ASR), model adaptation, fMLLR, fMPE, spectral subtraction, ETSI AFE, DNA + Aurora II, Vector Taylor Series (VTS), Algonquin.
y(t) = h(t) ∗ x(t) + n(t),
(1)
where ∗ denotes linear convolution, h(t) models all channel effects, including propagation distortion and speaker-dependent vocal tract characteristics, and n(t) models all other sources of acoustic interference. In the frequency domain |Y|2
= =
|H|2 |X|2 + |N|2 + 2|H||X||N| cos θ |H|2 |X|2 + |N|2 + ,
(2)
where |X| and θx represent the magnitude and phase spectrum of x(t), and θ = θx + θh − θn . For uniformly distributed phases, the expected value of the phase term is zero, and more generally, if the speech dominates the noise or vice versa, this term is small. Ignoring the phase term , and assuming that the channel response |H| is constant over each Mel frequency band, in the log Mel spectral domain the relationship becomes y ≈ log(exp(x + h) + exp(n)) = f (x + h, n),
(3)
for each frequency band, where frequency subscripts are omitted for brevity, and y represents the log Mel transform of |Y|2 . Mel binning substantially reduces the error in this approximation. In this work we will model this error as zero mean and Gaussian distributed:
1. INTRODUCTION Model-based approaches to robust ASR that utilize explicit models of noise, channel distortion, and their interaction with speech are a well established and continually evolving research paradigm in robust ASR. Many interesting and effective approximate modeling and inference techniques have been developed to represent these acoustic entities [4–6], and the reasonably well understood but complicated interactions between them [7–11]. While results showing the promise of these techniques on less sophisticated systems trained on small amounts of artificially mixed data abound, there has been little evidence that these techniques can improve state-of-the-art ASR systems. In this paper we show that a model-based technique, dynamic noise adaptation (DNA), can substantially improve the performance of commercial-grade speech recognizers trained on large amounts of data, deliver match-trained word error rate (WER) performance, and improve our best recognizer in low SNR conditions.
2. DNA MODEL The DNA model consists of a speech model, noise model, channel model, and interaction model, which describes how these acoustic entities combine to generate noisy speech, as described below.
978-1-4577-0539-7/11/$26.00 ©2011 IEEE
The model of noisy speech in the time domain is
4592
p(y|x + h, n) = N (y; f (x + h, n), ψ 2 ).
(4)
More accurate models of phase interaction in the log Mel domain have recently been proposed, and improve performance at the expense of increased computation [11]. From (3) we can see that in the log Mel spectral domain (and therefore in the Mel cepstral domain), the relationship between the speech and channel is approximately linear, but the relationship between the speech and the noise is highly non-linear.
2.2. Speech Model In previous work [1, 5], most model-based approaches set in the Mel spectral (cepstral) domain have utilized a diagonal covariance acoustic model (AM). In this work we utilize a band-quantized gaussian mixture model (BQ-GMM) [12], so that the speech model used by DNA can be efficiently computed and stored. A BQ-GMM is a constrained, diagonal covariance GMM. Each acoustic state s of a BQGMM is associated with a set of F univariate Gaussian BQ atoms— one for each feature dimension f —which are selected from a reduced set of B Gaussians at each dimension, for all F dimensions.
ICASSP 2011
βf,τ +1 = E[lt |y0:τ ] =
Specifically: p(s) = πs , p(x|s) =
2 N x; μa(s,f ) , σa(s,f ) ,
(5)
f
where a(s, f ) maps acoustic state s to Gaussian a in frequency band f . A BQ-GMM with S states therefore has B ×F univariate Gaussians in total, rather than S × F . By design, B S, and so BQGMMs can be both evaluated and stored efficiently. This optimization is particularly useful for techniques like DNA that explicitly model the interaction between speech and noise, because the acoustic likelihoods are more expensive to evaluate than a typical AM. 2.3. Noise Model DNA models noise in the Mel spectrum as a Gaussian process. The dynamically evolving component of this noise—the noise level— is assumed to be changing slowly relative to the frame rate, and is modeled as follows: 2 p(lf,0 ) = N lf,0 ; βf , ωf,0 , (6) 2 (7) p(lf,τ |lf,τ −1 ) = N lf,τ ; lf,τ −1 , γf , where lf,τ is a random variable representing the noise level in frequency band f at frame τ . Note that it is assumed that the noise evolves independently at each frequency, which is rarely the case, but in practice we have so far found that introducing coupling only introduces additional unknown parameters and does not improve performance. The transient component of the noise process at each frequency band is modeled as zero-mean and Gaussian. The conditional probability density of the noise in frequency band f during frame τ is: p(nf,τ |lf,τ ) = N nf,τ ; lf,τ , φ2f . (8) The separation of the evolving and transient components of the noise facilitates robust tracking of the noise level during inference. This simple graphical model is descriptive of evolving diffuse noise in the log Mel spectrum, which is the predominant form of interference in many acoustic environments including cars (see [1] for details). 2.4. Channel Model In previous work, channel distortion was not modeled in DNA. In this work, we model the channel vector as a parameter, which is stochastically adapted during inference in a manner similar to how noise was adapted in [5]: ˆ f (τ )), p(hf,τ ) = δ(hf,τ − h
(9)
ˆ f (τ ) is the current estimate of the channel in frequency bin where h f at frame τ . 3. INFERENCE To do inference in the DNA model inherently requires that approximations be made, since the exact noise posterior for a given frame τ is a K T component GMM for an utterance with T frames, when a |s| = K component GMM speech model is used. We do inference in the model in a sequential fashion, and approximate the noise prior at frame τ + 1 given the noise posterior at frame τ as Gaussian: 2 p(lf,τ +1 ) ≈ N (lf,τ +1 ; βf,τ +1 , ωf,τ +1 ),
(10)
4593
p(sτ |y0:τ )E[lf,τ |y0:τ , sτ ] ,
(11)
sτ 2 ωf,τ +1
= =
Var[lf,τ |y0:τ ] + γf2 p(sxτ |y0:τ ){Var[lf,τ |y0:τ , sτ ] + sτ
(E[lf,τ |y0:τ ] − E[lf,τ |y0:τ , sτ ])2 } + γf2 . (12) In this paper we use a slight variation of Algonquin [10] to compute the conditional posterior of the noise level and speech for each Gaussian atom a. Algonquin iteratively linearizes the interaction function (3) in (4) given a context-dependent expansion point, usually taken as the current (prior or posterior) estimates of the speech and noise. For each Gaussian atom a: p(y|x, n, h) ≈ N (y; αa (x + h) + (1 − αa )n + ba , ψ 2 ), αa =
ˆ a |2 ˆ a | 2 |X |H δf = , ˆ a |2 |X ˆ a |2 + |N ˆ a |2 δx xˆa ,ˆla ,ˆna |H
ˆa, n ˆa − n xa + h ˆ a ) − αa (ˆ xa + h ˆa) − n ˆa. ba = f (ˆ
(13) (14) (15)
Given αa , the posterior distribution of x and n is Gaussian. Once the final estimate of αa has been determined, the posterior distribution of l can be determined by integrating out the speech and transient noise to get a Gaussian posterior likelihood for l, and then combining it with the current noise level prior. This is more efficient than unnecessarily computing the joint posterior of x, n, and l. In this paper we use DNA as a front-end module, and reconstruct DNA compensated Mel features for recognition by the back-end. The MMSE estimate of the Mel speech features for frame τ given the above approximations is: x ˆf,τ = E[xf,τ |y0:τ ] =
p(sτ |y0:τ )E[xf,τ |y0:τ , sτ ] .
(16)
sτ
4. EXPERIMENTAL SETUP All experiments were conducted on internal IBM databases. The audio data is US English in-car speech recorded in various noise conditions (0, 30 and 60 mph), and sampled at 16kHz. A training subset is composed of 786 hours of speech, with 10.3K speakers for a total of 803K utterances. A test subset contains a total of 206K words in 38.9K utterances from 128 held-out speakers. There are 47 tasks covering four domains (navigation, command & control, digits & dialing, radio) in 7 various US regional accents. Our reference model is a 10K Gaussian model built on all the training data. We use a set of 91 phonemes, each modeled with a 3-state left-to-right hidden Markov model (HMM). These states are modeled using 2-phoneme left and right context dependencies, within word, yielding a total of 865 context-dependent (CD) states. Acoustic models for these CD states are built on 40-dimensional features obtained using Linear Discriminant Analysis (LDA). Training consists of 30 iterations of EM algorithm where CD state alignments are re-estimated every few steps. Similarly, a clean 10K model was built on a 345 hour subset of the training (composed of utterances with SNR greater than 20dB), with the same phoneme set, HMM topology, number of CD states (865) as the reference 10K model. Finally, an fMPE model was built on top of the reference model by adding 30 iterations of EM and training a feature-space Minimum Phone Er-
ror (fMPE) transformation. The fMPE transform uses a secondary acoustic model with 512 Gaussians, with an inner and outer context of 17 and 9 frames respectively. Once trained, each acoustic model was compressed with hierarchical band-quantization [13]. DNA speech models are diagonal covariance GMMs, internally represented as BQ-GMMs for speed and storage efficiency. First, a speech model was built on all SNR conditions in the training data; this is our DNA model. Then, a second speech model was built on clean conditions of 20dB and above; this is our DNA 20dB+ model. Both models use 256 Gaussians to model speech and 16 Gaussians to model silence in the training data. 4.1. Recognition Setup Recognition is done using IBM embedded speech recognizer. Its front-end includes a zero-latency model-based speech activity detector producing speech/silence labels for spectral subtraction, as well as smoothed labels for speech end-pointing. Every 15 ms, 13 melfrequency cepstra are computed, mean-normalized, and transformed using LDA, MLLT, optionally DNA, fMPE and fMLLR, to produce a 40-dimensional feature vector. Frames labeled as speech enter a back-end which contains a hierarchical BQ labeler and our decoder. Recognition is performed using static graphs pre-compiled from constrained task-specific word grammars sharing a word pronunciation dictionary. In each of the 47 tasks, the recognition runs continuously on a sequence of utterances with known end-points. Adaptive techniques (voice activity detection, spectral subtraction, and cepstral mean normalization) run in on-line continuous adaptation mode, however speaker boundaries are not known. This is a onepass decoding setup where the best hypothesis is used to adapt the fMLLR transform (with partial updates to the matrix made every 5 frames using stochastic gradient descent). The final transform is applied only to the subsequent utterance. In this setup, DNA is reset at each utterance. We would expect better DNA performance if more contextual information were to be used. Spectral subtraction (SS) is performed in the power spectral domain prior to Mel filter binning. Our model-based voice activity detector operates on the non-subtracted features, their deltas, double deltas, and normalized raw energy tracks (39+3 features). For non-speech frames, the noise estimate is adapted (with exponential forgetting factor α = 0.9). For speech frames the noise estimate is simply subtracted from the input. A smooth transition from high to low input levels is ensured by flooring the estimated output to a fraction of the current noise estimate. Very low input levels stay unmodified. Like for DNA, the first ten frames of every utterances are assumed to be speech-free. DNA uses these frames to initialize the parameters of its noise model, as described in [2]. In order to assess how well the studied techniques compare under various SNR conditions, all test utterances were partitioned into SNR bins as follows. In each utterance, speech boundaries were found using forced alignment with a reference text. The SNR was estimated by taking a ratio of the average Mel-binned power spectra in speech and speech-free regions. Subsequently, a histogram of all SNRs was built and a range that covered 99.6% of the utterances was found, leaving 0.2% outliers on each side. The dB range was linearly split in 8 bins and the data were partitioned accordingly. The bin with the highest SNR was later discarded since it contained too few utterances. Once all utterances from the 47 tasks are decoded, each utterance is assigned to a SNR bin, effectively repartitioning the test data. For each bin, Word Error Rate (WER) and Sentence Error Rate (SER) are calculated providing a measure of SNR-dependent performance.
4594
WER (%) / SER (%) Model SS- fMLLR- SS+ fMLLR- SS- fMLLR+ SS+ fMLLR+ 10K reference no-DNA 2.49 / 6.15 1.91 / 5.12 1.49 / 3.99 1.43 / 3.86 DNA 1.83 / 5.11 1.86 / 5.21 1.38 / 3.91 1.39 / 3.79 10K clean no-DNA 4.10 / 8.60 2.66 / 6.28 1.60 / 4.38 1.52 / 4.21 DNA 1.82 / 5.16 1.88 / 5.17 1.42 / 4.06 1.41 / 3.95 DNA20dB+ 1.88 / 5.27 1.92 / 5.36 1.42 / 3.94 1.41 / 3.97 fMPE no-DNA 1.34 / 3.77 1.18 / 3.41 1.08 / 3.00 1.00 / 2.79 DNA 1.21 / 3.77 1.25 / 3.80 0.99 / 2.89 1.00 / 2.92 Table 1. WERs and SERs for 10K reference, clean, and fMPE backend models with/without DNA and with Spectral Subtraction (SS) and fMLLR enabled/disabled (+/-). The DNA20dB+ result was obtained using a DNA speech model trained only on clean data.
5. EXPERIMENTAL RESULTS We present decoding results on our test set to demonstrate that DNA is a relevant noise robustness algorithm for a commercial recognition system. We compare DNA to SS and fMLLR—two common techniques used to improve recognition performance on noisy data. We also present results on the combined effect of using DNA, SS and fMLLR in all possible combinations. We report results for DNA, SS, and fMLLR on all our 10K reference, clean, and fMPE models. Table 1 summarizes the recognition performance of all possible combinations of spectral subtraction, DNA and fMLLR. Results are shown on the 10K reference, clean, and fMPE models. In general, DNA is seen to consistently improve performance, particularly when used alone. One exception is when DNA is combined with spectral subtraction on the fMPE model, but this artifact disappears as soon as fMLLR is turned on. On the reference 10K model, DNA alone gains 26.5% relative over the baseline. This is 3.5% more than spectral subtraction, the difference being markedly displayed at the lowest SNR regions, as shown later. Unfortunately, the gains due to SS and DNA are not additive. However, DNA combines well with fMLLR, with both methods together providing an impressive 44.5% relative gain over the baseline. DNA is by design aimed at mismatched scenarios with unseen noise conditions. This is to some extent simulated by training on clean data, as shown in the second part of the Table 1. Here DNA considerably outperforms both spectral subtraction and fMLLR, showing basically no regression from the matched training. Gains from fMLLR and DNA are again additive. This suggests that with DNA, the acoustic model of virtually the same quality can be reached with much less data, and perhaps more importantly, significantly reducing the burden of collecting real noisy recordings. The third part of the table represents an attempt to improve upon a very matured system featuring SS, fMLLR and fMPE. In this combination DNA does not show an improvement, which suggests that these techniques are strong enough to cope well with reasonably well matched noise conditions. Gains from DNA could only be expected on mismatched tasks. More insight into the results can be gained by measuring performance as a function of SNR, as illustrated in Figures 1 to 3. All figures show word error rate against SNR level. The background histogram gives the number of words in each bin. General to all fig-
2
2
1 10
15
20 SNR (dB)
25
30
35
WER (%)
3
3
2
2
1
1
1
0
0 0
Fig. 1. WERs as a function of SNR for 10K reference model (blue and red) and for 10K clean model (green). fMLLR is off. Results are reported for SS and DNA included(+) and excluded(-) from the front-end pipeline.
5
5
10
15
20 SNR (dB)
25
30
35
6. REFERENCES
0
Fig. 2. WERs as a function of SNR for 10K reference model (blue and red) and for 10K clean model (green). fMLLR is on. Results are reported for SS and DNA included(+) and excluded(-) from the front-end pipeline. fMPE model, fMLLR+
4
x 10 7
7
5
DNA−, SS− DNA+, SS− 6 DNA−, SS+ DNA+, SS+ 5
4
4
3
3
2
2
1
1
6
WER (%)
ures is the observation that DNA brings the most gain at the lowest SNRs. When compared to spectral subtraction, this is the area where DNA has the edge (see blue curves in all figures). It is the case even for the fMPE model (Figure 3) on which the word error rates for DNA are otherwise similar to those of SS. The reason is that there is little low-SNR data in the test. It is seen that at higher SNRs DNA sometimes slightly degrades performance, but it does not appear to be a general trend. The observations on the clean model are shown by green lines in Figures 1 and 2. Comparing both dash-dotted lines indicates how well fMLLR is able to manage unseen noises, both lines being well above DNA’s solid lines for the lowest SNR. The dashed green lines indicate that spectral subtraction is not of much help on this task. Finally, comparing solid green to solid blue in both figures confirms that, apart from the lowest SNR bin, clean-trained models with DNA are capable of delivering the same performance as a standard multicondition training with or without fMLLR.
number of words
3
5
4
DNA−, SS− DNA+, SS− 6 DNA−, SS+ DNA+, SS+ 5 clean, DNA−, SS− clean, DNA−, SS+ clean, DNA+, SS− 4
WER (%)
3
5
4
x 10 7
7 6
number of words
4
DNA−, SS− DNA+, SS− 6 DNA−, SS+ DNA+, SS+ 5 clean, DNA−, SS− clean, DNA−, SS+ clean, DNA+, SS− 4
6
0 0
10K models, fMLLR+
4
x 10 7
0 0
SS−, DNA−, fMLLR−
5
10
15
20 SNR (dB)
25
30
35
number of words
10K models, fMLLR− 7
0
Fig. 3. WERs as a function of SNR for fMPE model. fMLLR is on. Results are reported for SS and DNA included(+) and excluded(-) from the front-end pipeline.
[1] S. Rennie, T. Kristjansson, P. Olsen, and R. Gopinath, “Dynamic noise adaptation,” ICASSP, 2006. [2] S. Rennie and P. Dognin, “Beyond linear transforms: Efficient nonlinear dynamic adaptation for noise robust speech recognition,” in Interspeech, September 2008.
[8]
[3] D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, and G. Zweig, “fMPE: Discriminatively trained features for speech recognition,” ICASSP, 2005.
[9]
[4] Yariv Ephraim, David Malah, and Biing-Hwang Juang, “On the application of hidden Markov models for enhancing noisy speech,” IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 37, no. 12, pp. 1846–1856, December 1989.
[10]
[5] L. Deng, J. Droppo, and A. Acero, “Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition,” IEEE Trans. on Speech and Audio Processing, vol. 11:6, pp. 568–580, 2003.
[11]
[6] J. Droppo and A. Acero, “Noise robust speech recognition with a switching linear dynamic model,” ICASSP, 2004.
[12]
[7] Arthur N´adas, David Nahamoo, and Michael A. Picheny, “Speech recognition using noise-adaptive prototypes,” IEEE Transactions on
[13]
4595
Acoustics, Speech and Signal Processing, vol. 37, no. 10, pp. 1495– 1503, October 1989. Mark J. F. Gales and Steve J. Young, “Robust continuous speech recognition using parallel model combination,” IEEE Trans. on Speech and Audio Processing, vol. 4, no. 5, pp. 352–359, September 1996. Pedro J. Moreno, Bhiksha Raj, and Richard M. Stern, “A vector Taylor series approach for environment-independent speech recognition,” in ICASSP, Atlanta, Georgia, May 1996, IEEE, vol. 2, pp. 733–736. B.J. Frey, L. Deng, A. Acero, and T. Kristjansson, “Algonquin: Iterating Laplace’s method to remove multiple types of acoustic distortion for robust speech recognition,” Eurospeech, 2001. L. Deng, J. Droppo, and A. Acero, “Enhancement of log mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise,” IEEE Trans. on Speech and Audio Processing, vol. 12:2, pp. 133–143, 2004. E. Bocchieri, “Vector quantization for the efficient computation of continuous density likelihoods,” in ICASSP, 1993, pp. 692–695. R. Bakis, D. Nahamoo, M. A. Picheny, and J. Sedivy, “Hierarchical labeler in a speech recognition system,” U.S. Patent 6023673.