DISTANT-TALKING ROBUST SPEECH RECOGNITION USING L ATE REFLECTION COMPONENTS OF ROOM IMPULSE RESPONSE Rαndy GOl11巴Z, Jani Even, Hiroshi S,αruwaf,αri, Kiyohiro Shikano
Graduate School ofInfonnation Science NaraInstitute of Science and Technology , JAPAN E-ma工1:
{randy-g,even,sawatari,shikano}@is.naist.jp
ABSTRACT
ant conditions. Instead of using multi-LPC, we devise an approach to effectively estimate the late reflection using the measured impulse response and suppress its effect through multi-band SS. Unlike in [1], the proposed method does not need to wait for the whole reverberant u抗巴rance to start pro・ cessing thus, real-time implementation is possible.
We propose a robust and fast d巴reverberation technique for real-time speech recognition application. First, we effectively identifシthe late reflection components of the room impulse response. We use this information together with the con cept of Spectral Subtraction (SS) to remove the late reflection components of the reverberant signal. In the absence of the c]ean speech in actual scenario, approximation is carried out in estimating the late reflection where the estimation e汀or is corrected through multi-band SS. The multi-band coefficients are optimized during offline training加d used in the actual online dereverberation. The proposed method performs bet ter and faster th叩the relevant approach using Multi-LPC and reverberant matched model. Moreover the proposed method is robust to sp巴水er and microphone locations.
2. SPECTRAL SUBTRACTION A reverb巴rant speech signal contains both the effects of the early and lat巴 reflections (when referring to early reflection we include by definition the direct signal). Although there exists a strong correlation due to articulatory constraints be tween the speech and th巴 effects of the reverberant environ ment condition (i.e. early, and late reflections) this strong cor・ relation is lost due to articulatory movements [2]. Thus, we can wnte
1. INTRODUCTION Reverberation degrades significantly the performance of distant talking speech recognition applications. Thus, it is important to suppress the reverberation effects to minimize model mis match prior to input to the recognizer. Technique such as inverse filtering is effective but take much ∞mputation time and precludes real-time application. In this research, we focus on a single channel real-time dereverberation 合amework which can be easily extended to multiple channels. A novel approach based on this合amework is proposed in [1]. This approach employs a numerical criterion based on minimum squared e汀or through multi-step Lin巴ar Preàiction Co巴節cients (LPC) to effectively estimate the late reflection and makes use of single-band SS to remove it from the ob served signal. Although [1] works well in estimating the late reflection, this approach requires the complete reverberant ut・ terance for processing since multi-step LPC's performance is directly proportional to the observed data. η1US, realtime speech recognition is difficult to realize. In our proposed method, we extended and modified [1], resulting to a real-time d巴reverberation in realistic reverber-
1・4244・1484・9/08/$25.00 IÞ2008 IEEE
(1) x(n) xE(n) + xL(n), where XE(π), xL(n) are the uncorrelated early and late陀 fleω削cαωtiぬO叩nc∞ompo∞nent岱s of t由he r巴verberant signal x(η). Denote s(n)ぉclean speech,加d suppose that given room impulse h(n) [hEhL] where its early coefficients hE and late coef ficients hL are identified in advance, Equation 1 becomes
=
(2) x(n) s(η) * hE + s本hL. Since xE(n),xL(n) are uncorrelated to some constraint [2], we can use SS [3] to remove xL(吋. The target signal xE(n)
=
becomes
x(n) - xL(η). The reasons of removing xL(π) are the following: XE(n)
=
(3)
(1) The lat巴 reflection has lower energy compared to the early reflections. (2) The late reflection tends to be static over time and not so sensitive with microphone-to-speaker distance as op posed to the early reflection. (3) The late reflection falls outside the合amework in which the 3-state HMM is designed to handle.
4581
ICASSP 2008
qd 'EA
臨瀦獄簿巾) cl回酬ns珊問帥
九勺叩?児?
I・M・**.
ア?)な:詑ご芯 立:吋同
扮4L4」」」』4与i判必 以23d )1| J
Known回目制師
|
d
日s拍nce
以)… 10 0
Fig. 1. Ideal dereverberation where clean speech signal is known
tical axis shows the re∞gnition performance. It is obvious in this figure that the steep decrease in the performance starts at 70 ms which suggests the beginning of the effect of the late reßection xL ( n ) . The st巴ep decrease is attributed to the fact that the recognizer cannot deal with reverberation that falls outside the 3・state H恥仏4合amework. Thus this part is du巴 to the late reflection hL・Moreover,this figure shows that the recognizer is robust to the effects of the early part hE which causes the early reflections xE ( n ).
Recognaer
Fig.2. Practical implementation of the ideal fast dereverber atlOn. Moreover,the early reverberation effects in target signal xE (n) can be handled by the 3-state HMM architecture through Cep stral Mean Normalization and adaptation techniques [4].
3ムEstimating i; L (n例Tη叫bけ) instead of xL (η
Since it is not feasibl巴 to estimate xL (π) because s(η) is not available in the actual scenario,we made a crude assumption x ( n )吋L using the ob that we can instead estimate X L (η) served閃verberant signal x (η ) as shown in Figure 2. This as・ sumption however,results to significant estimation e汀or and would render the conventional single band SS to be inoper ative since SS needs a good estimate of xL (π). To correct this error,we employ multi-band SS similar to that in [5]. We introduced an offiine training scheme in computing the multi band coefficients that minimize the error between xL (η) and the crude estimate i;L(n) which is discussed in Section 3.3.
3. PROPOSED METHOD Based on the SS concept given in Equation 3,a fast and sim ple dereverberation approach can be constructed as depicted in Figure 1. This figure shows that late reverberant compo nents can be easily r巴moved if late reflection impulse response hL and clean speech s(n) are given in order to estimate for xL(n). This approach is much faster and accurate than that of [1] since we use the exact impulse response boundary and not rely on multi-LPC which takes time to estimate for xL (吋 . Figure 1 is ideal in a sense that hL and s(η) are not avail able. In Figure 2, we show our proposed method which is 叩alternative implementation to that of the ideal case shown in Figure 1. It is possible to measure the room impulse re sponse h(η) and企om that we can experimentally identi今 hL which will be explained in S巴ction 3.1. Likewise, we can as sume that by using the actual reverberant signal x(n) instead of s(n) (note that s(n) is not available) we can arrive to a crude estimate :h(η) instead of the exact xL( n ). Although this would result to significant estimation error, we can cor・ rect this through multi-band SS where multi-band coe伍cients ó = {ð1, , ðK} are t凶ned offiine to minimize the error be tween :h(η) and xL(η) as described in Section 3.3.
••
3. 1. Identifying Impu1se Response Boundary hL Suppose that we are able to measure the room impulse re・ sponse h (η) (see Section 4.1),we need to effectively find the boundary for hL. In doing so, we varied the length of the impulse response in generating reverberant test data sets and perform recognition experiments using a clean model. The result of the experiment is shown in Figure 3,where the hor izontal axis is the length of the impulse response and the ver・
=
3.3. Correcting Estimation Error Through Training ofMulti band Coefficients for SS Although s(η) is not available in the actual scenario, w巴 can have access of this in the training database. Thus,we optimize the values of the multi-band coefficients offiine in a form of traini昭to minimize the error between xL (η) and xL(n). Fig ure 4 shows this process. For each selected clean signal s(n) in the database, the actual late reflection is xL ( n ) hL *s(n) and the crude estimate late reflection xL(n) = hL * h * s(η) 訂e computed using the late part of the impulse response and the clean speech in the database. Next, the power spectral densities (PSD) XL(f) and XL(f) of both signals 訂e esti mated using Welch's method. The window type,overlap and length of the frame are the same as th巴 one used in the multi band SS. Figure 5 shows an example of PSDs of both signals. For a given set of bands B {B1,..., BK}, the coe伍cients ó = {ð1, , ðK } are determined by minimizing the squared error in each band k
=
=
••
Ek =
乞 fEBk
4582
IXL(f)
-
ðkXL(fW.
(4)
Aせ ー,ム
e 一 1;1 1 M 一\ 吋 / m d o m 一 2一 e n ・ 0 醐 問 凶 ー一111 1』 一 町 山 m 白 M 一一 副 a、 r L 岨 t u r a n 閉 山 冶 4ei- H : , b 割 引 一 m 一 町 O E K 1・ 岨 問 L Ili--i甲 一 府 m au ・ 0一 e o ぃ O O FE U z a 目 。JE Z 且E U一E �O
=。
C
(Center microphone)
L1/R1 (Left IR唱ht microphone 1.0m from Center C)
当ら
Fig. 4. Obtaining the values of the multi-band coefficients offline using the clean utterances.
L2 I R2 (Left IRight microphone 2.0m from Center C)
Fig. 6. Microphone-speaker set-up in acquiring room impulse response using TSP
-20 Xx.
宝�(引
Table 1. System specifications Sampling frequency 16 kHz 25 ms Frame length Frame period 10 ms Pre-emphasis 1 0.97z-1 F eature vectors 12・order恥1FCC, 12・orderムMFCCs I-orderムE PTM ,2000 states m.仏f Adult and Senior by JNAS Training data Adult and Senior by JNAS Test data
-80 -J..OO
-
-J..20 o
2000
4000
6000
8000
Fig. 5. Power spectral densities of the real late reverberant component XL(f) and estimated late reverberant component
XL(f).
、.ノ F、d J't・、 nu > 7 7 7 7 γ 一 γ T T 仏 AK r H 9 い rE Et EE EE』eE T fJ
Thus, in the actual multiband SS online using the optimized ð,the target signal XE(f) in frequency domain is given as,
1 ・
Aリげ 九P L L卜lド A X A 丸一
Aリげ げ X
I ll-、 J e F S
一一
E 《 X
JULIUS [7] is used as a recognizer using Phonetically Tied Mixture (PTM) [8] model with 20K-word Japanese news paper dictation task from JNAS [9] with a combined 561 speaJ← ers (male and female). The open test set constitut巴s 44 (male and female) speakers with a combined 200 utterances. Sum mary of the conditions used in recognition is given in Table
e
for fεBk with βthe flooring coefficient叩dγthe power exponent as in conventional SS. We have tried different num ber of bands and finally choose the one used by the recognizer in obtaining the mel scale (see Section 4 Table 1). Moreover, the resulting ð coefficients from training which is used in the actual multi-band SS are {3.430, 1.913,1.647,0刊0,0.664, 2.743,2.655,1.995,1.699,1.232,1.794,1.324} . 4. EXPERI九1ENTAL RESULTS 4. 1. Experimental Conditions We use the Time Stretched Pulse (TSP) method [6] to obtain the measurement of the actual room impulse response h( n) and to simulate reverberant utterances for both the training and test data in the same manner as [1]. In this experiment we use a single channel directional microphone. The room set-up is shown in Figure 6 with source/speaker locations of 0.5m, 1.0m,1.5m,and 2.0m respectively. Microphones are located with positions L2,L1,C,R2,and RI respectively. Reverbera tion time of the measured impulse response is around 400ms. Reverberant signals are obtained using 6000・tap filter.
4.2. Recognition Performance In Figure 7 we show the basic recognition results at each spe必(er-to・center microphone distances 0.5m, 1.0m, 1.5m, and 2.0m. At each of this distances we also consider the 5 microphone positions R2,RI,C,LI,and L2 (refer to Figure 6 for room configuration). Figur巴 7 shows that the proposed method (A) outperforms the multi-LPC approach (B) in all cases. Moreover, the recognition performance improvement in using the proposed method is obvious as comp訂ed to the (c) and (0). 4.3. Robustness toMicrophone Positions and Speaker Dis tances A variation in spe必∞r location would imply a variation of ð. The result shown in Figure 8 shows that the proposed method is independent of ð,thus robust to variation in location. When using only one set of ð measured at the farthest microphone distance at 2.0m (we refer to this as robust ð),the recognition performance does not vary much as compared to using several
4583
Fhd 噌E・鼻 咽EA
A 、々 , e r 凶
よ!|; 】JFプてて7--, ω
70
!6O
Speaker.to-center microphone dist 0.5m
=
勺巴ケ~ζζ R1
C
Microphone Locati ons
A --・ー・・ -
- -
40 30
R2
L2
L1
C
L2
R2
目
esl B__.・一一T Modω
R1
c
R2
L2
L1 I R1 (Left 1 Right microphone 1.0m)
L21 R2 (Left1 Right microphone 2.0m)
Tesl Model:
L1
Reverberaled and poωssed by propωed m回hod Reverberaled and processed by proposed melhod
←
C
Reve巾eraled and pro悶sed by Mulli.LPC melhod Reverberaled and processed by Mulli.LPC melhod
L1
c
R1
R2
C (Center microphone)
←Tesl : Reverbera刷bul no processing Mode副1: Rever巾ber悶al恰ed bul no processing
*
D 一目品一叩 、
T esl : Reve附e町『脚r制a剖leωd bul凶崎削川no pro肌cωesお叩馴s剖sln Mode剖1: Nolr問'ever巾ber悶al怜ed and no processi川ng
1
Fig. 7. Basic Recognition Performance.
80 70 r 60 三 u � 50 540 〈 30 芝 0
三20 10
1 [
�
[2] ロ
L1
C
Microohone Localions
R1
R2 [3]
• Processed by proposed method using single value 01 alpha (robusl)
圃Proωssed by proposed melhod using several values 01 alpha (maにhed) 口R帥巾erated wilhout processing at all
[ 4]
Fìg. 8. Robustness of the Proposed Method.
matched Ó. This points to the fact that much as well.
( )
xL η
does not v町 [5]
5. CONCLUSION
[6]
Although the multi-LPC [1] is novel in a sense出at it can adaptively estimate x L ( η ) , real-time d巴reverberation for real time speech recognition is not feasible. It is true that the pro posed method requires a measurement of room impulse re sponse in advance, but this trade-off is negligible since we are abl巴 to execute a fast and real-time dereverberation im plementation which is not achieved in [ 1]. Moreover, since XL ( n ) does not vaJγso much with distance (as shown in Fig ure 8), we only need to measure a single impulse response and caJculate a single set of ó. Currently we are expanding this research to using microphone arrays.
7. REFERENCES
K. Kinoshita, T. Nakatani, and M. Miyoshi“Spectral Subtraction Steered By Multi-step Forward Linear Pre diction For Single Channel Speech Dereverberation" ln Proceedings o/ICASSP, 2006 K. Kinoshita, T. Nakatani, and M. Miyoshi“Efficient Dereverberation Frarnework For Automatic Speech Recognition" ln Proceedings o/ICSLP , Vol 1, pp 9295,2005 S.F Boll“Suppression of Acoustic Noise in Spe巴ch us ing Spectral Subtraction" lEEE Trans. on ASSP, vol. 27(2), pp. 1 13-120, 1979 C.J.Leggeter and Woodland “Maximum Likelihood Lin ear Regression for Speaker Adaptation of Continu ous Density Hidden Markov Models" ln Proceedings o/Computer Speech and Language, voI.9,pp.17 1・185, 1995 S. Karnath, and P. Loizou“A Multi-Band Spectral Sub traction Method for enhancing Speech corrupted by col ored Noise" ln Proceedings o/ICASSP, 2002 Y. Suzuki, F. Asano, H.-Y. Kim,加d Toshio Sone, "An optimum computer-generated pulse signal suitable for the measurement of very long impulse responses" J Acousl. Soc. Am. 1-'