Multichannel Speech Dereverberation Based on Convolutive Nonnegative Tensor Factorization for ASR Applications Seyedmahdad Mirsamadi and John H. L. Hansen? Center for Robust Speech Systems (CRSS) The University of Texas at Dallas, Richardson, TX 75080-3021, U.S.A. {mirsamadi,john.hansen}@utdallas.edu
Abstract Room reverberation is a primary cause of failure in distant speech recognition (DSR) systems. In this study, we present a multichannel spectrum enhancement method for reverberant speech recognition, which is an extension of a single-channel dereverberation algorithm based on convolutive nonnegative matrix factorization (NMF). The generalization to a multichannel scenario is shown to be a special case of convolutive nonnegative tensor factorization (NTF). The presented algorithm integrates information from across different channels in the magnitude short time Fourier transform (STFT) domain. By doing so, it eliminates any limitations on the array geometry or a need for information concerning the source location, making the algorithm particularly suitable for distributed microphone arrays. Experiments are performed on speech data using actual room impulse responses from AIR database. Relative WER improvements using a clean-trained ASR system vary from +7.1% to +30.1% based on the number of channels and the source to microphone distances (1 to 3 meters). Index Terms: reverberation, automatic speech recognition, nonnegative matrix and tensor factorization
1. Introduction Automatic Speech Recognition (ASR) systems have now reached a performance level which enables them to be used in many real world applications. However, this is generally limited to the case where the speech signal is captured by a closetalking microphone. In the case of distant speech recognition (DSR), where the microphone is far from the speaker’s mouth, the recognition accuracy degrades drastically as a result of room reverberation and environmental noise. Reverberation, in particular, is a major challenge that must be addressed in DSR, occurring as a result of multiple sound reflections being captured by the distant-talking microphone. Although these reflections can be viewed as corruptive noise terms added to the actual desired speech signal, the reverberation problem is fundamentally different from conventional noise-robustness techniques, because the reflections serve as both nonstationary and colored noise which is also correlated with the desired speech signal. The reverberation problem can potentially be addressed in three different stages of the ASR system front-end, namely (i) the waveform domain, (ii) magnitude Short Time Fourier Transform (STFT) domain, and (iii) feature (cepstrum) domain [1]. A multichannel solution (i.e., the use of a microphone array) can ? This project was funded by AFRL under contract FA8750-12-10188 and partially by the University of Texas at Dallas from the Distinguished University Chair in Telecommunications Engineering held by J.H.L. Hansen.
theoretically be exploited in any of the these stages by integrating information from different microphones into a single set of features. However, microphone array solutions have traditionally been used mostly in the waveform domain, in the form of either fixed or adaptive beamformers. The directional sound capture provided by a beamformer can provide a certain degree of signal enhancement which in turn improves the extracted features. However, this approach is not optimal for the task of reverberant speech recognition, mainly because in ASR we are interested in obtaining a set of features that closely resemble the clean training set, and not in recovering the actual speech waveform [2]. Other factors such as the requirement for the speaker’s location (which is difficult to estimate in a reverberant environment), and also certain limitations on the array geometry (for example to avoid spatial aliasing in high frequencies), further limit the applicability of beamforming approaches in realistic DSR scenarios. In this paper, we propose a multichannel speech dereverberation method based on Nonnegative Tensor Factorization (NTF) which operates in the magnitude STFT domain. A multichannel approach which fuses information available from different sensors in the magnitude time-frequency domain is particularly useful in the context of distributed microphone array systems, because it does not depend on phase information which is difficult to preserve across independent channels of a distributed system [3, 4]. Furthermore, no assumptions are necessary about the specific location of the source or the individual sensors. Nonnegative tensor factorization (NTF) [5] is a generalization of Nonnegative Matrix Factorization (NMF) to tensors (multi-way arrays). NMF is a multivariate data analysis technique popularized by the simple algorithms of [6] (referred to as Lee-Seung algorithms) which have been extended to the convolutive case in [7]. A special formulation of convolutive NMF was used for single-channel speech dereverberation in [8], shown to provide significant improvements in ASR accuracy in [9], and further extended to the Gammatone subband domain in [10]. In our study, we extend the method of [8] to a multichannel framework. We will show that while single-channel dereverberation is a special case of convolutive nonnegative matrix factorization, multichannel dereverberation can be considered as a special case of convolutive nonnegative tensor factorization with a third order tensor. The resulting algorithm is similar in nature to the multichannel version of the latent-variable decomposition based approach developed in [11], but without any statistical priors assumed for the input. The proposed multichannel extension is shown to provide recognition improvements in highly reverberant conditions compared with the original single-channel approach. The remainder of this paper is organized as follows. In
Sec. 2 we describe a model for room reverberation in the short time Fourier transform domain as the basis for spectrum enhancement approaches to dereverberation. In Sec. 3 we present the NTF-based multichannel dereverberation algorithm, and explain the connections with a standard NTF problem. We provide experimental results in Sec. 4 which illustrate consistent improvements in recognition accuracy. We discuss some interesting properties of the proposed algorithm in Sec. 5, and finally conclude the paper in Sec. 6.
A reverberant speech signal is modelled in the time domain by a convolution between the clean speech signal and the room impulse response (RIR) from the source location to the microphone. The length of the RIR is often much longer than the typical time segments used in speech recognition systems, and this causes the spectral content of each frame to influence the subsequent frames (referred to as the “spectral smearing” effect). This effect can be modelled in the magnitude STFT domain by a convolution of the form, L−1 X
(i)
Hk (p)S(m − p, k),
(1)
p=0
where m, k, and i ∈ {1, · · · , N } are frame, frequency and channel indices, N is the number of microphones, S(m, k) and X (i) (m, k) are the magnitude STFTs of the clean speech and (i) i’th microphone signal, and Hk (m) is the subband envelope of the RIR from the source location to the i’th microphone. The reverberant spectrogram model of Eq. (1) has been the basis for many recent studies on speech dereverberation [12, 13].
3. NTF-based dereverberation algorithm 3.1. Problem Formulation Based on the model in Eq. (1), the multichannel dereverberation problem can be stated as finding the nonnegative factors ˆ (i) (m) for all channels, together with the single common nonH k ˆ negative factor S(m, k), which jointly minimize an error criterion between the reverberant spectrograms X (i) (m, k) and their ˆ (i) (m) ∗ S(m, ˆ estimates given by Z (i) (m, k) , H k) (here, ∗ k denotes convolution on frame index m). We therefore define the following cost function using the Euclidean distance between the two spectra as the error criterion: X E= kX(m, k) − Z(m, k)k2F , (2)
0→
ˆ k (p)S(m ˆ H − p, k),
(3)
X(N)
H (0)
S
Σ Σ
•• •
X(1) • ••
H(N)(L-1)
X
• ••
H(1)(L-1) • ••
• ••
H (L-1) Figure 1: Nonnegative tensor factorization model for multichannel dereverberation
is expected to yield an estimate of the clean speech spectrogram as well as the subband envelopes of the RIRs. To address the scaling indeterminacy inherent in the problem, we impose the following additional constraint: N L−1 X X
(i)
k = 1, · · · , K.
Hk (p) = 1,
(7)
i=1 p=0
It is important to use the above normalization strategy instead (i) of individual normalization of each filter Hk (m) (as is done in [8] and [10],) because this will allow the algorithm to adjust the gain of each filter according to the SNR of the corresponding microphone signal (see Sec. 5 for more details). 3.2. Relations with the standard NTF As illustrated in Fig. 1, the set of magnitude spectrograms for different channels can be considered as the frontal slices of a third order tensor X . In this view, the dereverberation problem described above is equivalent to the following convolutive NTF problem: X(i) =
L−1 X
p→
H(i) (p) S
i = 1, · · · , N,
(8)
p=0
where X(i) is the magnitude spectrogram matrix of the i’th channel which forms the i’th frontal slice of the tensor X . The base matrices H(i) (p) are all diagonal matrices of the form, (i)
H1 (p)
0 ..
H(i) (p) =
where L−1 X
• ••
• ••
k,m
Z(m, k) =
• ••
L-1→
2. Reverberation Model
X (i) (m, k) =
H(N)(0)
H(1)(0)
0
. (i)
,
(9)
HK (p)
p=0
h iT ˆ k (m) = H ˆ (1) (m), · · · , H ˆ (N ) (m) , H k k
(4)
h iT X(m, k) = X (1) (m, k), · · · , X (N ) (m, k) .
(5)
Note that the cost function of Eq. (2) is the sum of all the error terms associated with each individual channel. The minimization of Eq. (2) subject to the nonnegativity constraints, (i)
Hk (p) > 0 and S(m, k) > 0, for all i, p, m, k,
(6)
which constitute the i’th frontal slice of the tensor H(p), as shown in Fig. 1. The matrix S in Eq. (8) is the magnitude spectrogram matrix of the clean speech signal, and the operator p → shifts the rows of its argument matrix by p positions to the right, filling in zeros from the left. Note that in the single-channel case (i.e., N = 1) the NTF problem in Eq. (8) simplifies to a NMF problem, similar to [8]. Because of the diagonal form of the base matrices in Eq. (8), the optimization of the components can be carried out independently in each subband, as detailed in the next section.
3.3. Multiplicative update rules A standard gradient descent minimization of Eq. (2) does not necessarily preserve the nonnegativity of the components. However, it has been shown in [6] that by using a variable step size in each iteration, simple multiplicative update formulas can be derived which ensure the nonnegativity of the results. The gradient of Eq. (2) with respect to filter coefficients is, X ∂E =2 [Z(m, k) − X(m, k)] S(m − p, k). (10) ∂Hk (p) m We choose the following vector of step sizes for the gradient descent optimization of filter coefficients: X η H = Hk (p) 2 Z(m, k)S(m − p, k), (11)
x(1)(n)
STFT
|•|
x(2)(n)
STFT
|•|
• • •
• • •
STFT
|•|
NTF • • •
x(N)(n)
Reconstruction
Feature Extraction
arg(•)
Figure 2: Overall front-end processing using NTF-based dereverberation.
m
where represents element-wise division. Each element of η H is the step size used for the corresponding channel. Using ˆ k (p) is, Eq. (10) and (11), the update rule for H X ˆ k (p) ← H ˆ k (p) ˆ H X(m, k)S(m − p, k) (12) m
X
ˆ Z(m, k)S(m − p, k),
m
where represents element-wise multiplication. Similarly, using the derivative with respect to S(l, k), X ∂E =2 [Z(m, k) − X(m, k)]T Hk (m − l), (13) ∂S(l, k) m and choosing the following step size parameter ηS =
ˆ k) S(l, 2
P
ˆ k (m ZT (m, k)H
− l)
,
(14)
m
we obtain the multiplicative update formula for clean STFT estimates: P T ˆ k (m − l) X (m, k)H m ˆ ˆ S(l, k) ← S(l, k) P . (15) ˆ k (m − l) ZT (m, k)H m
At the end of each iteration, all filter coefficients are normalized according to Eq. (7). Note that similar to the single-channel case in [8], both final update rules of Eq. (12) and (15) contain cross-correlation terms in their numerators and denominators. These correlations can be computed via FFT multiplication in the modulation frequency domain in order to reduce the computational complexity of the algorithm.
4. Experiments We conduct speech recognition experiments to assess the performance of the proposed multichannel dereverberation algorithm for ASR applications. The CMU Sphinx3 system was used for the recognition experiments. We used 13-dimensional Mel-frequency cepstral coefficients (MFCCs) along with their delta and double-delta extensions as speech features. Acoustic models (3-state HMMs with 8 Gaussians per state) were trained on clean utterances from the TIMIT database. Cepstral mean normalization (CMN) and a trigram language model have been used in all experiments.
The overall front-end processing scheme for the experiments is depicted in Fig. 2. Magnitude STFTs are first computed for all input channels using a 64 ms Hamming window with a window shift size of 16 ms. These magnitude spectra are then jointly processed by 10 iterations of the NTF update algorithm described in Sec. 3.3 to obtain an estimate of the clean speech spectrum. A filter length of L = 10 is (i) used for the subband filters Hk (m), which are all initialized (i) with Hk (m) = 1 − m/2L, (m = 0, . . . , L − 1). The phase values from one of the input channels is then used with the dereverberated magnitude spectrum to reconstruct the time domain signal, which is finally submitted to a conventional MFCC extraction unit with a frame size of 25 ms and a skip rate of 10 ms. The test data was created by convolving TIMIT test utterances with RIRs from the AIR database [14]. We used a subset of the RIRs collected at different locations in a stairway area with a reverberation time (T60 ) of approximately 0.8 s, which makes it a challenging situation for ASR. We have used pairs of binaural RIRs at distances of d = 1, 2, 3 m from the source, with an azimuth difference of 30 degrees between the pairs at each distance (see Fig. 3). Fig. 4 shows the resulting word error rates (WERs) from ASR experiments performed for different source-tomicrophone distances within the reverberant environment. It is observed that dual-channel NTF provides relative improvements of +7.1%, +13.7% and +13.6% for distances of d=1m, d=2m and d=3m over the single-channel NMF algorithm. The relative improvements for the 4-channel case are +18.7%, +26.6% and +30.1%. In general, there is more improvement for larger distances between the source and the microphones. Fig. 5 shows a comparison between the dereverberated spectrograms provided by the single-channel NMF algorithm and a 4-channel NTF algorithm, for a distance of 2 m from the source to the microphones. It is observed that the multichannel algorithm has provided a more effective removal of the spectral smearing effect.
5. Discussion The NTF-based multichannel dereverberation algorithm presented in this study has properties that make it attractive for distributed array systems, which are emerging as effective solutions for distance-based speech recognition in applications such as smart home or office environments. In a distributed array,
10 9 11 12
6 5 8
7 4
3
1m
2 1
1m 30° 1m
Figure 3: source and microphone locations in ASR experiments. 90
85.1
80 68.8
70
WER (%)
60
56.7 49.0
50 40
39.6
37.9
36.3
32.7 27.8
30 20
19.8
18.4
16.1
10 0 d=1m
d=2m
MFCC (No dereverberation)
NMF_1ch
d=3m
NTF_2ch
NTF_4ch
Figure 4: Word error rates from reverberant speech recognition experiments.
6. Conclusions In this study, we presented a multichannel dereverberation algorithm for ASR applications. The proposed algorithm generalizes a single-channel NMF-based method to a general and not necessarily uniformly spaced multichannel framework by using nonnegative tensor factorization. The proposed algorithm was experimentally shown to provide relative WER improvements of up to +30% in highly reverberant conditions. This gain was
Figure 5: Spectrogram comparison between dereverberation provided by single-channel vs. multichannel algorithm.
achieved by the generalized reformulation of the single-channel NMF approach to an improved multichannel solution. The algorithm was also shown to be robust against varying signal qualities among the different channels (occurring due to different spatial locations of the microphones).
0.1
i=1 i=2 i=3 i=4
0.08
Hk(i)(m)
microphones are generally in unknown random locations in the room, and different channels have different gains and signalto-noise ratios (SNRs). Another challenge is a possible lack of synchrony among different channels, as a result of independent processing units of the recording devices [4]. As mentioned in Sec. 1, the NTF-based dereverberation method is independent of signal phases and therefore circumvents the synchronization problem. Another interesting property observed for the proposed algorithm is that the update rules automatically adjust the filter taps of each channel (i.e. (i) Hk (m)) according to the corresponding SNR. To illustrate this, we performed an experiment for a 4-channel scenario in which three of the microphones are at a distance of 1 m to the source (microphones 1, 2 and 3 in Fig. 3), but one of the microphones is located 3 m away from the source (microphone 11 in Fig. 3), thus having a much lower signal to reverberation ratio (SRR). Fig. 6 shows the normalized subband filters after 10 iterations of the algorithm (for an example frequency bin k = 10). It is observed that the weights corresponding to the low-SRR channel have been set to small values in the adaptation process compared to the other channels, minimizing the effect of this low-SRR channel on the final clean STFT estimates.
0.06
0.04 0.02 0
0
1
2
3
4
5
6
7
8
9
m (tap index) Figure 6: Subband filters after convergence in a 4-channel scenario with one low-SRR channel (for frequency bin k = 10).
7. References [1] T. Yoshioka, A. Sehr, M. Delcroix, K. Kinoshita, R. Maas, T. Nakatani, and W. Kellermann, “Making machines understand us in reverberant rooms: Robustness against reverberation for automatic speech recognition,” Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 114–126, Nov 2012. [2] M. L. Seltzer, “Microphone array processing for robust speech recognition,” Ph.D. dissertation, Carnegie Mellon University, 2001. [3] Z. Liu, “Sound source separation with distributed microphone arrays in the presence of clock synchronization errors,” in International Workshop for Acoustic Echo and Noise Control (IWAENC 2008), Sep 2008. [4] N. Ono, H. Kohno, N. Ito, and S. Sagayama, “Blind alignment of asynchronously recorded signals for distributed microphone array,” in Applications of Signal Processing to Audio and Acoustics, 2009. WASPAA ’09. IEEE Workshop on, Oct 2009, pp. 161–164. [5] A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari, Nonnegative Matrix and Tensor Factorizations : Applications to Exploratory Multi-way Data Analysis and Blind Source Separation. Wiley, 2009. [6] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Advances in Neural Information Processing Systems, 2000, pp. 556–562. [7] P. Smaragdis, “Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs,” in Independent Component Analysis and Blind Signal Separation, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2004, vol. 3195, pp. 494–499.
[8] H. Kameoka, T. Nakatani, and T. Yoshioka, “Robust speech dereverberation based on non-negativity and sparse nature of speech spectrograms,” in Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, April 2009, pp. 45–48. [9] K. Kumar, “A spectro-temporal framework for compensation of reverberation for speech recognition,” Ph.D. dissertation, Carnegie Mellon University, 2001. [10] K. Kumar, R. Singh, B. Raj, and R. Stern, “Gammatone sub-band magnitude-domain dereverberation for asr,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, May 2011, pp. 4604–4607. [11] R. Singh, B. Raj, and P. Smaragdis, “Latent-variable decomposition based dereverberation of monaural and multi-channel signals,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, March 2010, pp. 1914–1917. [12] R. Maas, E. A. P. Habets, A. Sehr, and W. Kellermann, “On the application of reverberation suppression to robust speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, March 2012, pp. 297–300. [13] J. Erkelens and R. Heusdens, “Correlation-based and model-based blind single-channel late-reverberation suppression in noisy timevarying acoustical environments,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 18, no. 7, pp. 1746–1765, Sept 2010. [14] M. Jeub, M. Schafer, and P. Vary, “A binaural room impulse response database for the evaluation of dereverberation algorithms,” in Digital Signal Processing, 2009 16th International Conference on, July 2009.