352
IEEE SIGNAL PROCESSING LETTERS, VOL. 21, NO. 3, MARCH 2014
Spectro-Temporal Filtering for Multichannel Speech Enhancement in Short-Time Fourier Transform Domain Yu Gwang Jin, Student Member, IEEE, Jong Won Shin, Member, IEEE, and Nam Soo Kim, Senior Member, IEEE
Abstract—In this letter, we propose a spectro-temporal filtering algorithm for multichannel speech enhancement in the short-time Fourier transform (STFT) domain. Compared with the traditional multiplicative filtering technique, the proposed method takes account of interdependencies between components in adjacent frames and frequency bins. For spectro-temporal filtering, speech and noise power spectral density (PSD) matrices are estimated based on an extended formulation utilizing temporal and spectral correlations, and the parametric noise reduction filter based on these PSD matrices is applied to the input microphone array signal. Moreover, multichannel speech presence probabilities are also estimated within a unified framework. A number of experimental results show that the proposed spectro-temporal filtering method improves the performance of multichannel speech enhancement. Index Terms—Microphone array, multichannel speech enhancement, multichannel speech presence probability, parameterized non-causal multichannel Wiener filter, spectro-temporal filtering.
I. INTRODUCTION OWADAYS adoption of multiple microphones to exploit spatial diversity became indispensable for reliable speech communication in adverse environment. A number of multichannel speech enhancement approaches have been proposed [1]–[5]. In many traditional multichannel enhancement schemes, the channel transfer function (TF) relating the target speech source and the microphone input should be known in advance or at least estimated from the received data [3]. Under the assumption that the exact channel TFs are known, effective noise reduction can be achieved without much speech distortion. However, it is generally difficult to estimate the unknown
N
Manuscript received October 31, 2013; revised December 25, 2013; accepted January 14, 2014. Date of of current version February 04, 2014. This work was supported in part by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (Grant No. 2012R1A2A2A01045874) and by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the ITRC (Information Technology Research Center) support program (NIPA-2013-H0301-13-4005) supervised by the NIPA (National IT Industry Promotion Agency). The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Mathew Magimai Doss. Y. G. Jin and N. S. Kim are with the Department of Electrical and Computer Engineering and the Institute of New Media and Communications, Seoul National University, Seoul, Korea (e-mail:
[email protected];
[email protected]). J. W. Shin is with the School of Information and Communications, Gwangju Institute of Science and Technology, Gwangju, Korea (e-mail:
[email protected]. kr). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/LSP.2014.2302897
channel TFs in real environments, and inaccurate estimates may lead to a dramatic degradation in speech enhancement performance. Recently, an optimal filtering technique which unifies the parameterized multichannel non-causal Wiener filter (PMWF), the minimum variance distortionless response (MVDR) filter and the generalized sidelobe canceller (GSC) was proposed [4], which only depends on the noise and noisy data power spectral density (PSD) matrices. It is worth noting that this technique makes the multichannel speech enhancement framework look similar to that of the well-established single channel approaches. In general, system identification algorithms are often applied in the time-frequency domain for the sake of achieving both computational efficiency and improved convergence rate. In [6], identification of linear time-invariant (LTI) systems is formulated in the short-time Fourier transform (STFT) domain, and its result implies that cross-band filters are generally required to perfectly represent an LTI system. In order to reduce the computation incurred by the cross-band filters, the multiplicative TF (MTF) approach is widely used for modeling an LTI system in the STFT domain as an approximated version where the time domain convolution is approximated by multiplying a constant to each STFT coefficient separately. However, since practical implementations employ finite length analysis windows, the MTF approximation is not considered adequate for high quality speech enhancement. For a more successful speech enhancement in adverse acoustic environments, it is desirable to employ a more realistic modeling which considers the spectral and temporal correlations as well. In this letter, we propose a spectro-temporal filtering algorithm for multichannel speech enhancement in the STFT domain utilizing the temporal and spectral correlations. In contrast to the conventional multiplicative filtering methods, the proposed approach constructs spectro-temporal filters which reflect the acoustic effects of a clean speech signal not only to the noisy observations in the same frame and frequency bin, but also to other components in adjacent frames and frequency bins. In order to construct an optimal spectro-temporal filter, we first estimate the extended data PSD matrices of the speech and noise signal encompassing cross-correlations between adjacent components, and the extended PMWF designed based on these PSD matrices is applied to the input microphone array signal for noise reduction and interference rejection. In addition, multichannel speech presence probabilities are also estimated within a unified framework. From a number of experiments, we can see that the proposed spectro-temporal filtering approach improves the performance of multichannel speech enhancement.
1070-9908 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
JIN et al.: SPECTRO-TEMPORAL FILTERING FOR MULTICHANNEL SPEECH ENHANCEMENT
II. PROBLEM STATEMENT Let , and denote the STFT coefficients of the noisy speech, clean speech and noise signal, respectively, for the -th frequency bin at frame observed from the -th microphone. When we use an array of microphones with an is given by arbitrary geometry, the STFT component (1) In conventional approaches to multichannel speech enhancement, an estimate of the clean speech in the -th channel is obtained by multiplying the input microphone array signal by a proper noise reduction gain as follows:
353
. Here, the dimension of the is extended from noise reduction filter gains to when we take adjacent frequency bins from to and adjacent frames from to into account. It is found that (3) is a generalized form of (2) which is a special case with and . is obtained by stacking the components of Since in a specific order, we can similarly extend both and the noise components the speech components to and , respectively in the same -dimensional order. Then, (1) and (3) can be rewritten as vector equations, (4) (5)
(2) , ,
where
and . The superscripts , and denote the transpose-conjugate, transpose and conjugate operators, denotes an estimate of a respectively, and the notation can be random quantity . In (2), the filter gains, estimated under a variety of criteria, and the relevant estimates require some knowledge on speech and noise statistics [4]. III. SPECTRO-TEMPORAL FILTERING FOR MULTICHANNEL SPEECH ENHANCEMENT During the past few decades, a number of methods have been developed to address the problem of finding a suitable noise reduction filter based on the MTF assumption. Those multichannel speech enhancement techniques have demonstrated better performance than the conventional single channel approaches since they utilize additional information about the spatial properties of the speech and noise components. According to the analysis of system identification in the STFT domain [6], however, cross-band filters are generally required for a perfect representation of the LTI system, and a finite length of the analysis window makes the MTF approximation less accurate than the cross-multiplicative TF approximation [7]. The multiplicative filtering method considers only the spatial correlation while ignoring the spectral and temporal correlations. Hence it is possible to further improve the performance of the MTF-based speech enhancement techniques by employing spectro-temporal filters. In this letter, in order to take advantage of the spectral and temporal correlations of the input signals, we propose a spectrotemporal filtering approach given as follows:
(3) , ,
where and
where
and so that the
dimension of becomes . The definition of the PSD matrices introduced in [4] is also generalized similarly in the following way:
(6) . where the dimension of each PSD matrix becomes These PSD matrices preserve all the information of the conventional PSD matrices, such as or , and contain additional information concerned with spectral and temporal correlations. A. Extended Parameterized Multichannel Non-causal Wiener Filter In [4], a gain function for the PMWF, which includes GSC and MVDR beamformer as special cases, has been proposed as a function of the PSD matrices: (7) where
is an
-dimensional vector
and the real is a factor that controls the trade-off between the noise reduction and speech distortion. In this work, we extend the PMWF in (7) to accommodate the proposed spectro-temporal filters and the generalized PSD for extended matrices. The optimal filter gain matrix PMWF is given by (8) matrix consists of columns of elemenwhere the tary vectors each of which has only one non-zero value at the position corresponding to the current frame and frequency bin for a specific channel. In other words, in (8) plays the role of picking up the desired spectro-temporal components. Because only depends on and , an explicit estimation of the channel TFs in real environments is not reis obtained, the clean speech component quired. Once can be estimated through (5).
354
IEEE SIGNAL PROCESSING LETTERS, VOL. 21, NO. 3, MARCH 2014
Compared with the conventional multichannel speech enhancement approaches, the filter expressed in (8) estimates the clean speech spectra of all channels at once, not one specific channel. If we want the enhanced output for a specific channel, . it suffices to use only the corresponding column of Though estimating the clean speech spectra for all the channels simultaneously may be considered an overhead of computation, in the it is quite meaningful if the processed output current frame is utilized to estimate the PSDs in the following frames as in the single channel decision-directed technique [8], which will be one of our future studies. For this reason, in this paper, we attempt to enhance the signal components for all the channels, and evaluate the performance averaged over all the channels. B. Speech and Noise Statistics Estimation For a good performance of multichannel speech enhancement, it is important to robustly estimate the clean speech and and from the input noise PSD matrices, signal. Motivated by the techniques employed in most of the conventional single channel speech enhancement algorithms, and we recursively update the estimates for in the following way:
case under the assumption that the speech and noise components are multivariate Gaussian and their real and imaginary parts are uncorrelated and identically distributed. We can extend this derivation to the spectro-temporal filtering case without difis ficulty. The likelihood ratio in the -th frequency bin, given by
(13) denotes the determinant of a square matrix. Based where on (13) the SPP is computed as follows: (14) where is the a priori speech absence probability (SAP) and it is recursively updated by the multichannel MCRA algorithm [5]. IV. EXPERIMENTAL RESULTS
(9) (10) and are two forgetwhere is fixed to a constant ting factors. Usually the factor value, and the time-varying frequency-dependent smoothing is updated as factor (11) , and denotes the speech preswhere ence probability (SPP) for each frequency bin and frame, which will be addressed in the next subsection. The minima controlled recursive averaging (MCRA) algorithm [9] has been known as one of the most popular techniques to estimate noise power spectrum in adverse conditions in a single channel scenario, and it has been recently generalized to the multichannel case [5]. In this work, we apply the multichannel MCRA approach in order to track the noise PSD matrix in (10) and consequently the SPP in (14). In our implementation, since noise and speech components are assumed to be uncorrelated, the PSD matrix of clean speech is simply obtained as (12) which is seen as a smoothed version of maximum-likelihood estimation [4], [5], [10]. C. Speech Presence Probability Estimation reliably, the SPP To track the noise PSD matrix needs to be estimated. In [10], an approach to estimate the SPP in the single-channel case was generalized to the multichannel
To evaluate the performance of the proposed spectro-temporal filtering technique for multichannel speech enhancement, we carried out a number of objective quality measurements s, the dimenunder various noisy conditions with different sions of the noise reduction filters. We simulated a reverberant m m m, with the reroom with dimensions ms by using the image method verberation time [11], [12]. The speech and the interference sources are located at (1.737 m, 4.6 m, 1.4 m) and (3.337 m, 4.6 m, 1.4 m), ) respectively. We consider a two-channel scenario ( where the microphones are located at (2.437 m, 5.6 m, 1.4 m) and (2.637 m, 5.6 m, 1.4 m). Ten utterances from the TIMIT database were used as the speech source signals, and the babble, factory and F-16 noises from the NOISEX-92 database were used as the interference signals with 0, 5, 10, 15 and 20 dB signal-to-interference ratio (SIR). Experimental results shown in Fig. 1 and Table I are obtained by averaging over all of these SIRs. Each signal was sampled at 16 kHz, and an analysis window of length 512 which was half-overlapped was applied. The noise reduction filter in (3) was implemented in and . It is noted that a variety of dimensions by varying ) is equivalent to the the parameter setting ( conventional multiplicative filtering approach. In this work, we , and the values of and were used experimentally determined depending on , and an estimate of the input SIR and signal-to-noise ratio (SNR). In order to verify the performance of the proposed spectrotemporal filtering approach, we evaluated the cepstrum distance (CD), the log-likelihood ratio (LLR) [13] between the estimated signal and the clean signal and the segmental SNR (segSNR) measured after noise reduction. The results obtained under difand are ferent noisy conditions with various values of shown in Fig. 1. From the results we can see that the speech enhancement performance which is measured in terms of CD, or increased. LLR and segSNR improved as the value of
JIN et al.: SPECTRO-TEMPORAL FILTERING FOR MULTICHANNEL SPEECH ENHANCEMENT
355
TABLE II PERFORMANCES OF THE PROPOSED SPECTRO-TEMPORAL FILTERING WITH DIFFERENT VALUES OF AND : F-16 INTERFERING SIGNAL AND ADDITIVE WHITE GAUSSIAN NOISE. dB, dB
V. CONCLUSIONS In this letter, we proposed a spectro-temporal filtering algorithm for multichannel speech enhancement in the STFT domain utilizing temporal and spectral correlations. In contrast to the conventional multiplicative filtering method, the proposed approach considers correlations between components in adjacent frames and frequency bins. An extended form of speech and noise PSD matrices was introduced, and all the following steps including spectro-temporal filtering, PSD estimation and SPP computation were performed under a unified framework. Experimental results demonstrated that the proposed approach improves the performance of multichannel speech enhancement. Fig. 1. Results of cepstrum distance, log-likelihood ratio and segmental SNR ), (b) with under different noisy conditions: (a) with various values of L ( ). various values of M ( TABLE I PERFORMANCES OF THE PROPOSED SPECTRO-TEMPORAL FILTERING WITH DIFFERENT VALUES OF AND : F-16 INTERFERING SIGNAL
Additionally, to investigate the performance improvement and , we further in case of simultaneous increases of evaluated CD, LLR and segSNR under different conditions. The results obtained from the case of F-16 interfering signals are summarized in Table I. As seen in Table I, the enhanced ) showed better quality speech signal with ( ), ( ), and than those with ( ) which is equivalent to the conventional multi( plicative algorithm. Furthermore, we verified that the proposed spectro-temporal filtering approach outperformed the multiplicative filtering approach in the coherent plus non-coherent noise field, too. The results are shown in Table II for which dB) was used as a directional noise the F-16 noise ( source and the additive white Gaussian noise ( dB) was used as a diffused noise simultaneously. from 0 to 1 is compuIt is worth noting that increasing tationally more expensive than increasing from 0 to 1 since is proportional to the dimension of the noise reduction filter ) and ( ). (
REFERENCES [1] L. J. Griffiths and C. W. Jim, “An alternative approach to linearly constrained adaptive beamforming,” IEEE Trans. Antennas Propagat., vol. AP-30, no. 1, pp. 27–34, Jan. 1982. [2] S. Gannot, D. Burstein, and E. Weinstein, “Signal enhancement using beamforming and nonstationarity with applications to speech,” IEEE Trans. Signal Process., vol. 49, no. 8, pp. 1614–1626, Aug. 2001. [3] S. Gannot and I. Cohen, “Adaptive beamforming and postfiltering,” in Springer Handbook of Speech Processing, J. Benesty, Y. Huang, and M. M. Sondhi, Eds. New York, NY, USA: Springer-Verlag, 2007, ch. 47, pp. 945–978. [4] M. Souden, J. Benesty, and S. Affes, “On optimal frequency-domain multichannel linear filtering for noise reduction,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 2, pp. 260–276, Feb. 2010. [5] M. Souden, J. Chen, J. Benesty, and S. Affes, “An integrated solution for online multichannel noise tracking and reduction,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2159–2169, Sep. 2011. [6] Y. Avargel and I. Cohen, “System identification in the short-time Fourier transform domain with crossband filtering,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 4, pp. 1305–1319, May 2007. [7] Y. Avargel and I. Cohen, “Adaptive system identification in the short-time Fourier transform domain using cross-multiplicative transfer function approximation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 1, pp. 162–173, Jan. 2008. [8] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 6, pp. 1109–1121, Dec. 1984. [9] I. Cohen, “Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging,” IEEE Trans. Speech Audio Process., vol. 11, no. 5, pp. 466–475, Sep. 2003. [10] M. Souden, J. Chen, J. Benesty, and S. Affes, “Gaussian model-based multichannel speech presence probability,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 5, pp. 1072–1077, Jul. 2010. [11] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” J. Acoust. Soc. Amer., vol. 65, pp. 943–950, Apr. 1979. [12] E. A. Lehmann, “Image-source method for room acoustics,” [Online]. Available: http://www.eric-lehmann.com/ism_code.html [13] P. C. Loizou, Speech Enhancement: Theory and Practice. Boca Raton, FL, USA: CRC, 2007.