758
IEEE SIGNAL PROCESSING LETTERS, VOL. 14, NO. 10, OCTOBER 2007
A Statistical Model-Based Residual Echo Suppression Seung Yeol Lee, Student Member, IEEE, and Nam Soo Kim, Member, IEEE
Abstract—In this letter, we propose a novel residual echo suppression (RES) algorithm based on a statistical model constructed in the acoustic echo cancellation framework. In the proposed approach, all the possible near-end and far-end signal conditions are classified into four distinct hypotheses, and the power spectral density estimation is carried out according to the result of hypothesis testing. The distribution of each signal component is characterized by a parametric model, and the conventional likelihood ratio test is performed to make an optimal decision. The experimental results show that the proposed algorithm yields improved performance compared to that of the previous RES technique. Fig. 1. Block diagram of AEC system with RES post-filter.
Index Terms—Acoustic echo cancellation, post filter, residual echo suppression (RES).
proposed algorithm is evaluated through echo return loss enhancement (ERLE) and speech attenuation tests.
I. INTRODUCTION ENERALLY, acoustic echo makes serious conversation trouble in two-way telecommunication. To overcome this problem, acoustic echo cancellers (AECs) have been deployed for a comfortable conversation by reducing the effect of acoustic echo. In many practical applications, however, there still exists some amount of residual echo at the output of the AEC filter. This may come from the possible mismatch between the actual echo path and the employed adaptive filter structure, slow tracking capability of the adaptation algorithm, interfering signals such as the active near-end speech, and a variety of other unpredictable effects. Usually, in order to further reduce the residual echo, the residual echo suppression (RES) filters have been applied to AEC. Various RES post-filtering techniques [1]–[7] have been developed to satisfy the echo attenuation requirement as recommended by ITU-T standard [8]. In this letter, we propose a novel algorithm for RES. In the proposed algorithm, the frequency response of the RES filter is determined differently according to the existence of the active near-end speech and residual echo. For this, all the possible signal conditions are classified into four categories depending on the presence or absence of the near-end speech and the residual echo. Identification of each category can be treated as a hypothesis testing problem, and we apply a set of parametric models to perform a likelihood ratio test. All the parameters are specified in terms of the power spectral densities (PSDs) of the relevant signals, and their estimates are updated depending on the decision made by the hypothesis testing [9]. The optimal RES filter gain is given as a function of the signal-to-noise ratio (SNR) and signal-to-echo ratio (SER), which are obtained in a decision-directed manner [10], [11]. The performance of the
G
Manuscript received December 1, 2006; revised February 20, 2007. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Alan McCree. The authors are with the School of Electrical Engineering and INMC, Seoul National University, Shillim-dong, Gwanak-gu, Seoul 151-744, Korea (e-mail:
[email protected];
[email protected]). Digital Object Identifier 10.1109/LSP.2007.896452
II. STATISTICAL MODEL FOR RES A conventional AEC framework is shown in Fig. 1, where is the far-end speech signal at time , represents the impulse response of the real acoustic echo path, and characterizes the corresponding echo path estimate. Let be the signal picked up by the microphone and be the AEC output signal that is to be transmitted to the far-end. Then (1) (2) is the echo signal, is the near-end speech, in which is the background noise, and is the adaptive filter output. In a practical implementation of the AEC, it is almost impossible to make the residual echo signal completely suppressed due to the inherent modeling mismatch and the lack of adaptability. is usually further processed by Therefore, the AEC output a post-filter that performs RES. RES is generally performed in the frequency domain. For this, all the signal components in (2) are assumed to be stationary random processes. In addition, we also assume that the near-end speech , the background noise, , and the residual echo are statistically independent. Based on these assumptions, we can derive the equivalent frequency domain relations as follows:
(3) where
, , , , , and represent the frequency domain spectra of , , , , , and , respectively, computed in the th frame for frequency .
1070-9908/$25.00 © 2007 IEEE
LEE AND KIM: A STATISTICAL MODEL-BASED RESIDUAL ECHO SUPPRESSION
Let be the output of the RES post-filter and the corresponding spectrum. Then
be
759
is the likelihood ratio computed in the th and frequency bin such that
(4) denotes the gain of the RES post-filter. Fiin which nally, is obtained by taking the inverse Fourier transform and then transmitting to the far-end. of For a successful RES, we need an exact estimate for the residual echo, which should vary depending on the presence of active near-end or far-end signals. In order to cope with such variability, we classify each condition into four hypotheses, , , , and , depending on the presence or absence of the active near-end and far-end signal components. In this setup for hypotheses, we assume that the performance of the AEC filter is not perfect, and for that reason, the existence of active far-end signal usually gives rise to a meaningful amount of residual echo. From (3), each hypothesis can be expressed as
(8) and are called the a priori and a poswhere teriori SNRs, respectively [10], [13]. If exceeds a threshold , we assume that the near-end speech is inactive, . On the contrary, if is resulting in hypothesis smaller than , is decided to be true, implying the presence of active near-end speech. and Now, what remains is how to discriminate between . In a similar analogy to computing , we can obtain the GSAP in the presence of active residual echo as follows: (9)
in which the likelihood ratio,
, is described as
(5) First, we separate the hypotheses according to the presence of active far-end signal. It can be seen from (5) that the hypotheses and indicate the absence of far-end signal, while and account for active far-end signal. Discrimination of and from and can be achieved when we apply voice activity detection (VAD) to the far-end signal, . One of the promising approaches to VAD is the statistical model-based technique that computes the global speech absence probability (GSAP) [9], [12]. The VAD output is zero when the signal is not found to have any active spectral component, and its value becomes one in the presence of active signal power. If the VAD output obtained from the far-end signal is one, the state is classified into or . On the other hand, or is the desirable state when the VAD output is zero. The problem of testing between and in accordance with near-end signal activity is the same as the case considered in conventional noise suppression techniques [10]. When we apply the technique proposed in [9] to the current task of classifying and , the GSAP of the error signal without residual echo, , is given by
(6) in which denotes the spectrum of and is its spectral component in the th frequency bin, with being the total number of frequency bins. In (6), is the ratio between the prior probabilities as defined by
(7)
(10) where (11) (12) with and being the a priori and a posteriori SERs, which will be discussed in the next section. Basically, the SERs used in this work represent the ratios between the speech signal and the residual echo . If lies below a threshold , we decide that is true, showing that there exists an active near-end signal. On the other hand, if exceeds , a decision is made in favor of . We can summarize the tests for detecting each condition as follows:
where signal,
(13) indicates the VAD output for the far-end . III. PSD ESTIMATION FOR RES
A crucial part of the RES operation requires a robust estimation of the PSDs for the relevant signal components. In this section, we describe the procedures for estimating the PSDs of the background noise and residual echo. Let and be the estimates for the PSDs of the background noise and residual echo, respectively. For robustness reasons,
760
IEEE SIGNAL PROCESSING LETTERS, VOL. 14, NO. 10, OCTOBER 2007
should be updated only when is decided to be should be performed by considering true. Updating the model of the loudspeaker-enclosure-microphone (LEM) system [1]. In a conventional LEM system, the residual echo spectrum is usually given by the product of the far-end signal , with the frequency response of the system spectrum, , such that mismatch,
(20) , of the RES post-filter is described The optimal gain, in terms of and [4], [9]. It is noted that modifies the magnitude of while retaining its phase. According to the minimum mean squared error criterion, it can be derived that
(14) where , i.e., the difference of frequency response between the actual echo path, , [1]. By (14), a straightforward way and its estimate, to estimate the PSD of the residual echo is given by (15) where is the PSD estimate for the far-end signal, . The squared magnitude response of the system mismatch, , can be updated by means of the decision directed approach given in (16) at the bottom of the page, in which is an appropriate smoothing parameter, and represents a PSD estimate for the AEC output, . and can be easily updated through Both a first-order recursion with suitable smoothing parameters and . Once the estimates for the PSDs of the background noise and residual echo are obtained, the next step is to update the a priori SNR, , and the a priori SER, . Among many possible approaches, we apply the decision directed technique proposed in [10]. Given and , the decision directed approach updates and in the following way:
(17)
(18) where is a unit step function, and represents the gain of the RES post-filter computed in the previous frame. The a posteriori SNR, , and the a posteriori SER, , can be obtained from the instantaneous spectrum of the AEC filter output signal, , such that (19)
(21) which is equivalent to the Wiener filtering solution [11]. IV. SIMULATION RESULTS In order to evaluate the performance of the proposed RES algorithm, we conducted computer simulations under various conditions. Twenty sentences were spoken by four speakers and sampled at 16 kHz. For performance assessment, we artificially created 20 data files such that each file was obtained by mixing the far-end signal with the near-end signal. The far-end speech was passed through a filter simulating the acoustic echo path before being mixed. The echo level measured at input microphone was 3.5 dB lower than that of the input speech on average. Two types of noise sources, the babble and vehicular noises from the NOISEX-92 database, were added to the clean speech waveforms by varying SNR. To simulate the echo, the LEM system was modeled by a time-invariant FIR filter derived from an analysis of room acoustics [14]. The simulation environment was de. In order signed to fit a small office room of a size 4 3 3 to estimate the echo, an adaptive filter with the number of filter taps, , was used, and the coefficients of the AEC filter were adapted by means of the normalized least mean square (NLMS) algorithm with an adaptive step-size control strategy [1]. The smoothing parameters used for PSDs estimation were set as follows: , , , , and . In our approach, three different tests were conducted to identify each state of the signal condition. In Fig. 2, we plot the receiver operating characteristic (ROC) curves of the applied detection algorithms evaluated over the test data. Appropriate thresholds, and , were selected so that a good trade-off could be made between the false-alarm and detection probabilities. The performance of the RES approach was measured in terms of , which is defined by (22) denoting expected value at time , and ERLE denotes with the corresponding value averaged over all time duration. For the
if
is chosen in
otherwise.
th frame
(16)
LEE AND KIM: A STATISTICAL MODEL-BASED RESIDUAL ECHO SUPPRESSION
761
Fig. 4. Time variation of ERLE (t).
Fig. 2. ROC curves for hypothesis testing: (a) far-end signal detection, (b) testing between H and H , and (c) testing between H and H .
activity of near-end and far-end signals. In order to test each hypothesis, a statistical approach resulting in likelihood ratio tests has been adopted. The performance of the proposed approach has been found superior to that of the conventional technique through ERLE evaluation tests. REFERENCES
Fig. 3. Performance of RES algorithms: (a) ERLE score and (b) speech attenuation during double-talk.
purpose of comparison, we also evaluated the performance of the original AEC system without any RES module and the one with the RES algorithm proposed by Gustafsson et al. [3]. The overall results for ERLE are plotted in Fig. 3(a). From these results, we can observe that the ERLEs of the proposed algorithm were higher than those of the previous RES technique in all the tested conditions. Another important factor we should consider in the performance evaluation is the speech attenuation during the double-talk periods. The speech attenuation during the double-talk periods is shown in Fig. 3(b), where we can see that the proposed algorithm resulted in a similar level of attenuation compared to the conventional RES technique. In Fig. 4, an example of variation over time is given in conjunction with the corresponding waveform. We can observe that the proposed algorithm attenuated the residual echo more efficiently than the conventional RES technique while preserving the near-end signal quite well. V. CONCLUSIONS In this letter, we have proposed a novel RES algorithm based on a statistical model. The principal contribution of this work is a systematic classification of the AEC state according to the
[1] E. Hänsler and G. Schmidt, Acoustic Echo and Noise Control : A Practical Approach. New York: Wiley, 2004. [2] V. Turbin, A. Gilloire, and P. Scalart, “Comparison of three post-filtering algorithms for residual echo reduction,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, 1997, pp. 307–310. [3] S. Gustafsson, R. Martin, and P. Vary, “Combined acoustic echo control and noise reduction for hands-free telephony,” Signal Process., vol. 64, no. 1, pp. 21–32, Jan. 1998. [4] S. Gustafsson, R. Martin, P. Jax, and P. Vary, “A psychoacoustic approach to combined acoustic echo cancellation and noise reduction,” IEEE Trans. Speech Audio Process., vol. 10, no. 5, pp. 245–256, Jul. 2002. [5] G. Enzner and P. Vary, “Frequency-domain adaptive Kalman filter for acoustic echo control in hands-free telephones,” Signal Process., vol. 86, no. 6, pp. 1140–1156, Jun. 2006. [6] J. D. Gordy and R. A. Goubran, “On the perceptual performance limitations of echo cancellers in wideband telephony,” IEEE Trans. Audio, Speech, Language Processing, vol. 14, no. 1, pp. 33–42, Jan. 2006. [7] C. Faller and J. Chen, “Suppressing acoustic echo in a spectral envelope space,” IEEE Trans. Speech Audio Process., vol. 13, no. 5, pp. 1048–1062, Sep. 2005. [8] ITU-T Recommendation G.167, Acoustic Echo Controllers. Helsinki, 1993. [9] N. S. Kim and J. -. Chang, “Spectral enhancement based on global soft decision,” IEEE Signal Process. Lett., vol. 7, no. 5, pp. 108–110, May 2000. [10] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-32, no. 6, pp. 1109–1121, Dec. 1984. [11] R. L. B. Jeannés, P. Scalart, G. Faucon, and C. Beaugeant, “Combined noise and echo reduction in hands-free systems: A survey,” IEEE Trans. Speech Audio Process., vol. 9, no. 8, pp. 808–820, Nov. 2001. [12] J. -H. Chang, N. S. Kim, and S. K. Mitra, “Voice activity detection based on multiple statistical models,” IEEE Trans. Signal Process., vol. 54, no. 6, pp. 1965–1976, Jun. 2006. [13] J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voice activity detection,” IEEE Signal Process. Lett., vol. 6, pp. 1–3, Jan. 1999. [14] S. McGovern, A Model for Room Acoustics, 2003 [Online]. Available: http://2pi.us/rir.html