ICA-based Noise Reduction for Mobile Phone Speech Communication Zhipeng Zhang and Minoru Etoh Research Laboratories, NTT DoCoMo 3-5 Hikari-no-oka, Yokosuka, Kanagawa, 239-8536 Japan
{zzp, etoh}@mml.yrp.nttdocomo.co.jp background noise, and long reverberation which degrade the performance of ICA-based signal separation. Abstract One best way to overcome the ICA’s drawback is to purify the result by post processing that screen physically-plausible We propose a frequency-domain Independent Component signals. Such a post-filter can improve the output signal Analysis (ICA) with robust and computationally-light post quality by suppressing the residual crosstalk and background processing method for background noise reduction in mobile noise. Aichner et al. proposed the Wiener filter [3] to achieve phone speech communication. In our scenario, multi-source these goals in ICA. They estimate the residual cross-talk and signal separation is not the target, but noise reduction is the background noise power spectral densities and remove them primal one. This primal target characterizes our approach that from the separated signal. This method can be applied to a promotes a new physical constraint, in other words, we place very range of wide applications though its improvement of a restriction on the amplitude range of the transfer functions SNR is small. rather than assuming that the amplitudes are constant. When Other methods [4, 5] utilize the direction of arrival there are diffraction, obstacles and reflections in the real(DOA) information, as a physical constraint, which leads the world environment, it is better to assume that transfer time frequency masking filter at a low computational cost. By function amplitude (derived from the distance to the mouth) combining ICA and the time frequency masking, we can varies within a certain range. Our two-microphone expect SNR improvement in a practical way. Such a physical experiment shows that the ICA-based noise reduction constraint contributes to performance improvement greatly. significantly improves speech recognition performance Mukai et al. assumed physical constraints that the especially in severe noise conditions. microphones and sources are not moved and there is no diffraction or reverberation [4]; that is, the transfer functions 1. Introduction between source and microphones were constants. They assumed that the target source exists in a sphere. The Mobile communication era has raised noise reduction as a estimated spheres can be utilized for sorting or classifying the vital research topic in recent years, since communication output signals of ICA, which principally solves the environments are not limited to calm indoor circumstances. permutation problem. This method based on audibleSNR in such environments is typically severe and the directional constraints works well for multi-signal source computational and sensing resources are limited. In light of separation in general. There is still a limitation in case of the above environments and limitations, we have proposed an more uneven conditions, where the assumption fail due to Independent Component Analysis (ICA) [1] –based noise diffraction, obstacles and reflections of sound wave between reduction technique that formulates the parameter estimation the microphone and target source in the real-world with stabilization by the a priori information within a environment. When considering mobile phones where Bayesian framework of maximum a posteriori (MAP). microphones are moving, and the background environments MAP-ICA has shown its robustness in mobile phone are complex, we may consider other physical constraints than environments [2]. The new experiments that use the condition DOA. shown in Fig.1, are detailed in the following sections.1 In this paper, we will introduce another enhancement to our previous work,. That is use of post processing filter derived Sub mic from a physical constraint between a microphone and the mouth. ICA offers well-formulated solutions in signal separation 15cm though, the real-world is so complex often that we can not model mixing/unmixing processes due to unknown parameters such as moving speakers, non-stationary Main mic 60cm 1
Especially in our applications, a many-microphone system is hard to commercialize because of the limited space, and the signal separation in frequency-domain ICA, which is commonly used for noise reduction, requires to solve the permutation problem, which becomes harder to when more microphones are used. Thus we focus on a two-microphone system without losing the generality of our proposed scheme.
30cm
Airport noise Restaurant noise
Fig. 1. Experiment Conditions
1-4244-1251-X/07/$25.00 ©2007 IEEE.
470
Speech input (frequency domain) Initializing W using a priori knowledge Performing MAP- ICA
Fig. 3 . Original Speech signal
Measuring distance between target and microphones
Output of ICA
Setting threshold i.e., physical
Comparing the output of ICA with the threshold
Fig. 4. Noisy Speech (2 channels)
Identifying error signals (frequency bin) of ICA estimation Masking the result by post filtering
Result Fig. 5. Speech signal extracted by ICA Fig. 2 Framework of proposed method. In our application scenario, multi-source signal separation is not the target, but noise reduction is the primal one. This primal target characterizes our approach that promotes a new physical constraint, in other word, restriction (a threshold) on the amplitudes of the transfer functions rather than assuming that amplitudes are constant. When there are diffraction, obstacles and reflections in the real-world environment, it is better to assume that transfer function amplitude varies within a certain range. Roughly speaking, we adopt a distance constraint rather than the directional constraint. This restriction is applied to the output of ICA estimation at each frequency bin. When the output of ICA estimation exceeds the threshold, we treat it as estimation error and apply a masking filter to suppress the estimation error. Fig. 2 shows the framework of the proposed method. We believe that this framework can be extended to support a wide range of applications. Fig. 3 and Fig. 4 show a sample of original clean speech data, noisy speech data recorded in two microphones (“main mic” and “sub mic”). Fig. 5 and Fig.6 show the result extracted by traditional ICA method and our method, respectively. This result shows the proposed method successfully suppresses the noise and outer performs over traditional ICA. The remainder of the paper is organized as follows. We first explain our ICA-based noise reduction technique, and then propose a masking filter for ICA post-processing. The next section reports some experiments on speech recognition. The paper concludes with a general discussion and issues related to future research.
Fig. 6. Speech signal purified by the proposed method transfer fuction depends only on the position. We assume that function amplitude (derived from the distance to the mouth) varies within a certain range. To confirm this assumption, we used a HATS (Head And Torso Simulator) to measure the transfer function at different points, A and B, as shown in Fig7. In practice, points A and B are taken to the common microphone positions given the regular operation conditions of the phone. We adopted the TSP (Time-Stretched Pulse) method for the measurements [7]. Fig.8 shows the amplitude of the transfer function from mouth of HATS to positions A, B. We can see that there is no 1cm Microphone position: B
Microphone position: A 2.5cm
2. ICA Technique for Noise Reduction HATS
2.1 Transfer Functions Usually two factors affect the transfer fuction, the microphone type and its relative position to mouth. In the mobile phone, because we do not change the microphone, the
1-4244-1251-X/07/$25.00 ©2007 IEEE.
Fig 7. Configuration of transfer function measurement of points A, B (main mic positions)
471
Here g(w) expresses the prior knowledge about the unmixing matrix. Using these parameters as initial values, the unmixing matrix can be updated with high efficiency.
3. Post-processing for ICA 3.1 Invariant in ICA The ratio of elements in the same column of unmixing matrix W is invariable [4], it is common to use this ratio to deduce geometric information related to source signals. [W-1]11/[W-1]21= A21/A11 , (4) where A and W denote the mixing matrix and the unmixing matrix respectively. In paper [4], a near-field model is applied to constrain the transfer function of A, Fig 8. Transfer function (AMP) of position A, B
Ai1=1/|| pi -q|| exp(jωc-1|| pi -q||) , (5)
obvious difference between A and B except at low frequencies. This confirms the assumption that function amplitude varies within a certain range.
where pi represents the location of the microphone i, q is the location of the target, and c is the speed of wave propagation. By taking the ratio of (5) for a pair of microphones, we have: | [W-1]11/[W-1]21 | =|| p1 -q|| /|| p2 -q||, (6) This equation yields a sphere within which the target source exists. This estimated sphere has been utilized for classifying the output signal of ICA [4]. However, in our mobile noise reduction system, there are obstacles and reflections between source and microphone, it is impossible to use geometric information of sphere in our system. Here we suppose that the amplitude value of transfer function varies at a certain range that is decided by the distance between a microphone and a target. Figure 9 shows this ratio of amplitude (A21/A11) for a pair of transfer functions at each frequency bin. The transfer function is measured in advance. We can see that this ratio has continuity, there is no unexpected value. We can find the estimation error based on the continuity property.
2.2 System Configuration of Fixed-point ICA ICA is a statistical method that decomposes multivariate data into a linear sum of non-orthogonal basis vectors. The source signal S(ω) is mixed by the mixing matrix A(ω) in the frequency domain. We thus have Y(ω) = A(ω)S(ω), (1) Where ω and Y(ω) denotes frequency and the observation signal. Here A(ω) corresponds to the frequency response of the mixing filters, from the source to the microphone. The source signal can be recovered if we know all frequency responses of all sources to all microphones. However, in most cases we have no information about the source as well as the microphone. The problem in recovering original signal S, given only sensor outputs Y(ω), is to obtain Z(ω) by estimating the unmixing matrix W blindly:
3.2 Masking Filter Figure 10 shows an example of a simple binary filter. The xaxis indicates the frequency bin while the y-axis represents the absolute value of estimated W21/W11. We ca clearly see that there are some peak values in estimated W21/W11. We compare the estimated W21/W11 to the threshold that is decided according to the near-field model. When the absolute value of W21/W11 at a frequency bin exceeds the threshold, we treat the ICA result as an estimation error and apply a masking filter to this frequency bin and adjoining bins (We call this the “estimation error area”). We set the threshold heuristically. The final output in the estimation error area is a combination of the ICA estimation (WICA) and the initial value (Winit).
Z(ω) = W(ω)Y(ω), (2) To estimate W blindly, we adopt the fixed-point ICA which was first developed in [6]. There are three reasons for choosing this algorithm in our mobile phone noise reduction system. First, the key advantage of this algorithm is that it converges faster than other algorithms with almost the same separation quality. This makes it suitable for mobile phone applications, where computational power is very limited and real-time processing is necessary. Second, we configure the main mic at position A (in fig 7) and sub mic is held at the HATS ear. We measure that transfer function as the initial value. It provides a good initial value and results in early convergence and better separation quality. Third, this algorithm only has to extract one independent component (target speaker’s voice), meaning that just one separation vector w (one row of the separation matrix W) must be estimated such that the separated component equals one independent component.
Wˆ = arg max
T
∑ log t =1
p { y ( t ); w} g ( w )
( 3)
Frequency bins Fig. 9. |A21/A11| of transfer function
w
1-4244-1251-X/07/$25.00 ©2007 IEEE.
,
|A21/A11|
2.3 Bayesian ICA for Mobile Phone Noise Reduction We have formulated the parameter estimation stabilized by the a priori information as a Bayesian framework of maximum a posteriori (MAP) estimation [2]. We have used the mouth-to-microphone transfer function as the priori information to update the matrix as follows:
W final = λ * WICA + (1.0 − λ )Winit
472
(7)
The speech signals were converted into a 25-dimensional acoustic vector consisting of 12-dimensional cepstral-meannormalized MFCCs [9] and their first derivatives, as well as normalized log energy coefficients. The HMM[9] used in our experiments was a 5 state left-to-right HMM; each state had 4 mixtures.
|W21/W11|
threshold
estimation error areas
4.3 Recognition Result ICA was performed to extract clean speech. FFT length was 512. Recognition experiments were performed. Figure 11 shows the comparison results (accuracy %) for three techniques: original speech recorded by main microphone (“input”), output result by traditional ICA (“ICA”), and proposed method, all under the three SNR conditions. The proposed method achieved about 10% improvement in word accuracy compared to traditional ICA. This confirms the effectiveness of the proposed method.
Frequency bins Fig. 10. A binary masking filer We apportion the weighting factor (λ) between them according to how much the peak value exceeds the threshold. The simple case is a binary filter. The filter furnishes information about whether the estimation of ICA is reliable or not. In the estimation error area, we use the initial value instead of the ICA estimation result .λ = ⎧ ⎨
0, in estimation error areas 1 ⎩ , otherwise
,
5. Conclusion We have described a masking filter method for ICA-based noise reduction in mobile phone speech communication. The proposed method creates a threshold derived from the distance between the microphone and target and corrects the estimation error of ICA by this threshold. The proposed method was confirmed to be superior to the traditional ICA method. Although this paper considered mobile phone environments, since the framework of the proposed method is flexible enough to cope with various applications, it should prove valuable in other areas. Future research includes investigating 1: robustness in more severe scenarios using variation of testing environments. 2: more effective filtering techniques than the binary masking filter. 3: MOS assessment methods.
(8)
4. Experiment
4.1 Task and Data To confirm the performance of the proposed method in a real system, we evaluated it using the application of speech recognition as same as other contributions [8] because it is hard to obtain the reference signal that is needed by objective assessment. By measuring a speech recognition performance, we can indirectly evaluate the practical use of our scheme. The task of the system is to recognize Japanese sentences. Thirty utterances from one speaker were recorded at 16kHz with 16bit resolution in noisy environments. An artificial human (B&K) torso uttered clean speech. The two noise sources (70dB) were placed 60cm and 30cm from the torso. The noise data (airport noise and restaurant noise) were nonstationary. We adjusted the clean speech power (60, 70 and 80dB).
Acknowledgements The authors wish to express their thanks to Mr. Naka and Mr. Kikuiri in NTT DoCoMo for a valuable discussion.
References
[1]T-W. Lee, et al., "Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and super-gaussian sources", Neural Computation, Vol. 11, pp. 417 -441, 1999. [2]Zhang zp, et al., “A Fast- and High-convergence Method for ICA-based Noise Reduction in Mobile Phone Speech Communication”, in Proceedings of AES conference, 2005 [3] Robert Aichner, et al.,“Post-processing for Convolutive Blind Source Separation”, in Proceedings of ICSSSP2006, pp37-40 [4] R. Mukai, et al., "Near-Field Frequency Domain Blind Source Separation for Convolutive Mixtures," in Proceedings of ICASSP 2004, pp.49-52, 2004, [5] S. Kurita, et al., “Evaluation of blind signal separation method using directivity pattern under reverberant conditions.” In Proc.of . ICASSP 2000, pp. 3140–3143, 2000. [6] A. Hyvarinen, ”Survey on Independent Component Analysis,” Neural Computing, vol.2, pp.94-128, 1999. [7]Y. Suzuki, et al., "An optimum computer-generated pulse signal suitable for the measurement of very long impulse responses", J. Acoust. Soc. Am. pp.-1119-1123, 1995 [8]A. Kamimura et al., “Rapid filter adaptation for frequencydomain ICA analysis in various car enviroments”, in Proc of EUROSPEECH 2005 [9] http://htk.eng.cam.ac.uk/
80 Input
A C C(% )
60
ICA
Proposed method
40
20
0 60dB
70dB Clean speech power
80dB
Fig11.Recognition results by ICA and proposed method 4.2 Feature Vector and HMM
1-4244-1251-X/07/$25.00 ©2007 IEEE.
473