ROBUST SPEECH SEPARATION USING TIME-FREQUENCY ...

Report 7 Downloads 125 Views
ROBUST SPEECH SEPARATION USING TIME-FREQUENCY MASKING Parham Aarabi, Guangji Shi, and Omid Jahromi The Artificial Perception Laboratory University of Toronto 10 Kings College Road,Ontario, Canada, M5S 3G4 fparham@ecf, guangji@comm, [email protected] ABSTRACT A multi-microphone time-frequency speech masking technique is proposed. This technique utilizes both the timefrequency magnitude and phase information in order to estimate the Signal-to-Noise Ratio (SNR) maximizing masking coefficients for each time-frequency block given that the direction (or alternatively, the time-delay of arrival) of the speaker of interest is known. Using this masking algorithm, speech features (such as formants) from the direction of interest are preserved while features from other directions are severely degraded. Digit recognition experiments indicate that the proposed technique can result in a substantial increase in the digit recognition accuracy rate. At 0dB, for example, the proposed technique results in a digit recognition accuracy rate improvement of 26% over the single microphone case and an improvement of 12% over the two microphone superdirective beamforming case.

utilization can provide a significant edge in the de-noising and signal separation tasks [2, 5]. In this paper, we extend the phase error filtering technique initially proposed in [2] to include the magnitudes of the two microphones in addition to the phase information. This technique transforms two noisy time-domain signals recorded by two microphones into their time-frequency (TF) representations. For each time-frequency component or block, a phase-error measure is derived from the information in both microphones. Based on this, the time-frequency block for each microphone is scaled by a masking value between zero and one. Basically, TF blocks with large phaseerrors are ‘punished’ by a small mask value (0) and TF blocks with small phase-errors are ‘rewarded’ by a large mask value (1). In the following sections, we formulate four different TF masks and analyze them theoretically, through SNR-gain simulations, and digit recognition experiments.

1. INTRODUCTION

2. TIME-FREQUENCY MASKING

The necessity for accurate speech recognition systems capable of handling adverse environments with multiple speakers has in recent years fueled speech separation and enhancement research [1, 5, 3, 2, 6, 4]. This has resulted in numerous techniques with varying degrees of success, most of which employ multiple microphones [1, 3, 2, 6, 4]. Beamforming techniques, for example, utilize knowledge about the direction of the speech source of interest in order to reduce noise from other directions. The resulting SNR gain is significant as long as a large number of microphones are available [6, 4]. Independent Component Analysis (ICA), on the other hand, is capable of producing large SNR gains with few microphones [1, 3]. However, ICA has several limitations that have hampered its application in real-world situations [1, 6]. Interestingly, both of these popular techniques (as well as many other speech separation techniques) are similar in the sense that they are not specifically designed to function for speech signals. Speech has certain characteristics whose

We assume that a pair of signals are received by two microphones. Two short signal segments (15ms in duration) are then obtained by windowing each microphone signal by a Hanning window. We model the Fourier transforms of the pair of signals obtained in the k th time segment as follows:

( ) = Sk (!) + N ;k (!) (1) X ;k (! ) = Sk (! )e j! + N ;k (! ) (2) where Sk (! ) is the signal component arriving directly from the source of interest (with a time-delay-of-arrival (TDOA) of  ) in time-segment k , and N ;k (! ) and N ;k (! ) are posX1;k !

1

2

2

1

2

sibly dependent noises corresponding to the k th time segment. The noises here could include microphone dependent noise, other speakers, and reverberations (non-direct arrival of the signal from the source of interest). In this paper, we use the notation Yk ! to denote the time-frequency representation of the time domain signal y t .

()

()

This TF representation is obtained by half-overlapped segmentation, windowing (using a Hanning window), and taking the Fourier transform of the k th time segment of y t . Note also that y t can be retrieved from Yk ! by taking the inverse Fourier transforms of Yk ! for each k , halfoverlapping them, and then adding them to obtain a time domain estimate for y t . Furthermore, because of the finite windows, the frequency domain will be discrete. Hence, the TF representation that is discussed in this paper has both a discrete time-segment index k and a discrete frequency index ! . In this paper, as in [2], we assume that jN1;k ! j jN2;k ! j jNk ! j. Clearly, there is an inherent assumption in equations 1 and 2 that the direct signal-of-interest component has the same magnitude for both microphones. These assumptions would be valid if the inter-microphone distance is small [6, 2]. By rearranging equations 1 and 2, we obtain:

()

()

()

()

()

()=

() =

()

jSk (!)j = jNk (!)j 2

2

(6

6

X1;k (! ) j e N2;k (!) N1;k (!) X 2;k (! ) 2 X1;k (! ) j! X (! ) e 2;k

) 1

() () ~ () ~ () ~ ;k (!) =  ;k (!)  X ;k (!) X (8) ~ ;k (!) =  ;k (!)  X ;k (!) X (9) have less noise than X ;k (! ) and X ;k (! ), respectively.

( )=6

2

()

()

6

X2;k !

6

where 6

!

6 Nk (!)) ( ) = k(!()!) os(

os( (!))

Nk (! ) = 6

k

()

6

()

(10)

k

6

(11)

The maximum value of this optimal mask is:

( ) = 2 (!) +k (1!) + os(1  (!)) k k

 max ;k !

(12)

( ) = 2 (!) k (1!) os(1  (!)) k k

 min ;k !

(6)

(7)

( )+

N2;k ! N1;k ! X2;k ! X1;k ! . Our goal is to devise a method of filtering (or masking) each time-frequency component such that the product of the mask and the time-frequency signals have a higher overall

()

2

! ) os(6 Nk (! )) ( ) = 2k (!) k ( os( 6 Nk (!)) os(k (!))

(5)

2

k

2

and the minimum value is:

Equation 5 is a measure of the relative disparity between the magnitudes of the received signals. Equation 6 is the phase-error definition, which in the absence of noise would be 0 (or a multiple of  ). The phase-error has been extensively explored before for time-frequency masking applications [6, 2]. Based on these definitions, we can re-state equation 3 as follows: Rk !

2

k !

1

X1;k !

2

( ) = ;k (!) = k (!) = 1 +RkR(!()!)

1;k !

(4)

2

jX ;k (!)j jX ;k (!)j k (! ) = 2jX ;k (!)j + 2jX ;k (!)j k !

2

As a result, combining equations 7 and 10, we obtain:

and also define the terms

and

1

~ ()

2

2

1

1

(3)

Equation 3 defines the signal-to-noise ratio of each timefrequency component (at time-segment index k and frequency ! ) in terms of the received signals. For clarity, we will define the signal to noise ratio as:

1

1

Note that we seek masks that can be separately applied to each microphone signal to reduce noise. Our result would then be two masked signals which can either be used individually or combined by further multi-channel techniques [6]. Note that in this paper, we take X1;k ! to be the result of the TF masking. Hence, we do not include the multichannel integration that could potentially improve the results even further. It can be shown that the optimal (SNR maximizing) value for the masks can be defined as [6, 2]:

2

1

jSk (!)j Rk (! ) = jNk (!)j

SNR than prior to masking. In other words, we want to find a set of optimal masking functions 1;k ! and 2;k ! such that the resulting signals X1;k ! and X2;k ! , defined as:

(13)

3. MASK SELECTION UNDER UNCERTAINTY 3.1. Minimum Mean Square Error Masking All the terms in equation 11 can easily be obtained except 6 Nk ! . However, if we assume that this term has a uniform distribution in the range ;  , or, equivalently, that the angles 6 N1;k ! and 6 N2;k ! are independent uniform random variables in the range ;  , we obtain the  minimum mean-square error (MMSE) estimate mmse ;k !  of k ! as follows:



()

()

[0 2 ℄ ()

[0 2 ℄

()

()

 mmse ;k !

( ) = E [k(!)j k (!); k (!)℄

os(k (!)) = 1 p(2k((!!)) os(  (! ))) k

k

(14) 2

1

where the expectation is performed over the unknown phase parameter 6 Nk ! .



()

3.2. Max-Reward Max-Punish Masking The MMSE mask of equation 14 has the drawback of being too lenient on large phase errors while being too aggressive on smaller phase errors. In this section, we propose an alternate masking function which is as lenient as possible on small phase errors while as aggressive as possible on large phase errors. This maximum-reward-maximumpunish (MRMP) masking function is simply a linear combination of the minimum and maximum values of the optimal mask, as defined by equations 13 and 12. It can be stated as:

 mrmp ;k !

( ) = k (!) ;k (!) + (1 k (!)) ;k (!) (15) where k (! ) is the mixing ratio of the upper and lower mask bounds. When k (! ) = 1, the upper bound of the mask is used and when k (! ) = 0, the lower bound is used. In this  ! 2 is used, where paper, the mixing ratio k (! ) = e max

min

10(

k(

Fig. 2. The Spectrogram (or Time-Frequency representation magnitudes) of the signal of interest (left) and the noise source (right). Both are male speakers.

))

the sharpness of the exponential function is chosen to only reward very small phase-errors. It should be mentioned that the exact choice of the mixing ratio does not have a significant effect on the results as long as the sharpness of the mixing ratio is maintained. 4. TF MASKING EXAMPLE In this section we present a simulated example of the MRMP TF masking technique. The setup of the simulation is shown in Figure 1, with two speakers (one is the speaker of interest, and the other is considered as noise) with a relative SNR of 0dB, and independent Gaussian noise added to each microphone to result in a 30dB speech-signal-to-Gaussian-noise ratio. No reverberations were modeled for this simulation Speaker 2 (noise) Speaker 1 (speaker of interest)

speech intensity ratio = 0dB independent Gaussian noise=30dB recording length ~= 1.5s

Fig. 3. The resulting Spectrogram for the left microphone (left) and the right microphone (right) which include equal speaker and noise intensities plus 30dB independent Gaussian noise. plication of the TF mask of Figure 4 (left) to the noisy spectrogram of the first microphone (Figure 3 left). As shown, the main characteristics of the speech signal of interest have been maintained while the main spectral characteristics of the noise source have been removed. The noise signal in fact has been degraded beyond the point of comprehension, resulting in the better perception of the speech signal of interest.

40o

5. SIMULATIONS 0.2m Microphone Pair

Fig. 1. Simulation setup for the example of Figures 2 to 4. and a sampling rate of 44.1kHz was used for the recordings. Also, the speakers were assumed to be far away from the microphones such that their signals arrived with equal intensities at the microphones. Two 1.5s recordings from a single male speaker were used as the signal of interest and noise. The spectrograms of the speakers are shown in Figure 2. The spectrograms of the signal recorded by the two microphones are shown in Figure 3 and the resulting TF mask derived from the a MRMP masking algorithm is shown in Figure 4 (left). Figure 4 (right) illustrates the ap-

A simulation was performed in order to analyze the SNR gain in relation to the angle of arrival difference between the speaker of interest and the noise source. Figure 5 (left) illustrates the setup of the two speakers, with the speaker of interest remaining stationary in front of the microphones and the noise speaker placed at nine different locations in the environment. The SNR gain results for the four different techniques are shown in Figure 5 (right). As shown, the MRMP technique results in the highest SNR gain (approximately 7dB when the noise is not in the same direction as the speaker of interest). The real advantage of the proposed technique is its ability to increase the perceptual quality of noisy speech signals. In order to obtain a quantitative measurement of this in-

the MRMP result comes very close to the digit recognition accuracy of the noiseless speech signals (81.25% compared to 81.5%). At all SNRs, the MRMP technique performs much better than the superdirective (SD) beamforming technique of [4].

22.5 o

22.5

Signal−to−Noise Ratio gain (in dB)

Speaker 1 (speaker of interest)

22.5 22.5

80 70 60 50 40 30 20 10 −10

speech intensity ratio = 0dB indep. Gaussian noise=20dB recording length ~= 3s

22.5 o

o

22.5

o

o

22.5

o

o

22.5

o

Min Max MMSE MRMP

6

90 Digit Recognition Accuracy Rate (%)

Fig. 4. The resulting MRMP TF mask (left) and the resulting Spectrogram after TF masking (right). For the mask, the darker areas correspond to higher mask values.

Digit Recognition Accuracy Rate (%)

90

Min Max MRMP MMSE No Noise No Masking 0 10 Signal−to−Noise Ratio (dB)

20

80 70 60 50 40 30

MRMP No Noise No Masking SD Beamforming

20 10 −10

0 10 Signal−to−Noise Ratio (dB)

20

5

Fig. 7. Average digit recognition results with four speakers for the proposed techniques (left) and in comparison to the superdirective beamforming technique of [4] (right).

4 3 2 1 0

Initial Speaker 2 (noise) location

0.2m Microphone Pair

Final Speaker 2 (noise) location

−90

−67.5

−45

−22.5

0

22.5

45

67.5

90

Angle of arrival of noise (in Degrees)

6. CONCLUSIONS Fig. 5. Setup for a simulation with a non-stationary noise source (left) and the resulting SNR gains (right). crease, a speaker independent digit recognition experiment was performed using Sensory Inc.’s Voice Extreme module. The setup of the experiment is depicted in Figure 6.

A technique for the Time-Frequency masking of noisy speech signals obtained by two microphones was proposed. This technique resulted in a substantial increase in the SNR as well as the digit recognition accuracy rate, much more so than current techniques such as beamforming. 7. REFERENCES

Speaker 2 (noise) Speaker 1 (speaker of interest)

speech intensity ratio = 0dB no independent Gaussian noise recording length ~= 200s Sampling rate = 16kHz

o

30

[1] P. Aarabi. Genetic sensor selection enhanced independent component analysis and its applications to speech recognition. In Proceedings of the 5th IEEE Workshop on Nonlinear Signal and Info. Processing, June 2001.

Microphone Pair

[2] P. Aarabi and G. Shi. Multi-channel time-frequency data fusion. In Proceedings of the 5th International Conference on Information Fusion, August 2002.

0.4m

TF Masking System

Sensory Inc. Voice Extreme Digit Recognition Module

ONE, TWO, ...

Fig. 6. Simulation setup for digit recognition experiments. Four different speakers spoke 22 digits each. Noise, in the form of a second speaker (also speaking digits and synchronized with the speaker of interest) was added to the original speech signal. The resulting digit recognition accuracies are shown in Figure 7. Again, at higher SNRs the MRMP technique performs better than the other techniques. At lower SNRs, the minimum mask does slightly better than the MRMP. This is understandable since the minimum mask is in fact more aggressive which pays off at lower SNRs but degrades the signal unnecessarily at higher SNRs. At 20dB,

[3] A. Bell and T. Sejnowski. An info.-max. approach to blind separation and blind deconvolution. Neural Computation, 7:1129–1159, July 1995. [4] J. Bitzer, K.U. Simmmer, and K.D. Kammeyer. Multimicrophone noise reduction techniques for hands-free speech recognition-a comparative study. In Proceeding of ROBUST’99, pp. 171-174, May 1999. [5] B.J. Frey, T.T. Kristjansson, L. Deng, and A. Acero. Learning dynamic noise models from noisy speech for robust speech recognition. In Proceedings of NIPS 2001, December 2001. [6] G. Shi. Robust speech separation using phase-error filtering, October 2002. M.A.Sc. Thesis, Department of Elec. and Comp. Engineering, University of Toronto.