Blind separation of speech mixtures based on ... - CiteSeerX

Report 2 Downloads 93 Views
BLIND SEPARATION OF SPEECH MIXTURES BASED ON NONSTATIONARITY Christine Servi`ere, Hakim Boumaraf

Dinh-Tuan Pham Laboratoire de Mod´elisation et Calcul, BP 53X, 38041 Grenoble Cedex, France [email protected]

ABSTRACT This paper presents a method for blind separation of convolutive mixtures of speech signals, based on the joint diagonalization of the time varying spectral matrices of the observation records and a novel technique to handle the problem of permutation ambiguity in the frequency domain. Simulations show that our method works well even for rather realistic mixtures in which the mixing filter has a quite long impulse response and strong echos.

Laboratoire des Images et des Signaux BP 46, 38402 St Martin d’H`ere Cedex, France [email protected] [email protected]

This paper extends an earlier paper of the authors [6] by providing a new technique to solve the problem of permutation ambiguity in the frequency domain. This is the biggest challenge in blind separation of audio signal. In [6] a solution has been proposed based on the continuity of the frequency response of the separation filter, but it has some weakness. Here a new solution, based on an idea similar to [7], is developed and shown to work better. 2. MODEL AND METHODS

1. INTRODUCTION Blind separation of realistic convolutive audio signal is still a largely unsolved problem [1], even though there has been several works on this topic [2, 3, 4]. The difficulty is that the mixing filter often has a quite long impulse response which contains strong peaks corresponding to echos. In this context, the frequency domain approach seems to be more appropriate since it reduces the problem to a set of independent separation problems of instantaneous mixtures associated with each frequency bin. But the long impulse response of the mixing filter would require working with very narrow bins, leading a poor accuracy of the spectral estimate. Further, the finite Fourier transform (FFT) tends to produce nearly Gaussian variables and it is well known that blind separation doesn’t work in the case of instantaneous Gaussian mixture. Fortunately, speech signals are also highly non stationary so one can exploit this nonstationarity to separate their mixture which allow one to ignore their non Gaussianity and use only their second order statistics [5], which lead to a joint diagonalization problem. Note that the idea of exploiting nonstationarity has been introduced in [2], but these authors used an ad-hoc criterion, unlike ours, which is based on the Gaussian mutual information. This criterion is related to the maximum likelihood and our experience in the case of instantaneous mixture shows that it is quite powerful. Such criterion has in fact been considered in [4], but without using the nonstationarity idea. Thanks to the European Bliss Project for funding.

We consider the blind separation of convolutive mixtures: xk (t) =

∞ X K X

Hkj (n)sj (t − n),

1 ≤ k ≤ K, (1)

n=−∞ j=1

where {x1 (t)}, . . . , {xK (t)} denote the observed sequences, {s1 (t)}, . . . , {sK (t)} denote the sources sequences and {Hkj (n)} are elements of the impulse response matrix {H(n)} of the mixing filter. The goal is to recover P∞ the sources through another filtering operation y(t) = T n=−∞ G(n)x(t − n) where x(t) = [x1 (t) · · · xK (t)] (T denoting the transpose), y(t) = [y1 (t) · · · yK (t)]T is the recovered source vector and {G(l)} is the impulse response matrix of the separation filter. In the blind context, the idea is to adjust the filter {G(n)} such that the reconstructed sources {yk (t)} are as mutually independent as it is possible. By adopting a second order approach, we are in fact focused only on the interspectra between the reconstructed sources at all frequencies. However, as we are dealing with nonstationary signals, we need to consider the time varying spectra, that is the localized spectra around each given time point. It is precisely the time evolution of these spectra which help us to separate the sources. Indeed, from (1), the time varying spectrum of the vector observation sequence {x(t)} P∞ is Sx (t, f ) = H(f )Ss (t, f )H∗ (f ) where H(f ) = n=−∞ einf 2π H(n) denotes the frequency response of the mixing filter1 at fre1 For

simplicity, we use the same symbol to denote the impulse or the

quency f , Ss (t, f ) is the diagonal matrix with diagonal elements being the time varying spectra of the sources and ∗ denotes the transpose conjugated. As in [6], the diagonalization criterion can be expressed as X n1 t

2

o log det diag[G(f )Sˆx (t, f )G∗ (f )]−log det |G(f )|

(2) where diag(·) denotes the operator which builds a diagonal matrix from its argument and the summation is over the time points of interest. This criterion is to be minimized with respect to G(f ) to obtain the frequency response of the separation filter. We have already developed a simple and very fast algorithm to minimize this criterion [8]. In practice, the spectrum Sˆx (t, f ) is estimated over a (high resolution) grid of frequencies. We follows the same method in [6], by subdividing the data sequence into blocks (actually half overlapping) and estimate the spectrum as if the data inside each block comes from a stationary process. In each block we compute the FFT, form the periodogram and then average it over consecutive frequencies to estimate the spectrum. Specially, the periodogram of the k-th data block starting at nk + 1 and of length N is Px (k, f ) =

nk +N k +N ih nX i∗ 1h X x(t)e2πif t x(t)e2πif t . N t=n +1 t=n +1 k

k

The frequencies2 are taken to be of the form f = n/N, n = 0, . . . , N/2, with N being chosen to be a power of 2, to take advantage of the Fast Fourier Transform. The time varying spectrum at the mid point tk = nk + (N + 1)/2 of the k-th block is then simply estimated by n+m  l mod N   X 1 n = Px k, . Sˆx tk , N 2m + 1 N l=n−m

where m is a bandwidth parameter. 3. THE PERMUTATION AMBIGUITY PROBLEM The advantage of the frequency domain approach, as explained in the introduction, comes however with a price. The joint diagonalization only provides the matrices G(f ) up to a scale change and a permutation: if G(f ) is a solution then so is Π(f )D(f )G(f ) for any diagonal matrix D(f ) and any permutation matrix Π(f ). Thus, one only gets a separation filter of frequency response matrix of the ˆ −1 (f ) where H(f ˆ ) is a consisform G(f ) = Π(f )D(f )H tent estimator of H(f ) but Π(f ) and D(f ) are arbitrary permutation and diagonal matrices. The scale ambiguity is frequency response of a filter, depending to its argument 2 actually the relative frequency with respect to the sampling frequency

however intrinsic to the blind separation of convolutive mixtures and cannot be lifted. In [6] we have proposed a method to solve the permutation ambiguity problem based on the continuity of the frequency response of the separation filter, which is more or less equivalent to constrain the separating filter to have short support in the time domain [4, 3]. Although this method can detect most of frequency permutation jumps, its weakness is that even a single wrong detection can cause wrong permutations over a large block of frequency. In this paper we propose a complementary method based on an idea similar to that in [7] which introduces some frequency coupling [3]. The main idea is that, for a speech signal at least, the energy over different frequency bins appears to vary in time in a similar way, up to a gain factor. For example, if a time block contains a long period of pause, one would expect that it energy would be nearly zero in all frequency bins. Thus we consider the “profiles” E(f, k; ·), defined as the logarithm of the k-th diagonal element of G(f )Sx (t, f )G∗ (f ), and we assume that if the profiles E(f 0 , k 0 ; ·) and E(f 00 , k 00 , ·) come from the same source, they would be similar up to an additive constant. To check this similarity, the first idea which comes to mind is to consider the correlations. But this is awkward and very time consuming as there are K 2 L(L − 1)/2 correlations to be computed, L denoting the number of frequency bins. Here we propose another method with a computational cost growing only linearly with L. First we center all profiles by subtracting its time average to get rid of the additive constant. The notation E 0 will be used for centered profile. Suppose that we knew the profiles Ek0 (·) of the k-th source (assumed to be frequency independent), then we can find a permutation π1 (f ), . . . , πK (f ) which specifiesl that the the E(f, πk (f ), ·) comes from the k-th source, by minimizing the criterion K X

kE 0 (f, πk (f ), ·) − Ek0 (·)k2

(3)

k=1

over all possible permutations of {1, . . . , K}. But we don’t ¯ 0 (·, πk (·), ·), know Ek0 (f, ·), therefore we “estimate” it by E 0 the average over all frequencies f of E (f, πk (f ), ·). Thus we end up by minimizing (3) with Ek0 (·) being set to ¯ 0 (·, πk (·), ·). This can be done in an iterative way. We E start with some initial profiles, minimize (3) with respect to ¯ 0 (·, πk (·), ·) the permutations πk (f ), then reset Ek0 (·) to E and minimize again and so on. It can be seen that this iteration decreases (3) at each step and therefore would converge to a minimum (possibly local). To construct the initial profiles, we apply the “permutation correction method” in [6] and proceed as if there is no permutation error, that is we simply take Ek0 (·) to be the average, over all frequencies f , of E 0 (f, k, ·). We are currently investigating other way to improve the initialization of the profiles.

4. DESIGN AND SIMULATION RESULTS We considered mixtures of real sound sources from premeasured room impulse responses. These responses are obtained from the matlab routine roomix.m of Alex Westner (found in http://sound.media.mit.edu/ica-bench), which uses a library of impulse responses measured off a real 3.5m × 7m × 3m conference room. (using 8 preset positions). The sources are speech signals sampled at 11 kHz. The provided room responses however correspond signal sampled at 22 Khz so we have “downsampled” them to 11 KhZ. These responses are quite long, up to 8192 lags, but become quite small at high lags so that we can truncate them to 256 lags and still retaining all echos. They are displayed in figure 1. Figure 2 shows the frequency response of H21 to illustrate its rapid variation as a function of frequency. impulse response of h

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3

impulse response of h

11

close to 0 or infinity (in this case the estimated sources are permuted). When r crosses the value 1, this means that a permutation has occurred. Figure 3 plots min(r, 1) and min(1/r, 1) versus frequency (in Hz), before and after applying the new method of frequency permutation correction (but always with a preliminary correction by the method in [6]). One can see that the new method eliminates many permutation errors (relative to a global permutation) and more importantly, these errors now occur in isolated frequency channels instead of in frequency bands as before.

12

0.2

0.1

0.1

0.1 0.05

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0.05 1

0 0

0.9

−0.05

0.8

−0.05

−0.1 50

100

150

200

250

0.7 0.6

−0.1

50

impulse response of h

100

150

200

250

0.4

impulse response of h

21

0.5

22

0.3

0.04

0.06

0.2

0.02

0.04

0.1

0.02

0

0

−0.02

−0.02

−0.04

−0.04 −0.06

−0.06 50

100

150

200

250

50

100

150

200

250

Figure 1: Impulse response of the considered filter

Figure 3: Separation index (solid red) and its inverse (blue dots) truncated at 1, before (upper panel) and after (lower panel) applying the new permutation correction 4 3 2

0.4

1

0.2

0.2

0

0

0

−1

−0.2

−0.2

−2

−0.4 0

1000 2000 3000 4000 5000

−3

0

1000 2000 3000 4000 5000

Figure 2: Frequency response of H21 for the considered filter; left panel = real part, right panel = imaginary part We take as block length N = 2048 with an overlap of half a block (yielding 31 time blocks) and estimate the spectral matrices by averaging over 5 frequencies (m = 2). As in [6], we consider the performance index r(f ) = |(GH)12 (f )(GH)21 (f )/[(GH)11 (f )(GH)22 (f )]|1/2 where (GH)ij (f ) is the ij element of the matrix G(f )H(f ). For a good separation, this index should be

−4 5

10

15

20

25

30

Figure 4: The mean profiles of the two estimated sources (solid) and ± its standard deviation (- - and -·) The mean profiles of the two sources are shown in figure 4 together with their standard deviations, indicated by the upper and lower curves. They differ less than the standard deviations, which explains the limitation of the method to eliminate permutation errors. Still enough of them has been eliminated. Using longer data length, which produces longer profile, also helps. Note that the profiles are com-

Source 1

puted from the estimated sources, which can be contaminated with other sources to some degree. The profiles of the exact sources (not shown) are somewhat more separate. impulse response of (G∗H)11

0 −0.5

0.01

0.06 0.04

0

500

1000

2.5

0.5

1

1.5 Mixture 1

2

2.5

0.5

1

1.5 Mixture 2

2

2.5

0.5

1

1.5 Separated source 2

2

2.5

0.5

1

1.5 Separated source 1

2

2.5

0.5

1

1.5

2

2.5

0.05

−0.04 −500

2

0

0

−1000

1.5 Source 2

−0.5

−0.02

−0.01

1

0.5

0.02

−0.005

0.5

impulse response of (G∗H)12

0.005 0

0.5

−1000

impulse response of (G∗H)21

0

−500

0

500

1000

impulse response of (G∗H)22

−0.05 −0.1

0.01 0.04

0.05

0.02

0.005

0

0

−0.05

0

−0.02 −0.04

−0.005

−0.06 −0.08 −1000

−500

0

500

1000

−0.01 −1000

−500

0

500

1000

Figure 5: Impulse response of the global filter (G ∗ H)(n) The impulse response of the global filter (G ∗ H)(n) is shown in figure 5. One can see that (G ∗ H)11 (n) is much smaller than (G ∗ H)12 (n) and (G ∗ H)22 (n) is somewhat smaller than (G ∗ H)21 (n), meaning that the sources are well separated (and permuted). This can be confirmed by looking at the original sources, the mixtures and the separated sources, displayed in figure 6 (noting that there is a global permutation). 5. CONCLUSION We have introduced a method for blind separation of speech signals, which exploits the specificity of such signals: their non stationarity and the presence of pauses. Our method is able to separate convolutive mixtures with fairly long impulse responses containing strong echos. 6. REFERENCES [1] R. Mukai, S. Araki, and S. Makino, “Separation and dereverberation performance of frequency domain blind source separation,” in Proceeding of ICA 2001 Conference, San-Diego, USA, Dec. 2001, pp. 230–235. [2] L. Parra and C. Spence, “Convolutive blind source separation of non-stationary sources,” IEEE Trans. on Speech and Audio Processing, vol. 8, no. 3, pp. 320– 327, May 2000. [3] P. Smaragdis, “Blind separation of convolved mixtures in the frequency domain,” in International Workshop on

0.1 0.05 0 −0.05 −0.1

0.1 0 −0.1

Figure 6: Sources, mixtures and estimated sources Independence & Artificial Neural Networks, University of La Laguna, Tenerife, Spain, Feb. 1998. [4] H.-C. Wu and J. C. Principe, “Simultaneous diagonalization in the frequency domain (SDIF) for source separation,” in Proceeding of ICA 1999 Conference, Aussois, France, Jan. 1999, pp. 245–250. [5] D. T. Pham and J.-F. Cardoso, “Blind separation of instantaneous mixtures of non stationary sources,” IEEE Trans. Signal Processing, vol. 49, no. 9, pp. 1837–1848, 2001. [6] D.T. Pham, C. Servi`ere, and H. Boumaraf, “Blind separation of convolutive audio mixtures using nonstationarity,” in Proceeding of ICA 2003 Conference, Nara, Japan, Apr. 2003. [7] J. Anem¨uler and B. Kollmeier, “Amplitude modulation decorrelation for convolutive blind source separation,” in Proceeding of ICA 2000 Conference, Helsinki, Finland, June 2000, pp. 215–220. [8] D. T. Pham, “Joint approximate diagonalization of positive definite matrices,” SIAM J. on Matrix Anal. and Appl., vol. 22, no. 4, pp. 1136–1152, 2001.