Modeling the Short Time Fourier Transform Ratio and Application to Underdetermined Audio Source Separation Dinh-Tuan Pham1 , Zaher El-Chami2 , Alexandre Gu´erin2 , and Christine Servi`ere3 1
Laboratory Jean Kuntzmann, CNRS - INPG - UJF (Grenoble, France),
[email protected] 2 Orange Labs, (Lannion, France) {zaher.elchami,alexandre.guerin}@orange.ft-group.com 3 GIPSA-lab, CNRS - INPG (Grenoble, France)
[email protected] Abstract. This paper presents the theoretical background for the Model Based Underdetermined Source Separation presented in [5]. We show that for a given frequency band, in contrast to customary assumption, the observed Short-Time Fourier Transform (STFT) ratio coming from one source is not constant in time, but is a random variable whose distribution we have obtained. Using this distribution and the TimeFrequency (TF) “disjoint” assumption of sources, we are able to obtain promising results in separating four audio sources from two microphones in a real reverberant room.
1
Introduction
Blind Source Separation (BSS) is a well known technique to recover multiple original sources from their mixtures. Applications of BSS can be found in different fields like acoustic, biomedical and even economic. In the acoustic domain, BSS aims at separating speech and/or audio sources that have been mixed and then captured by multiple microphones in usually reverberant environment (the problem is referred as the cocktail party). The difficulty of the problem lies in the nature of the mixing system which is not instantaneous but convolutive (due to numerous reflections) with long impulse responses. In real reverberant situation, such responses can have thousands of taps even with small sampling rate (2400 taps for 60-dB reverberation time4 of 300ms with 8KHz sampling rate). The problem gets harder in the underdetermined case (number of sources greater than number of microphones): in this case, no unique algebraic solution exists unless additional information on the mixing matrix or on the sources is given. Different approaches are proposed in the literature to separate underdetermined audio mixtures. Most of these methods operate in the frequency domain 4
this is the time for which the sound energy has decayed by 60dB
2
(FD). The time-domain mixed signals xj (t) observed at the microphones are converted into frequency-domain time-series signals by a short-time Fourier transform (STFT): Xj (t, ω) =
L−1 X k=0
w(k)xj (t + k)e−iωk =
N X
Xj,p (t, ω),
(1)
p=1
where w(·) is an analysis window (e.g. Hamming) and Xj,p (t, ω) is the contributed STFT of the pth source to the j th sensor, that is of xj,p (t) = (aj,p ?sp )(t), aj,p (·) denoting the filter impulse response describing the effect of the pth source on the j th microphone and ? denoting the convolution. Working in the FD has two major reasons. Firstly, it permits transforming convolutive mixtures into instantaneous ones in each frequency band. Specifically Xj,p (t, ω) may be approximated by Aj,p (ω)Sp (t, ω) where Aj,p is the frequency response of the filter aj,p and Sp (t, ω) is the STFT of sp (t). Secondly, the sparseness of speech signals becomes prominent in the Time-Frequency (TF) domain. This feature plays a crucial role in the underdetermined case: It entails that in the TF domain, the quasi-support5 of the source would be nearly disjoint, hence in each TF bin there is a high probability that at most one source is dominant. Thus the sources can be separated by binary masks. Still it remains the problem of how to allocate each TF bin to the corresponding source. Applying the “disjoint” assumption, at each bin (t, ω), the last sum in (1) contains one single non negligible term, hence: Xj (t, ω) ≈ Xj,q (t, ω) ≈ Ajq (ω)Sq (t, ω) (2) where q denoted the dominant source index at (t, ω). The ratio of the observed STFT at this point would then be: R(t, ω) =
X1,q (t, ω) A1q (ω) X1 (t, ω) ≈ ≈ X2 (t, ω) X2,q (t, ω) A2q (ω)
(3)
Being frequency and source dependent (but time independent), this ratio can be used in each frequency band to identify the sources in each TF bin. Most existing methods follows this approach. Note that, the modulus and argument of R(t, ω) are no other than the traditional Interchannel Level Difference (ILD) and Interchannel Phase Difference (IPD). As we can see, underdetermined audio separation is based on two approximation: the sparseness or more exactly the “disjoint” assumption, which results in the first approximation in (2) and (3), and the second approximation in (2) or more exactly in (3). The validity of the sparseness assumption can be found in [2, ?,4]. In this paper we will show that the second approximation in (2) and (3) unfortunately are not valid in realistic situations where the room impulse response often exceeds the STFT window length L. More precisely, we will show that for a fixed frequency band ω and even with only one active source sp (t, ω), the observed STFT ratio X1,p (t, ω)/X2,p (t, ω) can not be considered as constant 5
by quasi-support we mean the set of TF bins for which the signal is not negligible
3
in time, but is a random variable with an infinite variance. The logarithm of this ratio however has a finite variance and its mean and variance would tend to log[A1,p (ω)/A2,p (ω)] and 0 as L → ∞ (as expected). But for L not large with respect to the impulse response length, the variance can be appreciable. We will derive formulas for the mean and variance of this “log ratio” and also its probability density function. Having this “log ratio” model and after estimating its parameters by an Expectation Maximization (EM) algorithm, soft masks can be derived to separate the sources. Note that Balan and Rosca [1] has also considered a probabilistic model for the observed STFT ratio, but they still rely on the approximation (3) while relaxing somewhat the “disjoint assumption”.
2
The STFT ratio distribution
We consider the case where only one source sp is active at the two microphones: xj,p (t) = (aj,p ? sp )(t) =
M −1 X
ajp (k)sp (t − k),
j = 1, 2.
k=0
The time-domain observed signals xj,p (t) are converted into frequency-domain time-series Xj,p (t, ω) signals using the STFT: Xj,p (t, ω) =
L−1 X
w(k)xj,p (t + k)e−ikω =
k=0
0 X
w(−k)eikω xj,p (t − k)
k=1−L
As we can see from the above second right hand side, Xj,p (t, ω) can be considered as the output of the filter wω : wω (k) = w(−k)eikω applied to the signal xj,p (t). At its turn, xj,p (t) is the filtered version of sp (t) by the filter aj,p . Thus, for a given ω, Xj,p (t, ω) is the filtered version of sp (t) by the filter aj,p ? wω , which has Fourier transform W (ω − ·)Aj,p (·), W denoting the Fourier transform of w. 2.1
Variance and Correlation
Note that the filter ajp ? wω has support in [1 − M, L − 1]. Therefore if we assume that the signal sp can be considered stationary in the time frame [t − L + 1, t + M − 1] with power spectral density denoted by Dp (t, ·) 6 , then X1,p (t, ω) and X2,p (t, ω) have variances Z π dλ 2 σj,p (t, ω) = E|Xj,p (t, ω)|2 = |W (ω − λ)|2 |Ajp (λ)|2 Dp (t, λ) , j = 1, 2 2π −π and (complex) covariance ∗ cov[X1,p (t, ω), X2,p (t, ω)] = E|X1,p (t, ω)X2,p (t, ω)] Z π dλ = |W (ω − λ)|2 A1p (λ)A∗2p (λ)Dp (t, λ) . 2π −π 6
This assumption might not be quite realistic for large L + M − 2, however the calculations might still be valid by interpreting Dp (t, ·) as the average spectral density
4
Note that for large L, |W |2 will have the shape of a Dirac: it is negligible outside in interval centered at 0 of length of the order 1/L. Thus if we assume that the spectral density does not vary much in a interval of frequencies of length about 1/L, we may pull the term Dp (t, λ) outside the above integrals, which yields the approximations to th complex correlation between X1,p (t, ω) and X2,p (t, ω) Rπ |W (ω − λ)|2 A1p (λ)A∗2p (λ)dλ −π Rπ ρp (t, ω) ≈ R π |W (ω − λ)|2 |A1p (λ)|2 dλ −π |W (ω − λ)|2 |A2p (λ)|2 dλ]1/2 −π and to their variance ratio Rπ 2 |W (ω − λ)|2 |A1p (λ)|2 dλ σ1,p (t, ω) −π R ≈ . π 2 (t, ω) σ2,p |W (ω − λ)|2 |A2p (λ)|2 dλ −π The above approximations are reasonable and will be made. It follows that the variance ratio and the correlation don’t vary in time and depend only on the chosen analysis window and the filter impulse responses. Note that if we make the unrealistic assumption that M is much smaller than L, then A1p and A2,p also don’t vary much in a interval of length about 1/L and thus |ρp (t, ω)| ≈ 1. This means that X1p (t, ω)/X1p (t, ω) is constant in time and we get the second approximation in (2) and (3) mentioned in the introduction. 2.2
Ratio distribution
To simplify the notations, in this section we will omit the ω variable since it remains fixed throughout. By the approximations in previous section, the variance ratio and correlation do not depend on time, hence the time variable will also be dropped in their symbols. We assume that L is large enough so that, by the Central Limit Theorem, the pair X1,p (t) and X2,p (t) can be considered 2 2 as Gaussian (complex circular) with variance σ1,p (t) and σ2,p (t) and correlation ρp . To compute the distribution of X1,p (t)/X2,p (t), a trick is used to take the correlated item X2,p (t) out of X1,p (t). It can be checked that the difference X1,p (t)/σ1,p (t) − ρp X2,p (t)/σ2,p (t) is uncorrelated with, hence independent of, X2,p (t) and has variance 1 − |ρp |2 . Thus: X1,p (t)/σ1,p (t) = ρp X2,p (t)/σ2,p + (1 − |ρp |2 )1/2 R1|2,p (t) where R1|2,p (t) is a complex standard normal variable independent of X2,p (t). Therefore q R1|2,p (t) σ1,p X1,p (t) R(t) = = ρp + 1 − |ρp |2 (4) X2,p (t) σ2,p X2,p (t)/σ2,p (t) Both R1|2,p (t) and X2,p (t)/σ2,p (t) are complex standard normal variables, hence √ are distributed as V eiU where V is an exponential variable with unit mean and U is uniform in [0, 2π]. Thus the right hand side of (4) has the same distribution as: q √ iU σ1,p ρp + 1 − |ρp |2 Re σ2,p
5
where R is the ratio of two independent exponential variables of same mean and U is uniform in [0, 2π] (since the difference modulo 2π of 2 independent uniform variables in [0, 2π] is again uniform in [0, 2π]). Explicit calculation shows that R admits the density function 1/(1 + r)2 , hence it has an infinite mean but its square root has a finite mean. Therefore R(t) has infinite variance. To avoid this problem, we propose to consider its logarithm, which is distributed as √ (5) log(σ1,p /σ2,p ) + i arg ρp + log[|ρp | + (1 − |ρp |2 )1/2 ReiU ] (arg z denoting the argument of the complex number z: z = |z|ei arg z ), since (U − arg ρp ) mod 2π has the same distribution as U . The first two terms in (5) are non random and correspond to the √ mean of log[X1,p (t)/X2,p (t)] since the remainder term log[|ρp | + (1 − |ρp |2 )1/2 ReiU ] has zero mean as it can be seen later. The joint density of the real and imaginary part of this complex variable can be obtained from the known joint density of the independent √ random variables R, U and the transformation r, u 7→ log[|ρp | + (1 − |ρp |2 )1/2 reiu ] = x + iy. An explicit (and tedious) calculation yields this joint density: 1 − |ρp |2 1 . (6) p|ρp | (x, y) = 4π (cosh x − |ρp | cos x)2 This density is an even function of both of its arguments, hence the corresponding complex variable has zero mean and uncorrelated real and imaginary parts. Finally, the time set of log |R(t)| and arg R(t) have the joint density: p|ρp | (x − log |rp |, y − arg rp )
(7)
where rp = (σ1,p /σ2,p )ei arg ρp . Note that rp and |ρp | may be viewed respectively as standing for the source sp position in space and for the reverberation degree of the acoustic path between sp and the set of microphones, so that |ρp | tends to one when the mixture tends to be anechoic.
3
Application to sparse source separation
In [5], we have used the density model given in (6) – (7) to separate audio sources in the underdetermined case. The method and results are briefly described here. Under the “disjoint” assumption, at most one source is dominant in each TF bin, but we don’t know which one it is. Thus given the one source log ratio density function in (6) –(7), we adopt the following model for the distribution of the real and imaginary parts of log R(t, ω) for a given ω: p(x, y|ρρ, r , µ) =
N X
µp p|ρp | (x − log |rp |, y − arg rp )
p=1
where ρ = (|ρ1 |, . . . , |ρN |), r = (r1 , . . . , rN ) and µ = (µ1 , .., µN ) are the frequency dependent model parameters (the frequency variable ω is again omitted for simplicity), µp denoting the a priori probability that the pth source being dominant
6
in the considered frequency band. As no model is known for the parameters rp and |ρp | as function of frequency or that such a model is too complex, we choose to estimate them independently at each frequency band, for example, by the maximum likelihood method based on the data set {log |R(t)|, arg R(t)} and the model (6) – (7). Once estimated, the a posteriori probability that pth source is dominant at the TF point (t, ω) is given by: µp p|ρp | [log |R(t)/rp |, arg R(t)/rp ] . πp (t) = PN i=1 µi p|ρi | [log |R(t)/ri |, arg R(t)/ri ] These a posteriori probabilities can be used to construct a binary or a soft mask to extract the pth source. To maximize the likelihood, the EM (expectation-maximization) algorithm can be used. The hidden variable is taken as the indicator variable that indicates which source is dominant at each TF point. Of course, working in FD has the permutation ambiguities problem. These ambiguities should be aligned properly so that the separated frequency components that originate from the same source are grouped together, using for example the methods in [3].
4
Experiments and Results
Room dimension Source Signals
Loudspeakers height 1.6m
1.2 m
50° 15°
1.1m
1m
1.5 m
0.8m
-15° -45°
Sampling Rate STFT frame size STFT frame shift STFT frame type Microphone distance Microphone height Microphone type
3,55 x 4,45 x 2,5 m Speeches of 10s 4 male and 4 female 16kHz 2048 points 512 points Hanning Window D=1m and D=5cm 1,4m Omnidirectional
Fig. 1. Recording arrangement used for development data.
We adopt the experimental setup (room dimensions and source/microphone positions) as described in Fig. 1. Further, development data can be downloaded from the Signal Separation Evaluation Campaign site at [6]. First, in order to highlight the bad approximation of X1,p (t, ω)/X2,p (t, ω) by A1p (ω)/A2p (ω) we compute numerically the correlation ρp (ω) between the twos STFT output X1,p (t, ω) and X2,p (t, ω) and their standard deviation ratio σ1,p (ω)/σ2,p (ω). We choose p = 1, hence the computation is based on the mixing filters which relate the first source to the set of microphones (D=5cm, source angle = −45◦ ). Theoretically, the “log ratio” log[X1,1 (t, ω)/X2,1 (t, ω)] has means log[σ1,1 (ω)/σ2,1 (ω)] + i arg ρ1 (ω) and its variability is controlled by
7
p ρ1 (ω) but is better measured by 1 − |ρ1 (ω)|2 , since the later (by numerical calculation) varies more or less linearly with the standard deviationspof its real and imaginary parts. Therefore we plot in Figure 2 the histogram of 1 − |ρ1 |2 (left panel) and the scatter plot of log[(σ1,1 /σ2,1 )ei arg ρ1 /(A11 /A21 )] (right panel, real and imaginary parts as abscissa and ordinate). One can see that for many frequency p bins the variability of the “log ratio” is significant. In fact half of them have a 1 − |ρ1 |2 exceeding 0.4554. This corresponds to a standard deviation of log |X1,1 (t, ω)/X2,1 (t, ω)| of 0.53 and a difference of ±0.53 in the logarithm corresponds to a 70% increase or 41% decrease in value. The means of the “log ratio” can also be quite different from log[A11 (ω)/A21 (ω)] for some frequencies. Scatter plot of log (σ1,1/σ2,1)ei arg ρ1/(A11/A21)
Histogram of (1 − |ρ1|2)1/2 3 120
2
100 1 80 0 60 −1
40
−2
20 0
−3 0.2
0.4
0.6
0.8
p Fig. 2. Histogram of 1 − |ρ1 |2 , left log[(σ1,1 /σ2,1 )ei arg ρ1 /(A11 /A21 )], right panel.
log.01
panel,
log .1
and
0
scatter
log10
plot
of
Next, we examine the performance of the algorithm in section 3. We measure it by the improvement of the Signal to Interference Ratio (SIR) and Signal to distortion Ratio (SDR) as in [4]. For both criteria, larger number refers to better results. SIR improvement is calculated by the difference between the mean of the Input and Output SIR: P 2 2 1X t |xj,p (t)| P P 10 log10 InputSIRp = 2 2 j=1 t| q6=p xj,q (t)| P 2 2 1X t |yj,pp (t)| P P OutputSIRp = 10 log10 2 2 j=1 t| q6=p yj,qp (t)| where yj,qp is the inverse STFT of the observation Xj,q masked by Mp 7 . The SDR for output p is defined by the mean of the power ratio between xj,p and the distortion in yj,pp P 2 2 1X t |αp xj,p (t)| SDRi = 10 log10 P 2 2 j=1 t |yj,pp (t) − αp xj,p (t)| 7
Mp (t, ω) = πp (t, ω) is the separation soft mask that extracts the source p
8
where αp is an adjusting factor for P amplitude differences P and is optimally obtained by least-mean-square: αp = ( t yj,pp (t)xj,p (t))/ t |xj,p (t)|2 . Mixture
4 females live recording 4 males live recording D = 5cm D = 1m D = 5cm D = 1m Over all Source s1 s2 s3 s4 s1 s2 s3 s4 s1 s2 s3 s4 s1 s2 s3 s4 SIR 7.8 8.3 8 7.1 9.3 7 9.9 9.8 7.8 6.6 7.3 7.2 8.2 6.8 9 8.4 8.1 SDR 6.7 1.5 6.4 8.8 6.8 7.1 6.7 5.9 4.5 3 7.8 7.2 5.2 6.4 6.7 6.7 6.1
Table 1. Results for live recording with two microphone spacing (5cm, 1m) and two different types of speaker; Overall performance of the separation is presented in the last column.
Table 1 displays the performance results. As we can see the new algorithm gives promising results in a very chalenging situation (average SIR improvement of 8.1dB). These results compare favorably with other previous methods (see [5]). Our method can make use of soft mask which gives less distortion in the separated sources, in another term, less artefacts as it can be seen in the obtained SDR average (6.1dB).
5
Conclusion
In this paper, we show the non validity of approximating the one source observed STFT ratio by the filter frequency response ratio. Instead, the first ratio is a random variable, the distribution of which we have obtained in close form. A separation procedure based on this distribution model has been briefly described. Experiments are conducted in a real reverberant room where results show good performance for the new proposed algorithm.
References 1. R. Balan and J. Rosca (2000) Statistical Properties of STFT Ratios for two Channel Systems and Applications to Blind Source Separation. In Proc. iCA 2000: 429–434. Helsinki, Findland, June 2000. 2. 0. Yilmaz and S. Rickard (2004). Blind separation of speech mixtures via timefrequency masking. IEEE Transactions on Signal Processing 52(7): 1830–1847 3. H. Sawada, S. Araki and S. Makino (2007). Measuring Dependence of Bin-wise Separated Signals for Permutation Alignment in Frequency-domain BSS. ISCAS, New Orleans, USA, May 2007 4. Araki, S., H. Sawada and S. Makino (2007). K-means Based Underdetermined Blind Speech Separation. In Blind Speech Separation: 243–270. S. Makino, Te-Won Lee and H. Sawada Editors, Springer: New-York. 5. Z. El-Chami, D.-T. Pham, Ch. Servi`ere and A. Gu´erin (2008) A New-Model Based Underdetermined Source Separation, IWAENC, Seattle, USA, September 2008. 6. Signal Separation Evaluation Campaign (2008), http://sisec.wiki.irisa.fr/tikiindex.php