Circuits Syst Signal Process (2010) 29: 183–194 DOI 10.1007/s00034-009-9141-4
Voice Activity Detection Based on Discriminative Weight Training Incorporating a Spectral Flatness Measure Sang-Ick Kang · Joon-Hyuk Chang
Received: 13 March 2008 / Revised: 10 February 2009 / Published online: 29 December 2009 © Springer Science+Business Media, LLC 2009
Abstract In this paper, we present an approach to incorporate discriminative weight training into a statistical model-based voice activity detection (VAD) method. In our approach, the VAD decision rule is derived from the optimally weighted likelihood ratios (LRs) using a minimum classification error (MCE) method. An adaptive online means of selecting two kinds of weights based on a power spectral flatness measure (PSFM) is devised for performance improvement. The proposed approach is compared to conventional schemes under various noise conditions, and shows better performance. Keywords Voice activity detection · Minimum classification error · Statistical model · Power spectral flatness measure 1 Introduction Voice activity detection (VAD) refers to the classical problem of distinguishing active speech from non-speech and has many applications for a variety of speech communication systems, such as speech coding, speech recognition, speech enhancement, hands-free conferencing, and echo cancellation [4, 13]. The VAD is an essential component of variable-rate speech coding, one of the most elegant ways of enhancing This work was supported by the MKE (Ministry of Knowledge Economy), Korea, under the ITRC (Information Technology Research Center) support program supervised by the IITA (Institute for Information Technology Advancement) (IITA-2008-C1090-0902-0010) and This research was financially supported by the MKE and KOTEF through the Human Resource Training Project for Strategic Technology. S.-I. Kang · J.-H. Chang () School of Electronic Engineering, Inha University, Incheon 402-751, Korea e-mail:
[email protected] S.-I. Kang e-mail:
[email protected] 184
Circuits Syst Signal Process (2010) 29: 183–194
capacity and coverage of communication bandwidth [1]. Recently, numerous VAD approaches have been developed as a preprocessing step for speech enhancement. Among these, we focus on a statistical model-based VAD method proposed by Sohn et al. [5, 15], as it provides better or comparable performances compared to standard VAD algorithms such as ITU-T G.729 Annex B and ETSI AMR VAD while only requiring optimization of a few relevant parameters [7, 8]. The novelty of the statistical model-based VAD has been recognized in many studies [2, 3, 6, 11, 12, 14] which mainly employ the decision rule from the likelihood ratio (LR) on the discrete Fourier transform (DFT) domain. Our recent research [10] has aimed at incorporating a discriminative weight training (DWT) scheme for the optimally weighted LR-based decision rule to gain performance improvement. While the proposed approach is an extension of the DWT-based VAD algorithm proposed by Kang et al. [10], a power spectral flatness measure (PSFM) is incorporated in an on-line fashion for further performance improvement. From a number of experiments, the proposed technique is found to be superior to the conventional approaches [10, 15].
2 A Statistical Model-Based VAD For the VAD, we assume that the noise signal n(t) is added to the speech signal x(t), with their sum being denoted by y(t) in the time domain. y(t) is transformed by a DFT as follows: Y (t) = X(t) + N (t)
(1)
where Y (t) = [Y1 (t), Y2 (t), . . . , YM (t)], X(t) = [X1 (t), X2 (t), . . . , XM (t)], and N (t) = [N1 (t), N2 (t), . . . , NM (t)] denote the DFT coefficients of the noisy speech signal, clean speech, and the added noise. Given two classes, H0 and H1 , which respectively indicate speech presence and absence, it is assumed that H0 : speech absent: Yk (t) = Nk (t)
(2)
H1 : speech present: Yk (t) = Xk (t) + Nk (t).
(3)
With a Gaussian pdf assumption, the distributions of the noisy spectral components conditioned on both hypotheses are given by 1 |Yk |2 p(Yk |H0 ) = (4) exp − πλn,k λn,k |Yk |2 1 p(Yk |H1 ) = (5) exp − π[λn,k + λx,k ] λn,k + λx,k where λx,k and λn,k respectively denote the variances of speech and noise for the individual frequency band. The LR of the kth frequency band is given by 1 p(Yk |H1 ) γk ξk = (6) exp Λk ≡ p(Yk |H0 ) 1 + ξk 1 + ξk
Circuits Syst Signal Process (2010) 29: 183–194
185
where ξk = λx,k /λn,k and γk = Yk /λn,k denote the a priori signal-to-noise ratio (SNR) and the a posteriori SNR, respectively [6]. The a posteriori SNR γk is estimated using λn,k and the a priori SNR ξk is estimated by the well-known decisiondirected (DD) method as follows [4, 6, 14]: ξˆk (t) = α
|Xˆ k (t − 1)|2 + (1 − α)P γk (t) − 1 λn,k (t − 1)
(7)
where |Xˆ k (t − 1)|2 , the speech spectral amplitude estimate of the previous frame, is obtained by the minimum mean-square error (MMSE) estimator [15]. Also, α is a weight that is usually determined in the range (0.95, 0.99) [6] and the function P [x] = x if x ≥ 0 and P [x] = 0 otherwise. The final decision in the conventional statistical model-based VADs is established from the geometric mean of the LRs computed for the individual frequency bins [3, 10–12, 14–16] and is obtained by log Λ(t) =
M H1 1 log Λk (t) > < η M H k=1
(8)
0
where an input frame is classified as speech presence if the geometric mean of the LRs is greater than a certain threshold value η and as speech absence otherwise. It should be noted that the usual step to derive a statistical model-based VAD is to construct the geometric mean of equally weighted LRs.
3 MCE Training Incorporating PSFM We first review the technique to obtain optimal weights for the LRs based on DWT [10]. The weights {wk } must satisfy the following conditions: M
wk = 1,
(9)
k=1
wk ≥ 0,
k = 1, 2, . . . , M.
(10)
Let log Λw = {w1 log Λ1 , w2 log Λ 2 , . . . , wM log ΛM } represent the optimally M 1 weighted LR vector and log Λw be M k=1 wk log Λk . Two discriminant functions, gs and gn , are prepared to decide whether each frame is classified into speech or noise as follows: gs (log Λw ) = log Λw − θ,
(11)
gn (log Λw ) = θ − log Λw
(12)
where θ denotes a threshold value of the combined score. If the discriminative function of gs (log Λw ) is greater than that of gn (log Λw ), each frame of log Λw is classified into the speech frame. Specifically, since the minimum classification error (MCE)
186
Circuits Syst Signal Process (2010) 29: 183–194
Fig. 1 Loss function L(t) for γ =1
training requires a discriminative function for each cluster, the two functions are employed [10]. In our approach, estimation of the weights is performed under a discriminative training framework, and the generalized probabilistic descent (GPD) technique is applied since the GPD is optimal in a sense of probabilistic descent search [9]. Let D denote the misclassification measure of training data log Λw (t). Then,
−gs log Λw (t) + gn log Λw (t) if C m Y (t) = H1 D log Λw (t) = (13) −gn log Λw (t) + gs log Λw (t) if C m Y (t) = H0 where C m (·) denotes a VAD decision, and is obtained by manual labeling at every tth frame. When (13) is negative, the classification is considered to be correct. The GPD approach approximates the empirical classification error by a smooth objective function, which is the 0−1 step loss function defined by L(t) =
1 , 1 + exp(−γ D(log Λw (t)))
γ >0
(14)
where γ denotes the gradient of the sigmoid function. Once the parameter γ is specified, the weights are trained as illustrated in Fig. 1 according to the following criterion: {wˆ k } = argmin L .
(15)
{wk }
Using the obtained weights, we finally construct the VAD method comparing a given threshold value and Λw incorporating the optimally weighted LR as follows: log Λw (t) =
M H1 1 wk log Λk (t) > < η. M H k=1
(16)
0
Here, we present a novel method to select a group of optimal weights considering the spectral shape. The concept underlying this approach is that the optimal weights
Circuits Syst Signal Process (2010) 29: 183–194
187
Fig. 2 PSFM trajectory under car noise condition at SNR = 5 dB. Clean speech signals corresponding to the noisy signals are shown in this figure
Fig. 3 Weights distribution according to PSFM
should be different depending on the spectral distribution of the input signal. Through an extensive simulation, it was found that the spectral shape plays an important role in choosing the weights, which is similar to the case in [4]. Among the various informative techniques for spectral shape modeling, we select the following PSFM rather than a simple spectral flatness measure as a quantitative measure of the “whiteness” of the spectrum since the PSFM is known to be relevant in characterizing the spectral shape [4]: M M−1 ˆ k=0 λy,k (t) (17) Ξ (t) = 1 M−1 ˆ k=0 λy,k (t) M Here the PSFM, Ξ (t) is defined as the ratio of the geometric average to the arithmetic average of the power estimates of different frequency bins [10] as given in Fig. 2 for
188
Circuits Syst Signal Process (2010) 29: 183–194
Fig. 4 ROC curves for babble noise condition (SNR = 5 dB)
Fig. 5 ROC curves for car noise condition (SNR = 5 dB)
easy comprehension. The estimated power spectrum of noisy speech is obtained by
2 λˆ y,k (t) = αy λˆ y,k (t − 1) + (1 − αy ) Yk (t)
(18)
where αy is an experimentally chosen smoothing coefficient, which was set at 0.5. For the purpose of avoiding abrupt variation, a temporal smoothing technique to Ξy (t) is applied such that M M−1 ˆ k=0 λy,k (t) Ξy (t) = αΞ Ξy (t − 1) + (1 − αΞ ) 1 M−1 (19) ˆ k=0 λy,k (t) M where αΞ is an experimentally chosen smoothing coefficient, which was set at 0.5 and Ξy (t) represents the smoothed PSFM. Here, based on the above PSFM, we introduce
Circuits Syst Signal Process (2010) 29: 183–194
189
Table 1 FRR and FAR of VAD under the various noise conditions Noise
SNR (dB)
No weight FRR
Car
Street
Office
White
Weight FAR
FRR
Proposed FAR
FRR
FAR
5
23.41%
1.80%
22.50%
1.88%
20.02%
2.23%
15
11.86%
2.98%
10.72%
3.71%
9.69%
3.48%
5
36.53%
2.27%
36.41%
2.23%
31.01%
5.77%
15
28.18%
2.17%
27.95%
2.09%
24.46%
2.42%
5
44.67%
15.71%
45.05%
13.37%
36.90%
23.67%
15
31.78%
3.17%
31.43%
3.71%
27.67%
8.22%
5
38.69%
0.03%
38.80%
0.04%
29.72%
2.60%
15
29.70%
1.02%
30.13%
1.01%
25.01%
1.69%
Fig. 6 ROC curves for office noise condition (SNR = 5 dB)
two different regions depending on Ξy (t) which we partition as A (low PSFM) and B (high PSFM). We choose two intervals with a partition because the performance improvement is limited even though the number of the partition grows. In this regards, {wk }A is the optimal weights in region A while {wk }B is the best fit in region B as follows:
{wk }A 0 ≤ Ξy (t) < Ξ0 (20) {wk } = {wk }B Ξ0 ≤ Ξy (t) < 1 where Ξ0 (= 0.44) is a specific boundary level determined from the experiment under various noise conditions. Considering (20), it is readily apparent that {wk } replaced with either {wk }A or {wk }B according to Ξy (t). The advantage of this estimate is that the optimal weights vary every frame according to the spectral shape using the PSFM as time progresses, which yields a reliable in a real-time manner estimation of {wk }.
190
Circuits Syst Signal Process (2010) 29: 183–194
Fig. 7 ROC curves for street noise condition (SNR = 5 dB)
Fig. 8 ROC curves for white noise condition (SNR = 5 dB)
To derive two sets of {wk }, we conduct the MCE training on 16 frequency subbands1 for the two intervals. The two sets of weight distributions for two different intervals are shown in Fig. 3.
4 Experiments and Results In order to evaluate the performance of the proposed algorithm, we investigated speech detection and false-alarm probabilities (Pd and Pf ) in application to the NTT database, which consists of a number of speech files [3]. 1 We adopt 16 subbands from channel combining in a 128-point DFT to take full consideration of frequency
subbands correlation and computational efficiency.
Circuits Syst Signal Process (2010) 29: 183–194
191
Fig. 9 ROC curves for babble noise condition (SNR = 15 dB)
Fig. 10 ROC curves for car noise condition (SNR = 15 dB)
For training, we constructed a reference decision on a 230 s long clean speech sample by hand labeling the active and inactive frames of the speech segments every 10 ms frame. The percentage of hand-marked active speech frames was 57.1%, which consisted of 44.0% voiced sounds and 13.1% unvoiced sounds frames. In order to create noisy environments, we added car, street, office, and white noises to the clean speech data at SNR = 5, 10, and 15 dB. The parameters used for defining the objective function L were selected such that γ = 1 and the step size for the parameter t update was set to = 1 − 40000 for a smooth convergence of the objective function to the minimal point during the training process. Initial values for the weights and are 1/M and 1, respectively, which enable us to converge efficiently. Also, a threshold value of the combined score was set to 0 as an experimentally chosen value from the suitable boundary between two classes.
192
Circuits Syst Signal Process (2010) 29: 183–194
Fig. 11 ROC curves for office noise condition (SNR = 15 dB)
Fig. 12 ROC curves for street noise condition (SNR = 15 dB)
For testing, we used different speech samples (220 s in duration) from the NTT database. As a reference, we manually labeled the test material using 10 ms frames. To simulate noisy conditions, car, street, office, and white noises were again added to the clean speech data at 5 and 15 dB SNR. The receiver operating characteristics (ROCs), showing a trade-off between Pd and Pf of clean, car, and street noise environments, are shown in Figs. 4 and 5. Under the given noise conditions, from the figures it is evident that the proposed algorithm outperformed or at least was comparable to our previously reported methods [5, 15] showing better or similar performances compared with G.729B or AMR [7, 8]. Performance improvement was found to be greater for the office and white noise cases at all SNRs. The performances of the VAD algorithms are summarized in Table 1 in terms of the false rejection ratio (FRR) and false acceptance rate (FAR). From the results, we can see that the proposed algorithm results in a performance superior to the conventional approaches (no weight, weight).
Circuits Syst Signal Process (2010) 29: 183–194
193
Fig. 13 ROC curves for white noise condition (SNR = 15 dB)
Fig. 14 ROC curves when the noise type changes
Also, as depicted in Figs. 5–9, we conducted the test under the babble noise condition (from the NOISEX-92 database) which is known to be most difficult to work with in actual environments [16]. As a result, the proposed algorithm obtained nearly the same performance compared to the conventional approach. This phenomenon might be attributed to the fact that the frequency characteristics of the babble noise resemble that of the speech signal. Next, we carried out a number of experiments where the added noises were made artificially time-varying. This assumption could be considered realistic because we need to evaluate the performance of the proposed approach in actual non-stationary environments. For this, we varied the noisy speech segments from one type to another (office → street → car → white) as the SNR was changed (5 dB → 10 dB → 15 dB) every 2–3 s. From the results, it is found that the proposed approach is better than the original method, and thus the approach is effective when noise characteristics are time-varying.
194
Circuits Syst Signal Process (2010) 29: 183–194
5 Conclusions In this paper, we have proposed a novel VAD technique based on the MCE algorithm using PSFM in which optimally weighted LRs are integrated into the geometric mean at every frame for a robust VAD decision. The proposed approach yields better performance than the conventional method in non-stationary noise environments while preserving the computational efficiency.
References 1. J.-H. Chang, N.S. Kim, Distorted speech rejection for automatic speech recognition in wireless communication. IEICE Trans. Inf. Syst. E87-D(7), 1978–1981 (2004) 2. J.-H. Chang, J.W. Shin, N.S. Kim, Voice activity detector employing generalised Gaussian distribution. Electron. Lett. 40(24), 1561–1563 (2004) 3. J.-H. Chang, N.S. Kim, S.K. Mitra, Voice activity detection based on multiple statistical models. IEEE Trans. Signal Process. 54(6), 1965–1976 (2006) 4. J.-H. Chang, S. Gazor, N.S. Kim, S.K. Mitra, Multiple statistical models for soft decision in noisy speech enhancement. Pattern Recognit. 40(3), 1123–1134 (2007) 5. Y.D. Cho, A. Kondoz, Analysis and improvement of a statistical model-based voice activity detector. IEEE Signal Process. Lett. 8(10), 276–278 (2001) 6. Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. ASSP-32(6), 1190–1121 (1984) 7. ETSI, Voice activity detector (VAD) for adaptive multi-rate (AMR) speech traffic channels. ETSI EN 301 708 v7.1.1 8. ITU-T, A silence compression scheme for G.729 optimised for terminals conforming to ITU-T V.70. ITU-T Rec. G.729 Annex B 9. B.-H. Juang, W. Chou, C.-H. Lee, Minimum classification error rate methods for speech recognition. IEEE Trans. Speech Audio Process. 5(3), 257–265 (1997) 10. S.-I. Kang, Q.-H. Jo, J.-H. Chang, Discriminative weight training for a statistical model-based voice activity detection. IEEE Signal Process. Lett. 15, 170–173 (2008) 11. Y.C. Lee, S.S. Ahn, Statistical model-based VAD algorithm with wavelet transform. IEICE Trans. Fundam. E89-A(6), 1594–1600 (2006) 12. J. Ramirez, J.M. Gorriz, J.C. Segura, C.G. Puntonet, A.J. Rubio, Speech/non-speech discrimination based on contextual information integrated bispectrum LRT. IEEE Signal Process. Lett. 13(8), 497– 500 (2006) 13. M.H. Savoji, A robust algorithm for accurate endpointing of speech signals. Speech Commun. 8, 45–60 (1989) 14. J. Sohn, W. Sung, A voice activity detector employing soft decision based noise spectrum adaptation. Proc. Int. Conf. Acoust. Speech Signal Process. 1, 365–368 (1998) 15. J. Sohn, N.S. Kim, W. Sung, A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999) 16. A. Varga, H.J.M. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)