Voice activity detection based on conditional MAP ... - Semantic Scholar

Report 1 Downloads 118 Views
Signal Processing 92 (2012) 1699–1705

Contents lists available at SciVerse ScienceDirect

Signal Processing journal homepage: www.elsevier.com/locate/sigpro

Voice activity detection based on conditional MAP criterion incorporating the spectral gradient Sang-Kyun Kim, Joon-Hyuk Chang n School of Electronic Engineering, Hanyang University, Seoul 133-791, Korea

a r t i c l e i n f o

abstract

Article history: Received 16 August 2011 Received in revised form 3 January 2012 Accepted 5 January 2012 Available online 13 January 2012

In this paper, we propose a novel approach to improve a statistical model-based voice activity detection (VAD) method based on a modified conditional maximum a posteriori (MAP) criterion incorporating the spectral gradient scheme. The proposed conditional MAP incorporates not only the voice activity decision in the previous frame as in [1] but also the spectral gradient of the observed spectra between the current frame and the past frames to efficiently exploit the inter-frame correlation of voice activity. As a result, the proposed VAD leads to six separate thresholds to be adaptively determined in the likelihood ratio test (LRT) depending on both the previous VAD result and the estimated spectral gradient parameter. Experimental results demonstrate that the proposed approach yields better results compared to those of the previous conditional MAPbased method. & 2012 Elsevier B.V. All rights reserved.

Keywords: Voice activity detection Spectral gradient Conditional MAP Likelihood ratio test

1. Introduction Robust speech processing in adverse environments has been an important issue in recent years. Indeed, voice activity detection (VAD), which determines whether the input data is speech or noise, is a crucial component of speech processing systems [1–4]. When noise is detected, transmission is generally stopped and only a general description of the background information can be transmitted in the application of voice communications. At the decoder stage, inactive frames are then reconstructed through comfort noise generation (CNG), which produces natural background sounds. For this reason, the VAD for speech coding can be aggressive since an aggressive VAD may be tolerated if the associated CNG can compensate misclassified frames [5]. In contrast, a passive VAD for speech recognition is needed since misclassified frames can degrade the recognition accuracy. Many traditional VAD algorithms are based on heuristic rules on several

n

Corresponding author. E-mail address: [email protected] (J.-H. Chang).

0165-1684/$ - see front matter & 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2012.01.005

parameters such as linear predictive coding parameters, energy levels, formant shape and cepstral features [6]. Among the various VAD approaches, we focus on the likelihood ratio test (LRT), which uses the distributions of both speech and noise due to its impressive performance as well as the efficient implementation. Considering the statistical model of speech, the distributions of both noisy speech and noise are assumed to follow a statistical distribution such as Gaussian, Laplacian or generalized gamma [2–4]. Based on the assumed statistical model, the LRT is inherently established based on the maximum a posteriori (MAP) criterion, which chooses the hypothesis (speech or noise) with the higher probability. These conventional methods based on the MAP depend on the independence of each frame. Recently, Shin et al. [1] used inter-frame correlation of speech signals. Specifically, this is achieved by incorporating a simple but rigorous rule such as a conditional MAP (CMAP) criterion, which is conditioned not only on the data of the current frame, but also the voice activity decision of the previous frame. This algorithm results in an adaptive decision threshold for the LRT based on the result of the voice activity in the previous frame. This method proved efficient in exploiting

1700

S.-K. Kim, J.-H. Chang / Signal Processing 92 (2012) 1699–1705

the inter-frame correlation. However, this approach does not fully consider spectral variation when making the CMAP criterion. In this paper, we propose a novel technique for the LRT by incorporating the spectral gradient in the decision criterion, in addition to the aforementioned result of voice activity in the previous frame with the CMAP framework. Consideration of the spectral gradient in the CMAP enables more exact detection since we can consider the spectral variation in terms of increases, decreases and sustenance of the spectra associated with the voice activity. As a result, the decision thresholds of the LRT have six different values depending on the status of voice activity in the previous frame and the spectral gradient between the current power spectrum and the averaged long-term power spectrum. Based on comparison of a number of experiments on VAD, the proposed approach shows better performance than does the previous algorithm proposed by Shin et al. [1]. The paper is organized as follows. In Section 2, we briefly review the CMAP-based VAD. In Section 3, we present the spectral gradient scheme and apply it to the CMAP for VAD performance improvement. Finally, an objective evaluation of the previous methods and our approach is performed in Section 4.

Lðk,nÞ 

In the time domain, it is assumed that the noise signal d(t) is added to the clean speech signal x(t), with their sum being denoted by y(t), which is called the noisy speech signal. These variables are transformed by the short-term fast Fourier transform (FFT) as follows: ð1Þ

where YðnÞ ¼ ½Yð0,nÞ,Yð1,nÞ, . . . ,YðM1,nÞ, XðnÞ ¼ ½Xð0,nÞ, Xð1,nÞ, . . . ,XðM1,nÞ, DðnÞ ¼ ½Dð0,nÞ,Dð1,nÞ, . . . ,DðM1,nÞ at the nth frame. Assuming that speech is degraded by uncorrelated additive noise, two hypotheses, H0 ðnÞ and H1 ðnÞ, indicate speech absence and presence in the noisy spectral component, respectively H0 ðnÞ : Yðk,nÞ ¼ Dðk,nÞ

ð2Þ

H1 ðnÞ : Yðk,nÞ ¼ Xðk,nÞ þ Dðk,nÞ

ð3Þ

With the assumption of the Gaussian probability density function (pdf) due to the practical superiority [7], the distributions of the noisy spectral components conditioned on both hypotheses are given by ( ) 2 9Yðk,nÞ9 1 exp  pðYðk,nÞ9H0 ðnÞÞ ¼ ð4Þ pld ðk,nÞ ld ðk,nÞ ( ) 2 9Yðk,nÞ9 1 exp  pðYðk,nÞ9H1 ðnÞÞ ¼ pðld ðk,nÞ þ lx ðk,nÞÞ ld ðk,nÞ þ lx ðk,nÞ

ð5Þ where ld ðk,nÞ and lx ðk,nÞ denote the variances in noise and speech for each frequency bin, respectively. The LR of

  pðYðk,nÞ9H1 ðnÞÞ 1 gðk,nÞxðk,nÞ exp ¼ 1 þ xðk,nÞ 1 þ xðk,nÞ pðYðk,nÞ9H0 ðnÞÞ ð6Þ 2

where xðk,nÞ ¼ lx ðk,nÞ=ld ðk,nÞ and gðk,nÞ ¼ 9Yðk,nÞ9 =ld ðk,nÞ represent the a priori signal-to-noise ratio (SNR) and the a posteriori SNR, respectively [7]. The a posteriori SNR gðk,nÞ is estimated using the estimator of ld ðk,nÞ and the a priori SNR xðk,nÞ is estimated by the well-known decision-directed (DD) method as follows [8]: 2

x^ ðk,nÞ ¼ a

9X^ ðk,n1Þ9 þ ð1aÞU½gðk,nÞ1 ld ðk,n1Þ

ð7Þ

2

where 9X^ ðk,n1Þ9 is the speech spectral amplitude estimate of the previous frame obtained using the minimum mean-square error (MMSE) estimator [7]. Also, a is a weight that is usually chosen in the range (0.95, 0.99) [8], and the function U½x ¼ x if x Z0 and U½x ¼ 0 otherwise. The computational burden can be decreased when the subband combining technique as in [9] is used. The final decision in conventional statistical model-based VADs is reached using the geometric mean of the LRs computed for the individual frequency bins [10–13] and is obtained by

LðnÞ9

2. Review of CMAP-based VAD

YðnÞ ¼ XðnÞ þ DðnÞ

the kth frequency band is achieved by

1 H1 X 1 M log Lðk,nÞ _ Z Mk¼0 H0

ð8Þ

where an input frame is classified as speech if the geometric mean of the LRs is greater than a certain threshold value Z and as non-speech otherwise. On the other hand, the previous CMAP-based VAD originates from the conventional MAP according to the following decision rule: PðHðnÞ ¼ H1 ðnÞ9YðnÞÞ H1 _1 PðHðnÞ ¼ H0 ðnÞ9YðnÞÞ H0

ð9Þ

where H(n) denotes the correct hypothesis in the nth frame. This rule is changed to the following criterion in the LRT such that PðYðnÞ9HðnÞ ¼ H1 ðnÞÞ H1 PðHðnÞ ¼ H0 ðnÞÞ _a PðYðnÞ9HðnÞ ¼ H0 ðnÞÞ H0 PðHðnÞ ¼ H1 ðnÞÞ

ð10Þ

where a Z1 [7]. Actually, Shin et al. proposed a way to incorporate the inter-frame correlation of the voice activity into the MAP criterion. More specifically, the a posteriori probability PðHðnÞ9YðnÞÞ is conditioned on both the current observation YðnÞ and the decision in the previous frame. From that, PðHðnÞ9YðnÞ,Hðn1ÞÞ is derived. This implies that PðHðnÞ ¼ H1 ðnÞ9YðnÞ,Hðn1Þ ¼ Hi Þ H1 _ a, PðHðnÞ ¼ H0 ðnÞ9YðnÞ,Hðn1Þ ¼ Hi Þ H0

i ¼ 0; 1

ð11Þ

where a is the threshold. Using the upper criterion, we derive the following LRT [1]: PðYðnÞ9HðnÞ ¼ H1 ðnÞ,Hðn1Þ ¼ Hi Þ H1 _ PðYðnÞ9HðnÞ ¼ H0 ðnÞ,Hðn1Þ ¼ Hi Þ H0 a

PðHðnÞ ¼ H0 9Hðn1Þ ¼ Hi Þ PðHðnÞ ¼ H1 9Hðn1Þ ¼ Hi Þ

ð12Þ

S.-K. Kim, J.-H. Chang / Signal Processing 92 (2012) 1699–1705

It is noted that the likelihoods PðYðnÞ9HðnÞ ¼ H1 ðnÞ, Hðn1Þ ¼ Hi Þ and PðYðnÞ9HðnÞ ¼ H0 ðnÞ,Hðn1Þ ¼ Hi Þ could be simplified for the dominant contribution of the distribution of YðnÞ in the current frame as follows: PðYðnÞ9HðnÞ ¼ H1 ðnÞÞ H1 PðHðnÞ ¼ H0 9Hðn1Þ ¼ Hi Þ _a PðYðnÞ9HðnÞ ¼ H0 ðnÞÞ H0 PðHðnÞ ¼ H1 9Hðn1Þ ¼ Hi Þ

ð13Þ

3. Proposed method based on spectral gradient The previous section shows that the method of Shin et al. derives two separate thresholds for the decision of speech activity in the previous frame. Here, we propose a way to incorporate the spectral gradient into the conditional term in the CMAP, which considers the timevarying spectral change. For example, in the case of onset regions, the spectral power eventually increases, which can be relevant for detecting voice activity. For the rigorous algorithm, we first define the spectral gradient of each frame based on the difference between the current power spectrum and the average long-term power spectrum given by

DðnÞ ¼

2

ð9Yðk,nÞ9 Eðk,nÞÞ

ð14Þ

k¼0

where Eðk,nÞ denotes the average long-term spectral estimate during previous frames, given by [9] 2

Eðk,nÞ ¼ bEðk,n1Þ þ ð1bÞ9Yðk,nÞ9

where b ¼ 0:8 is a weight. Note that Eðk,1Þ is determined 2 by 9Yðk,1Þ9 . Using DðnÞ, we can categorize three cases by comparing the given threshold G 8 DðnÞ 4 G > < G1 , G o DðnÞ r G GðnÞ ¼ G0 , ð16Þ > : G , DðnÞ r G 1

where different thresholds are used depending on the speech activity in the previous frame ðn1Þ.

M 1 X

1701

ð15Þ

In this equation, G1 implies an ascending spectral gradient since the current power sufficiently exceeds the average long-term power. Also, in the case of G1 , the spectral gradient is descending, while G0 refers to the static class. Fig. 1 represents a typical example of the 2 contour of 9Yðk,nÞ9 and Eðk,nÞ, which shows that the proposed DðnÞ successfully characterizes the spectral change, especially in the onset and offset regions. In this regard, the correlative characteristic of speech occurrence in consecutive frames associated with G(n) can be represented by PðHðnÞ ¼ H1 9Hðn1Þ ¼ H1 ,GðnÞ ¼ G1 Þ 4 PðHðnÞ ¼ H1 9Hðn1Þ ¼ H1 Þ or PðHðnÞ ¼ H0 9Hðn1Þ ¼ H0 ,GðnÞ ¼ G1 Þ o PðHðnÞ ¼ H0 9Hðn1Þ ¼ H0 Þ

PðHðnÞ ¼ H1 ðnÞ9YðnÞ,Hðn1Þ ¼ Hi ,GðnÞ ¼ Gj Þ H1 _a PðHðnÞ ¼ H0 ðnÞ9YðnÞ,Hðn1Þ ¼ Hi ,GðnÞ ¼ Gj Þ H0

Input signal

2 1 0 −1

0.5

1

1.5 time (s)

2

2.5

x 106 current average long−term

power spectrum

4 3 2 1 0 0

0.5

1

ð18Þ

where PðHðnÞÞ denotes the correct hypothesis at nth frame [7]. Based on this motivation, we propose the novel voice activity decision rule, which is analogous to (11), through the incorporation of G(n)

x 104

0

ð17Þ

1.5 time (s)

2

2.5 2

Fig. 1. (a) Waveform of the input file (car noise, SNR ¼10 dB) and (b) plot of the 9Yðk,nÞ9 and Eðk,nÞ.

1702

S.-K. Kim, J.-H. Chang / Signal Processing 92 (2012) 1699–1705

i ¼ 0; 1 and j ¼ 1; 0,1

ð19Þ

In a similar way as in [1], the upper criterion is the basis for the following likelihood ratio test (LRT) using Bayes’ rule such that PðYðnÞ9HðnÞ ¼ H1 ðnÞÞ H1 _c , PðYðnÞ9HðnÞ ¼ H0 ðnÞÞ H0 ij

i ¼ 0; 1 and j ¼ 1; 0,1

ð20Þ

where

cij ¼ a

PðHðnÞ ¼ H0 9Hðn1Þ ¼ Hi ,GðnÞ ¼ Gj Þ PðHðnÞ ¼ H1 9Hðn1Þ ¼ Hi ,GðnÞ ¼ Gj Þ

We note from the proposed test statistics that six separate thresholds are formed depending on the speech activity in the previous frame and the spectral gradient. Specifically, for example, ! PðHðnÞ ¼ H0 9Hðn1Þ ¼ H1 ,GðnÞ ¼ G1 Þ c11 ¼ a PðHðnÞ ¼ H1 9Hðn1Þ ¼ H1 ,GðnÞ ¼ G1 Þ is used if the presence of speech is detected in the previous frame and G1 is chosen from (16). Conversely, ! PðHðnÞ ¼ H0 9Hðn1Þ ¼ H0 ,GðnÞ ¼ G1 Þ c01 ¼ a PðHðnÞ ¼ H1 9Hðn1Þ ¼ H0 ,GðnÞ ¼ G1 Þ

4. Experiments and results Conventional methods and the proposed method were evaluated quantitatively in various noise environments. For the test material, 456 s of speech was recorded by four males and four females, and then it was sampled at 8 kHz. Actually, the clean speech data were obtained by concatenating the clean speech sentence (duration of 8 s). The analysis has done by a trapezoidal window with a size of 10 ms and 3 ms overlap as in [6]. To evaluate the performance, we first made reference decisions on the clean speech material by labeling every 10 ms [6]. The proportion of hand-marked speech frames was 58.2% and consisted of 44.5% voiced sounds and 13.4% unvoiced sounds. To simulate various noise environments, stationary noise such as white, quasi-stationary noise such as car and non-stationary office noise (similar to babble) were directly added to the clean speech data, resulting in SNRs of 5, 10 and 15 dB. The thresholds were experimentally set for ensuring the robust performance under the various noise conditions such that c00 ¼ 26:2, c01 ¼ 25, c01 ¼ 26:8, c10 ¼ 22:7, c11 ¼ 24:7 and c11 ¼ 18. The conditions of the initial value cij were set to c00 because we assumed that only noise existed in the initial several frames. Also, we further added the noise types such as street, babble and subway of the widely used database,

20,000 0 −10,000

Output of the proposed VAD

Proposed

CMAP

Manual

Input signal

is used if the absence of speech is detected in the previous frame and G1 is selected. These six thresholds depend on the speech activities in the previous frame and, when combined with the spectral gradient, can provide more reliable statistics for testing voice activity by successfully considering inter-frame correlation. It is observed from Fig. 2 that the threshold in the proposed method is

adaptively determined by taking advantage of the spectral gradient. And, we see that the performance is uniform over the processed time without any sometime to converge to correct estimations.

0

0.5

1

1.5

2

2.5

3

0

0.5

1

1.5

2

2.5

3

0

0.5

1

1.5

2

2.5

3

0

0.5

1

1.5

2

2.5

3

0

0.5

1

1.5 time (s)

2

2.5

3

2 1 0

16 14 12

16 10 6

1 0.5 0

Fig. 2. (a) Waveform of the test file (street noise, SNR¼ 10 dB); (b) manual VAD (silence ¼0, unvoiced¼ 1, voiced ¼ 2); (c) threshold of the CMAP; (d) threshold of the proposed method; and (e) output of the proposed VAD.

S.-K. Kim, J.-H. Chang / Signal Processing 92 (2012) 1699–1705

1703

first-order CMAP [1], the G.729 Annex B, G.729B Appendix III and the AMR VAD 1 and 2. For the AMR VAD, a handlabeling on every 20 ms was used since the AMR VAD has 20 ms frame. The proposed approach has also been found to exhibit superior performance in non-stationary noises. The test results confirm that the proposed method effectively enhances the performance of the statistical modelbased VAD with non-stationary noise types such as office, street and babble especially for high SNRs. This is attributable to the fact that the spectral gradient scheme is appropriate for the highly varying spectral condition. But, in low SNR conditions, the performance improvement is limited because it is hard to determine the spectral gradient in adverse conditions. Also, it is seen that the false-alarm probability, PF, was slightly increased in the case of the non-stationary noises such as street and babble because some of the speech-like or peaky noise components are misclassified into speech. Even though the superiority of the proposed approach in terms of the objective test is clear, their computational complexity compared to that of conventional method

i.e., Aurora to show that the results are not dependent on the aforementioned training data set. Note that the Aurora database is a collection of short sentences containing digits. By concatenating the sentences, we made the additional test material of 140 s and the made reference decisions by hand labeling again. Tables 1 and 2 including PE (probability of error), PM (probability of miss), and PF (false alarm probability) show comparative results of the conventional methods and the proposed approach. For the test, we used the car, office, white, street, babble and subway noises by varying SNRs from 0 dB to 25 dB. Indeed, to aid in the repeatability of the results, the standardized VAD, ITU-T G.729 Annex B and the recent ITU-T G.729B Appendix III, were included [14,16]. Also, the results of the well-known standard VAD algorithm such as ETSI AMR VAD option 1 and option 2 [15] were included to show that the performance difference is practically acceptable. At first, from the results, it is evident that the proposed VAD algorithm shows better performance in most environmental conditions than do the previously reported VAD methods including the

Table 1 Comparison (Part I) of voice activity detection probability of error (PE), probability of miss (PM) and false alarm probability (PF) among the method of G.729B, G.729B App. III, AMR VAD1. Environments

G.729B

G.729B App. III PM

AMR VAD1

Noise

SNR (dB)

PE

PM

PF

PE

PF

PE

PM

PF

Car

0 5 10 15 20 25

31.63 27.52 23.48 19.79 16.09 14.59

13.51 13.00 10.03 7.14 4.83 4.32

56.86 47.72 42.20 37.40 31.77 28.90

23.93 22.44 21.97 20.33 19.33 17.91

0.74 0.56 0.25 0.17 0.12 0.08

56.22 52.90 52.21 48.41 46.08 42.73

7.06 7.87 7.92 7.57 5.83 5.47

0.87 0.51 0.36 0.22 0.35 0.34

15.67 18.11 18.45 17.80 13.45 12.61

Office

0 5 10 15 20 25

28.51 26.45 22.74 19.28 15.94 13.20

38.14 28.67 22.29 17.30 12.82 8.48

15.10 23.35 23.37 22.04 20.27 19.78

16.27 14.86 12.83 12.11 11.72 11.38

15.28 9.12 5.02 2.60 1.34 0.65

17.66 22.86 23.70 25.34 26.18 26.62

24.48 22.64 20.25 18.65 18.84 14.77

5.93 5.25 2.57 1.13 0.46 0.34

50.30 46.85 44.85 43.06 44.45 34.86

White

0 5 10 15 20 25

37.15 25.19 17.35 10.94 7.07 5.27

63.45 42.80 29.14 18.04 11.20 7.66

0.54 0.68 0.94 1.06 1.32 1.94

22.50 11.79 9.65 7.34 7.60 8.09

31.79 12.19 6.70 3.79 1.99 1.03

9.56 11.24 13.75 12.29 15.41 17.92

11.77 9.90 8.19 6.87 6.47 6.65

14.75 10.39 6.17 2.87 1.22 0.79

7.61 9.22 11.01 12.44 13.77 14.80

Street

0 5 10 15 20 25

32.28 27.20 24.50 21.22 18.62 18.04

29.10 15.18 9.27 5.96 3.05 2.56

35.50 39.37 39.92 36.67 34.39 33.71

36.52 35.08 34.83 37.46 36.88 37.01

13.63 3.85 2.22 0.80 0.70 0.65

59.69 66.69 67.85 74.57 73.50 73.82

38.72 40.40 41.19 38.09 31.80 28.56

9.38 1.87 0.68 0.68 0.51 0.43

68.41 79.40 82.21 75.97 63.49 57.04

Babble

0 5 10 15 20 25

32.81 25.28 24.17 23.94 15.29 15.44

24.60 14.82 8.15 5.59 4.26 3.03

41.11 35.86 40.39 42.53 26.46 28.00

32.69 29.42 31.49 32.59 27.30 27.59

12.36 5.36 1.74 1.05 0.94 0.87

53.27 53.79 61.60 64.53 53.98 54.64

39.03 40.31 40.84 36.84 39.04 38.26

5.57 1.38 0.50 0.43 0.21 0.19

72.89 79.73 81.69 73.70 78.35 76.80

Subway

0 5 10 15 20 25

32.16 25.29 21.08 19.51 19.67 19.45

32.35 17.33 11.92 7.30 5.52 4.18

31.95 33.89 30.99 32.72 34.97 34.91

35.84 34.22 34.75 35.00 40.57 36.24

11.06 3.86 2.15 1.10 0.29 0.15

62.64 67.05 70.01 71.67 84.14 72.78

37.81 37.73 39.69 37.28 25.46 23.53

5.71 2.54 0.71 0.73 1.22 0.63

72.52 75.80 81.85 76.81 51.67 46.71

1704

S.-K. Kim, J.-H. Chang / Signal Processing 92 (2012) 1699–1705

Table 2 Comparison (Part II) of voice activity detection probability of error (PE), probability of miss (PM) and false alarm probability (PF) among the method of AMR VAD2, the CMAP-based and the proposed technique. Environments

AMR VAD2 PE

CMAP [1] PM

PF

PE

Proposed PM

PF

PE

PM

PF

Noise

SNR (dB)

Car

0 5 10 15 20 25

9.06 8.29 7.13 6.45 5.87 5.75

0.34 0.63 0.56 0.76 1.04 1.13

21.21 18.96 16.29 14.36 12.58 12.18

6.27 5.86 5.61 5.40 5.22 5.06

2.03 1.85 1.75 1.65 1.53 1.46

12.15 11.46 10.99 10.61 10.38 10.07

6.02 5.62 5.51 5.34 5.17 5.01

1.83 1.78 1.69 1.60 1.51 1.42

11.85 10.94 10.82 10.53 10.26 10.00

Office

0 5 10 15 20 25

18.73 16.61 15.41 16.11 16.61 12.47

8.70 5.44 1.58 0.73 0.33 0.36

32.71 32.15 34.67 37.52 39.26 29.33

17.34 16.21 11.27 8.40 6.05 5.86

13.81 12.24 7.48 2.92 1.53 1.48

22.25 21.72 16.65 16.03 12.34 11.95

16.94 15.78 10.84 8.16 5.92 5.78

13.68 11.93 6.79 2.81 1.42 1.05

21.47 21.13 16.48 15.57 12.18 12.36

White

0 5 10 15 20 25

10.57 9.30 9.11 9.27 8.39 6.68

11.29 3.87 2.08 1.61 0.29 0.27

9.57 16.85 18.89 19.93 19.66 15.60

10.81 6.61 6.12 5.84 5.43 5.08

5.17 3.43 2.25 1.80 1.45 1.09

18.66 11.02 11.53 11.45 10.97 10.63

10.35 6.34 5.86 5.71 5.39 5.05

4.83 3.06 1.94 1.75 1.44 1.09

18.03 10.92 11.32 11.21 10.88 10.56

Street

0 5 10 15 20 25

43.65 44.51 47.19 43.28 43.46 41.87

7.95 1.19 0.09 0.25 0.08 0.04

79.80 88.36 94.87 86.85 87.38 84.22

41.25 39.36 40.18 41.33 17.03 12.61

81.67 74.92 79.07 81.66 28.48 10.75

0.32 3.36 0.81 0.50 5.43 14.49

40.21 39.70 25.45 20.22 16.77 12.55

79.52 78.71 37.68 25.93 15.53 8.74

0.41 0.19 13.14 14.53 18.02 16.40

Babble

0 5 10 15 20 25

38.91 36.02 39.90 36.15 27.86 26.87

17.50 4.00 1.03 0.41 0.45 0.36

60.58 68.44 79.25 72.34 55.61 53.71

40.12 36.19 26.10 18.81 15.16 11.98

78.47 63.40 36.93 25.52 18.32 8.82

1.29 8.63 15.13 12.02 11.96 15.17

38.83 36.12 22.01 15.32 14.74 11.72

75.39 63.23 32.17 17.47 13.75 12.36

1.81 8.68 11.88 13.71 15.74 11.07

Subway

0 5 10 15 20 25

36.84 29.78 33.94 32.75 35.58 34.97

27.35 6.26 2.09 0.83 0.39 0.26

47.11 55.21 68.38 67.96 73.63 70.11

32.14 29.51 16.40 14.74 12.64 10.87

57.79 46.16 10.72 9.15 8.73 6.54

6.17 11.14 22.56 19.84 16.59 15.25

30.57 24.65 13.86 11.32 12.40 10.72

54.53 28.32 8.47 7.39 5.57 3.46

6.31 20.90 20.27 15.58 19.31 18.07

Table 3 Computational cost comparison per single frame ( ¼ 10 ms) (FFT part is excluded since the FFT routine can be reused in the noise suppression, which is a preprocessing module of speech coding.)

Computational cost

G.729B

CMAP [1]

Proposed

1241

8979

9002

computation compared to G.729B method. But, since the most of the VAD algorithm can be reused in the noise suppression algorithm as in [16], the computational load of the proposed method can be much reduced in practice.

5. Conclusions should be evaluated for a fair comparison. In case of the G.729B, we excluded the feature extraction part since some coding parameters are reused in the G.729B encoder. Indeed, the FFT routine can be used for the noise suppression routine, which is a primary preprocessing module of the many commercial codecs. In this regard, Table 3 shows a summary of the computational complexity in terms of the cost of the operation claimed by algorithm such that add¼1, multiplication¼1, division¼ 5 and exponent¼10. The results show that the proposed method requires almost same computation compared to the Sohn’s method and the CMAP while the MAP-based methods need much higher

In this paper, we have proposed a novel VAD technique based on the CMAP algorithm in which the spectral gradient is incorporated for a robust VAD decision. The proposed CMAP criterion determines a hypothesis with maximal conditional probability given the current observation, the voice activity in the previous frame and the condition derived from the spectral gradient. Tracking the spectral gradient involves determination of the difference between the current power and the averaged long-term power spectra. The proposed approach yields better performance than the conventional method in various noise environments.

S.-K. Kim, J.-H. Chang / Signal Processing 92 (2012) 1699–1705

Acknowledgments This work was supported by the research fund of Hanyang University (HY-2011- 201100000000210). References [1] J.W. Shin, H.J. Kwon, S.H. Jin, N.S. Kim, Voice activity detection based on conditional MAP criterion, IEEE Signal Processing Letters 15 (February) (2008) 256–260. [2] J.-H. Chang, J.W. Shin, N.S. Kim, Voice activity detector employing generalised Gaussian distribution, Electronics Letters 40 (24) (2004) 1561–1562. [3] J.-H. Chang, N.S. Kim, Voice activity detection based on complex Laplacian model, Electronics Letters 39 (7) (2003) 632–634. [4] J.W. Shin, J.-H. Chang, N.S. Kim, Statistical modeling of speech signals based on generalized gamma distribution, IEEE Signal Processing Letters 12 (3) (2005) 258–261. [5] K.H. Maleh, Classification-Based Techniques for Digital Coding of Speech-Plus-Noise, Ph.D. Thesis, 2004. [6] J.-H. Chang, N.S. Kim, S.K. Mitra, Voice activity detection based on multiple statistical models, IEEE Transactions on Signal Processing 54 (6) (2006) 1965–1976. [7] J. Sohn, N.S. Kim, W. Sung, A statistical model-based voice activity detection, IEEE Signal Processing Letters 6 (1) (1999) 1–3.

1705

[8] Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator, IEEE Transactions on Acoustics, Speech, Signal Processing ASSP-32 (6) (1984) 1121–1190. [9] 3GPP2 Spec., Enhanced variable rate codec (EVRC), 3GPP2-C.S00140, vol. 1.0, April 2004. [10] Y.D. Cho, K. Al-Naimi, A. Kondoz, Improved voice activity detection based on a smoothed statistical likelihood ratio, Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, May 2001, pp. 7–11. [11] J. Ramirez, J.C. Segura, C. Benitez, L. Garcia, A. Rubio, Statistical voice activity detection using a multiple observation likelihood ratio test, IEEE Signal Processing Letters 12 (10) (2005) 689–692. [12] J.-H. Chang, J.W. Shin, N.S. Kim, Likelihood ratio test with complex Laplacian model for voice activity detection, in: Proceedings of Eurospeech, August 2003, pp. 1065–1068. [13] W.J. Lee, J.-H. Chang, Minima-controlled speech presence uncertainty tracking method for speech enhancement, Signal Processing 91 (1) (2011) 155–161. [14] ITU-T, A silence compression scheme for G.729 optimised for terminals conforming to recommendation V.70, ITU-T Rec. G.729, Annex B, 1996. [15] ETSI, Voice activity detection (VAD) for adaptive multi-rate (AMR) speech traffic channels, ETSI EN 301 708 v7.1.1., 1999. [16] ITU-T, Appendix III: G.729 Annex B enhancement in voice-over-IP applications—Option 2, 2005.