Improved likelihood ratio test based voice activity detector applied to ...

Comment

Report 3 Downloads 78 Views

Available online at www.sciencedirect.com

Speech Communication 52 (2010) 664–677 www.elsevier.com/locate/specom

Improved likelihood ratio test based voice activity detector applied to speech recognition J.M. Go´rriz a,*, J. Ramı´rez a, E.W. Lang b, C.G. Puntonet c, I. Turias d a

Dpt. Signal Theory, Networking and Communications, University of Granada, 18071 Granada, Spain b Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany c Dpt. Computer Architecture and Technology, University of Granada, 18071 Granada, Spain d Dpt. Lenguajes y Sistemas Informa´ticos, University of Ca´diz, 11202 Algeciras, Spain Received 7 December 2007; received in revised form 18 September 2009; accepted 4 March 2010

Abstract Nowadays, the accuracy of speech processing systems is strongly aﬀected by acoustic noise. This is a serious obstacle regarding the demands of modern applications. Therefore, these systems often need a noise reduction algorithm working in combination with a precise voice activity detector (VAD). The computation needed to achieve denoising and speech detection must not exceed the limitations imposed by real time speech processing systems. This paper presents a novel VAD for improving speech detection robustness in noisy environments and the performance of speech recognition systems in real time applications. The algorithm is based on a Multivariate Complex Gaussian (MCG) observation model and deﬁnes an optimal likelihood ratio test (LRT) involving multiple and correlated observations (MCO) based on a jointly Gaussian probability distribution (jGpdf) and a symmetric covariance matrix. The complete derivation of the jGpdf-LRT for the general case of a symmetric covariance matrix is shown in terms of the Cholesky decomposition which allows to eﬃciently compute the VAD decision rule. An extensive analysis of the proposed methodology for a low dimensional observation model demonstrates: (i) the improved robustness of the proposed approach by means of a clear reduction of the classiﬁcation error as the number of observations is increased, and (ii) the trade-oﬀ between the number of observations and the detection performance. The proposed strategy is also compared to diﬀerent VAD methods including the G.729, AMR and AFE standards, as well as other recently reported algorithms showing a sustained advantage in speech/non-speech detection accuracy and speech recognition performance using the AURORA databases. Ó 2010 Elsevier B.V. All rights reserved. Keywords: Voice activity detection; Generalized complex Gaussian probability distribution function; Robust speech recognition

1. Introduction The new voice services, including discontinuous speech transmission (Benyassine et al., 1997; ITU, 1996; ETSI, 1999) or distributed speech recognition (DSR) over wireless and IP networks (ETSI, 2002), aﬀord increasing levels of performance in noise adverse environments together with the design of high response rate speech processing systems.

*

Corresponding author. Tel.: +34 958240842. E-mail address: [email protected] (J.M. Go´rriz).

0167-6393/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2010.03.003

These systems often require a noise reduction scheme working in combination with a precise voice activity detector (VAD) (Bouquin-Jeannes and Faucon, 1995) for estimating the noise spectrum during non-speech periods in order to compensate its harmful eﬀect on the speech signal. The non-speech detection algorithm is an important and sensitive part of most of the existing single-microphone noise reduction schemes indeed. The well known noise suppression algorithms such as Wiener ﬁltering (WF) or spectral subtraction (Berouti et al., 1979; Boll, 1979), are widely used for robust speech recognition which is critical during VAD for attaining a high level of performance. Non-eﬀective

J.M. Go´rriz et al. / Speech Communication 52 (2010) 664–677

speech/non-speech detection is an important source of performance degradation in automatic speech recognition (ASR) systems. On one hand, noise parameters such as its spectrum are updated during non-speech periods being the speech enhancement system is strongly inﬂuenced by the quality of the noise estimation, and on the other hand, frame-dropping (FD), a frequently used technique in speech recognition to reduce the number of insertion errors caused by the noise, is based on the VAD decision and speech classiﬁcation errors lead to loss of speech, thus causing irrecoverable deletion errors. An example of such a system is the ETSI standard for DSR that incorporates noise suppression methods (see Fig. 1). The so-called advanced front-end (AFE) (ETSI, 2002) considers an energy-based VAD to estimate the noise spectrum for Wiener ﬁltering and a diﬀerent VAD for nonspeech FD. The recognizer is usually based on Hidden Markov Models (HMMs) and the task consists of recognizing connected digits, which are modeled as whole word HMMs with a number of states per word, gaussian mixtures per state, etc., by maximizing the likelihood of each digit. Firstly, a HMM is trained for each vocabulary word using a number of examples of that word. Secondly, to recognize some unknown word, the likelihood of each model generating that word is calculated and the most likely model identiﬁes the word. During the last decade numerous researchers have studied diﬀerent strategies for detecting speech in noise and the inﬂuence of the VAD on the performance of speech processing systems (Bouquin-Jeannes and Faucon, 1994, 1995). Most of them have focussed on the development of robust algorithms with special attention on the derivation and study of noise robust features and decision rules (Woo et al., 2000; Li et al., 2002; Marzinzik and Kollmeier, 2002; Sohn et al., 1999). The diﬀerent approaches include those based on energy thresholds (Woo et al., 2000), pitch detection (Chengalvarayan, 1999), spectrum analysis (Marzinzik and Kollmeier, 2002), zero-crossing rate (ITU, 1996), periodicity measures (Tucker, 1992), higher order statistics (Go´rriz et al., 2006; Ramfrez et al., 2006) or com-

x(n)

x(n)

MEL-scale filter-bank

DFT

MFCC, Δ, ΔΔ

ETSI FE

Wiener- filtering and frame-dropping

Log DCT

Δ, ΔΔ

Base

MFCC, Δ, ΔΔ

Enhanced front-end feature extraction

VAD

Fig. 1. ETSI standard for DSR. Front-end feature extraction.

665

¨ zer, 2000; binations of diﬀerent features (Tanyer and O ITU, 1996; ETSI, 1999). Sohn et al. (1999) proposed a robust VAD algorithm based on a statistical likelihood ratio test (LRT) involving a single observation vector, and a HMM-based hang-over scheme. Later, Cho et al. (2001) suggested an improvement based on a smoothed LRT. The statistical model proposed by Sohn was extended to generalized gaussian distributions in (Chang et al., 2004), deriving a complex model that uses the independence assumption of each part. Most VADs in use today normally consider hang-over algorithms based on empirical models to smooth the VAD decision which yields signiﬁcant improvements in word-ending accuracy. It has been shown recently (Ramfrez et al., 2001; Go´rriz et al., 2006,, 2005) that incorporating long-term speech information to the decision rule results in beneﬁts for speech/pause discrimination in high noise environments also. However, an inherent delay is inevitably included, thus challenging the performance of real time processing systems. Finally, an important assumption made on the latter works needs revision: the independence of adjacent observations. In any speech processing system the input signal is usually decomposed into overlapping frames, thus a clear statistical dependence between adjacent feature vectors is introduced. In this sense, some approaches (Gorriz et al., 2009) tried to take this dependence into account assuming dependence between adjacent observations only, i.e. tridiagonal covariance matrix, however this assumption is arguable for strongly correlated speech segments as an example. In this work we propose an novel advance in VAD by means of a multiple correlated observation likelihood ratio test (MCO-LRT) which is deﬁned in terms of the Cholesky decomposition and on the previous observations, thus avoiding the inclusion of any processing delay (if required). The dependence between observations is addressed by using a Multivariate Complex Gaussian (MCG) model with symmetric and tridiagonal covariance matrix and the following assumption: the observations are jointly gaussian distributed with non-zero correlations. Important issues that also need to be discussed are: (i) the increased computational complexity mainly due to the deﬁnition of the decision rule over large data sets, and (ii) the optimal criterion for the decision rule. The paper is organized as follows: Section 2 reviews the theoretical background on observation models and LRT statistical decision theory used for VAD. Then, in Section 3, we propose a novel LRT based on a MCG observation model, the so-called MCO-LRT, with two possibilities: considering symmetric or tridiagonal covariance matrices. Section 4 considers its application to the problem of detecting speech in a noisy signal and addresses the computation of the novel MCO-LRT for VAD. Sections 4.1 and 4.2 analyze the proposed method for just two, three and N consecutively correlated speech observations, respectively, using, as an example, an utterance of the AURORA 3 Spanish SpeechDat-Car (SDC) database (Moreno et al., 2000). Section 5 describes the experimental framework considered for the evaluation of

J.M. Go´rriz et al. / Speech Communication 52 (2010) 664–677

666

the proposed endpoint detection in both VAD and Speech Recognition frameworks using the AURORA databases (Moreno et al., 2000; Hirsch and Pearce, 2000) and ﬁnally, Section 6 summarizes the conclusions of this work. 2. Background on observation models for VAD For single observation based VAD, the noise signal n is usually assumed to be added to a speech signal s, with their sum being denoted by y. Given two hypotheses, H 0 and H 1 , which indicate respectively voice activity and inactivity, it is assumed that: H0 : H1 :

speech absent : y ¼ N speech present : y ¼ S þ N

ð1Þ

T

where y ¼ ½yðx1 Þ; yðx2 Þ; . . . ; yðxK Þ , N ¼ ½N ðx1 Þ; N ðx2 Þ; . . . ; N ðxK ÞT and S ¼ ½Sðx1 Þ; Sðx2 Þ; . . . ; SðxK ÞT are the DFT coeﬃcients of the noisy speech, noise and clean speech, respectively. The above statistical model is completed with: the assumption that the DFT coeﬃcients of each process are asymptotically independent (Sohn et al., 1999) and, an appropriate speciﬁcation of the DFT coeﬃcients distribution (Chang et al., 2004). The most common way is the complex Gaussian pdf as in (Sohn et al., 1999; Chang et al., 2004; Go´rriz et al., 2005), etc. The ﬁrst assumption allows us to write the probability distribution of the random vector y as: pðyÞ ¼

K Y

pðyðxk ÞÞ

ð2Þ

k¼1

The second speciﬁcation requires an additional independent assumption: If y R ðxk Þ and y I ðxk Þ for k ¼ 1; . . . ; K denote the real and imaginary parts of the DFT coeﬃcients of y, respectively, then pðyðxk ÞÞ ¼ pðy R ðxk ÞÞ pðy I ðxk ÞÞ

ð3Þ

Finally, we derive the Generalized Gaussian distribution (GGD) on the DFT domain for each part such that: sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ v Cð3=vk Þ k pðy P ðxk ÞÞ ¼ 2ry C3 ð1=vk Þ ( "sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ # vk ) Cð3=vk Þy P ðxk Þ ð4Þ exp Cð1=vk Þ ry where vk denotes a shape parameter controlling the distribution shape and ry is the standard deviation of yðxk Þ. Note that for vk ¼ 1 or 2, the GGD becomes the Laplacian and Gaussian densities, respectively. Although, the methodology described in this paper is general, in order to obtain a relation to previous Gaussian-based approaches (Ramfrez et al., 2001; Sohn et al., 1999) vk is assumed to

be equal to 2 in the following, and therefore, the experimental comparison to the referenced VADs is fair. From previous equations it is straight forward to evaluate the distributions of the DFT coeﬃcients under the respective hypothesis ðH 0 ; H 1 Þ. Under a two hypotheses test, the optimal decision rule that minimizes the error probability is the Bayes classiﬁer. Given an observation vector ^y to be classiﬁed, the problem is reduced to selecting the class ðH 0 or H 1 Þ with the largest posterior probability P ðH i j^yÞ. From the Bayes rule, the LRT for single observation under the GGD-based observation model is expressed as: Lð^yÞ ¼

pyjH 1 ð^yjH 1 Þ > P ½H 0 y^ $ H 1 ) pyjH 0 ð^yjH 0 Þ < P ½H 1 ^y $ H 0

ð5Þ

where P ½: denotes the likelihood of each hypothesis, and pyjH s is the conditional probability of the observation y given the occurrence of H s for s ¼ 0; 1. 3. Multivariate Complex Gaussian (MCG) model based likelihood ratio test In the LRT, only a single observation is considered and represented by a vector ^y. The performance of the decision procedure can be improved by incorporating consecutive observations (contextual information) into the statistical tests. In addition, Gaussian-based approaches, i.e. (Sohn et al., 1999; Ramfrez et al., 2001), do not consider the complex nature of the DFT coeﬃcients since they only use the magnitude in the pdf. The use of GGD based approaches in the single observation LRT (Chang et al., 2004) allows to introduce the phase “as a feature” in the VAD problem. This can be also achieved in a MO scenario with a novel speciﬁcation of the DFT coeﬃcients distribution. 3.1. The independence assumption When N measurements y^1 ; ^y2 ; . . . ; ^yN are available in a two-class classiﬁcation problem, a multiple observation likelihood ratio test (MO-LRT) can be deﬁned by: LN ð^y1 ; ^y2 ; . . . ; ^yN Þ ¼

py1 ;y2 ;...;yN jH 1 ð^y1 ; y^2 ; . . . ; y^N jH 1 Þ py1 ;y2 ;...;yN jH 0 ð^y1 ; ^y2 ; . . . ; ^yN jH 0 Þ

ð6Þ

This test involves the evaluation of an Nth order LRT for which a computationally eﬃcient method is available when the individual measurements ^yk for k ¼ 1; . . . ; N , are independent (Ramfrez et al., 2001): LN ð^y1 ; ^y2 ; . . . ; ^yN Þ ¼

N p Y yj jH 1 Þ yj jH 1 ð^ j¼1

¼

pyj jH 0 ð^yj jH 0 Þ

N Y K p ðx ÞjH ð^ Y k 1 y j ðxk ÞjH 1 Þ yj j¼1 k¼1

pyj ðxk ÞjH 0 ð^y j ðxk ÞjH 0 Þ

ð7Þ

where the second equality holds by virtue of Eq. (2). However, if they are not, as for example in the VAD problem where the frames used in the computation of the observa-

J.M. Go´rriz et al. / Speech Communication 52 (2010) 664–677

667

4

x 10

signal

2

1 st

2 nd

3 rd

0

−2 0 4 x 10

160 240 320360 440 520 600

800

signal

1200

1400

1600

s

0

1 st

signal

−2 160 4 x 10 2

240

320

360

(xTs s)

0

2 nd −2 240 4 x 10 2

signal

1000

(xT s)

2

320

360

440

440

520

(xTs s)

0

3 rd −2 320

360

(xTs s) Fig. 2. Example of processing windows of an utterance in the AURORA 3 database (T s ¼ 1=8000 s;# samples = 200, overlap between adjacent frames = 120 samples). Note how the overlap between frames aﬀects the statistical independence assumption.

tion vectors are usually overlapping as shown in Fig. 2, a more appropriate model must be considered. 3.2. The novel approach In this paper a MCG observation model is proposed for the set of observation vectors which are assumed to be independently distributed in their components (Sohn et al., 1999; Chang et al., 2004; Ramı´rez et al., 2001) and in their real and imaginary parts (Chang et al., 2004). Thus, the MO-LRT in Eq. (6) can be rewritten as: y1 ; ^ y2 ; . . . ; ^ yN Þ ¼ LN ð^

Y Y pyP jH ð^ yP x jH 1 Þ 1 x pyPx jH 0 ð^ yP x jH 0 Þ P¼fR;Ig x

ð8Þ

yP yP yP where ^ yP x ¼ ð^ 1 ðxÞ; ^ 2 ðxÞ; . . . ; ^ N ðxÞÞ is the vector obtained by joining the “P” part of the component x of each observation vector ^yk for k ¼ 1; . . . ; N in the MO window; and the probability law is given by the jointly Gaussian probability density function (jGpdf)1: 1 P T N 1 P P ^ ð9Þ pyPx jH s ð^ yx jH s Þ ¼ K H s ;N exp ð^ y Þ ðC yx ;H s Þ yx 2 x 1 In the following we assume for simplicity the observation vector yx to be real. The extension of the results to a complex scenario is achieved by applying the derived expressions to the real and imaginary parts of the vector independently, i.e. using Eq. (8).

for s ¼ 0; 1, where K H s ;N ¼ ð2pÞN =2 jC1N

yx ;H s

j1=2

, C Ny;H s is the N-or-

der covariance matrix of the observation vector under hypothesis H s and j:j denotes the determinant of a matrix. As follows from Eq. (9), this novel LRT includes the dependence between adjacent observations, hence we name it MCO-LRT. The motivation of using the MCG model based MCO-LRT is evident: (i) the Mahalanobis distance has been successfully applied in many ﬁelds (Manly, 1986) whenever there is a clear dependence between observations, and (ii) to our knowledge, the use of Generalized Gaussian models in multivariate analysis is not well established yet (Cho and Bui, 2005), and the determination of model parameters including correlation requires a high computational load (Niehsen, 1999). Moreover, The N-order covariance matrix used in the probability law of the MCG model automatically introduces the coeﬃcients phase “as a feature” and the time dependence of the observations as well.

3.2.1. Dependence between observations For our purposes,2 the covariance matrix is preliminary modeled as a symmetric tridiagonal matrix (Gorriz et al., 2

The speech signal sampled at 8 kHz is usually processed on a frame by frame basis and the feature vectors ^yk are computed using a 25 ms framesize and a 10 ms overlap.

J.M. Go´rriz et al. / Speech Communication 52 (2010) 664–677

668

2009), thus considering the correlation function between adjacent observations exclusively we express it as: 8 2 x 2 > < ry i ðxÞ E½jy i j for i ¼ j ð10Þ ½C Nyx ij ¼ rij ðxÞ E½y xi y xj for j ¼ i þ 1 > : 0 else where 1 6 i 6 j 6 N and r2y i ðxÞ and rij ðxÞ are the variance and correlation frequency components of the observation vector (denoted for clarity ri ; ri , respectively). This approach reduces the computational eﬀort of the algorithm with additional beneﬁts obtained from the properties of symmetric tridiagonal matrices (Yamani and Abdelmonem, 1997). A more realistic assumption is to consider the general case, i.e. symmetric positive deﬁnite matrix, since voice segments are strongly correlated and the previous assumption could be thought to be ﬂawed (although experimental results argue against this statement). In this case the model for the symmetric covariance matrix is expressed as: ( 2 r2y i ðxÞ E½jy xi j for i ¼ j N ð11Þ ½C yx ij ¼ rij ðxÞ E½y xi y xj for j–i The computation of the MCO-LRT VAD using this general model is achieved by means of the Cholesky decomposition (Golub and Loan, 1996) as shown in the following section for N observations. This novel approach allows to eﬃciently compute Eq. (8) with the inclusion of multiple correlations among frames within the sliding window. This eﬃcient computation (Golub and Loan, 1996) is based on the properties summarized in the Appendix A. 3.2.2. The observation model It is completely deﬁned by selecting the set of observation vectors yk for k ¼ 1; . . . ; N . The model selected for the observation vector is similar to that used by Sohn et al. (1999), consisting of the discrete Fourier transform (DFT) coeﬃcients of the clean speech ðSðxÞÞ and the additive noise ðN ðxÞÞ. In the latter work they are assumed to be asymptotically independent Gaussian random variables. Thus, the binary hypothesis is rewritten in terms of each observation vector as: ^ yk ¼ N k under H 0 y^k ¼ S k þ N k under

H1

ð12Þ

for k ¼ 1; . . . ; N where N k ¼ ðN k ðx1 Þ; . . . ; N k ðxK ÞÞ and S k ¼ ðS k ðx1 Þ; . . . ; S k ðxK ÞÞ are, respectively, the kth noise and clean signal DFT observation vector. However, in the previous works (Sohn et al., 1999; Ramı´rez et al., 2001) the covariance matrix given in Eq. (9) is either not considered at all (single observation) or assumed to be diagonal (statistical independence in the set of observations). Thus, we may expect obtaining a better characterization of the problem by introducing this statistical dependence. In the next sections we present a new algorithm based on this methodology for N ¼ 2; N ¼ 3 and for any number of

observations N. In order to obtain a connection with previous proposals, the assumption of vanishing squared correlation functions must be considered along with symmetric tridiagonal matrices. Thus, we implement a novel robust speech detector with an inherent small delay that is mainly intended for real-time applications such as mobile communications. The decision function will be described in terms of the correlation and variance coeﬃcients which constitute a correction to the previous LRT method (Ramfrez et al., 2001) that assumes uncorrelated observation vectors. 4. MCO-LRT over continuous speech processing systems The use of the MCO-LRT for voice activity detection is mainly motivated by two factors: (i) the optimal behavior of the so deﬁned decision rule, and (ii) a multiple observation vector for classiﬁcation deﬁnes a reduced variance LRT achieving clear improvements in robustness against the presence of acoustic noise in the environment. The second property is also achieved by the previous MO-LRT (Ramı´rez et al., 2001) when a large window size (N) is selected which counteracts the non-optimality of the decision rule based on the independence assumption. Consequently, substantial improvements are expected when dealing with a small MCO-window size (N). On the other hand, for real time applications the model order cannot be as high as is usually selected (Ramı´rez et al., 2001) for speech recognition systems. Then the proposed approach is expected to improve the independent MO methodology for VAD by selecting a low model order. The proposed MCO-LRT VAD is deﬁned over the slidyl ; ^ ylþ1 ; ing window of observation vectors f^ylm ; . . . ; ^yl1 ; ^ . . . ; ^ylþm g. By applying a log transformation to Eq. (8) and using the jGpdf model in Eq. (9) leads to: ( !) X1 jC N^yx ;H 0 j T x ^y D ^yx þ ln ‘l;N ¼ ð13Þ 2 x N jC N^yx ;H 1 j x where DxN ¼ ðC N^yx ;H 0 Þ1 ðC N^yx ;H 1 Þ1 ; N ¼ 2m þ 1 is the order of the model, l denotes the frame being classiﬁed as speech ðH 1 Þ or non-speech ðH 0 Þ and ^yx is the previously deﬁned frequency observation vector over the sliding window. The evaluation of the LRT requires the computation of the inverse and the determinant of a matrix as shown in Eq. (13). This is not an implementation obstacle because of the reduced MCO-window size (N) and the selected model for the covariance matrix in which only the dependence between adjacent observations is considered. 4.1. MCO-LRT for N=2 The improvement provided by the proposed methodology is evaluated in this section by studying the case N ¼ 2 (see Appendix A). In this low dimensional problem, explicit expressions for the evaluation of the second-order MCO-LRT can be obtained and a connection with previ-

J.M. Go´rriz et al. / Speech Communication 52 (2010) 664–677

ous proposals (Go´rriz et al., 2005) can be shown. In this case, since the covariance matrix is: r1 ðxÞ r1 ðxÞ 2 C yx ¼ ð14Þ r1 ðxÞ r2 ðxÞ and assuming vanishing squared correlations under H 0 &H 1 , the LRT can be evaluated according to: 1 X c1 ðxÞn1 ðxÞ c2 ðxÞn2 ðxÞ ‘l;2 ¼ þ ln½ð1 2 x 1 þ n1 ðxÞ 1 þ n2 ðxÞ þ n1 ðxÞÞ ln½ð1 þ n2 ðxÞÞ 2

! H1 pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ H q 1 c1 ðxÞc2 ðxÞ q1 0 pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð1 þ n1 ðxÞÞð1 þ n2 ðxÞÞ ð15Þ H r1 1 ðxÞ H1 H r1 ðxÞr2 1 ðxÞ

where qH1 1 ðxÞ ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ and qH1 0 ðxÞ are the correlation coeﬃcients of the observations under H 1 &H 0 , respec2 tively, ni ðxÞ rsi ðxÞ=rni ðxÞ; ci ðxÞ ðy xi Þ =rni ðxÞ are the a priori and a posteriori SNRs, and l indexes the second observation. Finally, assuming that the correlation coeﬃcients are negligible under H 0 (noise correlation coeﬃcients) we can obtain the decision rule with the previous MO-LRT (Ramı´rez et al., 2001) as follows: 1X L1 ðxÞ þ L2 ðxÞ þ 2 ‘l;2 ¼ 2 x ! qH1 1 pﬃﬃﬃﬃﬃﬃﬃﬃ ð16Þ c1 c2 pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð1 þ n1 Þð1 þ n2 Þ ðxÞni ðxÞ lnð1 þ ni ðxÞÞ for i ¼ 1; 2, are the where Li ðxÞ ci1þn i ðxÞ y2 which are corindependent LRT of the observations ^ y1 ; ^ rected with the term depending on qH1 1 . At this point ergodicity of the process in frequency space must be assumed to estimate the new model parameter qH1 1 . This means that the correlation coeﬃcients are constant in frequency (wide sense stationary) and thus, an ensemble average can be computed using the sample mean correlation of the obsery2 included in the sliding window. Fig. 3 vations ^ y1 and ^ clariﬁes the motivations for using the correlation correction as shown in Eq. (16). We show the evaluation of the proposed VAD on an utterance of the AURORA 3 Spanish SpeechDat-Car database (Moreno et al., 2000). As it is indicated by black arrows, the span of the decision function over voice periods is increased signiﬁcantly as the span over noise periods is decreased radically. On the other hand, the decision rule of the previous MO-LRT VAD is non-stationary and noisy for a small number of observations as shown in the same ﬁgure.

669

4.2.1. Tridiagonal covariance matrix In this case, the properties of a symmetric and tridiagonal matrix come into play: 0 1 r1 ðxÞ r1 ðxÞ 0 B C ð17Þ C 3yx ¼ @ r1 ðxÞ r2 ðxÞ r2 ðxÞ A 0

r2 ðxÞ

r3 ðxÞ

The likelihood ratio can be expressed as (see Appendix B): X K H ;3 1 ‘l;3 ¼ ln 1 þ ^yTx Dx3 ^yx ð18Þ K H 0 ;3 2 x where K H s ;3 for s ¼ 0; 1 and Dx3 are deﬁned in Sections 3 and 4, respectively. Assuming that squared correlation coeﬃcients vanish under H 0 &H 1 , the log-LRT can be evaluated as follows (for clarity we have omitted the frequency dependence of the parameters): 3 X 1 c i ni ‘l;3 ¼ umx lnð1 þ ni Þ 2 2 1 þ ni i¼1 ! qH1 1 pﬃﬃﬃﬃﬃﬃﬃﬃ H 0 c1 c2 q1 pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2 ð1 þ n1 Þð1 þ n2 Þ ! qH2 1 pﬃﬃﬃﬃﬃﬃﬃﬃ H 0 c2 c3 q2 pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ þ 2 ð1 þ n2 Þð1 þ n3 Þ

! qH1 1 qH2 1 pﬃﬃﬃﬃﬃﬃﬃﬃ H 0 H 0 c1 c3 q1 q2 pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð1 þ n1 Þð1 þ n2 Þ ð1 þ n2 Þð1 þ n3 Þ ð19Þ

Again the correlation coeﬃcients under H 0 can be neglected, thus obtaining: 3 1XX Li ðxÞ þ 2 ‘l;3 ¼ 2 x i¼1 ! qH1 1 pﬃﬃﬃﬃﬃﬃﬃﬃ c1 c2 pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ þ 2 ð1 þ xi1 Þð1 þ n2 Þ ! qH2 1 pﬃﬃﬃﬃﬃﬃﬃﬃ c2 c3 pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2 ð1 þ n2 Þð1 þ n3 Þ 0 1 qH1 1 qH2 1 pﬃﬃﬃﬃﬃﬃﬃﬃB C ﬃA c1 c3 @qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2 ð1 þ xi1 Þð1 þ n2 Þ ð1 þ n3 Þ

ð20Þ

A generalization of the result for N observations, assuming an Nth-order tridiagonal matrix C Nyx , is given by: " # pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ lþm lþm1 X 2 ci ciþ1 qHi 1 1X X pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Li ðxÞ þ ‘l;N ¼ 2 x i¼lm ð1 þ ni Þð1 þ niþ1 Þ i¼lm ð21Þ which can be recursively computed as:

4.2. MCO-LRT for N observations

‘lþ1;N ¼ ‘l;N Ulm ðxÞ þ Ulþðmþ1Þ ðxÞ

The improvement provided by the proposed methodology is evaluated in this section by studying the case N ¼ 3 and then, generalizing it to any number N of observations for the proposed models shown in Section 3.2.

where pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 1 2 clm clm 1 qHlm Ulm ðxÞ ¼ Llm ðxÞ þ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ : ð1 þ nlm Þð1 þ nlm 1 Þ

ð22Þ

ð23Þ

J.M. Go´rriz et al. / Speech Communication 52 (2010) 664–677

670 4

3

x 10

noisy speech jGpdf LRTdecision N=2

signal

2 1 0 −1 −2

0

1

2

3

4

5

6 4

(xT s)

x 10

s

MO−LRT N=2 jGpdf LRT N=2

20

LRT (dB)

10 0 −10 −20 −30 −40 0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6 4

(xT s)

x 10

s

Fig. 3. Comparison between MO-LRT and MCO-LRT on a utterance of AURORA 3 database. Note how the correlation correction term provides a almost binary decision rule, decreasing the decision span in noise periods and increasing in speech ones.

4.2.2. Symmetric covariance matrix In this case the log-LRT given in Eq. (13) is computed by means of the Cholesky decomposition (Golub and Loan, 1996). Given a N N symmetric positive deﬁnite covariance matrix C Nyx ;H s under hypothesis H s , where s ¼ f0; 1g, the Cholesky decomposition is an upper triangular matrix U sN with strictly positive diagonal entries ½uii s such that C Nyx ;H s ¼ ðU sN ÞT U sN , where T stands for matrix transpose. Using P2 in Appendix A the second term of Eq. (13) can be written as: ! N X jC N^yx ;H 0 j ½u2 ¼ ln ii2 0 ð24Þ ln N ½uii 1 jC ^yx ;H 1 j i¼1 In addition, the ﬁrst term can be rewritten as the following: ^yTx DxN ^yx ¼ ^yTx ½ðU 0N Þ1 ðU 0N ÞT ðU 1N Þ1 ðU 1N ÞT ^ yx ¼ jjz0x jj2 jjz1x jj2 ð25Þ

where jj:jj2 stands for the squared norm of vector zsx T yx . Finally, after some math operations the comðU sN Þ ^ plete MCO-LRT VAD transforms into: 8 0 2 2 1 X X N N N X

Recommend Documents

Jointly Gaussian PDF-Based Likelihood Ratio Test for Voice Activity ...

An improved noise-robust voice activity detector based ... - nlsde-admire

Noise-Robust Voice Activity Detector Based on Hidden Semi-Markov