Available online at www.sciencedirect.com
Speech Communication 52 (2010) 664–677 www.elsevier.com/locate/specom
Improved likelihood ratio test based voice activity detector applied to speech recognition J.M. Go´rriz a,*, J. Ramı´rez a, E.W. Lang b, C.G. Puntonet c, I. Turias d a
Dpt. Signal Theory, Networking and Communications, University of Granada, 18071 Granada, Spain b Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany c Dpt. Computer Architecture and Technology, University of Granada, 18071 Granada, Spain d Dpt. Lenguajes y Sistemas Informa´ticos, University of Ca´diz, 11202 Algeciras, Spain Received 7 December 2007; received in revised form 18 September 2009; accepted 4 March 2010
Abstract Nowadays, the accuracy of speech processing systems is strongly affected by acoustic noise. This is a serious obstacle regarding the demands of modern applications. Therefore, these systems often need a noise reduction algorithm working in combination with a precise voice activity detector (VAD). The computation needed to achieve denoising and speech detection must not exceed the limitations imposed by real time speech processing systems. This paper presents a novel VAD for improving speech detection robustness in noisy environments and the performance of speech recognition systems in real time applications. The algorithm is based on a Multivariate Complex Gaussian (MCG) observation model and defines an optimal likelihood ratio test (LRT) involving multiple and correlated observations (MCO) based on a jointly Gaussian probability distribution (jGpdf) and a symmetric covariance matrix. The complete derivation of the jGpdf-LRT for the general case of a symmetric covariance matrix is shown in terms of the Cholesky decomposition which allows to efficiently compute the VAD decision rule. An extensive analysis of the proposed methodology for a low dimensional observation model demonstrates: (i) the improved robustness of the proposed approach by means of a clear reduction of the classification error as the number of observations is increased, and (ii) the trade-off between the number of observations and the detection performance. The proposed strategy is also compared to different VAD methods including the G.729, AMR and AFE standards, as well as other recently reported algorithms showing a sustained advantage in speech/non-speech detection accuracy and speech recognition performance using the AURORA databases. Ó 2010 Elsevier B.V. All rights reserved. Keywords: Voice activity detection; Generalized complex Gaussian probability distribution function; Robust speech recognition
1. Introduction The new voice services, including discontinuous speech transmission (Benyassine et al., 1997; ITU, 1996; ETSI, 1999) or distributed speech recognition (DSR) over wireless and IP networks (ETSI, 2002), afford increasing levels of performance in noise adverse environments together with the design of high response rate speech processing systems.
*
Corresponding author. Tel.: +34 958240842. E-mail address:
[email protected] (J.M. Go´rriz).
0167-6393/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2010.03.003
These systems often require a noise reduction scheme working in combination with a precise voice activity detector (VAD) (Bouquin-Jeannes and Faucon, 1995) for estimating the noise spectrum during non-speech periods in order to compensate its harmful effect on the speech signal. The non-speech detection algorithm is an important and sensitive part of most of the existing single-microphone noise reduction schemes indeed. The well known noise suppression algorithms such as Wiener filtering (WF) or spectral subtraction (Berouti et al., 1979; Boll, 1979), are widely used for robust speech recognition which is critical during VAD for attaining a high level of performance. Non-effective
J.M. Go´rriz et al. / Speech Communication 52 (2010) 664–677
speech/non-speech detection is an important source of performance degradation in automatic speech recognition (ASR) systems. On one hand, noise parameters such as its spectrum are updated during non-speech periods being the speech enhancement system is strongly influenced by the quality of the noise estimation, and on the other hand, frame-dropping (FD), a frequently used technique in speech recognition to reduce the number of insertion errors caused by the noise, is based on the VAD decision and speech classification errors lead to loss of speech, thus causing irrecoverable deletion errors. An example of such a system is the ETSI standard for DSR that incorporates noise suppression methods (see Fig. 1). The so-called advanced front-end (AFE) (ETSI, 2002) considers an energy-based VAD to estimate the noise spectrum for Wiener filtering and a different VAD for nonspeech FD. The recognizer is usually based on Hidden Markov Models (HMMs) and the task consists of recognizing connected digits, which are modeled as whole word HMMs with a number of states per word, gaussian mixtures per state, etc., by maximizing the likelihood of each digit. Firstly, a HMM is trained for each vocabulary word using a number of examples of that word. Secondly, to recognize some unknown word, the likelihood of each model generating that word is calculated and the most likely model identifies the word. During the last decade numerous researchers have studied different strategies for detecting speech in noise and the influence of the VAD on the performance of speech processing systems (Bouquin-Jeannes and Faucon, 1994, 1995). Most of them have focussed on the development of robust algorithms with special attention on the derivation and study of noise robust features and decision rules (Woo et al., 2000; Li et al., 2002; Marzinzik and Kollmeier, 2002; Sohn et al., 1999). The different approaches include those based on energy thresholds (Woo et al., 2000), pitch detection (Chengalvarayan, 1999), spectrum analysis (Marzinzik and Kollmeier, 2002), zero-crossing rate (ITU, 1996), periodicity measures (Tucker, 1992), higher order statistics (Go´rriz et al., 2006; Ramfrez et al., 2006) or com-
x(n)
x(n)
MEL-scale filter-bank
DFT
MFCC, Δ, ΔΔ
ETSI FE
Wiener- filtering and frame-dropping
Log DCT
Δ, ΔΔ
Base
MFCC, Δ, ΔΔ
Enhanced front-end feature extraction
VAD
Fig. 1. ETSI standard for DSR. Front-end feature extraction.
665
¨ zer, 2000; binations of different features (Tanyer and O ITU, 1996; ETSI, 1999). Sohn et al. (1999) proposed a robust VAD algorithm based on a statistical likelihood ratio test (LRT) involving a single observation vector, and a HMM-based hang-over scheme. Later, Cho et al. (2001) suggested an improvement based on a smoothed LRT. The statistical model proposed by Sohn was extended to generalized gaussian distributions in (Chang et al., 2004), deriving a complex model that uses the independence assumption of each part. Most VADs in use today normally consider hang-over algorithms based on empirical models to smooth the VAD decision which yields significant improvements in word-ending accuracy. It has been shown recently (Ramfrez et al., 2001; Go´rriz et al., 2006,, 2005) that incorporating long-term speech information to the decision rule results in benefits for speech/pause discrimination in high noise environments also. However, an inherent delay is inevitably included, thus challenging the performance of real time processing systems. Finally, an important assumption made on the latter works needs revision: the independence of adjacent observations. In any speech processing system the input signal is usually decomposed into overlapping frames, thus a clear statistical dependence between adjacent feature vectors is introduced. In this sense, some approaches (Gorriz et al., 2009) tried to take this dependence into account assuming dependence between adjacent observations only, i.e. tridiagonal covariance matrix, however this assumption is arguable for strongly correlated speech segments as an example. In this work we propose an novel advance in VAD by means of a multiple correlated observation likelihood ratio test (MCO-LRT) which is defined in terms of the Cholesky decomposition and on the previous observations, thus avoiding the inclusion of any processing delay (if required). The dependence between observations is addressed by using a Multivariate Complex Gaussian (MCG) model with symmetric and tridiagonal covariance matrix and the following assumption: the observations are jointly gaussian distributed with non-zero correlations. Important issues that also need to be discussed are: (i) the increased computational complexity mainly due to the definition of the decision rule over large data sets, and (ii) the optimal criterion for the decision rule. The paper is organized as follows: Section 2 reviews the theoretical background on observation models and LRT statistical decision theory used for VAD. Then, in Section 3, we propose a novel LRT based on a MCG observation model, the so-called MCO-LRT, with two possibilities: considering symmetric or tridiagonal covariance matrices. Section 4 considers its application to the problem of detecting speech in a noisy signal and addresses the computation of the novel MCO-LRT for VAD. Sections 4.1 and 4.2 analyze the proposed method for just two, three and N consecutively correlated speech observations, respectively, using, as an example, an utterance of the AURORA 3 Spanish SpeechDat-Car (SDC) database (Moreno et al., 2000). Section 5 describes the experimental framework considered for the evaluation of
J.M. Go´rriz et al. / Speech Communication 52 (2010) 664–677
666
the proposed endpoint detection in both VAD and Speech Recognition frameworks using the AURORA databases (Moreno et al., 2000; Hirsch and Pearce, 2000) and finally, Section 6 summarizes the conclusions of this work. 2. Background on observation models for VAD For single observation based VAD, the noise signal n is usually assumed to be added to a speech signal s, with their sum being denoted by y. Given two hypotheses, H 0 and H 1 , which indicate respectively voice activity and inactivity, it is assumed that: H0 : H1 :
speech absent : y ¼ N speech present : y ¼ S þ N
ð1Þ
T
where y ¼ ½yðx1 Þ; yðx2 Þ; . . . ; yðxK Þ , N ¼ ½N ðx1 Þ; N ðx2 Þ; . . . ; N ðxK ÞT and S ¼ ½Sðx1 Þ; Sðx2 Þ; . . . ; SðxK ÞT are the DFT coefficients of the noisy speech, noise and clean speech, respectively. The above statistical model is completed with: the assumption that the DFT coefficients of each process are asymptotically independent (Sohn et al., 1999) and, an appropriate specification of the DFT coefficients distribution (Chang et al., 2004). The most common way is the complex Gaussian pdf as in (Sohn et al., 1999; Chang et al., 2004; Go´rriz et al., 2005), etc. The first assumption allows us to write the probability distribution of the random vector y as: pðyÞ ¼
K Y
pðyðxk ÞÞ
ð2Þ
k¼1
The second specification requires an additional independent assumption: If y R ðxk Þ and y I ðxk Þ for k ¼ 1; . . . ; K denote the real and imaginary parts of the DFT coefficients of y, respectively, then pðyðxk ÞÞ ¼ pðy R ðxk ÞÞ pðy I ðxk ÞÞ
ð3Þ
Finally, we derive the Generalized Gaussian distribution (GGD) on the DFT domain for each part such that: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi v Cð3=vk Þ k pðy P ðxk ÞÞ ¼ 2ry C3 ð1=vk Þ ( "sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi # vk ) Cð3=vk Þy P ðxk Þ ð4Þ exp Cð1=vk Þ ry where vk denotes a shape parameter controlling the distribution shape and ry is the standard deviation of yðxk Þ. Note that for vk ¼ 1 or 2, the GGD becomes the Laplacian and Gaussian densities, respectively. Although, the methodology described in this paper is general, in order to obtain a relation to previous Gaussian-based approaches (Ramfrez et al., 2001; Sohn et al., 1999) vk is assumed to
be equal to 2 in the following, and therefore, the experimental comparison to the referenced VADs is fair. From previous equations it is straight forward to evaluate the distributions of the DFT coefficients under the respective hypothesis ðH 0 ; H 1 Þ. Under a two hypotheses test, the optimal decision rule that minimizes the error probability is the Bayes classifier. Given an observation vector ^y to be classified, the problem is reduced to selecting the class ðH 0 or H 1 Þ with the largest posterior probability P ðH i j^yÞ. From the Bayes rule, the LRT for single observation under the GGD-based observation model is expressed as: Lð^yÞ ¼
pyjH 1 ð^yjH 1 Þ > P ½H 0 y^ $ H 1 ) pyjH 0 ð^yjH 0 Þ < P ½H 1 ^y $ H 0
ð5Þ
where P ½: denotes the likelihood of each hypothesis, and pyjH s is the conditional probability of the observation y given the occurrence of H s for s ¼ 0; 1. 3. Multivariate Complex Gaussian (MCG) model based likelihood ratio test In the LRT, only a single observation is considered and represented by a vector ^y. The performance of the decision procedure can be improved by incorporating consecutive observations (contextual information) into the statistical tests. In addition, Gaussian-based approaches, i.e. (Sohn et al., 1999; Ramfrez et al., 2001), do not consider the complex nature of the DFT coefficients since they only use the magnitude in the pdf. The use of GGD based approaches in the single observation LRT (Chang et al., 2004) allows to introduce the phase “as a feature” in the VAD problem. This can be also achieved in a MO scenario with a novel specification of the DFT coefficients distribution. 3.1. The independence assumption When N measurements y^1 ; ^y2 ; . . . ; ^yN are available in a two-class classification problem, a multiple observation likelihood ratio test (MO-LRT) can be defined by: LN ð^y1 ; ^y2 ; . . . ; ^yN Þ ¼
py1 ;y2 ;...;yN jH 1 ð^y1 ; y^2 ; . . . ; y^N jH 1 Þ py1 ;y2 ;...;yN jH 0 ð^y1 ; ^y2 ; . . . ; ^yN jH 0 Þ
ð6Þ
This test involves the evaluation of an Nth order LRT for which a computationally efficient method is available when the individual measurements ^yk for k ¼ 1; . . . ; N , are independent (Ramfrez et al., 2001): LN ð^y1 ; ^y2 ; . . . ; ^yN Þ ¼
N p Y yj jH 1 Þ yj jH 1 ð^ j¼1
¼
pyj jH 0 ð^yj jH 0 Þ
N Y K p ðx ÞjH ð^ Y k 1 y j ðxk ÞjH 1 Þ yj j¼1 k¼1
pyj ðxk ÞjH 0 ð^y j ðxk ÞjH 0 Þ
ð7Þ
where the second equality holds by virtue of Eq. (2). However, if they are not, as for example in the VAD problem where the frames used in the computation of the observa-
J.M. Go´rriz et al. / Speech Communication 52 (2010) 664–677
667
4
x 10
signal
2
1 st
2 nd
3 rd
0
−2 0 4 x 10
160 240 320360 440 520 600
800
signal
1200
1400
1600
s
0
1 st
signal
−2 160 4 x 10 2
240
320
360
(xTs s)
0
2 nd −2 240 4 x 10 2
signal
1000
(xT s)
2
320
360
440
440
520
(xTs s)
0
3 rd −2 320
360
(xTs s) Fig. 2. Example of processing windows of an utterance in the AURORA 3 database (T s ¼ 1=8000 s;# samples = 200, overlap between adjacent frames = 120 samples). Note how the overlap between frames affects the statistical independence assumption.
tion vectors are usually overlapping as shown in Fig. 2, a more appropriate model must be considered. 3.2. The novel approach In this paper a MCG observation model is proposed for the set of observation vectors which are assumed to be independently distributed in their components (Sohn et al., 1999; Chang et al., 2004; Ramı´rez et al., 2001) and in their real and imaginary parts (Chang et al., 2004). Thus, the MO-LRT in Eq. (6) can be rewritten as: y1 ; ^ y2 ; . . . ; ^ yN Þ ¼ LN ð^
Y Y pyP jH ð^ yP x jH 1 Þ 1 x pyPx jH 0 ð^ yP x jH 0 Þ P¼fR;Ig x
ð8Þ
yP yP yP where ^ yP x ¼ ð^ 1 ðxÞ; ^ 2 ðxÞ; . . . ; ^ N ðxÞÞ is the vector obtained by joining the “P” part of the component x of each observation vector ^yk for k ¼ 1; . . . ; N in the MO window; and the probability law is given by the jointly Gaussian probability density function (jGpdf)1: 1 P T N 1 P P ^ ð9Þ pyPx jH s ð^ yx jH s Þ ¼ K H s ;N exp ð^ y Þ ðC yx ;H s Þ yx 2 x 1 In the following we assume for simplicity the observation vector yx to be real. The extension of the results to a complex scenario is achieved by applying the derived expressions to the real and imaginary parts of the vector independently, i.e. using Eq. (8).
for s ¼ 0; 1, where K H s ;N ¼ ð2pÞN =2 jC1N
yx ;H s
j1=2
, C Ny;H s is the N-or-
der covariance matrix of the observation vector under hypothesis H s and j:j denotes the determinant of a matrix. As follows from Eq. (9), this novel LRT includes the dependence between adjacent observations, hence we name it MCO-LRT. The motivation of using the MCG model based MCO-LRT is evident: (i) the Mahalanobis distance has been successfully applied in many fields (Manly, 1986) whenever there is a clear dependence between observations, and (ii) to our knowledge, the use of Generalized Gaussian models in multivariate analysis is not well established yet (Cho and Bui, 2005), and the determination of model parameters including correlation requires a high computational load (Niehsen, 1999). Moreover, The N-order covariance matrix used in the probability law of the MCG model automatically introduces the coefficients phase “as a feature” and the time dependence of the observations as well.
3.2.1. Dependence between observations For our purposes,2 the covariance matrix is preliminary modeled as a symmetric tridiagonal matrix (Gorriz et al., 2
The speech signal sampled at 8 kHz is usually processed on a frame by frame basis and the feature vectors ^yk are computed using a 25 ms framesize and a 10 ms overlap.
J.M. Go´rriz et al. / Speech Communication 52 (2010) 664–677
668
2009), thus considering the correlation function between adjacent observations exclusively we express it as: 8 2 x 2 > < ry i ðxÞ E½jy i j for i ¼ j ð10Þ ½C Nyx ij ¼ rij ðxÞ E½y xi y xj for j ¼ i þ 1 > : 0 else where 1 6 i 6 j 6 N and r2y i ðxÞ and rij ðxÞ are the variance and correlation frequency components of the observation vector (denoted for clarity ri ; ri , respectively). This approach reduces the computational effort of the algorithm with additional benefits obtained from the properties of symmetric tridiagonal matrices (Yamani and Abdelmonem, 1997). A more realistic assumption is to consider the general case, i.e. symmetric positive definite matrix, since voice segments are strongly correlated and the previous assumption could be thought to be flawed (although experimental results argue against this statement). In this case the model for the symmetric covariance matrix is expressed as: ( 2 r2y i ðxÞ E½jy xi j for i ¼ j N ð11Þ ½C yx ij ¼ rij ðxÞ E½y xi y xj for j–i The computation of the MCO-LRT VAD using this general model is achieved by means of the Cholesky decomposition (Golub and Loan, 1996) as shown in the following section for N observations. This novel approach allows to efficiently compute Eq. (8) with the inclusion of multiple correlations among frames within the sliding window. This efficient computation (Golub and Loan, 1996) is based on the properties summarized in the Appendix A. 3.2.2. The observation model It is completely defined by selecting the set of observation vectors yk for k ¼ 1; . . . ; N . The model selected for the observation vector is similar to that used by Sohn et al. (1999), consisting of the discrete Fourier transform (DFT) coefficients of the clean speech ðSðxÞÞ and the additive noise ðN ðxÞÞ. In the latter work they are assumed to be asymptotically independent Gaussian random variables. Thus, the binary hypothesis is rewritten in terms of each observation vector as: ^ yk ¼ N k under H 0 y^k ¼ S k þ N k under
H1
ð12Þ
for k ¼ 1; . . . ; N where N k ¼ ðN k ðx1 Þ; . . . ; N k ðxK ÞÞ and S k ¼ ðS k ðx1 Þ; . . . ; S k ðxK ÞÞ are, respectively, the kth noise and clean signal DFT observation vector. However, in the previous works (Sohn et al., 1999; Ramı´rez et al., 2001) the covariance matrix given in Eq. (9) is either not considered at all (single observation) or assumed to be diagonal (statistical independence in the set of observations). Thus, we may expect obtaining a better characterization of the problem by introducing this statistical dependence. In the next sections we present a new algorithm based on this methodology for N ¼ 2; N ¼ 3 and for any number of
observations N. In order to obtain a connection with previous proposals, the assumption of vanishing squared correlation functions must be considered along with symmetric tridiagonal matrices. Thus, we implement a novel robust speech detector with an inherent small delay that is mainly intended for real-time applications such as mobile communications. The decision function will be described in terms of the correlation and variance coefficients which constitute a correction to the previous LRT method (Ramfrez et al., 2001) that assumes uncorrelated observation vectors. 4. MCO-LRT over continuous speech processing systems The use of the MCO-LRT for voice activity detection is mainly motivated by two factors: (i) the optimal behavior of the so defined decision rule, and (ii) a multiple observation vector for classification defines a reduced variance LRT achieving clear improvements in robustness against the presence of acoustic noise in the environment. The second property is also achieved by the previous MO-LRT (Ramı´rez et al., 2001) when a large window size (N) is selected which counteracts the non-optimality of the decision rule based on the independence assumption. Consequently, substantial improvements are expected when dealing with a small MCO-window size (N). On the other hand, for real time applications the model order cannot be as high as is usually selected (Ramı´rez et al., 2001) for speech recognition systems. Then the proposed approach is expected to improve the independent MO methodology for VAD by selecting a low model order. The proposed MCO-LRT VAD is defined over the slidyl ; ^ ylþ1 ; ing window of observation vectors f^ylm ; . . . ; ^yl1 ; ^ . . . ; ^ylþm g. By applying a log transformation to Eq. (8) and using the jGpdf model in Eq. (9) leads to: ( !) X1 jC N^yx ;H 0 j T x ^y D ^yx þ ln ‘l;N ¼ ð13Þ 2 x N jC N^yx ;H 1 j x where DxN ¼ ðC N^yx ;H 0 Þ1 ðC N^yx ;H 1 Þ1 ; N ¼ 2m þ 1 is the order of the model, l denotes the frame being classified as speech ðH 1 Þ or non-speech ðH 0 Þ and ^yx is the previously defined frequency observation vector over the sliding window. The evaluation of the LRT requires the computation of the inverse and the determinant of a matrix as shown in Eq. (13). This is not an implementation obstacle because of the reduced MCO-window size (N) and the selected model for the covariance matrix in which only the dependence between adjacent observations is considered. 4.1. MCO-LRT for N=2 The improvement provided by the proposed methodology is evaluated in this section by studying the case N ¼ 2 (see Appendix A). In this low dimensional problem, explicit expressions for the evaluation of the second-order MCO-LRT can be obtained and a connection with previ-
J.M. Go´rriz et al. / Speech Communication 52 (2010) 664–677
ous proposals (Go´rriz et al., 2005) can be shown. In this case, since the covariance matrix is: r1 ðxÞ r1 ðxÞ 2 C yx ¼ ð14Þ r1 ðxÞ r2 ðxÞ and assuming vanishing squared correlations under H 0 &H 1 , the LRT can be evaluated according to: 1 X c1 ðxÞn1 ðxÞ c2 ðxÞn2 ðxÞ ‘l;2 ¼ þ ln½ð1 2 x 1 þ n1 ðxÞ 1 þ n2 ðxÞ þ n1 ðxÞÞ ln½ð1 þ n2 ðxÞÞ 2
! H1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi H q 1 c1 ðxÞc2 ðxÞ q1 0 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð1 þ n1 ðxÞÞð1 þ n2 ðxÞÞ ð15Þ H r1 1 ðxÞ H1 H r1 ðxÞr2 1 ðxÞ
where qH1 1 ðxÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi and qH1 0 ðxÞ are the correlation coefficients of the observations under H 1 &H 0 , respec2 tively, ni ðxÞ rsi ðxÞ=rni ðxÞ; ci ðxÞ ðy xi Þ =rni ðxÞ are the a priori and a posteriori SNRs, and l indexes the second observation. Finally, assuming that the correlation coefficients are negligible under H 0 (noise correlation coefficients) we can obtain the decision rule with the previous MO-LRT (Ramı´rez et al., 2001) as follows: 1X L1 ðxÞ þ L2 ðxÞ þ 2 ‘l;2 ¼ 2 x ! qH1 1 pffiffiffiffiffiffiffiffi ð16Þ c1 c2 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð1 þ n1 Þð1 þ n2 Þ ðxÞni ðxÞ lnð1 þ ni ðxÞÞ for i ¼ 1; 2, are the where Li ðxÞ ci1þn i ðxÞ y2 which are corindependent LRT of the observations ^ y1 ; ^ rected with the term depending on qH1 1 . At this point ergodicity of the process in frequency space must be assumed to estimate the new model parameter qH1 1 . This means that the correlation coefficients are constant in frequency (wide sense stationary) and thus, an ensemble average can be computed using the sample mean correlation of the obsery2 included in the sliding window. Fig. 3 vations ^ y1 and ^ clarifies the motivations for using the correlation correction as shown in Eq. (16). We show the evaluation of the proposed VAD on an utterance of the AURORA 3 Spanish SpeechDat-Car database (Moreno et al., 2000). As it is indicated by black arrows, the span of the decision function over voice periods is increased significantly as the span over noise periods is decreased radically. On the other hand, the decision rule of the previous MO-LRT VAD is non-stationary and noisy for a small number of observations as shown in the same figure.
669
4.2.1. Tridiagonal covariance matrix In this case, the properties of a symmetric and tridiagonal matrix come into play: 0 1 r1 ðxÞ r1 ðxÞ 0 B C ð17Þ C 3yx ¼ @ r1 ðxÞ r2 ðxÞ r2 ðxÞ A 0
r2 ðxÞ
r3 ðxÞ
The likelihood ratio can be expressed as (see Appendix B): X K H ;3 1 ‘l;3 ¼ ln 1 þ ^yTx Dx3 ^yx ð18Þ K H 0 ;3 2 x where K H s ;3 for s ¼ 0; 1 and Dx3 are defined in Sections 3 and 4, respectively. Assuming that squared correlation coefficients vanish under H 0 &H 1 , the log-LRT can be evaluated as follows (for clarity we have omitted the frequency dependence of the parameters): 3 X 1 c i ni ‘l;3 ¼ umx lnð1 þ ni Þ 2 2 1 þ ni i¼1 ! qH1 1 pffiffiffiffiffiffiffiffi H 0 c1 c2 q1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 ð1 þ n1 Þð1 þ n2 Þ ! qH2 1 pffiffiffiffiffiffiffiffi H 0 c2 c3 q2 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi þ 2 ð1 þ n2 Þð1 þ n3 Þ
! qH1 1 qH2 1 pffiffiffiffiffiffiffiffi H 0 H 0 c1 c3 q1 q2 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð1 þ n1 Þð1 þ n2 Þ ð1 þ n2 Þð1 þ n3 Þ ð19Þ
Again the correlation coefficients under H 0 can be neglected, thus obtaining: 3 1XX Li ðxÞ þ 2 ‘l;3 ¼ 2 x i¼1 ! qH1 1 pffiffiffiffiffiffiffiffi c1 c2 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi þ 2 ð1 þ xi1 Þð1 þ n2 Þ ! qH2 1 pffiffiffiffiffiffiffiffi c2 c3 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 ð1 þ n2 Þð1 þ n3 Þ 0 1 qH1 1 qH2 1 pffiffiffiffiffiffiffiffiB C ffiA c1 c3 @qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 ð1 þ xi1 Þð1 þ n2 Þ ð1 þ n3 Þ
ð20Þ
A generalization of the result for N observations, assuming an Nth-order tridiagonal matrix C Nyx , is given by: " # pffiffiffiffiffiffiffiffiffiffiffi lþm lþm1 X 2 ci ciþ1 qHi 1 1X X pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Li ðxÞ þ ‘l;N ¼ 2 x i¼lm ð1 þ ni Þð1 þ niþ1 Þ i¼lm ð21Þ which can be recursively computed as:
4.2. MCO-LRT for N observations
‘lþ1;N ¼ ‘l;N Ulm ðxÞ þ Ulþðmþ1Þ ðxÞ
The improvement provided by the proposed methodology is evaluated in this section by studying the case N ¼ 3 and then, generalizing it to any number N of observations for the proposed models shown in Section 3.2.
where pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 2 clm clm 1 qHlm Ulm ðxÞ ¼ Llm ðxÞ þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : ð1 þ nlm Þð1 þ nlm 1 Þ
ð22Þ
ð23Þ
J.M. Go´rriz et al. / Speech Communication 52 (2010) 664–677
670 4
3
x 10
noisy speech jGpdf LRTdecision N=2
signal
2 1 0 −1 −2
0
1
2
3
4
5
6 4
(xT s)
x 10
s
MO−LRT N=2 jGpdf LRT N=2
20
LRT (dB)
10 0 −10 −20 −30 −40 0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6 4
(xT s)
x 10
s
Fig. 3. Comparison between MO-LRT and MCO-LRT on a utterance of AURORA 3 database. Note how the correlation correction term provides a almost binary decision rule, decreasing the decision span in noise periods and increasing in speech ones.
4.2.2. Symmetric covariance matrix In this case the log-LRT given in Eq. (13) is computed by means of the Cholesky decomposition (Golub and Loan, 1996). Given a N N symmetric positive definite covariance matrix C Nyx ;H s under hypothesis H s , where s ¼ f0; 1g, the Cholesky decomposition is an upper triangular matrix U sN with strictly positive diagonal entries ½uii s such that C Nyx ;H s ¼ ðU sN ÞT U sN , where T stands for matrix transpose. Using P2 in Appendix A the second term of Eq. (13) can be written as: ! N X jC N^yx ;H 0 j ½u2 ¼ ln ii2 0 ð24Þ ln N ½uii 1 jC ^yx ;H 1 j i¼1 In addition, the first term can be rewritten as the following: ^yTx DxN ^yx ¼ ^yTx ½ðU 0N Þ1 ðU 0N ÞT ðU 1N Þ1 ðU 1N ÞT ^ yx ¼ jjz0x jj2 jjz1x jj2 ð25Þ
where jj:jj2 stands for the squared norm of vector zsx T yx . Finally, after some math operations the comðU sN Þ ^ plete MCO-LRT VAD transforms into: 8 0 2 2 1 X X N N N X