Speech Communication 48 (2006) 1458–1485 www.elsevier.com/locate/specom
Feature extraction from higher-lag autocorrelation coefficients for robust speech recognition Benjamin J. Shannon, Kuldip K. Paliwal
*
School of Microelectronic Engineering, Griffith University, Nathan Campus, Brisbane, QLD 4111, Australia Received 4 July 2005; received in revised form 1 August 2006; accepted 1 August 2006
Abstract In this paper, a feature extraction method that is robust to additive background noise is proposed for automatic speech recognition. Since the background noise corrupts the autocorrelation coefficients of the speech signal mostly at the lowertime lags, while the higher-lag autocorrelation coefficients are least affected, this method discards the lower-lag autocorrelation coefficients and uses only the higher-lag autocorrelation coefficients for spectral estimation. The magnitude spectrum of the windowed higher-lag autocorrelation sequence is used here as an estimate of the power spectrum of the speech signal. This power spectral estimate is processed further (like the well-known Mel frequency cepstral coefficient (MFCC) procedure) by the Mel filter bank, log operation and the discrete cosine transform to get the cepstral coefficients. These cepstral coefficients are referred to as the autocorrelation Mel frequency cepstral coefficients (AMFCCs). We evaluate the speech recognition performance of the AMFCC features on the Aurora and the resource management databases and show that they perform as well as the MFCC features for clean speech and their recognition performance is better than the MFCC features for noisy speech. Finally, we show that the AMFCC features perform better than the features derived from the robust linear prediction-based methods for noisy speech. Ó 2006 Elsevier B.V. All rights reserved. Keywords: Speech recognition; Feature extraction; Robustness to noise; MFCC
1. Introduction A speech recogniser is trained in a given acoustic environment and deployed (or, tested) normally in a different environment; thus, there is always a mismatch in the training and test environments. This mismatch causes a drastic degradation in speech recognition performance. One of the major factors * Corresponding author. Tel.: +61 7 3875 6536; fax: +61 7 3875 5198. E-mail address: K.Paliwal@griffith.edu.au (K.K. Paliwal).
responsible for the mismatch between the training and test environments is additive background noise (uncorrelated to speech) (Juang, 1991; Gong, 1995). A number of methods have been proposed in the literature to overcome this environmental mismatch problem. These include robust feature extraction methods (Ghitza, 1986; Mansour and Juang, 1989; Paliwal and Sondhi, 1991; Paliwal and Sagisaka, 1997), speech enhancement methods (Kim et al., 2003; Hermus and Wambacq, 2004), feature compensation methods (Hermansky and Morgan, 1994; Stern et al., 1996), multi-band methods (Bourlard
0167-6393/$ - see front matter Ó 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2006.08.003
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485
and Dupont, 1996; Tibrewala and Hermansky, 1997), missing feature methods (Lippmann, 1997; Cooke et al., 1997; Raj et al., 2004), and model compensation (and adaptation) methods (Gales and Woodland, 1996; Bellegarda, 1997; Lee, 1998). The focus of this paper is on robust feature extraction for speech recognition. We are interested in developing a feature extraction method that can deal with the additive background noise distortion in a robust manner. Here, we are given a frame of the observed (noisy) signal x(n), n = 0, 1, . . ., N 1, for analysis, where N is the frame length (in number samples). This can be expressed as xðnÞ ¼ sðnÞ þ dðnÞ;
ð1Þ
where s(n) is the clean speech signal and d(n) is the background noise signal.1 Our aim is to extract recognition features from the noisy speech signal x(n) in such a manner that they capture the spectral characteristics of the clean speech signal s(n) accurately and are least affected by the noise d(n). The Mel-frequency cepstral coefficients (MFCCs) are perhaps the most widely used features in the current state-of-the-art speech recognition systems (Rabiner and Juang, 1993; Huang et al., 2001). In MFCC feature extraction (Davis and Mermelstein, 1980), the noisy signal x(n) is processed in terms of the following steps: (1) Perform short-time Fourier analysis of the signal x(n) using a finite-duration window (such as a 32 ms Hamming window) and use the periodogram method (Kay, 1988) to compute the power spectral estimate Pb xx ðxÞ of the signal x(n), (2) apply the Mel filter bank to the power spectrum Pb xx ðxÞ to get the filter-bank energies, and (3) compute discrete cosine transform (DCT) of the log filter-bank energies to get the MFCCs. The MFCC features perform reasonably well for the recognition of clean speech, but their performance is very poor for noisy speech. This happens because the periodogram-based power spectral estimate used in MFCC computation gets severely affected by the additive background noise and this degrades the recognition performance of MFCC features for noisy speech. In the present paper, we use the autocorrelation domain processing for robust estimation of power spectrum from noisy speech. The autocorrelation 1 We denote the power spectrum of the noisy signal x(n) by Pxx(x) and its autocorrelation function by rxx(n). Similarly, for the clean signal s(n), the corresponding symbols are Pss(x) and rss(n), respectively. For the noise signal d(n), these symbols are Pdd(x) and rdd(n), respectively.
1459
function of a signal is related to the signal’s power spectrum through the Fourier transform (Kay, 1988) and it has the following two attractive properties: (1) Additivity property: If the two signals are uncorrelated, the autocorrelation function of their sum is equal to the sum of their autocorrelation functions. Thus: rxx ðnÞ ¼ rss ðnÞ þ rdd ðnÞ;
ð2Þ
as s(n) and d(n) are uncorrelated signals. (2) Robustness property: The autocorrelation function of the white random noise signal is zero everywhere except for the zeroth time-lag (Kay, 1979). For broadband noise signals, it is mainly confined to lower-timelags and is very small or zero for higher-time-lags (Mansour and Juang, 1989). As a result, the additive noise d(n) does not affect the higher-lags of the autocorrelation function. Thus the higher-lag autocorrelation coefficients are relatively robust to additive noise distortion. Because of these attractive properties, the autocorrelation domain processing has been used in the past for autoregressive (AR) spectral estimation (or, linear prediction (LP) analysis) of the noisy signals. The initial effort in this direction was based on the use of high-order Yule–Walker equations (Gersch, 1970; Chan and Langford, 1982), where the autocorrelation coefficients that are involved in the Yule–Walker equation set exclude the zerolag coefficient. Other similar methods have been used that either avoid the zero-lag coefficient (Cadzow, 1982; Paliwal, 1986a; Paliwal, 1986b; Paliwal, 1986c; McGinn and Johnson, 1989), or reduce the contribution of the first few coefficients (Mansour and Juang, 1989; Hernando and Nadeu, 1997). All of these techniques are based on all-pole modelling of the causal part of the autocorrelation sequence of signal x(n). Two of these techniques (Mansour and Juang, 1989; Hernando and Nadeu, 1997) have been used to extract the cepstral coefficient features for speech recognition and found to provide some robustness to noise, but their recognition performance for clean speech is worse than the conventional linear prediction cepstral coefficient (LPCC) features (Hernando and Nadeu, 1997). In the present paper, we propose a robust feature extraction method based on autocorrelation domain processing.2 Since the broadband noise distortion 2 We have reported the preliminary results about this method earlier in conferences (Shannon and Paliwal, 2004; Shannon and Paliwal, 2005).
1460
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485
affects only the lower-lag autocorrelation coefficients, we discard them and utilise only the higher-lag (>2 ms) autocorrelation coefficients. We multiply the higher-lag autocorrelation coefficients of x(n) by a suitable window function (to be discussed later in this paper) and compute the magnitude spectrum of the windowed sequence as an estimate of the signal’s power spectrum. We call this method of power spectrum estimation as the higher-lag autocorrelation spectrum estimation (HASE) method. Steps (2) and (3) of the MFCC feature extraction procedure described above are applied to this power spectrum estimate to obtain the cepstral coefficient features. We call these features autocorrelation Mel frequency cepstral coefficients (AMFCCs). Note that the HASE method of power spectrum estimation (used here for AMFCC computation) is based on the additivity and robustness properties of the signal’s autocorrelation function. These two properties are strictly valid only in an asymptotic sense (under ergodicity assumption), i.e., when the autocorrelation coefficients are computed from infinitely long signals. For AMFCC computation, we do only short-time analysis of the speech signal (with signal duration of about 32 ms) for computing the autocorrelation function. This autocorrelation function may not satisfy these two properties strictly. The effect of the short-time analysis on these properties and the speech recognition performance is discussed later in this paper. This paper is organised as follows. Section 2 describes some of the properties of the short-time autocorrelation function relevant in the context of the AMFCC method of feature extraction. The spectral estimation method used for computing the AMFCCs is described in Section 3. This method uses the magnitude spectrum of the windowed one-sided higher-lag autocorrelation sequence as an estimate Pb xx ðxÞ of the signal’s power spectrum. The AMFCC feature extraction method is presented in Section 4. This method is evaluated for robust speech recognition and its performance is compared with that of the other feature extraction methods reported in the literature. The recognition experiments carried out for this purpose and their results are described in Section 5. The conclusions are presented in Section 6. 2. Some properties of the short-time autocorrelation function The autocorrelation function of a signal contains the same information about the signal as its power
spectrum (Kay, 1988). In the power spectrum domain, the information is presented as a function of frequency and in the autocorrelation domain it is presented as a function of time (Anstey, 1966). In this section, we demonstrate some of the properties of the short-time autocorrelation function relevant in the context of the AMFCC method of feature extraction. Since the AMFCC method discards the zeroth and lower-lag autocorrelation coefficients and uses only the higher-lag autocorrelation coefficients for spectral estimation, it is necessary to know whether and how these coefficients contain the spectral information necessary for speech recognition. Also, since we are proposing the AMFCC method as a robust feature extraction procedure on the basis that the additive noise distortion has most of its autocorrelation coefficients concentrated near the lowertime-lags and their higher-lag autocorrelation coefficients are zero (or, very small), we want to know what types of noise signals have this property and to what extent they satisfy this property (under the short-time analysis framework). This question is explored in the second part of this section. 2.1. Speech signals The commonly used source-system model of speech production views the speech signal as the output of a linear, time-varying system (or filter) excited by either a white noise source (for unvoiced speech) or a periodic pulse train source (for voiced speech) (Huang et al., 2001). For speech recognition, we are typically interested in extracting the magnitude response of the time-varying system as a function of time.3 For this, we perform a framewise short-time analysis of the speech signal to compute its power spectrum and use its (smoothed) envelope to get the system’s magnitude response. In order to illustrate whether and how the higherlag autocorrelation coefficients contain the spectral information necessary for speech recognition, we first consider a 32 ms frame of a clean (noise-free) voiced speech signal from a female / ey / sound. We show its power spectrum and autocorrelation function4 in Fig. 1(a) and (b), respectively. To illustrate 3
We assume that the magnitude response of the system carries sufficient speech information for accurate recognition. 4 The autocorrelation function is obtained from the power spectrum through inverse Fourier transformation. Thus, it provides a biased estimate of autocorrelation function for a frame.
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485 (a)
(b) 20 Autocorrelation
Power (dB)
40 20 0 −20 −40
0
0.1
0.2 0.3 Freq. (Norm.)
0.4
0
−20
0.5
−20
(c) Autocorrelation
Power (dB)
20 0 −20
0
0.1
0.2 0.3 Freq. (Norm.)
0.4
0 −2
−20
(e) Autocorrelation
Power (dB)
0 −20
0.1
0.2 0.3 Freq. (Norm.)
0 Time Lag (ms)
20
(f)
20
0
20
2
0.5
40
−40
0 Time Lag (ms)
(d)
40
−40
1461
0.4
0.5
5 0 −5 −20
0 Time Lag (ms)
20
Fig. 1. Power spectrum and autocorrelation function of a 32 ms long voiced speech frame of female / ey / sound. (a) Signal power spectrum (in dB). (b) Autocorrelation sequence associated with the signal spectrum in (a). (c) Smooth spectral envelope (in dB) computed by retaining the first 12 cepstral coefficients of the signal spectrum shown in (a). (d) Autocorrelation sequence associated with the smooth spectrum shown in (c). (e) Excitation power spectrum (in dB) estimated by subtracting the smooth (log) spectrum shown in (c) from the signal (log) spectrum shown in (a). (f) Autocorrelation sequence associated with the excitation spectrum shown in (e).
the system and source components for this signal, we perform homomorphic analysis and compute the smooth spectral envelope by retaining the first 12 cepstral coefficients. This is used as the power spectral estimate of the system component. The residual power spectrum (obtained by dividing the signal’s power spectrum by the smooth spectral envelope) is used as the power spectral estimate of the (excitation) source component. Fig. 1(c) and (e) shows the these power spectral estimates of the system and source components. Their associated autocorrelation functions are shown in Fig. 1(d) and (f), respectively. We can observe from Fig. 1(d) and (f) that the autocorrelation function of the system component is confined to lower-lags only and that of the source component shows the periodicity of the voiced speech signal. Since the signal’s autocorrelation function (shown in Fig. 1(b)) is the convolution of the autocorrelation functions of the system and source components, it is periodic with the smooth spectral envelope information repeated periodically over the entire lag range and its values are quite large for higher-lags (specially when we remember that it is a biased estimate). Because of this, we should be able to discard the
lower-portion of the signal’s autocorrelation function and still be able to get a good power spectral estimate for speech recognition from the higher-lag autocorrelation coefficients (as shown in the later sections). The same analysis is repeated for a 32 ms frame of a clean (noise-free) unvoiced speech signal from a female/s/sound and the results are shown in Fig. 2. Here the autocorrelation function of the system component (Fig. 2(d)) is confined to lower-lags and that of the source component (Fig. 2(f)) is nonperiodic and non-zero for higher-lags. As a result of convolution, the signal’s autocorrelation function (Fig. 2(b)) contains the spectral information about the system component at higher-lags (though not to the same extent as it has for the voiced speech signal). Thus, we should be able use the signal’s higher-lag autocorrelation coefficients to estimate the power spectrum of unvoiced speech needed for feature extraction. 2.2. Noise signals As mentioned earlier, the AMFCC method is proposed here as a robust feature extraction procedure
1462
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485 (a)
(b) 20 Autocorrelation
Power (dB)
40 20 0 −20 −40
0
0.1
0.2 0.3 Freq. (Norm.)
0.4
0
−20
0.5
−20
(c) Autocorrelation
Power (dB)
20 0 −20
0
0.1
0.2 0.3 Freq. (Norm.)
20
(d)
40
−40
0 Time Lag (ms)
0.4
5 0 −5 −20
0.5
(e)
0 Time Lag (ms)
20
(f) Autocorrelation
Power (dB)
40 20 0 −20 −40
0
0.1
0.2 0.3 Freq. (Norm.)
0.4
0.5
5 0 −5 −20
0 Time Lag (ms)
20
Fig. 2. Power spectrum and autocorrelation function of a 32 ms long voiced speech frame of female/s/sound. (a) Signal power spectrum (in dB). (b) Autocorrelation sequence associated with the spectrum in (a). (c) Smooth spectral envelope (in dB) computed by retaining the first 12 cepstral coefficients of the signal spectrum shown in (a). (d) The autocorrelation sequence associated with the spectrum shown in (c). (e) Excitation power spectrum (in dB) estimated by subtracting the smooth (log) spectrum shown in (c) from the signal (log) spectrum shown in (a). (f) Autocorrelation sequence associated with the excitation spectrum shown in (e).
on the basis that the additive noise distortion has most of its autocorrelation coefficients concentrated near the lower-time-lags and its higher-lag autocorrelation coefficients are zero (or, very small). In this context, it is important to know what types of noise signals have this property and to what extent they satisfy this property. In order to answer this question, we first consider the stationary white random noise signal. Theoretically (and asymptotically), the autocorrelation function should be zero for all the lags except for the zeroth lag (Kay, 1979). We want to know whether this is true for short-time analysis. We take 2 s long computer-generated (artificial) white Gaussian noise and perform a short-time analysis (with Hamming window) every 10 ms using a frame length of 32 ms. Fig. 3(a) shows the spectrogram5 of this signal. For illustration, we take three different frames starting from 0.5, 1 and 1.5 s. We show the waveforms of these frames in Fig. 3(b)–(d), respectively, their respective power spectra in Fig. 3(e)–(g) and their respective autocorrelation functions in
5
In this paper, we use a 10 ms frame-shift and a 32 ms Hamming windowed frame for spectrogram computation.
Fig. 3(h)–(j). As expected, the higher-lag autocorrelation coefficients are smaller in magnitude than the zeroth autocorrelation coefficient, but they have non-zero values due to short-time analysis. Similarly, it can be shown that the stationary broadband random noise signals6 satisfy this property where the magnitude of lower-lag autocorrelation coefficients is large and that of higher-lag coefficients is very small (Mansour and Juang, 1989). In addition to these noises (white and broadband), we can identify the following two other noises that satisfy this property: chirp noise and impulsive noise. Typical examples of the chirp noise are the sounds generated by the emergency sirens and chirping of birds. The typewriter (or keyboard) noise and the sound from machine gun firing are the typical examples of impulsive noise. Here, we generate artificial chirp and impulsive noises7 and perform short-time analysis. The results are shown in Figs. 4 and 5. We can observe from these figures 6 White random noise can be considered to be a particular case of the broadband random noise where the broadband extends from zero to half of the sampling frequency. 7 The procedure to generate these artificial noises is described later in Section 5.
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485
1463
(a) Freq. (Hz)
3000 2000 1000 0
0.2
0.4
0.6
0.8
1 Time (s)
(b)
10
20 Time (ms)
0
−1
30
0
10
(e)
2000 Freq. (Hz)
3000
(h) 1 0.5 0 −0.5 −20
0 Time Lag (ms)
0
10
20
0
1000
2000 Freq. (Hz)
3000
(i) 1 0.5 0 −0.5 −1
30
(g)
0
−50
4000
20 Time (ms)
50 Power (dB)
Power (dB) 1000
Autocorrelation (Norm.)
Power (dB) Autocorrelation (Norm.)
−1
30
50
0
−1
20 Time (ms)
0
(f)
50
0
1.8
1 Signal
Signal
Signal 0
1.6
(d)
1
0
−50
1.4
(c)
1
−1
1.2
−20
0 Time Lag (ms)
20
0
−50
4000 Autocorrelation (Norm.)
0
0
1000
2000 Freq. (Hz)
3000
4000
(j) 1 0.5 0 −0.5 −1
−20
0 Time Lag (ms)
20
Fig. 3. Short-time analysis of the artificial white random noise signal using 32 ms frames. (a) Spectrogram of a 2 s sample of long noise signal. (b)–(d) Waveform of noise frames taken at 0.5, 1.0 and 1.5 s, respectively. (e)–(g) Power spectra (periodogram estimate with a Hamming window) of the frames shown in (b)–(d), respectively. (h)–(j) Autocorrelation sequences corresponding to the power spectra shown in (e)–(g), respectively.
that most of the autocorrelation coefficients concentrated near the lower-time lags and the higher-lag autocorrelation coefficients are zero. So far, we have considered only artificial noises and shown that their autocorrelation function have the required property that makes the AMFCC method suitable for robust feature extraction. Now, we consider real-life noises. As an example, we take the car noise from the Aurora-2 database (Pearce and Hirsch, 2000) and perform the shorttime analysis (similar to that done for the artificial noises). The results are shown in Fig. 6. We can see from the spectrogram (Fig. 6(a)) that the car noise is a kind of broadband (low-pass) noise. From the autocorrelation plots shown in Fig. 6(h)–(j), it can be observed that the magnitude of its lowerlag autocorrelation coefficients is larger than the higher-lag coefficients, but its higher-lag autocorrelation coefficients are not as small as those observed for the artificial noises. Thus, the car noise satisfies
the required robustness property, but not to the same extent as has been the case with the artificial noises. We have experimented with the other reallife noises from the Aurora-2 database and made similar observations. Thus, we have seen that (1) the higher-lag autocorrelation coefficients of the speech signal s(n) contain information about the signal’s power spectrum Pss(u) and (2) the magnitude of the higher-lag autocorrelation coefficients of the noise signal d(n) is relatively small for some of the noises. If we assume the additivity property (Eq. (2)) in autocorrelation domain to be true,8 the higher-lag autocorrelation coefficients of noisy speech signal y(n) will be relatively insensitive to the additive background noise distortion. Therefore, we discard the lower-lag autocorrelation
8
This property is true only asymptotically and may not be completely satisfied when we do short-time analysis.
1464
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485
3000 2000 1000 0
0
0.2
0.4
0.6
0.8
1 Time (s)
(b)
10
20 Time (ms)
0
−1
30
0
10
(e)
2000 Freq. (Hz)
3000
(h) 0.5 0 −0.5
0
1000
2000 Freq. (Hz)
3000
(i) 1 0.5 0
−0.5
0 Time Lag (ms)
10
20
1
20 Time (ms)
30
(g)
0
−50
4000
1
−20
0
50 Power (dB)
Power (dB) 1000
Autocorrelation (Norm.)
Power (dB) Autocorrelation (Norm.)
−1
30
50
0
−1
20 Time (ms)
0
(f)
50
0
1.8
1 Signal
Signal
Signal 0
1.6
(d)
1
0
−50
1.4
(c)
1
−1
1.2
−20
0 Time Lag (ms)
20
0
−50
4000 Autocorrelation (Norm.)
Freq. (Hz)
(a)
0
1000
2000 Freq. (Hz)
3000
4000
(j) 1 0.5 0 −0.5 −1
−20
0 Time Lag (ms)
20
Fig. 4. Short-time analysis of the artificial chirp noise signal using 32 ms frames. (a) Spectrogram of a 2 s sample of long noise signal. (b)–(d) Waveform of noise frames taken at 0.5, 1.0 and 1.5 s, respectively. (e)–(g) Power spectra (periodogram estimate with a Hamming window) of the frames shown in (b)–(d), respectively. (h)–(j) Autocorrelation sequences corresponding to the power spectra shown in (e)–(g), respectively.
coefficients and use only higher-lag (lag greater than 2 ms) autocorrelation coefficients of the noisy speech signal for robust spectral estimation. In order to improve the robustness of this spectral estimator further, we de-emphasise the more corrupt lowerlag autocorrelation coefficients during spectral estimation. The lower-lag coefficients can be attenuated by using a tapered window function. This also has the added effect of attenuating the very high-lag autocorrelation coefficients, which have relatively highestimation variance (Kay, 1988). 3. Spectral estimation from higher-lag autocorrelation coefficients Typically, spectral estimation in automatic speech recognition is performed using either a linear prediction (LP) analysis algorithm or a Fourier transform based periodogram algorithm. These two approaches can be discussed in terms of the autocorrelation
coefficients they employ. A conventional LP analysis algorithm (such as the autocorrelation method) used in speech processing literature (Makhoul, 1975) utilises only the first few lower-lag autocorrelation coefficients, and the periodogram method can be considered as utilising the full autocorrelation sequence. In our higher-lag autocorrelation spectrum estimation (HASE) method, we propose to use only the higher-lag portion of one side (causal side) of the autocorrelation sequence. By considering only the causal side of the autocorrelation sequence, we are not loosing any information since the autocorrelation sequence is symmetric. A number of LP-based techniques have been reported in the literature which utilise the windowed one-sided autocorrelation sequence for improving the robustness performance of the spectral estimates. Two of these techniques (the short-time modified coherence (SMC) method (Mansour and Juang, 1989) and the one-sided autocorrelation
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485
1465
3000 2000 1000 0
0
0.2
0.4
0.6
0.8
1 Time (s)
(b)
10
20 Time (ms)
0
−1
30
0
10
(e)
2000 Freq. (Hz)
3000
(h) 1 0.5 0 −0.5 −20
0 Time Lag (ms)
0
10
20
0
1000
2000 Freq. (Hz)
3000
(i) 1 0.5 0 −0.5 −1
30
(g)
0
−50
4000
20 Time (ms)
50 Power (dB)
Power (dB) 1000
Autocorrelation (Norm.)
Power (dB) Autocorrelation (Norm.)
−1
30
50
0
−1
20 Time (ms)
0
(f)
50
0
1.8
1 Signal
Signal
Signal 0
1.6
(d)
1
0
−50
1.4
(c)
1
−1
1.2
−20
0 Time Lag (ms)
20
0
−50
4000 Autocorrelation (Norm.)
Freq. (Hz)
(a)
0
1000
2000 Freq. (Hz)
3000
4000
(j) 1 0.5 0 −0.5 −1
−20
0 Time Lag (ms)
20
Fig. 5. Short-time analysis of the artificial impulsive noise signal using 32 ms frames. (a) Spectrogram of a 2 s sample of long noise signal. (b)–(d) Waveform of noise frames taken at 0.5, 1.0 and 1.5 s, respectively. (e)–(g) Power spectra (periodogram estimate with a Hamming window) of the frames shown in (b)–(d), respectively. (h)–(j) Autocorrelation sequences corresponding to the power spectra shown in (e)–(g), respectively.
linear prediction coefficient (OSALPC) method (Hernando and Nadeu, 1997) have been used in the past for automatic speech recognition. Both of these techniques use all-pole modelling of the windowed autocorrelation sequence for computing the spectral estimates. As mentioned earlier in the introduction, these techniques provide some robustness to noise, but their recognition performance for clean speech is much worse than the conventional autocorrelation method of LP analysis (Hernando and Nadeu, 1997). This happens because the use of higher-lag autocorrelation coefficients provides robust spectral estimates for noisy speech, but these autocorrelation coefficients do not follow the allpole model as well as the clean speech signal does. Hence, in our HASE method, we do not use allpole modelling of the windowed one-sided (causal) autocorrelation sequence to compute the spectral estimate. Instead, we use the magnitude spectrum of the windowed one-sided autocorrelation sequence
as an estimate of the power spectrum of the signal. However, before doing it, we discard the first few lower-lag autocorrelation coefficients as they are effected more by the additive noises (of the type discussed in the preceding section). We utilise only the higher-lag (greater than 2 ms) autocorrelation coefficients. Thus, our method positions the window on the autocorrelation sequence at a higher-starting lag than the previous LP-based methods. The HASE method for computing the power spectrum of the observed signal x(n), n = 0, 1, . . ., N 1, (where N is 256 for 32 ms long speech signal sampled at 8 kHz) is described formally as follows: (1) Multiply the observed signal x(n) by a window function ws(n) to get the windowed signal xw(n) = x(n)ws(n), n = 0, 1, . . ., N 1. We use here the Hamming window function for ws(n). (2) Compute the biased estimate of the onesided autocorrelation coefficients RðiÞ ¼
1466
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485
3000 2000 1000 0
0
0.2
0.4
0.6
0.8
1 Time (s)
(b)
10
20 Time (ms)
0
−1
30
0
10
(e)
2000 Freq. (Hz)
3000
(h) 1 0.5 0 −0.5 −20
0 Time Lag (ms)
0
10
20
0
1000
2000 Freq. (Hz)
3000
(i) 1 0.5 0 −0.5 −1
30
(g)
0
−50
4000
20 Time (ms)
50 Power (dB)
Power (dB) 1000
Autocorrelation (Norm.)
Power (dB) Autocorrelation (Norm.)
−1
30
50
0
−1
20 Time (ms)
0
(f)
50
0
1.8
1 Signal
Signal
Signal 0
1.6
(d)
1
0
−50
1.4
(c)
1
−1
1.2
−20
0 Time Lag (ms)
20
0
−50
4000 Autocorrelation (Norm.)
Freq. (Hz)
(a)
0
1000
2000 Freq. (Hz)
3000
4000
(j) 1 0.5 0 −0.5 −1
−20
0 Time Lag (ms)
20
Fig. 6. Analysis of car noise signal using 32 ms frames. (a) Spectrogram of a 2 s sample of car noise. (b)–(d) Signal frames taken at 0.5, 1.0 and 1.5 s respectively. (e)–(g) Logarithmic power spectra, computed using a Hamming window, of the frames shown in (b)–(d) respectively. (h)–(j) Autocorrelation sequences corresponding to the spectra shown in (e)–(g) respectively.
1 N
PN i1
i = 0, 1, . . ., N 1. n¼0 xw ðnÞxw ðn þ iÞ, This computation can be done in a computationally efficient manner using the fast Fourier transform (FFT) algorithm. (3) Discard the first L(=16) lower-lag autocorrelation coefficients and multiply the remaining higher-lag autocorrelation sequence by a window function wr(n) to get the windowed higher-lag autocorrelation sequence Rw(n) = R(L + n)wr(n), n = 0, 1, . . ., M 1, where M = N L. (4) Append the windowed higher-lag autocorrelation sequence Rw(n), n = 0, 1, . . ., M 1 at the end with 2N M zeros and compute the discrete Fourier transform (DFT) of the appended sequence through the FFT algorithm. Use the absolute value of this transform as the desired power spectral estimate Pb xx ðxÞ. The window function wr(n) used on the one-sided (causal) higher-lag autocorrelation sequence plays
an important role in our spectral estimation method. Therefore, we first discuss the design of this window function and then provide some results to demonstrate the spectral estimation performance of our method with respect to the previous LP-based methods. 3.1. Window function design In our HASE method, we use the Hamming window function for ws(n) to window the signal, but we do not know what type of window function we should use for wr(n) to window the one-sided higher-lag autocorrelation sequence. In order to see its effect on the spectral estimation performance of the HASE method, we first use the Hamming window function for wr(n) to window the higherlag autocorrelation sequence. We apply this HASE method to analyse a 32 ms long voiced speech frame of female / ey / sound. (Note that this frame is used earlier in Fig. 1.) The Hamming-windowing
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485
1467
(b)
(a) 0
1 0.8
−10 0.6
−20 Power (dB)
Autocorrelation
0.4 0.2 0
−30
−0.2 −40
−0.4 −0.6
−50
−0.8 −1
0
5
10
15
20
25
30
−60
0
500
1000
1500
Lag (ms)
2000
2500
3000
3500
4000
Freq. (Hz)
Fig. 7. Illustration of the HASE method (using Hamming windowed higher-lag autocorrelation sequence) on a 32 ms long voiced speech frame of female / ey / sound. (a) one-sided (causal) autocorrelation sequence (in dotted line), a 30 ms Hamming window function starting at 2 ms (in dashed line) and windowed autocorrelation sequence (in solid line). (b) Power spectral estimate by the HASE method (in solid line) and the periodogram method (in dotted line).
1
0
0.9
−10
0.8
−20
0.7
−30
Power (dB)
Amplitude
(a)
0.6 0.5 0.4
−40 −50 −60
0.3
−70
0.2
−80 −90
0.1 0 50
100
150
200
−100
0
0.05
Sample
0.1
0.15
0.2
0.15
0.2
0.15
0.2
Freq. (Norm)
1
0
0.9
−10
0.8
−20
0.7
−30
Power (dB)
Amplitude
(b)
0.6 0.5 0.4
−40 −50 −60
0.3
−70
0.2
−80 −90
0.1 0 50
100
150
200
−100
0
0.05
Sample
0.1 Freq. (Norm)
(c) 1
0
0.9
−10
0.8
−20
0.7
−30
Power (dB)
Amplitude
operation carried out on the higher-lag autocorrelation sequence is illustrated in Fig. 7(a). The power spectral estimate of the speech frame using the HASE method is shown in Fig. 7(b) along with its periodogram estimate. We can see from this figure that the periodogram method can capture spectral details of the harmonics as low as 43 dB, i.e., its dynamic range is about 43 dB. The HASE method captures the higher-magnitude harmonics well, but fails to capture the harmonics having magnitude lower than 22 dB. In other words, its dynamic range is limited to about 22 dB. In order to explain this dynamic range issue, we show the shape of the Hamming window function and its power spectrum in Fig. 8(a). We see that the magnitude of the highest side-lobe is about 43 dB lower than that of the main lobe, i.e., the dynamic range of the Hamming window is about 43 dB. Also note here that the Hamming window is applied only to the signal in the periodogram method, but it is applied to both the signal and the higher-lag autocorrelation sequence in the HASE method. It is well known (Mansour and Juang, 1989) that the power spectrum of an autocorrelation sequence of a signal has dynamic range twice as large as that of the signal’s power spectrum. The HASE method uses here the Hamming window on the autocorrelation sequence and, hence, the resulting power spectral estimate of the signal has a dynamic range of 22 dB. This aspect of windowing an autocorrelation sequence has also been reported previously by
0.6 0.5 0.4
−40 −50 −60
0.3
−70
0.2
−80 −90
0.1 0 50
100 150 Sample
200
−100
0
0.05
0.1 Freq. (Norm)
Fig. 8. Window function and its power spectrum for (a) Hamming window, (b) Kaiser window (a = 11.3) and (c) DDR Hamming window.
1468
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485
use a window function with an 86 dB dynamic range. The Kaiser window function can provide such a window (Harris, 1978). The function used for generating a Kaiser window of length M is given by 8 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 > n < I 0 2a M1 ðMn1Þ ; 0 6 n < M; wðnÞ ¼ ð3Þ I 0 ðaÞ > : 0; otherwise;
Mansour and Juang (1989). While developing the SMC method of robust LP analysis, they compared the SMC power spectral estimates with the corresponding estimates obtained by using the conventional autocorrelation method of LP analysis (Makhoul, 1975). To compute the SMC spectra in their investigation, a Hamming window was applied to the autocorrelation sequence. It was reported (Mansour and Juang, 1989) that, in certain cases, formants present in the conventional LP spectrum were reduced or totally lost in the SMC spectrum. Mansour and Juang (1989) made particular reference to third formants (which have low magnitude) and attributed the dynamic range of the window function applied to the autocorrelation sequence to be the cause of this spectral loss. These comments are consistent with our observations. As mentioned earlier, the power spectrum of an autocorrelation sequence has a dynamic range twice that of the corresponding signal’s power spectrum. Therefore, when we window the one-sided autocorrelation sequence, we need to use a window function that has a dynamic range that is twice as large as that of the window function we would typically use on the original time-domain signal. For example, a Hamming window, which is typically applied to the time-domain signal in speech recognition, has a dynamic range of 43 dB. To produce a spectral estimate with equivalent dynamic range from the windowed autocorrelation sequence, we need to
where I0(x) is a modified Bessel function of the first kind and a is the design parameter that sets the window’s dynamic range. For getting a Kaiser window function with dynamic range of 86 dB, we set a = 11.3. The resulting Kaiser window function and its power spectrum are shown in Fig. 8(b). We use this window function to window the higher-lag autocorrelation sequence in the HASE method and analyse the same speech frame as the one used in Fig. 7. We show the Kaiser-windowing operation carried out on the higher-lag autocorrelation sequence in Fig. 9(a) and the resulting power spectral estimate of the speech frame in Fig. 9(b). By comparing Fig. 9(b) with Fig. 7(b), we see that the use of the Kaiser window function solves the dynamic range problem faced earlier with the Hamming window function, but it produces slightly wider harmonics. This happens because the Kaiser window function has a wider main lobe than the Hamming window function (see Fig. 8). However, this is not a big problem as for feature extraction
(a)
(b) 0
1 0.8
−10 0.6 −20 Power (dB)
Autocorrelation
0.4 0.2 0
−30
−0.2 −40
−0.4 −0.6
−50
−0.8 −1
0
5
10
15 Lag (ms)
20
25
30
−60
0
500
1000
1500
2000 Freq. (Hz)
2500
3000
3500
4000
Fig. 9. Illustration of the HASE method (using Kaiser windowed higher-lag autocorrelation sequence) on a 32 ms long voiced speech frame of female / ey / sound. (a) one-sided (causal) autocorrelation sequence (in dotted line), a 30 ms Hamming window function starting at 2 ms (in dashed line) and windowed autocorrelation sequence (in solid line). (b) Power spectral estimate by the HASE method (in solid line) and the periodogram method (in dotted line).
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485
in speech recognition we are interested only in the spectral envelope that captures the formant structure, and slightly wider harmonics do not spoil this structure. The main point is that the HASE method (with Kaiser-windowed higher-lag autocorrelation sequence) provides the same dynamic range (about 43 dB) as the periodogram method and it captures all the harmonics that are visible in the periodogram estimate. Though the Kaiser window function provides good performance, it has two problems. Firstly, it is computationally more expensive to design a Kaiser window function than the cosine window functions (such as Hamming window function) as it involves computation of a Bessel function (see Eq. (3)). Secondly, every time we decide to use a different window function ws(n) applied on the timedomain signal (for example, Blackman window instead of Hamming window), we have to modify the feature extraction software to design the Kaiser window function that has a dynamic range equal to twice that of the new window function. To solve these problems, we propose a design method specifically for computing a window function that has twice the dynamic range of the window function ws(n) used on the time-domain signal and which can be generated automatically inside the software once we specify the window function for ws(n). Given that we are applying a Hamming window on the time-domain signal and using the autocorrelation sequence from 2 ms to 32 ms (with length
1469
M = 240 samples at 8 kHz sampling rate), we want to design an M length window function having twice the dynamic range of a Hamming window. The design procedure is outlined below. (1) Construct a Hamming window of length M2 . (2) Compute its two-sided (biased) autocorrelation sequence of length M 1 having a maximum at zeroth lag in the centre. (3) Pad this M 1 long autocorrelation sequence with one zero at the end to get the desired window of length M. This window function will have dynamic range of about 86 dB. We call this window function as the double-dynamic-range (DDR) Hamming window. The DDR Hamming window function and its power spectrum are shown in Fig. 8(c). We use this window function to window the higher-lag autocorrelation sequence in the HASE method and analyse the same speech frame as the one used in Figs. 7 and 9. We show the DDR Hamming-windowing operation carried out on the higher-lag autocorrelation sequence in Fig. 10(a) and the resulting power spectral estimate of the speech frame in Fig. 10(b). By comparing Fig. 10(b) with Fig. 9(b), we see that the DDR Hamming window performs as well as the Kaiser window function in terms of its spectral estimation performance. The resulting HASE method (with DDR Hamming windowing of the higher-lag autocorrelation) captures all the harmonic peaks visible in the
(a)
(b) 0
1 0.8
−10
0.6 −20
Power (dB)
Autocorrelation
0.4 0.2 0
−30
−0.2 −40
−0.4 −0.6
−50
−0.8 −1
0
5
10
15 Lag (ms)
20
25
30
−60
0
500
1000
1500
2000 Freq. (Hz)
2500
3000
3500
4000
Fig. 10. Illustration of the HASE method (using DDR Hamming windowed higher-lag autocorrelation sequence) on a 32 ms long voiced speech frame of female / ey / sound. (a) One-sided (causal) autocorrelation sequence (in dotted line), a 30 ms Hamming window function starting at 2 ms (in dashed line) and windowed autocorrelation sequence (in solid line). (b) Power spectral estimate by the HASE method (in solid line) and the periodogram method (in dotted line).
1470
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485 (a)
(b) 0 Power (dB)
Power (dB)
0 −20 −40 −60
0
1000
2000 3000 Freq. (Hz)
−20 −40 −60
4000
0
1000
(c)
4000
3000
4000
3000
4000
0 Power (dB)
Power (dB)
3000
(d)
0 −20 −40 −60
0
1000
2000 3000 Freq. (Hz)
−20 −40 −60
4000
0
1000
(e)
2000 Freq. (Hz)
(f) 0 Power (dB)
0 Power (dB)
2000 Freq. (Hz)
−20 −40 −60
−20 −40 −60
0
1000
2000 3000 Freq. (Hz)
4000
0
1000
2000 Freq. (Hz)
Fig. 11. Comparison of spectral estimation methods using a 32 ms frame of clean synthetic voiced speech. The dashed line in each plot is the original power spectrum of the synthetic signal. The solid line in each plot is the power spectral estimate using the respective method. (a) Periodogram method with Hamming window, (b) HASE method (using Hamming windowed higher-lag autocorrelation sequence), (c) HASE method (using DDR Hamming windowed higher-lag autocorrelation sequence), (d) autocorrelation method of LP analysis, (e) SMC method, and (f) OSALPC method.
periodogram estimate and its spectral dynamic range is same (about 43 dB) as that of the periodogram method. 3.2. Spectral estimation performance of the HASE method As mentioned earlier, the HASE method uses the magnitude spectrum of the windowed one-sided higher-lag autocorrelation sequence as the estimate of the signal’s power spectrum. In order to investigate the spectral estimation performance of this method, we first use a synthetic voiced speech signal for analysis.9 For generating this synthetic signal, we take a segment of a real /r/ sound and compute the LP coefficients using the autocorrelation method of LP analysis. Using the resulting all-pole (or AR) model, we filter a periodic impulse train (of period = 5 ms) to produce a 5 s long synthetic /r/ sound. From the middle of this sound, we excise a 32 ms frame for analysis. 9
We use here the synthetic speech signal for analysis because its power spectrum is known a priori and can be used as a reference spectrum. This helps in evaluating the performance of a given spectral estimation method.
Here, we are interested in answering two questions: (1) How does the shape of the window function used for weighting the causal higher-lag autocorrelation sequence effect the spectral estimation performance of the HASE method? and (2) How does the HASE method compare with the other spectral estimation methods used in the past to extract cepstral coefficient features for speech recognition? For answering these questions, we compare the spectral estimation performance of the following six methods: (1) The periodogram method with Hamming window (as used in the MFCC feature extraction (Davis and Mermelstein, 1980)), (2) the HASE method with the Hamming-windowed one-sided higher-lag autocorrelation sequence, (3) the HASE method with the DDR Hamming-windowed one-sided higher-lag autocorrelation sequence, (4) the autocorrelation method of LP analysis (Makhoul, 1975), (5) the SMC method of LP analysis (Mansour and Juang, 1989), and (6) The OSA method of LP analysis (Hernando and Nadeu, 1997). Among these six methods, the methods 1 and 4 have been used in the past to extract MFCC and LPCC features, respectively, while the methods 5 and 6 are based on the windowed onesided autocorrelation sequence (like our method).
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485
We use the 32 ms long frame of the synthetic speech signal and compute its power spectrum using the six spectral estimation methods mentioned above. Fig. 11 shows the results. The dashed line in all the plots is the original power spectrum of the synthetic speech signal (obtained as the power response of the AR model used to generate the synthetic speech signal) and it serves as a reference for evaluating the performance of these six spectral estimators. Fig. 11(a) shows the spectral estimate made by the periodogram method (method 1). It can be seen from this figure that its envelope matches the original power spectrum of the synthetic signal very well. Therefore, it is not surprising that this method is used for spectral estimation to extract the (currently popular) MFCC features. Fig. 11(b) shows the spectral estimate made by our HASE method using the Hamming windowed one-sided higher-lag autocorrelation sequence (method 2). The original power spectrum of the synthetic signal shows four peaks, but the power spectral estimate by this method only reveals three. The reduction in spectral dynamic range that results from using a Hamming window to window the autocorrelation sequence is very apparent. The original spectrum shows a dynamic range of 45 dB. The dynamic range revealed by the Hamming-windowed method is approximately half of that value, which demonstrates the previously discussed shortcoming. Fig. 11(c) shows the spectral estimate made by our HASE method using the DDR Hamming windowed one-sided higher-lag autocorrelation sequence (method 3). Since the DDR Hamming window has a large dynamic range, the four peaks of the original spectrum are revealed in the spectral estimate made by this method. Furthermore, the agreement between the original spectrum and the envelope of the spectral estimate is excellent. This plot shows that a very good power spectral estimate of speech can be made using the one-sided autocorrelation sequence, if the window function is designed appropriately. Fig. 11(d) shows a power spectrum estimate computed from the AR parameters when the autocorrelation method of LP analysis is applied to the synthetic speech signal (method 4). This figure shows a good match between this estimate and the original power spectrum. This is the reason that the LPCC feature set (based on this method) has been so popular in the past. Fig. 11(e) shows a power spectrum estimate computed from the AR parameters when the SMC algorithm is applied to the synthetic speech signal
1471
(method 5). This plot shows the low dynamic range of the SMC algorithm due to the Hamming window. It also shows that out of the four peaks revealed in the AR model response, only two are shown in the SMC spectrum. Fig. 11(f) shows a power spectrum estimate computed from the AR parameters when the OSA–LPC algorithm is applied to the synthetic speech signal (method 6). Even though a Hamming window is used in the OSA–LPC algorithm, the dynamic range appears to be the same as the AR model response. This is due to the power spectrum estimate being squared during the algorithm, rather than the effect of a high-dynamic range window. Three out of the four peaks are present in the OSA–LPC spectrum, which is better than the SMC spectrum, but not as good as the spectrum computed using the proposed algorithm. In Fig. 11, we have seen the spectral estimation performance of the six methods for the clean (or, noise-free) synthetic speech signal. Now, we investigate their performance for the noisy synthetic speech signal. For this, we take the same 32 ms long frame of synthetic voiced speech, corrupt it by adding the following four noises to make its signal-to-noise ratio (SNR) to be 10 dB: the artificial white Gaussian random noise, the artificial chirp noise, the artificial impulsive noise and the real car noise. The resulting spectral performance of the six methods for the noise cases is shown in Figs. 12–15, respectively. By comparing these figures with Fig. 11, we can see that all the six methods suffer in terms of their spectral estimation performance when applied to the signal corrupted by the noise, but it is somewhat difficult to say from these figures which method suffers more and which less. For making a comment about their relative performance in presence of noise, we compute the formant estimation performance (measured in terms of number of well-estimated formants) of each of the six methods for this noisy synthetic speech frame for four different noises (white, chirp, impulse and car) using the four Figs. 12–15. We call a formant well estimated if it matches with one of the four formants present in the original power spectrum of the synthetic signal.10 The formant estimation performance of the six methods for each of the four noise conditions is summarised in Table 1. From this table, 10
Determination of a well-estimated formant is somewhat subjective. We use this measure to provide only a qualitative indication about the formant estimation performance of different methods.
1472
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485 (a)
(b) 0 Power (dB)
Power (dB)
0 −20 −40 −60
0
1000
2000 3000 Freq. (Hz)
−20 −40 −60
4000
0
1000
−20 −40 −60
0
1000
2000 3000 Freq. (Hz)
4000
3000
4000
3000
4000
0 −20 −40 −60
4000
0
1000
(e)
2000 Freq. (Hz)
(f) 0 Power (dB)
0 Power (dB)
3000
(d)
0
Power (dB)
Power (dB)
(c)
2000 Freq. (Hz)
−20 −40 −60
0
1000
2000 3000 Freq. (Hz)
−20 −40 −60
4000
0
1000
2000 Freq. (Hz)
Fig. 12. Comparison of spectral estimation methods using a 32 ms frame of synthetic voiced speech corrupted by the synthetic white random noise at 10 dB SNR. The dashed line in each plot is the original power spectrum of the synthetic signal. The solid line in each plot is the power spectral estimate using the respective method. (a) Periodogram method with Hamming window, (b) HASE method (using Hamming windowed higher-lag autocorrelation sequence), (c) HASE method (using DDR Hamming windowed higher-lag autocorrelation sequence), (d) autocorrelation method of LP analysis, (e) SMC method, and (f) OSALPC method.
(a)
(b) 0 Power (dB)
Power (dB)
0 −20 −40 −60
0
1000
2000 Freq. (Hz)
3000
−20 −40 −60
4000
0
1000
(c)
4000
3000
4000
3000
4000
0 Power (dB)
Power (dB)
3000
(d)
0 −20 −40 −60
0
1000
2000 Freq. (Hz)
3000
−20 −40 −60
4000
0
1000
(e)
2000 Freq. (Hz)
(f) 0 Power (dB)
0 Power (dB)
2000 Freq. (Hz)
−20 −40 −60
0
1000
2000 Freq. (Hz)
3000
4000
−20 −40 −60
0
1000
2000 Freq. (Hz)
Fig. 13. Comparison of spectral estimation methods using a 32 ms frame of synthetic voiced speech corrupted by the synthetic chirp noise at 10 dB SNR. The dashed line in each plot is the original power spectrum of the synthetic signal. The solid line in each plot is the power spectral estimate using the respective method. (a) Periodogram method with Hamming window, (b) HASE method (using Hamming windowed higher-lag autocorrelation sequence), (c) HASE method (using DDR Hamming windowed higher-lag autocorrelation sequence), (d) autocorrelation method of LP analysis, (e) SMC method, and (f) OSALPC method.
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485 (a)
(b) 0 Power (dB)
Power (dB)
0 −20 −40 −60
0
1000
2000 3000 Freq. (Hz)
−20 −40 −60
4000
0
1000
(c)
3000
4000
3000
4000
3000
4000
0 Power (dB)
Power (dB)
2000 Freq. (Hz)
(d)
0 −20 −40 −60
0
1000
2000 3000 Freq. (Hz)
−20 −40 −60
4000
0
1000
(e)
2000 Freq. (Hz)
(f) 0 Power (dB)
0 Power (dB)
1473
−20 −40 −60
0
1000
2000 3000 Freq. (Hz)
4000
−20 −40 −60
0
1000
2000 Freq. (Hz)
Fig. 14. Comparison of spectral estimation methods using a 32 ms frame of synthetic voiced speech corrupted by the synthetic impulsive noise at 10 dB SNR. The dashed line in each plot is the original power spectrum of the synthetic signal. The solid line in each plot is the power spectral estimate using the respective method. (a) Periodogram method with Hamming window, (b) HASE method (using Hamming windowed higher-lag autocorrelation sequence), (c) HASE method (using DDR Hamming windowed higher-lag autocorrelation sequence), (d) autocorrelation method of LP analysis, (e) SMC method, and (f) OSALPC method.
(a)
(b) 0 Power (dB)
Power (dB)
0 −20 −40 −60
0
1000
2000 3000 Freq. (Hz)
−20 −40 −60
4000
0
1000
−20 −40 −60
0
1000
2000 3000 Freq. (Hz)
4000
3000
4000
3000
4000
0 −20 −40 −60
4000
0
1000
(e)
2000 Freq. (Hz)
(f) 0 Power (dB)
0 Power (dB)
3000
(d)
0
Power (dB)
Power (dB)
(c)
2000 Freq. (Hz)
−20 −40 −60
0
1000
2000 3000 Freq. (Hz)
4000
−20 −40 −60
0
1000
2000 Freq. (Hz)
Fig. 15. Comparison of spectral estimation methods using a 32 ms frame of synthetic voiced speech corrupted by the real car noise at 10 dB SNR. The dashed line in each plot is the original power spectrum of the synthetic signal. The solid line in each plot is the power spectral estimate using the respective method. (a) Periodogram method with Hamming window, (b) HASE method (using Hamming windowed higher-lag autocorrelation sequence), (c) HASE method (using DDR Hamming windowed higher-lag autocorrelation sequence), (d) autocorrelation method of LP analysis, (e) SMC method, and (f) OSALPC method.
1474
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485
Table 1 Formant estimation performance of six methods on a 32 ms frame of noisy synthetic speech Method
Number of well-estimated formants
(1) Periodogram method – Hamming (2) HASE method – Hamming (3) HASE method – DDR Hamming (4) Autocorrelation method of LP analysis (5) SMC method (6) OSALPC method
White
Chirp
Impulse
Car
3
3
3
3
3 3
3 4
3 3
3 3
2
2
2
2
2 3
2 3
2 3
2 3
it can be seen that the HASE method (using DDR Hamming windowed higher-lag autocorrelation sequence) is most successful in estimating the formants of this 32 ms frame of the synthetic speech signal in presence of noise. So far, we have investigated the spectral estimation performance of the HASE method for synthetic speech. Now, we investigate it for real speech. For this, we take a speech utterance from the Aurora database (MAL_19Z96Z8A). We compute the spectrogram of this real speech utterance (with 10 ms frame shift and 32 ms frame length) using the periodogram method and the HASE method (with DDR Hamming-windowed higher-lag autocorrelation sequence). The spectrogram plots are shown in Fig. 16(a). Top plot in this figure corresponds (b) 4000
3000
3000 Freq. (Hz)
Freq. (Hz)
(a) 4000
2000 1000
1000
0
0.5
1
1.5
2 Time (s)
2.5
3
3.5
0
4
4000
4000
3000
3000 Freq. (Hz)
Freq. (Hz)
0
2000
2000 1000 0
0
0.5
1
1.5
2 Time (s)
2.5
3
3.5
4
0
0.5
1
1.5
2 Time (s)
2.5
3
3.5
4
2000 1000
0
0.5
1
1.5
2 Time (s)
2.5
3
3.5
0
4
(d) 4000
3000
3000 Freq. (Hz)
Freq. (Hz)
(c) 4000
2000 1000
1000
0
0.5
1
1.5
2 Time (s)
2.5
3
3.5
0
4
4000
4000
3000
3000 Freq. (Hz)
Freq. (Hz)
0
2000
2000 1000 0
0
0.5
1
1.5
2 Time (s)
2.5
3
3.5
4
0
0.5
1
1.5
2 Time (s)
2.5
3
3.5
4
2000 1000
0
0.5
1
1.5
2 Time (s)
2.5
3
3.5
4
0
Fig. 16. Spectrograms of real speech (utterance ‘MAL_19Z96Z8 A’ from Aurora database) using the periodogram method (upper plot in each subfigure) and HASE method with DDR Hamming windowed higher-lag autocorrelation sequence (lower-plot in each subfigure) in various noise conditions. (a) Clean utterance. (b) Chirp noise. (c) Impulse noise and (d) car noise.
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485
shows a substantially lower-noise in the spectrogram than the periodogram method. In the car noise condition (Fig. 16(d)), the noise differences in the spectrograms from the two methods are more subtle.
to the periodogram method and the bottom plot to the HASE method. To investigate the performance of these methods for noisy speech, we take this utterance (MAL_19Z96Z8A) and corrupt it by adding the following three noises to make its signal-tonoise ratio (SNR) to be 10 dB: the artificial chirp noise, the artificial impulsive noise and the real car noise. The spectrograms of noisy speech for these three noises using the two methods (periodogram and HASE) are shown in Fig. 16(b)–(d), respectively. The spectrograms for the clean condition case (Fig. 16(a)) again confirm that the HASE method captures the formant structure as well as the periodogram method. In the HASE method, the extra windowing operation on the autocorrelation sequence results in a less pronounced (or smoother) harmonic structure than the periodogram method. In the chirp (Fig. 16(b)) and impulse (Fig. 16(c)) noise conditions, the HASE method
4. AMFCC feature extraction method In this section, we present a brief outline of our AMFCC feature extraction method and point out where and how it differs from the popular MFCC feature extraction method. The AMFCC feature extraction front-end shown in Fig. 17 uses the HASE method (with DDR Hamming-windowed higher-lag autocorrelation sequence) for spectral estimation. In this front-end, we first pre-emphasise the input speech signal using a pre-emphasis filter H(z) = 1 0.97z1. In order to carry out the short-time analysis of the
Autocorrelation Speech Signal
Frame Blocking
Preemphasis
1475
Windowing
Windowing | |
2
FFT
Energy Feature Features
Temporal Derivative
Delta Features
Temporal Derivative
Base Features
Logarithm
Concatenate
Delta–Delta
Discrete Cosine Transform
Σ
| |
Logarithm
Mel Filter Bank
Fig. 17. Block diagram of AMFCC feature extraction algorithm.
FFT Speech Signal
Preemphasis
Frame Blocking
Windowing |
|
2
|
|
2
Energy Feature
Features Delta Features Base
Temporal Derivative
Logarithm Temporal Derivative
Concatenate
Delta–Delta
Σ
Discrete Cosine Transform
Features
Fig. 18. Block diagram of MFCC feature extraction algorithm.
Mel Filter Bank
Logarithm
1476
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485
pre-emphasised speech signal, we perform frame blocking with a frame size of 32 ms and a frame shift of 10 ms and the signal is analysed sequentially in a frame-wise manner. The Hamming window is applied to the pre-emphasised signal of a given frame and a biased autocorrelation estimator is applied to the windowed signal to get the 32 ms long autocorrelation sequence.11 We discard the autocorrelation coefficients for time lags less than 2 ms and apply the DDR window to the remaining (2–32 ms) higher-lag autocorrelation sequence. We compute the magnitude spectrum of this windowed autocorrelation sequence and use it as the power spectral estimate of the signal. This spectrum is then processed by the Mel filter bank, logarithm and DCT operations to get 12 AMFCCs. We also compute the log energy of the windowed speech signal and append il to the 12 AMFCCs to get the 13 base features. We compute the first and the second derivatives (delta and delta–delta) of the time sequence of each base feature. These derivatives are concatenated to the base feature set to get the final AMFCC feature set (having 39 features). Since the MFCCs are currently the most popular features used for speech recognition, we show here the block diagram for the MFCC feature extraction front-end in Fig. 18 and point out where and how these two front-ends differ. The two front-ends differ only in the procedure used for estimating the power spectrum of the speech signal. In the MFCC case, it is obtained by using the periodogram estimate computed from the Hamming-windowed speech signal. In the AMFCC case, it is obtained by computing the magnitude spectrum of the DDR Hammingwindowed higher-lag autocorrelation sequence derived from the Hamming-windowed speech signal. This difference can be seen by noting the operations carried out by the two methods between the ‘windowing’ and ‘Mel filter bank’ blocks in Figs. 17 and 18, respectively.
5. Recognition experiments and results In this section, we evaluate the recognition performance of the AMFCC features for noisy speech at different SNRs using the Aurora-2 and the resource management (RM) databases. In order to get the noisy speech utterances, we corrupt the 11
This is done in a computationally efficient manner using the FFT algorithm.
speech signal by adding different types of artificial and real noises. In our recognition experiments, we use clean speech utterances for training the recogniser and the noisy speech utterances at different SNRs to evaluate its performance. The SNR of an utterance is defined here as the ratio of the energies of clean signal and noise computed over the whole utterance. Here, we first describe the databases and the experimental setup used in our recognition experiments. We then describe the results where we compare the recognition performance of the AMFCC features with that of the MFCC features first, and then with that of the LPCC, SMC and OSALPC features. The parameters used for implementing these (MFCC, AMFCC, LPCC, SMC and OSALPCC) feature extraction methods are listed in Table 2. 5.1. Databases and experimental setup 5.1.1. Aurora-2 task The Aurora-2 task (Pearce and Hirsch, 2000) was originally developed for evaluating the speech recognition performance of different feature extraction methods under noisy environments and is highlysuited in the current context where we are investigating the AMFCC method for robust feature extraction. This task provides speech utterances and scripts to perform speaker-independent connected-digit recognition experiments in clean and noisy conditions. The utterances in the Aurora-2 database consist of digit sequences (e.g., sil–one– four–seven–three–five–three–three–sil) that have been derived from the TI-digit database, down-sampled to 8 kHz and then filtered by a G.712 characteristic filter. The Aurora-2 scripts provide for two types of training conditions: clean-condition training and multi-condition training. In our experiments, we use the clean-condition training situation and train our recogniser with the clean training set (having 8440 utterances). We use the hidden Markov models (HMMs) to model the digits and pauses using the same topology as prescribed in the scripts. The test data in the Aurora-2 corpus is divided into three sets, ‘a’, ‘b’ and ‘c’. In our experiments, we use only the test set ‘a’ (28 028 utterances) for testing the recogniser. This test set consists of speech corrupted by four different real-life noises (subway, babble, car and exhibition) at seven SNR levels (clean, then 20 dB to 5 dB in 5 dB steps).
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485
1477
Table 2 Parameters used in the MFCC, AMFCC, LPCC, SMC and OSALPC methods Parameter
Sampling freq. (kHz) Frame shift (ms) Frame size (ms) Preemphasis coeff. No. of cepstral coeff. Feature vector size Mel filter range (kHz) No. of Mel filters LP order Signal window Autocorr. estimator Autocorr. window Autocorr. lag range (ms) a
Method MFCC
AMFCC
LPCC
SMC
OSALPC
8 10 32 0.97 12 39 0–4 23 – Hamming – – –
8 10 32 0.97 12 39 0–4 23 – Hamming Biased DDR Hamming 2–32
8 10 32 0.97 12 39 – – 12 Hamming Biased Rect. 0–1.5
8 10 32 0.97 12 39 – – 12 Rect. Coherence Hamming 0–16a
8 10 32 0.97 12 39 – – 12 Rect. Coherence Hamming 0–16a
The zeroth lag is set to zero in the SMC and OSALPC methods.
We also test our recogniser on speech utterances corrupted by the following three types of artificial noises: the white random noise, the chirp noise and the impulsive noise. The artificial white noise is obtained through a Gaussian random number generator. The artificial chirp noise used in this paper is periodic in nature with period equal to 32 ms. One period of this noise is generated as a sinusoidal signal whose frequency changes linearly from zero to half of the sampling frequency over the period. To create the artificial impulsive noise, we begin with a 32 ms block of zeros. To this block, we add a unit pulse of 2 ms duration. The starting position of the 2 ms pulse is randomly selected between 0 and 30 ms using a uniform random number generator. We then concatenate this block with another 32 ms block that contains only zeros. These two steps are then repeated, but this time the sign of the 2 ms pulse is reversed to maintain the overall mean to zero. These four steps are then repeated continuously to get a sufficiently long sequence of the impulsive noise. Thus, in this noise, the separation between successive pulses is randomly varying between 32 and 92 ms.
5.1.2. Resource management (RM) task This task uses the RM database (Price et al., 1988) for carrying out the speaker-independent continuous speech recognition experiments. The vocabulary of the RM task is 991 words. We use here the speaker-independent part of the RM database. We down-sample the speech utterances from 16 kHz to 8 kHz. The training set consists of 3990 utter-
ances spoken by 109 speakers. This is used to train a set of six-mixture, tied-state, cross-word triphone HMMs in accordance with the RM scripts, which are supplied with the HTK (HMM tool kit) (HTK and markov model tool kit, 2005) distribution. For testing the recognition system, we use the February-89 test set, which consists of 300 utterances spoken by 10 speakers. These utterances are corrupted by the same (artificial and real) noises as used in the Aurora-2 task (see the preceding subsection) to get the noisy utterances with SNRs ranging from clean, then 30 dB to 5 dB in 5 dB steps.
5.2. Results: AMFCC versus MFCC We evaluate the recognition performance of the AMFCC feature extraction method for speech corrupted by the artificial as well as the real-life noises and compare it with that of the MFCC method. The word recognition accuracy results are listed in Tables 3 and 4, respectively, for the Aurora-2 task. The corresponding results for the RM task are provided in Tables 5 and 6. In these tables, the first column shows the SNRs of noisy speech, and the second and the third columns show the results for the artificial and the real-life noises. In the artificial noise column, the first sub-column lists the word recognition accuracies for the white random noise, the second sub-column for the chirp noise and the third sub-column for the impulsive noise. The mean recognition accuracies for the artificial noise (obtained by averaging the corresponding word recognition accuracies for the three noises) are shown
1478
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485
Table 3 AMFCC + E + D + DD features: Recognition performance on the Aurora-2 task SNR
Clean 20 dB 15 dB 10 dB 5 dB 0 dB 5 dB
Artificial noise
Real noise
White
Chirp
Impulse
Mean
Subway
Babble
Car
Exhib.
Mean
99.11 93.64 87.35 74.24 54.10 29.54 12.10
99.11 98.46 98.00 94.50 87.29 80.69 74.03
99.11 99.14 99.02 99.02 98.86 98.50 96.44
99.11 97.08 94.79 89.25 80.08 69.58 60.86
99.11 97.27 93.89 84.49 64.66 38.47 17.65
99.09 96.64 89.48 69.71 36.85 12.00 2.93
98.99 97.97 96.42 88.64 62.63 25.26 11.42
99.17 95.93 92.63 81.98 60.32 30.76 12.93
99.09 96.95 93.11 81.20 56.11 26.62 11.23
Table 4 MFCC + E + D + DD features: Recognition performance on the Aurora-2 task SNR
Clean 20 dB 15 dB 10 dB 5 dB 0 dB 5 dB
Artificial noise
Real noise
White
Chirp
Impulse
Mean
Subway
Babble
Car
Exhib.
Mean
99.08 94.26 86.06 69.08 42.83 14.98 7.98
99.08 96.75 93.58 84.62 64.51 36.57 14.61
99.08 98.83 98.68 97.97 95.46 86.64 60.79
99.08 96.61 92.77 83.89 67.60 46.06 27.79
99.08 97.45 94.14 83.45 59.32 31.84 13.26
99.00 96.43 87.48 62.94 30.08 10.91 5.05
98.96 97.64 96.45 86.58 53.27 17.72 8.98
99.29 96.24 92.90 79.45 53.41 23.45 11.23
99.08 96.94 92.74 78.10 49.02 20.98 9.63
Table 5 AMFCC + E + D + DD features: Recognition performance on the resource management task SNR
Clean 30 dB 25 dB 20 dB 15 dB 10 dB 5 dB
Artificial noise
Real noise
White
Chirp
Impulse
Mean
Subway
Babble
Car
Exhib.
Mean
93.44 88.93 84.06 73.37 53.60 26.88 8.37
93.44 92.19 90.67 89.14 86.10 80.63 70.97
93.44 93.40 93.36 92.66 92.58 92.15 90.12
93.44 91.51 89.36 85.06 77.43 66.55 56.49
93.44 91.45 89.89 83.75 69.42 41.70 20.35
93.44 92.62 90.98 85.90 74.74 51.07 20.39
93.44 92.54 91.76 89.61 82.82 64.86 34.80
93.44 91.76 89.30 84.30 71.11 47.52 20.15
93.44 92.09 90.48 85.89 74.52 51.29 23.92
Table 6 MFCC + E + D + DD features: Recognition performance on the resource management task SNR
Clean 30 dB 25 dB 20 dB 15 dB 10 dB 5 dB
Artificial noise
Real noise
White
Chirp
Impulse
Mean
Subway
Babble
Car
Exhib.
Mean
93.48 89.96 84.86 75.82 53.72 28.40 12.19
93.48 91.64 90.20 86.26 75.63 53.82 31.01
93.48 93.28 92.74 91.76 89.61 82.31 67.57
93.48 91.63 89.27 84.61 72.99 54.84 36.92
93.48 92.07 90.16 83.79 68.29 42.75 23.85
93.48 92.35 91.37 86.37 75.36 49.28 23.73
93.48 92.46 91.68 89.07 81.92 60.91 31.93
93.48 91.92 90.55 85.24 73.37 50.22 21.98
93.48 92.20 90.94 86.12 74.73 50.79 25.37
in the fourth sub-column. Similarly, in the real noise column, we show the word recognition accuracies for the subway, babble, car and exhibition noises
in the first, second, third and fourth sub-columns, respectively, and their corresponding means are listed in the fifth sub-column. In order to make
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485
the comparison between the MFCC and AMFCC features easier, we plot the mean recognition accuracies for the artificial and the real noises on the Aurora-2 and the RM tasks as a function of SNR in Fig. 19. We can make the following two observations from Tables 3–6 and Fig. 19: (1) When tested on the clean speech utterances, the AMFCC method does as well as the MFCC method in terms of the recognition performance. This shows that the higher-lag autocorrelation coefficients used for spectral estimation in the AMFCC method capture the power spectral envelope of speech to the same
extent as done by the periodogram estimate used in the MFCC method. (2) For noisy speech utterances, the AMFCC method provides better recognition results than the MFCC method. This shows the usefulness of the AMFCC method for robust feature extraction. As mentioned earlier, we want the robustness and the additivity assumptions to be true for successful functioning of the AMFCC method. These assumptions are not strictly valid for short-time analysis carried out in the AMFCC feature extraction. Here, we make some comments about the effect of short-time analysis on the validity of these
(a)
(b) Resource Management − Artificial Noises 100
90
90
80
80
70
70
Recognition Accuracy (%)
Recognition Accuracy (%)
Aurora − Artificial Noises 100
60 50 40 30 20
60 50 40 30 20
10 0 −5
10
MFCC AMFCC
MFCC AMFCC
0 0
5
10 SNR (dB)
15
20
clean
5
10
15
(c)
25
30
clean
Resource Management − Real Noises
100
100
90
90
80
80
70
70
Recognition Accuracy (%)
Recognition Accuracy (%)
20 SNR (dB)
(d)
Aurora − Real Noises
60 50 40 30 20
60 50 40 30 20
10 0 −5
1479
MFCC AMFCC
10
MFCC AMFCC
0 0
5
10 SNR (dB)
15
20
clean
5
10
15
20 SNR (dB)
25
30
clean
Fig. 19. Mean recognition accuracy of the AMFCC and the MFCC features as a function of SNR. (a) Aurora-2 database speech tested with three artificial noises: white random, chirp and impulsive. (b) Resource management database speech tested with the same three artificial noises. (c) Aurora-2 database speech tested with four real-life noises: subway, babble, car and exhibition. (d) Resource management database speech tested with the same four real-life noises.
1480
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485
assumptions using the AMFCC results shown in Tables 3 and 5 for artificial noises. The robustness assumption requires the autocorrelation coefficients to be zero for time lags greater than 2 ms for the successful operation of the AMFCC method (as implemented here). As seen from Figs. 3–5, the validity of this assumption for short-time analysis varies for different noise conditions; it is perfectly valid for the impulsive noise, less valid for the chirp noise and least valid for the white noise. Its impact can be clearly seen in Tables 3 and 5, where the speech recognition performance of the AMFCC method deteriorates least for the impulsive noise, more for the chirp noise and most for the white noise. Also, for the impulsive noise case (where the robustness assumption is perfectly satisfied), the degradation in performance occurs due to additivity assumption alone and this degradation is comparably not that severe. We use an ‘oracle’ experiment to demonstrate the effect of additivity assumption on the speech recognition performance of the AMFCC method in Appendix A. There we show that the speech recognition performance does not degrade much due to deviation from the additivity assumption resulting from the short-time analysis. In other words, the additivity assumption is
quite valid for the short-time analysis and the speech recognition performance of the AMFCC method for a given SNR is mainly dictated by the robustness property of the noise corrupting the speech signal. 5.3. Results: AMFCC versus LPCC, SMC and OSALPC In this subsection, we compare the recognition performance of the AMFCC method with that of the three LP-based methods (namely, LPCC, SMC and OSALPC methods) of feature extraction. We list the word recognition accuracy results of the three LP-based methods in Tables 7–9, respectively, for the Aurora-2 task, and in Tables 10–12, respectively, for the RM task. We also show the mean recognition accuracies of these methods as a function of SNR in Fig. 20 for the artificial and the real noises on the Aurora and the RM tasks. By comparing this figure and these tables with the corresponding figure and tables shown in the preceding subsection for the AMFCC features, we can make the following two observations: (1) In terms of its recognition performance on the clean speech utterances, the AMFCC method does better
Table 7 LPCC + E + D + DD features: Recognition performance on the Aurora-2 task SNR
Clean 20 dB 15 dB 10 dB 5 dB 0 dB 5 dB
Artificial noise
Real noise
White
Chirp
Impulse
Mean
Subway
Babble
Car
Exhib.
Mean
98.71 84.56 64.48 44.77 29.78 10.93 7.77
98.71 95.64 90.70 77.86 56.00 27.17 11.39
98.71 98.93 98.71 98.31 97.54 95.09 89.01
98.71 93.04 84.63 73.65 61.11 44.40 36.06
98.71 96.44 91.93 76.14 48.33 27.02 15.35
98.94 96.52 88.12 61.94 23.04 0.51 2.72
98.84 97.55 95.29 79.93 41.16 11.63 7.84
98.86 95.25 88.28 69.24 39.86 15.46 8.49
98.84 96.44 90.91 71.81 38.10 13.40 7.24
Table 8 SMC + E + D + DD features: Recognition performance on the Aurora-2 task SNR
Artificial noise
Real noise
White
Chirp
Impulse
Mean
Subway
Babble
Car
Exhib.
Mean
Clean 20 dB 15 dB 10 dB 5 dB 0 dB 5 dB
98.62 81.42 65.40 47.28 28.80 19.28 10.01
98.62 95.52 88.24 75.04 55.82 33.50 14.74
98.62 98.56 98.56 98.56 98.28 96.90 92.05
98.62 91.83 84.07 73.63 60.97 49.89 38.93
98.62 95.61 90.33 71.60 39.42 18.11 9.61
98.55 89.21 72.31 47.10 25.45 4.99 4.75
98.00 97.02 91.23 69.49 41.69 19.53 10.65
98.52 95.25 90.96 78.96 51.62 25.42 12.03
98.42 94.27 86.21 66.79 39.55 17.01 6.88
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485
1481
Table 9 OSLPC + E + D + DD features: Recognition performance on the Aurora-2 task SNR
Clean 20 dB 15 dB 10 dB 5 dB 0 dB 5 dB
Artificial noise
Real noise
White
Chirp
Impulse
Mean
Subway
Babble
Car
Exhib.
Mean
98.37 83.57 66.01 49.28 33.34 20.60 10.59
98.37 97.42 90.24 67.82 42.40 17.65 6.88
98.37 98.56 98.53 98.37 97.88 95.86 86.15
98.37 93.18 84.93 71.82 57.87 44.70 34.54
98.37 94.96 87.10 66.50 36.01 16.12 6.66
98.43 90.36 75.33 50.00 26.12 6.50 5.77
97.88 96.72 92.51 74.14 46.56 23.50 12.14
98.49 95.37 91.05 78.00 51.93 25.46 12.10
98.29 94.35 86.50 67.16 40.16 17.90 6.28
Table 10 LPCC + E + D + DD features: Recognition performance on the management task SNR
Artificial noise
Real noise
White
Chirp
Impulse
Mean
Subway
Babble
Car
Exhib.
Mean
Clean 30 dB 25 dB 20 dB 15 dB 10 dB 5 dB
91.57 87.06 81.25 69.06 41.77 18.28 6.98
91.57 89.57 87.62 82.82 68.60 42.57 17.39
91.57 91.57 91.25 90.55 88.48 86.39 80.04
91.57 89.40 86.71 80.81 66.28 49.08 34.80
91.57 89.93 88.01 80.81 63.92 40.23 23.22
91.57 90.98 90.39 87.54 77.12 54.47 27.61
91.57 90.86 89.85 88.32 83.09 67.04 41.43
91.57 90.16 87.97 82.82 68.10 44.44 18.81
91.57 90.48 89.06 84.87 73.06 51.55 27.77
Table 11 SMC + E + D + DD features: Recognition performance on the resource management task SNR
Artificial noise
Real noise
White
Chirp
Impulse
Mean
Subway
Babble
Car
Exhib.
Mean
Clean 30 dB 25 dB 20 dB 15 dB 10 dB 5 dB
92.03 85.64 80.15 71.24 48.53 25.64 8.00
92.03 89.11 87.39 82.47 76.57 63.06 36.78
92.03 91.76 91.76 91.60 90.82 89.57 88.11
92.03 88.84 86.43 81.77 71.97 59.42 44.30
92.03 89.46 86.92 82.35 68.67 43.54 18.37
92.03 90.12 88.60 85.20 74.85 52.83 21.98
92.03 90.63 90.04 88.29 83.60 74.78 53.34
92.03 89.22 88.44 85.16 73.34 52.21 25.77
92.03 89.86 88.50 85.25 75.11 55.84 29.86
Table 12 OSALPC + E + D + DD features: Recognition performance on the resource management task SNR
Clean 30 dB 25 dB 20 dB 15 dB 10 dB 5 dB
Artificial noise
Real noise
White
Chirp
Impulse
Mean
Subway
Babble
Car
Exhib.
Mean
91.92 86.27 79.19 69.63 49.47 25.04 8.00
91.92 89.61 87.82 82.47 73.45 53.45 24.56
91.92 91.49 91.14 90.72 89.89 88.47 85.30
91.92 89.12 86.05 80.94 70.94 55.65 39.29
91.92 90.20 87.90 80.42 68.12 43.91 21.94
91.92 90.39 88.44 84.54 72.98 48.73 18.40
91.92 90.94 90.12 88.25 83.44 72.32 48.26
91.92 90.28 87.54 83.80 73.02 51.27 22.34
91.92 90.45 88.50 84.25 74.39 54.06 27.73
than the three LP-based methods. (2) For the noisy speech utterances, the recognition performance of
the AMFCC method is better than that of the three LP-based methods for most of the cases.
1482
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485 (a)
(b) Resource Management − Artificial Noises 100
90
90
80
80
70
70
Recognition Accuracy (%)
Recognition Accuracy (%)
Aurora − Artificial Noises 100
60 50 40 30 20
0 −5
0
5
10
15
20
50 40 30 20
AMFCC LPCC SMC OSALPC
10
60
AMFCC LPCC SMC OSALPC
10 0− 5
clean
10
15
SNR (dB)
(c)
30
clean
Resource Management − Real Noises
100
100
90
90
80
80
70
70
Recognition Accuracy (%)
Recognition Accuracy (%)
25
(d)
Aurora − Real Noises
60 50 40 30 20
AMFCC LPCC SMC OSALPC
10 0 −5
20 SNR (dB)
60 50 40 30 20
AMFCC LPCC SMC OSALPC
10 0
0
5
10 SNR (dB)
15
20
clean
5
10
15
20 SNR (dB)
25
30
clean
Fig. 20. Mean recognition accuracy of the AMFCC, the LPCC, the SMC and the OSALPC features as a function of SNR. (a) Aurora-2 database speech tested with three artificial noises: white random, chirp and impulsive. (b) Resource management database speech tested with the same three artificial noises. (c) Aurora-2 database speech tested with four real-life noises: subway, babble, car and exhibition. (d) Resource management database speech tested with the same four real-life noises.
6. Conclusions In this paper, we have proposed the AMFCC method of feature extraction for robust speech recognition. This method uses the windowed (onesided) higher-lag autocorrelation sequence for estimating the power spectrum of the speech signal. In order to capture the spectral envelope information of speech, the window applied to the autocorrelation sequence has to have a dynamic range twice that of the Hamming window. We have evaluated the speech recognition performance of the AMFCC features on the Aurora and the RM databases and
shown that they perform as well as the MFCC features for clean speech and their recognition performance is better than the MFCC features for noisy speech. Finally, we have shown that the AMFCC features perform better than the features derived from the LP-based methods for both the clean and noisy speech utterances. Appendix A. Effect of additivity assumption on speech recognition performance If the clean speech signal s(n) and the noise signal d(n) in Eq. (1) are correlated, the autocorrelation
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485
where we can enforce the additivity assumption for short-time analysis. For this, we perform short-time analysis of s(n) and d(n) separately and compute their short-time autocorrelation sequences rss(n) and rdd(n). These two sequences are then added according to Eq. (2) to obtain the short-time autocorrelation sequence rxx(n). As a result, the modified AMFCC method (denoted here by the AMFCC-O method) ensures validity of the additivity assumption for short-time analysis. We show a composite block diagram of the AMFCC and the AMFCCO methods in Fig. 21. In this figure, we have highlighted two paths by dotted lines and labelled them as ‘‘ALL CROSS TERMS’’ and ‘‘ZERO CROSS TERMS’’, respectively. The top path (with ‘‘ALL CROSS TERMS’’) is used for the AMFCC method and the bottom path (with ‘‘ZERO CROSS TERMS’’) for the AMFCC-O method. Using features computed by the AMFCC-O method and the AMFCC method, we carry out speech recognition experiments on the Aurora database and compare the speech recognition performance of the two methods. Since the AMFCC-O method requires the noise signal as an input, the pre-mixed noisy utterances in the Aurora database cannot be used in these experiments. Instead, we take the clean speech from the subway case, and mix it with the subway, babble, car and exhibition noises provided in the Aurora database. The noise signal levels are adjusted to produce noisy speech utterances with global SNRs equal to the seven
function of the observed (noisy) signal x(n) is given by rxx ðnÞ ¼ rss ðnÞ þ rdd ðnÞ þ rsd ðnÞ þ rds ðnÞ ; |fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl}
1483
ð4Þ
cross-correlation terms
where rsd(n) and rds(n) ate the cross-correlation functions of s(n) and d(n). When s(n) and d(n) ate uncorrelated, the cross-correlation terms in Eq. (4) become zero. As a result, Eq. (4) reduces to Eq. (2), which defines the additivity property necessary for successful functioning of the AMFCC method of feature extraction. As mentioned earlier, this property is strictly valid only asymptotically (i.e., when the signal x(n) available for analysis is infinitely long). In practice, we perform short-time analysis of x(n) for AMFCC feature extraction. Under the shorttime analysis constraint, the cross-terms in Eq. (4) are not zero even when the signals s(n) and d(n) ate uncorrelated. As a result, the additivity property (Eq. (2)) is not valid for the short-time analysis. In this appendix, we investigate empirically to what extent this deviation from additivity assumption due to short-time analysis affects the speech recognition performance of the AMFCC method. For this, we carry out an ‘oracle’ experiment where we assume that we can access the clean speech signal s(n) as well as the noise signal d(n) individually. This allows us to modify the AMFCC method
ALL CROSS TERMS SPEECH SIGNAL
Preemphasis
g UNIT VARIANCE NOISE SIGNAL
Delta Features
Windowing
Autocorrelation
ZERO CROSS TERMS
SET SNR
Delta–Delta Features
Frame Blocking
Preemphasis
Frame Blocking
Windowing
Autocorrelation
Preemphasis
Frame Blocking
Windowing
Autocorrelation
Windowing Temporal Derivative Temporal Derivative
Discrete Cosine Transform
Logarithm
Mel Filter Bank
| |
FFT
Base Features
Fig. 21. Composite block diagram showing AMFCC and AMFCC-O feature extraction methods. The path labelled ‘‘ALL CROSS TERMS’’ shows the typical implementation of the AMFCC method. The path labelled ‘‘ZERO CROSS TERMS’’ shows the AMFCC-O method, which eliminates the cross-correlation terms from the short-time autocorrelation sequence.
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485 (a)
(b)
Subway
Babble
100
100
90
90
80
80
70
70
Recognition Accuracy (%)
Recognition Accuracy (%)
1484
60 50 40 30 20
50 40 30 20
AMFCC AMFCC−O MFCC
10 0 −5
60
0
5
10 SNR (dB)
15
20
AMFCC AMFCC−O MFCC
10 0 −5
clean
0
5
20
clean
Exhib.
100
100
90
90
80
80
70
70
Recognition Accuracy (%)
Recognition Accuracy (%)
15
(d)
(c) Car
60 50 40 30 20
60 50 40 30 20
AMFCC AMFCC−O MFCC
10 0 −5
10 SNR (dB)
0
5
10 SNR (dB)
15
20
clean
AMFCC AMFCC−O MFCC
10 0 −5
0
5
10 SNR (dB)
15
20
clean
Fig. 22. Recognition accuracy results for AMFCC-O, AMFCC and MFCC features for four Aurora noises.
default test SNRs found in the Aurora database: clean, 20, 15, 10, 5, 0 and 5 dB. In the speech recognition experiments, the base feature set consists of 12 cepstral coefficients (which does not include logarithmic energy or zeroth cepstral coefficient). A 36-dimensional feature vector is formed by concatenating the delta and acceleration coefficients to the base feature set. The results of the experiments are presented in Fig. 22. For reference, we also plot results for MFCC features. We can see from this figure that the AMFCC-O method provides very little improvement in terms of speech recognition performance over the AMFCC method for some of the (lower) SNRs; the two methods are comparable for other SNRs. These results show that the additivity assumption used in the AMFCC
method is quite good even for the short-time analysis. References Anstey, N.A., 1966. Correlation techniques. Can. J. Explor. Geophys. 2 (1), 55–82. Bellegarda, J.R., 1997. Statistical techniques for robust ASR: Review and perspectives. Proc. Eurospeech, KN33–KN36. Bourlard, H., Dupont, S., 1996. A new ASR approach based on independent processing and recombination of partially frequency bands. Proc. ICSLP, 426–429. Cadzow, J.A., 1982. Spectral estimation: An overdetermined rational model equation approach. Proc. IEEE 70 (Sep.), 907– 939. Chan, Y.T., Langford, R.P., 1982. Spectral estimation via the high-order Yule–Walker equations. IEEE Trans. Acoust Speech Signal Process. (ASSP) 30 (5), 689–698.
B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485 Cooke, M.P., Morris, A., Green, P.D., 1997. Missing data techniques for robust speech recognition. Proc. ICASSP, 863– 866. Davis, S.B., Mermelstein, P., 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process. (ASSP) 28 (4), 357–365. Gales, M., Woodland, P., 1996. Mean and variance adaptation within the MLLR framework. Comput. Speech Language 10, 249–264. Gersch, W., 1970. Estimation of the autoregressive parameters of mixed autoregressive moving-average time series. IEEE Trans. AC 5 (October), 583–588. Ghitza, O., 1986. Auditory nerve representation as a front-end for speech recognition in a noisy environments. Comput. Language Speech 1, 109–130. Gong, Y., 1995. Speech recognition in noisy environments: A survey. Speech Commun. 16 (3), 261–291. Harris, F.J., 1978. On the use of windows for harmonic analysis with the discrete fourier transform. Proc. IEEE 66 (1), 51–83. Hermansky, H., Morgan, N., 1994. Rasta processing of speech. IEEE Trans. Speech Audio Process. 2 (4), 578–586. Hermus, K., Wambacq, P., 2004. Assessment of signal subspace based speech enhancement for noise robust speech recognition. Proc. ICASSP, 945–948. Hernando, J., Nadeu, C., 1997. Linear prediction of the one-sided autocorrelation sequence for noisy speech recognition. IEEE Trans. Speech Audio Process. 5 (1), 80–84. HTK, Hidden markov model tool kit, available from: , 2005. Huang, X., Acero, A., Hon, H., 2001. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall. Juang, B.H., 1991. Speech recognition in adverse environments. Comput. Speech Language 5, 275–294. Kay, S.M., 1979. The effects of noise on the autoregressive spectral estimator. IEEE Trans. Acoust Speech Signal Process. (ASSP) ASSP-27 (5), 478–485. Kay, S., 1988. Modern Spectral Analysis. Prentice Hall. Kim, H.G., Schwab, M., Moreau, N., Sikora, T., 2003. Enhancement of noisy speech for noise robust front-end and speech reconstruction at back-end of DSR system. Proc. Eurospeech, 545–548. Lee, C.H., 1998. On stochastic features and model compensation approaches to robust speech recognition. Speech Commun. 25, 29–48. Lippmann, R.P., 1997. Speech recognition by machines and humans. Speech Commun. 22, 1–16.
1485
Makhoul, J., 1975. Linear prediction: A tutorial review. Proc. IEEE 63 (April), 561–580. Mansour, D., Juang, B.H., 1989. The short-time modified coherence representation and noisy speech recognition. IEEE Trans. Acoust Speech Signal Process. (ASSP) 37 (6), 795– 804. McGinn, D.P., Johnson, D.H., 1989. Estimation of all-pole model parameters from noise-corrupted sequences. IEEE Trans. Acoust Speech Signal Process. (ASSP) 37 (3), 433– 436. Paliwal, K.K., 1986a. A noise-compensated long correlation matching method for AR spectral estimation of noisy signals. Proc. ICASSP, 1369–1372. Paliwal, K.K., 1986b. A constrained forward–backward correlation prediction method for AR spectral estimation of noisy signals. Proc. EUSIPCO, 295–298. Paliwal, K.K., 1986c. Robust LP analysis method based on pitch information for noisy speech. Proc. EUSIPCO, 593–596. Paliwal, K.K., Sagisaka, Y., 1997. Cyclic autocorrelation-based linear prediction analysis of speech. Proc. Eurospeech, 279– 282. Paliwal, K.K., Sondhi, M.M., 1991. Recognition of noisy speech using cumulant-based linear prediction analysis. Proc. ICASSP, 429–432. Pearce, D., Hirsch, H., 2000. The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. Proc. ICSLP 4 (Oct.), 29–32. Price, P., Fisher, W., Bernstein, J., Pallett, D., 1988. The DARPA 1000-word resource management database for continuous speech recognition. Proc. ICASSP, 651–654. Rabiner, L., Juang, B., 1993. Fundamentals of Speech Recognition. Prentice Hall. Raj, B., Seltzer, M.L., Stern, R.M., 2004. Reconstruction of missing features for robust speech recognition. Speech Commun. 43, 275–296. Shannon, B.J., Paliwal, K.K., 2004. MFCC computation from magnitude spectrum of higher lag autocorrelation coefficients for robust speech recognition. Proc. ICSLP. Shannon, B.J., Paliwal, K., 2005. Influence of autocorrelation lag ranges on robust speech recognition. Proc. ICASSP 2 (March). Stern, R.M., Acero, A., Liu, F.H., Ohshima, Y., 1996. Signal processing for robust speech recognition. In: Lee, C., Soong, F., Paliwal, K.K. (Eds.), Automatic Speech Recognition: Advanced Topics. Kluwer Academic Publishers, Boston, pp. 357–384. Tibrewala, S., Hermansky, H., 1997. Sub-band based recognition of noisy speech. Proc. ICASSP, 1255–1258.