Feature extraction from higher-lag autocorrelation ... - CiteSeerX

Comment

Report 5 Downloads 95 Views

Speech Communication 48 (2006) 1458–1485 www.elsevier.com/locate/specom

Feature extraction from higher-lag autocorrelation coeﬃcients for robust speech recognition Benjamin J. Shannon, Kuldip K. Paliwal

*

School of Microelectronic Engineering, Griﬃth University, Nathan Campus, Brisbane, QLD 4111, Australia Received 4 July 2005; received in revised form 1 August 2006; accepted 1 August 2006

Abstract In this paper, a feature extraction method that is robust to additive background noise is proposed for automatic speech recognition. Since the background noise corrupts the autocorrelation coeﬃcients of the speech signal mostly at the lowertime lags, while the higher-lag autocorrelation coeﬃcients are least aﬀected, this method discards the lower-lag autocorrelation coeﬃcients and uses only the higher-lag autocorrelation coeﬃcients for spectral estimation. The magnitude spectrum of the windowed higher-lag autocorrelation sequence is used here as an estimate of the power spectrum of the speech signal. This power spectral estimate is processed further (like the well-known Mel frequency cepstral coeﬃcient (MFCC) procedure) by the Mel ﬁlter bank, log operation and the discrete cosine transform to get the cepstral coeﬃcients. These cepstral coeﬃcients are referred to as the autocorrelation Mel frequency cepstral coeﬃcients (AMFCCs). We evaluate the speech recognition performance of the AMFCC features on the Aurora and the resource management databases and show that they perform as well as the MFCC features for clean speech and their recognition performance is better than the MFCC features for noisy speech. Finally, we show that the AMFCC features perform better than the features derived from the robust linear prediction-based methods for noisy speech. Ó 2006 Elsevier B.V. All rights reserved. Keywords: Speech recognition; Feature extraction; Robustness to noise; MFCC

1. Introduction A speech recogniser is trained in a given acoustic environment and deployed (or, tested) normally in a diﬀerent environment; thus, there is always a mismatch in the training and test environments. This mismatch causes a drastic degradation in speech recognition performance. One of the major factors * Corresponding author. Tel.: +61 7 3875 6536; fax: +61 7 3875 5198. E-mail address: K.Paliwal@griﬃth.edu.au (K.K. Paliwal).

responsible for the mismatch between the training and test environments is additive background noise (uncorrelated to speech) (Juang, 1991; Gong, 1995). A number of methods have been proposed in the literature to overcome this environmental mismatch problem. These include robust feature extraction methods (Ghitza, 1986; Mansour and Juang, 1989; Paliwal and Sondhi, 1991; Paliwal and Sagisaka, 1997), speech enhancement methods (Kim et al., 2003; Hermus and Wambacq, 2004), feature compensation methods (Hermansky and Morgan, 1994; Stern et al., 1996), multi-band methods (Bourlard

0167-6393/$ - see front matter Ó 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2006.08.003

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485

and Dupont, 1996; Tibrewala and Hermansky, 1997), missing feature methods (Lippmann, 1997; Cooke et al., 1997; Raj et al., 2004), and model compensation (and adaptation) methods (Gales and Woodland, 1996; Bellegarda, 1997; Lee, 1998). The focus of this paper is on robust feature extraction for speech recognition. We are interested in developing a feature extraction method that can deal with the additive background noise distortion in a robust manner. Here, we are given a frame of the observed (noisy) signal x(n), n = 0, 1, . . ., N 1, for analysis, where N is the frame length (in number samples). This can be expressed as xðnÞ ¼ sðnÞ þ dðnÞ;

ð1Þ

where s(n) is the clean speech signal and d(n) is the background noise signal.1 Our aim is to extract recognition features from the noisy speech signal x(n) in such a manner that they capture the spectral characteristics of the clean speech signal s(n) accurately and are least aﬀected by the noise d(n). The Mel-frequency cepstral coeﬃcients (MFCCs) are perhaps the most widely used features in the current state-of-the-art speech recognition systems (Rabiner and Juang, 1993; Huang et al., 2001). In MFCC feature extraction (Davis and Mermelstein, 1980), the noisy signal x(n) is processed in terms of the following steps: (1) Perform short-time Fourier analysis of the signal x(n) using a ﬁnite-duration window (such as a 32 ms Hamming window) and use the periodogram method (Kay, 1988) to compute the power spectral estimate Pb xx ðxÞ of the signal x(n), (2) apply the Mel ﬁlter bank to the power spectrum Pb xx ðxÞ to get the ﬁlter-bank energies, and (3) compute discrete cosine transform (DCT) of the log ﬁlter-bank energies to get the MFCCs. The MFCC features perform reasonably well for the recognition of clean speech, but their performance is very poor for noisy speech. This happens because the periodogram-based power spectral estimate used in MFCC computation gets severely aﬀected by the additive background noise and this degrades the recognition performance of MFCC features for noisy speech. In the present paper, we use the autocorrelation domain processing for robust estimation of power spectrum from noisy speech. The autocorrelation 1 We denote the power spectrum of the noisy signal x(n) by Pxx(x) and its autocorrelation function by rxx(n). Similarly, for the clean signal s(n), the corresponding symbols are Pss(x) and rss(n), respectively. For the noise signal d(n), these symbols are Pdd(x) and rdd(n), respectively.

1459

function of a signal is related to the signal’s power spectrum through the Fourier transform (Kay, 1988) and it has the following two attractive properties: (1) Additivity property: If the two signals are uncorrelated, the autocorrelation function of their sum is equal to the sum of their autocorrelation functions. Thus: rxx ðnÞ ¼ rss ðnÞ þ rdd ðnÞ;

ð2Þ

as s(n) and d(n) are uncorrelated signals. (2) Robustness property: The autocorrelation function of the white random noise signal is zero everywhere except for the zeroth time-lag (Kay, 1979). For broadband noise signals, it is mainly conﬁned to lower-timelags and is very small or zero for higher-time-lags (Mansour and Juang, 1989). As a result, the additive noise d(n) does not aﬀect the higher-lags of the autocorrelation function. Thus the higher-lag autocorrelation coeﬃcients are relatively robust to additive noise distortion. Because of these attractive properties, the autocorrelation domain processing has been used in the past for autoregressive (AR) spectral estimation (or, linear prediction (LP) analysis) of the noisy signals. The initial eﬀort in this direction was based on the use of high-order Yule–Walker equations (Gersch, 1970; Chan and Langford, 1982), where the autocorrelation coeﬃcients that are involved in the Yule–Walker equation set exclude the zerolag coeﬃcient. Other similar methods have been used that either avoid the zero-lag coeﬃcient (Cadzow, 1982; Paliwal, 1986a; Paliwal, 1986b; Paliwal, 1986c; McGinn and Johnson, 1989), or reduce the contribution of the ﬁrst few coeﬃcients (Mansour and Juang, 1989; Hernando and Nadeu, 1997). All of these techniques are based on all-pole modelling of the causal part of the autocorrelation sequence of signal x(n). Two of these techniques (Mansour and Juang, 1989; Hernando and Nadeu, 1997) have been used to extract the cepstral coeﬃcient features for speech recognition and found to provide some robustness to noise, but their recognition performance for clean speech is worse than the conventional linear prediction cepstral coeﬃcient (LPCC) features (Hernando and Nadeu, 1997). In the present paper, we propose a robust feature extraction method based on autocorrelation domain processing.2 Since the broadband noise distortion 2 We have reported the preliminary results about this method earlier in conferences (Shannon and Paliwal, 2004; Shannon and Paliwal, 2005).

1460

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485

aﬀects only the lower-lag autocorrelation coeﬃcients, we discard them and utilise only the higher-lag (>2 ms) autocorrelation coeﬃcients. We multiply the higher-lag autocorrelation coeﬃcients of x(n) by a suitable window function (to be discussed later in this paper) and compute the magnitude spectrum of the windowed sequence as an estimate of the signal’s power spectrum. We call this method of power spectrum estimation as the higher-lag autocorrelation spectrum estimation (HASE) method. Steps (2) and (3) of the MFCC feature extraction procedure described above are applied to this power spectrum estimate to obtain the cepstral coeﬃcient features. We call these features autocorrelation Mel frequency cepstral coeﬃcients (AMFCCs). Note that the HASE method of power spectrum estimation (used here for AMFCC computation) is based on the additivity and robustness properties of the signal’s autocorrelation function. These two properties are strictly valid only in an asymptotic sense (under ergodicity assumption), i.e., when the autocorrelation coeﬃcients are computed from inﬁnitely long signals. For AMFCC computation, we do only short-time analysis of the speech signal (with signal duration of about 32 ms) for computing the autocorrelation function. This autocorrelation function may not satisfy these two properties strictly. The eﬀect of the short-time analysis on these properties and the speech recognition performance is discussed later in this paper. This paper is organised as follows. Section 2 describes some of the properties of the short-time autocorrelation function relevant in the context of the AMFCC method of feature extraction. The spectral estimation method used for computing the AMFCCs is described in Section 3. This method uses the magnitude spectrum of the windowed one-sided higher-lag autocorrelation sequence as an estimate Pb xx ðxÞ of the signal’s power spectrum. The AMFCC feature extraction method is presented in Section 4. This method is evaluated for robust speech recognition and its performance is compared with that of the other feature extraction methods reported in the literature. The recognition experiments carried out for this purpose and their results are described in Section 5. The conclusions are presented in Section 6. 2. Some properties of the short-time autocorrelation function The autocorrelation function of a signal contains the same information about the signal as its power

spectrum (Kay, 1988). In the power spectrum domain, the information is presented as a function of frequency and in the autocorrelation domain it is presented as a function of time (Anstey, 1966). In this section, we demonstrate some of the properties of the short-time autocorrelation function relevant in the context of the AMFCC method of feature extraction. Since the AMFCC method discards the zeroth and lower-lag autocorrelation coeﬃcients and uses only the higher-lag autocorrelation coeﬃcients for spectral estimation, it is necessary to know whether and how these coeﬃcients contain the spectral information necessary for speech recognition. Also, since we are proposing the AMFCC method as a robust feature extraction procedure on the basis that the additive noise distortion has most of its autocorrelation coeﬃcients concentrated near the lowertime-lags and their higher-lag autocorrelation coeﬃcients are zero (or, very small), we want to know what types of noise signals have this property and to what extent they satisfy this property (under the short-time analysis framework). This question is explored in the second part of this section. 2.1. Speech signals The commonly used source-system model of speech production views the speech signal as the output of a linear, time-varying system (or ﬁlter) excited by either a white noise source (for unvoiced speech) or a periodic pulse train source (for voiced speech) (Huang et al., 2001). For speech recognition, we are typically interested in extracting the magnitude response of the time-varying system as a function of time.3 For this, we perform a framewise short-time analysis of the speech signal to compute its power spectrum and use its (smoothed) envelope to get the system’s magnitude response. In order to illustrate whether and how the higherlag autocorrelation coeﬃcients contain the spectral information necessary for speech recognition, we ﬁrst consider a 32 ms frame of a clean (noise-free) voiced speech signal from a female / ey / sound. We show its power spectrum and autocorrelation function4 in Fig. 1(a) and (b), respectively. To illustrate 3

We assume that the magnitude response of the system carries suﬃcient speech information for accurate recognition. 4 The autocorrelation function is obtained from the power spectrum through inverse Fourier transformation. Thus, it provides a biased estimate of autocorrelation function for a frame.

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485 (a)

(b) 20 Autocorrelation

Power (dB)

40 20 0 −20 −40

0

0.1

0.2 0.3 Freq. (Norm.)

0.4

0

−20

0.5

−20

(c) Autocorrelation

Power (dB)

20 0 −20

0

0.1

0.2 0.3 Freq. (Norm.)

0.4

0 −2

−20

(e) Autocorrelation

Power (dB)

0 −20

0.1

0.2 0.3 Freq. (Norm.)

0 Time Lag (ms)

20

(f)

20

0

20

2

0.5

40

−40

0 Time Lag (ms)

(d)

40

−40

1461

0.4

0.5

5 0 −5 −20

0 Time Lag (ms)

20

Fig. 1. Power spectrum and autocorrelation function of a 32 ms long voiced speech frame of female / ey / sound. (a) Signal power spectrum (in dB). (b) Autocorrelation sequence associated with the signal spectrum in (a). (c) Smooth spectral envelope (in dB) computed by retaining the ﬁrst 12 cepstral coeﬃcients of the signal spectrum shown in (a). (d) Autocorrelation sequence associated with the smooth spectrum shown in (c). (e) Excitation power spectrum (in dB) estimated by subtracting the smooth (log) spectrum shown in (c) from the signal (log) spectrum shown in (a). (f) Autocorrelation sequence associated with the excitation spectrum shown in (e).

the system and source components for this signal, we perform homomorphic analysis and compute the smooth spectral envelope by retaining the ﬁrst 12 cepstral coeﬃcients. This is used as the power spectral estimate of the system component. The residual power spectrum (obtained by dividing the signal’s power spectrum by the smooth spectral envelope) is used as the power spectral estimate of the (excitation) source component. Fig. 1(c) and (e) shows the these power spectral estimates of the system and source components. Their associated autocorrelation functions are shown in Fig. 1(d) and (f), respectively. We can observe from Fig. 1(d) and (f) that the autocorrelation function of the system component is conﬁned to lower-lags only and that of the source component shows the periodicity of the voiced speech signal. Since the signal’s autocorrelation function (shown in Fig. 1(b)) is the convolution of the autocorrelation functions of the system and source components, it is periodic with the smooth spectral envelope information repeated periodically over the entire lag range and its values are quite large for higher-lags (specially when we remember that it is a biased estimate). Because of this, we should be able to discard the

lower-portion of the signal’s autocorrelation function and still be able to get a good power spectral estimate for speech recognition from the higher-lag autocorrelation coeﬃcients (as shown in the later sections). The same analysis is repeated for a 32 ms frame of a clean (noise-free) unvoiced speech signal from a female/s/sound and the results are shown in Fig. 2. Here the autocorrelation function of the system component (Fig. 2(d)) is conﬁned to lower-lags and that of the source component (Fig. 2(f)) is nonperiodic and non-zero for higher-lags. As a result of convolution, the signal’s autocorrelation function (Fig. 2(b)) contains the spectral information about the system component at higher-lags (though not to the same extent as it has for the voiced speech signal). Thus, we should be able use the signal’s higher-lag autocorrelation coeﬃcients to estimate the power spectrum of unvoiced speech needed for feature extraction. 2.2. Noise signals As mentioned earlier, the AMFCC method is proposed here as a robust feature extraction procedure

1462

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485 (a)

(b) 20 Autocorrelation

Power (dB)

40 20 0 −20 −40

0

0.1

0.2 0.3 Freq. (Norm.)

0.4

0

−20

0.5

−20

(c) Autocorrelation

Power (dB)

20 0 −20

0

0.1

0.2 0.3 Freq. (Norm.)

20

(d)

40

−40

0 Time Lag (ms)

0.4

5 0 −5 −20

0.5

(e)

0 Time Lag (ms)

20

(f) Autocorrelation

Power (dB)

40 20 0 −20 −40

0

0.1

0.2 0.3 Freq. (Norm.)

0.4

0.5

5 0 −5 −20

0 Time Lag (ms)

20

Fig. 2. Power spectrum and autocorrelation function of a 32 ms long voiced speech frame of female/s/sound. (a) Signal power spectrum (in dB). (b) Autocorrelation sequence associated with the spectrum in (a). (c) Smooth spectral envelope (in dB) computed by retaining the ﬁrst 12 cepstral coeﬃcients of the signal spectrum shown in (a). (d) The autocorrelation sequence associated with the spectrum shown in (c). (e) Excitation power spectrum (in dB) estimated by subtracting the smooth (log) spectrum shown in (c) from the signal (log) spectrum shown in (a). (f) Autocorrelation sequence associated with the excitation spectrum shown in (e).

on the basis that the additive noise distortion has most of its autocorrelation coeﬃcients concentrated near the lower-time-lags and its higher-lag autocorrelation coeﬃcients are zero (or, very small). In this context, it is important to know what types of noise signals have this property and to what extent they satisfy this property. In order to answer this question, we ﬁrst consider the stationary white random noise signal. Theoretically (and asymptotically), the autocorrelation function should be zero for all the lags except for the zeroth lag (Kay, 1979). We want to know whether this is true for short-time analysis. We take 2 s long computer-generated (artiﬁcial) white Gaussian noise and perform a short-time analysis (with Hamming window) every 10 ms using a frame length of 32 ms. Fig. 3(a) shows the spectrogram5 of this signal. For illustration, we take three diﬀerent frames starting from 0.5, 1 and 1.5 s. We show the waveforms of these frames in Fig. 3(b)–(d), respectively, their respective power spectra in Fig. 3(e)–(g) and their respective autocorrelation functions in

5

In this paper, we use a 10 ms frame-shift and a 32 ms Hamming windowed frame for spectrogram computation.

Fig. 3(h)–(j). As expected, the higher-lag autocorrelation coeﬃcients are smaller in magnitude than the zeroth autocorrelation coeﬃcient, but they have non-zero values due to short-time analysis. Similarly, it can be shown that the stationary broadband random noise signals6 satisfy this property where the magnitude of lower-lag autocorrelation coeﬃcients is large and that of higher-lag coeﬃcients is very small (Mansour and Juang, 1989). In addition to these noises (white and broadband), we can identify the following two other noises that satisfy this property: chirp noise and impulsive noise. Typical examples of the chirp noise are the sounds generated by the emergency sirens and chirping of birds. The typewriter (or keyboard) noise and the sound from machine gun ﬁring are the typical examples of impulsive noise. Here, we generate artiﬁcial chirp and impulsive noises7 and perform short-time analysis. The results are shown in Figs. 4 and 5. We can observe from these ﬁgures 6 White random noise can be considered to be a particular case of the broadband random noise where the broadband extends from zero to half of the sampling frequency. 7 The procedure to generate these artiﬁcial noises is described later in Section 5.

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485

1463

(a) Freq. (Hz)

3000 2000 1000 0

0.2

0.4

0.6

0.8

1 Time (s)

(b)

10

20 Time (ms)

0

−1

30

0

10

(e)

2000 Freq. (Hz)

3000

(h) 1 0.5 0 −0.5 −20

0 Time Lag (ms)

0

10

20

0

1000

2000 Freq. (Hz)

3000

(i) 1 0.5 0 −0.5 −1

30

(g)

0

−50

4000

20 Time (ms)

50 Power (dB)

Power (dB) 1000

Autocorrelation (Norm.)

Power (dB) Autocorrelation (Norm.)

−1

30

50

0

−1

20 Time (ms)

0

(f)

50

0

1.8

1 Signal

Signal

Signal 0

1.6

(d)

1

0

−50

1.4

(c)

1

−1

1.2

−20

0 Time Lag (ms)

20

0

−50

4000 Autocorrelation (Norm.)

0

0

1000

2000 Freq. (Hz)

3000

4000

(j) 1 0.5 0 −0.5 −1

−20

0 Time Lag (ms)

20

Fig. 3. Short-time analysis of the artiﬁcial white random noise signal using 32 ms frames. (a) Spectrogram of a 2 s sample of long noise signal. (b)–(d) Waveform of noise frames taken at 0.5, 1.0 and 1.5 s, respectively. (e)–(g) Power spectra (periodogram estimate with a Hamming window) of the frames shown in (b)–(d), respectively. (h)–(j) Autocorrelation sequences corresponding to the power spectra shown in (e)–(g), respectively.

that most of the autocorrelation coeﬃcients concentrated near the lower-time lags and the higher-lag autocorrelation coeﬃcients are zero. So far, we have considered only artiﬁcial noises and shown that their autocorrelation function have the required property that makes the AMFCC method suitable for robust feature extraction. Now, we consider real-life noises. As an example, we take the car noise from the Aurora-2 database (Pearce and Hirsch, 2000) and perform the shorttime analysis (similar to that done for the artiﬁcial noises). The results are shown in Fig. 6. We can see from the spectrogram (Fig. 6(a)) that the car noise is a kind of broadband (low-pass) noise. From the autocorrelation plots shown in Fig. 6(h)–(j), it can be observed that the magnitude of its lowerlag autocorrelation coeﬃcients is larger than the higher-lag coeﬃcients, but its higher-lag autocorrelation coeﬃcients are not as small as those observed for the artiﬁcial noises. Thus, the car noise satisﬁes

the required robustness property, but not to the same extent as has been the case with the artiﬁcial noises. We have experimented with the other reallife noises from the Aurora-2 database and made similar observations. Thus, we have seen that (1) the higher-lag autocorrelation coeﬃcients of the speech signal s(n) contain information about the signal’s power spectrum Pss(u) and (2) the magnitude of the higher-lag autocorrelation coeﬃcients of the noise signal d(n) is relatively small for some of the noises. If we assume the additivity property (Eq. (2)) in autocorrelation domain to be true,8 the higher-lag autocorrelation coeﬃcients of noisy speech signal y(n) will be relatively insensitive to the additive background noise distortion. Therefore, we discard the lower-lag autocorrelation

8

This property is true only asymptotically and may not be completely satisﬁed when we do short-time analysis.

1464

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485

3000 2000 1000 0

0

0.2

0.4

0.6

0.8

1 Time (s)

(b)

10

20 Time (ms)

0

−1

30

0

10

(e)

2000 Freq. (Hz)

3000

(h) 0.5 0 −0.5

0

1000

2000 Freq. (Hz)

3000

(i) 1 0.5 0

−0.5

0 Time Lag (ms)

10

20

1

20 Time (ms)

30

(g)

0

−50

4000

1

−20

0

50 Power (dB)

Power (dB) 1000

Autocorrelation (Norm.)

Power (dB) Autocorrelation (Norm.)

−1

30

50

0

−1

20 Time (ms)

0

(f)

50

0

1.8

1 Signal

Signal

Signal 0

1.6

(d)

1

0

−50

1.4

(c)

1

−1

1.2

−20

0 Time Lag (ms)

20

0

−50

4000 Autocorrelation (Norm.)

Freq. (Hz)

(a)

0

1000

2000 Freq. (Hz)

3000

4000

(j) 1 0.5 0 −0.5 −1

−20

0 Time Lag (ms)

20

Fig. 4. Short-time analysis of the artiﬁcial chirp noise signal using 32 ms frames. (a) Spectrogram of a 2 s sample of long noise signal. (b)–(d) Waveform of noise frames taken at 0.5, 1.0 and 1.5 s, respectively. (e)–(g) Power spectra (periodogram estimate with a Hamming window) of the frames shown in (b)–(d), respectively. (h)–(j) Autocorrelation sequences corresponding to the power spectra shown in (e)–(g), respectively.

coeﬃcients and use only higher-lag (lag greater than 2 ms) autocorrelation coeﬃcients of the noisy speech signal for robust spectral estimation. In order to improve the robustness of this spectral estimator further, we de-emphasise the more corrupt lowerlag autocorrelation coeﬃcients during spectral estimation. The lower-lag coeﬃcients can be attenuated by using a tapered window function. This also has the added eﬀect of attenuating the very high-lag autocorrelation coeﬃcients, which have relatively highestimation variance (Kay, 1988). 3. Spectral estimation from higher-lag autocorrelation coeﬃcients Typically, spectral estimation in automatic speech recognition is performed using either a linear prediction (LP) analysis algorithm or a Fourier transform based periodogram algorithm. These two approaches can be discussed in terms of the autocorrelation

coeﬃcients they employ. A conventional LP analysis algorithm (such as the autocorrelation method) used in speech processing literature (Makhoul, 1975) utilises only the ﬁrst few lower-lag autocorrelation coeﬃcients, and the periodogram method can be considered as utilising the full autocorrelation sequence. In our higher-lag autocorrelation spectrum estimation (HASE) method, we propose to use only the higher-lag portion of one side (causal side) of the autocorrelation sequence. By considering only the causal side of the autocorrelation sequence, we are not loosing any information since the autocorrelation sequence is symmetric. A number of LP-based techniques have been reported in the literature which utilise the windowed one-sided autocorrelation sequence for improving the robustness performance of the spectral estimates. Two of these techniques (the short-time modiﬁed coherence (SMC) method (Mansour and Juang, 1989) and the one-sided autocorrelation

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485

1465

3000 2000 1000 0

0

0.2

0.4

0.6

0.8

1 Time (s)

(b)

10

20 Time (ms)

0

−1

30

0

10

(e)

2000 Freq. (Hz)

3000

(h) 1 0.5 0 −0.5 −20

0 Time Lag (ms)

0

10

20

0

1000

2000 Freq. (Hz)

3000

(i) 1 0.5 0 −0.5 −1

30

(g)

0

−50

4000

20 Time (ms)

50 Power (dB)

Power (dB) 1000

Autocorrelation (Norm.)

Power (dB) Autocorrelation (Norm.)

−1

30

50

0

−1

20 Time (ms)

0

(f)

50

0

1.8

1 Signal

Signal

Signal 0

1.6

(d)

1

0

−50

1.4

(c)

1

−1

1.2

−20

0 Time Lag (ms)

20

0

−50

4000 Autocorrelation (Norm.)

Freq. (Hz)

(a)

0

1000

2000 Freq. (Hz)

3000

4000

(j) 1 0.5 0 −0.5 −1

−20

0 Time Lag (ms)

20

Fig. 5. Short-time analysis of the artiﬁcial impulsive noise signal using 32 ms frames. (a) Spectrogram of a 2 s sample of long noise signal. (b)–(d) Waveform of noise frames taken at 0.5, 1.0 and 1.5 s, respectively. (e)–(g) Power spectra (periodogram estimate with a Hamming window) of the frames shown in (b)–(d), respectively. (h)–(j) Autocorrelation sequences corresponding to the power spectra shown in (e)–(g), respectively.

linear prediction coeﬃcient (OSALPC) method (Hernando and Nadeu, 1997) have been used in the past for automatic speech recognition. Both of these techniques use all-pole modelling of the windowed autocorrelation sequence for computing the spectral estimates. As mentioned earlier in the introduction, these techniques provide some robustness to noise, but their recognition performance for clean speech is much worse than the conventional autocorrelation method of LP analysis (Hernando and Nadeu, 1997). This happens because the use of higher-lag autocorrelation coeﬃcients provides robust spectral estimates for noisy speech, but these autocorrelation coeﬃcients do not follow the allpole model as well as the clean speech signal does. Hence, in our HASE method, we do not use allpole modelling of the windowed one-sided (causal) autocorrelation sequence to compute the spectral estimate. Instead, we use the magnitude spectrum of the windowed one-sided autocorrelation sequence

as an estimate of the power spectrum of the signal. However, before doing it, we discard the ﬁrst few lower-lag autocorrelation coeﬃcients as they are eﬀected more by the additive noises (of the type discussed in the preceding section). We utilise only the higher-lag (greater than 2 ms) autocorrelation coeﬃcients. Thus, our method positions the window on the autocorrelation sequence at a higher-starting lag than the previous LP-based methods. The HASE method for computing the power spectrum of the observed signal x(n), n = 0, 1, . . ., N 1, (where N is 256 for 32 ms long speech signal sampled at 8 kHz) is described formally as follows: (1) Multiply the observed signal x(n) by a window function ws(n) to get the windowed signal xw(n) = x(n)ws(n), n = 0, 1, . . ., N 1. We use here the Hamming window function for ws(n). (2) Compute the biased estimate of the onesided autocorrelation coeﬃcients RðiÞ ¼

1466

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485

3000 2000 1000 0

0

0.2

0.4

0.6

0.8

1 Time (s)

(b)

10

20 Time (ms)

0

−1

30

0

10

(e)

2000 Freq. (Hz)

3000

(h) 1 0.5 0 −0.5 −20

0 Time Lag (ms)

0

10

20

0

1000

2000 Freq. (Hz)

3000

(i) 1 0.5 0 −0.5 −1

30

(g)

0

−50

4000

20 Time (ms)

50 Power (dB)

Power (dB) 1000

Autocorrelation (Norm.)

Power (dB) Autocorrelation (Norm.)

−1

30

50

0

−1

20 Time (ms)

0

(f)

50

0

1.8

1 Signal

Signal

Signal 0

1.6

(d)

1

0

−50

1.4

(c)

1

−1

1.2

−20

0 Time Lag (ms)

20

0

−50

4000 Autocorrelation (Norm.)

Freq. (Hz)

(a)

0

1000

2000 Freq. (Hz)

3000

4000

(j) 1 0.5 0 −0.5 −1

−20

0 Time Lag (ms)

20

Fig. 6. Analysis of car noise signal using 32 ms frames. (a) Spectrogram of a 2 s sample of car noise. (b)–(d) Signal frames taken at 0.5, 1.0 and 1.5 s respectively. (e)–(g) Logarithmic power spectra, computed using a Hamming window, of the frames shown in (b)–(d) respectively. (h)–(j) Autocorrelation sequences corresponding to the spectra shown in (e)–(g) respectively.

1 N

PN i1

i = 0, 1, . . ., N 1. n¼0 xw ðnÞxw ðn þ iÞ, This computation can be done in a computationally eﬃcient manner using the fast Fourier transform (FFT) algorithm. (3) Discard the ﬁrst L(=16) lower-lag autocorrelation coeﬃcients and multiply the remaining higher-lag autocorrelation sequence by a window function wr(n) to get the windowed higher-lag autocorrelation sequence Rw(n) = R(L + n)wr(n), n = 0, 1, . . ., M 1, where M = N L. (4) Append the windowed higher-lag autocorrelation sequence Rw(n), n = 0, 1, . . ., M 1 at the end with 2N M zeros and compute the discrete Fourier transform (DFT) of the appended sequence through the FFT algorithm. Use the absolute value of this transform as the desired power spectral estimate Pb xx ðxÞ. The window function wr(n) used on the one-sided (causal) higher-lag autocorrelation sequence plays

an important role in our spectral estimation method. Therefore, we ﬁrst discuss the design of this window function and then provide some results to demonstrate the spectral estimation performance of our method with respect to the previous LP-based methods. 3.1. Window function design In our HASE method, we use the Hamming window function for ws(n) to window the signal, but we do not know what type of window function we should use for wr(n) to window the one-sided higher-lag autocorrelation sequence. In order to see its eﬀect on the spectral estimation performance of the HASE method, we ﬁrst use the Hamming window function for wr(n) to window the higherlag autocorrelation sequence. We apply this HASE method to analyse a 32 ms long voiced speech frame of female / ey / sound. (Note that this frame is used earlier in Fig. 1.) The Hamming-windowing

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485

1467

(b)

(a) 0

1 0.8

−10 0.6

−20 Power (dB)

Autocorrelation

0.4 0.2 0

−30

−0.2 −40

−0.4 −0.6

−50

−0.8 −1

0

5

10

15

20

25

30

−60

0

500

1000

1500

Lag (ms)

2000

2500

3000

3500

4000

Freq. (Hz)

Fig. 7. Illustration of the HASE method (using Hamming windowed higher-lag autocorrelation sequence) on a 32 ms long voiced speech frame of female / ey / sound. (a) one-sided (causal) autocorrelation sequence (in dotted line), a 30 ms Hamming window function starting at 2 ms (in dashed line) and windowed autocorrelation sequence (in solid line). (b) Power spectral estimate by the HASE method (in solid line) and the periodogram method (in dotted line).

1

0

0.9

−10

0.8

−20

0.7

−30

Power (dB)

Amplitude

(a)

0.6 0.5 0.4

−40 −50 −60

0.3

−70

0.2

−80 −90

0.1 0 50

100

150

200

−100

0

0.05

Sample

0.1

0.15

0.2

0.15

0.2

0.15

0.2

Freq. (Norm)

1

0

0.9

−10

0.8

−20

0.7

−30

Power (dB)

Amplitude

(b)

0.6 0.5 0.4

−40 −50 −60

0.3

−70

0.2

−80 −90

0.1 0 50

100

150

200

−100

0

0.05

Sample

0.1 Freq. (Norm)

(c) 1

0

0.9

−10

0.8

−20

0.7

−30

Power (dB)

Amplitude

operation carried out on the higher-lag autocorrelation sequence is illustrated in Fig. 7(a). The power spectral estimate of the speech frame using the HASE method is shown in Fig. 7(b) along with its periodogram estimate. We can see from this ﬁgure that the periodogram method can capture spectral details of the harmonics as low as 43 dB, i.e., its dynamic range is about 43 dB. The HASE method captures the higher-magnitude harmonics well, but fails to capture the harmonics having magnitude lower than 22 dB. In other words, its dynamic range is limited to about 22 dB. In order to explain this dynamic range issue, we show the shape of the Hamming window function and its power spectrum in Fig. 8(a). We see that the magnitude of the highest side-lobe is about 43 dB lower than that of the main lobe, i.e., the dynamic range of the Hamming window is about 43 dB. Also note here that the Hamming window is applied only to the signal in the periodogram method, but it is applied to both the signal and the higher-lag autocorrelation sequence in the HASE method. It is well known (Mansour and Juang, 1989) that the power spectrum of an autocorrelation sequence of a signal has dynamic range twice as large as that of the signal’s power spectrum. The HASE method uses here the Hamming window on the autocorrelation sequence and, hence, the resulting power spectral estimate of the signal has a dynamic range of 22 dB. This aspect of windowing an autocorrelation sequence has also been reported previously by

0.6 0.5 0.4

−40 −50 −60

0.3

−70

0.2

−80 −90

0.1 0 50

100 150 Sample

200

−100

0

0.05

0.1 Freq. (Norm)

Fig. 8. Window function and its power spectrum for (a) Hamming window, (b) Kaiser window (a = 11.3) and (c) DDR Hamming window.

1468

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485

use a window function with an 86 dB dynamic range. The Kaiser window function can provide such a window (Harris, 1978). The function used for generating a Kaiser window of length M is given by 8 qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2 > n < I 0 2a M1 ðMn1Þ ; 0 6 n < M; wðnÞ ¼ ð3Þ I 0 ðaÞ > : 0; otherwise;

Mansour and Juang (1989). While developing the SMC method of robust LP analysis, they compared the SMC power spectral estimates with the corresponding estimates obtained by using the conventional autocorrelation method of LP analysis (Makhoul, 1975). To compute the SMC spectra in their investigation, a Hamming window was applied to the autocorrelation sequence. It was reported (Mansour and Juang, 1989) that, in certain cases, formants present in the conventional LP spectrum were reduced or totally lost in the SMC spectrum. Mansour and Juang (1989) made particular reference to third formants (which have low magnitude) and attributed the dynamic range of the window function applied to the autocorrelation sequence to be the cause of this spectral loss. These comments are consistent with our observations. As mentioned earlier, the power spectrum of an autocorrelation sequence has a dynamic range twice that of the corresponding signal’s power spectrum. Therefore, when we window the one-sided autocorrelation sequence, we need to use a window function that has a dynamic range that is twice as large as that of the window function we would typically use on the original time-domain signal. For example, a Hamming window, which is typically applied to the time-domain signal in speech recognition, has a dynamic range of 43 dB. To produce a spectral estimate with equivalent dynamic range from the windowed autocorrelation sequence, we need to

where I0(x) is a modiﬁed Bessel function of the ﬁrst kind and a is the design parameter that sets the window’s dynamic range. For getting a Kaiser window function with dynamic range of 86 dB, we set a = 11.3. The resulting Kaiser window function and its power spectrum are shown in Fig. 8(b). We use this window function to window the higher-lag autocorrelation sequence in the HASE method and analyse the same speech frame as the one used in Fig. 7. We show the Kaiser-windowing operation carried out on the higher-lag autocorrelation sequence in Fig. 9(a) and the resulting power spectral estimate of the speech frame in Fig. 9(b). By comparing Fig. 9(b) with Fig. 7(b), we see that the use of the Kaiser window function solves the dynamic range problem faced earlier with the Hamming window function, but it produces slightly wider harmonics. This happens because the Kaiser window function has a wider main lobe than the Hamming window function (see Fig. 8). However, this is not a big problem as for feature extraction

(a)

(b) 0

1 0.8

−10 0.6 −20 Power (dB)

Autocorrelation

0.4 0.2 0

−30

−0.2 −40

−0.4 −0.6

−50

−0.8 −1

0

5

10

15 Lag (ms)

20

25

30

−60

0

500

1000

1500

2000 Freq. (Hz)

2500

3000

3500

4000

Fig. 9. Illustration of the HASE method (using Kaiser windowed higher-lag autocorrelation sequence) on a 32 ms long voiced speech frame of female / ey / sound. (a) one-sided (causal) autocorrelation sequence (in dotted line), a 30 ms Hamming window function starting at 2 ms (in dashed line) and windowed autocorrelation sequence (in solid line). (b) Power spectral estimate by the HASE method (in solid line) and the periodogram method (in dotted line).

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485

in speech recognition we are interested only in the spectral envelope that captures the formant structure, and slightly wider harmonics do not spoil this structure. The main point is that the HASE method (with Kaiser-windowed higher-lag autocorrelation sequence) provides the same dynamic range (about 43 dB) as the periodogram method and it captures all the harmonics that are visible in the periodogram estimate. Though the Kaiser window function provides good performance, it has two problems. Firstly, it is computationally more expensive to design a Kaiser window function than the cosine window functions (such as Hamming window function) as it involves computation of a Bessel function (see Eq. (3)). Secondly, every time we decide to use a different window function ws(n) applied on the timedomain signal (for example, Blackman window instead of Hamming window), we have to modify the feature extraction software to design the Kaiser window function that has a dynamic range equal to twice that of the new window function. To solve these problems, we propose a design method specifically for computing a window function that has twice the dynamic range of the window function ws(n) used on the time-domain signal and which can be generated automatically inside the software once we specify the window function for ws(n). Given that we are applying a Hamming window on the time-domain signal and using the autocorrelation sequence from 2 ms to 32 ms (with length

1469

M = 240 samples at 8 kHz sampling rate), we want to design an M length window function having twice the dynamic range of a Hamming window. The design procedure is outlined below. (1) Construct a Hamming window of length M2 . (2) Compute its two-sided (biased) autocorrelation sequence of length M 1 having a maximum at zeroth lag in the centre. (3) Pad this M 1 long autocorrelation sequence with one zero at the end to get the desired window of length M. This window function will have dynamic range of about 86 dB. We call this window function as the double-dynamic-range (DDR) Hamming window. The DDR Hamming window function and its power spectrum are shown in Fig. 8(c). We use this window function to window the higher-lag autocorrelation sequence in the HASE method and analyse the same speech frame as the one used in Figs. 7 and 9. We show the DDR Hamming-windowing operation carried out on the higher-lag autocorrelation sequence in Fig. 10(a) and the resulting power spectral estimate of the speech frame in Fig. 10(b). By comparing Fig. 10(b) with Fig. 9(b), we see that the DDR Hamming window performs as well as the Kaiser window function in terms of its spectral estimation performance. The resulting HASE method (with DDR Hamming windowing of the higher-lag autocorrelation) captures all the harmonic peaks visible in the

(a)

(b) 0

1 0.8

−10

0.6 −20

Power (dB)

Autocorrelation

0.4 0.2 0

−30

−0.2 −40

−0.4 −0.6

−50

−0.8 −1

0

5

10

15 Lag (ms)

20

25

30

−60

0

500

1000

1500

2000 Freq. (Hz)

2500

3000

3500

4000

Fig. 10. Illustration of the HASE method (using DDR Hamming windowed higher-lag autocorrelation sequence) on a 32 ms long voiced speech frame of female / ey / sound. (a) One-sided (causal) autocorrelation sequence (in dotted line), a 30 ms Hamming window function starting at 2 ms (in dashed line) and windowed autocorrelation sequence (in solid line). (b) Power spectral estimate by the HASE method (in solid line) and the periodogram method (in dotted line).

1470

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485 (a)

(b) 0 Power (dB)

Power (dB)

0 −20 −40 −60

0

1000

2000 3000 Freq. (Hz)

−20 −40 −60

4000

0

1000

(c)

4000

3000

4000

3000

4000

0 Power (dB)

Power (dB)

3000

(d)

0 −20 −40 −60

0

1000

2000 3000 Freq. (Hz)

−20 −40 −60

4000

0

1000

(e)

2000 Freq. (Hz)

(f) 0 Power (dB)

0 Power (dB)

2000 Freq. (Hz)

−20 −40 −60

−20 −40 −60

0

1000

2000 3000 Freq. (Hz)

4000

0

1000

2000 Freq. (Hz)

Fig. 11. Comparison of spectral estimation methods using a 32 ms frame of clean synthetic voiced speech. The dashed line in each plot is the original power spectrum of the synthetic signal. The solid line in each plot is the power spectral estimate using the respective method. (a) Periodogram method with Hamming window, (b) HASE method (using Hamming windowed higher-lag autocorrelation sequence), (c) HASE method (using DDR Hamming windowed higher-lag autocorrelation sequence), (d) autocorrelation method of LP analysis, (e) SMC method, and (f) OSALPC method.

periodogram estimate and its spectral dynamic range is same (about 43 dB) as that of the periodogram method. 3.2. Spectral estimation performance of the HASE method As mentioned earlier, the HASE method uses the magnitude spectrum of the windowed one-sided higher-lag autocorrelation sequence as the estimate of the signal’s power spectrum. In order to investigate the spectral estimation performance of this method, we ﬁrst use a synthetic voiced speech signal for analysis.9 For generating this synthetic signal, we take a segment of a real /r/ sound and compute the LP coefﬁcients using the autocorrelation method of LP analysis. Using the resulting all-pole (or AR) model, we ﬁlter a periodic impulse train (of period = 5 ms) to produce a 5 s long synthetic /r/ sound. From the middle of this sound, we excise a 32 ms frame for analysis. 9

We use here the synthetic speech signal for analysis because its power spectrum is known a priori and can be used as a reference spectrum. This helps in evaluating the performance of a given spectral estimation method.

Here, we are interested in answering two questions: (1) How does the shape of the window function used for weighting the causal higher-lag autocorrelation sequence eﬀect the spectral estimation performance of the HASE method? and (2) How does the HASE method compare with the other spectral estimation methods used in the past to extract cepstral coeﬃcient features for speech recognition? For answering these questions, we compare the spectral estimation performance of the following six methods: (1) The periodogram method with Hamming window (as used in the MFCC feature extraction (Davis and Mermelstein, 1980)), (2) the HASE method with the Hamming-windowed one-sided higher-lag autocorrelation sequence, (3) the HASE method with the DDR Hamming-windowed one-sided higher-lag autocorrelation sequence, (4) the autocorrelation method of LP analysis (Makhoul, 1975), (5) the SMC method of LP analysis (Mansour and Juang, 1989), and (6) The OSA method of LP analysis (Hernando and Nadeu, 1997). Among these six methods, the methods 1 and 4 have been used in the past to extract MFCC and LPCC features, respectively, while the methods 5 and 6 are based on the windowed onesided autocorrelation sequence (like our method).

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485

We use the 32 ms long frame of the synthetic speech signal and compute its power spectrum using the six spectral estimation methods mentioned above. Fig. 11 shows the results. The dashed line in all the plots is the original power spectrum of the synthetic speech signal (obtained as the power response of the AR model used to generate the synthetic speech signal) and it serves as a reference for evaluating the performance of these six spectral estimators. Fig. 11(a) shows the spectral estimate made by the periodogram method (method 1). It can be seen from this ﬁgure that its envelope matches the original power spectrum of the synthetic signal very well. Therefore, it is not surprising that this method is used for spectral estimation to extract the (currently popular) MFCC features. Fig. 11(b) shows the spectral estimate made by our HASE method using the Hamming windowed one-sided higher-lag autocorrelation sequence (method 2). The original power spectrum of the synthetic signal shows four peaks, but the power spectral estimate by this method only reveals three. The reduction in spectral dynamic range that results from using a Hamming window to window the autocorrelation sequence is very apparent. The original spectrum shows a dynamic range of 45 dB. The dynamic range revealed by the Hamming-windowed method is approximately half of that value, which demonstrates the previously discussed shortcoming. Fig. 11(c) shows the spectral estimate made by our HASE method using the DDR Hamming windowed one-sided higher-lag autocorrelation sequence (method 3). Since the DDR Hamming window has a large dynamic range, the four peaks of the original spectrum are revealed in the spectral estimate made by this method. Furthermore, the agreement between the original spectrum and the envelope of the spectral estimate is excellent. This plot shows that a very good power spectral estimate of speech can be made using the one-sided autocorrelation sequence, if the window function is designed appropriately. Fig. 11(d) shows a power spectrum estimate computed from the AR parameters when the autocorrelation method of LP analysis is applied to the synthetic speech signal (method 4). This ﬁgure shows a good match between this estimate and the original power spectrum. This is the reason that the LPCC feature set (based on this method) has been so popular in the past. Fig. 11(e) shows a power spectrum estimate computed from the AR parameters when the SMC algorithm is applied to the synthetic speech signal

1471

(method 5). This plot shows the low dynamic range of the SMC algorithm due to the Hamming window. It also shows that out of the four peaks revealed in the AR model response, only two are shown in the SMC spectrum. Fig. 11(f) shows a power spectrum estimate computed from the AR parameters when the OSA–LPC algorithm is applied to the synthetic speech signal (method 6). Even though a Hamming window is used in the OSA–LPC algorithm, the dynamic range appears to be the same as the AR model response. This is due to the power spectrum estimate being squared during the algorithm, rather than the eﬀect of a high-dynamic range window. Three out of the four peaks are present in the OSA–LPC spectrum, which is better than the SMC spectrum, but not as good as the spectrum computed using the proposed algorithm. In Fig. 11, we have seen the spectral estimation performance of the six methods for the clean (or, noise-free) synthetic speech signal. Now, we investigate their performance for the noisy synthetic speech signal. For this, we take the same 32 ms long frame of synthetic voiced speech, corrupt it by adding the following four noises to make its signal-to-noise ratio (SNR) to be 10 dB: the artiﬁcial white Gaussian random noise, the artiﬁcial chirp noise, the artiﬁcial impulsive noise and the real car noise. The resulting spectral performance of the six methods for the noise cases is shown in Figs. 12–15, respectively. By comparing these ﬁgures with Fig. 11, we can see that all the six methods suﬀer in terms of their spectral estimation performance when applied to the signal corrupted by the noise, but it is somewhat diﬃcult to say from these ﬁgures which method suﬀers more and which less. For making a comment about their relative performance in presence of noise, we compute the formant estimation performance (measured in terms of number of well-estimated formants) of each of the six methods for this noisy synthetic speech frame for four diﬀerent noises (white, chirp, impulse and car) using the four Figs. 12–15. We call a formant well estimated if it matches with one of the four formants present in the original power spectrum of the synthetic signal.10 The formant estimation performance of the six methods for each of the four noise conditions is summarised in Table 1. From this table, 10

Determination of a well-estimated formant is somewhat subjective. We use this measure to provide only a qualitative indication about the formant estimation performance of diﬀerent methods.

1472

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485 (a)

(b) 0 Power (dB)

Power (dB)

0 −20 −40 −60

0

1000

2000 3000 Freq. (Hz)

−20 −40 −60

4000

0

1000

−20 −40 −60

0

1000

2000 3000 Freq. (Hz)

4000

3000

4000

3000

4000

0 −20 −40 −60

4000

0

1000

(e)

2000 Freq. (Hz)

(f) 0 Power (dB)

0 Power (dB)

3000

(d)

0

Power (dB)

Power (dB)

(c)

2000 Freq. (Hz)

−20 −40 −60

0

1000

2000 3000 Freq. (Hz)

−20 −40 −60

4000

0

1000

2000 Freq. (Hz)

Fig. 12. Comparison of spectral estimation methods using a 32 ms frame of synthetic voiced speech corrupted by the synthetic white random noise at 10 dB SNR. The dashed line in each plot is the original power spectrum of the synthetic signal. The solid line in each plot is the power spectral estimate using the respective method. (a) Periodogram method with Hamming window, (b) HASE method (using Hamming windowed higher-lag autocorrelation sequence), (c) HASE method (using DDR Hamming windowed higher-lag autocorrelation sequence), (d) autocorrelation method of LP analysis, (e) SMC method, and (f) OSALPC method.

(a)

(b) 0 Power (dB)

Power (dB)

0 −20 −40 −60

0

1000

2000 Freq. (Hz)

3000

−20 −40 −60

4000

0

1000

(c)

4000

3000

4000

3000

4000

0 Power (dB)

Power (dB)

3000

(d)

0 −20 −40 −60

0

1000

2000 Freq. (Hz)

3000

−20 −40 −60

4000

0

1000

(e)

2000 Freq. (Hz)

(f) 0 Power (dB)

0 Power (dB)

2000 Freq. (Hz)

−20 −40 −60

0

1000

2000 Freq. (Hz)

3000

4000

−20 −40 −60

0

1000

2000 Freq. (Hz)

Fig. 13. Comparison of spectral estimation methods using a 32 ms frame of synthetic voiced speech corrupted by the synthetic chirp noise at 10 dB SNR. The dashed line in each plot is the original power spectrum of the synthetic signal. The solid line in each plot is the power spectral estimate using the respective method. (a) Periodogram method with Hamming window, (b) HASE method (using Hamming windowed higher-lag autocorrelation sequence), (c) HASE method (using DDR Hamming windowed higher-lag autocorrelation sequence), (d) autocorrelation method of LP analysis, (e) SMC method, and (f) OSALPC method.

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485 (a)

(b) 0 Power (dB)

Power (dB)

0 −20 −40 −60

0

1000

2000 3000 Freq. (Hz)

−20 −40 −60

4000

0

1000

(c)

3000

4000

3000

4000

3000

4000

0 Power (dB)

Power (dB)

2000 Freq. (Hz)

(d)

0 −20 −40 −60

0

1000

2000 3000 Freq. (Hz)

−20 −40 −60

4000

0

1000

(e)

2000 Freq. (Hz)

(f) 0 Power (dB)

0 Power (dB)

1473

−20 −40 −60

0

1000

2000 3000 Freq. (Hz)

4000

−20 −40 −60

0

1000

2000 Freq. (Hz)

Fig. 14. Comparison of spectral estimation methods using a 32 ms frame of synthetic voiced speech corrupted by the synthetic impulsive noise at 10 dB SNR. The dashed line in each plot is the original power spectrum of the synthetic signal. The solid line in each plot is the power spectral estimate using the respective method. (a) Periodogram method with Hamming window, (b) HASE method (using Hamming windowed higher-lag autocorrelation sequence), (c) HASE method (using DDR Hamming windowed higher-lag autocorrelation sequence), (d) autocorrelation method of LP analysis, (e) SMC method, and (f) OSALPC method.

(a)

(b) 0 Power (dB)

Power (dB)

0 −20 −40 −60

0

1000

2000 3000 Freq. (Hz)

−20 −40 −60

4000

0

1000

−20 −40 −60

0

1000

2000 3000 Freq. (Hz)

4000

3000

4000

3000

4000

0 −20 −40 −60

4000

0

1000

(e)

2000 Freq. (Hz)

(f) 0 Power (dB)

0 Power (dB)

3000

(d)

0

Power (dB)

Power (dB)

(c)

2000 Freq. (Hz)

−20 −40 −60

0

1000

2000 3000 Freq. (Hz)

4000

−20 −40 −60

0

1000

2000 Freq. (Hz)

Fig. 15. Comparison of spectral estimation methods using a 32 ms frame of synthetic voiced speech corrupted by the real car noise at 10 dB SNR. The dashed line in each plot is the original power spectrum of the synthetic signal. The solid line in each plot is the power spectral estimate using the respective method. (a) Periodogram method with Hamming window, (b) HASE method (using Hamming windowed higher-lag autocorrelation sequence), (c) HASE method (using DDR Hamming windowed higher-lag autocorrelation sequence), (d) autocorrelation method of LP analysis, (e) SMC method, and (f) OSALPC method.

1474

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485

Table 1 Formant estimation performance of six methods on a 32 ms frame of noisy synthetic speech Method

Number of well-estimated formants

(1) Periodogram method – Hamming (2) HASE method – Hamming (3) HASE method – DDR Hamming (4) Autocorrelation method of LP analysis (5) SMC method (6) OSALPC method

White

Chirp

Impulse

Car

3

3

3

3

3 3

3 4

3 3

3 3

2

2

2

2

2 3

2 3

2 3

2 3

it can be seen that the HASE method (using DDR Hamming windowed higher-lag autocorrelation sequence) is most successful in estimating the formants of this 32 ms frame of the synthetic speech signal in presence of noise. So far, we have investigated the spectral estimation performance of the HASE method for synthetic speech. Now, we investigate it for real speech. For this, we take a speech utterance from the Aurora database (MAL_19Z96Z8A). We compute the spectrogram of this real speech utterance (with 10 ms frame shift and 32 ms frame length) using the periodogram method and the HASE method (with DDR Hamming-windowed higher-lag autocorrelation sequence). The spectrogram plots are shown in Fig. 16(a). Top plot in this ﬁgure corresponds (b) 4000

3000

3000 Freq. (Hz)

Freq. (Hz)

(a) 4000

2000 1000

1000

0

0.5

1

1.5

2 Time (s)

2.5

3

3.5

0

4

4000

4000

3000

3000 Freq. (Hz)

Freq. (Hz)

0

2000

2000 1000 0

0

0.5

1

1.5

2 Time (s)

2.5

3

3.5

4

0

0.5

1

1.5

2 Time (s)

2.5

3

3.5

4

2000 1000

0

0.5

1

1.5

2 Time (s)

2.5

3

3.5

0

4

(d) 4000

3000

3000 Freq. (Hz)

Freq. (Hz)

(c) 4000

2000 1000

1000

0

0.5

1

1.5

2 Time (s)

2.5

3

3.5

0

4

4000

4000

3000

3000 Freq. (Hz)

Freq. (Hz)

0

2000

2000 1000 0

0

0.5

1

1.5

2 Time (s)

2.5

3

3.5

4

0

0.5

1

1.5

2 Time (s)

2.5

3

3.5

4

2000 1000

0

0.5

1

1.5

2 Time (s)

2.5

3

3.5

4

0

Fig. 16. Spectrograms of real speech (utterance ‘MAL_19Z96Z8 A’ from Aurora database) using the periodogram method (upper plot in each subﬁgure) and HASE method with DDR Hamming windowed higher-lag autocorrelation sequence (lower-plot in each subﬁgure) in various noise conditions. (a) Clean utterance. (b) Chirp noise. (c) Impulse noise and (d) car noise.

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485

shows a substantially lower-noise in the spectrogram than the periodogram method. In the car noise condition (Fig. 16(d)), the noise diﬀerences in the spectrograms from the two methods are more subtle.

to the periodogram method and the bottom plot to the HASE method. To investigate the performance of these methods for noisy speech, we take this utterance (MAL_19Z96Z8A) and corrupt it by adding the following three noises to make its signal-tonoise ratio (SNR) to be 10 dB: the artiﬁcial chirp noise, the artiﬁcial impulsive noise and the real car noise. The spectrograms of noisy speech for these three noises using the two methods (periodogram and HASE) are shown in Fig. 16(b)–(d), respectively. The spectrograms for the clean condition case (Fig. 16(a)) again conﬁrm that the HASE method captures the formant structure as well as the periodogram method. In the HASE method, the extra windowing operation on the autocorrelation sequence results in a less pronounced (or smoother) harmonic structure than the periodogram method. In the chirp (Fig. 16(b)) and impulse (Fig. 16(c)) noise conditions, the HASE method

4. AMFCC feature extraction method In this section, we present a brief outline of our AMFCC feature extraction method and point out where and how it diﬀers from the popular MFCC feature extraction method. The AMFCC feature extraction front-end shown in Fig. 17 uses the HASE method (with DDR Hamming-windowed higher-lag autocorrelation sequence) for spectral estimation. In this front-end, we ﬁrst pre-emphasise the input speech signal using a pre-emphasis ﬁlter H(z) = 1 0.97z1. In order to carry out the short-time analysis of the

Autocorrelation Speech Signal

Frame Blocking

Preemphasis

1475

Windowing

Windowing | |

2

FFT

Energy Feature Features

Temporal Derivative

Delta Features

Temporal Derivative

Base Features

Logarithm

Concatenate

Delta–Delta

Discrete Cosine Transform

Σ

| |

Logarithm

Mel Filter Bank

Fig. 17. Block diagram of AMFCC feature extraction algorithm.

FFT Speech Signal

Preemphasis

Frame Blocking

Windowing |

|

2

|

|

2

Energy Feature

Features Delta Features Base

Temporal Derivative

Logarithm Temporal Derivative

Concatenate

Delta–Delta

Σ

Discrete Cosine Transform

Features

Fig. 18. Block diagram of MFCC feature extraction algorithm.

Mel Filter Bank

Logarithm

1476

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485

pre-emphasised speech signal, we perform frame blocking with a frame size of 32 ms and a frame shift of 10 ms and the signal is analysed sequentially in a frame-wise manner. The Hamming window is applied to the pre-emphasised signal of a given frame and a biased autocorrelation estimator is applied to the windowed signal to get the 32 ms long autocorrelation sequence.11 We discard the autocorrelation coeﬃcients for time lags less than 2 ms and apply the DDR window to the remaining (2–32 ms) higher-lag autocorrelation sequence. We compute the magnitude spectrum of this windowed autocorrelation sequence and use it as the power spectral estimate of the signal. This spectrum is then processed by the Mel ﬁlter bank, logarithm and DCT operations to get 12 AMFCCs. We also compute the log energy of the windowed speech signal and append il to the 12 AMFCCs to get the 13 base features. We compute the ﬁrst and the second derivatives (delta and delta–delta) of the time sequence of each base feature. These derivatives are concatenated to the base feature set to get the ﬁnal AMFCC feature set (having 39 features). Since the MFCCs are currently the most popular features used for speech recognition, we show here the block diagram for the MFCC feature extraction front-end in Fig. 18 and point out where and how these two front-ends diﬀer. The two front-ends diﬀer only in the procedure used for estimating the power spectrum of the speech signal. In the MFCC case, it is obtained by using the periodogram estimate computed from the Hamming-windowed speech signal. In the AMFCC case, it is obtained by computing the magnitude spectrum of the DDR Hammingwindowed higher-lag autocorrelation sequence derived from the Hamming-windowed speech signal. This diﬀerence can be seen by noting the operations carried out by the two methods between the ‘windowing’ and ‘Mel ﬁlter bank’ blocks in Figs. 17 and 18, respectively.

5. Recognition experiments and results In this section, we evaluate the recognition performance of the AMFCC features for noisy speech at diﬀerent SNRs using the Aurora-2 and the resource management (RM) databases. In order to get the noisy speech utterances, we corrupt the 11

This is done in a computationally eﬃcient manner using the FFT algorithm.

speech signal by adding diﬀerent types of artiﬁcial and real noises. In our recognition experiments, we use clean speech utterances for training the recogniser and the noisy speech utterances at diﬀerent SNRs to evaluate its performance. The SNR of an utterance is deﬁned here as the ratio of the energies of clean signal and noise computed over the whole utterance. Here, we ﬁrst describe the databases and the experimental setup used in our recognition experiments. We then describe the results where we compare the recognition performance of the AMFCC features with that of the MFCC features ﬁrst, and then with that of the LPCC, SMC and OSALPC features. The parameters used for implementing these (MFCC, AMFCC, LPCC, SMC and OSALPCC) feature extraction methods are listed in Table 2. 5.1. Databases and experimental setup 5.1.1. Aurora-2 task The Aurora-2 task (Pearce and Hirsch, 2000) was originally developed for evaluating the speech recognition performance of diﬀerent feature extraction methods under noisy environments and is highlysuited in the current context where we are investigating the AMFCC method for robust feature extraction. This task provides speech utterances and scripts to perform speaker-independent connected-digit recognition experiments in clean and noisy conditions. The utterances in the Aurora-2 database consist of digit sequences (e.g., sil–one– four–seven–three–ﬁve–three–three–sil) that have been derived from the TI-digit database, down-sampled to 8 kHz and then ﬁltered by a G.712 characteristic ﬁlter. The Aurora-2 scripts provide for two types of training conditions: clean-condition training and multi-condition training. In our experiments, we use the clean-condition training situation and train our recogniser with the clean training set (having 8440 utterances). We use the hidden Markov models (HMMs) to model the digits and pauses using the same topology as prescribed in the scripts. The test data in the Aurora-2 corpus is divided into three sets, ‘a’, ‘b’ and ‘c’. In our experiments, we use only the test set ‘a’ (28 028 utterances) for testing the recogniser. This test set consists of speech corrupted by four diﬀerent real-life noises (subway, babble, car and exhibition) at seven SNR levels (clean, then 20 dB to 5 dB in 5 dB steps).

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485

1477

Table 2 Parameters used in the MFCC, AMFCC, LPCC, SMC and OSALPC methods Parameter

Sampling freq. (kHz) Frame shift (ms) Frame size (ms) Preemphasis coeﬀ. No. of cepstral coeﬀ. Feature vector size Mel ﬁlter range (kHz) No. of Mel ﬁlters LP order Signal window Autocorr. estimator Autocorr. window Autocorr. lag range (ms) a

Method MFCC

AMFCC

LPCC

SMC

OSALPC

8 10 32 0.97 12 39 0–4 23 – Hamming – – –

8 10 32 0.97 12 39 0–4 23 – Hamming Biased DDR Hamming 2–32

8 10 32 0.97 12 39 – – 12 Hamming Biased Rect. 0–1.5

8 10 32 0.97 12 39 – – 12 Rect. Coherence Hamming 0–16a

8 10 32 0.97 12 39 – – 12 Rect. Coherence Hamming 0–16a

The zeroth lag is set to zero in the SMC and OSALPC methods.

We also test our recogniser on speech utterances corrupted by the following three types of artiﬁcial noises: the white random noise, the chirp noise and the impulsive noise. The artiﬁcial white noise is obtained through a Gaussian random number generator. The artiﬁcial chirp noise used in this paper is periodic in nature with period equal to 32 ms. One period of this noise is generated as a sinusoidal signal whose frequency changes linearly from zero to half of the sampling frequency over the period. To create the artiﬁcial impulsive noise, we begin with a 32 ms block of zeros. To this block, we add a unit pulse of 2 ms duration. The starting position of the 2 ms pulse is randomly selected between 0 and 30 ms using a uniform random number generator. We then concatenate this block with another 32 ms block that contains only zeros. These two steps are then repeated, but this time the sign of the 2 ms pulse is reversed to maintain the overall mean to zero. These four steps are then repeated continuously to get a suﬃciently long sequence of the impulsive noise. Thus, in this noise, the separation between successive pulses is randomly varying between 32 and 92 ms.

5.1.2. Resource management (RM) task This task uses the RM database (Price et al., 1988) for carrying out the speaker-independent continuous speech recognition experiments. The vocabulary of the RM task is 991 words. We use here the speaker-independent part of the RM database. We down-sample the speech utterances from 16 kHz to 8 kHz. The training set consists of 3990 utter-

ances spoken by 109 speakers. This is used to train a set of six-mixture, tied-state, cross-word triphone HMMs in accordance with the RM scripts, which are supplied with the HTK (HMM tool kit) (HTK and markov model tool kit, 2005) distribution. For testing the recognition system, we use the February-89 test set, which consists of 300 utterances spoken by 10 speakers. These utterances are corrupted by the same (artiﬁcial and real) noises as used in the Aurora-2 task (see the preceding subsection) to get the noisy utterances with SNRs ranging from clean, then 30 dB to 5 dB in 5 dB steps.

5.2. Results: AMFCC versus MFCC We evaluate the recognition performance of the AMFCC feature extraction method for speech corrupted by the artiﬁcial as well as the real-life noises and compare it with that of the MFCC method. The word recognition accuracy results are listed in Tables 3 and 4, respectively, for the Aurora-2 task. The corresponding results for the RM task are provided in Tables 5 and 6. In these tables, the ﬁrst column shows the SNRs of noisy speech, and the second and the third columns show the results for the artiﬁcial and the real-life noises. In the artiﬁcial noise column, the ﬁrst sub-column lists the word recognition accuracies for the white random noise, the second sub-column for the chirp noise and the third sub-column for the impulsive noise. The mean recognition accuracies for the artiﬁcial noise (obtained by averaging the corresponding word recognition accuracies for the three noises) are shown

1478

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485

Table 3 AMFCC + E + D + DD features: Recognition performance on the Aurora-2 task SNR

Clean 20 dB 15 dB 10 dB 5 dB 0 dB 5 dB

Artiﬁcial noise

Real noise

White

Chirp

Impulse

Mean

Subway

Babble

Car

Exhib.

Mean

99.11 93.64 87.35 74.24 54.10 29.54 12.10

99.11 98.46 98.00 94.50 87.29 80.69 74.03

99.11 99.14 99.02 99.02 98.86 98.50 96.44

99.11 97.08 94.79 89.25 80.08 69.58 60.86

99.11 97.27 93.89 84.49 64.66 38.47 17.65

99.09 96.64 89.48 69.71 36.85 12.00 2.93

98.99 97.97 96.42 88.64 62.63 25.26 11.42

99.17 95.93 92.63 81.98 60.32 30.76 12.93

99.09 96.95 93.11 81.20 56.11 26.62 11.23

Table 4 MFCC + E + D + DD features: Recognition performance on the Aurora-2 task SNR

Clean 20 dB 15 dB 10 dB 5 dB 0 dB 5 dB

Artiﬁcial noise

Real noise

White

Chirp

Impulse

Mean

Subway

Babble

Car

Exhib.

Mean

99.08 94.26 86.06 69.08 42.83 14.98 7.98

99.08 96.75 93.58 84.62 64.51 36.57 14.61

99.08 98.83 98.68 97.97 95.46 86.64 60.79

99.08 96.61 92.77 83.89 67.60 46.06 27.79

99.08 97.45 94.14 83.45 59.32 31.84 13.26

99.00 96.43 87.48 62.94 30.08 10.91 5.05

98.96 97.64 96.45 86.58 53.27 17.72 8.98

99.29 96.24 92.90 79.45 53.41 23.45 11.23

99.08 96.94 92.74 78.10 49.02 20.98 9.63

Table 5 AMFCC + E + D + DD features: Recognition performance on the resource management task SNR

Clean 30 dB 25 dB 20 dB 15 dB 10 dB 5 dB

Artiﬁcial noise

Real noise

White

Chirp

Impulse

Mean

Subway

Babble

Car

Exhib.

Mean

93.44 88.93 84.06 73.37 53.60 26.88 8.37

93.44 92.19 90.67 89.14 86.10 80.63 70.97

93.44 93.40 93.36 92.66 92.58 92.15 90.12

93.44 91.51 89.36 85.06 77.43 66.55 56.49

93.44 91.45 89.89 83.75 69.42 41.70 20.35

93.44 92.62 90.98 85.90 74.74 51.07 20.39

93.44 92.54 91.76 89.61 82.82 64.86 34.80

93.44 91.76 89.30 84.30 71.11 47.52 20.15

93.44 92.09 90.48 85.89 74.52 51.29 23.92

Table 6 MFCC + E + D + DD features: Recognition performance on the resource management task SNR

Clean 30 dB 25 dB 20 dB 15 dB 10 dB 5 dB

Artiﬁcial noise

Real noise

White

Chirp

Impulse

Mean

Subway

Babble

Car

Exhib.

Mean

93.48 89.96 84.86 75.82 53.72 28.40 12.19

93.48 91.64 90.20 86.26 75.63 53.82 31.01

93.48 93.28 92.74 91.76 89.61 82.31 67.57

93.48 91.63 89.27 84.61 72.99 54.84 36.92

93.48 92.07 90.16 83.79 68.29 42.75 23.85

93.48 92.35 91.37 86.37 75.36 49.28 23.73

93.48 92.46 91.68 89.07 81.92 60.91 31.93

93.48 91.92 90.55 85.24 73.37 50.22 21.98

93.48 92.20 90.94 86.12 74.73 50.79 25.37

in the fourth sub-column. Similarly, in the real noise column, we show the word recognition accuracies for the subway, babble, car and exhibition noises

in the ﬁrst, second, third and fourth sub-columns, respectively, and their corresponding means are listed in the ﬁfth sub-column. In order to make

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485

the comparison between the MFCC and AMFCC features easier, we plot the mean recognition accuracies for the artiﬁcial and the real noises on the Aurora-2 and the RM tasks as a function of SNR in Fig. 19. We can make the following two observations from Tables 3–6 and Fig. 19: (1) When tested on the clean speech utterances, the AMFCC method does as well as the MFCC method in terms of the recognition performance. This shows that the higher-lag autocorrelation coeﬃcients used for spectral estimation in the AMFCC method capture the power spectral envelope of speech to the same

extent as done by the periodogram estimate used in the MFCC method. (2) For noisy speech utterances, the AMFCC method provides better recognition results than the MFCC method. This shows the usefulness of the AMFCC method for robust feature extraction. As mentioned earlier, we want the robustness and the additivity assumptions to be true for successful functioning of the AMFCC method. These assumptions are not strictly valid for short-time analysis carried out in the AMFCC feature extraction. Here, we make some comments about the eﬀect of short-time analysis on the validity of these

(a)

(b) Resource Management − Artificial Noises 100

90

90

80

80

70

70

Recognition Accuracy (%)

Recognition Accuracy (%)

Aurora − Artificial Noises 100

60 50 40 30 20

60 50 40 30 20

10 0 −5

10

MFCC AMFCC

MFCC AMFCC

0 0

5

10 SNR (dB)

15

20

clean

5

10

15

(c)

25

30

clean

Resource Management − Real Noises

100

100

90

90

80

80

70

70

Recognition Accuracy (%)

Recognition Accuracy (%)

20 SNR (dB)

(d)

Aurora − Real Noises

60 50 40 30 20

60 50 40 30 20

10 0 −5

1479

MFCC AMFCC

10

MFCC AMFCC

0 0

5

10 SNR (dB)

15

20

clean

5

10

15

20 SNR (dB)

25

30

clean

Fig. 19. Mean recognition accuracy of the AMFCC and the MFCC features as a function of SNR. (a) Aurora-2 database speech tested with three artiﬁcial noises: white random, chirp and impulsive. (b) Resource management database speech tested with the same three artiﬁcial noises. (c) Aurora-2 database speech tested with four real-life noises: subway, babble, car and exhibition. (d) Resource management database speech tested with the same four real-life noises.

1480

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485

assumptions using the AMFCC results shown in Tables 3 and 5 for artiﬁcial noises. The robustness assumption requires the autocorrelation coeﬃcients to be zero for time lags greater than 2 ms for the successful operation of the AMFCC method (as implemented here). As seen from Figs. 3–5, the validity of this assumption for short-time analysis varies for diﬀerent noise conditions; it is perfectly valid for the impulsive noise, less valid for the chirp noise and least valid for the white noise. Its impact can be clearly seen in Tables 3 and 5, where the speech recognition performance of the AMFCC method deteriorates least for the impulsive noise, more for the chirp noise and most for the white noise. Also, for the impulsive noise case (where the robustness assumption is perfectly satisﬁed), the degradation in performance occurs due to additivity assumption alone and this degradation is comparably not that severe. We use an ‘oracle’ experiment to demonstrate the eﬀect of additivity assumption on the speech recognition performance of the AMFCC method in Appendix A. There we show that the speech recognition performance does not degrade much due to deviation from the additivity assumption resulting from the short-time analysis. In other words, the additivity assumption is

quite valid for the short-time analysis and the speech recognition performance of the AMFCC method for a given SNR is mainly dictated by the robustness property of the noise corrupting the speech signal. 5.3. Results: AMFCC versus LPCC, SMC and OSALPC In this subsection, we compare the recognition performance of the AMFCC method with that of the three LP-based methods (namely, LPCC, SMC and OSALPC methods) of feature extraction. We list the word recognition accuracy results of the three LP-based methods in Tables 7–9, respectively, for the Aurora-2 task, and in Tables 10–12, respectively, for the RM task. We also show the mean recognition accuracies of these methods as a function of SNR in Fig. 20 for the artiﬁcial and the real noises on the Aurora and the RM tasks. By comparing this ﬁgure and these tables with the corresponding ﬁgure and tables shown in the preceding subsection for the AMFCC features, we can make the following two observations: (1) In terms of its recognition performance on the clean speech utterances, the AMFCC method does better

Table 7 LPCC + E + D + DD features: Recognition performance on the Aurora-2 task SNR

Clean 20 dB 15 dB 10 dB 5 dB 0 dB 5 dB

Artiﬁcial noise

Real noise

White

Chirp

Impulse

Mean

Subway

Babble

Car

Exhib.

Mean

98.71 84.56 64.48 44.77 29.78 10.93 7.77

98.71 95.64 90.70 77.86 56.00 27.17 11.39

98.71 98.93 98.71 98.31 97.54 95.09 89.01

98.71 93.04 84.63 73.65 61.11 44.40 36.06

98.71 96.44 91.93 76.14 48.33 27.02 15.35

98.94 96.52 88.12 61.94 23.04 0.51 2.72

98.84 97.55 95.29 79.93 41.16 11.63 7.84

98.86 95.25 88.28 69.24 39.86 15.46 8.49

98.84 96.44 90.91 71.81 38.10 13.40 7.24

Table 8 SMC + E + D + DD features: Recognition performance on the Aurora-2 task SNR

Artiﬁcial noise

Real noise

White

Chirp

Impulse

Mean

Subway

Babble

Car

Exhib.

Mean

Clean 20 dB 15 dB 10 dB 5 dB 0 dB 5 dB

98.62 81.42 65.40 47.28 28.80 19.28 10.01

98.62 95.52 88.24 75.04 55.82 33.50 14.74

98.62 98.56 98.56 98.56 98.28 96.90 92.05

98.62 91.83 84.07 73.63 60.97 49.89 38.93

98.62 95.61 90.33 71.60 39.42 18.11 9.61

98.55 89.21 72.31 47.10 25.45 4.99 4.75

98.00 97.02 91.23 69.49 41.69 19.53 10.65

98.52 95.25 90.96 78.96 51.62 25.42 12.03

98.42 94.27 86.21 66.79 39.55 17.01 6.88

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485

1481

Table 9 OSLPC + E + D + DD features: Recognition performance on the Aurora-2 task SNR

Clean 20 dB 15 dB 10 dB 5 dB 0 dB 5 dB

Artiﬁcial noise

Real noise

White

Chirp

Impulse

Mean

Subway

Babble

Car

Exhib.

Mean

98.37 83.57 66.01 49.28 33.34 20.60 10.59

98.37 97.42 90.24 67.82 42.40 17.65 6.88

98.37 98.56 98.53 98.37 97.88 95.86 86.15

98.37 93.18 84.93 71.82 57.87 44.70 34.54

98.37 94.96 87.10 66.50 36.01 16.12 6.66

98.43 90.36 75.33 50.00 26.12 6.50 5.77

97.88 96.72 92.51 74.14 46.56 23.50 12.14

98.49 95.37 91.05 78.00 51.93 25.46 12.10

98.29 94.35 86.50 67.16 40.16 17.90 6.28

Table 10 LPCC + E + D + DD features: Recognition performance on the management task SNR

Artiﬁcial noise

Real noise

White

Chirp

Impulse

Mean

Subway

Babble

Car

Exhib.

Mean

Clean 30 dB 25 dB 20 dB 15 dB 10 dB 5 dB

91.57 87.06 81.25 69.06 41.77 18.28 6.98

91.57 89.57 87.62 82.82 68.60 42.57 17.39

91.57 91.57 91.25 90.55 88.48 86.39 80.04

91.57 89.40 86.71 80.81 66.28 49.08 34.80

91.57 89.93 88.01 80.81 63.92 40.23 23.22

91.57 90.98 90.39 87.54 77.12 54.47 27.61

91.57 90.86 89.85 88.32 83.09 67.04 41.43

91.57 90.16 87.97 82.82 68.10 44.44 18.81

91.57 90.48 89.06 84.87 73.06 51.55 27.77

Table 11 SMC + E + D + DD features: Recognition performance on the resource management task SNR

Artiﬁcial noise

Real noise

White

Chirp

Impulse

Mean

Subway

Babble

Car

Exhib.

Mean

Clean 30 dB 25 dB 20 dB 15 dB 10 dB 5 dB

92.03 85.64 80.15 71.24 48.53 25.64 8.00

92.03 89.11 87.39 82.47 76.57 63.06 36.78

92.03 91.76 91.76 91.60 90.82 89.57 88.11

92.03 88.84 86.43 81.77 71.97 59.42 44.30

92.03 89.46 86.92 82.35 68.67 43.54 18.37

92.03 90.12 88.60 85.20 74.85 52.83 21.98

92.03 90.63 90.04 88.29 83.60 74.78 53.34

92.03 89.22 88.44 85.16 73.34 52.21 25.77

92.03 89.86 88.50 85.25 75.11 55.84 29.86

Table 12 OSALPC + E + D + DD features: Recognition performance on the resource management task SNR

Clean 30 dB 25 dB 20 dB 15 dB 10 dB 5 dB

Artiﬁcial noise

Real noise

White

Chirp

Impulse

Mean

Subway

Babble

Car

Exhib.

Mean

91.92 86.27 79.19 69.63 49.47 25.04 8.00

91.92 89.61 87.82 82.47 73.45 53.45 24.56

91.92 91.49 91.14 90.72 89.89 88.47 85.30

91.92 89.12 86.05 80.94 70.94 55.65 39.29

91.92 90.20 87.90 80.42 68.12 43.91 21.94

91.92 90.39 88.44 84.54 72.98 48.73 18.40

91.92 90.94 90.12 88.25 83.44 72.32 48.26

91.92 90.28 87.54 83.80 73.02 51.27 22.34

91.92 90.45 88.50 84.25 74.39 54.06 27.73

than the three LP-based methods. (2) For the noisy speech utterances, the recognition performance of

the AMFCC method is better than that of the three LP-based methods for most of the cases.

1482

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485 (a)

(b) Resource Management − Artificial Noises 100

90

90

80

80

70

70

Recognition Accuracy (%)

Recognition Accuracy (%)

Aurora − Artificial Noises 100

60 50 40 30 20

0 −5

0

5

10

15

20

50 40 30 20

AMFCC LPCC SMC OSALPC

10

60

AMFCC LPCC SMC OSALPC

10 0− 5

clean

10

15

SNR (dB)

(c)

30

clean

Resource Management − Real Noises

100

100

90

90

80

80

70

70

Recognition Accuracy (%)

Recognition Accuracy (%)

25

(d)

Aurora − Real Noises

60 50 40 30 20

AMFCC LPCC SMC OSALPC

10 0 −5

20 SNR (dB)

60 50 40 30 20

AMFCC LPCC SMC OSALPC

10 0

0

5

10 SNR (dB)

15

20

clean

5

10

15

20 SNR (dB)

25

30

clean

Fig. 20. Mean recognition accuracy of the AMFCC, the LPCC, the SMC and the OSALPC features as a function of SNR. (a) Aurora-2 database speech tested with three artiﬁcial noises: white random, chirp and impulsive. (b) Resource management database speech tested with the same three artiﬁcial noises. (c) Aurora-2 database speech tested with four real-life noises: subway, babble, car and exhibition. (d) Resource management database speech tested with the same four real-life noises.

6. Conclusions In this paper, we have proposed the AMFCC method of feature extraction for robust speech recognition. This method uses the windowed (onesided) higher-lag autocorrelation sequence for estimating the power spectrum of the speech signal. In order to capture the spectral envelope information of speech, the window applied to the autocorrelation sequence has to have a dynamic range twice that of the Hamming window. We have evaluated the speech recognition performance of the AMFCC features on the Aurora and the RM databases and

shown that they perform as well as the MFCC features for clean speech and their recognition performance is better than the MFCC features for noisy speech. Finally, we have shown that the AMFCC features perform better than the features derived from the LP-based methods for both the clean and noisy speech utterances. Appendix A. Eﬀect of additivity assumption on speech recognition performance If the clean speech signal s(n) and the noise signal d(n) in Eq. (1) are correlated, the autocorrelation

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485

where we can enforce the additivity assumption for short-time analysis. For this, we perform short-time analysis of s(n) and d(n) separately and compute their short-time autocorrelation sequences rss(n) and rdd(n). These two sequences are then added according to Eq. (2) to obtain the short-time autocorrelation sequence rxx(n). As a result, the modiﬁed AMFCC method (denoted here by the AMFCC-O method) ensures validity of the additivity assumption for short-time analysis. We show a composite block diagram of the AMFCC and the AMFCCO methods in Fig. 21. In this ﬁgure, we have highlighted two paths by dotted lines and labelled them as ‘‘ALL CROSS TERMS’’ and ‘‘ZERO CROSS TERMS’’, respectively. The top path (with ‘‘ALL CROSS TERMS’’) is used for the AMFCC method and the bottom path (with ‘‘ZERO CROSS TERMS’’) for the AMFCC-O method. Using features computed by the AMFCC-O method and the AMFCC method, we carry out speech recognition experiments on the Aurora database and compare the speech recognition performance of the two methods. Since the AMFCC-O method requires the noise signal as an input, the pre-mixed noisy utterances in the Aurora database cannot be used in these experiments. Instead, we take the clean speech from the subway case, and mix it with the subway, babble, car and exhibition noises provided in the Aurora database. The noise signal levels are adjusted to produce noisy speech utterances with global SNRs equal to the seven

function of the observed (noisy) signal x(n) is given by rxx ðnÞ ¼ rss ðnÞ þ rdd ðnÞ þ rsd ðnÞ þ rds ðnÞ ; |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ}

1483

ð4Þ

cross-correlation terms

where rsd(n) and rds(n) ate the cross-correlation functions of s(n) and d(n). When s(n) and d(n) ate uncorrelated, the cross-correlation terms in Eq. (4) become zero. As a result, Eq. (4) reduces to Eq. (2), which deﬁnes the additivity property necessary for successful functioning of the AMFCC method of feature extraction. As mentioned earlier, this property is strictly valid only asymptotically (i.e., when the signal x(n) available for analysis is inﬁnitely long). In practice, we perform short-time analysis of x(n) for AMFCC feature extraction. Under the shorttime analysis constraint, the cross-terms in Eq. (4) are not zero even when the signals s(n) and d(n) ate uncorrelated. As a result, the additivity property (Eq. (2)) is not valid for the short-time analysis. In this appendix, we investigate empirically to what extent this deviation from additivity assumption due to short-time analysis aﬀects the speech recognition performance of the AMFCC method. For this, we carry out an ‘oracle’ experiment where we assume that we can access the clean speech signal s(n) as well as the noise signal d(n) individually. This allows us to modify the AMFCC method

ALL CROSS TERMS SPEECH SIGNAL

Preemphasis

g UNIT VARIANCE NOISE SIGNAL

Delta Features

Windowing

Autocorrelation

ZERO CROSS TERMS

SET SNR

Delta–Delta Features

Frame Blocking

Preemphasis

Frame Blocking

Windowing

Autocorrelation

Preemphasis

Frame Blocking

Windowing

Autocorrelation

Windowing Temporal Derivative Temporal Derivative

Discrete Cosine Transform

Logarithm

Mel Filter Bank

| |

FFT

Base Features

Fig. 21. Composite block diagram showing AMFCC and AMFCC-O feature extraction methods. The path labelled ‘‘ALL CROSS TERMS’’ shows the typical implementation of the AMFCC method. The path labelled ‘‘ZERO CROSS TERMS’’ shows the AMFCC-O method, which eliminates the cross-correlation terms from the short-time autocorrelation sequence.

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485 (a)

(b)

Subway

Babble

100

100

90

90

80

80

70

70

Recognition Accuracy (%)

Recognition Accuracy (%)

1484

60 50 40 30 20

50 40 30 20

AMFCC AMFCC−O MFCC

10 0 −5

60

0

5

10 SNR (dB)

15

20

AMFCC AMFCC−O MFCC

10 0 −5

clean

0

5

20

clean

Exhib.

100

100

90

90

80

80

70

70

Recognition Accuracy (%)

Recognition Accuracy (%)

15

(d)

(c) Car

60 50 40 30 20

60 50 40 30 20

AMFCC AMFCC−O MFCC

10 0 −5

10 SNR (dB)

0

5

10 SNR (dB)

15

20

clean

AMFCC AMFCC−O MFCC

10 0 −5

0

5

10 SNR (dB)

15

20

clean

Fig. 22. Recognition accuracy results for AMFCC-O, AMFCC and MFCC features for four Aurora noises.

default test SNRs found in the Aurora database: clean, 20, 15, 10, 5, 0 and 5 dB. In the speech recognition experiments, the base feature set consists of 12 cepstral coeﬃcients (which does not include logarithmic energy or zeroth cepstral coeﬃcient). A 36-dimensional feature vector is formed by concatenating the delta and acceleration coeﬃcients to the base feature set. The results of the experiments are presented in Fig. 22. For reference, we also plot results for MFCC features. We can see from this ﬁgure that the AMFCC-O method provides very little improvement in terms of speech recognition performance over the AMFCC method for some of the (lower) SNRs; the two methods are comparable for other SNRs. These results show that the additivity assumption used in the AMFCC

method is quite good even for the short-time analysis. References Anstey, N.A., 1966. Correlation techniques. Can. J. Explor. Geophys. 2 (1), 55–82. Bellegarda, J.R., 1997. Statistical techniques for robust ASR: Review and perspectives. Proc. Eurospeech, KN33–KN36. Bourlard, H., Dupont, S., 1996. A new ASR approach based on independent processing and recombination of partially frequency bands. Proc. ICSLP, 426–429. Cadzow, J.A., 1982. Spectral estimation: An overdetermined rational model equation approach. Proc. IEEE 70 (Sep.), 907– 939. Chan, Y.T., Langford, R.P., 1982. Spectral estimation via the high-order Yule–Walker equations. IEEE Trans. Acoust Speech Signal Process. (ASSP) 30 (5), 689–698.

B.J. Shannon, K.K. Paliwal / Speech Communication 48 (2006) 1458–1485 Cooke, M.P., Morris, A., Green, P.D., 1997. Missing data techniques for robust speech recognition. Proc. ICASSP, 863– 866. Davis, S.B., Mermelstein, P., 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process. (ASSP) 28 (4), 357–365. Gales, M., Woodland, P., 1996. Mean and variance adaptation within the MLLR framework. Comput. Speech Language 10, 249–264. Gersch, W., 1970. Estimation of the autoregressive parameters of mixed autoregressive moving-average time series. IEEE Trans. AC 5 (October), 583–588. Ghitza, O., 1986. Auditory nerve representation as a front-end for speech recognition in a noisy environments. Comput. Language Speech 1, 109–130. Gong, Y., 1995. Speech recognition in noisy environments: A survey. Speech Commun. 16 (3), 261–291. Harris, F.J., 1978. On the use of windows for harmonic analysis with the discrete fourier transform. Proc. IEEE 66 (1), 51–83. Hermansky, H., Morgan, N., 1994. Rasta processing of speech. IEEE Trans. Speech Audio Process. 2 (4), 578–586. Hermus, K., Wambacq, P., 2004. Assessment of signal subspace based speech enhancement for noise robust speech recognition. Proc. ICASSP, 945–948. Hernando, J., Nadeu, C., 1997. Linear prediction of the one-sided autocorrelation sequence for noisy speech recognition. IEEE Trans. Speech Audio Process. 5 (1), 80–84. HTK, Hidden markov model tool kit, available from: , 2005. Huang, X., Acero, A., Hon, H., 2001. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall. Juang, B.H., 1991. Speech recognition in adverse environments. Comput. Speech Language 5, 275–294. Kay, S.M., 1979. The eﬀects of noise on the autoregressive spectral estimator. IEEE Trans. Acoust Speech Signal Process. (ASSP) ASSP-27 (5), 478–485. Kay, S., 1988. Modern Spectral Analysis. Prentice Hall. Kim, H.G., Schwab, M., Moreau, N., Sikora, T., 2003. Enhancement of noisy speech for noise robust front-end and speech reconstruction at back-end of DSR system. Proc. Eurospeech, 545–548. Lee, C.H., 1998. On stochastic features and model compensation approaches to robust speech recognition. Speech Commun. 25, 29–48. Lippmann, R.P., 1997. Speech recognition by machines and humans. Speech Commun. 22, 1–16.

1485

Makhoul, J., 1975. Linear prediction: A tutorial review. Proc. IEEE 63 (April), 561–580. Mansour, D., Juang, B.H., 1989. The short-time modiﬁed coherence representation and noisy speech recognition. IEEE Trans. Acoust Speech Signal Process. (ASSP) 37 (6), 795– 804. McGinn, D.P., Johnson, D.H., 1989. Estimation of all-pole model parameters from noise-corrupted sequences. IEEE Trans. Acoust Speech Signal Process. (ASSP) 37 (3), 433– 436. Paliwal, K.K., 1986a. A noise-compensated long correlation matching method for AR spectral estimation of noisy signals. Proc. ICASSP, 1369–1372. Paliwal, K.K., 1986b. A constrained forward–backward correlation prediction method for AR spectral estimation of noisy signals. Proc. EUSIPCO, 295–298. Paliwal, K.K., 1986c. Robust LP analysis method based on pitch information for noisy speech. Proc. EUSIPCO, 593–596. Paliwal, K.K., Sagisaka, Y., 1997. Cyclic autocorrelation-based linear prediction analysis of speech. Proc. Eurospeech, 279– 282. Paliwal, K.K., Sondhi, M.M., 1991. Recognition of noisy speech using cumulant-based linear prediction analysis. Proc. ICASSP, 429–432. Pearce, D., Hirsch, H., 2000. The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. Proc. ICSLP 4 (Oct.), 29–32. Price, P., Fisher, W., Bernstein, J., Pallett, D., 1988. The DARPA 1000-word resource management database for continuous speech recognition. Proc. ICASSP, 651–654. Rabiner, L., Juang, B., 1993. Fundamentals of Speech Recognition. Prentice Hall. Raj, B., Seltzer, M.L., Stern, R.M., 2004. Reconstruction of missing features for robust speech recognition. Speech Commun. 43, 275–296. Shannon, B.J., Paliwal, K.K., 2004. MFCC computation from magnitude spectrum of higher lag autocorrelation coeﬃcients for robust speech recognition. Proc. ICSLP. Shannon, B.J., Paliwal, K., 2005. Inﬂuence of autocorrelation lag ranges on robust speech recognition. Proc. ICASSP 2 (March). Stern, R.M., Acero, A., Liu, F.H., Ohshima, Y., 1996. Signal processing for robust speech recognition. In: Lee, C., Soong, F., Paliwal, K.K. (Eds.), Automatic Speech Recognition: Advanced Topics. Kluwer Academic Publishers, Boston, pp. 357–384. Tibrewala, S., Hermansky, H., 1997. Sub-band based recognition of noisy speech. Proc. ICASSP, 1255–1258.

Recommend Documents

Feature extraction from higher-lag autocorrelation ... - Semantic Scholar

FEATURE EXTRACTION FROM HYPERSPECTRAL IMAGES

Primal-Sketch Feature Extraction from a Log-Polar Images - CiteSeerX

feature extraction methods for character recognition - CiteSeerX

Signal Feature Extraction From Microbarograph ... - Semantic Scholar