AUDITORY FEATURES BASED ON GAMMATONE ... - Semantic Scholar

Report 27 Downloads 102 Views
AUDITORY FEATURES BASED ON GAMMATONE FILTERS FOR ROBUST SPEECH RECOGNITION Jun Qi1 , Dong Wang2 , Yi Jiang1 , Runsheng Liu1 1. Department of Electronic Engineering, Tsinghua University, Beijing, China 2. Center of Speech and Language Technology, Tsinghua University, Beijing, China ABSTRACT A major challenge for automatic speech recognition (ASR) relates to significant performance reduction in noisy environments. Recent research has shown that auditory features based on Gammatone filters are promising to improve robustness of ASR systems against noise, though the research is far from extensive and generalizability of the new features is unknown. This paper presents our implementation of the Gammatone filter-based feature and the experimental results on Mandarin speech data. By some thorough designs, we obtained significant performance gains with the new feature in various noise conditions when compared with the widely used MFCC and PLP features. A particular novelty of our implementation is that the filter design is purely in the time domain. This means that the channel signals are obtained with a set of Gammatone filters applied directly on the speech signals in time domain, which is totally different from the commonly adopted frequencydomain design that first converts signals to spectra and then applies the filter banks upon them. The time-domain implementation on the one hand avoids the approximation introduced by short-time spectral analysis and hence is more precise; and on the other hand, it avoids the complex spectral computation and hence simplifies hardware realization. Index Terms— Gammatone filters, feature extraction, robust speech recognition 1. INTRODUCTION A major challenge for automatic speech recognition (ASR) relates to significant performance reduction in noisy environments. Plenty of research has been focused on robust speech recognition to mitigate this problem. Well known approaches include channel normalization, signal enhancement, model adaptation and audio-visual recognition etc. We are most interested in noise-robust feature extraction in this paper, particularly the auditory features. Generally speaking, auditory features leverage the characteristics of the human auditory system, i.e., different forms of response to different frequency components of signals. This character can be accounted for with particular distributed filterbanks, leading to the widely used Mel frequency coefficient cepstra (MFCC) and perceptual linear prediction (PLP) features. A more general approach is based on gammatone filters (GF), which involve dedicated design of mathematical forms of frequency response that match physiological experimental results. Similar to MFCC, this feature is usually referred to as Gammatone frequency coefficient cepstra (GFCC) [1, 2, 3]. Some attempts have been made to utilize GFCC in ASR, for instance [4, 5, 2], though the investigation is far from

being extensive [4] and the improvement in results being insignificant [5]. The results presented in [2] seem quite promising but the gains could not be well generalized to other data, as shown in Section 4. We provide a new implementation of GFCC in this paper, which shows consistent and significant performance gains in various noise types and levels. A particular novelty of our implementation is that the Gammatone filters are realized purely in the time domain. Specifically, the filters are applied directly on time series of speech signals by simple operations such as delay, summation and multiplication. This is quite different from the widely adopted frequency-domain design, for instance [4, 2], where signals are transformed to frequency spectra first and the Gammatone filers then applied upon them. The timedomain implementation avoids unnecessary approximation introduced by short-time spectral analysis, and saves a considerable proportion of computation involved in FFT. This paper is organized as follows: the time-domain GF design is presented in Section 2; the processing steps of our GFCC implementation is described in Section 3. The experiments are presented in Section 4 followed by the conclusions drawn in Section 5. 2. TIME-DOMAIN GAMMATONE FILTERS A Gammatone filter is formally represented as follows in the form of impulse response in the time domain: g(t) = atn−1 e−2πbt cos(2πfc t + φ)

(1)

where fc is the central frequency of the filter, and φ is the phase which is usually set to be 0. The constant a controls the gain, and n is the order of the filter which is usually set to be equal or less then 4 [6]. Finally b is the decay factor which is related to fc and is given by: b = 1.019 ∗ 24.7 ∗ (4.37 ∗ fc /1000 + 1).

(2)

A set of GFs with different fc form a gammatone filterbank, which can be applied to obtain the signal characteristics at various frequency, resulting in a temporal-frequency presentation similar to the FFT-based short-time spectral analysis. In order to simulate human auditory behavior, the central frequencies of the filterbank are often equally distributed on the Bark scale. We now derive the time domain GF implementation. First notice that (1) consists of two components: the filter envelope atn−1 e−2πbt and the amplitude modulator cos(2πfc t + φ) at frequency fc . Applying the Fourier analysis, the frequency

domain representation of g(t) is obtained:

3. GFCC IMPLEMENTATION

a(n − 1)! j(f − fc ) j(f + fc ) {( +1)−n +( +1)−n }. n 2(2πb) b b (3) As stated in [7], the term [j(f + fc )/b + 1]−n can be ignored if fc /b is sufficiently large, which leads to the approximation form of (3) as follows: G(f ) =

G(f ) ≈

a(n − 1)! (1 + j(f − fc )/b)−n . 2(2πb)n

By setting s = j2πf , the Laplace transformation of the GF is obtained: a(n − 1)! G(s) = (s − (j2πfc − 2πb))−n . 2 Sampling the impulse response with frequency fs and applying the time impulse invariant transformation [8], the Laplace transform on the original continuous signal corresponds to the following Z transformation on the sampled discrete series: a(n − 1)! G(z) = (1 − ej2πfc −2πb )−n . 2 Letting A(z) be an element transform: A(z) =

1

, (4) 1 − ej2πfc /fs −2πb/fs z −1 G(z) can be regarded as a cascade of n recursive application of A(z). Note that A(z) is dependent on the central frequency fc , and so is the entire transform G(z). This can be simplified by a cascaded application of a series of filters which first ˆ remove the fc component and then apply a base filter G(z) that is independent of fc , and finally compensate for fc . This process is shown in Fig.1, where x(t) is the input signal and y(t; fc ) is the filtered signal dependent on fc .

x(t)

X

X

y(t;fc)

Fig. 1. Time domain Gammatone filtering. Considering the special case with n = 4, the base transform takes the following form: ˆ G(z) =

3a (5) 1 − 4mz −1 + 6m2 z −2 − 4m3 z −3 + m4 z −4

where m = e−2πb/fs . Note that (5) indicates that the filter channel y(t; fc ) in Fig. 1 can be obtained by a series of filters each of which involves only simple multiplication and summation in the time domain. This is fundamentally different from the commonly adopted implementation which realizes the filters in the frequency domain and so involves the complex FFT-based spectral analysis. We will see in Section 4 that this new design can significantly speed up the feature extraction process. We finally note that this time-domain GF design was proposed in [9]. We employ this technique here to produce GFCC for robust ASR. To the authors’ best knowledge, this is the first work in this direction.

With the time-domain design of GFs presented in the previous section, GFCC features can be derived from GF-generated Cochleagrams by duplicating the procedure presented in [2]. However, we find that some revisions of the implementation steps may lead to better numerical properties with the resulting features. This section presents the details of our GFCC implementation 1 . 3.1. Pre-emphasis It is well known that a pre-emphasis is helpful in reducing the dynamic range of spectrum and intensifying the low frequency components which usually involve more information of speech signals. Following the same idea, we implement the pre-emphasis as a 2-order low-pass filter given by: H(z) = 1 + 4e−2πb/fs z −1 + e−2πb/fs z −2 where b and fs are as defined in (2) and (4) respectively. 3.2. Average-based framing The GFCC implementation proposed in [2] extracts frames from the GF outputs by down-sampling y(t; fm ) to 100 Hz where fm is the central frequency of the m-th GF. This approach tends to result in high variation even with a low-pass filter applied. We use an average approach which uses a window covering K points and shifting every L points to frame y(t; fm ). For the n-th frame, the average value of y(t; fm ) within the window t ∈ [nL, nL + K) is computed as the m-th component: y¯(n; m) =

K−1 1 X γ(fm )|y(nL + i; fm )| K i=0

where | · | represents the magnitude of complex numbers, γ(fm ) is a center frequency-dependent factor, and m is the index of the channel whose central frequency is fm . In our experiment, we choose K = 400, L = 160, and 0 ≤ m < 31. This means each frame corresponds to a 32-dimension vector y¯(n) = [¯ y (n; 0), y¯(n; 1), ...¯ y (n; 31)]T . For 16 kHz signals, these settings result in 100 frames per second, exactly the same as the down-sampling approach and the regular frame rate of MFCCs and PLPs. The resulting matrix y¯(n; m) provides a frequency-time representation of the original signal, and is often referred to as a Cochleagram [3]. A typical Cochleagram is shown in Fig. 2. 3.3. Log-based cosine transform With Cochleagrams, the discrete cosine transform (DCT) is requested to obtain component-uncorrelated cepstral features2 . While the DCT can be applied on Cochleagrams directly (as in [2]), we place a logarithm on Cochleagrams 1 The code is publicly available at http://homepages.inf.ed.ac.uk/v1dwang2/public/tools/index.html 2 We use the term “cepstra” just to note that GFCCs are uncorrelated features with DCT. The usage is not very precise as GFCCs are actually in the frequency domain, rather than the time domain that “cepstra” usually reside in.

as usually adopted in the MFCC processing. Though there is not much theoretical advantage with the logarithm, we find that it leads to more stability in numerical processing. The following equation presents the exact cepstral form: F (n, u) = (

M 2 0.5 X 1 πu ) { log(¯ y (n, i))cos[ (2i − 1)]} M 3 2N i=1

where M is the total number of channels which is 32 is our case, and u ranges from 0 to 31 accordingly. Based on the observation that most values of F (·, u) are close to zero when u ≥ 13, we choose the first 12 components of F (n, u) as the feature vector. This results in the static GFCC features, as illustrated in Fig. 2.

Feature abstraction

Cochleagram

Static features

Fig. 2. Cochleagram and GFCC.

4.2. Experimental settings The acoustic models are 174 monophones (including short pause and long silence) modelled by 3-state left-to-right hidden Markov models (HMM). The state emission is modeled by Gaussian mixture models (GMM) where the number of Gaussian components is fixed to 8 and all components are diagonal. The language model is a simple loop graph that corresponds to the grammar exampled as follows: $entry = bei3 jing1 | shang4 hai3 | ... | lu2 fang1 ; ( SILENCE SILENCE ) where a uniformed distribution is applied to possible alternatives. The main goal of the research is to provide a comparative study of MFCC, PLP and Gammatone-based features. For a fair comparison, all the features are derived from the same frequency range (80 - 5000 Hz) and the frame rates are all set to 100 per second. Each feature involves 12 static cepstral coefficients and their first and second order derivatives, resulting in 36-dimensional vectors. Cepstral mean subtraction (CMS) is applied to all features to suppress noise and mismatch. The HTK toolkit from Cambridge [10] is used to extract MFCCs and PLPs, train the acoustic models and conduct the decoding. The toolkit provided by Prof. Wang is used to generate the frequency-domain GFCC3 .

3.4. Dynamic feature generation

4.3. Robust recognition results

Dynamic features are generally helpful in capturing temporal information. We produce the first and second order dynamic features as follows: PK k(F (n + k, u) − F (n − k, u)) ∆F (n, u) = k=1 PK 2 k=1 k 2

The robust recognition test involves comparative study with four features: MFCC, PLP, frequency-domain GFCC proposed in [2] (GFCC-o) and time-domain GFCC that is proposed in this paper (GFCC-n). These features are tested under 3 noise types (white noise, F16 noise, and babble noise) and 8 noise levels (0, 3, 6, 9, 12, 15, 20 and 30 dB respectively). The noise is obtained from the Noise92 database and is mixed with the clean utterances in the test database. The results with the three types of noise are shown in Fig. 3, 4 and 5 respectively. We observe from these results that the time-domain GFCC outperforms the PLP and MFCC in all noise conditions, while the frequency-domain GFCC does not provide any gain in most cases. This indicates that performance of GFCCs largely depends on implementation details, and the outstanding gains reported in [2] are not easy to generalize to other databases and tasks. Fig. 6 shows the average processing time for one second of speech signal. A desktop computer with two 2.4GHz Intel kernels and 2GB memory were used to conduct the comparison. We see clearly that the time-domain GFCC is remarkably faster than the frequency-domain implementation, though still a little slower than the MFCC and PLP.

PK ∆∆F (n, u) =

k=1

k(∆F (n + k, u) − ∆F (n − k, u)) PK 2 k=1 k 2

where K is set to 2, which means the length of the context window is 5. 4. EXPERIMENTS We test the proposed GFCC features on a grammar based speech recognition task in Mandarin. This section first describes the training and test data as well as the experimental settings, and then presents the comparative results with MFCC, PLP and a frequency-domain GFCC implementation proposed in [2]. 4.1. Data profile The training dataset is from the standard Mandrin speech database collected under the state-sponsored 863 research program, which involves 127 hours of reading speech data. The test data consist of recordings of 10 speakers collected in a noise-cacelled studio. Each speaker speaks 600 short Chinese utterances involving 200 Chinese names, 200 stock names and 200 Chinese place-names. This amounts to 6000 clean utterances, into which three types of noise are intentionally mixed to get the noisy data for robustness testing.

5. CONCLUSIONS We presented a time-domain GFCC feature to improve robustness of speech recognition in noisy conditions. Compared to the frequency-domain implementation, the new implementation realizes the Gammatone filters in time domain and hence avoids the time-consuming FFT computation. Experiments on a grammar-based Mandarin speech recognition task demonstrates that the time-domain GFCC provides much 3 http://www.cse.ohio-state.edu/pnl/shareware/gfcc/

100

100

90

90

80 80

Accuracy (%)

Accuracy (%)

70 60 50 40 30

GFCC−n GFCC−o PLP MFCC

20 10 0

0

5

10

15

20

25

70 60 50

GFCC−n GFCC−o PLP MFCC

40 30

30

SNR (dB)

20

0

5

10

15

20

25

30

SNR (dB)

Fig. 3. Recognition accuracy with white noise.

Fig. 5. Recognition accuracy with babble noise.

100 90 80

Accuracy (%)

70 60 50 40 30

GFCC−n GFCC−o PLP MFCC

20 10 0

0

5

10

15

20

25

Fig. 6. Average processing time (in million seconds) for one second of speech signal. 30

SNR (dB)

Fig. 4. Recognition accuracy with F16 noise. faster processing speed and consistent performance improvement over the frequency-domain design. When compared to MFCC and PLP, the time-domain GFCC provides better recognition performance under all noise conditions, although at the expense of marginally higher computational resources. These results show great potential of the time-domain GFCC features in robust speech recognition, particularly with hardware realization. 6. REFERENCES [1] Yushi Zhang and Waleed H. Abdulla, “Gammatone auditory filterbank and independent component analysis for speaker identification systems,” Tech. Rep., The University of Auckland, 2005. [2] Y. Shao, Z. Jin, D.L. Wang, and S. Srinivasan, “An auditory-based feature for robust speech recognition,” in ICASSP, 2009, pp. 4625–4628. [3] Y. Shao and D.L. Wang, “Robust speaker identification using auditory features and computational auditory scene analysis,” in ICASSP, 2007, pp. 1589–1592. [4] W. H. Abdulla, Advance in Communication and Software Technologies, chapter Auditory Based Feature Vectors for Speech Recognition Systems, pp. 231–236, WSEAS Press, 2002.

[5] L. Bezrukov, H. Wagner, and H. Ney, “Gammatone features and feature combination for large vocabulary speech recognition,” in ICASSP, 2007, vol. 4, pp. 649– 654. [6] R.D. Patterson and B.C.J. Moore, Frequency Selective in Hearing, chapter Auditory filters and excitation patterns as representations of frequency resolution, pp. 123–177, Academic Press Ltd., London, 1986. [7] E. de Boer and C. Kruidenier, “On ringing limits of the auditory periphery,” Biological Cybernetics, vol. 63, no. 6, pp. 433–442, 1990. [8] M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio-visual corpus for speech perception and automatic speech recognition,” Journal of the Acoustic Society of America, vol. 120, no. 5, pp. 2421–2424, 2006. [9] John Holdsworth, Ian Nimmo-Smith, Roy Patterson, and Peter Rice, “Implementing a gammatone filter bank,” Tech. Rep., Cambridge Edlectronic Design, 1988. [10] S.Young et al., The HTK Book, Cambridge, 3.4 edition, 2006.