Analysis and design of Wavelet-Packet Cepstral ... - Semantic Scholar

Comment

Report 4 Downloads 28 Views

Available online at www.sciencedirect.com

Speech Communication 54 (2012) 814–835 www.elsevier.com/locate/specom

Analysis and design of Wavelet-Packet Cepstral coeﬃcients for automatic speech recognition Eduardo Pavez, Jorge F. Silva ⇑ University of Chile, Department of Electrical Engineering, Av. Tupper 2007, Santiago 412-3, Chile Received 3 July 2011; received in revised form 31 January 2012; accepted 2 February 2012 Available online 18 February 2012

Abstract This work proposes using Wavelet-Packet Cepstral coeﬃcients (WPPCs) as an alternative way to do ﬁlter-bank energy-based feature extraction (FE) for automatic speech recognition (ASR). The rich coverage of time-frequency properties of Wavelet Packets (WPs) is used to obtain new sets of acoustic features, in which competitive and better performances are obtained with respect to the widely adopted Mel-Frequency Cepstral coeﬃcients (MFCCs) in the TIMIT corpus. In the analysis, concrete ﬁlter-bank design considerations are stipulated to obtain most of the phone-discriminating information embedded in the speech signal, where the ﬁlter-bank frequency selectivity, and better discrimination in the lower frequency range [200 Hz–1 kHz] of the acoustic spectrum are important aspects to consider. Ó 2012 Elsevier B.V. All rights reserved. Keywords: Wavelet Packets; Filter-bank analysis; Automatic speech recognition; Filter-bank selection; Cepstral coeﬃcients; The Gray code

1. Introduction Feature extraction (FE) is one of the key dimensions of design in automatic speech recognition (ASR) (Quatieri, 2002). The most recognized and widely adopted approach for acoustic FE is using the Mel-Frequency Cepstral coefﬁcients (MFCCs). MFCCs is a short-time analysis scheme, in which a signature of the acoustic signal spectrum is computed from a ﬁlter-bank with central frequencies projected uniformly on the Mel scale (Quatieri, 2002). This scale is derived from well-documented studies of the human auditory system (Quatieri, 2002). Departing from this direction, there has been interest in the use of alternative signal processing techniques to propose new ways of doing shorttime ﬁlter-bank analysis on the acoustic signal (Silva and Narayanan, 2009; Farooq and Datta, 2001; Choueiter ⇑ Corresponding author. Tel.: +56 2 9784090; fax: +56 2 6953881.

E-mail addresses: [email protected] (E. Pavez), [email protected]. cl (J.F. Silva). URL: http://www.ids.uchile.cl/josilva/ (J.F. Silva). 0167-6393/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2012.02.002

and Glass, 2007; Kim et al., 2000; Tan et al., 1996). The use of Wavelets and Wavelet Packets (Daubechies, 1992; Mallat, 1989; Vetterli and Kovacevic, 1995) has been of particular interest in this context. Wavelet Packets (WPs) (Vetterli and Kovacevic, 1995; Mallat, 1989; Coifman et al., 1990) have emerged as important signal representation schemes impacting compression, detection and classiﬁcation (Crouse et al., 1998; Etemad and Chellapa, 1998; Ramchandran et al., 1996; Vasconcelos, 2004; Willsky, 2002; Learned et al., 1992; Scott and Nowak, 2004). This collection of bases is particularly appealing for the analysis of pseudo-stationary time series processes and quasi-periodic random ﬁelds, such as the acoustic speech process (Silva and Narayanan, 2009; Choueiter and Glass, 2007; Chang and Kuo, 1993; Learned et al., 1992). WPs belong to the category of structured bases, those whose orthonormal basis elements are generated from a ﬁnite number of elementary transformations (Vetterli and Kovacevic, 1995; Daubechies, 1992; Ramchandran et al., 1996). From an engineering point of view, these kinds of representations are attractive because

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835

they can be implemented with a basic two-channel ﬁlter (TCF) and down-sampling operations (Vetterli and Kovacevic, 1995). WPs can be used to characterize a rich covering of signal-space decomposition, and in particular, they provide a way for generating sub-band dependent partitions of the observation space. In conclusion, WPs induce a family of structural ﬁlter-banks with a rich covering of time-frequency characteristics that has the potential for enriching the way conventional MFCC features describe the shortterm behavior of the acoustic speech process. WPs and multi-rate ﬁlter bank analysis have been adopted to improve the performance of conventional MFCC features in the context of ASR (Farooq and Datta, 2001; Choueiter and Glass, 2007; Kim et al., 2000; Tan et al., 1996). In particular, Farooq and Datta (2001) proposed a WP ﬁlter-bank representation, in which the objective was to mimic the MEL-scale frequency resolution. They used the Daubechies (DB) two channel ﬁlter (Daubechies, 1992), with which performance improvements were observed for speciﬁc phone subcategories (stop and unvoiced) in a portion of the TIMIT corpus. More recently, Choueiter and Glass (2007) explored the problem of two-channel ﬁlter-bank design and, in particular, the novel framework of rational ﬁlter-banks. The focus of this work was to improve the frequency selectivity with respect to the conventionally adopted Daubechies (DB) WPs with standard dyadic structure, by designing a type of MELfrequency ﬁlter-bank structure. Better performances were obtained in a simpliﬁed phone-segmented classiﬁcation task with respect to MFCCs. These seminal works provide concrete evidence of the advantage of adopting WPs for parameterizing the speech acoustic process. However, the problem of adapting the WP basis-structure to the decision task, in the sense of ﬁnding the ﬁlter-bank topology, within the collection of treestructured WP bases, that best captures the time-frequency acoustic information for a given complexity constraint (feature dimension), remains an unexplored direction. As pointed out in (Choueiter and Glass, 2007), this direction has the potential to further adapt WP ﬁlter-bank solutions (acoustic energy-signature) to the phone discrimination task at hand. On the other hand, the results reported so far have considered simpliﬁed settings, in terms of the classiﬁcation task or data-sets. Thus, a systematic analysis in standard phone recognition experiments would be beneﬁcial to support the adoption of WP-based features as a competitive front-end alternative for doing acoustic FE. In this work we propose the Wavelet-Packet Cepstral coeﬃcients (WPCC’s) and show concrete results that complement previous work on supporting the use of WPs as a FE techniques for ASR. This work builds upon the ideas recently proposed in Silva and Narayanan (2009), in which the problem of optimal ﬁlter-bank selection for pattern recognition (PR) was formulated based on the minimum probability of error decision principle (Silva et al., 2012; Vasconcelos, 2004). Here we explore WP ﬁlter-bank selection to propose a family of WPCCs. These features are

815

log-energy-based acoustic signatures rotated with the discrete cosine transform (the Cepstrum), as proposed in Farooq and Datta (2001), where the energy signatures are obtained from a bank of ﬁlters selected from the family of WP ﬁlter-banks. For the ﬁlter-bank selection, we use a complexity regularized criterion adopted from standard tree-structured bases selection problems (Silva and Narayanan, 2009; Etemad and Chellapa, 1998; Saito and Coifman, 1994; Coifman et al., 1992). In particular, we use acoustic energy, the Fisher-scatter ratio (Duda and Hart, 1983), and the Kullback-Leibler divergence (KLD) as ﬁdelity measures. The last two criteria are phone-discriminative in nature, while energy is based on the principle of increasing the frequency resolution in bands with higher acoustic energy, proposed in Chang and Kuo (1993) for the problem of texture classiﬁcation. As supporting results, we run standard phone recognition experiments in the TIMIT corpus. We contrast the diﬀerent ﬁlter-bank solutions with respect to a number of design elements. Among them are the ﬁdelity measure to select the ﬁlter-banks, the number of bands, the number of features, and the frequency selectivity of the two-channel ﬁlter (TCF) that induces the family of WPs. Interestingly, we found competitive results and concrete solutions that outperform the MFCCs. In the analysis, we show performance trends and dependencies that explain what the important design variables are to be considered for the construction of good acoustic features for ASR. At the end, WPCCs oﬀer a rich collection of acoustic features that extend the idea of short-time (segmental) energy-signature for acoustic event detection. The rest of the article is organized as follows. Section 2 revisits the standard approach for obtaining short-term acoustic features. Sections 3 and 4 are devoted to the presentation of the WPCCs, where background material is covered to aid understanding of the ﬁlter-bank properties of WPs, and Section 5 covers the ﬁlter-bank selection problem. Finally Sections 6 and 7 show the ﬁlter-bank structure of the obtained solutions and the phone-classiﬁcation performances, respectively. Final remarks are presented in Section 8, and supplemental material is presented in the Appendix. 2. Revisiting the ﬁlter bank Cepstral analysis view of feature extraction We revisit the standard feature extraction (FE) technique for ASR based on ﬁlter-bank energy features and the applications of the Cepstral transform (Quatieri, 2002) illustrated in Fig. 1a. Given the acoustic signal the scheme has the following phases: a high pass pre-emphasis ﬁlter 1–0:97z1 is applied on the whole acoustic signal; the resulting signal is segmented with a Hamming window of 32 ms creating overlapped short-term acoustic segments every 10 ms (segmental analysis); each acoustic segment is passed through a bank of triangular shaped ﬁlters with center frequencies forming an equipartition of the MEL scale, as shown in Fig. 1; and ﬁnally, in each segment the ﬁlter-bank energies (FBE) are computed to form a vector, where the logarithm

816

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835

Fig. 1. Illustration of the phases that characterize the standard approach for acoustic feature extraction in speech recognition.

function (point-wise) and the Discrete Cosine transform (DCT) are applied to create the MEL frequency Cepstral coeﬃcients (MFCCs) (Davis and Mermelstein, 1980). In this work we explore an extension of this framework for acoustic FE, where, instead of using the perceptually motivated MEL ﬁlter-bank structure, we study the rich collection of ﬁlter-banks induced from the Wavelet Packet (WP) bases (Vetterli and Kovacevic, 1995; Mallat, 2009). The next section is devoted to explaining the methodology adopted to induce a new set of ﬁlter-bank energy features from the WPs, and, later, we present the proposed Wavelet Packet Cepstral coeﬃcients (WPCCs) for ASR.

n 2 Zg and U 1Lþ1 span /1Lþ1 ðt 2Lþ1 nÞ : n 2 Z , we have that (Mallat, 2009) X ¼ U 0Lþ1 U 1Lþ1 :

The structure of the WP framework comes from the fact that B1Lþ1 and B0Lþ1 are induced by a discrete time pair of conjugate mirror ﬁlters (CMF) that we denote by ðhðnÞ; gðnÞÞ (Mallat, 2009, Chap. 7.1.3). More precisely, the basis elements /0Lþ1 ðtÞ; /1Lþ1 ðtÞ associated with the scale L þ 1 are induced from /L ðtÞ, of the scale L, by 1 X hðnÞ /L ðt 2L nÞ; /0Lþ1 ðtÞ ¼

3. Wavelet Packets /1Lþ1 ðtÞ WPs were proposed by Coifman et al. (1992) as a collection of bases with an underlying tree-structure. They oﬀer diﬀerent time-frequency representation qualities, and consequently, the potential to adapt to complex time series phenomena like the speech acoustic process (Silva and Narayanan, 2009). Here we provide a brief introduction of this family with focus on its ﬁlter-bank characteristics. Excellent expositions can be found in Mallat (2009), Vetterli and Kovacevic (1995) and Daubechies (1992). 3.1. WP sub-space decomposition: tree-structured collection Let X be the signal space of interest that, without loss of generality, is associated with a ﬁnite level of scale 2L or resolution 2L , L being an integer strictly greater than zero (Mallat, 2009). Consequently, X can be equipped with an orthonormal basis BL /L ðt 2L nÞ n2Z (Mallat, 2009; Vetterli and Kovacevic, 1995; Daubechies, 1992). The WP framework provides a way of decomposing the basis BL into two orthonormal collections, B0Lþ1 0 /Lþ1 ðt 2Lþ1 nÞgn2Z and B1Lþ1 /1Lþ1 ðt 2Lþ1 nÞ n 2 Z, where, denoting by U 0Lþ1 span /0Lþ1 ðt 2Lþ1 nÞ :

ð1Þ

¼

n¼1 1 X

gðnÞ /L ðt 2L nÞ;

ð2Þ

n¼1

where hðnÞ and gðnÞ are related by the perfect reconstruction property, i.e., gðnÞ ¼ ð1Þ1n hð1 nÞ; 8n 2 Z (Coifman et al., 1992), (Mallat, 2009, Th. 8.1). Iterating the application of the CMF pair ðhðnÞ; gðnÞÞ on each basis element /0Lþ1 ðtÞ and /1Lþ1 ðtÞ (Mallat, 2009, Th. 8.1), we can continue, in a binary tree-structured way, with the construction of alternative bases and subspace decompositions for X. More precisely after a ﬁxed number of iterations, we can create /pLþj ðtÞ for all j P 1 and for any n p 2 0; . . . ; 2j 1 , where U pLþj ¼ span /pLþj ðt 2Lþj nÞ : n 2 Zg, see Fig. 2a. Furthermore by construction, 8j P 1; 8p 2 0; . . . ; 2j 1 , 2pþ1 U pLþj ¼ U 2p ð3Þ Lþjþ1 U Lþjþ1 ; P 2pþ1 p Lþj where /2p ðtÞ ¼ 1 nÞ and /Lþjþ1 n¼1 hðnÞ /Lþj ðt 2 P1Lþjþ1 ðtÞ ¼ n¼1 gðnÞ /pLþj ðt 2Lþj nÞ.

At the end, the WPs can be seen as a family of tree-structured bases induced from the iteration of the two channel ﬁlter (TCF) ðhðnÞ; gðnÞÞ as illustrated in Fig. 2a.

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835

817

Fig. 2. Binary tree-structure and representation of the family of Wavelet Packet bases.

3.2. Inter-scale relationship of the WP transform coeﬃcients

3.3. WP ﬁlter bank implementation

A key property of WPs is the inter-scale relationship induced from (2) among the WP transform coeﬃcients obtained across scales (Mallat, 2009). More precisely, let xðtÞ be in U pj X with transform coeﬃcients given by

From a discrete time ﬁlter-bank point of view (Vetterli and Kovacevic, 1995), the basic iteration in (5) can be implemented by the application of a two channel ﬁlter (TCF), with impulse response hðnÞ and gðnÞ, followed by a down-sampler by 2 operation (Vetterli and Kovacevic, 1995; Mallat, 2009). This view is generalized in the following result.

d pj ðnÞ hxðtÞ; /pj ðt 2j nÞi; 8n 2 Z:

ð4Þ

Projecting xðtÞ, instead, in the alternative basis associated 2pþ1 with U 2p jþ1 U jþ1 , we have that (Mallat, 2009, Prop. 8.4) X d 2p d pj ðkÞ hðk 2nÞ; jþ1 ðnÞ ¼ k2Z 2pþ1 d jþ1 ðnÞ

¼

X

d pj ðkÞ gðk 2nÞ;

8n 2 Z:

ð5Þ

k2Z

Considering the fact that those are orthonormal bases, the Parseval’s relationship (Mallat, 2009) implies that 2 X 2 X X 2p 2pþ1 2 2 jjxðtÞjj ¼ d pj ðnÞ ¼ d jþ1 ðnÞ þ d jþ1 ðnÞ : ð6Þ n2Z

n2Z

n2Z

By induction, a closed-form relationship in the transform coeﬃcients can be obtained for every pair of basis elements in the WPs, as illustrated in Fig. 2b. The beauty of this result is that we pass from an analysis in continuous time in (4), to a discrete time analysis (algorithm) in (5). In fact, assuming that xðtÞ lives in a ﬁnite resolution space X, the Eq. (4) with j ¼ L and p ¼ 0 can be seen as a generalized Sampling theorem (Zhou and Sun, 1999; Walter, 1992). Furthermore, the WP binary structure manifested in (5) permits a fast algorithm implementation of the WP analysis (Mallat, 2009). Concerning the algorithmic part, the next section addresses the ﬁlter-bank implementation of WPs (Vetterli and Kovacevic, 1995).

Proposition 1 (Vaidyanathan (1993, Chap. 11.3.3)). Let xðtÞ be in a finite 2L scale space X, with transform coefficients ðd 0L ðnÞÞn2Z obtained from (4). Let us consider an arbitrary sub-space U pj induced from the WP filter bank decomposition with j > L and p 2 0; . . . ; 2jL 1 . Let us denote by ðh0 ðnÞÞn2Z and ðh1 ðnÞÞn2Z , the conjugate mirror filter pair p1 ;...; (with transfer function H 0 ðzÞ and H 1 ðzÞ), by U Lþ1 pjL1 Uj the sequence of intermediate sub-spaces used to go from X to U pj , and by Hðj:pÞ ¼ ðh1 ; . . . ; hjL Þ 2 f0; 1gjL the binary path code. In the last definition, choosing hk implies filtering with H hk ðzÞ and then applying the down-sampler by 2 at step k of the iteration. Then ðd pj ðnÞÞn2Z is obtained by passing ðd 0L ðnÞÞn2Z to the following discrete time filter H Hðj;pÞ ðzÞ ¼

jL Y

i1

H hk ðz2 Þ;

ð7Þ

i¼1

and then applying the down-sampler by 2jL operator. Proof. The proof of this result is a consequence of Proposition 2 presented in Appendix B. Fig. 3 illustrates the relationship. h

818

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835

Fig. 3. The equivalent systems stated in Proposition 1. The aggregated down-sampler is by K ¼ 2jL .

Fig. 4. Illustration of the frequency division of Wavelet Packet bases for two tree structures. The ideal Shannon conjugate ﬁlter pair is considered, which provides perfect dyadic partitions of the interval ½p; p. Scenario (a–c) shows a recursive iteration of H 0 ðzÞ (Wavelet type), and scenario (b–d) presents a balanced tree structure (uniform frequency resolution).

4. Frequency response of the WP ﬁlter banks Note that the process that relates ðd 0L ðnÞÞn2Z with in Proposition 1, is linear but not time invariant. Consequently, it is misleading to talk about the frequency response associated with the process of projecting xðtÞ into the WP sub-space U pj . We can circumvent this issue by considering only the equivalent ﬁltering part of the process in (7) and, consequently, avoiding the last down-sampling stage.1 More precisely, we consider the frequency response of the equivalent linear time-invariant (LTI) system just before the down-sampling stage. This characterizes the frequency content associated with each subspace, with which we can deﬁne the frequency decomposition achieved by a given WP basis. To illustrate this, let us consider the Shannon WPs (Mallat, 2009) induced by the perfect low and high pass ﬁlters presented in Figs. 4 and 5, i.e., ðd pj ðnÞÞn2Z

1 An alternative interpretation is presented in Appendix A. This analysis is not based on the ﬁlter-bank view of WP’s presented here.

jx

j H 0 ðe Þ j¼

( pﬃﬃﬃ 2 0

x 2 ½p=2 þ 2kp; p=2 þ 2kp otherwise

and jx

j H 1 ðe Þ j¼

( pﬃﬃﬃ 2 0

x 2 ½p=2 þ 2kp; 3p=2 þ 2kp otherwise

:

Following Section 3.1, each WP basis of X can be represented by the leaves of a binary-tree, as shown in Fig. 2(a). More precisely a basis is indexed by fðji ; pSi Þ : i ¼ 1; . . . ; M g2 associated with the basis element M p p M B ¼ i¼1 Bjii and sub-space decomposition X ¼ ai¼1 U jii . For each leaf ðji ; pi Þ of this tree, we can obtain its equiva2 It is necessary that ji > L and pi 2 0; . . . ; 2ji L 1; 8i 2 f1; . . . ; M g. In addition there are structural conditions to guarantee that fðji ; pi Þ : i ¼ 1; . . . ; M g corresponds to the leaves of a binary tree rooted at node ðL; 0Þ, not detailed here for space considerations. We refer the reader to Breiman et al. (1984), Chou et al. (1989) and Scott (2005) for a systematic exposition of this point.

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835

819

Fig. 5. The same scenario as in Fig. 4. Scenario (a–c) shows a recursive iteration of H 1 ðzÞ, and scenario (b–d) the reciprocal, in terms of frequency selectivity, of the Wavelet type in Fig. 4(a) and (c).

Fig. 6. The equivalent M-channel ﬁlter-bank of a WP basis B ¼

SM

pi i¼1 Bji .

lent ﬁlters H i ðzÞ H Hðji ;pi Þ ðzÞ by (7) and, consequently, reduce the analysis to the frequency response of an M-channel ﬁlter-bank, see Fig. 6. Examples of the frequency response before the down-sampling stage are presented in Figs. 4 and 5. From these, we can notice that for the Wavelet type of structure, produced by iterating H 0 ðejw Þ in every step, we obtain a solution that increases the resolution in the low frequency range. In general, in each step of iterating the TCF, we reduce the frequency support of the resulting sub-space by half, as illustrated in Fig. 4c. 4.1. Frequency ordering: the Gray code Concerning frequency ordering, however, the up-sampled versions of H 0 ðzÞ and H 1 ðzÞ do not necessarily play the role of the low and high pass ﬁlters, respectively, in the band of interest. The reason is that the side lobes of

these ﬁlters, out of the original frequency range of its definition ½p; p, are brought into the ½p; p after the upsampling operation in a non-trivial way (Mallat, 2009). This is a direct consequence of the result presented in Proposition 1. An example of this phenomenon is shown in Fig. 5a, for the case of iterating H 1 ðzÞ. This scenario does not provide a solution that decomposes the high frequency range of the signal, see Fig. 5c, as one would expect from its reciprocal Wavelet solution shown in Fig. 4c. To illustrate this mirroring eﬀect more clearly, let us consider Fig. 5b and d. In this scenario, the frequency support of the equivalent ﬁlter H 1 ðzÞH 1 ðz2 Þ is not the highest band in the interval ½0; p as expected. In fact, the supports of H 1 ðzÞ and H 1 ðz2 Þ are ½p=2; 3p=2 and ½p=4; 3p=4, respectively. Thus H 1 ðzÞH 1 ðz2 Þ has support in ½p=2; 3p=4. For further details on this frequency ordering issue, we refer the reader to Mallat (2009, Section 8.1.2) and Atto et al. (2007, 2010). Fortunately, there is a simple closed-form rule to relabel any admissible node ðj; pÞ in the WP tree as an equivalent node ðj; kÞ, at the same depth (scale), so that the resulting labels are frequency ordered (Mallat, 2009). This mapping k ¼ GðpÞ is called the Gray code and it is presented in Appendix SM p C for completeness. Then, for each WP basis B ¼ i¼1 Bjii , we can compute the ordered indexes fðji ; k i Þ : i ¼ 1; . . . ; Mg, with k i ¼ Gðpi Þ, (C.1), where each p induced subspace atom U jii , captures the signal information concentrated in the band

820

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835

V ¼ ð0; 0Þ; ð1; 0Þ; ð1; 1Þ; . . . ; ðJ ; 0Þ; . . . ; ðJ ; 2J 1Þ ;

I kjii ½ðk i þ 1Þp2ji ; k i p2ji [ ½k i p2ji ; ðk i þ 1Þp2ji : ð8Þ fI kjii

Then, B produces IB ¼ : i ¼ 1; . . . ; Mg a partition of the discrete time frequency range ½p; p. Extending this analysis to WPs with an arbitrary conjugate mirror ﬁlter pair ðh0 ðnÞ; h1 ðnÞÞ, their frequency selectivity property depends upon how H 0 ðejw Þ is concentrated in ½p=2; p=2. Consequently, we only have an approximation of the clean selectivity properties of the Shannon WPs in (8). For the applications on acoustic speech signals, this will be one of the critical aspects to evaluate. In the following, we concentrate on the family of Daubechies (DB) WPs (Mallat, 2009; Daubechies, 1992), exploring diﬀerent ﬁlter order solutions (associated with the number of zeros at p of H 0 ðzÞ), which provide a tradeoﬀ between the order of the TCF, and the concentration of H 0 ðejw Þ in the range ½p=2; p=2, or frequency selectivity (Chap. 8.1.2 Mallat, 2009). We choose the family of compactly supported Daubechies wavelets (Daubechies, 1992), because it oﬀers a rich range of frequency selectivities. In fact, we can go from the Haar Wavelet (Vetterli and Kovacevic, 1995; Mallat, 2009), where H 0 ðzÞ has one zero at p, with almost nofrequency selectivity but perfect time localization, to the Shannon Wavelet that oﬀers perfect frequency selectivity (in the limit where the number of zeros at p of H 0 ðzÞ goes to inﬁnity) (Mallat, 2009). On the theoretical side, this family oﬀers the minimum order TCF solution ðh0 ðnÞ; h1 ðnÞÞ for a given number of vanishing moments or zeros at p of H 0 ðzÞ. This last attribute is associated with the frequency selectivity of the TCF (Mallat, 2009, Th. 7.9).

ð9Þ

and E the collection of arcs on V V that characterizes a full-rooted binary tree with root vroot ¼ ð0; 0Þ as shown in Fig. 2a. Instead of representing the tree as a collection of arcs in G, we use the convention of Breiman et al. (1984), in which subgraphs are represented by a subset of nodes of the full graph. More formally, we deﬁne a rooted binary tree T ¼ fv0 ; v1 ; . . . ;g V as a collection of nodes with only one of degree 2, the root node, and the remaining nodes with degree 3 (internal nodes) and leaf nodes (Cormen et al., 1990). We deﬁne LðT Þ as the set of leaves of T and IðT Þ as the set of internal nodes, consequently, LðT Þ [ IðT Þ ¼ T . We say that a rooted binary tree S is a subtree of T if S T . In the previous deﬁnition, if the roots of S and T are the same, then S is a pruned subtree of T , denoted by S T . In addition, if the root of S is an internal node of T , then S is called a branch. In particular, we denote the largest branch of T rooted at v 2 T as T v . We deﬁne the size of the tree T as the number of leaves, i.e., the cardinality of LðT Þ denoted as j T j. Finally in our problem, T full ¼ V in (9) denotes the full binary tree, consequently, the collection of WP bases is indexed by the admissible trees T V : T T full . In this context, any pruned version of the full-rooted binary tree represents a particular way of iterating the TCF ðh0 ðnÞ; h1 ðnÞÞn2Z of the WP. More precisely, if we let T ¼ fðji ; pi Þ : i 2 f1; . . . ; M gg be an admissible WP binary tree, then we denote its basis by BT

M [

p

ð10Þ

Bjii ;

i¼1

5. Wavelet Packet ﬁlter-bank selection The last aspect in the implementation of the WP acoustic features is to decide appropriate WP ﬁlter-bank structures for the phone recognition task we have at hand. We follow the data-driven approach independently proposed by Etemad and Chellapa (1998) and Saito and Coifman (1994),3 and revisited by Silva and Narayanan (2009). The idea is to use supervised data to select a ﬁlter-bank structure (or a frequency partition of ½p; p), that provides a nearly-optimal phonetic discrimination basis solution. More details of the formulation of this problem can be found in Silva and Narayanan (2009), Silva and Narayanan (2007) and Vasconcelos (2004). To formulate the optimization problem, let us ﬁrst introduce some notations. Following Silva and Narayanan (2009), we represent the process of producing a particular basis in the WP family by a rooted binary tree (Scott, 2005). For simplicity, let J > 0 be the maximum number of iterations of the sub-band decomposition process. Let G ¼ ðV ; EÞ be a graph with 3

This work was inspired by the seminal work of Coifman and Wickerhauser (Coifman et al., 1992) in the context of basis selection for sparse signal representation.

its sub-space decomposition by n o p UT U jii : i ¼ 1; . . . ; M ; M

ð11Þ

p

where X ¼ ai¼1 U jii , and its ideal Shannon frequency partition by n o IT I kjii : i ¼ 1; . . . ; M ; ð12Þ with k i ¼ Gðpi Þ from (C.1) and I kjii from (8). Finally, as we are interested in extending the ﬁlter-bank Cepstral analysis view for acoustic FE, Section 2, then for each T T full and for any point x 2 X, we deﬁne the ﬁlter-bank energy signature of x relative to T by mT ðxÞ Epj ðxÞ ðj;pÞ2LðT Þ ð13Þ where Epj ðxÞ denotes the energyPof x in the subspace U pj , 2 and by orthonormality jjxjj ¼ ðj;pÞ2LðT Þ Epj ðxÞ. 5.1. The tree-pruning problem Here we revisit the approach in Silva and Narayanan (2009), where the selection of the WP basis was based on approximating the minimum probability of error decision (Silva et al., 2012). This formulation is reduced to ﬁnd an

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835

optimal tradeoﬀ between the estimation and approximation errors and, consequently, addresses a complexity-regularization problem. More precisely, we address the solution of T ðkÞ ¼ arg min F ðmT ðX Þ; Y Þ þ kUðT Þ; T T full

ð14Þ

where X is the random object representing the raw acoustic observation in our signal space X, and Y is the class label random variable with values in the ﬁnite alphabet space of phonetic classes Y. The ﬁrst term in (14) involves F ð; Þ, which is a measure designed to capture the discriminate information of mT ðX Þ relative to the class label Y (ﬁdelity measure). The second term /ðÞ is a non-decreasing real function (cost term) designed to incorporate estimation error eﬀects. The solution of (14), for all k > 0, resides in the solution of the following cost-ﬁdelity problem (Scott, 2005) (Silva and Narayanan, 2009, Sec. IV.D): T

k

¼ arg

F ðmT ðX Þ; Y Þ: max fT T full :jT j6kg

ð15Þ

The problem in (15) is equivalent to ﬁnding the ﬁlter-bank of length k that maximizes the ﬁdelity gain F ðmT ðX Þ; Y Þ, for all k 2 2; 3; . . . ; jT full j . Interestingly, when the ﬁdelity measure is additive,4 or alternatively aﬃne,5 with respect the structure of T , which will be the case for all measures experimentally evaluated in this work (see Section 5.2), the solution of (15) admits implementation with an eﬃcient complexity OðT full log T full Þ (Silva and Narayanan, 2009, Th. 2 and 3). Furthermore, (15) oﬀers an embedded solution structure, i.e. T 2 T 3 T ðjT full j1Þ T full (Silva and Narayanan, 2009, Th. 3). For completeness, the algorithm for solving (15) is presented in Section 5.3. 5.2. Fidelity measures

malized energy of x 2 X by Epj ðxÞ

821 p

Ej ðxÞ

, and the number of PN examples in class y 2 Y by N y i¼1 Ify g ðy i Þ. Let the energy map eðj; p; yÞ be given by eðj; p; yÞ ¼

kxk2

N 1 X pj ðxi Þ; Ify g ðy i Þ E N y i¼1

ð17Þ

for any pair ðj; pÞ 2 f0; . . . ; J g 0; . . . ; 2j 1 and y 2 Y. For a binary tree T , its class conditional energy signature is deﬁned by eT ðyÞ ¼ ðeðj; p; yÞÞðj;pÞ2LðT Þ ; ð18Þ where from the Parseval’s relationship we have that P ðj;pÞ2LðT Þ eðj; p; yÞ ¼ 1. Therefore, we can treat eT ðyÞ as a probability mass function and deﬁne the KLD ﬁdelity as (Saito and Coifman, 1994) X F ðmT ðX ; Y ÞÞ ¼ DðeT ðyÞkeT ðzÞÞ: ð19Þ y;z2Y

Here D is the discrete KLD (Gray, 1990; Cover and Thomas, 1991). To write the functional in its additive form, in (16), we consider the following equalities: X F ðmT ðX ; Y ÞÞ ¼ DðeT ðyÞkeT ðzÞÞ y;z2Y

eðj; p; yÞ ¼ eðj; p; yÞ log eðj; p; zÞ y;z2Y ðj;pÞ2LðT Þ X X eðj; p; yÞ eðj; p; yÞ log ¼ eðj; p; zÞ ðj;pÞ2LðT Þ y;z2Y X ¼ F ðEpj ðX Þ; Y Þ: X

X

ðj;pÞ2LðT Þ

where the leaf functional is X eðj; p; yÞ p F ðEj ðX Þ; Y Þ ¼ eðj; p; yÞ log : eðj; p; zÞ y;z2Y

ð20Þ

N

Let fðxi ; y i Þgi¼1 be independent and identically distributed (i.i.d.) realizations of the joint vector ðX ; Y Þ, where every pair ðxi ; y i Þ corresponds to a speech segment and its respective phone label. As ﬁdelity measures, we use the indicators proposed by Saito and Coifman (1994), Etemad and Chellapa (1998) and Silva and Narayanan (2009). All of them can be written in the additive form: X F ðmT ðX Þ; Y Þ ¼ F ðEpj ðX Þ; Y Þ: ð16Þ ðj;pÞ2LðT Þ

5.2.1. KLD ﬁdelity estimate The ﬁrst ﬁdelity measure is the symmetric version of the Kullback-Leibler divergence (KLD) (Kullback, 1958) proposed in Saito and Coifman (1994). Let us deﬁne the norP A tree functional qðÞ is is additive if qðT Þ ¼ ðj;pÞ2LðT Þ qðj; pÞ (Scott, 2005). 5 A tree functional qðÞ is aﬃneP if, for any T ; S rooted binary trees such that S T , then qðT Þ ¼ qðSÞ þ s2LðSÞ qðT s Þ qðfsgÞ, where fsg is the trivial tree rooted at s, see (Scott, 2005). 4

5.2.2. Parametric version of the mutual information: Fisher ﬁdelity estimate The second indicator is the mutual information (MI) adopted in Silva and Narayanan (2009). Assuming the Markov tree property presented in Prop. 3 (Silva and Narayanan, 2009) the functional is aﬃne (Silva and Narayanan, 2009, Th. 3). To simplify the estimation, we assume that the class conditional distributions are Gaussian, where MI reduces to a version of the Fisher discriminative indicator (Silva and Narayanan, 2007; Padmanabhan et al., 2005), proposed by Etemad and Chellapa (1998). More precisely, let the energy vector of a signal xi in the tree T be given by mT ðxi Þ ¼ Epj ðxi Þ ðj;pÞ2T , and Pb ðfy gÞ ¼ Ny denote the class probability mass 8y 2 Y. Assuming that N the class conditional probability of object mT ðX Þ is a multivariate Gaussian distribution, the maximum likelihood estimator of its mean and covariance are ^y ¼ l

N 1 X Ify g ðy i ÞmT ðxi Þ N y i¼1

ð21Þ

822

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835

and Ry ¼

1 Ny

N X

y

^y ÞðmT ðxi Þ l ^y Þ ; Ify g ðy i ÞðmT ðxi Þ l

ð22Þ

i¼1

respectively. The unconditional mean estimator is P ^ ¼ N1 Ni¼1 mT ðxi Þ. Now we can deﬁne the within-class scatl ter matrix S w for the tree T by X Pb ðfy gÞ Ry ; S w ðT Þ ¼ ð23Þ y2Y

and the between-class scatter matrix by X ^y Þð^ ^ y Þy : Pb ðfy gÞ ð^ ll ll S b ðT Þ ¼

ð24Þ

y2Y

Finally for a rooted binary tree T , its leaf ðj; pÞ ﬁdelity functional is consequently deﬁned by F ðEpj ðX Þ; Y Þ ¼ trðS 1 w ðt v ÞS b ðt v ÞÞ

trðS 1 w ðfðj; pÞgÞS b ðfðj; pÞgÞÞ:

6. Analysis of ﬁlter bank solutions ð25Þ

In this context tv is the binary tree rooted at v ¼ ðj; pÞ with leaves ðj þ 1; 2pÞ and ðj þ 1; 2p þ 1Þ (see Fig. 2a), and fðj; pÞg is the one node tree. 5.2.3. Energy ﬁdelity estimate Finally, as a non-discriminative indicator, we consider the average subspace energy proposed in Chang and Kuo (1993), i.e., F ðEpj ðX Þ; Y Þ ¼

N 1 X Ep ðxi Þ: N i¼1 j

ð26Þ

With the average energy ﬁdelity measure in (26), the algorithm to solve (15), presented in Section 5.3, splits the leaf of the tree T k with the highest average energy to ﬁnd T kþ1 , the solution of order k þ 1. 5.3. Minimum cost tree pruning algorithm To conclude this section, a dynamic programing (DP) algorithm to solve (15) is presented. We refer the interested reader to Scott (2005), Chou et al. (1989), Bohanec and Bratko (1994), Breiman et al. (1984) and Silva and Narayanan (2009) for a systematic exposition on the computational complexity, as well as theoretical results of this algorithm. Phase 0:

Phase 1:

– Fidelity gain Dðj; pÞ if (F is KLD functional) 2pþ1 Dðj; pÞ ¼ F ðE2p jþ1 ðX Þ; Y Þ þ F ðE jþ1 ðX Þ; Y Þ else Dðj; pÞ ¼ F ðEpj ðX Þ; Y Þ end Phase 2: (Initialization) Initialize: T 2 ¼ fð0; 0Þ; ð1; 0Þ; ð1; 1Þg, then LðT 2 Þ ¼ fð1; 0Þ; ð1; 1Þg Phase 3: (Iteration) for k ¼ 2 to k ¼ 2J 2 1. -compute: ðj ; p Þ ¼ arg max Dðj; pÞ k

2. -save: ðj;pÞ2LðT Þ:j6J 1 T ðkþ1Þ ¼ T k [ fðj þ 1; 2p Þ; ðj þ 1; 2p þ 1Þgend

(Choice of parameters) Choose a speciﬁc CMF pair h0 ; h1 , a maximum level of decomposition J and a ﬁdelity functional F. (Computation: Subband measurements and Fidelity Gain) 8j 2 f0; . . . ; J 1g; 8p 2 0; . . . ; 2j 1 compute: – Epj ðxi Þ : 8xi 2 X 8j 2 f0; . . . ; J 2g; 8p 2 0; . . . ; 2j 1 compute:

The TIMIT corpus was adopted for all the experiments presented in this work. TIMIT is one of the standard corpus used to evaluate new methods and techniques in ASR, mainly because it is a phonetically balanced task and has good coverage of speakers and dialects. All of these make TIMIT a suﬃciently challenging corpus with which to evaluate new ASR methods, which justiﬁes its wide adoption by the community. The TIMIT corpus consists of 6300 utterances for the 8 major dialects of the United States. There are 630 diﬀerent speakers, each one speaking 10 sentences. TIMIT phonetic transcriptions contain 64 phonetic classes, from which we have adopted the standard folding proposed in (Lee and Hon, 1989) that reduces the number of phonetic classes to 39 plus the silence model. The training set, proposed in the TIMIT corpus, was used to extract supervised data for the tree-pruning stage, in Section 5.1. More precisely, we used the phonetic segmentations and labels of the TIMIT database folded in 39 classes to select the supervised training data. For each phone segmented signal, we took three 20ms segments, from the left, center, and right positions of the signal, and we considered those as realizations of the phoneme. With this data, we computed the ﬁdelity measures presented in Section 5.2, i.e., the Fisher, the symmetric KLD, and the Energy tree functionals, respectively. Finally, those measures were used to create the ﬁlter-bank solutions by solving the pruning problem in (14) and (15). In addition, we have adopted four diﬀerent pairs of two channel ﬁlters (TCFs), (see Section 3.3), associated with the Daubechies (DB) Wavelets (Daubechies, 1992; Mallat, 2009; Vetterli and Kovacevic, 1995) of order 6, 12, 24 and 44, respectively. With these we have good coverage of frequency selectivity properties to obtain a fairly representative family of WP ﬁlter-bank solutions. It is important to point out that frequency selectivity was one of the key dimensions considered in this analysis.

4 3

2

2 1

3

0 −1

4

−2 −3

5

−4 6

−5 5

10

15 20 25 Index of frequency−bands

30

Depth of the WP decomposition − scale index

Depth of the WP decomposition − scale index

5 1

5 1 4 3

2

2 1

3

0 −1

4

−2 −3

5

−4 6

−5 5

10

15 20 25 Index of frequency−bands

30

Depth of the WP decomposition − scale index

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835

823 5

1 4 3

2

2 1

3

0 −1

4

−2 −3

5

−4 6

−5 5

10

15 20 25 Index of frequency−bands

30

Fig. 7. Distribution of the KLD ﬁdelity gains Dðj; kÞ indexed by the scale j (vertical axes) and frequency location k (horizontal axes) considering the frequency-ordered WP sub-space decomposition structure. A whiter color indicates a higher ﬁdelity gain.

6.1. Analysis of ﬁdelity gains across scale and frequency location In this section we report the sensitivity of the WP ﬁlterbank selection algorithm to the frequency selectivity, proportional to the order of the Daubechies TCF (DB-TCF) (Mallat, 2009). For that purpose, we have analyzed the ﬁdelity gains across scale and position, represented by a scale index j and a frequency localization index (position) k. We compared the ﬁdelity gains of iterating the TCF, (see Section 3.1), of the three ﬁdelity functionals (Fisher, KLD and Energy). Fig. 7 shows the KLD-based gains of decomposing a frequency ordered node ðj; kÞ (associated with a WP subspace) for the DB-TCF of orders 6, 12 and 24. As expected, higher discriminative gains are obtained in the low frequency domain. It is important to note in the ﬁgure that the KLD gain structure is not that sensitive to the order of the TCF, and tends to stabilize as the order (frequency selectivity) increases. This stability phenomenon was also observed with the Fisher-based gains, as well as the Energy gains. However each of them has a particular ﬁdelity gain structure as shown in Fig. 8. This shows that the frequency selectivity does not imply a major change in the ﬁdelity gains and consequently, in the ﬁlter-bank tree-structures obtained from solving the minimum cost tree-pruning problem in (15). On the other hand, Fig. 8 illustrates the gains for the three ﬁdelity criteria with the DB-TCF of order 44 (the highest selectivity). Interestingly, all the plots show that the salient information for discriminating phonemes, relative to the ﬁdelity measure adopted, is localized in the low frequency domain. Consequently, the solutions of the optimal tree-pruning problem oﬀer structures that give priority to iterating the TCF in this frequency range. In this regard, the non-discriminative criterion in Fig. 8, with respect to the discriminative criteria in Fig. 8c and b, has minor diﬀerences. However these diﬀerences are suﬃcient to characterize a particular way of zooming on the lower frequency region of the acoustic space. These zooming patterns could potentially imply some marginal but important

diﬀerences in ASR recognition performances, as we shall see in the following sections.

6.2. Analysis of the ﬁlter-bank frequency responses In order to contrast the ﬁlter-bank solutions induced from diﬀerent frequency selectivity conditions, Fig. 9 shows the equivalent ﬁlter-bank frequency response obtained for the scenarios with DB-TCF of orders 6 and 44, respectively. Verifying our previous analysis, the frequency selectivity does not signiﬁcantly aﬀect the structure of the ﬁlter-bank solutions, i.e, the way of iterating the TCF. This can be observed in the main lobes of the solutions, which are centered at the same frequencies, focusing on the solutions with the same number of frequency bands illustrated in rows of Fig. 9. In fact, the solutions of size 6 (Fig. 9) and size 14 (Fig. 9) have the same tree topology, however, their frequency supports are clearly diﬀerent. Concerning the frequency support, the trend is the following: The family of DB Wavelets converges to the Shannon Wavelets, as the order of the TCF increases,6 then the frequency supports of the ﬁlter-banks converge to the Shannon WP partitions in (8). Alternatively, for any order of the TCF, the frequency support of a subspace with arbitrary large depth (scale) gets narrower following the Shannon WP frequency support, which in the limit converges to a ﬁxed frequency point. Details of this result are presented in Section 3.2 (Atto et al., 2007, Atto et al., 2010). For our ﬁnite scale regime, the higher the order of the DB-TCF, the closer we are to the Shannon frequency partition in Section 4. Hence, by increasing the order of the TCF, the frequency bands are more clearly localized and the overlap between adjacent bands, or what we called between-band interference, is reduced. Associated with each frequency-ordered leaf ðj; kÞ of a given WP tree, we have its main lobe centered in the 6 A systematic exposition of this fact is presented in Shen and Strang (1996) and Shen and Strang (1998).

Depth of the WP decomposition − scale index

0 1

−0.5 −1

2

−1.5 3

−2 −2.5

4

−3 5

−3.5 −4

6

5 1 4 3

2

2 1

3

0 −1

4

−2 −3

5

−4 6

−5

Depth of the WP decomposition − scale index

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835

Depth of the WP decomposition − scale index

824

0

1

−1 2 −2 3 −3 4

−4

5

−5

6

−6

−4.5 5

10

15 20 25 Index of frequency−bands

30

5

10

15 20 25 Index of frequency−bands

30

5

10

15 20 25 Index of frequency−bands

30

Fig. 8. Fidelity gains Dðj; kÞ indexed by the scale j (vertical axes) and frequency location k (horizontal axes) considering the frequency-ordered WP subspace decomposition structure. The Daubechies of order 44 is considered and the results are presented for the three methods. Whiter color indicates higher ﬁdelity gain.

frequency range Ikj in (8). However, there are also secondary lobes with signiﬁcant gains, which are not necessarily adjacent to the target band Ikj , in particular for the case of small TCF order solutions. This phenomenon characterizes a very complex interference pattern as illustrated in Fig. 9. Interpreting these results, the projection onto the subspace associated with a given WP node ðj; kÞ contains information of: its target Shannon band Ikj ; the neighborhood bands of Ikj ; but not intuitively, information of undetermined non-adjacent bands because of the gains of the secondary lobes as illustrated in Fig. 9. The good news is that those secondary-interference lobes vanish as the frequency selectivity increases. These asymptotic trends have a formal justiﬁcation in the fact that the DB WPs converge to the Shannon WPs as the TCF order tends to inﬁnity (Shen and Strang, 1996, 1998). Finally Fig. 10 shows the frequency response of the equivalent ﬁlter-banks obtained with a discriminative and a non-discriminative method. We use the DB-TCF of order 44 to induce ﬁlter-banks with clearer structures and reduced side-lobe interference. As was illustrated in Figs. 7 and 8, the pruned solutions oﬀer higher resolution in the low frequency region. In general the M-channel ﬁlterbank solutions of the same size are similar (rows of Fig. 10), but as we increase the number of bands, some minor diﬀerences can be observed. In conclusion, for a clean acoustic speech process, the ﬁlter-banks obtained are pretty much independent of the pruning method, and no major contrast is observed by the use of a discriminative or a non-discriminative criterion. This veriﬁes the preliminary results obtained in Silva and Narayanan (2009), where it was claimed that the acoustic speech process is an optimal design, in the sense that it allocates energy in the frequency bands that oﬀer higher frequency discrimination. These results are based on short-time (frame by frame) information analysis of acoustic speech processes to discriminate phonemes, and do not consider, for instance, a noisy scenario, or higher level contextual information, where alternative trends could be observed.

7. Phone recognition experiments The analysis made in this work considered a number of degrees of freedom for acoustic FE such as: the ﬁdelity measure for the ﬁlter-bank selection problem presented in Section 5.1 (and, therefore, the set of embedded treestructured WP ﬁlter-banks); the frequency selectivity of the TCF; the ﬁlter-bank size; and the feature space dimension. As we presented in previous sections, we induce the WPCCs by: ﬁrst, selecting a M-channel WP ﬁlter-bank; second, by deriving the frequency-ordered energy coeﬃcients; and ﬁnally, by applying DCT for de-correlation as well as for dimensionality reduction (Quatieri, 2002) by choosing the ﬁrst m < M transformed DCT coeﬃcients. The resulting WPCC features are the previously mentioned m Cepstral coeﬃcients plus the log-energy of the frame. The experiments are conducted in a sequence of incremental steps. First, we start the analysis in a simpliﬁed mono-phone recognition task that does not consider contextual information appended to the WPCC feature vector, i.e., delta and acceleration coeﬃcients. This initial phase is designed to explore the feature space dimension (number of Cepstral coeﬃcients) and WP tree size (number of bands) to deﬁne an initial range of values to be explored in the more complex settings. This analysis is conducted under diﬀerent frequency selectivity for the TCF, and for all the ﬁdelity measures. We then expand the analysis, enriching the feature vector with delta and acceleration coeﬃcients, under the same mono-phone recognition task, to see if we observe similar trends. For that we re-run the phone recognition experiments in the range of values obtained in the previous phase. Finally, we run a state-of-the-art phone recognition experiment considering context dependent HMM-acoustic phone models (tri-phones) with a bigram language model. As a benchmark in all the phases mentioned, we have chosen the standard MFCC features computed with the 22 channel MEL-ﬁlters and adopting the ﬁrst 12 Cepstral coeﬃcients plus frame log-energy as the feature vector.

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835

825

2

2.5

1.8 2

1.6

Amplitude Gain

Amplitude Gain

1.4 1.5

1

0.5

1.2 1 0.8 0.6 0.4 0.2

0

0 0

0.1

0.2

0.3

0.4 0.5 0.6 Normalized Frequency

0.7

0.8

0.9

1

Amplitude Gain

Amplitude Gain

0.2

0.3

0.4 0.5 0.6 Normalized Frequency

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4 0.5 0.6 Normalized Frequency

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4 0.5 0.6 Normalized Frequency

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4 0.5 0.6 Normalized Frequency

0.7

0.8

0.9

1

2.5

2

1.5

1

0.5

2

1.5

1

0.5 0

0 0

0.1

0.2

0.3

0.4 0.5 0.6 Normalized Frequency

0.7

0.8

0.9

1

3

3

2.5

2.5

2

2

Amplitude Gain

Amplitude Gain

0.1

3

2.5

1.5

1

0.5

1.5

1

0.5

0

0 0

0.1

0.2

0.3

0.4 0.5 0.6 Normalized Frequency

0.7

0.8

0.9

1

3

3

2.5

2.5

2

2

Amplitude Gain

Amplitude Gain

0

1.5

1

0.5

1.5

1

0.5

0 0

0.1

0.2

0.3

0.4 0.5 0.6 Normalized Frequency

0.7

0.8

0.9

1

0

Fig. 9. Frequency response of the Wavelet Packet ﬁlter-bank solutions. The solutions were obtained with Daubechies of order 6 (left column) and 44 (right column), respectively. Plots are normalized over the interval [0, 8 kHz].

In general for each speech segment, we computed the MFCC and WPCC features using a hamming windows of 32ms with a frame-rate of 10ms. The ASR system was implemented with the HTK toolbox (Young, 2009), where for each phone acoustic model we adopted the standard 5 state hidden Markov model (HMM) (Rabiner, 1989) with 3

emitting states, the standard left-to-right topology, and the 16 Gaussian mixture as the observation distribution (Rabiner, 1989). We used the steps proposed in the TIMIT documentation to train all models in this work, and the Core-test of the TIMIT corpus was used for obtaining ASR performances.

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835 2

2

1.8

1.8

1.6

1.6

1.4

1.4

Amplitude Gain

Amplitude Gain

826

1.2 1 0.8

0.6

0.4

0.4

0.2

0.2 0 0.1

0.2

0.3

0.4 0.5 0.6 Normalized Frequency

0.7

0.8

0.9

1

2.5

2.5

2

2

Amplitude Gain

Amplitude Gain

0

1.5

1

0.5

0.1

0.2

0.3

0.4 0.5 0.6 Normalized Frequency

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4 0.5 0.6 Normalized Frequency

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4 0.5 0.6 Normalized Frequency

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4 0.5 0.6 Normalized Frequency

0.7

0.8

0.9

1

1.5

1

0 0

0.1

0.2

0.3

0.4 0.5 0.6 Normalized Frequency

0.7

0.8

0.9

1

3

2.5

2.5

2

2

Amplitude Gain

3

1.5

1

0.5

1.5

1

0.5 0

0 0

0.1

0.2

0.3

0.4 0.5 0.6 Normalized Frequency

0.7

0.8

0.9

1

3

2.5

2.5

2

2

Amplitude Gain

3

1.5

1

0.5 0

0

0.5

0

Amplitude Gain

1 0.8

0.6

0

Amplitude Gain

1.2

1.5

1

0.5

0

0.1

0.2

0.3

0.4 0.5 0.6 Normalized Frequency

0.7

0.8

0.9

1

0

Fig. 10. Frequency response of the Wavelet Packet ﬁlter-bank solutions. The ﬁgures show a comparison between non-discriminative and discriminative criteria, Energy (left column) and KLD (right column), respectively. The solutions were obtained with Daubechies of order 44. Plots are normalized over the interval [0, 8 kHz].

7.1. Context-independent phone recognition experiments The pruning solutions of size 24 obtained from the three ﬁdelity functionals (KLD, Fisher and Energy) are presented here. The acoustic features are the WPCC plus log-energy with a ﬁxed number of bands, where we varied

the number of Cepstral coeﬃcients from 6 to 24, to gain insight into the most appropriate dimension for the feature space. In this context Fig. 11a shows the performance trends of the Fisher ﬁdelity WPCC solutions across the feature space dimension and for diﬀerent frequency selectivity given by the order of the DB-TCF (db6, db12, db24 and

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835

db44). In each of these performance curves, the curse of dimensionality is observed as expected. There is an initial increasing trend in performances that later saturates and decreases, attributed to the well-understood estimation error phenomenon presented in this learning-decision problem. The results show an optimal range for feature space dimension starting approximately at dimension 11 and ending approximately at dimension 19. This good range of feature dimension is practically invariant when we increase the number of bands and the frequency selectivity of the ﬁlter-bank solutions. This behavior is also consistent with the other two ﬁdelity measures, KLD and Energy, exempliﬁed in Fig. 11b and c for the WPCC ﬁlter-bank solutions of 24 bands in each case. Considering the good range of feature dimension obtained in the previous set of experiments, we ﬁxed one of them, dimension 13 (12 Cepstral coeﬃcients plus log-

827

energy), to show the performance trend with respect to the number of bands of the WP ﬁlter-bank solutions (WP tree size). The experiments again consider all ﬁdelity measures and TCF orders (db6,db12,db24 and db44). Fig. 12 shows these trends. Again we observed a performance trend that increases, then saturates, and ﬁnally decreases as we explore WP ﬁlter-bank solutions with an increasing number of bands. Since in this case the feature dimension is ﬁxed, this trend cannot be attributed to the curse of dimensionality and so, consequently, has to do with the acoustic discrimination power of the ﬁlter-bank solutions. From these results we conclude that a good range of exploration in the number of bands is from 18 to 26. Before we change the focus to the next set of experiments, a couple of remarks should be made. It is very interesting to observe the trend with respect to the frequency selectivity in the obtained results, Figs. 11 and 12. In

45

44

% Accurracy Coretest

43

42

41

40

Fisher db44 Fisher db24 Fisher db12 Fisher db6 MFCCE

39

38 6

8

10

12 14 16 18 Number of Cepstral Coefficients

46

20

22

24

45

45

44

44 % Accurracy Coretest

% Accurracy Coretest

43 43

42

42

41 41

KLD db44 KLD db24 KLD db12 KLD db6 MFCCE

40

EN db44 EN db24 EN db12 EN db6 MFCCE

40

39

39 6

8

10

12 14 16 18 Number of Cepstral Coefficients

20

22

24

6

8

10

12 14 16 18 Number of Cepstral Coefficients

20

22

24

Fig. 11. Recognition accuracies in the Core-test set as a function of the number of Cepstral coeﬃcients for a ﬁxed size of WP ﬁlter-bank (number of bands) and static features. Eﬀect of frequency selectivity for the Fisher functional ﬁlter-banks of size 24 (11a), KLD functional ﬁlter-banks of size 24 (11b), and Energy functional ﬁlter-banks of size 24 (11c).

828

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835 45.5

45

44.5

% Accurracy Coretest

44

43.5

43

42.5

42 Fisher db44 Fisher db24 Fisher db12 Fisher db6 MFCC

41.5

E

41 14

16

18

20

22 Number of Bands

45.5

24

26

28

30

45

45 44.5 44.5 44 % Accurracy Coretest

% Accurracy Coretest

44

43.5

43

43.5

43

42.5 42.5 42 KLD db44 KLD db24 KLD db12 KLD db6 MFCC

41.5

EN db44 EN db24 EN db12 EN db6 MFCC

42

E

E

41

41.5 14

16

18

20

22 Number of Bands

24

26

28

30

14

16

18

20

22 Number of Bands

24

26

28

30

Fig. 12. Recognition accuracies in the Core-test set as a function of the WP ﬁlter-bank size (number of bands), for ﬁxed 12 Cepstral coeﬃcients and static features. Eﬀect of frequency selectivity for the Fisher functional ﬁlter-banks (a), KLD (b) and Energy (c).

almost all cases, increasing the frequency selectivity provides better performances for any given dimension, ﬁlterbank size, and ﬁdelity measure adopted. This ratiﬁes our conjecture that inter-band interference is something to be avoided for acoustic discrimination, and consequently, better performances can be achieved by increasing the order of the DB-TCF in our context. This is congruent with some of the results presented in Choueiter and Glass (2007) for the case of a simpliﬁed phone-segmented classiﬁcation task. Also it is important to note that we have already obtained concrete settings for our WPCCs that outperform the standard MFCC features, under the same scenario that does not consider contextual information in the acoustic features. In this mono-phone recognition task, this benchmark has 44,87% recognition accuracy. Finally, we add delta and acceleration coeﬃcients to the analysis. It is well understood that dynamic features

improve recognition rates, but it is interesting to observe their particular eﬀects on our WP ﬁlter-bank features. We consider a similar set of scenarios (number of bands, number of Cepstral coeﬃcients) to explore the eﬀect on frequency selectivity and the ﬁdelity criterion. Fig. 13 shows recognition accuracies as a function of the number of bands for a given ﬁxed Cepstral feature dimension in the set f11; 12; 13; 14g, which maps to a feature vector of dimensions f36; 39; 42; 45g, respectively, and with the maximum order (frequency selectivity) in the TCF. In general, the best set of results is obtained in the range of 20–26 bands, illustrated in Fig. 13. In addition, out of this range, the energy ﬁdelity criterion systematically shows the best performance curves and, consequently, the most competitive results with respect to the standard MFCCs (39 feature vector) with a baseline of 55.3% in accuracy. In spite of that, the best result is obtained with the KLD ﬁdelity

55.4

55.4

55.2

55.2

55

55

54.8

54.8 % Accurracy Coretest

% Accurracy Coretest

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835

54.6

54.4

54.2

54

829

54.6

54.4

54.2

54

53.8

53.8 KLD db44 Energy db44 Fisher db44 MFCC

53.6

KLD db44 Energy db44 Fisher db44 MFCC

53.6

EDA

EDA

53.4

53.4 14

16

18

20 Number of Bands

22

24

26

14

16

18

20 Number of Bands

22

24

26

22

24

26

55.5

55.4

55.2 55 55

54.5 % Accurracy Coretest

% Accurracy Coretest

54.8

54.6

54.4

54.2

54

53.5

54 53

KLD db44 Energy db44 Fisher db44 MFCC

53.8

KLD db44 Energy db44 Fisher db44 MFCC

EDA

53.6

EDA

52.5 14

16

18

20 Number of Bands

22

24

26

14

16

18

20 Number of Bands

Fig. 13. Recognition accuracies in the Core-test set as a function of the WP ﬁlter-bank size (number of bands), for ﬁxed numbers Cepstral coeﬃcients adding delta and acceleration features. Comparison of solutions obtained for all pruning methods and the higher frequency selectivity considered (DB 44).

measure, solution of 22 bands and 12 Cepstral coeﬃcients (a 39 feature vector) shown in Fig. 13b, with recognition accuracy of 55.36%. Fig. 14, on the other hand, revisits the eﬀect of the frequency selectivity on the recognition accuracy for the KLD and Energy based solutions with 12 Cepstral coeﬃcients. This veriﬁes that higher order DB-TCF achieves the best performance. Finally, Table 1 presents the gain of adding delta and acceleration coeﬃcients to the feature vector. This gap increases by increasing the frequency resolutions of the WP ﬁlters, reaﬃrming the advantage of adopting higher order TCFs for this task. 7.2. Context-dependent phone recognition experiments Finally we evaluate performance in the standard phone recognition task that considers context-dependent HMMs, Cepstral acoustic features plus delta and acceleration, and

a bi-gram language model. For this, we focus the analysis on the range of 20–26 bands, and the Cepstral feature dimension in the neighborhood of 13 coeﬃcients. This is the range of values with good performances observed in the previous set of experiments. Fig. 15 shows recognition accuracies as a function of the number of Cepstral coeﬃcients. Here we report the best trends, observed for the case of 24 and 26 ﬁlter-bank bands with the DB44 TCF. These trends were obtained with 9 to 15 Cepstral coeﬃcients. i.e., feature space dimensions from 30 to 48. The estimationapproximation error trade-oﬀ can be observed as expected, however, these trends are diﬀerent from those in the context independent case, shown in Fig. 11. The reason is that, in this context, the number of models is larger as are the model parameters to be estimated, but the training data remains the same. This causes the estimation error to dominate the approximation error earlier, in lower dimensional feature spaces, with respect to the results shown in Fig. 11.

830

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835 56

55.5

55

55 54.5

54

% Accurracy Coretest

% Accurracy Coretest

54 53.5

53

52.5

53

52 52

51.5

Energy db44 Energy db24 Energy db12 Energy db6 MFCCEDA

51

KLD db44 KLD db24 KLD db12 KLD db6 MFCCEDA

51

50

50.5 14

16

18

20

22 Number of Bands

24

26

28

14

30

16

18

20

22 Number of Bands

24

26

28

30

Fig. 14. Recognition accuracies in the Core-test set as a function of the WP ﬁlter-bank size (number of bands), for ﬁxed 12 Cepstral coeﬃcients adding delta and acceleration features. Comparison of solutions at diﬀerent frequency selectivity for energy (a) and KLD (b).

Table 1 Average gains in recognition accuracy when passing from WPCCE to WPCCEDA acoustic features. Accuracies obtained in a scenario with 12 Cepstral coeﬃcients plus log-energy and number of bands from 14 to 30. The ﬁrst row shows the average recognition accuracy of static features in the four Daubechies Wavelet scenarios for the KLD, Fisher and Energy solutions. The second row shows the accuracy obtained when running the same task using delta and acceleration features, and the third row shows the accuracy gain.

WPCCE WPCCEDA Gain

DB6 (%)

DB12 (%)

DB24 (%)

DB44 (%)

42.09 51.19 9.1

43.29 53.13 9.84

43.97 54.2 10.22

44.26 54.5 10.24

We observed again that Energy and the KLD methods oﬀer the best performance trends, which is consistent with

previous context-independent phone recognition results, where the best two performances are achieved with the Energy functional, in the scenario with 26 bands and 11 Cepstral coeﬃcients (68.04%) and with 24 bands and 11 Cepstral coeﬃcients (68.09%), Fig. 15b, respectively. Those results are very competitive with the state-of-the-art MFCC feature, baseline of 67.28%, were in fact, they oﬀer a relative improvement of 1.2% in the best case tested. To conclude this analysis, the equivalent ﬁlter-banks of the Energy solutions with 24 and 26 bands are presented in Fig. 16a and b, respectively. The Mel-scale has a linear-uniform frequency partitioning in the lower frequency range and moves to a uniform logarithmic partitioning in the rest (Quatieri, 2002). Following this trend, our best two solutions, shown in Fig. 16a and b and in Table 2, oﬀer an

68.5

68.5

68 68 67.5 67.5 % Accurracy Coretest

% Accurracy Coretest

67 67

66.5

66.5

66

65.5 66 65 KLD db44 Energy db44 Fisher db44 MFCCEDA

65.5

KLD db44 Energy db44 Fisher db44 MFCCEDA

64.5

65

64 9

10

11 12 13 Number of Cepstral Coefficients

14

15

9

10

11 12 13 Number of Cepstral Coefficients

14

15

Fig. 15. Phone recognition accuracies with context-dependent phone models as a function of the number of Cepstral coeﬃcients, considering delta and acceleration features.

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835

approximately uniform partition (with the same bandwidth) in the interval [0, 1 kHz] and then an increasing bandwidth from 2 Hz to 8 kHz, as depicted in Table 2. Hence, as expected, our data-driven WP ﬁlter-bank solutions oﬀer, in general, the Mel frequency partition type of structure.

7.3. Final analysis

respectively. In general, all these results are below the MFCC baseline of 67.28% for this phone-recognition task and, in consequence, below the best performance of 68.09% reported for the WPCCs, even in the scenario in which we match the ﬁlter orders adopting DB44. In terms of the ﬁlter-bank structure, our best data-driven solution with 24 bands presented in Table 2 oﬀers frequency bands similar to those adopted in Farooq and Datta (2001) and Choueiter and Glass (2007), presented in Tables 3 and 4, respectively. The reason again is that our solution follows the general structure of the MELscale. However, it is important to emphasize the minor structural mismatches to justify the performance diﬀerences among the WP solutions. On this, the entries in bold in Tables 3 and 4 indicate the bands that have diﬀerences, in terms of bandwidth or frequency support, with respect to our best solution shown in Table 2. In particular, the 24 band solution in Table 3 has diﬀerent frequency partitions in the intervals [0, 250 Hz], [1000 Hz, 1500 Hz] and [3000 Hz, 5000 Hz]. The same comparison can be made for the 26 band WP of Table 4, where the diﬀerences are concentrated in the [0, 250 Hz] and [5000 Hz, 6000 Hz] regions. It is worth mentioning that the 26 band WP can be generated from our 24 band solution, by splitting the (6,0) and (3,5) leaves, therefore, the structural diﬀerences are minor, but important to induce particular feature attributes for the task.

8. Summary, discussion and ﬁnal remarks This work proposes the Wavelet-Packet Cepstral coeﬃcient (WPCC) as a dynamic ﬁlter-bank structure to perform short-time (frame-by-frame) acoustic analysis for ASR. A collection of log-energy based acoustic signatures with diﬀerent time-frequency resolutions was derived, extending the conventional MFCC scheme. In the process, the ﬁlter-bank properties and basis structure of WaveletPackets (WPs) were fully considered, where the interpretation of WP as a ﬁlter-bank analysis scheme was put into the frame-by-frame acoustic analysis context. In particular, the equivalent ﬁlter-bank frequency response of a WP basis was deﬁned, where the Gray code and the concept of

3

3

2.5

2.5

2

2

Amplitude Gain

Amplitude Gain

Finally our solutions are compared with two state-ofthe-art dyadic WP based features for ASR. In particular, we implemented the 24 and 26 band WP energy-signatures considered by Farooq and Datta (2001) and Choueiter and Glass (2007), respectively. The ideal frequency partitions of those WP solutions are shown in Tables 3 and 4, respectively. In (Farooq and Datta, 2001), the FE is implemented with the Daubechies TCF of order 6 (DB6) considering a vector of 13 Cepstral coeﬃcients. On the other hand, the acoustic features proposed in Choueiter and Glass (2007) (for the case of dyadic WP) were obtained from the concatenation of 26 log energy vectors plus dynamic features obtained at the phone segmental level, where, at the end of this process, principal component analysis (PCA) was used to reduce the dimensionality of the resulting vector, targeting a phone segmented classiﬁcation task. Their dyadic WPs were implemented using Daubechies (DB) TCF of orders 4, 6, 10 and 12, respectively. To contextualize these solutions in our time-series phone recognition scenario and to make them comparable with our solutions, we only consider their WP ﬁlter bank structure. More precisely, we consider the binary-tree topologies of the WP bases with their respective dyadic partition of the frequency space and their induced WPCCs plus dynamic features (delta and acceleration) based on the general scheme presented in Section 2. The accuracies obtained for the 24 and 26 band WP solutions with DB6 were 63.37% and 61.19%, respectively. For the 26 band WP, increasing the order of the TCF to DB12 improves the performance to 64.59%, which is consistent with our previous analysis on frequency selectivity. Because of this trend, we also tried the unexplored DB44 for the 24 and 26 band solutions obtaining improvements of 66.45% and 66.33%,

1.5

1

831

1.5

1

0.5

0.5 0

0 0

0.1

0.2

0.3

0.4 0.5 0.6 Normalized Frequency

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4 0.5 0.6 Normalized Frequency

0.7

0.8

0.9

1

Fig. 16. Frequency response of the ﬁlter-banks with the two highest performances tested. The frequency range is normalized over the interval [0, 8 kHz].

832 Table 2 Shannon WP frequency partition of the interval [0, 8 kHz] for the ﬁlter-bank solution of Fig. 16a. It contains the frequency ordered leaves of the WP tree, i.e., fðj; k ¼ gðpÞÞ : ðj; pÞ 2 LðT Þg, and their respective frequency supports (I kj ) and bandwidths in Hz. ð5; 0Þ ½0; 250 250

ð6; 2Þ ½250; 375 125

ð6; 3Þ ½375; 500 125

ð6; 4Þ ½500; 625 125

ð6; 5Þ ½625; 750 125

ð6; 6Þ ½750; 875 125

ð6; 7Þ ½875; 1000 125

ð5; 4Þ ½1000; 1250 250

ð5; 5Þ ½1250; 1500 250

ð5; 6Þ ½1500; 1750 250

ð5; 7Þ ½1750; 2000 250

ð5; 8Þ ½2000; 2250 250

Leaf ðj; kÞ Band I kj (Hz) Bandwidth (Hz)

ð5; 9Þ ½2250; 2500 250

ð5; 10Þ ½2500; 2750 250

ð5; 11Þ ½2750; 3000 250

ð5; 12Þ ½3000; 3250 250

ð5; 13Þ ½3250; 3500 250

ð5; 14Þ ½3500; 3750 250

ð5; 15Þ ½3750; 4000 250

ð4; 8Þ ½4000; 4500 500

ð4; 9Þ ½4500; 5000 500

ð3; 5Þ ½5000; 6000 1000

ð3; 6Þ ½6000; 7000 1000

ð3; 7Þ ½7000; 8000 1000

Table 3 Shannon WP frequency partition of the interval [0, 8 kHz] for a Mel-like ﬁlter bank with 24 bands considered by Farooq and Datta (2001). It contains the frequency ordered leaves, frequency supports and bandwidths as in Table 2. Leaf ðj; kÞ Band I kj (Hz) Bandwidth (Hz)

ð6; 0Þ ½0; 125 125

ð6; 1Þ ½125; 250 125

ð6; 2Þ ½250; 375 125

ð6; 3Þ ½375; 500 125

ð6; 4Þ ½500; 625 125

ð6; 5Þ ½625; 750 125

ð6; 6Þ ½750; 875 125

ð6; 7Þ ½875; 1000 125

ð6; 8Þ ½1000; 1125 125

ð6; 9Þ ½1125; 1250 125

ð6; 10Þ ½1250; 1375 125

ð6; 11Þ ½1375; 1500 125

Leaf ðj; kÞ Band I kj (Hz) Bandwidth (Hz)

ð5; 6Þ ½1500; 1750 250

ð5; 7Þ ½1750; 2000 250

ð5; 8Þ ½2000; 2250 250

ð5; 9Þ ½2250; 2500 250

ð5; 10Þ ½2500; 2750 250

ð5; 11Þ ½2750; 3000 250

ð4; 6Þ ½3000; 3500 500

ð4; 7Þ ½3500; 4000 500

ð3; 4Þ ½4000; 5000 1000

ð3; 5Þ ½5000; 6000 1000

ð3; 6Þ ½6000; 7000 1000

ð3; 7Þ ½7000; 8000 1000

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835

Leaf ðj; kÞ Band I kj (Hz) Bandwidth (Hz)

ð3; 7Þ ½7000; 8000 1000 ð3; 6Þ ½6000; 7000 1000 ð5; 9Þ ½2250; 2500 250 Leaf ðj; kÞ Band I kj (Hz) Bandwidth (Hz)

ð5; 10Þ ½2500; 2750 250

ð5; 11Þ ½2750; 3000 250

ð5; 12Þ ½3000; 3250 250

ð5; 13Þ ½3250; 3500 250

ð5; 14Þ ½3500; 3750 250

ð5; 15Þ ½3750; 4000 250

ð4; 8Þ ½4000; 4500 500

ð4; 9Þ ½4500; 5000 500

ð4; 10Þ ½5000; 5500 500

ð4; 11Þ ½5500; 6000 500

ð5; 8Þ ½2000; 2250 250 ð5; 7Þ ½1750; 2000 250 ð5; 6Þ ½1500; 1750 250 ð5; 5Þ ½1250; 1500 250 ð5; 4Þ ½1000; 1250 250 ð6; 7Þ ½875; 1000 125 ð6; 6Þ ½750; 875 125 ð6; 5Þ ½625; 750 125 ð6; 4Þ ½500; 625 125 ð6; 3Þ ½375; 500 125 ð6; 2Þ ½250; 375 125 ð6; 1Þ ½125; 250 125 ð6; 0Þ ½0; 125 125 Leaf ðj; kÞ Band I kj (Hz) Bandwidth (Hz)

Table 4 Shannon WP frequency partition of the interval [0, 8 kHz] for a Mel-like ﬁlter bank with 26 bands considered by Choueiter and Glass (2007). It contains the frequency ordered leaves, frequency supports and bandwidths as in Table 2.

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835

833

ﬁlter-bank frequency ordering was revisited. This last point is an important concept that, to the best of our knowledge, has not been treated in previous work on the topic (Farooq and Datta, 2001; Choueiter and Glass, 2007; Kim et al., 2000; Tan et al., 1996). The main contribution of this work is systematically exploring the problem of WP ﬁlter-bank selection to obtain adaptive and nearly optimal energy-based ﬁlter-bank signatures for an ASR task. This important dimension of analysis has not been considered in previous studies on the topic of Wavelet and WP for ASR (Farooq and Datta, 2001; Choueiter and Glass, 2007; Kim et al., 2000; Tan et al., 1996). In this regard, Farooq and Datta (2001) considered a ﬁxed tree-topology (frequency partition pattern) based on the MEL scale, while in the work of Choueiter and Glass (2007) the objective was on obtaining a speciﬁc critical-band frequency partition by means of adopting two previously unexplored ﬁlter-bank design methods, as well as rational and dyadic WP ﬁlter-banks. In this work, the ﬁlter-bank selection problem was addressed by a complexity regularized criterion, with the objective of modeling the well-understood trade-oﬀ between feature discrimination and feature complexity. Three methods were explored to provide a wide range of data-driven ﬁlter-bank solutions induced from the proposed WPCC analysis scheme, and the performances of those solutions were evaluated and contrasted. It is worth noting that all the proposed ﬁlterbank selection methods reduce to an equivalent tree-pruning problem with additive or aﬃne functionals, that admit, consequently, computationally eﬃcient implementations, i.e., a complexity that grows polynomial on the side of the problem (Silva and Narayanan, 2009). Moving on with the experimental ﬁndings, as reported in Section 6, there are only marginal diﬀerences in the ﬁdelity gains observed when considering discriminative and non-discriminative ﬁdelity indicators. This implies that the ﬁlter-bank solutions obtained show similar structures, where in general they provide increased frequency resolution in the low-frequency range. This veriﬁes the wellknown fact that the discriminative information of the speech acoustic process is embedded in lower frequency bands, and that the speech production-perception process can be considered an optimal communication design, in the sense that there is more signal energy in the frequency region where more perception (frequency discrimination) is available. On the experimental side, this is demonstrated under concrete experimental conditions and with the standard HMM-based phone recognition task. The energyﬁdelity-based WPCC solutions oﬀer the best performance results compared with two discriminative ﬁdelity indicators (Fisher-scatter based, and Kullback-Leibler divergencebased) and, furthermore, they show a number of constructions that outperform the state-of-the-art MFCCs. Interestingly, under clean acoustic conditions, our data-driven frequency selectivity methods oﬀer ﬁlter-bank solutions that follow, in general, the structure of the MEL scale, although our approach oﬀers performance improvements

834

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835

Acknowledgment Fig. 17. Commuting relationship between the down-sampler and ﬁltering.

with respect to the state-of-the-art MEL-based energy signatures (MFFCs). In addition, we show that frequency selectivity in the design of the Wavelet Packet ﬁlter-banks is a critical dimension of analysis for obtaining good performances. More precisely, the better the selectivity of the two-channel ﬁlter (the basic block that constructs the WP basis family), the better the phone recognition performances obtained from their ﬁlter-bank solutions, which agrees with some of the ﬁndings presented in Choueiter and Glass (2007). Although the reported ASR performance improvements can be considered to be marginal, the generality of our WPCC construction is worth emphasizing. WPCC acoustic features oﬀer a natural way of extending the MFCC ﬁlterbank analysis paradigm by considering a much more general way of characterizing the ﬁlter-bank analysis part. In fact, we provide a way of creating not only a ﬁxed solution, but also a family of embedded ﬁlter-bank solutions (and their respective Cepstral energy-based features) with increased frequency discrimination. As we have shown in our experiments, these solutions are adapted to the task, i.e., they oﬀer the optimal estimation-approximation error tradeoﬀ, which depends on a number of dynamic factors, which are strongly task dependent. Just to mention a few of them: the intrinsic acoustic-discrimination complexity of the task (approximation part); the modeling assumptions; the number of model parameters; the amount of data (the estimation error part); and the presence of distortion or noise in the training data. 9. Future work For some applications, it would be beneﬁcial to work with non-optimal parsimonious representations, to save algorithmic complexity at the expense of sacriﬁcing some accuracy. An example of this would be scenarios with communication constraints, or scenarios where the task is of a smaller vocabulary, in which the algorithmic complexity associated with the Viterbi-decoding is a critical issue in the design of an ASR solution. In this context, the proposed embedded ﬁlter-bank solutions have the ﬂexibility to address the trade-oﬀ between performance and algorithmic complexity. In this regard, we believe that there is a number of directions to be explored with respect to the operational ﬂexibility that the proposed WPCCs oﬀer for ASR applications. Another important future work direction is to evaluate the WPCCs in the problem of robust ASR under diﬀerent noisy conditions, source coding distortions and channel degradations.

The work was supported by funding from FONDECYT Grant 1110145, CONICYT-Chile. We are grateful to the anonymous reviewers for their suggestions and comments that contribute to improve the quality and organization of the work. We thank S. Beckman for proofreading this material. Appendix A. Wavelet Packets: an alternative view of its subspace frequency content The conjugate mirror ﬁlter pair ðhðnÞ; gðnÞÞ maps the canonical basis BL of X to an alternative orthogonal basis B0Lþ1 [ B1Lþ1 . Importantly, we can associate the sub-spaces 0 Lþ1 0 U ¼ span / ðt 2 nÞ : n 2 Z and U 1Lþ1 ¼ span Lþ1 Lþ1 1 Lþ1 /Lþ1 ðt 2 nÞ : n 2 Z with a frequency content of X by the following relationship: ^ L ðwÞ; ^ 0 ðwÞ ¼ ^ hð2L wÞ / / Lþ1

^ 1 ðwÞ ¼ g^ð2L wÞ / ^ L ðwÞ; / Lþ1

ðA:1Þ

^ 0 ðwÞ and hð2 ^ wÞ denote the Fourier transform where / Lþ1 (FT) and the Discrete-Time Fourier transform (DTFT) of /0Lþ1 ðtÞ and hðnÞ (alternatively, /1Lþ1 ðtÞ and gðnÞ), respectively. Iterating the application of ðhðnÞ; gðnÞÞ, we induce /pLþj ðtÞ for all j P 1 and for any p 2 0; . . . ; 2j 1 , where the frequency content of any arbitrary sub-space in the n o chain, for instance U pLþj ¼ span /pLþj ðt 2Lþj nÞ : n 2 Z , is inherited from (A.1) by: L

^ p ðwÞ; ^ 2p ðwÞ ¼ ^hð2L wÞ / / Lþj Lþjþ1

^ 2pþ1 ðwÞ ¼ g^ð2L wÞ / ^ p ðwÞ: / Lþj Lþjþ1

ðA:2Þ

Figs. 4 and 5 illustrates those frequency maps for the ideal Shannon pair of ﬁlters that provides a perfect partition of the frequency content of X. Appendix B. Multi-rate ﬁlter-bank property Proposition 2 Vetterli and Kovacevic (1995, Chap. 2, pp. 72–73). Let hðnÞ be the impulse response of a LTI system with transfer function H ðzÞ. Then for any ðxðnÞÞ 2 RZ , it is equivalent to pass xðnÞ through a down-sampler by N and then by the LTI system with transfer function H ðzÞ; to pass xðnÞ through H ðzN Þ and then by the down-sampler by Nfactor. Fig. 17 illustrates the relationship. Appendix C. The Gray code Proposition 3 Mallat (2009, Chap. 8.1.2). Let ðj; pÞ be an admissible node of the Shannon WP decomposition with binary path Hðj:pÞ ¼ ðh1 ; . . . ; hjL Þ 2 f0; 1gjL : then its equivalent frequency-ordered label ðj; kÞ is constructed by the following rule jL X hi 2i 2 0; . . . ; 2jL 1 ; k ¼ GðpÞ ðC:1Þ i¼1

PjL where hi l¼i hl mod 2 2 f0; 1g; 8i 2 f1; . . . ; j Lg.

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835

References Atto, A.M., Pastor, D., Isar, A., 2007. On the statistical decorrelation of the wavelet packet coeﬃcients of a band-limited wide-sense stationary random process. Signal Processing 87 (10), 2320–2335. Atto, A.M., Pastor, D., Mercier, G., 2010. Wavelet packets of fractional brownian motion: Asymptotic analysis and spectrum estimation. IEEE Transactions on Information Theory 56 (9), 429–441. Bohanec, M., Bratko, I., 1994. Trading accuracy for simplicity in decision trees. Machine Learning 15, 223–250. Breiman, L., Friedman, J., Olshen, R., Stone, C., 1984. Classiﬁcation and Regression Trees. Wadsworth, Belmont, CA. Chang, T., Kuo, C.J., 1993. Texture analysis and classiﬁcation with treestructured wavelet transform. IEEE Transactions on Image Processing 2 (4), 429–441. Chou, P., Lookabaugh, T., Gray, R., 1989. Optimal pruning with applications to tree-structure source coding and modeling. IEEE Transactions on Information Theory 35 (2), 299–315. Choueiter, G., Glass, J., 2007. An implementation of rational wavelets and ﬁlter design for phonetic classiﬁcation. IEEE Transactions on Audio, Speech, and Language Processing 15 (3), 939–948. Coifman, R., Meyer, Y., Quake, S., Wickerhauser, V., 1990. Signal processing and compression with wavelet packets. Tech. rep., Numerical Algorithms Research Group, New Haven, CT, Yale University. Coifman, R.R., Meyer, Y., Wickerhauser, M.V., 1992. Wavelet analysis and signal processing. In B. Ruskai (Ed.), Wavelets and their Applications. Jones and Barlettt, pp. 153–178. Coifman, R.R., Wickerhauser, M.V., 1992. Entropy-based algorithm for best basis selection. IEEE Transactions on Information Theory 38 (2), 713–718, March. Cormen, T., Leiserson, C., Rivest, R.L., 1990. Introduction to Algorithms. The MIT Press, Cambridge, Massachusetts. Cover, T.M., Thomas, J.A., 1991. Elements of Information Theory. Wiley Interscience, New York. Crouse, M.S., Nowak, R.D., Baraniuk, R.G., April 1998. Wavelet-based statistical signal processing using hidden Markov models. IEEE Transactions on Signal Processing 46 (46), 886–902. Daubechies, I., 1992. Ten Lectures on Wavelets. SIAM, Philadelphia. Davis, S.B., Mermelstein, P., 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28 (4), 357–366. Duda, R.O., Hart, P.E., 1983. Pattern Classiﬁcation and Scene Analysis. Wiley, New York. Etemad, K., Chellapa, R., 1998. Separability-based multiscale basis selection and feature extraction for signal and image classiﬁcation. IEEE Transactions on Image Processing 7 (10), 1453–1465, October. Farooq, O., Datta, S., 2001. Mel ﬁlter-like admissible wavelet packet structure for speech recognition. IEEE Signal Processing Letters 8 (7), 196–198. Gray, R.M., 1990. Entropy and Information Theory. Springer-Verlag, New York. Kim, K., Youn, D., Lee, C., 2000. Evaluation of wavelet ﬁlters for speech recognition. In: IEEE Int. Conf. Syst. Man. Cybern. pp. 2891–2894. Kullback, S., 1958. Information theory and Statistics. Wiley, New York. Learned, R.E., Karl, W.C., Willsky, A.S., 1992. Wavelet packet based transient signal classiﬁcation., 109–112.

835

Lee, K.-F., Hon, H.-W., 1989. Speaker-independent phone recognition using hidden markov models. IEEE Transactions on Acustics, Speech and Signal Processing 37 (11), 1641–1648. Mallat, S., 1989. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 11, 674–693, July. Mallat, S., 2009. A Wavelet Tour of Signal Processing. 3rd ed. Academic Press. Padmanabhan, M., Dharanipragada, S., 2005. Maximizing information content in feature extraction. IEEE Transactions on Speech and Audio Processing 13 (4), 512–519, July. Quatieri, T.F., 2002. Discrete-time Speech Signal Processing principles and practice. Prentice Hall. Rabiner, L.R., 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 (2), 257–286, February. Ramchandran, K., Vetterli, M., Herley, C., 1996. Wavelet, subband coding, and best bases. Proceedings of the IEEE 84 (4), 541–560. April. Saito, N., Coifman, R.R., 1994. Local discriminant basis. in: Proc. SPIE 2303, Mathematical Imaging: Wavelet Applications in Signal and Image Processing 2–14. Scott, C., 2005. Tree pruning with subadditive penalties. IEEE Transactions on Signal Processing 53 (12), 4518–4525. Scott, C., Nowak, R.D., 2004. Templar: A wavelet-based framework for pattern learning and analysis. IEEE Transactions on Signal Processing 52 (8), 2264–2274. August. Shen, J., Strang, G., 1996. Asymptotic analysis of daubechies polynomials. Proceedings of the American Mathematical Society 124 (12), 3819– 3833. Shen, J., Strang, G., 1998. Asymptotics of daubechies ﬁlters, scaling functions, and wavelets. Applied and Computational Harmonic Analysis 5, 312–331. Silva, J., Narayanan, S., August 2007. Minimum probability of error signal representation. In: IEEE Workshop Machine Learning for Signal Processing. Silva, J., Narayanan, S., 2009. Discriminative wavelet packet ﬁlter bank selection for pattern recognition. IEEE Transactions on Signal Processing 57 (5), 1796–1810. Silva, J.F., Narayanan, S.S., 2012. On signal representations within the bayes decision framework. Pattern Recognition 45 (5), 1853–1865, May. Tan, B., Minyue, F., Spray, A., Dermody, P., 1996. The use of wavelet transform in phoneme recognition. In: Int. Conf. Spoken Lang. Process. pp. 2431–2434. Vaidyanathan, P.P., 1993. Multirate Systems and Filter Banks. NY Prentice-Hall, Englewood Cliﬀs. Vasconcelos, N., 2004. Minimum probability of error image retrieval. IEEE Transactions on Signal Processing 52 (8), 2322–2336. Vetterli, M., Kovacevic, J., 1995. Wavelet and Subband Coding. PrenticeHall, Englewood Cliﬀs, NY. Walter, G.G., 1992. A sampling theorem for wavelet subspaces. IEEE Transactions on Information Theory 38 (2), 881–884. Willsky, A.S., 2002. Multiresolution Markov models for signal and image processing. Proceedings of the IEEE 90 (8), 1396–1458. August. Young, S., 2009. The HTK Book (for HTK Version 3.4). Zhou, X., Sun, W., 1999. On the sampling theorem for wavelet subspaces. The Journal of Fourier Analysis and Applications 5 (4), 347–354.

Recommend Documents

Analysis and Design of Fanout-Free Networks of ... - Semantic Scholar

Wavelet-Based Mel-Frequency Cepstral ... - Semantic Scholar