Available online at www.sciencedirect.com
Speech Communication 54 (2012) 814–835 www.elsevier.com/locate/specom
Analysis and design of Wavelet-Packet Cepstral coefficients for automatic speech recognition Eduardo Pavez, Jorge F. Silva ⇑ University of Chile, Department of Electrical Engineering, Av. Tupper 2007, Santiago 412-3, Chile Received 3 July 2011; received in revised form 31 January 2012; accepted 2 February 2012 Available online 18 February 2012
Abstract This work proposes using Wavelet-Packet Cepstral coefficients (WPPCs) as an alternative way to do filter-bank energy-based feature extraction (FE) for automatic speech recognition (ASR). The rich coverage of time-frequency properties of Wavelet Packets (WPs) is used to obtain new sets of acoustic features, in which competitive and better performances are obtained with respect to the widely adopted Mel-Frequency Cepstral coefficients (MFCCs) in the TIMIT corpus. In the analysis, concrete filter-bank design considerations are stipulated to obtain most of the phone-discriminating information embedded in the speech signal, where the filter-bank frequency selectivity, and better discrimination in the lower frequency range [200 Hz–1 kHz] of the acoustic spectrum are important aspects to consider. Ó 2012 Elsevier B.V. All rights reserved. Keywords: Wavelet Packets; Filter-bank analysis; Automatic speech recognition; Filter-bank selection; Cepstral coefficients; The Gray code
1. Introduction Feature extraction (FE) is one of the key dimensions of design in automatic speech recognition (ASR) (Quatieri, 2002). The most recognized and widely adopted approach for acoustic FE is using the Mel-Frequency Cepstral coefficients (MFCCs). MFCCs is a short-time analysis scheme, in which a signature of the acoustic signal spectrum is computed from a filter-bank with central frequencies projected uniformly on the Mel scale (Quatieri, 2002). This scale is derived from well-documented studies of the human auditory system (Quatieri, 2002). Departing from this direction, there has been interest in the use of alternative signal processing techniques to propose new ways of doing shorttime filter-bank analysis on the acoustic signal (Silva and Narayanan, 2009; Farooq and Datta, 2001; Choueiter ⇑ Corresponding author. Tel.: +56 2 9784090; fax: +56 2 6953881.
E-mail addresses:
[email protected] (E. Pavez),
[email protected]. cl (J.F. Silva). URL: http://www.ids.uchile.cl/josilva/ (J.F. Silva). 0167-6393/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2012.02.002
and Glass, 2007; Kim et al., 2000; Tan et al., 1996). The use of Wavelets and Wavelet Packets (Daubechies, 1992; Mallat, 1989; Vetterli and Kovacevic, 1995) has been of particular interest in this context. Wavelet Packets (WPs) (Vetterli and Kovacevic, 1995; Mallat, 1989; Coifman et al., 1990) have emerged as important signal representation schemes impacting compression, detection and classification (Crouse et al., 1998; Etemad and Chellapa, 1998; Ramchandran et al., 1996; Vasconcelos, 2004; Willsky, 2002; Learned et al., 1992; Scott and Nowak, 2004). This collection of bases is particularly appealing for the analysis of pseudo-stationary time series processes and quasi-periodic random fields, such as the acoustic speech process (Silva and Narayanan, 2009; Choueiter and Glass, 2007; Chang and Kuo, 1993; Learned et al., 1992). WPs belong to the category of structured bases, those whose orthonormal basis elements are generated from a finite number of elementary transformations (Vetterli and Kovacevic, 1995; Daubechies, 1992; Ramchandran et al., 1996). From an engineering point of view, these kinds of representations are attractive because
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835
they can be implemented with a basic two-channel filter (TCF) and down-sampling operations (Vetterli and Kovacevic, 1995). WPs can be used to characterize a rich covering of signal-space decomposition, and in particular, they provide a way for generating sub-band dependent partitions of the observation space. In conclusion, WPs induce a family of structural filter-banks with a rich covering of time-frequency characteristics that has the potential for enriching the way conventional MFCC features describe the shortterm behavior of the acoustic speech process. WPs and multi-rate filter bank analysis have been adopted to improve the performance of conventional MFCC features in the context of ASR (Farooq and Datta, 2001; Choueiter and Glass, 2007; Kim et al., 2000; Tan et al., 1996). In particular, Farooq and Datta (2001) proposed a WP filter-bank representation, in which the objective was to mimic the MEL-scale frequency resolution. They used the Daubechies (DB) two channel filter (Daubechies, 1992), with which performance improvements were observed for specific phone subcategories (stop and unvoiced) in a portion of the TIMIT corpus. More recently, Choueiter and Glass (2007) explored the problem of two-channel filter-bank design and, in particular, the novel framework of rational filter-banks. The focus of this work was to improve the frequency selectivity with respect to the conventionally adopted Daubechies (DB) WPs with standard dyadic structure, by designing a type of MELfrequency filter-bank structure. Better performances were obtained in a simplified phone-segmented classification task with respect to MFCCs. These seminal works provide concrete evidence of the advantage of adopting WPs for parameterizing the speech acoustic process. However, the problem of adapting the WP basis-structure to the decision task, in the sense of finding the filter-bank topology, within the collection of treestructured WP bases, that best captures the time-frequency acoustic information for a given complexity constraint (feature dimension), remains an unexplored direction. As pointed out in (Choueiter and Glass, 2007), this direction has the potential to further adapt WP filter-bank solutions (acoustic energy-signature) to the phone discrimination task at hand. On the other hand, the results reported so far have considered simplified settings, in terms of the classification task or data-sets. Thus, a systematic analysis in standard phone recognition experiments would be beneficial to support the adoption of WP-based features as a competitive front-end alternative for doing acoustic FE. In this work we propose the Wavelet-Packet Cepstral coefficients (WPCC’s) and show concrete results that complement previous work on supporting the use of WPs as a FE techniques for ASR. This work builds upon the ideas recently proposed in Silva and Narayanan (2009), in which the problem of optimal filter-bank selection for pattern recognition (PR) was formulated based on the minimum probability of error decision principle (Silva et al., 2012; Vasconcelos, 2004). Here we explore WP filter-bank selection to propose a family of WPCCs. These features are
815
log-energy-based acoustic signatures rotated with the discrete cosine transform (the Cepstrum), as proposed in Farooq and Datta (2001), where the energy signatures are obtained from a bank of filters selected from the family of WP filter-banks. For the filter-bank selection, we use a complexity regularized criterion adopted from standard tree-structured bases selection problems (Silva and Narayanan, 2009; Etemad and Chellapa, 1998; Saito and Coifman, 1994; Coifman et al., 1992). In particular, we use acoustic energy, the Fisher-scatter ratio (Duda and Hart, 1983), and the Kullback-Leibler divergence (KLD) as fidelity measures. The last two criteria are phone-discriminative in nature, while energy is based on the principle of increasing the frequency resolution in bands with higher acoustic energy, proposed in Chang and Kuo (1993) for the problem of texture classification. As supporting results, we run standard phone recognition experiments in the TIMIT corpus. We contrast the different filter-bank solutions with respect to a number of design elements. Among them are the fidelity measure to select the filter-banks, the number of bands, the number of features, and the frequency selectivity of the two-channel filter (TCF) that induces the family of WPs. Interestingly, we found competitive results and concrete solutions that outperform the MFCCs. In the analysis, we show performance trends and dependencies that explain what the important design variables are to be considered for the construction of good acoustic features for ASR. At the end, WPCCs offer a rich collection of acoustic features that extend the idea of short-time (segmental) energy-signature for acoustic event detection. The rest of the article is organized as follows. Section 2 revisits the standard approach for obtaining short-term acoustic features. Sections 3 and 4 are devoted to the presentation of the WPCCs, where background material is covered to aid understanding of the filter-bank properties of WPs, and Section 5 covers the filter-bank selection problem. Finally Sections 6 and 7 show the filter-bank structure of the obtained solutions and the phone-classification performances, respectively. Final remarks are presented in Section 8, and supplemental material is presented in the Appendix. 2. Revisiting the filter bank Cepstral analysis view of feature extraction We revisit the standard feature extraction (FE) technique for ASR based on filter-bank energy features and the applications of the Cepstral transform (Quatieri, 2002) illustrated in Fig. 1a. Given the acoustic signal the scheme has the following phases: a high pass pre-emphasis filter 1–0:97z1 is applied on the whole acoustic signal; the resulting signal is segmented with a Hamming window of 32 ms creating overlapped short-term acoustic segments every 10 ms (segmental analysis); each acoustic segment is passed through a bank of triangular shaped filters with center frequencies forming an equipartition of the MEL scale, as shown in Fig. 1; and finally, in each segment the filter-bank energies (FBE) are computed to form a vector, where the logarithm
816
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835
Fig. 1. Illustration of the phases that characterize the standard approach for acoustic feature extraction in speech recognition.
function (point-wise) and the Discrete Cosine transform (DCT) are applied to create the MEL frequency Cepstral coefficients (MFCCs) (Davis and Mermelstein, 1980). In this work we explore an extension of this framework for acoustic FE, where, instead of using the perceptually motivated MEL filter-bank structure, we study the rich collection of filter-banks induced from the Wavelet Packet (WP) bases (Vetterli and Kovacevic, 1995; Mallat, 2009). The next section is devoted to explaining the methodology adopted to induce a new set of filter-bank energy features from the WPs, and, later, we present the proposed Wavelet Packet Cepstral coefficients (WPCCs) for ASR.
n 2 Zg and U 1Lþ1 span /1Lþ1 ðt 2Lþ1 nÞ : n 2 Z , we have that (Mallat, 2009) X ¼ U 0Lþ1 U 1Lþ1 :
The structure of the WP framework comes from the fact that B1Lþ1 and B0Lþ1 are induced by a discrete time pair of conjugate mirror filters (CMF) that we denote by ðhðnÞ; gðnÞÞ (Mallat, 2009, Chap. 7.1.3). More precisely, the basis elements /0Lþ1 ðtÞ; /1Lþ1 ðtÞ associated with the scale L þ 1 are induced from /L ðtÞ, of the scale L, by 1 X hðnÞ /L ðt 2L nÞ; /0Lþ1 ðtÞ ¼
3. Wavelet Packets /1Lþ1 ðtÞ WPs were proposed by Coifman et al. (1992) as a collection of bases with an underlying tree-structure. They offer different time-frequency representation qualities, and consequently, the potential to adapt to complex time series phenomena like the speech acoustic process (Silva and Narayanan, 2009). Here we provide a brief introduction of this family with focus on its filter-bank characteristics. Excellent expositions can be found in Mallat (2009), Vetterli and Kovacevic (1995) and Daubechies (1992). 3.1. WP sub-space decomposition: tree-structured collection Let X be the signal space of interest that, without loss of generality, is associated with a finite level of scale 2L or resolution 2L , L being an integer strictly greater than zero (Mallat, 2009). Consequently, X can be equipped with an orthonormal basis BL /L ðt 2L nÞ n2Z (Mallat, 2009; Vetterli and Kovacevic, 1995; Daubechies, 1992). The WP framework provides a way of decomposing the basis BL into two orthonormal collections, B0Lþ1 0 /Lþ1 ðt 2Lþ1 nÞgn2Z and B1Lþ1 /1Lþ1 ðt 2Lþ1 nÞ n 2 Z, where, denoting by U 0Lþ1 span /0Lþ1 ðt 2Lþ1 nÞ :
ð1Þ
¼
n¼1 1 X
gðnÞ /L ðt 2L nÞ;
ð2Þ
n¼1
where hðnÞ and gðnÞ are related by the perfect reconstruction property, i.e., gðnÞ ¼ ð1Þ1n hð1 nÞ; 8n 2 Z (Coifman et al., 1992), (Mallat, 2009, Th. 8.1). Iterating the application of the CMF pair ðhðnÞ; gðnÞÞ on each basis element /0Lþ1 ðtÞ and /1Lþ1 ðtÞ (Mallat, 2009, Th. 8.1), we can continue, in a binary tree-structured way, with the construction of alternative bases and subspace decompositions for X. More precisely after a fixed number of iterations, we can create /pLþj ðtÞ for all j P 1 and for any n p 2 0; . . . ; 2j 1 , where U pLþj ¼ span /pLþj ðt 2Lþj nÞ : n 2 Zg, see Fig. 2a. Furthermore by construction, 8j P 1; 8p 2 0; . . . ; 2j 1 , 2pþ1 U pLþj ¼ U 2p ð3Þ Lþjþ1 U Lþjþ1 ; P 2pþ1 p Lþj where /2p ðtÞ ¼ 1 nÞ and /Lþjþ1 n¼1 hðnÞ /Lþj ðt 2 P1Lþjþ1 ðtÞ ¼ n¼1 gðnÞ /pLþj ðt 2Lþj nÞ.
At the end, the WPs can be seen as a family of tree-structured bases induced from the iteration of the two channel filter (TCF) ðhðnÞ; gðnÞÞ as illustrated in Fig. 2a.
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835
817
Fig. 2. Binary tree-structure and representation of the family of Wavelet Packet bases.
3.2. Inter-scale relationship of the WP transform coefficients
3.3. WP filter bank implementation
A key property of WPs is the inter-scale relationship induced from (2) among the WP transform coefficients obtained across scales (Mallat, 2009). More precisely, let xðtÞ be in U pj X with transform coefficients given by
From a discrete time filter-bank point of view (Vetterli and Kovacevic, 1995), the basic iteration in (5) can be implemented by the application of a two channel filter (TCF), with impulse response hðnÞ and gðnÞ, followed by a down-sampler by 2 operation (Vetterli and Kovacevic, 1995; Mallat, 2009). This view is generalized in the following result.
d pj ðnÞ hxðtÞ; /pj ðt 2j nÞi; 8n 2 Z:
ð4Þ
Projecting xðtÞ, instead, in the alternative basis associated 2pþ1 with U 2p jþ1 U jþ1 , we have that (Mallat, 2009, Prop. 8.4) X d 2p d pj ðkÞ hðk 2nÞ; jþ1 ðnÞ ¼ k2Z 2pþ1 d jþ1 ðnÞ
¼
X
d pj ðkÞ gðk 2nÞ;
8n 2 Z:
ð5Þ
k2Z
Considering the fact that those are orthonormal bases, the Parseval’s relationship (Mallat, 2009) implies that 2 X 2 X X 2p 2pþ1 2 2 jjxðtÞjj ¼ d pj ðnÞ ¼ d jþ1 ðnÞ þ d jþ1 ðnÞ : ð6Þ n2Z
n2Z
n2Z
By induction, a closed-form relationship in the transform coefficients can be obtained for every pair of basis elements in the WPs, as illustrated in Fig. 2b. The beauty of this result is that we pass from an analysis in continuous time in (4), to a discrete time analysis (algorithm) in (5). In fact, assuming that xðtÞ lives in a finite resolution space X, the Eq. (4) with j ¼ L and p ¼ 0 can be seen as a generalized Sampling theorem (Zhou and Sun, 1999; Walter, 1992). Furthermore, the WP binary structure manifested in (5) permits a fast algorithm implementation of the WP analysis (Mallat, 2009). Concerning the algorithmic part, the next section addresses the filter-bank implementation of WPs (Vetterli and Kovacevic, 1995).
Proposition 1 (Vaidyanathan (1993, Chap. 11.3.3)). Let xðtÞ be in a finite 2L scale space X, with transform coefficients ðd 0L ðnÞÞn2Z obtained from (4). Let us consider an arbitrary sub-space U pj induced from the WP filter bank decomposition with j > L and p 2 0; . . . ; 2jL 1 . Let us denote by ðh0 ðnÞÞn2Z and ðh1 ðnÞÞn2Z , the conjugate mirror filter pair p1 ;...; (with transfer function H 0 ðzÞ and H 1 ðzÞ), by U Lþ1 pjL1 Uj the sequence of intermediate sub-spaces used to go from X to U pj , and by Hðj:pÞ ¼ ðh1 ; . . . ; hjL Þ 2 f0; 1gjL the binary path code. In the last definition, choosing hk implies filtering with H hk ðzÞ and then applying the down-sampler by 2 at step k of the iteration. Then ðd pj ðnÞÞn2Z is obtained by passing ðd 0L ðnÞÞn2Z to the following discrete time filter H Hðj;pÞ ðzÞ ¼
jL Y
i1
H hk ðz2 Þ;
ð7Þ
i¼1
and then applying the down-sampler by 2jL operator. Proof. The proof of this result is a consequence of Proposition 2 presented in Appendix B. Fig. 3 illustrates the relationship. h
818
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835
Fig. 3. The equivalent systems stated in Proposition 1. The aggregated down-sampler is by K ¼ 2jL .
Fig. 4. Illustration of the frequency division of Wavelet Packet bases for two tree structures. The ideal Shannon conjugate filter pair is considered, which provides perfect dyadic partitions of the interval ½p; p. Scenario (a–c) shows a recursive iteration of H 0 ðzÞ (Wavelet type), and scenario (b–d) presents a balanced tree structure (uniform frequency resolution).
4. Frequency response of the WP filter banks Note that the process that relates ðd 0L ðnÞÞn2Z with in Proposition 1, is linear but not time invariant. Consequently, it is misleading to talk about the frequency response associated with the process of projecting xðtÞ into the WP sub-space U pj . We can circumvent this issue by considering only the equivalent filtering part of the process in (7) and, consequently, avoiding the last down-sampling stage.1 More precisely, we consider the frequency response of the equivalent linear time-invariant (LTI) system just before the down-sampling stage. This characterizes the frequency content associated with each subspace, with which we can define the frequency decomposition achieved by a given WP basis. To illustrate this, let us consider the Shannon WPs (Mallat, 2009) induced by the perfect low and high pass filters presented in Figs. 4 and 5, i.e., ðd pj ðnÞÞn2Z
1 An alternative interpretation is presented in Appendix A. This analysis is not based on the filter-bank view of WP’s presented here.
jx
j H 0 ðe Þ j¼
( pffiffiffi 2 0
x 2 ½p=2 þ 2kp; p=2 þ 2kp otherwise
and jx
j H 1 ðe Þ j¼
( pffiffiffi 2 0
x 2 ½p=2 þ 2kp; 3p=2 þ 2kp otherwise
:
Following Section 3.1, each WP basis of X can be represented by the leaves of a binary-tree, as shown in Fig. 2(a). More precisely a basis is indexed by fðji ; pSi Þ : i ¼ 1; . . . ; M g2 associated with the basis element M p p M B ¼ i¼1 Bjii and sub-space decomposition X ¼ ai¼1 U jii . For each leaf ðji ; pi Þ of this tree, we can obtain its equiva2 It is necessary that ji > L and pi 2 0; . . . ; 2ji L 1; 8i 2 f1; . . . ; M g. In addition there are structural conditions to guarantee that fðji ; pi Þ : i ¼ 1; . . . ; M g corresponds to the leaves of a binary tree rooted at node ðL; 0Þ, not detailed here for space considerations. We refer the reader to Breiman et al. (1984), Chou et al. (1989) and Scott (2005) for a systematic exposition of this point.
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835
819
Fig. 5. The same scenario as in Fig. 4. Scenario (a–c) shows a recursive iteration of H 1 ðzÞ, and scenario (b–d) the reciprocal, in terms of frequency selectivity, of the Wavelet type in Fig. 4(a) and (c).
Fig. 6. The equivalent M-channel filter-bank of a WP basis B ¼
SM
pi i¼1 Bji .
lent filters H i ðzÞ H Hðji ;pi Þ ðzÞ by (7) and, consequently, reduce the analysis to the frequency response of an M-channel filter-bank, see Fig. 6. Examples of the frequency response before the down-sampling stage are presented in Figs. 4 and 5. From these, we can notice that for the Wavelet type of structure, produced by iterating H 0 ðejw Þ in every step, we obtain a solution that increases the resolution in the low frequency range. In general, in each step of iterating the TCF, we reduce the frequency support of the resulting sub-space by half, as illustrated in Fig. 4c. 4.1. Frequency ordering: the Gray code Concerning frequency ordering, however, the up-sampled versions of H 0 ðzÞ and H 1 ðzÞ do not necessarily play the role of the low and high pass filters, respectively, in the band of interest. The reason is that the side lobes of
these filters, out of the original frequency range of its definition ½p; p, are brought into the ½p; p after the upsampling operation in a non-trivial way (Mallat, 2009). This is a direct consequence of the result presented in Proposition 1. An example of this phenomenon is shown in Fig. 5a, for the case of iterating H 1 ðzÞ. This scenario does not provide a solution that decomposes the high frequency range of the signal, see Fig. 5c, as one would expect from its reciprocal Wavelet solution shown in Fig. 4c. To illustrate this mirroring effect more clearly, let us consider Fig. 5b and d. In this scenario, the frequency support of the equivalent filter H 1 ðzÞH 1 ðz2 Þ is not the highest band in the interval ½0; p as expected. In fact, the supports of H 1 ðzÞ and H 1 ðz2 Þ are ½p=2; 3p=2 and ½p=4; 3p=4, respectively. Thus H 1 ðzÞH 1 ðz2 Þ has support in ½p=2; 3p=4. For further details on this frequency ordering issue, we refer the reader to Mallat (2009, Section 8.1.2) and Atto et al. (2007, 2010). Fortunately, there is a simple closed-form rule to relabel any admissible node ðj; pÞ in the WP tree as an equivalent node ðj; kÞ, at the same depth (scale), so that the resulting labels are frequency ordered (Mallat, 2009). This mapping k ¼ GðpÞ is called the Gray code and it is presented in Appendix SM p C for completeness. Then, for each WP basis B ¼ i¼1 Bjii , we can compute the ordered indexes fðji ; k i Þ : i ¼ 1; . . . ; Mg, with k i ¼ Gðpi Þ, (C.1), where each p induced subspace atom U jii , captures the signal information concentrated in the band
820
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835
V ¼ ð0; 0Þ; ð1; 0Þ; ð1; 1Þ; . . . ; ðJ ; 0Þ; . . . ; ðJ ; 2J 1Þ ;
I kjii ½ðk i þ 1Þp2ji ; k i p2ji [ ½k i p2ji ; ðk i þ 1Þp2ji : ð8Þ fI kjii
Then, B produces IB ¼ : i ¼ 1; . . . ; Mg a partition of the discrete time frequency range ½p; p. Extending this analysis to WPs with an arbitrary conjugate mirror filter pair ðh0 ðnÞ; h1 ðnÞÞ, their frequency selectivity property depends upon how H 0 ðejw Þ is concentrated in ½p=2; p=2. Consequently, we only have an approximation of the clean selectivity properties of the Shannon WPs in (8). For the applications on acoustic speech signals, this will be one of the critical aspects to evaluate. In the following, we concentrate on the family of Daubechies (DB) WPs (Mallat, 2009; Daubechies, 1992), exploring different filter order solutions (associated with the number of zeros at p of H 0 ðzÞ), which provide a tradeoff between the order of the TCF, and the concentration of H 0 ðejw Þ in the range ½p=2; p=2, or frequency selectivity (Chap. 8.1.2 Mallat, 2009). We choose the family of compactly supported Daubechies wavelets (Daubechies, 1992), because it offers a rich range of frequency selectivities. In fact, we can go from the Haar Wavelet (Vetterli and Kovacevic, 1995; Mallat, 2009), where H 0 ðzÞ has one zero at p, with almost nofrequency selectivity but perfect time localization, to the Shannon Wavelet that offers perfect frequency selectivity (in the limit where the number of zeros at p of H 0 ðzÞ goes to infinity) (Mallat, 2009). On the theoretical side, this family offers the minimum order TCF solution ðh0 ðnÞ; h1 ðnÞÞ for a given number of vanishing moments or zeros at p of H 0 ðzÞ. This last attribute is associated with the frequency selectivity of the TCF (Mallat, 2009, Th. 7.9).
ð9Þ
and E the collection of arcs on V V that characterizes a full-rooted binary tree with root vroot ¼ ð0; 0Þ as shown in Fig. 2a. Instead of representing the tree as a collection of arcs in G, we use the convention of Breiman et al. (1984), in which subgraphs are represented by a subset of nodes of the full graph. More formally, we define a rooted binary tree T ¼ fv0 ; v1 ; . . . ;g V as a collection of nodes with only one of degree 2, the root node, and the remaining nodes with degree 3 (internal nodes) and leaf nodes (Cormen et al., 1990). We define LðT Þ as the set of leaves of T and IðT Þ as the set of internal nodes, consequently, LðT Þ [ IðT Þ ¼ T . We say that a rooted binary tree S is a subtree of T if S T . In the previous definition, if the roots of S and T are the same, then S is a pruned subtree of T , denoted by S T . In addition, if the root of S is an internal node of T , then S is called a branch. In particular, we denote the largest branch of T rooted at v 2 T as T v . We define the size of the tree T as the number of leaves, i.e., the cardinality of LðT Þ denoted as j T j. Finally in our problem, T full ¼ V in (9) denotes the full binary tree, consequently, the collection of WP bases is indexed by the admissible trees T V : T T full . In this context, any pruned version of the full-rooted binary tree represents a particular way of iterating the TCF ðh0 ðnÞ; h1 ðnÞÞn2Z of the WP. More precisely, if we let T ¼ fðji ; pi Þ : i 2 f1; . . . ; M gg be an admissible WP binary tree, then we denote its basis by BT
M [
p
ð10Þ
Bjii ;
i¼1
5. Wavelet Packet filter-bank selection The last aspect in the implementation of the WP acoustic features is to decide appropriate WP filter-bank structures for the phone recognition task we have at hand. We follow the data-driven approach independently proposed by Etemad and Chellapa (1998) and Saito and Coifman (1994),3 and revisited by Silva and Narayanan (2009). The idea is to use supervised data to select a filter-bank structure (or a frequency partition of ½p; p), that provides a nearly-optimal phonetic discrimination basis solution. More details of the formulation of this problem can be found in Silva and Narayanan (2009), Silva and Narayanan (2007) and Vasconcelos (2004). To formulate the optimization problem, let us first introduce some notations. Following Silva and Narayanan (2009), we represent the process of producing a particular basis in the WP family by a rooted binary tree (Scott, 2005). For simplicity, let J > 0 be the maximum number of iterations of the sub-band decomposition process. Let G ¼ ðV ; EÞ be a graph with 3
This work was inspired by the seminal work of Coifman and Wickerhauser (Coifman et al., 1992) in the context of basis selection for sparse signal representation.
its sub-space decomposition by n o p UT U jii : i ¼ 1; . . . ; M ; M
ð11Þ
p
where X ¼ ai¼1 U jii , and its ideal Shannon frequency partition by n o IT I kjii : i ¼ 1; . . . ; M ; ð12Þ with k i ¼ Gðpi Þ from (C.1) and I kjii from (8). Finally, as we are interested in extending the filter-bank Cepstral analysis view for acoustic FE, Section 2, then for each T T full and for any point x 2 X, we define the filter-bank energy signature of x relative to T by mT ðxÞ Epj ðxÞ ðj;pÞ2LðT Þ ð13Þ where Epj ðxÞ denotes the energyPof x in the subspace U pj , 2 and by orthonormality jjxjj ¼ ðj;pÞ2LðT Þ Epj ðxÞ. 5.1. The tree-pruning problem Here we revisit the approach in Silva and Narayanan (2009), where the selection of the WP basis was based on approximating the minimum probability of error decision (Silva et al., 2012). This formulation is reduced to find an
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835
optimal tradeoff between the estimation and approximation errors and, consequently, addresses a complexity-regularization problem. More precisely, we address the solution of T ðkÞ ¼ arg min F ðmT ðX Þ; Y Þ þ kUðT Þ; T T full
ð14Þ
where X is the random object representing the raw acoustic observation in our signal space X, and Y is the class label random variable with values in the finite alphabet space of phonetic classes Y. The first term in (14) involves F ð; Þ, which is a measure designed to capture the discriminate information of mT ðX Þ relative to the class label Y (fidelity measure). The second term /ðÞ is a non-decreasing real function (cost term) designed to incorporate estimation error effects. The solution of (14), for all k > 0, resides in the solution of the following cost-fidelity problem (Scott, 2005) (Silva and Narayanan, 2009, Sec. IV.D): T
k
¼ arg
F ðmT ðX Þ; Y Þ: max fT T full :jT j6kg
ð15Þ
The problem in (15) is equivalent to finding the filter-bank of length k that maximizes the fidelity gain F ðmT ðX Þ; Y Þ, for all k 2 2; 3; . . . ; jT full j . Interestingly, when the fidelity measure is additive,4 or alternatively affine,5 with respect the structure of T , which will be the case for all measures experimentally evaluated in this work (see Section 5.2), the solution of (15) admits implementation with an efficient complexity OðT full log T full Þ (Silva and Narayanan, 2009, Th. 2 and 3). Furthermore, (15) offers an embedded solution structure, i.e. T 2 T 3 T ðjT full j1Þ T full (Silva and Narayanan, 2009, Th. 3). For completeness, the algorithm for solving (15) is presented in Section 5.3. 5.2. Fidelity measures
malized energy of x 2 X by Epj ðxÞ
821 p
Ej ðxÞ
, and the number of PN examples in class y 2 Y by N y i¼1 Ify g ðy i Þ. Let the energy map eðj; p; yÞ be given by eðj; p; yÞ ¼
kxk2
N 1 X pj ðxi Þ; Ify g ðy i Þ E N y i¼1
ð17Þ
for any pair ðj; pÞ 2 f0; . . . ; J g 0; . . . ; 2j 1 and y 2 Y. For a binary tree T , its class conditional energy signature is defined by eT ðyÞ ¼ ðeðj; p; yÞÞðj;pÞ2LðT Þ ; ð18Þ where from the Parseval’s relationship we have that P ðj;pÞ2LðT Þ eðj; p; yÞ ¼ 1. Therefore, we can treat eT ðyÞ as a probability mass function and define the KLD fidelity as (Saito and Coifman, 1994) X F ðmT ðX ; Y ÞÞ ¼ DðeT ðyÞkeT ðzÞÞ: ð19Þ y;z2Y
Here D is the discrete KLD (Gray, 1990; Cover and Thomas, 1991). To write the functional in its additive form, in (16), we consider the following equalities: X F ðmT ðX ; Y ÞÞ ¼ DðeT ðyÞkeT ðzÞÞ y;z2Y
eðj; p; yÞ ¼ eðj; p; yÞ log eðj; p; zÞ y;z2Y ðj;pÞ2LðT Þ X X eðj; p; yÞ eðj; p; yÞ log ¼ eðj; p; zÞ ðj;pÞ2LðT Þ y;z2Y X ¼ F ðEpj ðX Þ; Y Þ: X
X
ðj;pÞ2LðT Þ
where the leaf functional is X eðj; p; yÞ p F ðEj ðX Þ; Y Þ ¼ eðj; p; yÞ log : eðj; p; zÞ y;z2Y
ð20Þ
N
Let fðxi ; y i Þgi¼1 be independent and identically distributed (i.i.d.) realizations of the joint vector ðX ; Y Þ, where every pair ðxi ; y i Þ corresponds to a speech segment and its respective phone label. As fidelity measures, we use the indicators proposed by Saito and Coifman (1994), Etemad and Chellapa (1998) and Silva and Narayanan (2009). All of them can be written in the additive form: X F ðmT ðX Þ; Y Þ ¼ F ðEpj ðX Þ; Y Þ: ð16Þ ðj;pÞ2LðT Þ
5.2.1. KLD fidelity estimate The first fidelity measure is the symmetric version of the Kullback-Leibler divergence (KLD) (Kullback, 1958) proposed in Saito and Coifman (1994). Let us define the norP A tree functional qðÞ is is additive if qðT Þ ¼ ðj;pÞ2LðT Þ qðj; pÞ (Scott, 2005). 5 A tree functional qðÞ is affineP if, for any T ; S rooted binary trees such that S T , then qðT Þ ¼ qðSÞ þ s2LðSÞ qðT s Þ qðfsgÞ, where fsg is the trivial tree rooted at s, see (Scott, 2005). 4
5.2.2. Parametric version of the mutual information: Fisher fidelity estimate The second indicator is the mutual information (MI) adopted in Silva and Narayanan (2009). Assuming the Markov tree property presented in Prop. 3 (Silva and Narayanan, 2009) the functional is affine (Silva and Narayanan, 2009, Th. 3). To simplify the estimation, we assume that the class conditional distributions are Gaussian, where MI reduces to a version of the Fisher discriminative indicator (Silva and Narayanan, 2007; Padmanabhan et al., 2005), proposed by Etemad and Chellapa (1998). More precisely, let the energy vector of a signal xi in the tree T be given by mT ðxi Þ ¼ Epj ðxi Þ ðj;pÞ2T , and Pb ðfy gÞ ¼ Ny denote the class probability mass 8y 2 Y. Assuming that N the class conditional probability of object mT ðX Þ is a multivariate Gaussian distribution, the maximum likelihood estimator of its mean and covariance are ^y ¼ l
N 1 X Ify g ðy i ÞmT ðxi Þ N y i¼1
ð21Þ
822
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835
and Ry ¼
1 Ny
N X
y
^y ÞðmT ðxi Þ l ^y Þ ; Ify g ðy i ÞðmT ðxi Þ l
ð22Þ
i¼1
respectively. The unconditional mean estimator is P ^ ¼ N1 Ni¼1 mT ðxi Þ. Now we can define the within-class scatl ter matrix S w for the tree T by X Pb ðfy gÞ Ry ; S w ðT Þ ¼ ð23Þ y2Y
and the between-class scatter matrix by X ^y Þð^ ^ y Þy : Pb ðfy gÞ ð^ ll ll S b ðT Þ ¼
ð24Þ
y2Y
Finally for a rooted binary tree T , its leaf ðj; pÞ fidelity functional is consequently defined by F ðEpj ðX Þ; Y Þ ¼ trðS 1 w ðt v ÞS b ðt v ÞÞ
trðS 1 w ðfðj; pÞgÞS b ðfðj; pÞgÞÞ:
6. Analysis of filter bank solutions ð25Þ
In this context tv is the binary tree rooted at v ¼ ðj; pÞ with leaves ðj þ 1; 2pÞ and ðj þ 1; 2p þ 1Þ (see Fig. 2a), and fðj; pÞg is the one node tree. 5.2.3. Energy fidelity estimate Finally, as a non-discriminative indicator, we consider the average subspace energy proposed in Chang and Kuo (1993), i.e., F ðEpj ðX Þ; Y Þ ¼
N 1 X Ep ðxi Þ: N i¼1 j
ð26Þ
With the average energy fidelity measure in (26), the algorithm to solve (15), presented in Section 5.3, splits the leaf of the tree T k with the highest average energy to find T kþ1 , the solution of order k þ 1. 5.3. Minimum cost tree pruning algorithm To conclude this section, a dynamic programing (DP) algorithm to solve (15) is presented. We refer the interested reader to Scott (2005), Chou et al. (1989), Bohanec and Bratko (1994), Breiman et al. (1984) and Silva and Narayanan (2009) for a systematic exposition on the computational complexity, as well as theoretical results of this algorithm. Phase 0:
Phase 1:
– Fidelity gain Dðj; pÞ if (F is KLD functional) 2pþ1 Dðj; pÞ ¼ F ðE2p jþ1 ðX Þ; Y Þ þ F ðE jþ1 ðX Þ; Y Þ else Dðj; pÞ ¼ F ðEpj ðX Þ; Y Þ end Phase 2: (Initialization) Initialize: T 2 ¼ fð0; 0Þ; ð1; 0Þ; ð1; 1Þg, then LðT 2 Þ ¼ fð1; 0Þ; ð1; 1Þg Phase 3: (Iteration) for k ¼ 2 to k ¼ 2J 2 1. -compute: ðj ; p Þ ¼ arg max Dðj; pÞ k
2. -save: ðj;pÞ2LðT Þ:j6J 1 T ðkþ1Þ ¼ T k [ fðj þ 1; 2p Þ; ðj þ 1; 2p þ 1Þgend
(Choice of parameters) Choose a specific CMF pair h0 ; h1 , a maximum level of decomposition J and a fidelity functional F. (Computation: Subband measurements and Fidelity Gain) 8j 2 f0; . . . ; J 1g; 8p 2 0; . . . ; 2j 1 compute: – Epj ðxi Þ : 8xi 2 X 8j 2 f0; . . . ; J 2g; 8p 2 0; . . . ; 2j 1 compute:
The TIMIT corpus was adopted for all the experiments presented in this work. TIMIT is one of the standard corpus used to evaluate new methods and techniques in ASR, mainly because it is a phonetically balanced task and has good coverage of speakers and dialects. All of these make TIMIT a sufficiently challenging corpus with which to evaluate new ASR methods, which justifies its wide adoption by the community. The TIMIT corpus consists of 6300 utterances for the 8 major dialects of the United States. There are 630 different speakers, each one speaking 10 sentences. TIMIT phonetic transcriptions contain 64 phonetic classes, from which we have adopted the standard folding proposed in (Lee and Hon, 1989) that reduces the number of phonetic classes to 39 plus the silence model. The training set, proposed in the TIMIT corpus, was used to extract supervised data for the tree-pruning stage, in Section 5.1. More precisely, we used the phonetic segmentations and labels of the TIMIT database folded in 39 classes to select the supervised training data. For each phone segmented signal, we took three 20ms segments, from the left, center, and right positions of the signal, and we considered those as realizations of the phoneme. With this data, we computed the fidelity measures presented in Section 5.2, i.e., the Fisher, the symmetric KLD, and the Energy tree functionals, respectively. Finally, those measures were used to create the filter-bank solutions by solving the pruning problem in (14) and (15). In addition, we have adopted four different pairs of two channel filters (TCFs), (see Section 3.3), associated with the Daubechies (DB) Wavelets (Daubechies, 1992; Mallat, 2009; Vetterli and Kovacevic, 1995) of order 6, 12, 24 and 44, respectively. With these we have good coverage of frequency selectivity properties to obtain a fairly representative family of WP filter-bank solutions. It is important to point out that frequency selectivity was one of the key dimensions considered in this analysis.
4 3
2
2 1
3
0 −1
4
−2 −3
5
−4 6
−5 5
10
15 20 25 Index of frequency−bands
30
Depth of the WP decomposition − scale index
Depth of the WP decomposition − scale index
5 1
5 1 4 3
2
2 1
3
0 −1
4
−2 −3
5
−4 6
−5 5
10
15 20 25 Index of frequency−bands
30
Depth of the WP decomposition − scale index
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835
823 5
1 4 3
2
2 1
3
0 −1
4
−2 −3
5
−4 6
−5 5
10
15 20 25 Index of frequency−bands
30
Fig. 7. Distribution of the KLD fidelity gains Dðj; kÞ indexed by the scale j (vertical axes) and frequency location k (horizontal axes) considering the frequency-ordered WP sub-space decomposition structure. A whiter color indicates a higher fidelity gain.
6.1. Analysis of fidelity gains across scale and frequency location In this section we report the sensitivity of the WP filterbank selection algorithm to the frequency selectivity, proportional to the order of the Daubechies TCF (DB-TCF) (Mallat, 2009). For that purpose, we have analyzed the fidelity gains across scale and position, represented by a scale index j and a frequency localization index (position) k. We compared the fidelity gains of iterating the TCF, (see Section 3.1), of the three fidelity functionals (Fisher, KLD and Energy). Fig. 7 shows the KLD-based gains of decomposing a frequency ordered node ðj; kÞ (associated with a WP subspace) for the DB-TCF of orders 6, 12 and 24. As expected, higher discriminative gains are obtained in the low frequency domain. It is important to note in the figure that the KLD gain structure is not that sensitive to the order of the TCF, and tends to stabilize as the order (frequency selectivity) increases. This stability phenomenon was also observed with the Fisher-based gains, as well as the Energy gains. However each of them has a particular fidelity gain structure as shown in Fig. 8. This shows that the frequency selectivity does not imply a major change in the fidelity gains and consequently, in the filter-bank tree-structures obtained from solving the minimum cost tree-pruning problem in (15). On the other hand, Fig. 8 illustrates the gains for the three fidelity criteria with the DB-TCF of order 44 (the highest selectivity). Interestingly, all the plots show that the salient information for discriminating phonemes, relative to the fidelity measure adopted, is localized in the low frequency domain. Consequently, the solutions of the optimal tree-pruning problem offer structures that give priority to iterating the TCF in this frequency range. In this regard, the non-discriminative criterion in Fig. 8, with respect to the discriminative criteria in Fig. 8c and b, has minor differences. However these differences are sufficient to characterize a particular way of zooming on the lower frequency region of the acoustic space. These zooming patterns could potentially imply some marginal but important
differences in ASR recognition performances, as we shall see in the following sections.
6.2. Analysis of the filter-bank frequency responses In order to contrast the filter-bank solutions induced from different frequency selectivity conditions, Fig. 9 shows the equivalent filter-bank frequency response obtained for the scenarios with DB-TCF of orders 6 and 44, respectively. Verifying our previous analysis, the frequency selectivity does not significantly affect the structure of the filter-bank solutions, i.e, the way of iterating the TCF. This can be observed in the main lobes of the solutions, which are centered at the same frequencies, focusing on the solutions with the same number of frequency bands illustrated in rows of Fig. 9. In fact, the solutions of size 6 (Fig. 9) and size 14 (Fig. 9) have the same tree topology, however, their frequency supports are clearly different. Concerning the frequency support, the trend is the following: The family of DB Wavelets converges to the Shannon Wavelets, as the order of the TCF increases,6 then the frequency supports of the filter-banks converge to the Shannon WP partitions in (8). Alternatively, for any order of the TCF, the frequency support of a subspace with arbitrary large depth (scale) gets narrower following the Shannon WP frequency support, which in the limit converges to a fixed frequency point. Details of this result are presented in Section 3.2 (Atto et al., 2007, Atto et al., 2010). For our finite scale regime, the higher the order of the DB-TCF, the closer we are to the Shannon frequency partition in Section 4. Hence, by increasing the order of the TCF, the frequency bands are more clearly localized and the overlap between adjacent bands, or what we called between-band interference, is reduced. Associated with each frequency-ordered leaf ðj; kÞ of a given WP tree, we have its main lobe centered in the 6 A systematic exposition of this fact is presented in Shen and Strang (1996) and Shen and Strang (1998).
Depth of the WP decomposition − scale index
0 1
−0.5 −1
2
−1.5 3
−2 −2.5
4
−3 5
−3.5 −4
6
5 1 4 3
2
2 1
3
0 −1
4
−2 −3
5
−4 6
−5
Depth of the WP decomposition − scale index
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835
Depth of the WP decomposition − scale index
824
0
1
−1 2 −2 3 −3 4
−4
5
−5
6
−6
−4.5 5
10
15 20 25 Index of frequency−bands
30
5
10
15 20 25 Index of frequency−bands
30
5
10
15 20 25 Index of frequency−bands
30
Fig. 8. Fidelity gains Dðj; kÞ indexed by the scale j (vertical axes) and frequency location k (horizontal axes) considering the frequency-ordered WP subspace decomposition structure. The Daubechies of order 44 is considered and the results are presented for the three methods. Whiter color indicates higher fidelity gain.
frequency range Ikj in (8). However, there are also secondary lobes with significant gains, which are not necessarily adjacent to the target band Ikj , in particular for the case of small TCF order solutions. This phenomenon characterizes a very complex interference pattern as illustrated in Fig. 9. Interpreting these results, the projection onto the subspace associated with a given WP node ðj; kÞ contains information of: its target Shannon band Ikj ; the neighborhood bands of Ikj ; but not intuitively, information of undetermined non-adjacent bands because of the gains of the secondary lobes as illustrated in Fig. 9. The good news is that those secondary-interference lobes vanish as the frequency selectivity increases. These asymptotic trends have a formal justification in the fact that the DB WPs converge to the Shannon WPs as the TCF order tends to infinity (Shen and Strang, 1996, 1998). Finally Fig. 10 shows the frequency response of the equivalent filter-banks obtained with a discriminative and a non-discriminative method. We use the DB-TCF of order 44 to induce filter-banks with clearer structures and reduced side-lobe interference. As was illustrated in Figs. 7 and 8, the pruned solutions offer higher resolution in the low frequency region. In general the M-channel filterbank solutions of the same size are similar (rows of Fig. 10), but as we increase the number of bands, some minor differences can be observed. In conclusion, for a clean acoustic speech process, the filter-banks obtained are pretty much independent of the pruning method, and no major contrast is observed by the use of a discriminative or a non-discriminative criterion. This verifies the preliminary results obtained in Silva and Narayanan (2009), where it was claimed that the acoustic speech process is an optimal design, in the sense that it allocates energy in the frequency bands that offer higher frequency discrimination. These results are based on short-time (frame by frame) information analysis of acoustic speech processes to discriminate phonemes, and do not consider, for instance, a noisy scenario, or higher level contextual information, where alternative trends could be observed.
7. Phone recognition experiments The analysis made in this work considered a number of degrees of freedom for acoustic FE such as: the fidelity measure for the filter-bank selection problem presented in Section 5.1 (and, therefore, the set of embedded treestructured WP filter-banks); the frequency selectivity of the TCF; the filter-bank size; and the feature space dimension. As we presented in previous sections, we induce the WPCCs by: first, selecting a M-channel WP filter-bank; second, by deriving the frequency-ordered energy coefficients; and finally, by applying DCT for de-correlation as well as for dimensionality reduction (Quatieri, 2002) by choosing the first m < M transformed DCT coefficients. The resulting WPCC features are the previously mentioned m Cepstral coefficients plus the log-energy of the frame. The experiments are conducted in a sequence of incremental steps. First, we start the analysis in a simplified mono-phone recognition task that does not consider contextual information appended to the WPCC feature vector, i.e., delta and acceleration coefficients. This initial phase is designed to explore the feature space dimension (number of Cepstral coefficients) and WP tree size (number of bands) to define an initial range of values to be explored in the more complex settings. This analysis is conducted under different frequency selectivity for the TCF, and for all the fidelity measures. We then expand the analysis, enriching the feature vector with delta and acceleration coefficients, under the same mono-phone recognition task, to see if we observe similar trends. For that we re-run the phone recognition experiments in the range of values obtained in the previous phase. Finally, we run a state-of-the-art phone recognition experiment considering context dependent HMM-acoustic phone models (tri-phones) with a bigram language model. As a benchmark in all the phases mentioned, we have chosen the standard MFCC features computed with the 22 channel MEL-filters and adopting the first 12 Cepstral coefficients plus frame log-energy as the feature vector.
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835
825
2
2.5
1.8 2
1.6
Amplitude Gain
Amplitude Gain
1.4 1.5
1
0.5
1.2 1 0.8 0.6 0.4 0.2
0
0 0
0.1
0.2
0.3
0.4 0.5 0.6 Normalized Frequency
0.7
0.8
0.9
1
Amplitude Gain
Amplitude Gain
0.2
0.3
0.4 0.5 0.6 Normalized Frequency
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4 0.5 0.6 Normalized Frequency
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4 0.5 0.6 Normalized Frequency
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4 0.5 0.6 Normalized Frequency
0.7
0.8
0.9
1
2.5
2
1.5
1
0.5
2
1.5
1
0.5 0
0 0
0.1
0.2
0.3
0.4 0.5 0.6 Normalized Frequency
0.7
0.8
0.9
1
3
3
2.5
2.5
2
2
Amplitude Gain
Amplitude Gain
0.1
3
2.5
1.5
1
0.5
1.5
1
0.5
0
0 0
0.1
0.2
0.3
0.4 0.5 0.6 Normalized Frequency
0.7
0.8
0.9
1
3
3
2.5
2.5
2
2
Amplitude Gain
Amplitude Gain
0
1.5
1
0.5
1.5
1
0.5
0 0
0.1
0.2
0.3
0.4 0.5 0.6 Normalized Frequency
0.7
0.8
0.9
1
0
Fig. 9. Frequency response of the Wavelet Packet filter-bank solutions. The solutions were obtained with Daubechies of order 6 (left column) and 44 (right column), respectively. Plots are normalized over the interval [0, 8 kHz].
In general for each speech segment, we computed the MFCC and WPCC features using a hamming windows of 32ms with a frame-rate of 10ms. The ASR system was implemented with the HTK toolbox (Young, 2009), where for each phone acoustic model we adopted the standard 5 state hidden Markov model (HMM) (Rabiner, 1989) with 3
emitting states, the standard left-to-right topology, and the 16 Gaussian mixture as the observation distribution (Rabiner, 1989). We used the steps proposed in the TIMIT documentation to train all models in this work, and the Core-test of the TIMIT corpus was used for obtaining ASR performances.
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835 2
2
1.8
1.8
1.6
1.6
1.4
1.4
Amplitude Gain
Amplitude Gain
826
1.2 1 0.8
0.6
0.4
0.4
0.2
0.2 0 0.1
0.2
0.3
0.4 0.5 0.6 Normalized Frequency
0.7
0.8
0.9
1
2.5
2.5
2
2
Amplitude Gain
Amplitude Gain
0
1.5
1
0.5
0.1
0.2
0.3
0.4 0.5 0.6 Normalized Frequency
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4 0.5 0.6 Normalized Frequency
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4 0.5 0.6 Normalized Frequency
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4 0.5 0.6 Normalized Frequency
0.7
0.8
0.9
1
1.5
1
0 0
0.1
0.2
0.3
0.4 0.5 0.6 Normalized Frequency
0.7
0.8
0.9
1
3
2.5
2.5
2
2
Amplitude Gain
3
1.5
1
0.5
1.5
1
0.5 0
0 0
0.1
0.2
0.3
0.4 0.5 0.6 Normalized Frequency
0.7
0.8
0.9
1
3
2.5
2.5
2
2
Amplitude Gain
3
1.5
1
0.5 0
0
0.5
0
Amplitude Gain
1 0.8
0.6
0
Amplitude Gain
1.2
1.5
1
0.5
0
0.1
0.2
0.3
0.4 0.5 0.6 Normalized Frequency
0.7
0.8
0.9
1
0
Fig. 10. Frequency response of the Wavelet Packet filter-bank solutions. The figures show a comparison between non-discriminative and discriminative criteria, Energy (left column) and KLD (right column), respectively. The solutions were obtained with Daubechies of order 44. Plots are normalized over the interval [0, 8 kHz].
7.1. Context-independent phone recognition experiments The pruning solutions of size 24 obtained from the three fidelity functionals (KLD, Fisher and Energy) are presented here. The acoustic features are the WPCC plus log-energy with a fixed number of bands, where we varied
the number of Cepstral coefficients from 6 to 24, to gain insight into the most appropriate dimension for the feature space. In this context Fig. 11a shows the performance trends of the Fisher fidelity WPCC solutions across the feature space dimension and for different frequency selectivity given by the order of the DB-TCF (db6, db12, db24 and
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835
db44). In each of these performance curves, the curse of dimensionality is observed as expected. There is an initial increasing trend in performances that later saturates and decreases, attributed to the well-understood estimation error phenomenon presented in this learning-decision problem. The results show an optimal range for feature space dimension starting approximately at dimension 11 and ending approximately at dimension 19. This good range of feature dimension is practically invariant when we increase the number of bands and the frequency selectivity of the filter-bank solutions. This behavior is also consistent with the other two fidelity measures, KLD and Energy, exemplified in Fig. 11b and c for the WPCC filter-bank solutions of 24 bands in each case. Considering the good range of feature dimension obtained in the previous set of experiments, we fixed one of them, dimension 13 (12 Cepstral coefficients plus log-
827
energy), to show the performance trend with respect to the number of bands of the WP filter-bank solutions (WP tree size). The experiments again consider all fidelity measures and TCF orders (db6,db12,db24 and db44). Fig. 12 shows these trends. Again we observed a performance trend that increases, then saturates, and finally decreases as we explore WP filter-bank solutions with an increasing number of bands. Since in this case the feature dimension is fixed, this trend cannot be attributed to the curse of dimensionality and so, consequently, has to do with the acoustic discrimination power of the filter-bank solutions. From these results we conclude that a good range of exploration in the number of bands is from 18 to 26. Before we change the focus to the next set of experiments, a couple of remarks should be made. It is very interesting to observe the trend with respect to the frequency selectivity in the obtained results, Figs. 11 and 12. In
45
44
% Accurracy Coretest
43
42
41
40
Fisher db44 Fisher db24 Fisher db12 Fisher db6 MFCCE
39
38 6
8
10
12 14 16 18 Number of Cepstral Coefficients
46
20
22
24
45
45
44
44 % Accurracy Coretest
% Accurracy Coretest
43 43
42
42
41 41
KLD db44 KLD db24 KLD db12 KLD db6 MFCCE
40
EN db44 EN db24 EN db12 EN db6 MFCCE
40
39
39 6
8
10
12 14 16 18 Number of Cepstral Coefficients
20
22
24
6
8
10
12 14 16 18 Number of Cepstral Coefficients
20
22
24
Fig. 11. Recognition accuracies in the Core-test set as a function of the number of Cepstral coefficients for a fixed size of WP filter-bank (number of bands) and static features. Effect of frequency selectivity for the Fisher functional filter-banks of size 24 (11a), KLD functional filter-banks of size 24 (11b), and Energy functional filter-banks of size 24 (11c).
828
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835 45.5
45
44.5
% Accurracy Coretest
44
43.5
43
42.5
42 Fisher db44 Fisher db24 Fisher db12 Fisher db6 MFCC
41.5
E
41 14
16
18
20
22 Number of Bands
45.5
24
26
28
30
45
45 44.5 44.5 44 % Accurracy Coretest
% Accurracy Coretest
44
43.5
43
43.5
43
42.5 42.5 42 KLD db44 KLD db24 KLD db12 KLD db6 MFCC
41.5
EN db44 EN db24 EN db12 EN db6 MFCC
42
E
E
41
41.5 14
16
18
20
22 Number of Bands
24
26
28
30
14
16
18
20
22 Number of Bands
24
26
28
30
Fig. 12. Recognition accuracies in the Core-test set as a function of the WP filter-bank size (number of bands), for fixed 12 Cepstral coefficients and static features. Effect of frequency selectivity for the Fisher functional filter-banks (a), KLD (b) and Energy (c).
almost all cases, increasing the frequency selectivity provides better performances for any given dimension, filterbank size, and fidelity measure adopted. This ratifies our conjecture that inter-band interference is something to be avoided for acoustic discrimination, and consequently, better performances can be achieved by increasing the order of the DB-TCF in our context. This is congruent with some of the results presented in Choueiter and Glass (2007) for the case of a simplified phone-segmented classification task. Also it is important to note that we have already obtained concrete settings for our WPCCs that outperform the standard MFCC features, under the same scenario that does not consider contextual information in the acoustic features. In this mono-phone recognition task, this benchmark has 44,87% recognition accuracy. Finally, we add delta and acceleration coefficients to the analysis. It is well understood that dynamic features
improve recognition rates, but it is interesting to observe their particular effects on our WP filter-bank features. We consider a similar set of scenarios (number of bands, number of Cepstral coefficients) to explore the effect on frequency selectivity and the fidelity criterion. Fig. 13 shows recognition accuracies as a function of the number of bands for a given fixed Cepstral feature dimension in the set f11; 12; 13; 14g, which maps to a feature vector of dimensions f36; 39; 42; 45g, respectively, and with the maximum order (frequency selectivity) in the TCF. In general, the best set of results is obtained in the range of 20–26 bands, illustrated in Fig. 13. In addition, out of this range, the energy fidelity criterion systematically shows the best performance curves and, consequently, the most competitive results with respect to the standard MFCCs (39 feature vector) with a baseline of 55.3% in accuracy. In spite of that, the best result is obtained with the KLD fidelity
55.4
55.4
55.2
55.2
55
55
54.8
54.8 % Accurracy Coretest
% Accurracy Coretest
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835
54.6
54.4
54.2
54
829
54.6
54.4
54.2
54
53.8
53.8 KLD db44 Energy db44 Fisher db44 MFCC
53.6
KLD db44 Energy db44 Fisher db44 MFCC
53.6
EDA
EDA
53.4
53.4 14
16
18
20 Number of Bands
22
24
26
14
16
18
20 Number of Bands
22
24
26
22
24
26
55.5
55.4
55.2 55 55
54.5 % Accurracy Coretest
% Accurracy Coretest
54.8
54.6
54.4
54.2
54
53.5
54 53
KLD db44 Energy db44 Fisher db44 MFCC
53.8
KLD db44 Energy db44 Fisher db44 MFCC
EDA
53.6
EDA
52.5 14
16
18
20 Number of Bands
22
24
26
14
16
18
20 Number of Bands
Fig. 13. Recognition accuracies in the Core-test set as a function of the WP filter-bank size (number of bands), for fixed numbers Cepstral coefficients adding delta and acceleration features. Comparison of solutions obtained for all pruning methods and the higher frequency selectivity considered (DB 44).
measure, solution of 22 bands and 12 Cepstral coefficients (a 39 feature vector) shown in Fig. 13b, with recognition accuracy of 55.36%. Fig. 14, on the other hand, revisits the effect of the frequency selectivity on the recognition accuracy for the KLD and Energy based solutions with 12 Cepstral coefficients. This verifies that higher order DB-TCF achieves the best performance. Finally, Table 1 presents the gain of adding delta and acceleration coefficients to the feature vector. This gap increases by increasing the frequency resolutions of the WP filters, reaffirming the advantage of adopting higher order TCFs for this task. 7.2. Context-dependent phone recognition experiments Finally we evaluate performance in the standard phone recognition task that considers context-dependent HMMs, Cepstral acoustic features plus delta and acceleration, and
a bi-gram language model. For this, we focus the analysis on the range of 20–26 bands, and the Cepstral feature dimension in the neighborhood of 13 coefficients. This is the range of values with good performances observed in the previous set of experiments. Fig. 15 shows recognition accuracies as a function of the number of Cepstral coefficients. Here we report the best trends, observed for the case of 24 and 26 filter-bank bands with the DB44 TCF. These trends were obtained with 9 to 15 Cepstral coefficients. i.e., feature space dimensions from 30 to 48. The estimationapproximation error trade-off can be observed as expected, however, these trends are different from those in the context independent case, shown in Fig. 11. The reason is that, in this context, the number of models is larger as are the model parameters to be estimated, but the training data remains the same. This causes the estimation error to dominate the approximation error earlier, in lower dimensional feature spaces, with respect to the results shown in Fig. 11.
830
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835 56
55.5
55
55 54.5
54
% Accurracy Coretest
% Accurracy Coretest
54 53.5
53
52.5
53
52 52
51.5
Energy db44 Energy db24 Energy db12 Energy db6 MFCCEDA
51
KLD db44 KLD db24 KLD db12 KLD db6 MFCCEDA
51
50
50.5 14
16
18
20
22 Number of Bands
24
26
28
14
30
16
18
20
22 Number of Bands
24
26
28
30
Fig. 14. Recognition accuracies in the Core-test set as a function of the WP filter-bank size (number of bands), for fixed 12 Cepstral coefficients adding delta and acceleration features. Comparison of solutions at different frequency selectivity for energy (a) and KLD (b).
Table 1 Average gains in recognition accuracy when passing from WPCCE to WPCCEDA acoustic features. Accuracies obtained in a scenario with 12 Cepstral coefficients plus log-energy and number of bands from 14 to 30. The first row shows the average recognition accuracy of static features in the four Daubechies Wavelet scenarios for the KLD, Fisher and Energy solutions. The second row shows the accuracy obtained when running the same task using delta and acceleration features, and the third row shows the accuracy gain.
WPCCE WPCCEDA Gain
DB6 (%)
DB12 (%)
DB24 (%)
DB44 (%)
42.09 51.19 9.1
43.29 53.13 9.84
43.97 54.2 10.22
44.26 54.5 10.24
We observed again that Energy and the KLD methods offer the best performance trends, which is consistent with
previous context-independent phone recognition results, where the best two performances are achieved with the Energy functional, in the scenario with 26 bands and 11 Cepstral coefficients (68.04%) and with 24 bands and 11 Cepstral coefficients (68.09%), Fig. 15b, respectively. Those results are very competitive with the state-of-the-art MFCC feature, baseline of 67.28%, were in fact, they offer a relative improvement of 1.2% in the best case tested. To conclude this analysis, the equivalent filter-banks of the Energy solutions with 24 and 26 bands are presented in Fig. 16a and b, respectively. The Mel-scale has a linear-uniform frequency partitioning in the lower frequency range and moves to a uniform logarithmic partitioning in the rest (Quatieri, 2002). Following this trend, our best two solutions, shown in Fig. 16a and b and in Table 2, offer an
68.5
68.5
68 68 67.5 67.5 % Accurracy Coretest
% Accurracy Coretest
67 67
66.5
66.5
66
65.5 66 65 KLD db44 Energy db44 Fisher db44 MFCCEDA
65.5
KLD db44 Energy db44 Fisher db44 MFCCEDA
64.5
65
64 9
10
11 12 13 Number of Cepstral Coefficients
14
15
9
10
11 12 13 Number of Cepstral Coefficients
14
15
Fig. 15. Phone recognition accuracies with context-dependent phone models as a function of the number of Cepstral coefficients, considering delta and acceleration features.
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835
approximately uniform partition (with the same bandwidth) in the interval [0, 1 kHz] and then an increasing bandwidth from 2 Hz to 8 kHz, as depicted in Table 2. Hence, as expected, our data-driven WP filter-bank solutions offer, in general, the Mel frequency partition type of structure.
7.3. Final analysis
respectively. In general, all these results are below the MFCC baseline of 67.28% for this phone-recognition task and, in consequence, below the best performance of 68.09% reported for the WPCCs, even in the scenario in which we match the filter orders adopting DB44. In terms of the filter-bank structure, our best data-driven solution with 24 bands presented in Table 2 offers frequency bands similar to those adopted in Farooq and Datta (2001) and Choueiter and Glass (2007), presented in Tables 3 and 4, respectively. The reason again is that our solution follows the general structure of the MELscale. However, it is important to emphasize the minor structural mismatches to justify the performance differences among the WP solutions. On this, the entries in bold in Tables 3 and 4 indicate the bands that have differences, in terms of bandwidth or frequency support, with respect to our best solution shown in Table 2. In particular, the 24 band solution in Table 3 has different frequency partitions in the intervals [0, 250 Hz], [1000 Hz, 1500 Hz] and [3000 Hz, 5000 Hz]. The same comparison can be made for the 26 band WP of Table 4, where the differences are concentrated in the [0, 250 Hz] and [5000 Hz, 6000 Hz] regions. It is worth mentioning that the 26 band WP can be generated from our 24 band solution, by splitting the (6,0) and (3,5) leaves, therefore, the structural differences are minor, but important to induce particular feature attributes for the task.
8. Summary, discussion and final remarks This work proposes the Wavelet-Packet Cepstral coefficient (WPCC) as a dynamic filter-bank structure to perform short-time (frame-by-frame) acoustic analysis for ASR. A collection of log-energy based acoustic signatures with different time-frequency resolutions was derived, extending the conventional MFCC scheme. In the process, the filter-bank properties and basis structure of WaveletPackets (WPs) were fully considered, where the interpretation of WP as a filter-bank analysis scheme was put into the frame-by-frame acoustic analysis context. In particular, the equivalent filter-bank frequency response of a WP basis was defined, where the Gray code and the concept of
3
3
2.5
2.5
2
2
Amplitude Gain
Amplitude Gain
Finally our solutions are compared with two state-ofthe-art dyadic WP based features for ASR. In particular, we implemented the 24 and 26 band WP energy-signatures considered by Farooq and Datta (2001) and Choueiter and Glass (2007), respectively. The ideal frequency partitions of those WP solutions are shown in Tables 3 and 4, respectively. In (Farooq and Datta, 2001), the FE is implemented with the Daubechies TCF of order 6 (DB6) considering a vector of 13 Cepstral coefficients. On the other hand, the acoustic features proposed in Choueiter and Glass (2007) (for the case of dyadic WP) were obtained from the concatenation of 26 log energy vectors plus dynamic features obtained at the phone segmental level, where, at the end of this process, principal component analysis (PCA) was used to reduce the dimensionality of the resulting vector, targeting a phone segmented classification task. Their dyadic WPs were implemented using Daubechies (DB) TCF of orders 4, 6, 10 and 12, respectively. To contextualize these solutions in our time-series phone recognition scenario and to make them comparable with our solutions, we only consider their WP filter bank structure. More precisely, we consider the binary-tree topologies of the WP bases with their respective dyadic partition of the frequency space and their induced WPCCs plus dynamic features (delta and acceleration) based on the general scheme presented in Section 2. The accuracies obtained for the 24 and 26 band WP solutions with DB6 were 63.37% and 61.19%, respectively. For the 26 band WP, increasing the order of the TCF to DB12 improves the performance to 64.59%, which is consistent with our previous analysis on frequency selectivity. Because of this trend, we also tried the unexplored DB44 for the 24 and 26 band solutions obtaining improvements of 66.45% and 66.33%,
1.5
1
831
1.5
1
0.5
0.5 0
0 0
0.1
0.2
0.3
0.4 0.5 0.6 Normalized Frequency
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4 0.5 0.6 Normalized Frequency
0.7
0.8
0.9
1
Fig. 16. Frequency response of the filter-banks with the two highest performances tested. The frequency range is normalized over the interval [0, 8 kHz].
832 Table 2 Shannon WP frequency partition of the interval [0, 8 kHz] for the filter-bank solution of Fig. 16a. It contains the frequency ordered leaves of the WP tree, i.e., fðj; k ¼ gðpÞÞ : ðj; pÞ 2 LðT Þg, and their respective frequency supports (I kj ) and bandwidths in Hz. ð5; 0Þ ½0; 250 250
ð6; 2Þ ½250; 375 125
ð6; 3Þ ½375; 500 125
ð6; 4Þ ½500; 625 125
ð6; 5Þ ½625; 750 125
ð6; 6Þ ½750; 875 125
ð6; 7Þ ½875; 1000 125
ð5; 4Þ ½1000; 1250 250
ð5; 5Þ ½1250; 1500 250
ð5; 6Þ ½1500; 1750 250
ð5; 7Þ ½1750; 2000 250
ð5; 8Þ ½2000; 2250 250
Leaf ðj; kÞ Band I kj (Hz) Bandwidth (Hz)
ð5; 9Þ ½2250; 2500 250
ð5; 10Þ ½2500; 2750 250
ð5; 11Þ ½2750; 3000 250
ð5; 12Þ ½3000; 3250 250
ð5; 13Þ ½3250; 3500 250
ð5; 14Þ ½3500; 3750 250
ð5; 15Þ ½3750; 4000 250
ð4; 8Þ ½4000; 4500 500
ð4; 9Þ ½4500; 5000 500
ð3; 5Þ ½5000; 6000 1000
ð3; 6Þ ½6000; 7000 1000
ð3; 7Þ ½7000; 8000 1000
Table 3 Shannon WP frequency partition of the interval [0, 8 kHz] for a Mel-like filter bank with 24 bands considered by Farooq and Datta (2001). It contains the frequency ordered leaves, frequency supports and bandwidths as in Table 2. Leaf ðj; kÞ Band I kj (Hz) Bandwidth (Hz)
ð6; 0Þ ½0; 125 125
ð6; 1Þ ½125; 250 125
ð6; 2Þ ½250; 375 125
ð6; 3Þ ½375; 500 125
ð6; 4Þ ½500; 625 125
ð6; 5Þ ½625; 750 125
ð6; 6Þ ½750; 875 125
ð6; 7Þ ½875; 1000 125
ð6; 8Þ ½1000; 1125 125
ð6; 9Þ ½1125; 1250 125
ð6; 10Þ ½1250; 1375 125
ð6; 11Þ ½1375; 1500 125
Leaf ðj; kÞ Band I kj (Hz) Bandwidth (Hz)
ð5; 6Þ ½1500; 1750 250
ð5; 7Þ ½1750; 2000 250
ð5; 8Þ ½2000; 2250 250
ð5; 9Þ ½2250; 2500 250
ð5; 10Þ ½2500; 2750 250
ð5; 11Þ ½2750; 3000 250
ð4; 6Þ ½3000; 3500 500
ð4; 7Þ ½3500; 4000 500
ð3; 4Þ ½4000; 5000 1000
ð3; 5Þ ½5000; 6000 1000
ð3; 6Þ ½6000; 7000 1000
ð3; 7Þ ½7000; 8000 1000
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835
Leaf ðj; kÞ Band I kj (Hz) Bandwidth (Hz)
ð3; 7Þ ½7000; 8000 1000 ð3; 6Þ ½6000; 7000 1000 ð5; 9Þ ½2250; 2500 250 Leaf ðj; kÞ Band I kj (Hz) Bandwidth (Hz)
ð5; 10Þ ½2500; 2750 250
ð5; 11Þ ½2750; 3000 250
ð5; 12Þ ½3000; 3250 250
ð5; 13Þ ½3250; 3500 250
ð5; 14Þ ½3500; 3750 250
ð5; 15Þ ½3750; 4000 250
ð4; 8Þ ½4000; 4500 500
ð4; 9Þ ½4500; 5000 500
ð4; 10Þ ½5000; 5500 500
ð4; 11Þ ½5500; 6000 500
ð5; 8Þ ½2000; 2250 250 ð5; 7Þ ½1750; 2000 250 ð5; 6Þ ½1500; 1750 250 ð5; 5Þ ½1250; 1500 250 ð5; 4Þ ½1000; 1250 250 ð6; 7Þ ½875; 1000 125 ð6; 6Þ ½750; 875 125 ð6; 5Þ ½625; 750 125 ð6; 4Þ ½500; 625 125 ð6; 3Þ ½375; 500 125 ð6; 2Þ ½250; 375 125 ð6; 1Þ ½125; 250 125 ð6; 0Þ ½0; 125 125 Leaf ðj; kÞ Band I kj (Hz) Bandwidth (Hz)
Table 4 Shannon WP frequency partition of the interval [0, 8 kHz] for a Mel-like filter bank with 26 bands considered by Choueiter and Glass (2007). It contains the frequency ordered leaves, frequency supports and bandwidths as in Table 2.
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835
833
filter-bank frequency ordering was revisited. This last point is an important concept that, to the best of our knowledge, has not been treated in previous work on the topic (Farooq and Datta, 2001; Choueiter and Glass, 2007; Kim et al., 2000; Tan et al., 1996). The main contribution of this work is systematically exploring the problem of WP filter-bank selection to obtain adaptive and nearly optimal energy-based filter-bank signatures for an ASR task. This important dimension of analysis has not been considered in previous studies on the topic of Wavelet and WP for ASR (Farooq and Datta, 2001; Choueiter and Glass, 2007; Kim et al., 2000; Tan et al., 1996). In this regard, Farooq and Datta (2001) considered a fixed tree-topology (frequency partition pattern) based on the MEL scale, while in the work of Choueiter and Glass (2007) the objective was on obtaining a specific critical-band frequency partition by means of adopting two previously unexplored filter-bank design methods, as well as rational and dyadic WP filter-banks. In this work, the filter-bank selection problem was addressed by a complexity regularized criterion, with the objective of modeling the well-understood trade-off between feature discrimination and feature complexity. Three methods were explored to provide a wide range of data-driven filter-bank solutions induced from the proposed WPCC analysis scheme, and the performances of those solutions were evaluated and contrasted. It is worth noting that all the proposed filterbank selection methods reduce to an equivalent tree-pruning problem with additive or affine functionals, that admit, consequently, computationally efficient implementations, i.e., a complexity that grows polynomial on the side of the problem (Silva and Narayanan, 2009). Moving on with the experimental findings, as reported in Section 6, there are only marginal differences in the fidelity gains observed when considering discriminative and non-discriminative fidelity indicators. This implies that the filter-bank solutions obtained show similar structures, where in general they provide increased frequency resolution in the low-frequency range. This verifies the wellknown fact that the discriminative information of the speech acoustic process is embedded in lower frequency bands, and that the speech production-perception process can be considered an optimal communication design, in the sense that there is more signal energy in the frequency region where more perception (frequency discrimination) is available. On the experimental side, this is demonstrated under concrete experimental conditions and with the standard HMM-based phone recognition task. The energyfidelity-based WPCC solutions offer the best performance results compared with two discriminative fidelity indicators (Fisher-scatter based, and Kullback-Leibler divergencebased) and, furthermore, they show a number of constructions that outperform the state-of-the-art MFCCs. Interestingly, under clean acoustic conditions, our data-driven frequency selectivity methods offer filter-bank solutions that follow, in general, the structure of the MEL scale, although our approach offers performance improvements
834
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835
Acknowledgment Fig. 17. Commuting relationship between the down-sampler and filtering.
with respect to the state-of-the-art MEL-based energy signatures (MFFCs). In addition, we show that frequency selectivity in the design of the Wavelet Packet filter-banks is a critical dimension of analysis for obtaining good performances. More precisely, the better the selectivity of the two-channel filter (the basic block that constructs the WP basis family), the better the phone recognition performances obtained from their filter-bank solutions, which agrees with some of the findings presented in Choueiter and Glass (2007). Although the reported ASR performance improvements can be considered to be marginal, the generality of our WPCC construction is worth emphasizing. WPCC acoustic features offer a natural way of extending the MFCC filterbank analysis paradigm by considering a much more general way of characterizing the filter-bank analysis part. In fact, we provide a way of creating not only a fixed solution, but also a family of embedded filter-bank solutions (and their respective Cepstral energy-based features) with increased frequency discrimination. As we have shown in our experiments, these solutions are adapted to the task, i.e., they offer the optimal estimation-approximation error tradeoff, which depends on a number of dynamic factors, which are strongly task dependent. Just to mention a few of them: the intrinsic acoustic-discrimination complexity of the task (approximation part); the modeling assumptions; the number of model parameters; the amount of data (the estimation error part); and the presence of distortion or noise in the training data. 9. Future work For some applications, it would be beneficial to work with non-optimal parsimonious representations, to save algorithmic complexity at the expense of sacrificing some accuracy. An example of this would be scenarios with communication constraints, or scenarios where the task is of a smaller vocabulary, in which the algorithmic complexity associated with the Viterbi-decoding is a critical issue in the design of an ASR solution. In this context, the proposed embedded filter-bank solutions have the flexibility to address the trade-off between performance and algorithmic complexity. In this regard, we believe that there is a number of directions to be explored with respect to the operational flexibility that the proposed WPCCs offer for ASR applications. Another important future work direction is to evaluate the WPCCs in the problem of robust ASR under different noisy conditions, source coding distortions and channel degradations.
The work was supported by funding from FONDECYT Grant 1110145, CONICYT-Chile. We are grateful to the anonymous reviewers for their suggestions and comments that contribute to improve the quality and organization of the work. We thank S. Beckman for proofreading this material. Appendix A. Wavelet Packets: an alternative view of its subspace frequency content The conjugate mirror filter pair ðhðnÞ; gðnÞÞ maps the canonical basis BL of X to an alternative orthogonal basis B0Lþ1 [ B1Lþ1 . Importantly, we can associate the sub-spaces 0 Lþ1 0 U ¼ span / ðt 2 nÞ : n 2 Z and U 1Lþ1 ¼ span Lþ1 Lþ1 1 Lþ1 /Lþ1 ðt 2 nÞ : n 2 Z with a frequency content of X by the following relationship: ^ L ðwÞ; ^ 0 ðwÞ ¼ ^ hð2L wÞ / / Lþ1
^ 1 ðwÞ ¼ g^ð2L wÞ / ^ L ðwÞ; / Lþ1
ðA:1Þ
^ 0 ðwÞ and hð2 ^ wÞ denote the Fourier transform where / Lþ1 (FT) and the Discrete-Time Fourier transform (DTFT) of /0Lþ1 ðtÞ and hðnÞ (alternatively, /1Lþ1 ðtÞ and gðnÞ), respectively. Iterating the application of ðhðnÞ; gðnÞÞ, we induce /pLþj ðtÞ for all j P 1 and for any p 2 0; . . . ; 2j 1 , where the frequency content of any arbitrary sub-space in the n o chain, for instance U pLþj ¼ span /pLþj ðt 2Lþj nÞ : n 2 Z , is inherited from (A.1) by: L
^ p ðwÞ; ^ 2p ðwÞ ¼ ^hð2L wÞ / / Lþj Lþjþ1
^ 2pþ1 ðwÞ ¼ g^ð2L wÞ / ^ p ðwÞ: / Lþj Lþjþ1
ðA:2Þ
Figs. 4 and 5 illustrates those frequency maps for the ideal Shannon pair of filters that provides a perfect partition of the frequency content of X. Appendix B. Multi-rate filter-bank property Proposition 2 Vetterli and Kovacevic (1995, Chap. 2, pp. 72–73). Let hðnÞ be the impulse response of a LTI system with transfer function H ðzÞ. Then for any ðxðnÞÞ 2 RZ , it is equivalent to pass xðnÞ through a down-sampler by N and then by the LTI system with transfer function H ðzÞ; to pass xðnÞ through H ðzN Þ and then by the down-sampler by Nfactor. Fig. 17 illustrates the relationship. Appendix C. The Gray code Proposition 3 Mallat (2009, Chap. 8.1.2). Let ðj; pÞ be an admissible node of the Shannon WP decomposition with binary path Hðj:pÞ ¼ ðh1 ; . . . ; hjL Þ 2 f0; 1gjL : then its equivalent frequency-ordered label ðj; kÞ is constructed by the following rule jL X hi 2i 2 0; . . . ; 2jL 1 ; k ¼ GðpÞ ðC:1Þ i¼1
PjL where hi l¼i hl mod 2 2 f0; 1g; 8i 2 f1; . . . ; j Lg.
E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835
References Atto, A.M., Pastor, D., Isar, A., 2007. On the statistical decorrelation of the wavelet packet coefficients of a band-limited wide-sense stationary random process. Signal Processing 87 (10), 2320–2335. Atto, A.M., Pastor, D., Mercier, G., 2010. Wavelet packets of fractional brownian motion: Asymptotic analysis and spectrum estimation. IEEE Transactions on Information Theory 56 (9), 429–441. Bohanec, M., Bratko, I., 1994. Trading accuracy for simplicity in decision trees. Machine Learning 15, 223–250. Breiman, L., Friedman, J., Olshen, R., Stone, C., 1984. Classification and Regression Trees. Wadsworth, Belmont, CA. Chang, T., Kuo, C.J., 1993. Texture analysis and classification with treestructured wavelet transform. IEEE Transactions on Image Processing 2 (4), 429–441. Chou, P., Lookabaugh, T., Gray, R., 1989. Optimal pruning with applications to tree-structure source coding and modeling. IEEE Transactions on Information Theory 35 (2), 299–315. Choueiter, G., Glass, J., 2007. An implementation of rational wavelets and filter design for phonetic classification. IEEE Transactions on Audio, Speech, and Language Processing 15 (3), 939–948. Coifman, R., Meyer, Y., Quake, S., Wickerhauser, V., 1990. Signal processing and compression with wavelet packets. Tech. rep., Numerical Algorithms Research Group, New Haven, CT, Yale University. Coifman, R.R., Meyer, Y., Wickerhauser, M.V., 1992. Wavelet analysis and signal processing. In B. Ruskai (Ed.), Wavelets and their Applications. Jones and Barlettt, pp. 153–178. Coifman, R.R., Wickerhauser, M.V., 1992. Entropy-based algorithm for best basis selection. IEEE Transactions on Information Theory 38 (2), 713–718, March. Cormen, T., Leiserson, C., Rivest, R.L., 1990. Introduction to Algorithms. The MIT Press, Cambridge, Massachusetts. Cover, T.M., Thomas, J.A., 1991. Elements of Information Theory. Wiley Interscience, New York. Crouse, M.S., Nowak, R.D., Baraniuk, R.G., April 1998. Wavelet-based statistical signal processing using hidden Markov models. IEEE Transactions on Signal Processing 46 (46), 886–902. Daubechies, I., 1992. Ten Lectures on Wavelets. SIAM, Philadelphia. Davis, S.B., Mermelstein, P., 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28 (4), 357–366. Duda, R.O., Hart, P.E., 1983. Pattern Classification and Scene Analysis. Wiley, New York. Etemad, K., Chellapa, R., 1998. Separability-based multiscale basis selection and feature extraction for signal and image classification. IEEE Transactions on Image Processing 7 (10), 1453–1465, October. Farooq, O., Datta, S., 2001. Mel filter-like admissible wavelet packet structure for speech recognition. IEEE Signal Processing Letters 8 (7), 196–198. Gray, R.M., 1990. Entropy and Information Theory. Springer-Verlag, New York. Kim, K., Youn, D., Lee, C., 2000. Evaluation of wavelet filters for speech recognition. In: IEEE Int. Conf. Syst. Man. Cybern. pp. 2891–2894. Kullback, S., 1958. Information theory and Statistics. Wiley, New York. Learned, R.E., Karl, W.C., Willsky, A.S., 1992. Wavelet packet based transient signal classification., 109–112.
835
Lee, K.-F., Hon, H.-W., 1989. Speaker-independent phone recognition using hidden markov models. IEEE Transactions on Acustics, Speech and Signal Processing 37 (11), 1641–1648. Mallat, S., 1989. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 11, 674–693, July. Mallat, S., 2009. A Wavelet Tour of Signal Processing. 3rd ed. Academic Press. Padmanabhan, M., Dharanipragada, S., 2005. Maximizing information content in feature extraction. IEEE Transactions on Speech and Audio Processing 13 (4), 512–519, July. Quatieri, T.F., 2002. Discrete-time Speech Signal Processing principles and practice. Prentice Hall. Rabiner, L.R., 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 (2), 257–286, February. Ramchandran, K., Vetterli, M., Herley, C., 1996. Wavelet, subband coding, and best bases. Proceedings of the IEEE 84 (4), 541–560. April. Saito, N., Coifman, R.R., 1994. Local discriminant basis. in: Proc. SPIE 2303, Mathematical Imaging: Wavelet Applications in Signal and Image Processing 2–14. Scott, C., 2005. Tree pruning with subadditive penalties. IEEE Transactions on Signal Processing 53 (12), 4518–4525. Scott, C., Nowak, R.D., 2004. Templar: A wavelet-based framework for pattern learning and analysis. IEEE Transactions on Signal Processing 52 (8), 2264–2274. August. Shen, J., Strang, G., 1996. Asymptotic analysis of daubechies polynomials. Proceedings of the American Mathematical Society 124 (12), 3819– 3833. Shen, J., Strang, G., 1998. Asymptotics of daubechies filters, scaling functions, and wavelets. Applied and Computational Harmonic Analysis 5, 312–331. Silva, J., Narayanan, S., August 2007. Minimum probability of error signal representation. In: IEEE Workshop Machine Learning for Signal Processing. Silva, J., Narayanan, S., 2009. Discriminative wavelet packet filter bank selection for pattern recognition. IEEE Transactions on Signal Processing 57 (5), 1796–1810. Silva, J.F., Narayanan, S.S., 2012. On signal representations within the bayes decision framework. Pattern Recognition 45 (5), 1853–1865, May. Tan, B., Minyue, F., Spray, A., Dermody, P., 1996. The use of wavelet transform in phoneme recognition. In: Int. Conf. Spoken Lang. Process. pp. 2431–2434. Vaidyanathan, P.P., 1993. Multirate Systems and Filter Banks. NY Prentice-Hall, Englewood Cliffs. Vasconcelos, N., 2004. Minimum probability of error image retrieval. IEEE Transactions on Signal Processing 52 (8), 2322–2336. Vetterli, M., Kovacevic, J., 1995. Wavelet and Subband Coding. PrenticeHall, Englewood Cliffs, NY. Walter, G.G., 1992. A sampling theorem for wavelet subspaces. IEEE Transactions on Information Theory 38 (2), 881–884. Willsky, A.S., 2002. Multiresolution Markov models for signal and image processing. Proceedings of the IEEE 90 (8), 1396–1458. August. Young, S., 2009. The HTK Book (for HTK Version 3.4). Zhou, X., Sun, W., 1999. On the sampling theorem for wavelet subspaces. The Journal of Fourier Analysis and Applications 5 (4), 347–354.