Canonicalization of Feature Parameters for Automatic Speech ...

Report 2 Downloads 68 Views
Canonicalization of Feature Parameters for Automatic Speech Recognition Takashi Fukuda and Tsuneo Nitta Graduate School of Engineering, Toyohashi University of Technology, Japan [email protected], [email protected]

Recently, a current automatic speech recognition (ASR) system can achieve recognition accuracy more than 90% when the speech is uttered clearly in low-noise environments. This is mainly due to the powerful support of a stochastic classifier, or an HMM-based classifier, and large scale speech corpora, however, because acoustic models (AMs) of the HMM-based classifier include hidden variables such as gender type, speaking rate, and acoustical environment, a single HMM classifier can not obtain enough performance every time. Several strategies of estimating the AMs using speech data collected in various conditions have been investigated toward this problem [1]. These strategies can represent various types of acoustic varieties in the AMs. On the other hand, an approach of decoding in parallel with multiple HMMs corresponding to hidden variables has recently been proposed ([2, 3] see Figure 1). Multi-path acoustic modeling, that represents hidden variables with several paths in the same AM instead of applying multiple HMMs, was also proposed [4, 5]. These approaches may achieve robust ASR systems in various aspects, however, they need large amount of

z MFCC z LPC z RASTA z

HMM Classifier 1

HMM Classifier 2

z z

z z z

HMM Classifier N

Recognition Result

Fig.1 A single feature extractor and multiple HMM classifiers. < Canonicalization process >

Speech data Input LF: 25 dim. ×3

< male  > GD - DPF Extractor

DPFmale : 15 dim. × 3

< female    > GD - DPF Extractor

DPFfemale : 15 dim. × 3

< neutral    > GI - DPF Extractor

DPFneutral : 15 dim. × 3

Canonicalized DPF : 15 dim. × 3

HMM Classifier

1. Introduction

Feature Extractor

DPF Selector ( Minimum distance criterion )

Acoustic models (AMs) of an HMM-based classifier include various types of hidden variables such as gender type, speaking rate, and acoustic environment. If there exists a canonicalization process that reduces the influence of the hidden variables from the AMs, a robust automatic speech recognition (ASR) system can be realized. In this paper, we describe the configuration of a canonicalization process targeting gender type as a hidden variable. The proposed canonicalization process is composed of multiple distinctive phonetic feature (DPF) extractors corresponding to the hidden variable and a DPF selector in which the distance between input DPF and AMs is compared. In a DPF extraction stage, an input sequence of acoustic feature vectors is mapped onto three DPF spaces corresponding to male, female, and neutral voice by using three multilayer neural networks (MLNs). Experiments are carried out by comparing (A) the combination of the canonicalized DPF and a single HMM classifier, and (B) the combination of a single acoustic feature (MFCC) and multiple HMM classifiers. The result shows that the proposed canonicalization method outperforms both of the conventional ASR with MFCC and a single HMM and the ASR with multiple HMMs in spite of less memories and computation time.

Speech Data

Hypothesis Selector ( Maximum likelihood criterion )

Abstract

Monophone model

Fig.2 Multiple DPF extractors and a single HMM classifier. memories and computation time. If there exists a canonicalization process of feature parameters that reduces the influence of the hidden variables from the AMs, the robust ASR system can be realized in low cost. In this paper, we describe the configuration of the canonicalization process targeting gender type as the hidden variable. In our proposed method, three distinctive phonetic feature (DPF) extractors composed of multilayer neural networks (MLNs) map an input sequence of acoustic feature vectors onto three DPF spaces corresponding to male, female, and neutral voice (see Figure 2). Then, a DPF selector, in which the distance between input DPF and AMs is compared, outputs a nearest DPF to AMs as a canonicalized DPF. Canonicalization described above is realized by introducing a DPF space between an acoustic feature space and the AM of an HMM classifier for the first time.

Recognition Result

3. Canonicalization of feature parameters y0:

xt+3

t Speech Data

xt Local Feature Extraction f

xt-3 t

DPF Extractor : MLN (Mapping LF +△ P to DPFs)

f

DPFpre

Input for MLN: 25 dim. × 3 fr.

y1: DPFcur

To HMM

y2: DPFfol

Output: 15 dim. × 3 DPF vectors

Fig.3 A DPF extractor. This paper is organized as follows. Section 2 outlines implementation of a DPF extractor, and Section 3 explains the canonicalization process of feature parameters. Section 4 describes an experimental setup and results, and provides a discussion. Finally, Section 5 finishes with some conclusions.

2. Overview of a DPF extractor This section describes an overview of a DPF extractor which plays an important role in a canonicalization process of feature paramters. The configuration of the DPF extractor[6] is illustrated in Figure 3. At an acoustic feature extraction stage, firstly, input speech is converted into local features (LFs) that represent a variance of spectrum along time and frequency axes[7]. LFs are then entered into an MLN with four layers, including two hidden layers, after combining a current frame xt with the other two frames that are 3-points before and after the current frame (xt-3 , xt+3). The MLN has 45 output units (15×3) corresponding to a set of triphones, or context-dependent DPFs, that consist of three DPF vectors (a preceding context DPF vector, a current DPF vector, and a following context DPF vector) with 15 dimensions each. The two hidden layers consist of 256 and 96 units from the input layer. A DPF set is adjusted to balance the distances among phonemes by redesigning the Japanese traditional DPF set[8]. Fifteen DPF elements are mora, high, low, nil (an intermediate expression of “high / low”), anterior, back, nil (an intermediate expression of “anterior / back”), coronal, plosive, affricative, continuant, voiced, unvoiced, nasal, and semi-vowel. The MLN is trained by using a backpropagation algorithm to output a value of 1 for the corresponding DPF elements with an input phoneme and its adjacent phonemes (a set of triphone). The number of training data of each triphone set for MLN is limited to a maximum of 30, which are selected based on nearest-neighborhood clustering using a D1 data set in Section 4.1. In our previous work, we proposed a method to approximate DPF distribution with logarithmic normal distribution (LND) of HMM[9]. In this paper, output probabilities in the HMM classifier are represented by LND with only negative skewness, and diagonal matrices are used.

In this section, we describe the configuration of a canonicalization process of feature parameters. In the canonicalization process shown in Figure 2, three MLNs are firstly used to map an input sequence of acoustic feature vectors onto three DPF spaces corresponding to male, female and neutral voice (three DPFs output from each MLN are called DPFmale, DPFfemale, and DPFneutral, respectively). Two DPF extractors for male and female voice are trained independently with male and female data in the D1 data set in Section 4.1. The DPF extractor for neutral voice is designed with both male and female data in D1. The distortion of DPF vectors corresponding to male, female, or neutral voice is less than that of DPF vectors extracted from a single MLN. Thus, it is expected to reduce the influence of the hidden variable by using the DPF vectors corresponding to speaker’s characteristics as the input of an HMM classifier. Next, the canonicalization process needs a DPF selector described below. To select the desirable DPF vectors that correspond to the gender type of input speech, there are several strategies including a distortion measure and an information criterion. In the followings, firstly, we explain about labeled DPF vectors and initial training of AMs. (1) Two MLNs corresponding to male and female voice are trained independently using male and female speech in D1 data set described in Section 4.1. (2) Input speech is mapped onto DPF spaces according to the gender type by using the above MLN. Here, MLN is selected using label information in D1 data set. (3) AMs are trained using DPF vectors selected along the above mentioned process. Next, AMs are redesigned using the three DPF extractors and the DPF selector. The DPF selector outputs DPF vectors, that minimize the distance d in the following equation, as the canonicalized DPF every words.

d=

1 N

(

N

∑ min D (p , q ) ( j = 1,2, …, M ) i =1

j

) (

M

i

(1)

j

)

(

D M p i , q j = log(p i ) − µ j t Σ −j 1 log(p i ) − µ j

)

(2)

where pi is the input DPF vector of i-th frame, qj is the distribution parameter set in the middle state of j-th HMM, and DM is Mahalanobis distance between pi and qj. N and M are total number of frames in a vowel interval and number of monophone models in a word, respectively. µ and Σ are mean vector and covariance matrix in the model. Here, DM is calculated between the HMM (mix = 1) and the input DPF vectors in vowel intervals. The interval in which a value of one DPF element “mora” is bigger than 0.5 is recognized as a vowel interval. The DPF elements in the vowel intervals have higher reliability than that in consonantal parts. When DM between input candidate vectors of DPFmale and the AMs is close (within ±25%) to DM between another input candidate vectors of DPFfemale and the AMs, the DPF selector extracts the other input candidate vectors DPFneutral as the canonicalized DPF. Finally, the canonicalized DPF vectors are input to the HMM classifier. These iterative training, or

(a) Baseline 1 ( MFCC: dim. = 38, single HMM )

(a) Baseline 1 ( MFCC:dim.=38, single HMM ) (b) Baseline 2 ( MFCC:dim.=38, parallel HMM )

10 HMM Classifier

(b) Baseline 2 ( MFCC: dim. = 38, parallel HMMs ) < male >

HMM Classifier < female >

HMM Classifier

Hypothesis Selector ( ML criterion )

MFCC Extractor

 

Word error rate [%]

MFCC Extractor

(c) Original-DPF ( dim.=45, single HMM ) (d) Canonicalized-DPF ( dim.=45, single HMM )

8

6

4

2

0 mix.=1

(c) Original-DPF ( dim. = 45, single HMM ) DPF Extractor

HMM Classifier

< female >  

GD - DPF Extractor < neutral >  

GI - DPF Extractor

Feature Vector Selector ( Minimum distance criterion )

GD - DPF Extractor

mix.=4

mix.=8

mix.=16

Number of mixtures

Fig.5 Performance comparison among ASR systems.

(d) Canonicalized-DPF ( dim. = 45, single HMM ) < male >

mix.=2

HMM Classifier

of BPF bank are converted into local features (LFs), and then LFs are mapped to DPFs. The D1 data set was used to design 43 Japanese monophone HMMs with five states and three loops. In the HMM, output probabilities are represented in normal distribution (ND) for MFCC parameter set and logarithmic normal distribution (LND) with only negative skewness for DPFs[9], and diagonal matrices are used. 4.3. Experimental Results

Fig.4

ASR systems evaluated in experiments.

redesigning using canonicalized DPF vectors, is continued until the value of the HMM likelihood converges.

4. Experiments 4.1. Speech database The following two data sets were used: D1. Acoustic model design set with clean speech: A subset of “ASJ (Acoustic Society of Japan) Continuous Speech Database”, consisting of 9,003 sentences uttered by 30 male and 30 female speakers, respectively(16 kHz, 16-bit)[10]. D2. Test data set with clean speech: A subset of “Tohoku University and Matsushita Spoken Word Database”, consisting of 100 words uttered by 10 unknown male and female speakers each[11]. Total number of words is 1,962. The sampling rate was converted from 24 kHz to 16 kHz. 4.2. Experimental Setup An input speech is sampled at 16 kHz and a 512-point FFT of the 25 ms Hamming-windowed speech segment is applied every 10 ms. The resultant FFT power spectrum is then integrated into 24-ch BPFs output with mel-scaled center frequencies. At the acoustic feature extraction stage, an output

Speaker-independent isolated spoken-word recognition tests were carried out with the D2 data set. In the experiments, four types of ASR systems illustrated in Figure 4 were investigated. In the baseline 1, the input of HMM is the conventional acoustic feature set with 38 dimensions which consists of MFCC with CMN, dynamic features (∆t, ∆t∆t), ∆P and ∆∆P. Parallel HMMs of baseline 2 in Figure 4 (b) are followed by hypothesis selector that recognizes an HMM with a targeting hidden variable owner based on a maximum likelihood (ML) criterion. An original DPF in Figure 4 (c) indicates a method of using a single DPF extractor which is designed both male and female voice in D1 data set. Figure 5 shows the experimental results. The canonicalized DPF outperformed the both baseline systems except the case of a single mixture. The original DPF with a single DPF extractor also yielded better performance than the baseline systems. This gain in performance is considered to be as follows. The DPF vectors are originally specified for classifying phonemes even if the hidden variable of gender is included, and the canonicalized DPF vectors can form a feature space that is independent of the targeting hidden variable of the gender type. 4.4. Discussion The proposed canonicalization process has three types of DPF extractors (two DPF extractors for male and female voice, and oneDPF extractor for neutral voice). Here, an effect of adding the DPF extractor for neutral voice is investigated. Figure 6 shows the recognition result when eliminating the DPF extractor for neutral voice from the canonicalization process. Tables 1 and 2 provide the selection rate to three DPFs

Without a DPF extractor for neutral voice ( dim.=45 )

10

Word error rate [%]

With a DPF extractor for neutral voice ( dim.=45 ) 8

6

DPFs and the single HMM showed better performance in the small computation time and memories than the conventional ASR system with multiple HMMs. The proposed canonicalization process can reduce the influence of gender type as hidden variable from the AMs. In feature work, we will discuss a canonicalization method targeting the other types of hidden variables such as speaking style and acoustic environment.

4

2

0 mix.=1

mix.=2

mix.=4

mix.=8

mix.=16

Number of mixtures

Fig.6 Difference of canonicalization units.

References

Table 1. The DPF selection rate (DSR) [%] and its gender type accuracy(ACC) [%] with a DPF extractor for neutral voice. ASJ corpus (Training set)

Tohoku Univ. corpus (Test set)

DSR

ACC

DSR

ACC

DPFmale

34.7

100.0

34.8

94.9

DPFfemale

31.5

99.9

26.1

97.8

DPFneutral

33.8

39.1

Table 2. The DPF selection rate (DSR) [%] and its gender type accuracy (ACC) [%] without a DPF extractor for neutral voice. ASJ corpus (Training set)

Acknowledgements This work was supported in The 21st Century COE Program “Intelligent Human Sensing”, from the ministry of Education, Culture, Sports, Science and Technology. And the authors want to thank the Hori Information Science Promotion Foundation for supporting this work .

Tohoku Univ. corpus (Test set)

DSR

ACC

DSR

ACC

DPFmale

48.5

99.5

51.2

86.9

DPFfemale

51.5

96.6

48.8

88.3

(DPFmale, DPFfemale, and DPFneutral) and its accuracy of gender type. When the DPF extractor for neutral voice was eliminated from canonicalization process, the recognition performance is reduced as shown in Figure 6. On the other hands, in Table 2, DPFs of around 13% in D2 data set (Tohoku Univ. corpus) does not correspond to input gender type. These selection errors can be collected into the new category of the neutral voice (see Table 1), thus the DPF extractor for neutral voice contributes to the improvement of performance.

5. Conclusion The canonicalization process of the feature parameters was proposed. The canonicalization is realized by introducing DPF space between acoustic feature space and the AM of HMM classifier. In an experiment using an isolated-spoken word recognition task, the combination of the canonicalized

[1] K. W. Church, “Speech and Language Processing: Where have we been and where are we going?,” Proc. Eurospeech’03, vol.1, pp.1-4, 2003. [2] S. Matsuda, T. Jitsuhiro, K. Markov and S. Nakamura, “LVCSR Robust to Noise and Speaking Styles,” IPSJ SIG Technical Report, 2004-SLP-50, pp.37-44, 2004, in Japanese. [3] T. Shinozaki and S. Furui, “Spontaneous Speech Recognition Using a Massively Parallel Decoder,” Proceedings of the Third Spontaneous Speech Science and Technology Workshop, pp.67-72, 2004, in Japanese. [4] A. Lee, Y. Mera, K. Shikano and H. Saruwatari, “Selective multi-path acoustic model based on database likelihoods,” Proc. ICSLP’02, pp.2661-2664, 2002. [5] M. Ida and S. Nakamura, “Rapid Environment Adaptation Method Based on HMM Composition with Prior Noise GMM and Multi-SNR Models for Noisy Speech Recognition,” The Institute of Electronics, Information and Communication Engineerings (IEICE) Trans., Vol. J86-D-II, No. 2, pp. 195-203, Feb. 2003. [6] T. Fukuda, W. Yamamoto and T. Nitta, “Distinctive Phonetic Feature Extraction for Robust Speech Recognition,” Proc. ICASSP’03, vol.II, pp.25-28, 2003. [7] T. Nitta, “Feature extraction for speech recognition based on orthogonal acoustic-feature planes and LDA,” Proc. of Proc. ICASSP’99, pp.421-424 , 1999. [8] T. Fukuda and T. Nitta, “Orthogonalized Distinctive Phonetic Feature Extraction for Noise-Robust Automatic Speech Recognition,” The Institute of Electronics, Information and Communication Engineerings (IEICE) Trans. Information and Systems, 2004. [9] T. Fukuda and T. Nitta, “Noise-robust ASR by Using Distinctive Phonetic Features Approximated with Logarithmic Normal Distribution of HMM,” Eurospeech 2003, Vol.III, pp.2185-2188, 2003. [10] T. Kobayashi, S. Itahashi, S. Hayamizu and T. Takezawa, “ASJ continuous speech corpus for research,” Acoustic Society of Japan (ASJ) Trans. vol.48. no.12. pp.888-893, 1992. [11] S. Makino, K, Niyada, Y. Mafune and K. Kido, “Tohoku University and Matsushita isolated spoken word database,” Acoustic Society of Japan (ASJ) Trans. vol.48, no.12, pp.899-905, 1992