DATA SAMPLING ENSEMBLE ACOUSTIC MODELLING IN SPEAKER INDEPENDENT SPEECH RECOGNITION Xin Chen and Yunxin Zhao Department of Computer Science University of Missouri, Columbia, MO 65211 USA
[email protected] [email protected] ABSTRACT In this paper, we extend our recent data-sampling based ensemble acoustic modeling technique for the speaker-independent task of TIMIT and propose new methods to further improve the effectiveness of the ensemble acoustic models. We propose applying overlapped speaker clustering in data sampling to construct an ensemble of acoustic models for speaker independent speech recognition. In addition, we evaluate the method of data sampling in recurrent neural network for constructing a RNN based frame classifier. We also investigate using CVEM in place of EM in our ensemble acoustic model training. By using these methods on the speaker independent TIMIT phone recognition task, we have obtained a 2.5% absolute gain on phone accuracy over a standard HMM baseline system. Index Terms— ensemble acoustic modeling, recurrent neural network, speaker overlapped clustering, data sampling, speaker adaptation
1. INTRODUCTION Combining multiple speech recognition systems has become widely used for improving the accuracy performance of speech recognition [1][2][3][4], where the individual systems work independently and the word hypotheses of the multiple systems are combined. Recently, a novel technique has been proposed for generating random-forests based multiple acoustic models by random sampling the phonetic questions and combining the multiple scores for each speech frame [5]. This approach requires only one decoding search for each speech utterance, and therefore it has a significantly lower computation complexity in comparison with the system-level combination. Along this direction, a Cross Validation (CV) data sampling method has been proposed subsequently to generate ensemble acoustic models [6]. Both the question sampling and the data sampling methods have been evaluated on a speaker-dependent conversational speech recognition task of telehealth captioning [7] and led to significant improvements in accuracy performance. In this paper, we extend our recent data-sampling based ensemble acoustic modeling technique of [6] for the speakerindependent task of TIMIT and propose new methods to further improve the effectiveness of the ensemble acoustic models. We propose applying overlapped speaker clustering in data sampling to construct an ensemble of acoustic models for speaker independent speech recognition. We combine these acoustic models in triphone HMM states to compute the scores for each speech frame as in [5]. In addition, we evaluate the method of data sampling in recurrent neural network for constructing a RNN based frame classifier. We also investigate using CVEM of [8] in place of EM in our ensemble acoustic model training. By using these methods on the
978-1-4244-4296-6/10/$25.00 ©2010 IEEE
5130
speaker independent TIMIT phone recognition task, we have obtained a 2.5% absolute gain on phone accuracy over the baseline system. For the telehealth captioning task we used overlapped data partition on multiple talkers’ data to train multiple acoustic models and obtained some interesting results. The rest of the paper is organized as the following. In section 2 we introduce data sampling methods for ensemble acoustic model training. In section 3 we describe data sampling for RNN ensemble. In section 4 we present the experimental results. In section 5 we discuss utilizing multiple talker data in the telehealth captioning task in the ensemble framework. In section 6, we conclude our work and discuss possible future extensions.
2. DATA SAMPLING METHODS FOR ENSEMBLE ACOUSTIC MODELING OF HMM 2.1 CV data sampling In CV based data partition, multiple training data sets are produced through data sampling and each sampled training data set is used to train one set of acoustic models. For an N-fold CV sampling, a (N1)/N fraction of training data is included in each sampled training set, and each data sample is used exactly N-1 times overall to avoid bias in data usage. For details of CV data partition, please refer to [6]. For each triphone HMM state, its tied-state Gaussian Mixture Densities (GMDs) in the N sets of acoustic models are combined to form its ensemble model. As in [5], the triphone HMM states that share the same tied states in every set of acoustic models form a Random-Forests (RF) tied state and share the same ensemble model. In decoding search, the likelihood scores calculated from the GMDs within the same RF-tied state are combined for each speech frame. The acoustic score combining methods have been discussed in [5][6]. It remains an unanswered question whether this method would show any advantage on speaker independent tasks. Our new results on TIMIT have confirmed the effectiveness of this method on the speaker independent task (see section 4). 2.2 Speaker clustering based data sampling It is of interest to investigate the potential of speaker clustering based data sampling for training ensemble models since speaker clustering may increase the diversity among the acoustic model sets for speaker-independent recognition tasks. It is noted that a basic distinction of the data sampling method proposed here from conventional speaker clustering is that the data sampled multiple datasets are overlapped, and by controlling the amount of overlaps we may trade off the accuracy of individual model sets with the diversity among the model sets. In [4], an utterance clustered (nonoverlaped) multiple decoder system was reported to provide higher recognition accuracy than a single model system.
ICASSP 2010
Following the idea of CV data sampling, we propose a novel method for generating speaker clustering based overlapped data partitions. The speaker clustering method consists of two parts. In the first part, core models are iteratively estimated from the data of the clustered small sets of talkers. In the second part, the core models obtained from the first part are used to find the overlapped, sampled training data sets. Multiple training data sets are therefore produced and each data set is used to train one set of acoustic models. The procedure is detailed in Fig. 1.
3. DATA SAMPLING FOR NEURAL NETWORK Since data sampling generated HMM/GMM ensembles have shown very good performances in [6], one interesting question is whether a similar advantage would hold for other classifiers, for example, Recurrent Neural Network (RNN). RNN estimates frame level phone posterior probabilities discriminately and has been widely used in speech recognition. The posterior probabilities of phonemes at each frame t, p(c|xt), where c is a phoneme label and xt is the acoustic features at time t, can be used in frame classification, phoneme recognition, as well as general speech recognition tasks [10]. In the current work, we apply CV data sampling to build an ensemble of RNNs. In classification, we combine the posterior probabilities from each base RNN with a simple average. We evaluate the performance of the RNN ensemble against the conventional method of using a single RNN for frame classification.
4. EXPERIMENTAL RESULTS We conducted a number of experiments to evaluate and compare the performance of the proposed ensemble models on the TIMIT task. The evaluations were in terms of phone recognition accuracy using HMMs and frame classification accuracy using RNN.
Fig. 1 N-fold speaker clustering based ensemble model training Part 1: core model training Step-1 Randomly select a small percentage of speakers without replacement and assign them to one of N folds; repeat N times to fill up the N folds. Step-2 Train acoustic models (referred to as core models) for each fold by using the selected speaker data for the fold. Step-3 Calculate forced-alignment likelihood scores on the data of all selected speakers with the core models, and reassign the speakers to the folds based on the ranking of the likelihoods while keeping the fold size unchanged. Note that the scenario that one speaker is used by many folds should be avoided for the sake of model diversity. We apply a speaker selection probability, 1- t/T, where t is the number of times that a speaker has been used and T is the maximum allowed times, in the speaker assignment step to reduce the chance that a speaker is repeatedly used in many folds. Step-4 Return to Step-2 unless a sufficient number of iterations or model convergence has been reached. Part 2: overlapped speaker clustering We first fix the desired fraction of the full set of training data for each model set as K/N. We then assign each speaker to the K folds corresponding to the core models that give the top-K ranked likelihood scores for the speaker’s data. The folds are made to have approximately the same number of speakers. Note that by isolating the core model training from the overlapped speaker clustering into two parts, we gain the flexibility of generating many speaker clustered datasets with different fraction parameters without the need for changing the core models. Depending on the amount of data overlaps among the clusters, it may be desirable to estimate the combiner weights from each test utterance. In the current implementation, however, only a simple average is used for model combination.
5131
4.1 Phone recognition experiment setup For the phone recognition experiments, TIMIT database was used. The training data set consisted of 3696 utterances from 462 speakers, and the standard test data set consisted of 1344 utterances from 168 speakers. There were 39 phonemes. Phone bigram language model was trained from the transcriptions of all the 3696 utterances. The acoustic features were 13 MFCCs, plus delta and double delta features. HTK [9] was used for training the individual model sets. The baseline phone recognition accuracy was 71.72% obtained by using a crossword triphone HMM system with Gaussian mixture densities of size 16. 4.2 CV data sampling based ensemble acoustic models
Fig. 2 The effects of ensemble acoustic models with EM or CVEM We first applied a 10-fold CV data sampling to produce an ensemble of 10 sets of acoustic models with the GMD mixture size equal to 16, which gave a 1.3% absolute phone accuracy gain over the baseline. We next increased the mixture size of the GMDs to 24 and 32, and trained their respective 10-CV ensemble models. Finally, as in the hierarchical mixture modeling (Hie_Mix) of [6], we combined the ensemble models with the two mixture sizes of 16 and 32, and combined the single models with all three mixture sizes. For each of these model architectures, we also compared
using EM versus using CVEM [8] for model parameter estimation. For EM, 2 iterations were used and for CVEM, the number of folds was set to 10 and 8 iterations of CVEM were performed after 2 iterations of EM (This implementation was optimized in [6]). The ensemble of 10 model sets (mix32) with CVEM yielded an average phoneme accuracy of 74.26%, which is a 2.5% absolute gain over the baseline. The ensemble of 20 model sets (Hie_Mix) with EM training yielded an average phoneme accuracy of 73.94%. It is worth noting that using the ensemble acoustic models is always better than using the single models trained by either EM or CVEM. For example, at the mixture size of 16, the ensemble of 10CV acoustic models yielded a phoneme accuracy of 73.07% in contrast to the 71.98% by the CVEM trained single model.
4.4 RNN ensemble classifier For each RNN model setup, we used an MLP with 273 input nodes (6 context frames plus the current frame for 39 phonemes, i.e., 7*39), 200 hidden nodes with recurrent connections and 39 output nodes. We used the NICO toolkit to train the RNNs [10]. The frame level phoneme classification task was evaluated on TIMIT. Training data included 4620 utterances, and testing data included 1680 utterances. Each frame was represented by 12 MFCCs plus log energy. The 10-fold CV partition was used to generate the RNN ensemble, and the baseline RNN classifier was trained on the full set of training data. Table 1 The effect of ensemble RNN on frame classification
4.3 Speaker clustering based ensemble We evaluated the proposed speaker clustering (SC) based data sampling method in constructing ensemble acoustic models. The fold size was set to 10. We used random sampling without replacement to initialize the speakers for the core models. Each core model had 5% of the full set of training sentence utterances. The maximum reuse of each speaker was empirically set to 3, and 5 iterations was used in generating the core models. The overlapped data sets used different fractions of the full training set, ranging from 90%, 70%, 50%, and down to 30% of the total training data. For comparison, a data sampling method of random sampling (RS) with replacement was also used, where the overlapped sampled datasets contained 90%, 70%, 50%, 30% of the full training data set corresponding to the cases of SC. The results are shown in Fig. 3.
Frame Classification Accuracy
1 RNN
10 RNNs
69.3%
70.6%
The average accuracy of the 10 individual RNN classifiers in the ensemble was 69.1%, with a standard deviation of 0.0028, and this average performance was lower than the baseline. The ensemble RNN classifier, however, gave a 1.3% absolute increase in frame classification accuracy over the baseline. This result indicates that while training data size and coverage are important to recognition accuracy, an ensemble classifier also benefits from the diversity that sampling the training data has generated for the base classifiers. 4.5 Speaker adaptation We evaluated the Maximum Likelihood Linear Regression (MLLR) based speaker adaptation for the TIMIT task, where the two SA sentences of each test speaker were excluded from the test data and used as adaptation speech. One global MLLR transform was used. The ensemble models were generated from CV data sampling. For each speaker, we first used two SA sentences to adapt the single model and the ensemble model, respectively, and then used the adapted models to perform recognition.
Fig. 3 The effects of speaker clustering versus random sampling in data sampling for training ensemble acoustic models As shown in Fig. 3, the proposed SC based ensemble model outperformed the RS based ensemble model in all the cases. Fig. 3 also suggests that the fraction of data overlap is an important factor determining the quality of the ensemble acoustic models, and 50% appears to be the best choice for the TIMIT task. As the amount of overlap was reduced among the data sets, each model set became weaker due to the smaller amount of training data, while the diversity among the model sets increased. At the 90% data overlap, SC and CV based sampling gave similar accuracy performance, since the heavy data overlap decreased the effect of speaker clustering on the diversity among the sampled datasets. Using SC with a 50% data overlap gave a 0.14% absolute higher accuracy than using the 10-fold CV at the GMD mixture size equal to 16, while the size of each model set in the SC model ensemble is only about half of each model set in the 10-fold CV model ensemble.
Fig. 4 The effects of MLLR adaptation on ensemble models It is observed that the accuracy performance of the single models decreased when using the full or the diagonal MLLR transforms, suggesting insufficient adaptation data. However, the ensemble model of mix24_10M_CV using MLLR with a full transform matrix outperformed the baseline, suggesting that the ensemble model is able to compensate for the overfitting effect.
5. DISCUSSIONS
5132
For a training dataset with only a few talkers such as our telehealth task [7], it is desirable to be able to use other speakers’ data to enhance the speaker-dependent acoustic models. In [11] the
authors used other speakers’ data as prior information in guiding the phonetic decision tree clustering for each test speaker and improved recognition performance. Along this line, we investigated using other speakers’ data to help the speech recognition of each target test speaker within the frame work of ensemble models. Experiments were performed on the telehealth automatic captioning task that has five speakers in total. For details of the experiment setup, please refer to [7]. For each target test speaker whose speech is to be recognized, we excluded the data of the current test speaker from the training set and compared the following three cases of using other talkers’ speech to train acoustic models. First, we put the other four speakers’ data together to train one set of models (referred as 1M). Second, we used each one of the other four speakers’ data to train a set of models and combined the four sets of models (referred as 4M_V1). Third, following the CV approach, we used the training data from three of the four other speakers to train one set of models, and by varying the selections we trained four sets of models and combined them (referred as 4M_V2). These models were used to recognize the speech of the current test speaker. By rotating the test speaker over the five speakers, we obtained averaged word recognition accuracy for the three cases, shown in Table 2. Table 2 The effects of ensemble models versus single model of other speakers Average word accuracy
current work, we have proposed data sampling methods for ensemble acoustic modeling of a speaker independent task, including CV data partition and speaker clustering. By combining several methods, we achieved a 2.5% absolute gain over an HMM baseline in speaker independent TIMIT phoneme recognition. When using an ensemble of 10 model sets, our currently proposed speaker-clustering method at the 50% data overlap outperformed CV data-partition in recognition accuracy while using only about half its model size. Within of our ensemble acoustic modeling framework for speech recognition, there are a number of possible future extensions. Since there are many ways of generating speaker clusters and therefore acoustic models, how to select models that are best for an ensemble model is an open question for exploration. Another possibility is combining data sampling with discriminative training such as MMIE, MCE, and fMPE which are successful in acoustic modeling but may overfit training data, where through data sampling, the diversity of ensemble models may well compensate for the overfit in the individual models. A further potential extension is to better utilize the state-of-art multiple-core computer architecture to carry out parallel computation of the acoustic likelihood scores from the models in the ensemble to speed up decoding search.
1M
4M_V1
4M_V2
7. ACKNOWLEDGEMENT
51.13%
51.76%
55.75%
This work is supported in part by National Science Foundation under the grant award IIS - 0916639. The authors would like to thank Dr. Bryan Pellom and Dr. Kadri Hacioglu of Rosetta Stone for their helpful discussions on RNN.
It is seen from Table 2 that the ensemble models 4M_V2 yielded a 4.62% absolute word accuracy gain over the models 1M, and 4M_V2 also yielded a 4% absolute word accuracy gain over the ensemble models 4M_V1, again confirming the advantage of using ensemble models trained from overlapped multiple training datasets. Furthermore, we observed in our experiments that MLLR, Maximum A Posteriori estimation (MAP), and Maximum Likelihood estimation (ML) based adaptations of the ensemble models always outperformed the adaptations on the single model. For instance, with the ML method (referred as 4M_ADAPT), we used each one of the four model sets of 4M_V2 to initialize EM to generate an ensemble of four adapted model sets from the current test speaker’s training data (the procedure used one EM iteration, and the minimum count for GMD model update was set to 3). We then combined the 4M_ADAPT ensemble with the baseline speaker-dependent model of the current test speaker with the equal weights (0.5, 0.5). The word accuracy increased to 80.89%, a 1.68% absolute gain over the speaker-dependent baseline of 79.21%. We next combined the 4M_ADAPT ensemble with the speaker-dependent ensemble of 60 models that was deemed best in [6]. With the weights empirically set to (0.1, 0.9) for the two ensembles, the accuracy increased to 82.64%, and with the weights changed to (0.02, 0.98) to give more emphasis on the speakerdependent ensemble model, the accuracy increased to 82.73%, which was a 3.52% absolute accuracy gain over the speaker dependent baseline models, and a 0.26% absolute accuracy gain over the speaker-dependent ensemble of 60 models [6]. Although these results on the telehealth data are of a preliminary nature, they demonstrated some interesting effects of using overlapped data partitions on training multiple models when there are only a few talkers in a training dataset.
6. CONCLUSION AND FUTURE WORK Ensemble acoustic modeling is a promising new direction to improve the accuracy performance of speech recognition. In the
5133
8. REFERENCE [1] J. G. Fiscus, “A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER)”, Proc. IEEE ASRU Workshop, pp. 347-352, 1997. [2] O. Siohan, B. Ramabhadran, and B. Kingsbury, “Constructing ensembles of ASR systems using randomized decision trees,” Proc. ICASSP, pp. I-197-I-200, 2005. [3] R. Zhang, et al., “Investigations on ensemble based semisupervised acoustic model training,” Proc. EuroSpeech, pp. 16771680, 2005. [4] T. Shinozaki and S. Furui, “Spontaneous speech recognition using a massively parallel decoder,'' Proc. ICSLP, pp. 1705–1708, 2004. [5] J. Xue and Y. Zhao, “Random forests of phonetic decision trees for acoustic modeling in conversational speech recognition,” IEEE Trans. ASLP, vol.16, no. 3, pp. 519-528, 2008. [6] X. Chen and Y. Zhao, “Data sampling based ensemble acoustic modeling,” Proc. ICASSP, pp.3805-3808, 2009 [7] Y. Zhao, et al., “An automatic captioning system for telemedicine,” Proc. ICASSP, pp. I-957 – I-960, 2006. [8] T. Shinozaki and M. Ostendorf, “Cross-validation EM training for robust parameter estimation,” Proc. ICASSP, vol. IV, pp. 437– 440, 2007. [9] HTK Toolkit, U.K. http://htk.eng.cam.ac. [10] N. Strom, “Phoneme probability estimation with dynamic sparsely connected artificial networks,” in The Free Speech Journal, no. 5, 1997. [11] R.-S. Hu and Y. Zhao, "Knowledge-based adaptive decision tree state tying for conversational speech recognition," IEEE Trans. ASLP, vol. 15, no. 5, pp. 2160-2168, Sept. 2007.