AUTOMATIC LANGUAGE IDENTIFICATION ... - Semantic Scholar

Report 3 Downloads 169 Views
AUTOMATIC LANGUAGE IDENTIFICATIONUSING LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION Sergio Mendoza, Larry Gillick, Yoshiko Ito, Stephen Lowe, Michael Newinan Dragon Systems, Inc. 320 Nevada Street, Newton, MA 02160

ABSTRACT We have developed a highly accurate automatic language identification system based on large vocabulary continuous speech recognition (LVCSR). Each test utterance is recognized in a number of languages, and the language ID decision is based on the probability of the output word sequence reported by each recognizer. Recognizers were implemented for this test in English, Japanese, and Spanish, using the Ricardo corpus of telephone monologues. When tested on the OGI corpus of digitally recorded telephone speech, we obtained error rates of 3% or lower on 2-way and 3-way closed-set classification of ten-second and one-minute speech segments.

ful attempt to discriminate between Spanish and English using microphone data and a single recognizer based on the Wall Street Journal corpus. In contrast, early results on spontaneous telephone speech, using a Switchboard recognizer (with or without a second recognizer built from Spanish data collected in-house) were disappointing. The negative result was not very dear-cut, however, due to a number of confusing factors affecting the result. In particular there were gross differences between the corpora from which the recognizers were trained. This time we have carefully rebuilt the language identification system using a family of recognizers, one for each “target” language, where we have tried to minimize systematic bias by using comparable amounts of data collected over the same channel. The recognition system is described in section 2, and the LID system in section 3. Results of tests performed on data frorn the OGI corpus are given in section 4.

1. INTRODUCTION The capability to do high-quality language classification of telephone speech has important commercial applications, such as routing calls to directory assistance, or for traveler information services. As speech recognition technology has improved over the last few years, large vocabulary continuous speech recognition (LVCSR) has become increasingly attractive as a front end for such voice-information retrieval applications, other examples of which are speaker and topic identification. There exist specialized techniques for dealing with each of these applications, but from the conceptual point of view it is very appealing to have a unified framework for analyzing all of these systems. Furthermore, Dragon has argued for several years that LVCSR is the best way to extract information from a given speech signal, and we have previously published results of successful experiments in speaker and topic identification [ 1,2]. Most existing approaches to language identification (LID) make use of one or more phoneme recognizers in the front end [3 - 103, although at least one other group has recently reported on an LVCSR-based system [ll]. Preliminary work on language identification was published by us in [12], where we reported on a success-

0-7803-2127-8/94 $4.00 01995 IEEE

2. OVERVIEW OF THE RECOGNITION SYSTEM

A naive LID system consists of Iv’ recognizers, one for each of the target languages. A test utterance would be recognized by each of the recognizm, resulting in N putative transcripts together with the (log) likelihood of each. The classification procedure would consist of picking the language for which the reported likelihood of the recognized word sequence was highest. A realistic LID system is a little more complicated. Early (unpublished) experiments at Dragon tried to do LID with English and Spanish recognizers built from very different corpora, and were not very successful. Unfortunately it was not clear how much of this failure resulted from the asymmetry between the recognizers, and how much from problems intrinsic to the method. In order to reduce systematic biases arising from factors such as channel, quantity and quality of acoustic data, and conversational style, Dragon set about collecting the “Ricardo” corpus of transcribed telephone speech in several languages, recorded as 8 kHr: mu-law data over an analog line. The data for each language was collected from about 40 speakers, evenly balanced by gender. Each

785

speaker recorded as many as 21 monologues of up to one minute in length, prompted by a set of questions on diverse topics such as food, weather, and books recently read. As far as possible the same set of questions (translated into the target language) was used in collecting each sub-corpus. From this corpus, we built recognizers in English, Japanese, and Spanish (a Mandarin recognizer has since been completed). For each language, we partitioned the available data into training and test sets of about 30 and 10 speakers respectively. Each speaker recorded between 15 and 20 minutes of speech, for a total of about 8 hours of acoustic training data per language. All the data in the corpus is transcribed, but there are no time markings. We constructed acoustic models using a “flat start” procedure. From all the training data, we built a tied mixture model of 256 gaussian components. We defined monophone acoustic models from these basis components, where the mixture weights were initially set equal (whence the name “flat start”). The only exception was the silence model, which was built separately from a sample of silence frames. The transcripts were then used to run several iterations of supervised Baum-Welch adaptation on the models, resulting in models which produced reasonably accurate time-alignments of the training data. From these timealignments, our standard model-building algorithm built full-context acoustic models. The acoustic models used for the test consist of mixtures of up to 16 gaussians, with a total of about 2700 output distributions. Each distribution represents a node of a set of triphones which have been clustered using a decision-tree algorithm [ 131. The signal processing produces every 10 ms. a 44 parameter feature vector (12 each of cepstral, difference, and second difference; as well as 8 spectral parameters), which is reduced to 24 parameters by an IMELDA transformation [14]. In addition the feature vectors are channel-normalized based on averaging each feature over all speech identified as such by the “harmonicity” algorithm for detecting voiced speech [151. The language model for each language was built entirely from the transcripts of the acoustic training data, which consisted of only around 60-70,000 tokens, and a vocabulary size of about 6-7,000 words. From this we trained unigram and bigram probabilities, keeping all bigrams. Because the training data was so sparse, the count ratios used in Good-Turing smoothing of the bigrams were rather noisy, and we substituted an absolute discounting method due to Ney et al. [ 161. The recognition performance of the recognizers was measured on the test set of 10 speakers, with an error rate in the range 60 - 65% in all three languages, and a

recognition speed of around 20 times real time running on a 90 Mhz Pentium with 64 MB of RAM.

3. THE LANGUAGE ID SYSTEM As mentioned earlier, an LID system is more than just an ensemble of recognizers. We were able to enhance the performance of the system considerably with a small amount of judicious tweaking. The simplest step was to preprocess the test data so as to cut out sections of silence lasting more than a few hundred milliseconds. (Some of the test data consisted of almost 50% silence!) The chopping was implemented using harmonicity [15], but leaving a generous window of a few hundred milliseconds on either side of any voicing to avoid chopping of unvoiced speech. The motivation for cutting out silence was twofold. Firstly, one would not expect silence to contain language-specific information, and therefore silence frames can contribute only to the statistical noise in the results. More importantly, we discovered that the score for silence frames is systematically lower (i.e. higher probability) than for speech frames, and this tended to dominate the score hfferences on test utterances containing a high proportion of silence. Experimentally, we found that cutting out silence gave us an absolute reduction in the error rate of 4%. The most important enhancement of the recognition scores came from controlling for the intrinsic acoustics of the test utterances. A large component of the recognition score reflects the raw acoustic match between the data and the models, independent of what the person is saying, or of the language being spoken. We can derive an independent measure of this match with what we call “best score” [12]. For each frame of the utterance, the frame is scored against every output distribution in the model, and the lowest (best) score is kept, totally independent of all other frames. This best score is then summed over all frames. Best score contains all the information about the raw acoustic match, with no continuity restrictions, and none of the higher-level information arising from knowledge of phoneme and/or word sequences. Subtracting best score from the raw recognition score improved language identification by a full 10% (absolute). The final step in the processing of the recognition scores is to divide by the number of frames in the test utterance, to produce a normalized “score - best score” per frame. Each utterance in the test set is represented by a vector of N such scores, one for each language. Given these numbers, we classify the utterances using logistic regression [17]. First, we estimate the regression coefficients by maximizing the log likelihood of a set of utterances for which the language is known

786

Table 1: Results of 2- and 3-way language ID (this was done using pre-1995 data from the OGI multilanguage telephone speech corpus [18]). Then for each of the test utterances, the logistic regression algorithm produces estimates of the probability that the utterance comes from each of the target languages. If a hard decision is required, we choose the language with the highest probability.

4. RESULTS AND DISCUSSION We tested our LID system on the new 1995 OGI corpus of digital telephone speech. The test material consisted of 59 one-minute and 222 10-second speech segments, evenly distributed among English, Japanese, and Spanish. Each one-minute utterance came from a different speaker, and the 10-second utterances came from chopping the one-minute segments into a number of pieces, each containing approximately 10 seconds of speech. We performed three two-way and one three-way closed-set tests. In table 1 we present the LID results, expressed as percentages of incorrectly classified utterances, together with the numbers of errors (and total number of test utterances). It is important to note that most of the errors came from a single Japanese speaker. Overall, the number of errors is too small to make any 26 English Japanese

x +

24

.-ij

22

statistically significant comparisons between the various test conditions. As an extreme example, the error rate reported on the 3-way test is twice as higlh as the average error rate on the 2-way tests, but in fact the same one error was made in both cases! Future tests will require either a larger test set and/or harder data, to measure differences in LID error rate between various tests conditions, in particular when the amount of test data is varied. We should also point out that our system performed better in this test than we had expected from development test experiments run on the earlier (analog line) OGI test data, which suggests the new data is cleaner. In figure 1 we present an example of a scatter-plot showing the separation of Japanese and English speakers, for the 10-second test condition. On each axis, we are plotting the “Score - Best score” per frame, as measured by the appropriate recognition models. The rogue Japanese speaker is clearly visible as a cluster of points on the wrong side of the line, in the upper right corner. The recognition scores used to produce the results quoted above have a number of distinct components, arising principally from the acoustic, language, and duration models. Most other systems throw away the acoustic score on the grounds that the raw score is too noisy, and rely instead on clever post-processing of the output strings with a specially designed “language model”. We ran a number of experiments to try to isolate the contributions of the various components, and the results are shown in table 2.

E

E

Test method 20

%!

8

7

Acoustic score only Score Best score Acoustic score - Best score Score - Best score Score - Best score (Phoneme recognition)

18

16

..

14

16

17

18

19 20 21 22 English recognizer

23

24

Figure 1: Scatter plot showing separation of Japanese and English speakers

787

Average 3-way percent error rate 16.2 14.0 41.0 5.4 2.7

5.4

The first entry shows the contribution from just the acoustic score, without the best score correction, or language or duration models. The second entry includes these latter two components, and shows a modest improvement. The third entry shows that best score by itself is a very poor tool, but the next two entries show how valuable it is when it is subtracted from either the score or the acoustic score, as a control for noise in the acoustics. The final entry in the table shows how well the system performs when phoneme recognition is substituted for word-level recognition. Otherwise, the system is identical to the evaluation system (in particular, it still uses score - best score). One motivation for trying this is that phoneme recognition typically runs much faster than full LVCSR. We were also interested in facilitating a comparison with other phoneme-based systems [3 - 101. (Note that we are not claiming any similarity between our system and these others.) With phonemes, the system makes 12 errors instead of 6, out of 222 test utterances. Although these numbers are small, because the 12 errors include all of the original 6, plus 6 new errors, a MacNemar test suggests that this is a significant difference in performance (P < 0.05), and so the wordbased system appears to perform better.

5. FUTURE WORK Over the coming year we will add Mandarin to the LID system, to allow 4-way language classification. In view of the low error rates obtained when testing on prompted monologues, we intend to extend our tests to the much harder domain of conversational telephone speech from the CallFriend corpus, where we expect a much higher word error rate, with a corresponding degradation in the LID rate. We also hope to explore the possibility of modifying our system to allow it to classify languages for which we do not have transcribed training data, a capability that a number of other systems already possess.

6. REFERENCES [l] L. Gillick et al. “Application of large vocabulary continuous speech recognition to topic and speaker identification using telephone speech,” Proc. ICASSP 1993, v.2, pp. 471-474. [2] B. Peskin et al. “Topic and Speaker Identification via Large Vocabulary Continuous Speech Recognition,” Proceedings of the ARPA HLT Workshop 1993, pp. 119124. [3] J. Gauvain and L. Lamel, “Identification of NonLinguistic Speech Features,” Proceedings of the ARPA HLT Workshop 1993, pp. 96-101.

[4] Y. Muthusamy and R. A. Cole, “Automatic Segmentation and Identification of Ten Languages Using Telephone Speech,” Proc. ICSLP 1992, v.2, pp. 10071010. [5] M. Zissmann and E. Singer, “Automatic Language Identification of Telephone Speech Messages using Phoneme Recognition and N-gram Modeling,” Proc. ICASSP 1993, v.2, pp. 309-402. [6] L. Lamel and J. Gauvain, “Identifying Nonlinguistic Speech Features,” Proc. Eurospeech 1993, v.1, pp. 2330. [7] Y. Muthusamy et al., “Comparison of Approaches to Automatic Language Identification using Telephone Speech,” Proc. Eurospeech 1993, pp. 1307-1310. [8] A. A. Reyes et al., “Three Language Identification Methods based on HMMs,” Proc. ICSLP 1994, pp, 18951898. [9] T. J. Hazen and V. W. Zue, “Automatic Language Identification using a Segment-based Approach,” Proc. Eurospeech 1993, pp. 1303-1306. [lo] K. Berkling et al., “Analysis of Phoneme-based Features for Language Identification,” Proc. ICASSP 1994, v.1, pp. 289-292. [ 111 T. Schultz et. al., “Experiments with LVCSR based language identification,” Proc. SRS 1995, pp. 89-92. [12] S. Lowe et al., “Language identification via large vocabulary speaker independent continuous speech recognition”, Proceedings of the ARPA HLT Workshop 1994, pp. 437-441. [13] R. Roth et al., “Dragon Systems’ 1994 Large Vocabulary Continuous Speech Recognizer,” Proc. SLS Technology Workshop, Austin, January 1995, pp. 116120. [14] M. Hunt et al., “An investigation of PLP and IMELDA Acoustic Representations and of their Potential for Combination,” Proc. ICASSP 1991, v.2, pp. 881-884. [15] M. Hunt, “A robust method of detecting the presence of voiced speech,”, Proc. 15th International Congress on Acoustics, Trondheim, Norway, June 1995. [16] H. Ney et al., “On Structuring Probabilistic Dependencies in Stochastic Language Modeling,” Computer Speech and Language 1994, v.8, pp. 1-38. [ 171 P. McCullagh and J. A. Nelder, Generalized Linear Models. New York, NY: Chapman and Hall Ltd., 1983. [ 181 Y. K. Muthusamy et al., “The OGI multi-language telephone speech corpus,” Proc. ICSLP 1992, v.2, pp. 895-898.

788