Phonotactic Recognition of Greek and Cypriot ... - Semantic Scholar

Report 1 Downloads 35 Views
Phonotactic Recognition of Greek and Cypriot Dialects from Telephone Speech Iosif Mporas1, Todor Ganchev2, Nikos Fakotakis2 1, 2 Artificial Intelligence Group, Wire Communications Laboratory, Dept. of Electrical and Computer Engineering, University of Patras, 26500 Rion, Greece 1 [email protected], 2 {tganchev, fakotaki}@wcl.ee.upatras.gr

Abstract. In the present work we report recent progress in development of dialect recognition system for the Standard Modern Greek and Cypriot dialect of Greek language. Specifically, we rely on a compound recognition scheme, where the outputs of multiple phone recognizers, trained on different European languages are combined. This allows achieving higher recognition accuracy, when compared to the one of the mainstream phone recognizer. The evaluation results reported here indicate high recognition accuracy - up to 95%, which makes the proposed solution a feasible addition to existing spoken dialogue systems, such as voice banking applications, call routers, voice portals, smarthome environments, e-Government speech oriented services, etc. Keywords: Dialect recognition, phone recognition.

1 Introduction The globalization tendency in the last two decades has forced the speech technology community to turn to the development of multilingual functional systems. Specifically, multilingual speech recognition and synthesis, which will enable automatic speech translation, will become increasingly important [1]. The cornerstone of multilingual speech applications is Language Identification (LID) that is the task of automatic recognition of the language of a speech signal. The role of LID is essential for multilingual functionalities, such as spoken dialog systems (e.g. infokiosks, voice banking, e-Government, voice portals, etc) that support a group of languages, spoken document retrieval and human-to-human communication systems (e.g. call routers, speech-to-speech translation) [1]. Due to the high importance of LID, intensive efforts have been devoted to the development of this technology. This has led to significant progress, which has been made in the last few years [2]. One of the challenging research tasks related to LID is Dialect Identification (DID). Similarly to LID, in DID task, a system is supposed to identify correctly one among different dialects of a given language from a spoken utterance. The DID functionality is crucial for spoken dialogue systems, when there is need the speech recognition engine to be adapted to the speaking style and manner of articulation of users originating from areas that speak different dialects. Dialect adaptation facilitates

higher speech recognition performance, when compared to the baseline performance without adaptation. Various techniques have been proposed for addressing the challenges of the DID task. Most of them bear similarity to corresponding techniques used in LID tasks, from which they were originally inspired. However, generally speaking, DID is considered as a more difficult task [3], compared to LID, due to the intrinsic similarities among the dialects within a specific language. Specifically, in the DID task various sources of information, which is encoded in the different levels of spoken language can be utilized for successful discrimination among dialects. This discrimination can be performed on various levels, such as the acoustic level (e.g. spectral information), the prosodic level (e.g. prosody), the phonotactic level (e.g. language models) and the lexical information [4]. Regarding the acoustic level, spectral information of the speech signal is extracted through speech parameterization techniques and further fed to powerful classification algorithms such as Gaussian mixture models [3], support vector machines [5] and neural networks [6]. At the prosodic level, duration of phonetic units [4, 7, 8, 9] and rhythm [10] have been exploited. Lexical information also has been proved as a useful source when recognizing dialects or languages [11, 12]. To this end, the most successfully applied approach for both LID and DID is the phonotactic approach [13, 14]. In the phonotactic approach, the speech signal is decomposed to its corresponding phone sequence, and further fed to dialect-specific language models. The language model with the maximum probability score indicates the recognized dialect. The decomposition of the speech waveform to phonetic sequence can be performed using a single phone recognizer followed by the target language models (PRLM), or using parallel phone recognizers (PPRLM). In the present work, we investigate the task of automatic recognition between the two major dialectal categories of Greek language, namely the Standard Modern Greek and the Cypriot. We develop and evaluate various configurations of a DID system, which follow the phonotactic approach. Specifically, we utilize six different language-dependent phone recognizers and investigate several configurations for fusion of their outputs. The remainder of this paper is organized as follows: In Section 2, we describe the main differences between the Standard Modern Greek and Cypriot dialect and briefly outline the characteristics of the speech corpora used for the purpose of performance evaluation. In Section 3, we offer a detailed description of the architecture of the DID system that has been developed. The experimental setup is explained in Section 4. Section 5 is devoted to analysis of the experimental results. Finally, Section 6 offers summary and conclusion.

2 Greek and Cypriot Speech Corpora The technical characteristics of the Cypriot dialect vary from the Standard Modern Greek in the phonology, morphology, vocabulary and syntax. In the phonological level, where the approach employed here falls in, the basic difference is that in the Standard Modern Greek there is no variation in vowel lengths. This means that there is no phonemic distinction between long and short vowels. However, in the Cypriot

Fig. 1. DID system for Greek – Cypriot dialects. Phone recognizers and language models are denoted as PR and LM respectively.

dialect there are expanded phonemic consonant length distinctions. Detailed description of the variations between Greek and Cypriot can be found in [15]. There exist two speech corpora that capture the Standard Modern Greek and Cypriot dialect: SpeechDat(II) FDB5000 Greek database [16], and the Orientel Cypriot Greek database [17]. Both databases include spontaneous answers to prompted questions. The recorded utterances consist of isolated and connected digits, natural numbers, money amounts, yes/no answers, dates, application words and phrases, phonetically rich words and phonetically rich sentences. The SpeechDat(II) FDB5000 Greek database consists of prompted speech recordings from 5000 native Greek speakers (both males and females), recorded over the fixed telephone network of Greece. The speech waveforms were sampled at 8 KHz and stored in 8-bit A-law format. The Orientel Cypriot Greek database consists of recordings of 1000 Cypriot native speakers. Following the conventions of SpeechDat project, the recordings were collected over the fixed telephone network of Cyprus. The speech waveforms were sampled at 8 KHz and stored in 8-bit A-law format.

3 System Description The DID system presented in this section follows the PPRLM approach. In Fig. 1, we outline the general architecture of the system. As the figure shows, the unlabelled speech signal (i.e. of unknown dialect) is initially pre-processed and parameterized. Afterwards, the feature vector sequence, F, is forwarded in parallel to N=6 phone recognizers. Each phone recognizer decomposes the feature vector sequence to a phone sequence, On, n=1,…, 6. Each one of the phone sequences consists of the corresponding phone labels that constitute the nth phone recognizer. For every one of

the computed phone sequences the likelihood against the corresponding dialectdependent n-gram language model Lnl is computed:

Pnl = P(On | Lnl ) .

(1)

where n=1,…, N is the identity number of the phone recognizer, and l=1, 2, is the identity number of the target dialect. The decision D about the unknown input speech waveform is derived by utilizing the maximum-likelihood criterion on the scores of all the computed phone sequences against the language models:

D = arg max{P(On | Lnl )} .

(2)

n ,l

Alternatively, the phone sequence likelihoods are utilized as input I to a discriminant classifier C. The classifier will map the input I (which corresponds to the unknown input speech waveform) to one of the target dialects:

C : I → {Dl } , l = 1, 2 .

(3)

where Dl is the decision for the unknown speech waveform either to belong the Standard Modern Greek or to the Cypriot dialect. 3.1 Phone Recognizers The system studied in the present work utilizes six parallel phone recognizers. Each recognizer was trained on the phone set of one language, namely on Czech, Russian, Hungarian, Greek, British English and German. Specifically, the Czech, Russian and Hungarian phone recognizers were trained on the SpeechDat-E database [18]. These recognizers utilize Mel-scale filter bank energies and temporal patterns of critical-band spectral densities. Phone posterior probabilities are further computed using neural networks. Further details concerning these phone recognizers are available in [19]. The remaining three phone recognizers, namely the Greek, British English and German, were trained on the SpeechDat(II)-FDB databases [20], utilizing the HTK toolkit [21]. Each phone is modeled by a 3-state left-to-right hidden Markov model (HMM). They utilize the 12 first Mel frequency cepstral coefficients together with the 0th coefficient. Phone models are trained using the base feature vector as well as the first and second derivative coefficients. Each state of the HMMs is modeled by a mixture of eight continuous Gaussian distributions. The size of the training data for the Greek phone recognizer was almost 6 hours of speech recordings. The British English phone recognizer was trained by employing approximately 21 hours of

speech. Finally, the German phone recognizer was trained utilizing nearly 14 hours of speech. All phone models in these three recognizers are context-independent. 3.2 Language Models For the construction of the language models we exploited the CMU Cambridge Statistical Language Modeling (SLM) Toolkit [22]. Every phonetic sequence was modeled by an m-gram language model. For each dialect and for each output of a phone recognizer, one 3-gram (m=3) language model was trained, resulted to a total of 6x2=12 language models. 3.3 Fusion of PRLMs The output scores of each PRLM were forwarded as input to a linear discriminant classifier. The classifier makes the final decision about the dialect that the unknown speech waveform belongs to, by fusing the outputs of the parallel tokenizers (e.g. phone recognizers). Besides the case where we fuse the outputs of all six phone recognizers, in Section 5 we also investigate fusion of other subsets of PRLM scores.

4 Experimental Setup For the present evaluation of dialect recognition performance, we utilized recordings from the Greek and Cypriot speech corpora, described in Section 2. Specifically, we used 20,000 speech files (10,000 from each dialect). The experiments were carried out by dividing the speech data to 80% (8,000 files per dialect) for training the language models, 10% (1,000 files per dialect) for training the linear discriminant classifier and 10% (1,000 files per dialect) for evaluating the identification accuracy. All types of utterances mentioned in Section 2 have been utilized in the present evaluation. The experimental procedure was implemented by performing 5-fold cross validation. This experimental setup takes advantage of the available data while keeping nonoverlapping training and testing datasets. Two types of errors can occur during the dialect recognition process: The first one called a false negative error (i.e. miss to recognize the correct target) occurs when the true target is falsely rejected as being a non-target. The second type called a false positive error occurs when a tryout from a non-target is accepted as if it comes from the target dialect. The latter error is also known as a false alarm, because a non-target trial is accepted as a target one. We use the miss probability and false alarm probability as indicators of the DID performance. These two error types are utilized in the Detection Error Trade-off (DET) [23] plots shown in the experimental results section. The DET plots utilize the scores computed for the Cypriot and Greek dialect models, and therefore show the system performance in multiple operation points of the DID system.

5 Experimental Results Following the experimental protocol outlined in Section 4, we evaluated the performance of the individual language-specific phone recognizers as well as several score fusion schemes. Specifically, in Fig. 2 we present the DET curves for the Russian (RU), Hungarian (HU), Czech (CZ), Greek (GR), British English (BR), German (GE) phone recognizers and the score-level fusion (FU) of all the six PRLM outputs. As the figure presents, the British English and German phone recognizers exhibited superior performance when compared to the remaining phone recognizers. We can explain this observation with the significantly larger amounts of training data used for building these models. On the other hand, the performance of the Czech, Russian and Hungarian phone recognizers is significantly inferior, when compared to the Greek one, mainly because of the differences in the phone sets among these languages and the Greek dialects. Furthermore, the fusion results for different sets of phone recognizers are presented in Fig. 3. Specifically, the fusion result for all phone recognizers (FU) is contrasted to these obtained for the fusion of the top-3 (FU3) scorers (GE, BR, GR), and for the top-2 (FU2) scorers (GE, BR). As the DET plots present, the FU2 doesn't offer any

Fig. 2. DET plots of the individual phone recognizers: Czech (CZ), Russian (RU), Hungarian (HU), Greek (GR), British English (BR), German (GE) and the fusion of all of them (FU).

Fig. 3. DET plots for different fusion schemes: FU – all phone recognizers; FU3 – fusion of the top-3 (GE, BR, GR) phone recognizers; and as FU2 – of the top-2 (GE, BR) scorers. Plots of the individual phone recognizers: Greek (GR), British English (BR), German (GE).

advantage, when compared to the individual Greek and British English phone recognizers. Furthermore, adding the Greek phone recognizer to the top-2 set, significantly improves the recognition accuracy. As Fig. 3 presents, the top-3 fusion scheme, FU3, exhibits the highest recognition accuracy at the area of Equal Error Rate (EER), i.e. where the probability of missing a target trial is equal to the probability of accepting a non-target trial as originating from the target. Finally, the fusion of all six phone recognizers FU offers more balanced performance in all areas of the DET plot. In particular, as presented in the corresponding DET-plot, the FU fusion scheme guarantees satisfactory recognition accuracy for the low-miss probability, low-false alarm and the EER operating points of the Greek-Cypriot dialect identification system. Such balanced performance makes the FU fusion scheme more predictable, and thus, a more attractive choice for a wide range of applications. However, for applications where the preferred operating point is near the EER the FU3 fusion scheme maximizes the dialect recognition performance. Finally, the FU3 fusion scheme could be the option of interest when the computational demands and memory requirements are important factors and is desirable to be bounded.

6 Conclusion The present-day speech recognition technology relies on large speech corpora, which serve in the creation of statistical models of speech. However, for many European languages and dialects such extensive resources are not available or the existing data are of very limited size, mainly due to the high expenses for collecting spoken corpora. As a consequence, the citizens speaking these languages or dialects are not in position that would allow them to comfortably take advantage of the modern technology, since they often experience difficulties when assessing public domain spoken dialogue systems. In the present contribution, we presume that a dialect recognition component will make the spoken interaction application aware to the dialect spoken by the user. This information would enable for selection of dialect-specific speech recognition settings, which would improve the overall task completion rate in spoken interaction, and as a result will contribute for improved quality of service. Specifically, evaluating different configurations of our dialect identification component, we study a fusion scheme that exploits a number of phone recognizers, trained from significant amounts of speech corpora. The use of a compound phone recognition scheme improves dialect identification rates, when compared to the use of the mainstream Greek phone recognizer alone. In conclusion, we would like to summarize that the reported effort led to successful development of compound dialect recognition component, which is capable to achieve recognition accuracy above 95%, and which has the potential to facilitate for improved speech recognition of Cypriot speakers. Acknowledgments. The authors would like to thank the anonymous reviewers for their useful comments and corrections.

References 1. Schultz, T., Kirchhoff, K.: Multilingual Speech Processing. Academic Press, Elsevier (2006) 2. Martin, A.F., Le, A.N.: NIST 2007 Language Recognition Evaluation. In: Odyssey 2008 The Speaker and Language Recognition Workshop ISCA Tutorial and Research Workshop, (2008) 3. Torres-Carrasquillo, P.A., Gleason, T.P., Reynolds, D.A.: Dialect Identification using Gaussian Mixture Models. In: Odyssey 2004 - The Speaker and Language Recognition Workshop, ISCA Tutorial and Research Workshop, pp. 297--300 (2004) 4. Tong, R., Ma, B., Li, H., Chng, E.S.: Integrating Acoustic, Prosodic and Phonotactic Features for Spoken Language Identification. In: 2006 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 205--208 (2006) 5. Campbell, W. M., Singer, E., Torres-Carrasquillo, P.A., Reynolds, D.A.: Language Recognition with Support Vector Machines. In: Odyssey 2004 - The Speaker and Language Recognition Workshop, ISCA Tutorial and Research Workshop, pp. 285--288 (2004) 6. Braun, J., Levkowitz, H.: Automatic Language Identification with Perceptually Guided Training and Recurrent Neural Networks. In: 5th International Conference on Spoken Language Processing (ICSLP), pp. 3201--3205 (1998)

7. Ghesquiere, P.J., Compernolle, D.V.: Flemish Accent Identification based on Formant and Duration Features. In: 2002 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 749--752 (2002) 8. Lin, C.Y., Wang, H.C.: Fusion of Phonotactic and Prosodic Knowledge for Language Identification. In: 9th International Conference on Spoken Language Processing (ICSLP), pp. 425--428 (2006) 9. Hazen, T., Zue, V.: Segment-based Automatic Language Identification. J. of the Acoustic Society of America, (101) 4, pp. 2323--2331 (1997) 10. Farinas, J., Pellegrino, F., Rouas, J.L., Andre-Obrecht, R.: Merging Segmental and Rhythm Features for Automatic Language Identification. In: 2002 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 753--756 (2002) 11. Huang, R., Hansen, J.: Dialect/Accent Classification via Boosted Word Modeling. In: 2005 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 585--588 (2005) 12. Campbell, W.M., Richardson, F., Reynolds, D.A.: Language Recognition with Word Lattices and Support Vector Machines. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 425--428 (2007) 13. Zissman, M.: Comparison of Four Approaches to Automatic Language Identification. J. IEEE Transactions on Speech and Audio Processing, 4 (1), pp. 31--44 (1996) 14. Tsai, W.H., Chang, W.W.: Chinese Dialect Identification using an Acoustic-Phonotactic Model. In: 6th European Conference on Speech Communication and Technology (EUROSPEECH), pp. 367--370 (1999) 15. Κοντοσόπουλος, Ν.Γ.: Διάλεκτοι και ιδιώματα της Νέας Ελληνικής. Εκδόσεις Γρηγόρη (1994). 16. Chatzi, I., Fakotakis, N., Kokkinakis, G.: Greek speech database for creation of voice driven teleservices. In: 5th European Conference on Speech Communication and Technology (EUROSPEECH), pp.1755--1758 (1997) 17. Kostoulas, T., Georgila, K.: Orientel Cypriot Greek Database. V.2.0, (2007) 18. Pollak, P., Cernocky, J., Boudy, J., Choukri, K., Van den Heuvel, H., Vicsi, K., Virag, A., Siemund, R., Majewski, W., Staroniewicz, P., Tropf, H.: SpeechDat(E) - Eastern European Telephone Speech Databases. In: XLDB Workshop and Satellite Event to LREC Conference on Very Large Telephone Speech Databases (2000) 19. Schwarz, P., Matejka, P., Cernocky, J.: Towards Lower Error Rates in Phoneme Recognition. In: Sojka, P., Kopecek, I. Pala K. (eds.): TSD 2004, LNAI 3206, pp. 465–472, Springer, Heidelberg (2004) 20. Hoge, H., Draxler, C., Van den Heuvel, H., Johansen, F.T., Sanders, E., Tropf, H.S.: SpeechDat Multilingual Speech Databases for Teleservices: Across the Finish Line. In: 6th European Conference on Speech Communication and Technology (EUROSPEECH), pp. 2699--2702 (1999) 21. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.3), Cambridge University (2005) 22. Clarkson, P.R., Rosenfeld, R.: Statistical Language Modeling Using the CMU-Cambridge Toolkit. In: 5th European Conference on Speech Communication and Technology (EUROSPEECH), pp. 2707--2710 (1997) 23. Martin, A., Doddington, G., Kamm, T., Ordowski, M., Przybocki, M.: The DET curve in assessment of detection task performance. In: 5th European Conference on Speech Communication and Technology (EUROSPEECH), vol.4. pp.1895--1898 (1997)