COMBINATION OF STOCHASTIC UNDERSTANDING AND MACHINE TRANSLATION SYSTEMS FOR LANGUAGE PORTABILITY OF DIALOGUE SYSTEMS Bassam Jabaian 1,2, Laurent Besacier 1, Fabrice Lefèvre 2 1
LIG, University Joseph Fourier, Grenoble - France 2 LIA, University of Avignon, Avignon - France
{bassam.jabaian,laurent.besacier}@imag.fr,
[email protected] which the portability was considered. We showed that the system can be ported either at the train level (TrainOnTarget), by training a new semantic tagger for the target language using SMT to translate the source training data or at the test level (TestOnSource), by automatically translating the test data and using the current source SLU system. Our experiments showed that, if TestOnSource is the best performing approach, they are pretty comparable in terms of performance and cost. Both solutions require to manually translating a subset of the source corpus to train the SMT system. In this work, we now intend to improve even further their performance. Two main lines are drawn: developing a new SLU system based on a different underlying statistical model with an increased robustness to alignment imperfections in the training data should help TrainOnTarget ; increasing CRF robustness to translation errors might impact positively on TestOnSource. For the first point, phrase-based SMT is investigated as an alternative to CRF. The foreseen advantage is the capacity of SMT to infer its own alignment from the parallel data (word/concept in this case). To address the last point, the first approach investigated is to retrain the source SLU model using training data smeared by the effects of automatic translation (smeared cross-lingual training data, SCTD). The second approach consists in using statistical post edition (SPE) [11, 12] in the source language to automatically post edit the output of the machine translator before passing it to the semantic tagger. Finally all the approaches proposed in the paper, regardless of their respective performance, can be seen as parallel systems with differentiated behaviors. And we show that their combination can lead to an additional reduction in concept error rate when compared with any of them individually. The paper is structured as follows: Section 2 reviews two strategies to port a SLU system to a new language. An SMT based approach for SLU is presented in Section 3, along with CRF. In Section 4, two solutions to improve robustness to translation errors are introduced (SPE and SCTD). Section 5 describes the MEDIA corpus used in our experiments, as well as the tools used to translate the corpus to the target language. Section 6 gathers the experimental results of all methods.
Abstract In this paper, several approaches for language portability of dialogue systems are investigated with a focus on the spoken language understanding (SLU) component. We show that the use of statistical machine translation (SMT) can greatly reduce the time and cost of porting an existing system from a source to a target language. Using automatically translated training data we study phrase-based machine translation as an alternative to conditional random fields for conceptual decoding to compensate for the loss of a precise concept-word alignment. Also two ways to increase SLU robustness to translation errors (smeared training data and translation postediting) are shown to improve performance when test data are translated then decoded in the source language. Overall the combination of all these approaches allows to reduce even further the concept error rate. Experiments were carried out on the French MEDIA dialogue corpus with a subset manually translated into Italian. Index Terms: Spoken Dialogue Systems, Spoken Language Understanding, Language Portability, Statistical Machine Translation.
1. INTRODUCTION A very challenging aspect of porting a spoken dialogue system (SDS) from a language to another lies in the effort reduction that can be obtained by making the best use of existing resources in the source language for developing the system in the target language. If some SDS components are expected to be rather language-independent (the dialogue manager mainly), most of them are not. In this paper we will focus on the portability of the SLU module. Recently several works showed that the use of automatic machine translation at different levels of the understanding process can be useful to port a SLU system with a minimum human effort [1, 2, 3, 4, 5]. If good performance can already be obtained, a difficulty remains in the capacity of these approaches to transfer all the semantic knowledge from the source data to the target language. For instance propagating the segmental alignment between words and conceptual units is quite complex. In [4] the STC model is proposed which doesn’t use any alignment information. In [3] the conceptual segments are translated separately so as to maintain the chunking between source and target language but at the cost of degrading the translation quality. In a previous work [5], we investigated multiple approaches for SLU portability across language with variants on how the conceptual alignment is treated by the system. All these approaches were based on conditional random fields (CRF) as the semantic tagger and differed on the level at
978-1-4577-0539-7/11/$26.00 ©2011 IEEE
2. STRATEGIES FOR SLU PORTABILITY The portability of the SLU system to a new language can be done using two different strategies [5], briefly recalled in the following sections.
5612
ICASSP 2011
ெ
2.1 Train On Target
ାଶ ሻ ାଶ ሻ ൌ ߣ Ǥ ݄ ሺܿିଵ ǡ ܿ ǡ ݓିଶ ܪሺܿିଵ ǡ ܿ ǡ ݓିଶ
The general idea behind this strategy is to translate the source training corpus into the target language and to infer the corresponding semantic annotations. In [5] the approach that gave the best performance used word-to-word alignments to infer the concepts associated to the target corpus from annotated data in the source language. In the example in Figure 1, we see that it is possible (while generally not obvious) to match directly the semantic concepts to the chunks in the target language using the alignment information. The Berkeley aligner [15] was used for automatic source/target alignment. For the rest of the paper, we will refer to this approach as TrainOnTarget.
ୀଵ
The log-linear models are based on feature functions ାଶ ሻ ݄ ሺܿିଵ ǡ ܿ ǡ ݓିଶ representing the information extracted from the training corpus, Z is a normalization term and Ȝ is estimated during the training process.
3.2. SLU/PB-SMT In this approach, we consider that each sequence of concepts is the translation of an initial words sequence. Thus, the semantic tagging is seen as a translation task: the best tagged sequence C from the word sequence W is defined as: + ൌ ܲሺܥȁܹሻ ൌ ܲሺܹȁܥሻǤ ܲሺܥሻ
Fig 1: Example of inferring concepts in the target language using word-to-word alignment Command-tache
nbChambre
Je voudrais réserver
loc-ville
une chambre à Marseille
Vorrei prenotare una stanza a Marsiglia
2.2 Test On Source In this approach, a SLU system is available in the source language, and a SMT system is used to translate the target test sentences back to the source language. The automatic translations are used as inputs for the source SLU system. In other words, this approach consists in porting the system at the test level without any change in the training process, it will be referred to as TestOnSource for the rest of the paper.
3. CASTING SLU AS A SMT TASK The goal of SLU is to extract a sequence of concepts given a sequence of words. The generation of concepts can be presented as follows. With V a vocabulary of words given as hypotheses of an ASR system, C= ܿଵ ,..., ܿ a sequence of concept tags that can be hypothesized from a sequence of words W= ݓଵ ,…,ݓ , to each concept ܿ is associated a subsequence of words in W. Several efficient methods have been proposed for concept tagging [6, 8]. For their very good performance we chose CRF [7] as a baseline but, observing that SLU can be cast as a SMT task, we also test phrase-based SMT.
4. INCREASING SLU ROBUSTNESS TO TRANSLATION ERRORS As already mentioned, the best method for SLU portability is also the simplest one TestOnSource [4, 5]. The main weakness of this method is that the quality of tagging depends a lot on the quality of the machine translation. Thus, the SLU system has to deal with a noisy input due to translation errors. In order to increase the robustness of this approach, we propose two methods in this paper. The first one takes into account the noise coming from the machine translation during the SLU training process; the second one automatically corrects the output of the machine translation before sending it to the SLU model. It should be noted that although not evaluated right now (as no audio data are available in the target language), both methods will be suitable to deal with the speech recognition errors that occur in a real SLU task.
3.1. SLU/CRF CRF represent a log-linear model, normalized at the sentence level. In order to train a CRF tagger, the training data need to be presented in the BIO formalism (as in [8]) which denotes the boundaries between concepts. For example, the sentence: Je voudrais réserver une chambre à Marseille will be represented as a sequence of pairs (w,c): (je, B_command-tache) (voudrais, I_command-tache) (réserver, I_command-tache) (une, B_nbChambre) (chambre, I_nbChambre) (à, B_loc-ville) (Marseille, I_loc-ville) The CRF models the probability between concepts and words as follows: ே
4.1. Smeared Cross-lingual Training Data
ͳ ାଶ ሻ ሺܿଵே ȁݓଵே ሻ ൌ ෑ ܪሺܿିଵ ǡ ܿ ǡ ݓିଶ ܼ with
For solving this equation, we need a language model of concepts P(C), which can be trained using SRILM [14] toolkit on the corpus of concepts, and a translation model P(W|C) which can be a PB-SMT model, for instance. We used Moses toolkit [10] to train such a PB-SMT model from a “parallel” corpus containing word sentences and concepts. The weights of this model are optimized using minimum error rate training (MERT) which is used in traditional translation task to optimize the BLEU score. The performance of the initial SLU/PB-SMT approach was improved using some specific characteristics of the SLU task. First, assuming that the occurrences of the semantics of a sentence respect the order into which the words of this sentence occur, the phrase table is re-trained using a monotone constraint during the automatic word alignment. Then, since one of the major difficulties of the translation process is to align automatically and correctly a word from the source language with its corresponding in the target language, we tried to help the alignment process using the BIO formalism. In that way, the phrase table extraction was performed on a better aligned parallel corpus. Finally, since the evaluation metric of the SLU task is the CER and not BLEU score, we modified the MERT optimization process in order to optimize CER directly. Finally, in order to avoid out of vocabulary word coming from city names not in the training data, the list of cities is added to the training data and the SLU/PB-SMT system retrained.
The principle of this method is to train a SLU model in the source language with additional noisy data coming from the output of an automatic translation system. Practically, we
ୀଵ
5613
translate the available training data from target to source language and we infer the concepts associated to this noisy data (with the same method as for TrainOnTarget). Then, we add the smeared data (now semantically annotated) to the original “clean” data and the whole of it is used to train a new SLU model which has now encompassed knowledge on both clean and SMT smeared data.
machine translation performance only, using this post editor increased the BLEU score of the system from 47.18 to 49.25. Table 1: Overview of the MEDIA corpus and its Italian translation (# sentences).
4.2. Statistical Post-Edition Several recent works in machine translation, used a phrasebased machine translation approach to post edit the output of another machine translation system [11, 12]. Such a process, while not necessarily intuitive, has been proposed to improve the quality of the translated data sent to human post-editors. To train such a statistical post-editor, [12] uses the output of a SMT system, with its manual post edition as a parallel training set. In our case, since the output of the SMT system will be used as input to the SLU system trained on the source language training data, we propose to post edit this output to decrease the noise coming from the SMT system. In order to train a SPE, our choice was to automatically translate the available training set from the target language to the source language, and then use the translation output with the corresponding clean source part, as a parallel corpus to train a phrase-based post editor. The expectation is that SPE can allow to reorder words or recover missing words for many sentences.
5. CORPUS AND TOOLS DESCRIPTION All the experiments of this paper have been performed on the French MEDIA corpus. As described in [13], this corpus covers a domain related to the reservation of hotel rooms and tourism information. The corpus is made of 1257 dialogs from 250 speakers, divided into three parts: a training set (approx. 13k sentences), a development set (approx. 1.3k sentences) and an evaluation set (approx. 3.5k sentences). A subset of the training set (approx. 5.6k sentences), as well as the development and test sets, have been translated manually to Italian in the context of the LUNA project [3]. In this study, we use two statistical machine translation systems to obtain automatic translation from French to Italian and from Italian to French. To perform these translations we use the Moses toolkit [10]. Moses is a state-of-the-art phrasebased translation system using log-linear models. We use the part of the training corpus of MEDIA, manually translated into Italian, as a parallel corpus for Moses in both directions to learn translation models. We use the French and Italian parts to obtain the language model. Also the development set with its translation is used as a parallel corpus to tune the log-linear weights of the SMT systems. Finally we obtain a French to Italian system with a BLEU score of 43.62, and an Italian to French system with a BLEU score of 47.18. Those scores are measured on the manually translated MEDIA test set. Since only one reference per utterance is used to evaluate BLEU, and since a small SMT training data is used (5.6k), these performances can be considered as acceptable. These systems are used to obtain an automatic translation of the remaining (not manually translated) part of the train set; so a full translation of the MEDIA train (manual + automatic) is available. Table 1 gives a brief overview of the available subsets at this point. We also automatically translate the Italian manual train part of the corpus into French in order to use this translation, with the original French part, as a parallel corpus to train the SPE and provide the noisy data for SCTD. With regard to
5614
MEDIA data French MEDIA Italian manual Italian automatic
Train 13K 5.6K 7.4K
Dev 1.3K 1.3K -
Test 3.5K 3.5K -
6. EXPERIMENTS AND RESULTS To evaluate the performance of the proposed approaches we used the Italian manual translation of the MEDIA test set. The CER is the evaluation criteria chosen for this study. It is defined as the ratio of the sum of deleted, inserted and substituted concepts by the total number of concepts in the reference.
6.1. SLU Portability Strategies We trained a baseline CRF tagger using unigram and bigram on the French MEDIA corpus (12.9%), and then we give the automatic French translation of the Italian test set as input to this tagger in order to evaluate the performance of the TestOnSource approach. We also applied the TrainOnTarget method as described in 2.1. Results are reported in Table 2. Results show that CRF is more robust with TestOnSource than TrainOnTarget. All our experiments used the CRF++ toolkit [9]. Table 2: Evaluation (CER %) of the different strategies of SLU portability on the SLU/CRF system Model SLU/CRF(TestOnSource) SLU/CRF(TrainOnTarget)
Sub 5.2 3.1
Del 12.1 15.0
Ins 2.6 2.3
CER 19.9 20.5
6.2. SLU/PB-SMT Our first attempts to build a PB-SMT model for Italian SLU lead clearly to lower performance than the CRFs (CER=28.1% after MERT tuning for the PB-SMT compared to ~20% for CRF). The progressive improvements of this model as proposed in Section 3 are evaluated in Table 3. Using a monotone constraint during word alignment reduced the CER by 0.6%. Converting our training data to the BIO format before training the phrase table reduced the CER significantly by 2.8%. Finally optimizing CER instead of BLEU reduced CER by 0.3%. Adding a list of cities to the training data and retrained a PB-SMT model obtained a final 0.5% reduction. Results show that despite the fine tuning on the SMT approach, CRF based approaches still lead to better performance. Even more, in a parallel experiment, a PB-SMT model was built for French SLU in order to test it with the TestOnSource method. Unfortunately the performance of such a combination was very disappointing and well below those of the other methods. That’s why we decided to discard it in this study. From a simple analysis of the type of errors made by each model, we can observe that the methods based on CRF have a high level of deletions compared to the other types of errors, while the PB-SMT method presents a better trade-off between deletion and insertion errors, even though it ends up with a lower CER.
Table 3: Iterative improvement of the SLU/PB-SMT on the Italian MEDIA test set (%) SLU/PB-SMT Initial + MERT (BLEU) + Monotone algn + BIO format + MERT (CER) + City list
Sub 6.5 6.3 7.4 6.5 6.4 7.2
Del 4.0 9.3 8.4 10.6 10.9 10.5
Ins 18.6 12.5 11.8 7.7 7.2 6.1
Table 5: System combination with and without PB-SMT Model BASIC ALL ALL – SLU/PB-SMT
CER 29.1 28.1 27.5 24.7 24.4 23.9
Ins 2.6 2.3 2.5 2.9
CER 18.6 18.2 19.4
8. SUPPORT This work is supported by the ANR funded PORT-MEDIA project (ANR 08 CORD 026 01). More information on the project website, www.port-media.org.
9. REFERENCES [1]
Table 4: Evaluation (CER %) of approaches proposed to system robustness against noise Del 12.1 11.4 10.6 9.9
Ins 2.7 2.3 2.7
In this paper we proposed and compare several methods for SLU portability across languages. CRFs and phrase-based SMT are used for this task and results show that using a CRF tagger on translated test data gives the best performance. We also proposed and evaluated several methods to increase the robustness of the approach to the translation errors: training with additional noisy data or post editing the outputs of the automatic translation. They both lead to better performance and serializing the two methods further decreased the concept error rate. Finally it was shown that a combination of all the approaches benefits from their complementary behaviors and obtains a significant improvement of the performance.
We tried to increase the performance of the SLU/CRF TestOnSource method by increasing its robustness against translation errors. First, we automatically translated the Italian manual part of the train set of the MEDIA corpus to French using the same SMT that was used to translate the test, and then we aligned this translation with the original French data to infer the corresponding concepts to this new corpus. We trained a new CRF tagger on both original French training set and translated one (approach +SCTD described in section 4.1). The method of section 4.2 (SPE) was also evaluated where the post edited translated test set was sent to both baseline CRF (+SPE) or to CRF with smeared data (+SCTD+SPE). The evaluation of the performance of these approaches is shown in table 4. Both methods, training on noisy data and SPE, increase the performance of the semantic tagger. Their serialization gave the best performance.
Sub 5.2 5.9 6.5 6.4
Del 9.7 10.5 10.2
7. CONCLUSION
Robust SLU/CRF TestOnSource
SLU/CRF TestOnSource +SCTD +SPE +SCTD +SPE
Sub 6.2 5.4 6.6
[2] [3]
CER 19.9 19.6 19.7 19.3
[4]
[5]
6.3. System Combination We propose to combine the three main approaches (SLU/CRF TestOnTarget and TrainOnTarget, SLU/PB-SMT) in order to benefit from their respective characteristics to enhance the performance. The combination (referred to as BASIC in Table 5) is simple: a confusion network is built from the three hypotheses and the concept sequence corresponding to the highest probability path is output. The performance is significantly better (-1.3% CER) which shows that the methods are complementary. Finally, we combine all the methods proposed in this paper (SLU/CRF TrainOnTarge, SLU/CRF TestOnSource, +SCTD, +SPE, +SCTD+SPE, SLU/PB-SMT), and we obtain the best performance measured until now on this test set (18.2%). In order to check the influence of the SLU/PB-SMT method on the performance of the combination, we also combine all methods except SLU/PB-SMT and evaluate the performance. This evaluation shows that the SLU/PB-SMT method, despite giving the worst performance of all methods, has a great influence in the combination.
[6] [7] [8] [9] [10]
[11] [12] [13] [14] [15]
5615
K. Suenderman, J. Liscombe, “From rule-based to statistical grammars: Continuous improvement of large-scale spoken dialog system,” in ICASSP 2009. K. Suenderman, J. Liscombe, “Localization of speech recognition in spoken dialog systems: How machine translation can make our lives,” in Interspeech 2009. C. Servan, N. Camelin, C. Raymond, F. Bechet, R. De Mori, “On the use of machine translation for spoken language understanding portability,” in ICASSP 2010. F. Lefèvre, F. Mairesse, S. Young, “Cross-lingual spoken language understanding from unaligned data using discriminative classification models and machine translation,” in Interspeech 2010. B. Jabaian, L. Besacier, F. Lefèvre, “Investigating multiple approaches for SLU portability to a new language,” in Interspeech 2010. S. Hahn, P. Lehnen, C. Raymond, H. Ney, “A comparison of various methods for concept tagging for spoken language understanding,” in LREC 2008 J. Lafferty, A. McCallum, F. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in ICAL 2010. C. Raymond, G. Riccardi, “Generative and discriminative algorithms for spoken language understanding,” in Interspeech 2007. CRF++ is available online: https://crfpp.sourceforge.net P. Koehn, H. Hoang, A. Birch, C. Callisonburch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin and E. Herbst, “Moses: Open source toolkit for statistical machine translation,” in ACL 2007. M. Simard, C. Goutte, P. Isabelle, “Statistical phrase-based post-editing,” in NAACL 2007. A. Diaz de Ilarraza, G. Labaka, K. Sarasola, “Statistical postediting: A valuable method in domain adaptation of RBMT systems for less-resourced languages,” in MATMT 2008. H. Bonneau-Maynard, S. Rosset, C. Ayache, A. Kuhn, and D. Mostefa, “Semantic annotation of the French media dialog corpus,” in Eurospeech 2005. A. Stolcke, “SRILM an extensible language modeling toolkit,” in SLP 2002. P. Liang, B. Taskar, D. Klein, “Alignment by agreement,” in HLT 2006.