Vietnamese Automatic Speech Recognition: the ... - Semantic Scholar

Comment

Report 3 Downloads 139 Views

Vietnamese Automatic Speech Recognition: the FLaVoR Approach Quan Vu, Kris Demuynck, Dirk Van Compernolle K.U.Leuven/ESAT/PSI Kasteelpark Arenberg 10, B-3001 Leuven, Belgium [email protected] www.esat.kuleuven.be/psi/spraak/

Abstract. Automatic speech recognition for languages in Southeast Asia, including Chinese, Thai and Vietnamese, typically models both acoustics and languages at the syllable level. This paper presents a new approach for recognizing those languages by exploiting information at the word level. The new approach, adapted from our FLaVoR architecture[1], consists of two layers. In the rst layer, a pure acoustic-phonemic search generates a dense phoneme network enriched with meta data. In the second layer, a word decoding is performed in the composition of a series of nite state transducers (FST), combining various knowledge sources across sub-lexical, word lexical and word-based language models. Experimental results on the Vietnamese Broadcast News corpus showed that our approach is both eective and exible.

Key words: Automatic speech recognition, nite state transducer, phoneme network, Vietnamese

1 Introduction Like Chinese [5], Thai [4] and other languages in Southeast Asia, Vietnamese is a tonal, morpho-syllabic language in which each syllable is represented by a unique word unit (WU) and most WUs are also morphemes, except for some foreign words, mainly borrowed from English and French. Notice that the term WU we use here has a similar meaning to the term character in Chinese. There are six distinct tones and around seven thousand WUs found in the language. Of these 7000, about ve thousand WUs are frequently used [2]. The Vietnamese writing system, on the other hand, is completely dierent from the ones in Southeast Asia, including Chinese. In fact, it is based on an extended Latin symbol set as given in Fig 1. The underlined symbols in Fig 1 are not in the original system. In addition, the last ve symbols are tone marks (the sixth tone has no mark). Fig 2 shows some examples of Vietnamese WUs and words. This example sequence contains six WUs and three words. With this writing system, even if WUs are separated by spaces, the problem of word segmentation is not trivial. As with Chinese, words are not well dened. Each word is composed of one to several WUs with dierent meaning.

Fig. 1. Vietnamese written symbols.

Fig. 2. Example of Vietnamese morphemes and words. For the automatic speech recognition problem, most systems for Chinese [3], Thai [4] or Vietnamese [2] share a similar approach in both acoustic modeling (AC) and language modeling (LM). Specically, the acoustic modeling is typically based on the decomposition of a syllable into initial and nal parts; while the language modeling is trained on WUs or words. As reported in [6], the performance of the system which used a word-based LM is better than the one that used a WU-based LM. However with the word-based LM approach, the search network is much bigger than the former as the vocabulary size is increased considerably. More precisely, in those systems, the standard speech recognition architecture brings all available knowledge sources very early in the process. In fact, an all-in-one search strategy is adopted which completely integrates the acoustic model with word-based LM. Subsequently, it will be more expensive in terms of memory and time, when a long-span word-based LM is exploited. In this paper, we present a new approach for recognizing the languages mentioned above. Our approach is based on the FLaVoR architecture and exploits the compositional property of FSTs [7]. The approach consists of two steps. First, a pure acoustic-phonetic search generates a dense phoneme graph or phoneme FST, enriched with meta-data. Then, the output of the rst step is composed with a series of FSTs, including sub-lexical, word lexical and word-based LM FSTs from which an usual word decoding is carried out. The word-based LM is trained by using the word segmentation procedure [6]. The paper is structured as follows. First, we briey describe the FLaVoR architecture in Section 2. Section 3 presents our approach in detail. In Section 4, we will report the experimental result on the Vietnamese Broadcast News corpus and compare it to our previous work. Finally, some conclusions and remarks are given in Section 5.

2 The FLaVoR Approach 2.1 The Architecture As depicted in Fig 3, the key aspect of the FLaVoR architecture consists of splitting up the search engine into two separate layers. The rst layer performs

phoneme recognizer

acoustic models [a] = s1

s3

[z] =

^ F = argmax f( X | F ) P(F ) F

I o

d i

s4

s5

s6

x y z

x y z

p(x |x ) p(y |x ) p(z |x )

x y z

p(x |y ) p(y |y ) p(z |y )

x y z

p(x |z ) p(y |z ) p(z |z )

decoupling possible on basis of "word−pair approximation" phoneme network

I

n

j a

s

j

i

N

prosody info morpho−syntactic model

word decoder

s2

search algorithm

network optimizer

O^

phoneme transition model

incoming signal

prob. left−corner grammar ( == + transducer) shallow parsing

l

phonemes + duration + likelihood + prosody info f( X |F ) phoneme

morpheme/word convertor

− morpheme lexicon − constraints on morpheme sequence − pronunciation rules (context−dep. rewrite rules)

− substitution matrix

transducer/search network E

en

en

O

f 2

n

I

of in

E

n

1 I

0

3

p op

in

f of p

O op

(pronunciation errors)

search algorithm P(W,M)

(partial) sentences lm states

W,M = argmax f( X |F ) P(F |W,M) P(W,M) W,M

f( X |F )P(F |W,M) morphemes word boundaries

recognized sentence

Fig. 3. FLaVoR architecture. phoneme recognition and outputs a dense phoneme network, which acts as an interface to the second layer. In this second layer, the actual word decoding is accomplished by means of sophisticated probabilistic morpho-phonological and morpho-syntactic models. Specically, in the rst layer, a search algorithm determines the network of most probable phoneme strings F given the acoustic feature X of the incoming signal. The knowledge sources employed are an acoustic model p(X|F ) and a phoneme transition model p(F ). The isolation of the low-level acoustic-phonemic search provides the generic nature of the rst layer for a full natural language. That is, the phoneme recognizer can function in any knowledge domain for a specic language. In the word decoding stage, the search algorithm has two knowledge sources as its disposal: a morpho-phonological and a morpho-syntactic component. The morpho-phonological component converts the phoneme network into sequences of morphemes and hypothesizes word boundaries. Its knowledge sources include a morpheme lexicon, constraints on morpheme sequences, and pronunciation rules. The morpho-syntactic language model provides a probability measure for each hypothesized word based on morphological and syntactic information of the word and its context. In this work, just a part of the FlaVoR architecture was exploited. Specically, the PLCG, shallow parsing and searching components, as shown in Fig 3

were skipped. Instead, a full composition of transducers, including phoneme network, sub-lexion, word-based LM is performed.

2.2 Finite State Transducers It is important to notice that all the knowledge sources mentioned above can be represented as FSTs, and that all these FSTs can be combined into one transducer. Transducers are a compact and ecient means to represents the knowledges and are a good match for the decoding process. A Weighted nitestate transducer, (Q, Σ ∪ {²}, Ω ∪ {²}, K, E, i, F, λ, ρ) is a structure with a set of states Q, an alphabet of input symbols Σ , an alphabet of output symbols Ω , a weight semiring K , a set of arcs E , a single initial state i, with weight λ and a set of nal states F weighted by the function ρ : F → K . A weighted nite-state acceptor is a weighted nite-state transducer without the output alphabet. A composition algorithm is dened as: Let T1 : Σ ∗ × Ω ∗ → K and T2 : ∗ Ω × Γ ∗ → K be two transducers dened over the same semiring K . Their composition T1 ◦ T2 realizes the function T : Σ ∗ × Γ ∗ → K . Three dierent FST Toolkits were used for our experiments. The AT&T FSM Toolkit [7] was basically written in C, while both RWTH FSA [8] and MIT FST [9] Toolkits were written in C++, exploring the use of STL - the C++ Standard Template Library. In addition, the AT&T FSM Toolkit requires a specic license agreement in order to use its source codes, while both RWTH FSA and MIT FST are provided as open-sources. The organization of the AT&T FSM and MIT FST is similar in which each algorithm or operation is corresponding to an executable le. In contrast, the RWTH FSA combined all the algorithms and the operations into just one le. Moreover, the RWTH FSA supports labels with Unicode encoding so that it can work directly with other languages like Chinese, Japanese etc. In the following section, we will describe how the FLaVoR architecture is applied to the Vietnamese automatic speech recognition.

3 The Proposed Approach As suggested in Fig 4, our approach includes the following steps. 1. A phoneme transducer F is generated based on the corresponding acoustic models. F contains a set of best matching phonemes with their optimal start and end time. 2. F is composed with M , where M represents WU pronunciations, mapping phone sequences to WU sequences according to a pronouncing dictionary. 3. The resulting FST of step 2 is composed with W , where W represents word segmentations, mapping WU sequences to word sequences according to a predened lexicon. 4. Finally, the resulting FST of step 3 is composed with word-based LM G to produces the nal F ST . Viterbi decoding is used to nd the best path (hypothesis) through this nal FST.

Acoustic Models Phoneme FST

F Sub-lexical Dictionary

M

Word lexical Dictionary

W

Word-based LM

G

Phone2WU FST

WU2Word FST

final FST

FT Viterbi decoding Hypothesis

Fig. 4. The proposed approach. Thus, each path in the composition F ◦ M ◦ W ◦ G pairs a sequence of phones with a word sequence, assigning it a weight corresponding to the likelihood that the word sequence is pronounced. The composition L ◦ M ◦ W ◦ G can thus serve as the modeling network for a standard Viterbi decoding in an usual way. It is important to notice that the proposed approach may not lead to a real-time system as it requires the composition and optimization of a series of FSTs. Consider the abstract example illustrated in Fig 5. In this example, there are four WUs, namely, A,B ,C ,D and these map respectively to four pronunciations ab, ac, eb, ec, as shown in Fig 5-b (the symbol eps represents the empty transition). Furthermore, there are three words, AB , CD, D, represented in the UW to word dictionary (Fig 5-c). The word-based LM is simply a bigram with its transition from AB to CD, as in Fig 5-d. Finally, the phone transducer includes a path for the phone sequence (a, b, a, c, e, b, e, c) , as given in Fig 5-a. By composing those transducers according to the procedure mentioned above, we obtain the nal transducer, as depicted in Fig 5-e.

4 Experimental Results In this section we present the experimental results of our approach on the Vietnamese Broadcast News corpus (VNBN). The results include phone error rate, word-base LM perplexity, word error rate and FST sizes.

4.1 Training Corpus and Test Data Acoustic Training and Test Data We used the VNBN for training and

testing [2]. The acoustic training data was collected from July to August 2005

Fig. 5. An abstract example illustrating the approach. Table 1. Data for training and testing. Training Testing Dialect Length #Sentence Length #Sentence (hours) (hours) Hanoi 18.0 17502 1.0 1021 Saigon 2.0 1994 _ Total 20.0 19496 1.0 1021

from VOV - the national radio broadcaster (mostly in Hanoi and Saigon dialects), which consists of a total of 20 hours. The recording was manually transcribed and segmented into sentences, which resulted in a total of 19496 sentences and a vocabulary size of 3174 WUs. The corpus was further divided into two sets: training and testing, as shown in Table 1. The speech was sampled at 16kHz and 16 bits. They were further parameterized into 12 dimensional MFCC, energy, and their delta and acceleration (39 length front-end parameters).

Language Training Data The language model training data comes from newspaper text sources. In particular, a 100M-WU collection of the national wide newspaper, VOV, was employed, which included all issues between 1998-2005 [2]. Numeric expressions and abbreviated words occurring in the texts were replaced

by suitable labels. In addition, the transcriptions of the acoustic training data were also added.

4.2 Acoustic Models

S

[ I ]F

F

VT[E]

I V T E

Fig. 6. Initial-Final Units for Vietnamese. As depicted in Fig 6, we follow the usual approach as for Chinese acoustic modeling [3] in which each syllable is decomposed into initial and nal parts. While most of Vietnamese syllables consist of an initial and an nal part, some of them have only the nal. The initial part always corresponds to a consonant. The nal part includes main sound plus tone and an optional ending sound. This decomposition results in a total number of 44 phones, as shown in Fig 6. There is an interesting point in our decomposition scheme, which are related to a given tone in a syllable. Specically, we treat the tone as a distinct phoneme and it follows immediately after the main sound. With this approach, the context-dependent model could be built straightforwardly. Fig 7 illustrates the process of making triphones from a syllable.

Fig. 7. Construction of triphones. We use a tree-based state tying technique in which a set of 35 left and 17 right questions was designed, based on the Vietnamese linguistic knowledge. Initially, all of the states to be clustered are placed in the root node of the tree and the log likelihood of the training data calculated on the assumption that all of the states in that node are tied. This node is then split into two by nding the question which partitions the states in the parent node so as to give the maximum increase in log likelihood. The process is repeated until the likelihood increase is smaller

than a predened threshold. Fig 8 shows the split process of the decision tree for the main sound a ˘.

Fig. 8. Decision tree based state tying

4.3 Language Model Both the trigram WU-based LM and the word-based LM were trained on the text corpus mentioned above, using the SRI LM toolkit [10] with Kneser-Ney smoothing. For the WU-based LM, a lexicon with the 5K most frequent WUs was used. This lexicon gives a 1.8% OOV rate on the newspaper corpus and about 1.0% on the VNBN. The process of building a word-based LM consists of two steps. In the rst step, the WU sentences were segmented into words using the maximum match method. A Named-Entity list was also added to the original wordlist at this step to improve segmentation quality. After word segmentation, we chose the vocabulary to be the top-N most frequent words. The commonly used WUs (5K) are added then to the vocabulary as well. In particular, a lexicon consisting of 40K words and 5K WUs was selected. Table 2 reports the perplexities of both LMs on the same test set, containing 580 sentences randomly selected from VOV, issued in 2006.

Table 2. WU-based and word-based perplexities. bigram trigram WU-based LM 188.6 136.2 Word-based LM 384.5 321.4

12 phone error rate

phone error rate in phone network

11

10

9

8

7

6

5 0

2

4 6 phone network events (arc/frame)

8

10

Fig. 9. Phone recognition results.

4.4 Results Phone Recognition Results As mentioned in [1], the key prerequisite for the proposed approach is the generation of high quality phoneme lattices in the acoustic-phonemic layer. The quality of phoneme lattices is dened by both a low phoneme error rate and a low event rate or density. The phoneme decoding was based on the architecture described in [1] in which the decoder extends the word-pair approximation to the phoneme level in order to assure a high quality of the output. Hence, to obtain high quality phoneme lattices, a phoneme transition model (an N-gram between phonemes) of a suciently high order N has to be used, or the LM-context has to be articially expanded to (M-1) phonemes. The acoustic models used are context-dependent (CD) tied state phoneme models (2410 states) with mixtures of tied Gaussian (48216 Gaussian) as observation density functions. Figure 3 shows the results of phoneme recognition experiments with N=3 and M=3 and with dierent lattice densities. The values on the ordinate are the phoneme error rates (ins. + del. + sub.) of the phoneme lattice, i.e. the error rate of that path which matches best with the reference transcription. The values on the abscissa are the average number of events (an arc representing a CD-phoneme) per frame in the lattice. The phone graph that results in a phone error rate of 8% was selected as input to the second layer in our approach. As shown in Fig 9, this is already reached with an event rate less than 4. Size of Transducers Table 3 reports size of the optimized transducers, in terms of transitions. They include:

the 5162-WU pronunciation dictionary M . the 39882-word lexicon W . the trigram word-based LM G, built as mentioned in the previous subsection. the phone transducer F - the report number is the average number of transitions over the sentences in the test set.

Table 3. Size (number of arcs) of transducers mentioned in Fig 4. F M W G FT

FST(MIT) 4681 18741 52131 3491884 58918

FSA(Aachen) 6112 21635 58472 3689128 66121

FSM (AT&T) 4868 20196 54618 3512841 64767

the nal composed transducer F T , also average over the test sentences. There are two main observations obtained from the experiment. Firstly, the optimized transducers have acceptable sizes, even with a trigram word-based LM. Secondly, the MIT FST Toolkit performed best in the optimization, though the dierences are not signicant. Although the computing time is not reported here, our experiments showed that the AT&T FSM Toolkit is the fastest one.

Word Error Rate Finally, we report the WU error rate of the new approach

and compare the results with the previous work [2] (the baseline). The acoustic model in [2] is identical to the one described in this paper. The LMs however diers. In the previous work, the LM was trained on WU level while in this work, the LM was trained on word level. Moreover, both experiments used the same training and testing sets, given in Table 1.

Table 4. The WU error rate for the two approaches. bigram trigram Baseline 20.8 19.1 The new approach 19.0 17.8

As shown in Table 4, the new approach shows signicant improvements over the previous results. It is also observed that the WU error rate with trigram WUbased LM is roughly comparable to the one obtained with a bigram word-based LM.

5 Conclusion We presented a dierent approach for recognizing the languages in the Southeast Asia, of which the boundary between words is not clear. The main advantage of our approach is that, it allows for an easier integration of dierent knowledge sources. In this paper, we used FST as a tool for combining the phoneme network with the word-based LM to demonstrate the idea. Experimental results on VNBN showed that our approach is both robust and exible.

6 Acknowledgments The research reported in this paper was funded by IWT in the GBOU programme, project FLaVoR (Project number 020192). www.esat.kuleuven.be/ psi/spraak/projects/FLaVoR

References 1. Kris Demuynck, Tom Laureys, Dirk Van Compernolle and Hugo Van hamme, FLaVoR: a Flexible Architecture for LVCSR, Eurospeech2003, Geneva, Switzerland, pp. 1973-1976, 2003. 2. Ha Nguyen, Quan Vu,Selection of Basic Units for Vietnamese Large Vocabulary Continuous Speech Recognition, The 4th IEEE International Conference on Computer Science - Research, Innovation and Vision of the Future, HoChiMinh City, Vietnam, pp. 320-326, 2006. 3. Zhang J.Y. el al,Improved Context-Dependent Acoustic Modeling for Continuous Chinese Speech Recognition, Eurospeech2001, Aalborg, Denmark, pp 1617-1620, 2001. 4. Sinaporn Suebvisai et al, Thai Automatic Speech Recognition, ICASSP2005, Philadelphia, PA, USA, pp. 857- 860, 2005. 5. Liu, Y. and P. Fung, Modeling partial pronunciation variations for spontaneous Mandarin speech recognition, Computer Speech and Language, 17, 2003, pp. 357379. 6. B. Xiang et al,The BBN Mandarin Broadcast News Transcription System, InterSpeech2005, Lisbon, Portugal, pp. 1649-1652, 2005. 7. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Weighted Finite-State Transducers in Speech Recognition, Computer Speech and Language, 16(1), pp. 69-88, 2002. 8. S. Kanthak and H. Ney, FSA: An Ecient and Flexible C++ Toolkit for Finite State Automata Using On-Demand Computation", ACL2004, Barcelona, Spain, pp. 510-517, 2004. 9. Lee Hetherington , MIT Finite State Transducer Toolkit, 2005, http://people. csail.mit.edu/ilh//fst/ 10. Andreas Stolcke,"SRILM - An Extensible Language Modeling Toolkit", in Proc. Intl. Conf. Spoken Language Processing, Denver, Colorado, pp. 901-904, 2002.

Recommend Documents

Vietnamese Large Vocabulary Continuous Speech ... - Semantic Scholar

Robust Speech Recognition with Speech ... - Semantic Scholar