the use of lexica in automatic speech recognition - limsi


Sep 8, 1998 - the acoustic and language models, the lexicon, and the search engine. .... a joint optimization of two terms: Pr(w), the a priori probability of a.

MARTINE ADDA-DECKER AND LORI LAMEL

THE USE OF LEXICA IN AUTOMATIC SPEECH RECOGNITION 1. I N T R O D U C T I O N

The lexicon plays a pivotal role in automatic speech recognition being the link between the acoustic-level representation and the word sequence output by the speech recognizer. The role of the lexicon can be considered as twofold: rst, the lexicon speci es what words or lexical items are known by the system; second, the lexicon provides the means to build acoustic models for each entry. Lexical design thus entails two main parts - de nition and selection of the vocabulary items and representation of each pronunciation entry using the basic acoustic units of the recognizer. For large vocabulary speech recognition, the vocabulary is usually selected to maximize lexical coverage for a given size lexicon, and the elementary units of choice are usually phonemes or phone-like units. In this chapter we address the two main aspects of lexicon development. We consider the development and use of lexica in large vocabulary speech recognition and in spoken language systems, dealing with di erent types of speech (from read to spontaneous) and with di erent languages. Language-dependent design considerations are mentioned for the di erent European languages for which we have had experience in lexical design for speech recognition. 2. G E N E R A L O V E R V I E W

Speech recognition[DixonMartin79] deals with automatically generating a sequence of elementary units, usually words, given an acoustic speech signal. In order to produce such word sequences the recognition system requires lexical knowledge. Designing lexica for automatic speech recognition is hence a key aspect in system development. As speech recognizers are mainly based on acoustic phoneme or phone-like1 (subword) units a 1

Here we distinguish between phonemes, the elementary sounds in a given language such that there exist two words di ering by only a single phoneme, and phones which correspond to di erent contextual variants (or allophones) of phonemes. For example, a ap (a realization of an alveolar stop or nasal) is a phone, but not a phoneme in English.

1

c 1998 Kluwer Academic Publishers. Printed in The Netherlands.

lex.tex; 8/09/1998; 15:48; p.1

2 MARTINE ADDA-DECKER AND LORI LAMEL lexicon is viewed in this context as a pronunciation lexicon, providing for each lexical item wi one or more phonemic transcription (wi). Lexicon development has two main aspects: representing basic units of written language, which can be considered as the output of the recognizer; and describing the spoken form of language, which is the input to the recognizer. During development, attention must be given to both parts in order to achieve the best performing system. De nition and selection of the basic units of written language (the word list) depends on the type of application, but lexical coverage is always a major concern. For large vocabulary speech recognition (LVSR), such as dictation or broadcast news transcriptions tasks, the word list is generally obtained from training text corpora (e.g. newspapers, task-speci c text documents) and when available, transcriptions of speech data. For spoken language systems for information retrieval tasks, the word list usually is determined by the word occurrences in typical queries (often bootstrapping with Wizard of Oz collection2 or by typed queries), completed by task and domain speci c knowledge. For conversational speech recognition the only available text data are speech corpora transcribed for this purpose. Associated with each lexical entry are one or more pronunciations. These pronunciations may be taken from existing pronunciation dictionaries, created manually or generated by an automatic graphemephoneme conversion software. The word list, obviously related to lexical coverage, is also related to language model accuracy. The accuracy of the acoustic models is linked to the consistency of the pronunciations associated with each lexical entry. The recognition vocabulary or word list3 generally consists of a simple list of lexical items as observed in running text.4 Depending on the form of the training texts, di erent word lists can be derived. Various text forms can be generated by applying di erent normalization steps to the text corpora (see Section 5.1). 2

In a Wizard of Oz setup a human is used to replace system components that have not yet been developed, such as a speech recognizer. In this case, the human types a transcription (or paraphrase) of what was said to use in further processing stages. 3 It can also be a list of root forms (stems) augmented by derivation, declension, and/or composition rules. The latter approach is more powerful in terms of language coverage, which is a desirable quality for recognizer development, but more dicult to integrate in present state-of-the-art recognizer technology. 4 For simplicity lexical items can be considered to be character strings separated by spaces. Lexical item de nition is discussed further in Section 5.1. In some languages such as Japanese or Chinese, the notion of a word is less evident, and other representations may be more appropriate.

lex.tex; 8/09/1998; 15:48; p.2

THE USE OF LEXICA IN ASR

Example Texts Normalized Texts (no case distinction)

Word Lists

3

Mr. Green has joined the Greens. Mr. Green's hat is green. He likes eating greens and green's his favorite color. it is green with a green hat on the green.

(Mr. Green is playing golf and wearing a green hat) green's wife likes greens. (from her kitchen garden) he likes greens. (in politics) wl1: green Green Green's green's greens Greens wl2: green green's greens wl3: green greens

Figure 1. Example texts (top) and sample word lists wl1, wl2 and wl3 illustrate the impact of text normalizations on lexicon size and coverage.

Figure 1 provides example texts (top) and sample partial word lists obtained using di erent normalizations. In wl1 there are 6 case sensitive entries corresponding to di erent forms of the word green. Without upper/lower case distinction (middle) there are 3 entries for green, as shown in wl2. This form entails some loss in syntactic and semantic information. If during text processing the apostrophe (or single-quote) is considered as a word boundary mark the sample word list is reduced to the 2 forms given in wl3. In English, the apostrophe is mainly used to build the genitive form of nouns, but it can also be found in contracted forms, like won't, you'd, he'll, she's, we've or in proper names (d'angelo, o'keefe), and has a limited impact on lexical coverage. For example, in a lexicon containing 100,000 entries, only 4% of the words contain an apostrophe. In contrast, the apostrophe is quite frequent in French, occurring in word sequences like l'ami, j'ai, c'est. If frequently occurring forms containing the apostrophe are included as separate lexical entries, the number of lexical entries increases signi cantly (by over 20%). Therefore the characteristics of the language must be taken into account in determining an appropriate text normalization. The other part of lexicon development concerns the creation of pronunciations for each lexical item of the word list. A preliminary step consists in the choice of elementary units to be used for the given language. The most common units are phonemes or phones. In the former case only standard pronunciations are given without explicit representation of allophonic variants, even common ones. A reason for choosing

lex.tex; 8/09/1998; 15:48; p.3

4 MARTINE ADDA-DECKER AND LORI LAMEL such a representation is that most allophonic variants can be predicted by rules, and that their use is optional. More importantly, there is often a continuum between di erent allophones of a given phoneme and the decision as to which occurred in any given utterance is subjective. By using a phonemic representation, no hard decision is imposed, and it is left to the acoustic models to implicitly represent the observed variants in the training data.5 The lexicon is then represented with standard pronunciations using the language-dependent phone sets. Some automatic approaches to pronunciation development have been explored where pronunciations are learned from the training corpus. While such approaches are easily applied to xed vocabulary tasks, they are dicult to extend as in general spoken samples of each lexical item are required. 3. R O L E O F T H E L E X I C O N I N A S P E E C H R E C O G N I Z E R

An overview of a generic speech recognition system is given in Figure 2. The lexicon plays a role in training the recognizer and during speechto-text conversion or recognition. The recognizer's main components are the acoustic and language models, the lexicon, and the search engine. Before describing in more detail the lexicon development process and its link with the other main components of the recognizer, we brie y review the statistical approach commonly used for LVSR [Baker75], [Jelinek76], [YoungBloothooft97]. For Hidden Markov Model (HMM) based systems [RabinerJuang86], acoustic modeling consists of estimating probability density functions of a sequence of acoustic feature vectors. Acoustic features are chosen so as to reduce model complexity while trying to keep the relevant information (i.e. the linguistic information for the speech recognition problem). Sequences of acoustic feature vectors are computed from overlapping portions (usually on the order of 30ms) of the acoustic signal so as to capture its time-varying nature. The acoustic feature vectors are typically computed using short-time cepstral features based either on a Fourier transform or on linear prediction techniques[RabinerSchaefer78]. The Fourier transform is a general method to analyze the frequency content of a signal. The linear predictive analysis is based on the assumption that the observed signal is produced by an acoustical excitation of the vocal tract which is modeled by a time-varying linear lter. Acoustic units generally correspond to subword units which when compared with word 5

A publicly available lexicon represented with a limited number of allophones to match the phonetic transcriptions is distributed with the TIMIT corpus[Garofoloetal93]. The allophones include: released and unreleased plosives, ap allophones of /d/, /t/ and /n/, as well as fronted allophones of /u/ and //.

lex.tex; 8/09/1998; 15:48; p.4

THE USE OF LEXICA IN ASR !!!!!!!!!!!! !!!!!!!!!!!! !!!!!!!!!!!! training !!!!!!!!!!!! texts !!!!!!!!!!!! !!!!!!!!!!!! !!!!!!!!!!!! !!!!!!!!!!!!

5

TRAINING

!!!!!!!!!!!! !!!!!!!!!!!! training !!!!!!!!!!!! speech + !!!!!!!!!!!! transcripts !!!!!!!!!!!! !!!!!!!!!!!! !!!!!!!!!!!! !!!!!!!!!!!!

text NORMALIZATION

3333333333 3333333333 phone set333333333333 33333333333 3333333333 word list/transcripts word list/training 33333333333 333333333333 3333333333 33333333333 333333333333 33333333333 333333333333 pronunciation GENERATION

language model TRAINING

3333333333 training lexicon 3333333333 3333333333 acoustic model TRAINING

3333333333333333 3333333333333333 recognizer lexicon @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ 3333333333333333 @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ 3333333333333333 acoustic models language models @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ 3333333333333333 @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ @@@@@@@@@@@@@@ !!!!!!!!!!!!!!!! DECODING !!!!!!!!!!!!!!!! recognizer lexicon !!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!! !!!!!!!!!!!!!! !!!!!!!!!!!!!! language model acoustic models !!!!!!!!!!!!!! !!!!!!!!!!!!!! !!!!!!!!!!!!!! !!!!!!!!!!!!!! !!!!!!!!!!!!!! !!!!!!!!!!!!!! DECODING speech

words

Figure 2. Generic speech recognizer overview. The training process focuses on lexicon development and its interrelationship with acoustic and language modeling.

models, reduce the number of parameters, enable cross word modeling and facilitate porting to new vocabularies. Language models are used to model regularities in natural language, and can therefore be used in speech recognition to predict probable word sequences during decoding. The most popular methods, such as statistical n-gram models, attempt to capture the syntactic and semantic constraints by estimating the frequencies of sequences of n words. A backo mechanism [Katz87] is generally used to get a better estimate of the probabilities of infrequent n-grams by relying on lower order n-grams to model rare and unobserved word sequences. Another advantage of the

lex.tex; 8/09/1998; 15:48; p.5

6 MARTINE ADDA-DECKER AND LORI LAMEL backo mechanism is that LM size can be controlled by relying more or less on the backo component. This e ect is obtained by simply varying the minimum number of required n-gram observations needed to include the n-gram in the model. Language models are typically compared by measuring the perplexity6 of the models on a set of development texts not included in the training material. One of the most important problems in implementing the decoder of a large vocabulary speech recognizer is the design of an ecient search algorithm to deal with the huge search space [YoungBloothooft97]. Even for research purposes where real-time recognition is not needed, there is a limit on computing resources (memory and CPU time) above which the development process becomes too costly. The most commonly used approach for small and medium vocabulary sizes is a one-pass framesynchronous dynamic programming procedure. This basic strategy has been extended to deal with large vocabularies by adding features such as fast match, N-best rescoring, progressive search and one-pass dynamic network decoding. Many LVSR systems use multiple pass decoders to reduce the computational resources needed for evaluation runs. In this case, information is transmitted between passes by means of N-best lists, word graphs or word lattices.7 In this statistical framework the speech recognizer has to determine the most probable word sequence w1N given the acoustic signal xT1 :

d

d

w1N = arg max Pr(w1n) Pr(xT1 jw1n) fw1n g where w1n is a sequence of n words each in the lexicon, n being a positive integer. Pr(w1n ) is to be provided by a language model, and Pr(xT1 jw1n ) by an acoustic model. The recognition decision is taken as a joint optimization of two terms: Pr(w), the a priori probability of a word or a word sequence as given by the language model and Pr(xjw) the

conditional probability of the signal corresponding to the word sequence, given by the acoustic model. This formulation shows the relationship of the lexicon fwi; (wi)g with: ? the recognizer's output w1N . The output w1N is a sequence of items from the lexicon fwi g. Pro-

d

d

6

The corpus perplexity is given by [Pr(w1 : : : wN )] N1 which is the probability that the language model produces the word sequence w1 : : : wN . 7 Word lattices are graphs where nodes correspond to times and where arcs representing word hypotheses have associated acoustic and language model scores.

lex.tex; 8/09/1998; 15:48; p.6

7 nounced items which are not in the lexicon (referred to as out-ofvocabulary words or OOVs) are necessarily missing in the recognizer's output, and thus misrecognized. Hence the motivation for maximizing lexical coverage by appropriate de nition and selection of the lexical items during training. These issues are addressed further in Section 5 on text normalization and word list selection, where comparisons are made across di erent languages. ? the language model Pr(w) The lexical unit, wi , can be considered the basic observation for statistical language models. Given a xed amount of training data, less reliable language models (LMs) are usually obtained for highly in ected languages (with large lexical variety) than for less in ected languages. Text normalization aimed at reducing lexical variety will hence in uence the LM eciency. ? the acoustic model Pr(xjw) Acoustic modeling most commonly makes use of context-dependent phone units.8 Pr(xjw) is obtained via a pronunciation lexicon, where each word wi is described as a sequence of the appropriate phones: (wi) = i1  i2  : : : im Pr(xjwi)  Pr(xj(wi)) = Pr(xji1  i2  : : : im ) Consistent use of the di erent phone symbols in the lexicon is probably the most important requirement in pronunciation generation. ? the decoder arg maxfw1ng The search space to be explored by the decoder is related to the lexicon size and the language model (LM) complexity. For a bigram LM the search space is proportional to the lexicon size. Pronunciation variants introduce additional entries in the search space. The computational requirements can be controlled by limiting lexicon size and pronunciation variants. Optimizing the lexical coverage for a xed vocabulary size is thus an important aspect in lexicon development for LVSR systems. For readnewspaper and broadcast news tasks, increasing the lexicon size up to 65k words has improved performance, despite the potential of increased confusability of the lexical entries provided enough texts are available for LM training[ARPA'94].9 Minimizing the OOV rate contributes to THE USE OF LEXICA IN ASR

8

In some real-time systems context-independent phone units may be used in order to reduce the computation time and search space. Such considerations may e ect the choice of elementary lexical units. 9 At present, a technical limit of 65k words is generally imposed in order to index each lexical entry using a single 16-bit integer.

lex.tex; 8/09/1998; 15:48; p.7

8 MARTINE ADDA-DECKER AND LORI LAMEL minimizing the global error rate of a recognition system. While di erent sites have reported di erent estimates of the number of recognition errors due to each OOV, on average the contribution is between 1.6 and 2.2 errors per OOV. For example, in the LIMSI system, going from 20k words to 65k words, recovers on average 1.2 times as many errors as OOV words removed. The lexicon may be used at di erent levels of decoding: the most common usage is to integrate the lexical knowledge directly during acoustic decoding in combination with appropriate language and acoustic models. The use of a lexicon may be delayed until after a rst acousticphonetic decoding pass, as in case of phone recognition. Such an approach also presents the advantage of overcoming the technical limit of 65k words presently observed by large vocabulary speech recognizers [Caratyetal97]. For domain-speci c recognizers, (e.g. the recognition module in spoken dialog systems) the lexicon size is generally on the order of 2k-3k words and the tradeo between lexicon size and lexical coverage is less crucial. Typically task-speci c knowledge is used to form an initial lexicon, which is then completed with task-speci c data. Here the problem lies in predicting the lexical items and phone transcriptions that occur in naturally spoken queries. 4. M U L T I L I N G U A L A S P E C T S

In this section we study some of the linguistic properties of four languages (chosen for study based on availability of text corpora), and how these characteristics in uence lexical development for LVSR[Youngetal97]. Table I compares newspaper text corpora in terms of lexical variety and lexical coverage obtained for di erent size lexica on the same text data. The text sizes are comparable (37M words) for the di erent languages, except for Italian for which only 26M words of text were available.10 In order to compare across the languages, we have introduced the ratio: #words #distinct words

which is indicative of how well statistical language models can be estimated. According to this measure French and Italian are quite compa10

The newspaper text corpora compared are the Wall Street Journal (American English), Le Monde (French), Frankfurter Rundschau (German) from the ACLECI cdrom, distributed by Elsnet and LDC, and Il Sole 24 Ore (Italian) (after[LamelDeMori95].

lex.tex; 8/09/1998; 15:48; p.8

9 rable, with a ratio about half that of English, and more than twice that of German. THE USE OF LEXICA IN ASR

TABLE I Comparison of WSJ, Il Sole 24 Ore, Le Monde and Frankfurter Rundschau text corpora in terms of number of distinct words and lexical coverage of the text data for di erent lexicon sizes. OOV rates are shown for 20k and 65k lexica. language

English

Italian

corpus total # words #distinct words

WSJ 37.2M 165k 225 90.6% 97.5% 99.6% 2.5% 0.4%

Sole 24 Le Monde 25.7M 37.7M 200k 280k 128 135 88.3% 85.2% 96.3% 94.7% 99.0% 98.3% 3.7% 5.3% 1.0% 1.7%

#words #distinct words

5k coverage 20k coverage 65k coverage 20k-OOV rate 65k-OOV rate

French

German FR 36M 650k 55 82.9% 90.0% 95.1% 10.0% 4.9%

The di erence in lexical coverage for French and English mainly stems from the number and gender agreement in French for adjectives and past participles, and the large number of di erent forms for a given verb (about 40 forms as opposed to at most 5 in English). German is also a highly in ected language where one can observe similar phenomena as in French. In addition, German has case declension for articles, adjectives and nouns. The four cases nominative/dative/genitive/accusative (N/G/D/A) may generate di erent acoustically similar forms. In German all nouns are capitalized and most words can be nominalized, which generates lexical items which are homophones. The major reason however for the poor lexical coverage in German is word compounding. Compound words are freely constructed in English by simple concatenation, if they don't already exist as words (e.g. blackbird, coastguard, strong-minded, world-wide), such as black beetle, mother tongue. In technical domains simple concatenation is most common (e.g. speech recognition, the speech recognition problem). In French, aside from a limited number of compounds (such as gentilhomme, porte-manteaux), prepositions are added to complete the expressions, as in reconnaissance de la parole, le probleme de la reconnaissance de la parole. In contrast, in German words are joined together to form a new single word (e.g.: Spracherkennung, das Spracherkennungsproblem) which are then subject

lex.tex; 8/09/1998; 15:48; p.9

10 MARTINE ADDA-DECKER AND LORI LAMEL to the same case variations. This tendency makes text processing and ecient vocabulary selection particularly complex in German. In German 20k words are needed to have the same lexical coverage as for 5k words in English. For state-of-the-art LVSR systems in English, very high lexical coverage (over 99%) can be achieved. Lexical coverage remains lower for languages with high in ectional generation (e.g. French), or with compound word generation (e.g. German). The in ectional generation mechanism is linear, i.e. producing for each root form a nite number of derived forms. In contrast, the compounding mechanism is exponential. In German compounds of two or three morphemes are very common. For example, the word Stadt (city) appears in 2500 di erent forms. The word Krieg (war) is observed in over 1000 compound forms. More than 80 forms containing the compound Burgerkrieg (civil war) can be found, of which only ten are included in the 65k German lexicon.11 TABLE II Relative percentage of proper names (including acronyms) in French lexica of increasing size. Lexicon size Proper names

5k 20k 65k 90k 9% 17% 32% 40%

While there is no explicit limit to the number of constituents, compounds including four or more morphemes are signi cantly less frequent 11

Lexical entry Decomposition Meaning: civil war (masculine) Burgerkrieg Burger  Krieg Burgerkriege Burger  Kriege Burgerkriegen Burger  Kriegen Burgerkrieges Burger  Krieges Burgerkriegs Burger  Kriegs Meaning: civil war refugees (masculine) Burgerkriegs uchtlinge Burger  Krieg-s  Fluchtlinge Burgerkriegs uchtlingen Burger  Krieg-s  Fluchtlingen Meaning: civil war area (neuter) Burgerkriegsgebiet Burger  Krieg-s  Gebiet Burgerkriegsgebieten Burger  Krieg-s  Gebieten Meaning: civil war party (feminine) Burgerkriegsparteien Burger  Krieg-s  Parteien

Declension : N/D/A sg. N/G/A pl., D sg. D pl. G sg. (variant) G sg. N/G/A pl. D pl. N/D/A sg. D pl. N/G/D/A pl.

lex.tex; 8/09/1998; 15:48; p.10

11 (e.g.: Bundes-sozial-hilfe-gesetz, Bundes-liga-spiel-tag, Bau-spar-kassenberatungs-stellen, : : :) and are most often observed in technical texts. The OOV problem could be reduced in German by a text preprocessing aimed at separating compound words into their constituent building blocks[Hausser94]. This step is far from straightforward, requiring a re ned morphological analysis. Independently of the language, proper names pose a problem for lexical coverage. The importance of proper names (including acronyms) in lexica of increasing size is illustrated in Table II for French, where percentage-wise they account for twice as many entries in the 90k lexicon as in the 20k lexicon. It is also dicult to determine pronunciations for proper names, particularly those of foreign origin, as the potential pronunciations may more or less depend on the in uence of the language of origin. THE USE OF LEXICA IN ASR

5. W O R D L I S T D E V E L O P M E N T

In this section we discuss issues in word list development for some common tasks such as recognition of read newspaper texts,[Gauvainetal90], [PaulBaker92] and broadcast news [Gra 97]. The problem of word item de nition, i.e. what is to be considered a lexical item in the training texts, is addressed in the next section dealing with text normalizations. Section 5.2 addresses the problem of word item selection for a given sized lexicon. 5.1. Text normalization For LVSR applications one of the most easily available sources of texts are newspapers. A major requirement for the word list is high lexical coverage during testing, which implies that the training text material should be closely related (in time and topics) to the test data. The text data need to be preprocessed for lexicon and language model (LM) development. A straightforward de nition of a lexical item as a graphemic string between two blanks is too simplistic and generates a large number of spurious items. There is a certain amount of text normalization to be carried out, in order to clean the texts12 and to de ne what is actually to be considered a lexical item in each language. Once normalized, a task vocabulary can be selected and language models trained. 12

While not the subject of this discussion, the text data contain errors of di erent types. Some are due to typographical errors such as mispellings (milllion, officals) or missing spaces (littleknown), others may arise from prior text processing.

lex.tex; 8/09/1998; 15:48; p.11

12 MARTINE ADDA-DECKER AND LORI LAMEL A common motivation for normalization in all languages is to reduce lexical variability so as to increase the coverage for a xed size task vocabulary. Normalization decisions are generally language-speci c. Much of speech recognition research for American English has been supported by ARPA and has been based on text materials which were processed to remove upper/lower case distinction and compounds [PaulBaker92]. Thus, for instance, no lexical distinction is made between Gates, gates or Green, green. In the French Le Monde corpus, capitalization of proper names is distinctive with di erent lexical entries for Pierre, pierre or Roman, roman. In this section we consider the French language which, in addition to high lexical variety, makes frequent use of diacritic symbols that are particularly prone to spelling, encoding and formating errors. An extensive study has been carried out on di erent types of normalizations [Addaetal97] using a text set T0 of 40M words from the Le Monde (years 1987-88). Some of the normalization steps can be considered as baseline, such as the coding of accents and other diacritics (in ISO-Latin1); separation of the text into articles, paragraphs and sentences; correction of frequent formating and punctuation errors; and processing of unambiguous punctuation markers. These are carried out to produce a baseline V0 text form. Other possible normalizations considered here are: N1 : processing of ambiguous punctuation marks (hyphen -, apostrophe ') not including compounds N2 : processing of capitalized sentence starts N3 : digit processing (110 ! cent dix) N4 : acronym processing (ABCD ! A. B. C. D.) N5 : emphatic capital processing (Etat ! etat) N6 : decompounding (arc-en-ciel ! arc en ciel) N7 : no case distinction (Paris ! paris) N8 : no diacritics (enerve ! enerve) Starting with the baseline form V0, sequentially applying the elementary operations N1 : : : N8 produces versions Vi = Vi?1  Ni . Normalizations N1 , N3 , N4 , N6 can be considered as \decompounding" rules, modifying word boundaries and the total number of words. N2, N5, N7 , N8 keep the total number of words unchanged, but reduce intraword graphemic variability. In Fig. 3 (left) the number of distinct words is seen to decrease with additional normalizations, especially for steps N1, N2, N3. The impact of the \decompounding" rules on the total number of words is illustrated in Fig. 3 (right): an increase is observed for text versions V1 , V3, V4 and V6 . Figure 4 (left) gives the corresponding OOV rates (complementary

lex.tex; 8/09/1998; 15:48; p.12

13 measure of lexical coverage) on the text data using 65k entry lexica (containing the most frequent 65k words in the corresponding normalized text version). The OOV rate curve is seen to parallel the #-distinctword curve of Figure 3. A large reduction in OOV rate is obtained for the V1, V2 and V3 text versions, which correspond to the processing of ambiguous punctuation marks, sentence-initial capitalization, and digits. Subsequent normalizations improve coverage, but to a lesser extent. THE USE OF LEXICA IN ASR

440K

42.0M

420K 41.0M

380K

40.0M

360K

# words

# distinct words

400K

340K 320K

39.0M 38.0M

300K 37.0M

280K 260K

36.0M

240K V0

V1

V2

V3

V4

V5

V6

V7

V8

V0

V1

V2

V3

V4

V5

V6

V7

V8

Figure 3. Number of distinct words (left) and total number of words (right) for normalization versions Vi on T0 text data. 4.0

180 training T0

170 # words / # distinct words

3.5

% OOV

3.0 2.5 2.0 1.5

160 150 140 130 120 110 100 90

1.0 V0

V1

V2

V3

V4

V5

V6

V7

V8

80 V0

V1

V2

V3

V4

V5

V6

V7

V8

Figure 4. OOV rates on the T0 text using 65k word lists (left) and ratio f#words/#distinct wordsg (right) for di erent normalization combinations Vi .

This shows the importance of processing ambiguous punctuation marks and digits prior to word list selection for French. The in uence of punctuation processing can be considered as language-speci c, whereas digit processing is probably important for all languages. For example, in the Sqale project[Youngetal97] numbers were decompounded in order to increase lexical coverage. The number 1991 which in standard German is written as neunzehnhunderteinundneunzig was represented by word sequence neunzehn hundert ein und neunzig. The ratio of the f#words/#distinct wordsg is shown in Fig. 4 (right).

lex.tex; 8/09/1998; 15:48; p.13

14 MARTINE ADDA-DECKER AND LORI LAMEL This ratio is seen to double with increasing text versions, which should enable more robust estimates of the LM parameters. However, better LM accuracy may be achieved with a larger number of word forms by maintaining linguistically meaningful distinctions if there are sucient training data available. In our French LVSR system we generally use V5 text forms which is a good compromise between outputting standard written French and lexical coverage. The nal choice of a given normalization version is largely application-driven. 5.2. Word list selection In practice, the selection of words is done so as to minimize the system's OOV rate by including the most useful words. In this context useful means an expected input to the recognizer, but also trainable for LMs given the training text corpora. In order to meet the latter condition, it is common to choose the N most frequent words in the training data. This criterion does not, however, guarantee the usefulness of the lexicon, according to the rst requirement. Therefore it is common practice to use a set of additional development data in order to select a word list adapted to the expected test conditions. To measure lexical coverage as a function of training text corpus the LeMonde newspaper corpus has been divided in di erent subsets (di ering in size and epoch): T0 : years 1987-88 (40M words)13 T00 : years 1994-95 (40M words) T1 : years 1987-95 (185M words) T2 : years 1991-95 (105M words)14 On the left of Figure 5 the OOV rates are given for a 65k word list containing the most frequent words in the training data T0 . A substantial degradation in lexical coverage is observed on a set of development texts containing 20000 words (Le Monde, May 1996). The gure on the right shows the OOV rates on the same development texts for di erent 65k word lists derived from the di erent text subsets. The text subsets T00 , T1 and T2 have almost identical OOV rates, showing that corpus size is not critical. That the text epoch is more important than text size for optimizing coverage, can be seen by comparing OOV rates for subsets T0 and T00 , where a 25% relative OOV reduction is obtained. Concerning normalization, OOV word rates are reduced by about 40% when going from raw but clean data (V0 text form) to the V5 normalized form. 13

These were baseline resources in the Aupelf French recognizer evaluation project[Dolmazonetal97]. 14 T2 is signi cantly smaller than T1 , but contains on average more recent data.

lex.tex; 8/09/1998; 15:48; p.14

15

THE USE OF LEXICA IN ASR 4.0

4.0 T0-65kT0 dev-65kT0

3.5

3.0 % OOV

3.0 % OOV

dev-65kT0 dev-65kT0’ dev-65kT1 dev-65kT2

3.5

2.5

2.5

2.0

2.0

1.5

1.5

1.0

1.0 V0

V5

V6

V8

V0

V5

V6

V8

Figure 5. OOV rates for normalization versions V0 , V5 V6 , and V8 on development test data using 65k word lists derived from di erent training text sets: T0 (40M words), T0 (40M words), T1 (185M words) and T2 (105M words). 0

6. P R O N U N C I A T I O N D E V E L O P M E N T

The pronunciation lexicon provides the rules how to construct acoustic word models for the recognizer. As the acoustic observations of a given lexical item are prone to substantial variation (due to pronunciation, speaker, acoustic channel : : :), the acoustic word model must be able to deal with this variability. Pronunciation variants are included in the lexicon to complement the acoustic model's ability to implicitly capture the observed variation. Pronunciation development is concerned with two related problems: phone set de nition and pronunciation generation using this phone set. Using the basic phone set more complex phone symbols can be derived for labeling the acoustic models, depending generally on left and right phone contexts. The selection of the contexts usually entails a trade-o between resolution and robustness, and is highly dependent on the available training data. Di erent approaches have been investigated[YoungBloothooft97] such as modeling all possible contextdependent units, using decision trees for context selection, and based on the observed frequency of occurrence in the training data. In all cases, smoothing or backo techniques, are used to model infrequent or unobserved contextual units. Context-dependent models increase acoustic modeling accuracy by providing a means to account for a large amount of coarticulation e ects and commonly observed pronunciation variants. They can be considered as implicit pronunciation rules. Both during training and recognition context-dependent phone models are aligned with acoustic segments of a minimum duration depending on the HMM topology (typically 3 frames, 30 ms) as shown in Figure 6. The importance of including pronunciation

lex.tex; 8/09/1998; 15:48; p.15

16 MARTINE ADDA-DECKER AND LORI LAMEL variants which allow phones to be inserted or deleted depends on this minimum duration parameter.

Figure 6. (3-state left-to-right CDHMM continuous mixture density hidden Markov model. An acoustic phone-like segment is then temporally modeled as a sequence of 3 states, each state being acoustically modeled by a weighted sum of Gaussian densities.

6.1. Phone set de nition In each language di erent choices of phone symbols are generally possible. To guide the phone set de nition it is important to have an idea of the relative importance of each possible symbol in the language, and more pragmatically in the speech corpora available for that language. In English or German, a ricates like /Q/, //, /8/, and diphthongs like /aj /, /aw /, /=j /, can be represented by either one or two phone symbols, and consequently by one or two HMMs as shown in Fig. 7. A consequence of using a single phone unit is that the minimum duration is half that required for a sequence of two phones, which may be desirable for fast speaking rates. A representation using two phones may provide more robust training if the individual phones also occur frequently or if the a ricate or diphthong is infrequent. 1 symbol

2 symbols

Figure 7. Impact on acoustic/temporal modeling depending on the choice of one or two symbols for a ricates or diphthongs.

Pronunciations can be generated using a standard set of (more or less detailed) IPA (International Phonetic Alphabet) symbols [Pullum96]. If this representation makes distinctions which are not appropriate given the recognizer's characteristics or the variability in speaking styles, rewrite rules can be applied to reduce the phone set or to simplify pronunciations. If the phone symbol set makes ne distinctions (such as di erent

lex.tex; 8/09/1998; 15:48; p.16

17 stop allophones - unreleased, released, aspirated, unaspirated, sonorantlike), many variants must be explicitly speci ed in order to account for di erent pronunciation variations. If the basic phone set remains close to a phonemic representation, pronunciation variants are necessary only for major deviations from the baseform, as the acoustic models can account for some of the variability. In Table III we illustrate the phone symbol choice for a subset of German vowels.15 Vowels are grouped by type, where for each type there are three symbols corresponding to a lax version (e.g. *), a tense version with lexical stress (e.g. i:), and a tense version without lexical stress (e.g. i). The lexical stress puts emphasis on the syllable containing this vowel, which generally entails an increase in duration and energy. THE USE OF LEXICA IN ASR

TABLE III Examples of some vowel symbols used in German standard pronunciations: IPA codes, recognizer codes (1 and 2), example word. ' indicates lexical stress. IPA code Recognizer1 Recognizer2 Example

i: i *

o: o =

u: u W

i i I o o O u u U

i i i o o o u u u

v'iel vit'al w'ill M'osel Mor'al M'ost H'ut Argum'ent H'und

As HMM-based recognizers are not particularly good at modeling duration, a phone set which requires ne duration distinctions is not a very appropriate choice.16 Rather than distinguishing three forms for each vowel, a more appropriate choice for a recognizer corresponds to the recognizer1 column in Table III (reducing the 9 IPA codes to only 6 e ective symbols). The recognizer2 column might be of interest for recognition tasks where the language model is able to discriminate 15

The German DUDEN uses 67 phone symbols in their pronunciations, whereas our German recognition system makes use of only 47 phone symbols. 16 In fact this point is more complicated than it appears at rst. Duration is generally judged as a relative measure, not absolute. It is highly correlated with the speaking rate and both lexical and sentential stress, as well as semantic novelty.

lex.tex; 8/09/1998; 15:48; p.17

18 MARTINE ADDA-DECKER AND LORI LAMEL among the lexical items, thus relying less on the acoustic evidence (only 3 symbols are used). 6.2. Pronunciation generation The rst part of pronunciation generation concerns generation of baseform or \standard" pronunciations. For some languages (such as French, Spanish and Italian) an initial set of pronunciations can be generated using grapheme-to-phoneme rules (these rules have typically been developed for speech synthesis). If large pronunciation lexica already exist for the language(s) of interest, these can be adapted for use in speech recognition. Consistent use of the di erent phone symbols in the lexicon needs to be checked. The standard pronunciations can be augmented by pronunciation variants if signi cant di erences in spectral content which are unlikely to be represented by the acoustic models can be observed, or if there can be a severe temporal mismatch between the proposed pronunciation and the produced utterances. Pronunciation variants are discussed further in 6.3. For many potential applications adding new items to the existing lexicon, particularly proper names, remains a problem. For the LIMSI French lexicon, initial pronunciations were produced using grapheme-to-phoneme rules[Prouts80]. These pronunciations were then modi ed manually or by rules (such as for optional schwas and liaisons). For German, we started with a 65k lexicon provided by Philips. Pronunciations for new words were generated using statistical graphemeto-phoneme conversion[Minker96] and manually veri ed. We have developed a utility to facilitate adding words to our American English lexicon[LamelAdda96]. While this utility can be run in an automatic mode, our experience is that human veri cation is required, and that interactive use is more ecient. For example, an erroneous transcription produced with an early version of the lexicon was obtained for the word \used". The program derived the pronunciation /st/, from the word \us". These types of errors can only be detected manually. An overview of the tool is shown in Figure 8. First, all source dictionaries are searched to nd the missing words. The source lexicons that we make use of are (in order of decreasing con dence): the LIMSI \master" lexicon, which contains pronunciations for over 80k words; the TIMIT lexicon[Garofoloetal93] (di erent phone set with some allophonic distinctions); a modi ed version of the Moby Pronunciator v1.3[Ward92] (di erent phone set and conventions for diphthongs); and a modi ed version of MIT pronunciations for words in the Merriam Webster Pocket dictionary of 1964 (di erent conventions for unstressed syllables). The Carnegie Mellon Pronouncing Dictionary (version cmudict.0.4)[CMU95]

lex.tex; 8/09/1998; 15:48; p.18

19

THE USE OF LEXICA IN ASR return transcriptions all transcriptions with derivation

yes word list

new word list no

dico lookup

yes

new lexicon

yes

dico lookup

no

master

master

manual correction

manual modification

apply affix rules no

rules

dico1 ...

reference dictionary

temp

dicoN

Figure 8. Pronunciation generation tool.

Ax Rule P/S type

Del Add ax ax

Add phones

Context

Example A/V/UV/C word

S P

strip+add strip

ier anti

y

/i/ /anftg[*,aj ]/

any any

sleepier

S S

strip+add strip

iness ness

y

/n*s/ /n*s/

any any

sleepiness carelessness

S

ed

-

S

strip+ undouble strip+add

ed

e

S

strip

ed

-

/d/ /d/ /d/ /d/ /t/ /d/ /d/ /t/

t,d V /t,d/ V UV d,t V UV

wedded, emitted blurred, quizzed rated, provided raised raced lifted, handed prospered walked

-

-

Figure 9. Some example ax rules. Phones in fg are optional, phones in [ ] are alternates.

(represented with a smaller phone set) and the Merriam Webster American English Pronouncing Dictionary[KenyonKnott53] (a book) are also used for reference. If a word is not located in any of the source dictionaries, ax rules are applied in attempt to automatically generate a pronunciation.17 Some 17

This algorithm was inspired by a set of rules written by David Shipman, now at

lex.tex; 8/09/1998; 15:48; p.19

20 MARTINE ADDA-DECKER AND LORI LAMEL example ax rules are given in Figure 9 along with example words. The rules apply to either pre xes (P) or suxes (S) and specify ordered actions (strip, strip+add, ...) which apply to the words (letters) and context dependent actions to modify pronunciations. For example, if the word \blurred" is unknown, the letter sequence \ed" is removed and the \r" undoubled. If the word \blur" is located, the phone /d/ is added to the returned pronunciation. While treating a word list, all pronunciations for new words are kept in a temporary dictionary so that in ected forms can be derived. When multiple pronunciations can be derived they are presented for selection, along with their source. We observed that often when no rules applied, it was because the missing word was actually a compound word (carpool), or an in ected form of a compound word (carpools). Thus, the ability to easily split such words and concatenate the result of multiple rule applications was added. At the current time we have not developed any speci c tools for consistency checking, but make use of Unix utilities to extract and verify all words with a given orthographic form. By using the pronunciation generation tool, we ensure that pronunciations of new words are consistent with respect to pronunciation variants in the master lexicon. For example, if the /d/ is optional in certain /nd/ sequences (such as candidate) it is also optional in other similar words (candidates, candidacy). Once a reasonably large source lexicon has been created (perhaps in the range of 50k-100k words depending upon the language of interest), the most frequent forms in the language are usually covered. However, speci c tasks usually require adding new words. In spoken language information retrieval systems the speci c lexical items may not appear in the general language, and a spontaneous speaking style may not be well modeled (see 7). For dictation tasks domain-speci c words are needed, and for more general news transcription tasks there is a continuous need to add words for current events.18 Proper names are particularly timeconsuming to add, as they often require listening to the signal to generate a correct pronunciation. The pronunciation of proper names foreign to the target language can be quite variable, depending upon the talker's Voice Processing Corporation, while he was at MIT. 18 For example, in 1996 our master lexicon, largely developed for the Wall Street Journal and North American Business News tasks, contained about 80,000 words. In order to develop a broadcast news transcription system, we needed to add 15,000 words to our master lexicon. About 50% of these words are proper names or acronyms and 30% correspond to new verb forms, plurals, possessive forms and compound words. The remaining words are mainly word fragments or mispronunciations which occur in the acoustic training transcriptions.

lex.tex; 8/09/1998; 15:48; p.20

21 knowledge of the language of origin. For example, the city of Worchester in England, should be pronounced /wWst/, but those not familiar with the name often mispronounce it as /w=rQst/. Similarly, the proper names \Houston" (the street in New York is pronounced /haw stn/ and the city in Texas is /hjustn/), \Pirrone", and \SCSI" may be pronounced di erently depending upon the talker's experience. THE USE OF LEXICA IN ASR

6.3. Pronunciation Variants

Figure 10. Spectrograms of coupon: /kupan/ (left, 406c0210) and /kjupan/ (right, 20ac0103 ). The grid is 100ms by 1 kHz.

Generating pronunciation variants is time-consuming and error-prone, as it involves a lot of manual work. Therefore an active research area in pronunciation modeling deals with automatic generation of pronunciation variants (c.f. [Jelinek96], [Rolduc98]). For speech recognition two often antagonistic goals have to be considered concerning pronunciation variants. The rst goal is to increase the accuracy of the acoustic models, and the second is to minimize the number of homophones in the lexicon. As a general rule, if pronunciation variants increase homophone rates, word error rates are likely to increase despite better acoustic modeling. It is nonetheless important that the lexicon contain multiple pronunciations for some of the entries. These are needed homographs (words spelled the same, but pronounced di erently) which re ect di erent parts of speech (verb or noun) such as excuse, record, and produce. In some lexicons part of speech tags are included to distinguish the di erent graphemic forms. Alternate pronunciations should also be provided when there are either dialectal or commonly accepted

lex.tex; 8/09/1998; 15:48; p.21

22 MARTINE ADDA-DECKER AND LORI LAMEL variants. One common example is the sux -ization which can be pronounced with a diphthong (/aj /) or a schwa (//). Another example is the palatalization of the /k/ in a /u/ context resulting from the insertion of a /j/, such as in the word coupon (pronounced /kupan/ or /kjupan/) as shown in Figure 10. If these types of alternative pronunciations are not explictly represented in the lexicon, the acoustic models will be less accurate.

I nt r I s t I n



Is

I n t rI s t I n t rI

s

Figure 11. Spectrograms of the word interest with pronunciation variants: /*n*s/ (left) and /*ntr*s/(right) taken from the WSJ corpus (sentences 20tc0106, 40lc0206). The grid is 100ms by 1 kHz. Segmentation of these utterances with a single pronunciation of interest /*ntr*st/ (middle) and with multiple variants /*ntr*st/ /*ntr*s/ /*n*s/ (bottom). The /I/ and /t/ segments are light and dark grey respectively.

Figure 11 shows two examples of the word interest by di erent speakers reading the same text prompt: \In reaction to the news, interest rates plunged...". The pronunciations are those chosen by the recognizer during segmentation using forced alignment. In the example on the left, the /t/ is deleted, and the /n/ is produced as a nasal ap. In the example on the right, the speaker said the word with 2 syllables, without the optional vowel and producing a /tr/ cluster. Segmenting the training data

lex.tex; 8/09/1998; 15:48; p.22

23 without pronunciation variants is illustrated in the middle. Whereas no /t/ was observed in the rst pronunciation example two /t/ segments had been aligned. An optimal alignment with a pronunciation dictionary including all required variants is shown on the bottom. Better alignment will result in more accurate acoustic phone models. THE USE OF LEXICA IN ASR

coupon interest counting industrialization excuse

kfjgupan *ntr*st *nftg*st kawnftg'8 *ndstril.[,aj ]zej Mn kskju[s,z]

Figure 12. Example alternate pronunciations for American English. Phones in fg are optional, phones in [ ] are alternates.

Fast speaking rates tend to cause problems for speech recognizers. Fast speakers tend to poorly articulate unstressed syllables (and sometimes skip them completely), particularly in long words with sequences of unstressed syllables. Although such long words are typically well recognized, often a nearby function word is deleted. To reduce these kinds of errors, alternate pronunciations for long words such as Minneapolis (/m*nipl*s/ or /m*nipl*s/) and positioning (/pz*Mn'8/ or /pz*Mn'8/), can be included in the lexicon allowing schwa-deletion or syllabic consonants in unstressed syllables. Some example alternate pronunciations in our American English lexicon are given in Figure 12. For each word the baseform transcription is used to generate a pronunciation graph to which word-internal phonological rules are optionally applied during training and recognition to account for some of the phonological variations observed in uent speech. The pronunciation for \counting" allows the /t/ to be optional, as a result of a word-internal phonological rule. The second word \interest", may be produced with 2 or 3 syllables, depending upon the speaker, where in the latter case the /t/ may be omitted and the [n] realized as a nasal ap ~D. Some example pronunciations for French and German are given in Figure 13. For French, the base pronunciations were semi-automatically extended to annotate potential liaisons and pronunciation variants. Alternate pronunciations were added for about 10% of the words, not including word nal optional liaisons. In German, variants are included to account for di erent numbers of syllables (neuem) or choices of consonants (nachste, instrumental) depending upon the speaker's dialect.

lex.tex; 8/09/1998; 15:48; p.23

24

MARTINE ADDA-DECKER AND LORI LAMEL

sont contenu etait decembre desertions squatter Morgan Wonder

s~= s~=t(V) k~=tfgny [e,]t [e,]tt(V) des~br des~br(V.) des~b(C) dezr[s,t]j~= skwate skwater(V) skwat[,]r m=rg~ mcrgn w=nd[,]r v~=dr

neuem n=j m n=j m nachste n[k,c]st Instrument b*n[s,M]trWmnt Pfennige fpgfn*g Figure 13. Example alternate pronunciations for French and German. Phones in fg are optional, phones in [ ] are alternates. () specify a context constraint, where V stands for vowel, C for consonant and the period represents silence. 7. L E X I C A L M O D E L I N G F O R S P O N T A N E O U S S P E E C H

Compared to read speech, spontaneous speech is more variable in terms of speaking rate and style, and has di erent lexical items19 and syntactic structures. Instead of simply reading aloud an existing text, the speaker generally formulates a message so as to be understood (and not necessarily transcribed). Moreover speaking is done while the message is being composed, the conjunction results in variations in the speaking rate, speech dis uencies (hesitations, restarts, incomplete words (fragments), repeated words....) and rearranging of word sequences or incorrect syntactic structures[StolckeShriberg96]. This increased variability, which may be considered independent of the language under consideration, leads to surface forms of variability which tend to be languagedependent. For example, hesitation ller words in American English are usually \uh, uhm" whereas in French the sound is more likely to be \euh". For read speech the same events may arise, but in a signi cantly smaller proportion. Various approaches have been tried to explicitly model these e ects in the lexicon and in the acoustic and language 19

For example, the rst person singular form is quite frequent in spontaneous speech but rare in newspaper texts.

lex.tex; 8/09/1998; 15:48; p.24

THE USE OF LEXICA IN ASR TABLE IV Mask data collection: evolution of corpus size and word list, with words & word fragments (about 25% of new items are fragments). Month

Jun95

#speakers 146 #queries 9.6k #items 69.6k #distinct items 1370 #distinct words/fragments 1180/190 #new words/fragments -

Dec95

May96

313 18.7k 150.8k 2080 1690/390 510/200

392 26.6k 205.4k 2530 2060/470 370/80

25

models. 7.1. Word list development Concerning spontaneous speech, there are generally no or only very limited amounts of transcribed data available for lexicon development purposes. Any transcriptions are necessarily produced after the speech and to a greater or lesser extent represent what was actually said. The human transcriber can be faced with situations where even providing an orthographic transcription is dicult. The transcriber must decide whether to stay close to the uttered speech signal, or to stay close to a normalized written version of what the speaker was trying to say (as judged by the transcriber). Here we consider two types of spontaneous speech: taken from radio and television broadcasts, and that found in Spoken Language Dialog Systems (SLDS) for information retrieval. In the case of news broadcasts, newspaper texts (and transcriptions if available) can be used for word list development. Acoustic data of this type is readily available, being produced in large quantities on a daily basis. For SLDS it is necessary to collect application-speci c data, which is useful for accurate modeling at di erent levels (acoustic, lexical, syntactic and semantic). Data is often collected using a Wizard of Oz setup or a bootstrap dialog system. Our experience is that while a bootstrap system is e ective for collecting representative acoustic data, the user's vocabulary is a ected by the system prompts and the formulation of the returned information. Acquiring sucient amounts of text training data is more challenging than obtaining acoustic data. With 10k queries relatively robust acoustic models can be trained, but these queries contain only on the order of 100k words, which probably give an incomplete cov-

lex.tex; 8/09/1998; 15:48; p.25

26 MARTINE ADDA-DECKER AND LORI LAMEL erage of the task (ie. they are not sucient for word list development) and are insucient for training n-gram language models. Table IV shows the evolution of data collection with the Esprit Mask[Gauvainetal97] system specifying the size of the text corpus and the number of lexical items and proportion of word fragments at six month intervals. Most SLDSs focus on information retrieval tasks,20 with lexicon sizes typically well below the maximum size of 65k entries. The word list is usually designed using a priori task-speci c knowledge and completed by task-speci c collected and transcribed data. For example, the recognition vocabulary of the Mask system contains 2000 words, including 600 station names selected to cover the French Railway commercial needs, other task-speci c words (dates, times), and all words occurring at least twice in the training data. For spontaneous speech, it is important that the lexicon include pseudo words for hesitations, as well as extraneous words such as \bof, ben" (in French) or \uh-huh, uh-uh" (in English) as they are commonly observed. Breath noise is also often modeled as a separate lexical item. 7.2. Pronunciation development Pronunciation modeling is also more dicult for spontaneous speech than for read speech, as there is a larger proportion of non-lexical events in the signal. Pronunciation variants are in uenced by a variety of factors including speaking rate, shared knowledge, lexical novelty, etc. Because these e ects are less often observed in read speech, most research has been carried out on spontaneous speech. Pronunciation variants can be word-internal, cross-word (at the word juncture), or can involve several words, usually for common word sequences. Word internal variants are often handled by adding variants to the lexicon, but evidently systematic variants can also be handled by phonological rules. Some recent work has addressed automatic generation of pronunciation variants, associating probabilities with each one based on observations made on large training corpora [Jelinek96], [Rolduc98]. In developing the Mask pronunciations we experimented with allowing the following word-internal variations: ? systematic optional nal schwa, even if no nal \e" is present in the graphemic form of the word, (e.g. Brest /brstfg/), 20

A notable exception is the Verbmobil project[Wahlster93] concerned with spoken language translation.

lex.tex; 8/09/1998; 15:48; p.26

THE USE OF LEXICA IN ASR

27

Figure 14. Pronunciation variants in French. Examples taken the from Mask data for the city name Abbeville pronounced as: /abvil/, /abevil/, /abvil/, /abvil/, /abvil/. The grid is 100ms by 1 kHz.

? optional vowels in unstressed positions for frequent words (e.g. voudrais /vfugdr/), ? systematic optional liaison for nouns, verbs, adjectives and adverbs, ? contracted forms for frequent words.

Some examples illustrating word-internal variation are shown in Figure 14 for the city Abbeville with 4 (top), 3, and 2 (lower right) syllables. The lower right corresponds to the pronunciation /abvil/, generated by the grapheme-to-phoneme rules. Accounting for the observed pronunciation variants can improve recognizer performance. Phonological rules have been proposed to account for some of the phonological variations observed in uent speech [Oshikaetal75], [Cohen89], particularly those occurring at word boundaries. These rules are optionally applied during training and recognition. Their use during training results in better acoustic models, as they are less \polluted" by wrong transcriptions. Their use during recognition reduces the number of mismatches. We have used the same mechanism to handle liaisons, mute-e, and nal consonant cluster reduction for French. As for speaking rate, the information ow rate can be increased either

lex.tex; 8/09/1998; 15:48; p.27

28

MARTINE ADDA-DECKER AND LORI LAMEL TABLE V Some example compound words and their pronunciations. what did you don't know let me i am going to

waftgd*dju donftgno ltmi aj @m go'8t[u]

waftgd*dj waftgd* w[a] dno lmi aj m aj m g[=]n

by uniformly speaking faster, or by reducing the number of syllables, particularly on word sequences with low information content. The latter is rather frequent in spontaneous speech, but very dicult to handle for speech recognizers. Reduction of the number of syllables is particularly common for numbers (typically dates) where the contextual information is sucient for understanding. For example in German, the \und" in \99" (neun-und-neunzig) is often reduced to a syllabic-n. The same type of reduction is observed for the word \and" in \one hundred and fty". Compound words can provide an easy way of allowing reduced pronunciations for frequent word sequences particularly prone to cross-word reduction phenomena (what did you/whatjagoing to/gonna). A few example compound words and their pronunciations are given in Table V, where the rst pronunciation is formed by concatenation of the pronunciations of the component words. Some example spectrograms showing various degrees of reduction for the word sequence \what did you ..." are shown in Figure 15. In the rst example the \what did you see" the words are fairly distinct, but the second /d/ of \did" is palatalized. The second example, \what did you wear" has a ap at the boundary of \what" and \did". The third example, \what did you think of that" the \what did you" is pronounced as /wa/. 8. D I S C U S S I O N

We have attempted to cover the main aspects of lexical modeling, largely based on our experience in lexical modeling for automatic speech recognition in American English, French and German. Lexical modeling entails the elaboration of the word list (or recognition vocabulary) and the association of pronunciations with each entry. For read speech data, or dictation tasks, lexical item de nition is usually based on existing written texts which have been processed using more or less complex normalization steps. Standard phonemic transcriptions, as provided by pronunciation dictionaries or by automatic grapheme-to-phoneme con-

lex.tex; 8/09/1998; 15:48; p.28

THE USE OF LEXICA IN ASR

29

Figure 15. Spectrograms illustrating di erent pronunciations for the word sequence \what did you ...". The grid is 100ms by 1 kHz.

version systems, can be used to achieve satisfactory acoustic model accuracy, provided the use of the phone alphabet is consistent within the dictionary. Lexicon development for spontaneous speech is more dicult for both word list de nition and pronunciation modeling. In spontaneous speech variability is added to the acoustic signal, in terms of speaking rate and style, speech dis uencies, new or incomplete lexical items and syntactic structures. The word list is usually derived from task-speci c spontaneous speech training data which need to be transcribed, completed with appropriate a priori knowledge. Reduced form pronunciations are often needed for the lexical entries, and phonological rules have been used to account for frequent variations. Compound words have also been used to model contracted forms of common word sequences. Using phonological variants during training results in better acoustic models, as they are less \polluted" by wrong transcriptions. Their use during recognition reduces the number of mismatches. Lexical modeling and phonological variation are likely to remain important research areas in embedding speech recognition in applications for use by the general public. REFERENCES ARPA'94. Proc. ARPA Spoken Language Technology Workshop, San Francisco: Morgan Kaufmann, 1994. (see also other proceedings years 1991-1998) CMU95. Carnegie Mellon Pronouncing Dictionary, CMUDICT V0.4, 1995. Rolduc98. Modeling Pronunciation Variation for Automatic Speech Recognition, ESCA/COST/A2 RT Workshop, Rolduc, The Netherlands, May 1998. Addaetal97. Adda G., Adda-Decker M., Gauvain J.L., Lamel L., \Text Normalization and Speech Recognition in French", Proc. ESCA Eurospeech'97, Rhodes, 5, pp. 2711-2714, Sept. 1997.

lex.tex; 8/09/1998; 15:48; p.29

30

MARTINE ADDA-DECKER AND LORI LAMEL

Baker75. J. Baker, \The Dragon System { An Overview," IEEE Trans. on Acoustics Speech and Signal Processing, vol ASSP-23, pp. 24-29, Feb. 1975. Caratyetal97. M. Caraty, C. Montacie, and F. Lefevre, \Dynamic Lexicon for a Very Large Vocabulary Vocal Dictation," Proc. ESCA Eurospeech'97, Rhodes, 5, pp. 2691-2694, Sept. 1997. Cohen89. M. Cohen, Phonological Structures for Speech Recognition, PhD Thesis, U. Ca. Berkeley, 1989. DixonMartin79. N.R. Dixon and T.B. Martin, eds. \Automatic Speech & Speaker Recognition," New York: IEEE Press, 1979. Dolmazonetal97. J.M. Dolmazon, F. Bimbot, G. Adda, M. El Beze, J.C. Caerou, J. Zeiliger, M.A Decker, \ARC B1 - Organisation de la premiere campagne AUPELF pour l'evaluation des systemes de dictee vocale", 1eres JST Francil, Avignon, pp. 13-18, April 1997. Garofoloetal93. J. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, N.L. Dahlgren, \Documentation for the DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM," Feb. 1993. Gauvainetal90. J.L. Gauvain, L.F. Lamel, M. Eskenazi, \Design considerations & text selection for BREF, a large French read-speech corpus," Proc. ICSLP-90, 2, pp. 1097-2000, Kobe, Japan, Nov. 1990. Gauvainetal97. J.L. Gauvain, S.K. Bennacef, L. Devillers, L.F. Lamel, and S. Rosset, \The Spoken Language Component of the Mask Kiosk," in Human Comfort and Security of Information Systems, Advanced Interfaces for the Information Society, editors K.C. Varghese et S. P eger, Springer Verlag, 1997. Gra 97. D. Gra , \The 1996 Broadcast News Speech and Language Model Corpus," Proc. DARPA Speech Recognition Workshop, Chantilly, VA, pp. 11-14 Feb. 1997. Hausser94. R. Hausser, ed., Computer-Morphologie, Dokumentation zur Ersten Morpholymics 1994, ), University Erlangen, Germany. Jelinek76. F. Jelinek, \Continuous Speech Recognition by Statistical Methods," Proc. of the IEEE, 64(4), pp. 532-556, April 1976. Jelinek96. F. Jelinek, \DoD Workshops on Conversational Speech Recognition at Johns Hopkins, Proc. DARPA Speech Recognition Workshop, Harriman, NY, pp. 148-153, Feb. 1996. Katz87. S.M. Katz, \Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer," IEEE Trans. Acoustics, Speech, and Signal Processing, ASSP-35(3), pp. 400-401, March 1987. KenyonKnott53. J.S. Kenyon and T.A. Knott, \A Pronouncing Dictionary of American English," MA: Merriam-Webster, 1953. LamelDeMori95. L. Lamel and R. DeMori, \Speech Recognition of European Languages," Proc. IEEE Automatic Speech Recognition Workshop, Snowbird, pp. 51-54, Dec. 1995. LamelAdda96. L.F. Lamel, G. Adda, \On Designing Pronunciation Lexicons for Large Vocabulary, Continuous Speech Recognition," Proc. ICSLP'96, Philadelphia, PA, 1, pp. 6-9, Oct. 1996. Minker96. W. Minker, \Grapheme-to-Phoneme conversion, an Approach based on Hidden Markov Models," Technical Report 96-04, LIMSI-CNRS, January 1996. Oshikaetal75. Oshika B.T., Zue V.W., Weeks R.V., Neu H., and Aurbach J.: \The Role of Phonological Rules in Speech Understanding Research," IEEE Trans. Acoustics, Speech, Signal Processing, ASSP-23, pp. 104-112, 1975. PaulBaker92. Paul D.B., Baker J.M.: \The Design for the Wall Street Journal-based CSR Corpus," Proc. ICSLP'92, Ban , CA, 2, pp. 899-902, Oct. 1992.

lex.tex; 8/09/1998; 15:48; p.30

THE USE OF LEXICA IN ASR

31

Prouts80. Prouts B.: Contribution a la synthese de la parole a partir du texte: Transcription grapheme-phoneme en temps reel sur microprocesseur, These de docteur-ingenieur, U. Paris XI, Nov. 1980. Pullum96. G.K. Pullum & W.A. Ladusaw, Phonetic Symbol Guide, The University of Chicago Press, Chicago and London, 1996. RabinerSchaefer78. L.R. Rabiner and R.W. Schaefer, Digital Processing of Speech Signals, New Jersey: Prentice-Hall, 1978. RabinerJuang86. L.R. Rabiner and Juang, B.H.: \An introduction to Hidden Markov Models," IEEE Transactions on Acoustics, Speech and Signal Processing, 3(1), pp. 4-16, 1986. StolckeShriberg96. A. Stolcke, E. Shriberg, \Statistical Language Modeling for Speech Dis uencies, Proc. IEEE ICASSP-96, Atlanta, GA, I, pp. 405-408, May 1996. Ward92. G. Ward, Moby Pronunciator, v1.3, 1992. Youngetal97. S.J. Young, M. Adda-Decker, X. Aubert, C. Dugast and J.L. Gauvain, D.J. Kershaw, L. Lamel, D.A. Leeuwen, D. Pye, A.J. Robinson, H.J.M. Steeneken, P.C. Woodland, \Multilingual large vocabulary speech recognition: the European sc Sqale project", Computer Speech & Language, 11(1), pp. 73-89, Jan. 1997. YoungBloothooft97. S. Young and G. Bloothooft, Eds., Corpus-based methods in language and speech processing, Dordrecht, The Netherlands: Kluwer Academic Publishers, 1997. Wahlster93. W. Wahlster, \Verbmobil: Translation of Face-to-Face Dialogs," Proc. ESCA Eurospeech'93, Berlin, Plenary, pp. 29-38, Sept. 1993.

lex.tex; 8/09/1998; 15:48; p.31

Recommend Documents
2.4.2 American and German Evaluator Responses Compared. 19. 3 ..... inner emotion œ or an emotive cause, consciously bringing affective information across (Marty ...... from the current frequency for the lower/upper border, respectively.

sentation with our original waveform representation we are able to achieve a reduction in word error rate of 33% on an automatic speech recognition task. 1.

The cognitive approach attempts to infer analytic ... A Procedural Network (PN) can be described with ... execution of the current network process and the.

1 University of Novi Sad, Faculty of Engineering, Trg Dositeja Obradovica 6 ... A silent environment, where this ASR system will be used, results in bigger .... problem is the implementation of separate energy normalization procedures for silent.

Cantonese, Mandarin, Arabic, Urdu . Out of these, the sub- ... Urdu (urd). 385. 94. 45. All. -. -. 82 ..... Available: https://en.wikipedia. org/wiki/Dinka language.

space into a new space by linear transformations. Matrix clusters can also be used for transformation sharing. In some application scenarios, such as voice ...

Automatic speech recognition in the diagnosis of primary progressive aphasia. Kathleen Fraser1, Frank Rudzicz1,2, Naida Graham2,3, Elizabeth Rochon2,3.

assessed on IDEAL, a multilingual corpus containing telephone speech in French .... gram models with the best acoustic models (i.e., those esti- mated on the ...

The results include phone error rate, .... phone error rate in phone network .... Continuous Speech Recognition , The 4th IEEE International Conference on Com-.

Based on Bell Labs speech recognition and understanding technology, we developed LASR3 (Lucent Automatic Speech. Recognition, Version 3), a speaker ...