Modeling Systematic Variations in Pronunciation ... - Semantic Scholar

Report 2 Downloads 159 Views
Modeling Systematic Variations in Pronunciation via a Language-Dependent Hidden Speaking Mode M. Ostendorf, B. Byrne, M. Bacchiani, M. Finke, A. Gunawardana, K. Ross, S. Roweis, E. Shriberg, D. Talkin, A. Waibel, B. Wheatley and T. Zeppenfeld November 17, 1996

Abstract This paper describes the research e orts of the \Hidden Speaking Mode" group participating in the 1996 summer workshop on speech recognition. The goal of this project is to model pronunciation variations that occur in conversational speech in general and, more speci cally, to investigate the use of a hidden speaking mode to represent systematic variations that are correlated with the word sequence (e.g. predictable from syntactic structure). This paper describes the theoretical formulation of hidden mode modeling, as well as some results in error analysis, language modeling and pronunciation modeling.

1 Introduction what what what plus

the aims of the group were the team tried was accomplished interpretation of any data that was produced

Spontaneous, conversational speech tends to be much more variable than the careful read speech that much of speech recognition work has focused on in the past, and not surprisingly the recognition accuracy is much lower on spontaneous speech. Pronunciation di erences, in particular, represent one important source of variability that is not well accounted for by current recognition systems. For example, the word \because" might be pronounced with a full or a reduced vowel in the initial syllable, or the whole initial syllable might be dropped. Increasing the allowed pronunciation variability of words is needed to handle the reduction phenomena that seem to be a cause of many errors in conversational speech. Unfortunately, as many researchers have noticed, simply increasing the allowable set of pronunciations in all contexts often does not help and may even hurt performance, since the gain of including more pronunciations may be o set by a loss due to increased confusability. If it is the case that pronunciation changes are systematic, then models can be varied dynamically so as to reduce the added confusability. Thus, the goal of the \Hidden Speaking Mode Group", which participated in the 1996 summer DoD workshop on speech recognition, was to develop a method for allowing pronunciation variations depending on a hidden speaking mode. The speaking \mode" would vary within and across utterances and would re ect speaking \style", e.g. indicating the likelihood of reduced or sloppy speech vs. clear vs. exaggerated speech. By changing the allowed pronunciations as a function of the speaking mode, we can account for systematic variability without the increased confusability associated with a static model. 1

We focus on capturing variability associated with speaking style, rather than on variability due to dialect or background noise, because of the evidence showing that style has a dramatic impact on recognition performance. In a 1995 study done by SRI, speech recorded over a telephone channel under three conditions showed very di erent error rates. Spontaneous conversational speech gave an error rate of 52.6%, while the same word sequences read and \acted" (simulated spontaneous speech) by the same speakers led to 28.8% and 37.4% error rates respectively [1]. Since the word sequence and speakers are xed, the drop in accuracy from read to spontaneous speech must be due to style-related di erences. Speaking style appears to be correlated with the word sequence, and therefore it should be at least somewhat predictable from syntactic and discourse structure. For example, content words (especially nouns) are much more often clearly articulated than function words, which can be reduced to the point of having only a few milliseconds of acoustic evidence. The word sequence \going to" can be reduced to \gonna" when the following word is a verb but not if a noun phrase follows. Old or shared information in the conversation is more likely to be reduced than a new word. Similarly, words at the end of a sub-topic or discourse segment are more likely to be mumbled, while the initial phrase after a topic change will be clearly articulated. Syntactic and discourse structure is of course dicult to extract, but simple text analyses may still be useful for predicting speaking mode. Because text analysis will necessarily be simplistic and it may be based on errorful data from a recognizer, it is important to also rely on acoustic cues to speaking style. It has been well-established that higher speaking rates are associated with higher recognition errors [2, 3], and rate is perhaps the best candidate for predicting pronunciation variations. However, we have anecdotally noticed reduction phenomena in regions of low energy and pitch range, where a speaker may be mumbling. Thus, the hidden speaking mode model will be conditioned on both acoustic cues and language cues. In the remaining sections of the paper, we describe the mathematical framework and recognition and training algorithms developed for hidden mode modeling, followed by a summary of the experimental results obtained at the workshop and a discussion of the open questions raised by this work.

2 Hidden Mode Modeling 2.1 Mathematical Framework Mathematically, the standard problem of recognizing the word sequence w = (w1 ; : : : ; wN ) given acoustic observations x = (x1 ; : : : ; xT ) can be expressed using conditional distributions as w^ = argmax p(wjx) w = argmax p(xjw)p(w) w  argmax p(xjq)p(qjw)p(w); w;q

where q is a phone sequence associated with the word sequence, p(xjq) is the acoustic model, p(qjw) gives the pronunciation likelihoods, and p(w) is the standard language model. With hidden mode conditioning, these equations become w^ = argmax p(wjx; y) w

= argmax w

X p(xjw; m)p(mjf (w); y)p(w) m

2

recognition

x

search p(x|q,m) acoustic model p(q|w,m)

p(m|y,f(w)) y

pronunciation

hidden mode

dictionary

p(w) language model

Figure 1: Caption needed.

3

w

 argmax w;q

X p(xjq; m)p(qjw; m)p(mjf (w); y)p(w); m

where the new variables introduced { m, y and f (w) { are sequences of mode labels, acoustic cues to the mode, and language cues to the mode, respectively. With hidden mode conditioning, p(xjq; m) is the acoustic model and p(qjw; m) is the pronunciation likelihood which is interpolated by the mode likelihood p(mjf (w); y). The sequence models are simpli ed using Markov and conditional independence assumptions as in typical recognition systems, e.g.

p(xjq; m) =

YT p(xtjqi t ; mj t )

t=1

( )

( )

where i(t) and j (t) indicate the phone and mode state associated with time t. Mode conditioning can be implemented either directly in the acoustic model p(xt jqi ; mj ) and/or in a pronunciation probability p(qjwi ; mi ). Both approaches were explored at the workshop. Direct acoustic model mode conditioning can be incorporated by including the mode as a factor in tree-based distribution clustering, together with questions about neighboring phonetic context. Pronunciation probabilities can incorporate mode conditioning in several ways. The simplest approach is to estimate mode-dependent pronunciation probabilities for each possible pronunciation of each word, but these probabilities will only be robust for the most frequent words. As an alternative, we also investigated using decision trees to predict baseform expansion rule probabilities based on the mode or mode-dependent cues in combination with other factors associated with phonetic context. Decision tree pronunciation prediction as proposed in [4] can also be extended to include mode as a prediction factor. Another design issue to resolve is what time scale the mode varies on. For example, m might be allowed to change at each frame, syllable, word or utterance. Error analyses from the 1995 workshop [5] show that utterance-level factors are not good predictors of error, which suggests that a mode varying within the utterance would be more useful. To restrict the scope of the e ort and simplify implementation in recognition and training, we chose to work with a slowly varying mode, assuming that the mode did not change mid-word. Assuming a word-level mode, the mode sequence includes one mode value mi for each word: m = fm1 ; : : : ; mN g, where N is the number of words (or hypothesized words) in an utterance. The mode likelihood model assumes conditional independence of modes at each word given the acoustic and language cues: N p(mjf (w); y) = p(mijf (w); y(wi )):

Y

i=1

The distribution p(mi jf (w); y(wi )) is represented using a decision tree with questions about the language cues f (w) and the acoustic cues y(wi ) within a window of the target word wi .

2.2 Automatic Training On a small task, it might be possible to hand-label data with modes according to a coding system developed to capture pronunciation-related speaking style variation. However, our experience was that it was dicult to de ne such a coding scheme, and impractical to label a suciently large amount of data by hand. As a result, the work focused on unsupervised learning of initial speaking mode labels through various clustering techniques using acoustic cues y. Given an initial mode labeling, one can estimate modedependent pronunciation and/or acoustic models and conditional mode likelihoods, and then iteratively 4

improve all models jointly using Viterbi-style estimation. The problem of nding the hidden speaking \modes" can be thought of as analogous to nding the modes or component distributions of Gaussian mixtures. Two clustering methods were explored, both based at least in part on decision trees [6]. In the rst approach, decision trees are designed to predict regions of recognition error (due at least in part to the acoustic model) vs. regions where the recognizer output was correct, using Chase's error analysis tool [7]. (Errors due to language modeling alone were omitted from clustering.) The leaves of the resulting tree de ned a set of \pre-modes". While the acoustic error regions are certainly correlated with di erent speaking modes, the resulting clusters will not necessarily re ect systematic pronunciation di erences. Therefore, the \pre-mode" clusters were subsequently merged using agglomerative clustering with a distance measure on the pronunciation probability distributions of the 100 most frequent words, weighted by the relative frequency of each word. In the second approach, regions of pronunciation similarity were clustered directly by using the acoustic cues to the mode as features in decision tree clustering to predict baseform expansion rule probabilities. This approach has the advantage of clustering directly on pronunciation variability, which is the goal of the hidden mode modeling. However, it is only possible when pronunciation variability can be expressed with a small dimensional vector, as in the roughly 20 rules used in the CMU Janus system.

2.3 Mode dependent triphone training As described earlier, one approach to incorporating mode in acoustic modeling is to train HMMs to model mode-speci c phones. This requires modeling the bracketed term

Xfp(xjq; m)gp(qjw; m)p(mjf (w); y)p(w) m

(1)

in the acoustic modeling equation which formulates the hidden mode conditioning. Without mode, phone-based HMMs are used to construct the acoustic model. Context independent phones usually based on a relatively small phone set. The 64K word, Pronlex-based dictionary used for WS96 contained 42 phones. The dictionary entries are of the form given in Table 1(a). (a)

Word Pronunciation ABACK ax b ae k ABANDON ax b ae n d ih n

(b)

Word Mode Pronunciation ABACK 1 ax.1 b.1 ae.1 k.1 ABACK 2 ax.2 b.2 ae.2 k.2 ABANDON 1 ax.1 b.1 ae.1 n.1 d.1 ih.1 n.1 ABANDON 2 ax.2 b.2 ae.2 n.2 d.2 ih.2 n.2

Table 1: Example dictionary entries with context independent phones: (a) without mode; (b) with mode. Acoustic models are trained for each of these context independent phones. Word acoustic models based on these monophone pronunciations can then be found by concatenating the component monophone HMMs. Mode is introduced into this modeling procedure by providing a di erent set of acoustic models for each mode. If the mode is assumed to be slowly varying so that it can be considered constant over each word, mode-dependent pronunciations can be implemented by including separate entries in the dictionary for each possible mode. For example, assuming there are two modes 1 and 2, mode can be introduced as 5

in Table 1(b). The pronunciation chosen for each word hypothesis, and therefore the mode-speci c HMMs used to evaluate an observation, depends on the value of the mode hypothesized in Equation 1. Given a small number of modes, it should be possible to reliably train a set of mode-dependent monophone acoustic HMMs. Phonetic context is typically captured through the use of triphones. Examples of dictionary entries based on word-internal triphone models are given in Table 2(a). (a)

Word Pronunciation ABACK ax+b ax-b+ae b-ae+k ae-k ABANDON ax+b ax-b+ae b-ae+n ae-n+d d-ih+n ih-n

(b)

Word Mode Pronunciation ABACK 1 ax.1+b.1 ax.1-b.1+ae.1 b.1-ae.1+k.1 ae.1-k.1 ABACK 2 ax.2+b.2 ax.2-b.2+ae.2 b.2-ae.2+k.2 ae.2-k.2 ABANDON 1 ax.1+b.1 ax.1-b.1+ae.1 ... d.1-ih.1+n.1 ih.1-n.1 ABANDON 2 ax.2+b.2 ax.2-b.2+ae.2 ... d.2-ih.2+n.2 ih.2-n.2 Table 2: Dictionary entries with context dependent phones: (a) without mode; (b) with mode.

Modeling context through the use of triphones increases the number of models needed. In the WS96 50 hour acoustic training set, there are approximately 80,000 instances. It is generally accepted that such a large number of models cannot be reliably trained and techniques are available to reduce the number of independently models. We have used the tree-based state-clustering procedure available in HTKV2.0. This assigns each triphone state to a class which is represented by one multiple mixture Gaussian observation distribution. The tree clustering is based on a set of acoustic-phonetic categories that allows clustering the individual triphone states based on phonetic context. Example categories are given in Table 3. Class L Class-Stop R Class-Stop L Vowel-Long R Vowel-Back

Context Phone Left Right x p,b,t,d,k,g x p,b,t,d,k,g x iy,ow,aw,ao,uw,en,el x uw,uh,ow,ah,ax

Table 3: Example phonetic categories determined by right and left phonetic context. In this way, each state of each triphone is assigned to a class. An example classi cation from the baseline system is given in Figure 2. For the 60 Hour acoustic training set used in WS96, this procedure reduces the number of distinct triphones to approximately 8,000. Note that in this procedure the identity of the center phone speci es the root of the tree. Questions are asked only about the phonetic context, i.e. about the properties of the left and right phones. Because all monophones are considered identical, it isn't necessary to ask questions about the center phone. Expanding the triphones pronunciations to include mode labels, as show in Table 2(b) will increase the number of context dependent models that must be trained even beyond the number of already untrainable 6

{jh[3]}

R_Vowel

onal

R_ah

_

R_IVowel

L_Strident

L_Nasal

L_Unrounded

R_Syllabic

_

R_Rounded

L_Short

_

_

_

L_SilPause

_

L_Central−Stop

_

L_Long

L_V−Front

_

L_Fronting

_

_

Figure 2: Clusters for state 3 of the triphone jh.

7

_

_

_

_

ABACK [ . ax ] [ b ' ae k ] ABANDON [ . ax ] [ b ' ae n ] [ d . ih n ] ABDICATE [ ' ae b ] [ d . ih ] [ k + ey t ] Table 4: Stress Marked and Syllabi ed Pronunciations. triphones. We intend to address this problem by expanding the triphone clustering technique to allow questions about the center phone of a context dependent model to allow clustering based on questions about the speaking mode. Ideally, the clustering procedure should both yield robust acoustic models and also determine what acoustic cues to the mode are relevant and reliable in varying phonetic contexts.

2.3.1 Stress, Syllable, and Position Dependent Phones As a baseline experiment, Barbara Wheatley developed a dictionary that included syllabi cation and stress information. This was done using William Fisher's tsylb program as an initial pass followed by hand editing. The entries were of the form given in Table 4 where syllable boundaries and primary and secondary stress markings were added to the baseline dictionary. We attempted to train and test wordinternal triphones that were clustered using the baseline set of phonetic contexts augmented by questions about phone position and lexical stress. The purpose of this experiment was two fold. The additional information included in the dictionary can be though of as a kind of mode, albeit one that remains constant and independent of the acoustic cues y (see Equation 1). Hence this is a primitive form of speaking mode modeling. Also, as mentioned in the previous section, the standard training procedure asks questions only about the phonetic context and not about the center phone. To cluster based on the speaking mode requires modifying this procedure so that questions can be asked about the mode of the center phone. Hence the state clustering procedures developed in this experiment will also be useful for clustering based on mode hypotheses. To make this information accessible during the state-clustering process, a coded version of this dictionary was derived Michiel Bacchiani. A 4-digit code was appended to each phone in every pronunciation in the dictionary. The 4 digit code was of the form ABCD as described in Table 5. Based on this coding scheme, example dictionary entries with codes appended to the context-independent phones are given in Table 6. This increased the number of context independent phones, referred to as coded monophones, in the acoustic training set from 41 to approximately 1300. The number of coded triphones increased to about 114,000. A set of categories was developed based on the coded markings described in Table 5. These are described in Tables 7,8, and 9. For example, inspecting the code of the phone b.1201 indicates that: it is a middle phone in the word (A=1); it is in the last syllable (B=2); it is in the onset initial position of the syllable (C=0); the syllable has primary stress (D=1). This information is used in deciding what classes to assign HMM states for context-dependent models of b.1201. It is also available in the classi cation of neighboring phones, for example, in determining how to classify the states of ax.0020+b.1201 based on available contextual information. 8

Code Digits and Indicated Features A B C D Digit Phone Position Syllable Position Phone Position Stress Value in Word in Word in Syllable 0 rst rst onset initial stressless 1 middle middle onset other primary 2 last last nucleus secondary 3 only only coda only 4 coda initial 5 coda other 6 ambisyllabic Table 5: Phone Position and Stress Codes. Note that in 'syllables' that contained no vowel, the consonant series was treated as a coda. The word syllable position annotation for ambisyllabics is always 1 (middle). (a)

(b)

Word ABACK ABANDON ABDICATE

Word ABACK ABANDON ABDICATE

Coded Phone Pronunciation ax.0020 b.1201 ae.1221 k.2231 ax.0020 b.1101 ae.1121 n.1131 d.1200 ih.1220 n.2230 ae.0021 b.1031 d.1100 ih.1120 k.1202 ey.1222 t.2232

Coded Phone Pronunciation ax.0020+b.1201 ax.0020-b.1201+ae.1221 ... ae.1221-k.2231 ax.0020+b.1101 ax.0020-b.1101+ae.1121 ih.1220-n.2230 ae.0021+b.1031 ae.0021-d.1100+ih.1120 ... ey.1222- t.2232

Table 6: Codi ed version of the Stress Marked and Syllabi ed Pronunciations: (a) Context independent; (b) Context dependent.

2.3.2 Training Stress, Syllable, and Position Dependent Acoustic HMMs The recommended HTK HMM training procedure was modi ed to allow questions to be asked about the center phone in determining state-clustered triphones. This required modi cations to some HTK programs and library routines; contact Bill Byrne ([email protected]) for the most recent copy of this software. The state clustering algorithm requires estimates of the expected state occupancies for triphones. The following procedure was used to derive the needed estimates.

2.3.3 Finding the coded triphone state occupancy statistics Step 1 Following the usual procedures, single mixture, uncoded monophones were found. 9

Position of phone in word (Code Digit A) (0,3) vs. (1,2) 1st phone in word (0,1) vs. (2,3) last phone in word (0) vs. (1,2,3) 1st but not last phone in word (1) vs. (0,2,3) middle phone in word (2) vs. (0,1,3) last but not rst phone in word (3) vs. (0,1,2) only phone in word Position of syllable in word (Code Digit B) (0,3) vs. (1,2) 1st phone in word (0,1) vs. (2,3) last phone in word (0) vs. (1,2,3) 1st but not last phone in word (1) vs. (0,2,3) middle phone in word (2) vs. (0,1,3) last but not rst phone in word (3) vs. (0,1,2) only phone in word Table 7: Classes Determined by Phone and Syllable Position in Word. These categories apply only to the center phone. Step 2 Coded monophones were obtained by cloning the uncoded monophones. This created a model for each of the coded monophones in the training set. Step 3 Using the coded monophones, forced alignment was performed over the training set to choose which pronunciations should be used for training. This also speci ed the inventory of coded triphones needed for acoustic training. Step 4 Uncoded triphones were found by cloning the uncoded monophones. Step 5 Four iterations of Baum-Welch reestimaton were performed using the uncoded triphones. Step 6 Coded triphones were found by cloning the uncoded triphones. Step 7 A single iteration of Baum-Welch reestimation was performed. The expected state occupancies needed for state-clustering were found at this step. This procedure has been found to work well in the experiments attempted. Investigation into the clustering and training of coded triphones is continuing. Using the occupancy statistics, the modi ed state-clustering routine was used to cluster the coded triphone states. This yielded 7514 triphone states, compared to approximately 6500 in the baseline system obtained from the statistics found at Step 5. The usual mixture splitting and reestimation procedures were then used to further re ne the HMMs.

2.4 Results and Discussion An example clustering of the coded triphone states is given in Figure 3. It is potentially signi cant that the rst question asked in clustering is whether the stop is the rst phone in the syllable. This corresponds with the notion that pre-vocalic stops are likely to be spoken clearly, while post-vocalic stops are likely to 10

Position of phone in the syllable (Code Digit C) Question Applied to Features Identi ed Center phone (0) vs. (1,2,3,4,5,6) singleton questions .. .. . . (6) vs. (0,1,2,3,4,5) singleton questions (0,1,6) vs. (2,3,4,5) phone in the syllable onset (0,1) vs. (2,3,4,5,6) phone in the syllable onset and not ambisyllabic (0,6) vs. (1,2,3,4,5) phone syllable-initial consonant (0,1,2) vs. (3,4,5,6) phone in the syllable coda (0,1,2,6) vs. (3,4,5) phone in the syllable coda and not in the onset (0,1,2,4) vs. (3,5,6) phone possibly syllable- nal (5 may be followed by another 5 or a 6) (0,1,2,4,6) vs. (3,5) phone possibly syllable- nal and not ambisyllabic Left phone (0,1,6) vs. (2,3,4,5) center phone is C: non-initial in onset cluster center phone is V: syllable-initial (2) vs. (0,1,3,4,5,6) center phone is C, post-vocalic center phone is V, V-V syllable onset (4) vs. (0,1,2,3,5,6) consonant non-initial in coda cluster (never have V or onset C after 4) Right phone (0,2) vs. (1,3,4,5,6) center phone is V: open syllable (0) vs. (1,2,3,4,5,6) center phone is V: open syllable before C center phone is C: syllable- nal before heterosyllabic C (2) vs. (0,1,3,4,5,6) center phone is V: open syllable before V center phone is C: prevocalic (1) vs. (0,2,3,4,5,6) if right phone is in onset-other, then in onset consonant cluster Table 8: Classes Determined by Phone Position in Syllable.

11

{b[3]}

P0inSyll

L_aa

R_AVowel

R_iy

R_V−Back

llInWord

_

R_V−Back

R_Front

teriorC_PrimaryStress

−Central _

R_OO

_

R_V−Front

_

R_V−Back

R_RVowel

_

R_Long

R_Front−Start

_

L_Fricat

R_r

_

_

R_C−Back

_

R_Fronting

L_Fortis

_

_

C_Unstressed

_ C_PrimaryStress _

_

_

R_High

_

_

_

R_OVowel _

_

_

Figure 3: Clusters for state 3 of the coded phone b.

12

R_Syllabic

_

R_V−Front

L_Unrounded _

_

_

L_Fr

L_C−Front C_PrimaryStre L_

_

_

_

_

Lexical stress (Code Digit D) Questions asked about center phone, and left and right contexts Features Identi ed (0) vs. (1,2) unstressed vs. stressed (1) vs. (0,2) primary stress vs. not primary stress Conjoined questions (Code Digits B and D) Questions asked about center phone, and left and right contexts Feature Identi ed B=3 and D 6= 1 vs. B 6=3 and D = 1 single syllable words with no primary stress (function words) Table 9: Classes Determined by Syllabic Stress. be reduced. Another signi cant feature of the tree is that questions about both primary and secondary stress are used. Finally, the right children of the L aa node at level 2 appears to be a consonant cluster. A rescoring experiment was performed with word-internal triphone models. The test was performed on the 445 utterance, ICSI subset of the WS96 dev-test set using the previously generated word lattices produced that form the workshop baseline. The results are presented in Table 10. Number of Mixtures 6 8 10

Word Accuracy Uncoded Triphones Coded Triphones 43.67 44.35 45.06 45.51 45.16 46.34

Table 10: Baseline, state-clustered triphone system compared to the stress and position dependent coded triphone system. Results are found by rescoring WS96 baseline, bigram word lattices with the word-internal HMMS indicated here. Final results that would allow comparison of stress and position coded models to a recognizer incorporating cross-word HMMs and a trigram language model are not yet available. Cross-word triphones incorporating stress and position information have been trained, but it is not yet clear how to incorporate these models in recognition; the number of cross-word triphones is prohibitively large when stress and position information is included. However, the intermediate results presented here are encouraging. They suggest that stress and syllabic information can be used to nd better acoustic equivalence classes which lead to improved acoustic models. They also suggest that there is now a viable mechanism available which can be used to train mode-dependent acoustic models and that this procedure can be used in further studies of acoustic modeling of speaking mode.

13

2.5 Recognition The recognition algorithm relies on a multi-pass search strategy, which reduces the search space by using standard, static-pronunciation hidden Markov models in a rst pass of recognition that results in a word lattice or N-best list. The lattice or N-best list must be annotated with at least hypothesized word and silence times, and ideally also with hypothesized phone labels and times, for use in computing acoustic features y. In rescoring, the dictionary or at least the relative probability of each entry in the dictionary must vary dynamically throughout the utterance, since the mode can change with each hypothesized word. The combined acoustic/pronunciation model of the i-th hypothesized word wi is given by either

X

p(x(wi )jwi ; f (wi ); y(wi )) = p(x(wi )jwi ; m)p(mjf (wi ); y(wi ))

(2)

m

using a single pronunciation and mode conditioning directly in the acoustic model, or by

X

p(x(wi )jwi ; f (wi ); y(wi ))  max p(x(wi )jqk ) p(qk jm)p(mjf (wi ); y(wi )) k

(3)

m

using mode conditioning in the pronunciation model, where x(wi ) and y(wi ) are the cepstral and mode acoustic features associated with word wi given its time markings. The language features f (wi ) are based on the hypothesized word sequence associated with wi , which will be di erent for wi in di erent N-best word strings. In other words, the mode likelihood provides the probability of the pronunciation in direct acoustic model mode conditioning and acts as an interpolation factor in pronunciation likelihood mode conditioning. In summary, the di erences between hidden speaking mode utterance rescoring and a standard rescoring procedure are that (1) additional acoustic and text analyses are needed for extracting y(wi ) and f (wi ) for each hypothesized word wi in an utterance, (2) a mode likelihood must be computed for each hypothesized word, and (3) each hypothesized word must point to a di erent pronunciation weight distribution or weighted collection of dictionary entries.

3 Experimental Results The focus of the Hidden Speaking Mode Group's e ort was on developing appropriate models for each of the terms introduced in equations 3 and 4, including the acoustic model p(xjq; m), the pronunciation model p(qjm; w) and the mode likelihood p(mjf (w); y(w)), as well as on exploring methods for unsupervised learning of the mode. As for all the workshop groups, the experimental paradigm was conversational speech recognition on the Switchboard task [8]. Results were obtained starting from two di erent baseline systems: an HTK system developed for the workshop trained on 60 hours of data (gender-independent), and the CMU Janus system trained on 140 hours of data (gender-dependent) [9]. A major thrust of the summer e ort was data analysis to determine appropriate acoustic features y(wi ) and preliminary mode clustering. Over 100 features were studied, with some based on forced alignments given the known word transcription (useful in initial mode clustering only), some based on recognized word and phone labels and times, and some that were purely acoustic. The features included various measures 14

of speaking rate, SNR and/or energy, normalized fundamental frequency, presence and duration of silence, and phone label distance measures between di erent alignment/recognition alternatives. Speaker gender was included as a control to insure that the normalization techniques were e ective and the unsupervised mode clustering did not simply learn gender, and in fact gender was never used. The \goodness" criterion for evaluating features was prediction of acoustic modeling errors, i.e. regions where an incorrectly recognized word string had a higher acoustic model likelihood than the correct word sequence. Analyses of individual features showed that normalization is very important, and the best normalization methods were conversation-level. In decision tree error prediction experiments, using recognition on the training data to de ne a suciently large number of error regions, acoustic cues alone gave almost as good performance as the superset of features that included those based on recognizer hypotheses (cross validation error rates of 25% vs. 24%, respectively, compared 36% chance). The most important features were speaking rate (having two measures was better than one) and presence of silence, but SNR also played a signi cant role. We anticipate that these features may also be useful for research in con dence scoring. Among the intonation features developed for use in the decision trees, we developed a new means of measuring the minimum fundamental frequency (minimum f0) of an utterance. The minimum f0 for a phrase is a linguistically important value related to pitch range and can be used for normalizing f0 values within a phrase. Unfortunately, it is frequently dicult to automatically get reasonable measures of minimum f0 because of phenomena such as glottalization and frequency halving. Our new approach for nding the minimum f0 is based upon forming a histogram of the f0 values for each speaker. The f0 histogram for a typical speaker has at least two peaks including a major peak and a smaller peak at half the frequency of the major peak. Many of the frames with f0 values in the lower peak are from glottalized speech or f0 tracking errors. A good value for minimum f0 can be found by choosing either the position of the minimum between the major peak and the smaller half frequency peak or three quarters the frequency of the major peak. (Three quarters times the frequency of the major peak works because it is an approximation of the minimum between the peaks.) This approach for nding minimum f0 is very easy to perform automatically. This value for minimum f0 was used for normalizing some of our features that were based upon the f0 of a segment. We developed a di erent feature based upon phone duration that appears to be well correlated with incorrect phone segmentations. This measure is based upon the RMS deviation of the durations of the phones in the region from mean values for the durations of the phones: DDUR = sqrt(sum((dur(phone[i]) - MeanDur(phone[i]))^2)/N)

where N is the number of phones in the region, i is the phone index, dur() is the duration of the phone hypothesis, and MeanDur() is the mean duration for that phone in the training corpus. This measure should be fairly highly correlated with incorrect phone segmentations because we believed the greatest deviations from the mean durations occur in regions where the alignments are bad. Because this measure is also correlated with deviations from the average speaking rate, we also tried a slightly modi ed version of this measure that attempted to normalize the duration di erence measure for overall speaking rate. In this case, the overall length of the region is used to normalize the individual phone durations and the comparison is made between normalized mean durations and normalized observations: DUR = sum(dur(phone[i])) MDUR = sum(MeanDur(phone[i])) DDUR2 = sqrt(sum(((dur(phone[i])/DUR) - (MeanDur(phone[i])/MDUR))^2))

Both of these measurements ignored the duration contributed by the silence segments. 15

Once the training data has been assigned initial mode labels, a model for predicting the hidden mode has to be created. Our goal was to use decision trees to predict the mode based upon a combination of simple linguistic features that could be derived from the text and of simple acoustic features. During recognition these features would be based upon the hypothesized word sequence. The initial mode labels for training the mode prediction models were generated by clustering the terminal nodes of the tree for predicting the error regions. Some of the text based features with which we experimented were very simple and included: the frequency of the word (and its neighbors) in training, the frequency of the word's bi-gram in training, the position of the word in the utterance, whether or not the word was a function word, and the number and type of phonemes in the word. Other text based features were more sophisticated such as: the part-of-speech of the word and its neighbors, the given/new status of the word, whether or not a word is contained in a queue of the most recent content words in the conversation, and whether the word occurs before or after the pivot point (the rst main verb [Meteer]) in the utterance. Acoustic based features included features such as: speaking rate, word duration and normalized duration, and recognized silences before and after the word. Unfortunately, the distribution of initial mode labels from the clustering of error regions that was to be used for training the mode model was heavily skewed towards one of the mode labels. This made it impossible to grow good trees for predicting the mode labels. In the future, we believe this problem can be alleviated by use data separate from the recognizer's training data for training the mode prediction model. Experiments were also performed with these features for predicting correctly and incorrectly recognized words. Some of the most important features for predicting recognition performance were: the word's duration and normalized duration, the word's part-of-speech and function word class, the presence of silences before or after the word, the word's and its neighbor's frequencies in the training data, and the word's bi-gram frequency. Trees grown with these features were able to predict the recognition results on an independent test set for 69 The fact that the presence (but not duration) of silence is important for predicting acoustic modeling errors raises the question of whether silence should be an acoustic cue or a language cue. In other words, silence could be treated as a word in the language model, just as utterance boundaries now are. The silence \word" has been used successfully in the CMU Janus system [9]. Experiments with the HTK system were conducted using one or two silence \words" plus a segmentation boundary \word," where only silences of duration longer than 80 ms were treated as words and 250ms was used as a cut-o for the case when two silence \words" were used. When combined with a language model based on linguistic segmentations and testing with known linguistic segmentations, the silence \words" degraded performance slightly, from 45.1% to 45.8% word error. Using acoustic segmentations, the results were mixed, with the two silence \words" giving a slight improvement, from 46.6% to 46.4% word error. Because the use of a silence \word" blocks actual word context, it is likely that better results could be obtained by including silence \words" but using them in an extended n-gram framework. The fact that the presence (but not duration) of silence is important for predicting acoustic modeling errors raises the question of whether silence should be an acoustic cue or a language cue. In other words, silence could be treated as a word in the language model, just as utterance boundaries now are. The silence "word" has been used successfully in the CMU Janus system [9]. We conducted experiments with the HTK system to begin exploring the merits of treating silence as a word. Training Data. In order to train a language model with silence included, we needed a data source that would indicate the presence and duration of non-word intervals in the Switchboard training set. We decided to use the Janus forced alignments of the training data because they were readily available and contained the necessary information. (Forced alignments of the training data using HTK and the BBN utterance segmentation 16

were not available.) From the Janus forced alignments, we extracted a representation of each utterance that indicated the type and duration of each non-word interval, including both silence and noises. For the purpose of these experiments, silence and noise were treated as equivalent, and sequences of silence and noises (with no intervening words) were combined to determine the total duration of the "silence" interval. However, intervals representing a minimal path through a noise acoustic model were excluded, on the hypothesis that these are likely to be transcription errors. For use in language modeling, the silence intervals needed to be quantized into one or more units that we could treat as words. We set a lower bound of 80 msec for the minimum duration that we would treat as a word, to avoid any possibility of treating a stop closure as a silence word. A second threshold was set at 250 msec, based on analyses performed at SRI which suggested that the relative frequency of linguistic pauses vs. hesitations changes at around 250 msec. A third threshold was set at 500 msec, based on histograms of silence and noise durations that indicated sharp drop-o s above 500 msec. (This distribution may be artifactual, but it is not simply an automatic utterance detector threshold, since the Janus segmentations were based on the original Switchboard "mrk" les; these re ect the original transcriber turn divisions, which were done manually based on listening to the conversations.) These threshold choices gave us the following types of silence \word": \short" silence, 80-250 msec; "long" silence, 250-500 msec, and \ultra-long" silence, above 500 msec. In addition, we continued to treat utterance boundary as a word; its acoustic manifestation is silence (which may include noise, in HTK), so silences adjacent to utterance boundaries were incorporated into the utterance boundary marker, not treated as separate words. We regarded these divisions of silence as preliminary and not necessarily optimal, and so experimented with di erent groupings, as described below (e.g., combining long and ultra-long into a single category). A drawback of deriving our training data from the Janus forced alignments is that the Janus utterance segmentation is systematically di erent from the BBN utterance segmentation used in HTK; the BBN segments are typically much shorter. In addition, both segmentations are "acoustic," derived from the word-level time marks and the original transcription segmentation; they do not re ect the linguistic segmentation of Switchboard transcriptions done by the Penn Treebank project. We anticipate that linguistic segmentations will provide better bases for predicting hidden modes. Therefore, we wanted to construct a language model that would use the linguistic segment boundaries rather than the acoustic segment boundaries. This required integrating the linguistically segmented transcriptions with the Janus force-aligned transcriptions. Since the linguistic segmentations used a later version of the transcriptions, aligning the Janus output with the linguistically segmented transcriptions was not trivial. We obtained software developed at SRI to align the BBN segmentations with the linguistic segmentations and adapted it to work with the Janus output. In a fairly short time, we were able to develop a version that produced merged transcriptions that were largely correct. The residual errors would have required extensive work to detect and x; given the time constraints of the workshop, we decided to simply use the initial merger despite its occasional errors. Trigram language models were trained using both segmentations, i.e., the acoustically segmented data with silences indicated (1.4M words; 1.5M including silence "words"), and the linguistically segmented data merged with the silence information (1.1M words; 1.2M with silence). Several models were trained on each data set, using di erent groupings of silence words. Test Conditions. Testing was performed using N-best rescoring of the HTK output. This required augmenting the HTK output sequences to include silence information at the word level. For each N-best list, the top 20 were 17

Reference: WS96 trigram, no silence words Baseline: Janus training data, no silence words Three silence words (S, L, U) Two silence words (L, U) One silence word (L+U)

46.4% 46.6% 46.4% 46.8% 46.5%

Table 11: Caption needed here. Baseline on ling. seg., no silence words: Two silence words (S, L+U) and utterance end silence One silence word (L+U) and utterance end silence One silence word (L+U), no utterance end silence

45.1% 45.8% 45.9% 45.2%

Table 12: Caption needed. force-aligned with the test data to obtain phone-level alignments, including the location and duration of silences (normally treated at the phone level in HTK). The location and duration of silences were extracted from the phone alignments and quantized into silence "words" using the same thresholds and groupings as in training. For the language models trained on acoustically segmented data, test sets were prepared using the entire development test set. For those trained on the linguistically segmented data, test sets were prepared using the subset of the development test set for which linguistic segmentations are available. Results We were able to perform some initial exploratory experiments using the language models and test data described above. The word error rates on tests using the acoustic segmentations are shown below: The baseline test, i.e., using the Janus training data but without any silence words, was necessary to establish the performance degradation due to the mismatch between the Janus segmentation used in training the language model and the BBN segmentation used in producing the N-best lists. As the results show, there was a small increase in error rate (0.2%). We tested three conditions using silence words. In each case, utterance boundary was also represented in the language model, which e ectively adds another silence word. The best performance, a slight improvement (0.2%) over the baseline, was obtained using all three silence words (short, long, and ultralong). Some preliminary tests had suggested that the short silence might be too noisy to be bene cial, but omitting it in these tests did not prove helpful: the error rate increased above the baseline to 46.8%. A further hypothesis was that treating the ultra-long silence as a separate category was harmful because the data became too fragmented. This hypothesis did receive some support: merging ultra-long with long silence brought the error rate down again to just below the baseline. Overall, however, the di erences in performance are very slight and do not provide very convincing evidence for any of these variants. The linguistic segmentation also provided inconclusive results, as shown below: With the linguistic segmentation, it was harmful to assume that each test utterance has an implicit end marker associated with silence, as shown by the last two results above. This seems consistent with the fact that the segmentation is determined linguistically, not acoustically. We obtained our best result, 45.2%, without assuming end-of-utterance silences and using a single long silence word. This result is not quite as good as the baseline, with no silence words. Directions for further work. 18

Our initial results have not shown a clear bene t of treating silence as a word. Nevertheless, they can be viewed as promising, given the limitations of the implementation we used. We have seen that treating silence as a word results in little change in error rate, even though the silence "words" block the actual word context, suggesting that silence and words may provide equivalent amounts (but di erent kinds) of information. Thus, an extended n-gram framework that allowed both silence and the word context may combine their bene ts to yield better results. Further improvements may be obtained by re ning the speci cs of the implementation, e.g., the number and types of silence words, the treatment of noise events as distinct from silence, and the integration of utterance boundary markers with silence words. Three strategies were pursued for pronunciation modeling, in part because of the di erences in the HTK and Janus systems. Results were obtained for modeling pronunciation variations without mode dependence to provide a baseline. In the Janus system, pronunciation variations were generated using a small set of rules for phenomena such as apping and vowel reduction. Adding pronunciations reduced the error rate from 39.0% to 38.4%, and using pronunciation probabilities derived from the relative likelihood of rule application further reduced error rate to 37.6%. (The Janus results were reported only on male subset of the standard test set.) Analogous experiments were conducted using the HTK system, but adding pronunciations only for the 100 most frequent words. The new pronunciations and their relative likelihoods were based on the results of the Janus system. Perhaps because the Janus pronunciations are tuned to a di erent acoustic model, there was no gain in performance when using these in the HTK system: 47.0% error baseline performance compared to 47.5% with unweighted additional pronunciations and 47.1% error with pronunciations weighted by their relative frequency. The third approach to pronunciation modeling used distribution clustering, and the baseline (no mode) experiment involved adding stress and syllable structure information to the inventory of clustering questions. Though no recognition experiments have been completed as yet, the training results showed that these features are important in that they are used early in the clustering process. One might expect syllable position and stress to be associated with the relative strength of articulation of a phone (e.g. the strength of a burst for consonants or the distance from a neutral position for a vowel), but it was interesting to see that these were important factors even before many phonetic contextual e ects were accounted for. Finally, although the mode-dependent pronunciation probability distributions are yet to be evaluated in a recognition system, the initial mode clustering experiments did provide evidence to suggest that pronunciation dynamics are at least somewhat predictable from acoustic cues to speaking mode (speci cally, speaking rate and normalized energy measures). Pronunciation di erences (e.g. di erences in the relative likelihood of the pronunciations /ae n d/ and /ax n/ for \and") were found both by clustering probability distributions of phonological rules based on acoustic features, as well as by clustering pronunciation probability distributions associated with the acoustically-derived pre-mode regions.

4 Conclusions In summary, we have introduced a new approach for handling speaking style variability in speech recognition based on a hidden speaking mode that controls allowable pronunciation variability. Under the assumption that pronunciation variations are systematically related to the speaking mode, a mode likelihood is predicted from acoustic observations such as speaking rate and relative energy as well as from language cues related to the information status (e.g. new vs. old, content vs. function) of words in the local context. We describe di erent ways of mode conditioning: in the word pronunciation likelihoods (directly or via baseform expansion rules) and in the acoustic models using distribution clustering. Standard training and recognition algorithms are extended to incorporate mode-dependent modeling. 19

Data analyses were conducted to identify acoustic cues to the mode, and initial pronunciation clustering experiments demonstrate that modes do in uence pronunciation likelihood. However, it remains to be shown that mode-dependent acoustic modeling will improve recognition performance. In addition, it is an open question as to where mode-conditioning will be most e ective: in the acoustic model or in the pronunciation likelihood. Because of time limitations, many issues related to mode modeling were not explored in depth, such as the use of textual cues to the mode and assumptions about the form of the mode likelihood model. These are just a few of the questions that the idea of hidden mode modeling will raise, making this a fruitful area for future study.

Acknowledgments The Hidden Speaking Mode group would like to thank: BBN for data resources, SRI for software, the other WS96 groups for help and collaboration on various fronts, Victor Jimenez for heroic e orts in providing recognition lattices and baseline results, and Lin Chase for help with error analysis.

References [1] M. Weintraub, K. Taussig, K. Hunicke-Smith, and A. Snodgrass \E ect of Speaking Style on LVCSR Performance," these proceedings. [2] D. Pallett, J. Fiscus, W. Fisher, J. Carofolo, B. Lund, M. Przybocki, \1993 benchmark tests for the ARPA Spoken Language Program," Proc. ARPA Workshop on Spoken Language Technology, pp. 15-40, 1994. [3] N. Mirghafori, E. Fosler and N. Morgan, \Towards robustness to fast speech in ASR," Proc. Int'l. Conf. on Acoust., Speech and Signal Proc., vol. 1, pp. 335-338, 1996. [4] M. Riley, \A statistical model for generating pronunciation networks," in Proc. of the Int. Conf. on Acoust., Speech and Signal Proc., vol. II, pp. S11.1-S11.4, 1991. [5] 1995 Language Modeling Summer Research Workshop Technical Reports, Section 2.7, CLSP Research Notes No. 1, Johns Hopkins University, February 27, 1996. [6] L. Breiman, J. Friedman, R. Olshen and C. Stone, Classi cation and Regression Trees, Wadsworth and Brooks, 1984. [7] L. Chase, R. Rosenfeld and W. Ward, \Error-responsive modi cations to speech recognizers: negative n-grams," in Proc. Int'l. Conf. on Spoken Language Processing, vol. 2, pp. 827-830, 1994. [8] J. Godfrey, E. Holliman and J. McDaniel, \Switchboard: telephone speech corpus for research and development," Proc. Int'l. Conf. on Acoust., Speech and Signal Proc., vol. 1, pp. 517-520, 1992. [9] M. Finke et al., \Janus-II { Translation of spontaneous conversational speech," presentation at the Large Vocabulary Speech Recognition { Hub 5 Workshop, 1996.

20