Modeling Word-LevelRate-of-Speech Variation inLarge Vocabulary ConversationalSpeechRecognition
Jing Zheng,Horacio Franco,Andreas Stolcke
SpeechTechnology andResearch Laboratory SRIInternational
Correspondence Address: Jing Zheng Speech Technology and ResearchLaboratory, SRIInter 333 Ravenswood Avenue, MenloPark, CA94025,USA
national
Email:
[email protected] Tel: (650)859-6129 Fax:(650)859-5984
1
Number oPage: f 31 (includingtables and figures)
Number oTables: f 3
Number oFigures: f 3
Key Words:Rate-of-speech modeling;Large vocabularyconversationalspeech recognition; Pronunciation modeling
2
Modeling Word-LevelRate-of-Speech Variation in Large Vocabulary ConversationalSpeech Recognition JingZheng, HoracioFranco,Andreas Stolcke SpeechTechnologyandResearchLaboratory SRIInternational Received 2February2001; revisedNovember 2001and
April 2002
ABSTRACT Variationsinrateofspeech(ROS)producevariatio pronunciationsthataffectautomaticspeechrecogni effects,weproposetouseasetofparallelrate-s Rateswitchingispermittedatwordboundaries,to
nsinbothspectralfeaturesandword tionsystems.TodealwiththeseROS pecificacousticandpronunciationmodels. allowwithin-sentencespeechrate
variation,whichiscommoninconversationalspeech
B . ecauseoftheparallelstructureof
rate-specificmodelsandthemaximumlikelihooddec
odingmethod,ourapproachdoesnot
requireROSestimationbeforerecognition,whichis on large-vocabulary a conversationalspeechrecogni ontheNIST2000Hub-5developmentsetshowthatwo resultsina2.2%absolutereductioninworderror system.Relativetoanenhancedbaselinesystemtha reductioninmultiword a dictionary,rate-dependent of1.5%.Furthermore,weintroducenovel a methodt arecommoninfastspeechbasedontheapproachof pronunciationmodels
hardtoachieve.Weevaluateourmodels tiontaskoverthetelephone.Experiments rd-levelROS-dependentmodeling rateoverarate-independentbaseline models t crosswordphoneticelisionand modelsachieveanabsoluteimprovement omodelingreducedpronunciationsthat skippingshortphonesinthe
whilepreservingthephoneticcontextfortheadjac
methodishowntoalsoproduceasmalladditional
entphones.This
improvementontopoR f OS-dependent
acousticmodeling.
3
ZUSAMMENFASSUNG SchwankungeninderSprechgeschwindigkeit(“rateof
speech”,ROS)beeinflussensowohl
diespekralenEigenschaftenalsauchdieAussprache
vonWörternundbetreffensomitdie
automatischeSpracherkennung.UmdiesenEffektenRe
chnungzutragen,verwendenwir
mehrereparallele,ROS-spezifischeakustischeundA
usprachemodellenimErkenner.Dabei
sindROS-WechselanWortgrenzenerlaubt,sodassAn
passungenanROS-Änderungen
innerhalbeinesSatzesmöglichsind.Aufgrundderp
arallelenStrukturderROS-spezifischen
ModelleundderVerwendungderMaximum-Likelihood-M
ethodeisteineBestimmungder
ROSvorderSpracherkennungnichtnotwendig,wasty darstellt.WirtestenunsereModelleinderErkennu
pischerweiseeinschwierigesProblem ngvonTelefongesprächen.Experimente
mitdemNIST2000Hub-5-Korpusergabeneineabsolut
eVerringerungderWortfehlerrate
von2.2%beiBenutzungvonROS-abhängigenakustisch
enModellenverglichenmiteinem
ROS-unabhängigenBaseline-System.Gegenübereinemv demphonetischeElidierungen undReduktionenanWo sind, ergibteinROS-abhängigesSystemeineabsolu stellenwireineneueMethodezurModellieringvon beischnellemSprechenauftreten,vor.DiesesVerf
erbessertenBaseline-System,in rtgrenzenmittelsMultiwörternerfasst teVerbesserungvon1.5%.Ausserdem reduziertenAussprachevarianten,dieoft ahrenerlaubtdasÜberspringenvon
kurzenSegmentenimAussprachemodel,wobeijedochd
erphonetischeKontextvon
Nachbarsegmentenerhaltenwird.DieseMethodeergib Verbesserungder ROS-abhängigenakustischenModelle
teinegeringfügigezusätzliche .
4
Résumé Lesvariationsdevitessed'élocution(ROS)affecte
ntlesindicesspectrauxdusignal
vocaletlaprononciation;lessystèmesdereconnai
ssanceautomatiquedelaparoley
sontdoncexposés.Afindceombattreceseffets,no
usproposonsd'utiliserenparallèle
deuxgroupesdemodèlesacoustiquesede t prononcia vitessed'élocution.Lechoixentrecesdeuxgroupe motsafinderendrecompteencoursd'énoncédesva
tion,adaptésenfonctiondela ps eutbasculeràlafrontièredes riationsdecettevitesse,
courantesenparoleconversationnelle.Grâceaupar
allélismedesdeuxgroupesde
modèleseàltaméthodedde écodagebaséesurlema
ximumdevraisemblance,notre
approchenedemandepasl'estimationdelavitesse reconnaissance,cequiseraitdifficileàréaliser. tâchedreeconnaissanceautomatiquedlepaaroleté
d'élocutionavantdécisionde Nousévaluonsnosmodèlessurune léphoniquegrandvocabulaire.Les
expériencessur uneconfigurationddeéveloppement
NIST2000Hub-5smontrentque
notremodélisationobtient2,2%d'améliorationdut
auxdereconnaissancedemots
comparéun à systèmedbe asenceomportantpasdte
raitementdelda épendancela à
vitessed'élocution.Parrapportàunsystèmedeba
seamélioréoùlacoarticulationet
lesélisionssontmodéliséesdansundictionnaired dépendante dlevaitesse d'élocutionobtient 1,5% d Nousavonsdeplusintroduitunenouvellemodélisat fréquentesdans la parole débit à rapide, oùlesph quesegmentmaispréservésentantquecontextepho adjacents.Cetteapprocheaégalementpermisunelé cellequ'obtient la priseecnomptedes variations
emulti-mots,notremodélisation 'amélioration. iondesréductionsphonétiques, ones courts peuventêtre omis en tant nétiquepourlesphones gèreaméliorations'ajoutantà de vitesse d'élocution.
5
1. INTRODUCTION Rateospeech f (ROS)
1 has beenobservedaasnimportantfactorthataffe
of taranscriptionsystem;speakingeithertoofast rate(WER)(SieglerandStern,1995;Mirghaforiet
ctstheperformance
ortooslowwouldleadtohigherworderror al.,1996)foranumberofpossible
reasons.First,ROSisrelatedtothedegreeoaf co
usticrealization;changesinROSwould
resultinvariationinbothacousticobservationsa
ndunderlyingpronunciationbaseforms.
Furthermore,somefeaturescommonlyusedinrecogni clearlyinfluencedbyspeechrate,suchasdeltaan muchpriorworkemploysrate-dependentacousticmod
tionsystemsaredurationrelatedand ddelta-deltafeatures.Forthesereasons, elstoreducemodelmismatchand
improverobustnessagainstspeechratevariation. However,previousresearchaddressedrate-of-speech levels.InMirghaforietal.(1996),aninpututter
effectsmostlyathe t sentenceohigher r ancewasfirstclassifiedafsastorslow,using
aROSestimator,andthenfedtoarate-specificsy
stemtunedtofastosr lowspeech.An
obviousproblemwiththismethoditshaterrorsin
theROSclassificationarelikelytotrigger
errorsintherecognitionstepbecauseofmodelmis utteranceswerenormalizedbasedonaROSmeasure, interpolationonthetimeaxis.Bothoftheabovea withinanutteranceius niform,whichisoftennot
match.InRichardsonetal.(1999), byperformingcepstralfeature pproachespresumethatthespeechrate thecaseinconversationalspeech.Inour
earlierresearchworkonbroadcastnewsspeech(Zhe
ngetal.,2000),wefoundthatspeech
ratevariationwithinsentencesiscommon.However,
localROSisoftenhardtoestimate
robustly.Richardsonetal.(1999)observedthatal gaveconsiderablylargerimprovementthansentencephonesequencewasknown,ift ailedtoyieldanyim sequencewasunknown.Thisresultindicatesthepot
1 We usetheterms“rateospeech”, f “speakingrate”
thoughphone-levelROSnormalization levelnormalizationwhenthecorrect provementwhenthecorrectphone entialbenefitfrommodelingROSata
and , “speechrate”interchangeablyitnhispaper. 6
morelocallevel,butalsosuggeststhatinordert
orealizethebenefit,theproblemofrobust
estimationolocal f ROS mustbe solvedfirst. WewilladdressthelocalROSestimationproblemby
usingparallelrate-dependentacoustic
andpronunciationmodelsathe t wordlevel.Eachwo
rdigs iventwogroupsorfate-specific
pronunciations:onegroupo“ffast”pronunciations
andone
groupo“fslow”pronunciations,
2 s.The recognizerias llowedtoselect
eachbeingimplementedbyrate-specificphonemodel thefastortheslowpronunciationforeachwordau
tomaticallyduringsearch,basedonthe
maximumlikelihoodcriterion.Inthisway,weaccou
ntforwithin-sentencespeechrate
variation,andavoidtherequirementofprerecognit
ionROSclassification.Totraintherate-
specificphonemodels,weuseaduration-basedROS
measuretopartitionthetrainingdata
intorate-specificcategories.Becauseotfheavail
abilityoftranscriptionsintraining,robust
andaccurateROSestimationisnotanissueinour
approach.Inanexperimentwitha
multiword-augmenteddictionary,weverifiedtheimp level insteadoat fm a ore global level, especially
ortanceom f odelingROSattheword themultiwordlevel.
AsobservedbySieglerandStern(1995),fastspeec
hfrequentlyproduceschangesinword
pronunciationaw s ellasinphonearticulation.To
addressthis,weexplorenew a methodfor
modelingrate-dependentpronunciationvariation.Ba phone(Zhengetal.,2000),weenablemodelsofsom
eshortphonestobeskippedinthe
searchwithoutchangingthephoneticcontextsoth f
eirneighboringphones;thus,weareable
tomodelthecoarticulatoryeffectsotfhoseshort
phones.Adata-drivenalgorithmisusedto
generatetherate-dependentpronunciationdictionar fromalignmentdata.Themethodeffectivelyallows
sedontheconceptofazero-length
ywithzero-lengthphonesautomatically wordstohavedifferentpronunciations
(or pronunciationprobabilities)for different ROS. Theremainderotfhepaperios rganizedasfollows. usedforpartitioningthetrainingdata.Section3
2 In principle,wecouldgroupthewordsin anynumb typical choicewouldbtrichotomy ae oslow, f norma ofavailabletrainingdatawcehosetuoseonlytwo
Section2introducestheROSmeasure reportsexperimentalresultswithrate-
er ofclustersbasedocertain an ROS measure; a l,andfastspeech. Becauseothe f limitedamount clusters,which wreefer to “slow” as and “fast”. 7
dependentacousticmodelingandcomparesdifferent
trainingapproaches.Section4
introducesrate-dependentpronunciationmodelingan
dreportsresultswithratedependencyin
bothacousticandpronunciationmodels.Section5a
ddressesissueswithROSmodelingina
multiword-augmentedrecognitionsystem.Section6s
ummarizestheworkpresented.
2. RATE-OF-SPEECHMEASURE Twotypesomethods f aretypicallyusedtoestimate
theROSofaninpututterance.Thefirst
isbasedonphonedurations,whichareusuallyobta
inedfromphone-levelsegmentationsvia
forcedViterbialignments.Whentheutterancetrans
criptionisknown,thisduration-based
methodcanproviderobustROSestimation(Mirghafor transcriptioniusnknown,wecanonlyusethehypot qualityishardtoguarantee.Thesecondmethodinv waveformoracousticfeaturesothe f inpututteranc signal-processing-basedmeasure,knownas detection.Toachieverobustness,the computationm (1-2seconds),whichigenerally s toolongforesti proposedanother measurebasedontheEuclideandis
ei at l.,1996);however,whenthe hesisfrom pariorrecognitionrun,whose olvesestimatingROSdirectlyfromthe e.MorganandFosler(1998)developeda
mrate,toestimatesyllablerateforrapidspeech ustuse data a window of sufficient length matinglocalROS.TuerkandYoung(1999) tancebetweensuccessivefeature vectors
formodelingspeakingrate,andshowedsomediscrim
inativepowerinclassifyingfastand
slowphones.However,theydidnotreportexperimen
talresultsusingthismeasureinan
automaticspeechrecognition(ASR)system.
Underourproposedapproach,trainingtherate-spec trainingdataintorate-specificcategoriesathe eachwordtobeestimatedlocally.Theoutputofth trainingtranscriptionarateclasslabel.Wedecid “slow”,forseveralreasons.First,increasingthe
ificmodelsrequirespartitioningthe wordlevel;wethereforeneedtheROSfor isprocessshouldgiveeachwordinthe edtouseonlytwoROSclasses,“fast”and numberoR f OSclasseswouldreducethe
amountoftrainingdataineachclass,whichisnot
desirableforalargevocabularytask.
Second,inourmethod,searchcomplexityincreases
rapidlywiththenumberofROSclasses,
asthe number of pronunciationsis proportionalto
the number ofclasses. 8
[Figure 1] BecauseweneedtocomputeROSonlyforthetrainin
gdataforwhichtranscriptionsare
available,itisrelativelystraightforwardtoobta
inthedurationofeachwordandits
component phonesbycomputingforcedViterbialignm
ents,andthenapplyingduration-based
ROSestimationmethods.AbsoluteROSmeasures,such
asphonespersecond(PPS)and
inversemeanduration(IMD)(Mirghaforietal.,199
6),wereusedinpreviouswork.However,
wefeltthatthesemeasuresarenotsuitableforou
pr urposessincetheydonotconsiderthe
factthatdifferentphonetypeshavedifferentdura
tiondistributions.Figure1illustratesthe
durationdistributionsofive f typicalphonemes,/d
3 as estimatedfrom
/,/p/,/ch/,/ih/and/ay/,
thetrainingcorpus.Clearlythedurationdistribut
ionsfordifferentphonetypesdiffer
substantially.BasedonPPSorIMDastheROSmeasu
re,wordscomposedosfhortphones
wouldseeminherently“faster”thanthosecomposed
oflongerphones,evenwhenspokenat
thesamespeakingrate.Therefore,weuse relativ a
R e OSmeasure,
RW(D),defineda1minus s
the distributionfunctionothe f worddurationcons
ideredarasandom variable:
RW ( D) = PW (d > D) = 1 − ∑ PW (d ) D
d =0
where Wisagivenword,
Disthedurationof
analysiswindow,10msinoursystem),and havingduration
(1) Winframeunits(thestepsizeotfhesignal PW(d)istheprobabilityofthattypeow f ord
d. RW(D)istheprobabilityof
Whavingadurationlongerthan
measure RW(D)alwaysfallswithintherange[0,1],andcanbec
D.The
omparedbetweendifferent
wordcategories. ItisinterestingtonotethattheROSobtainedfro distributionsince
mequation(1)hasaclose-to-uniform
RW(.)canbeviewedasahistogram-equalizationtransfo
Woods,1992)mappingthewordduration histogram.However,inpractice,
rm(Gonzalezand
Dtotherangeof[0,1]withanequalized
PW(d)ishardtoestimatedirectlybecauseodata f spar
Toaddressthisproblemweassumethat,withinawo 3 We useOGIbetfor phonelabeling throughoutthepa
seness.
rd,thedurationdistributionsofits
per. 9
componentsubwordunits,suchaps hones,areindepe
ndentofeachother.Thus,theduration
probabilityofawordequalstheconvolutionofits
componentsubwordunitprobabilities,
whichareeasier toestimatereliablyfrom training
data. Thiscanbfeormulatedas
n PW ( D) = P1 (d1 ) * P2 (d 2 ) * L * Pn (d n ) = ∑∑ L ∑ ∏i =1 Pi (d i ) d + d +Ld = D 1
where d1, d2,…,
2
dn arethedurationsofthesubunitsofword
correspondingprobabilities.Topartiallyaccountf usecontext-dependentsubwordunits,specificallyt PW(d). Thetriphone durationdistributions are estimate
(2)
n
W,and
Pi(di)arethe
ordependencebetweennearbyphoneswe riphones,forthepurposeofestimating directlyfromthetrainingcorpus.
Weusedthe ROSmeasurethusdefinedfor allwordt
okensinthetrainingdata.Wefoundthat
80% ofsentenceswithfiveomore r words have both
atleastonewordbelongingtothefastest
33%andonewordbelongingtotheslowest33%ofal conversationalspeech,speechrateiusually s not u
lwords.Thissuggeststhatin niform within sentence. a
Equation(1)canalsobeapplieddirectlytosubwor
dunits,thusallowingutsocalculatethe
ROSoindividual f phones.Thisgivesuasnapproach
tostudythevariationoROS f am at ore
locallevel.Foreachwordosentence r thathasat
leasttwophones,wecomputedthestandard
deviationoROS f ofallofitsphones.Fromthedef
initionweseethatthephoneROSranges
from0to1thus, ; itsstandarddeviationmustalso
fallwithin[0,1].Dividingtheinterval[0,1]
into100equivalentbins,wecollectedthehistogra
msophone f ROSinboththewithin-word
caseandthe within-sentencecaseonthewholetrai
ningdata,as depictedinFigure2The . data
suggeststhatthewordibetter as unitthanthese
ntenceforROSmodeling,sincetheaverage
phone-levelROSdeviationwithinawordissignific
antlysmallerthanwithinasentence,
whichmeansthatROSismorestableathe t wordlev
elthaninthesentencelevel,andthus
classifyingeachwordafast s orslowmakesmorese
nsethanclassifyingtheentiresentenceas
fastor slow. [Figure 2]
10
WenotethattheproposedROSmeasureappliestoin
dividualwords,andthereforedoesnot
includethedurationofinterwordpausesthatcontr
ibutetoothercommondefinitionsof
speechrate.Thereasonforthisdifferenceitshat
ourapproachaimsaimproving t theacoustic
modelingothe f speechportionsothe f signalonly,
thatis,theportionsaccountedforbyword
pronunciations.Furthermore,ourgoalistodosob
ymodelingtheeffectsofROSonthe
acousticfeatures,notbymodelingROSitselfasa arenotconcernedaboutROSestimation
discriminatorfeature.Forthesereasonswe perse and , havenotinvestigatedthequantitative
relationshipbetweenour ROSmeasureandothers pro
posedintheliterature.
3. RATE-DEPENDENTACOUSTICMODELING We focus on rate-dependent acoustic modeling alone,
without changes to word
pronunciations.Intheproposedmethod,eachwordi
gs ivenparallelpronunciationso“fast” f
and“slow”phonemodels.Bothfastandslowpronunc
iationsareinitializedfromtheoriginal
rate-independentversion,withasimplereplacement
ofrate-independentphonesbyrate-
specificphones.Forexample,theoriginalrate-ind
ependentpronunciationof“WORD”is
/werd/.Consequently,thefastandslowpronuncia consistingoffastandslowphonemodels,respectiv
tionsare/w
er df f/and/w f
reduringsearch,andthusavoidstheneed
forROSestimationbeforerecognition.Inaddition,
thesearchalgorithmisallowedtoselect
pronunciationsofdifferentratesacrosswordbound
Theintroductionoparallel f ROS-specificpronuncia parallelstatepathsinrecentworkonphonehidden
s/,
ely.Therecognizerautomaticallyfinds
thepronunciationsthatmaximizethelikelihoodsco
accountingfor speechratevariationwithin sente a
er ds s
aries(butnotwithinaword),thus nce. tionsirseminiscentoftheintroductionof Markovmodel(HMM)topologies(Iyeret
al.,1999).ParallelpathHMMsaimtomodeltheaco
usticconsistencyofadjacentframes,
emulatingtrajectory-basedsegmentmodels(Ostendor
efat l.,1996),similartothewayour
models enforceROSconsistencyover adjacent phones
.
11
3.1 Trainingrate-dependentacousticmodels OurinitialexperimentswereperformedonSRI’s199
8Hub-5evaluationsystem(Weintraub
etal.,1998),whichusescontinuous-densitygenoni
cHMMs(Digalakisetal.,1996)for
acousticmodeling.Thesystemusedamultipassreco
gnitionstrategy(Murveitetal.,1993).
Forthesakeosimplicity, f weranourexperiments usedgender-dependentnon-crosswordgenonicHMMs(1 genoneswith64Gaussianspergenone)andabigram vocabulary.Thepronunciationdictionarywasderive withstressinformationstripped.Mostwordshavea somehavemultipleentriesforcommonpronunciation
withonlythefirst-passrecognizer,which ,730malegenonesand1,458female grammarwitha33,275-word dfromtheCMUversion0.4lexicon singleentryintheCMUlexicon,while variants.Forexample,theword“was”
hasthreeentries:/waaz/,/wahz/,and/waoz/
Generally . speaking,thelexicondoesnot
coverthepossiblepronunciationvariationscaused
bydifferentspeakingrates.Therecognizer
usedatwo-pass(forwardandbackward)Viterbibeam
searchalgorithm;inthefirstpassa
lexicaltreewasusedinthegrammarbackoffnodet
ospeedupsearch.Belowwereport
resultsfromthebackwardpass.Theacousticfeatur
esusedwere9MelFrequencyCepstral
Coefficients(C1-C8plusC0)withtheirfirst-and
second-orderderivativesobtainedfrom18
filterbankscovering300-3300Hzin10mtsimefra
mes.Theacoustictrainingsetconsisted
of87hoursosfpeechformalesand106hoursforf
emales,fromacombinationocforpora:
(1)Macrophonereadtelephonespeech,(2)3,094con segmentedSwitchboard-1trainingset(withsomehan
versationsidesfromtheBBNd-corrections),and(3)100CallHome
Englishtrainingconversations. WefirstcalculatedtheROSforallwordsinthetr mentionedmeasure,sortedwordsbyROS,andthensp
ainingcorpusbasedontheabovelitthemintofastandslowcategories.
TheROSthresholdforsplittingwasselectedtoach
ieveequalamountsotfrainingdatafor
fastandslowspeech.Thetrainingtranscriptionsw
erelabeledaccordingly.Wethenprepared
aspecialtraininglexicon:wordtokenslabeled“fa
st”weregivenpronunciationswithfast
12
phonesonly,andsimilarlyfor“slow”words.Inthi models simultaneouslyusingthestandardBaum-Welch WeusedtheDECIPHER
w s ay,wewereabletotrainfastandslow trainingprocedure.
TM
genonictrainingtoolstorunstandardmaximumlik
estimation(MLE)gender-dependenttraining,andobt genonesformalespeechand2,501genonesforfemal rate-dependentmodelsusedthesameinformationlos
elihood
ainedrate-dependentmodelswith3,233 espeech.Thegenoneclusteringfor tshresholdaps reviouslyusedforrate-
independent training. Results.Wecomparedtherate-dependentacousticmodelwit baselinesystem)onadevelopmentsubsetofthe199 1,143sentencesfrom20speakers(9male,11female WERforthetwomodels.Notethatallresultsrepor within-wordtriphoneacousticmodelsandabigraml
htherate-independentone(the 8Hub-5evaluationdata,consistingof ).ThefirsttworowsofTable1showthe tedherearebasedonspeaker-independent anguagemodel,andarethereforenot
comparable tothosefor the fullevaluationsystem.
[Table 1]
Rate-dependentmodelingyieldsanabsoluteWERredu significant( p