Modeling Word-Level Rate-of-Speech Variation in ... - Semantic Scholar

Report 5 Downloads 37 Views
Modeling Word-LevelRate-of-Speech Variation inLarge Vocabulary ConversationalSpeechRecognition

Jing Zheng,Horacio Franco,Andreas Stolcke

SpeechTechnology andResearch Laboratory SRIInternational

Correspondence Address: Jing Zheng Speech Technology and ResearchLaboratory, SRIInter 333 Ravenswood Avenue, MenloPark, CA94025,USA

national

Email: [email protected] Tel: (650)859-6129 Fax:(650)859-5984

1

Number oPage: f 31 (includingtables and figures)

Number oTables: f 3

Number oFigures: f 3

Key Words:Rate-of-speech modeling;Large vocabularyconversationalspeech recognition; Pronunciation modeling

2

Modeling Word-LevelRate-of-Speech Variation in Large Vocabulary ConversationalSpeech Recognition JingZheng, HoracioFranco,Andreas Stolcke SpeechTechnologyandResearchLaboratory SRIInternational Received 2February2001; revisedNovember 2001and

April 2002

ABSTRACT Variationsinrateofspeech(ROS)producevariatio pronunciationsthataffectautomaticspeechrecogni effects,weproposetouseasetofparallelrate-s Rateswitchingispermittedatwordboundaries,to

nsinbothspectralfeaturesandword tionsystems.TodealwiththeseROS pecificacousticandpronunciationmodels. allowwithin-sentencespeechrate

variation,whichiscommoninconversationalspeech

B . ecauseoftheparallelstructureof

rate-specificmodelsandthemaximumlikelihooddec

odingmethod,ourapproachdoesnot

requireROSestimationbeforerecognition,whichis on large-vocabulary a conversationalspeechrecogni ontheNIST2000Hub-5developmentsetshowthatwo resultsina2.2%absolutereductioninworderror system.Relativetoanenhancedbaselinesystemtha reductioninmultiword a dictionary,rate-dependent of1.5%.Furthermore,weintroducenovel a methodt arecommoninfastspeechbasedontheapproachof pronunciationmodels

hardtoachieve.Weevaluateourmodels tiontaskoverthetelephone.Experiments rd-levelROS-dependentmodeling rateoverarate-independentbaseline models t crosswordphoneticelisionand modelsachieveanabsoluteimprovement omodelingreducedpronunciationsthat skippingshortphonesinthe

whilepreservingthephoneticcontextfortheadjac

methodishowntoalsoproduceasmalladditional

entphones.This

improvementontopoR f OS-dependent

acousticmodeling.

3

ZUSAMMENFASSUNG SchwankungeninderSprechgeschwindigkeit(“rateof

speech”,ROS)beeinflussensowohl

diespekralenEigenschaftenalsauchdieAussprache

vonWörternundbetreffensomitdie

automatischeSpracherkennung.UmdiesenEffektenRe

chnungzutragen,verwendenwir

mehrereparallele,ROS-spezifischeakustischeundA

usprachemodellenimErkenner.Dabei

sindROS-WechselanWortgrenzenerlaubt,sodassAn

passungenanROS-Änderungen

innerhalbeinesSatzesmöglichsind.Aufgrundderp

arallelenStrukturderROS-spezifischen

ModelleundderVerwendungderMaximum-Likelihood-M

ethodeisteineBestimmungder

ROSvorderSpracherkennungnichtnotwendig,wasty darstellt.WirtestenunsereModelleinderErkennu

pischerweiseeinschwierigesProblem ngvonTelefongesprächen.Experimente

mitdemNIST2000Hub-5-Korpusergabeneineabsolut

eVerringerungderWortfehlerrate

von2.2%beiBenutzungvonROS-abhängigenakustisch

enModellenverglichenmiteinem

ROS-unabhängigenBaseline-System.Gegenübereinemv demphonetischeElidierungen undReduktionenanWo sind, ergibteinROS-abhängigesSystemeineabsolu stellenwireineneueMethodezurModellieringvon beischnellemSprechenauftreten,vor.DiesesVerf

erbessertenBaseline-System,in rtgrenzenmittelsMultiwörternerfasst teVerbesserungvon1.5%.Ausserdem reduziertenAussprachevarianten,dieoft ahrenerlaubtdasÜberspringenvon

kurzenSegmentenimAussprachemodel,wobeijedochd

erphonetischeKontextvon

Nachbarsegmentenerhaltenwird.DieseMethodeergib Verbesserungder ROS-abhängigenakustischenModelle

teinegeringfügigezusätzliche .

4

Résumé Lesvariationsdevitessed'élocution(ROS)affecte

ntlesindicesspectrauxdusignal

vocaletlaprononciation;lessystèmesdereconnai

ssanceautomatiquedelaparoley

sontdoncexposés.Afindceombattreceseffets,no

usproposonsd'utiliserenparallèle

deuxgroupesdemodèlesacoustiquesede t prononcia vitessed'élocution.Lechoixentrecesdeuxgroupe motsafinderendrecompteencoursd'énoncédesva

tion,adaptésenfonctiondela ps eutbasculeràlafrontièredes riationsdecettevitesse,

courantesenparoleconversationnelle.Grâceaupar

allélismedesdeuxgroupesde

modèleseàltaméthodedde écodagebaséesurlema

ximumdevraisemblance,notre

approchenedemandepasl'estimationdelavitesse reconnaissance,cequiseraitdifficileàréaliser. tâchedreeconnaissanceautomatiquedlepaaroleté

d'élocutionavantdécisionde Nousévaluonsnosmodèlessurune léphoniquegrandvocabulaire.Les

expériencessur uneconfigurationddeéveloppement

NIST2000Hub-5smontrentque

notremodélisationobtient2,2%d'améliorationdut

auxdereconnaissancedemots

comparéun à systèmedbe asenceomportantpasdte

raitementdelda épendancela à

vitessed'élocution.Parrapportàunsystèmedeba

seamélioréoùlacoarticulationet

lesélisionssontmodéliséesdansundictionnaired dépendante dlevaitesse d'élocutionobtient 1,5% d Nousavonsdeplusintroduitunenouvellemodélisat fréquentesdans la parole débit à rapide, oùlesph quesegmentmaispréservésentantquecontextepho adjacents.Cetteapprocheaégalementpermisunelé cellequ'obtient la priseecnomptedes variations

emulti-mots,notremodélisation 'amélioration. iondesréductionsphonétiques, ones courts peuventêtre omis en tant nétiquepourlesphones gèreaméliorations'ajoutantà de vitesse d'élocution.

5

1. INTRODUCTION Rateospeech f (ROS)

1 has beenobservedaasnimportantfactorthataffe

of taranscriptionsystem;speakingeithertoofast rate(WER)(SieglerandStern,1995;Mirghaforiet

ctstheperformance

ortooslowwouldleadtohigherworderror al.,1996)foranumberofpossible

reasons.First,ROSisrelatedtothedegreeoaf co

usticrealization;changesinROSwould

resultinvariationinbothacousticobservationsa

ndunderlyingpronunciationbaseforms.

Furthermore,somefeaturescommonlyusedinrecogni clearlyinfluencedbyspeechrate,suchasdeltaan muchpriorworkemploysrate-dependentacousticmod

tionsystemsaredurationrelatedand ddelta-deltafeatures.Forthesereasons, elstoreducemodelmismatchand

improverobustnessagainstspeechratevariation. However,previousresearchaddressedrate-of-speech levels.InMirghaforietal.(1996),aninpututter

effectsmostlyathe t sentenceohigher r ancewasfirstclassifiedafsastorslow,using

aROSestimator,andthenfedtoarate-specificsy

stemtunedtofastosr lowspeech.An

obviousproblemwiththismethoditshaterrorsin

theROSclassificationarelikelytotrigger

errorsintherecognitionstepbecauseofmodelmis utteranceswerenormalizedbasedonaROSmeasure, interpolationonthetimeaxis.Bothoftheabovea withinanutteranceius niform,whichisoftennot

match.InRichardsonetal.(1999), byperformingcepstralfeature pproachespresumethatthespeechrate thecaseinconversationalspeech.Inour

earlierresearchworkonbroadcastnewsspeech(Zhe

ngetal.,2000),wefoundthatspeech

ratevariationwithinsentencesiscommon.However,

localROSisoftenhardtoestimate

robustly.Richardsonetal.(1999)observedthatal gaveconsiderablylargerimprovementthansentencephonesequencewasknown,ift ailedtoyieldanyim sequencewasunknown.Thisresultindicatesthepot

1 We usetheterms“rateospeech”, f “speakingrate”

thoughphone-levelROSnormalization levelnormalizationwhenthecorrect provementwhenthecorrectphone entialbenefitfrommodelingROSata

and , “speechrate”interchangeablyitnhispaper. 6

morelocallevel,butalsosuggeststhatinordert

orealizethebenefit,theproblemofrobust

estimationolocal f ROS mustbe solvedfirst. WewilladdressthelocalROSestimationproblemby

usingparallelrate-dependentacoustic

andpronunciationmodelsathe t wordlevel.Eachwo

rdigs iventwogroupsorfate-specific

pronunciations:onegroupo“ffast”pronunciations

andone

groupo“fslow”pronunciations,

2 s.The recognizerias llowedtoselect

eachbeingimplementedbyrate-specificphonemodel thefastortheslowpronunciationforeachwordau

tomaticallyduringsearch,basedonthe

maximumlikelihoodcriterion.Inthisway,weaccou

ntforwithin-sentencespeechrate

variation,andavoidtherequirementofprerecognit

ionROSclassification.Totraintherate-

specificphonemodels,weuseaduration-basedROS

measuretopartitionthetrainingdata

intorate-specificcategories.Becauseotfheavail

abilityoftranscriptionsintraining,robust

andaccurateROSestimationisnotanissueinour

approach.Inanexperimentwitha

multiword-augmenteddictionary,weverifiedtheimp level insteadoat fm a ore global level, especially

ortanceom f odelingROSattheword themultiwordlevel.

AsobservedbySieglerandStern(1995),fastspeec

hfrequentlyproduceschangesinword

pronunciationaw s ellasinphonearticulation.To

addressthis,weexplorenew a methodfor

modelingrate-dependentpronunciationvariation.Ba phone(Zhengetal.,2000),weenablemodelsofsom

eshortphonestobeskippedinthe

searchwithoutchangingthephoneticcontextsoth f

eirneighboringphones;thus,weareable

tomodelthecoarticulatoryeffectsotfhoseshort

phones.Adata-drivenalgorithmisusedto

generatetherate-dependentpronunciationdictionar fromalignmentdata.Themethodeffectivelyallows

sedontheconceptofazero-length

ywithzero-lengthphonesautomatically wordstohavedifferentpronunciations

(or pronunciationprobabilities)for different ROS. Theremainderotfhepaperios rganizedasfollows. usedforpartitioningthetrainingdata.Section3

2 In principle,wecouldgroupthewordsin anynumb typical choicewouldbtrichotomy ae oslow, f norma ofavailabletrainingdatawcehosetuoseonlytwo

Section2introducestheROSmeasure reportsexperimentalresultswithrate-

er ofclustersbasedocertain an ROS measure; a l,andfastspeech. Becauseothe f limitedamount clusters,which wreefer to “slow” as and “fast”. 7

dependentacousticmodelingandcomparesdifferent

trainingapproaches.Section4

introducesrate-dependentpronunciationmodelingan

dreportsresultswithratedependencyin

bothacousticandpronunciationmodels.Section5a

ddressesissueswithROSmodelingina

multiword-augmentedrecognitionsystem.Section6s

ummarizestheworkpresented.

2. RATE-OF-SPEECHMEASURE Twotypesomethods f aretypicallyusedtoestimate

theROSofaninpututterance.Thefirst

isbasedonphonedurations,whichareusuallyobta

inedfromphone-levelsegmentationsvia

forcedViterbialignments.Whentheutterancetrans

criptionisknown,thisduration-based

methodcanproviderobustROSestimation(Mirghafor transcriptioniusnknown,wecanonlyusethehypot qualityishardtoguarantee.Thesecondmethodinv waveformoracousticfeaturesothe f inpututteranc signal-processing-basedmeasure,knownas detection.Toachieverobustness,the computationm (1-2seconds),whichigenerally s toolongforesti proposedanother measurebasedontheEuclideandis

ei at l.,1996);however,whenthe hesisfrom pariorrecognitionrun,whose olvesestimatingROSdirectlyfromthe e.MorganandFosler(1998)developeda

mrate,toestimatesyllablerateforrapidspeech ustuse data a window of sufficient length matinglocalROS.TuerkandYoung(1999) tancebetweensuccessivefeature vectors

formodelingspeakingrate,andshowedsomediscrim

inativepowerinclassifyingfastand

slowphones.However,theydidnotreportexperimen

talresultsusingthismeasureinan

automaticspeechrecognition(ASR)system.

Underourproposedapproach,trainingtherate-spec trainingdataintorate-specificcategoriesathe eachwordtobeestimatedlocally.Theoutputofth trainingtranscriptionarateclasslabel.Wedecid “slow”,forseveralreasons.First,increasingthe

ificmodelsrequirespartitioningthe wordlevel;wethereforeneedtheROSfor isprocessshouldgiveeachwordinthe edtouseonlytwoROSclasses,“fast”and numberoR f OSclasseswouldreducethe

amountoftrainingdataineachclass,whichisnot

desirableforalargevocabularytask.

Second,inourmethod,searchcomplexityincreases

rapidlywiththenumberofROSclasses,

asthe number of pronunciationsis proportionalto

the number ofclasses. 8

[Figure 1] BecauseweneedtocomputeROSonlyforthetrainin

gdataforwhichtranscriptionsare

available,itisrelativelystraightforwardtoobta

inthedurationofeachwordandits

component phonesbycomputingforcedViterbialignm

ents,andthenapplyingduration-based

ROSestimationmethods.AbsoluteROSmeasures,such

asphonespersecond(PPS)and

inversemeanduration(IMD)(Mirghaforietal.,199

6),wereusedinpreviouswork.However,

wefeltthatthesemeasuresarenotsuitableforou

pr urposessincetheydonotconsiderthe

factthatdifferentphonetypeshavedifferentdura

tiondistributions.Figure1illustratesthe

durationdistributionsofive f typicalphonemes,/d

3 as estimatedfrom

/,/p/,/ch/,/ih/and/ay/,

thetrainingcorpus.Clearlythedurationdistribut

ionsfordifferentphonetypesdiffer

substantially.BasedonPPSorIMDastheROSmeasu

re,wordscomposedosfhortphones

wouldseeminherently“faster”thanthosecomposed

oflongerphones,evenwhenspokenat

thesamespeakingrate.Therefore,weuse relativ a

R e OSmeasure,

RW(D),defineda1minus s

the distributionfunctionothe f worddurationcons

ideredarasandom variable:

RW ( D) = PW (d > D) = 1 − ∑ PW (d ) D

d =0

where Wisagivenword,

Disthedurationof

analysiswindow,10msinoursystem),and havingduration

(1) Winframeunits(thestepsizeotfhesignal PW(d)istheprobabilityofthattypeow f ord

d. RW(D)istheprobabilityof

Whavingadurationlongerthan

measure RW(D)alwaysfallswithintherange[0,1],andcanbec

D.The

omparedbetweendifferent

wordcategories. ItisinterestingtonotethattheROSobtainedfro distributionsince

mequation(1)hasaclose-to-uniform

RW(.)canbeviewedasahistogram-equalizationtransfo

Woods,1992)mappingthewordduration histogram.However,inpractice,

rm(Gonzalezand

Dtotherangeof[0,1]withanequalized

PW(d)ishardtoestimatedirectlybecauseodata f spar

Toaddressthisproblemweassumethat,withinawo 3 We useOGIbetfor phonelabeling throughoutthepa

seness.

rd,thedurationdistributionsofits

per. 9

componentsubwordunits,suchaps hones,areindepe

ndentofeachother.Thus,theduration

probabilityofawordequalstheconvolutionofits

componentsubwordunitprobabilities,

whichareeasier toestimatereliablyfrom training

data. Thiscanbfeormulatedas

n PW ( D) = P1 (d1 ) * P2 (d 2 ) * L * Pn (d n ) = ∑∑ L ∑ ∏i =1 Pi (d i )   d + d +Ld = D 1

where d1, d2,…,

2

dn arethedurationsofthesubunitsofword

correspondingprobabilities.Topartiallyaccountf usecontext-dependentsubwordunits,specificallyt PW(d). Thetriphone durationdistributions are estimate

(2)

n

W,and

Pi(di)arethe

ordependencebetweennearbyphoneswe riphones,forthepurposeofestimating directlyfromthetrainingcorpus.

Weusedthe ROSmeasurethusdefinedfor allwordt

okensinthetrainingdata.Wefoundthat

80% ofsentenceswithfiveomore r words have both

atleastonewordbelongingtothefastest

33%andonewordbelongingtotheslowest33%ofal conversationalspeech,speechrateiusually s not u

lwords.Thissuggeststhatin niform within sentence. a

Equation(1)canalsobeapplieddirectlytosubwor

dunits,thusallowingutsocalculatethe

ROSoindividual f phones.Thisgivesuasnapproach

tostudythevariationoROS f am at ore

locallevel.Foreachwordosentence r thathasat

leasttwophones,wecomputedthestandard

deviationoROS f ofallofitsphones.Fromthedef

initionweseethatthephoneROSranges

from0to1thus, ; itsstandarddeviationmustalso

fallwithin[0,1].Dividingtheinterval[0,1]

into100equivalentbins,wecollectedthehistogra

msophone f ROSinboththewithin-word

caseandthe within-sentencecaseonthewholetrai

ningdata,as depictedinFigure2The . data

suggeststhatthewordibetter as unitthanthese

ntenceforROSmodeling,sincetheaverage

phone-levelROSdeviationwithinawordissignific

antlysmallerthanwithinasentence,

whichmeansthatROSismorestableathe t wordlev

elthaninthesentencelevel,andthus

classifyingeachwordafast s orslowmakesmorese

nsethanclassifyingtheentiresentenceas

fastor slow. [Figure 2]

10

WenotethattheproposedROSmeasureappliestoin

dividualwords,andthereforedoesnot

includethedurationofinterwordpausesthatcontr

ibutetoothercommondefinitionsof

speechrate.Thereasonforthisdifferenceitshat

ourapproachaimsaimproving t theacoustic

modelingothe f speechportionsothe f signalonly,

thatis,theportionsaccountedforbyword

pronunciations.Furthermore,ourgoalistodosob

ymodelingtheeffectsofROSonthe

acousticfeatures,notbymodelingROSitselfasa arenotconcernedaboutROSestimation

discriminatorfeature.Forthesereasonswe perse and , havenotinvestigatedthequantitative

relationshipbetweenour ROSmeasureandothers pro

posedintheliterature.

3. RATE-DEPENDENTACOUSTICMODELING We focus on rate-dependent acoustic modeling alone,

without changes to word

pronunciations.Intheproposedmethod,eachwordi

gs ivenparallelpronunciationso“fast” f

and“slow”phonemodels.Bothfastandslowpronunc

iationsareinitializedfromtheoriginal

rate-independentversion,withasimplereplacement

ofrate-independentphonesbyrate-

specificphones.Forexample,theoriginalrate-ind

ependentpronunciationof“WORD”is

/werd/.Consequently,thefastandslowpronuncia consistingoffastandslowphonemodels,respectiv

tionsare/w

er df f/and/w f

reduringsearch,andthusavoidstheneed

forROSestimationbeforerecognition.Inaddition,

thesearchalgorithmisallowedtoselect

pronunciationsofdifferentratesacrosswordbound

Theintroductionoparallel f ROS-specificpronuncia parallelstatepathsinrecentworkonphonehidden

s/,

ely.Therecognizerautomaticallyfinds

thepronunciationsthatmaximizethelikelihoodsco

accountingfor speechratevariationwithin sente a

er ds s

aries(butnotwithinaword),thus nce. tionsirseminiscentoftheintroductionof Markovmodel(HMM)topologies(Iyeret

al.,1999).ParallelpathHMMsaimtomodeltheaco

usticconsistencyofadjacentframes,

emulatingtrajectory-basedsegmentmodels(Ostendor

efat l.,1996),similartothewayour

models enforceROSconsistencyover adjacent phones

.

11

3.1 Trainingrate-dependentacousticmodels OurinitialexperimentswereperformedonSRI’s199

8Hub-5evaluationsystem(Weintraub

etal.,1998),whichusescontinuous-densitygenoni

cHMMs(Digalakisetal.,1996)for

acousticmodeling.Thesystemusedamultipassreco

gnitionstrategy(Murveitetal.,1993).

Forthesakeosimplicity, f weranourexperiments usedgender-dependentnon-crosswordgenonicHMMs(1 genoneswith64Gaussianspergenone)andabigram vocabulary.Thepronunciationdictionarywasderive withstressinformationstripped.Mostwordshavea somehavemultipleentriesforcommonpronunciation

withonlythefirst-passrecognizer,which ,730malegenonesand1,458female grammarwitha33,275-word dfromtheCMUversion0.4lexicon singleentryintheCMUlexicon,while variants.Forexample,theword“was”

hasthreeentries:/waaz/,/wahz/,and/waoz/

Generally . speaking,thelexicondoesnot

coverthepossiblepronunciationvariationscaused

bydifferentspeakingrates.Therecognizer

usedatwo-pass(forwardandbackward)Viterbibeam

searchalgorithm;inthefirstpassa

lexicaltreewasusedinthegrammarbackoffnodet

ospeedupsearch.Belowwereport

resultsfromthebackwardpass.Theacousticfeatur

esusedwere9MelFrequencyCepstral

Coefficients(C1-C8plusC0)withtheirfirst-and

second-orderderivativesobtainedfrom18

filterbankscovering300-3300Hzin10mtsimefra

mes.Theacoustictrainingsetconsisted

of87hoursosfpeechformalesand106hoursforf

emales,fromacombinationocforpora:

(1)Macrophonereadtelephonespeech,(2)3,094con segmentedSwitchboard-1trainingset(withsomehan

versationsidesfromtheBBNd-corrections),and(3)100CallHome

Englishtrainingconversations. WefirstcalculatedtheROSforallwordsinthetr mentionedmeasure,sortedwordsbyROS,andthensp

ainingcorpusbasedontheabovelitthemintofastandslowcategories.

TheROSthresholdforsplittingwasselectedtoach

ieveequalamountsotfrainingdatafor

fastandslowspeech.Thetrainingtranscriptionsw

erelabeledaccordingly.Wethenprepared

aspecialtraininglexicon:wordtokenslabeled“fa

st”weregivenpronunciationswithfast

12

phonesonly,andsimilarlyfor“slow”words.Inthi models simultaneouslyusingthestandardBaum-Welch WeusedtheDECIPHER

w s ay,wewereabletotrainfastandslow trainingprocedure.

TM

genonictrainingtoolstorunstandardmaximumlik

estimation(MLE)gender-dependenttraining,andobt genonesformalespeechand2,501genonesforfemal rate-dependentmodelsusedthesameinformationlos

elihood

ainedrate-dependentmodelswith3,233 espeech.Thegenoneclusteringfor tshresholdaps reviouslyusedforrate-

independent training. Results.Wecomparedtherate-dependentacousticmodelwit baselinesystem)onadevelopmentsubsetofthe199 1,143sentencesfrom20speakers(9male,11female WERforthetwomodels.Notethatallresultsrepor within-wordtriphoneacousticmodelsandabigraml

htherate-independentone(the 8Hub-5evaluationdata,consistingof ).ThefirsttworowsofTable1showthe tedherearebasedonspeaker-independent anguagemodel,andarethereforenot

comparable tothosefor the fullevaluationsystem.

[Table 1]

Rate-dependentmodelingyieldsanabsoluteWERredu significant( p