Combining Information Sources for Confidence Estimation with CRF Models
M.S. Seigel, P.C. Woodland {mss46,pcw}@eng.cam.ac.uk
Interspeech 2011 - Florence Cambridge University Engineering Department,Trumpington Street, Cambridge, CB2 1PZ, UK
Outline
Introduction CRF models for Confidence Estimation Predictor Features from Lattices Experiments Conclusion
2/20
Introduction Overview
•
What is confidence estimation (CE)? • • •
•
Goals of this work: • • • •
•
ASR systems generate “most likely” transcription (which are errorful) How confident are we that the transcription is correct? CE is the task of estimating this measure of confidence Develop a flexible classification-based CE framework Investigate utility of various predictor features Investigate effect of combining different information sources Investigate potential improvements for utterances
Propose using a CRF to combine predictor features from various information sources to generate confidence scores for words in a sequence
3/20
Outline
Introduction CRF models for Confidence Estimation Predictor Features from Lattices Experiments Conclusion
4/20
CRF models for Confidence Estimation CRF Overview
< s>
•
c2
w1
w2
··· ···
cn
wn
How can they be applied to CE? • •
•
c1
Given a set of predictor features per word, wi (observations) Assign the label Correct or Incorrect to each word, ci
Why CRFs? Interesting characteristics: • • •
CRFs are sequential models (word sequences, “runs” of errors) CRFs are insensitive to highly correlated inputs Definition of arbitrary feature functions possible 5/20
CRFs for Confidence Estimation CRF Formulation
•
•
CRFs introduced as a discriminative model for sequence data (Lafferty et. al,2001) Now CRF models posterior probability of label sequence c conditioned on observation sequence w : X X pθ (c|w) ∝ exp λk tk (c, w) + µj gj (c, w) j
k •
With “transition features” tk (c, w) and “emission features” gj (c, w) < s>
λt
c1
λt µg
c2
···
µg
cn
λt µg
··· w1
w2
wn
6/20
CRFs for Confidence Estimation Feature Representation
< s>
λt
c1
λt µg
c2
···
cn
λt µg
µg
··· w1
•
•
wn
CRFs classically use discrete predictor/input features and feature functions (Moment Constraints) Represent continuous predictor features by defining (g) feature functions accordingly: • •
•
w2
Quantisation with binning Spline features and Distribution Constraints (Yu et. al, 2009)
Spline feature functions outperformed binning 7/20
CRFs for Confidence Estimation Applying the CRF Model for CE
•
Training: • •
•
Model parameters θ = (λ1 , . . . , λk , µ1 , . . . , µj ) estimated by maximising the conditional log-likelihood Use gradient-based technique to optimise (LBFGS)
Test: • • •
Given a sequence of word-level predictor features for an utterance, assign a corresponding sequence of Correct/Incorrect tags Word-level marginals probabilities for Correct tag calculated during forward-backward pass These word-level marginals are the confidence scores assigned to each word
8/20
Outline
Introduction CRF models for Confidence Estimation Predictor Features from Lattices Experiments Conclusion
9/20
Predictor Features from Lattices One-best case
• •
Extract features for words in 1-best from corresponding lattice Define a set I of lattice arcs a which intersect each word i in the 1-best hypothesis r
ref arc in r (one best) arc in set I for word
current arc median intersect
10/20
Predictor Features from Lattices Hypothesis Injection
• •
Extract features for words in alternative hypothesis from a lattice A simulated/injected reference hypothesis is found which best matches the alternative hypothesis
ref arc in r (simulated) arc in set I for word
current arc median intersect
11/20
Predictor Features from Lattices Predictor Feature Definition
•
•
Given the set of lattice arcs I for each word, a set of features are computed The lattice arc posterior ratio (LAPR): P posterior word-matched arc LAPR(ri ) = I P I posterior of arc
•
The Lattice Acoustic Stability (LAS): P GSF #word-matched arcs at GSF in top % LAS(ri ) = #GSF × #arcs in top %
•
The alignment feature (Levenshtein Alignment): LA(ri ) = δ(ri , wi ) 12/20
Outline
Introduction CRF models for Confidence Estimation Predictor Features from Lattices Experiments Conclusion
13/20
Experiments Setup
•
Results reported on the 2010 Cambridge Arabic ASR system • • • •
•
Decoding: P1+P2 lattice generation, P3 adapt and rescore P2 system used is a graphemic word-based system P3 system is result of rescoring with phonetic acoustic models Similar morpheme-based pair used for ROVER experiments
Subsets of 2009/2010 GALE data used: Usage
Name
Size (hours)
Training
dev10c+dev10r+dev09sub
27.5
Test
dev10d eval09ns
18.6 6.5
14/20
Experiments Evaluation
•
Standard word-level CE measure, Normalised cross entropy (NCE) score used
•
Performance over utterances also of interest
•
Consider utterance level mean absolute deviation (UMAD): Li Li S X X X 1 1 cij − αij UMAD = S Li i =1
j =1
j =1
•
Where cij is the ideal confidence value, and αij is the confidence score.
•
UMAD relates the average confidence score with the empirical error rate on a per-utterance basis
15/20
Experiments 1-Best Confidence Annotation
•
Annotating 1-best hypothesis from lattice using features from that lattice LAPR and LAS predictor features available in this case
•
Improvements in NCE score over baseline (effect of sequence info)
•
Improvements in UMAD (scores better over sequences)
•
dev10d (30.8%)
•
Feat
Sys
NCE
UMAD
LAPR
DT CRF
0.358 0.367
11.33 10.26
Further improvements on above with combination of LAPR+LAS
16/20
Experiments Alternative Hypothesis Confidence Estimation
• • •
Task is to annotate a hypothesis from a source other than the lattice P3 consensus hypothesis experiments most interesting CNP-only : Improvements due to sequence modelling
dev10d (27.3%) Sys DT CRF
Feat
NCE
UMAD
CNP CNP
0.223 0.247
14.01 12.78
17/20
Experiments Alternative Hypothesis Confidence Estimation
• • • •
Task is to annotate a hypothesis from a source other than the lattice P3 consensus hypothesis experiments most interesting CNP-only : Improvements due to sequence modelling LAPR : Outperforms CNP (30% rel. on NCE, 2% abs. UMAD) dev10d (27.3%) Sys CRF CRF
Feat
NCE
UMAD
CNP LAPR
0.247 0.325
12.78 10.64
17/20
Experiments Alternative Hypothesis Confidence Estimation
• • • • •
Task is to annotate a hypothesis from a source other than the lattice P3 consensus hypothesis experiments most interesting CNP-only : Improvements due to sequence modelling LAPR : Outperforms CNP (30% rel. on NCE, 2% abs. UMAD) All : NCE gain 18% rel., UMAD gain 1.5% abs. dev10d (27.3%) Sys
DT CRF
Feat
NCE
UMAD
all all
0.297 0.353
11.79 10.30
17/20
Experiments Alternative Hypothesis Confidence Estimation
• • • • •
Task is to annotate a hypothesis from a source other than the lattice P3 consensus hypothesis experiments most interesting CNP-only : Improvements due to sequence modelling LAPR : Outperforms CNP (30% rel. on NCE, 2% abs. UMAD) All : NCE gain 18% rel., UMAD gain 1.5% abs. dev10d (27.3%) Sys DT CRF CRF DT CRF
Feat
NCE
UMAD
CNP CNP LAPR all all
0.223 0.247 0.325 0.297 0.353
14.01 12.78 10.64 11.79 10.30
17/20
Experiments Application to ROVER Combination
•
Two-way ROVER experiments, with word/morphemic P3 systems
•
CNP-only : Improvements in NCE
dev10d Sys DT CRF
Feat
WER NCE UMAD
CNP CNP
25.0 0.130 15.06 25.1 0.175 13.65
18/20
Experiments Application to ROVER Combination
•
Two-way ROVER experiments, with word/morphemic P3 systems
•
CNP-only : Improvements in NCE
•
All : 1% relative WER gain, NCE gain avg 28% rel., UMAD gain avg 1.2% abs. dev10d Sys
DT CRF
Feat
WER NCE UMAD
all all
25.1 0.220 12.80 24.8 0.289 10.91
18/20
Experiments Application to ROVER Combination
•
Two-way ROVER experiments, with word/morphemic P3 systems
•
CNP-only : Improvements in NCE
•
All : 1% relative WER gain, NCE gain avg 28% rel., UMAD gain avg 1.2% abs. dev10d Sys DT CRF DT CRF
Feat
WER NCE UMAD
CNP CNP all all
25.0 25.1 25.1 24.8
0.130 0.175 0.220 0.289
15.06 13.65 12.80 10.91
18/20
Outline
Introduction CRF models for Confidence Estimation Predictor Features from Lattices Experiments Conclusion
19/20
Conclusion
•
The CRF modelling framework allows us to : • • • •
• •
Simply combine multiple predictor features (different sources) Reap rewards of sequential modelling Use accurate representations of predictor features Easily extend the model for further experimentation
This framework ultimately yields improved confidence measures We have also: • • •
Described how accurate predictor features may be extracted from recognition lattices (e.g. LAPR) Shown that combining different sources of information yields the largest gains Proven the improved scores may be applied succesfully (ROVER)
20/20