Combining Information Sources for Confidence Estimation with CRF ...

Report 3 Downloads 15 Views
Combining Information Sources for Confidence Estimation with CRF Models

M.S. Seigel, P.C. Woodland {mss46,pcw}@eng.cam.ac.uk

Interspeech 2011 - Florence Cambridge University Engineering Department,Trumpington Street, Cambridge, CB2 1PZ, UK

Outline

Introduction CRF models for Confidence Estimation Predictor Features from Lattices Experiments Conclusion

2/20

Introduction Overview



What is confidence estimation (CE)? • • •



Goals of this work: • • • •



ASR systems generate “most likely” transcription (which are errorful) How confident are we that the transcription is correct? CE is the task of estimating this measure of confidence Develop a flexible classification-based CE framework Investigate utility of various predictor features Investigate effect of combining different information sources Investigate potential improvements for utterances

Propose using a CRF to combine predictor features from various information sources to generate confidence scores for words in a sequence

3/20

Outline

Introduction CRF models for Confidence Estimation Predictor Features from Lattices Experiments Conclusion

4/20

CRF models for Confidence Estimation CRF Overview

< s>



c2

w1

w2

··· ···

cn



wn

How can they be applied to CE? • •



c1

Given a set of predictor features per word, wi (observations) Assign the label Correct or Incorrect to each word, ci

Why CRFs? Interesting characteristics: • • •

CRFs are sequential models (word sequences, “runs” of errors) CRFs are insensitive to highly correlated inputs Definition of arbitrary feature functions possible 5/20

CRFs for Confidence Estimation CRF Formulation





CRFs introduced as a discriminative model for sequence data (Lafferty et. al,2001) Now CRF models posterior probability of label sequence c conditioned on observation sequence w : X  X pθ (c|w) ∝ exp λk tk (c, w) + µj gj (c, w) j

k •

With “transition features” tk (c, w) and “emission features” gj (c, w) < s>

λt

c1

λt µg

c2

···

µg

cn

λt µg



··· w1

w2

wn

6/20

CRFs for Confidence Estimation Feature Representation

< s>

λt

c1

λt µg

c2

···

cn

λt µg

µg



··· w1





wn

CRFs classically use discrete predictor/input features and feature functions (Moment Constraints) Represent continuous predictor features by defining (g) feature functions accordingly: • •



w2

Quantisation with binning Spline features and Distribution Constraints (Yu et. al, 2009)

Spline feature functions outperformed binning 7/20

CRFs for Confidence Estimation Applying the CRF Model for CE



Training: • •



Model parameters θ = (λ1 , . . . , λk , µ1 , . . . , µj ) estimated by maximising the conditional log-likelihood Use gradient-based technique to optimise (LBFGS)

Test: • • •

Given a sequence of word-level predictor features for an utterance, assign a corresponding sequence of Correct/Incorrect tags Word-level marginals probabilities for Correct tag calculated during forward-backward pass These word-level marginals are the confidence scores assigned to each word

8/20

Outline

Introduction CRF models for Confidence Estimation Predictor Features from Lattices Experiments Conclusion

9/20

Predictor Features from Lattices One-best case

• •

Extract features for words in 1-best from corresponding lattice Define a set I of lattice arcs a which intersect each word i in the 1-best hypothesis r

ref arc in r (one best) arc in set I for word

current arc median intersect

10/20

Predictor Features from Lattices Hypothesis Injection

• •

Extract features for words in alternative hypothesis from a lattice A simulated/injected reference hypothesis is found which best matches the alternative hypothesis

ref arc in r (simulated) arc in set I for word

current arc median intersect

11/20

Predictor Features from Lattices Predictor Feature Definition





Given the set of lattice arcs I for each word, a set of features are computed The lattice arc posterior ratio (LAPR): P posterior word-matched arc LAPR(ri ) = I P I posterior of arc



The Lattice Acoustic Stability (LAS): P GSF #word-matched arcs at GSF in top % LAS(ri ) = #GSF × #arcs in top %



The alignment feature (Levenshtein Alignment): LA(ri ) = δ(ri , wi ) 12/20

Outline

Introduction CRF models for Confidence Estimation Predictor Features from Lattices Experiments Conclusion

13/20

Experiments Setup



Results reported on the 2010 Cambridge Arabic ASR system • • • •



Decoding: P1+P2 lattice generation, P3 adapt and rescore P2 system used is a graphemic word-based system P3 system is result of rescoring with phonetic acoustic models Similar morpheme-based pair used for ROVER experiments

Subsets of 2009/2010 GALE data used: Usage

Name

Size (hours)

Training

dev10c+dev10r+dev09sub

27.5

Test

dev10d eval09ns

18.6 6.5

14/20

Experiments Evaluation



Standard word-level CE measure, Normalised cross entropy (NCE) score used



Performance over utterances also of interest



Consider utterance level mean absolute deviation (UMAD):   Li Li S X X X 1 1  cij − αij  UMAD = S Li i =1

j =1

j =1



Where cij is the ideal confidence value, and αij is the confidence score.



UMAD relates the average confidence score with the empirical error rate on a per-utterance basis

15/20

Experiments 1-Best Confidence Annotation



Annotating 1-best hypothesis from lattice using features from that lattice LAPR and LAS predictor features available in this case



Improvements in NCE score over baseline (effect of sequence info)



Improvements in UMAD (scores better over sequences)



dev10d (30.8%)



Feat

Sys

NCE

UMAD

LAPR

DT CRF

0.358 0.367

11.33 10.26

Further improvements on above with combination of LAPR+LAS

16/20

Experiments Alternative Hypothesis Confidence Estimation

• • •

Task is to annotate a hypothesis from a source other than the lattice P3 consensus hypothesis experiments most interesting CNP-only : Improvements due to sequence modelling

dev10d (27.3%) Sys DT CRF

Feat

NCE

UMAD

CNP CNP

0.223 0.247

14.01 12.78

17/20

Experiments Alternative Hypothesis Confidence Estimation

• • • •

Task is to annotate a hypothesis from a source other than the lattice P3 consensus hypothesis experiments most interesting CNP-only : Improvements due to sequence modelling LAPR : Outperforms CNP (30% rel. on NCE, 2% abs. UMAD) dev10d (27.3%) Sys CRF CRF

Feat

NCE

UMAD

CNP LAPR

0.247 0.325

12.78 10.64

17/20

Experiments Alternative Hypothesis Confidence Estimation

• • • • •

Task is to annotate a hypothesis from a source other than the lattice P3 consensus hypothesis experiments most interesting CNP-only : Improvements due to sequence modelling LAPR : Outperforms CNP (30% rel. on NCE, 2% abs. UMAD) All : NCE gain 18% rel., UMAD gain 1.5% abs. dev10d (27.3%) Sys

DT CRF

Feat

NCE

UMAD

all all

0.297 0.353

11.79 10.30

17/20

Experiments Alternative Hypothesis Confidence Estimation

• • • • •

Task is to annotate a hypothesis from a source other than the lattice P3 consensus hypothesis experiments most interesting CNP-only : Improvements due to sequence modelling LAPR : Outperforms CNP (30% rel. on NCE, 2% abs. UMAD) All : NCE gain 18% rel., UMAD gain 1.5% abs. dev10d (27.3%) Sys DT CRF CRF DT CRF

Feat

NCE

UMAD

CNP CNP LAPR all all

0.223 0.247 0.325 0.297 0.353

14.01 12.78 10.64 11.79 10.30

17/20

Experiments Application to ROVER Combination



Two-way ROVER experiments, with word/morphemic P3 systems



CNP-only : Improvements in NCE

dev10d Sys DT CRF

Feat

WER NCE UMAD

CNP CNP

25.0 0.130 15.06 25.1 0.175 13.65

18/20

Experiments Application to ROVER Combination



Two-way ROVER experiments, with word/morphemic P3 systems



CNP-only : Improvements in NCE



All : 1% relative WER gain, NCE gain avg 28% rel., UMAD gain avg 1.2% abs. dev10d Sys

DT CRF

Feat

WER NCE UMAD

all all

25.1 0.220 12.80 24.8 0.289 10.91

18/20

Experiments Application to ROVER Combination



Two-way ROVER experiments, with word/morphemic P3 systems



CNP-only : Improvements in NCE



All : 1% relative WER gain, NCE gain avg 28% rel., UMAD gain avg 1.2% abs. dev10d Sys DT CRF DT CRF

Feat

WER NCE UMAD

CNP CNP all all

25.0 25.1 25.1 24.8

0.130 0.175 0.220 0.289

15.06 13.65 12.80 10.91

18/20

Outline

Introduction CRF models for Confidence Estimation Predictor Features from Lattices Experiments Conclusion

19/20

Conclusion



The CRF modelling framework allows us to : • • • •

• •

Simply combine multiple predictor features (different sources) Reap rewards of sequential modelling Use accurate representations of predictor features Easily extend the model for further experimentation

This framework ultimately yields improved confidence measures We have also: • • •

Described how accurate predictor features may be extracted from recognition lattices (e.g. LAPR) Shown that combining different sources of information yields the largest gains Proven the improved scores may be applied succesfully (ROVER)

20/20