Discriminative Training for Segmental Minimum Bayes Risk Decoding

Report 15 Downloads 81 Views
Discriminative Training for Segmental Minimum Bayes Risk Decoding Vlasios Doumpiotis, Stavros Tsakalidis, Bill Byrne Center for Language and Speech Processing Department of Electrical and Computer Engineering The Johns Hopkins University

Segmental Minimum Bayes Risk Decoding (SMBR)

Lattices are segmented into sequences of separate decision problems involving small sets of confusable words Separate sets of acoustic models specialized to discriminate between the competing words in these classes are applied in subsequent SMBR decoding passes Results in a refined search space that allows the use of specialized discriminative models Improvement in performance over MMI

2

Review of MAP Decoding vs Minimum Bayes-Risk Decoders MAP decoding, given an utterance A produces a sentence hypothesis ˜ = maxW ∈W P (W |A) W

MAP is the optimum decoding criterion when performance is measured under the Sentence Error Rate criterion. For other criteria, such as Word Error Rate, other decoding schemes may be better. If L(W,W’) is the loss function between word strings W and W’, the MBR recognizer seeks the optimal hypothesis as " ˜ = maxW !∈W ! W W ∈W L(W, W )P (W |A)

MAP is the optimum decoding criterion when performance is measured under the Sentence Error Rate criterion. Minimum Bayes-Risk decoders attempt to find the sentence hypothesis with the least expected error under a given task specific loss function. If L(W,W’) is the 1/0 loss function, MAP results 3

Segmental Minimum Bayes Risk Decoding Address the MBR search problem over very large lattices. Each word string in the lattice is segmented into N substrings: W = W1 · · · WN This effectively segments the lattice as well: W = W1 · · · WN Given a specific lattice segmentation, the MBR hypothesis can then be obtained through a sequence of independent MBR decision rules



" ˆ i = minW !∈W ! W i W ∈Wi L(W, W )Pi (W |A)

Lattice Segmentation and Pinching

Every path in the lattice is aligned to the MAP hypothesis Low and high confidence regions are identified High confidence regions: retain only the MAP hypothesis The word order of the original lattice is preserved 4

Lattice Cutting and Pinching

E C NINE

SILENCE

A

A SILENCE

OH

J

SILENCE D

NINE

V

A

B

SILENCE

8 D

K 4

V

NINE

SILENCE

C D

A SILENCE

OH

K

NINE

A

A

V

SILENCE

B

J 4

8

E

A:17 SILENCE

OH

J:17

NINE

A

5

A:7 8:7

V:5 B:5

SILENCE

Objectives 1. Identify potential errors in the MAP hypothesis 2. Derive a new search space for subsequent decoding passes

• •

Regions of low confidence

• •

Models will be trained to fix the errors in the MAP hypothesis

The search space contains portions of the MAP hypothesis plus alternatives.

Regions of high confidence



The search space is restricted to the MAP hypothesis.

Because the structure of the original lattice is retained, we can perform acoustic rescoring over this pinched lattice

6

Minimum Error Estimation for SMBR

Suppose we have a labeled training set (A,W) A reasonable approach to estimation for an MBR decoder is ! minθ W ! L(W, W !)P (W !|A; θ)

Note that if L is the 0/1 loss function, MMI results: maxθ P (W |A; θ)

How does this change for SMBR ? If we assume that each segment set contains one word strings and the loss function is binary, then we can treat the estimation problem for each segment set separately ! maxθi W ! L(W, W !)P (W !|A; θi) The problem simplifies to separate MMI estimation procedures for the small vocabulary ASR problems identified in the segmented lattices 7

Iterative SMBR Estimation and Decoding Our goal is to develop a joint estimation and decoding procedure that improves over MMI.

1.

Generate lattices, initially with MMI acoustic models

2.

Segment and pinch lattices

3.

Identify errors

4.

Train sets of models to resolve the errors

5.

Rescore the pinched lattices using the models tuned to fix the errors in each segment set

6.

Repeat...

We need to establish that

• •

Lattice cutting finds segment sets similar to the dominant confusion pairs observed in decoding. The segment sets identified in the test set are also found consistently in the training set.



Put differently, does the decoder behave the same on the training set as on the test set ?

8

Dominant Confusion Sets in MMI Decoding



HTK baseline: Whole word models, MFCCs, 12 mixture Gaussian HMMs, ATT FSM decoder



46,730 training utterances, 3,112 test utterances

• •

Ten Most Frequent Confusion Sets Found by Lattice Cutting Test

Ten Most Frequent ASR Word Errors

F+S

Training Count

Count

58

60

F+S

1089

F+S

15197

54

42

•M+N

P+T

843

P+T

10744

45

35

8+H

784

8+H

10370

P+T

32

44

M+N

772

M+N

10242

40

29

•8+H

V+Z

557

V+Z

8068

17

34

B+D

389

B+D

5996

A+8

10

40

L+OH

343

L+OH

5108

12

33

•B+D

B+V

314

B+V

4963

16

23

A+K

292

5+I

4413

C+V

16

17

5+I

289

J+K

3653

•V+Z •B+V •L+OH •

Hypothesized errors via unsupervised lattice cutting agree with actual errors

• 9

Discriminative Training on OGI AlphaDigits



11 10.7



10



9

9.98

MMI



9.36

9.27

9.07

9.03

WER

8.47 8.17

MRT

8

7.92

7.86

7

6

5 0

1

2

3

4

5

6

7

8

Iteration

Observations Initial ML performance of 10.7% WER is reduced to 9.07% with MMI. MinRisk training: a further 1% WER reduction beyond the best MMI performance. Overall WER decreases in MMI training progresses...

10

MMI Improvement Is Not Uniform Over All Error Types 90 80 70 60 50

MMI-1 MMI-2 MMI-3

40 30 20 10

11

C >

V

V-

>

B

Overall reduction in WER is at the expense of specific errors

C-

->

D

D

>

L

B-

->

O

H

O

H

A L-

>

>

8

8-

>

8

A-

->

H

H

>

B

8-

>

V

V-

>

P

B-

>

T

T-

>

M

P-

->

N N

->

V M

>

Z

Z-

>

F

V-

>

S-

F-

>

S

0

Minimum Risk Training

70

60

50

40 MRT-1 MRT-2 MRT-3 30

20

10

C > V-

V > C-

B -> D

D > B-

L -> H

H

O

>

O

A L-

> 8-

8 > A-

8 -> H

H > 8-

B > V-

P

V > B-

> T-

T > P-

M -> N

M

->

N

V > Z-

F

Z > V-

> S-

F-

>

S

0

Overall error rate is not reduced at the expense of individual hypotheses 12

Conclusions

SMBR - a divide and conquer approach to ASR Unsupervised approach to identify and eliminate recognition errors SMBR is used to identify regions that are likely to contain errors rescore with models trained for each type of error SMBR yields further improvements over MMIR Arguably, discriminative training is improved by introducing a training criterion based on a good approximation to the Word Error Rate rather than the Sentence Error Rate

13