Discriminative Training for Segmental Minimum Bayes Risk Decoding Vlasios Doumpiotis, Stavros Tsakalidis, Bill Byrne Center for Language and Speech Processing Department of Electrical and Computer Engineering The Johns Hopkins University
Segmental Minimum Bayes Risk Decoding (SMBR)
Lattices are segmented into sequences of separate decision problems involving small sets of confusable words Separate sets of acoustic models specialized to discriminate between the competing words in these classes are applied in subsequent SMBR decoding passes Results in a refined search space that allows the use of specialized discriminative models Improvement in performance over MMI
2
Review of MAP Decoding vs Minimum Bayes-Risk Decoders MAP decoding, given an utterance A produces a sentence hypothesis ˜ = maxW ∈W P (W |A) W
MAP is the optimum decoding criterion when performance is measured under the Sentence Error Rate criterion. For other criteria, such as Word Error Rate, other decoding schemes may be better. If L(W,W’) is the loss function between word strings W and W’, the MBR recognizer seeks the optimal hypothesis as " ˜ = maxW !∈W ! W W ∈W L(W, W )P (W |A)
MAP is the optimum decoding criterion when performance is measured under the Sentence Error Rate criterion. Minimum Bayes-Risk decoders attempt to find the sentence hypothesis with the least expected error under a given task specific loss function. If L(W,W’) is the 1/0 loss function, MAP results 3
Segmental Minimum Bayes Risk Decoding Address the MBR search problem over very large lattices. Each word string in the lattice is segmented into N substrings: W = W1 · · · WN This effectively segments the lattice as well: W = W1 · · · WN Given a specific lattice segmentation, the MBR hypothesis can then be obtained through a sequence of independent MBR decision rules
•
" ˆ i = minW !∈W ! W i W ∈Wi L(W, W )Pi (W |A)
Lattice Segmentation and Pinching
Every path in the lattice is aligned to the MAP hypothesis Low and high confidence regions are identified High confidence regions: retain only the MAP hypothesis The word order of the original lattice is preserved 4
Lattice Cutting and Pinching
E C NINE
SILENCE
A
A SILENCE
OH
J
SILENCE D
NINE
V
A
B
SILENCE
8 D
K 4
V
NINE
SILENCE
C D
A SILENCE
OH
K
NINE
A
A
V
SILENCE
B
J 4
8
E
A:17 SILENCE
OH
J:17
NINE
A
5
A:7 8:7
V:5 B:5
SILENCE
Objectives 1. Identify potential errors in the MAP hypothesis 2. Derive a new search space for subsequent decoding passes
• •
Regions of low confidence
• •
Models will be trained to fix the errors in the MAP hypothesis
The search space contains portions of the MAP hypothesis plus alternatives.
Regions of high confidence
•
The search space is restricted to the MAP hypothesis.
Because the structure of the original lattice is retained, we can perform acoustic rescoring over this pinched lattice
6
Minimum Error Estimation for SMBR
Suppose we have a labeled training set (A,W) A reasonable approach to estimation for an MBR decoder is ! minθ W ! L(W, W !)P (W !|A; θ)
Note that if L is the 0/1 loss function, MMI results: maxθ P (W |A; θ)
How does this change for SMBR ? If we assume that each segment set contains one word strings and the loss function is binary, then we can treat the estimation problem for each segment set separately ! maxθi W ! L(W, W !)P (W !|A; θi) The problem simplifies to separate MMI estimation procedures for the small vocabulary ASR problems identified in the segmented lattices 7
Iterative SMBR Estimation and Decoding Our goal is to develop a joint estimation and decoding procedure that improves over MMI.
1.
Generate lattices, initially with MMI acoustic models
2.
Segment and pinch lattices
3.
Identify errors
4.
Train sets of models to resolve the errors
5.
Rescore the pinched lattices using the models tuned to fix the errors in each segment set
6.
Repeat...
We need to establish that
• •
Lattice cutting finds segment sets similar to the dominant confusion pairs observed in decoding. The segment sets identified in the test set are also found consistently in the training set.
•
Put differently, does the decoder behave the same on the training set as on the test set ?
8
Dominant Confusion Sets in MMI Decoding
•
HTK baseline: Whole word models, MFCCs, 12 mixture Gaussian HMMs, ATT FSM decoder
•
46,730 training utterances, 3,112 test utterances
• •
Ten Most Frequent Confusion Sets Found by Lattice Cutting Test
Ten Most Frequent ASR Word Errors
F+S
Training Count
Count
58
60
F+S
1089
F+S
15197
54
42
•M+N
P+T
843
P+T
10744
45
35
8+H
784
8+H
10370
P+T
32
44
M+N
772
M+N
10242
40
29
•8+H
V+Z
557
V+Z
8068
17
34
B+D
389
B+D
5996
A+8
10
40
L+OH
343
L+OH
5108
12
33
•B+D
B+V
314
B+V
4963
16
23
A+K
292
5+I
4413
C+V
16
17
5+I
289
J+K
3653
•V+Z •B+V •L+OH •
Hypothesized errors via unsupervised lattice cutting agree with actual errors
• 9
Discriminative Training on OGI AlphaDigits
•
11 10.7
•
10
•
9
9.98
MMI
•
9.36
9.27
9.07
9.03
WER
8.47 8.17
MRT
8
7.92
7.86
7
6
5 0
1
2
3
4
5
6
7
8
Iteration
Observations Initial ML performance of 10.7% WER is reduced to 9.07% with MMI. MinRisk training: a further 1% WER reduction beyond the best MMI performance. Overall WER decreases in MMI training progresses...
10
MMI Improvement Is Not Uniform Over All Error Types 90 80 70 60 50
MMI-1 MMI-2 MMI-3
40 30 20 10
11
C >
V
V-
>
B
Overall reduction in WER is at the expense of specific errors
C-
->
D
D
>
L
B-
->
O
H
O
H
A L-
>
>
8
8-
>
8
A-
->
H
H
>
B
8-
>
V
V-
>
P
B-
>
T
T-
>
M
P-
->
N N
->
V M
>
Z
Z-
>
F
V-
>
S-
F-
>
S
0
Minimum Risk Training
70
60
50
40 MRT-1 MRT-2 MRT-3 30
20
10
C > V-
V > C-
B -> D
D > B-
L -> H
H
O
>
O
A L-
> 8-
8 > A-
8 -> H
H > 8-
B > V-
P
V > B-
> T-
T > P-
M -> N
M
->
N
V > Z-
F
Z > V-
> S-
F-
>
S
0
Overall error rate is not reduced at the expense of individual hypotheses 12
Conclusions
SMBR - a divide and conquer approach to ASR Unsupervised approach to identify and eliminate recognition errors SMBR is used to identify regions that are likely to contain errors rescore with models trained for each type of error SMBR yields further improvements over MMIR Arguably, discriminative training is improved by introducing a training criterion based on a good approximation to the Word Error Rate rather than the Sentence Error Rate
13