Confidence based multiple classifier fusion in speaker verification Fernando Huenupán, Nestor Becerra Yoma, Carlos Molina and Claudio Garretón Speech Processing and Transmission Laboratory Department of Electrical Engineering Universidad de Chile, Santiago, Chile
[email protected] Telephone: +56-2-678 4205 Fax: +56-2-695 3881
ABSTRACT A novel framework based on Bayes-based confidence measure for Multiple Classifier System fusion is proposed. Compared with ordinary Bayesian fusion, the presented approach leads to reductions of 20% and 25% in EER and ROC curve area, respectively, in speaker verification.
I.
INTRODUCTION
The problem of pattern recognition using multiple classifier systems (MCS) has been addressed in several fields (Fumera and Roli, 2005). More specifically, it is worth emphasizing that MCS has widely been employed in the context of handwrite recognition (Xu et. Al., 1992; Kittler et. al., 1998) and fusion of multimedia data, voice and image (Fumera and Roli 2005; Kittler et. al., 1998). The motivation behind MCS is the fact that the response to the same input signal could be classifier dependent so the error of a given classifier could be corrected by the whole system. In speaker verification (SV), neural networks (Farrel 1995, 1998; Xiang and Berger, 2003; Yegnanarayana et. al., 2005), linear combination (Benzeghiba and Bourlard, 2003; Mack et al, 2003) and binary methods (Genoud et. al., 1996) are among the most popular approaches to tackle the problem of optimizing the use of multiple classifiers. Nevertheless, most of those combining techniques do not provide the required mathematical analysis to achieve the relevant performance evaluations (Kim and Ko, 2003). In MCS, classifiers are usually combined in parallel into two levels: abstract level and score level. In the abstract level approach, the binary decisions made by multiple classifiers are combined (see Fig. 1). In the score level configuration, the scores of individual classifiers are merged by mean of a set of weights (see Fig. 2) (Mack et. al., 2003). Given Figs. 1 and 2: O is the observed input signal; CLj is the classifier j, where 1 ≤ j ≤ J and J is the number of classifiers; SCL j (O) is the score at the output of classifier j; dCL j is the local decision or classification due to classifier j; and, D(O) is the final decision or classification that corresponds to O. D(O) and dCL j indicate one of the M classes denoted by Cm, where 1 ≤ m ≤ M and M
is the total number of classes . From pattern recognition theory, the most straightforward formal strategy to fusion classifiers is probably by means of Bayes classification theory (Duda, 1973; Kittler et. al., 1998). ⎧ ⎫ ⎪⎪ Pr [ S (O) | C ] Pr (C ) ⎪⎪ m m (1) D (O) = arg max [ Pr (Cm | S (O ) ] = arg max ⎨ M ⎬ m m ⎪ ∑ Pr [ S (O ) | Cm ] Pr (Cm ) ⎪ ⎪⎩ m =1 ⎭⎪ where
S(O) = ⎡⎣ SCL 1 (O ),..., SCL j (O ),...., SCLJ (O ) ⎤⎦ . Theoretically, the classification error is optimally
minimized by (1). However, the a priori multivariable p.d.f.’s Pr [ S (O) | Cm ] may require an unmanageable amount of training data to be reliable estimated (Kittler et. al., 1998). As a consequence, the problem is substantially simplified if maximization in (1) could be expressed in terms of computations performed by individual classifiers. The classical techniques to simplify the Bayesian Fusion (Kittler et. al., 1998; Kuncheva 1999, 2002) are: Product Rule; Max Rule; Min rule; Mean Rule and, Majority Vote Rule. Among the several MCS combinations rules in the literature, Mean Rule and Vote Rule are the most frequently employed approximations (Fumera and Roli, 2005; Kittler and Alkoot, 2003). On the other hand, Product
Rule corresponds to the optimal Bayesian fusion if the classifiers are statistically independent. Vote Rule allows combining local decision of individual classifiers as in Fig. 1. Concerning SV, the problem of classifiers fusion in MCS has not been tackled in depth with the proper mathematical analysis. The problem of assessing the accuracy of automatic speech recognition (ASR) systems in telephone based services has attracted the attention of several authors (Andorno et. al., 2002; Hazen et. al., 2000; Kwan et. al., 2002; Lee and Huo, 2000; Stolcke et. al., 1997) in order to avoid user’s frustration, interaction needs to be very effective and efficient, and confirmation loops should be avoided. Consequently, reliably assessing the operation of ASR is necessary to decide whether a recognized word or sentence should be accepted or rejected. In (Yoma et. al., 2005), a Bayes-based confidence measure (BBCM) was proposed. BBCM is a probability itself and incorporates a priori information about the recognizer performance. Moreover, when compared to standard confidence measures, BBCM dramatically improves the discrimination of misrecognized words. Surprisingly, the problem of assessing the accuracy of SV systems has not been exhaustively addressed in the specialized literature. The contribution of this paper concerns: a) the application of MCS strategy to address SV by using simplifications of the Bayesian Fusion; b) a new classification criterion based on BBCM maximization applicable to any pattern recognition problem independently of the number of classes; c) a new strategy for classifier fusion in MCS based on BBCM also applicable to any pattern recognition problem independently of the number of classes; and, d) modelling the problem of confidence measure in SV by applying BBCM. As shown here, BBCM based MCS fusion scheme corresponds to the ordinary Bayes fusion weighted by the reliability of each individual classifier. Moreover, BBCM provides a formal model for heuristic weighting functions employed elsewhere. Finally, the approach and analysis presented in this paper has not been found in the specialized literature.
II.
BAYES CLASSIFIER COMBINATION APPLIED TO SPEAKER VERIFICATION (SV)
In SV, the task is to decide about the user identity that claims a given identity and two classes are possible: client, C1; and, impostor, C2. In the enrolling process, each user is prompted to pronounce a given number of utterances that will be employed to generate the user’s speaker dependent (SD) model. In verification, the speech signal from a user that claims an identity is compared with the SD model associated to the claimed identity and with an impostor model. This impostor model is denominated speaker independent (SI) because it is usually trained with a wide variety of users. In a SV MCS system, every classifier may have a SD and a SI model. From the comparison of the input speech signal with each classifier CLj results a score SCL j (O ) , Pr ⎡⎣ SCL j (O ) | C1 ⎤⎦ and Pr ⎡⎣ SCL j (O ) | C2 ⎤⎦ . Consequently, the classifier array provides a set of scores S (O) , Pr [ S (O ) | C1 ] and Pr [ S (O ) | C2 ] . If the a priori probabilities of client and impostor are assumed uniformly distributed, then Pr(C1 ) = Pr(C2 ) = 0.5 and (1) can be written as: ⎧ ⎫ ⎪⎪ Pr [ S (O) | C ] ⎪⎪ m D (O) = arg max { Pr [Cm | S (O ) ]} = arg max ⎨ M ⎬ m m ⎪ ∑ Pr [ S (O ) | Cm ] ⎪ ⎪⎩ m =1 ⎭⎪
(2)
where m=1 (client) or m=2 (impostor). As mentioned above, Pr [ S (O) | Cm ] may require an unmanageable amount of data and the Bayesian fusion expressed in (2) can be approximated with Product Rule, Mean Rule and Majority Vote Rule, among others. In this paper three standard SV techniques were addressed: forced Viterbi based score (Furui, 1997); Majority Voting Rule for sequence of feature vectors, MVR-FV (Radova and Padrta, 2004); and, Support Vector Machines, SVM (Campbell et al, 2004).
2.1 Product Rule If the classifiers are considered statistically independent, the Bayes decision rule in (2) can be expressed as (Kittler et. al., 1998):
⎧ ⎫ ⎪⎪ J Pr ⎡ SCL j (O) | Cm ⎤ ⎪⎪ ⎧⎪ J ⎫⎪ ⎦ D (O) = arg max ⎨∏ Pr ⎡⎣Cm | SCL j (O ) ⎤⎦ ⎬ = arg max ⎨∏ M ⎣ ⎬ m m ⎪⎩ j =1 ⎪⎭ ⎪ j =1 ∑ Pr ⎡ SCL ( O ) | Cm ⎤ ⎪ ⎦ ⎭⎪ ⎪⎩ m =1 ⎣ j
(3)
with M=2 and J is the total of classifier for fusion.
2.2 Mean Rule According to (Tax et. al., 1997), if the estimation of classifiers is not reliable, the product rule may be very sensitive to classification errors. For instance, if one classifier gives score equal to zero, the product rule will also outputs score zero. The mean rule is more stable than the product rule and is usually preferred when classifiers show high levels of errors. The mean rule is defined as: ⎧ ⎫ ⎪⎪ 1 J Pr ⎡ SCL j (O) | Cm ⎤ ⎪⎪ ⎧1 J ⎫ ⎣ ⎦ (4) D (O) = arg max ⎨ ∑ Pr ⎡Cm | SCL j (O) ⎤ ⎬ = arg max ⎨ ∑ M ⎬ ⎣ ⎦ m m ⎩ J j =1 ⎭ ⎪ J j =1 ∑ Pr ⎡ SCL ( O ) | Cm ⎤ ⎪ ⎣ j ⎦ ⎭⎪ ⎪⎩ m =1 with M=2 and J is the total of classifier for fusion. 2.3 Weighted Majority Vote Rule Majority Vote Rule in the context of MCS, MVR-MCS, corresponds to a straightforward scheme that is widely employed in the literature to combine the output of classifiers as indicated in Fig. 1 (Kittler and Alkoot, 2003). MVR-MCS is defined as:
⎧ C if ∆(O) ≥ 0 D(O) = ⎨ 1 ⎩C2 if ∆(O) < 0
(5)
where J
∆(O) = ∑ ∆ CL j j =1
and:
⎧⎪ 1 ∆ CL j = ⎨ ⎪⎩−1
if d cl j (O) = C1 if d cl j (O) = C2
In this paper, a weighted version of Majority Vote Rule, WMVR-MCS, as defined in (Stefano, 2002) is addressed: J
∆(O) = ∑ ∆ CL j α j
(6)
j =1
where α j = Pr ⎡⎣ d cl j | SCL j (O ) ⎤⎦ .
III.
BAYES BASED CONFIDENCE MEASURE (BBCM) IN SV
BBCM was firstly proposed in the field of automatic speech recognition (ASR). An ASR system receives a speech signal as input and delivers a string of recognized words (i.e. w1 , w2 ,..., wn ,..., wN ), where wn denotes the nth word in the string. If WF denotes a given word feature (Andorno et. al., 2002; Hazen et. al., 2000; Kwan et. al., 2002; Lee and Huo, 2000; Stolcke et. al., 1997) (e.g. likelihood, word density confidence measure, etc), BBCM is defined in ASR as (Yoma et. al., 2005): BBCM (WFi ) = Pr ( wi is ok|WFi ) =
Pr (WFi | wi is OK) Pr ( wi is OK) Pr (WFi )
(7)
where “OK”, that substitutes “correct” in (Yoma et. al., 2005). The event “OK” corresponds to the fact that word wi , which is contained at least in one of the N-best hypotheses, was properly recognized (i.e., it is in the transcription of the testing utterance). Notice that BBCM (WFi ) is a probability itself. Moreover, the distribution Pr(WFi | wi is OK) and the probability Pr(wi is OK) provide information about the recognition engine performance. As mentioned above, there are two classes in SV: client, C1; and, impostor, C2. Moreover, the word feature employed in ASR can certainly be replaced with the score of a given classifier. As a consequence, applying the definition of BBCM to the SV problem leads to: BBCM ⎡ SCL j (O ) ⎤ = Pr ⎡ d CL j is OK|SCL j (O) ⎤ = ⎣ ⎦ ⎣ ⎦
Pr ⎡SCL j (O) | d CL j is OK ⎤ Pr ( dCL j is OK) ⎣ ⎦ ⎡ Pr SCL j (O) ⎤ ⎣ ⎦
(8)
where “ d CL is OK ” denotes that the decision of classifier CLj, which is client (C1) or impostor (C2) in SV, is j
correct. Therefore (8) can be expressed as:
{
}
BBCM ⎡ SCL j (O ) ⎤ = Pr ⎡ d CL j is OK|SCL j (O) ⎤ = Pr ⎡O ∈ C1 ∧ dCL j (O) = C1 ⎤ ∨ ⎡O ∈ C2 ∧ d CL j (O) = C2 ⎤ | SCL j (O) ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦
(9)
where O ∈ Cm denotes that input signal O belongs to class Cm. It is possible to consider that:
{
}
Pr ⎡O ∈ C1 ∧ d CL j (O) = C1 ⎤ ∧ ⎡O ∈ C2 ∧ dCL j (O) = C2 ⎤ | SCL j (O) = 0 ⎣ ⎦ ⎣ ⎦
this
means
that
a
classifier
chooses
only
one
class
each
time.
Consequently,
BBCM ⎡ SCL j (O) ⎤ = Pr ⎡ dCL j is OK|SCL j (O) ⎤ in (9) is written as: ⎣ ⎦ ⎣ ⎦ Pr ⎡ d CL j is OK|SCL j (O) ⎤ = Pr ⎡O ∈ C1 ∧ dCL j (O) = C1 | SCL j (O ) ⎤ + Pr ⎡O ∈ C2 ∧ d CL j (O) = C2 | SCL j (O) ⎤ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦
(10)
the generalization of this result is straightforward and the application of (10) to the M-class classification problem leads to: M
BBCM ⎡SCL j (O ) ⎤ = Pr ⎡ d CL j is OK|SCL j (O) ⎤ = ∑ Pr ⎡O ∈ Cm ∧ d CL j (O) = Cm | SCL j (O ) ⎤ ⎣ ⎦ ⎣ ⎦ m =1 ⎣ ⎦
(11)
by using Bayes theorem, Pr ⎡⎣O ∈ Cm ∧ d CL (O) = Cm | SCL (O) ⎤⎦ leads to: j
Pr ⎡O ∈ Cm ∧ d CL j (O) = Cm | SCL j (O ) ⎤ = ⎣ ⎦
j
Pr ⎡ SCL j (O) | O ∈ Cm ∧ d CL j (O) = Cm ⎤ ⋅ Pr ⎡O ∈ Cm ∧ d CL j (O) = Cm ⎤ ⎣ ⎦ ⎣ ⎦ M
∑ Pr ⎡⎣ S m =1
CL j
(O) | Cm ⎤Pr (Cm ) ⎦
(12)
finally, BBCM ⎡⎣ SCL (O) ⎤⎦ in (11) can be expressed as function of a priori p.d.f.’s: j
M
BBCM ⎡⎣SCL j (O) ⎤⎦ = ∑ m =1
Pr ⎡ SCL j (O) | O ∈ Cm ∧ d CL j (O) = Cm ⎤ ⋅ Pr ⎡O ∈ Cm ∧ d CL j (O) = Cm ⎤ ⎣ ⎦ ⎣ ⎦ M
∑ Pr ⎡⎣ SCL j (O) | Cm ⎤⎦Pr (Cm )
(13)
m =1
Observe that (13) should be applicable to any classification problem independently of the number of classes M. Finally, Pr ⎡⎣O ∈ Cm ∧ d CL (O) = Cm | SCL (O) ⎤⎦ in (12) suggests the following definition j
j
BBCM ⎡⎣SCL j (O ) ∧ Cm ⎤⎦ = Pr ⎡⎣O ∈ Cm ∧ d CL j (O ) = Cm | SCL j (O) ⎤⎦
(14)
where BBCM ⎣⎡SCL j (O) ∧ Cm ⎦⎤ denotes the probability of d CL j (O) being OK given score SCL j (O ) and the selected class Cm .
IV. BBCM BASED CLASSIFIER FUSION IN SV As mentioned above, the most straightforward scheme to combine multiple classifiers is Bayes fusion. Bayes classification rule proposes that the recognized class is the one that maximizes the a posteriori probability given the observed signal. In this section it is suggested that BBCM as defined in (14) could also be employed as the classification criterion that needs to be maximized to optimally decide about the recognized class.
4.1 BBCM as a classification criterion According to (14), BBCM ⎡⎣ SCL (O ) ∧ Cm ⎤⎦ can be written as: j
BBCM ⎡ SCL j (O) ∧ Cm ⎤ = Pr ⎡O ∈ Cm ∧ d CL j (O ) = Cm | SCL j (O ) ⎤ ⎣ ⎦ ⎣ ⎦ = Pr ⎡O ∈ Cm | SCL j (O ) ⎤ ⋅ Pr ⎡ dCL j (O) = Cm | SCL j (O ), O ∈ Cm ⎤ ⎣ ⎦ ⎣ ⎦
(15)
where Pr ⎡⎣O ∈ Cm | SCL j (O ) ⎤⎦ = Pr ⎡⎣Cm | SCL j (O) ⎤⎦ is the a posteriori probability that appears in the maximization according to the Bayes classification rules in (3) and (4); and, Pr ⎡⎣ d CL j (O ) = Cm | SCL j (O ), O ∈ Cm ⎤⎦ corresponds to additional information incorporated by BBCM and is related to the reliability of an individual classifier given an input and selected class. As a consequence, it sounds reasonable to classify by selecting the class that maximizes BBCM ⎡⎣ SCL (O ) ∧ Cm ⎤⎦ : j
{ = arg max { Pr ⎡⎣O ∈ C
}
d CL j = arg max BBCM ⎡⎣ SCL j (O ) ∧ Cm ⎤⎦ m m
m
}
| SCL j (O ) ⎤⎦ ⋅ Pr ⎡⎣ dCL j (O ) = Cm | SCL j (O ), O ∈ Cm ⎤⎦
(16)
Observe
that (16) maximizes a confidence measure instead of conventional a posteriori probability Pr ⎡⎣Cm | SCL j (O) ⎤⎦ . Nevertheless, the confidence measure employed here is a probability itself and also incorporates Pr ⎡⎣Cm | SCL j (O) ⎤⎦ , besides the information on classifier reliability. Finally, the result denoted by (16) should be applicable to any classification problem independently of the number of classes.
4.2 Fusion with BBCM In (Yoma et. al., 2005), it is proposed that BBCM could also be estimated with a combination of features. In the context of MCS at score level (Fig. 2), a feature corresponds to a classifier score and BBCM could be defined as: BBCM [ S (O ) ] = Pr [ D (O) is OK|S (O) ]
Then, according to section 4.1, the final classification decision of MCS at score level could be given by:
(17)
D (O) = arg max { BBCM [ S (O) ∧ Cm ]} m
= arg max { Pr [O ∈ Cm | S (O) ] ⋅ Pr [ D (O ) = Cm | S (O ), O ∈ Cm ]}
(18)
m
where BBCM [ S (O ) ∧ Cm ] is estimated as:
BBCM [S(O ) ∧ Cm ] =
Pr [ S (O ) | O ∈ Cm ∧ D(O ) = Cm ] ⋅ Pr [O ∈ Cm ∧ D(O ) = Cm ] M
∑ Pr [ D(O) | C ]Pr (C m =1
m
m
(19)
)
In (19), the a priori p.d.f Pr ⎡⎣S (O) | O ∈ Cm ∧ D(O) = Cm ⎤⎦ required to estimate BBCM [ S (O ) ∧ Cm ] also demands a high amount of training data, which it turn is not always available. Consequently, the same approximations mentioned in section 2 for Bayesian fusion are adopted here.
4.2.1 Mean Rule with BBCM The approximation of (18) by means of mean rule leads to: ⎧1 J ⎫ D (O) = arg max ⎨ ∑ BBCM ⎡SCL j (O) ∧ Cm ⎤ ⎬ ⎣ ⎦ m ⎩ J j =1 ⎭ ⎧1 J ⎫ = arg max ⎨ ∑ Pr ⎡⎣O ∈ Cm ∧ dCL j (O ) = Cm | SCL j (O ) ⎤⎦ ⎬ m ⎩ J j =1 ⎭
(20)
and D ( O ) in (20)can be written as:
⎧1 J ⎫ D (O) = arg max ⎨ ∑ Pr ⎡⎣O ∈ Cm | SCL j (O) ⎤⎦ ⋅ Pr ⎡⎣ dCL j (O) = Cm | SCL j (O), O ∈ Cm ⎤⎦ ⎬ J m ⎩ j =1 ⎭
(21)
As can be seen, D ( O ) in (21) corresponds to the Mean Rule approximation weighted by the reliability of classifier j given class Cm
4.2.2 Product Rule with BBCM If classifiers are assumed statistically independent, BBCM [ S (O) ] in (17) can be approximated as (Yoma et. al., 2005): BBCM [ S (O ) ] = BBCM ⎡⎣ SCL1 (O ) ⎤⎦ ⋅ BBCM ⎡⎣ SCL2 (O ) ⎤⎦ ⋅ ... ⋅ BBCM ⎡⎣ SCL j (O) ⎤⎦ ⋅ ... ⋅ BBCM ⎡⎣ SCLJ (O) ⎤⎦
(22)
By using (22) and the BBCM based fusion criterion (18), D(O) can be approximated as: ⎪⎧ J ⎪⎫ D (O) = arg max ⎨∏ BBCM ⎡⎣SCL j (O) ∧ Cm ⎤⎦ ⎬ m ⎪⎩ j =1 ⎪⎭ J ⎧⎪ ⎫⎪ = arg max ⎨∏ Pr ⎡⎣O ∈ Cm ∧ d CL j (O) = Cm | SCL j (O) ⎤⎦ ⎬ m ⎪⎩ j =1 ⎪⎭
(23)
As done in BBCM Mean Rule, D ( O ) in (23) can be written as: ⎧⎪ J ⎫⎪ D(O) = arg max ⎨∏ Pr ⎣⎡O ∈ Cm | SCL j (O) ⎦⎤ ⋅ Pr ⎣⎡ d CL j (O) = Cm | SCL j (O), O ∈ Cm ⎦⎤ ⎬ m ⎩⎪ j =1 ⎭⎪
(24)
As in BBCM Mean Rule in (21), BBCM Product Rule in (24) corresponds to the ordinary Product Rule approximation weighted by the reliability of classifier j given class Cm.
4.2.3 Weighted Majority Vote Rule (WMVR-MCS) with BBCM By replacing Pr ⎡⎣ d cl j | SCL j (O) ⎤⎦ with BBCM ⎡⎣ SCL j (O) ∧ d cl j (O) ⎤⎦ in (6) BBCM WMVR-MCS is written as: J
D (O) = ∑ ∆ CL j ⋅ BBCM ⎡⎣ SCL j (O ) ∧ d cl j (O ) ⎤⎦ j =1
(25)
The same interpretation given to BBCM Mean Rule and BBCM Product Rule is also applicable to (25)
4.3 Difference between BBCM and classic Bayesian score As mentioned above, BBCM incorporates information about the classifier performance and confidence, and the a posteriori probability itself. In contrast, Bayes classification employs only the a posteriori probability. Moreover, BBCM score is between 0 and 1 by definition, but the sum over the classes is not necessarily equal to one. On the other hand, in Bayes rule a posteriori probabilities are defined in a range that is classifier dependent but provide a sum over all the classes equal to one. As a result, BBCM seems an interesting framework to address the problem of MCS fusion due to the fact that the concept of confidence is applicable to any classifier and BBCM values are defined in the interval [ 0,1] .
V. EXPERIMENTS The results presented here are achieved with a database composed of 40 speakers (20 males and 20 females). The vocabulary corresponds to Spanish digits. Each speaker pronounced the 10-digit sequence “0-1-2-3-4-56-7-8-9” three times for enrolling. For verification, every speaker uttered the four-digit sequences “1-8-6-4”, “4-5-2-0” and “9-5-7-3” three times each. The enrolling and verification speech signals were recorded on the same telephone line, at Speech Processing and Transmission Laboratory (LPTV), at Universidad de Chile. The database was divided in two groups, A and B. Database A, composed of 30 speakers (15 males and 15 females), is used for testing. Database B, composed of 10 speakers (5 males and 5 females), is employed to estimate the required a priori p.d.f.’s and BBCM curves. The database employed in this paper is similar to those mentioned elsewhere (Asami et. al., 2005; Yegnanarayana et. al., 2005).
5.1 Individual Classifiers As mentioned above, three standard SV techniques were addressed: forced Viterbi based score (Furui, 1997); Maximum Vote Rate for sequence of feature vectors, MVR-FV (Radova and Padrta, 2004); and, Support Vector Machines, SVM (Campbell et al, 2004). Viterbi and MVR-FV based SV systems were implemented at LPTV. In contrast SVM SV system was implemented with Matlab using OSU SVM toolbox1. 5.1.1 Viterbi based score classifier The signal input is processed with forced-Viterbi algorithm in order to estimate the normalized log likelihood log L(O ) (Furui, 1997): T
log L(O) = ∑ ⎣⎡log P (Ot | λSD ) − log P(Ot | λSI ) ⎦⎤
(26)
t =1
where O is the observation sequence O = [O1 , O2 ,..., Ot ,..., OT ] ; P(Ot | λSD ) and P(Ot | λSI ) represent the likelihood related to the SD ( λSD ) and SI ( λSI ) HMM´s, respectively. Both models, λSD and λSI , correspond to the sequence of triphone HMM’s that compose the testing sequence O . The normalized log likelihood log L(O) is divided by the number of frames (T) in the verification utterance:
1
http://sourceforge.net/projects/svm/ and http://svm.sourceforge.net/license.shtml
log L(O ) . It is worth highlighting that λSD is computed with the enrolling data pronounced by T each client, and λSI is estimated with 60 speakers (30 male + 30 female) that are different from the ones in databases A and B mentioned above. Each speaker uttered three times the digit sequence “0-1-2-3-4-5-6-7-89” using the same telephone line that was employed in the testing data. log L(O) ' =
5.1.2 Majority voting rule for sequences of feature vectors (MVR-FV) Consider an input speech signal, O, composed of T frames. This signal is divided in I non-overlapped K-frame length windows, where K is the number of frames within every window. Consider window Wi that starts at frame Oi ⋅ K and ends at frame O( i +1)⋅K −1 . The log likelihood as defined in (26) of the ith window, Wi, is: log L(Wi ) =
1 K
( i +1)⋅K −1
∑
k =i⋅ K
⎣⎡log Pr (Ok | λSD ) − log Pr (Ok | λSI ) ⎦⎤
(27)
The local decision at window Wi is estimated according to (Radova and Padrta, 2004), if log L(Wi ) > T
⎧1 D (Wi ) = ⎨ ⎩ −1
if log L(Wi ) ≤ T
(28)
Then, the final decision is taken according to the score: S (O) =
1 I ⋅ ∑ D(Wi ) I i =1
(29)
5.1.3 Support Vector Machine (SVM) A support vector machine is a two-class classifier constructed with a sum of a kernel function k (⋅, ⋅) (Burges, 1998; Campbell et al, 2004): N
f ( x) = ∑ α i ti K ( x, xi ) + bi
(30)
i =1
where the ti are targets, and
N
∑α t i =1
i i
= 0 . The vectors xi are support vectors and are obtained from the training
set. The target values are 1 or −1 depending on class 1 or class 2, respectively. For classification, a class decision is based upon whether the value, f(x), is above or below a threshold. SVM was implemented with OSU SVM toolbox for MATLAB1.
5.2 Experimental setup Enrolling and verification utterances are decomposed as a sequence of triphones. Thirty-three cepstral coefficients are computed per frame: the frame energy plus ten static coefficients and their first and second time derivatives. In Viterbi based score, the HMM’s are trained with the Viterbi algorithm. Each triphone is modeled with a three-state left-to-right HMM topology without skip-state transition, with one multivariate Gaussian density per state in speaker-dependent models, and eight multivariate Gaussian densities per state in the speaker- independent model. Both models employ diagonal covariance matrices. In MVR-FV, K is made equal to two and threshold T in equation (28) is estimated with Database B. In SVM, K-means algorithm is used for group the training data, 256 codewords for client class and 512 codewords for impostor class, and Gaussian kernel (Burges, 1998) is employed. FA and FR error rates are computed with database A as follows: FR curves are estimated with 30 speakers x 9 verification signals per client = 270 signals; and, FA curves are computed by avoiding cross-gender impostor trials with (14 impostors) x 9 verification signals per impostor x 30 users = 3780 experiments. The baseline in this paper is given by Viterbi based score that is more accurate than MVR-FV and SVM. The baseline system gives an EER equal to 5.9%. The EER’s provided by MVR-FV and SVM are 7.42% and 14.82%, respectively. The lower accuracy of SVM should be due to the fact that SVM was not specifically proposed to address the problem of text-dependent SV. Actually, SVM has been applied to text-independent SV (Campbell et. al., 2004). Results are presented in Table 1 and Figs. 3 and 4.
VI. DISCUSSION Figure 3 shows BBCM curves obtained for the SV systems mentioned in section 5.1. As can be seen in Fig.3, the minimum value of BBCM takes place when the classifier score is equal to TEER (Threshold of Equal Error Rate) in the database where the a priori p.d.f.’s and BBCM are evaluated (database B). This result is rather intuitive and means that the closer the score from the decision threshold, the lower the classification confidence. However, BBCM provides a formal framework to model confidence as a probability and is not a heuristic method as proposed elsewhere. Moreover, the comparison of Fig. 3 with Table 1 suggests that the higher the EER of the SV system, the wider the BBCM curve According to Table 1 and Fig.4, all the fusion methods outperformed the baseline system, Viterbi based score. When compared with the baseline system, the ordinary MCS fusion methods gives reductions of 44.2%, 49.9% and 30.8% with, respectively, Mean Rule, Product Rule and WMVR-MCS. However, when BBCM based MCS fusion is applied, reductions as high as 55.3%, 50.6% and 42.8% are achieved with BBCM Mean Rule, BBCM Product Rule and BBCM WMVR-MCS, respectively. The highest improvement with BBCM MCS takes place with Mean Rule: a reduction in EER as high as 20% when compared with ordinary Mean Rule. The superior performance of BBCM based MCS fusion schemes can also be observed in Fig. 4. As can be seen in Fig. 4, BBCM based MCS fusion strategies can lead to averaged reductions as high as 64% and 19.4% in the area below the ROC curve when compared with the baseline system and classical MCS fusion methods, respectively. The reductions in EER obtained with MCS at abstract level are lower than at score level. This could be due to the fact that the hard decisions made by classifiers may incorporate errors that are more difficult to overcome. However, the reduction in EER and in the area below the ROC curve achieved by BBCM WMVR is 17% and 25%, respectively, when compared with conventional weighted MVR. The lower improvement obtained with BBCM Product Rule, regarding the ordinary fusion method, must be due to the fact that the imprecision of BBCM curves are amplified in the Product Rule scheme, which in turn is more sensitive to errors than Mean Rule (Tax et. al., 1997). Observe that BBCM curves are estimated with a database different from the testing database.
VII. CONCLUSIONS The problem of MCS fusion in SV is addressed by employing Bayes-based confidence measure, BBCM. Also, BBCM as a classification criterion, generalized to any classification problem independently of the number of classes, is proposed. Moreover, a new framework based on BBCM is presented for MCS fusion. Although tested with a SV task on the telephone line, this confidence based fusion is also generalized to any classification context independently of the number of classes. It is proved that the BBCM based MCS fusion scheme corresponds to the ordinary Bayes fusion weighted by the reliability of each individual classifier. Also, BBCM provides a formal model for heuristic weighting functions employed elsewhere. The confidence based MCS fusion is able to lead to reductions as high as 20% and 25% in EER and in the area below the ROC curves, respectively, when compared with the ordinary Bayesian fusion. Finally the applicability of confidence based MCS fusion to SV in mismatch conditions and other classification problems is proposed as future work.
REFERENCES Andorno, M., Laface, P., Gemello, R., 2002. Experiments in confidence scoring for word and sentence verification. In Proc. ICSLP, pp.1377–1380. Asami, T., Iwano, K., Furui, S., 2005. Stream-weight optimization by LDA and adaboost for multi-stream speaker verification". In: Proc. of ICSLP, Lisbon, Portugal, pp. 2185-2188. Benzeghiba, M.F., Bourlard, H, 2003. Hybrid HMM/ANN and GMM combination for user-customized password speaker verification, In: Proc. ICASSP, vol.2, pp. 225-8. Burges C.,1998. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121–16. Campbell, W.M., Campbell, J.P., Reynolds, D.A., Jones, D.A., Leek, T.R., 2004. High-level speaker verification with support vector machines. In: Proc. ICASSP, Montréal, Quebec, Canada, pp. 73-76. De Stefano, C., Della Cioppa, A., Marcelli, A., 2002. An adaptive weighted majority vote rule for combining multiple classifiers, In: Proc. of IEEE ICPR, Vol. 2, pp. 192-195.
Duda, R. O, P., 1973. Pattern classification and scene analysis. John Wiley and Sons,1973 Farrell, K.R., 1995. Text-dependent speaker verification using data fusion. In Proc. of ICASSP, vol. 1, pp. 349-352. Farrell, K.R., Ramachandran, R.P., Sharma, M., Mammone, R., 1997. Sub-word speaker verification using data fusion methods. In Proc. IEEE Workshop on Neural Networks for Signal Processing, 531-540. Fumera, G., Roli, F., 2005. A theoretical and experimental analysis of linear combiners for multiple classifier systems. IEEE Trans. Pattern Analysis and Machine Intelligence. 27(6). pp. 942-956. Furui, S., 1997. Recent advances in speaker recognition, Pattern Recognition Letters. 18, pp. 859-872. Genoud, D., Bimbot, F., Gravier, G., Chollet, G., 1996. Combining methods to improve speaker verification, In: Proc. ICSLP, Vol. 3, pp. 1756-1759. Hazen, T. J. , Burianek, Th., Polifroni, J., and Seneff, S., 2000. Recognition confidence scoring for use in speech understanding systems. In Proc. ISCA Tutorial Research Workshop, Paris, France, pp. 213–220. Kim, T.,Ko, H., 2003. Utterance verification under distributed detection and fusion framework, In Proc. of Eurospeech, Geneva, Switzerland, pp. 889-892. Kittler, J., Alkoot, F.M., 2003. Sum versus vote fusion in multiple classifier systems, IEEE Trans. Pattern Analysis and Machine Intelligence. 25, pp. 110-115. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J., 1998. On combining classifiers, IEEE Trans. on Pattern Analysis and Machine Intelligence. 20, pp. 226-239. Kuncheva, L.I., 1999. Decision templates for multiple classifier fusion: an experimental comparison. Pattern Recognition. 34(2), pp. 299-314. Kuncheva, L.I., 2002. A theoretical study on six classifier fusion strategies, IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 24, pp. 281-286. Kwan, K. Y., Lee, T., Yang, C., 2002. Unsupervised N-best based model adaptation using model-level confidence measures. In Proc. ICSLP , pp. 69–72. Lee, C. H. and Huo, Q., 2000. On adaptive decision rules and decision parameter adaptation for automatic speech recognition. Proc. IEEE, vol. 88, no. 8, pp. 1241–1267. Mak, M.W., Cheung, M.C., Kung, S.Y., 2003. Robust speaker verification from GSM-transcoded speech based on decision fusion and feature transformation, In Proc: ICASSP, pp. 745 -748. Radova, V., Padrta, A., 2004. Comparison of several speaker verification procedures based on GMM, In: Proc. Interspeech, pp. 17771780. Stolcke, A., König Y., Weintraub, M., 1997. Explicit word error minimization in N-best list rescoring. In Proc. 5th Eur. Conf. Speech Communication Technology, vol. 1, pp. 163–166. Tax, D.M.J., Duin, R.P.W., Van Breukelen, M., 1997. Comparison between product and mean classifier combination rules, In Proc. of International Conference on Statistical Techniques in Pattern Recognition, pp. 165-170. Xiang, B., Berger, T, 2003. Efficient text-independent speaker verification with structural Gaussian mixture models and neural network. IEEE Trans. on Speech and Audio Process. 11, pp. 447- 456. Xu, L., Krzyzak, A., Suen, C.Y., 1992. Methods of combining multiple classifiers and- their applications to handwriting recognition, IEEE Trans. Syst.. Man, Cybern. SMC-22(3), pp. 418-435. Yegnanarayana, B., Mahadeva Prasanna, S.R., Zachariah, J.M., Gupta, C.S., 2005. Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system. IEEE Trans. on Speech and Audio Process. 13(4), pp. 578-582. Yoma, N.B., Carrasco, J., Molina, C., 2005. Bayes-based confidence measure in speech recognition. IEEE Signal Processing Letters. 12, pp. 745 – 748.
Classifiers Viterbi based score MVR-FV SVM BBCM Mean Rule Mean Rule BBCM Product Rule Product Rule BBCM WMVR - MCS WMVR - MCS
EER
ROC Area
5,93 7,42 14,82 2,65 3,31 2,93 2,97 3,39 4,1
118,4 229,4 782,4 31,8 39,5 35,8 41,5 59.5 79.4
Table 1: EER and integral below the ROC curve with isolated classifiers and MCS fusion scheme.
Classifier 1 (CL1)
Classifier 2 (CL2)
O Sequence Observation
. . . Classifier J
SCL 1 (O )
SCL 2 (O )
SCL J (O )
(CLJ)
Decision Threshold
d CL 1
Decision Threshold
d CL 2
Decision Threshold
d CL J
D(O) FUSION
Figure 1: MCS fusion at abstract level.
Classifier 1 (CL1)
Classifier 2 (CL2)
O Sequence Observation
. . . Classifier J
SCL 1 (O )
SCL 2 (O )
D(O) FUSION
SCL J (O )
(CLJ)
Figure 2: MCS fusion at score level.
1,2
a)
BBCM
1,0 0,8 0,6 0,4 0,2 0,0 -0,14
-0,12
-0,10
-0,08 TEER -0,06
-0,04
-0,02
0,00
Viterbi based score, SCL1 (O )
1,2 1,0
b)
BBCM
0,8 0,6 0,4 0,2 0,0 -1,00
TEER -0,95
-0,90
-0,85
-0,80
-0,75
-0,70
MVR - FV score, SCL2 (O ) 1,2
c)
BBCM
1,0 0,8 0,6 0,4 0,2 0,0 0,05
TEER 0,10
0,15
SVM score, SCL3 (O )
0,20
Figure 3: BBCM curve with : a) Viterbi based score, b) MVR-FV and c) SVM.
0,25
Figure 4: DET curves provided by MCS fusion schemes. The results are compared with the baseline system (Viterbi based score): a) BBCM Mean Rule compared with ordinary mean rule; b) BBCM Product Rule compared with ordinary product rule, and, c) BBCM WMVR-MCS with classical WMVR.