Understanding Scores in Forensic Speaker ... - Semantic Scholar

Report 11 Downloads 26 Views
Understanding Scores in Forensic Speaker Recognition* W. M. Campbell, K. J. Brady, J. P. Campbell, R. Granville, D. A. Reynolds MIT Lincoln Laboratory [email protected]

Abstract Recent work in forensic speaker recognition has introduced many new methodologies for scoring. First, confidence scores (posterior probabilities) have become a useful method of presenting results to an analyst. The introduction of an objective measure of confidence score quality, the normalized cross-entropy, has resulted in a systematic manner of evaluating and designing these systems. A second scoring methodology that has become popular is support vector machines (SVMs) for high-level features. SVMs are accurate and produce excellent results across a wide variety of token types—words, phones, and prosodic features. In both cases, an analyst may be at a loss to explain the significance and meaning of the score produced by these methods. We tackle the problem of interpretation by exploring concepts from the statistical and pattern classification literature. In both cases, our preliminary results show interesting aspects of scores not obvious from viewing them “only as numbers.”

1. Introduction Forensic speaker recognition is a developing area in speaker recognition that spans many disciplines. Signal processing approaches to speaker recognition typically focus on automatic methods such as Gaussian mixture models (GMMs) [1] and Support Vector Machines (SVMs) [2, 3]. Forensic linguistics focuses on “high-level” features that distinguish speakers and relies primarily on a human analyst for processing and interpretation. Other methods include spectrogram reading and hybrid approaches between the different methods. A typical scenario in forensic speaker recognition is that an analyst will receive two samples of speech—one from a known source and one that is questioned. The goal is to determine if the source is the same—i.e., is there a speaker match or not. An analyst in this situation may wish to use automatic tools to provide insight into whether there is a match. At least two levels of interpretation may be of interest. First, if the automatic system produces a score, the analyst may wonder what the confidence the score implies. Second, the analyst may want to understand why the system made a decision. The problem of interpretable scores and their evaluation has been addressed recently in [4]. Estimating the scores has been explored in [5]. The basic idea is to produce a posterior probability of match—something we will refer to as a confidence score for the rest of the paper—along with an * This work was sponsored by the Department of Defense under Air Force Contract F19628-00-C-0002. Opinions, interperetations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government.

objective measure that tells when we are producing good estimates. The quality of confidence estimates can be objectively evaluated using the normalized cross-entropy (NCE) measure [4, 5]. A difficulty with cross-entropy is that it is difficult to interpret directly. We propose using the quantities of calibration and refinement from the statistical literature to clarify the concept. After the analyst has a reasonable understanding of the score, the interpretation of why the system made a decision is of interest. High-level automatic speaker recognition systems can potentially bridge this gap. High-level features are understandable as human observable quantities. We show that the interaction of SVM modeling with high-level features gives a convenient interpretation of scoring that can be of potential use to an analyst. The outline of the paper is as follows. In Section 2, we review confidence scores and normalized cross-entropy. In Section 3, we introduce the concepts of calibration and refinement in a qualitative manner. In Section 4, we provide quantitative values for calibration and refinement and their relation to NCE. In Section 5, experiments demonstrate the concepts and their utility. In Section 6, we discuss interpretation of high-level speaker recognition with SVMs. In Section 7, we show an interpretation example for SVM models via an experiment.

2. Confidence Scores Standard cepstral and high-level speaker recognition systems produce scores that vary widely in interpretation and scale. GMM cepstral systems with UBM only normalization (no TNorm) produce scores that are nominally log-likelihood ratios, but the dynamic range across speakers is small and does not represent a true likelihood ratio due to the independence of frames assumption. An SVM system produces scores that measure distance to a hyperplane. In both cases, a better representation is needed to unify analysis and comparison of algorithms. A natural choice is to remap the scores from a speaker recognition system to estimate the posterior probability

p ( = 1 | E ).

We will call this posterior probability the confidence score of a speaker recognition system. In the equation, ! represents the event of interest; i.e., for our example in the introduction, !=1 means the speakers match, and !=0 means they do not. The variable E represents the evidence that we have about the match. Methods for estimating this posterior probability are given in [4, 5]. Once we have a confidence score from a system, it is important to be able to evaluate the utility of the score. One solution to this problem is to use strictly proper scoring rules

[4]. These scoring rules evaluate the posterior estimate produced by a forecaster

Days when forecaster gives confidence is 0.2

q( E ) from the evidence and encourage the forecaster to elicit posterior probabilities. A strictly proper scoring rule that is appropriate for this task is based on cross-entropy. For a sequence of many trials with evidence Ei, the scoring rule produces the objective measure:

H cond =

1 N true

Day J2

Day Jn ooo

Actual Weather

Figure 1: Explanation of the concept of calibration

i true trials

(1)

1 N false

log (1 q ( Ei ) )

0 i false trials

In (1), the log function is base 2 (a convention that we will use for the rest of this paper), H1 is the prior probability of match, and H0=1- H1. Intuitively, the scoring rule in (1) discourages values near 0 for true trials and values near 1 for false trials. It is common to normalize the quantity in (1) by a baseline and form the normalized cross-entropy (NCE)

NCE=

H base

H cond

H base

Although the cross-entropy in (1) gives an objective measure of forecaster performance, it is not obvious how to interpret this number. What values of NCE are good? Where is the error originating from? We propose a partial solution to the interpretation problem in the next section.

3. Calibration and Refinement DeGroot introduces the idea of calibration and refinement in [6]. Both calibration and refinement are related to NCE in a qualitative and quantitative manner. The starting point in understanding these concepts is to observe a forecaster over a large number of trials (a frequentist perspective). We assume that the confidence score that a forecaster produces belongs to a discrete finite set; a typical example would be that the forecaster produces confidence scores of 0, 0.1, 0.2, …, 1.0 (11 possible values). We will call the discrete values of the confidence c1, c2, …, cn. This assumption simplifies our analysis considerably. A question to ask is: What does a forecaster mean when he claims the probability of match is 0.2? An interpretation over many trials is shown in Figure 1 using a weather forecaster analogy. We look at the trials where the forecaster predicted that the confidence score was 0.2. A forecaster is wellcalibrated, if the actual occurrence of rain is 20% of the time. Mathematically,

Day J1

20% of these times it should rain (the forecaster is well-calibrated)

log ( q ( Ei ) )

1

0

we

write

that p (

= 1 | q ( E ) = ci )

An important point in observing the calibration of a forecaster is that we are limited to values of confidence that the analyst produced. If the forecaster never gives a confidence score of 0.5, then we cannot judge their calibration at this value. A complimentary concept to calibration is refinement. Refinement addresses the following difficulty. Suppose the prior probability of match is 0.5. Then, a forecaster could always produce 0.5 as their confidence score. This forecaster would be perfectly calibrated but not very useful. The refinement of the forecaster is the degree to which the forecaster gives scores near 0 and 1. A refined analyst should not give “middle of the road” confidences, but instead should give confident scores. Qualitatively, refinement and calibration provide us ways of evaluating a forecaster. The forecaster should produce confidences close to posterior probabilities that are refined. We will show in the next section that these concepts are the only necessary ones to relate the performance of a forecaster to NCE.

4. Quantitative Calibration and Refinement A remarkable property of strictly proper scoring rules is that they can be decomposed into values representing calibration error and refinement [6]. That is, for the particular case of NCE, we can write H cond = H calib + H refine . In this section, we derive formulas for these terms from DeGroot’s general equations and explain how these values relate to the qualitative concepts discussed in Section 3. We first consider the version of Hcond for discrete values. A rearrangement of (1) gives, H cond =

0

ci .

Note that this definition does not say anything about the prior probability of rain. We’ll come back to this issue later.

= 0 | q ( E ) = ci ) log(1 ci )

p(

i

1

p(

= 1 | q ( E ) = ci ) log( ci )

(2)

i

4.1. Refinement We define the entropy function as h(p)=-plog (p)-(1-p) log (1p). Then, from DeGroot’s equation (5.6), we obtain

H refine =

p ( q ( E ) = ci ) h ( p (

p ( q ( E ) = ci ) =

= 1 | q ( E ) = ci ) )

0

p ( q ( E ) = ci |

i

= H ( | q ( E )) That is, refinement is just the conditional entropy of the actual match given the analyst estimate. Since the refinement is a conditional entropy, this implies that it is always positive. The interpretation of refinement is as follows. First, note that the goal is to make Hcond as small as possible. Since, Hcond is a sum of Hcalib and Hrefine, this means we should make both terms as small as possible. Now, suppose that we have a reasonably calibrated estimator, then p(!=1|q(E)=ci) is approximately ci. Then the terms in the refinement equation become p(q(E)=ci)h(p(ci)). Now, h() is a concave function with a maximum at 0.5 and minima at 0 and 1 (it looks like an inverted parabola). Therefore, to make refinement as small as possible, we should have high probabilities for p(q(E)=ci) when ci is near 0 and 1, and small values of p(q(E)=ci) when ci is near 0.5. 4.2. Calibration Calibration can be expressed using DeGroot’s equation (5.5), =

H

p(

calib

= j , q ( E ) = ci )

i, j

j

log p(

(3)

ci

= j | q( E ) = c j )

The resulting simplification is not too insightful and is just Hcalib=Hcond-H( |q(E)). The quantity in (3) looks somewhat like a mutual information, but differs in that ci is used in place of a probability mass. From the equation, we observe that if calibration is perfect, then Hcalib is zero. Further, the quantity Hcalib is always greater than or equal to zero [6]. Thus, we can view Hcalib as a measure of the calibration error. 4.3. Calculating Calibration and Refinement

1

H ( q ( E )) =

p ( q ( E ) = ci |

Calculating refinement can be accomplished by noting (see [7]), H(!|q(E))=H(!,q(E))-H(q(E))=H(q(E)| !)- H(q(E))H(!). Each of these quantities can be calculated as follows: Step 1: H(!)=h(30) Step 2:

H (q( E ) | ) = H (q( E ) |

0

H (q ( E ) |

i

1

H (q( E ) |

= 1)

= j) =

p ( q ( E ) = ci | Step 3:

= 0) +

= j ) log( p ( q ( E ) = ci |

= j ))

= 1)

p ( q ( E ) = ci ) log( p ( q ( E ) = ci )) i

Calibration can be calculated by first using (2) to calculate Hcond and then computing calibration as Hcond-Hrefine.

5. Confidence Scoring Experiments 5.1. GMM cepstral system We performed our initial set of experiments with the same data set as our confidence estimation paper [5]. This data set used the male subset of 3,483 files from the training and testing portion of the NIST 2000 speaker recognition evaluation. This evaluation gave a variety of durations nominally around 30s and 2 minutes. The GMM system used to produce scores had 2048 mixture components and used GMM UBM scoring. Input features were MFCCs and associated deltas. MAP adaptation was used to produce models. The confidence estimate was produced using a speaker match scenario. For the comparison of two conversations, two speaker models were produced. These models were then scored against the other utterance producing bidirectional scores, s1 and s2. In addition, meta information was automatically extracted from the conversations. The resulting strategy for confidence estimation is shown in Figure 2.

s1, s2 Bidirectional Scoring

Likelihood Ratio num/den SNR

In many cases, the prior in the test set will not be exactly the same as the priors used in estimating the confidence score. We give formulas that avoid this problem by explicitly breaking out the priors. We denote the match prior Hk=p (!=k).

= 0) +

Channel Classification (CB, Elec, Cell)

M e t a i n f o

Confidence Estimation

0.42

Duration

Figure 2: Confidence score estimation inputs

The confidence estimator was selected as a multilayer perceptron (MLP). Training for the MLP was drawn from the Switchboard 2 phase 3 corpus (and is distinct from the NIST SRE 2000 corpus). The best configuration was found to use duration of the two inputs, the bidirectional scores, and the channel labels. We used a match prior of 0.5 in our experiments. For the best MLP confidence estimator we found that Hcond=0.39. We also compared this to two other types of confidence estimators. One estimator used Gaussian approximations of the average bidirectional score, s=0.5s1+0.5s2 for both the match and non-match case to form a log likelihood ratio [5]. The Hcond in this case was 0.49. Another estimator was based

on hard decisions and is show in Figure 3 and had a Hcond of 0.53. In the figure, EER is the equal error rate of the system.

35 Distribution Based MLP

30

s1, s2

Mean s=0.5s1+0.5 s2

s>thresh(EER)

1-EER

s