Open-Set Speaker Identification in Broadcast News Chao Gao, Guruprasad Saikumar, Amit Srivastava, Premkumar Natarajan Raytheon BBN Technologies, Cambridge, MA 02138 {cgao, gsaikuma, asrivast, pnataraj}@bbn.com ABSTRACT In this paper, we examine the problem of text-independent open-set speaker identification (OS-SI) in broadcast news. Particularly, the impact of the population of registered speakers to OS-SI performance is investigated, which is the central issue for designing practical OS-SI system. We amend the maximum mutual information (MMI)-based discriminative training scheme to facilitate its incorporation in OS-SI systems. We also improve the implementation to allow the application of MMIbased approach with 2048-component Gaussian mixture models. All systems are evaluated using NIST RT-03, RT-04 and FBIS corpora, with a maximum of 82 registered speakers. Our study shows that notable performance improvement can be obtained with MMI-based discriminative training, which reduces the equal error rate (EER) by 15.9% relatively, in comparison to the GMM-MAP scheme. 1. INTRODUCTION Closed-set speaker identification [1] aims to determine the correct speaker of a given test utterance from a registered population. This problem becomes open-set speaker identification (OS-SI) when it is coupled with the possibility that a given test utterance is from an unknown speaker. OS-SI is considered to be one of the most challenging problems in speaker recognition and it is critical to a wide range of applications such as broadcast monitoring, indexation of speech document and audio surveillance. For example, in broadcast monitoring system, it is desirable to be able to identify and track occurrence of the utterance from the registered speakers, which involves both OS-SI and speaker diarization. While the problem of speaker diarization in broadcast news has been widely investigated and available in some commercial systems [2][3], the problem of OS-SI in broadcast news is much less explored. In this paper, we investigate OS-SI in the textindependent mode for the broadcast domain. Given a set of registered speakers and a sample of test utterance, OS-SI can be described by a two-stage process: 1) nominate a registered speaker that best matches the utterance, and 2) verify whether this utterance is from the nominated speaker or an unregistered speaker. One major factor that determines the complexity of OS-SI is the population of registered speakers. As the population grows, the confusion between registered speakers increases, and the association of a test utterance with the nominated speaker becomes less confident. In this paper, we use a discriminative training scheme [4] based on maximum mutual information (MMI) criteria for
978-1-4577-0539-7/11/$26.00 ©2011 IEEE
5280
speaker recognition. Previous research [5][6][7] on discriminative training has been limited to training GMM with a small number of Gaussian components. We extend the algorithm to facilitate training of GMM with 2048 components in this paper. To the best of our knowledge, this is the first time such a large number of Gaussian-components are trained with discriminative approach in speaker recognition. Another contribution of this paper is that we present the discriminative training on both target and cohort speaker models specifically for OS-SI problem. Our experiment result shows that notable performance improvement can be obtained from MMI discriminative (MMI-DISC) approach, as compared to the classic GMM-maximum a posterior (MAP) [8][9] approach. We also investigate how the OS-SI system performance is correlated with the number of registered speakers. The study shows that system performance degrades with the number of registered speakers increasing, yet the degradation slows down when more than 17 speakers are enrolled. The rest of this paper is organized as follows. Section 2 introduces the basic OS-SI framework and error metrics used in this paper. Section 3 briefly reviews GMM-MAP system. In Section 4, we introduce the modified MMI-DISC training in OS-SI system. In Section 5, we present the experimental results. Finally, we conclude this paper in Section 6. 2. OPEN-SET SPEAKER IDENTIFICATION The problem of OS-SI requires two-stage process which involves both identification step and verification step. 2.1. Two stages Assuming that speakers are registered in the system and that their statistical models are described by . Let denote the feature vector sequence extracted from the test utterance, the process of OS-SI can then be stated as: LPSRVWHU where is a predetermined threshold. In the first stage of OSSI, the speaker model that yields the maximum likelihood is nominated; in the second stage, the decision of whether to assign observation to the nominated speaker, or to declare it as originated from an imposter, is made. 2.2. Error metrics
ICASSP 2011
Three types of errors — confusion, false rejection and false alarm — exist in OS-SI problem. The definitions of these errors are listed in Table 1. Table 1 Errors in open-set speaker identification Reference Speaker
Stage 1 Speaker
Speaker Imposter
Speaker Speaker
Stage 2
4. DISCRIMINATIVE TRAINING 4.1. MMI-based discriminative training scheme In a MMI-based discriminative training scheme, MMI objective function can be optimized by using extended Baum-Welch algorithm [4], where the mean and variance are updated by:
Error type (case) Confusion (1) False Rejection (2) False Rejection (3) False Alarm (4)
(4)
Although the causes of error in the first three cases in Table 1 are different, the consequences are the same — a trial from registered speaker j is missing. Therefore, we categorize both confusion and false rejection into the same class of missing. In this context, the performance of OS-SI can be illustrated by a detection error tradeoff (DET) curve [10], where the missing probability is plotted against the false alarm probability. False alarm probabilities are computed from the proportion of non-target trials whose scores exceed the threshold , i.e. (2)
Here and denote the indices of the speaker model and its Gaussian component, respectively. and are the mean and variance in the previous iteration, and is a smoothing constant [4]. Terms (5)
Missing probabilities are computed from proportion of target or whose trials whose scores fall below threshold hypothesized speaker is incorrect, i.e.
are occupation counts, first and second order statistics in , Gaussian component level. The numerator statistics , and are ordinary maximum likelihood (ML) statistics [4]. The posterior probabilities for denominator can be estimated by
(3)
(6)
In Table 1, we also note that the error of case 1 and case 2 in the first stage will result in an error regardless of the decision made in the second stage; hence, another error metric of interest here the sub-problem of erroneous first stage result — closed-set identification error.
The likelihood is evaluated by using old model . is a compensation factor to the parameters incorrect assumption of independency between frames in each segment [8]. Finally,
3. GMM-MAP SYSTEM In the GMM-MAP system [8], each speaker is represented by a GMM. A speaker-independent UBM is first trained with sufficient data from the non-target speakers using the expectation maximization (EM) algorithm. Then, the parameters of each speaker model are derived by adapting the well-trained parameters of the UBM using the speaker’s training data. This allows careful characterization of speaker feature with large number of Gaussian mixtures when limited speaker training data is provided. In addition, the alignment of Gaussian mixtures in speaker models and UBM also benefits the decoding process since it enables fast scoring [8][11]. In the fast scoring approach, likelihood of speaker model can be approximated very well using only the top C best-scoring Gaussian components selected by UBM. This technique is especially helpful to our broadcast monitoring system, since the processing speed in our system needs to be comparable to realtime processing. Furthermore, in our system where unconstrained cohort normalization (UCN) [12] is employed, hundreds of speaker models including both registered speakers and cohort speakers are involved. Such a system requires a larger number of score calculation during decoding process, and the introduction of fast scoring is particularly helpful.
(7) where is the weight of the m-th Gaussian component, and is the number of Gaussian components in model . 4.2. Extended MMI-based discriminative training We extend the MMI-DISC training scheme for OS-SI system in two aspects: 1) fast scoring technique is employed to calculate in (6) and 2) discriminative training is applied on both target speaker models and cohort models. In order to keep the speaker GMMs well aligned with the UBM to facilitate application of fast scoring, Gaussian component weights are only trained in MAP adaptation in our implementation, and are not updated in MMI iterations. We use three set of models in our system, including UBM, target speaker models and cohort speaker models. These cohort models are used in UCN for robust verification performance, and the final score is computed as (8) where and belong to target speaker models and cohort speaker models, respectively. are the L cohort models associated with the top L likelihoods. To improve OS-SI performance, both identification and verification performance should be optimized. To this end, utterances from registered speakers are not only used to
5281
discriminate registered speakers but also distinguish registered speakers and the top L cohort speakers. As described by (6), each training utterance from registered speaker is used to calculate the denominator statistics for all enrolled speaker models and the top L cohort models, proportional to its likelihood score. A single iteration of this algorithm can be summarized in the following: Extended MMI-based Discriminative Training Algorithm 1. Calculate of all speaker models for each training from target speakers. Note that likelihoods are segment calculated by using fast scoring technique. 2. Calculate as:
cohort speaker is then estimated from the UBM based on MAP adaptation. The relevant factor in MAP adaptation is set to be 16 for all experiments. The number of Gaussian components is also fixed to be 2048 for all speakers across all experiments. In the decoding process, fast scoring is employed. In our experiments, the top 5 best Gaussian components are used. The verification decision is made based on the normalized log in our likelihood score described in (8), where experiments. 5.2. MMI-based discriminative training performance In this experiment, the performance of MMI-DISC training approach is compared with that of MAP adaptation. Initial GMMs including 82 target speaker models and 120 cohort models are obtained from a single iteration of MAP adaptation. These models are then further discriminatively trained by three iterations of MMI using the algorithm described in Section 4. In our experiment, the initial models are also our baseline models. It is found that additional MAP iterations yield no significant further gains in performance. Fig. 1 shows that models trained by MMI-DISC approach reduce EER by 15.9% relatively and also reduce closed-set identification error from 7.48% to 6.23%.
where is the set including all target speakers and the top L cohort speakers associated with the -th target training segment . and denote training utterances from target speakers and cohort speakers respectively. denotes the correct cohort speaker of . 3. Calculate by using ordinary ML approach. 4. Calculate according to (5) and (6). and according to (4). 5. Update
5. EXPERIMENTAL RESULTS 5.1. Experimental setup To evaluate the performance of the OS-SI system in broadcast news, three broadcast corpora, FBIS (Arabic and Chinese), NIST RT-03 and RT-04 (English) are used in our experiments. First, these corpora are split into training and testing sets based on two rules: 1) training data should be older than testing data to avoid same news data in both sets; 2) number of common speakers in both sets should be maximized. By doing so, 82 common speakers (60 male and 22 female) with a minimum of 2-minute training data are selected as target speakers. Secondly, 20 male and 20 female speakers with a minimum of 2-minute training data in each of the three languages are selected as cohort speakers. Finally, 5 hours of gender-balanced data from over 200 non-target speakers, who are not cohort speakers either are selected from three languages, and used for UBM training. The testing data set includes 3676 segments from target speakers and 3743 segments from imposters. The length of a testing segment ranges from 3 to 15 seconds. Feature analysis is performed using 29 (14 base and 15 first order of derivative) perceptual linear prediction (PLP) cepstral coefficients on a 25-ms frame basis with a 10-ms skip rate. In both training and decoding process, features are normalized by mean and variance of features in UBM training data. A language-independent UBM of 2048-component Gaussians is trained by three iterations of EM. Each target and
Fig. 1. DET curves of MAP and MMI-DISC. MMI-DISC reduces EER by 15.9% relatively, in comparison to MAP adaptation. 5.3. Impact of the population of registered speakers This experiment aims to characterize the OS-SI system performance with respect to the number of registered speakers. Assuming M speakers are enrolled in our system, the empirical performance of a system of N (N<M) enrolled speakers is computed using Monte Carlo simulations, wherein 50 sets of out of M speakers are randomly selected as target speakers. Counts of misses and false alarms at EER operating point are accumulated and used to compute system performance for registered speakers. We can train the MAP models for all the M target speakers and cohort speakers at a time, and then use the models of the N selected speakers to perform the Monte Carlo simulation, since each MAP speaker model is trained independently using its own segments.
5282
We can see from Fig. 2 that a sharp increase in EER is observed as the number of registered speakers increases from 2 to 17 — a relative increase of 199.6%. In comparison, the growth in EER becomes slower after that: adding 65 more target speakers to the set only sees a relative increase of 48.1% in EER. As the number of speakers increase from 17 to 82, the identification error curve can be approximated by a line of slope 0.047, which is slightly more than half of the EER slope, 0.082. This observation indicates that the verification and identification attribute almost equally to the increase of overall missing probability, since EER curve represents all missing errors, and the curve of identification error represents missing errors caused in identification stage. Additionally, the false alarm probability is determined solely by the verification. Therefore, for the change in the overall system performance as more target speakers are added, verification generally plays a more important role.
Fig. 2: MAP system performance with different numbers of registered speakers. System performance degrades significantly from 2 to 17 speakers, and the trend slows down after 17 speakers. The same Monte Carlo simulations are also conducted for the models trained by MMI-DISC although this leads to a simulated error. Because each MMI model with N (N<M) target speakers is trained by using all M target speakers’ training segments, this simulation usually underestimates the performance of MMI models. However, Fig. 3 shows that the “underestimated” performance of MMI models is still superior to that of MAP models in all cases. In the meantime, this alternative MMI-DISC system provides more design flexibity and can be implemented much easier, which is valuable to many practical systems.
6. CONCLUSION In this paper, we study the problem of OS-SI, and compare MAP adaptation and MMI-DISC training approaches. We show that MMI-DISC approach outperforms MAP adaptation in our experiment by a noticeable relative margin of 15.9% (from an EER 15.15% to 12.74%). In addition, we examine the characteristics of OS-SI performance with respect to the number of registered speakers. Generally, the system performance degrades with enlarging set of register speakers. However, our results show the degradation slows down after certain threshold value (17 in our experiment). It is also observed that verification generally plays more important role than identification in OS-SI system.
7. REFERENCES [1] H. Gish and M. Schmidt, “Text-independent speaker identification,” IEEE Signal Process. Mag., Oct. 1994, pp. 18–32. [2] BBN Broadcast Monitoring SystemTM, http://www.bbn.com/products_and_services/bbn_broadcast_monit oring_system/. [3] IBM Translingual Automatic Language Exploration System, http://domino.research.ibm.com/comm/research_projects.nsf/pages /tales.index.html . [4] D. Povey, “Discriminative Training for Large Vocabulary Speech Recognition,” Ph.D. thesis, Cambridge University, 2004. [5] P. Angkititrakul and J. H. L. Hansen, “Discriminative Inset/Out-of-set Speaker Recognition,” IEEE Trans. Audio, Speech, Lang. Process., vol.15, no. 2, pp. 498–508, Feb. 2007. [6] Q. Y. Hong, S. Kwong, “A Discriminative Training Approach for Text-independent Speaker Recognition,” Signal Process, 85 (2005) pp. 1449–1463. [7] K. P. Markov and S. Nakagawa, “Discriminative Training of GMM Using a Modified EM Algorithm for Speaker Recognition,” in Proc. ICSLP, 1998, vol. 2, pp. 177–180. [8] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Process, Academic Press, Inc, vol. 10 pp. 19-41, 2000. [9] E. Singer and D. A. Reynolds, Analysis of Multitarget Detection for Speaker and Language Recognition. In Proc. Odyssey: The Speaker and Language Recognition Workshop, Toledo, Spain, ISCA, pp. 301-308, May 2004. [10] A. Martin et al., “The DET Curve in Assessment of Detection Task Performance,” Proc. Eurospeech 97, pp. 1895-1898. [11] P. A. Torres-Carrasquillo, E. Singer, W. Campbell, T. Gleason, A. McCree, D. A. Reynolds, F. Richardson, W. Shen, and D. Sturim, “The MITLL NIST LRE 2007 Language Recognition System,” in Interspeech 2008, Brisbane, Australia. Sept. 22-26, 2008. [12] R. Auckenthaler, M. Carey, and H. Lloyd-Thomas, “Score normalization for text-independent speaker verification systems,” Digital Signal Processing, vol. 10, no. 1/2/3, pp. 42–54, 2000.
Fig. 3: Performance comparison between MAP and MMI-DISC with different number of registered speakers.
5283