Learning Polynomial Function Based Neutral-Emotion ... - IEEE Xplore

Comment

Report 3 Downloads 21 Views

Learning Polynomial Function Based Neutral-Emotion GMM Transformation for Emotional Speaker Recognition Zhenyu Shan, Yingchun Yang* College of Computer Science, Zhejiang University [email protected], [email protected] Abstract One of the biggest challenges in speaker recognition is dealing with speaker-emotion variability. The basic problem is how to train the emotion GMMs of the speakers from their neutral speech and how to calculate the scores of the feature vectors against the emotion GMMs. In this paper, we present a new neutral-emotion GMM transformation algorithm to overcome this limitation. A transformation function based on polynomial function is learned to represent the relationship between the neutral and emotion GMM. It is adopted in testing to calculate the scores against the emotion GMM. The experiments carried on MASC show the performance is improved with an EER reduction of 39.5% from the baseline system.

1. Introduction In the speaker recognition, the performance degradation is induced by many factors, including background noise, channel effect, health condition and emotion variability. The emotion variability means the emotion state mismatch between the training and testing speech, and such kind of recognition is named emotional speaker recognition in this paper. In these years, some efforts have been devoted to solving the problem(e.g.[2,5,6]). Scherer[3] presented a structured training approach which aims at making the system get familiar with the emotion variation of the user’s voice in training. However, it is unfriendly in many applications to ask the registered speakers to provide speech on various emotional states. When only the natural speech is obtained in training, it is hard to train the emotion model and calculate the scores of the feature vectors against his/her emotion model. In our former work, a neutral-emotion GMM transformation algorithm was presented to train the emotion GMM from his/her neutral speech directly[7], which is based on the assumption: if two speakers’ natural speech space satisfies the similar distribution, so does their emotion speech space, especially when they share the

same culture. In this method[7], the KL-distance between neutral GMMs ( n -order) is calculated to find out k speakers, who have the similar space with the registered speaker. Then, the registered speaker’s emotion GMM is trained by transforming the emotion GMMs of these speakers. However, the order of the transformed GMM is n k times of the original and leads to an exponential increase in computation cost. In this paper, we present a new neutral-emotion GMM transformation algorithm based on the same assumption as above. The transformation function is defined by polynomial to establish a relationship between neutral and emotion GMM. It is adopted in the speaker recognition to weaken the impact of speakeremotion variability with less computation cost increasing. In our method, only neutral speech is required in training, and the speech various emotional states contain in testing and the emotion state is not available. The performance evaluation is carried on the MASC database and the promising result is achieved compared with the baseline system. The following is organized as: Section 2 presents the natural-emotion GMM transformation algorithm. Section 3 describes the emotional speaker recognition. The performance evaluation is shown in Section 4. Conclusions are drawn in Section 5.

2. Transformation Algorithm 2.1 Gaussian Mixture Model The Gaussian Mixture Model(GMM)[1] serves to represent the distribution of speech space, which is a weighted sum of n components and defined as: n

G ( x) = ∑ wi gi ( x, μ, Σ)

(1)

i =1

where x is a D-dimensional vector, w is the weights and g is the D-variate Gaussian component with mean vector μ and covariance matrix Σ . In GMM-based speaker recognition, each speaker has a GMM to represent the distribution of his/her speech space. The

* Corresponding Author. This work is supported by the foundings: NCET-04-0545, NSFC_60525202/ 60533040, 863 Program 2006AA01Z136, PCSIRT0652, ZPNSF Y106705, National Key Technology R&D Program (No. 2006BAH02A01).

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

model for representing the distribution of neutral / emotion speech space is named neutral/emotion GMM.

2.2 Transformation Function On the assumption in Section 1, there is a relationship between the distribution of neutral and emotion speech space. It can be represented by the transformation function: M k = f (Gk ) (k = 1...K ) (2) where G and M is the neutral and emotion GMM individually, and k is the speaker index. If f is known, M can be transformed from G by Equation (2).

However, the form of f is unknown here and it is hard to get an analytic solution for the analysis equation. Equation (2) can also be understood in another way. It is the functional relation of G ( x) and M ( x ) , which means the value of f (Gk ( x)) equals to M k ( x) for any given feature vector x :

result and the number of the critical features is l. In the second, G and M of speakers are train by EM algorithm. The data for learning the transformation function is obtained by calculating the values of (log(G ( x)), log( M ( x))) , where x stands for the critical feature. In the third, the least squares method[9] is adopted to solve Equation (5). The degree is set at a high value to calculate the coefficients. The final degree is p = arg max(| ai |> ε ) ，where ε is a positive number approximating to zero.

3 Emotional Speaker Recognition

M ( x) = f (G ( x)) (3) In Equation (3), the problem of solving analysis equation is converted to solving the functional equation. The log-probability is usually used in testing and Equation (3) is rewritten as:

log( M ( x)) = f (log(G ( x)))

(4)

2.3 Transformation Algorithm Polynomial function is a smooth and continuous function, which serves to approximate other complex functions. Thus we employ it to define the structure of the transformation function: p

f ( x) = ∑ ai x i

(5)

i =0

where a is the coefficient, p is the degree and x is the value of log(G ( x)) . The function from neutral GMM to neutral one is defined as f ( x) = x . And the function from neutral GMM to emotion GMM can be learned by the transformation algorithm, which includes three steps: obtaining the critical features, calculating the training data and solving the function. In the first, the appropriate features are obtained by the clustering method. It is our assumption that all the features of speakers are demanded to solve Equation (4). However, it is unrealistic to select the speech of all speakers. Thus, we choose the representative features of speakers to solve the equation. In our method, it is obtained by GMM-based clustering. All features of speakers are joined together to train an l-order GMM by Expectation-Maximization (EM) algorithm. The mean vectors of the GMM are used as the clustered

Figure 1. The scoring process of the feature vector against the GMMs. Gn is the neutral GMM and G1 , G2 ,..., Gm are the emotion GMMs. The solid line is the scoring method based on the transformation function, when only neutral speech is obtained in training. The dashed line is applied when the speech of various emotional states can be obtained in training.

In this section, the emotional speaker recognition method based on the transformation algorithm is described. In the GMM-Based speaker recognition[1], the speaker si is represented by a GMM Gi trained from his/her neutral speech. And in the emotional speaker recognition, the speaker si is represented by models (Gi1 , Gi2 ,..., Gim ) trained on the individual training set with different emotion state, where m is the number of the emotional states. If the models can be trained from emotion speech, the scores of testing features ( X = x1 , x2 ,..., xT ) against the model of speaker si is defined as: T

m

Score = ∑ max(log(Gik ( x j ))) / T j =1

k =1

(6)

where Gik ( x j ) is the posterior probability of j th frame th

vector x j against the GMM with k emotion state of speaker si . When only neutral speech is obtained in training, the emotion GMMs can’t be trained by EM algorithm and the scores can’t be calculated according to Equation (6). Combining the transformation function (Equation 4), Equation (6) can be converted to: T

m

Score = ∑ max( f k (lo g(Gi ( x j ))) / T j =1

k =1

(7)

where Gi ( x j ) is the posterior probability of j th frame vector x j against the neutral GMM of speaker si , and k is the emotional state of the scored GMM. The scoring process is shown in Figure 1. The score of feature vector against neutral GMM is calculated. Then, the emotion scores are obtained by the function f k . It is a polynomial calculation with less computation cost increasing, compared with the poster probability calculation. The final score is the maximum of the neutral and emotion scores.

First, the critical features are obtained from the mean of the components of a 128-order GMM, which is trained by EM algorithm from all the features of the 50 speakers. The number of the critical features is 128. Then, the neutral and emotion utterances of each speaker are trained into neutral and emotion GMM individually. Integrated with the 128 features, 128*50=6400 points are gained to learn the transformation function. Four functions are learned for four emotional states in the MASC, which will be used in the subsection 4.3. Figure 2 shows the relationship between the scores of 128 critical features against the neutral and anger GMM of the 21st speaker. The curve is the result of the transformation function between neutral and anger GMM with p = 10 .

4. Performance Evaluation 4.1 Database An emotional speech database MASC (Mandarin Affective Speech Corpus) is used in our experiments. The corpus contains recordings of 68 Chinese speakers (23 female and 45 male) and five kinds of emotional states: neutral, anger, elation, panic and sadness. Each speaker reads 5 phrases, 20 sentences three times for each emotional state and 2 paragraphs only for neutral. The sentences include all the phonemes and most common consonant clusters in Mandarin. More details can be found in [4]. Only the sentences (2-10s) of the corpus are used in the experiments, which are divided into two parts. The sentences of last 50 speakers (emotion database) are used to learn the transformation function and the sentences of the other 18 speakers (8 female and 10 male) are used for the performance evaluation. The first 5 sentences read for three times of each speaker are used for training (training dataset) and the left 5*3*15*18 = 4050 for testing (testing dataset).

4.2 Learning the Transformation Function This experiment is designed to learn the transformation function for each emotional state. In the feature extraction, the speech is segmented into frames by a 20-ms window step at a 10-ms frame rate. An energy-based voice activity detector is used to remove the silence. The 13-dimension MFCCs are extracted from the speech frames[8].

Figure 2. The curve is result of the transformation function. The points are the scores of critical features x against the neutral and anger GMM. X-axis and Y-axis stand for neutral and anger scores individually.

4.3 Experiment Strategies Four verification experiments are designed to evaluate the performance of the Neutral-Emotion GMM transformation algorithm. T-norm technique is used to normalize the dissimilarity scores. Lower Baseline(L): The first experiment is the traditional GMM-based recognition. Only the natural sentences in the training dataset are used for training 32-order GMMs while all the sentences in testing dataset for testing. Because the emotion mismatch between the training and testing speech, the result is the lower limit of the four experiments. Upper Baseline(H): The testing sentences are the same as the first experiment. All the sentences with five emotional states in training dataset are used for training. All the emotional states of testing utterances are involved in training data set. The result of this

experiment is the upper limit. The recognition method follows the description in Section 3. ENGT1 and ENGT2: The transformation algorithm proposed in [7] and in this paper is applied in ENGT1 and ENGT2 individually. The training and testing sentences of the two experiments are the same as the first experiment. In ENGT1, the sentences of the last 50 speakers are used to build the emotion database. Equation(6) is used as the scoring method s. In ENGT2, the scoring method is defined as the Equation (7).

4.4 Results and Discussion

second factor is the limitation of the precondition of this algorithm. It assumes that the transformation function can be approximated by the polynomial function. Though the polynomial function can fit most curves, the relationship between neutral and emotion GMM may be more complex.

4. Conclusion This paper presents a new neutral-emotion GMM transformation algorithm based on the polynomial function. It establishes a transformation function between the neutral and emotion GMM. In emotional speaker recognition, the scores of testing features against the emotion GMM are calculated according to this function. Only neutral speech is required in training and there exists various emotion utterances in testing. The verification experiments carried on MASC show that the performance is improved by our algorithm with an EER reduction of 39.5% from the baseline system. It indicates the transformation algorithm can overcome the speaker-emotion variability problem to some extent.

References

Figure 3. The DET curve of four experiments : Lower Baseline (L) (EER=16.06%), Upper Baseline(H) (EER=5.15%), ENGT1 (EER=14.22%) and ENGT2 (EER=10.31%). The Detection Error Trade-offs Curve(DET Curve) of the four experiments is shown in Figure 4. The second experiment (EER=5.15%) outperforms the first (EER=16.06%) with an EER reduced 10.91%. The performance is improved when each emotion state in testing is involved in training. Thus it is helpful to make the system get familiar with the speech of various emotional states. In ENGT1 and ENGT2, the performance is improved by the transformation algorithm with an EER reduction of 27.4% and 39.5% from the lower baseline. Further, the ENGT2 is better than ENGT1 with EER reduced 3.91%. It shows the effect of the two transformation algorithms. And the polynomial based algorithm achieves the better result. However, the polynomial based transformation algorithm doesn’t achieve the result, where the emotion GMM is trained from the speaker’s emotion speech. It means the emotion GMM trained by transformation algorithm doesn’t represent the distribution of emotion speech space exactly. Two factors may cause the difference. One is the number of critical features is not enough for learning the transformation function. The

[1]. Douglas A.Reynolds, Richard C. Rose, “Robust TextIndependent Speaker Identification Using Gasussian Mixture Speaker Models,” IEEE Transactions on Speech and Audio Processing, 3:72-83, 1995.1. [2]. Zhaohui Wu, Dongdong Li, Yingchun Yang, “Rules Based Feature Modification for Affective Speaker Recognition,” ICASSP 2006, pp.661-664, May 2006. [3]. K. R. Scherer, T. Johnstone, G. Klasmeyer, “Can automatic speaker verification be improved by training the algorithms on emotional speech?” Proceedings of ICSLP2000, vol. 2, pp807-810. Beijing, China. [4]. Tian Wu, Yingchun Yang, Zhaohui Wu, Dongdong Li, “MASC: A Speech Corpus in Mandarin for Emotion Analysis and Affective Speaker Recognition,” ODYSSEY 2006, June 2006, pp.1-5. [5]. Wei Wu, Thomas Fang Zheng, Ming-Xing Xu, HuanJun Bao, “Study on Speaker Verification on Emotional Speech,” ICSLP 2006, pp.2102-2105, 2006.9. [6]. K. R. Scherer, “A cross-cultural investigation of 40 Enorm emotion inferences from voice and speech: implication for speech technology,” in ICSLP 2000. [7]. Shan Zhenyu, Yang Yingchun, Ye Ruizhi, “NaturalEmotion GMM Transformation Algorithm for Emotional Speaker Recognition”, Interspeech 2007, pp. 782-785, 2007. [8]. Rivaral Vergin, Douglas O’Shaughnessy, Vishwa Gupta, “Compensated Mel Frequency Cepstrum Coefficients,” Proceedings of the Acoustics, Speech, and Signal Processing, 1996. [9]. Charles L. Lawson, Richard J. Hanson, “Solving Least Squares Problems”, Prentice-Hall, 1995.