IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 2, FEBRUARY 2007
641
Statistical Approach for Voice Personality Transformation Ki-Seung Lee
Abstract—A voice transformation method which changes the source speaker’s utterances so as to sound similar to those of a target speaker is described. Speaker individuality transformation is achieved by altering the LPC cepstrum, average pitch period and average speaking rate. The main objective of the work involves building a nonlinear relationship between the parameters for the acoustical features of two speakers, based on a probabilistic model. The conversion rules involve the probabilistic classification and a cross correlation probability between the acoustic features of the two speakers. The parameters of the conversion rules are estimated by estimating the maximum likelihood of the training data. To obtain transformed speech signals which are perceptually closer to the target speaker’s voice, prosody modification is also involved. Prosody modification is achieved by scaling excitation spectrum and time scale modification with appropriate modification factors. An evaluation by objective tests and informal listening tests clearly indicated the effectiveness of the proposed transformation method. We also confirmed that the proposed method leads to smoothly evolving spectral contours over time, which, from a perceptual standpoint, produced results that were superior to conventional vector quantization (VQ)-based methods. Index Terms—Maximum likelihood (ML) estimation, prosody modification, voice conversion.
I. INTRODUCTION OICE personality transformation [1]–[11] is a process in which a voice personality is altered, so that one’s voice can be heard as another’s. It has numerous applications in a variety of areas such as the personification of text-to-speech synthesis systems, preprocessing for speech recognition [12], improving the intelligibility of abnormal speech uttered by a person with a speech problem [8], and improving the effectiveness of foreign language training systems [14]. Voice personality transformation is generally performed in two steps. In the first step, the training stage, a set of speech feature parameters of both the source and target speakers are extracted and appropriate mapping rules that transform the parameters of the source speaker onto those of the target speaker are generated. In the second step, the transformation stage, the features of the source signal are transformed using mapping rules developed in the training stage so that the synthesized speech possesses the personality of the target speaker. To implement voice personality transformation, two problems need to be considered: what features are extracted from
V
Manuscript received December 21, 2004; revised March 13, 2006. This work was supported by Konkuk University in 2005. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Nick Campbell. The author is with the Department of Electronic Engineering, Konkuk University, Seoul 143-701, Korea (e-mail:
[email protected]). Digital Object Identifier 10.1109/TASL.2006.876760
the underlying speech signals and how to modify these features in a way so that the transformed speech signals mimic target speaker’s voices. The first problem is closely related to our past knowledge of automatic speaker recognition/verification tasks. Extracting speaker-specific features from given speech signals has an important role in these two tasks. This is also an important issue in voice personality transformation tasks. It is known that the vocal tract transfer function is a major factor in specifying speaker individuality [22]. For this reason, feature parameters representing the vocal tract transfer function have been widely used in voice personality transformation, including formant frequencies [4], [5], the LPC cepsturm [2], [10], [11] and LSP(Line Spectrum Pair) coefficients [9]. In this work, the LPC cepstrum is used as a feature parameter that represents the vocal tract transfer function. Prosody is another clue in discriminating speaker individuality [22]. It is also known that the speaking style of a speaker is highly correlated with prosody information [2]. Hence, prosody modification is highly desirable in terms of obtaining transformed speech signals which are perceptually closer to a target voice. In an effort to implement prosody modification, previous efforts largely focused on pitch modification [1], [2], [5], [10]. However, since one speaker’s prosody is represented by wide varieties of acoustic features [22], the use of only pitch modification is insufficient in terms of prosody. Phoneme duration and short-time energy have also been involved in voice personality transformation tasks [9]. In the proposed method, prosody modification is accomplished by not only modifying the pitch but the average speaking rate as well. To modify speaking rate, we employed the rate of speech (ROS) of a given utterance which is measured as an average number of vowels(or syllables) per second [23] and applied time scale modification (TSM) [24], [25] to the source speaker’s utterances with a reasonable modification factor. The second problem can be formulated as building acceptable mapping rules from the source speaker’s feature parameters to those of target speaker. In previous studies, the entire speaker space was partitioned into several clusters using VQ (Vector Quantization) [21], the mapping rules for each partition are then estimated in the form of a histogram [1] or minimum mean square error criterion [3], [10]. The underlying assumption is that each cell implicitly corresponds to each phoneme. Hence the mapping rules reflect phonetic variation. These methods, however, reveal problems, due to the hard clustering property of VQ-based classification. According to Stylianou’s study [7], the VQ-based classification method causes a discontinuity problem in transition regions. Moreover, Knagenhjelm [26] noted that spectral discontinuities between adjacent frames are one of the major sources in the degradation of quality in speech coding
1558-7916/$25.00 © 2006 IEEE
642
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 2, FEBRUARY 2007
Fig. 1. Block diagram of the proposed voice transformation system.
systems. Hence, it would be desirable to adopt a soft-clustering approach [7], [11] for voice conversion. This work mainly focused on transforming the spectral characteristics of two speakers. We first define a model describing the relationship between two speakers’ feature parameters, then conversion rules are built based on the model. Our assumption is that the occurrences of source and target features are controlled by both intra/inter probabilistic models. The term intra means that an underlying probabilistic model affects intra-speaker variabilities. Whereas inter-speaker variabilities are affected by an inter probabilistic model. In the proposed method, intra/inter probabilistic models are represented by Gaussian mixture model (GMM), and cross correlational probabilities, respectively. The parameters for representing each probabilistic model are obtained by means of a maximum likelihood (ML) estimation. The conversion rules are derived from the probabilistic models using a minimum mean square estimation (MMSE) [20]. The GMMbased method [7] employed transformation matrices in order to minimize overall distance between the target and the transformed feature vectors. Here, the proposed conversion function is derived only from probabilistic models. The resulting conversion rules include, not only a source speaker’s probabilistic model, but the target speaker’s probabilistic model and cross correlation probabilities as well. This approach has several advantages over conventional methods. Compared with VQ-based approaches [1], [3], the proposed conversion function produced continuously evolving features over time. Hence, a situation in which transformed features are abruptly changed in transition regions can be avoided. Moreover, since the proposed conversion rules are represented by a simple linear combination of likelihood values and target speaker’s mean vectors, the computational complexities and required memory size are reduced.
Objective and subjective tests were performed to evaluate the efficiency of the proposed method. In objective tests, the LPC cepstrum distance reduction ratio and likelihood ratio are used as a measure of the performance of the transformation. ABX tests using several phonetically balanced sentences were performed to evaluate subjective performance. In addition, a preference test was administered, to evaluate the improvement in performance, in terms of quality. This paper is organized as follows. Section II provides an overview of the proposed voice transformation method; including training procedure and online transformation procedure. In Section III, the modelling and transformation of the LPC cepstrum coefficients are presented. The procedure used for prosody modification is described in Section IV. Experimental results are presented in Section V. Finally, concluding remarks are summarized in Section VI. II. OVERVIEW OF THE VOICE TRANSFORMATION SYSTEM A block diagram of the proposed voice personality transformation system is shown in Fig. 1. In the training stage, voices from source and target speakers are first recorded, an analysis is performed on these speech samples to derive the feature parameters to be transformed. In this work, the LPC cepstrum, pitch, and the number of vowels per unit second were used as feature parameters. In practice, even if two speakers utter the same words, given their different speaking rates, it is unlikely that a synchronized set of LPC cepstrum sequences would be produced. To time-align these sequences, dynamic time warping (DTW) [17] was applied in a preprocessing step. The resulting time-aligned LPC cepstrum sequences were used to build the conversion rules for the vocal
LEE: STATISTICAL APPROACH FOR VOICE PERSONALITY TRANSFORMATION
643
is regarded as describing the dependencies of the two random is vector sets in this work. Since each random source assumed to be Gaussian
(2)
(3)
Fig. 2. Graphical depiction of the proposed model for representing the relationship between a source speaker’s feature space and the target speaker’s feature = = 4). space. (In the case of
N M
tract transfer functions. The average pitch values for each speaker and the average ROS were used to build the rules for prosody modification. In the online stage, the feature parameters extracted in the training stage, were derived from the incoming speech signals. The feature parameters were then modified, based on the conversion rules from the training stage. The modified short-time speech signals were synthesized from the modified parameters. Finally, continuous waveforms were obtained by concatenating the short-time modified speech signals. In this procedure, the synchronized over-lap and add (SOLA) [25] algorithm was involved to modify the speaking rate. Each part of the proposed system is described in more detail in the following sections.
and are the covariance matrix and mean vector where of the th random source for source LPC cepstrum, respectively. and are the covariance matrix and mean Similarly, vector of the th random source for target LPC cepstrum, respectively. is the order of the LPC cepstrum. B. Maximum Likelihood Parameter Estimation Given training speech from the source and target speakers, the , and paramgoal of model training is to estimate , , which, in some sense, best eters describing matches the actual distribution of the training feature vectors. To this end, the maximum likelihood (ML) estimation is employed in this work. The aim of ML estimation is to find the model parameters in which the likelihood of the underlying probabilistic model can be maximized, given the training data. For a sequence of time, the likelihood aligned training vectors function of parameter set over can be written as (4)
III. TRANSFORMATION RULES FOR LPC CEPSTRA A. Model
where
In this paper, any LPC cepstrum vector is generated by a number of random sources, as depicted in Fig. 2. Each random source is assumed to be a Gaussian random variable, and a set of random sources are characterized by an underlying speaker. This assumption is based on Gaussian mixture speaker models [18], which have been employed in speaker identification tasks [18] and voice personality transformation tasks [7]. To reflect the dependencies of the two speakers’ LPC cepstrum vectors, the cross correlational probabilities between the two speakers’ random sources are employed, as shown in Fig. 2. In the figure, four random sources are assumed in both the source and target speakers’ LPC cepstrum space. According to this model, the joint probability of the source LPC cepstrum , the target LPC cepstrum , source speaker’s th random source and target speaker’s th random source is given by (1) where is cross correlational probability between the th random source of source LPC cepstrum and the th random source of target LPC cepstrum. In [1], this term was represented by a histogram of a discrete set of feature vectors. This term
with , is a particular sequence of random sources (or states in HMM context). Note that and are the number of random sources for the source and target speakers’ LPC cepstrum vecdenotes the sum over tors, respectively. Hence, the notation all possible random source sequences. The optimal parameter is given by set (5) Since the likelihood function (4) is a nonlinear function of the parameters , a direct maximization cannot be achieved. However, ML parameter estimates can be obtained iteratively using a special case of the expectation-maximization (EM) algorithm [19]. The basic concept of the EM algorithm is, beginning with an initial model , to estimate a new model , such that . The new model then becomes the initial model for the next iteration and the process is repeated until an acceptable convergence threshold is reached. For each EM iteration, the following reestimation formulas are used which guarantee a monotonic increase in the model’s likelihood value.
644
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 2, FEBRUARY 2007
Cross Correlational Probability:
(6)
given by a histogram, obtained by accumulating the vector correspondences between and . In order to improve the generalization of the model, it is useful to penalize a cross correlational probability that has a low frequency. A possible way of doing this is to replace the probability estimate used above by the lower bound of the confidence is replaced by interval of this estimate, i.e.,
Probability of random source: (15)
(7) where
Mean vectors:
(16)
(8)
(17) (9)
Covariance matrices:
With a pruning factor , is the lower bound of the 95% confidence interval. When , is set to 0. With such an estimate, the sum of the probabilities reaches a value lower than 1, which requires a renormalization. C. MMSE of Target LPCC
(10)
(11)
A useful criterion for estimating the target LPC cepstrum is to choose the estimated (or transformed) LPC cepstrum required to minimize the mean-squared error . It is well known [20] that the resulting estimator is the conditional . Hence, the estimated LPC cepstrum can mean be written as (18)
The a posteriori probabilities are given by
(12)
is the length of the LPC cesptrum sequence to estiwhere mate. Assuming that observations of both source and target LPC cepstrum are independent in different time frames , the above equation can be written as follows:
(13)
(14)
An important implementation issue associated with the EM algorithm is its initialization. In practice, the initialization of the EM algorithm affects its convergence rate but can also modify the final result [19]. In this work, the model parameters are initialized by the use of a standard full-search VQ procedure [21]: the probability of a random source, a mean vector and a covariance matrix are estimated independently using the clusters , obtained by VQ for each training vector set . The initial cross correlational probability is
(19) Using (1), the above equation can be represented by
(20)
This is the final form of the proposed conversion function. If in (20) is replaced by otherwise
(21)
LEE: STATISTICAL APPROACH FOR VOICE PERSONALITY TRANSFORMATION
then, the conversion function is (22) where is the th code vector in the VQ codebook and is the VQ code vector index having a minimum mean square error. The form (22) is of the type used by Abe et al. in a VQ-mapping approach [1]. Compared (22) with the proposed conversion function (20), it should be understood that the conversion function of the VQ-mapping method can be thought of as a special case of the proposed conversion function, which is given in (21), under conditions where each random source is replaced by a centroid in the VQ codebook. This means that the proposed method yields a more generalized form of the conversion function, compared to the VQ-mapping approach.
645
and Note that speech signal time scale is expanded when . is compressed when To apply TSM to voice personality transformation, two factors are of concern: how to measure the speaking rate and how to determine a proper time scale modification factor in the sense that the speaking rate of the modified utterances is close to that of the target speech. It is known that the sequence of vowels in a spoken utterance roughly correspond to the rhythm of speech [23]. Accordingly, the number of vowels per unit second is used as a measurement of speaking rate. Since the phonetic transcription for each utterance in the training corpus is available, the number of vowels per unit second can be easily computed. In the proposed method, an average of the number of vowels per unit second is computed from all utterances including those in the training corpus, the time scale modification factor is then defined as
IV. PROSODY MODIFICATION In this work, prosody modification is achieved by changing the overall pitch contour and overall speaking rate. The average pitch period of a speaker contributes a great deal to speech individuality [22]. Hence, we modify the source speaker’s excitation defined spectrum by the average pitch modification factor by
(27) and are the average number of vowels per where unit second of the source and the target speakers, respectively. Hence, it can be said that the proposed scheme changes only the global speaking rate, while the local characteristics of the source speaker’s speaking rate are preserved.
(23) V. EXPERIMENTAL RESULTS where and are the average pitch periods of the source and the target speakers, respectively. Modifying the excitation signal is achieved by linearly interpolating the real and imaginary parts of the short-time Fourier transformation (STFT) of the excitation signal to yield
(24) where (25) Note that is the integer part of , denotes an analysis represents the windowed STFT of interval, and the excitation signal. Short-time modified speech is obtained by short-time inverse Fourier transform (STIFT). Since pitch modification changes the locations of the pitch pulses of the short time modified speech, phase consistency between neighboring frames would be lost. To compensate for phase mismatches between neighboring frames, the Synchronized Over-Lap and Add (SOLA) algorithm [25] is employed to build a continuous speech signal from the modified short-time speech. Another useful aspect of the SOLA algorithm is that time scale modification (TSM) can be implemented, which is achieved by using and different analysis/synthesis frame intervals. Let the be the analysis and synthesis frame intervals, respectively, the time scale modification factor is defined as
(26)
The database used to obtain the conversion rules consists of 200 utterances of the Korean language. This material was obtained from three different male speakers and one female speaker which we refer to as M1, M2, M3, and F, respectively. 100 utterances from these speakers were additionally prepared for both objective and subjective evaluations. Speech signals were digitized at a rate of 16 kHz. The orders of the LPC coefficients and the LPC cepstrum were 20 and 30, respectively. A 25-ms length Hanning window was used to compute and extract the LPC parameters in 10 ms intervals. We constrained each Gaussian component to have a diagonal covariance matrix. Variance limiting [18] was also involved in estimating each component of a covariance matrix. Two voice transformation experiments are presented: one , the other being being male-to-male conversion . male-to-female conversion A. Comparison With the VQ-Mapping Method in Terms of Cross Correlational Probabilities One of the novelties of the proposed method is the learning of the cross correlational probabilities. Therefore, it is noteworthy that the differences between the initial cross correlation probabilities (from the histogram estimated in Abe’s work [1]) and the final cross correlation probabilities obtained after iterations can be seen. An example is shown in Table I. This example was obtained when the number of source/target speakers’ random sources was set to 8, in the case of male-to-male conversion. As shown, the results of the proposed method are more or less different from those from Abe’s work [1]. This can be explained by the fact that, since the probability model (including centroids) of each random source is updated at each
646
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 2, FEBRUARY 2007
TABLE I EXAMPLE OF CROSS-CORRELATIONAL PROBABILITIES FOR 8 8 RANDOM SOURCES; TOP: INITIAL CROSS-CORRELATIONAL PROBABILITIES, BOTTOM: CROSS-CORRELATIONAL PROBABILITIES AFTER 20 ITERATIONS
2
First, the cepstral distance reduction ratio which is given by
[6] was used,
(28) where , and are the LPC cepstrum sequences for the source speaker, the target speaker and the transformed, respecdenotes the averaged Euclidean distance betively, tween vectors and . Note that the zeroth cepstral coefficient because it is not affected by the is omitted in computing conversion function. In the case where all the transformed LPC cepstrum coefficients are exactly the same as the target LPC cepstrum coefficients, the reduction ratio takes a value of 100. This means that the larger the reduction ratio, the greater is the similarity between the transformed LPC cepstrum and the target LPC cepstrum. Another objective measure is the following log likelihood ratio [9]:
iteration, the cross correlational probabilities between the two speakers’ random sources are also accordingly changed. One of the interesting points observed in the two cross correlation probability tables is that after iterations, a cross correlational probability having the highest frequencies is increased, whereas is the remaining ones are decreased. For example, increased to 0.7586 from 0.5136, after 20 iterations. The re, maining cross correlational probabilities become smaller. There are some exceptions, e.g. , which is increased to 0.0038 from 0.0025, though this does not correspond to the largest frequencies. The major reason for increasing the frequencies of the largest cross correlational probabilities is the pruning of the cross correlational probabilities. In the proposed ML parameter estimation, a cross correlational probability having a low frequency is penalized. This will increase the frequencies of the cross correlational probabilities having larger frequencies. Assuming that the cross correlational probabilities having low frequency are of less generality than the others, it can be said that further decreasing the frequencies of cross correlational probabilities having relatively low frequencies (or further increasing the frequencies of cross correlational probabilities having relatively high frequencies) would be helpful in terms of increasing the generalization of the model. From this point of view, the cross correlation probability matrix resulting from the proposed ML parameter estimation is more desirable than the one from the histogram used in Abe’s work [1]. B. Objective Evaluation Two objective measures were considered in evaluating the performance of the underlying voice transformation methods.
(29) and are probabilistic models estimated from where the source speaker’s training corpus and the target speaker’s training corpus, respectively. In this work, 256 mixture Gaussian mixture models (GMM) were employed to represent each speaker’s probabilistic model. According to the above is normally a negative value, in the case where equation, is close to the source speaker’s LPC cepstrum sequence. is close to the target speaker’s LPC cepstrum sequence, If has a positive value. Hence, larger positive values of the log likelihood ratio indicate that the transformed LPC cepstrum coefficients are statistically closer to the target LPC cepstrum coefficients. Two types of conversion functions introduced in Section III (the proposed one and the original VQ-mapping approach of and as Abe et al.) were examined. Fig. 3 presents measured on the test corpus for these two methods as a function of the number of source/target random sources (or the number of centroids in the case of VQ-mapping). In most cases, the and take the form of increasing curves with increasing number of random sources (or centroids). The only exfor a smaller number of centroids ( 16) for ception is the conversion, when the VQ-mapping method is applied. For all cases, higher and are observed in the proposed probabilistic conversion, compared to the VQ-mapping approach. In particular, the superiority of probabilistic conversion is more clear when the number of random sources is of the statistical method using large ( 32). For example, 64 random sources is slightly higher than that of the VQ-mapping method, in which 128 centroids are used. This is mainly due to the fact that, since the proposed conversion function is represented by the weighted sum of the target speaker’s prototype mean vectors, wide varieties of transformed vectors can be obtained by the proposed method. Whereas for the VQ-mapping method, the number of possible transformed vectors is limited by the size of the underlying codebook.
LEE: STATISTICAL APPROACH FOR VOICE PERSONALITY TRANSFORMATION
647
Fig. 3. Objective performance of the two voice transformation methods. (a) M3
For the statistical conversion, the highest of conversionis54.8%.Thismeansthattheaverage cepstraldistance between the target and the transformed vectors is about half the cepstral distance between the target and the source (or not transformed) vectors. Considering the fact that differences in vocal tract response between male and female speakers are higher than differences between the same genders, it can be inferred that the in (28) would generally be larger, in absolute value of the case of different-gender transformation (e.g. conin conversion version). Nevertheless, the highest was about 68%, which is higher than in the case of (same gender) conversion. This indicates that a voice personality transformation task can be successfully applied to the transformation of speaker characteristics across gender. In terms of the likelihood ratio, the difference in performance between the two methods is more visible, as shown in Fig. 3.
! M1 conversion. (b) M2 ! F conversion.
conversion, the likelihood ratio obtained by the For proposed statistical method using 32 random sources is equivalent to the VQ-mapping method, in which 256 centroids are used. For both conversion experiments, differences in the likelihood ratio were found to be increased more as the number of random sources (or centroids) was increased. Consequently, the proposed method yields transformed vectors which are statistically closer to the target, compared to the previous VQ-mapping approach. To inspect the local behavior of the objective measures, an example of the frame cepstral distance and log likelihood measured for one second of test speech is shown in Fig. 4. The results shown in this figure were obtained when the number of random sources (or centroids) was set to 128. As expected, the cepstral distance by the probabilistic method is located lower than that of the VQ-mapping method. It is also noteworthy that the shape
648
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 2, FEBRUARY 2007
Fig. 4. (a) Example of cepstral distance between the source and the target (solid line) as well as between the transformed and the target (dotted line) evaluated for 100 consecutive frames. (b) Example of log likelihood of the source LPC cepstrum coefficients with respect to target model (solid line) as well as of the transformed LPC cepstrum coefficients with respect to target model (dotted line), evaluated for the same speech samples in (a).
of the cepstral distance by the VQ-mapping method sometimes shows a high spike, whereas the probabilistic method provides a more regular reduction in cepstral distance. This is also due to the soft-classification property of the statistical method. This example means that the proposed conversion function yields a more “consistent” performance across the frames than the VQ-mapping method. An example of the frame log likelihood
of the transformed LPC cepstrum coefficients with respect to the target model evaluated for the same speech is shown in Fig. 4(b). It can be seen that the log likelihood curve obtained by the statistical method is frequently, but not always, higher than that obtained using the VQ-mapping method. However, the variations in the log likelihood obtained from the proposed method are higher than those from the VQ-mapping method.
LEE: STATISTICAL APPROACH FOR VOICE PERSONALITY TRANSFORMATION
TABLE II ABX TEST RESULTS FOR THE PROBABILISTIC (PROPOSED) METHOD AND THE VQ-MAPPING METHOD
TABLE III ABX TEST RESULTS FOR THE PROBABILISTIC (PROPOSED) METHOD IN THE CASE WHERE TSM (TIME SCALE MODIFICATION) IS APPLIED OR NOT
C. Subjective Evaluation The eventual goal of a voice personality transformation task is to generate a modified speech sound that mimics the target speech. Hence, it is very important to evaluate performance in terms of how closely the transformed speech signals sound like target speaker’s voices. To this end, two subjective listening tests were conducted. The first was designed to evaluate the performance related to converting the individuality of a speaker using the ABX test. In this test, 15 test sentences were used and each sentence was presented three times to 18 listeners. The first and second stimuli, A and B, were either the source speaker’s or the target speaker’s, while the last stimuli X was transformed speech achieved using the underlying method. The subjects were then asked to select either A or B as a candidate for X. In this test, the number of random sources or centroids was set to 128. We also compared the results from the two methods, the proposed statistical method and the VQ-mapping approach. In order to compare only the vocal tract response conversion aspect, the same prosody modification method introduced in Section IV was commonly employed in these two methods. Audio examples can be sounded on the Web site: http://www. konkuk.ac.kr/~kseung/audio\_demo/demo\_page.html. Table II shows the ABX test results. In terms of correct identification ratio, the statistical method was 5 6% higher than the VQ-mapping approach. To evaluate the effectiveness of the method for changing the speaking rate, another ABX test was performed on the converted utterances. In this test, the converted utterances were obtained by two methods: LPC cepstrum modification and pitch modification are commonly employed in these two methods. Time scale modification (TSM) is employed only in the second method. The results are shown in Table III. The identification ratio was dramatically increased in the case of the conversion. Indeed, the difference in speaking rate between M3 and M1 is relatively large, leading to large time . Accordingly, TSM plays an scale modification factor conversion. No large important role in the case of
649
TABLE IV PREFERENCE TEST RESULTS
difference in speaking rate between M2 and F was observed. In this case, applying TSM did not increase the identification ratio to a significant extent. Note that the role of TSM in voice transformation may be dependent on the underlying language. Although the ABX test results appear to be somewhat promising, the listeners participating in the ABX test indicated that the transformed source speech did not sound exactly like the target speech and was perceptually different. One reason for this is the difference in the speaking style of the source and the target. Another reason for this might be that excitation was not completely transformed. In practice, listeners’ selection was seriously influenced by speaking style. A possible method for alleviating the weakness in speaking style conversion is the use of time-varying pitch modification [2] or phoneme-specific time scale modification [9]. However, the quality of the transformed speech was slightly worse when the time-varying modification factor was used [2]. Accordingly, carefully designed prosody modification rules would be highly desirable for obtaining converted utterances that are perceived to be more similar to the target. In addition to the ABX test, we also performed a preference test to determine whether speech samples that had been converted using the proposed method sounded more pleasant to listeners than those converted using the VQ-mapping approach. In this test, 10 relatively short sentences were used in the evaluation and 15 listeners participated. A paired comparison procedure was used. Each utterance converted using the proposed method was paired up with the same utterance converted using the VQ-mapping method. The order of the pair was random. Each listener was allowed to listen to any pair of utterances as many times as needed before determining which utterance in the pair sounded more natural or more pleasant. Overall, 62% of the total stimuli preferred utterances that had been converted using the proposed method over the VQ-mapping method, as shown in Table IV. The listeners indicated that utterances converted using the proposed method sounded more comfortable and were less noisy than those produced using the VQ-mapping method. However, they also indicated that transformed utterances converted by the two methods commonly sounded “ambiguous” and “unclear”. This is mainly due to the bandwidth widening problem caused by the averaging effects of the VQ. Each centroid in the VQ codebook is essentially an average of a small cluster of spectra. Unfortunately, each centroid, since it is an averaged spectrum, tends to have a larger bandwidth [8]. The weighted sum of the mean vectors, employed in the proposed conversion function, further increases the bandwidth of the spectrum. There are several methods for enhancing the formants. Given that LPC cepstrum coefficients are used as
650
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 2, FEBRUARY 2007
a feature parameter in the proposed algorithm, the method of cepstral weighting [16] is the preferred method for further improving the quality of converted utterances. Some artifacts were also observed in the converted speech. In some cases, the converted utterances were seriously degraded both in intelligibility and in naturalness. To analyze the source of these artifacts, we listened to the converted speech signals carefully. The findings here show that the artifacts are more remarkable, when both the LPC ceptsrum and excitation spectrum are simultaneously modified. Modifying only one parameter (LPC cepstrum or excitation spectrum) did not lead to a serious degradation in quality. One reason for the artifacts is an insufficient transformation of each parameter. Two distortions caused by the two independent transformation add up and resulting in an increased number of artifacts. Another possible explanation for the artifacts is that, since the two parameters inherently originate from the same speech signals, an implicit relationship exists between the two parameters. The converted utterances, however, were obtained by an independent transformation of each parameter. The relationship between the two parameters will be lost after transformation. This will result in artifacts that are associated with the converted utterances. One possible method for alleviating this problem is to build a model that explains the relationship between LPC parameters and excitation parameters for one speaker. The conversion function is then constructed under the constraint that the relationship is preserved. We will focus on this issue, in an attempt to implement the next version of a voice personality transformation algorithm. VI. CONCLUSION A new voice transformation algorithm is proposed, based on a probabilistic model involved with an intra-speaker model and an inter-speaker model. The underlying assumption is that one speaker’s LPC cepstrum is generated by a number of random sources which can be modelled by a Gaussian function. To model inter-speaker dependencies, cross correlational probabilities between the random sources of the source and the target are employed. The conversion function is derived from the probabilistic model, based on a minimum mean square error criterion. The resulting conversion function includes an intra-speaker model for the target, which has not been considered in previous methods. Several related issues are also discussed, including a method for increasing the generalization of the model. To accomplish a more enhanced performance in prosodic conversion, time scale modification was employed, which has not been frequently employed in previous methods. The time scale modification factor is given by the ratio of the source and the target speaker’s speaking rate, which is represented by the number of vowels per unit second. We confirmed that the application of time scale modification sometimes led to a remarkable improvement in voice conversion, especially when two speakers’ speaking rates were very different from each other. Although the proposed method produces results that are superior to the previous VQ-mapping approach in terms of both objective and subjective evaluations, there are still limitations that have been found in the previous voice personality methods.
According to informal listening tests, the quality of transformed speech signals are not comparable to that of the coded speech signals obtained by low bit rate speech coders. This means that the quality of the transformed speech signals is not sufficient for practical use. Hence, it can be understood that the major contribution of the work presented here lies in the two major factors in the voice personality transformation task; modelling the relationship between the two speakers’ feature spaces and building conversion rules based on the corresponding model. The resulting conversion rules are represented by the modified form of the previous VQ-mapping approach. To increase the usefulness of the voice transformation system, practical aspects should be considered. For example, the distortions found in the transformed speech signals should be analyzed in terms of perceptual aspects. Moreover, research efforts should be directed at reducing differences in perceptual quality between raw speech signals and transformed speech signals, as well as obtaining transformed speech that sounds more realistic to a target speaker. Our future studies will focus on these issues. ACKNOWLEDGMENT The author would like to thank the anonymous reviewers for their valuable comments. REFERENCES [1] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice conversion through vector quantization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 1988, vol. 1, pp. 565–568. [2] M. Savic and I. H. Nam, “Voice personality transformation,” Digital Signal Process., vol. 4, pp. 107–110, 1991. [3] H. Valbret, E. Moulines, and J. P. Tubach, “Voice transformation using PSOLA technique,” Speech Commun., vol. 11, pp. 175–187, 1992. [4] H. Mizuno and M. Abe, “Voice conversion algorithm based on piecewise linear conversion rules of formant frequency and spectral tilt,” Speech Commun., vol. 16, no. 2, pp. 153–164, 1995. [5] M. Narendranath, H. A. Murthy, S. Rajendran, and B. Yegnanarayana, “Transformation of formants of voice conversion using artificial neural networks,” Speech Commun., vol. 16, no. 2, pp. 207–216, 1995. [6] N. Iwahashi and Y. Sagisaka, “Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks,” Speech Commun., vol. 16, no. 2, pp. 139–152, 1995. [7] Y. Stylianou, O. Cappe, and E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Trans. Acoust., Speech, Signal Process., vol. 6, no. 2, pp. 131–142, Mar. 1998. [8] N. Bi and Y. Qi, “Application of speech conversion to alaryngeal speech enhancement,” IEEE Trans. Acoust., Speech, Signal Process., vol. 5, no. 2, pp. 97–105, Mar. 1997. [9] L. M. Arslan, “Speaker transformation algorithm using segmental codebooks (STASC),” Speech Commun., vol. 28, pp. 211–226, 1999. [10] K. S. Lee, D. H. Youn, and I. W. Cha, “A new voice personality transformation based on both linear and nonlinear prediction analysis,” in Proc. Int. Conf. Spoken Language Process., 1996, pp. 1401–1404. [11] ——, “Voice conversion using a low dimensional vector mapping,” IEICE Trans. Inform. Syst., vol. E85D, no. 8, pp. 1297–1305, Aug. 2002. [12] S. J. Cox and J. S. Bridle, “Unsupervised speaker adaptation by probabilistic spectrum fitting,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 1989, vol. 1, pp. 294–297. [13] M. A. Richards, “Helium speech enhancement using the short-time fourier transform,” IEEE Trans. Acoust., Speech, Signal Process., vol. 30, no. 6, pp. 841–853, Dec. 1982. [14] E. Moulines and F. Charpentier, “Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones,” Speech Commun., vol. 9, no. 5/6, pp. 453–467, 1990. [15] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals. Englewood Cliffs, NJ: Prentice-Hall, 1987. [16] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall, 1993.
LEE: STATISTICAL APPROACH FOR VOICE PERSONALITY TRANSFORMATION
651
[17] G. M. White and R. B. Neely, “Speech recognition experiments with linear prediction, bandpass filtering, and dynamic programming,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-24, no. 2, pp. 183–188, Apr. 1976. [18] D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. Acoust., Speech, Signal Process., vol. 3, no. 1, pp. 72–83, Jan. 1995. [19] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist. Soc., vol. 39, pp. 1–38, 1977. [20] H. L. Van Trees, Detection, Estimation and Modulation Theory, (Part I). New York: Wiley, 1968. [21] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer design,” IEEE Trans. Commun., vol. 28, no. 1, pp. 84–95, Jan. 1980. [22] D. G. Childers, B. Yegnanarayana, and K. Wu, “Voice conversion: Factors responsible for quality,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 1985, vol. 1, pp. 748–751. [23] T. Pfau and G. Ruske, “Estimating the speaking rate by vowel detection,” in Proc. ICASSP, 1998, pp. 945–948. [24] E. Moulines and J. Laroche, “Non-parametric techniques for pitchscale and time-scale modification of speech,” Speech Commun., vol. 16, no. 2, pp. 175–206, 1995. [25] S. Roucos and A. M. Wilgus, “High quality time-scale modification for speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 1985, pp. 493–469.
[26] H. P. Knagenhjelm and W. B. Kleijn, “Spectral dynamics is more important than spectral distortion,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 1995, pp. 732–735. Ki-Seung Lee was born in Seoul, Korea, in 1968. He received the B.S., M.S., and Ph.D. degrees in electronics engineering from Yonsei University, Seoul, in 1991, 1993 and 1997, respectively. Since September 2001, he has been an Assistant Professor at Konkuk University, Seoul. From February 1997 to September 1997, he was with the Center for Signal Processing Research (CSPR), Yonsei University. From October 1997 to September 2000, he was with the Speech Processing Software and Technology Research Department, Shannon Laboratories, AT&T Laboratories-Research, Florham Park, NJ, where he worked on ASR/TTS-based very low bit rate speech coding and prosody generation of the AT&T TTS Systems. From November 2000 to August 2001, he was with the Human and Computer Interaction Laboratories, Samsung Advanced Institute of Technology (SAIT), Suwon, Korea, where he worked on a corpus-based TTS System. His research interests include the various fields of speech signal processing including voice transformation, speech segmentation, speech synthesis and real-time implementation of the speech processing systems.