➠
➡ NEW REFINEMENT SCHEMES FOR VOICE CONVERSION Cheng-Yuan Lin and J.-S. Roger Jang Dept. of Computer Science, National Tsing Hua University, Taiwan Email: {gavins, jang}@cs.nthu.edu.tw ABSTRACT New refinement schemes for voice conversion are proposed in this paper. We take mel-frequency cepstral coefficients (MFCC) as the basic feature and adopt cepstral mean subtraction to compensate the channel effects. We propose S/U/V (Silence/Unvoiced/Voiced) decision rule such that two sets of codebooks are used to capture the difference between unvoiced and voiced segments of the source speaker. Moreover, we apply three schemes to refine the synthesized voice, including pitch refinement with PSOLA, energy equalization, and frame concatenation based on synchronized pitch marks. The satisfactory performance of the voice conversion system can be demonstrated through ABX listening test and MOS grade.
During the frame concatenation stage, we apply several refinement schemes to adjust the pitch and energy profiles of the converted voice. Moreover, to create smooth transition between neighboring frames, we propose pitchmark-synchronous concatenation and cross-fading based on duplicated fundamental periods. Details of these refinement schemes are described in section 4. 2. RELEATED WORK Voice conversion has been developed in the past several decades. Some representative work can be listed next. 1. 2. 3.
1. INTRODUCTION
4.
Voice conversion, a technology that modifies a source speaker’s speech to sound as if a target speaker had spoken it, offers a number of useful applications. For example, personification of text-to-speech synthesis systems, preservation of speaker characteristics in interpreting systems and movie dubbing, etc. In this paper, we propose new schemes for refining the training and synthesis procedures in voice conversion. We employ the mel-frequency cepstral coefficients (MFCC) as the basic feature that usually adopted in speech recognition [12]. Before constructing the mapping codebooks, we classify speech frames into three categories: silence, unvoiced, and voiced. Codebooks for unvoiced and voiced frames are constructed separately such that the mapping between consonants and vowels can be described more precisely. Besides, we choose data grouping induced by alignment of dynamic time warping for generating codebooks of the target speech [11]. *This is a project supported by the MOE Program for Promoting Academic Excellency of Universities (grant number: 89-E-FA04-1-4), where the authors express many thanks.
0-7803-7965-9/03/$17.00 ©2003 IEEE
Voice conversion based on VQ (Vector Quantization) [1]. Voice conversion based on GMM (Gaussian Mixture Model) [6] [9]. Voice conversion based on LMR (Linear Multivariate Regression) or DFW (Dynamic Frequency Warping) [10]. Voice conversion based on static speaker characteristics [8].
However, most of the methods proposed in the above papers take LPC (Linear Predictive Coding) or LSF (Line Spectrum Frequency) [12] parameters as the basic feature since these parameters (plus the residual signals, if necessary) can be converted to the original time-domain signals. However, these parameters do not correspond well to human’s perception of speech signals, as can be seen from the fact that most speech recognition systems employ MFCC instead of the above parameters. Moreover, most papers do not address the issue of post-processing, which is an important step in generating high quality voice. Our system takes MFCC as the basic feature and employs STC (Sine Transform Coder) to obtain synthesis features (frequency, amplitude, phase). To boost the quality, we proposed several post-processing scheme to refine the synthesized voice, as described in the following sections. 3. MAPPING CODEBOOK GENERATION 3.1 Decision Rule for Silence/Unvoiced/Voiced Frames
II - 725
ICME 2003
➡
➡ Human’s speech signals can be roughly divided into two categories: unvoiced and voiced. The voiced part usually represents a stable region of the vowel part of a syllable, while the unvoiced part usually represents the consonant part. In general, the unvoiced part does not bear the information of the speaker’s identity, and thus can be modeled by a white noise generator. Consequently, we need to treat unvoiced and voiced parts separately in order to generate high-quality synthesized voice. To achieve this, we need to classify each speech frame into three categories: silence, unvoiced, or voiced. The decision rule for S/U/V (silence/unvoiced/voiced) classification is based on the following criteria: 1. If the AMDF (Average Magnitude Difference Function) [4] curve of the frame shows a sequence of almost equally spaced (with 10% variation) local minima, and the obtained pitch is between 100 and 1000 Hz, then the frame is voiced. 2. Otherwise if the energy is larger than an energy threshold and the zero crossing rate (ZCR) is larger than a ZCR threshold, then the frame is unvoiced. 3. Otherwise the frame is silenced. The following plot shows the results of S/U/V detection.
codebooks for unvoiced and voiced frames of the source speaker. The following figure shows the flowchart of U/V codebooks generation for the source speaker.
Fig. 2. U/V codebooks generation for the source speaker
3.3. Data Grouping for the Target Speaker Once we have the codebooks for the source speaker, we can find the data grouping (for the target speaker) induced by the alignment procedure of dynamic time warping. Once the data grouping is done, we can find the centroid of each group and the mapping from target centroids to source centroids can be established intuitively. The principle of the induced grouping can be explained via the following example.
Amplitude
Wave form
0.2
0
−0.2
−0.4
0.2
0.4
0.6
0.8
1
1.2
1.4
Frequency
800 pitch low−pitch threshold high−pitch threshold
600
400
200
0.2
0.4
0.6
Energy
6
2
0.2
Zero Corssing Rate
0.8 Time (in seconds)
1
1.2
1.4
0.8 Time (in seconds)
1
1.2
1.4
0.8 Time (in seconds)
1
1.2
1.4
energy low−energy threshold high−energy threshold
4
250 200 150 100 50
0.4
0.6
Fig. 3. An example of data grouping.
zero crossing rate zcr threshold
0.2
0.4
0.6
Fig. 1. S/U/V detection using three factors
After S/U/V classification is performed, we take the MFCC of each non-silence frame (with cepstral mean subtraction to compensate channel effects) as the basic feature for vector quantization on unvoiced and voiced frames, respectively, as explained in the following subsection. 3.2 Codebook Generation for the Source Speaker After S/U/V classification, the non-silence frames are kept for generating two sets of codebooks for unvoiced and voiced frames, respectively. The purpose of this step is to prepare the representative frames (prototypes or centers) of the source speaker for the subsequent alignment procedure between the source speaker (speaker A) and the target speaker (speaker B) via DTW (Dynamic Time Warping) [12]. We use vector quantization to extract two
In the above figure, the first and the second rows show the mapping from the frame indices of the source speech to the code vector indices. The first and the third rows show the alignment between the frame indices of the source sentence and those of the target one. As a result, we can establish the mapping from the code vector indices to the frame indices of the target sentence. This mapping induces a grouping on the target frames. For instance, frames 1, 10, and 13 of the target sentence should be in the same group since they have the same code vector index 140 in the second row. Similarly, frames 4 and 12 of the target sentence should be in the same group since they have the same code vector index 203 in the second row. After alignment and grouping, the frames in the target sentence are partition into several groups. In general, the number of group in the target sentence should be the same as the number of codebook size of the source sentence. To establish the mapping from the source frames to the target ones, we still need to compute the “centroid” of each group in the target frames. Usually the centroid of a
II - 726
➡
➡ group is the average of all vectors in the group. However, in our case, we would like to avoid the conversion from MFCC to a speech frame. As a result, the centroid should be one of the data point in the group. Thus the centroid is obtained as the data point that has a minimal total distance to all the other data points in the same group.
waveform, we attempt three schemes to refine the synthesized voices. The following figure exhibits the flowchart of our proposed synthesizer.
Once the centroid of each target group is determined, we can perform frame to frame voice conversion via the following steps: 1. Get a source frame and find its closest code vector. 2. Find the corresponding group in the target frame. 3. Return the centroid of the identified target group. Note that unvoiced and voiced frames should be processed with unvoiced and voiced codebooks, respectively. The following figure demonstrates the flowchart of the mapping codebook generation.
Fig. 5. The flowchart of the proposed refinement schemes.
4.1. Pitch Refinement We employ PSOLA (Pitch Synchronous Overlap and Add) [3] technique to adjust the pitch of each synthesized frame. The synthesized sentence should have an average pitch similar to that of the target speaker. Moreover, the pitch contrast of the synthesized sentence should be as close as possible to that of the source sentence, such that the cadence of the source speaker can be maintained. The following steps show the scheme for pitch refinement. 1. Let ps, mean and pt , mean be the average pitch of the
2.
sentence. Then the pitch of the converted frame should be adjusted to the value pt , mean + ps (i ) − ps , mean . In other words, the
Fig. 4. Mapping codebook generation
In addition, we can also employ STC method to encode the frames of target speaker in order to save space of the mapping codebook. If the centroid frame is identified to voiced, then we apply the following equation (harmonic sinusoidal model) [7]: L
s ( x) = ∑ Al cos(ω0lx + φl ) l =1
Otherwise (unvoiced frame), we adopt another equation (exponential sinusoidal model) [7]: L
s ( x) = ∑ Al exp(− d l x) cos(ωl x + φl ) l =1
4. POST PROCESSING Once the training procedure is accomplished, we can acquire the mapping codebook. The structure of the mapping codebook is one by one. Therefore, the synthesizer only compares features among the mapping codebook and discover the most similar features to decode. Next, we adopt concatenation-based method to articulate each frame. However, the synthesized voices occasionally filled with undesirable buzzy components. Consequently, to achieve the best quality of concatenation-based
source and target speakers, respectively. (Here instead of Hz, we use semitone as the unit for pitch.) Let ps (i ) is the pitch of frame i of the source
converted sentence should have the same slope in pitch as that of the source sentence. 4.2. Energy Equalization According the above step, we have modified the pitch curve of the synthesized voice such that the converted speech has the same average pitch as the target speaker but the cadence of the source sentence is maintained. However, this is not sufficient since the complete prosody information should also include energy. Accordingly, we add the scheme of energy equalization to refine the synthesized voice. The goal is to maintain the same energy profile as that of the source speaker. To achieve, we can simply adjust the amplitude of the converted speech frame such that its energy is the same as that of the source frame. 4.3. Frame Concatenation Before concatenating the converted frames, we need to make sure the concatenation should keep the same fundamental period and have a smooth transition. To achieve the goal, we propose the following strategies:
II - 727
➡
➠ 1. To keep the same fundamental period across frames, we need to identify pitch marks of neighboring frames and then decide the concatenating points as the last pitch mark of the first frame and the first pitch mark of the second frame. 2. To generate smooth transition as well as to keep the original length of the converted speech, we need to duplicate several fundamental periods of neighboring frames and use cross-fading to create a smooth transition between frames, as shown in the following figure.
Second ABX Second MOS
92.0% 3.7
74.0% 3.1
Table 1: Perceptual tests. (Test question: “X is closer to A or to B?”)
6. CONCLUSIONS In this paper, we have presented new refinement schemes for voice conversion, including S/U/V classification, data grouping based on DTW alignment, and post-processing adjustments for pitch, energy and smooth concatenation. The proposed system has been evaluated by formal listening tests (ABX and MOS) and the results demonstrate that the proposed refinement schemes are feasible in practice. 7. REFERENCE [1]
Fig. 6. The demonstration of frame concatenation
5. EXPERIMENTAL RESULTS We take 5160 frames of a male source speaker as training data and compare them with 4980 frames of a female target speaker to extract our mapping codebooks. The codebooks have 512 and 10 vectors for voiced and unvoiced frames, respectively. We take 50 sentences for training the codebooks (and also for inside test), and another 50 sentences for outside test. To subjectively evaluate the performance of our system, two forced-choice (ABX) [6] experiments and an MOS (mean opinion score) [5] tests were performed. We adopt two ABX experiments to compare the improvement of the proposed method. (A and B are the source and target speech utterances, respectively, and X is the result of converting source speaker’s utterance to target speaker’s ones). In the first ABX experiment, we adopted 50 stimuli A, B, and X. In this test, we did not perform any refinement schemes to the synthesized voice. And the MOS experiment was carried out here to estimate the listening quality, using a 5-point scale: 1-bad, 2-poor, 3-fair, 4good, and 5-excellent. In the second experiment, we adopted the same 50 stimuli A, B, and X, and then applied the proposed refinement schemes to concatenate frames more smoothly. The same with the first ABX experiment, the MOS experiment was also implemented here. The following table shows the experimental results. First ABX First MOS
Inside test 84.0% 3.3
Abe, M., Nakamura, S., Shikano, K., Kuwabara, H. “Voice conversion through vector quantization “. ICASSP88.,1988 Page(s): 655 – 658. [2] Cheng-Yuan Lin, J.-S. Roger Jang, Shaw-Hwa Hwang, “An On-The-Fly Mandarin Singing Voice Synthesis System”, IEEE PCM 2002 [3] F. Charpentier and Moulines, “Pitch-synchronous Waveform Processing Technique for Text-to-Speech Synthesis Using Diphones,” European Conf. On Speech Communication and Technology, pp.13-19, Paris, 1989. [4] G.S. Ying and L.H. Jamieson and C.D. Michell, “A probabilistic approach to AMDF pitch detection”, Spoken Language, 1996. ICSLP 1996, p201-1204. [5] ITU-T, Methods for Subjective Determination of Transmission Quality, 1996, Telecommunication Unit. [6] Kain, A., Macon, M.W. “Spectral voice conversion for text-to-speech synthesis”, Acoustics, Speech and Signal Processing, 1998 p285 – 288. [7] Macon, Michael W., M. W. Macon, “Speech Synthesis Based on Sinusoidal Modeling,” PhD thesis, Georgia Institute of Technology, 1996. [8] Schwardt, L.C., Du Preez, J.A. “Voice conversion based on static speaker characteristics” Communications and Signal Processing, 1998. COMSIG '98. Page(s): 57 – 62. [9] Toda, T., Saruwatari, H., Shikano, K. “Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum”, Acoustics, Speech, and Signal Processing, 2001, p841 -844 vol.2 [10] Valbret, H, Moulines, E. Tubach, J.P. “Voice transformation using PSOLA technique” ICASSP-92, 1992 Page(s): 145 – 148. [11] Verhelst, W, Mertens, J. “Voice conversion using partitions of spectral feature space” 1996. ICASSP-96. Conference Proceedings p365 -368 vol. 1 [12] Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon. "Spoken Language Processing." Prentice Hall, 2000, p424426.
Outside test 56.0% 2.4
II - 728