Transformation of Spectral Envelope for Voice Conversion Based on ...

Report 6 Downloads 91 Views
7th International Conference on Spoken Language Processing [ICSLP2002] Denver, Colorado, USA September 16Ć20, 2002

ISCA Archive

http://www.iscaĆspeech.org/archive

TRANSFORMATION OF SPECTRAL ENVELOPE FOR VOICE CONVERSION BASED ON RADIAL BASIS FUNCTION NETWORKS Tomomi Watanabe1, Takahiro Murakami2, Munehiro Namba3, Tetsuya Hoya4, and Yoshihisa Ishida5 12 5

Department of Electronics and Communications, School of Science and Technology, Meiji University, 1-1-1,Higashimita, Tama-ku, Kawasaki, 214-8571 Japan e-mail: { twatanabe, murakami }@bach.mind.meiji.ac.jp, [email protected] 3 Department of Mathematics and Informatics, School of Education, Tokyo Gakugei University, 4-1-1, Nukuikitamachi, Koganeii, Tokyo, 184-8501 Japan e-mail: [email protected] 4 Laboratory for Advanced Brain Signal Processing, BSI-RIKEN, 2-1, Hirosawa, Wakoh, Saitama, 351-0198 Japan e-mail: [email protected]

ABSTRACT This paper presents a novel algorithm that modifies the speech uttered by a source speaker to sound as if produced by a target speaker. In particular, we address the issue of transformation of the vocal tract characteristics from one speaker to another. The approach is based on estimating spectral envelopes using radial basis function (RBF) networks, which is one of the well-known models of artificial neural networks. The simulation results show that the proposed method achieves nearly optimal spectral conversion performance. Moreover, average cepstrum distance to the target speech is reduced by 87%, and in the listening tests, around 84% of mean opinion score (MOS) is obtained.

1. INTRODUCTION Voice conversion is a technique that modifies a source speaker’s utterance to be perceived as if a target speaker had spoken it. This technique has numerous applications, e.g., personification of text-to-speech systems, reducing speaker variability, and improving the intelligibility of abnormal speech uttered by a speaker who has speech organ problems. There has been many attempts directed at problem of voice conversion recently, for instance, the methods by segmental codebooks [1], those by subspace method and Gaussian mixture model [2, 3], or those by multi layered feedforward neural networks using the back propagation algorithm [4]. In this paper, we propose a novel method for voice conversion, in which the transformation of the vocal tract characteristics is modeled by means of an RBF network [5] well known for its rapid training, generality, and simplicity. In the proposed method, the RBF network is applied to nonlinear time series function approximation of the LPC

spectral envelope transformation. Additionally, the average of fundamental frequency (F0) is modified to match that of the target F0 using the time-domain pitch-synchronous overlap and add method (TD-PSOLA) [7].

2. THE RBF NETWORK Figure 1(a) illustrates a three layers RBF network with m inputs, one hidden layer containing P RBFs, and n outputs. As in the figure, the input vector x=[x1, x2, , xm] is applied to all the RBFs in the hidden layer. Each RBF has a parameter vector called the centroid with which the input vector is compared and generates a radially symmetric response by so-called a Gaussian response function shown in Figure 1(b):

L

2ö ÷ 2 ÷ i ø

æ

x - ci

ç è

2I

hi ( x ) = exp ç -

L

,

(1)

where ci (i = 1,2, ,P) is the centroid vector of the i -th RBF and Ii is the radius which determines the width of the symmetric response of the hidden node. Thus, hi(x) diminishes rapidly as x and ci are separated from each other. Each response of the RBFs is weighted by a connection vector wi=[wi1, wi2, , wiP]T (T: a vector transpose) with the output layer called the weight. Then, the network outputs y(x)=[y1, y2, , yn] are given as the linearly weighted summation of hi(x):

L

L

å h ( x )w , P

yk ( x ) =

i =1

i

ki

L

k =1,2, ,n

(2)

The training of an RBF network does not normally require any iterative procedures; the training is completed by simply learning the weight once both the centroid vector and its radius are fixed. In this paper, the centroid vectors ci

are set to the phonemes obtained by k-means algorithm [5] that will be given in the next subsection. In general implementation of RBF networks, the radii Ii are fixed so as to moderately cover the space spanned by the centroids [6]. However, we empirically found that the setting

I i = ci

2

(3)

yields better results. Therefore, we assign the respective values as in Eq.(3). During training, the weight vectors are determined so as to minimize the averaged error between the actual network outputs and the desired outputs over the pattern vectors in the training set. The training set is composed of M pairs of the input vectors xj and the target vectors y(xj). Then, the weight vectors array W = [w1, w2, , wM] is adjusted by employing the least squares method [4]:

Input Layer

Hidden Layer

Output Layer

Output of the RBF

Weight

å

ì

Input vector ï

ï ï í ï ï ï î

Output vector

y(x)

å (a) Structure of an RBF network

L

W

where H

= (H

T

H )-1 H TY ,

é h( x1 ) ù ê h( x2 ) ú =ê ú with ê h( x ) ú M û ë

M

(4)

h( xk ) = [h1 ( xk ), h2 ( xk ),

L

L, h ( x )] , P

(b) Response of the RBF Figure 1. An RBF network

k

and Y = [y(x1), y(x2), , y(xM)] denotes the set of desired output vectors. HTH in Eq.(4), however, may not be invertible. Therefore, in order to calculate the pseudoinverse of HTH, we exploit the singular value decomposition (SVD) [6]. T

(1)

(2)

Training set obtained from spectral envelopes

Source speaker’s speech signal

Overlapping frame analysis

2.1 The k-means Algorithm For the setting of the RBF network centroids, the k-means algorithm [5] is summarized as follows: Step1: Prepare a set of training vectors {x1, x2, , xN}. Step2: Assign initial P (< N) centroid vectors to arbitrarily chosen P vectors in the training set. Step3: For k = 1 to N, do the following Step 3.1: Find the nearest centroid cj to x k. Step 3.2: Update cj as follows: I (x ) c j (k ) = c j (k - 1) + k j k  xk - c j (k - 1)  , (5) I j ( xJ )

L

å

RBF Network

LPC spectral envelopes

LPC residual signals

Vocal Tract Transformation

Combining spectral envelopes and residual signals into obtained signals Converting the obtained signals into the speech signal

J =1

where

ì I j ( xk ) = í1 î0

: if xk - c j : otherwise.

£

xk

- cl "l,

Matching the average of F0 to that of target’s F 0

Converted speech signal

3. VOICE CONVERSION ALGORITHM The proposed method for voice conversion is summarized in Figure 2. There are two phases of the procedure: (1) the training of the RBF network as preprocess and (2) the actual speech transformation. In order to represent spectral characteristics of vocal tract characteristics, we use the LPC spectral envelopes.

Figure 2. Procedure for Voice Conversion At first phase (1), the RBF network is trained by the training set. The spectral envelope of a phoneme uttered by the source speaker and that of same phoneme uttered by the target speaker are respectively used as the input vector and the target vector of each pair in the training set. At second

phase (2), the source speech signal is divided into the half the frame-length of overlapped blocks with a hanning window and the frame vectors so obtained are analyzed into both the LPC spectral envelopes and the LPC residual signals. The trained RBF network is then used as the implicit nonlinear transformation for modifying the LPC spectral envelopes. The transformed spectral envelopes are combined with residual signals. Then, adding the overlapping frame signals, the converted speech, matched the average of F 0 to that of target’s F0 using TD-PSOLA technique [7], is finally produced.

4. SIMULATION STUDY 4.1 Parameter Settings We attempted to convert the voice characteristics of the speech sampled at 11[kHz] between the same genders. The speech was divided into overlapped blocks consisting of 256 data points (23.2[ms]). The spectral envelopes were modeled with 16 LPC coefficients. The number of centroids was chosen as that of phonemes plus one (noise) and initialized to input spectral envelopes of phonemes, namely a total of six centroids for a speech of five Japanese vowels, /a/, /i/, /u/, /e/ and /o/, or nine centroids for that of three Japanese vowels, /a/, /i/, and /o/, and four affricates, i.e., /k/, /n/, /ch/, and /w/, were prepared. 4.2 Simulation Results Figure 3 and 4 show the simulation results (for two trials) of the five vowels /a/, /i/, /u/, /e/, and /o/ uttered by a Japanese male using the proposed method. Average cepstrum distance (first eight cepstrum coefficients) between the converted speech or the source speech and the target speech is represented in Table 1. Table 2 summarizes the mean opinion scores of the speech utterances. As in Figure 3, we observe that each converted spectral envelope approximates the target envelope to a large extent. In Figure 3 and 4, it is evident that formants of the converted speech have moved from those of the source speech to those of the target speech. Moreover, as in Table 1, average cepstrum distances to the target speech were reduced by around 87%, those reflected that the listeners considered the converted speech closer to the target speaker than to the source speaker around 78% times in Table 2.

5. CONCLUSIONS In this paper, we have proposed a novel method for voice conversion. This approach is based on transformation of voice characteristics using RBF networks. The objective evaluations verified that the RBF network could capture the transformation function to modify the spectral envelope of

source speaker into that of the target speaker. That was observed eminently in those of vowels. The subjective evaluations of the listening tests demonstrated that voice conversion was achieved with the proposed method. When comparing converted speech with natural speech, however, it observed the quality of converted speech was deteriorated significantly. Therefore, future work includes a farther modification to compensate for this drawback.

REFERENCES [1] L. M. Arslan, ‘‘Speaker Transformation Algorithm using Segmental Codebooks (STASC)’’, Speech Communication 28, 1999, pp.211-226. [2] T. Inoue, M. Nishida, M. Fujimoto, and Y.Ariki, ‘‘Voice conversion using subspace method and Gaussian mixture model’’, Technical Report of IEICE Japan, 2001, SP2001-9. [3] A. Kain and M. W. Macon, ‘‘Spectral Voice Conversion for Text-to-Speech Synthesis’’, Proc. ICASSP, vol.1, 1998, pp285-288. [4] M. Narendranath, H. A. Murthy, S.Rajendran, and B. Yegnanarayana, ‘‘Transformation of Formants for Voice Conversion using Artificial Neural Networks’’, Speech Communication 16, 1995, pp.207-216. [5] P. D. Wasserman, ‘‘Advanced Method in Neural Networks’’, Van Nostrand Reinhold, New York, 1993, pp.147-176. [6] S. Haykin, ‘‘Neural Networks : A Comprehensive Foundation - 2nd ed.’’, Prentice Hall, New Jersey, 1999, pp.256-317. [7] E. Moulines and J. Laroche, ‘‘Non-parametric techniques for pitch-scale and time-scale modification of speech’’, Speech Communication 16, 1995, pp.175205.

Table 1. Objective results: Average cepstrum distances to the target speech Male ® Male Female ® Female

Sourse speech 1.09 1.13

Converted speech 0.14 (-87%) 0.18 (-84%)

Table 2. Subjective results of the listening test: Mean opinion scores Convert the voice characteristics into the target Combination of affricates Vowels only and vowels Male ® Male Female ® Female

78% 76%

75% 72%

20 0

-20 -40

0

20 0

-20

2000 4000 6000 Frequency[Hz]

60

Amplitude[dB]

40

Amplitude[dB]

Amplitude[dB]

40

0

(a)

0

-20

2000 4000 6000 Frequency[Hz]

0

2000 4000 6000 Frequency[Hz]

(c)

40

Amplitude[dB]

Amplitude[dB]

20

(b)

40 20 0

-20

40

0

20 0

-20

2000 4000 6000 Frequency[Hz]

(d)

0

2000 4000 6000 Frequency[Hz]

(e)

Frequency[Hz]

Figure 3. Simulation Results: Converted spectral envelopes by the proposed method (thick line), Source envelopes (broken line), and Target envelopes (thin line), (a) a vowel part /a/, (b) a vowel part /i/, (c) a vowel part /u/, (d) a vowel part /e/, and (e) a vowel part /o/

4000

(a)

2000 0

0.5

1

1.5

2 2.5 Time[sec]

3

3.5

3.5

4

3

3.5

4

4.5

Frequency[Hz]

0

4000

(b)

2000 0

0.5

1

1.5

2

2.5 3 Time[sec]

4.5

5

Frequency[Hz]

0

4000

(c)

2000 0

0

0.5

1

1.5

2 2.5 Time[sec]

4

4.5

Figure 4. Simulation Results: (a) Source speech spectrogram, (b) Target speech spectrogram, and (c) Converted speech spectrogram by the proposed method of the five vowels /a/, /i/, /u/, /e/, and /o/ uttered sequentially