Fig. 3

Report 6 Downloads 142 Views
Available online at www.sciencedirect.com

ScienceDirect Speech Communication 58 (2014) 124–138 www.elsevier.com/locate/specom

Voice conversion based on Gaussian processes by coherent and asymmetric training with limited training data q Ning Xu a,b,c,⇑, Yibing Tang a, Jingyi Bao d, Aiming Jiang a,b, Xiaofeng Liu a,b, Zhen Yang e a College of IoT Engineering, Hohai University, Changzhou, China Changzhou Key Laboratory of Robotics and Intelligent Technology, Hohai University, Changzhou, China c Ministry of Education Key Lab of Broadband Wireless Communication and Sensor Network Technology, Nanjing University of Posts and Telecommunications, China d School of Electronic Information and Electric Engineering, Changzhou Institute of Technology, Changzhou, China e College of Telecommunication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China b

Received 8 May 2013; received in revised form 28 October 2013; accepted 14 November 2013 Available online 26 November 2013

Abstract Voice conversion (VC) is a technique aiming to mapping the individuality of a source speaker to that of a target speaker, wherein Gaussian mixture model (GMM) based methods are evidently prevalent. Despite their wide use, two major problems remains to be resolved, i.e., over-smoothing and over-fitting. The latter one arises naturally when the structure of model is too complicated given limited amount of training data. Recently, a new voice conversion method based on Gaussian processes (GPs) was proposed, whose nonparametric nature ensures that the over-fitting problem can be alleviated significantly. Meanwhile, it is flexible to perform non-linear mapping under the framework of GPs by introducing sophisticated kernel functions. Thus this kind of method deserves to be explored thoroughly in this paper. To further improve the performance of the GP-based method, a strategy for mapping prosodic and spectral features coherently is adopted, making the best use of the intercorrelations embedded among both excitation and vocal tract features. Moreover, the accuracy in computing the kernel functions of GP can be improved by resorting to an asymmetric training strategy that allows the dimensionality of input vectors being reasonably higher than that of the output vectors without additional computational costs. Experiments have been conducted to confirm the effectiveness of the proposed method both objectively and subjectively, which have demonstrated that improvements can be obtained by GP-based method compared to the traditional GMM-based approach. Ó 2013 Elsevier B.V. All rights reserved. Keywords: Asymmetric training; Coherent training; Gaussian processes; Gaussian mixture model; Voice conversion

q

The work is supported in part by the Grant from the National Natural Science Foundation of China (11274092, 61271335), the Grant from the Fundamental Research Funds for the Central Universities (2011B11114, 2011B11314, 2012B07314, 2012B04014), the Grant from the National Natural Science Foundation for Young Scholars of China (61101158, 61201301, 31101643), the Grant from the Jiangsu Province Natural Science Foundation for Young Scholars of China (BK20130238), and the open research fund of Key Lab of Broadband Wireless Communication and Sensor Network Technology (Nanjing University of Posts and Telecommunications), Ministry of Education (NYKL201305). ⇑ Corresponding author at: College of IoT Engineering, Hohai University, Changzhou, China. E-mail addresses: [email protected] (N. Xu), [email protected] (Y. Tang), [email protected] (J. Bao), [email protected] (A. Jiang), [email protected] (X. Liu), [email protected] (Z. Yang). 0167-6393/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.specom.2013.11.005

1. Introduction Voice conversion, in a word, is a technique that aims to modify the speaker-dependent information of a source speech so as to match that of a target speech with message unaltered. There are many applications of VC, such as customizing voices for text-to-speech (TTS) systems (Stylianou, 2009), transforming somebody’s voice so that it sounds like that of a well-known celebrity (Ye and Young, 2006), and improving the intelligibility of deficient voices uttered by a person with speech difficulty (Erro, 2008).

N. Xu et al. / Speech Communication 58 (2014) 124–138

In general, the process of VC mainly consists of two stages, namely training and transformation, wherein the objective is to learn a mapping function from training observations (training stage) that can then be used to mapping arbitrary test features (including both prosodic and spectral features) of a source speech onto the acoustic space of a target speech (transformation stage). Various approaches have been proposed in the literature, e.g. the standard codebook mapping (Abe et al., 1988), the weighted codebook mapping (Arslan, 1999), the artificial neural network (ANN) mapping (Narendranath et al., 1995; Desai et al., 2010) as well as the Gaussian mixture model (GMM)-based mapping (Stylianou et al., 1998; Kain, 2001; Lee, 2007; Toda et al., 2008, 2007; Erro et al., 2010). The standard codebook mapping method, so-called vector quantization (VQ), is used extensively during the early days of VC, where a one-to-one correspondence between source and target spectral codebooks is derived in the training stage. In the transformation stage, the obtained relationship is used to transform the short time spectral envelope of a source speech into an estimated envelope that is close to the desired one. Since this transformation is achieved as a linear combination of the target codebook centroids of a limited set of vectors, it inevitably leads to discontinuities in the transformed speech and suffers from the problem of degraded speech quality. Then, an enhanced algorithm named weighted codebook mapping is proposed. In this case, the converted vector is calculated by weighting all of the target codebooks, wherein the weighting factor is obtained according to the contribution of line spectral frequencies (LSFs). This approach solves the discontinuity problem to some extent. Although ANN is capable of handling nonlinear mapping rules for spectral envelope, giving fairly good results, the performance of the VC system using ANN has been confirmed to be relatively inferior to that obtained by Gaussian mixture model (GMM). It is worth noting that GMM-based statistical mapping methods have made great contributions to VC by significantly improving the quality and similarity of the converted speech, compared to other alternatives. Thus, a massive of variants of GMM have been proposed and used extensively in most of the state-of-theart VC systems (Lee, 2007; Toda et al., 2008; Erro et al., 2010). On the other hand, it should be noted that GMMbased methods mainly suffer from the over-smoothing and over-fitting problems: (a) the over-smoothing phenomenon arises as the fact that the converted spectra are excessively smoothed compared to the natural ones, which may be attributed to the averaging nature of GMM. The nature of statistical modeling inevitably leads to the reduction of details of spectrum, thus causing the degradation of the synthesized speech. A lot of attentions have been paid to addressing this problem. For example, Chen et al. (2003) have found that most of the values in the cross covariance matrices of GMM

125

are extremely small, which may lead to over-smoothing that makes converted speech sound muffled. So they proposed to design a mapping function based on the concept of maximum-a posteriori adaptation in order to alleviate this problem. Later on, the idea of perceptual post-filtering (Ye and Young, 2006) has been proposed to avoid the excessive broadening of the formants caused by the over-smoothing effect. Recently, Toda et al. (2007) have demonstrated that the variances of the converted spectra are less versatile than those of natural ones so that they introduced an enhanced version of GMM, which takes the global variances into consideration. Obviously, the over-smoothing problem of GMM is so well-established that a myriad of methods have been proposed intensively in the literature. (b) The problem of over-fitting is referred to the fact that a trained model gives very good results for the training data while being poor to predict new test data that are unseen. This problem arises from the fact that the structure of model is probably too complicated given that the amount of training data is limited. For example, suppose the dimensionality of feature parameter is 20, thus, a joint GMM model (Kain, 2001) with 256 mixtures trained by using 10 utterances may inevitably result in over-fitting. Compared to the vast proposals to the over-smoothing problem, relatively few attentions have been focused on the over-fitting problem of GMM until now. Among the limited amount of literatures, Helander et al. (2008) proposed to take the inter-relationship of LSFs into consideration, making GMM more reliable under the condition when small training data set is available. Later on, they proceeded by resorting to the combination of partial least squares (PLS) regression and GMM modeling, restricting the degrees of freedom in mapping functions by selecting a suitable number of components adaptively (Helander et al., 2012). In addition, variational Bayesian techniques were also used to obtain the estimates of the parameters of GMM in a full Bayes way, alleviating the over-fitting problem to a certain extent (Marume et al., 2007; Xu and Yang, 2010). Recently, Pilkington et al. proposed to use Gaussian processes (GPs) experts to perform the task of spectral conversion, which is found to be insensitive to the over-fitting problem and can predict the target spectra accurately (Pilkington et al., 2011). In essence, GP provides a unified, principled and probabilistic approach for machine learning problems by resorting to Bayesian formalism (Rasmussen and Williams, 2006), which is intrinsically a nonparametric model giving advantage of making full use of the information involved in training set without strong artificial assumptions. It should be emphasized that the over-fitting problem can be largely suppressed due to the nonparametric nature of GP, which in fact allows relatively few degrees of freedom of model. Moreover, it is reported that GP can be interpreted as a Bayes-

126

N. Xu et al. / Speech Communication 58 (2014) 124–138

ian version of the well-known support vector machine (SVM).1 Therefore, excellent nonlinear mapping capability is inherent, which is definitely helpful for solving regression issues like VC. In this paper, we focus on making use of GP as a basic tool for mapping feature parameters in the context of VC. The basic knowledge in using GP for VC is somewhat similar to that of Pilkington et al. (2011). However, in order to improve the performance of the GP-based approach, this paper further attempts several investigations: (a) the earlier approach mainly focuses on mapping spectra using GPs, leaving prosodic information being translated separately. Although this methodology works reasonably well, something important has been missed. For example, the coherent information between excitation and vocal tract is ignored when we model them independently. Fortunately, GP provides us an unified framework to address this problem by grouping them together and then modeling jointly. (b) In general, high dimensional feature parameters containing various aspects of correlations are more informative in modeling. For example, in a classical GMM-based VC approach (Kain, 2001), the aligned source and target spectral features are forced into one augmented vector considering coherent information. In Toda et al. (2007), both static and dynamic cepstral coefficients are combined together as a new feature vector for the sake of being more informative. However, the computational costs for using these high dimensional vectors are always expensive in practice, inevitably leading to the problem of the curse of dimensionality (Bishop, 2006). The same problem arises in the case of GP, especially when coherent training has been adopted which means higher dimensionality of feature vectors is under consideration. As a result, we present in this paper a new implementation of the GP-based method that preserves the advantage of using high-resolution (i.e. high dimensional) feature vectors without significant computational cost. This paper is organized as follows. Section 2 introduces the basic concept of GP and interpret how it can be integrated into VC. Section 3 describes the details of the proposed methods, including motivations. Implementation details of the proposed VC system is provided in Section 4, making this research reproducible. Finally, extensive experiments are conducted both objectively and subjectively in order to evaluate the effectiveness of the methods in Section 5 and conclusions are made in Section 6.

1 Although there are some similarities between GP and SVM, both of which are kernel-based methods, the details of prediction and the underlying logic differ. Note that the prediction made by GP gives more informative results, e.g. error bars. Moreover, the use of marginal likelihood to set the parameters of kernels or covariance functions makes it more Bayes.

2. Voice conversion using Gaussian processes 2.1. Gaussian processes Formally, GP is defined as a collection of random variables, any finite number of which has a joint Gaussian distribution (Rasmussen and Williams, 2006). In other words, GP can be completely specified by its second order statistics, i.e., mean and covariance. For example, suppose in a standard case (i.e., vector inputs and scalar outputs), we have a mean function mðxÞ and a covariance function kðx; x0 Þ of a real process y ¼ f ðxÞ, then it follows y  GPðmðxÞ; kðx; x0 ÞÞ

ð1Þ

In practice, we generally presume the mean function to be zero by removing the empirical mean of the data in advance, resulting in y  GPð0; kðx; x0 ÞÞ.2 This setting allows variability in data to be modelled solely by the covariance function. One typical choice of the covariance functions is the squared exponential function (SE) specified as   1 0 2 kðx; x Þ ¼ r exp  2 jx  x j 2l 0

2

ð2Þ

where the unknown variance r2 and length-scale l are often referred to as hyper-parameters, which can be inferred from training data. Note that the covariance between the outputs is explicitly expressed as a function of inputs. In other words, covariance function defines a measurement of similarity, which assumes that points with close inputs are likely to have similar outputs. Now let’s consider using GP for the regression problem. Imagine we have a training data set fxi ; y i ji ¼ 1; 2; . . . ; ng and a single test input x . Then we can make a full probabilistic prediction about y  by first building a GP model based on training data, and then calculating the posterior distribution of y  given all of the available information. Specifically, according to the definition of GP, a prior joint distribution of the training outputs y ¼ fy i ji ¼ 1; 2; . . . ; ng and test output y  can be formulated as 

y y



 N

 0;

KðX; XÞ Kðx ; XÞ

KðX; x Þ kðx ; x Þ

 ð3Þ

where KðX; x Þ denotes the n  1 matrix of the covariances evaluated at all pairs of training and test points, and similarly for the other entries KðX; XÞ; Kðx ; XÞ. Based on the assumption that the hyper-parameters involved in kð; Þ (for example, r and l in Eq. (2)) were learned from the training data in advance, the posterior distribution 2

This is true for standard singular GP, however, if multiple GPs (for example, with clustered training approach discussed later) are used, normalization should be done cluster-by-cluster.

N. Xu et al. / Speech Communication 58 (2014) 124–138

P ðy  jX; y; x Þ can be obtained straightforward by resorting to the results of Gaussian identity (Bishop, 2006) P ðy  jX; y; x Þ ¼ N ðy  ; y  ; V½y  Þ

ð4Þ

where y  ¼ Kðx ; XÞKðX; XÞ1 y

ð5Þ 1

V½y   ¼ kðx ; x Þ  Kðx ; XÞKðX; XÞ KðX; x Þ

ð6Þ

Note that the predicted mean y  can be viewed as a weighted sum of all the training outputs fy i ; i ¼ 1; 2; . . . ; ng so that this kind of method is often called linear smoother (Rasmussen and Williams, 2006). Besides, the point-wise error bar provided by V½y   is also a well-known tool for analyzing uncertainties involved in modeling, contrast to other alternative models like SVM. It should be noted that, the hyper-parameters are assumed to be known in the above discussions so that it is worthy of describing briefly how these hyper-parameters can be learned, or the way GPs are trained before closing this subsection. More details can be found in Rasmussen and Williams (2006). By assuming all of the unknown hyper-parameters are absorbed into h, the quantity we aim to optimise is P ðyjX; hÞ, which is often called marginal likelihood. More specifically, we maximize the log marginal likelihood L ¼ log P ðyjX; hÞ with respect to the hyper-parameters of the covariance function, where 1 1 1 log P ðyjX; hÞ ¼  yT KðX; XÞ y  log jKðX; XÞj 2 2 n  log 2p 2 and its gradient is   @L 1 @KðX; XÞ ¼ tr ðDDT  KðX; XÞ1 Þ @h 2 @h

ð7Þ

127

need to worry about that because the nonparametric nature of GP enables relatively few model parameters, thus being insusceptible to over-fitting. (c) Essentially, GP is a kind of kernel method, which means nonlinear mapping capacity is inherent. Moreover, its flexibility can be significantly improved by exploiting more sophisticated covariance functions. (d) The major computational complexity of GP is mainly related to the number of training data rather than the dimensionality of feature vectors. That means you are free to choose relatively high order of feature vectors, representing high-resolution characteristics without increasing computational costs significantly. The details of the training and conversion algorithms based on GP can be described as follows. Assume that we have a parallel corpus wherein the source and target speaker speak the same sentences, so that phonetically equivalent feature vectors can be extracted and then paired. Without loss of generality, suppose the paired sequences of feature vectors are denoted as X ¼ ½x1 ; x2 ; . . . ; xN ; Y ¼ ½y1 ; y2 ; . . . ; yN , both of dimension D  N . Our task can then be formulated as addressing a regression problem by designing a specific mapping function using those training data, which can be solved easily by resorting to the formalism of GP described in Section 2.1. Moreover, two implementation points need to be paid attention: (a) the training data need to be normalized first by removing the empirical mean according to GP. (b) Standard GP takes in vectors and outputs scalars, however, for the problem at hand multiple inputs and multiple outputs are required. Therefore, a simple but effective strategy that allows the use of multiple GPs for individual dimensions of the output feature vector is applied. The technique details are summarized in Tables 1 and 2. 2.3. Computationally efficient GP

ð8Þ

1

where D ¼ KðX; XÞ y and trðÞ denotes the trace. Note that, although the maximization of Eq. (7) is a non-convex optimisation task, gradients (i.e., Eq. (8)) can be easily obtained and thus standard gradient optimisers can be used, such as conjugate gradient (CG) (Rasmussen and Williams, 2006). 2.2. Training and transformation based on GP The idea that GP can be used for VC is motivated by the following reasons: (a) the core issue of VC can be formulated as solving a regression problem, for which GP is inherently qualified. Moreover, since we often have little prior information about the physical relationship between the source and the corresponding target features, it is not advisable to use parametric models by placing strong artificial assumptions. (b) Most of the current VC systems are subject to over-fitting problem, especially the number of training data is limited. This is often caused by the fact that a model has too many degrees of freedom compared to the amount of training data available. However, you do not

As mentioned in Rasmussen and Williams (2006), one of the major problems of GP is that it typically scales as OðN 3 Þ, where N denotes the number of training data. For large problems, for example N > 10; 000, both storage and computational costs are prohibitively expensive. Vast algorithms have been proposed to address this problem (Snelson, 2007) in the community of machine learning, of which the most popular one is based on the concept of selecting a subset of data (SD). Although, this technique sounds naive, it serves as a prototype for most of the other more sophisticated techniques. It should be noted that one of main issues of SD is that some information involved in the rest of training data has been missed. For highly redundant data sets, this is not a serious problem, however, in real applications where the amount of training data is limited such as VC, the performance of the system would be affected significantly by only using a subset of the whole training data. In this paper, we adopt a strategy that allows making full use of the whole training data set along with the alleviation of computational difficulty of GP. Note that the major cost is due to the inversion of the covariance matrix Kð; Þ whose dimensionality is N  N .

128

N. Xu et al. / Speech Communication 58 (2014) 124–138

Table 1 Training procedure based on GP. Step 1: X and Y are first normalized by removing the empirical mean, resulting in  ¼ ½ i ; . . . ; x N  and X x1 ; . . . ; x  ¼ ½y1 ; . . . ;  Y yi ; . . . ;  yN , where ~;  i ¼ xi  x yi ¼ yi  ~ y with x P P ~ ¼ N1 Nj¼1 xj , ~ y ¼ N1 Nj¼1 yj . x Step 2: For j from 1 to D  Y  j: g is collected, where 1. Subset fX;  j: denotes the jth row of Y.  Y 2. We then train the jth GP using  Y  j: g in a standard manner. fX; The type of the covariance function of the jth GP is carefully chosen along with the corresponding unknown hyper-parameters set hj being initialized randomly. 3. The estimates of hj , denoted as ^ hj , can be derived by maximizing the log marginal likelihood  j: jX;  hj Þ with respect L ¼ log P ðY to hj under the framework of Bayesian formalism. h1 ; ^ h2 ; . . . ; ^ hD g as well as Step 3: ^ h ¼ f^ ~ and ~ y are preserved for further x use.

Table 2 Conversion procedure based on GP. Step 1: Test inputs X ¼ ½x1 ; . . . ; xi ; . . . ; xL  need also to be normalized in advance, resulting in   ¼ ½ i ; . . . ; x L , where X x1 ; . . . ; x ~. i ¼ xi  x x Step 2: For j from 1 to D The jth  , row of the converted matrix Y  ;j: , can be predicted denoted by Y by iteratively evaluating Eq. (5) point-by-point. Step 3: Test outputs Y can be   with reconstructed by adding Y the empirical mean of the training outputs, i.e., yi ¼  yi þ ~ y for all i.

This implies the smaller N would be, the less training cost could be. Thus, we proceed to partition the whole data set into several non-overlapped subsets, and then the training procedure is taken in each subset in turn, for which we call it clustered training. Note that for each cluster the number of training data is now relatively small compared to that of the whole data set, leading to efficient computation of the inversion of covariance matrix. It can be easily verified that the cost of training indeed decreases as the number of clusters increases. For instances, suppose the number of clusters is Q and the whole data set is equally partitioned, then computational complexity is approximately  the  3  O D  NQ  Q .

The concept of clustered training for GP is illustrated in Fig. 1. Note that care should be taken when we are preparing to use clustered training method. Specifically, In the training stage: (a) the training data are first clustered by the algorithm of vector quantization (VQ) (Rabiner and Schafer, 2009). (b) For each category, the data need to be normalized individually and then modeled by GPs in such a manner as described in Table 1. In the transformation stage: (a) the test data of inputs are classified into different categories by computing distances against the centroids resulted from VQ technique. (b) For each category, conversion has been performed separately according to the algorithm listed in Table 2. 3. Enhancement of GP-based mapping This section describes two novel implementations of the GP-based method, namely coherent training and asymmetric training, both of which are proposed to improve the performance of the standard GP (Pilkington et al., 2011). 3.1. Coherent training The term coherent training refers to the fact that the prosodic (especially refer to F0s in this paper) and spectral parameters are combined together as a whole during the process of training. This idea is motivated by the following two reasons: (a) as mentioned in Section 1, the classical GP method (Pilkington et al., 2011) prefers to deal with prosodic and spectral features independently, wherein most attention has been focused on modeling spectral information using GP, leaving prosodic features far from full exploitation. Although this simplification may give perceptually acceptable results, it is theoretically unacceptable.3 Moreover, forcing prosodic and spectral parameters into one augmented feature vector gives the advantage of being more informative than the case that both types of features are modeled separately, since coherent information is exploited explicitly. (b) In most of the state-of-the-art systems based on GMM, such as Toda et al. (2007), F0s are modeled by a univariate Gaussian along with the distribution of the spectral features being fitted by GMM. Despite of the artificial Gaussian assumption, it works reasonably well. This is the basis that we propose to model them jointly in the unified framework of GP, which essentially assumes the joint distribution of F0s and spectral features to be multivariate Gaussian. 3.2. Asymmetric training In general, the sizes of the feature vectors of both speakers fed to the VC systems are usually the same. For 3

By using LPC as an analysis/synthesis model in Stylianou et al. (1998), the residual contains more prosodic information other than F0. In such a case, it is inadequate to convert F0s only with residual unaltered. On the other hand, F0s are the solely representative features related to prosody in terms of harmonic plus stochastic model (HSM) used in this paper.

N. Xu et al. / Speech Communication 58 (2014) 124–138

129

Fig. 1. The illustration of the algorithm of clustered training for GP.

instances, as illustrated in Fig. 1, both Xq and Yq are of dimensionality D  N q . This configuration (particularly the same dimensionality of feature vectors) provides a consistent formulation, thus simplifying the subsequent manipulations. We call it symmetric training in this paper, otherwise, asymmetric training. Specifically, the term of asymmetric training refers to the fact that the particular VC system has been trained with training data set consisting of X and Y of dimensionality Dx  N and Dy  N , respectively, where Dx – Dy . It will be verified that the asymmetric training algorithm, wherein we ensure that Dx > Dy in this paper, will lead to an increment of the performance of the system without extra significant costs of computation. This idea is motivated by the following reasons: (a) the covariance functions (or kernel functions) in GP, which are known to play an important role to the performance of system, are used to measure the similarity of any data points by computing distances between their inputs. For example, as expressed in Eq. (2), the output of the SE covariance function (or Gaussian covariance function) is indeed a measurement of the difference between the inputs x and x0 . Hence, to some extent, the choice of the parametrization of the inputs is crucial. Deep examination related to the problem of the choice of feature parameters will be conducted in Section 4. Meanwhile, for a given parametrization of features, it is well-known that feature vectors with reasonably high order are often more informative than those with relatively low one. For instances, a higher order of linear prediction (LP) spectrum often provides more details than that of a lower order, thus being more accurate in some task of speech processing (Makhoul, 1975). Therefore, in this paper, a reasonably high order of input feature parameters will be chosen in order to be more accurate in computing the covariance function, wherein the order is denoted as Dx . (b) Recall the example mentioned in Section 2, where the computational costs  of equally  parti3 tioned data set is approximately O D  NQ  Q . Note

that the costs is mainly associated with the dimensionality of feature vectors D (as illustrated in Fig.1) given that the number of training data (N) and the number of clusters (Q) are fixed. This means if N and Q were fixed we only need to reduce D for the sake of alleviating the computational complexity. It should be emphasized that in our proposed method, D is especially referred to as the dimensionality of feature vectors of output, which is well-illustrated in Fig.1(in this case, the dimensionalities of source and target vectors are equal). Thus we will denote it as Dy explicitly. (c) In conclusion, we tend to increase Dx in order to be more accurate in computing the covariance functions. Meanwhile, we prefer to reduce Dy for the sake of the reduction of computational complexity. This is why we propose the asymmetric training algorithm wherein Dx > Dy . 4. Implementation This section is devoted to explaining the details of the implementation of the proposed VC system where the systematic diagram is shown in Fig. 2. 4.1. Analysis and synthesis model As it can be seen, a harmonic plus stochastic model (HSM) is used as the analysis/synthesis model (Erro et al., 2007,). The duration of analysis frame is 30 ms with 50% overlap. For each frame, prosodic component represented by fundamental frequency (F0), harmonic component consisting of amplitudes and phases of harmonic sinusoids below 5 kHz, and stochastic component represented by an all-pole filter can be obtained according to HSM. In principle, all of the features mentioned above should be modeled and translated by the VC system. However, according to preliminary experiments, converting prosodic and harmonic components only, with stochastic component unchanged is perceptually acceptable. Thus,

130

N. Xu et al. / Speech Communication 58 (2014) 124–138

Fig. 2. Overview of the proposed system.

we will focus on the transformation of prosodic and harmonic components of HSM. Specifically, the procedure of analyzing speeches using HSM is as follows. (a) The estimate of pitch trajectory is often required beforehand. Hence, the modified autocorrelation method presented in Erro (2008) is used for this task. (b) With the results of U/V detection (obtained by pitch detection in step (a)), it is usually assumed that the voiced parts of speech consist of both harmonic and stochastic components while the unvoiced parts only stochastic components. In order to distinguish harmonic components from stochastic ones in voiced frames, a simple way is to set a threshold for piece-wise linear separation where a constant maximum voicing frequency of 5 kHz is chosen. In short, the voiced parts are considered as combinations of harmonic and stochastic components separated by maximum voicing frequency. Meanwhile, the unvoiced parts are modeled solely by stochastic components. (c) For the voiced frames, the amplitudes and phases of the sinusoids from a constant length of speech frame can be inferred by resorting to least squares optimization (Stylianou, 1996). On the other hand, for the unvoiced frames LPC technique is often used straightforward to obtain an allpole filter model. 4.2. Parametrization The amplitudes of harmonics obtained by HSM should be further parameterized for two reasons: (a) the number of

harmonics in each voiced frame is variable, which means the length of one feature vector will be different from another, leading to great inconvenience for the subsequent data alignment module. (b) Although the raw amplitudes of harmonics are mathematically efficient in terms of analysis and synthesis, they have little relevance to speaker individuality, which is very important to VC. According to previous studies, the influence of the amplitudes of harmonics has been convinced to be more decisive than that of the phases of harmonics in terms of human perception. Thus, conversion of phases is not involved in this paper. What we really need to do is to keep the phases coherently correlated with the amplitudes so that realistic speech waveform can be reconstructed, where a minimum-phase approach is adopted by considering a linear-in-frequency term for the prediction of the converted phases (Erro et al., 2007,). Besides, the prosodic information of HSM is mainly restricted to fundamental frequencies (F0s), of which the parametrization is not necessary because of its simplicity. In conclusion, the object of parametrization can be formulated as finding an appropriate formulation only for the amplitudes of harmonics. Two kinds of coefficients are most preferred in the stateof-the-art VC systems, i.e., cepstral coefficients (CCs) and line spectral frequencies (LSFs), both of which are physically meaningful and well-established in the speech community. In this paper, cepstral coefficient has been adopted since we have applied a strategy that trains multiple GPs for individual dimensions of output feature vectors

N. Xu et al. / Speech Communication 58 (2014) 124–138 7

131

6.4 Spectral distortion [dB]

Spectral distortion [dB]

6.2 6.5

6

6 5.8 5.6 5.4 5.2 5

5.5 10 15 20 25 (a) Number of training sentences

1 mixture 4 mixtures 8 mixtures 16 mixtures 32 mixtures 64 mixtures 128 mixtures 256 mixtures GP

6.6 6.4 Spectral distortion [dB]

30

6.2 6 5.8 5.6 5.4 5.2

5

10 15 20 25 (b) Number of training sentences

30

6.4 6.2 Spectral distortion [dB]

5

6 5.8 5.6 5.4 5.2 5

5

10

15

20

25

30

(c) Number of training sentences

5

10

15

20

25

30

(d) Number of training sentences

Fig. 3. Spectral distortion as a function of the number of training data for GP and GMM-based methods. (a) M-M, (b) M-F, (c) F-M, (d) F-F.

separately (see Fig. 1). In other words, the fact that LSF has strong correlations between inter-dimensions implies that it is not suitable for this task at hand. The procedures related to the parametrization and the inversion of parametrization can be described briefly as follows. In the training stage, a special frequency-domain implementation of linear prediction coding (LPC) algorithm can be applied to obtain the all-pole representation of a given set of harmonics (Makhoul, 1975). Then CCs can be inferred straightforward from LPCs (Rabiner and Schafer, 2009). In the conversion stage, after CCs being translated back into LPCs, amplitudes of harmonics of any analysis frame can be determined by selecting the values of the LPC spectrum at integer multiples of fundamental frequency below 5 kHz. Besides, phases of harmonics have to be corrected in order to avoid discontinuities between adjacent frames (Erro et al., 2007,). 4.3. Alignment Once the CCs are obtained, the underlying correspondence between source and target should be exploited in order to train the mapping function. In general, it is difficult to establish the relationship between the acoustic patterns of source and target given nonparallel training data, where nonparallel means utterances recorded from both speakers are different. In order to avoid this difficulty simple prior knowledge has been assumed in this paper, i.e., the training corpus is parallel. In such a situation, acoustic frames of source speaker can be paired with its phoneti-

cally equivalent frames of target speaker by the classical dynamic time warping (DTW) algorithm (Rabiner and Schafer, 2009). Moreover, the database used in this paper consists of full phonetic labeling, which can be utilized to refine the performance of DTW. Actually, the DTW algorithm we used is at the phoneme level: (a) the boundaries of different phonemes are determined automatically by resorting to the existing knowledge of labeling. (b) The boundaries are used as anchor points so that the CCs of both speakers can be categorized into different types of phonemes. (c) Among each phoneme duration, DTW has been performed in order to align the source frames with those of the target (Note that CCs are obtained only from harmonic components, which means only voiced frames are involved in the procedure of DTW). It should be emphasized that only one-to-one mapping correspondence is retained, which means for example, if multiple source frames are matched with a certain target frame, only one of the pairs is included and vice versa. Preliminary experiments have shown that despite of its simplicity, this phoneme-based DTW gives very good results for parallel training database. 4.4. Model training In this paper, we have applied Bayesian inference to GP in order to obtain the estimates of hyper-parameters of the covariance functions (Rasmussen and Williams, 2006). Although for most interesting models in practice, the required computations of Bayesian inference are often

132

N. Xu et al. / Speech Communication 58 (2014) 124–138

analytically intractable and approximations are necessary, GP is an exception. Specifically, we maximize the marginal likelihood by seeking the partial derivatives of the marginal likelihood w.r.t. the hyper-parameters using a conjugate gradient optimizer. As a result, the hyperparameters that maximize the marginal likelihood are chosen as the final estimates. It should be noted that there is no guarantee that the marginal likelihood does not suffer from multiple optima. Therefore, a simple and practical solution to address this problem is adopted by allowing re-initializing. In this paper, for each GP model, around 30 times of random re-initialization of hyperparameters have been performed, among which the trial with the largest marginal likelihood is picked as the final estimate. 5. Experiments 5.1. Experiments setup The CMU ARCTIC database consisting of 1132 phonetically balanced utterances recorded from 7 speakers is used for experiments. The recordings are sampled at 16 kHz with average duration of approximately 3 s. Full phonetically labeling is provided, which is used for the improvement of DTW as mentioned in Section 4.3. In this paper, four directions including BDL-to-SLT (male– female), BDL-to-RMS (male-male), CLB-to-RMS (female-male), and CLB-to-SLT (female–female), were conducted in order to evaluate the validity of the VC system for all of the possible inter-gender and intra-gender cases. Besides, the number of training sentences in the following experiments was varied from 5 to 100, along with an independent set of 30 utterances used as test data. Specifically, the training and the test data were prepared as follows. For training data, the utterances from arctica0001.wav to arctic-a0593.wav of both source and target speakers were analyzed, resulting in CCs. Then, DTW was applied ensuring parallelism of feature vectors. Finally, the required amount of training frames were selected randomly from the pool of training ensemble. For test data, the utterances from arctic-b0510.wav to arctic-b0539.wav of both speakers were used in the same way as mentioned above. It is worth noting that, the reason why we selected such a limited amount of data for training is due to the fact that one of the major concerns of this paper is to evaluate the capability of GP for alleviating the problem of overfitting. Several different methods were involved in experiments for comparison.  GP-CA: the proposed GP-based mapping using coherent and asymmetric training.  GP-C: the proposed GP-based mapping by coherent training only.  GP-A: the proposed GP-based mapping by asymmetric training only.

 GP: the standard GP-based mapping without coherent and asymmetric training as described in Pilkington et al. (2011).  GMM: one of the state-of-the art GMM-based mappings considering global variances (GVs) using maximum likelihood (ML) estimation (Toda et al., 2007). It should be noted that dynamic features were not included in this paper.4 Besides, full covariance matrices were used. Informal listening tests show that quality of the converted speech by considering GVs is somewhat better than that without GVs at the cost of larger spectral distortion.  GMM-VB: GMM-based mapping by evaluating the expectation of the conditional distribution of target features given source features, wherein model parameters were estimated by using variational Bayesian (VB) (Xu and Yang, 2010). Note that in the conversion process, the conditional predictive distribution of feature vectors can be represented as a mixture of student-t distributions considering the small size of dataset used in this paper. Moreover, it is possible to automatically select the number of mixtures under the framework of VB in terms of complexity-to-accuracy trade-off. Besides, the full covariance matrix was chosen. Thanks to VB, the mapping method suffers less from over-fitting. For those methods mentioned above (GP, GP-A, GMM, GMM-VB), most attention have been focused on the conversion of spectral information, leaving prosodic information (particularly F0s) being modeled separately by a univariate Gaussian and converted by a linear transformation. Specifically, the conversion function related to F0s is defined as logðF 0tgt Þ ¼ ltgt þ

rtgt ðlogðF 0src Þ  lsrc Þ rsrc

ð9Þ

where lsrc and ltgt denote the logarithm of the means of F0s of source and target speakers, respectively, along with rsrc and rtgt being the logarithm of the variances. It should be emphasized that using logðF 0Þ instead of F 0 fits better with the way that human ear perceives sounds. On the other hand, experiments were conducted to determine the order of spectral feature parameters. As a result, the order of 20 is selected empirically, which takes the accuracy-versus-complexity tradeoff into consideration.

4

Although dynamic features were not included, the derivation of the underlying conversion function remains almost unaltered compared to Toda et al. (2007). More specifically, the implementation procedure is as follows. First of all, the conversion function can be derived by setting the linear conversion matrix into identity matrix. Then, the most probable mixture component sequence was selected. The updating of GVs can be obtained analogous to the procedure proposed in Toda et al. (2007). The only difference lies in the fact that the linear conversion matrix is replaced by identity matrix.

N. Xu et al. / Speech Communication 58 (2014) 124–138

5.2. Objective tests Objective tests were conducted to evaluate the performance of systems. The spectral distortion for the nth frame between the converted and target vectors was defined as Toda et al. (2007) vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u 20 X 10 u t2 ðc ðiÞ  ^c ðiÞÞ2 ð10Þ sd n ½dB ¼ n n ln 10 i¼1 where cn ðiÞ and ^cn ðiÞ are the ith elements of CCs of the natural target and converted target at frame n, respectively. The overall distortion was then obtained by averaging over all the test frames. Note that the parallelism of the natural and converted vectors was ensured by time-aligning the CCs from source to target using DTW in advance. 5.2.1. Performance on alleviation of over-fitting One of the major concerns of this paper is the alleviation of over-fitting, wherein several models were compared in this subsection. Fig. 3 illustrates the result of spectral distortion obtained by GP and GMM of varying model complexity as a function of the number of training data for comparison. The kernel function of GP was empirically chosen as a linear function. Several insights can be obtained from Fig. 3: (a) as expected, the performances of both systems are increased as more training data are involved. (b) Usually, the more mixtures GMMs have, the better they will perform given that the number of training data is sufficiently enough. However, as illustrated in Fig. 3, on average the performance of GMM with 4 components is

convinced to be superior than all other competitors, where the reason may be attributed to the limited number of training data that is offered. This phenomenon can be well explained by the problem of over-fitting, where the degrees of freedom of a certain model are excessively redundant compared to the number of training data, thus making poor test predictions. This result shows that for such a small number of training data, a reasonable choice of the number of components of GMM is 4. (c) It is obvious that GP provides more convinced results than GMM does, especially when the number of training data is extremely small (for example, using 5 utterances as training data as illustrated in Fig. 3). It should be noted that when using linear covariance function in GP, the resulting mapping function (i.e., Eq. (5)) can be represented as a standard linear regression form so that the question what causes the obvious improvements of GP-based method compared to the traditional GMMbased method arises naturally. One of the reasonable reasons may be attributed to the nonparametric nature of GP that is insensitive to over-fitting, which makes it more robust to model parameters given limited amount of data. In order to clarify this point further, training data are used instead of test data for the evaluation of spectral distortion. Specifically, ten sentences were used as training data to train both GMM and GP, and then the same collection of sentences were used as test data. The results are shown in Fig. 4. It is obviously that the spectral distortion of GMM is decreasing as the number of mixtures is increasing, which gives a clear demonstration for the phenomenon of over-fitting (i.e., models are prone to give quite good results on training data but making poor predictions on 5.5 Spectral distortion [dB]

Spectral distortion [dB]

5.5

5

4.5

4

3.5

1

4

8

16

32

64

5

4.5

4

3.5

128 256

1

(a) Number of mixtures

8

16

32

64

128 256

5 Spectral distortion [dB]

Spectral distortion [dB]

4

(b) Number of mixtures

5.5

5

4.5 GMM GP 4

133

1

4

8

16

32

64

(c) Number of mixtures

128 256

4.8 4.6 4.4 4.2 4 3.8

1

4

8

16

32

64

128 256

(d) Number of mixtures

Fig. 4. Spectral distortion as a function of the number of mixtures using the same 10 utterances as training and test data. (a) M-M, (b) M-F, (c) F-M, (d) F-F.

134

N. Xu et al. / Speech Communication 58 (2014) 124–138

unseen test data). On the other hand, GP-based mapping with linear kernel is relatively immune to over-fitting by comparing the results from both Figs. 3 and 4. Although, GMM-based mapping with 1 mixture is also a standard linear regression, it differs that the output of GMM-based mapping is fundamentally a linear combination of source feature vectors, wherein the output of GP-based mapping with linear kernel is a linear combination of target feature vectors. Thus, more informative information related to target may be preserved in GP-based method, which leads to better results. Although the GMM-based mapping described above does not mainly focus on the alleviation of over-fitting, there are several different kinds of improvements that aim to addressing this problem in the literature. One of them is the GMM-based mapping by resorting to variational Bayesian (VB). Specifically, in the training stage, model parameters were estimated under the framework of VB instead of the traditional expectation maximization (EM). Then, in the conversion stage, nonlinear mapping can be obtained by evaluating the expectation of the predictive conditional distribution of target given source observations on a frame-by-frame basis (Xu and Yang, 2010). Fig. 5 illustrates the results given by GP-based and GMM-VB based mapping for comparison. Thirty sentences that are not included in training were used as test data. The digits in parentheses indicate the possibly best number of mixtures selected automatically by VB algorithm. On average, GMM-VB performs slightly better than GP does, confirming the effectiveness of alleviating over-fitting by comparing the results obtained from Fig. 3. However, note that the covariance function of GP used in this experiment is merely linear function, which means further improvements may be expected by resorting to more appropriate kernels or mixtures of GPs (Chatzis and Demiris, 2012). 5.2.2. Performance on clustered training Fig. 6 examines the effectiveness of clustered training algorithm of GP. The digits in parentheses indicate the number of clusters used by GP. In this experiment, training data included 10 sentences, and another set of 30 sentences were used as test data. The kernel function of GP was selected as linear. Note that GP with 1 cluster indicates the standard GP. It is obvious that the costs of time can be significantly reduced by increasing the number of clusters, while the resulting performances are undoubtedly well acceptable. 5.2.3. Performance on different kernel functions As mentioned above, the choice of the kernel function has a major impact on the performance of GP. Thus, various types of kernel functions have been tested. More details related to these functions can be found in Rasmussen and Williams (2006). The results are listed in Table 3, where 10 sentences were used as training data and 30 sentences that were not included in training were used as test data. No clusters were

applied. It is interesting to observe that, on average the performance relying on linear kernel is indeed good enough despite its simplicity. Meanwhile, improvements can be obtained by resorting to more appropriate kernels. The results listed in Table 3 show no preference for more sophisticated kernels over simpler ones. However, note that the comparison has been made with limited amount of training data. In other words, it is helpful to examine the results given sufficient amount of training data. The comparison results are shown in Table 4, where the simplest and the most complicated kernels listed in Table 3 were involved, i.e., the linear kernel and a combination of a neural network function, a Mate´rn function, a squared exponential function and a linear function (so-called composite kernel), respectively. The training sentences were varied from 10 to 100 with a separate set of 30 sentences used as test data. Considering the large amount of training data, it is advisable to use clustered training algorithm described in Section 2.3 for the sake of saving time. Thus, 16 clusters were used taking complexity-to-accuracy tradeoff into consideration. It is evidently that with relatively limited amount of training data (for example, less than 30 sentences), linear kernel performs slightly better than the more sophisticated kernel. However, it is inferior to the more sophisticated one when the number of training data provided is large enough (for example, with 100 training sentences). The reason may be explained by noting that with sufficient number of training data, linear kernel is less flexible than the composite kernel. On the other hand, when equipped with limited amount of training data, the composite kernel seems too complicated, resulting in more or less over-fitting. 5.2.4. Performance on coherent training The evaluation of the effectiveness of the proposed coherent training can be divided into two aspects. Firstly, the spectral distortions are under consideration, where GP and GP-C are compared in Table 5. In this experiment, ten sentences were used as training data as well as a separate set of 30 sentences for test. No clusters were used. The type of the kernel function was chosen as linear. From the results, the improvement is visible in all conversion directions by jointly modeling F0s and spectral features on the basis of coherent training. Second, the conversion related to F0s is examined. As described previously, the traditional method focuses on transforming F0s by using univariate Gaussian modeling, and then translating linearly (i.e., Eq. (9)). GP-C, however, uses information from both F0s and spectra in order to predicting more accurately. Table 6 concludes the results obtained by both traditional and GP-C methods. In this experiment, the training and test data were the same as those for Table 5. Meanwhile, the configuration of GP-C is not altered. In order to evaluating the effectiveness of the F0s conversion, objective measurement is used by calculating the Pearson’s correlation coefficients between the converted F0s and the natural target F0s. The correlation

N. Xu et al. / Speech Communication 58 (2014) 124–138 5.15

5.65

(4)

5.6

(5) (7)

5.55

(8) (9) (10)

5.5 5.45

5

Spectral distortion [dB]

Spectral distortion [dB]

5.7

(4) 5.05

(7)

5

(7)

(8)

4.95

5

(10)

10 15 20 25 30 (b) Number of training sentences

5

5.45

GMM−VB GP

(4)

5.4 (8) 5.35 (8) 5.3

5

(8)

(8) (8)

Spectral distortion [dB]

Spectral distortion [dB]

(3)

5.1

4.9

10 15 20 25 30 (a) Number of training sentences

5.5

5.25

135

4.95 4.9 (3) 4.85

(6)

4.8 4.75

10 15 20 25 30 (c) Number of training sentences

(6) (7) 5

(8)

(9)

10 15 20 25 30 (d) Number of training sentences

Fig. 5. Spectral distortion as a function of the number of training data for GP and GMM-VB methods. (a) M-M, (b) M-F, (c) F-M, (d) F-F.

5.15 (16)

Spectral distortion [dB]

Spectral distortion [dB]

5.65 5.6 (1)

(8)

5.55 5.5

(4)

5.45 5.4

0

500 1000 (a) Time (s)

5.05 (1) 5 (4) 4.95 4.9

1500

(8) 0

500 1000 (b) Time (s)

1500

5.05 Spectral distortion [dB]

5.45 Spectral distortion [dB]

(16) 5.1

5.4 (16) 5.35

(1)

5.3 (4) 5.25

(16)

5

(1)

4.95

(4)

4.9 (8)

5.2

(8) 0

500

1000 1500 (c) Time (s)

2000

4.85

0

500

1000 1500 (d) Time (s)

2000

2500

Fig. 6. Illustration of spectral distortion versus training time as a function of the number of clusters with 95% confidence intervals. (a) M-M, (b) M-F, (c) F-M, (d) F-F.

coefficients range from 1 to 1, with 1 or 1 indicating negative or positive linear relationship, respectively, wherein higher value means better conversion in terms of F0s in this experiment. According to Table 6, the improvement of F0s conversion is confirmed.

5.2.5. Performance on asymmetric training The effectiveness of GP-A is now being examined. As explained earlier, we conducted the experiment by varying the dimensionality of input vectors while keeping that of output vectors fixed to 20 as usual. Ten sentences were used

136

N. Xu et al. / Speech Communication 58 (2014) 124–138

Table 3 Spectral distortion for different kernel functions.

Table 5 Spectral distortion obtained by GP and GP-C for comparison.

Kernel function

Conversion direction [dB]

Direction

Spectral Distortion [dB]

M-M

M-F

F-M

F-F

GP

GP-C

Squared exponential (SE) Matern (1.5) Matern (2.5) Neural networks (NN) Linear (LIN) Rational quadratic (RQ) RQ + LIN RQ LIN Matern (1.5)+SE + NN + LIN

5.56 5.58 5.57 5.53 5.53 5.67 5.63 5.61 5.55

5.03 5.00 5.29 5.17 5.05 5.07 5.08 5.06 5.05

5.38 5.32 5.38 5.35 5.38 5.39 5.45 5.63 5.36

4.92 4.96 4.88 4.94 4.99 4.93 4.95 4.96 4.93

M-M M-F F-M F-F

5.53 5.05 5.38 4.99

5.46 4.97 5.32 4.92

The bold values highlight the minimum values in terms of errors.

for training while another set of 30 sentences for test. Besides, the type of the covariance function was chosen as linear. It should be noted that no clustered training and coherent training were involved in this experiment. The results are illustrated in Fig. 7, where some interesting observations can be found. The distortions decrease almost monotonically at the beginning, which is reasonable because, in general, features of higher order often include more details than those of lower ones, thus resulting in more accuracy. Thereafter, the distortions, however, tend to increase as the order becomes overly high. The reason may be explained by the fact that overly higher order of CCs are prone to capture random fluctuations of spectral envelopes. It should be noted that the best order of inputs (i.e., the order of inputs that results in least distortions), or equivalently the number of informative CCs, is highly variable. It is because the effective number of CCs is highly dependent on harmonics resolution, which means if F0 is low, the analysis band (0–5 kHz in this paper) contains a lot of harmonics from which we can compute high-dimensional CCs. On the other hand, when F0 is high, the number of harmonics is low so that the number of informative CCs has been reduced.

5.3. Subjective results In this subsection, the GP-based method was evaluated against the baseline GMM-based method subjectively. As usual, ten utterances were used as training data while a separate set of 10 sentences selected randomly from the

Table 6 Statistics of the converted F0s obtained by traditional and proposed methods. Direction

Correlation Coefficient

M-M M-F F-M F-F

Traditional (Eq. (9))

GP-C

0.55 0.45 0.53 0.57

0.59 0.49 0.61 0.63

pool of evaluation set were used as test data. The experiments were conducted for each of the 4 conversion directions (i.e., M-M, M-F, F-M, F-F) and each of the 3 conversion methods (i.e., GMM, GP, GP-CA) by evaluating the quality of the converted speech and the similarity between the converted and natural target speech. Both kinds of listening tests used the mean option score (MOS) method in a 5-point scale (quality evaluation: 5– very natural, 1–completely unnatural. Similarity evaluation: 5–very similar, 1–definitely different). Eight subjects were involved in these experiments, each of whom was asked to listen to each test sample and then assign a MOS score in terms of quality and similarity evaluation. All of the experiments were conducted in a normal but silent room with headphones. The configurations of the models are as follows. For GP-based methods, the kernel function was chosen as linear without clustered training. The dimension of inputs of GP-CA was fixed to 25 (the first dimension representing for F0 while the remaining for CCs) while outputs 20. For GMM-based method (Toda et al., 2007), the number of mixtures was empirically chosen as 4 according to the results obtained from Fig. 3.

Table 4 Spectral distortion as a function of number of training data. Direction

M-M M-F F-M F-F

Training sentence [dB]

Kernel

10

30

50

70

100

5.59 5.60 5.09 5.05 5.31 5.33 4.98 5.00

5.42 5.45 4.89 4.89 5.19 5.20 4.85 4.87

5.37 5.36 4.85 4.86 5.17 5.14 4.83 4.80

5.36 5.34 4.83 4.79 5.15 5.07 4.83 4.78

5.34 5.24 4.83 4.74 5.13 5.03 4.81 4.70

LIN Matern LIN Matern LIN Matern LIN Matern

(1.5)+SE + NN + LIN (1.5)+SE + NN + LIN (1.5)+SE + NN + LIN (1.5)+SE + NN + LIN

N. Xu et al. / Speech Communication 58 (2014) 124–138 5.4 Spectral distortion [dB]

Spectral distortion [dB]

5.8

5.7

5.6

5.5

5.4 10

15 20 25 (a) Order of input vector

5.3

5.2

5.1

5 10

30

5.6

15 20 25 (b) Order of input vector

30

15 20 25 (d) Order of input vector

30

5.3

5.55

Spectral distortion [dB]

Spectral distortion [dB]

137

5.5 5.45 5.4 5.35 5.3 10

15 20 25 (c) Order of input vector

5.2

5.1

5

4.9 10

30

Fig. 7. Spectral distortion as a function of the order of inputs for GP-A. Note that the order of outputs was fixed to 20. (a) M-M, (b) M-F, (c) F-M, (d) F-F.

5

5 GMM GP GP−CA

4.5

4

4

3.5

3.5

MOS

MOS

4.5

3

3

2.5

2.5

2

2

1.5

1.5

1

M−M

M−F F−M (a) Quality

F−F

GMM GP GP−CA

1

M−M

M−F F−M (b) Similarity

F−F

Fig. 8. Subjective tests using a 5-point scale with 95% confidence intervals.

The results are illustrated in Fig. 8, where both improvements in aspects of similarity and quality can be observed. The improvements may be attributed to the following reasons: (a) the GP-based method is less insensitive to over-fitting than the GMM-based method, consequently resulting in more robust model parameters.

(b) The proposed coherent and asymmetric training algorithms make positive contributions to the performance of the system. (c) The flexibility of GMM may be restricted by the limited number of components so as to mitigating over-fitting, given such little amount of training data. That means better performance can be acquired by increasing

138

N. Xu et al. / Speech Communication 58 (2014) 124–138

the number of training data, thus allowing adding more components to GMM. Moreover, note that the baseline GMM-based method is slightly different from Toda et al. (2007) in that no dynamic features were included, which implies an increment in performance may be expected by absorbing dynamic features into baseline system. 6. Conclusion GP is well-known as a nonparametric model that is good at solving non-linear regression problems without worrying about over-fitting. This is the reason why it has been introduced to VC earlier and explored in depth in this paper. Implementations of the GP-based method are described in detail along with the clustered training strategy which alleviates the computational complexity of GP. Moreover, several improvements according to standard GP have been proposed: (a) a coherent training and transforming algorithm that absorbs both F0s and CCs into the unified framework of GP is proposed, which makes use of the underlying inter-relations involved in prosodic and vocal tract parameters. (b) The algorithm of asymmetric training is presented for the sake of increasing the accuracy in computing covariance functions without additional computation costs. Experiments have confirmed that the results obtained by GP are promising especially when limited amount of training data is available, thus making GP arising as an alternative of GMM in the context of VC. Further improvement of the GP-based method can be obtained by resorting to more powerful and flexible kernel functions, e.g., those considering human auditory perception mechanism. References Abe, M., Nakamura, S., Shikano, K., Kuwabara, H., 1998. Voice conversion through vector quantization. In: Proc. IEEE Int. Conf. Acoust. Speech Signal Process., New York, USA, pp. 655–658. Arslan, L.M., 1999. Speaker transformation algorithm using segmental codebooks (STASC). Speech Commun. 28, 211–226. Bishop, C.M., 2006. Pattern Recognition and Machine Learning. Springer. Chatzis, S.P., Demiris, Y., 2012. Nonparametric mixtures of Gaussian processes with power-law behavior. IEEE Trans. Neural Networks Learn. Syst. 23, 1862–1871. Chen, Y., Chu, M., Chang, E., Liu, J., Liu, R., 2003. Voice conversion with smoothed GMM and MAP adaptation. In: Proc. Interspeech. Geneva, Switzerland, pp. 2413–2416. Desai, S., Black, A.W., Yegnanarayana, B., Prahallad, K., 2010. Spectral mapping using artificial neural networks for voice conversion. IEEE Trans. Audio Speech Lang. Process. 18, 954–964.

Erro, D., 2008. Intra-lingual and cross-lingual voice conversion using harmonic plus stochastic models (Ph.D. thesis). Universitat Polite`cnica de Catalunya. Erro, D., Moreno, A., Bonafonte, A., 2007. Flexible harmonic/stochastic speech synthesis. In: Proc. ISCA Workshop Speech Synth., Bonn, Germany, pp. 194–199. Erro, D., Moreno, A., Bonafonte, A., 2010. Voice conversion based on weighted frequency warping. IEEE Trans. Audio Speech Lang. Process. 18, 922–931. Helander, E., Nurminen, J., Gabbouj, M., 2008. LSF mapping for voice conversion with very small training sets. In: Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 4669–4672. Helander, E., Siln, H., Virtanen, T., Gabbouj, M., 2012. Voice conversion using dynamic kernel partial least squares regression. IEEE Trans. Audio Speech Lang. Process. 20, 806–817. Kain, A., 2001. High resolution voice transformation (Ph.D. thesis). Oregon Health and Sci University, Rockford, USA. Lee, K.S., 2007. Statistical approach for voice personality transformation. IEEE Trans. Audio Speech Lang. Process. 15, 641–651. Makhoul, J., 1975. Linear prediction: a tutorial review. Proc. IEEE 63, 561–580. Marume, M., Nankaku, Y., Sako, S., Tokuda, K., Kitamura, T., 2007. Voice conversion based on variational Bayes method. Technical Report of IEICE, vol. 107, pp. 103–108. Narendranath, M., Murthy, H.A., Rajendran, S., Yegnanarayana, B., 1995. Transformation of formants for voice conversion using artificial neural networks. Speech Commun. 16, 207–216. Pilkington, N.C.V., Zen, H., Gales, M.J.F., 2011. Gaussian process experts for voice conversion. In: Proc. Interspeech. Florence, Italy, pp. 2761–2764. Rabiner, L.R., Schafer, R.W., 2009. Theory and Applications of Digital Speech Processing. Prentice Hall. Rasmussen, C.E., Williams, C.K.I., 2006. Gaussian Processes for Machine Learning. MIT Press, Cambridge. Snelson, E.L., 2007. Flexible and efficient Gaussian process for machine learning (Ph.D. thesis). University of London. Stylianou, Y., 1996. Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification (Ph.D. ´ cole Nationale Supe´rieure des Te´le´communications. thesis). E Stylianou, Y., 2009. Voice transformation: a survey. In: Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Taipei, Taiwan, pp. 3585–3588. Stylianou, Y., Cappe, O., Moulines, E., 1998. Continuous probabilistic transform for voice conversion. IEEE Trans. Audio Speech Lang. Process. 6, 131–142. Toda, T., Black, A.W., Tokuda, K., 2007. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process. 15, 2222–2235. Toda, T., Black, A.W., Tokuda, K., 2008. Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model. Speech Commun. 50, 215–227. Xu, N., Yang, Z., 2010. A voice conversion algorithm in the context of sparse training data. J. Nanjing Univ. Posts Telecommun. 30, 1–7. Ye, H., Young, S., 2006. Quality enhanced voice morphing using maximum likelihood transformations. IEEE Trans. Audio Speech Lang. Process. 14, 1301–1312.