2202
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007
On Using Multiple Models for Automatic Speech Segmentation Seung Seop Park, Member, IEEE, and Nam Soo Kim, Member, IEEE
Abstract—In this paper, we propose a novel approach to automatic speech segmentation for unit-selection based text-to-speech systems. Instead of using a single automatic segmentation machine (ASM), we make use of multiple independent ASMs to produce a final boundary time-mark. Specifically, given multiple boundary time-marks provided by separate ASMs, we first compensate for the potential ASM-specific context-dependent systematic error (or a bias) of each time-mark and then compute the weighted sum of the bias-removed time-marks, yielding the final time-mark. The bias and weight parameters required for the proposed method are obtained beforehand for each phonetic context (e.g., /p/-/a/) through a training procedure where manual segmentations are utilized as the references. For the training procedure, we first define a cost function in order to quantify the discrepancy between the automatic and manual segmentations (or the error) and then minimize the sum of costs with respect to bias and weight parameters. In case a squared error is used for the cost, the bias parameters are easily obtained by averaging the errors of each phonetic context and then, with the bias parameters fixed, the weight parameters are simultaneously optimized through a gradient projection method which is adopted to overcome a set of constraints imposed on the weight parameter space. A decision tree which clusters all the phonetic contexts is utilized to deal with the unseen phonetic contexts. Our experimental results indicate that the proposed method improves the percentage of boundaries that deviate less than 20 ms with respect to the reference boundary from 95.06% with a HMM-based procedure and 96.85% with a previous multiple-model based procedure to 97.07%. Index Terms—Automatic speech segmentation, speech synthesis, unit selection.
I. INTRODUCTION OWADAYS, concatenative speech synthesis based on the unit-selection technique has become the predominant approach to text-to-speech (TTS) systems [1]–[4]. In this technique, a sequence of pre-recorded speech segments or units which realize the given phonetic and prosodic descriptions as faithfully as possible are selected from a large speech corpus, and then concatenated sequentially. One of the advantages of unit selection is that it can produce a natural sounding speech by utilizing the original speech waveform in a corpus as much
N
Manuscript received September 28, 2006; revised June 11, 2007. This paper was supported in part by the Brain Korea 21 Project and the Korea Science and Engineering Foundation (KOSEF) under Grant R0A-2007-000-10022-0 funded by the Korean government (MOST). The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Simon King. The authors are with the School of Electrical Engineering and INMC, Seoul National University, Seoul 151-742, Korea (e-mail:
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASL.2007.903933
as possible. However, if the appropriate units with desired phonetic and prosodic characteristics cannot be found in the corpus or if the selected units cannot be spliced smoothly, an unpleasant speech is likely to be generated. Signal processing can be applied to somewhat mitigate the degradation, but a large modification of the units may also degrade the output speech in an unpredictable way. Therefore, it is fair to say that the quality of unit selection is to a large extent dependent on the quality of the speech corpus. Speech corpora that are to be used for unit selection should be designed and established carefully by considering the speech quality, contextual coverage, prosodic variation and system requirement [5]–[8]. For such corpora, we need: 1) a large amount of speech waveform; 2) accurate phonetic labels; and 3) accurate segmentations that delimit the speech waveform according to the phonetic labels. There should be a sufficient amount of waveform segments for each unit type in order to cover various prosodic variations of the target language or target domain. The phonetic labels can be generated manually or automatically. Several approaches to automatic phonetic transcription have been proposed [9]–[13]. For example, in [11], given a speech utterance and the associated orthographic transcript the best pronunciation is selected from a multiple-pronunciation lexicon by applying a speech recognizer. Given the speech waveform and phonetic labels, the boundaries of each speech segment should be marked. Traditionally, manual segmentation has been employed in the development of a variety of TTS applications since it is considered the most reliable and precise way to get the segmentation information. However, it is usually time-consuming and labor-intensive. Therefore, an automatic method for segmentation is more feasible and practical especially when the required size of the speech database is huge. In the literature, a variety of approaches to automatic speech segmentation have been developed [14]. Most of the approaches are based on the hidden Markov model (HMM) which is widely used in the area of automatic speech recognition. In the HMMbased framework, each phone unit is modeled by a contextdependent or -independent HMM. The model parameters are trained based on a collection of speech data with the corresponding transcripts and then the trained HMMs are used to align a speech signal along the associated transcripts by means of Viterbi decoding. For HMM based segmentation, we should specify a number of factors such as the set of features to be used, the window length and frame rate for feature extraction, and the various model configurations which involve the context-dependency, the number of states of each model and the number of mixtures for each state. In [15] and [16], the segmentation accuracy with various combinations of these model configurations is
1558-7916/$25.00 © 2007 IEEE
PARK AND KIM: ON USING MULTIPLE MODELS FOR AUTOMATIC SPEECH SEGMENTATION
reported. It is often claimed that HMMs perform better for some transitions than for others, and make similar error patterns depending on the transitions [14], [17], [18]. For example, since context-dependent HMMs (CDHMMs) are usually trained with the phonemes of the same context, some CDHMMs can be constructed by mistakenly incorporating parts from other phonemes [14]. As a result, the CDHMM trained in this way would make errors in the same direction for the boundaries of the corresponding context. Though HMM-based automatic segmentation can be implemented very efficiently using the Baum–Welch and Viterbi algorithms, the results are often found unsatisfactory to be directly applied to TTS. In order to circumvent this limitation, various post-processing techniques have been developed [14], [17]–[32]. The goal of these methods is to improve the accuracy of segmentation by refining the initial segmentations. To achieve this goal, some of these methods try to emulate the quality of the target segmentation through supervised learning procedures. For example, in [14], [17], and [18], the average discrepancies between the automatic and manual segmentations, or the biases, are statistically computed by partitioning the entire space of phonetic transitions and then subtracted from the automatic segmentations. On the other hand, other post-processing approaches attempt to refine the segment boundaries with the use of various acoustic features such as the F0 contour [30] and the spectral variation function (SVF) [31]. Recently, there have appeared techniques utilizing multiple boundary time-marks obtained from a variety of segmentation methods to get a single final time-mark [33], [34]. The multiple time-marks are simply averaged in [33], while, in [34], a single (presumably the best) boundary time-mark among those provided by the multiple segmentation methods is selected depending on the phonetic context, e.g., /p/ on the left and /a/ on the right, of the boundary. In this paper, we propose a general framework which extends the previous multiple-model based segmentation methods. The final time-mark in the proposed method is determined by first removing the bias of each model’s time-mark and then calculating the weighted sum of all the bias-removed time-marks. The bias and weight parameters required for the proposed method are specified for each phonetic context and the separate ASM. The parameters are pre-trained through a learning procedure in which a limited amount of manually-segmented data is provided as a reference. For the training procedure, we first define a cost function which quantifies the discrepancy between the automatic and manual segmentations or the error, and then find the bias and weight parameters which minimize the total cost. The bias parameters are obtained by simply averaging the errors corresponding to each phonetic context and ASM pair. For training the weight parameters, we apply a constrained optimization technique, called the gradient projection (GP) method [35], because of the constraints given to the weight parameters. To cope with the phonetic contexts unseen during the training session, we utilize a decision tree with which all phonetic contexts are clustered effectively. For performance evaluation, we compute the mean absolute error (MAE), root mean square error (RMSE), and percentage of boundaries deviating less than 20 ms from the manually-determined time-marks as in [25].
2203
This paper is organized as follows. The next section describes an overview of the proposed segmentation procedure and a method to train the bias and weight parameters is developed. The material and methodology that we used to evaluate the proposed method are presented in Section III, and the results of the performance evaluation are provided in Section IV. Finally, conclusions are drawn in Section V. II. SEGMENTATION PROCEDURE A. Automatic Segmentation by Weighted Sum of Multiple Bias-Corrected Results (ASWSBC) Let an automatic segmentation machine (ASM) be a general system that performs a segmentation task automatically, i.e., an ASM produces a sequence of boundary time-marks given an utterance and the corresponding phonetic transcript. We also define the boundary type (btype) for a boundary time-mark (bmark) as a pair of two phonetic identities adjacent to this time-mark. It is easy to expect that a phone boundary of an ASM is dominantly affected by the two adjacent (left and right) phonemes [34], and, thus, an ASM is considered to have a relatively different performance depending on btype. An ASM applies a specific algorithm to align an utterance along its phonetic labels. For example, it may adopt an HMM-based approach with/without some post-processing techniques for boundary refinement. In this paper, however, our interest lies not on a specific algorithm of an ASM but on a general method regarding how to determine the boundary time marks when multiple ASMs are available. ASMs which use a variety of alSuppose that there are gorithms of their own. Let us call these ASMs the base ASMs. Given a speech signal and the corresponding phonetic transcript , the th base ASM produces a set of bmarks where is the th phonetic identity, is the th bmark given by the th base ASM and denotes the number of boundaries determined according to the transcript . The btype of the bmark is defined as , which is independent of the ASM index since all the base ASMs share base ASMs, our the same phonetic transcript . Given the goal is to make a single, improved set of bmarks based on all the bmark sets . To achieve this goal some approaches have been investigated is obtained by averaging the in previous studies. In [33], base results, i.e., (1) On the other hand, in [34] the final boundary decision is made by selecting the best base ASM depending on the btype as follows: (2) is a mapping that chooses one of its arguments acwhere cording to . In this section, we propose a new method called the automatic segmentation by weighted sum of multiple bias-corrected results (ASWSBC) which is an extension of our previous work presented in [34]. An overview of the proposed ASWSBC method is shown in Fig. 1. In the proposed method, we first
2204
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007
if otherwise where
(6)
is the index of the best ASM for btype .
B. Training of Bias and Weight Parameters A set of manually-segmented data is used as a target to train the weight and bias parameters. Let us assume that we have , where is the manual segmentation data total number of the manual bmarks and is the th manual bmark. Our goal is to train the bias and weight parameters such that they can minimize the distance between the manual bmarks and the output bmarks obtained from (3). For notational brevity, we define the bias vector , the weight vector , and the error vector of the th bmark as follows:
Fig. 1. Overview of the proposed ASWSBC method.
remove the bias of each bmark provided by the base ASMs, and then compute the weighted sum of the bias-removed bmarks to produce the final bmark. The parameters necessary for the bias removal and weighting are separately specified for each btype is obtained and ASM pair. Therefore, the th final bmark as follows:
in which is the discrepancy between the th bmark of the th base ASM and the corresponding th manual bmark, and the superscript denotes matrix transpose. Let us denote the bias and weight databases (DBs) by and , respectively, in which is a collection of btypes observed in the manual segmentation and are respectively the bias and weight vectors for and btype . The overall distance between the manual and final segmentaand is given by tions for some (7) (8)
(3)
where and are, respectively, the bias and weight parameters for the th base ASM when the btype is , and indicates the bias-removed bmark. The bias and weight parameters are employed to remove the bias and weigh the importance of the segmentation of every ASM depending on the btype, respectively. Since multiplying a time by a negative value does is desired to be connot make sense and the final bmark , we assume that fined in the region the weights for any btype are constrained by for
(4)
where (9) is a cost function which quantifies the difference beand is a subcost for and tween two bmarks. In (8), for
(10)
is a set of indices for which the btype is in the training data. The optimal weight and bias parameters are estimated according to the following criterion: (11)
Clearly, both the averaging and selection methods are nothing but special cases of the ASWSBC approach. If the averaging and selection methods are described in the ASWSBC framework shown in (3), it is not difficult to see that the corresponding weights are given as follows: for
(5)
From (8), we can see that is minimized when is min. Therefore, the optimal estimaimized for each btype tion procedure can be carried out separately for each btype as follows: (12)
PARK AND KIM: ON USING MULTIPLE MODELS FOR AUTOMATIC SPEECH SEGMENTATION
1) Selection of the Cost Function: The complexity of solving the above equation will depend on the specific form of the cost and since the function . In general, it is difficult to find bias and weight parameters may interact with each other. For example, when the absolute error is used for the cost function for leads to the condition that solving should equal to the median value of the set . The optimal bias vector that satisfies this condition is not easy to find because each element of the bias vector is dependent on not only the weight vector but also other elements of the bias vector. An iterative procedure can be applied to solve (12) but it may require many computations. Moreover, the dependency between the bias parameters of different ASMs does not make sense. On the other hand, if the squared error given by
2205
is an error covariance matrix for btype . The optimal weight and transformed bias parameters for btype are then estimated according to (19) Given the optimal transform-domain bias vector , the optimal original-domain bias parameters can be obtained by applying as follows: the inverse transform for (20) 3) Parameter Estimation: To find the optimal bias vector for btype , we solve the following equation:
(13) is adopted for the cost function each bias parameter can be obtained independently of both the weight parameters and other bias parameters, as will be shown in Section II-B3. For this reason and for mathematical tractability, we will use the squared error for the cost function in the remainder of this paper. 2) Error Transformation: When using the squared-error function for estimating the relevant parameters, the training procedure can be misled by a small number of gross errors caused by various reasons such as the errors in the phonetic transcripts [36]. To circumvent this problem, we transform the by an invertible individual time errors transform function such that each transformed error is confined within a finite region, [ 1,1], i.e.,
(21) which leads to the optimal bias vector
given by
(22) is the cardinality of . From (22), we can see that where the optimal bias parameters are obtained by averaging the corresponding errors. Note that the optimal bias parameters can be obtained independently of the weight parameters. is found, the weight vector parameter which Once minimizes
(14) As a feasible choice for the error transformation, we apply the sigmoid function given by (15) where is a slope parameter. According to the error transformation, we make a modification to (9) such that (16) where and are the transformed error and bias parameter vectors, respectively. Using (13) and (16), the subcost for in (8) is also modified to
(17) in which (18)
(23) is searched over the confined region shown in (4). We apply the GP method, which is suitable for solving constrained optimization problems [35]. At each iteration, we find a feasible direction by projecting the negative gradient of the objective function onto the tangent subspace specified by an active set of constraints, and as moving toward this direction, we find the minimal point which becomes the next starting point. The application of the is illustrated in gradient projection method to minimize the Appendix. 4) Clustering of Boundary Types: It is important for a robust estimation of the bias and weight parameters that there should be a sufficient amount of manually-segmented data for each btype. Since, however, the amount of the manually-segmented data is not usually large, the training of the bias and weight parameters does not guarantee a reliable estimate for all the btypes. Furthermore, some btypes may not be even observed in the manually-segmented data. In order to apply the proposed method, we should also have the bias and weight parameters for those unseen btypes. To overcome this difficulty, we employ a decision tree [37] which clusters all the btypes into a finite number of groups. The decision tree is built as follows: First, all boundaries of the manually-segmented data are pooled together at the root node of the
2206
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007
tree. Then, this pool is subsequently split up into two child nodes according to phonetically-motivated questions, such as the place of articulation, the voicing of phone, and the preceding and following phonetic context of the boundary. Node splitting stops when either child node contains less than manual time-marks. The split at each node is made such that some distortion measure be minimized. Assume that there is a set of manual bmarks, , at some node and a question splits the into and , where is the question set, and and are whose answers to the question are “yes” and the subsets of “no”, respectively. Let us define the cost for a data set with respect to and as (24)
The best question cording to
which splits the node
is selected ac-
(25) where (26) Given the decision tree, all boundary types (including unseen ones) can be uniquely mapped to a leaf node of the tree and the btypes reaching to the same leaf node share the same weight and bias parameters. III. MATERIAL AND METHODOLOGY A. Material The speech database consisted of 5000 utterances (286082 phonemes) extracted from the Korean TTS research database provided by the Electronics and Telecommunication Research Institute (ETRI). The sentences were uttered by a professional female narrator in a soundproof studio environment and recorded at 16-bit precision with 44.1-kHz sampling frequency. Then, the recorded signal was down-sampled to 16 kHz. In the database, phonetic transcripts for all utterances and manual segmentations for the first 2000 utterances were available. The phonetic transcripts were initially obtained by an automatic method and then corrected by human experts. The number of phonetic symbols we used was 43. The manual segmentation was performed by four labelers. In order to aid the manual segmentation process, initial segment boundaries were obtained automatically using monophone HMMs and then provided to the labelers. Although the labelers was experts having at least two years of experience at manual segmentation, they had difficulties in marking boundaries of some specific phonetic transitions such as vowel-to-vowel and vowel-to-liquid, and made somewhat different results from each other for these transitions. To reduce the inconsistencies between labelers [14], [15], the boundaries for those problematic transitions were determined by consensus among the labelers.
Fig. 2. Illustration of HMM-based automatic segmentation.
From the manual segmentation database, we randomly selected 1600 utterances for training the bias and weight parameters of the ASWSBC method and the remaining 400 utterances were used for performance evaluation. B. Methodology 1) Base ASMs: In order to apply the proposed ASWSBC method, we implemented 33 HMM-based base ASMs. Fig. 2 illustrates the HMM-based segmentation procedure: Firstly, as a data preparation step for HMM training, feature vectors were extracted from speech signals, and the orthographic transcripts were transformed to phonetic transcripts. Then, HMMs were trained using the feature vectors and the phonetic transcripts, yielding a set of HMMs. Finally, the HMMs corresponding to the phonetic transcripts were concatenated and then aligned with the feature vectors through the Viterbi algorithm, resulting in the segment boundaries. There are numerous factors that affect the performance of the HMM-based segmentation. For the feature extraction, we should make some decisions on the kind of features to use,
PARK AND KIM: ON USING MULTIPLE MODELS FOR AUTOMATIC SPEECH SEGMENTATION
the window length, and the frame shift. With regard to HMM training, we should specify various HMM configurations such as the context dependency (i.e., context-independent or -dependent model), the number of states for each phone model, and the number of mixture components per state. For training context-independent HMMs, the “embedded training” procedure may or may not be applied following the “isolated-unit” training (see below). Instead of the Viterbi algorithm, the Forward-Backward algorithm could be applied for alignment [38]. Although each combination of these factors would potentially yield a different ASM, in our experiments we varied only the HMM configurations (refer to Table I for details) to build the base ASMs. In the following, our HMM training procedure is described in detail. A feature vector was extracted for each ms ( “frame size”) using a 24-ms-long Hamming window. The feature vector was composed of 12 mel-frequency cepstral coefficients (MFCCs), normalized log energy, as well as their delta and delta-delta components (39-dimension in total). The delta and delta-delta components were computed by linear regression using four previous and ( “delta width”) following frames. In our preliminary experiments, HMMs trained with feature vectors using the delta width of 2 performed slightly worse than those using the delta width of 4, and using a 12- or 18-ms window did not yield better results than applying the 24-ms window in overall. A frame size smaller than the standard 10 ms in the speech recognition area has been often adopted for automatic segmentation [14]–[17]. We used a small frame size in order to get more precise segmentation results because the time resolution of segment boundaries is restricted by the frame size. A drawback of using a small frame size is that it takes more time to train HMMs due to the increased amount of frame data. Given the feature vectors and phonetic transcripts, both context-independent and -dependent HMMs were trained by using the HTK [39] software. The basic structure of the phone HMMs was a left-to-right type without any state skipping (except for the silence HMM whose topology was fixed as 3-state1 with transitions allowed between the first and last states) and the observation distribution of each state was characterized by mixtures of diagonal covariance Gaussians. The context-independent monophone HMMs (CIHMMs) were first initialized based on a small amount of manual segmentation data and then updated using the 1600 manuallysegmented utterances through the isolated-unit training procedure. After the isolated-unit training, we could apply the embedded training procedure if desired. In the isolated-unit training, the state boundaries are softly decided through the Baum–Welch algorithm with the model boundaries fixed to the manually-determined ones [39]. On the other hand, in the embedded training, a sequence of models associated with the given phonetic transcript are concatenated and all the model parameters are simultaneously updated through the Baum–Welch algorithm [39]. Hence, not only the state but also the model boundaries are soft in the embedded training. An advantage of the embedded training is that all the available 1Actually, the n-state in this paper denotes the “‘dummy” states if expressed by the HTK terms.
n
“emitting” and two
2207
Fig. 3. Effect of embedded training on the HMMs obtained by isolated-unit training in terms of the root mean square errors (in ms) as the number of mixtures per state and the size of manually-transcribed utterances for the isolated-unit training vary.
data can be used to estimate HMM parameters since it does not require information about model boundaries, and therefore more estimates for the parameters can be obtained. However, Kawai et al. [15] reported a performance degradation when the embedded training was applied in addition to the isolated-unit training using 2259 manually-segmented utterances. To see whether applying the embedded training would be advantageous in our case, we performed some experiments in which 3-state CIHMMs obtained through the isolated-unit training only were compared with those that were obtained through the isolated-unit and embedded training procedures while varying the number of mixtures per state from one up to eight. In addition, the number of manually-segmented utterances for the isolated-unit training was changed from 50 to 1600 in order to investigate the effect of the amount of manually transcribed data. For the embedded training, all the 5000 utterances were utilized and the HMMs from the isolated-unit training were used as the “seed” models. The feature vectors for these experiments were the same as described above except for the 10-ms frame size and the delta width of 2. Fig. 3 shows the results where the performance was evaluated in terms of RMSE. As we can see in the figure, the embedded training degraded the HMMs obtained from the isolated-unit training using 1600 manually-segmented utterances. It seems that more accurate parameters can be obtained by fixing the model boundaries to the “well-determined” manual boundaries than by utilizing the whole data without boundary information when the amount of manually-segmented data is large enough. On the other hand, Fig. 3 indicates that when the size of manual data available is as small as 50 or 100 utterances better HMMs can be obtained through the embedded training procedure using the whole data, especially when the number of HMM parameters to be estimated (which is proportional to the number of mixtures) is large. An interesting question about the isolated-unit versus embedded training procedure is how the performance is influenced by bias-compensation. That is, after
2208
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007
bias-compensation, the CIHMMs obtained from the isolated-unit ’s) may perform better plus embedded training ( than the CIHMMs obtained from the isolated-unit training ’s) even though the data size for the isolated-unit ( training is large. In order to examine this possibility, we ap’s plied a bias-compensation procedure for the eight trained with 1600 manual utterances and the corresponding ’s (16 CIHMMs in total). The biases were computed as described in Section II-B. On the average, the RMSE ’s, and was improved from 12.17 to 10.51 ms for the ’s. The benefit from 12.95 to 10.59 ms for the by the bias-compensation was larger for the ’s ’s. Even after the bias-compensation, than for the ’s always performed slightly better however, every ’s in our setup. Based on than the corresponding these results, we chose to use the CIHMMs from the isolated-unit training utilizing 1600 utterances and not to apply the embedded training procedure. For training the context-dependent triphone HMMs (CDHMMs), the use of the whole data seems to be more adequate considering the large number of triphone model parameters. Hence, the embedded-estimation procedure was applied to train CDHMMs. For each number of states, the 1-mixture triphone models were initialized with the corresponding 1-mixture monophone models, and then updated through the embedded training procedure. A tree-based state tying was applied in order to estimate model parameters robustly and then the parameters of tied-state HMMs were re-estimated. The multiple-mixture triphone models were obtained by subsequently splitting the mixture component with the largest weight, starting from the 1-mixture triphone models. 2) Combination of Base ASMs: Given the base ASMs, the bias and weight parameters were trained using 1600 manuallysegmented utterances for the proposed ASWSBC method. The trained bias and weight parameters were used to remove the bias and to weigh the significance of the segmentation of every ASM depending on the btype, respectively. There were 1151 btypes observed in the training database, while 1218 btypes existed in the entire database. The error transformation was performed in order to deemphasize the gross errors using the sigmoid function of (15) with the slope parameter ms , and the squared error of (13) was used as the cost function for quantifying the transformed error. To cope with the unseen btypes and to estimate the parameters in a robust way, a decision tree was grown to cluster all the btypes in the training data as described in Section II-B4. The question set for building the tree was constructed based on the linguistic knowledge of the authors and was similar to that used in [40]. The parameter for stopping the node splitting was set to 10, 20, or 50, yielding 765, 625, or 429 leaf nodes, respectively. Given the decision tree, each btype observed in the entire database could be mapped to a unique leaf node of the tree, and the btypes reaching to the same leaf node would share the same bias and weight parameters estimated based on the manually-segmented data belonging to the leaf node. In this way, the bias (or weight) DB which outputs a bias (or weight) vector for every bytpe was constructed. A variety of bias and weight DBs were built for performance and , and five comparison. We constructed two bias DBs,
, , , , and . is separate weight DBs, the optimal bias DB obtained by the error averaging procedure as described in Section II-B3), while is a null-bias DB which always outputs a zero vector for any btype. The null-bias DB was applied to see the performance improvement without the bias correction in the ASWSBC method. On the other hand, various weight DBs were made in order to compare the weight parameters obtained based on the gradient projection method with those based on the previous averaging and selection methods. and denote the optimal weight DBs obtained through and , respecthe GP method using the biases given by tively. Similarly, and consist of the selection weights [see (6)] that were trained after removing the biases provided by and , respectively. The selection weight should select the always gives the best ASM depending on the btype [34]. averaging weights [see (5)] for any btype [33]. Once the bias and weight DBs had been specified, the final segmentation results were obtained based on (3). IV. RESULTS A. Base ASMs As mentioned in Section III-B1, different HMM-based base ASMs were created varying HMM configurations such as the context-dependency, the number of states per phone, and the number of mixture components per state. Table I shows the specific HMM configuration and performance of each base ASM. The total number of states (tied-states for CDHMMs) in each HMM system is also presented in the table. We can see from Table I that, as the number of states increased, the performance of the CIHMMs gradually improved until seven states and then degraded. In our other experiments using the frame shift of 10 ms instead of 5 ms, the best results were obtained at three or four states and the performance began to degrade severely when more states were used. These results indicate that the number of states should be chosen depending on the frame size because the number of states together with the frame shift would specify the minimum duration for phonemes that HMMs can be properly aligned with. Table I also shows how the performance varies with the number of mixtures. In the case of the 3-state CIHMMs, the performance increased with the number of mixtures until 16 mixtures and then began to degrade. For the other numbers of states, the performances of CIHMMs also improved as more mixtures were used (up to eight mixtures, at least). These results imply that, given the fixed number of states and a sufficient amount of training data, a feasible way to model better a large varieties of acoustic events inside a state of a monophone model is to increase the number of mixture components for the state. On the other hand, the performances of the CDHMMs degraded when more mixtures were used as shown in Table I. Possible explanations for this phenomenon are: 1) As the number of mixtures increases, the number of model parameters increases. The training data may be too small to reliably estimate the increased number of triphone-model parameters. 2) As increasing mixtures we need to apply the embedded training procedure one more time. Meanwhile, the temporal precision becomes dull since the embedded training is based on the maximum likelihood criterion [15].
PARK AND KIM: ON USING MULTIPLE MODELS FOR AUTOMATIC SPEECH SEGMENTATION
TABLE I PERFORMANCES OF THE BASE ASMs
2209
TABLE II PERFORMANCES OF THE BASE ASMs AFTER BIAS CORRECTION WITH B (TRAINED WITH 1600 UTTERANCES AND = 20)
B. Combination of Base ASMs With the various bias and weight DBs obtained as in Section II-B, we applied (3) to get the combined segmentation results. Table II shows the performances of the base ASMs . From Table I and Table II, the after bias correction with performance gain from the bias correction was about 1 and 2 ms for the best CIHMM-based (7-state, 8-mixture) and CDHMM-based (7-state, 1-mixture) base ASMs in terms of MAE or RMSE, respectively. In general, CDHMMs seem to benefit from bias compensation more than CIHMMs do. Possible reasons for this are: 1) in the embedded training procedure that was employed for building CDHMMs, no pre-determined model boundaries are provided and 2) a CDHMM is usually trained in the same or similar phonetic context. As a result of 1) and 2), some CDHMMs may be constructed by mistakenly using parts from adjacent phonemes. When those CDHMMs are used for the alignment, the errors of a particular phonetic context are likely to be distributed toward one direction, inducing a bias. The performances with various combinations of the bias and weight DBs are shown in Table III. For the purpose of comparison, the performance of the best single base ASM with/ without bias correction is also presented. As we can see from Table III, the use of multiple ASMs outperformed the single best ASM. This phenomenon to some extent confirms the effectiveness of applying multiple ASMs for automatic segmentation. Among the results shown in Table III, simultaneous use
TABLE III PERFORMANCES OF THE ASWSBC APPROACH FOR VARIOUS WEIGHT AND BIAS DBs (TRAINED WITH 1600 UTTERANCES AND = 20)
of and achieved the best performance in terms of all figure of merits, which means that the proposed method is superior to the previous averaging and selection methods. When with those of comparing the result of and , it is also seen that removing bias before weight training is advantageous. Table IV shows how the averaging, selection and optimal weight DBs are affected by the amount of training utterances and the parameter which specifies the minimum number of reference bmarks in each leaf node of the btype clustering tree.
2210
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007
TABLE IV PERFORMANCES OF VARIOUS WEIGHT PARAMETERS AS VARYING THE TRAINING DATA SIZE (WITH 95% CONFIDENCE INTERVAL FOR MAE)
TABLE V RESULTS OF THE PAIRED-t TESTS FOR ABSOLUTE ERRORS: AVERAGING METHOD VERSUS PROPOSED METHOD
The experimental results indicate that the performance is somewhat insensitive to although the performance tends to degrade as increases. In general, the performance of each method became improved as more training data were used. It is seen from Table IV that the optimal weights trained by the GP algorithm outperformed the averaging and selection weights in all training conditions. The 95% confidence intervals for MAE are also shown in Table IV. The 95% confidence region of the proposed method did not overlap with that of the selection or averaging method for all training conditions. We also performed the paired- test in order to see whether the absolute errors of were significantly smaller than . Table V shows the results of the pairedthose of test. From the table, we can see that the proposed method is significantly better than the averaging method in terms of the absolute error. When the selection method was compared with the proposed method, the -values were much larger than those shown in Table V. These results are considered to prove the superiority of the proposed approach over the averaging and selection methods.
are estimated for each boundary type through a training procedure, in which the optimal bias parameters are calculated by averaging the errors corresponding to each boundary type and ASM pair, and the optimal weights are obtained by means of the gradient projection algorithm. To handle the boundary types unseen in the training data and to estimate the bias and weight parameters in a more robust way, all the boundary types are clustered using a decision tree. Our experimental results (Table III) shows that the mean absolute error, root mean square error, and percentage of boundaries deviating less than 20 ms are improved from 6.86 ms, 10.21 ms, and 96.06% of the best single HMM-based segmentation to 4.91 ms, 7.84 ms, and 97.05% by the proposed method, respectively. We also performed the paired- test in order to compare the proposed method with two previous multiple-mode based methods, i.e., the averaging and selection methods in which the final time-mark is obtained by averaging all the ASMs’ timemarks and by choosing the best ASM’s time-mark depending on the btype, respectively. Based on the result of the pairedtest, we confirmed that the proposed method performed signifthan the previous multiple-model icantly better -value based approaches. APPENDIX In this appendix, the gradient projection method [35] to minin (23) is described. imize Gradient projection method for finding 0) Initialize the weight vector such that it is feasible, and . calculate the error covariance matrix 1) Derive the index set defined by
V. CONCLUSION In this paper, we have proposed a general framework for automatic phonetic segmentation based on multiple ASMs. In the proposed method, given multiple boundary time-marks provided by various independent segmentation methods, biases are first subtracted from the time-marks and a single final time-mark is then obtained by the weighted sum of the bias-corrected time-marks. The bias and weight parameters
and set
2) Find the feasible direction vector as follows:
PARK AND KIM: ON USING MULTIPLE MODELS FOR AUTOMATIC SPEECH SEGMENTATION
where is a projection matrix whose component is given by if or else if otherwise
th
or
and
is the gradient vector at the point , 3) If a) calculate such that
.
is feasible
b) find
such that
c) set
If
and return to 1). , find for
a) If for all b) Otherwise, delete
set
where
, stop the iteration with from where
.
, and return to 2). ACKNOWLEDGMENT
The authors would like to thank the three anonymous reviewers for their valuable comments, which considerably improved the quality of this paper.
2211
REFERENCES [1] A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” in Proc. ICASSP, Atlanta, GA, 1996, pp. 373–376. [2] B. Möbius, “The Bell Labs German text-to-speech system,” in Comput. Speech Lang., 1999, vol. 13, pp. 319–358. [3] A. K. Syrdal, C. W. Wightman, A. Conkie, Y. Stylianou, M. Beutnagel, J. Schroeter, V. Strom, K.-S. Lee, and M. J. Makashay, “Corpus-based techniques in the AT&T NextGen synthesis system,” in Proc. ICSLP, Beijing, China, Oct. 2000, vol. 3, pp. 410–415. [4] G. Coorman, J. Fackrell, P. Rutten, and B. Van Coile, “Segment selection in the L&H Realspeak laboratory TTS system,” in Proc. ICSLP, Beijing, China, Oct. 2000, vol. 2, pp. 395–398. [5] J. P. H. van Santen and A. L. Buchsbaum, “Methods for optimal text selection,” in Proc. Eurospeech, Rhodes, Greece, 1997, pp. 553–556. [6] J. P. H. van Santen, “Combinatorial issues in text-to-speech synthesis,” in Proc. Eurospeech, Rhodes, Greece, 1997, pp. 2511–2514. [7] W. N. Campbell and A. Black, “Prosody and the selection of source units for concatenative synthesis,” in Progress in Speech Synthesis. New York: Springer-Verlag, 1997, pp. 279–292. [8] A. Black and K. Lenzo, “Optimal utterance selection for unit selection speech synthesis databases,” in International Journal of Speech Technology 6. Norwell, MA: Kluwer, 2003, pp. 357–363. [9] A. Ljolje and M. D. Riley, “Automatic segmentation and labeling of speech,” in Proc. ICASSP, Toronto, ON, Canada, 1992, pp. 473–476. [10] F. Brugnara, D. Falavigna, and M. Omologo, “A HMM-based system for automatic segmentation and labelling of speech,” in Proc. ICSLP, Banff, Canada, 1992, pp. 803–806. [11] R. E. Donovan, “Trainable Speech Synthesis,” Ph.D. dissertation, Cambrige Univ., Cambridge, U.K., 1996. [12] J. R. Bellegarda, “Unsupervised, language-independent grapheme-tophoneme conversion by latent analogy,” Speech Commun., vol. 46/2, pp. 140–152, 2005. [13] C. Van Bael, L. Boves, H. van den Heuvel, and H. Strik, “Automatic phonetic transcription of large speech corpora: A comparative study,” in Proc. ICSLP, Pittsburgh, PA, 2006, pp. 1085–1088. [14] D. T. Toledano, L. A. H. Gömez, and L. V. Grande, “Automatic phonetic segmentation,” IEEE Trans. Speech Audio Process., vol. 11, no. 6, pp. 617–625, Nov. 2003. [15] H. Kawai and T. Toda, “An evaluation of automatic phone segmentation for concatenative speech synthesis,” in Proc. ICASSP, Montreal, QC, Canada, 2004, vol. I, pp. 677–680. [16] S. Nefti and O. Boëffard, “Acoustical and topological experiments for an HMM-based speech segementation system,” in Proc. Eurospeech, Aalborg, Denmark, 2001, pp. 1711–1714. [17] J. Matousˇek, D. Tihelka, and J. Psutka, “Automatic segmentation for Czech concatenative speech synthesis using statistical approach with boundary-specific correction,” in Proc. Eurospeech, Geneva, Switzerland, 2003, pp. 301–304. [18] J. Adell, A. Bonafonte, J. A. Gömez, and M. Castro, “Comparative study of automatic phone segmentation methods for TTS,” in Proc. ICASSP, Philadelphia, PA, 2005, vol. I, pp. 309–312. [19] J. P. H. van Santen and R. Sproat, “High-accuracy automatic segmentation,” in Proc. Eurospeech, Budapest, Hungary, 1999, pp. 2809–2812. [20] A. Bonafonte, A. Nogueiras, and A. R. Garrido, “Explicit segmentation of speech using Gaussian models,” in Proc. ICSLP, Philadelphia, PA, 1996, pp. 1269–1272. [21] A. Sethy and S. Narayanam, “Refined speech segmentation for concatenative speech synthesis,” in Proc. ICSLP, Denver, CO, 2002, pp. 145–148. [22] E. Y. Park, S. H. Kim, and J. H. Chung, “Automatic speech synthesis unit generation with MLP based postprocessor against auto-segmented phoneme errors,” in Proc. ICASSP, Phoenix, AZ, 1999, pp. 2985–2990. [23] D. T. Toledano, “Neural network boundary refining for automatic speech segmentation,” in Proc. ICASSP, Istanbul, Turkey, 2000, pp. 3438–3441. [24] D. T. Toledano and L. A. H. Gömez, “Local refinement of phonetic boundaries: A general framework and its application using different transition models,” in Proc. Eurospeech, Aalborg, Denmark, 2001, pp. 1695–1698. [25] K. S. Lee, “MLP-based phone boundary refining for a TTS database,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 3, pp. 981–989, May 2006. [26] V. Pollet and G. Coorman, “Statistical corpus-based speech segmentation,” in Proc. ICSLP, Jeju, Korea, 2004.
2212
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007
[27] L. Wang, Y. Zhao, M. Chu, J. Zhou, and Z. Cao, “Refining segmental boundaries for TTS database using fine contextual dependent boundary models,” in Proc. ICASSP, Montreal, QC, Canada, 2004, vol. I, pp. 641–644. [28] L. Wang, Y. Zhao, M. Chu, F. K. Soong, J. Zhou, and Z. Cao, “Context-dependent boundary model for refining boundaries segmentation of TTS units,” IEICE Trans. Inform. Syst., vol. E89-D, pp. 981–989, 2006. [29] Y. J. Kim and A. Conkie, “Automatic segmentation combining an HMM-based approach and spectral boundary correction,” in Proc. ICSLP, Denver, CO, 2002, pp. 145–148. [30] T. Saito, “On the use of F0 features in automatic segmentation for speech synthesis,” in Proc. ICSLP, Sydney, Australia, 1998, vol. VII, pp. 2839–2842. [31] G. Flammia, P. Dalsgaard, O. Andersen, and B. Lindberg, “Segment based variable frame rate speech analysis and recognition using a spectral variation function,” in Proc. ICSLP, Banff, AB, Canada, 1992, pp. 983–986. [32] C. D. Mitchel, M. P. Harper, and L. H. Jamieson, “Using explicit segmentation to improve HMM phone recognition,” in Proc. ICASSP, Detroit, MI, 1995, vol. I, pp. 229–232. [33] J. Kominek and A. W. Black, “A family-of-models approach to HMM-based segmentation for unit selection speech synthesis,” in Proc. ICSLP, Jeju, Korea, 2004. [34] S. S. Park and N. S. Kim, “Automatic segmentation based on boundarytype candidate selection,” IEEE Signal Process. Lett., vol. 13, no. 10, pp. 640–643, Oct. 2006. [35] D. Luenberger, Linear and Nonlinear Programming, 2nd ed. Reading, MA: Addison-Wesley, 1984, pp. 330–334. [36] J. Kominek, C. Bennett, and A. W. Black, “Evaluating and correcting phoneme segmentation for unit selection synthesis,” in Proc. Eurospeech, Geneva, Switzerland, 2003, pp. 313–316. [37] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. New York: Chapman & Hall, 1984. [38] T. Laureys, K. Demuynck, J. Duchateau, and P. Wambacq, “An improved algorithm for the automatic segmentation of speech corpora,” in Proc. LREC, Las Palmas, Spain, 2002, vol. V, pp. 1564–1567.
[39] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book (for HTK Version 3.2). Cambridge, U.K.: Cambrige Univ., 2002. [40] J. Odell, “The Use of Context in Large Vocabulary Speech Recognition,” Ph.D. dissertation, Cambrige University, Cambridge, U.K., 1995.
Seung Seop Park (M’07) was born in Naju, Korea, in 1975. He received the B.S. and Ph.D. degrees in electrical engineering from Seoul National University (SNU), Seoul, Korea, in 2000 and 2007, respectively. From March 2000 to April 2003, he was with Netdus Corp., Seoul, where he developed a Korean TTS system. Currently, he is in a postdoctoral position at SNU. His research area includes speech synthesis, speech recognition, speech coding, and machine learning.
Nam Soo Kim (M’98) received the B.S. degree in electronics engineering from Seoul National University (SNU), Seoul, Korea, in 1988 and the M.S. and Ph.D. degrees in electrical engineering from Korea Advanced Institute of Science and Technology in 1990 and 1994, respectively. From 1994 to 1998, he was with Samsung Advanced Institute of Technology as a Senior Member of Technical Staff. Since 1998, he has been with the School of Electrical Engineering, SNU, where he is currently an Associate Professor. His research area includes speech signal processing, speech recognition, speech/audio coding, speech synthesis, adaptive signal processing, machine learning, and mobile communication.