A Bayesian Approach to Hidden Semi-Markov Model Based Speech ...

Report 9 Downloads 61 Views
A Bayesian Approach to Hidden Semi-Markov Model Based Speech Synthesis Kei Hashimoto 1. Introduction

Yoshihiko Nankaku

Keiichi Tokuda (Nagoya Institute of Technology)

3. Bayesian approach to HSMM based speech synthesis

Bayesian approach to hidden Markov model (HMM) based speech synthesis [Hashimoto; '09]

・ Reliable predictive distributions can be estimated by treating model parameters as random variables ・ Appropriate model structures can be selected by maximizing the marginal likelihood ・ Outperform HMM-based speech synthesis based on the ML criterion

Problems: Inconsistency between training and synthesis

Although the speech is synthesized from HMMs with explicit state duration probability distributions, HMMs are trained without them.

A Bayesian approach to hidden semi-Markov model (HSMM) based speech synthesis

HMM

Conditions

HSMM : State transition probability from i-th to j-th state : Output probability of from i-th state : Duration probability of from i-th state

Likelihood function

Overview of HMM-based speech synthesis

HMM

HSMM

Parameter generation from HMM

Excitation parameters

Excitation generation

ML apporach Training

: State seq. for synthesis data : Likelihood of synthesis data : Prior distribution of model parameters

: State seq. for training data : Likelihood of training data

Analysis win.

25 ms Hamming window / 5 ms shift

Topology

Model

Model training

Model selection

Number of pdfs (duration pdf)

ML-HMM

HMM

ML

MDL

87,267 (1,375)

ML-HSMM

HSMM

ML

MDL

88,287 (1,415)

Bayes-HMM

HMM

Bayes

Bayes

745,969 (15,025)

Bayes-HSMM

HSMM

Bayes

Bayes

744,955 (17,450)

3.9 3.8

3.630

3.7 3.6 3.355

3.5 3.4 3.3

3.180

3.225

3.2

Approximate distribution of the true posterior distribution

: Calculation of expectation

3.0

ML-HMM

ML-HSMM Bayes-HMM Bayes-HSMM

・ Speech quality was improved by using HSMMs ・ Bayes-HSMM outperformed ML-HSMM

Subjective evaluation for comparing model structure ・ Model structures of ML-HSMM and Bayes-HSMM are swapped for comparing the effect of model structure

SYNTHESIZED SPEECH

Bayesian apporach

3.4

・ Optimization can be effectively performed by iterative updates as the EM algorithm

Generalized forward-backward algorithm

: Synthesis data seq. : Label seq. for synthesis

95% confidence intervals

3.1

Spectral parameters

Synthesis filter

24 mel-cepstrum coef. + Δ + ΔΔ F0 + Δ + ΔΔ 5-state left-to-right MSD-HMM and MSD-HSMM with single Gaussian state output pdfs

Subjective evaluation (20 sentences x 10 subjects)

・ Estimation of posterior distribution based on maximizing

Synthesis : Training data seq. : Label seq. for training

16 kHz

Training part

TEXT Text analysis

Sampling rate

Predictive distribution of Bayesian approach

Training of HMM

Synthesis part Label

53 utterances

Compared models

・ Estimation of approximated posterior distributions ・ Define a lower bound of log marginal likelihood using Jensen's inequality

Spectral parameters

context-dependent HMMs & duration models

Test data

5-point opinion score

Label

450 utterances

: State seq.

Variational Bayesian method

Spectral parameter extraction

Excitation parameters

Training data

・ Predictive distribution is used for model training and speech parameter generation ・ To solve the expectation calculation is difficult ⇒ Variational Bayesian method

Speech signal

Excitation parameter extraction

ATR Japanese speech database b-set

: Model parameters

of HSMM can be computed efficiently by the generalized forward-backward algorithm ・ ・ Partial forward likelihood and partial backward likelihood can be computed recursively ・ Bayesian approach requires almost the same computational cost with the ML criterion

3.225

3.2 3.1

3.040 2.995

2.990

ML-MDL

Bayes-MDL

3.0 2.9

・ All processes are derived from one single predictive distribution ・ Model parameters are used as random variables ・ Bayesian approach indicates the better performance than ML criterion

Problems: Inconsistency between training and synthesis

95% confidence intervals

3.3 5-point opinion score

SPEECH DATABASE

Database

Feature vec.

・ Overcome inconsistency between training and synthesis ・ Estimate reliable predictive distributions

2. HMM-based speech synthesis

6. Experiments

2.8 : Partial forward likelihood

: Partial backward likelihood

ML-Bayes

Bayes-Bayes

・ Bayesian approach overcame the over-fitting problem Acknowledgement: the EMIME project http://www.emime.org