Bernoulli versus Markov

Report 5 Downloads 60 Views
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright

Author's personal copy ARTICLE IN PRESS Signal Processing 89 (2009) 662–668

Contents lists available at ScienceDirect

Signal Processing journal homepage: www.elsevier.com/locate/sigpro

Fast communication

Bernoulli versus Markov: Investigation of state transition regime in switching-state acoustic models Jahanshah Kabudian a,, Mohammad Mehdi Homayounpour a, Seyed Mohammad Ahadi b a b

Department of Computer Engineering, Amirkabir University of Technology (Tehran Polytechnic), Hafez Avenue, Tehran 15914, Iran Department of Electrical Engineering, Amirkabir University of Technology (Tehran Polytechnic), Hafez Avenue, Tehran 15914, Iran

a r t i c l e i n f o

abstract

Article history: Received 17 November 2007 Received in revised form 14 April 2008 Accepted 5 September 2008 Available online 21 September 2008

In this paper, a new acoustic model called time-inhomogeneous hidden Bernoulli model (TI-HBM) is introduced as an alternative to hidden Markov model (HMM) in continuous speech recognition. Contrary to HMM, the state transition process in TI-HBM is not a Markov process, rather it is an independent (generalized Bernoulli) process. This difference leads to elimination of dynamic programming at state-level in TI-HBM decoding process. Thus, the computational complexity of TI-HBM for probability evaluation and state estimation is OðNLÞ (instead of OðN 2 LÞ in the HMM case, where N and L are number of states and sequence length respectively). As a new framework for phone duration modeling, TI-HBM is able to model acoustic-unit duration (e.g. phone duration) by using a built-in parameter named survival probability. Similar to the HMM case, three essential problems in TI-HBM have been solved. An EM-algorithm-based method has been proposed for training TI-HBM parameters. Experiments in phone recognition for Persian (Farsi) spoken language show that the TI-HBM has some advantages over HMM (e.g. more simplicity and increased speed in recognition phase), and also outperforms HMM in terms of phone recognition accuracy. & 2008 Elsevier B.V. All rights reserved.

Keywords: Time-inhomogeneous hidden Bernoulli model Hidden Markov model Speech recognition Acoustic modeling Phone recognition Phone duration modeling Persian (Farsi) spoken language

1. Introduction Hidden Markov model (HMM) is the most popular and the most successful tool for analyzing and modeling stochastic sequences in speech processing [1]. The usual assumption in HMM is that the state transition process is a Markov process, and the generated state sequence is governed by Markov regime. It is experimentally approved that the state transition probabilities have less important roles compared to observation density functions (emission probabilities) in automatic speech recognition [2, Section 8.5.1]. There is no attempt on relaxing the Markov dependency in acoustic models like HMM.

 Corresponding author. Tel.: +98 912 5229167; fax: +98 21 66495521.

E-mail addresses: [email protected], [email protected] (J. Kabudian), [email protected] (M. Mehdi Homayounpour), [email protected] (S. Mohammad Ahadi). 0165-1684/$ - see front matter & 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2008.09.004

In this paper, a new acoustic model named timeinhomogeneous hidden Bernoulli model (TI-HBM) has been proposed in which the Markov regime in state transition process is relaxed. There are many attempts on phone duration modeling [3–5]. The TI-HBM can model acoustic-unit duration (e.g. phone duration) by using a built-in parameter named survival probability, which is derived from joint state–time distribution parameters. Employing TI-HBM in speaker-independent phoneme recognition, leads to more simplicity of the acoustic model, increased speed in recognition, and also higher performance in terms of phoneme recognition accuracy. The paper is organized as follows. In Section 2, the main elements of TI-HBM and some useful propositions and corollaries are presented. In Section 3, an algorithm for simulating TI-HBM is proposed. In Section 4, three essential problems (probability evaluation, state estimation, and training) of TI-HBM will be addressed. Also, an EM algorithm-based training method will be presented for

Author's personal copy ARTICLE IN PRESS J. Kabudian et al. / Signal Processing 89 (2009) 662–668

663

estimating TI-HBM parameters. In Section 5, we employ TI-HBM model for speaker-independent phoneme recognition on standard Persian continuous speech corpus FarsDat. Finally, in Section 6, we discuss about TI-HBM advantages over HMM.

using Bayes formulation as follows:

2. Time-inhomogeneous hidden Bernoulli model

PT next jT curr ð1j0Þ ¼ 1

TI-HBM model is a new acoustic model which is able to simultaneously model both state transition and acousticunit (e.g. phone) duration by using a new parameter called joint state–time distribution PS,T(i, t). The parameter P(i, t) is probability of being in state i at time t. Therefore, parameters of TI-HBM are:

The TI-HBM will be able to model acoustic-unit duration using survival probabilities. As we shall see in Corollary 1, survival probability is derived from P(i, t) as follows: PN P T ðt þ 1Þ j¼1 P S;T ðj; t þ 1Þ PT next jT curr ðt þ 1jtÞ ¼ ¼ PN (8) PT ðtÞ j¼1 P S;T ðj; tÞ

1. Joint state–time distribution P(i, t) 2. Parameters of state-conditioned Gaussian mixtures P {wim, mim, Cim}, bi ðxt Þ ¼ M m¼1 wim Nðxt ; mim ; C im Þ.

3. State selection probability given time PS|T(i|t) or P(i|t): PS|T(i|t) is probability of selecting state i at time t, and is computed using the following formula:

The parameters P(i, t) play roles similar to pi and aij in standard HMM. The following constraint must be satisfied: Lmax N X X

Pði; tÞ ¼ 1

(1)

i¼1 t¼1

Pði; tÞ ¼ 0

for

t4Lmax

(2)

where Lmax is the maximum length of observation sequence X. We derive some useful parameters from P(i, t), which are needed for employing TI-HBM in the real world: 1. Time distribution function PT(t) or P(t): The PT(t) is probability of being at time t which is computed as follows: PðtÞ ¼

N X

Pði; tÞ

(3)

i¼1

If we have K observation sequences with length Lk for k-th observation sequence, the time distribution function will be computed by relative frequency of observation vectors with time-index t (frame number t). Therefore, the time distribution function PT(t) is empirically computed by the following formula: PK 1ðtpLk Þ ^PðtÞ ¼ k¼1 (4) PK k¼1 Lk  1ðcondÞ ¼

1

if cond is TRUE

0

if cond is FALSE

PT next jT curr ðt þ 1jtÞ ¼

2. Survival probability P T next jT curr ðt þ 1jtÞ or P(t+1|t): Given that the process is at time t, the P(t+1|t) is probability of process survival to time t+1. In other words, at time t, the process continues to time t+1 with probability P(t+1|t), otherwise it is terminated at time t with probability 1P(t+1|t). The PT next jT curr ðt þ 1jtÞ is computed

(6)

Since sequence length Lk is always greater than zero, therefore:

PSjT ðijtÞ ¼

(7)

PS;T ði; tÞ PS;T ði; tÞ ¼ PN P T ðtÞ j¼1 P S;T ðj; tÞ

(9)

It can be seen that the state selection and transition process is a generalized Bernoulli process with probabilities PS|T(i|t). Contrary to standard Bernoulli process which is a binary process (like coin tossing), the generalized Bernoulli process is a multi-valued one with N outcomes (like die tossing in which N ¼ 6) [6]. Since the probabilities P(i|t) changes with respect to time, thus it is a timeinhomogeneous process. Now, we present some useful propositions and corollaries relating to TI-HBM. The proofs for propositions are simple and straightforward. Proposition 1. PT next ;T curr ðt þ 1; tÞ ¼ PT next ðt þ 1Þ. Proof. If the next time-index, i.e. Tnext is t+1, then the current time-index, i.e. Tcurr is surely t. In other words PðT curr ¼ tjT next ¼ t þ 1Þ ¼ 1

(10)

PðT curr ¼ t; T next ¼ t þ 1Þ ¼1 PðT next ¼ t þ 1Þ

(11)

PðT next ¼ t þ 1; T curr ¼ tÞ ¼ PðT next ¼ t þ 1Þ ¼ PT next ðt þ 1Þ

&

(12)

Proposition 2. The time-distribution PT(t) is a decreasing function with respect to time, i.e. PT(t+1)pPT(t). Proof. Firstly, we define a set of functions fk(t) ¼ 1(tpLk). It is obvious that f k ðt þ 1Þpf k ðtÞ

(5)

P T next ;T curr ðt þ 1; tÞ P T curr ðtÞ

for all t

(13)

Summing the above equations over different k’s and P then dividing by Lk, we have K X

f k ðt þ 1Þp

k¼1 K X k¼1

K X

f k ðtÞ

(14)

k¼1

, f k ðt þ 1Þ

K X k¼1

! Lk p

K X k¼1

, f k ðtÞ

K X k¼1

! Lk

(15)

Author's personal copy ARTICLE IN PRESS 664

J. Kabudian et al. / Signal Processing 89 (2009) 662–668

Fig. 1. Algorithm for simulating TI-HBM.

) P T ðt þ 1ÞpPT ðtÞ

(16)

&

Corollary 1. PT next jT curr ðt þ 1jtÞ ¼ P T ðt þ 1Þ=PT ðtÞ.

as follows: PT ðt þ 1Þ PðDXt þ 1Þ ¼ P T ðtÞ PðDXtÞ

P T next jT curr ðt þ 1jtÞ ¼

Proof. Using Proposition 1, we have

(22)

Eq. (22) is compatible with some result achieved in [7, p. 1115], and verifies the propositions and corollaries in another way.

P T next ;T curr ðt þ 1; tÞ P T next ðt þ 1Þ ¼ P T curr ðtÞ P T curr ðtÞ P T ðt þ 1Þ ¼ P T ðtÞ

P T next jT curr ðt þ 1jtÞ ¼

(17)

Two events T ¼ t and Tcurr ¼ t, and also events T ¼ t+1 and Tnext ¼ t+1 are equivalent. &

3. Simulation of TI-HBM

Corollary 2. Probability of generating a sequence of minimum length d is P(DXd) ¼ (PT(d)/PT(1)).

Simulating TI-HBM means that how an observation sequence X ¼ {x1, x2, y, xt, y, xL} is generated by TI-HBM. For this purpose, an algorithm in Fig. 1 is followed. Time-independency assumption for state-conditioned Gaussian mixture probability density functions is prevalent in acoustic models like HMM, and is also made in TI-HBM as follows:

Proof. If D is a variable for the sequence length, then PðDXdÞ ¼ Pð1j0Þ

d Y

Pðtjt  1Þ

t¼2

¼

P T ð2Þ PT ð3Þ P T ðdÞ PT ðdÞ ...... ¼ P T ð1Þ PT ð2Þ P T ðd  1Þ P T ð1Þ

&

(18) pðxt ji; tÞ ’ pðxt jiÞ ¼

Corollary 3. Probability of generating a sequence of exact length d is PD(d) ¼ (PT(d)PT(d+1))/PT(1). Proof. If the sequence length is exactly d, then the process will be terminated before time d+1 with probability 1P(d+1|d): ( ) d Y P D ðdÞ ¼ PðD ¼ dÞ ¼ Pð1j0Þ Pðtjt  1Þ t¼2

PT ð2Þ PT ð3Þ P T ðdÞ ...... ð1  Pðd þ 1jdÞÞ ¼ PT ð1Þ PT ð2Þ P T ðd  1Þ   P T ðd þ 1Þ P T ðdÞ  P T ðd þ 1Þ ¼ & (19)  1 P T ðdÞ PT ð1Þ The Corollary 2 is a way for converting duration-distribution function PD(  ) to time distribution function PT(  ) P T ðdÞ ¼ PT ð1ÞPðDXdÞ K P T ð1Þ ¼ PK

k¼1 Lk

¼

k¼1 Lk

K

wim Nðxt ; mim ; C im Þ

(23)

m¼1

If T ¼ f1; 2; . . . t; . . . ; Lg is the time-index sequence, and Q is the state sequence of generalized Bernoulli process, then the joint probability of surviving up to time L, traversing state sequence Q, and generating observation sequence X by TI-HBM will be PðT; X; Q Þ ¼ ð1  PðL þ 1jLÞÞ

L Y

Pðtjt  1ÞPðqt jtÞpðxt jqt ; tÞ

t¼1

’ ð1  PðL þ 1jLÞÞ

L Y

Pðtjt  1ÞPðqt jtÞpðxt jqt Þ

t¼1

¼

  L P T ðLÞ  P T ðL þ 1Þ Y Pðqt jtÞpðxt jqt Þ P T ð1Þ t¼1

¼ PD ðLÞ

L Y

Pðqt jtÞpðxt jqt Þ

(24)

t¼1

(20)

PK

M X

The above equation can be written in another form

!1 ¼

1 1 ¼ PL max EfDg d  P D ðdÞ d¼1

PðT; X; Q Þ ¼ PðTÞPðQ jTÞ PðXjQ ; TÞ ’ PðTÞPðQ jTÞPðXjQ Þ

(25)

(21) According to Corollary 1 and Eq. (20), we can derive survival probabilities using duration-distribution function

PðTÞ ¼ P D ðLÞ ¼

P T ðLÞ  PT ðL þ 1Þ PT ð1Þ

(26)

Author's personal copy ARTICLE IN PRESS J. Kabudian et al. / Signal Processing 89 (2009) 662–668

665

4.1. Efficient evaluation of probability PðT; XÞ Probability of generating an observation sequence X of length L is computed as follows: PðT; XÞ ¼ ð1  PðL þ 1jLÞÞ

L Y

Pðtjt  1Þpðxt jtÞ

t¼1



 L PT ðLÞ  P T ðL þ 1Þ Y pðxt jtÞ P T ð1Þ t¼1   L X N PT ðLÞ  P T ðL þ 1Þ Y ¼ pðxt ; ijtÞ P T ð1Þ t¼1 i¼1   L X N PT ðLÞ  P T ðL þ 1Þ Y PðijtÞpðxt ji; tÞ ¼ P T ð1Þ t¼1 i¼1   L X N PT ðLÞ  P T ðL þ 1Þ Y PðijtÞpðxt jiÞ ’ P T ð1Þ t¼1 i¼1 ¼

Fig. 2. The standard HMM structure.

¼ P D ðLÞ

L X N Y

PðijtÞpðxt jiÞ

(29)

t¼1 i¼1

In standard HMM, this probability is computed using dynamic programming (DP)-based methods (forward and backward procedures). The order of computations for evaluating P(X) in HMM is OðN 2 LÞ [1], while in TI-HBM, the order for evaluating PðT; XÞ is OðNLÞ. Since the state transition process in TI-HBM is not Markov-dependent, therefore the DP-type search is not needed for computing PðT; XÞ. 4.2. Optimal state sequence estimation Since the term PðTÞ has no effect on Q*, i.e. Q  ¼ arg Max PðT; X; Q Þ ¼ arg Max PðX; Q jTÞ Q

Fig. 3. The TI-HBM structure.

PðQ jTÞ ¼

L Y

Pðqt jtÞ

(27)

t¼1

PðXjQ ; TÞ ’ PðXjQ Þ ¼

L Y

pðxt jqt Þ

(28)

4. Three essential problems in TI-HBM For employing TI-HBM in real-world applications, three essential problems (similar to those in the HMM case) must be solved: efficient probability evaluation, optimal state sequence estimation (decoding), and parameter estimation (training).

(30)

therefore, the PðX; Q jTÞ is used instead of PðT; X; Q Þ. If Q* is the optimal state sequence for generating X by TI-HBM, then PðX; Q  jTÞ ¼ Max PðX; Q jTÞ Q ( ) L Y ¼ Max Pðqt jtÞpðxt jqt Þ q1 ;q2 ;...;qL

t¼1

where PðTÞ is probability of generating a sequence of exact length L. It can be seen that the PðTÞ is a function of PT(t) parameters only. On the other hand, parameters PT(t) are optimally and globally determined by Eq. (4) and are fixed (constant values). Therefore, PðTÞ will be treated as constant value in log-likelihood function of TI-HBM. Figs. 2 and 3 show structures of standard HMM and TI-HBM respectively.

Q

¼

L Y t¼1

¼

L Y

t¼1

Max Pðqt jtÞpðxt jqt Þ qt

Pðqt jtÞpðxt jqt Þ

(31)

t¼1

qt ¼ arg MaxfPðqt jtÞpðxt jqt Þg qt

(32)

It can be seen that the DP-search (Viterbi algorithm with order OðN2 LÞ) is eliminated from the state estimation problem in TI-HBM, and the order of computations is OðNLÞ. 4.3. Training TI-HBM parameters Suppose that we have a set X of K observation sequences for training TI-HBM parameters. If X(k) is k-th

Author's personal copy ARTICLE IN PRESS 666

J. Kabudian et al. / Signal Processing 89 (2009) 662–668

observation sequence of length Lk, then X ¼ fX X

ðkÞ

¼

ð1Þ

;X

ð2Þ

ðkÞ

;...;X ;...;X

ðKÞ

Pðm; ijt; xðkÞ t ;y

g (33)

If the TI-HBM parameter set is y, we want to find some parameter estimator y^ (with maximum likelihood criterion):

y^ ¼ arg Max log PðX; yÞ y

PðX; yÞ ¼

ðn1Þ

Þ ¼ Pðijt; xðkÞ t ;y

Þ

ðn1Þ Þ Pðmji; t; xðkÞ t ;y

ðkÞ ðkÞ ðkÞ ðxðkÞ 1 ; x2 ; . . . ; xt ; . . . ; xLk Þ

K Y

ðn1Þ

PðX ðkÞ ; yÞ

(34)

k¼1

Pðmji; t; xðkÞ t ;y

ðn1Þ

Þ ¼ PM

wim NðxðkÞ t ; mim ; C im Þ

ðkÞ m0 ¼1 wim0 Nðxt ;

mim0 ; C im0 Þ

(41)

(42)

The estimated values y^ will be stored in y(n) for next iteration. ^ ^ tÞ After EM re-estimation of PðijtÞ, the final value of Pði; is computed as follows: sm P^ S;T ði; tÞ ¼ P^ T ðtÞ  P^ SjT ðijtÞ

(43)

sm P^ T ðtÞ

is smoothed P^ T ðtÞ derived from parameterwhere ized PD ð:Þ (as described in section 4.3.1). 4.3.1. Estimating PT(t) parameters of TI-HBM The estimate for PT(t) parameters is the number of observation vectors with time-index t divided by the total number of observation vectors (as in Eq. (4)). This parameter estimator for PT(t) depends only upon Lk parameters and is independent of X(k)’s. Therefore, it yields the final estimate of PT(t) parameters, it is kept fixed, and will be treated as constant value in next stages of training. In the EM algorithm, we only estimate PS|T(i|t) parameters. After EM, parameters PS,T(i, t) are simply derived by using PS,T(i, t) ¼ PT(t)PS|T(i|t). In practice, the estimator in Eq. (4) must be smoothed. One way is to parameterize PD(  ) with a suitable distribution (e.g. Gamma distribution), and convert PD(  ) to PT(  ) by Eqs. (20) and (21). 4.3.2. Training TI-HBM by EM algorithm We have used EM algorithm [8] for training TI-HBM parameters. The details of mathematical manipulations can be found in [9] ^ PðijtÞ ¼

PK

ðn1Þ ðkÞ Þ1ðtpLk Þ k¼1 Pðijt; xt ; y PK 1ðtpL kÞ k¼1

(35)

PK PLk ðn1Þ ðkÞ Þ k¼1 t¼1 Pðm; ijt; xt ; y ^ im ¼ P w ðn1Þ ðkÞ K PLk Pðijt; x ; y Þ t k¼1 t¼1 PK PLk

m^ im ¼

C^ im ¼

(36)

ðn1Þ

ðkÞ ÞxðkÞ t k¼1 t¼1 Pðm; ijt; xt ; y PK PLk ðn1Þ ðkÞ Þ k¼1 t¼1 Pðm; ijt; xt ; y

PK PLk k¼1

(37)

ðn1Þ ðkÞ ðkÞ ÞðxðkÞ t  im Þðxt t¼1 Pðm; ijt; xt ; y PK PLk ðn1Þ ðkÞ Þ k¼1 t¼1 Pðm; ijt; xt ; y

m

 mim ÞT (38)

Pðijt; xðkÞ t ;y

ðn1Þ

Þ¼

Pði; t; xðkÞ t Þ Pðt; xðkÞ t Þ

Pði; t; xðkÞ t Þ ¼ PN Pðj; t; xðkÞ t Þ j¼1

PðijtÞpðxðkÞ t jiÞ ¼ PN ðkÞ j¼1 PðjjtÞpðxt jjÞ pðxðkÞ t jiÞ ¼

M X m¼1

wim NðxðkÞ t ; mim ; C im Þ

(39)

(40)

5. Experiments We have employed TI-HBM in speaker-independent phone recognition for Persian (Farsi) spoken language. For training HMM and TI-HBM phone models, the standard Farsi phonetically-balanced continuous speech database FarsDat [10] was used (available via ELDA website [11]). The FarsDat contains utterances of 304 speakers from 10 dialect regions inside Iran. Each speaker has uttered 20 sentences (of which 2 sentences are common among speakers). The utterances of first 250 speakers was used for training phone models (5000 sentences), and utterances of remaining 54 speakers was used for test (1080 sentences). Thirty-two phone models were trained. Feature vectors are 13 cepstral coefficients (c0–c12) derived from perceptual linear prediction analysis, plus first-, second-, and third-order derivatives (52-dimensional). The HMM and TI-HBM models have 3 states, and 2, 4, 8, 16, 24 and 32 diagonal-covariance Gaussian PDFs per state. For improving the results, a phone-bigram language model was used, and trained using phone labels of the training set. The final value of Lmax was 2Ltrain max . The PD(  ) was parameterized (smoothed) with a Gamma distribution and truncated outside the interval Lminptp Lmax(Lmin ¼ 3), and then converted to PT(  ) using Eqs. (20) and (21) in interval 1ptpLmax+1. The survival probabilities were then computed using smoothed PT(  ). Both HMM and TI-HBM models were trained by EM algorithm. As it is seen from re-estimation formulas, TI-HBM training is completely independent of HMM training. But for making the comparison fair, we have to start the HMM and TI-HBM training from the same initial points. The initial values of Gaussian mixture parameters for HMM and TI-HBM were the same. Since state transition-related parameters in HMM and TI-HBM are intrinsically different, we have to start the training from equivalent initial points. One way is to initialize HMM parameters with yHMM , and then convert yHMM to equivalent values in TI0 0 HBM, i.e. yHBM . The algorithm is as follows: the HMM 0 parameters were initialized with yHMM . By using yHMM , 0 0 the optimal state sequence for all observation sequences were determined, and the ratio of number of observation vectors with time-index t which assigned to state i, to the number of observation vectors with time-index t, was used as initial value of P(i|t) in TI-HBM. The initial values

Author's personal copy ARTICLE IN PRESS J. Kabudian et al. / Signal Processing 89 (2009) 662–668

of Gaussian mixture parameters for HMM and TI-HBM were the same. Therefore, both models have been trained using EM algorithm with starting from equivalent initial points. In decoding process, survival probabilities P(t|t1) are used instead of PD(d), because phone durations (d’s) are not known before the end of the search. In practice, [P(t|t1)]DSF and [1P(t|t1)]DSF was used. This is equivalent to putting a weight on duration-distribution function, i.e. using [PD(L)]DSF instead of PD(L). The DSF parameter was optimally set to 3. After estimating P(i|t) by EM algorithm, these parameters were extended to interval Ltrain max otpLmax like as follows: PðijtÞ ¼ PðijLtrain max Þ

for all i and Ltrain max otpLmax

(44)

In recognition phase, DP-type search is eliminated at the state-level, but the DP-search is needed for finding the best phoneme segmentation (phoneme boundaries). The DP-search at phone-level is performed in every acoustic model including HMM, TI-HBM and even neural network and support-vector machine-based acoustic models. Since the phoneme boundaries are not known a priori during recognition, we have used an array t0(ph) for keeping the entrance time into phone ‘‘ph’’ along the partial best path. Therefore, relative time-index t0 instead of t was used in parameters P(t) and P(i|t) ðphÞ 0 P ðphÞ T ðt Þ ¼ P T ðt  t 0 ðphÞ þ 1Þ ðphÞ 0 0 P ðphÞ T next jT curr ðt jt  1Þ ¼ P T next jT curr ðt  t 0 ðphÞ þ 1jt  t 0 ðphÞÞ ðphÞ 0 P ðphÞ SjT ðijt Þ ¼ P SjT ðijt  t 0 ðphÞ þ 1Þ

(45)

Given that we have performed the search and we have the best accumulated probability and also the best path (phone sequence) up to time t1, the recognition algorithm at time t is as follows: Firstly, for time t and each phoneme ‘‘ph’’, the best state and the maximum probability are extracted using the following formula: pðphÞ ðxt Þ ¼ MaxfP ðphÞ ðqt jt 0 ÞpðphÞ ðxt jqt Þg

(46)

qt

Table 1 Phone recognition accuracy (%) for the test set No. of Gaussians per state

HMM

TI-HBM

2 4 8 16 24 32

68.31 71.63 74.17 75.68 76.21 76.84

68.38 71.74 74.19 75.94 76.70 77.22

Table 2 Elapsed time (s) for decoding 200 s of speech No. of Gaussians per state

HMM

TI-HBM

Speed-up (%)

2 4 8 16

6.37 8.61 12.57 20.34

3.53 5.81 10.61 20.25

80.45 48.19 18.47 0.44

667

This stage does not need DP. At the second stage, we extract the best accumulated probability and also the best phoneme sequence and phoneme segmentation (phoneme boundaries) up to time t by performing a DP-search on p(ph)(xt) and PðphÞ ðt 0 jt 0  1Þ probabilities. The phone recognition results are shown in Table 1. It can be seen that the TI-HBM improves the phone recognition accuracy compared to standard HMM. In another experiment, we compared recognition time for both HMM and TI-HBM models. Table 2 shows the elapsed time for decoding 200 s of speech signal (on an Intel Pentium IV, 3.2 GHz processor). We can see that the TI-HBM is always faster than HMM, and speed-up factor is greater for low number of Gaussians per state. This is because of the fact that the main computational complexity of HMM and TI-HBM is due to the computation of emission probabilities (Gaussian mixtures). The TI-HBM will be quite faster than HMM for applications with low number of Gaussians per state, or feature vectors with low number of dimensions, or for those discrete HMM cases in which the index of observation vector is known a priori without any computations (like amino acids in bioinformatics applications where discrete HMM is widely used). Also, the TI-HBM was always faster than HMM in training phase in our experiments (not reported here). Another issue is number of free parameters of the model. The number of HMM parameters {pi, aij, wim, mimk, Cimk} with diagonal-covariance Gaussian mixtures is N+N2+NM+2NMDim, while for TI-HBM parameters {P(i, t), avg wim, mimk, Cimk}, it is NLavg max+NM+2NMDim (where Lmax is average Lmax for Persian phoneme set in FarsDat database). Assuming that 2MDimbLavg (for example M ¼ 32, max Dim ¼ 52, Lavg max ¼ 35), there is no considerable increase in the number of TI-HBM parameters. Furthermore, P(i, t) for each state i can be parameterized by a Gamma distribution to decrease term NLavg max to 2N (each Gamma distribution has two parameters). 6. Conclusion In this paper, a new acoustic model named TI-HBM was introduced as an alternative to HMM for speech recognition. In TI-HBM, state transition process is a generalized Bernoulli process instead of a Markov one. In terms of phoneme recognition accuracy, the TI-HBM outperforms the HMM. Also, TI-HBM has some simplicities and advantages over HMM, including: 1. TI-HBM is a new theoretical framework for processing time series data, especially for automatic speech recognition, by defining a set of new parameters called joint state–time distribution. 2. DP-search is eliminated at the state-level in TI-HBM which makes it simpler compared to HMM. 3. TI-HBM is faster than HMM in recognition and training phase. 4. TI-HBM is capable of modeling acoustic-unit duration (e.g. phone duration) by employing a parameter named survival probability. 5. Computation of probability in TI-HBM is performed in a non-recursive manner. Therefore, differentiation of

Author's personal copy ARTICLE IN PRESS 668

J. Kabudian et al. / Signal Processing 89 (2009) 662–668

TI-HBM likelihood function with respect to its parameters is simpler and faster compared to that of HMM, and does not need calculation of recursive forward and backward variables (at(i) and bt(i)). According to the obtained results on comparison between HMM and TI-HBM, it is approved that the state transition structure in acoustic models like HMM or TI-HBM is less important compared to the observation density structure. Therefore, the TI-HBM can be an alternative to the HMM with easier use for applications like speech recognition, in which the state and the timeindex (frame number) have strong relationship. Using uniform segmentation (equally segmenting speech signal which corresponds to an acoustic-unit, and assigning each segment to a state in HMM) in speech recognition for initializing HMM parameters is an evidence for this relationship [1]. Furthermore, TI-HBM can be used for modeling other speech acoustic-units like word, syllable, etc. Employing TI-HBM in other applications like bioinformatics, time series and pattern recognition can further reveal other advantages of this model.

Acknowledgement This research was supported by the Iran Telecommunication Research Center (ITRC) under Contract T-500-9269.

References [1] L.R. Rabiner, B.-H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs, NJ, 1993. [2] X. Huang, A. Acero, H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm and System Development, PrenticeHall, Upper Saddle River, NJ, 2001. [3] G. Linares, B. Lecouteux, D. Matrouf, P. Nocera, Phone duration models for fast broadcast news transcription, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Philadelphia, USA, 2005. [4] D. Povey, Phone duration modeling for LVCSR, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Montreal, Canada, 2004. [5] J. Pylkko¨nen, M. Kurimo, Using phone durations in Finnish large vocabulary continuous speech recognition, in: Proceedings of the Sixth Nordic Signal Processing Symposium (NORSIG), Espoo, Finland, June 2004. [6] K.S. Trivedi, Probability and Statistics with Reliability, Queuing and Computer Science Applications, second ed., Wiley, New York, 2001. [7] P.M. Djuric´, J.-H. Chun, An MCMC sampling approach to estimation of nonstationary hidden Markov models, IEEE Trans. Signal Process. 50 (5) (2002) 1113–1123. [8] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. B 39 (1) (1977) 1–38. [9] J. Kabudian, A new acoustic model and its application to speakerindependent automatic speech recognition, Technical Report, Department of Computer Engineering, Amirkabir University of Technology, Tehran, Iran, 2007. [10] M. Bijankhan, J. Sheikhzadegan, M.R. Roohani, Y. Samareh, C. Lucas, M. Tebyani, FarsDat—the speech database of Farsi spoken language, in: Proceedings of the Fifth Australian International Conference on Speech Science and Technology (SST), Perth, Australia, 1994, pp. 826–831. [11] /http://www.elda.org/catalogue/en/speech/S0112.htmlS.