Simulated annealing for maximum a posteriori parameter estimation of ...

Report 5 Downloads 114 Views
994

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 46, NO. 3, MAY 2000

Simulated Annealing for Maximum A Posteriori Parameter Estimation of Hidden Markov Models Christophe Andrieu and Arnaud Doucet

Abstract—Hidden Markov models are mixture models in which the populations from one observation to the next are selected according to an unobserved finite state-space Markov chain. Given a realization of the observation process, our aim is to estimate both the parameters of the Markov chain and of the mixture model in a Bayesian framework. In this paper, we present an original simulated annealing algorithm which, in the same way as the EM (Expectation–Maximization) algorithm, relies on data augmentation, and is based on stochastic simulation of the hidden Markov chain. This algorithm is shown to converge toward the set of Maximum A Posteriori (MAP) parameters under suitable regularity conditions. Index Terms—Bayesian estimation, data augmentation, hidden Markov models, maximum a posteriori, simulated annealing.

I. INTRODUCTION

H

IDDEN Markov models (HMM) are models in which the distribution from one observation to the next is selected according to an unobserved “hidden” finite state-space Markov chain. These models have many applications in signal processing [1], communications [2], [3], and applied statistics [5]–[9]. In several practical cases of interest, neither the parameters of the hidden Markov chain nor the parameters of the mixture model are known and have to be estimated from a finite number of observations. The most popular method used to perform Maximum-Likelihood (ML) or Penalized ML estimation of these parameters is the EM (Expectation–Maximization) algorithm [10]. When the number of observations is very large, the EM algorithm yields very satisfactory results in most applications. However, there are also well-documented shortcomings of this deterministic algorithm. For example, it can converge to local maxima or saddle points of the (penalized) log-likelihood function, and its limiting position is dependent on initialization. When small sample sizes are available, the objective function of the maximization becomes highly irregular and the EM algorithm can perform poorly. This is one of the main reasons why stochastic versions of EM, including SEM (Stochastic EM) [11] and MCEM (Monte Carlo EM) [6], [12] have been developed. In some applications, such as blind ML filter identification [4] or estimation of finite mixture distributions [13], it has been shown experimentally Manuscript received May 13, 1998; revised December 6, 1999. The work of C. Andrieu was supported by a post-doctoral grant from AT&T Laboratories, Cambridge. The work of A. Doucet was supported by an EPSRC grant. The authors are with the Signal Processing Group, Department of Engineering, University of Cambridge, CB2 1PZ Cambridge, U.K (e-mail: {ca226} {ad2}@eng.cam.ac.uk). Communicated by P. Moulin, Associate Editor for Nonparametric Estimation, Classification, and Neural Networks. Publisher Item Identifier S 0018-9448(00)03096-0.

that these stochastic algorithms outperform the EM algorithm. However, contrary to the algorithm presented here, there is a lack of theoretical convergence results for SEM and MCEM. In this paper, we address the problem of the estimation of these parameters in a Bayesian framework; that is, these parameters are regarded as random with appropriate prior distributions. Several Markov chain Monte Carlo (MCMC) algorithms have been proposed to estimate the posterior distribution of these parameters for models from this class [3], [5], [7], [14]. In a Bayesian framework, MCMC methods consist of building a homogeneous Markov chain whose asymptotic distribution converges to the posterior distribution of interest. These methods are particularly suitable for the estimation of conditional expectations, e.g., MMSE (Minimum Mean-Square Estimate), via sample path averages [15], [16], but rather inefficient for the estimation of the MAP values of the parameters. Indeed, these homogeneous algorithms do not focus especially on high posterior density regions. Thus a large amount of the computational burden is spent exploring regions of low posterior probability, which are of no interest when one wants the MAP estimate. Here, in order to achieve efficient computation of the MAP estimate for parameters of HMM, we design a simulated annealing algorithm which, under given conditions, converges toward the set of these estimates in the sense specified in Section IV. In this context, “simulated annealing” means that we build an inhomogeneous Markov chain whose sequence of invariant distributions concentrates on the set of the MAP parameters. Classical simulated annealing relies on a particular case of the Metropolis algorithm: exploration of the objective function is based on a random walk [17], [18]. In a Bayesian framework, such an implementation would not make use of the statistical structure of the model. Furthermore, one of the main drawbacks of classical simulated annealing based on the Metropolis algorithm is that it is not usually theoretically valid when some of the parameters to be estimated are continuous and do not lie in a compact set [19], [20]. Here we construct an algorithm which exploits the statistical structure of the model by introducing a set of missing data, here the hidden Markov chain, in a similar way to the EM algorithm [1]. This data augmentation provides the algorithm with two interesting properties. The first is algorithmic as it allows the parameters to be easily updated. The second is theoretical as the discrete nature of the missing data set ensures convergence of the algorithm for the whole set of parameters, including continuous parameters lying in unbounded sets. Using Monte Carlo simulations, we compare this algorithm and the EM algorithm. This algorithm is shown to perform better than EM, in the sense specified in Section V, while having an equivalent computational complexity at each iteration.

0018–9448/00$10.00 © 2000 IEEE

ANDRIEU AND DOUCET: SIMULATED ANNEALING FOR MAP PARAMETER ESTIMATION OF HIDDEN MARKOV MODELS

The paper is organized as follows. In Section II we formally present the model and the estimation objectives. In Section III, we introduce the simulated annealing algorithm and discuss implementation issues. In Section IV, we give sufficient conditions to ensure convergence of this scheme toward the set of MAP parameters. In Section V, we demonstrate the application of this algorithm to switching autoregressions with a Markov regime and to Markov-modulated Poisson processes. In Appendix A, notation is given. In Appendix B, the forward filtering–backward sampling procedure is described. In Appendix C, the way to evaluate the likelihood function is given. Finally, the proofs of convergence are grouped in Appendix D. II. MODEL AND PROBLEM FORMULATION A. Model Structure Let us formally detail the signal model. Let denote a discrete time index. Let denote a discrete-time, timehomogeneous, -state, first-order Markov chain with transition probabilities

(1) , for Denote the initial probability distribution as , such that and . The transition is an matrix with elements satisprobability matrix and , for each . For notational fying , , and the

convenience, denote the initial distribution the th row of the transition matrix Markov chain and observations from time to

995

. Prior distributions are used, in principle, to incorporate our knowledge of the parameters. Less informative priors should be employed when such knowledge is limited [21, pp. 357–367]. is a probability measure on , where is the Borel -algebra associated to the parameter space . is assumed absolutely continuous with respect to the Lebesgue the corresponding density. measure , and we denote by is assumed: The following structure for (4) that is to say that the parameters of the mixture models, the initial distribution, and the rows of the transition matrix are mucan be evaluated tually independent. We also assume that pointwise up to a proportionality factor and that have Dirichlet distributions as priors [21, pp. 134–135], i.e., , where and . If for all we obtain the uninformative uniform distribution commonly used in Bayesian estimation [8], [14]. Furthermore, this distribution is a conditionally conjugate prior [21, pp. 265–285], which allows efficient stochastic simulation of these parameters as required by our algorithm. Similar prior distributions are adopted in [3], [5], [7], will depend upon and [14]. The prior distributions for the application under study. Several examples will be presented in Section V. C. Estimation Objectives where is the number of availGiven the observations of able data, our objective is to obtain the MAP estimate defined as

and

(5)

respectively. is distributed conditional At each time , the observation , and , the vector parameter of the distribuupon tion, as

Performing this estimation involves solving a complex global optimization problem on a general (continuous and noncompact) state space. III. SIMULATED ANNEALING VIA DATA AUGMENTATION

(2)

We present here a simulated annealing-like algorithm which, in a similar way to the EM algorithm, relies on the introduction to yield a so-called “augmented of the hidden Markov chain data set.” The finite nature of this missing data set is the crucial feature which provides the algorithm with good convergence properties.

(3)

A. Presentation of the Algorithm

Unconditionally, we have

if if which is a finite mixture distribution. The set of unknown pa, rameters to estimate is is the dimension of . where B. Bayesian Model In this paper, we adopt a fully Bayesian approach, that is, the parameter is regarded as random with prior distribution

We construct an inhomogeneous Markov chain

whose transition kernel at iteration depends on a deterministic [18] satisfying cooling schedule and is an estimate of the set of parameters at iteration . Sufficient conditions on the cooling schedule to ensure convergence of the sequence

996

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 46, NO. 3, MAY 2000

toward the set of global maxima of are given in Section IV. The algorithm proceeds as follows. Simulated Annealing via Data Augmentation randomly 1) Initialization Set 2) Iteration • Sample • Sample • Evaluate the acceptance probability

The second term on the right is the prior distribution which is typically easily evaluated analytically up to a proportionality factor. The first term is the likelihood for the data which is usually calculated using Baum's forward–backward formula [1]. Here we do not make use of this recursion because, while simulating from , auxiliary quantities (denominator of (40)) are computed that can be used to evaluate the likelihood. This alternative evaluation is detailed in Appendix C. IV. STUDY OF THE ALGORITHM In this section, our main result is Theorem 1, which proves that the proposed algorithm converges to the set of MAP parameters in the following sense:

(6) If • Sample otherwise set set 3) Set and go to step 2).

then

B. Implementation Issues At each iteration , one must be able to sample efficiently and , and then evaluate from . an acceptance probability : is a dis1) Simulation from crete probability distribution. However, the evaluation of this distribution would be cumbersome due to its high dimensionality for large . Fortunately, it is not necessary to evaluate this distribution to sample from it. Here we use the forward filtering–backward sampling recursion recently introduced independently by Carter and Kohn [22] and Chib [5]. This efficient recursion is detailed in Appendix B. Its computational , which is linear in the data length . cost is : From the assumptions on 2) Simulation from the prior distribution and the observation model, we have (7) is problem-dependent Sampling from and typically relies on classical techniques [23] which will be illustrated in Section V. Sampling from is easy as we obtain

where is a norm defined in Section IV-A, is an appropriate logarithmic cooling schedule (see Theorem 1), is the distribution of the th sample of the inhomogeneous Markov chain built from our algorithm, and is the sequence of normalized versions of , which converges to a distribution located on the set of MAP parameters (Lemma 3). Our proof of convergence relies on the study of the inhomo. In Section IV-A, notageneous Markov chain tion and hypotheses are given. In Section IV-B, we show that is each transition kernel of the Markov chain uniformly ergodic. In Section IV-C, we study the behavior of the sequence of invariant distributions of these transition kernels. Then, in Section IV-D, we show that a logarithmic cooling schedule ensures a convergence result of the algorithm toward the set of global maxima of the posterior distribution. A. Notations and Assumptions Let be the set of all probability measures and the set . We use the folof all Markov transition kernels on lowing standard definitions and notations [16], [19], [24]. The total variation norm between two elements and of is defined as

Let be an inhomogeneous Markov chain of initial and Markov transition kernel at distribution iteration satisfying for any

(8) ( if and otherwise) and, where is the total number of one-step transitions for . from state to state in : The 3) Evaluation of the Acceptance Probability can be computed because we acceptance probability can evaluate the following probability in a pointwise manner up to a normalizing constant: (9)

(10) We define, for

,

so that the probability distribution

of

is defined by (11)

ANDRIEU AND DOUCET: SIMULATED ANNEALING FOR MAP PARAMETER ESTIMATION OF HIDDEN MARKOV MODELS

Proposition 1: There exists fies, for all

where

997

such that

satis(16)

is a probability density on

where Let be the interior of sumptions hold.

. We assume that the following asand

Assumption 1: differentiable on for This implies that tiable on

.

such that for any Lemma 2: There exists admits as invariant density (17)

are twice continuously

. is twice continuously differen-

.

is a uniformly ergodic transition , , and such that

Proposition 2: kernel. There exists and for any

Assumption 2: The set of global maxima

(18) and

where is finite and is contained in

. The Hessian matrix

is nonsingular at any point of

is a probability density on

. Remark 1: In practice one can adopt a mixture of transition kernels

.

such that for any Assumption 3: There exists . any element of . We denote by

with and including the transition studied above as a component and, for exkernel ample, another Metropolis component based on a random walk proposal distribution allowing a “local” exploration of the distriremains uniformly ergodic [16, bution. In this case, Proposition 3, pp. 1716].

Assumption 4:

B. Study of the Transition Kernels is First, it is easily checked that an inhomogeneous Markov chain of transition kernel on at iteration

C. Study of the Invariant Distributions We now study the sequence of invariant densities . To simplify notation, the probability will be denoted distribution associated with . Lemma 3: The sequence of invariant distributions converges to a probability distribution located on the set of global maxima

(12) is the delta-Dirac point-mass measure located at where . Thus integrating out , we see that is also a of transition kernel at Markov chain on iteration

(13) where

(19) where and

is a homogeneous transition kernel (20)

(14) where

the integration being with respect to the counting measure. Lemma 1: i.e.,

satisfies detailed balance for

,

.

Proposition 3 ([19, Theorem 3.2, pp. 871–872]): There exsuch that for any ists (21)

(15)

998

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 46, NO. 3, MAY 2000

V. APPLICATIONS

D. Convergence of the Simulated Annealing Algorithm as

goes to zero In this subsection, we prove that increases for suitable logarithmic cooling schedules. Proposition 4: For any positive integers , we have

such that

(22) and

Lemma 4: For any

,

if

(where

is defined in Proposition 2) then (23)

Lemma 5: For any

, and

,

if

A logarithmic cooling schedule is required to ensure theoretical convergence of the simulated annealing algorithm. In practice, it is well known that such cooling schedules decrease too slowly toward zero. So, as is usually done, we implement a geo[18], [20]. In the folmetric cooling schedule lowing examples, we run our algorithm for 250 iterations with and so that . For each example, we simulated 500 realizations of the signal and then we ran 250 iterations of the SA (simulated annealing) algorithm and 250 iterations of the EM algorithm as these algorithms have an equivalent computational complexity. These algorithms have been initialized randomly with the same values of the paramand , respectively, the estimates eters. We denote by obtained using the SA algorithm and the EM algorithm for the th realization. For the EM algorithm, the posterior probability is the value obtained at the increases at each iteration, so final iteration. For the SA algorithm, this property is no longer as the estimate which maximizes the valid and we select penalized log-likehood among the 250 iterations. To compare the algorithms, we compute the average difference of the penalized log-likelihood of the estimates obtained by both algorithms using 500 simulations, i.e.,

then (24)

Theorem 1: For any

and

It is difficult to specify other objective criteria in order to compare these algorithms. However, we also compute using Monte Carlo simulations the bias and variance of both estimates with respect to the values of the true parameters.

, if A. Switching Autoregressions with a Markov Regime

then, for any initial distribution of , the Markov chain generated by the simulated annealing converges in the following sense: (25) Consequently, for any

We consider the problem of the MAP estimation of the parameters of a switching autoregression process with a Markov regime [5], [28], [29]. We assume that the observations are generated from the following system equation: for (27) where

(26) and Remark 2: This proof of convergence can be extended to some simulated annealing algorithms derived from (homogeneous) Markov chain Monte Carlo algorithms which are uniformly ergodic (see, for example, [16, pp. 1714–1716]). is of Remark 3: As outlined in [19, pp. 873], if the set null Lebesgue measure (which is a realistic assumption, clearly true under Assumption 2) we cannot obtain

is Gaussian white noise of variance . We define and . In this case, we want to estimate We assume that (28)

Note that in a discrete setup, the set of maxima is of nonnull measure with respect to the reference counting measure. This allows one to obtain such a convergence result under appropriate assumptions [17], [18], [26].

The AR coefficients are assumed to be uniformly distributed on their domain of stability which is a reasonable assumption , where in numerous physical applications, i.e., is the set of order- stationary AR processes. Note, however,

ANDRIEU AND DOUCET: SIMULATED ANNEALING FOR MAP PARAMETER ESTIMATION OF HIDDEN MARKOV MODELS

that it does not imply that is stationary [28]. There is an identifiability problem as the posterior distribution is not modified up to a permutation of the labels of the hidden Markov chain. Hence we introduce an additional ordering constraint on the autoregressive coefficients. To simplify notation, we do not enforce this constraint later on. For the variances of dynamic noises, we set conjugate inverse Gamma prior distribuwith for all tions to ensure integrability of the posterior distribution. Note that these priors tend to the improper uninformative Jef. freys' prior as To apply the algorithm described in Section III, one needs to be able to sample from (29) There are several cases to distinguish: • If (30) is straightforward [23]. To sample Sampling from , we use the efficient procedure described in from [30]. , we define • If

TABLE I TRANSITION MATRIX: TRUE VALUES SIMULATION RESULTS

999

AND

For the numerical simulations, the following example is taken from [28, pp. 501]. We have simulated 500 realizations of observations of a two-state hidden Markov chain whose parameters are specified in Tables I and II. The prior distribution we chose is defined by for . Assumptions 1 and 3 are clearly true. We assume that Assumption 2 is valid. One can check that Assumption 4 is satisfied for this prior distribution. In simulations, we obtained

(31) Then which means that the simulated annealing performs better in finding the global mode of the penalized likelihood. In Tables I and II, the empirical means, biases, and standard deviations of the estimated parameters defined as (32) Two cases must be considered: then • If bution restricted to the set

is a Student- distri-

(33) We sample from this distribution by first sampling from a Student- distribution and then performing a rejection step [23]. , we sample from by first • If and then performing a rejection step. sampling from is an inverse Gamma distribution from which one can easily sample [23]

and similarly for the AR coefficients and the probability transitions. This estimation problem appears easy, both in statistical and optimization terms, as the estimates obtained using the SA and the EM algorithms are similar and really good: their bias and variance with respect to the true value are very small. For such a simple estimation problem, there is no significant advantage in using a stochastic algorithm. We present in the next subsection a more complex problem. B. Markov-Modulated Poisson Processes We consider here the problem of MAP parameter estimation for Markov-modulated Poisson processes [5], [27], [1], [8]. Let us assume that we observe counting data. These counting data are modeled as a Poisson process with a parameter switching according to a hidden Markov chain of unknown statistics. More formally (35)

(34)

. To complete the We want to estimate , are Bayesian model, we assume that the parameters ,

1000

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 46, NO. 3, MAY 2000

TABLE II AR PARAMETERS: TRUE VALUE AND SIMULATION RESULTS

a priori independent, and have conjugate Gamma priors where and . There is an identifiability problem as the posterior distribution is not modified up to a permutation of the labels of the hidden Markov chain. Hence so that we include an additional constraint

TABLE III TRANSITION MATRIX

(36) To implement the simulated annealing algorithm described in Section III, it is necessary to sample from where TABLE IV POISSON PROCESS INTENSITIES: TRUE VALUES AND SIMULATION RESULTS

(37) It is straightforward to sample from this distribution [23]. For the numerical simulations, the following example is taken from [8]. We have simulated 500 realizations , of observations of a four-state hidden Markov chain whose parameters are specified in Tables III and IV. The prior distribution we chose is quite un, , for . informative as we set This example is known to be difficult from a statistical point of view [8]. Assumptions 1 and 3 are clearly true. Assumption 2 is supposed valid. One can check that Assumption 4 is satisfied for this prior distribution. In simulations, the following results have been obtained. For the log-posterior distribution, we obtained

For this difficult statistical estimation problem, the estimates obtained using SA are significantly closer to the true values than those obtained using the EM algorithm and the variance is reduced. VI. CONCLUSION

It means again that the SA algorithm outperforms EM in terms of maximization. Table IV presents the empirical means, biases, and standard deviations of the estimated parameters defined as

In this paper, we have proposed an original simulated annealing algorithm to obtain MAP estimates of the parameters of HMM. This algorithm relies on stochastic simulation of the hidden Markov chain. Under suitable regularity assumptions, the simulation of this finite missing data set gives uniform ergodicity of the inhomogeneous transition kernel. This property is used to obtain sufficient conditions that ensure convergence of the algorithm toward the set of MAP parameters whereas the EM is only ensured to converge toward a stationary point of the posterior distribution. This algorithm has been applied to

ANDRIEU AND DOUCET: SIMULATED ANNEALING FOR MAP PARAMETER ESTIMATION OF HIDDEN MARKOV MODELS

1001

NOTATION

maximum a posteriori parameter estimation for switching autoregressions with a Markov regime and for Markov-modulated Poisson processes. It might be a good alternative to the EM algorithm for complex statistical problems as it has an equivalent computational complexity at each iteration and appears in practice to be less sensitive to initialization.

where . Given is a discrete distribution. This suggests the following algorithm to [5], [22]. sample from 1) Optimal filter (forward filtering): For compute the optimal filter [31]. For any , evaluate

APPENDIX A

,

(39)

See Notation at the top of this page. APPENDIX B FORWARD FILTERING–BACKWARD SAMPLING RECURSION To sample from decomposed as follows:

, note that

(40)

can be

(38)

and for

and store for , and

.

1002

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 46, NO. 3, MAY 2000

2) Backward sampling: Sample from , sample from for where if we have for

then

for any . From Assumption 1, we have that is is finite, then is continuous at continuous at . Since and

(41) Thus (16) follows.

and

Proof of Lemma 2: Under Assumption 2, there exists

APPENDIX C EVALUATION OF THE LIKELIHOOD The likelihood

is a probability density on

and since the “temperature” such that for any 0, there exists

can be decomposed as follows:

decreases toward

(42) so that

where (43)

is well defined. Then we note that is nothing but a Metropolis–Hastings algorithm and invariant density with proposal distribution [15], [16]. Indeed, the acceptance probability of such a Metropolis–Hastings algorithm satisfies

(44) Equation (44) is evaluated when computing the forward filtering–backward sampling recursion in Appendix B. This is the denominator of (40). This evaluation of the likelihood was first proposed by Devijer [32] who suggests that it is numerically more robust than Baum's forward–backward procedure.

because of Lemma 1. This is expression (6). Proof of Proposition 2:

from Proposition 1. decreasingly converging toward , such that for any there exists APPENDIX D PROOFS Proof of Proposition 1: The transition kernel satisfies

Then where

(45) where

From Assumptions 1 and 3, there exists such that and is continuous in . From Assumption 2, the number of global maxima is finite, thus there exists such that is integrable because, for any Denoting

From Assumption 3, there exists

Indeed, as for any

, we have

such that

on

, then

where is a probability density on . From is [16, Theorem 1, pp. 1758], this implies that -irreducible as it is -irreducible, as a is clearly strongly result of Lemma 2. Furthermore, aperiodic [24, pp. 118]. Thus the Markov transition kernel

ANDRIEU AND DOUCET: SIMULATED ANNEALING FOR MAP PARAMETER ESTIMATION OF HIDDEN MARKOV MODELS

1003

is uniformly ergodic [16, pp. 1714–1716], [24, Ch. 16]. Proof of Lemma 3: This is an application of Laplace's formula, see for example [25]. The result follows under Assumptions 1 and 2. Proof of Proposition 3: The proof is similar to the one of Haario et al.[19]. In applying the Lebesgue-dominated convergence theorem, which allows for derivation under the integral is sign, Haario et al. assume that bounded above on . Here, instead, we use Assumption 4. First note that for sufficiently large and all

Proof of Lemma 5: From Proposition 3, we have that

and from Lemma 3

which under Assumption 4 and using the dominated convergence theorem allows us to write

This is the starting point of the proof of [19, Theorem 3.2.], which is then the same. Proof of Proposition 4: We have

Proof of Theorem 1: From Proposition 4, we have for any . sequence

Choosing , then the theorem is proved using Lemmas 4 and 5. Equation (26) is straightforwardly obtained from [19, Corollary 5.4, pp. 880]. ACKNOWLEDGMENT

Using the straightforward generalization to inhomogeneous Markov chains of [24, Theorem 16.2.4, pp. 392–393], we obtain from Proposition 2

and one can verify that

Thus

Proof of Lemma 4: Noting

The authors would like to acknowledge R. A. Cook, L. Davis, S. J. Godsill, and D. Leporini for various helpful comments on a preliminary version of this work. REFERENCES [1] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, pp. 257–286, 1989. [2] G. K. Kaleh and R. Vallet, “Joint parameter estimation and symbol detection for linear and nonlinear unknown channels,” IEEE Trans. Commun., vol. 42, pp. 2046–2413, 1994. [3] R. Chen and T. Li, “Blind restoration of linearly degraded discrete signals by Gibbs sampling,” IEEE Trans. Signal Processing, vol. 43, pp. 2410–2413, 1995. [4] O. Cappé, A. Doucet, M. Lavielle, and É. Moulines, “Simulation-based methods for blind maximum-likelihood filter identification,” Signal Processing, vol. 73, no. 1, pp. 3–25, 1999. [5] S. Chib, “Calculating posterior distributions and modal estimates in Markov mixture models,” J. Econometrics, vol. 75, pp. 79–97, 1996. [6] W. Quian and D. M. Titterington, “Estimation of parameters in hidden Markov models,” Phil. Trans. Roy. Soc. London A, vol. 337, pp. 407–428, 1991. [7] C. P. Robert, G. Celeux, and J. Diebolt, “Bayesian estimation of hidden Markov models: A stochastic implementation,” Statist. Probab. Lett., vol. 16, pp. 77–83, 1993. [8] C. P. Robert and D. M. Titterington, “Reparameterization strategies for hidden Markov models and Bayesian approaches to maximum-likelihood estimation,” Statist. Comp., vol. 8, pp. 145–158, 1998. [9] D. M. Titterington, A. F. M. Smith, and U. E. Makov, Statistical Analysis of Finite Mixture Distributions. New York: Wiley, 1985. [10] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. Roy. Statist. Soc. B., vol. 39, pp. 1–38, 1977.

1004

[11] G. Celeux and J. Diebolt, “The SEM algorithm: A probabilistic teacher algorithm derived from the EM algorithm for the mixture problem,” Comp. Statist. Quat., vol. 2, pp. 73–82, 1985. [12] G. C. G. Wei and M. A. Tanner, “A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithm,” J. Amer. Statist. Assoc., vol. 85, pp. 699–704, 1990. [13] J. Diebolt and E. H. S. Ip, “Stochastic EM: Method and application,” in Markov Chain Monte Carlo in Practice, W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, Eds. London, U.K.: Chapman & Hall, 1996. [14] J. Diebolt and C. P. Robert, “Estimation of finite mixture distributions through Bayesian sampling,” J. Roy. Statist. Soc. B, vol. 56, pp. 363–375, 1994. [15] S. Chib and E. Greenberg, “Understanding the Metropolis-Hastings algorithm,” Amer. Statistician, vol. 49, pp. 327–335, 1995. [16] L. Tierney, “Markov chains for exploring posterior distributions,” Ann. Statist., vol. 22, pp. 1701–1762, 1994. [17] D. Mitra, F. Romeo, and A. Sangiovanni-Vincentelli, “Convergence and finite-time behavior of simulated annealing,” Adv. Appl. Probab., vol. 18, pp. 747–771, 1986. [18] P. J. Van Laarhoven and E. H. L. Arts, Simulated Annealing: Theory and Applications. Amsterdam, The Netherlands: Reidel, 1987. [19] H. Haario and E. Sacksman, “Simulated annealing in general state space,” Adv. Appl. Probab., vol. 23, pp. 866–893, 1991. [20] M. N. M. Van Lieshout, “Stochastic annealing for nearest-neighbor point processes with application to object recognition,” Adv. Appl. Probab., vol. 26, pp. 281–300, 1994. [21] J. M. Bernardo and A. F. M. Smith, Bayesian Theory. New York: Wiley, 1995.

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 46, NO. 3, MAY 2000

[22] C. K. Carter and R. Kohn, “On Gibbs sampling for state space models,” Biometrika, vol. 81, pp. 541–553, 1994. [23] L. Devroye, Non-Uniform Random Variate Generation. New York: Springer, 1986. [24] S. P. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability. New York: Springer-Verlag, 1993. [25] C. Hwang, “Laplace's method revisited: Weak convergence of probability measures,” Ann. Probab., vol. 8, pp. 1177–1182, 1980. [26] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images,” IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-6, pp. 721–741, 1984. [27] B. G. Leroux and M. L. Puterman, “Maximum-penalized-likelihood estimation for independent and Markov-dependent mixture models,” Biometrics, vol. 48, pp. 545–558, 1992. [28] U. Holst, G. Lindgren, J. Holst, and M. Thuvesholment, “Recursive estimation in switching autoregressions with a Markov regime,” J. Time Ser. Anal., vol. 15, pp. 489–506, 1994. [29] A. B. Poritz, “Linear predictive hidden Markov models and the speech signal,” in Proc. IEEE Int. Conf. Acousitcs, Speech, and Signal Processing, Paris, France, 1982, pp. 1291–1294. [30] S. Chib, “Bayesian estimation of regressions with autoregressive errors: A Gibbs sampling approach,” J. Econometrics, vol. 58, pp. 275–294, 1993. [31] B. D. O. Anderson and J. B. Moore, Optimal Filtering. Englewood Cliffs, NJ: Prentice-Hall, 1979. [32] P. A. Devijver, “Baum's forward-backward algorithm revisited,” Pattern Recogn. Lett., vol. 3, pp. 369–373, 1985.