SIGNAL
PROCESSING ELSEVIER
Signal Processing
Recursive Kalman-type
51 (1996) 55-64
optimal estimation and detection of hidden Markov chains
Enzo Baccarelli, Roberto Cusani” INFOCOM
Department,
University Received
of Rome‘La Sapienza ‘, Via Eudossiana 18, Rome 00184. Ita!v 3 1 May 1995;
revised 22 January 1996
Abstract
Efficient algorithms for computing the ‘a posterior? probabilities (APPs) of discrete-index finite-state hidden Markov sequences are proposed. They are obtained by reducing the APPs computation to the optimal nonlinear minimum mean square error (MMSE) estimation of the noisily observed sequences of the indicator functions associated with the chain states. Following an innovations approach, finite-dimensional and recursive Kalman-like ‘filter’ and ‘smoothers’ for the Markov chain state sequence are thus obtained, and exact expressions of their MSE performance are given. The filtered and smoothed state estimates coincide with the corresponding APP sequences. Finite-dimensional MMSE nonlinear filter and smoothers are also given for the so-called ‘number of jumps’ and for the ‘occupation time’ processes associated with the Markov state sequence. Zusammenfassung
Es werden effiziente Algorithmen zur Berechnung der ‘a posteriori’ Wahrscheinlichkeiten (APPs) von verdeckten Markoff-Folgen mit diskretem Index und endlichen Zustlnden vorgestellt. Sie werden hergeleitet, indem die Berechnung der APPs auf die im Sinne des kleinsten mittleren Fehlerquadrates (MMSE) optimale nichtlineare Schatzung der verrauscht beobachteten Folgen von Indikatorfunktionen zurlckgeftihrt wird, die mit den Kettenzustanden verkniipft sind. In Anlehung an den Innovationsansatz werden fiir die Zustandsfolge der MarkotIkette endlichdimensionale, rekursive ‘Filter’ und ‘Glatter’ vom Kalmantyp hergeleitet und exakte Ausdriicke fur ihren mittleren quadratischen Fehler angegeben. Die gefilterten und geglitteten Zustandsschatzwerte stimmen mit den entsprechenden APP Folgen iiberein. Es werden such endlichdimensionale, nichtlineare MMSE Filter und Glitter fur die Prozesse der sogenannten ‘Sprunganzahl’ und der sogenannten ‘Besetzungszeit’ angegeben, die mit der Zustandsfolge der MarkotIkette verbunden sind. Rbumk
On propose ici des algorithmes pour calculer les probabilites ‘a posterior? (PAP) de sequences de Markov Cachees a Ctat fini et a index discret. Ces algorithmes sont obtenus en reduisant le calcul des PAP a l’estimation optimale de l’erreur quadratique moyenne minimale (EQMM) nonlineaire des sequences bruitees observees des fonctions d’indicateur associees aux etats de la chaine. On obtient alors, par l’approche des innovations, un ‘filtre’ et des ‘lisseurs’ rtcursifs et de dimension finie pour la sequence d’etat de Markov tres proches de ceux de Kalman, et l’expression
*Corresponding
author.
Tel.: + 39 6 44585 859; fax: + 39 6 4873300;
0165-1684/96/$15.00 6 1996 Elsevier PII SO 165-l 684(96)00030-8
Science B.V. All rights reserved
e-mail:
[email protected].
E. Baccarelli, R. Cusani / Signal Processing 51 (1996) 55-64
56
exacte de leurs performances EQM est do&e. Les estimees filtries et lissees comcident avec les sequences PAP correspondantes. Un filtre et des lisseurs nonlineaires de dimension finie et a EQMM sont Cgalement don& pour les processus que l’on appelle ‘le nombre de sauts’ et ‘le temps d’occupation’ associis a la sequence d&tat de Markov. Keywords:
dimensional
Hidden Markov models; Innovations detectors
approach;
1. Introduction Hidden Markov Models (HMMs) are commonly employed as a powerful signal processing tool in many application fields such as speech recognition [ 111, communication systems, biological signal processing, etc. In particular, they are highly successful in automatic speech recognition, due to their ability in modelling the statistical variations of the signal (see, e.g., the recent contributions in [7, 141). In dealing with HMMs, a fundamental point is to develop efficient sequential processing algorithms whose complexity does not grow with the length of the data, so as to minimize the computational cost and the memory requirements. This is the case, for example, in HMM identification [6] or in the estimation of the chain states from noisy observations. A classic solution for this last problem is offered by the well-known Viterbi algorithm (VA) [9], which satisfies the specific optimality criterion of minimizing the state error probability over the whole state sequence. However, in some applications
a performance
better than that of the
VA can be obtained by assuming different optimality criteria such as the minimization of the singlestate error probability [S]. In principle, a fully general approach to the detection of hidden Markov sequences should consist in the preliminary calculation of the APPs associated with the chain states; from these, every suitable decisionstatistic can be derived, depending on the optimality criterion assumed for the particular application. In this paper an exhaustive set of recursive algorithms for the exact computation of the APPs of a hidden Markov chain is presented. Following an innovations approach [12, 131, the APPs calculation is reduced to the optimal nonlinear MMSE estimations (filtering and smoothing) of the se-
Multivariate
point process; Recursive nonlinear
finite-
quence of the indicator functions associated with the states of the Markov chain on the basis of the available realization of the noisy observation process, suitably represented as a discrete-index multivariate point process [4, Chapter 21. The proposed estimators are nonlinear, recursive and finite-dimensional (in the sense of [13]). In particular, the filter for the state sequence is formally similar to the classic Kalman filter for the MMSE estimation of Gauss-Markov processes in additive white Gaussian noise (AWGN). From this point of view, it can be considered as a direct extension of the standard approaches developed in [5,8, 151 for the detection of continuous-time Markov chains impaired by AWGN. The paper is organized as follows. In Section 2 suitable innovations representations are introduced for the hidden Markov chain. They are employed in Section 3 to derive the MMSE nonlinear filter, from which the optimal maximum ‘a posterior-i’ probability (MAP) detector is obtained. Its complexity is compared to the real-time (i.e., with no decision delay) version of the VA. The MSE performance of the filter is also given through both recursive and ‘on-line’ expressions. On the basis of the results obtained for the filter, general formulas for the smoother and its MSE performance are given in Section 4. Explicit recursive expressions for the three fundamental types of smoothers (fixedpoint, fixed-interval, fixed-lag) are also reported. In Section 5 the finite-dimensional MMSE nonlinear filter and smoothers for the ‘number of jumps’ and for the ‘occupation time’ processes associated with the state sequence of the hidden Markov chain are also presented. Examples and conclusions are finally given in Section 6. Applications of the proposed algorithms in the digital communication field have been recently developed by the authors [2,3].
E. Baccarelli, R. Cusani / Signal Processing 51 (1996) 55-64
so that
2. Statement of the problem and innovations representation for hidden Markov chains
E{x(k
Let us consider a first-order Markov chain with N (generally complex) states {x(k) E (al, . . . , rN}, k 3 1) defined on a given probability space (0, B, P) and characterized by the state transition probability matrix sequence {A(k) E aNxN, k 2 l>, with elements Uij(k) = P(x(k + 1) = Mi1x(k) = xj), l = E(x(k
= A(k)x(k),
+ 1)1x(k)}}
k 2 1.
(3)
In the same way, the innovations sequence k 3 1} pertaining to the process (y(k)} w.r.t. the sequences X: and Y :-’ can be defined as {v(k) E R”-l,
v(k) = y(k) - E{y(k)lX:,
Y:-‘1,
k 3 1,
(4)
so that the following relationship holds: E(y(k)lX:,
Y:-l} = E{y(k)lx(k)f
= Q”‘WW,
k> 1,
(5)
where Q”‘(k) denotes the (M - 1) x N matrix obtained from Q(k) by neglecting the last row. Therefore, the innovations representations (also called ‘semimartingale representations’ or ‘DoobMeyer decompositions’; see, e.g., [lo, Chapter 21) pertaining to the {x(k)} and {y(k)} w.r.t. X:, Y: and X:, Y t-‘, respectively, take on the forms [l] x(k + 1) = A(k)x(k) + w(k + l), y(k) = Q”‘(k)x(k) + v(k),
k 3 1,
k 3 1.
(6) (7)
3. Nonlinear Kalman-like filter and its MSE performance
d:TM_l= [0 0 . . . 0 11, d’M = [0 0 . . . 0 01, (1)
constituting a set of orthogonal (but non-orthonormal) elements of the Euclidean space R“‘-‘. In this way, the observed process {y(k) E D} can be conveniently represented as an (M - 1)-variate discretetime point process. Moreover, the choice of the state-space in (1) leads to generally nonsingular expressions for the gain coefficients of the estimators (see Section 3); this constitutes one of the main reasons for this choice in the present work. After introducing the source and observed ordered sequences from step 1 to step k as X: = (x(l), . . . ,x(k)} and Y: = {y(l), . . . . y(k)), respectively, the ‘innovations sequence’ {w(k + 1) E RN, k 3 11 [12] pertaining to (x(k)) with respect to (w.r.t.) X: and Y: is defined as w(k + 1) = x(k + 1) - E{x(k
57
+ 1)(X:, Y:$, k 2 1,
(2)
Let us denote by (z(kl k) E [O,llN, k 2 1) the APP sequence of the process {x(k)) conditional on the observed seqence Y:; from the previous section it coincides with the expected values of {x(k)} w.r.t. Yl, i.e., n(k)k)=f(kjk)=E{x(k)IY:},
k>
1.
(8)
The ‘estimate innovations sequence’ {p(k) E RN, k 3 1) and the ‘observation innovations sequence’ {/Z(k) E R”-‘, k 3 l} (also called ‘fundamental Martingale Difference (MD) of the observations’ [4]), respectively, defined as ,u(k)=E(x(k)IY:}-E{x(k))Y’;-‘J =i?(k(k)-i(k(k4k) = y(k) -
l),
k>, 1,
(9)
E:y(k)I Y:-‘>
= y(k) - y^(klk - l), k 3 1,
(10)
E. Baccarelli, R. Cusani / Signal Processing 51 (1996) 55-64
58
h,,(k)
are both MD sequences (in the sense of [12]) w.r.t. {Y: }; so, the well-known MD-representation theorem [12] allows us directly to write p(k) = G(k; Y ‘W(k),
k 2 1,
(11)
where G(k; Y :) is a time-varying gain which generally depends on the observed sequence until step k. If G(k; Y :) is not predictable (in the sense of [12]) w.r.t. Y:, it cannot be precomputed on the basis of the observed sequence Y !- ’ until step k - 1 so that the corresponding estimator is not recursive. However, in our case the observation process takes on a finite number of states only; as a consequence, it can be represented as a discrete-time (A4 - l)-variate point process and the gain G(k; Y :) in (11) proves to be predictable w.r.t. Y t; this property is proved in Appendix A and generalizes some results known in literature for the case of univariate point-processes [12,13]. Therefore, Eq. (11) can be rewritten as p(k) = G(k; Y;-‘)A(k),
(12)
G(k; Y;-‘) Y:-‘}[E(~(k)~‘(k)(
Y;-‘}I-‘,
k 3 1.
(13)
From the representations of (6) and (7) and from the properties of the innovations sequences in (2)-(4) we have z?(klk - 1) = E{x(k)j
Y;-‘}
= A(k - l)Z(k - 11k - l}, k 2 2,
(14)
with the initial condition R(110) = E{x(l)} = n(l),
(15)
and also j(klkk>
1) = E(y(k)(
Y:-I}
= Q”‘(k)?(kIk-
1.
[E{A(k)AT(k)l Y:-‘}I-’ 1
=
M-l
l-
l), (16)
The conditional second-order (13) can be expressed as [l]
1 j=l
jj(kIk-1) >
ka
I
1
P,,,(k)
1 1 ...
e..
...
.. .
...
...
1
1
..*
Pier-l,M-i(k)
1,
(17)
where
ll,
(18)
and E{ p(k)AT(k) 1Y F- ‘} = diag{i(k ( k - 1)) Q(‘jT(k) - ?(k ( k - l)jT(k ( k - l),
k>l
k B 1,
from which it follows that = E{p(k)AT(k)I
X
1
statistics present in
(19)
(in the above equations and in the sequel zi represents the ith element of the vector z and diag{z} is defined as the diagonal square matrix with elements given by the components of the vector z). Therefore, Eq. (12), rewritten as R(k(k)=R(kIk-1) + G(k; Y:-‘> [y(k) - j(k I k - l)],
k 2 1,
(20)
gives, together with (14)-( 18), a recursive exact algorithm for the computation of the APP vector defined in (8) by means of a finite number of relationships. It is underlined here that (20) is not a standard linear Kalman filter and that G(k; Y:- ‘) is not a standard Kalman gain (in fact, it depends on the observations). From the properties of the matrix Q(k), it follows that the filter in (20) is numerically stable. Furthermore, as a direct consequence of the assumed space state in (l), the expression of the gain G(k; Y:- ‘) as given by (17)-(19) is not singular, in general. On the contrary, using for the state-space of the observed process {y(k)} an orthonormal basis of tRMsystematically leads to singular expressions for the calculation of the gain sequence. This is the case of the approach followed in [16], based on a direct generalization of the results in [13] and applied to the problem of tracking manoeuvring targets. As em-
E. Baccarelli, R. Cusani / Signal Processing 51 (1996) 55-64
phasized by the same authors, the results of [16] could not be extended directly to problems afforded in the present paper, where the matrix Q’(k) (corresponding to the matrix AT(k) of [16]) is a stochastic matrix.
3. I. Estimators for the states of the hidden Markov chain
The probability vector n(k) k) E Z(k 1k) is computed in (20) as the MMSE estimate of the N-dimensional random vector constituting the determinations of the indicator function of the atomic events: x(k) = Ui, 1 < i < N. Eq. (20) is formally similar to the classic Kalman filter and can be considered as the discrete-time counterpart (for discrete memoryless channels) of the ‘Wonham filter’ [S, 151 originally derived for the detection of continuoustime Markov chains crossing AWGN channels. It can be noted that the ‘estimator’ in (20) is nonlinear w.r.t. the observations, because the gain G(k; Y!-‘) does depend on the available observations Y :- ‘. A suitable (linear or nonlinear) memoryless transformation of the APP vector z(k 1k) allows computation of any desired decision-statistic, thus giving the real-time state-estimate of the hidden Markov chain (x(k)) under any desired optimality criterion. For example, the MMSE and MAP estimates of x(k) based on Y: are, respectively, given by &&klk)
,~,lWl4
= [ai,
.?MAP(kI k) = Uj
(21)
because it constitutes a set of N scalar relationships and the APP vector z(k 1k) in (8) is indeed N-dimensional. In particular, the resulting MAP filter of (22) exhibits a computational complexity of the same order as that of the VA, when the latter operates with no decision delay. In fact, it can be verified that the number of arithmetic operations required at step k to compute z&&k ( k) is proportional to the number of nonzero elements in the matrix ,4(k). Compared to the procedure proposed, for the VA N additional maximum searches must be performed to detect the surviving paths at any step [9]. Moreover, in both cases N memory cells are necessary to store, step-by-step, the computed statistics.
3.3. MSE filter pe$ormance The MSE performance of the filter in (20) can be exactly computed step-by-step (i.e., for every value of the index k) by means of expressions based on the knowledge of n(l), {A(k)} and {Q(k)}. In particular, the conditional covariance matrix of the filtering error and of the one-step prediction error can be directly computed on-line from (20) respectively, as S(k I k) = E{ [x(k) - i(k 1k)] [x(k) - .?(k I k)lT 1Y: }
= diag{i(k I k)} - x*(k I k)iT(k I k),
S(klk-
l)]
l)=E{[x(k)-R(kIk-
x [x(k) - x*(k I k - l)]‘I Y;-i> (22)
= diag{T(k I k - 1)} - f(k I k - 1) x iT(k 1k - l),
3.2. Computational
k 2 1,
(23)
iff ij(k I k) > ai(k I k)
for every i # j.
59
k 3 1.
(24)
issues
The mutual orthogonality of the elements of the employed space-state in (1) allows us to obtain the inverse matrix in (17) by simply computing the reciprocals of M scalar quantities, thus drastically reducing the computational burden. Moreover, the sum of the elements of the vector R(kJ k) in (20) is unity, so that only N - 1 components of @k I k) need to be effectively calculated. Therefore, Eq. (20) gives a minimal solution to the general problem at hand,
Eqs. (23) and (24) are directly obtained from their definitions and from the orthonormality of the basis U. In [l] it is proved that (23) can be also posed in the recursive form S(k I k) = S(k I k - 1) + diag{p(k)}
- 2Sym{g(kI k - l)I?(k)GT(k; - G(k; Y;-‘)A(k)AT(k)GT(k; k3
1,
Y:-‘)j
Y;- ‘), (25)
E. Baccarelli,
60
R. Cusani J Signal Processing
where
51 (1996) 55-64
=R(kIk)+
5
S(k 1k - 1) = A(k - l)S(k - 1 (k - l)AT(k - 1)
+ R(k; Y:-‘),
k B 2,
(P(k;m)-g(kIm-1)
m=k+l
L>k,
x~Z~(mIrn-l)}@(m),
(26)
k>l. (30)
and R(k; Y;-‘)
G E{w(k)wT(k)I
In the above equation the sequence {B(m)>, defined as
Y:-‘}
= diag{R(k (k - l)} - A(k - 1)
x diag{x^(k - 1 (k - l)}AT(k - l), ka2
(27)
(in (25) Sym{B} = i(B + BT) denotes the symmetric part of the matrix B). Eq. (26) must be initialized on the basis of the knowledge of ~(1) (i.e., S(1) 0) = diag{n(l)} - $1)7rT(1)). Eqs. (25) and (26) are formally similar to the Riccati equation for the standard linear Kalman filter. Similarly to the classic expressions pertaining to the Kalman filter, further recursive expressions for the calculation of the gain in (13) have been found in the form [l] E{ /i(k
I Y ;- ‘} = i(k I k - l)Q”jT(k),
k a 1,
(28)
O(m) = Q”‘T(m)[E{il(m)~T(m)I m 2 1,
= Q”‘(k)S(kI k - l)Q”jT(k) + E{u(k)vT(k)I k a 1,
(31)
is computed from the available realization of the observation process and from the results obtained for the filter (see (lo), (16) and (17)) while the matrix sequence {S(k; m), m > k + l}, defined as S(k; m) = E{ [x(k) - Z(k I k - l)] x[x(m)-~(mIm-l)]TIY~-l) = P(k; m) - x*(k I m - l)iT(m ( m - l), mak+l,
k>l,
(32)
can be recursively computed on the basis of the following relationships: z(k)m
E{,I(k)AT(k)( Y”1-‘}
Yy-‘)I-‘A(m),
- 1) = x^(klm - 1) = E{x(k)I
Y;-‘1,
Y;l-‘>
= t(k) m - 2) + {P(k; m - 1) (29)
where the second-order statistic of the sequence {o(k)} is calculated recursively as: E{u(k)vT(k) I Y:-‘> = diag{ y^(kI k - l)} - Q”‘(k)diag{?(k I k l)} Q”jT(k).
-i(kIm-2)ZT(m-
x O(m - l),
llm-2)) m B k + 2, k > 1, (33)
P(k; m) = E{x(k)xT(m))
Y;l-‘}
= C(k; m)AT(m - l),
m > k + 2, k B 1,
4. General formulas for the smoother
(34)
On the basis of the properties and relationships reported in the previous sections and in Appendix A, from the application of the MD-representation theorem [12] the following general finitedimensional expression for the smoother is found
where the matrix
PI:
is updated column-by-column P(k; m - 1) as
n(k I I,) = x^(kIL) = E(x(k) I Yi>
C(k; m) = E{x(k)x’(m m>k+2,
- 1)l Yy-‘},
k>l,
(35) from the matrix
Ci(k; m) = Pi(k; m - 1) + (Aci’(k; m)
= x^(kj k) +
5 m=k+l
S(k; m)@(m)
- Pi(k; m - l)aT(m - 1 I m - 2)) O(m - l),
E. Baccarelli. R. Cusani / Signal Processing 51 (1996) 55-64
m>k+2,
k>l,
l from state Ui to state Uj, in the interval 1 < r < k, and J’(k) is the time spent by {x(r)} in the state Ui, in the same time-interval. It can be shown that the (fixed-interval) smoothed sequence {.?(r I L), 1 d r < k, L > k) of Section 4 constitutes a sufficient statistic w.r.t. the observations Yf for the computation of the nonlinear MMSE filtered and smoothed estimates of the processes {N’j(k), k z 2) and {J’(k), k 3 l}. In fact, the finite-dimensional nonlinear MMSE estimators (filter, for k = L; smoothers, for L > k) of the above processes can be directly expressed as [l] p’(kI
L) = E{N”(k)I = t
S(k 1L) = E { [x(k) - f(k )I.)] [x(k) - x*(k1L)IT 1Y ;}
(Aj(r
+
L > k, k 3 2,
k>l.
(38)
5. Nonlinear finite-dimensional MMSE filter and smoothers for the ‘number of jumps’ and the ‘occupation time’ processes Let us consider the ‘number of jumps’ and the ‘occupation time’ processes associated with the states of the hidden Markov chain {x(k)} defined, respectively, as W(k) = f’ I’j(r),
Yf}
l)[Ui - T(r - 11L)]
-
*=2
= diag {Z(k I L)} - S?(kI L)tT(k I L), L>k+l,
61
)2j(r
I L)}2i(r - 1 ( L),
i # j,
(41)
where Aj(r - 1) is the jth A(r - l), and
row of the matrix
j’(kl L) = E{J’(k)I
x*i(rl L),
Yf}
= i *=l
L>k,
k32,
l