The Asymptotic Redundancy of Bayes Rules for Markov Chains Kevin Atteson April 17, 1997 Abstract Abstract{ We derive the asymptotics of the redundancy of Bayes
rules for Markov chains with known order, extending the work of Barron and Clarke[6, 5] on i.i.d. sources. These asymptotics are derived when the actual source is in the class of -mixing sources which includes Markov chains and functions of Markov chains. These results can be used to derive minimax asymptotic rates of convergence for universal codes when a Markov chain of known order is used as a model. Index terms{ universal coding, Markov chains, Bayesian statistics,
1
asymptotics.
1 Introduction Given data generated by a known stochastic process, methods of encoding the data to achieve the minimal average coding length, such as Human and arithmetic coding, are known[7]. Universal codes[15, 8] encode data such that, asymptotically, the average per-symbol code length is equal to its minimal value (the entropy rate) for any source within a wide class. For the well-known Lempel-Ziv code, the average per-symbol code length in excess of the entropy, i.e. the redundancy, goes to 0 for the class of all ergodic stochastic processes[15]. In fact, this is the optimal rate of convergence for this class. There is no code for which the redundancy goes to 0 at rate
o(1) for the class of all ergodic stochastic processes[13]. However, it has been observed that the rates of convergence of universal methods such as Lempel-Ziv are slow[2, page 268]. In this paper we will derive some optimal rates of convergence for universal codes over nitely parameterized sources, in particular, Markov chains. For sources having a nite number k of parameters and satisfying certain 2
conditions, it has been shown that there is a code for which the redudancy n + o log n which is the optimum rate[12]. For goes to zero at rate k log 2n n nitely parameterized i.i.d. sources satisfying certain conditions, it has been shown that the cumulative redundancy (not per-symbol) of the Bayes rule with continuous prior f () is:
k log n + 1 log det I () ? log f () + o(1) 2 2e 2
(1)
where I () = ?E [r2 log P (X )] is the Fisher information matrix[5]. The q prior f() = det I (), known as the Jerey's prior, equalizes the risk with the rate of growth subtracted o and so is minimax for the asymptotic problem (in the interior of the parameter space). From the noiseless coding theorem, this is a lower bound on the optimal asymptotic rate of convergence of any universal code. This lower bound can be achieved by the Bayes rule using Jerey's prior assuming that the entropy lower bound can be achieved exactly for all n (see [1] for a sense in which this is true for arithmetic coding1). Here, we consider the asymptotics of optimal compression 1
The convergence of the average per-symbol code length to the entropy (e.g. Human
coding of suciently long blocks) is not sucient to demonstrate this. The coding procedure must have average code length (not per-symbol) equal to the entropy of sequences
3
by Markov chains of xed known order. Also, we assume the actual source is from the class of -mixing random processes since it will not typically exactly t a Markov chain. In particular, if the actual source is a Markov chain then equation (1) holds. Note that the generalization to nite memory tree sources (described in e.g. [14]) is immediate (see [9]). Speci cally, we derive the asymptotics of the redundancy for Bayes rules with continuous prior densities over the Markov chain transition probabilities with xed known order where the initial distribution of the Markov chain is assumed to be the stationary distribution. A Bayes prior which equalizes the risk with the rate of growth subtracted o is then found, providing an optimal rate of convergence. Note that the actual rules which achieve this optimal rate of convergence are expressed here in terms of integrals which have no closed form solution. Recently, an approximation of these rules has been found which works well in simulation[9]. As discussed in [5], other applications of this work include density estimation, hypothesis testing and portfolio selection (in this case, where the distributions are not necessarily i.i.d.) but we concentrate on the application to universal coding here. of length n. Human coding of blocks of length k yields code lengths proportional to
n k
in excess of the length n entropy which dwarfs the subtle distinctions in code length for dierent priors as discussed here and in [5].
4
2 The Asymptotic Redundancy of Bayes Rules Consider a Markov chain of order k over a nite alphabet A = f0; 1; : : : ; a ? 1g. Let be a parameter vector representing the transition probabilities of the Markov chain, that is, t;u = P (ujt) for all t 2 Ak and u 2 A. We will assume that the transition probabilities are strictly positive and hence, the Markov chain ergodic. We will write P for the point mass function given by the stationary Markov chain corresponding to the transition probabilities
. Let f() be a continuous Bayes prior density on . The Bayes rule corresponding to the prior f () assigns coding length of ? log Mn (xn) to each string xn 2 An where Mn (xn) is the Bayesian marginal probability distribution[5, 11, 1]:
Mn
(xn) =
Z
P (xn) f ()d
(2)
Similarly, the regret[3, 1] of the Bayes rule is its nth-order redundancy given by:
Rn (; f ) = E [log P
(X n ) ? log M 5
n
(X n)] = E
"
P (X n) log M n (X n )
#
where X n is the initial segment of a stationary random process (not necessarily a Markov chain) with values in A and having a distribution P such that P (Xk+1 = ujX k = t) = t;u and P (X k = t) = P (t) for t 2 Ak and
u 2 A. In practice, Markov chains are used as the model when the true source is not likely to be Markov. Here, we do not assume that the underlying stochastic process is a Markov chain of order k but consider modeling it with such a source. We derive results when the actual source is in the larger class of -mixing random processes, which includes functions of Markov chains and other sources which need not be Markov of any order. The redundancy which we consider is the dierence between the minimal expected code length which can be achieved by modeling the source with a Markov chain of order
k (namely, the entropy of the Markov chain formed by truncating the distribution of the source; see [7]) and the Bayes rule over Markov chains of order
k. We determine the growth of this redundancy. To obtain from our result, the dierence between the minimal expected code length by any source, the Kullback-Liebler divergence between the actual source and the Markov chain with transition probabilities must be added which will be O(n). The main result is dependent upon the actual source only through certain asymptotic 6
variances. Our main theorem gives the exact asymptotics of the nth-order redundancy of a Bayes rule with continuous prior density f(). We state it here for the simpler case in which the actual source is a Markov chain:
Corollary 1 If X1 ; X2; : : : is a kth-order Markov chain such that t;u > 0 for all t 2 Ak and u 2 A then:
q n = log det (In()) lim R ( ; f ) ? log n n 2 2e ! f() X X log t;u ? (a ? 1) log t ? log f () = 12
ak (a ? 1)
!
t2An u2A
where In() is the Fisher information and t = P (t), that is, the stationary probabilities.
Hence, as in [5], Jerey's prior is minimax for the asymptotic problem. There are two parts to the proof. The rst part, which simpli es Mn (X n ) using a uniform version of Laplace's method of integration, is presented in the next section. The second part, which determines the asymptotics of the resulting expectation using the theory of -mixing random processes, is presented in the subsequent section. 7
3 Laplace's Method of Integration Here, we show that for certain typical sequences, Laplace's method of integration applies uniformly to the Bayesian marginal distribution given by (2). Note that this theorem diers from some of the standard theorems on Laplace integrals (c.f. page 56 of Breitung[4]) in that the functions f and g may become in nite at the boundary and also in that the limits are uniform and so we must enforce more stringent conditions upon them so that the theorem holds. While the conditions presented here are not the most lenient possible, they suce for our purposes and are easier to demonstrate in the application at hand. We write an bn to denote the fact limn abnn = 1.
Lemma 1 Let D be a compact subset of Rj and E a bounded convex subset of Rk . Let f : D E ! R be thrice continuously dierentiable. Let F :
D E ! Rkk be the Hessian matrix of f (x; y) with respect to y. Suppose that F (x; y) is negative de nite for all x 2 D and y 2 E and also that, for each x 2 D, there is a maximum of f (x; y) in the interior of E which we denote by y(x). Let g : E ! R be continuous and integrable. Then:
Z E
enf (x;y)g(y)dy
2 k2 g(y(x)) nf (x;y(x)) 1e n jdet (F (x; y(x)))j 2 8
uniformly for all x 2 D.
Proof. We rst summarize the proof. We will show that we can restrict the intregal to an ellipsoid (whose shape and orientation depends upon x) which is centered at the maximum y(x) and which becomes arbitrarily small at a certain rate with respect to n. Outside of the ellipsoid, f (x; y) is bounded by the maximum on the boundary of the ellipsoid since f is convex and so we can bound the integral (uniformly, using the compactness of D) in this region. Within the ellipsoid, we can approximate g(y) by a constant since it is continuous (uniformly since, as will be shown, the ellipsoids are eventually all contained in a single compact set). Also, we can approximate f (x; y) by a quadratic within the ellipsoid by using a Taylor series with remainder (again uniformly using the compactness of D). In order to show that we can restrict the integral to a shrinking ellipsoid around the maximum, we rst prove that any such ellipsoid will eventually be contained in E . Note that the maximum y(x) is continuous with respect to x and since D is compact, y [D] is compact. Now, since y [D] is contained in the interior of E which is open, there must be some such that B = fy :
ky ? y[D]k g E where ky ? y[D]k = inf x2D ky ? y(x)k. We will have need to bound the quadratic form y0T F (x; y)y0 repeatedly 9
)y0 throughout the proof. Let max = supx2D supy2B (y(x)) supy0 y0T Fy0(Tx;y y0 , that
is, max is the least upper bound of the largest eigenvalue of F (x; y) where
B (y(x)) = fy : ky ? y(x)k g. Since D and B (y(x)) are compact, F (x; y) is continuous and eigenvalues are continuous functions of matrices, max is achieved. Since F (x; y) is negative de nite, max < 0. Similarly, we write min = inf x2D inf y2B (y(x)) inf y0 y
0T F (x;y)y0 y0T y0
which is also negative.
In order order to ease the notation, we will transform the variable of integration, y to a variable zn depending upon n. Let M (x) be the square root of the positive de nite matrix ?F (x; y(x)), that is, the unique positive de nite matrix M (x) such that M (x)M (x) = ?F (x; y(x)). We de ne zn by the equation:
zn = n 21 M (x)(y ? y(x))
(3)
Note that we will assume that this transformation holds throughout the remainder of the proof, that is, when we refer to both y and zn in a statement, it is assumed that they are related by (3). Let En (x) be the set over which the transformed integral is taken, that is, En(x) = fzn : y 2 E g. We also transform f (x; y) by setting fn (x; zn) = n(f (x; y) ? f (x; y(x))). Hence, by the change of variables formula, we obtain: 10
n
k 2
det (M (x)) e?nf (x;y(x)) =
Z
Z
nf (x;y)g (y )dy e E
fn (x;zn )g (y )dz e n E (x) n
(4)
Now let Bn = fzn : kznk n g for some 0 < < 61 . For zn 2 Bn , we
have that ky ? y(x)k =
n? 21 M (x)?1zn
n? 12 (?max)? 12 since the largest eigenvalue of M (x)?1 is at most (?max)? 12 . For suciently large n, we have that n? 21 (?max)? 12 and so for any x 2 D, we have Bn En (x) and we subsequently assume n to be large enough so that this is so. Note that fn (x; zn) is concave and attains its maximum at zn = 0. Hence, for any zn 2 En(x) ? Bn:
! ! n n fn x; kz k zn ? fn (x; zn) 1 ? kz k (fn(x; 0) ? fn (x; zn)) n n
(5)
which is obtained by rearranging the de ning inequality for concave func tions. Since the right hand side of (5) is positive, fn (x; zn) fn x; knzn k zn so that fn (x; zn) supffn (x; zn) : kznk = n g for any zn 2 En(x) ? Bn. Since fzn : kznk = n g is compact, there is some zn such that f (x; zn) = supff (x; zn) : kznk = n g. By expanding fn in a Taylor series around zn = 0, we have fn (x; zn) = 21 zn T Fn (x; zn0 ) zn where zn0 = tn zn for some tn 2 [0; 1] and 11
where Fn (x; zn) is the Hessian matrix of fn (x; zn) with respect to zn, that is, kzn k 1 (?max) 2 fn (x; zn ) ?n2.
Fn (x; zn) = M (x)?1F (x; y)M (x)?1. We have that kM (x)?1znk and so zn T Fn (x; zn) zn
max 2 ?max kzn k
= ?n2. Hence,
Thus:
Z Z ) f ( x;z f ( x;z ) n n n n g(y)dzn g(y)dzn lim lim n En (x)?Bn e n En (x)?Bn e Z n ) f ( x;z n = lim g (y)dzn n e En(Zx)?Bn k exp ?n2 g(y)n 2 det(M (x))dy lim n E k Z k n 2 (?min) 2 exp ?n2 g(y)dy = 0 lim n E Hence:
Z
efn(x;zn )g(y)dzn ! Z Z f ( x;z ) f ( x;z ) n n n n = lim e g(y)dzn + B e g(y)dzn n En (x)?Bn n Z fn (x;zn ) g (y )dz = lim n n B e
lim n
En (x)
n
(6)
uniformly for all x 2 D and we can restrict the integral to the set Bn. Now note that g is uniformly continuous over the compact set B . Recall 1
that for zn 2 Bn, we have that ky ? y(x)k (?max)? 2 n? 21 . Therefore, for 12
any , there is an N such that jg(y) ? g(y(x))j for all y such that zn 2 Bn and n N . Hence:
lim n
Z Bn
efn(x;zn)g(y)dz
n
= g(y(x)) lim n
Z Bn
efn(x;zn)dzn
(7)
uniformly for all x 2 D. Note that Fn (x; zn) is continuously dierentiable in zn. We will now expand fn in a 2nd order Taylor series around zn = 0 with remainder and bound the remainder. In fact, the remainder is given by:
X @ 3fn(x; zn0 ) @ (zn)i@ (zn)j @ (zn)k (zn)i(zn)j (zn)k i;j;k X @ 3fn(x; zn0 ) @ (z ) @ (z ) @ (z ) j(zn)ij j(zn)j j j(zn)k j n i n j nk i;j;k
(8)
where (zn)i denotes the ith component of the vector zn and zn0 = tzn for some
t 2 (0; 1). However: @ 3fn (x; zn0 ) = X n @ 3f (x; y) @yi0 @yj0 @yk0 @ (zn)i@ (zn)j @ (zn)k i0;j0;k0 @yi0 @yj0 @yk0 @ (zn)i @ (zn)j @ (zn)k X @ 3f (x; y) ? 12 = n @y 0 @y 0 @y 0 n M (x)?1 i;i0 n? 12 M (x)?1 j0;j n? 21 M (x)?1 k0;k i j k i0 ;j 0 ;k0 X @ 3f (x; y) ?1 ?1 ?1 M ( x ) (9) M ( x ) M ( x ) = n? 12 k0 ;k j 0 ;j i0 ;i i0 ;j 0 ;k0 @yi0 @yj 0 @yk0 13
where yi denotes the ith component of the vector y. Now de ne:
3fn (x; zn ) @ c = sup sup sup n @ (z ) @ (z ) @ (z ) x2D zn 2Bn i;j;k n i nj nk 1 2
(10)
In fact, c is nite because D and Bn are compact and the third derivative continuous. Also, c is independent of n from (9) and (10). Now, fn (x; zn) is de ned so that it achieves its maximum of 0 at zn = 0 and such that Fn(x; 0) is the negative of the identity matrix. Hence, combining (8), (9) and (10), the Taylor series yields:
fn (x; zn) + 1 kznk2
cn? 21 X j(zn)ij j(zn )j j j(zn)k j 2 i;j;k !3 1 X ? = cn 2 j(zn)ij ? 12 32
i
cn k kznk3 ck 23 n3? 12 where the second to last inequality is a well-known inequality on vector norms. Thus: lim n
Z Bn
efn(x;zn )dzn
= lim n
Z Bn
k
e? 21 kznk2 dzn = (2) 2
(11)
uniformly for all x 2 D. Combining (4), (6), (7) and (11) yields the desired result. 2 14
In order to evaluate jdet (F (x; y(x)))j for the case at hand, we will need the following lemma.
Lemma 2 If is a diagonal matrix with diagonal entries 1 ; 2; : : : and C a constant matrix (that is, a matrix whose entries are all equal to some constant
c), then:
det( + C ) =
Y i
i + c
XY i j 6=i
j
In particular, if i 6= 0 for all i:
X 1 !Y det( + C ) = 1 + c i i i i
Proof. The proof is by induction on the size of the matrices. The result holds for the base case of 1 1 matrices. Now suppose that the result holds for n n matrices and consider an (n + 1) (n + 1) matrix + C . Expanding about the rst row:
det( + C ) = 15
0 BB 2 + c BB BB c BB BB B c BB (1 + c) det B BB c BB BB BB : : : B@ c 0 BB c c BB BB c + c 3 BB BB B c c ?c det BBB BB c c BB BB BB : : : : : : B@ c c 0 BB c 2 + c BB BB c c BB BB B c c BB +c det B BB c c BB BB BB : : : : : : B@ c c
c
c
c
:::
c
3 + c
c
c
:::
c
c
4 + c
c
:::
c
c
c
5 + c : : :
c
:::
:::
:::
c
c
c
c
c
:::
c
c
:::
4 + c
c
:::
c
5 + c : : :
:::
:::
:::
c
c
:::
c
c
:::
c
c
:::
4 + c
c
:::
c
5 + c : : :
:::
:::
:::
c
c
:::
16
:::
:::
: : : n+1 + c 1 c CC CC c CCC CC c CCC CC c CCC CC : : : CCC C n+1 + c A 1 c CC CC c CCC CC c CCC CC c CCC CC : : : CCC CA n+1 + c
1 CC CC CC CC CC CC CC CC CC CC CC CA
0 BB c 2 + c c c BB BB c c 3 + c c BB BB B c c c c ?c det BBB BB c c c 5 + c BB BB ::: ::: BB : : : : : : B@ c c c c
:::
c
:::
c
:::
c
:::
c
:::
:::
: : : n+1 + c
1 CC CC CC CC CC CC CC + : : : CC CC CC CC CA
Note that the third matrix of the previous expression requires a single column transposition to be in the form on which the induction hypothesis can be used and the fourth requires two such transpositions and etc. Hence, we can change the sign of every odd term after the rst and apply the induction hypothesis to each term, yielding:
0 1 X Y X Y Y det( + C ) = (1 + c) @ i + c j A ? c2 j i>1 j 62f1;ig i>1 j 62f1;ig i>1 XY Y = i + c j i
i j 6=i
which is the rst statement of the lemma. The second statement proceeds by rearranging the terms. 2 In the following lemma, we apply Laplace's method of integration to derive the asymptotics of a sample path in a typical set. Note that for 17
the i.i.d. case as presented in [5], we have that P (xn) = Qi P (xi) = exp n P log P (xi) and so we can apply Laplace integration to P log P (xi) , i
i
n
n
that is, using this function as f in Lemma 1. However, for Markov chains, we have that:
Y P (xn) = P xk P (xi+1jxii?k+1 ) i k X log P (xi+1jxii?k+1)) ! = P x exp n n i where xji denotes the substring xixi+1 : : : xj . Hence, we apply Laplace integration to the probability of a string conditioned upon the initial symbols, that is, we allow the probability of the initial symbols, P xk , to accounted for in the function g of Lemma 1 rather than f . Within our proof, we use the estimator which is the maximum of the likelihood given the initial symbols n xk , that is, the value of which maximizes P xnjxk = Q k Q Nt;u (x )
t2A
u2A t;u
where Nt;u (xn) denotes the number of occurences of tu in xn. This conn ditional likelihood has a unique global maximum at ^t;u = PuN2At;uN(xt;u)(xn ) if P N (xn) > 0 and otherwise, we assign arbitrarily. We will also use u2A t;u n the random variables ^ = Nt;u(x ) and ^ = P ^ and the typical set
t;u
t
n
u2A t;u
An; = fxn 2 An : mint2Ak minu2A ^t;u g. For the remainder of the pa18
per, we will assume that all logarithms are in base e for ease of analysis. Also, let I^n() be the empirical Fisher information matrix of the conditional n k distribution, that is, the Hessian matrix of ? log P (x jx ) evaluated at = ^. n
Lemma 3 If X1 ; X2; : : : is a random process such that t;u > 0 for all t 2 Ak and u 2 A then: 2 ak (a2?1) ? 1 2 Mn f ^ det I^n ^ ^ n 2 ak (a2?1) Y ? a?1 Y 1 ! 2 = P^ (xn) n ^t;u ^t 2 f ^ k u2A t2A (xn) P
(xn)
uniformly over xn 2 An; .
Proof. For any xn 2 An;: Mn
(xn)
Z
P (xn) f ()d Z Y Y Nt;u(xn) = P (xk ) t;u f ()d k u2A t 2 A 1 0 Z X X = exp @ Nt;u (xn) log t;uA P (xk )f ()d t2Ak u2A 1 0 Z X X ^t;u log t;uA P (xk )f ()d = exp @n
=
t2Ak u2A
19
Now let f (^; ) = Pt2Ak Pu2A ^t;u log t;u and g() = P (xk )f () for xed initial symbols xk . Note that f (^; ) is thrice continuously dierentiable and strictly concave with unique maximum at = ^ for any xed ^ contained in the compact set f^ : xn 2 An;g (bounding ^ in this manner is necessary because of boundary eects). The function g() is continuous since f() is and since the stationary distribution P (xk ) is continuous in the transition probabilities . Therefore, the conditions of Lemma 1 are satis ed and so:
2 ak (a2?1) P^(xk )f ^ Y Y N (xn ) Mn (xn) n ^ 21 k ^t;ut;u det I^n t2A u2A 2 ak (a2?1) f ^ (12) = n ^ ^ 12 P^ (xn ) det In for xed xk since I^n() is the negation of the Hessian of f (^; ) and since it is positive semide nite. Since there are only a nite number of values for
xk , this result holds uniformly over all xn 2 An;. Now, using the fact that = 1 ? Pa?2 : t;a?1
where 1t=t0
u=0 t;u
@ 2f (^; ) = ? ^t;u 1 0 0 ? ^t;a?1 1 0 t=t 2 t=t ;u=u 2 @t;u@t0;u0 t;u t;a ?1 is 1 if t = t0 and 0 otherwise. Note that since
@ 2 f (^;) @t;u @t0 ;u0
=0
for t 6= t0, the matrix I^n() is block diagonal. Since the determinant of a 20
block diagonal matrix is the product of the determinants of the blocks and by Lemma 2, we obtain:
00 1 ?2 1 ?2 ^2 aY ^ ^ Y @@ ^t;a?1 aX ^t;u A t;u A det In = 1 + ^2 2 t;a?1 u=0 ^t;u u=0 ^t;u t2Ak aX ?2 ^ ! aY ?2 ^ ! Y ^ t t;u t 1+ ^ = ^ ^ t;a?1 u=0 t u=0 t;u t2Ak ?2 ^ ! ^t;a?1 ! aY Y 1 ? t 1+ ^ = ^ t;a?1 u=0 t;u t2Ak 0 1 Y @ X !a?1 Y 1 A = ^t;u u2A u2A ^t;u t2Ak Combining with (12) yields the desired result. 2
4
-mixing Random
Processes
In order to determine the expectation of log Mn (X n ) and prove the main result, we will nd the asymptotics of moments of the random variables ^t;u. For these results, we will resort to the theory of -mixing random variables[10], a class of random processes with dependencies which strictly includes Markov chains. Thus, our results will apply when modeling a source more general than Markov sources. Note that here P (xn ) denotes the probability of xn 21
under the Markov chain with transition probabilities , the transition probabilities of the actual -mixing process even though this process need not be a Markov chain. Most of the results of the theory of -mixing require that the function q (n) vanish suciently rapidly, in particular, that Pn (n) < 1 which we will assume here. The -mixing property is a kind of \asymptotic independence"; if one chooses suciently distant random variables from a mixing random process, these random variables will be roughly independent. Let t;u = E ^t;u = P (tu) and t = E ^t = P (t). De ne the typical set
A0n; = fmaxt2Ak maxu2A j^t;u ? t;uj n? g. Let 1A denote the indicator function for any set A. The following lemma allows us to restrict attention to these typical sets.
q
Lemma 4 For any -mixing random process such that Pn (n) < 1: " # P (X n ) lim n E log M (X n ) 1A0n; = 0 n
Proof. First: # " h i P (X n ) E log M (X n ) 1A0n; E ? log Mn (X n ) 1A0n; n
22
Z
E ? log P
(X k )f
Z
sup ? log P k k x 2A
()d1A0n;
(xk )f
()d P A0n;
where we have used the fact that P (X n) 1 for the rst inequality and
P (X k ) P (X n ) for the second. Also: "
# h i P (X n ) E log M (X n ) 1A0n; E log P (X n ) 1A0n; n 1 3 20 k X X log t;uA 1A0n; 5 E 4@log P X + n k t2A u2A 1 0 0 k X X A log P @xkinf + n x log P An; t;u 2Ak t2Ak u2A
where we have used the fact that Mn(X n ) 1 for the rst inequality and Nt;u (X n) n for the second. Thus, we need only show that limn nP A0 = t;u t;u n; 0. For any t 2 Ak and u 2 A, let A0t;u;n; = fj^t;u ? t;uj n? g. We show that limn nP A0t;u;n; = 0 which implies that limn nP A0n; = 0. Note that n^ = Pn 1 i . The random variables 1 i are -mixing since t;u
i=k+1 Xi?k =tu
Xi?k
they are formed by applying a bounded function to a nite window of the 23
-mixing random process X n , see page 31 of [10]. By Chebyshev's inequality, and Lemma 1.1.21 and Proposition 1.1.20 of [10], we have that:
i 0 h 1+ 52 E j^ ? j 25 lim nP A lim n t;u t;u t;u;n; n n i5 h 5 limn n1+ 25 cE j^t;u ? t;uj2 4 limn cn? 14 + 52 t;u2 = 0 i h where t;u = limn n?1 E j^t;u ? t;u j2 < 1 and we choose < 101 . Hence, the lemma is proved. 2 We are now in a position to prove our main result, namely, to derive the asymptotics of the redundancy of Bayes rules for Markov chains. Now let
In() be the Fisher information matrix, that is, In() = E I^n(). Let t;u denote the constant of the growth rate of the variance of ^t;u ? ^tt;u, that is, limn nE (^t;u ? ^tt;u)2 which exists as discussed in the proof of Lemma 4.
Theorem 1 For a -mixing random process such that t;u > 0 for all t 2 Ak and u 2 A: q ! 2 k (a ? 1) det (In()) X X t;u a n lim R ( ; f ) ? log = log ? n n 2 2 f () t2Ak u2T 2t;u ! 2 X X X X t;u 1 = 2 log t;u ? (a ? 1) log t ? log f () ? 2 t2Ak u2A
t2Ak u2T
24
t;u
Proof. By Lemma 4, we have that: ! k (a ? 1) n a lim log 2 n Rn (; f ) ? 2 " # k ! P a ( a ? 1) n (X n ) 0 = lim n E log Mn (X n ) 1An; ? 2 log 2 Note that A0n; is eventually contained in An; because the is positive. Hence, by Lemma 3:
r P ^ (X n ) ak(a ? 1) n det I^n(^) log n ? 0 2 log 2 ? log f ^ 1An; (n) Mn (X ) where lim (n) = 0. Taking expected values and using the fact, as demon strated in the proof of Lemma 4, that limn nP A0n; = 0:
r det I^n(^) k n E log P^ (X n) 1A0n; ? a (a ? 1) log n ? E log ^ 1A0n; (n) Mn (X ) 2 2 f Note that ^ converges to uniformly within A0n;. Thus, for any continuous function f , we have that Ef (^) 1A0n; converges to f (). Since ^t;u = Pu^2t;uA ^t;u 25
is a continuous, we have limn ^t;u = Pu2t;uA t;u = t;u uniformly on A0n;. There fore, E log f ^ 1A0n; converges to log f (). Also:
! ^ X X ^ log t;u ? (a ? 1) log ^t log det I^n() 1A0n; =
t2Ak u2A
as demonstrated in the proof of Lemma 3. Thus:
^ In() 1A0n; lim E log det n 0 1 X X X ^ = @ (a ? 1) log t ? log t;uA = lim n log E det In () t2Ak
t2Ak u2A
n It remains to nd the asymptotics of E log PP^((XX n)) 1A0n; . Note that log P^ X k
is continuous in ^ (the ergodic distribution of a Markov chain is continuous in the transition probabilities) and so limn E log P^ X k = E log P X k . t;u in We simplify what remains by expanding each term of the form ? log ^t;u
a Taylor series around ^t;u = t;u:
k n ^ X P XX P ( X ) n^t;u log t;u log P (X n ) ? log P (X k ) = ? t;u ^ ^ t2Ak u2A ^ XX n^t;u t;u? t;u = ? t;u t2Ak u2T ^ 2 ^ 3 1 t;u ? t;u t;u ? t;u C (13) ? 22 + A 3 3~t;u t;u 26
where ~t;u is between t;u and ^t;u. We will consider the rst, second and third order terms of the Taylor series in equation (13) above in turn. For the rst-order term of (13):
^t;u ? t;u 1 0 lim En + n? ^t;u ? t;u 1 0 lim En ^ t;u t;u n t;u An; n t;u An; ^t;u ? t;u ! ^t;u ? t;u ? E 1A0n; = lim n nt;u E t;u t;u ^t;u ? t;u 1 = 0 = ? lim n E t;u A0n; n t;u
?t;u is absolutely bounded and lim nP A0 = since E ^t;u = t;u and because ^t;ut;u n;
0. Similarly, the limit can be shown to be nonnegative and hence vanishes. For the second-order term of (13):
^ 2 ^ 2 ? t;u ? t;u t;u t;u 1? E 1 n 1A0n; lim A0n; lim t;u + n 2 2 n n En^t;u 2t;u 2t;u 0 2 ^ 2 1 ^ ? ? t;u t;u t;u t;u C B = lim ? E 1 0n; A A 2 2 n nt;u @E 2t;u 2t;u ^ ^t;u 2 t;u ? t;u 2 ? t;u ^t = lim = lim 2 2 n nt;u E n nt;u E 2t;u 2t;u 2 2 (^ E (^ t;u ? ^t t;u ) t;u ? ^t t;u ) = lim lim 2 2 n nt;u E n nt;u 2 (t ? n? )2 t;u 2^t2t;u 27
2 2 2 t;u t;u t;u = lim 2 = 2t;u 2 = t;u 2t2 t;u n nt;u 2n (t ? n? )2 t;u
Similarly, the limit of second term can be shown to be at least
2 t;u 2t;u
and so
equals it. Finally, for the third-order term within the summation of (13):
^ 3 ^t;u 3 t;u ? t;u ? t;u ^ 0n; = lim En^t;u t 3 lim 1 1A0n; A 3 n En^t;u n 3~t;u 3~t;u (^t;u ? ^tt;u)3 1 0 = lim En ^ t;u An; 3 ^t 3 n 3~t;u E (^t;u ? ^tt;u)3 1A0n; 1 ? lim 3 n nt;u + n 3 t;ut+?nn?? (t ? n? )3 E j^t;u ? ^tt;uj 52 1A0n; E (^t;u ? ^tt;u)3 1A0n; lim = lim 3 3 3 3 n nt;u n nt;u 3t;u 3t;u t t 5 2 54 4 t;u 2 1A0n; c c E (^t;u ? ^tt;u) 1A0n; n lim = lim n nt;u n nt;u 33 3 33 3 = 0 t;u t
t;u t
where the last inequality is form Lemma 1.1.21 of [10]. Similarly, the limit can be shown to be nonnegative and so vanishes. Putting together the results from this paragraph, we obtain: 2 XX X X t;u E log P^ (X n ) 1A0n; = log P X k + nt;u log t;u ? t2Ak u2T t2Ak u2T 2t;u
28
= log P
(X n ) ?
2 X X t;u t2Ak u2T 2t;u
Putting this together with the preceding portion of the proof yields the desired result. 2 It is relatively easy to apply the above result to the particular case in which the process generating the data is a Markov chain in particular:
Corollary 1 If X1 ; X2; : : : is a kth-order Markov chain such that t;u > 0 for all t 2 Ak and u 2 A then:
q det (In()) ak (a ? 1) n ? 2 lim n Rn (; f ) ? 2 log 2 = log f () ! k X X 1 = 2 log t;u ? (a ? 1) log t ? log f () ? a (a2? 1) !
ak (a ? 1)
t2Ak u2A
Proof. By Proposition 1.1.20 of [10], we have: 2 2 = E 1 t;u X (1:k+1)=tu ? 1X (1:k)=tt;u 1 X + E 1X (n+1:n+k+1)=tu ? 1X (n+1:n+k)=t t;u n=1
29
1X (1:k+1)=tu ? 1X (1:k)=tt;u
= P (X (1 : k + 1) = tu) 2 ?2P (X (1 : k + 1) = tu) t;u + P (X (1 : k) = t) t;u
+
1 X
n=1
(P (X (n + 1 : n + k + 1) = tu; X (1 : k + 1) = tu)
?P (X (n + 1 : n + k + 1) = tu; X (1 : k) = t) t;u ?P (X (n + 1 : n + k) = t; X (1 : k + 1) = tu) t;u 2 +P (X (n + 1 : n + k) = t; X (1 : k) = t) t;u
(14) (15)
However:
P (X (n + 1 : n + k + 1) = tu; X (1 : k + 1) = tu) = P (X (n + 1 : n + k) = t; X (1 : k + 1) = tu) t;u and:
P (X (n + 1 : n + k + 1) = tu; X (1 : k) = t) = P (X (n + 1 : n + k) = t; X (1 : k) = t) t;u which, upon cancelling from (14) yields t;u = t;u (1 ? t;u). Hence: 30
X X t;u X X 1 ? t;u ak(a ? 1) = 2 = 2 t2Ak u2A 2t;u t2Ak u2A Combining this with Theorem 1 yields the desired result. 2 From the above result, it can be seen that the prior:
f () = c
Y t2Ak
a?1
Q
t 2
1
2 u2A t;u
(16)
where c is chosen to normalize f () (in fact, it is normalizable since f () < Q k Q ? 12 which is a Dirichlet distribution) equalizes the asymptotic t2A
u2A t;u
redundancy with its rate of growth subtracted o. Hence, the estimator corresponding to this prior distrubution is minimax for the asymptotic problem. Because of the duality between estimators and codes (see [1], for example), this yields an optimal asymptotic minimax rate of convergence for all codes.
5 Acknowledgements I would like to thank my dissertation advisor, Dr. Max Mintz, and Dr. Andrew Barron, for many discussions related to this work. I would also like to thank the anonymous reviewers for their very diligent reviews and many extremely useful comments. 31
References [1] Kevin Atteson. A Mathematical Formalism for the Design of Optimal Adaptive Text Data Compression Rules. PhD thesis, University of Penn-
sylvania, 1995. [2] Timothy C. Bell, John G. Cleary, and Ian H. Witten. Text Compression. Prentice Hall, 1990. [3] James O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, 1985. [4] Karl Wilhelm Breitung. Asymptotic Approximations for Probability Integrals. Springer-Verlag, 1994.
[5] Bertrand S. Clarke and Andrew R. Barron. Information-theoretic asymptotics of bayes methods. IEEE Transactions on Information Theory, 36(3):453{471, 1990.
[6] Bertrand Salem Clarke. Asymptotic Cumulative Risk and Bayes Risk under Entropy Loss, with Applications. PhD thesis, University of Illinois
at Urbanna-Champaign, 1989.
32
[7] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, Inc., 1991. [8] Lee D. Davisson, Robert J. McEliece, Michael J. Pursley, and Mark S. Wallace. Ecient universal noiseless source codes. IEEE Transactions on Information Theory, 27(3):269{279, May 1981.
[9] Jun ichi Takeuchi and Tsutomu Kawabata. Approximation of bayes code for markov sources. In 1995 IEEE International Symposium on Information Theory, page 391, 1995.
[10] M. Iosifescu and R. Theodorescu. Random Processes and Learning. Springer-Verlag, 1969. [11] Toshiyasu Matsushima, Hiroshige Inazumi, and Shigeichi Hirasawa. A class of distortionless codes designed by bayes decision theory. IEEE Transactions on Information Theory, 37(5):1288{1293, September 1991.
[12] Jorma Rissanen. Universal coding, information, prediction, and estimation. IEEE Transactions on Information Theory, 30:629{636, 1984. [13] Paul C. Shields. Universal redundancy rates do not exist. IEEE Transactions on Information Theory, 39(2):520{524, 1993.
33
[14] Frans M. J. Willems, Yuri M. Shtarkov, and Tjalling J. Tjalkens. The context-tree weighting method: Basic properties. IEEE Transactions on Information Theory, 41(3):653{664, 1995.
[15] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337{343, 1977.
34