Iterated Logarithmic Expansions of the Pathwise Code Lengths for ...

Report 1 Downloads 18 Views
Iterated Logarithmic Expansions of the Pathwise Code Lengths for Exponential Families Lei Li

Florida State University

Bin Yu

Bell laboratories, Lucent Technologies and University of California at Berkeley

January 14, 2000

Abstract Rissanen's Minimum Description Length (MDL) principle is a statistical modeling principle motivated by coding theory. For exponential families we obtain pathwise expansions, to the constant order, of the predictive and mixture code lengths used in MDL. The results are useful for understanding di erent MDL forms.

Key words: minimum description length, mixture code, predictive code, two-stage code, law of iterated logarithm

1 Introduction and background The Minimum Description Length (MDL) principle is introduced by Rissanen as a fundamental principle to model data, see [15, 17] and the reference list in [18]. If we encode data from a source by pre x codes, the best code is the one that achieves the minimum description length among all pre x codes if there is such a one. Because of the equivalence between a pre x codelength and the negative logarithm of the corresponding probability distribution (via Kraft's inequality), this in turn gives us a modeling principle, namely, the MDL principle: choose the model which gives the minimal description of data. When the source distribution is known, Shannon's coding theorem dictates that we use the source distribution for the optimal code, and its codelength is the negative logarithm of the probability of the data. When the source distribution is unknown, the search for a universal modeling procedure is the search for a universal coding scheme. This universal coding scheme provides us with a distribution, the negative logarithm of whose frequency function or density function is the corresponding codelength of the (ideal) optimal code (ignoring the precision issue). Many favorable properties have been shown for MDL-based statistical procedures in both parametric and nonparametric contexts (see the two recent reviews by Barron, Rissanen and Yu [1] and Hansen and Yu [13]). A crucial step in implementing a MDL procedure is to choose the form of the universal code. Three forms have been studied extensively: two-stage, predictive and mixture, although new forms are appearing such as Normalized Maximum Likelihood (NML) (Rissanen, [19]). In order to compare these forms, here we provide pathwise expansions of codelengths generated from exponential families. Finite mixtures of the corresponding conjugate distributions are used as priors for mixture codes. A related work on the mixture code is Takeuchi and Barron [20] where they get an almost 2

sure expansion for exponential families and Je reys' priors (when they exist). Given a parametric family fP :  2 an open set   Rd g, we denote the density functions by p(xj). Assume X1;    ; Xn are independent and identically distributed (iid) random variables from P and 0 is an interior point of . Then the codelength of the optimal code is given by 0

P Shannon's coding theorem as L0 = ? ni=1 log p(Xij0 ). The codelength corresponding to a given

P distribution Q(x) is LQ = ? ni=1 log q(Xi ), and the redundancy is RQ = LQ ? L0 . Rissanen [17] shows that for each positive number  and for all 0 2 , except in a set whose volume goes to zero as n ?! 1,

EP RQ  d ?2  log n :

(1)

0

Later Barron and Hengartner ([2]) prove that except for a set of parameter values with a volume zero,

limn!1 EP d RQ  1 : 2 log n

(2)

0

Several coding schemes have been shown to achieve this bound. Among them are two-stage, mixture and predictive codes. In particular, under regularity conditions Clarke and Barron [5, 1] prove that the expected redundancy of the mixture code is given by 1=2 EP RQ = d2 log n ? d2 log 2 e + log (detwI((0))) + o(1) ; 0

0

(3)

where I () is the Fisher information matrix, and w() is the density of the prior. For the predictive code, the leading term 2d log n of the average regret has also been obtained, see [16, 18]. In this paper, we carry out a careful study of the pathwise codelengths of the predictive and mixture codes when the source distribution belongs to an exponential family and the prior is taken 3

to be a nite mixture of its conjugate distribution. It is shown that the leading term of the pathwise redundancies of these two codes is d2 log n, and the second term is of the order of iterated-logarithm with the same nonpositive bounded random coecient. The third term is a constant for the mixture code, and a nite random variable for the predictive code which may not depend on the parameter of the underlying distribution for some sources. A similar expansion for the two-stage codelength can be derived. The rest of the paper is organized as follows. Section 2 contains the main results and Section 3 consists of a discussion. All the proofs can be found in Section 4 (Appendix).

2 Main results Consider a canonical exponential family of distributions fP :  2 g, where the natural parameter space  is an open set of Rd . The density function is given by

p(xj) = expf0 T (x) ? A()g ;

(4)

with respect to some measure (dx) on the support. The transposition of a matrix (or vector) is represented by 0 here and throughout the paper. T () is the sucient statistic for the parameter . Recall that X1;    ; Xn are iid random variables from P , where 0 is an interior point of . We 0

denote the rst derivative and the second derivative of A() by A_ () and A() respectively. The (di erential) entropy of P is

H () = A() ? 0 A_ () :

4

(5)

First let us consider the mixture code. The conjugate prior of (4) has the following form.

u() = expf 0  ? A() ? B ( ; )g ;

(6)

where is a vector in Rd , is a scalar, and

B ( ; ) = log

Z 

expf 0  ? A()g d :

(7)

In this paper, we assume that the prior of  takes the form of a nite mixture of the conjugate distributions as in

w() =

XJ 

j =1

j

expf 0j  ? j A() ? B ( j ; j )g =

XJ 

j =1

j uj () :

(8)

Not only this form is technically easy to handle for the purpose of obtaining the almost sure convergence, but also has it the virtue of approximating general priors at least for the Binomial and Gaussian cases. In the rst case, the conjugate prior is the Beta distribution. Notice that any prior w() continuous on [0; 1] can be uniformly approximated by the Bernstein polynomials (See [14])

XJ

J J ! j (1 ? )J ?j w( j ) = X 1 w( j ) b J j =1 J + 1 J j +1;J ?j +1() ; j =1 j !(J ? j )!

(9)

where br;s(x) = (r + s ? 1)!=[(r ? 1)!(s ? 1)!] xr?1(1 ? x)s?1 is the density function of the Beta distribution with integer parameters r and s. In order to construct a legitimate distribution, we 1 w( j ) sums to unity as J goes large, we see that need to normalize the weights in (9). Since J +1 J

the prior w() can be well approximated by a nite mixture of Beta distributions. In the case of 5

the Gaussian distribution with unknown location parameter and known variance parameter, the conjugate prior is still Gaussian. Ferguson [10] remarked that an arbitrary density on the real line may be closely approximated by Gaussian mixtures. Under prior w(), the marginal distribution of (x1; :::; xn) is

m(x1; :::; xn) =

Z 

p(xj)w()d;

(10)

and the codelength of the mixture code is given by Lmixture = ? log m(X1;    ; Xn). Hence the redundancy of the mixture code is Rmixture = Lmixture ? L0 . For the predictive code, after obtaining observations x1;    ; xi , we estimate  by the maximum likelihood estimate ^i?1, and in turn encode the next observation according to this current estimated distribution. This procedure (Rissanen [16, 17]) has intimate connections with the prequential approach to statistical inference as advocated by Dawid [7, 8]. The redundancy of this coding scheme is given by Rpredictive = Lpredictive ? L0,

P where Lpredictive = (? ni=2 log p(Xij^i?1)), and we take L0 excluding the rst observation in this

situation for simplicity. The pathwise asymptotics of Rpredictive is given by the following theorem.

Theorem 2.1

Rpredictive = d2 log n ? Cn(!) (log log n) + Dn(!) ;

(11)

where the sequence of nonnegative random variables fCn (!)g has the following property,

limn!1 Cn(!)  d ;

(12)

for almost all path !'s, and the sequence of random variables fDn (!)g converge to an almost surely nite random variable D(!).

6

As a matter of fact, the upper limit of Cn(!) is bounded below by 1 when d > 1 (more accurate bounds are dicult to obtain in this case). For general source distributions, it is dicult to obtain the third term fDn (!)g. But in some speci c cases, this term can be calculated explicitly. The following is an example.

Proposition 2.1 Suppose the source of the data follows a Gaussian N (; ) with a known , then the distributions of the random variables fDn (!)g and fD(!)g do not depend on the parameters in the source distribution. The limit mean of Dn(!) is given by

+ 1) ; limn!1 E (Dn(!)) = E (D(!)) = d(ceuler 2

(13)

P where ceuler is the Euler constant de ned as limn!1 ( ni=1 1=i ? log n). Moreover, the expected redundancy of the predictive code satis es

sup E (Rpredictive) = d2 log n + d2 ceuler + o(1) :

(14)

A simulation study of the distribution of Dn can be found in the discussion section. We can also compare the predictive codelength with n H (^n ) to get an expansion on the regret of the predictive code as

Proposition 2.2 Lpredictive = (?

Xn log p(X j^ )) + d log n + D~ (!) i n

i=1

2

= nH (^n ) + d2 log n + D~ n(!) ; 7

n

(15)

where the sequence of random variables fD~ n(!)g converges to an almost surely nite random variable D~ (!). In addition, if the source follows a Gaussian N (; ), then D~ n(!) = Dn(!) +

log p(X1j0 ), where Dn(!) is the same as that in Proposition 2.1. Now we turn to the mixture code length, whose pathwise asymptotics is as follows.

Theorem 2.2 1 =2 Rmixture = 2d log n ? Cn (!) (log log n ) ? 2d log 2  + log (det wI ((0 )) ) + o (1 ) ; 0

(16)

where the sequence of nonnegative random variables fCn (!)g has the following property,

limn!1 Cn(!)  d ;

(17)

for almost all path !'s.

fCn (!)g is the same as that in Theorem 2.1. Similarly to Proposition 2.2, we have the following,

Proposition 2.3 Lmixture = (?

Xn log p(X j^ )) + d log n ? d log 2 + log (det I (0))1=2 + o(1) i=1

i n

2

2

= nH (^n ) + d2 log n ? d2 log 2 + log

8

(det I (0))1=2

w(0)

w(0)

+ o(1) :

(18)

3 Discussion 3.1 The two-stage code The two-stage code introduced by Rissanen ([18]) has a code length with the same leading terms as that of the mixture code. That is,

Ltwo?stage = (?

Xn log p(X j^ )) + d log n + Constant i n

i=1

2

= nH (^n ) + d2 log n + Constant :

(19)

Thus the iterated logarithmic pathwise expansion for the mixture code applies to the two-stage code except for the constant term. Later for Je reys priors Rissanen ([19]) modi ed the twostage code to match the mixture code length not only the log n term but also the constant term by incorporating the Fisher information. This modi ed two-stage code is called Normalized Maximum Likelihood (NML) code. For example, if the source follows a binomial distribution, then the NML turns out to be equivalent to the mixture code with respect to the Je reys' prior. In this case, our pathwise expansion for the mixture code applies to the NML code up to the o(1) term.

3.2 Comparing the mixture and predictive code Comparing Theorem 2.1 and 2.2, we see that the redundancies of the predictive and mixture codes have the same log n term and log log n term. They di er in the third term, which is nite in both cases. In the case of the mixture code, this term is a constant depending on the parameter 0 unless the prior is selected as Je reys' prior (cf. [6, 20]). For some families such as the Gaussian distribution, proper Je reys' priors do not exist over the natural parameter spaces so modi cations 9

have to be made, and this may result in a large constant term. In the case of a predictive code, the third term is a random variable. If the source follows a Gaussian distribution, then the distribution of this random variable does not depend on the parameter of the source distribution. We compare the average redundancies of the mixture and predictive codes using (3) and (14). The constant term for the predictive code does not depend on the source distribution; whereas that for the mixture code does depend on the source distribution unless the prior is taken to be Je reys' (or its modi cations). In order to nd the distribution of D(!) in Proposition 2.1, we carried out a simulation study for the case d = 1 under the Gaussian assumption (with variance 1). We simulated 20,000 paths. For each of these paths, D5;000(!) in Proposition 2.1 was calculated. The histogram of these 20,000

D5;000(!)'s is shown in Figure 1. Since D(!) is the pathwise limit of Dn (!), the distribution of Dn(!) converges to the distribution of D(!) as n gets large. Here we use the distribution of

D5;000(!) to approximate the distribution of D(!). The sample mean and standard deviation of the 20,000 D5;000(!)'s are 0.793 and 1.386 respectively, whereas the theoretical mean of D5;000(!) is well approximated by 0:5  (1 + ceuler ) = 0:789, see Proposition 2.1. The theoretical standard deviation is dicult to calculate. In the Gaussian case, if d > 1, then D(!) is the sum of d independent and identically distributed copies of that when d = 1 as discussed above. This can be seen from the proof of Theorem 2.1 (see the term R1 in the proof).

4 Appendix This section contains the proofs of the results in section 2 and 3. The following basic facts about the exponential family (4) are needed (see [3]). 10

1500 1000 500 0

0

5

10

Figure 1: The histogram of D5;000(!) for a Gaussian source with d = 1. It is obtained from 20,000 paths. 1. E (T (X )) = A_ (), and V ar(T (X )) = A(). 2. A_ () is one to one on the the natural parameter space. 3. The maximum likelihood estimate ^n based on (X1;    ; Xn) is given by

^n = A_ ?1 (Tn) ;

(20)

P where Tn = n1 ni=1 T (Xi ). 4. The Fisher information matrix I () = A(). 5. The MLE ^n ?!  almost surely, since Tn ?! ET (X ) = A_ () almost surely by the strong law of large numbers.

Proof of Theorem 2.1. Let G = f! : Tn (!) ?! A_ (0)g. It is easy to see that ^n ?! 0 for each path in G. Since P (G) = 1, we only consider the paths in G from now on. The redundancy can be 11

rearranged as

X f[A(^ ) ? ^0 T (X

n?1

Rpredictive =

i=1 n?1

i

X f?[T (X

=

i=1 n?1

X f?[T (X

=

i=1

i

i+1 )] ? [A(0 ) ? 0 T (Xi+1 )]g

i+1 ) ? A_ (0)]0(^i ? 0 ) + [A(^i ) ? A(0 )] ? A_ 0 (0)(^i ? 0 )g

1

i+1 ) ? A_ (0)]0(^i ? 0 ) + 2 (^i ? 0)0 A(0)(^i ? 0) + ri ;

(21)

where ri is the residual term in the Taylor expansion of A(^i ) ? A(0 ) up to the second term. In the sequel, we represent the k-th component of a vector b by b(k) . Then ri can be expressed as

ri =

X

3 [^i(k) ? 0(k) ][^i(l) ? 0(l) ][^i(m) ? 0(m) ] (k)@ A((l)) (m) j + (^i ? ) ; @ @ @ fk;l;mg 0

0

(22)

where is a number between 0 and 1, and the sum is calculated over all possible triples fk; l; mg. Furthermore, we express ^i ? 0 by Ti ? A_ (0).

^i ? 0 = A_ ?1(Ti ) ? A_ ?1(A_ (0)) = A?1(0 )(Ti ? A_ (0)) + si ;

(23)

(d) 0 where si = (s(1) i ;    ; si ) , and

si(k) =

X l;m=1;;d

s(i;k()l;m) [Ti(l) ? A_ (0 )(l)][Ti(m) ? A_ (0)(m) ] :

12

(24)

fs(i;k()l;m) g are the third order derivatives evaluated at a point between 0 and ^i . Plug this into (21), we have Rpredictive = R1 + R2 + R3 ;

(25)

where nX ?1 R1 = 12 f?2[T (Xi+1 ) ? A_ (0 )]0A?1(0)[Ti ? A_ (0 )] + [Ti ? A_ (0)]0A?1(0 )[Ti ? A_ (0)]g ; i=1 nX ?1 X R2 = ai;(k;l;m) [Ti(k) ? A_ (0 )(k) ][Ti(l) ? A_ (0 )(l)][Ti(m) ? A_ (0)(m) ] ; (26)

R3 = = =

k;l;m i=1 n?1

X [T (X i=1 d n?1

i+1 ) ? A_ (0)]0 si

X X [T (X

k=1 i=1 n?1

X X s(k)

k;l;m i=1

(k)

i+1 )(k) ? A_ (0)(k) ] si

(k) ? A_ (0 )(k) ][T(l) ? A_ (0)(l) ][T(m) ? A_ (0)(m) ] ;

i;(l;m) [T (Xi+1)

i

i

(27)

where fai;(k;l;m) g have absorbed relevant coecients. First let us study the main part R1. Let

Yi = A? (0)[T (Xi ) ? A_ (0 )] = (Yi(1) ; Yi(2);    ; Yi(d) )0 : 1 2

(28)

Then we can check that Y1 ;    ; Yn are i.i.d. with EYi = 0, and V arYi = Id , where Id is the identity

P P matrix of order d. Let Yi = 1i ij =1 Yi , Si = ij =1 Yi = (Si(1) ; Si(2) ;    ; Si(d) )0. Then

nX ?1 0 0 R1 = 21 f ?2Yii+1Si + Sii2Si g i=1 n X?1 0 S0 Si+1 g + 1 nX?1f ?2Yi0+1Si + Si0 Si ? Si0Si + Si0+1Si+1 g = 12 f SiiSi ? i+1 i+1 2 i i2 i i+1 i=1

i=1

13

nX ?1 ?2Y 0 S 0 0 0 S 0 S + 2Yi0+1Si + Yi0+1Yi+1 g = 21 (S10 S1 ? SnnSn ) + 12 f ii+1 i + Sii2Si ? SiiSi + i i i+1 i=1 nX ?1 n?1 n?1 2 i k2 ? X Yi0+1Si + 1 X k Yi+1 k2 = 21 (k S1 k2 ? k Snn k ) + 21 ik2(Si + 1) i=1 i(i + 1) 2 i=1 i + 1 i=1 2 Sn k log log n + I ; = d2 log n ? 2nklog (29) n log n

where nX ?1 n?1 i k2 ? X Yi0+1Si In = 21 k S1 k2 + 12 i2k(Si + 1) i=1 i(i + 1) i=1 nX ?1 nX ?1 2 + 12 ( k Yi i++11k ? i +d 1 ) + d2 ( i +1 1 ? log n) : i=1 i=1

(30)

Next we consider these terms separately. The rst term is the leading term in (11). As for the second term, notice that

(k) d S k S n k2 = X n p [ ]2 : 2n log log n 2n log log n k=1

Apply the law of iterative logarithm (see [9]) to each component Sn(k) , namely, (k) limn!1 p Sn = 1; 2n log log n (k) limn!1 p Sn = ?1 : 2n log log n Snk in (11). Next we consider the terms in I separately. In fact, Cn(!) = 2nkloglog n n 2

1. The rst term is nite almost surely.

14

(31) (32)

P ?1 kSik = Pdk=1 Pn?1[ Sik ], and E( Sik ) = 1 . Thus 2. Notice that in=1 i=1 i (i+1) i(i+1) i (i+1) i (i+1) ( )2

( )2

2

2

2

E(

+1 X i=1

2

X d = d < +1 : k Si k2 ) = +1 2 i (i + 1) i(i + 1) i=1

P k < +1 converges almost surely and also in L1 . Thus ni=1 ik(Sii+1) 2

2

0

Sn? , then E (V jY ;    ; Y ) = 0. Thus it is a martingale. And 3. Let Vn = Y(nn?1) n 1 n?1 n 1

0 ?1S 0 Yn n?1 E (k Vn k2) = E [ Yn(Snn? 1)2n2 ] = (n ? 11)2 n2 E [trace(Sn?1Sn0 ?1Yn Yn0 )] = (n ? 11)2 n2 E [trace(Sn?1Sn0 ?1E (YnYn0 jY1;    ; Yn?1))] = (n ? 11)2 n2 E [trace(Sn?1Sn0 ?1E (YnYn0 ))] = (n ?d1)2 n2 E [trace(Sn?1Sn0 ?1)] = (n ?d1)2 n2 E (k Sn?1 k2 ) 2 = (n ?d 1)n2 :

The expectation is evaluated by rst conditioning on the  algebra (Y1;    ; Yn?1). Since

P+1 EV 2 < +1, 1 Pn?1 Y 0 n=1

n

2

i+1 Si i=1 i(i+1)

converges almost surely, in L2 , and thus in L1 (see [9] for

the relevant result). 4. Rewrite the fourth term by each component of Yi ,

X ( k Yi+1 k2 ?

n?1 i=1

i+1

?1 Y (k)2 ?1 X d nX d Y (k)2 X d ) = nX 1 i +1 i+1 ? 1 ) : ? ) = ( ( i + 1 i=1 k=1 i + 1 i + 1 k=1 i=1 i + 1 i + 1

15

Notice that

k)2 ? 1 k)2 ? 1 Yi(+1 Yi(+1 E ( i + 1 ) = 0 ; V ar( i + 1 ) = (i +2 1)2 ;

+1 X

k)2 ? 1 Yi(+1 V ar( i + 1 ) < +1 : i=1

Thus by the two-series Theorem (Page 49 and 218 in [9]),

Pn?1( kY i=1

i+1

k2

i+1

d ) converges ? i+1

almost surely, in L2 , and thus in L1. 5. Finally,

?1 1 d (c ? 1) : d (nX ? log n ) = 2 i=1 i + 1 2 euler

Now let us look at R2 . It is sucient to consider one triple (k; l; m). According to the law of iterated logarithm, except for paths in a set with zero probability, the following is true

s

jTi(k) ? A_ (0)(k) j  2 2 logilog i : Also notice that ai;(k;l;m) is bounded as i gets large. Therefore the summand in (26) is at most of the order (log log i) =i . This means R2 converges for almost all the paths. 3 2

3 2

As for R3, notice that for each triple (k; l; m),

X s(k)

n?1 i=1

(k) ? A_ (0 )(k) ][T(l) ? A_ (0)(l) ][T(m) ? A_ (0)(m) ]

i;(l;m) [T (Xi+1)

i

i

is a martingale with respect to (X1;    ; Xn?1), and

E f[s(i;k()l;m) [T (Xi+1)(k) ? A_ (0 )(k) ][Ti(l) ? A_ (0)(l) ][Ti(m) ? A_ (0)(m) ]]2jX1;    ; Xi g 16

(33)

= V ar(T (X1 )(k) )[s(i;k()l;m) [Ti(l) ? A_ (0)(l) ][Ti(m) ? A_ (0)(m)]]2 : According to Theorem 2.17 in [12], (33) converges on the set

W = f! :

+1 X i=1

[s(i;k()l;m) [Ti(l) ? A_ (0 )(l)][Ti(m) ? A_ (0 )(m)]]2 < +1g :

(34)

Again by the law of iterated logarithm and the fact that s(i;k()l;m) is bounded as i gets large, the summand in (34) is at most of order (log log i)2=i2 . Therefore P (W ) = 1. This completes the proof.

Proof of Proposition 2.1. In this case, R2 and R3 vanish in (25) (see (21) and (23)). Hence Dn is exactly In in (29). Since Dn converges to D in L1, we have

E (D(!)) = limn!1 E (In(!)) = d2 (ceuler + 1) : We can obtain (14) by taking the expectation of the second term in (29) and using E k Sn k2 = n d.

Proof of Proposition 2.2. Notice that n n X X ^ (? log p(X j )) ? (? log p(X j )) i=1

=

i n

i 0

i=1

Xn f?[T (X ) ? A_ ( )]0(^ ?  ) + [A(^ ) ? A( )] ? A_ ( )0(^ ?  )g n 0 n 0 n 0 0 n 0 i=1

= ?n[Tn ? A_ (0)]0(^n ? 0) + n[A(^n ) ? A(0) ? A_ (0 )0(^i ? 0)] = ? n2 [Tn ? A_ (0)]0A?1 (0)[Tn ? A_ (0)] + O(n k ^n ? 0 k3) 2 = ? k S2nn k + o(1) ; 17

(35)

where we use the Taylor expansions in (23). Next, we combine (25), (29) and (35). It is easy to check that the negative logarithm of the likelihood evaluated at the maximum likelihood estimate

^n is nH (^n ).

Proof of Theorem 2.2: If the prior is given by (6), then the marginal density of (X1;    ; Xn ) is n X m(x ;    ; x ) = expfB ( T (x ) + ; n + ) ? B ( ; )g ;

1

n

i

i=1

(36)

according to the de nition of B (; ). Therefore Rm = B ( ; ) ? B (

Xn T (X ) + ; n + ) ? nA( ) + (Xn T (X ))0 i 0 i 0 i=1 i =1 Z

= B ( ; ) ? log( expfnLn (t; !)g d t) ; 

where

X nLn(t; !) = 0 t ? A(t) + [ T (Xi )]0(t ? 0) ? n[A(t) ? A(0 )] : i=1 n

(37)

(38)

Since Tn ?! A_ (0) almost surely, for almost all the path, when n is suciently large, the unique minimum of Ln(t; !) is achieved at

P

n ~n(!) = A_ ?1 ( i=1nT +(X i ) + ) = ^n (!) + O( n1 ) ?! 0 :

Notice that

?L n(t) = n +n A(t) :

18

Let

n ?1  = ?L ?1 n (~n(!)) = n + A (~n (!)) :

By expanding Ln(t; !) at the saddle point ~n(!) and applying the Laplace method (see [4, 21]), we have

Z

log( expfnLn (t; !)g d t) = ? d2 log n + d2 log 2 + 21 log(det ) + nLn (~n (!); !) + O( n1 )  ?! ? d2 log n + d2 log 2 ? 21 log(det I (0)) + nLn (~n(!); !) : (39) Next, we further expand nLn (~n(!); !).

X nLn (~n(!); !) = 0 ~n (!) ? A(~n (!)) + [ T (Xi )]0(~n(!) ? 0 ) ? n[A(~n (!)) ? A(0 )] n

i=1

= 0 0 ? A(0 ) + n[Tn0 (~n(!) ? 0) ? A_ (0 )0(~n(!) ? 0 ) ? 12 (~n (!) ? 0 )0A(0 )(~n(!) ? 0)] + o(1) = 0 0 ? A(0 ) + n[(Tn ? A_ (0 ))0(~n (!) ? 0) ? 21 (~n (!) ? 0 )0A(0 )(~n(!) ? 0)] + o(1) = 0 0 ? A(0 ) + n2 [(Tn ? A_ (0))0A?1 (0)(Tn ? A_ (0))] + o(1) : (40) In the last step, we use the expansion

~n (!) ? 0 = A?1(0)(Tn ? A_ (0)) + O(k Tn ? A_ (0 ) k2 ) : 19

Now we put (37), (39), and (40) together, Rm = d2 log n ? n2 [(Tn ? A_ (0 ))0A?1 (0)(Tn ? A_ (0 ))] ? d2 log 2 + 21 log(det I (0))

?f 0 0 ? A(0) ? B ( ; )g + o(1) 2 = d2 log n ? k S2nn k ? d2 log 2 + 12 log(det I (0)) ? log w(0) + o(1) ;

(41)

where we use the notations in (28). In the more general case when the prior is a nite mixture of conjugate distribution as in (8), it is easy to see from (10) and (36) that the marginal density is given by

m(x1;    ; xn) =

Xn T (x ) + ; n + ) ? B( ; )g :

XJ 

j =1

j expfB (

i=1

i

j

j

j j

According to (41), each term in this summation has the following almost sure asymptotics, expfB (

Xn T (x ) + ; n + )g i=1

i

j

j

2 = expf?L0 ? d2 log n + k S2nn k + d2 log 2 ? 21 log(det I (0)) + log uj (0) + o(1)g 2 = [uj (0) + o(1)]expf?L0 ? d2 log n + k S2nn k + d2 log 2 ? 12 log(det I (0))g :

Therefore,

m(x1;    ; xn) = [

XJ 

j =1

d log n + k Sn k2 + d log 2 ? 1 log(det I ( ))g u (  ) + o (1)] exp f? L ? 0 j j 0 0 2 2n 2 2 20

(42)

2 = [w(0) + o(1)] expf?L0 ? d2 log n + k S2nn k + d2 log 2 ? 21 log(det I (0))g 2 = expf?L0 ? d2 log n + k S2nn k + d2 log 2 ? 12 log(det I (0)) + log w(0) + o(1)g :

Hence, (41) is valid even if the prior takes the more general form of a nite mixture of conjugates. Next, those lines regarding the term kS2nnk in the proof of Proposition 2.2 can be used to nish the 2

proof.

References [1] A. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling. IEEE. Trans. Inform. Theory., pages 2743{2760, 1998. [2] A. R. Barron and N. Hengartner. Information theory and supereciency. Annals of Statistics, page to appear, 1998. [3] L. D. Brown. Fundamentals of Statistical Exponential Families: with Applications in Statistical Decision Theory. Institute of Mathematical Statistics: California,, 1987.

[4] N. G. De Bruijn. Asymptotic Methods in Analysis. North-Holland: Amsterdam, 1958. [5] B. S. Clarke and A. R. Barron. Information-theoretic asymptotics of Bayesian methods. IEEE Transactions on Information Theory, 36:453{471, 1990.

[6] B. S. Clarke and A. R. Barron. Je rey' prior is asymptotically least favorable under entropy risk. Journal of Statistical Planning and Inference, 41:37{64, 1994. [7] A. P. Dawid. Present position and potential developments: some personal views, statistical theory, the prequential approach. J. R. Stat. Soc. Ser. B, 147, 1984. 21

[8] A. P. Dawid. Prequential analysis, stochastic complexity and Bayesian inference. In Fourth Valencia International Meeting on Bayesian Statistics, pages 15{20. 1992.

[9] R. Durrett. Probability: Theory and Examples. Wadsworth & Brooks/Cole, 1991. [10] T. S. Ferguson. Bayesian density estimation via mixtures of normal distributions. In Recent Advances in Statistics, pages 287{302. Academic Press, New York, 1983.

[11] Y. Freund. Predicting a binary sequence almost as well as the optimal biased coin. Unpublished manuscript.

[12] P. Hall and C. C. Heyde. Martingale Limit Theorem and Its Application. Academic Press: New York, London, 1980. [13] M. Hansen and B. Yu. Model selection and minimum description length principle. JASA, 1998. submitted. [14] M. J. D. Powell. Approximation theory and methods. Cambridge University Press: Cambridge, 1981. [15] J. Rissanen. Modeling by shortest data description. Automatica, 14:465{471, 1978. [16] J. Rissanen. A predictive least squares principle. IMA Journal of Mathematical Control and Information, 3:211{222, 1986.

[17] J. Rissanen. Stochastic complexity and modeling. Annals of Statistics, 14:1080{1100, 1986. [18] J. Rissanen. Stochastic Complexity and Statistical Inquiry. World Scienti c: Singapore, 1989. [19] J. Rissanen. Fisher information and stochastic complexity. IEEE Trans. Inform. Theory, 42:40{47, 1996. 22

[20] J. Takeuchi and A. R. Barron. Asymptotically minimax regret by Bayesian mixtures. In Proceedings IEEE 1998 International Symposium on Information Theory.

[21] L. Tierney and J. B. Kadane. Accurate approximations for posterior moments and marginal densities. Journal of the American Statistical Association, 81:82{86, 1986.

23