Strong Large Deviations for Rao Test Score and ... - Semantic Scholar

Report 2 Downloads 77 Views
Strong Large Deviations for Rao Test Score and GLRT in Exponential Families Pierre Moulin and Patrick R. Johnstone University of Illinois at Urbana-Champaign Beckman Inst., Coord. Sci. Lab, and Dept of ECE 405 North Mathews Avenue, Urbana, IL 61801 USA Abstract—Exact asymptotics are derived for composite hypothesis testing between two product probability measures P n vs Qn , subject to a type-I error-probability constraint . Here P is known but Q is an unknown element of a given d-dimensional regular exponential family. We study the Rao score test, which is a quadratic approximation to the GLRT. The type-II p error probability is shown to vanish as exp{−nD + nV τ (, d) − βd ln n + γd } where D and V are respectively the Kullback-Leibler divergence and the variance of information divergence between P and Q; τ (, d) is the 1 −  quantile for the χ2d distribution; and the constants βd > 0 and γd are explicitly identified. The asymptotic regret relative to the Neyman-Pearson test (which knows Q) is reflected in the coefficient τ (, d), as is the cost of dimensionality. Looser asymptotics (with O(1) in place of γd ) are obtained for the GLRT.

I. I NTRODUCTION Consider a measure space (Y, F, µ) and a ddimensional regular exponential family of densities in canonical form >

pθ (y) = exp{θ t(y) − A(θ)}, θ ∈ Θ, y ∈ Y (1) R where Θ = {θ ∈ Rd : Y exp{θ> t(y)}dµ(y) < ∞} is the natural parameter space (an open, convex set), t : Y → Rd is the sufficient statistic, and A : Θ → R. Denote by Y , (Y1 , Y2 , · · · , Yn ) a sequence of n independent and identically distributed (i.i.d.) random variables drawn from Pθ . Fix a particular θ0 ∈ Θ and consider  the composite binary hypothesis test H0 : Y ∼ Pθn0 (2) H1 : Y ∼ Pθn , θ ∈ Θ \ {θ0 }. We wish to design a good deterministic test that guarantees a type-I error probability of at most . In case the value of θ is known, this can be done by seeking a Neyman-Pearson (NP) test which minimizes the type-II error probability subject to the above constraint on the type-I error probability. The NP rule takes the form of a likelihood ratio test (LRT) ) ( n [1] X pθ (Yi ) (3) >τ . δLRT (Y) , 1 ln pθ0 (Yi ) i=1

For the composite hypothesis testing problem (2), the test may not depend on θ, hence the LRT is excluded. The performance metrics for a decision rule δ : Y n → {0, 1} are its type-I error probability PI (δ, n) = Pθn0 {δ(Y) = 1} and its type-II error probability PII (δ, θ, n) = Pθn {δ(Y) = 0}. Various approaches can be considered, including the Generalized Likelihood Ratio Test (GLRT) [2], and the Rao score test [3] which is a quadratic approximation to the GLRT [4]. These tests are often justified by an asymptotic optimality property. For large n, under regularity conditions, PII (δ, θ, n) = exp{−nD(pθ0 kpθ ) + o(n)}, where the error exponent is the Kullback-Leibler divergence between pθ0 and pθ . Since the LRT of (3) achieves the same error exponent, such tests is uniformly most powerful on the exponential scale. Unfortunately the exponential approximation PII (δ, θ, n) ≈ e−nD(pθ0 kpθ )

(4)

is often inadequate for moderately large n. This motivates exploring an approach based on strong largedeviations asymptotics, which yields the exact asymptotics of PII (δ, θ, n). Early work in this direction was done by Strassen [5] which identified PII (δ, θ, n) up to a (very large constant) for the NP test: The type-II error probability for the Neyman-Pearson test at significance level 1 −  satisfies PII (δLRT , θ, n) = exp{−nD(pθ0 kpθ ) p 1 (5) + nV (pθ0 kpθ ) Q−1 () − ln n + O(1)} 2 where QR−1 is the inverse of the Q function, ∞ Q(x) = x (2π)−1/2 exp{−t2 /2} dt, and D(pθ0 kpθ ) and V (pθ0 kpθ ) are the Kullback-Leibler (KL) divergence and the variance of information divergence, respectively given by D(pθ0 kpθ ) = −A(θ0 ) + A(θ) + (θ0 − θ)> ∇A(θ0 ) (6)

and V (pθ0 kpθ ) = (θ0 − θ)> Iθ0 (θ0 − θ)

(7)

where Iθ0 = ∇2 A(θ0 ) is the Fisher information matrix. We will use the following Taylor series expansion of KL divergence. For any θ ∈ Θ and h ∈ Rd ,

In (10), we have denoted by τ (, d) the quantile function for the χ2 distribution with d degrees of freedom, evaluated at 1 − , i.e., Pr{χ2d ≤ τ (, d)} = 1 − . p

(13)

τ (, 1) = Q−1 (/2) > Q−1 () and Note that −1 η > η ∂ A(θ) τ (, 2) = 2 ln  . The function τ (, d) decreases with  D(pθ+ηh kpθ ) = h Iθ h + hi hj hk 2 3 ∂θi ∂θj ∂θk and increases with d. Thus for all  < 1/2 and d ≥ 1, i,j,k=1 we have τ (, d) > τ (1/2, d) = d. Also for large d, the +O(η 4 ) as η → 0. (8) χ2d random variable converges in distribution to N (d, d), √ hence τ (, d) ∼ d + d Q−1 (). In the limit as θ → θ0 , this yields D(pθ0 kpθ ) ∼ 1 Example: 1-D Gaussian Family. Let Pθ = N (θ, 1) 2 V (pθ0 kpθ ). (Gaussian with mean θ ∈ R and unit variance) which The O(1) term in (5) was derived in [6]. Other related belongs to the exponential family of (1) with dimension work includes [7] where a composite hypothesis test d = 1, test statistic t(y) −y, and A(θ) = θ2 /2. Let P= with finitely many alternatives was studied, and a vector n 1 ˆ θ0 = 0. Then θn = n i=1 Yi is N (θ, n1 ) for each n, likelihood ratio threshold test was found to have the same and asymptotics for PII as the NP test, up to a constant. The limited usefulness of exponential approximations such as δRao (Y) = δGLRT (Y) n   o p (4) has also been recognized by Unnikrishnan et. al. [8]. = 1 |θˆn | > τ (, 1) = Q−1 . 2 The main contribution of this paper is to derive an exact asymptotic expansion for the type-II error proba- Here D(Pθ0 kPθ ) = θ2 /2 and V (Pθ0 kPθ ) = θ2 . Fisher bility of the Rao score test. The regret for not knowing θ information is Iθ = 1. We have PI (δRao , n) =  (exactly appears in the second term of our asymptotic expansion, since θˆn is Gaussian) and √ which is different from the second term in (5). PII (δRao , θ, n) = Q( nθ − Q−1 (/2)). Theorem 1. Assume µ is the Lebesgue measure, 1 −x2 /2 , we θ0 ∈ int(Θ), and the mapping A is thrice continuously Using the asymptotic equality Q(x) ∼ √2πx e recover (10) and verify that the constant γ is equal to 1 differentiable. Also assume that under H0 , the sufficient 1 1 −1 2 2 (Q (/2)) − ln(2πθ ). statistic t(Y ) has a density with respect to µ. Then 2 2 Remark 1. Comparing (5) and (10), we see that the (i) The type-I and type-II error probabilities for the Rao dimensionality parameter d affects primarily the secondscore test admit the asymptotic expansions order term in the expansion of log p √ PII because the ratio PI (δRao , n) =  + O(n−1 ), (9) τ (, d)/Q−1 () increases as d for increasing d. PII (δRao , θ, n) = exp {−nD(pθ0 kpθ ) Remark 2. The assumption that θ0 is an interior point p + nV (pθ0 kpθ ) τ (, d) of the feasible set is crucial. In particular, for d = 1, the  test (2) is two-sided. It is however well known that for d+1 ln n + γd + o(1) (10) a one-sided test (where θ is known to be on one side of − 4 θ0 ), the GLRT is a Uniformly Most Powerful test. We for a constant should therefore expect that the GLRT achieves a better √ τ 1 d−1 d+1 coefficient in the n term of the expansion of log PII γd = − − ln(2π)+ ln τ − ln V (pθ0 kpθ ). 2 2 4 4 if the feasible set under H1 is closed and θ0 belongs to (ii) The type-I and type-II error probabilities for the its boundary. GLRT admit the asymptotic expansions Remark 3. The second-order term is especially −1/2 significant for small D(pθ0 kpθ ) since D(pθ0 kpθ ) ≈ PI (δGLRT , n) =  + O(n ), (11) 1 V (p kp θ0 θ ) in that case. The relative size of the second2 PII (δGLRT , θ, n) = exp {−nD(pθ0 kpθ ) p order term is then s + nV (pθ0 kpθ ) τ (, d)  2 τ (, d) d+1 . ≈ − ln n + O(1) . (12) nD(p θ0 kpθ ) 4 2

3

d X

3

Hence one needs nD(pθ0 kpθ ) to be much larger than 2 τ (, d) for the first-order approximation log PII ≈ −nD(pθ0 kpθ ) to be reasonably accurate (and even that does not yield an accuratepapproximation for PII ). Table I shows the values of τ (, d) for  = 10−3 and several values of d.

is the Fisher information matrix. The likelihood score function is ∇θ ln pθ (Y ) = −∇A(θ) + t(Y ).

(16)

Since the feasible set is the natural parameter space Θ, the ML estimator θˆn is the solution to the likelikood d=1 d=2 d=3 d=4 d=5 d = 10 equation p n τ (, d) 3.29 3.72 4.033 4.298 4.530 5.440 1X t(Yi ) = ∇A(θˆn ) a.s. Pθ ∀θ ∈ Θ (17) ξˆn , n TABLE I: Square root of test thresholds at significance i=1 level 1 −  = 0.999 and dimensionality parameters d = ˆ where ξn is also the ML estimator for the mean param1, 2, 3, 4, 5, 10. eter ξ , ∇A(θ). We also let ξ0 = ∇A(θ0 ). The regular exponential family {pθ , θ ∈ Θ} can be equivalently Remark 4. The appearance of the quantile τ (, d) represented as {qξ , ξ ∈ Ξ}. for the χ2d distribution in the expressions above can be explained as follows. Under H0 , the normalized score B. LRT  √ −1/2 1 Pn Xn = n Iθ0 i=1 t(Yi ) − E0 [t(Y )] is asympn If θ is given, the hypothesis test is simple. The LRT totically N (0, Id ), and the Rao test decides H0 whenever takes the form kXn k2 is below some threshold. Hence this threshold ) ( n should be asymptotic to τ (, d) to obtain PI = . We 1 X pθ (Yi ) ln >τ δLRT (Y) = 1 shall also see that the acceptance regions of the GLRT, n i=1 pθ0 (Yi ) stated in terms of Xn , can be sandwiched between the acceptance regions of two Rao tests with thresholds where the threshold τ is selected to achieve significance τ + O(n−1/2 ). This property will be used to obtain an level 1 − . For the exponential family (1), the test takes elementary extension of the proof for the Rao test to the the form GLRT. Remark 5. Similar results hold if µ is a counting measure and t(Y ) is a nonlattice random variable under H0 . However, the constant O(1) term in the expressions for PII is replaced by a bounded oscillatory function if t(Y ) is a strongly lattice random variable and θ − θ0 is a point in the centered lattice. The Rao score test is reviewed in Sec. II together with its relation to the GLRT. A key lemma is stated and proved in Sec. III, and Theorem 1 is proved in Sec. IV. Throughout this paper, we use the shorthands Eθ , Varθ , and Covθ for expectation, variance, and covariance with respect to Pθ , respectively. II. H YPOTHESIS T ESTS A. Exponential Families First recall that the mean and covariance of t(Y ) are given by the gradient and Hessian of A(θ): Eθ [t(Y )] = ∇A(θ),

Covθ [t(Y )] = ∇2 A(θ) = Iθ (14)

where   Iθ = Eθ −∇2 ln pθ (Y )

(15)

δLRT (Y) = 1{−A(θ) + A(θ0 ) ) n X >1 +(θ − θ0 ) t(Yi ) > τ . (18) n i=1 C. GLRT The GLR test statistic is defined as n X pθ (Yi ) ln ΛGLR , sup . n pθ0 (Yi ) θ∈Θ i=1 and the GLRT by  δGLRT (Y) , 1 ΛGLR >τ n

(19)

where the threshold τ is selected to achieve significance level 1−. Since the exponential family is assumed to be regular, the Fisher information matrix Iθ0 has full rank. It is known [9], [4] that 2ΛGLR converges in distribution n to a χ2d random variable under H0 . Moreover [4, p. 334] " # n X GLR > Λn = sup −n(A(θ) − A(θ0 )) + (θ − θ0 ) t(Yi ) θ∈Θ

i=1

= nD(pθˆn kpθ0 ) = nD(qξˆn kqξ0 ) which follows from (17) and (6), respectively.

(20)

D. Rao Score Test The Rao score test [3] is an alternative to the GLRT and is first-order equivalent to the GLRT [4, p. 339]. The test statistic is ΛRao , kXn k2 (21) n

Proof. The solution for d = 1 is an immediate application of Laplace’s method. The value of the integral In (1, V, τ ) is determined √ by the integrand in a vanishing neighborhood of x1 = τ : Z √ nV x1 In (1, V, τ ) = φ1 (x1 ) dx1 e √ |x1 |≤ τ √ τ √ nV x1

where Xn

, =

√ √

−1/2

n Iθ0

−1/2

n Iθ0



 ∇A(θˆn ) − ∇A(θ0 ) ! n 1X t(Yi ) − ξ0 . n i=1

Z ∼

(22)

The test takes the form δRao (Y)  Rao score 1 ΛRao > τ where τ is the threshold of the test. n

e

−∞ √ nV τ

= =

Under H0 , for each n, the random vector Xn has mean d 0 and covariance matrix Id . Moreover, Xn −→ N (0, Id ) and hence kXn k2 converges in distribution to a χ2d random variable. Hence we choose the test threshold τ = τ (, d) to be the 1− quantile for the χ2d distribution. The Rao score test takes the form  δRao (Y) = 1 kXn k2 > τ (, d) . (23) Defining the centered ball √ B d ( τ ) = {x ∈ Rd : kxk2 ≤ τ (, d)}, √ we may write δRao (Y) = 1{Xn ∈ /√ B d ( τ )}. Likewise, we have δGLRT (Y) = 1{Xn ∈ / Bnd ( τ )} where √ Bnd ( τ ) , {x ∈ Rd : nD(qξ0 +n−1/2 x kqξ0 ) ≤ τ (, d)}.

e √

2πnV

φ1 (x1 ) dx1

e−τ /2

as n → ∞. (25)

For d ≥ 2, we proceed as follows. Denote by S k−1 (r) = {x ∈ Rk : kxk = r} the k − 1 sphere of radius r embedded in Rk . For k = 1 we have S 0 (r) = {±r}. The volume of S k−1 (r) is |S k−1 (r)| = Ck−1 rk−1 where Ck−1 = kπ k/2 /Γ( k2 +1). By notational convention, |S 0 (r)| = 2. We also denote by B k (r) = {x ∈ Rk : kxk ≤ r} the centered ball of radius r in Rk . Let x = (x1 , z) where z ∈ Rd−1 . Then Z In (d, V, τ ) = dz φd−1 (z) √ Bd−1 ( τ ) Z √ e nV x1 φ(x1 ) dx1 × √ |x1 |≤ τ −kzk2 Z dz φd−1 (z) In (1, V, τ − kzk2 ). = √ Bd−1 ( τ )

It follows from (25) that

Z √ e−τ /2 nV (τ −kzk2 ) Using the Taylor expansion (8) and the fact that Iθ0 has I (d, V, τ ) ∼ √ e dz. n full rank, we can show there exist constants c1 , c2 ≥ 0 (2π)(d−1)/2 nV Bd−1 (√τ ) such that Let r = kzk. The integral above is equal to     √ √ √ c1 c2 d d d Z √τ √ τ−√ ⊆ Bn ( τ ) ⊆ B τ+√ . B 2 n n |S d−2 (r)| e nV (τ −r ) dr (24) r=0 Z √τ √ 2 III. A U SEFUL L EMMA = Cd−2 rd−2 e nV (τ −r ) dr r=0

The following lemma is instrumental to prove our main result, and the accuracy of the asymptotic approximations has been verified numerically. For d ≤ 3 and n = 100, the precision of the approximation is typically around 1%. Denote by φd the d-variate normal density. Lemma 1. Define the integral Z √ In (d, V, τ ) , e nV x1 φd (x) dx. x∈Rd : kxk2 ≤τ

The following asymptotic equality (as n → ∞) holds for all V, τ > 0, and d ≥ 1: In (d, V, τ ) ∼

√ d+1 e−τ /2 d−1 √ τ 4 (nV )− 4 e nV τ . 2π

(where C0 = 2) and is dominated by a vanishing neighborhood near r = 0 as n → ∞. Taylor series expansion of the exponent at r = 0 yields p √ r2 τ − r2 = τ − √ + O(r4 ) 2 τ hence In (d, V, τ ) ∼



Cd−2 e−τ /2 e nV τ √ (2π)(d−1)/2 nV √





Z

τ

r=0 nV τ Z ∞

Cd−2 e−τ /2 e √ (2π)(d−1)/2 nV

r=0

1

rd−2 e− 2 1

rd−2 e− 2



√ nV r 2 / τ

√ √ nV r 2 / τ

dr

dr



(a)

=

(b)

=

Cd−2 e−τ /2 d−1 e nV τ τ 4 d+1 (2π)(d−1)/2 (nV ) 4 −τ /2



Z



1

2

ud−2 e− 2 u du

where we have defined the vector 1/2

0

L0 , Iθ0 (θ − θ0 ).

nV τ

d−1 Cd−2 e e µd−2 τ 4 d+1 2 (2π)(d−1)/2 (nV ) 4

Note from (7) that kL0 k2 = V (pθ0 kpθ ). We obtain

√ d+1 e−τ /2 d−1 √ τ 4 (nV )− 4 e nV τ as n → ∞ 2π where in (a) we have used the change of variables u = (nV /τ )1/4 r, and in (b) we have defined   Z 1 2 i+1 2i/2 1 µi = |u|i √ e− 2 u du = √ Γ . 2 π 2π R

=

This proves the claim.

2

IV. P ROOF OF T HEOREM 1 We prove Part (i) only. The proof for Part (ii) is a direct extension of the proof for Part (i), using the sandwich property (24) of the acceptance region for the GLRT. Type-I Error Probability. The type-I error probability for the Rao score test is PI (δRao , n) = Pθn0 {kXn k2 > τ (, d)}. By the definition of τ (, d) in (13) and of the fact that kXn k2 converges in distribution to a χ2d random variable, PI (δRao , n) tends to  as n → ∞. Moreover, by the multidimensional Berry-Esseen theorem [4] we have |PI (δRao , n) − | ≤ c n−1/2 . Since Xn has a density fn , this can be strengthened to |PI (δRao , n) − | = O(n−1 ) using a two-term Edgeworth expansion for fn [11]: sup fn (x) − f ∗ (x)[1 + n−1/2 h3 (x; χ3 )] = O(n−1 ) x∈Rd

where f ∗ is the Gaussian density with mean zero and covariance Iθ0 , and h3 (·; χ3 ) is an odd d-variate polynomial function parameterized by χ3 , the set of third-order cumulants of t(Y ). Then the claim follows from the fact that h3 is an odd function and that the integration domain for Xn is B(, d), a symmetric region. Type-II Error Probability. The type-II error probability for the Rao score test is PII (δRao , θ, n) = Pθn {kXn k2 ≤ τ (, d)}. Denote by Fθ,n the cumulative density function (cdf) of Xn under pθ and by F ∗ the cdf of the d-variate normal distribution. We have ( ) X pθ (Yi ) dFθ,n (Xn ) = exp n ln dFθ0 ,n pθ0 (Yi ) i=1  √ = exp −nD(pθ0 kpθ ) + nL> 0 Xn

PII (δRao , θ, n) = = = (a)

=

(b)

=

Pθn {kXn k2 ≤ τ (, d)}   dFθ,n Eθ0 (Xn ) 1{kXn k2 ≤ τ (, d)} dFθ0 ,n Z √ > −nD(pθ0 kpθ ) e nL0 x dFθ0 ,n (x) e √ Bd ( τ (,d)) Z √ > e−nD(pθ0 kpθ ) e nL0 x dF ∗ (x) √ Bd ( τ (,d)) Z √ e−nD(pθ0 kpθ ) e nkL0 k x1 dF ∗ (x) √ Bd (

τ (,d))

(c)

p exp{−nD(pθ0 kpθ ) + nV (pθ0 kpθ ) τ (, d) d+1 − ln n + γd } 4 where (a) follows from [10], (b) holds because the integral √ depend on the vector L0 only via its norm kL0 k = V , and (c) follows from Lemma 1. The lemma also gives the value of the constant γd . This proves the claim. 2 ∼

R EFERENCES [1] E. L. Lehmann and J. P. Romano, Testing statistical hypotheses, 3rd ed., ser. Springer Texts in Statistics, 2005. [2] O. Zeitouni, J. Ziv, and N. Merhav, “When is the Generalized Likelihood Ratio Test Optimal?”, IEEE Trans. Inf. Theory, vol. 38, no. 5, pp. 1597—1602, Sep. 1992. [3] C. R. Rao, “Large Sample Tests of Statistical Hypotheses Concerning Several Parameters with Applications to Problems of Estimation,” Proc. Cambridge Phil. Soc., Vol. 44, pp. 50—57, 1948. [4] A. DasGupta, Asymptotic Theory of Statistics and Probability, Springer, 2008. [5] V. Strassen, “Asymptotische Absch¨atzungen in Shannon’s Informationstheorie,” Trans. Third Prague Conf. Information Theory, 1962. English translation by M. Luthy available from http://www.math.cornell.edu/˜pmlut/strassen.pdf. [6] P. Moulin, “The log-volume of optimal codes for memoryless channels, within a few nats,” Nov 2013. arXiv:1311.0181. [7] Y.-W. Huang and P. Moulin, “Strong Large Deviations for Composite Hypothesis Testing,” Proc. ISIT, 2014. [8] J. Unnikrishnan, D. Huang, S. P. Meyn and V. V. Veeravalli, “Universal and Composite Hypothesis Testing via Mismatched Divergence,” IEEE Trans. Information Theory, Vol. 57, No. 3, pp. 1587—1603, March 2011. [9] S. S. Wilks, “The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses,” Ann. Math. Stat, Vol. 9, pp. 60—62, 1938. [10] M. Iltis, “Sharp Asymptotics of Large Deviations in Rd ,” J. Theoretical Probability, Vol. 8, No. 3, pp. 501—522, 1995. [11] R. N. Bhattacharya and R. R. Rao, Normal Approximation and Asymptotic Expansions, SIAM Classics in Applied Math., Philadelphia, 2010.