1
Rate Analysis for Detection of Sparse Mixtures Jonathan G. Ligo, George V. Moustakides and Venugopal V. Veeravalli
arXiv:1509.07566v1 [cs.IT] 24 Sep 2015
Abstract In this paper, we study the rate of decay of the probability of error for distinguishing between a sparse signal with noise, modeled as a sparse mixture, from pure noise. This problem has many applications in signal processing, evolutionary biology, bioinformatics, astrophysics and feature selection for machine learning. We let the mixture probability tend to zero as the number of observations tends to infinity and derive oracle rates at which the error probability can be driven to zero for a general class of signal and noise distributions. In contrast to the problem of detection of non-sparse signals, we see the log-probability of error decays sublinearly rather than linearly and is characterized through the χ2 -divergence rather than the Kullback-Leibler divergence. Our contributions are: (i) the first characterization of the rate of decay of the error probability for this problem; and (ii) the construction of a test based on the L1 -Wasserstein metric that achieves the oracle rate in a Gaussian setting without prior knowledge of the sparsity level or the signal-to-noise ratio.
I. I NTRODUCTION We consider the problem of detecting a sparse signal in noise, modeled as a mixture, where the unknown sparsity level decreases as the number of samples collected increases. Of particular interest is the case where the unknown signal strength relative to the noise power is very small. This problem has many natural applications. In signal processing, it can be applied to detecting a signal in a multi-channel system or detecting covert communications [1], [2]. In evolutionary biology, the problem manifests in the reconstruction of phylogenetic trees in the multi-species coalescent model [3]. In bioinformatics, the problem arises in the context of determining gene expression from gene ontology datasets [4]. In astrophysics, detection of sparse mixtures is used to compare models of the cosmic microwave background to observed data [5]. Also, statistics developed from the study of this problem have been applied to high-dimensional feature selection when useful features are rare and weak [6]. Prior work on detecting a sparse signal in noise has been primarily focused on Gaussian signal and noise models, with the goal of determining the trade-off in signal strength with sparsity required for detection with vanishing probability of error. In contrast, this work considers a fairly general class of signal and noise models. Moreover, in this general class of sparse signal and noise models, we provide the first analysis of the rate at which the error probabilities vanish with sample size. In the problem of testing between n i.i.d. samples from two known distributions, it is well known that the rate at which the error probability decays is e−cn for some constant c > 0 bounded by the Kullback-Leibler divergences between the two distributions [7], [8]. In this work, we show for the problem of detecting a sparse signal in noise that the error probability for an oracle detector decays at a slower rate determined by the sparsity level and the χ2 -divergence between the signal and noise distributions. In addition to determining the optimal trade-off between signal strength and sparsity for consistent detection, an important contribution in prior work has been the construction of adaptive (and, to some extent, distribution-free) tests that achieve the optimal trade-off without knowing the model parameters [2], [9]–[14]. We discuss prior work in more detail in Sec. II-A. However, the adaptive tests that have been proposed in these papers are not amenable to an analysis of the rate at which the error probability goes to zero. In contrast, we propose a test based on the L1 -Wasserstein metric that is amenable to a rate analysis, and essentially achieves the oracle rate for a large range of levels of sparsity. II. P ROBLEM S ETUP Let {f0,n }, {f1,n } be sequences of probability density functions (PDFs) for real valued random-variables. We consider the following sequence of composite hypothesis testing problems with sample size n, called the (sparse) mixture detection problem: H0,n : X1 , . . . , Xn ∼ f0,n i.i.d. (null) (1) H1,n : X1 , . . . , Xn ∼ (1 − ǫn )f0,n + ǫn f1,n i.i.d. (alternative)
(2)
where {f0,n } is known and {f1,n } is from some known family of sequences of PDFs F and {ǫn } is a sequence of positive numbers such that ǫn → 0. We will also assume nǫn → ∞ so that a typical realization of the alternative is distinguishable from the null. Let P0,n , P1,n denote the probability measure under H0,n , H1,n respectively, and let E0,n , E1,n be the corresponding expectations with respect to some known and fixed {f0,n }, {f1,n } and {ǫn }. When convenient, we will drop the subscript n. Let J. G. Ligo and V. V. Veeravalli are with the University of Illinois at Urbana-Champaign, Urbana, IL 61801. G. V. Moustakides is with the University of Patras, Rio, Greece 26500 and Rutgers University, New Brunswick, NJ 08910. This work was supported in part by the US National Science Foundation under the grant CCF 1514245, through the University of Illinois at Urbana-Champaign, and under the Grant CIF 1513373, through Rutgers University.
2
f
Ln , f1,n . When f0,n (x) = f0 (x) and f1,n (x) = f0,n (x − µn ), we say that the model is a location model. When f0,n is a 0,n standard normal PDF, we call the location model a Gaussian location model. The distributions of the alternative in a location model are described by the set of sequences {(ǫn , µn )}. The location model can be considered as one where the null corresponds to pure noise (f0,n ), while the alternative corresponds to a sparse signal (controlled by ǫn ) with signal strength µn contaminated by additive noise. The relationship between ǫn and µn determines the signal-to-noise ratio (SNR), and characterizes when the hypotheses can be distinguished with vanishing probability of error. In the general case, f1,n can be thought of as the signal distribution. We define the probability of false alarm for a hypothesis test δn between H0,n and H1,n as
and the probability of missed detection as
PFA (n) , P0,n (δn = 1)
(3)
PMD (n) , P1,n (δn = 0).
(4)
A sequence of hypothesis tests {δn } is consistent if PFA (n), PMD (n) → 0 as n → ∞. We say we have a rate characterization for a sequence of consistent hypothesis tests {δn } if we can write log PFA (n) log PMD (n) = c, lim =d n→∞ n→∞ g0 (n) g1 (n) lim
(5)
where g0 (n), g1 (n) → ∞ as n → ∞ and −∞ < c, d < 0. The rate characterization describes decay of the error probabilities for large sample sizes. All logarithms are natural. For the problem of testing between i.i.d. samples from two fixed distributions, the rate characterization has g0 (n) = g1 (n) = n and c, d are called the error exponents [7]. In the mixture detection problem, g0 and g1 will be sublinear functions of n. The log-likelihood ratio between H1,n and H0,n is LLR(n) =
n X i=1
log(1 − ǫn + ǫn Ln (Xi )).
(6)
In order to perform an oracle rate characterization for the mixture detection problem, we consider the sequence of oracle likelihood ratio tests (LRTs) between H0,n and H1,n (i.e. with ǫn , f0,n , f1,n known): ( 1 LLR(n) ≥ 0 δn (X1 , . . . , Xn ) , . (7) 0 o.w. MD (n) It is well known that (7) is optimal in the sense of minimizing PFA (n)+P for testing between H0,n and H1,n , which is the 2 average probability of error when the null and alternative are assumed to be equally likely [15], [16]. It is valuable to analyze PFA and PMD separately since many applications incur different penalties for false alarms and missed detections. Location Model: The detectable region for a location model is the set of sequences {(ǫn , µn )} such that a sequence of consistent oracle tests {δn } exist. For convenience of analysis, we introduce the parameterization
ǫn = n−β
(8)
where β ∈ (0, 1) as necessary. Following the terminology of [11], when β ∈ (0, 1/2) ,the mixture is said to be a “dense mixture”. If β ∈ (1/2, 1), the mixture is said to be a “sparse mixture”. A. Related Work Prior work on mixture detection has been focused primarily on the Gaussian location model. The main goals in these works have been to determine the detectable region and construct optimally adaptive tests (i.e. those which are consistent independent of knowledge of {(ǫn , µn )}, whenever possible). The study of detection of mixtures where the mixture probability tends to zero was initiated by Ingster for the Gaussian location model [13]. Ingster characterized the detectable region, and showed that outside the detectable region the sum of the probabilities of false alarm and missed detection tends to one for any test. Since the generalized likelihood statistic tends to infinity under the null, Ingster developed an increasing sequence of simple hypothesis tests that are optimally adaptive. Donoho and Jin introduced the celebrated Higher Criticism test which is optimally adaptive and is computationally efficient, and also discussed some extensions to Subbotin distributions and χ2 -distributions [2]. Cai et al. extended these results to the case where f0,n is standard normal and f1,n is a normal distribution with positive variance, derived limiting expressions for the distribution of LLR(n) under both hypotheses, and showed that the Higher Criticism test is optimally adaptive in this case [9]. Jager and Wellner proposed a family of tests based on φ-divergences and showed that they attain the full detectable region in the Gaussian location model [12]. Arias-Castro and Wang studied a location model where f0,n is some fixed but unknown symmetric distribution, and constructed an optimally adaptive test which relies only on the symmetry of the distribution when µn > 0 [11]. Cai and Wu gave an information-theoretic characterization of the detectable
3
region via an analysis of the sharp asymptotics of the Hellinger distance for a wide variety of distributions, and established a strong converse result showing that reliable detection is impossible outside the detectable region in many cases [10]. Walther numerically showed that while the popular Higher Criticism statistic is consistent, there exist optimally adaptive tests with significantly higher power for a given sample size at different sparsity levels [14]. Our work complements [14] by providing a benchmark to meaningfully compare the sample size and sparsity tradeoffs of different tests with an oracle test. It should be noted that all work except [9], [11] has focused on the case where β > 12 , and no prior work has provided an analysis of the rate at which PFA , PMD can be driven to zero with sample size. III. M AIN R ESULTS FOR R ATE A NALYSIS A. General Case Our main result is a characterization of the oracle rate via the test given in (7). The sufficient conditions required for the rate characterization are applicable to a broad range of parameters in the Gaussian location model (Sec. III-B). Theorem 3.1: Assume that for all 0 < γ < γ0 for some γ0 ∈ (0, 1), the following conditions are satisfied: (Ln − 1)2 lim E0 1 (9) {Ln ≥1+γ/ǫn } = 0 n→∞ Dn2 ǫn Dn → 0
(10)
√ nǫn Dn → ∞
(11)
Dn2 = E0 [(Ln − 1)2 ] < ∞.
(12)
log PFA (n) 1 =− . n→∞ nǫ2n Dn2 8
(13)
where Then for the test specified by (7), lim
Moreover, (13) holds replacing PFA with PMD . The quantity Dn2 is known as the χ2 -divergence between f0,n and f1,n . In contrast to the problem of testing between i.i.d. samples from two fixed distributions, the rate is not characterized by the Kullback-Leibler divergence for the mixture detection problem. Proof We provide a sketch of the proof for PFA , and leave the details to Appendix A. We first establish that lim sup n→∞
1 log PFA (n) ≤− 2 2 nǫn Dn 8
By the Chernoff bound applied to PFA (n) and noting X1 , . . . , Xn are i.i.d., n PFA (n) = P0 (LLR(n) ≥ 0) ≤ min E0 [(1 − ǫn + ǫn Ln (X1 ))s ] 0≤s≤1 in hp 1 − ǫn + ǫn Ln (X1 ) ≤ E0
(14)
(15)
By direct computation, we see E0 [Ln (X1 ) − 1] = 0, and the following sequence of inequalities hold: " # hp i ǫ2n (Ln (X1 ) − 1)2 1 E0 1 − ǫn + ǫn Ln (X1 ) = 1 − E0 p 2 2 1 + 1 + ǫn (Ln (X1 ) − 1) # " ǫ2n (Ln (X1 ) − 1)2 p 1{ǫn (Ln (X1 )−1)≤γ} ≤ 1 − E0 2 (1 + 1 + ǫn (Ln (X1 ) − 1))2 (Ln (X1 ) − 1)2 ǫ2n Dn2 √ E0 ≤1− 1{Ln (X1 )≤1+γ/ǫn } Dn2 2(1 + 1 + γ)2 (Ln (X1 ) − 1)2 ǫ2n Dn2 √ 1 =1− 1 − E 0 {Ln (X1 )≥1+γ/ǫn } Dn2 2(1 + 1 + γ)2 Since the expectation in the previous line tends to zero by (9), we have by (15) ǫ2n Dn2 1 log PFA (n) √ ≤ log 1 − (1 − γ) n 2 (1 + 1 + γ)2
4
for sufficiently large n. Dividing both sides by ǫ2n Dn2 and taking a lim sup using (10),(11) establishes lim supn→∞ 1−γ −1 √ 2 (1+ 1+γ)2 . Since γ can be arbitrarily small, (14) is established. We now establish that log PFA (n) 1 lim inf ≥− . 2 2 n→∞ nǫn Dn 8
log PFA 2 nǫ2n Dn
≤
(16)
The proof of (16) is similar to that of Cramer’s theorem (Theorem I.4, [17]). The key difference from Cramer’s theorem is that LLR(n) is the sum of i.i.d. random variables for each n, but the distributions of the summands defining LLR(n) in (6) change for each n under either hypothesis. Thus, we modify the proof of Cramer’s theorem by introducing a n-dependent tilting distribution and replacing the standard central limit theorem (CLT) with the Lindeberg-Feller CLT for triangular arrays (Theorem 2.4.5, [18]). We introduce the tilted distribution f˜n corresponding to f˜0,n by sn 1 − ǫn + ǫn Ln (x) ˜ f0,n (x) (17) fn (x) = Λn (sn ) s ˜ E ˜ denote the tilted measure and expectation, and sn = arg min0≤s≤1 Λn (s). Let P, where Λn (s) = E0 1 − ǫn + ǫn Ln (X1 ) respectively (where we suppress the n for clarity). A standard dominated convergence argument (Lemma 2.2.5, [8]) shows that ˜ E[log(1 − ǫn + ǫn Ln (X1 )] = 0.
(18)
Define the variance of the log-likelihood ratio between one sample under H1,n and H0,n as ˜ log(1 − ǫn + ǫn Ln (X1 ) 2 . σn2 = E C1 ǫ2n Dn2
σn2
(19) C2 ǫ2n Dn2
We show in Appendix A that there exist positive constants C1 , C2 such that ≥ ≥ for sufficiently large ˜ measure and using the bounds on σ 2 , we see n, and therefore σn2 is well defined. By calculating PFA via changing to the P n that for sufficiently large n √ ˜ 1 log Λn (sn ) log PFA (n) log P(LLR(n) ≥ 0) C1 √ ≥ − . (20) + 2 2 2 2 2 D2 ˜ nǫn Dn ǫn Dn nǫ nǫ D P(LLR(n) ≥ 0) n n n n
n (sn ) In Appendix A, we provide a lower bound on lim inf n→∞ logǫΛ by lower bounding Λn and using (9). By normalizing 2 D2 n n ˜ LLR(n) by σn , the Lindeberg-Feller CLT shows that P(LLR(n) ≥ 0) → 12 , and applying conditions (10) and (11) show that the last two terms in (20) tend to zero. This establishes (16), and therefore (13). The analysis under H1,n for PMD relies on the fact that Xi ∼ 1 − ǫn + ǫn Ln (x) f0,n (x) i.i.d. under H1,n which allows the use of 1 − ǫn + ǫn Ln to change the measure from the alternative to the null. The upper bound is established identically, by noting the Chernoff bound furnishes " #!n 1 PMD (n) = P1,n (−LLR(n) > 0) ≤ E1 p 1 − ǫn + ǫn Ln (X1 ) in hp 1 − ǫn + ǫn Ln (X1 ) = E0
Similarly, the previous analysis can be applied to show that (16) holds with PFA replaced with PMD . When the conditions of Thm 3.1 do not hold, we have the following upper bound: Theorem 3.2: If for all M sufficiently large, h i E0 Ln 1{Ln >1+ M } → 1 ǫn
then for the test specified by (7),
lim sup n→∞
log PFA (n) ≤ −1. nǫn
(21)
(22)
Moreover, (22) holds with PFA replaced with PMD . Proof The proof is via optimizing the Chernoff bound and is deferred to Appendix B. Unlike in the case of Thm 3.1, where the optimal parameter in the Chernoff bound is close to 12 , the optimal parameter in the Chernoff bound is close to 1 for PFA and close to 0 for PMD . Corollary 3.4 shows Thm 3.2 is indeed tight under H1,n . It is reasonable to believe that Thm 3.2 is also tight under H0,n (or at least of the right order), since nǫn is the average number of observations drawn from f1,n under H1,n in the model of PFA (n) H1,n presented in the proof of Thm 3.3. However, we do not have proof showing log nǫ → −1 yet. n We have the following universal lower bound for PMD :
5
Theorem 3.3: Let {δn } be any sequence of tests such that lim supn→∞ PFA (n) < 1. Then, lim inf n→∞
log PMD (n) ≥ −1 nǫn
(23)
Proof One can think of H1,n as follows: 1) Generate an i.i.d. Bernoulli(ǫn) vector of length n, B. 2) If Bi = 0, draw Xi from f0,n . If Bi = 1, draw Xi from f1,n . Let Pb denote the probability measure associated with the distribution of the observations drawn under H1,n when B = b. Let z denote the all zeros vector. Then, PMD (n) = P1 [δn = 0] X = Pb [δn = 0]P[B = b]
(24)
b∈{0,1}n
≥ Pz [δn = 0]P[B = z] = P0 [δn = 0](1 − ǫn )n
(25) (26)
= (1 − PFA (n))(1 − ǫn )n .
where (24) follows from the law of total probability, (25) follows from taking only the term corresponding to the all zeros vector and (26) follows from observations drawn from H1,n when B = z are identical in distribution to those which are drawn from H0,n . Taking logarithms, dividing by nǫn , and taking a lim inf establishes (23). Note that the bound of Thm 3.3 is independent of any divergences between f0,n and f1,n , and the theorem holds for any consistent sequence of tests because PFA (n) → 0. Combining Thm 3.2 and 3.3 gives the following corollary: Corollary 3.4: If for all M sufficiently large, i h (27) E0 Ln 1{Ln >1+ M } → 1 ǫn
then for the test specified by (7),
lim
n→∞
log PMD (n) = −1. nǫn
(28)
Interestingly, so long as the condition of Thm 3.2 holds, no non-trivial sequence of tests (i.e. lim supn→∞ PFA , PMD < 1) can achieve a better rate than (7) under H1,n . This is different from the case of testing i.i.d. observations from two fixed distributions, where allowing for large PFA can improve the rate under the alternative. 1) Comparison to Related Work: Cai and Wu [10] consider a model which is essentially as general as ours, and characterize the detection boundary for many cases of interest, but do not perform a rate analysis. Note that our rate characterization (13) depends on Dn , the χ2 -divergence between f0,n and f1,n . While the Hellinger distance used in [10] can be upper bounded in terms of the χ2 -divergence, a corresponding lower bound does not exist in general [19], and so our results cannot be derived using the methods of [10]. In fact, our results complement [10] in giving precise bounds on the error decay for this problem once the detectable region boundary has been established. Furthermore, as we will show in Thm 4.2, there are cases where the rates derived by analyzing the likelihood ratio test are essentially achievable. B. Gaussian Location Model In this section, we specialize Thm 3.1 and 3.2 to the Gaussian location model. The rate characterization proved is summarized in Fig. 1. We first recall some results from the literature for the detectable region for this model. Theorem 3.5: ( [2], [9], [11]) The boundary of the detectable region (in {(ǫn , µn )} space) is given by (with ǫn = n−β ): 1) If 0 < β ≤ 1/2, then µcrit,n = nβ−1/2 q . (Dense)
2) If 1/2 < β < 3/4, then µcrit,n = 2(β − 12 ) log n. (Moderately Sparse) p √ 3) If 3/4 ≤ β < 1, then µcrit,n = 2(1 − 1 − β)2 log n. (Very Sparse) If in the dense case µn = nr , then the LRT (7) is consistent if r >√β−1/2. Moreover, if r < β−1/2, then PFA (n)+PMD (n) → 1 µ . Moreover, for any sequence of tests as n → ∞. If in the sparse cases, µn = 2r log n, then the LRT is consistent if r > √2crit,n log n µcrit,n √ if r < 2 log n , then PFA (n) + PMD (n) → 1 for any sequence of tests as n → ∞. We call the set of {(ǫn , µn )} sequences where (7) is consistent the interior of the detectable region. We now begin proving a rate characterization for the Gaussian location model by specializing Thm 3.1. Note that Ln (x) = 2 2 eµn x−µn /2 and Dn2 = eµn − 1. A simple computation shows that the conditions in the theorem can be re-written as:
1
0.5
0.9
0.4
0.8
0.3
0.7
0.2
0.6
0.1
r : µ = nr
r:µ=
√ 2r log n
6
0.5 0.4
0 -0.1
0.3
-0.2
0.2
-0.3
0.1
-0.4
0
-0.5 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
β
0.6
0.7
0.8
0.9
1
β
(a) (r versus β) where µn √ Detectable region 2r log n, ǫn = n−β
= (b) Detectable region (r versus β) where µn = nr ,ǫn = n−β
Fig. 1: Detectable regions for the Gaussian location model. Unshaded regions have PMD + PFA → 1 for any test (i.e. reliable detection is impossible). Green regions are where corollaries 3.6 and 3.7 provide an exact rate characterization. The red region is where Thm 3.9 provides an upper bound on the rate, but no lower bound. The blue region is where Cor. 3.8 holds, and provides an upper bound on the rate for PFA and an exact rate characterization for PMD . For all γ > 0 sufficiently small: 1) log(1 + γ/ǫn ) −3 + µn + Q 2 µn
Q( −3 2 µn +
log(1+γ/ǫn ) ) µn
− 2Q( −1 2 µn +
log(1+γ/ǫn ) ) µn
+ Q( 21 µn +
log(1+γ/ǫn ) ) µn
2
eµn − 1
2)
→0
2
ǫ2n (eµn − 1) → 0 3) where Q(x) =
(30)
2
R∞ x
nǫ2n (eµn − 1) → ∞
2
√1 e−x /2 dx. 2π
Corollary 3.6: (Dense case) If ǫn = n−β for β ∈ (0, 1/2) and µn =
1, then
lim
n→∞
If µn → 0, (32) can be rewritten as
h(n) n1/2−β
(29)
(31) where h(n) → ∞ and lim supn→∞ √ 2βµn 3
log PFA (n) 1 =− . 2 8 nǫ2n (eµn − 1)
log n
0, making it negligible with respect to µcrit,n in Thm 3.5. Proof It is easy to verify (30) and (31) directly, and (29) if µn does not tend to zero. To verify (29) it suffices to show: If µn → 0, n )/µn ) n )/µn ) → 0. This for any α ∈ R, then Q(αµn +log(1+γ/ǫ → 0. Since ex − 1 ≥ x, it suffices to show that Q(αµn +log(1+γ/ǫ µ2 µ2 n e
−1
1
2
n
can be verified by the standard bound Q(x) ≤ e− 2 x for x > 0 and noting that αµn + log(1 + γ/ǫn )/µn > 0 for sufficiently large n and x/(ex − 1) → 1 as x → 0. p 2(β − 1/2 + ξ) log n for any Corollary 3.7: (Moderately sparse case) If ǫn = n−β for β ∈ (1/2, 3/4) and µn = then 0 < ξ < 3−4β 6 log PFA (n) 1 lim (34) =− 2 n→∞ nǫ2 (eµn − 1) 8 n and the same result holds replacing PFA with PMD .
7
Note that ξ can be replaced with an appropriately chosen sequence tending to 0 such that (30) and (31) hold. Proof It is easy to verify (30) and (31) directly. To verify (29), note since Q(·) ≤ 1 and µn → ∞, log(1+γ/ǫn ) log(1+γ/ǫn ) n) Q( −3 )−2Q( −1 )+Q( 12 µn + log(1+γ/ǫ ) 2 µn + µn 2 µn + µn µn 2
equivalently, that
−3 2 µn
+
eµn −1 log(1+γ/ǫn ) → µn
log(1+γ/ǫn ) → 0. Thus, it suffices to show Q( −3 ) → 0, or 2 µn + µn
∞ for any fixed γ > 0. Applying log(1 + γ/ǫn ) ≥ log(γ/ǫn ) = β log n + log γ shows
β log n + log γ −3 p log(1 + γ/ǫn ) −3 2(β − 1/2 + ξ) log n + p ≥ µn + 2 µn 2 2(β − 1/2 + ξ) log n ! p β log γ −3 p (35) = 2(β − 1/2 + ξ) + p log n + p 2 2(β − 1/2 + ξ) 2(β − 1/2 + ξ) log n √ where the last term tends to 0 with n. Thus, (35) tends to infinity if the coefficient of log n is positive, i.e. if 12 (1 − 2ξ) < β < 14 (3 − 6ξ) , which holds by definition of ξ. Thus, (35) tends to infinity and (29) is proved. q For µn > 2 β3 log n, (29) does not hold. However, Thm 3.2 and Cor. 3.4 provide a partial rate characterization for the case √ where µn grows faster than 2β log n: Corollary 3.8: If ǫn = n−β for β ∈ (0, 1) and lim inf n→∞ √2βµnlog n > 1, then lim sup n→∞
log PFA ≤ −1 nǫn
(36)
and
log PMD = −1. (37) nǫn n) n) Proof The condition for Thm 3.2 and Cor. 3.4 given by (21) is Q log(1+M/ǫ − µ2n → 1. This holds if log(1+M/ǫ − µ2n → µn µn −∞, which holds if r > β. √ Theorems 3.1 and 3.2 do not hold when ǫn = n−β and µn = 2r log n where r ∈ (β/3, β) for β ∈ (0, 3/4) or r ∈ √ ((1 − 1 − β)2 , β) for β ∈ (3/4, 1). For the remainder of the detectable region, we have an upper bound on the rate derived specifically the Gaussian location setting. √ √ Theorem 3.9: Let ǫn = n−β and µn = 2r log n where r ∈ (β/3, β) for β ∈ (0, 3/4) or r ∈ ((1 − 1 − β)2 , β) for β ∈ (3/4, 1). Then, 1 log PFA (n) ≤− . (38) lim sup 2 β 16 n→∞ nǫ2 eµn Φ − 3 µn lim
n→∞
n
2r
2
where Φ denotes the standard normal cumulative distribution function. Moreover, (38) holds replacing PFA with PMD .
Proof The proof is based on a Chernoff bound with s = 1/2. Details are given in Appendix C. β 2 1−2β+r 2−( 3 − 2 2r ) 2 β √ for large n in Thm 3.9. It is useful to note that nǫ2n eµn Φ 2r − 32 µn behaves on the order of n 2r log n IV. T ESTING
BASED ON THE
L1 -WASSERSTEIN M ETRIC IN
THE
G AUSSIAN L OCATION M ODEL
No adaptive tests prior to this work have had precise rate characterization, even in the dense case of the Gaussian location model. Moreover, optimally adaptive tests for 0 < β < 1 such as Higher Criticism or CUSUM [2], [11] are not amenable to rate analysis due to the fact that the consistency proofs of these tests follow from constructing statistics which grow slowly under the null and slightly quicker under the alternative via a result of Darling and Erd¨os [20]. In this section, we introduce a test based on a Wasserstein distance for the Gaussian location model. Wasserstein distances provide a metric for convergence in distribution [21] and naturally occur in estimating a mixture distribution from samples [22], [23]. Unlike in [22], [23], we are concerned with the detection of a mixture, rather than estimation, and the mixtures considered become sparser as the sample size increases. The (L1 )-Wasserstein distance between cumulative distribution functions (CDF) F, G is defined by Z ∞ |F (x) − G(x)|dx. (39) dW (F, G) , inf E[|X − Y |] = PXY :X∼F,Y ∼G
−∞
Pn Let Fn denote the empirical CDF of {x1 , . . . , xn }: Fn (x) = ( i=1 1{xi ≤x} )/n. Our test is based on the following concentration inequality for dW (Fn , F ):
8
Theorem 4.1: ( [24], Theorem 2)R Let Fn be the empirical CDF of n samples drawn from CDF F . Assume there exist α α > 1, γ > 0 such that Eα,γ (F ) = eγ|x| dF (x) < ∞. Then, there exist positive constants C, c depending only on Eα,γ (F ) such that 2 P(dW (Fn , F ) ≥ x) ≤ Ce−cnx (40) for 0 < x ≤ 1. An inspection of the proof of this concentration inequality shows that positive constants C, c exist for a family of distributions if there exist uniform positive and finite upper and lower bounds on Eα,γ for all distributions in the family. Note the proof of Thm 4.1 may allow c to be very small. Our main result for testing in the Gaussian location model with rate guarantees is: Theorem 4.2: Define the test H1 log n dW (Fn , Φ) T √ (41) n H 0
Where Φ is the CDF of the standard normal distribution. Then, this test is consistent for the Gaussian location model with 0 < β < 1/2 when µn = nh(n) 1/2−β and limn→∞ h(n)/ log n > 1. Note that log n in the theorem can be replaced with any function g(n) such that g(n)n−ξ → 0 for any ξ > 0, g(n) → ∞, and analogous results still hold. Proof Taking α = 2, γ = 14 , we see for the family of distributions {(1 − ǫ)Φ(·) + ǫΦ(· − µ)}0≤µ≤1,0≤ǫ≤1 , there exist positive upper and lower bounds on Eα,γ independent of ǫ, µ. Thus, all constants C, c in the remainder of the proof exist and are positive. When µn > 1, the detection problem is uninteresting, as the dense case allows µn → 0. Under H0,n : Thm 4.1 is directly applicable, and we see n 2 2 log n √ −cn log n ≤ Ce = Ce−c log n . (42) PFA (n) = P0,n dW (Fn , Φ) ≥ √ n Thus, PFA → 0 and the test is consistent under H0,n . Under H1,n : Note that Z ∞ dW (Fn , Φ) = |Fn (t) − Φ(t)|dt −∞ Z ∞ = |Fn (t) − (1 − ǫn )Φ(t) − ǫn Φ(t − µn ) + ǫn Φ(t − µn ) − Φ(t) |dt −∞ ≥ |dW Fn , (1 − ǫn )Φ(·) + ǫn Φ(· − µn ) − ǫn dW Φ(·), Φ(· − µn ) | (43)
by the triangle inequality on the integrand. An application of Jensen’s inequality shows that dW Φ(·), Φ(· − µn ) ≥ µn , and equality is attained by taking Y = X + µn where X is standard normal. Using the lower bound on the test statistic (43), we see log n PMD (n) = P1,n dW (Fn , Φ) ≤ √ n log n ≤ P1,n −dW Fn , (1 − ǫn )Φ(·) + ǫn Φ(· − µn ) + ǫn µn ≤ √ n n 2 log n √ −cn(ǫn µn − log ) n ≤ Ce = P1,n dW Fn , (1 − ǫn )Φ(·) + ǫn Φ(· − µn ) ≥ ǫn µn − √ (44) n by Thm 4.1. We see for any h(n) as defined in the theorem, µn = h(n)/n1/2−β , PMD → 0. As shown by (42) and by (44), lim sup n→∞
log PMD (n) log PFA (n) ≤ −c. ≤ −c, lim sup 2 log n log n n→∞ n(ǫn µn − √ )2 n
(45)
From Cor. 3.6, we see for µn → 0, log PFA scales with nǫ2n µ2n . Taking µn as in the corollary, we have nǫ2n µ2n = h(n)2 , which can be a function that tends to infinity arbitrarily slowly. In this case, due to the threshold for our test statistic having a log n in the numerator, we require h(n) to tend to infinity at least logarithmically, explaining (45). Similarly, log PMD also has essentially the right scaling behavior. Thus, the predicted scaling for the decay of PFA , PMD is achievable modulo a logarithmic gap in µn due to the choice of threshold. Note that this test is easily generalized to other distributions, such as many members of the Subbotin family. Future work includes designing tests for the sparse case that are amenable to rate analysis.
9
-1 P
-2
P
FA
FA
P MD
P MD
-1.5 -3
-2
log Pe
log Pe
-4
-5
-2.5 -6 -3
-7
-8
-3.5 5
10
15
20
25
30
35
40
2
45
4
6
8
10
12
14
16
18
20
2
2
nǫ2n (eµn − 1)
nǫ2n (eµn − 1)
(a) Simulations of error probabilities in the Gaussian location model with µn = 1, ǫn = n−0.4 for the test (7). A best fit line for log PMD is given as a blue dashed line and corresponding line for log PFA is given as a red dot-dashed line.
(b) Simulations of error probabilities in the Gaussian p location model with µn = 2(0.19) log n, ǫn = n−0.6 for the test (7). A best fit line for log PMD is given as a blue dashed line and corresponding line for log PFA is given as a red dot-dashed line.
Fig. 2: Simulation results for Cor. 3.6 and 3.7 V. N UMERICAL E XPERIMENTS In this section, we provide numerical simulations to verify the rate characterization developed for the Gaussian location model. We first consider the dense case, with ǫn = n−0.4 and µn = 1. The conditions of Cor. 3.6 apply here, and we expect log PFA → − 81 . Simulations were done using direct monte carlo simulation with 10000 trials for the errors for n ≤ 106 . 2 nǫ2n (eµn −1) Importance sampling via the hypothesis alternate to the true hypothesis (i.e. H0,n for simulating PMD , H1,n for simulations PFA ) was used for 106 < n ≤ 2 × 107 with between 10000 − 15000 data points. The performance of the test given (7) is 2 shown in Fig. 2a. The dashed lines are the best fit lines between the log-error probabilities and nǫ2n (eµn − 1) using data for n ≥ 344000. By Cor. 3.6, we expect the slope of the best fit lines to be approximately − 81 . This is the case, as the line corresponding to missed detection has slope −0.1325 with standard error 0.0038 (i.e. − 18 is within 2 standard errors) and the line corresponding to false alarm has slope −0.1240 with standard error 0.0099 (i.e. − 81 is within 1 standard error). p −0.6 The moderately sparse case with ǫn = n and µn = 2(0.19) log n is shown in Fig. 2b. The conditions of Cor. 3.7 apply here, and we expect 2log µP2nFA → − 81 . Simulations were performed identically to the dense case. The dashed lines are nǫn (e
−1)
2
the best fit lines between the log-error probabilities and nǫ2n (eµn − 1) using data for n ≥ 100000. By Cor. 3.7, we expect the slope of the best fit lines to be approximately −0.125. Both best fit lines have slope of −0.108 and standard error 0.002. It is important to note that PFA , PMD are both large even at n = 2 × 10−7 and simulation to larger sample sizes should show better agreement with Cor. 3.7.
VI. C ONCLUSIONS AND F UTURE W ORK In this paper, we have presented an oracle rate characterization for the error probability decay with sample size in a general mixture detection problem. In the Gaussian location model, we explicitly showed that the rate characterization holds for a large portion of the dense regime and the moderately sparse regime. A partial rate characterization (an upper bound on the rate and universal lower bound on the rate under H1,n ) was provided for the remainder of the detectable region. In contrast to usual large deviations results [7], [8] for the decay of error probabilities, our results show the log-probability of error decays sublinearly with sample size. There are several possible extensions of this work. One is to provide corresponding lower bounds for the rate in cases not covered by Thm 3.1. Another is to provide a general analysis of the behavior that is not covered by Thm 3.1 and 3.2, present in Thm 3.9 in the Gaussian location model. As noted in [9], in some applications it is natural to require PFA (n) ≤ α for some fixed α > 0, rather than requiring PFA (n) → 0. While Thm 3.5 shows the detectable region is not enlarged under in the Gaussian location model (and similarly for some general models [10]), it is conceivable that the oracle optimal test which fixes PFA (i.e. one which compares LLR(n) to a non-zero threshold) can achieve a better rate for PMD . It is expected that the techniques developed in this paper extend to the case where PFA (n) is constrained to a level α. Finally, it is important to develop tests that are amenable to a rate analysis and are computationally simple to implement over 0 < β < 1. The test presented in Thm 4.2 shows that this is possible in the dense regime, but it is numerically difficult to compute. In the case of 3/4 < β < 1 H1 √ for the Gaussian location model, the test maxi=1,...,n Xi T 2 log n is optimally adaptive [2], and is likely amenable to H0
10
PFA (n) 1 rate analysis, as the CDF of the test statistic has a simple closed form that allows computing limn→∞ log log log n = − 2 , for example. However, since this max test is not consistent for a large range of µn in the regime where 0 < β < 3/4, the problem of finding a rate analyzable optimally adaptive test remains open.
R EFERENCES [1] R. Dobrushin, “A statistical problem arising in the theory of detection of signals in the presence of noise in a multi-channel system and leading to stable distribution laws,” Theory of Probability & Its Applications, vol. 3, no. 2, pp. 161–173, 1958. [2] D. Donoho and J. Jin, “Higher criticism for detecting sparse heterogeneous mixtures,” Ann. Statist., vol. 32, no. 3, pp. 962–994, 06 2004. [3] E. Mossel and S. Roch, “Distance-based species tree estimation: information-theoretic trade-off between number of loci and sequence length under the coalescent,” arXiv preprint arXiv:1504.05289 [math.PR], 2015. [4] J. J. Goeman and P. B¨uhlmann, “Analyzing gene expression data in terms of gene sets: methodological issues,” Bioinformatics, vol. 23, no. 8, pp. 980–987, 2007. [5] L. Cayon, J. Jin, and A. Treaster, “Higher criticism statistic: detecting and identifying non-gaussianity in the wmap first-year data,” Monthly Notices of the Royal Astronomical Society, vol. 362, no. 3, pp. 826–832, 2005. [6] D. Donoho and J. Jin, “Higher criticism thresholding: Optimal feature selection when useful features are rare and weak,” Proceedings of the National Academy of Sciences, vol. 105, no. 39, pp. 14 790–14 795, 2008. [7] T. M. Cover and J. A. Thomas, Elements of Information Theory. NY: John Wiley and Sons, Inc., 2006. [8] A. Dembo and O. Zeitouni, Large deviations techniques and applications. Springer Science & Business Media, 2009, vol. 38. [9] T. T. Cai, X. J. Jeng, and J. Jin, “Optimal detection of heterogeneous and heteroscedastic mixtures,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 73, no. 5, pp. 629–662, 2011. [10] T. T. Cai and Y. Wu, “Optimal detection of sparse mixtures against a given null distribution,” IEEE Trans. Info. Theory, vol. 60, no. 4, pp. 2217–2232, 2014. [11] E. Arias-Castro and M. Wang, “Distribution-free tests for sparse heterogeneous mixtures,” arXiv preprint arXiv:1308.0346 [math.ST], 2013. [12] L. Jager and J. A. Wellner, “Goodness-of-fit tests via phi-divergences,” Ann. Statist., vol. 35, no. 5, pp. 2018–2053, 10 2007. [13] Y. Ingster and I. A. Suslina, Nonparametric goodness-of-fit testing under Gaussian models. Springer Science & Business Media, 2003, vol. 169. [14] G. Walther, “The average likelihood ratio for large-scale multiple testing and detecting sparse mixtures,” arXiv preprint arXiv:1111.0328 [stat.ME], 2011. [15] H. V. Poor, An Introduction to Signal Detection and Estimation, 2nd ed. New York, NY: Springer, 1994. [16] E. L. Lehmann and J. P. Romano, Testing statistical hypotheses. Springer Science & Business Media, 2006. [17] F. den Hollander, Large deviations. American Mathematical Soc., 2008, vol. 14. [18] R. Durrett, Probability: Theory and Examples, 3rd ed. Duxbury Press, 2004. [19] A. L. Gibbs and F. E. Su, “On choosing and bounding probability metrics,” International statistical review, vol. 70, no. 3, pp. 419–435, 2002. [20] D. A. Darling and P. Erd¨os, “A limit theorem for the maximum of normalized sums of independent random variables,” Duke Math. J., vol. 23, no. 1, pp. 143–155, 03 1956. [21] R. M. Dudley, Real analysis and probability. Cambridge University Press, 2002, vol. 74. [22] J. Chen, “Optimal rate of convergence for finite mixture models,” Ann. Statist., vol. 23, no. 1, pp. 221–233, 02 1995. [Online]. Available: http://dx.doi.org/10.1214/aos/1176324464 [23] X. Nguyen, “Convergence of latent mixing measures in finite and infinite mixture models,” Ann. Statist., vol. 41, no. 1, pp. 370–400, 02 2013. [Online]. Available: http://dx.doi.org/10.1214/12-AOS1065 [24] N. Fournier and A. Guillin, “On the rate of convergence in wasserstein distance of the empirical measure,” Probability Theory and Related Fields, pp. 1–32, 2014.
A PPENDIX A. Proof of Theorem 3.1 In this section, we establish that lim inf n→∞
log PFA (n) 1 ≥− . nǫ2n Dn2 8
(46)
1) Some Useful Lemmas: Lemma A.1: There exist positive constants C1 , C2 such that for sufficiently large n, C1 ǫ2n Dn2 ≥ σn2 ≥ C2 ǫ2n Dn2 where i h ˜ (log (1 + ǫn (Ln (X1 ) − 1)))2 . (47) σn2 = E
Proof We first show that for sufficiently large n,
C1 ≥ Note that
σn2 Λn (sn ) . ǫ2n Dn2
(48)
(log (1 + x))2 (1 + x)s ≤ 2x2 for s ∈ (0, 1) , x ≥ 1. (49) √ s This follows from 0 ≤ log (1 + x) ≤ x for x ≥ 0 and 1 ≤ (1 + x) ≤ 2x for x ≥ 1 and s ∈ (0, 1). Also, note Λn (0) = Λn (1) = 1 implying sn ∈ (0, 1) by convexity of Λn (Lemma 2.2.5, [8]). For shorthand, we will write Ln = Ln (X1 ). Then, i h 2 s Λn (sn ) σn2 = E0 (log (1 + ǫn (Ln − 1))) (1 + ǫn (Ln − 1)) n h i = E0 (log (1 + ǫn (Ln − 1)))2 (1 + ǫn (Ln − 1))sn 1{ǫn (Ln −1)>1} h i 2 s + E0 (log (1 + ǫn (Ln − 1))) (1 + ǫn (Ln − 1)) n 1{ǫn (Ln −1)≤1} (50)
11
h i 2 s We first consider E0 (log (1 + ǫn (Ln − 1))) (1 + ǫn (Ln − 1)) n 1{ǫn (Ln −1)>1} . 2
sn
By (49), we have on the event {ǫn (Ln − 1) > 1} that (log (1 + ǫn (Ln − 1))) (1 + ǫn (Ln − 1)) h i E0 (log (1 + ǫn (Ln − 1)))2 (1 + ǫn (Ln − 1))sn 1{ǫn (Ln −1)>1} h i 2 ≤ E0 2 (ǫn (Ln − 1)) 1{ǫn (Ln −1)>1} i h 2 ≤ 2ǫ2n E0 (Ln − 1)
2
≤ 2 (ǫn (Ln − 1)) . Thus,
= 2ǫ2n Dn2
(51)
by monotonicity of E0 . h i 2 s We now consider E0 (log (1 + ǫn (Ln − 1))) (1 + ǫn (Ln − 1)) n 1{ǫn (Ln −1)≤1} . 2
A simple calculus argument shows that (log (1 + x)) ≤ 5x2 for x ≥ − 21 . Note that since Ln ≥ 0, −ǫn ≤ ǫn (Ln − 1). 2 2 Because ǫn → 0, for sufficiently large n we have that ǫn < 21 and (log (1 + ǫn (Ln − 1))) ≤ 5 (ǫn (Ln − 1)) holds. Also, sn sn (1 + ǫn (Ln − 1)) ≤ 2 ≤ 2 on the event {ǫn (Ln − 1) ≤ 1}. Thus, i h 2 s E0 (log (1 + ǫn (Ln − 1))) (1 + ǫn (Ln − 1)) n 1{ǫn (Ln −1)≤1} h i 2 ≤ E0 10 (ǫn (Ln − 1)) 1{ǫn (Ln −1)≤1} i h ≤ E0 10 (ǫn (Ln − 1))2 = 10ǫ2n Dn2 .
(52)
Using (51),(52) in (50), we see for sufficiently large n that Λn (sn ) σn2 ≤ 12ǫ2n Dn2 establishing (48). We now show that C2 ≤
σn2 Λn (sn ) . ǫ2n Dn2
(53)
Taking any γ < 12 , from Theorem 3.1, i h Λn (sn ) σn2 = E0 (log (1 + ǫn (Ln − 1)))2 (1 + ǫn (Ln − 1))sn i h 2 s ≥ E0 (log (1 + ǫn (Ln − 1))) (1 + ǫn (Ln − 1)) n 1{ǫn (Ln −1)≤γ} 1 2 ≥ E0 (log (1 + ǫn (Ln − 1))) 1{ǫn (Ln −1)≤γ} 2 i h 1 ≥ E0 (ǫn (Ln − 1))2 1{ǫn (Ln −1)≤γ} 4 " # (ǫn (Ln − 1))2 Dn2 E0 = 1{ǫn (Ln −1)≤γ} 4 Dn2 # " (ǫn (Ln − 1))2 Dn2 1{Ln ≤1+γ/ǫn } E0 = 4 Dn2 #! " (Ln − 1)2 Dn2 ǫ2n 1{Ln ≥1+γ/ǫn } 1 − E0 = 4 Dn2
(54) (55)
(56)
Where (54) follows from (1 + ǫn (Ln − 1))sn ≥ (1 − ǫn )sn ≥ 12 for sufficiently large n, as sn ∈ (0, 1) and ǫn → 0. A simple calculus argument shows that 12 x2 ≤ (log (1 + x))2 for x ∈ [−1/2, 1/2]. This, along with −1/2 < −ǫn ≤ ǫn (Ln − 1) ≤ γ < 2 1/2h on the event {ǫn (Ln − i 1) ≤ γ} for sufficiently large n establishes (55). The definition of Dn furnishes (56). Noting that (Ln −1)2 1{Ln ≥1+γ/ǫn } → 0 by the assumptions of Thm 2.1 in the main text, (53) is established. E0 2 Dn s In order to remove the Λn (sn ) factor from the bounds, note that Λn (sn ) ≤ Λn (1) ≤ 1 and that Λn (s) ≥ (1 − ǫn ) ≥ 12 for sufficiently large n. This along with (48) and (53) establishes the lemma. This lemma is established identically under H1,n by applying a change of measure to P0,n (which replaces sn with 1 − sn in the argument above).
12
Lemma A.2: Under the tilted measure, we have ˜ (LLR (n) ≥ 0) → 1 P 2
(57)
as n → ∞. Proof For the proof, we will need the Lindeberg-Feller Central Limit Theorem: Theorem A.3: (Theorem 2.4.5, [18]) For each n, let Xn,m , 1 ≤ m ≤ n, be independent zero-mean random variables. Suppose n X 2 E Xn,m → σ2 > 0 (58) m=1
and for all γ > 0,
lim
n→∞
n X
m=1
E |Xn,m |2 1{|Xn,m |>γ} = 0
(59)
Then, Sn = Xn,1 + . . . + Xn,n converges in distribution to normal distribution with mean zero and variance σ 2 as n → ∞. Let {Xn,m }nm=1 be drawn i.i.d. from H0,n . Define log (1 + ǫn (Ln (Xn,m ) − 1)) √ . (60) ξn,m = nσn Note
n X
LLR (n) ξn,m = √ . nσn m=1
(61)
1 Pn ˜ [ξn,m ] = 0 and E ˜ ξ2 We show m=1 ξn,m converges to a standard normal distribution. As stated in the main text, E n,m = n . Thus, (58) is satisfied with σ 2 = 1. It remains to check (59). Since for fixed n, the ξn,m are i.i.d, it suffices to verify that # " 2 2 (log (1 + ǫn (Ln (Xn,1 ) − 1))) ˜ ˜ (62) 1 (log(1+ǫn (Ln (Xn,1 )−1)))2 2 E nξn,1 1{|Xn,1 |>γ} = E σn2 { >γ } nσ2 n
tends to zero as n → ∞. To simplify notation, let Ln = Ln (Xn,1 ). By Lemma A.1, it suffices to show that " # 2 (log (1 + ǫ (L − 1))) n n ˜ 1{ (log(1+ǫn (Ln −1)))2 >nγ 2 } → 0 E 2 ǫ2n Dn2 ǫ2 n Dn
which changing to the E0 measure is equivalent to showing for 0 < γ < γ0 " # 2 (log (1 + ǫn (Ln − 1))) sn E0 (1 + ǫn (Ln − 1)) 1{ (log(1+ǫn (Ln −1)))2 >nγ 2 } → 0 2 ǫ2n Dn2 ǫ2 n Dn since Λn (sn ) ∈ [1/2, 1] for sufficiently large n. We decompose this into # " 2 (log (1 + ǫn (Ln − 1))) sn E0 (1 + ǫn (Ln − 1)) 1{ (log(1+ǫn (Ln −1)))2 >nγ 2 } = 2 ǫ2n Dn2 ǫ2 n Dn " # 2 (log (1 + ǫn (Ln − 1))) sn E0 (1 + ǫn (Ln − 1)) 1{ (log(1+ǫn (Ln −1)))2 >nγ 2 ,ǫ (L −1)>1} n n 2 ǫ2n Dn2 ǫ2 n Dn # " 2 (log (1 + ǫn (Ln − 1))) sn (1 + ǫn (Ln − 1)) 1{ (log(1+ǫn (Ln −1)))2 >nγ 2 ,ǫ (L −1)≤1} + E0 n n 2 ǫ2n Dn2 ǫ2 n Dn and first show that s (Ln −1)))2 (1 + ǫn (Ln − 1)) n 1{ (log(1+ǫn (Ln −1)))2 >nγ 2 ,ǫ E0 (log(1+ǫǫ2nD 2 n
n
2 ǫ2 n Dn
n (Ln −1)>1}
→ 0.
(63)
(64)
13
2
Applying (49) and (log (1 + x)) ≤ x2 for x > 0, " # 2 (log (1 + ǫn (Ln − 1))) sn E0 (1 + ǫn (Ln − 1)) 1{ (log(1+ǫn (Ln −1)))2 >nγ 2 ,ǫ (L −1)>1} n n 2 ǫ2n Dn2 ǫ2 n Dn # " 2 2ǫ2n (Ln − 1) 1{ (ǫn (Ln −1))2 >nγ 2 ,ǫn (Ln −1)>1} ≤ E0 2 2 2 ǫn Dn ǫ2 n Dn # " 2 (Ln − 1) = 2E0 1{ (Ln −1)2 >nγ 2 ,ǫn (Ln −1)>1} 2 Dn2 Dn # " 2 (Ln − 1) (65) ≤ 2E0 1{ (Ln −1)2 >nγ 2 } 2 Dn2 Dn " # 2 (Ln − 1) √ = 2E0 1{Ln >1+ nDn γ} Dn2 # " 2 (Ln − 1) 1{Ln >1+√nDn ǫn ǫγn } = 2E0 Dn2 # " 2 (Ln − 1) ≤ 2E0 1{Ln >1+ ǫγn } → 0 Dn2 √ where the last inequality holds for sufficiently large n from nDn ǫn → ∞ and the last statement follows from the first condition of the theorem. We now show that " # 2 (log (1 + ǫn (Ln − 1))) sn E0 (1 + ǫn (Ln − 1)) 1{ (log(1+ǫn (Ln −1)))2 >nγ 2 ,ǫ (L −1)≤1} → 0. n n 2 ǫ2n Dn2 ǫ2 n Dn s
2
Since Ln ≥ 0, −ǫn ≤ ǫn (Ln − 1). Using (log (1 + x)) ≤ 5x2 for x ≥ − 12 and (1 + ǫn (Ln − 1)) n ≤ 2sn ≤ 2 on the event (Ln −1)))2 { (log(1+ǫǫ2nD > nγ 2 , ǫn (Ln − 1) ≤ 1}, we see that for n sufficiently large such that ǫn < 21 , 2 n n " # 2 (log (1 + ǫn (Ln − 1))) sn E0 (1 + ǫn (Ln − 1)) 1{ (log(1+ǫn (Ln −1)))2 >nγ 2 ,ǫ (L −1)≤1} n n 2 ǫ2n Dn2 ǫ2 n Dn # " 2 10 (ǫn (Ln − 1)) ≤ E0 1{ 5(ǫn (Ln −1))2 >nγ 2 ,ǫn (Ln −1)≤1} 2 Dn2 ǫ2 n Dn " # 2 (Ln − 1) ≤ 10E0 1{ (Ln −1)2 > nγ 2 } 2 Dn2 5 Dn The argument then proceeds identically from (65) to show this tends to zero. √ Thus, (64) holds and the Lindeberg-Feller CLT shows LLR(n) converges to a standard normal distribution under the tilted nσn measure. Therefore, 1 LLR (n) ˜ ˜ √ ≥0 → (66) P (LLR (n) ≥ 0) = P nσn 2 as n → ∞ establishing the lemma. Verifying the Lindeberg-Feller CLT conditions for analyzing PMD is done by changing from the P1,n measure to the P0,n measure. Lemma A.4: lim inf n→∞
1 log Λn (sn ) ≥− ǫ2n Dn2 8
(67)
14
s
Proof Consider the function (1 + x) for 0 < s < 1 on the interval [−γ, γ] where 0 < γ
1+ M } → 1 ǫn
(80)
for all M sufficiently large, then for the test specified by (7), lim sup
log PFA (n) ≤ −1. nǫn n→∞
(81)
φ(x) = 1 + sx − (1 + x)s .
(82)
Let By Taylor’s theorem, we see for s ∈ (0, 1) and x ≥ −1 that φ(x) ≥ 0. Since E0 [Ln − 1] = 0, E0 [(1 − ǫn + ǫn Ln (X1 ))s ] = 1 − E0 [φ(ǫn (Ln (X1 ) − 1))].
(83)
Note this implies E0 [φ(ǫn (Ln (X1 ) − 1))] ∈ [0, 1] since E0 [(1 − ǫn + ǫn Ln (X1 ))s ] is convex in s and is 1 for s = 0, 1. As in the proof of Thm 3.1, by the Chernoff bound, PFA (n) ≤ (E0 [(1 − ǫn + ǫn Ln (X1 ))s ])n
16
for any s ∈ (0, 1). Thus, supressing the dependence on X1 , and assuming M > 1, we have
log PFA (n) ≤ log E0 [(1 − ǫn + ǫn Ln (X1 ))s ] n = log(1 − E0 [φ(ǫn (Ln − 1))]) ≤ −E0 [φ(ǫn (Ln − 1))] ≤ −E0 φ(ǫn (Ln − 1))1{ǫn (Ln −1)≥M} = −E0 (1 + sǫn (Ln − 1) − (1 + ǫn (Ln − 1))s ) 1{ǫn (Ln −1)≥M} ≤ −E0 (sǫn (Ln − 1) − (1 + ǫn (Ln − 1))s ) 1{ǫn (Ln −1)≥M} ≤ −E0 (sǫn (Ln − 1) − 2s ǫsn (Ln − 1)s ) 1{ǫn (Ln −1)≥M} 2s = −E0 ǫn (Ln − 1) s − 1 {ǫn (Ln −1)≥M} (ǫn (Ln − 1))1−s 2 ≤ −E0 ǫn (Ln − 1) s − 1−s 1{ǫn (Ln −1)≥M} M 1 2 = −ǫn s − 1−s E0 Ln 1 − 1{ǫn (Ln −1)≥M} M Ln ! # " 1 2 1{ǫn (Ln −1)≥M} ≤ −ǫn s − 1−s E0 Ln 1 − M 1+ M ǫn ! 2 1 = −ǫn s − 1−s 1− E0 Ln 1{ǫn (Ln −1)≥M} M M 1 + ǫn
(84) (85)
(86)
(87)
where (84) follows from log(1 − x) ≤ −x for x ≤ 1, (85) follows from φ(x) ≥ 0, (86) follows from (1 + x)s ≤ 2s xs for x ≥ 1 and taking M > 1, (87) follows from s ∈ (0, 1). Dividing both sides of the inequality by ǫn and taking a lim supn→∞ establishes lim sup n→∞
2 log PFA ≤ −s + 1−s . nǫn M
(88)
Letting M → ∞ and optimizing over s ∈ (0, 1) establishes the bound. The proof for PMD is identical, replacing s with 1 − s. C. Proof of Thm 3.9 β Assume 23 > 2r > 21 . Recall from the proof of Thm 3.1
and
in hp 1 − ǫn + ǫn Ln (X1 ) PFA (n) ≤ E0 "
# ǫ2n (Ln (X1 ) − 1)2 p 2 . 1 + 1 + ǫn (Ln (X1 ) − 1) √ We write the observations as a multiple of µn , X = αµn . Then, taking µn = 2r log n, we have hp i 1 E0 1 − ǫn + ǫn Ln (X1 ) = 1 − E0 2
(89)
(90)
2
Ln (x) = e−µn /2+µn x = nr(2α−1) .
(91)
In view of (91), ǫn (Ln − 1) = nr(2α−1)−β − n−β .
(92)
Thus, if r(2α − 1) − β > 0 we have ǫn (Ln − 1) → ∞ and if r(2α − 1) − β < 0 we have ǫn (Ln − 1) → 0 as n → ∞. β + 12 . Let κ = 2r
17
Note
2 √x (1+ 1+x)2
≥
x2 4
−
x3 8
for x ≥ −1. Thus, on the event {X1 < κµn },
ǫ3 (Ln (X1 ) − 1)3 ǫ2n (Ln (X1 ) − 1)2 ǫ2n (Ln (X1 ) − 1)2 − n p 2 ≥ 4 8 1 + 1 + ǫn (Ln (X1 ) − 1) 1 (1 − n−β ) ≥ ǫ2n (Ln (X1 ) − 1)2 − 4 8 ǫ2n (Ln (X1 ) − 1)2 1 + n−β = 8 ǫ2 (Ln (X1 ) − 1)2 ≥ n 8
(93)
(94)
where (93) follows from −1 ≤ −ǫn ≤ ǫn (Ln (X1 ) − 1) ≤ 1 − n−β on {X1 < κµn }, and (94) follows from non-negativity of the terms involved. Then, # " " # ǫ2n (Ln (X1 ) − 1)2 ǫ2n (Ln (X1 ) − 1)2 (95) E0 p p 2 ≥ E0 2 1{X1 12 gives the desired rate characterization. > 0, 32 > 2r The proof for PMD is identical. Note that this bound is likely not tight (even if it has the right order), since we neglected the event {X1 ≥ κµn } to form the bound.