Recovering Block-structured Activations Using Compressive Measurements Sivaraman Balakrishnan∗, Mladen Kolar†, Alessandro Rinaldo‡, and Aarti Singh§
Abstract We consider the problems of detection and support recovery of a contiguous block of weak activation in a large matrix, from a small number of noisy, possibly adaptive, compressive (linear) measurements. We characterize the tradeoffs between the various problem dimensions, the signal strength and the number of measurements required to reliably detect and recover the support of the signal. In each case sufficient conditions are complemented with information theoretic lower bounds. This is closely related to the problem of (noisy) compressed sensing, where the analogous task is to detect or recover the support of a sparse vector using a small number of linear measurements. In compressed sensing, it has been shown that, at least in a minimax sense, for both detection and support recovery, adaptivity and contiguous structure only reduce signal strength requirements by logarithmic factors. On the contrary, we show that while for detection neither adaptivity nor structure reduce the signal strength requirement, for support recovery the signal strength requirement is strongly influenced by both structure and the ability to choose measurements adaptively.
Keywords: Adaptive procedures, compressive measurements, large average submatrix, signal detection, signal localization, structured Normal means
1
Introduction
In this paper, we consider the problems of detecting the presence of and estimating the support of a small contiguous block of positive signal that is embedded in a large data matrix, given access to noisy compressive measurements, that is, linear combinations of the matrix entries corrupted with noise. Statistical inference leveraging large structured matrices is the one of the central themes in several applications such matrix completion, community detection and multivariate linear regression. Data matrices with sparse, contiguous block-structured signal components form an idealized model for signals arising in several real-world applications. Consider, for instance, the expression pattern resulting from genes, grouped by pathways, expressed under the influence of drugs, grouped by similarity. The block structure in this case arises from the fact that groups of genes (belonging to common pathways, say) are co-expressed under ∗
Department of Statistics, UC Berkeley; e-mail:
[email protected]. Machine Learning Department, Carnegie Mellon University; e-mail:
[email protected]. ‡ Department of Statistics, Carnegie Mellon University; e-mail:
[email protected]. § Machine Learning Department, Carnegie Mellon University; e-mail:
[email protected].
†
1
the influence of sets of similar drugs (Yoon et al., 2005). Similarly, the automated detection of particular artificial objects in natural satellite images is an important task in computer vision (Caron et al., 2002). The block-structured model can be seen as an idealization of a model for vehicles or buildings in natural images. The block structure in this case is a consequence of the geometry of the artificial objects. More generally, the task of detecting and localizing groups of anomalies in networks has a variety of applications in environmental monitoring using sensor networks, as detailed in the work of Arias-Castro et al. (2011a). The block-structured model we consider is an idealization of the so called thick cluster model considered in that work, and it can be expected that several of our results and techniques can be extended to handle this more flexible notion of structure. Indeed, in work that followed up on the initial posting of this work, Krishnamurthy et al. (2013) extend our results to the more general setting of graph-structured signals, although at the expense of optimal results in the setting we consider. In several applications, it is often difficult to measure, compute or store all the entries of the data matrix. For example, acquiring and storing high-resolution images is expensive, while measuring expression levels of all genes under all possible drugs is time-consuming and costly. However, if we have access to linear combinations of matrix entries (that is, compressive measurements) such as those obtained from compressive imaging cameras (Wakin et al., 2006) or from the combined expression of multiple genes under the influence of multiple drugs (see, for example, Kainkaryam et al., 2010, Kainkaryam and Woolf, 2009), then we might need to only make and store few such measurements, while still being able to infer the existence or estimate the position of the block of signal in the data matrix. In each of these cases the goal is to detect or recover the signal block, corresponding to a set of co-expressed genes and drugs or to the vehicles/buildings using only few compressive measurements of the data matrix. In this work we consider both the passive (non-adaptive) and active (adaptive) compressive measurements. The non-adaptive measurements are random or pre-specified linear combinations of matrix entries. The active measurements schemes are adaptive and use feedback from previously taken measurements to sequentially design linear combinations that are more informative. Related work. In the last few years, there has been a flurry of interest in the use of compressive measurements, primarily for inferring sparse vectors. Specifically, several papers, including Cand´es and Tao (2006), Donoho (2006), Cand´es and Tao (2007), Cand´es and Wakin (2008), have shown that it is possible to recover, in an `2 sense, a k-sparse vector in n dimensions using only O(k log n) incoherent compressive measurements, instead of measuring all of the n coordinates. Along with `2 recovery, researchers have also considered the problems of detection (Duarte et al., 2006, Haupt and Nowak, 2007, Arias-Castro, 2012, Arias-Castro et al., 2011b, Ingster et al., 2010) and support recovery (Wainwright, 2009a,b, Aeron et al., 2010, Reeves and Gastpar, 2012) of the non-zero entries of a sparse vector from non-adaptive compressive measurements. More recently, researchers have analyzed the possibility of taking adaptive compressive measurements, that is, where subsequent measurements are designed based on past observations (Cand´es and Davenport, 2013, Arias-Castro et al., 2013, Haupt et al., 2009). For example, Haupt et al. (2009) demonstrate upper bounds on the `2 error for recovering a sparse vector using an adaptive procedure called compressive distilled sensing, and Malloy
2
and Nowak (2012b) show that a compressive version of standard binary search achieves minimax performance for the localization of a one-sparse vector. Malloy and Nowak (2012a) extend the binary search procedure for identifying the support of k-sparse vectors. In AriasCastro et al. (2013), the authors show that adaptive compressive schemes offer improvements over the passive schemes which, in terms of `2 recovery and support recovery, are limited to a log factor. Arias-Castro (2012) further extends these results to the problem of detection from adaptive compressive measurements. In order to compare with the results we present in this work, Table 1 summarizes some results for the detection and exact support recovery of a sparse vector using passive and active compressive measurements. In order to avoid clutter, in each case we only give sufficient conditions. The corresponding references detail all the assumptions on the measurement matrices (for instance, the result of Wainwright (2009b) applies to Gaussian ensembles). In the table, we follow the standardization used throughout this paper (and various others on this topic), that is, the length of the vector is n and the number of non-zero coordinates in the vector is k. The number of measurements is m and each measurement is assumed to have unit `2 norm (in expectation, if the measurement is random). Table 1: Summary of known results for the sparse vector case. µmin represents the minimum (absolute) signal amplitude over all non-zero coordinates, and σ represents the standard deviation of the noise. C is a universal constant independent of the problem parameters. Detection Localization p µmin n for positive signals. m ≥ Ck log(n/k) σ ≥C mk2 Wainwright (2009b) Passive q p n µmin µmin n log n σ ≥C mk for arbitrary signals. σ ≥C m m ≥ Ck log(n/k) Arias-Castro et al. (2013) q Active Arias-Castro (2012) µmin n log k Malloy and Nowak (2012a) ≥ C σ m Almost all of the prior work described so far has been focused on inference for sparse unstructured data vectors from (passive or adaptive) compressed measurements. In this paper, we focus on the relatively unexplored problems of detection and support recovery for structured data matrices from compressive measurements. We are concerned with signals that are both sparse and highly structured, taking the form of a contiguous sub-matrix within a larger noisy matrix. If the signal components are unstructured, the treatment of data matrices is exactly equivalent to the treatment of data vectors. However, in the structured case, it is clear that some notions of structure are more natural when considered in the context of data matrices rather than data vectors. Specifically, the contiguous block signals considered in this work can (somewhat unnaturally) be viewed as arising from a particular collection of structured signals on vectors. The works of Baraniuk et al. (2010) and Huang et al. (2011) have analyzed different forms of structured sparsity in the vector setting, for example, when the non-zero locations in a data vector form non-overlapping or partially-overlapping groups. These works do not however provide sharp results when specialized to the setting we consider and do not address the adaptive setting. Castro (2012) and T´anczos and Castro (2013) study support recovery of unstructured and structured signals from adaptive measurements when the signal is observed 3
directly. Soni and Haupt (2011) consider the case of adaptive sensing of tree-structured signals. The class of tree-structured signals is quite different from the class of signals we consider, and the gains from adaptive sensing for the tree-structured signals are relatively limited (to a log factor as in the unstructured setting). Also related is the recent work on the recovery of low-rank matrices (see, for example, Negahban and Wainwright, 2011, Koltchinskii et al., 2011). Indeed, our analysis for support recovery in the passive setting (see Theorem 3) is inspired by restricted strong convexity analysis of Negahban and Wainwright (2011). The signals under consideration in this paper are a special case of low rank and sparse matrices. Richard et al. (2012) study the recovery of low rank and sparse matrices from direct observations while Golbabaee and Vandergheynst (2012) study recovery from compressive measurements. Each of these results focuses on recovery in Frobenius norm from passive measurements while our focus is on detection and signal support recovery from possibly active measurements. Furthermore, the additional structure of block-structured signals is crucially leveraged in our analysis and by our estimators, and specializing these earlier results does not give sharp results even in the passive setting. Results applicable to the sparse and low-rank setting do however extend to the setup where there is a non-contiguous sub-matrix or block, also considered in Sun and Nobel (2013), Butucea and Ingster (2013), Butucea et al. (2013), Bhamidi et al. (2012) and Kolar et al. (2011) in the case of direct measurements. This setting is beyond the scope of our paper. Summary of our contributions. We establish lower bounds on the minimum number of compressive measurements and the minimum (non-zero) signal amplitude needed to detect the presence of a block with positive signal, as well as to recover the support of the signal block, using both non-adaptive and adaptive measurements. We also establish matching upper bounds for the procedures that guarantee consistent detection and support recovery of weak block-structured signals using few non-adaptive and adaptive compressive measurements. Our results indicate that adaptivity and structure play a key role and that significant improvements are possible over non-adaptive and unstructured settings. This is unlike the vector case where, at least in the settings analogous to the ones we consider here, contiguous structure and adaptivity have been shown to provide minor, if any, improvement. In our setting we take compressive measurements of a data matrix of size (n1 × n2 ), the block containing the signal is of size (k1 × k2 ), with minimum signal amplitude µmin in each coordinate of the (k1 × k2 ) block. The measurements are corrupted with Gaussian noise with standard deviation σ, and we have a budget of m compressive measurements with each measurement matrix constrained to have unit Frobenius norm. Table 2 describes our main findings. In this table we describe both necessary and sufficient conditions. To reduce clutter in the expressions we use n1 = n2 := n and k1 = k2 := k. For detection, akin to the vector setting, structure and adaptivity play no role. The n structured data matrix setting requires the amplitude of a signal to scale as √mk 2 for both non-adaptive and adaptive cases, which is same as the amplitude needed to detect a k 2 sparse non-negative vector of length n2 as demonstrated in Arias-Castro (2012). As in the vector case, the structure of the signal components as well as the power of adaptivity offer no advantage in the detection problem. For support recovery of the signal block first notice that the necessary and sufficient conditions are tight in the scalings of µmin /σ upto log factors (which we ignore for this discussion). We use C for possibly different universal constants. The structured data matrix setting re-
4
Table 2: Summary of main results of our paper. We use C for possibly different universal constants. Detection µmin σ
Passive and Active
Passive
µmin σ
n ≥ C √mk 2 , necessary. Theorem 1.
n ≥ C √mk 2 , sufficient. Arias-Castro (2012).
Localization r µmin log n √n ≥ C , necessary. Theorem 2. max 1, σ k mk √ √n m ≥ C log n and µmin σ ≥ C mk log n, sufficient. Theorem 3. µmin σ
≥
√C m
max
Active m ≥ C log n and
µmin σ
≥
√C m
n √1 , k k2
max
, necessary. Theorem 4. √ log(k) log log(k) n √ , , sufficient. Theorem 5. k2 k
√n quires µmin /σ to scale as µmin σ ≥ C mk when using non-adaptive compressive measurements. √n In contrast, the unstructured setting requires a higher amplitude of µmin σ ≥ C m (the necessary conditions appear in Wainwright (2009b), Aeron et al. (2010), √ and Reeves and Gastpar (2012)). Structure without adaptivity already yields a factor of k reduction in the smallest µmin that still allows for reliable support recovery. Adaptivity in the compressive measurement design yields further improvements: with adaptive measurements, identifying the signal µmin C n 1 block requires σ ≥ √m max k2 , √k . In contrast, for the sparse vector case, Arias-Castro et al. (2013) showed that adaptive compressive measurements cannot recover the locations of Cn the non-zero components if the amplitude of a signal is smaller than √ . It is clear that the m 2 potential √ for gains with an adaptive measurement scheme are considerable, either a k factor or an n k factor. Exploiting the structure of the signal components and designing adaptive linear measurements can both yield significant gains if the signal corresponds to a contiguous block in a data matrix. The rest of this paper is organized as follows. We describe the problem set up and notation in Section 2. We study the detection problem in Section 3. Section 4 is devoted to non-adaptive support recovery, while Section 5 is focused on adaptive support recovery. Finally, in Section 6 we present and discuss some simulations that support our findings. The proofs are given in the Appendix.
2
Preliminaries
We begin by defining some notation that will be used frequently in the paper. Notation: We denote [n] to be the set {1, . . . , n}. 1I{·} denotes the indicator function. For a vector a ∈ Rn , we let supp(a) = {j : aj 6= 0} be the support set (with an analogous P definition n ×n 1 2 for matrices A ∈ R ), kakq , q ∈ [1, ∞), the `q -norm defined as kakq = ( i∈[n] |ai |q )1/q 5
with the usual extensions for q ∈ {0, ∞}, that is, kak0 = |supp(a)| and kak∞ = maxi∈[n] |ai |. For a matrix A ∈ Rn1 ×n2 , we use the notation vec(A) to denote the vector in P Rn1 n2 formed by stacking the columns of A. We denote the Frobenius norm of A by |||A|||2fro := i∈[n1 ],j∈[n2 ] a2ij . For two matrices A and B of identical dimensions, hhA, Bii denotes the trace inner product, hhA, Bii = trace(AT B). For a matrix A ∈ Rn1 ×n2 , define n = min(n1 , n2 ), and denote the ordered singular values of A by σ1 (A) ≥ . . . ≥ σn (A). Denote, the operator norm of A by ∞ |||A|||op := σ1 (A). For two sequences of numbers {an }∞ n=1 and {bn }n=1 , we use an = O(bn ) to denote that an ≤ Cbn for some finite positive constant C, and for all n large enough. If an = O(bn ) and bn = O(an ), we use the notation an bn . The notation an = o(bn ) is used n→∞ to denote that an b−1 n −−−→ 0. Let A ∈ Rn1 ×n2 be a signal matrix with unknown entries. We are interested in a structured setting where the support set of the matrix A, that is, the set of non-zero elements, is a contiguous block of size (k1 × k2 ), and the elements in the support set are equal to or larger than the level µmin > 0 1 . Although we focus on the case when k1 and k2 are known we also discuss the cases in which one can adapt to the unknown size of the support. In order to simplify the presentation of our results we will assume that kj ≤ nj /2 for j = 1, 2. The following collection contains locations of all contiguous blocks of size k1 × k2 Ir and Ic are contiguous subsets of [n1 ] and [n2 ], B = Ir × Ic : . (2.1) |Ir | = k1 , |Ic | = k2 With this, the signal matrix A = (aij ) can be represented as aij = µij 1I{(i, j) ∈ B ∗ } with µij ≥ µmin > 0 for some (unknown) B ∗ ∈ B. We consider the following observation model under which we collect m noisy linear measurements, {y1 , . . . , ym } of A yi = hhA, Xi ii + i ,
i = 1, . . . , m,
(2.2)
iid
where 1 , . . . , m ∼ N (0, σ 2 ), with σ > 0 known.2 We use the term amplitude to refer to µmin . The sensing matrices (Xi )i∈[m] ∈ Rn1 ×n2 are normalized to satisfy one of the following two conditions: 1) |||Xi |||2fro = 1 or 2) E|||Xi |||2fro = 1 for (i ∈ [m]). This normalization ensures that every measurement is made with the same amount of energy (possibly in expectation). The first condition is used when the measurements are non-random, see, for example, the detection problem described in Section 3, while the second condition is used for random measurements. Under the observation model in Eq. (2.2), we study two tasks: (1) detecting whether a contiguous block of positive signal exists in A and (2) identifying the support of the signal, B ∗ . Both of these are formally defined below. We develop efficient algorithms for these two tasks that provably require the smallest number of measurements (up to logarithmic factors). The algorithms are designed for one of two measurement schemes: (1) the measurement scheme can be implemented in an adaptive or sequential fashion by letting each Xi be a (possibly randomized) function of (yj , Xj )j∈[i−1] , or (2) the measurement matrices are chosen 1
We only deal with positive signal in the main text and briefly discuss how the methods should be changed to address the signals that are both positive and negative. 2 The measurement scheme could be equivalently written in a vector form as yi = vec(A)T vec(Xi ) + i . To emphasize that the support of the signal is a contiguous block, we prefer the representation of Eq. (2.2).
6
all at once, ignoring the outcomes of previous measurements. We call the first scheme active, while the second scheme is referred to as passive. Detection. The detection problem consists in deciding whether a (positive) contiguous block of signal exists in A. As we will show later, we can detect the presence of a contiguous block with a much smaller number of measurements than is required for identifying its support. Formally, detection is a hypothesis testing problem, with a composite alternative, of the form H0 : A = 0n1 ×n2 (2.3) H1 : A ∈ A(µmin ), where the notation 0n1 ×n2 is used for the all-zero matrix in Rn1 ×n2 and A(µmin ) = {A : A has entries aij = µij 1I {(i, j) ∈ B ∗ }, µij ≥ µmin , B ∗ ∈ B} . A test T is a function that takes (yi , Xi )i∈[m] as an input and outputs either 1, if the null hypothesis is rejected, and 0 otherwise. For any test T , we define its risk as Rdet (T ) := P0 T (yi , Xi )i∈[m] = 1 + sup PA T (yi , Xi )i∈[m] = 0 , A∈A(µmin )
where P0 and PA denote the joint probability distributions of (yi , Xi )i∈[m] under the null hypothesis and alternative hypothesis, respectively. The risk Rdet (T ) measures the sum of type I and maximal type II errors over the set of alternatives. The overall difficulty of the detection problem is quantified by the minimax detection risk Rdet := inf Rdet (T ), T
where the infimum is taken over all tests. When the amplitude of the signal is sufficiently small, as measured by µmin , the minimax risk can remain bounded away from zero, which implies that no test can reliably (that is, with a vanishing probability of error) distinguish H0 from H1 . In Section 3 we will characterize necessary and sufficient conditions for distinguishing H0 and H1 in terms of µmin . Identifying the support of the signal A. The problem of identifying the support of the signal A (or support recovery) is that of determining the exact location of the non-zero elements of A. Let Ψ be an estimator of B ∗ , that is, a function that takes (yi , Xi )i∈[m] as input and outputs an element from B, the estimated location of the support. We define the risk of any such estimator as Rsup (Ψ) := sup PA Ψ (yi , Xi )i∈[m] 6= supp(A) , A∈A(µmin )
while the minimax support recovery risk is Rsup := inf Rsup (Ψ), Ψ
where the infimum is taken over all such estimators Ψ. Like in the detection task, the minimax risk specifies the minimal risk of any support identification procedure. We focus in this paper on exact support recovery. An interesting extension of the work in this paper, that we do
7
not address, would be to understand the minimax risk of support recovery in the Hamming metric (for instance). Below we provide upper and lower bounds on the minimax risk for the problems of detection and support recovery as functions of tuples of (n1 , n2 , k1 , k2 , m, µmin , σ), for both the active and passive sampling schemes. Our results identify (up to logarithmic factors) necessary and sufficient conditions on the minimal amplitude of the signal µmin that can be detected or whose support can be recovered given a budget of m possibly adaptive measurements.
3
Detection of contiguous blocks
In this section, we give necessary and sufficient conditions on the problem parameters (n1 , n2 , k1 , k2 , m, µmin , σ), for distinguishing between H0 and H1 defined in Eq. (2.3). We start with model (2.2) and assume that each measurement matrix Xi can be chosen adaptively. We have the following lower bound on the magnitude of the signal. Theorem 1. Fix any 0 < α < 1. Based on m (possibly adaptive) measurements, if r µmin n1 n2 ≤ 8(1 − α) , σ mk12 k22 then the minimax detection risk Rdet ≥ α. The lower bound given in Theorem 1 is established by analyzing the risk of the (optimal) likelihood ratio test under a uniform prior over the alternatives. Careful modifications of standard arguments are necessary to account for adaptivity. We closely follow the approach of Arias-Castro (2012) who established the analogue of Theorem 1 in the vector setting. We make minor modifications (detailed in Appendix A.1) to deal with the particular form of the block-structured signals we consider. Next, we describe a test for distinguishing H0 and H1 , proposed in Arias-Castro (2012). The test is given as nX o p T (yi )i∈[m] = 1I yi > σ 2m log(α−1 ) (3.1) i
with the measurement matrices chosen passively as Xi = (n1 n2 )−1/2 1n1 10n2 (i = 1, . . . , m), where 1n is used to denote the vector of ones in Rn . From Proposition 1 in Arias-Castro (2012), if q µmin n1 n2 −1 ) ≥ C mk (3.2) 2 2 log(α 1 k2 σ for some universal constant C > 0, then Rdet (T ) ≤ α. The lower bound given in Theorem 1 matches the upper bound given in Eq. 3.2 up to constants. Taken together they establish that √ −1 −1 the minimax rate for detection under the model in Eq. (2.2) is µmin σ(k1 k2 ) m n1 n2 . It is worth pointing out that the structure of the activation pattern does not play any role in the minimax detection problem. Indeed the rate established matches the known bounds for detection in the unstructured vector case Arias-Castro (2012). We will contrast this to the problem of support recovery below. Furthermore, the procedure that achieves the adaptive lower bound (up to constants) is non-adaptive, indicating that adaptivity can not help much in the detection problem. 8
4
Support recovery from passive measurements
In this section, we address the problem of estimating the support set B ∗ of the signal A from noisy linear measurements as in equation (2.2). We give a lower bound for the case when the measurement matrices (Xi )i∈[m] are independent with jointly Gaussian entries, with variance 1 2 n1 n2 . The variance of the entries is set so that the energy of each measurement E|||Xi |||fro = 1. For the upper bound we consider the case when the signal is block constant, that is, µij = µmin for (i, j) ∈ B ∗ . We consider the more general case of measurement matrices which have possibly correlated sub-Gaussian entries. In particular, denote by X the (m × n1 n2 ) matrix where each row is vec(Xi ). We give upper bounds for the case when X = ΨΣ1/2 , where Ψ ∈ R(m×n1 n2 ) has independent sub-Gaussian entries with variance R(n1 n2 ×n1 n2 ) satisfies 0 < c ≤ λmin (Σ) ≤ λmax (Σ) ≤ C
1 n1 n2 ,
where Σ ∈
for universal constants c, C. We state our results in terms of λmax (Σ) and λmin (Σ) and track the dependence on them in our proofs. The energy of each measurement E|||Xi |||2fro is tr(Σ) n1 n2 . In order to compare results across different measurement schemes we prefer to treat λmin (Σ) and λmax (Σ) as universal constants, so that the energy of each measurement is also at most a universal constant. In this case we say the measurement matrices are drawn from the Σ sub-Gaussian ensemble. It is easy to see that this generalizes the case of an independent Gaussian ensemble (by taking λmin (Σ) = λmax (Σ) = 1).
4.1
Lower bound
The following theorem gives a lower bound on the amplitude of the signal, below which no procedure can reliably identify the support set B ∗ . Theorem 2. Consider sensing with measurement matrices Xi for which each entry is independent Gaussian with variance n11n2 . There exist positive constants C, α > 0 independent of the problem parameters (k1 , k2 , n1 , n2 ), such that if s µmin n1 n2 1 log(n1 n2 ) ≤C max , , σ m min(k1 , k2 ) k1 k2 then the minimax support recovery risk Rsup ≥ α > 0. The proof is based on a standard technique described in Chapter 2.6 of Tsybakov (2009). We start by identifying a subset of matrices that are hard to distinguish. Once a suitable finite set is identified, tools for establishing lower bounds on the error in multiple-hypothesis testing can be directly applied. These tools only require computing the Kullback-Leibler (KL) divergence between the induced distributions, which in our case are two multivariate normal distributions. Notice that the bound in Theorem 2 has two terms. The first term characterizes the difficulty of distinguishing two signals whose supports overlap considerably. The second term characterizes the difficulty of identifying the correct support when the signal is one of a large family of signals whose supports do not overlap. These constructions and calculations are described in detail in Appendix A.2. 9
4.2
Upper bound
We will investigate a procedure that searches over all contiguous blocks of size (k1 × k2 ) as defined in Eq. (2.1). For any block B ∈ B, let AB be the matrix 1I{B} 1I{B}T ∈ Rn1 ×n2 , where 1I{B} denotes the (n1 × n2 ) matrix with entries which are 1 for indices in B and 0 otherwise. Our procedure is to solve the following program: m
X b = arg min min 1 B (yi − µhhAB , Xi ii)2 . + 2m µ∈R B∈B
(4.1)
i=1
This program can be solved in polynomial time by a brute force search for the outer minimization. We show the following result for the output of this procedure. Theorem 3. Consider sensing with measurement matrices Xi , that are drawn from the Σ sub-Gaussian ensemble. For block-constant signals, for n n λmax (Σ) 2 1 2 m≥C log , λmin (Σ) α for a universal constant C if s p λmax (Σ) n1 n2 log (12n1 n2 /α) µmin ≥ 12 σ λmin (Σ)2 min(k1 , k2 )m b such that Rsup (B) b ≤ α. the procedure described above outputs a block B The proof follows the general recipe outlined in Negahban and Wainwright (2011) of establishing the “strong convexity” of the measurement ensemble for matrices that satisfy a certain cone condition. However, a direct application of those results leads to much weaker bounds in our setting and several careful modifications are needed in order to obtain sharp results. We present two other analyses in the Appendix. In Appendix A.4, we study a procedure that outputs the block B for which X X Ψ(B) := yi Xi,ab , (4.2) (a,b)∈B i∈[m]
is maximized. We show that this procedure achieves the same rate for uncorrelated designs but without the assumption of block-constant signal. We also show that a modification of the procedure in Equation 4.1 is able to estimate a support of a signal that has both positive and negative components, without the assumption of a block-constant signal, albeit at a slower rate, in Appendix B. Comparing to the lower bound in Theorem 2, we observe that the procedure outlined in this section achieves the lower bound up to a log (n1 n2 ) factor. It is also worth noting that the amplitude of the signal needed for correct support recovery is larger than the bound we saw earlier for passive detection. The block structure of the activation allows us, even in the passive setting, to localize much weaker signals. A straightforward adaptation of results on the LASSO (Wainwright, 10
2009a) suggest that q if the non-zero entries are spread out (say at random) then we would µmin 1 n2 ) require σ ≥ C n1 n2 log(n to correctly identify the block B ∗ . Exploiting the structure m p in the signal collection allows us to gain a 1/ min(k1 , k2 ) factor. As we will see in Section 5 even more substantial gains are possible when we choose measurements adaptively. We also note that one can also use a similar procedure to adapt to the unknown size of the activation block. In particular, one can perform exhaustive search procedure for all possible sizes of activation blocks. Small modifications to the proof of Theorem 3 can be used to show that this procedure adapts to the unknown block size while still achieving the same risk. We omit the details but note that similar modifications have been detailed in Butucea and Ingster (2013) for a related problem.
5
Support recovery from active measurements
In this section, we study identification of B ∗ using adaptive procedures, that is, the measurement matrix Xi may be a function of (yj , Xj )j∈[i−1] .
5.1
Lower bound
The following theorem gives a lower bound on the signal amplitude needed for any active procedure to estimate the support set B ∗ . Theorem 4. Fix any 0 < α < 1. There exists a constant C > 0 independent of the problem parameters (k1 , k2 , n1 , n2 ), such that, given m adaptively chosen measurements, if s ! r µmin n1 n2 1 < C(1 − α) max , σ m min(k1 , k2 ) mk12 k22 then Rsup ≥ α. The proof is similar to the proof of Theorem 2 in that we construct specific pairs of hypotheses that are hard to distinguish. The two terms in the lower bound reflect the hardness of estimating the support of the signal in two important cases. The first term reflects the difficulty of approximately estimating the support B ∗ . This term grows at the same rate as the detection lower bound, and its proof is similar. Given a coarse estimate of the support, we still need to identify the exact support of the signal. The hardness of this problem gives rise to the second term in the lower bound. The term is independent of n1 and n2 but has a considerably worse dependence on k1 and k2 .
5.2
Upper bound
The upper bound is established by analyzing the procedures described in Algorithms 1 and 2 for approximate and exact recovery of the support of the signal. Algorithm 1 is used to approximately estimate the support B ∗ , that is, it locates a 8k1 × 8k2 block that contains the support B ∗ with high probability. The algorithm essentially performs compressive binary
11
Algorithm 1 Approximate support recovery input Measurement budget m ≥ log pa , a collection of sizeb p of blocks D of size (2k1 × 2k2 ). (1) Initial support: J0 := {1, . . . , p}, s0 := log p. For each s in 1, . . . , log2 p 1. Allocate: ms := b(m − s0 )s2−s−1 c + 1. (s)
(s)
2. Split: Partition J0 into two collections of blocks of equal size, J1 contains the first p/2s (s) blocks and J2 the remainder. q q −(s0 −s+1) −(s0 −s+1) (s) (s) 3. Sensing matrix: Xs = 2 4k1 k2 on J1 , Xs = − 2 4k1 k2 on J2 and 0 otherwise. (s)
4. Measure: yi
(s)
= hhA, Xs ii + zi (s+1)
5. Update support: J0
(s0 +1)
output The single block in J0
(s)
= J1
if
for i ∈ [1, . . . , ms ]. Pms
i=1
(s)
yi
(s+1)
> 0 and J0
(s)
= J2
otherwise.
.
n1 n2 . Recall the definition of p = 4k 1 k2 b We assume p is dyadic to simplify our presentation of the algorithm.
a
Algorithm 2 Exact recovery (of columns) input Measurement budget m, a sub-matrix B ∈ R4k1 ×4k2 , success probability α. P4k 1. Measure: yic = (4k1 )−1/2 l=11 Blc +zic for i = {1, . . . , m/5} and c ∈ {1, k2 +1, 2k2 +1, 3k2 +1}. 2. Let l = argmaxc
Pm/5 i=1
yic , r = l + k2 , mb = b 6 logm k2 c. 2
3. While r − l ≥ 1 (a) Let c = b r+l 2 c. P4k (b) Measure yic = (4k1 )−1/2 l=11 Blc + zic for i = {1, . . . , mb }. r Pmb c log k2 mb σ 2 a (c) If log then l = c, otherwise r = c. i=1 yi ≥ O α log k2 output Set of columns {l − k2 + 1, . . . , l}. a
The exact constants appear in the proof of Theorem 5.
12
search (Davenport and Arias-Castro (2012)) on a collection of non-overlapping blocks that partition the matrix. Specifically, we create 4 collections D1 , D2 , D3 and D4 defined as3 D1 := {B1,1 := [1, . . . , 2k1 ] × [1, . . . , 2k2 ], B1,2 := [2k1 + 1, . . . , 4k1 ] × [1, . . . , 2k2 ] . . . , B1,n1 n2 /4k1 k2 := [n1 − 2k1 , . . . , n1 ] × [n2 − 2k2 , . . . , n2 ] D2 := {B2,1 := [k1 , . . . , 3k1 ] × [k2 , . . . , 3k2 ], B2,2 := [3k1 + 1, . . . , 5k1 ] × [k2 , . . . , 3k2 ] . . . , B2,n1 n2 /4k1 k2 := [n1 − k1 , ..., n1 , 1, . . . , k1 ] × [n2 − k2 , ..., n2 , 1, . . . , k2 ] D3 := {B3,1 := [k1 , . . . , 3k1 ] × [1, . . . , 2k2 ], B3,2 := [3k1 + 1, . . . , 5k1 ] × [1, . . . , 2k2 ] . . . , B3,n1 n2 /4k1 k2 := [n1 − k1 , ..., n1 , 1, . . . , k1 ] × [n2 − 2k2 , . . . , n2 ] , and D4 := {B4,1 := [1, . . . , 2k1 ] × [k2 , . . . , 3k2 ], B4,2 := [2k1 + 1, . . . , 4k1 ] × [k2 , . . . , 3k2 ] . . . , B4,n1 n2 /4k1 k2 := [n1 − 2k1 , . . . , n1 ] × [n2 − k2 , ..., n2 , 1, . . . , k2 ] each of which contains p := n1 n2 /(4k1 k2 ) non-overlapping blocks of the matrix A. D1 is a partition of the matrix into disjoint blocks of size (2k1 × 2k2 ), D3 is a similar partition shifted down by k1 rows, D4 is shifted to the right by k2 columns and D2 is both shifted down by k1 rows and to the right by k2 columns. Figure 1 illustrates this. Notice, that one of these collections must include a partition (one of D1 through D4 ) that completely contains the support B ∗ . Algorithm 1 is run separately on each of the collections and it outputs 4 blocks. As we will show in our analysis of this procedure, one of these 4 blocks contains the support B ∗ with high probability. Algorithm 1 is a modification of the compressive binary search (Davenport and Arias-Castro, 2012) adapted to perform the compressive binary search in a matrix setting. We focus our attention on the collection that completely contains B ∗ . The algorithm proceeds in steps. Starting with a collection of disjoint blocks, at each step the collection is split into two halves and a test is performed to determine which half is more likely to contain the support of B ∗ . This process continues in log2 (p) steps until only one block remains. Algorithm 2 is used next to precisely identify B ∗ within one of the four blocks identified by Algorithm 1. Algorithm 2 itself works in several stages: in the first stage the procedure measures repeatedly a small number of columns, exactly one of which is contained in B ∗ , to identify an active column with high probability. The next stage finds the first non-active column to the left and right by testing columns using a binary search procedure. In this way, all the active columns are located. Finally, Algorithm 2 is repeated on the rows to identify the active rows. The following theorem states that Algorithm 1 and Algorithm 2 succeed in estimating the support B ∗ with high probability if the amplitude of the signal is large enough. Theorem 5. If p µmin ≥ C log(1/α) max σ 3
r
n1 n2 , mk12 k22
s
log (max(k1 , k2 )) log log(k1 k2 ) m min(k1 , k2 )
For simplicity, we assume n1 is a multiple of 2k1 and n2 of 2k2
13
!!
Figure 1: The collection of blocks D1 is shown in solid lines and the collection D2 is shown in dashed lines. The collections D3 and D4 overlap with these and are not shown. The (k1 × k2 ) block of activation is shown in red. b ≤ α, where B b is the block output by running Algorithm 1 for and m ≥ 3 log(n1 n2 ) then R(B) approximate localization followed by Algorithm 2 on the subset of rows and columns returned by Algorithm 1. that the upper bound given Note in Theorem 5 matches the lower bound up to a p O log (max(k1 , k2 )) log log(k1 k2 ) factor. It is worth noting that for small k1 and k2 , when the first term dominates, our active support recovery procedure achieves the detection limits. This is the best result we could hope for. When the support is larger, the lower bound indicates that no procedure can achieve the detection rate. The active procedure still remains significantly more efficient than the passive one, and is able to recover the support √ of signals that are weaker by a n1 n2 factor. The great potential for gains from adaptive measurements is clearly seen in our model which captures the fundamental interplay between structure and adaptivity.
6
Experiments
In this section, we perform a set of simulation studies to illustrate finite sample performance of the proposed procedures. We let n1 = n2 = n and k1 = k2 = k. Theorem 8 and Theorem 5 characterize the signal amplitude needed for the passive and active identification of a contiguous block, respectively. We demonstrate that the scalings predicted by these theorems are sharp by plotting the probability of successful recovery against appropriately rescaled signal amplitude and showing that the curves for different values of n and k line up.
14
Probability of success
b Experiment 1. Figure 2 shows the√probability of successful localization of B ∗ using B −1 defined in Eq. (A.5) plotted against n km ∗ (µmin /σ), where the number of measurements m = 100. Each plot in Figure 2 represents different relationship between k and n; in the √ first plot, k = Θ(log n), in the second k = Θ( n), while in the third plot k = Θ(n). The dashed vertical line denotes the threshold position for the scaled amplitude at which the probability of success is larger than 0.95. We observe that irrespective of the problem size and the relationship between n and k, Theorem 8 tightly characterizes the minimum signal amplitude needed for successful identification. 1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
(n, (n, (n, (n,
0.2
k) k) k) k)
= = = =
( 50, 4) (100, 5) (150, 6) (200, 7)
(n, (n, (n, (n,
0.2
0
k) k) k) k)
= = = =
( 50, 7) (100, 10) (150, 12) (200, 14)
0 1
n
−1
2 √
3
4
5
k) k) k) k)
= = = =
( 50, 10) (100, 20) (150, 30) (200, 40)
0 1
km ∗ SNR
(n, (n, (n, (n,
0.2
n
−1
2 √
3
4
5
1
km ∗ SNR
2 3 4 √ n−1 km ∗ SNR
5
Figure 2: Probability of success with passive measurements (averaged over 100 simulation runs).
1
1
1
0.8
0.8
0.8
0.6
0.6
0.4
0.4 (n, k) = (1280, 10) (n, k) = (2816, 11) (n, k) = (6144, 12)
0.2 0
Probability of success
Probability of success
Experiment 2. Figure 3 shows the probability of successful localization of B ∗ using the procedure outlined in Section 5.2., with m = 500 adaptively chosen measurements, plotted √ against the scaled signal amplitude. (µmin /σ) is scaled by n−1 mk 2 in the first two plots √ where k = pΘ(log n) and k = Θ( n) respectively, while in the third plot the amplitude is scaled by mk/ log k as k = Θ(n). The dashed vertical line denotes the threshold position for the scaled signal amplitude at which the probability of success is larger than 0.95. We observe that Theorem 5 sharply characterizes the minimum signal amplitude needed for successful identification.
(n, k) = ( 8960, 70) (n, k) = (10240, 80) (n, k) = (11520, 90)
0.2 0
20 √ −1
n
40 2
mk ∗ SNR
60
0.6 0.4 0.2
(n, k) = ( 4000, 1000) (n, k) = (16000, 2000) (n, k) = (24000, 3000)
0 20 √ −1
n
40 2
mk ∗ SNR
60
150 p 100 mk/ log(k) ∗ SNR
200
Figure 3: Probability of success with adaptively chosen measurements (averaged over 100 simulation runs).
15
7
Discussion
In this paper, we establish the fundamental limits for the problem of detecting and localizing a contiguous block of weak activation in a data matrix from either adaptive or non-adaptive compressive measurements. Our bounds characterize the tradeoffs between the signal amplitude, size of the matrix, size of the non-zero block and number of measurements. We also demonstrate constructive computationally efficient estimators that achieve these bounds. Some interesting direct extensions of the work in this paper would be to address the case of unknown signal support size, to address the more general problem of recovering a signal block within a randomly permuted data matrix, commonly known as biclustering, and finally to understand localization beyond exact recovery. The contiguous block model we consider is one useful form of the more general class of structured sparse signals, where adaptivity helps considerably. Techniques developed in this work have since been applied to the problems of adaptive compressive sensing of tree-sparse and graph-structured signals in the works of Soni and Haupt (2013) and Krishnamurthy et al. (2013). We expect a full characterization of structured signals for which adaptive sensing can be useful will be a fruitful path for future investigation.
Acknowledgements We would like to thank Larry Wasserman for his ideas, indispensable advice and wise guidance. SB would also like to thank Martin Wainwright for helpful discussions related to the proof of Theorem 3. This research is supported in part by AFOSR under grant FA9550-101-0382, NSF under grant IIS-1116458 and NSF CAREER grant DMS 1149677.
References S. Aeron, V. Saligrama, and M. Zhao. Information theoretic bounds for compressed sensing. IEEE Trans. Inf. Theory, 56(10):5111–5130, 2010. E. Arias-Castro. Detecting a vector based on linear measurements. Electron. J. Stat., 6: 547–558, 2012. E. Arias-Castro, E. J. Cand´es, and A. Durand. Detection of an anomalous cluster in a network. Ann. Stat., 39(1):278–304, 2011a. E. Arias-Castro, E. J. Cand´es, and Y. Plan. Global testing under sparse alternatives: Anova, multiple comparisons and the higher criticism. Ann. Stat., 39(5):2533–2556, 2011b. E. Arias-Castro, E. J. Cand´es, and M. A. Davenport. On the fundamental limits of adaptive sensing. IEEE Trans. Inf. Theory, 59(1):472–481, 2013. R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. Model-based compressive sensing. IEEE Trans. Inf. Theory, 56(4):1982–2001, 2010. S. Bhamidi, P. S. Dey, and A. B. Nobel. Energy landscape for large average submatrix detection problems in gaussian random matrices. arXiv preprint arXiv:1211.2284, 2012. 16
L. Birg´e. An alternative point of view on lepski’s method. In State of the art in probability and statistics (Leiden, 1999), volume 36 of IMS Lecture Notes Monogr. Ser., pages 113–133. Inst. Math. Statist., Beachwood, OH, 2001. C. Butucea and Y. I. Ingster. Detection of a sparse submatrix of a high-dimensional noisy matrix. Bernoulli, 19(5B):2652–2688, 2013. C. Butucea, Y. I. Ingster, and I. Suslina. Sharp variable selection of a sparse submatrix in a high-dimensional noisy matrix. arXiv preprint arXiv;1303.5647, 2013. E. J. Cand´es and M. A. Davenport. How well can we estimate a sparse vector? Appl. Comput. Harmon. Anal., 34(2):317–323, 2013. E. J. Cand´es and T. Tao. The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Stat., 35(6):2313–2351, 2007. E. J. Cand´es and T. Tao. Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Trans. Inf. Theory, 52(12):5406–5425, 2006. E. J. Cand´es and M. B. Wakin. An introduction to compressive sampling. IEEE Signal Process. Mag., 25(2):21–30, 2008. Y. Caron, P. Makris, and N. Vincent. A method for detecting artificial objects in natural environments. In 16th Int. Conf. Pattern Recogn., volume 1, pages 600–603. IEEE, 2002. R. M. Castro. Adaptive sensing performance lower bounds for sparse signal detection and support estimation. arXiv preprint arXiv:1206.0648, 2012. M. A. Davenport and E. Arias-Castro. arXiv:1202.0937, 2012.
Compressive binary search.
arXiv preprint
D. L. Donoho. Compressed sensing. IEEE Trans. Inf. Theory, 52(4):1289–1306, 2006. M. F. Duarte, M. A. Davenport, M. J. Wainwright, and R. G. Baraniuk. Sparse signal detection from incoherent projections. In Proc. IEEE Int. Conf. Acoustics Speed and Signal Processing, pages III–305–III–308, 2006. M. Golbabaee and P. Vandergheynst. Compressed sensing of simultaneous low-rank and joint-sparse matrices. arXiv preprint arXiv:1211.5058, 2012. J. D. Haupt and R. D. Nowak. Compressive sampling for signal detection. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, volume 3, pages III–1509–III–1512. IEEE, 2007. J. D. Haupt, R. G. Baraniuk, R. M. Castro, and R. D. Nowak. Compressive distilled sensing: Sparse recovery using adaptivity in compressive measurements. In Proc. 43rd Asilomar Conf. Signals, Systems and Computers, pages 1551–1555. IEEE, 2009. J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. J. Mach. Learn. Res., 12:3371–3412, 2011.
17
Y. I. Ingster, A. B. Tsybakov, and N. Verzelen. Detection boundary in sparse regression. Electron. J. Stat., 4:1476–1526, 2010. I. M. Johnstone and A. Y. Lu. On consistency and sparsity for principal components analysis in high dimensions. J. Am. Stat. Assoc., 104(486):682–693, 2009. R. M. Kainkaryam and P. J. Woolf. Pooling in high-throughput drug screening. Current opinion in drug discovery & development, 12(3):339–350, 2009. R. M. Kainkaryam, A. Bruex, A. C. Gilbert, J. Schiefelbein, and P. J. Woolf. poolMC: Smart pooling of mRNA samples in microarray experiments. Bioinformatics, 11:299, 2010. M. Kolar, S. Balakrishnan, A. Rinaldo, and A. Singh. Minimax localization of structural information in large noisy matrices. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. C. N. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 909–917, 2011. V. Koltchinskii, K. Lounici, and A. B. Tsybakov. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Stat., 39(5):2302–2329, 2011. A. Krishnamurthy, J. Sharpnack, and A. Singh. Recovering graph-structured activations using adaptive compressive measurements. arXiv preprint arXiv:1305.0213, 2013. B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Ann. Stat., 28(5):1302–1338, 2000. M. L. Malloy and R. D. Nowak. Near-optimal adaptive compressed sensing. In Proc. 46th Asilomar Conf. Signals, Systems and Computers, pages 1935–1939. IEEE, 2012a. M. L. Malloy and R. D. Nowak. Near-optimal compressive binary search. arXiv preprint arXiv:1203.1804, 2012b. S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Stat., 39(2):1069–1097, 2011. G. Reeves and M. Gastpar. The sampling rate-distortion tradeoff for sparsity pattern recovery in compressed sensing. IEEE Trans. Inf. Theory, 58(5):3065–3092, 2012. E. Richard, P.-A. Savalle, and N. Vayatis. Estimation of simultaneously sparse and low rank matrices. In Proc. 29th Int. Conf. Mach. Learn., pages 1351–1358, 2012. M. Rudelson and R. Vershynin. Non-asymptotic theory of random matrices: extreme singular values. arXiv preprint arXiv:1003.2990, 2010. M. Rudelson and R. Vershynin. Hanson-wright inequality and sub-gaussian concentration. Electron. Commun. Probab., 18(0), 2013. A. Soni and J. D. Haupt. Efficient adaptive compressive sensing using sparse hierarchical learned dictionaries. In Proc. 45th Conf. Signals, Systems and Computers, pages 1250– 1254. IEEE, 2011.
18
A. Soni and J. D. Haupt. On the fundamental limits of recovering tree sparse vectors from noisy linear measurements. arXiv preprint arXiv:1306.4391, 2013. X. Sun and A. B. Nobel. On the maximal size of large-average and ANOVA-fit submatrices in a gaussian random matrix. Bernoulli, 19(1):275–294, 2013. E. T´anczos and R. M. Castro. Adaptive sensing for estimation of structured sparse signals. arXiv preprint arXiv:1311.7118, 2013. A. B. Tsybakov. Introduction To Nonparametric Estimation. Springer Series in Statistics. Springer, New York, 2009. M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using `1 -constrained quadratic programming (lasso). IEEE Trans. Inf. Theory, 55(5):2183–2202, 2009a. M. J. Wainwright. Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting. IEEE Trans. Inf. Theory, 55(12):5728–5741, 2009b. M. B. Wakin, J. N. Laska, M. F. Duarte, D. Baron, S. Sarvotham, D. Takhar, K. F. Kelly, and R. G. Baraniuk. An architecture for compressive imaging. In Proc. IEEE Int. Conf. Image Processing,, pages 1273–1276. IEEE, 2006. S. Yoon, C. Nardini, L. Benini, and G. De Micheli. Discovering coherent biclusters from gene expression data using zero-suppressed binary decision diagrams. IEEE/ACM Trans. Comput. Biol. Bioinf., 2(4):339–354, 2005.
19
A
Proofs of Main Results
In this appendix, we collect proofs of the results stated in the paper. Throughout the proofs, we will denote c1 , c2 , . . . positive constants that may change their value from line to line.
A.1
Proof of Theorem 1
We lower bound the Bayes risk of any test T . Recall, the null and alternate hypothesis, defined in Eq. (2.3), H0 : A = 0n1 ×n2 H1 : A ∈ A(µmin ). We will accomplish this by lower bounding the Bayes risk of any test T 0 for a simpler hypothesis testing problem H0 : H1 :
A = 0n1 ×n2 A = (aij ) with aij = µmin 1I{(i,j)∈B} , B ∈ B.
It is clear that inf T Rdet (T ) ≥ inf T 0 Rdet (T 0 ). To further bound the worst case risk of T 0 , we consider a uniform prior over the alternatives π, and bound the average risk Rπdet (T 0 ) = P0 [T 0 = 1] + EA∼π PA [T 0 = 0], which provides a desired lower bound. Under the prior π, the hypothesis testing problem becomes to distinguish between H0 : H1 :
A = 0n1 ×n2 A = (aij ) with aij = µmin EB∼π [1I{(i,j)∈B} ].
Both H0 and H1 are simple and the likelihood ratio test is optimal by the Neyman-Pearson lemma. The likelihood ratio is Q Eπ PA [(yi , Xi )i∈[m] ] Eπ m PA [yi |Xi ] , L≡ = Qmi=1 P0 [(yi , Xi )i∈[m] ] i=1 P0 [yi |Xi ] where the second equality follows by decomposing the probabilities by the chain rule and observing that P0 [Xi |(yj , Xj )j∈[i−1] ] = PA [Xi |(yj , Xj )j∈[i−1] ], since the sampling strategy (whether active or passive) is the same irrespective of the true hypothesis. The likelihood ratio can be further simplified as ! m X 2yi hhA, Xi ii − hhA, Xi ii2 L = Eπ exp . 2σ 2 i=1
The average risk of the likelihood ratio test 1 Rπdet (T 0 ) = 1 − kEπ PA − P0 kTV 2 is determined by the total variation distance between the mixture of alternatives from the null. By Pinkser’s inequality Tsybakov (2009), p kEπ PA − P0 kTV ≤ KL(P0 , Eπ PA )/2, 20
where KL(P0 , P1 ) denotes the Kullback-Leibler (KL) divergence between distributions P0 and P1 . Therefore a lower bound can be obtained by controlling the KL divergence. We have KL(P0 , Eπ PA ) = −E0 log L m X 2yi hhA, Xi ii − hhA, Xi ii2 ≤ −Eπ E0 2σ 2 i=1
= Eπ
m X
E0
i=1
m ≤ 2 2σ
hhA, Xi ii2 2σ 2 Eπ hhA, Xi ii2 :=
sup |||X|||fro ≤1
m |||C|||op , 2σ 2
where the first inequality follows by applying the Jensen’s inequality followed by Fubini’s theorem, and the second inequality follows using the fact that |||Xi |||fro = 1. The matrix C ∈ Rn1 n2 ×n1 n2 has elements Cii = µ2min Eπ PA [Aτ (i) = 1] and Cij = µ2min Eπ PA [Aτ (i) = 1, Aτ (j) = 1], where τ is an invertible map from a linear index in {1, . . . , n1 n2 } to an entry of A. A bound on the operator norm of C follows from the following two observations. Since the support of the signal forms a contiguous block and therefore in any row of C there are at most k1 k2 non-zero entries. Next we observe that each non-zero entry in C is of magnitude at most µ2min k1 k2 /(n1 − k1 )(n2 − k2 ). Combining the two observations, we have |||C|||op ≤ max j
X k
|Cjk | ≤
µ2min k12 k22 µ2 k 2 k 2 ≤ min 1 2 (n1 − k1 )(n2 − k2 ) 4n1 n2
where the last inequality follows from the assumption that k1 ≤ n1 /2 and k2 ≤ n2 /2. From this we obtain an upper bound on the KL divergence, which in turn gives us the following lower bound on the minimax detection risk r µmin m Rπdet (T ) ≥ 1 − k1 k2 64 . σ n1 n2
A.2
Proof of Theorem 2
The proof will proceed via two separate constructions. At a high level, the first construction is intended to capture the difficulty of distinguishing between supports that overlap everywhere, except for one row or column. The second construction deals with distinguishing between one support and a large collection of supports that do not overlap with the first one. Without loss of generality we assume k1 ≤ k2 . Construction 1. We consider two distributions PA1 and PA2 , where B1 = supp(A1 ) = [1, . . . , k1 ] × [1, . . . , k2 ] and B2 = supp(A2 ) = [1, . . . , k1 ] × [2, . . . , k2 + 1] and every non-zero element of the matrices is equal to µmin . Observe that the two supports differ in only one
21
column. We have the following lower bound, valid uniformly over all Ψ, Rsup (Ψ) = sup PA Ψ (yi , Xi )i∈[m] 6= supp(A) , A∈A(µmin )
≥ max PAj Ψ (yi , Xi )i∈[m] 6= Bj j∈{1,2}
1 ≥ 1 − kPA1 − PA2 kTV 2 p ≥ 1 − KL(PA1 − PA2 )/8. Therefore, we need to compute an upper bound on the KL divergence. The first bound in the theorem follows from the following calculation KL(P1 , P2 ) = EP1 log = =
1 EP 2σ 2 1
P1 P2 m X
(hhA2 , Xi ii − hhA1 , Xi ii)2
(A.1)
i=1
µ2min mk1 , σ 2 n1 n2
where we have used the fact that Xi is a random Gaussian matrix with independent entries of variance n11n2 . Construction 2. We consider a collection of distributions PA1 , . . . , PAt+1 where t = (n1 − k1 )(n2 − k2 ). The distribution PA1 is the same as in the first construction, while supports of signals A2 , . . . , At+1 do not overlap with the support of the signal A1 . In order to obtain a lower bound on the risk, we use Fano’s inequality (see, for example, Theorem 2.5 in Tsybakov (2009)). In order to apply the inequality, we need to compute the KL divergence between PA1 and every PAj . The same calculation as in the first construction gives us KL(PA1 , PAj ) ≤
µ2min mk1 k2 . σ 2 n1 n2
(A.2)
The second part of the theorem directly follows from an application of Theorem 2.5 in Tsybakov (2009).
A.3
Proof of Theorem 3
Proof. Define µ b to be the minimizer of the inner minimization in Equation 4.1. Define ∆ = µ bABb − µ∗ AB ∗ . By the optimality of µ bABb and the feasibility of µ∗ AB ∗ we get the following basic inequality m m 2 X 1 X hhXi , ∆ii2 ≤ i hhXi , ∆ii . m m i=1
i=1
Lemma 7 lower bounds the left hand side and Lemma 6 upper bounds the right hand side. Putting these together, and adjusting δ, we obtain that with probability at least 1 − δ, if λmax (Σ) 2 8(n1 − k1 )(n2 − k2 ) m≥C log , λmin (Σ) δ 22
for a universal constant C we have r p λmax (Σ) n1 n2 log (12(n1 − k1 )(n2 − k2 )/δ) |||∆|||fro ≤ 12 . λmin (Σ)2 m To finish the proof we need to make some structural observations regarding ∆. Consider two cases: either |b µ − µ∗ | < µ∗ /2 or |b µ − µ∗ | ≥ µ∗ /2. In the first case we have that µ b ≥ µ∗ /2 so every non-zero entry of ∆ that is in B\B ∗ has ∗ magnitude at least µ /2. Denote the number of entries in B\B ∗ by s. We have, s(µ∗ )2 . 4
|||∆|||2fro ≥
Further, we know that if s > 0 then s ≥ min(k1 , k2 ). So if |||∆|||fro
0 independent of the problem parameters (k1 , k2 , n1 , n2 ), such that if s n1 n2 log(k1 k2 ) log(n1 n2 ) µmin ≥C log(2/α) max , , σ m min(k1 , k2 ) k1 k2 b ≤ α, where B b is defined in Eq. (A.5). for 0 < α ≤ 1, then Rsup (B) 27
Proof. Let ∆(B) = Ψ(B ∗ )−Ψ(B) and observe that an error is made if ∆(B) < 0 for B 6= B ∗ . Therefore, the probability of an error is P[∪B∈B\B ∗ {∆(B) < 0}]. Under the conditions of the theorem, we will show that ∆(B) > 0 for all B ∈ B\B ∗ with large probability. For a fixed B ∈ B\B ∗ , we have X X X X yi Xi,ab yi Xi,ab − ∆(B) = (a,b)∈B\B ∗ i∈[m]
(a,b)∈B ∗ \B i∈[m]
=
X
X
tr(Xi A)
Xi,ab −
|
Xi,ab
(a,b)∈B ∗ \B
(a,b)∈B ∗ \B
i∈[m]
X
{z
}
T1 (B)
X
+
X
i
Xi,ab −
Xi,ab .
(a,b)∈B\B ∗
(a,b)∈B ∗ \B
i∈[m]
X
|
{z
}
T2 (B)
We express the first term as T1 (B) =
X
Z1,i Z2,i
i∈[m]
where
Z1,i Z2,i
∼ N2
0 0
,
|||A|||2
P
fro
n1 n2 (a,b)∈B ∗ \B
P
aab
n1 n2
(a,b)∈B ∗ \B
n1 n2 |B ∗ ∆B| n1 n2
aab
.
Then, by Corollary 15, m T1 (B) ≥ n1 n2
X (a,b)∈B ∗ \B
v u m s u aab − c1 t |B ∗ \B| n1 n2
X
a2ab log(c2 δ −1 )
(a,b)∈B ∗ \B
with probability 1 − δ/2. Next, we deal with the second term. Conditioning on , we have that kk22 |B ∗ ∆B| T2 (B) ∼ N 0, , n1 n2 so that
r T2 (B) ≤ c1 σ
m |B ∗ ∆B| log (c2 δ −1 ) n 1 n2
with probability at least 1 − δ/2 using the Gaussian tail bound (C.1) and (C.4). Combining the bounds on the two terms, we have that if r µmin n1 n2 ≥ c1 log (δ −1 ), σ m|B ∗ \B| then P[∆(B) < 0] ≤ δ.
28
(A.6)
Define N (l) = |{B ∈ B : |B∆B ∗ | = l}| to be the number of elements in B whose symmetric difference with B ∗ is equal to l. Note that N (l) = O(1) for any l. Using the union bound P[∪B∈B {∆(B) < 0}] X X (A.7) N (l)P[∆(B) < 0]. P[∆(B) < 0] + ≤ B∈B,|B∆B ∗ |=2k1 k2
l 0 independent of the problem parameters (k1 , k2 , n1 , n2 ), such that if m ≥ k1 k2 + C 0 log (n1 n2 ) and µmin ≥C σ
s
n1 n2 log(2/α) max m − k1 k2
log(k1 k2 ) log(n1 n2 ) , , min(k1 , k2 ) k1 k2
b ≤ α, where B b is defined in Eq. (B.2). for 0 < α ≤ 1, then Rsup (B) b is not going to recover the true block B ∗ if ∆(B) < 0 for some B ∈ B\B ∗ , Proof. The block B where ∆(B) = Ψ(B) − Ψ(B ∗ ) Therefore, P[error] = P[∪B∈B\B ∗ {∆(B) < 0}]. To bound P[∆(B) < 0], we will use Lemma 2 in Wainwright (2009b). 33
Lemma 10 (Wainwright (2009b)). If m − k1 k2 ≥
64σ 2 n1 n2 , µ2min
we have that
P P[∆(B) < 0] ≤ 4 exp −(m − k1 k2 )
(a,b)∈B ∗ \B
64(
P
(a,b)∈B ∗ \B
a2ab /(σ 2 n1 n2 )
!
a2ab /(σ 2 n1 n2 ) + 8)
.
The bound in the lemma can be simplified to |B ∗ \B|µ2min /(σ 2 n1 n2 ) . P[∆(B) < 0] ≤ 4 exp −(m − k1 k2 ) 64(|B ∗ \B|µ2min /(σ 2 n1 n2 ) + 8) Now P[∪B∈B\B ∗ {∆(B) < 0}] k1 k2 µ2min /(σ 2 n1 n2 ) ≤ c1 (n1 − k1 )(n2 − k2 ) exp −(m − k1 k2 ) 64(k1 k2 µ2min /(σ 2 n1 n2 ) + 8) min(k1 , k2 )µ2min /(σ 2 n1 n2 ) . + c2 k1 k2 exp −(m − k1 k2 ) 64(min(k1 , k2 )µ2min /(σ 2 n1 n2 ) + 8)
Suppose that k1 k2 µ2min /(σ 2 n1 n2 ) < 8. Then from the first equation, we require that r µmin n1 n2 ≥C log(2/α) log max (n1 − k1 , n2 − k2 ) σ (m − k1 k2 )k1 k2 and from the second µmin ≥ Cσ σ
r
n1 n2 log(2/α) log max (k1 , k2 ). (m − k1 k2 ) min(k1 , k2 )
If k1 k2 µ2min /(σ 2 n1 n2 ) > 8, then we need to have m − k1 k2 ≥ log max (n1 − k1 , n2 − k2 , k1 , k2 ) . The above results give us sufficient conditions on the sample size and the amplitude of the signal for recovery of the support B ∗ .
C
Collection of tail bounds
In this section, we collect useful results on tail bounds of various random quantities used throughout the paper. We start by stating a lower and upper bound on the survival function of the standard Normal random variable. Let Z ∼ N (0, 1) be a standard Normal random variable. Then for t > 0 1 t 1 1 √ exp(−t2 /2) ≤ P(Z > t) ≤ √ exp(−t2 /2). 2 2π t + 1 2π t
34
(C.1)
C.1
Tail bounds for Chi-squared variables
Throughout the paper we will often use the following tail bounds for central χ2 random variables. These are well known and proofs can be found in the original papers. Lemma 11 (Laurent and Massart (2000)). Let X ∼ χ2d . For all x ≥ 0, √ P[X − d ≥ 2 dx + 2x] ≤ exp(−x) √ P[X − d ≤ −2 dx] ≤ exp(−x). Lemma 12 (Johnstone and Lu (2009)). Let X ∼ χ2d , then 3 2 1 −1 P[|d X − 1| ≥ x] ≤ exp − dx , x ∈ [0, ). 16 2
(C.2) (C.3)
(C.4)
The following result provide a tail bound for non-central χ2 random variable with noncentrality parameter ν. Lemma 13 (Birg´e (2001)). Let X ∼ χ2d (ν), then for all x > 0 p P[X ≥ (d + ν) + 2 (d + 2ν)x + 2x] ≤ exp(−x) p P[X ≤ (d + ν) − 2 (d + 2ν)x] ≤ exp(−x).
(C.5) (C.6)
Using the above results, we have a tail bound for sum of product-Normal random variables. σaa σab 0 be a bivariate Normal random , Lemma 14. Let Z = (Za , Zb ) ∼ N σab σbb 0 iid
variable and let (zia , zib ) ∼ Z, i = 1, . . . , n. Then for all t ∈ [0, νab /2) " # 3nt2 −1 X P n zia zib − σab ≥ t ≤ 4 exp − , (C.7) 2 16νab i √ √ where νab = max{(1 − ρab ) σaa σbb , (1 + ρab ) σaa σbb }. 0 = z /√σ . Then using (C.4) Proof. Let zia ia aa " # n 1X zia zib − σab | ≥ t P | n i=1 " # n 1X 0 0 t =P | zia zib − ρab | ≥ √ n σaa σbb i=1 " n # X 4nt 0 0 2 0 0 2 =P | ((zia + zib ) − 2(1 + ρab )) − ((zia − zib ) − 2(1 − ρab ))| ≥ √ σaa σbb i=1 " n # X 2nt 0 0 2 ≤P | ((zia + zib ) − 2(1 + ρab ))| ≥ √ σaa σbb i=1 " n # X 2nt 0 0 2 − zib +P | ((zia ) − 2(1 − ρab ))| ≥ √ σaa σbb i=1 nt 3nt2 ≤ 2P |χ2n − n| ≥ ≤ 4 exp − , 2 νab 16νab 35
√ √ where νab = max{(1 − ρab ) Σaa Σbb , (1 + ρab ) Σaa Σbb } and t ∈ [0, νa /2). Corollary 15. Let Xi be drawn i.i.d from the distribution of the product of two independent standard Normal random variables for i = 1 . . . n. Then for t ∈ [0, 1/2) X 3nt2 −1 Xi | > t] ≤ 4 exp − P[|n . (C.8) 16 i∈[n]
C.2
Other tail bounds
We also require a tail bound for general sub-exponential random variables. This version is an adaptation of a result from Rudelson and Vershynin (2010). Lemma 16. Suppose that X is sub-exponential with parameters (ν, b), that is, 2 2 ν λ E[exp(λ(X − E[X])] ≤ exp , 2 for all |λ| < 1/b. Then 2 ( t 2 exp − 2ν if 0 ≤ t ≤ 2 P (|X − E[X]| ≥ t) ≤ 2 t 2 exp − 2b for t > νb .
ν2 b
and
Finally, we use the Hanson-Wright inequality. This version is from Rudelson and Vershynin (2013). Lemma 17. Let X = (X1 , . . . , Xn ) ∈ Rn be a random vector with independent sub-Gaussian components Xi with parameter bounded by σ 2 which satisfy EXi = 0. Let A be an (n × n) matrix. Then for every t ≥ 0, !! 2 T t t T P X AX − EX AX ≥ t ≤ 2 exp −c min , , σ 4 |||A|||2fro σ 2 |||A|||op where c is a universal constant.
36