Estimation with Norm Regularization Arindam Banerjee
Sheng Chen
Farideh Fazayeli
Vidyashankar Sivakumar
arXiv:1505.02294v2 [stat.ML] 18 May 2015
{banerjee,shengc,farideh,
[email protected]} Department of Computer Science & Engineering University of Minnesota, Twin Cities May 19, 2015
Abstract Analysis of non-asymptotic estimation error and structured statistical recovery based on norm regularized regression, such as Lasso, needs to consider four aspects: the norm, the loss function, the design matrix, and the noise model. This paper presents generalizations of such estimation error analysis on all four aspects compared to the existing literature. We characterize the restricted error set where the estimation error vector lies, establish relations between error sets for the constrained and regularized problems, and present an estimation error bound applicable to any norm. Precise characterizations of the bound is presented for a variety of design matrices, including sub-Gaussian, anisotropic, and correlated designs, noise models, including both Gaussian and sub-Gaussian noise, and loss functions, including least squares and generalized linear models. A key result from the analysis is that the sample complexity of all such estimators depends on the Gaussian width of the spherical cap corresponding to the restricted error set. Further, once the number of samples n crosses the required sample complexity, the estimation error decreases as √cn , where c depends on the Gaussian width of the unit norm ball.
1
Introduction
Over the past decade, progress has been made in developing non-asymptotic bounds on the estimation error of structured parameters based on norm regularized regression. Such estimators are usually of the form [35, 26, 10]: θˆλn = argmin L(θ; Z n ) + λn R(θ) , (1) θ∈Rp
where R(θ) is a suitable norm, L(·) is a suitable loss function, Z n = {(yi , Xi )}ni=1 where yi ∈ R, Xi ∈ Rp is the training set, and λn > 0 is a regularization parameter. The optimal parameter θ∗ is often assumed to be ‘structured,’ usually characterized or approximated as a small value according to some norm R(·). Recent work has viewed such characterizations in terms of atomic norms, which give the tightest convex relaxation of a structured set of atoms in which θ∗ belongs [14]. Since θˆλn is an estimate of the optimal structure θ∗ , ˆ n = (θˆλ − θ∗ ), e.g., the L2 norm the focus has been on bounding a suitable measure of the error vector ∆ n ˆ n k2 . k∆ To understand the state-of-the-art on non-asymptotic bounds on the estimation error for norm-regularized regression, four aspects of (1) need to be considered: (i) the norm R(θ), (ii) properties of the design matrix X = [X1 · · · Xn ]T ∈ Rn×p , (iii) the loss function L(·), and (iv) the noise model, typically in terms of ωi = yi − E[y|Xi ]. Most of the literatureP has focused on a linear model: y = Xθ + ω, and a squared-loss function: L(θ; Z n ) = n1 ky − Xθk22 = n1 ni=1 (yi − hθ, Xi i)2 . Early work on such estimators focussed on 1
the L1 norm [43, 41, 24], and led to sufficient conditions on the design matrix X, including the restrictedisometry properties (RIP) [13, 12] and restricted eigenvalue (RE) conditions [7, 26, 31]. While much of the development has focussed on isotropic Gaussian design matrices, recent work has extended the analysis for L1 norm to correlated Gaussian designs [31] as well as anisotropic sub-Gaussian design matrices [32]. Building on such development, [26] presents a unified framework for the case of decomposable norms and also considers generalized linear models (GLMs) for certain norms such as L1 . Two key insights are offered ˆ n lies in a restricted set, a cone or a star, for suitably large λn , and second, in [26]: first, the error vector ∆ the loss function needs to satisfy restricted strong convexity (RSC), a generalization of the RE condition, on the restricted error set for the analysis to work out. For isotropic Gaussian design matrices, additional progress has been made. [14] considers a constrained estimation formulation for all atomic norms, where the gain condition, equivalent to the RE condition, uses Gordon’s inequality [18, 19, 23] and is succinctly represented in terms of the Gaussian width of the intersection of the cone of the error set and a unit ball/sphere. [29] considers three related formulations for generalized Lasso problems, establish recovery guarantees based on Gordon’s inequality, and quantities related to the Gaussian width. Sharper analysis for recovery has been considered in [2], yielding a precise characterization of phase transition behavior using quantities related to the Gaussian width. [30] consider a linear programming estimator in a 1-bit compressed sensing setting and, interestingly, the concept of Gaussian width shows up in the analysis. In spite of the advances, with a few notable exceptions [36, 39], most existing results are restricted to isotropic Gaussian design matrices. Further, while a suitable scale for λn is known for special cases such as the L1 , a general analysis applicable to any norm R(·) has not been explored in the literature. In this paper, we consider structured estimation problems with norm regularization of the form (1), and present a unified analysis which substantially generalizes existing results on all four pertinent aspects: the norm, the design matrix, the loss, and the noise model. The analysis we present applies to all norms, and the results can be divided into three groups: characterization of the error set and recovery guarantees, characterization of the regularization parameter λn , and characterization of the restricted eigenvalue conditions or restricted strong convexity. We provide a summary of the key results below. Restricted error set and recovery guarantee: We start with a characterization of the error set Er in which ˆ n belongs. For a suitably large λn , we show that ∆ ˆ n belongs to the restricted error set the error vector ∆ 1 p ∗ ∗ Er = ∆ ∈ R R(θ + ∆) ≤ R(θ ) + R(∆) , (2) β where β > 1 is a constant. The restricted error set has interesting structure, and forms the basis of subsequent ˆ n k2 . As an alternative to regularized estimators, the literature has considered analysis for bounds on k∆ constrained estimators which directly focus on minimizing R(θ) under suitable constraints determined by the noise (y − Xθ) and/or the design matrix X [13, 7, 14, 15]. A recent example of such a constrained estimator is the generalized Dantzig selector (GDS) [15], which generalizes the Dantzig selector [11] corresponding to the L1 norm, and is given by: θˆϕn = argmin R(θ) s.t. R∗ (X T (y − Xθ∗ )) ≤ ϕn ,
(3)
θ∈Rp
where R∗ (·) denotes the dual norm of R(·). One can show [14, 15] that the restricted error set for such constrained estimators are of the form: Ec = {∆ ∈ Rp | R(θ∗ + ∆) ≤ R(θ∗ ) } .
(4)
One can readily see that Er is larger than Ec , i.e., Ec ⊆ Er , and Er approaches Ec as β increases. We establish a geometric relationship between the two sets, which will possibly help in transforming analysis 2
done on regularized estimators as in (1) to corresponding constrained estimators as in (3) and vice versa. Let ρB2p denote a L2 ball of any radius ρ in Rp . Then, with Ar = Er ∩ ρB2p and Ac = cone(Ec ) ∩ ρB2p , we show that 2 kθ∗ k2 w(Ar ) ≤ 1 + w(Ac ) , (5) β−1 ρ where w(A) = Eg [supa∈A ha, gi], with g being an isotropic Gaussian vector, denotes the Gaussian width1 of the set A [23, 6, 14, 33]. For example, when kθ∗ k2 = 1 and one chooses ρ = 1, β = 2, we have w(Ar ) ≤ 3w(Ac ). Note that Ar corresponds to the spherical cap of the error set Er at radius ρ, and Ac corresponds to the spherical cap of the error cone cone(Ec ) at the same radius. For the special case of L1 norm, [7] considered a simultaneous analysis of the Lasso and the Dantzig selector, and characterized the structure of the error sets for regularized and constrained sets for the special case of L1 norm. Further, while the characterization in [7] was also geometric, it was not based on Gaussian widths. In contrast, our results apply to any norm, not just L1 , and the geometric characterization is based on Gaussian widths. The utility of the Gaussian width based characterization becomes evident later when we establish sample complexity results for Gaussian and sub-Gaussian random matrices in terms of Gaussian widths of spherical caps. ˆ n under two assumptions, which are subsequently shown to We establish bounds on the estimation error ∆ hold with high probability for (sub)Gaussian designs and noise models. The first assumption is that the regularization parameter λn is suitably large. In particular, for any β > 1, the regularization parameter λn needs to satisfy λn ≥ βR∗ (∇L(θ∗ ; Z n )) , (6) where R∗ (·) denotes the dual norm of R(·). The second assumption is that the design matrix X ∈ Rn×p satisfies the restricted strong convexity (RSC) condition [7, 26] in the error set Er , so that for ∆ ∈ Er , there exists a suitable constant κ > 0 so that δL(∆, θ∗ ) , L(θ∗ + ∆) − L(θ∗ ) − h∇L(θ∗ ), ∆i ≥ κk∆k22 .
(7)
With such suitably large λn and X satisfying the RSC condition, we establish the following bound: ˆ n k2 ≤ Ψ(Er ) k∆
1 + β λn , β κ
(8)
where Ψ(Er ) = supu∈Er R(u) kuk2 is a norm compatibility constant [26]. Note that the above bound is deterministic, but relies on assumptions on λn and κ. So, we focus on characterizations of λn and κ which hold with high probability for families of design matrices X and noise models ω. Bounds on the regularization parameter λn : From (6) above, for the analysis to work, one needs to have λn ≥ R∗ (∇L(θ∗ ; Z n )). There are a few challenges in getting a suitable bound for λn . First, the bound depends on θ∗ , but θ∗ is unknown and is the quantity one is interested in estimating. Second, the bound depends on Z n , the samples, and is hence random. The goal will be to bound the expectation E[R∗ (∇L(θ∗ ; Z n ))] over all samples of size n, and obtain high-probability deviation bounds. Third, since the bound relies on the (dual) norm R∗ (·) of a p-dimensional random vector, without proper care, the lower bound on λn may end √ up having a large scaling dependency, say p, on the ambient dimensionality. Since the error bound in (8) is directly proportional to λn , such dependencies will lead to weak bounds. In Section 3, we characterize the expectation E[R∗ (∇L(θ∗ ; Z n )) in terms of the geometry of the unit normball of R, which leads to a sharp bound. Let ΩR = {u ∈ Rp |R(u) ≤ 1} denote the unit norm-ball. Then, 1
A gentle exposition to Gaussian width and some of its properties is given in Appendix A.
3
for Gaussian design matrices and squared loss, we show that c E[R∗ (∇L(θ∗ ; Z n ))] ≤ √ w(ΩR ) , n
(9)
which scales as the Gaussian width of ΩR . The result can be extended to the case of anisotropic Gaussian designs as well as correlated samples, where the constant c starts depending on the maximum eigenvalue (operator norm) of the corresponding covariance matrix. Further, the result also extends to the case of sub-Gaussian design matrices. In the sub-Gaussian case, the result is in terms of the ‘sub-Gaussian width’ of the unit norm-ball, which can be upper bounded by a constant times the Gaussian width using generic chaining [33]. For high-probability bounds, the Gaussian case follows from existing results in large deviation of Lipschitz functions of Gaussian vectors, or in this case, equivalently, supremum of Gaussian processes [22, 8]. While there is no equivalent result on supremum of general sub-Gaussian processes, large deviation bounds are possible for important special cases, e.g., bounded random variables [22], random variables with sufficient smoothness, i.e., bounded Maliavin derivatives [40], etc. The results can also be extended to general convex losses from generalized linear models. The above characterization allows one to choose λn ≥ √cn w(ΩR ). For the special case of L1 regularization, √ Ω √R is the unit L1 norm ball, and the corresponding Gaussian width w(ΩR ) ≤ c1 log p, which explains the log p term one finds in existing bounds for Lasso [26, 10]. When working with other norms, one simply needs to get an upper bound on the corresponding w(ΩR ). Restricted eigenvalue conditions: When the loss function under consideration is the squared loss, the RSC condition in (7) reduces to the restricted eigenvalue (RE) condition on the design matrix. Our analysis focuses on establishing the RE condition on A = cone(Er ) ∩ S p−1 , the spherical cap obtained by intersecting the cone of the error set with the unit hypersphere, since it implies the RE condition on Er . We show that for isotropic design matrices, for both Gaussian and sub-Gaussian case, an inequality of the following form √ inf kXuk2 ≥ c1 n − c2 w(A) , (10) u∈A
where w(A) is the Gaussian width of the spherical cap A, always holds with high probability. Thus, the sample complexity n0 = c22 w2 (A)/c21 = O(w2 (A)), and for n > n0 , an RE condition of the form √ c2 w(A) √ √ inf kXuk2 ≥ c1 1 − n = c1 κA (n, p) n , (11) u∈A c1 n is satisfied with high probability for κA (p, n) = Θ(1). Thus, one does not need to treat the RE condition as an assumption for isotropic Gaussian or sub-Gaussian designs—it always holds with high probability, with the phase transition happening at O(w2 (A)) samples. Our analysis technique for this result, as well as all other RE/RSC type results, relies on a simple covering argument, in particular a union bound over an covering of the spherical cap A. We use Sudakov’s inequality, also known as the weak converse of Dudley’s inequality, to go from covering numbers to Gaussian widths [16, 23]. For anisotropic sub-Gaussian designs, with Σ ∈ Rp×p denoting the covariance matrix, we show an inequality of the following form √ inf kXuk2 ≥ c1 n − c2 Λmax (Σ)w(A) , (12) u∈A
where Λmax (Σ) denotes the largest eigenvalue of Σ, always holds with high probability. We get a mildly sharper result for the case of anisotropic Gaussian designs. The constant c1 has a dependency on the restricted minimum eigenvalue ν = inf u∈A kΣ1/2 uk2 of the true covariance Σ, as expected. Thus, for the anisotropic case, the sample complexity of the RE condition is O(Λ2max (Σ)w2 (A)). If the largest eigenvalue Λmax (Σ) is a constant or grows slowly with the dimensionality, the phase transition behavior is qualitatively similar 4
to that for isotropic designs. However, for highly correlated covariates, the sample complexity may have a strong dependency on the dimensionality, and one may need to consider alternative estimators [17, 25]. We also establish results for the setting where one has correlated isotropic samples. If Γ ∈ Rn×n denotes the covariance matrix for the samples, then for both Gaussian and sub-Gaussian isotropic designs, we show that an inequality of the following form p inf kXuk2 ≥ c1 Tr(Γ) − c2 Λmax (Γ)w(A) , (13) u∈A
where Λmax (Γ) denotes the largest eigenvalue of Γ, always holds with high probability. Interestingly, the Tr Γ sample complexity of the RE condition here scales as the stable rank Λmax (Γ) of the covariance matrix Γ. Further, for the special case of independent samples, we recover the results discussed above for isotropic designs. Generalized linear models and restricted strong convexity: For convex loss functions coming from generalized linear models (GLMs), the sample complexity and associated phase transition behavior is determined by the Restricted Strong Convexity (RSC) condition [26]. By generalizing our argument for RE conditions corresponding to square loss, we show that the RSC conditions are going to be satisfied for convex losses corresponding to GLMs for sub-Gaussian designs at the same order of sample complexity as that for squared loss. In particular, for A = cone(Er ) ∩ S p−1 , we show a high probability lower bound of the form √ p √ inf n δL(u, θ∗ ) ≥ c1 n − c2 w(A) , (14) u∈A
where the constants c1 , c2 depend on the probability mass on the tails of the design matrix distribution. Interestingly, the sample complexity still scales as O(w2 (A)) for any spherical cap A. The result is thus a considerable generalization of earlier results on GLMs which had looked at specific norms and associated cones and/or did not express the results in terms of the Gaussian width of A [26]. Putting everything together: With the above results in place, from (8), the main bound takes the form ˆ n k2 ≤ k∆
1+β c2 w(ΩR ) √ √ Ψ(Er ) β max(0, 1 − c1 w(A)/ n) n
(15)
with high probability, where w(ΩR ) is the Gaussian width of the unit norm ball, w(A) is the Gaussian width of the spherical cap corresponding to the error set Er , and the result is valid only when n > nA = O(w2 (A)) which corresponds to the sample complexity. For the special case of√ L1 norm, i.e., Lasso, the sample √ complexity nA is of the order w2 (A) = O(s log p). Further, w(ΩR ) = logq p and Ψ(ER ) = s. Plugging ˆ n k2 ≤ c s log p holds with probability. in these values, choosing β = 2, for n > c3 s log p, the bound k∆ n For other norms, one can simply plug-in the widths to get the corresponding sample complexity and nonasymptotic error bounds. The rest of the paper is organized as follows: Section 2 presents results on the restricted error set and deterministic error bounds under suitable bounds on the regularization parameter λn and RSC assumptions. Section 3 presents a characterization of λn in terms of the Gaussian width of the unit norm ball for Gaussian as well as sub-Gaussian designs and noise. Section 4 proves RE conditions and associated sample complexity results corresponding to squared loss functions. Results are presented for Gaussian and sub-Gaussian designs, including anisotropic and correlated cases, and always in terms of the Gaussian width of the spherical cap corresponding to the error set. Section 5 presents RSC conditions corresponding to general convex losses arising from generalized linear models, and the results are again in terms of the Gaussian width of the spherical cap corresponding to the error set. We conclude in Section 6. All technical arguments and proofs are in the appendix, along with a gentle exposition to Gaussian widths and related results. A brief word on the notation used. We denote random matrices as X, and random vectors as Xi where i may be an index to a row or column of a random matrix. Vector norms are denoted as k · k, e.g., kXi k2 for a (random) vector Xi , and norms of random variables are denoted as |||·|||, e.g., |||X|||2 = E[kXk2 ]. 5
𝐸𝑐 = Δ 𝑅 𝜃 ∗ + Δ ≤ 𝑅(𝜃 ∗ )}
−𝜃∗∗ −𝜃
𝐴𝑐 = 𝑐𝑜𝑛𝑒 𝐸𝑐 ∩ 𝜌𝐵2𝑝 𝐴𝑟 = 𝐸𝑟 ∩ 𝜌𝐵2𝑝 𝐸𝑟 = Δ 𝑅 𝜃 ∗ + Δ ≤ 𝑅 𝜃 ∗ +
1 𝛽
𝑅(Δ)}
Figure 1: Relationship between error set of the regularized problem (Ar , green region) and the constrained problem (Ac , gray region) after intersection with a ball of radius ρ. While Ar will be larger in general, it will be within a constant factor of Ac in terms of Gaussian width (best viewed in color).
2
Restricted Error Set and Recovery Guarantees
ˆ n = (θˆλ − In this section, we give a characterization of the restricted error set Er in which the error vector ∆ n ∗ θ ) lies, establish clear relationships between the error sets for the regularized and constrained problems, and finally establish upper bounds on the estimation error. The error bound is deterministic, but has quantities which involve θ∗ , X, ω, for which we develop high probability bounds in Sections 3, 4, and 5.
2.1
The Restricted Error Set and the Error Cone
ˆ n will belong. We start with a characterization of the restricted error set Er where ∆ Lemma 1 For any β > 1, assuming λn ≥ βR∗ (∇L(θ∗ ; Z n )) , ˆ n = θˆλ − θ∗ belongs to the set where R∗ (·) is the dual norm of R(·). Then the error vector ∆ n 1 ∗ p ∗ ∗ Er = Er (θ , β) = ∆ ∈ R R(θ + ∆) ≤ R(θ ) + R(∆) . β
(16)
(17)
The restricted error set Er need not be convex for general norms. Interestingly, for β = 1, the inequality in (17) is just the triangle inequality, and is satisfied by all ∆. Note that β > 1 restricts the set of ∆ which satisfy the inequality, yielding the restricted error set. In particular, ∆ cannot go in the direction of θ∗ , i.e., ∆ 6= αθ∗ for any α > 0. Further, note that the condition in (16) is similar to that in [26] for β = 2, but the above characterization holds for any norm, not just decomposable norms [26]. 6
(a) Loss and regularizer.
(b) Overall objective.
Figure 2: Schematic for norm regularized objective functions considered. The finite sample estimate θˆλn has lower empirical loss than the optimum θ∗ . Bounding the difference between the losses yields a bound on kθˆλn − θ∗ k. While Er need not be a convex set, we establish a relationship between Er and the error set Ec corresponding to constrained estimators [13, 7, 14, 15]. A recent example of such a constrained estimator is the generalized Dantzig selector (GDS) [15] given by: θˆϕn = argmin R(θ) θ∈Rp
s.t. R∗ (X T (y − Xθ∗ )) ≤ ϕn ,
(18)
where R∗ (·) denotes the dual norm of R(·). One can show [14, 15] that the restricted error set for such constrained estimators [14, 15, 36] are of the form: Ec = {∆ ∈ Rp | R(θ∗ + ∆) ≤ R(θ∗ ) } .
(19)
By definition, it is easy to see that Ec is always convex, and that Ec ⊆ Er , as shown schematically in Figure 1. The following results establishes a relationship between Er and Ec in terms of their Gaussian widths. Theorem 1 Let Ar = Er ∩ ρB2p and Ac = cone(Ec ) ∩ ρB2p , where ρB2p = {u|kuk2 ≤ ρ} is the L2 ball of any radius ρ > 0. Then, for any β > 1 we have 2 kθ∗ k2 w(Ar ) ≤ 1 + w(Ac ) , (20) β−1 ρ where w(A) denotes the Gaussian width of any set A given by: w(A) = Eg sup ha, gi , where g is an a∈A
isotropic Gaussian random vector, i.e., g ∼ N (0, Ip×p ). Thus, the Gaussian width of the error sets of regularized and constrained problems are closely related. See Figure 1 for more details. In particular, for kθ∗ k2 = 1, with ρ = 1, β = 2, we have w(Ar ) ≤ 3w(Ac ). Related observations have been made for the special case of the L1 norm [7], although past work did not provide an explicit characterization in terms of Gaussian widths. The result also suggests that it is possible to move between the error analysis of the regularized and the constrained versions of the estimation problem.
7
2.2
Recovery Guarantees
In order to establish recovery guarantees, we start by assuming that restricted strong convexity (RSC) is satisfied by the loss function in Er , the error set, so that for any ∆ ∈ Er , there exists a suitable constant κ so that δL(∆, θ∗ ) , L(θ∗ + ∆) − L(θ∗ ) − h∇L(θ∗ ), ∆i ≥ κk∆k22 . (21) In Sections 4 and 5, we establish precise forms of the RSC condition for a wide variety of design matrices and loss functions. In order to establish recovery guarantees, we focus on the quantity F(∆) = L(θ∗ + ∆) − L(θ∗ ) + λn (R(θ∗ + ∆) − R(θ∗ )) .
(22)
ˆ n is the estimated parameter, i.e., θˆλ is the minimum of the objective, we clearly have Since θˆλn = θ∗ + ∆ n ˆ ˆ n k2 . Unlike previous analysis, the bound can be established F(∆n ) ≤ 0, which implies a bound on k∆ without making any additional assumptions on the norm R(θ). We start with the following result, which ˆ n k2 in terms of the gradient of the objective at θ∗ . expresses the upper bound on k∆ ˆn = Lemma 2 Assume that the RSC condition is satisfied in Er by the loss L(·) with parameter κ. With ∆ ∗ ˆ θλn − θ , for any norm R(·), we have ˆ n k2 ≤ k∆
1 k∇L(θ∗ ) + λn ∇R(θ∗ )k2 , κ
(23)
where ∇R(·) is any sub-gradient of the norm R(·). Figure 3 illustrates the above results. Note that the right hand side is simply the L2 norm of the gradient of the objective evaluated at θ∗ . For the special case when θˆλn = θ∗ , the gradient of the objective is zero, ˆ n k2 = 0. While the above result provides useful insights about the bound on implying correctly that k∆ ˆ k∆n k2 , the quantities on the right hand side depend on θ∗ , which is unknown. We present another form of the result in terms of quantities such as λn , κ, and the norm compatibility constant Ψ(Er ) = supu∈Er R(u) kuk2 , which are often easier to compute or bound. Theorem 2 Assume that the RSC condition is satisfied in Er by the loss L(·) with parameter κ. With ˆ n = θˆλ − θ∗ , for any norm R(·), we have ∆ n ˆ n k2 ≤ Ψ(Cr ) k∆
1 + β λn . β κ
(24)
The above result is deterministic, but contains λn and κ. In Section 3, we give precise characterizations of λn , which needs to satisfy (16). In Sections 4 and 5, we characterize the RSC condition constant κ for different losses and a variety of design matrices.
2.3
A Special Case: Decomposable Norms
In recent work, [26] considered regularized regression with the special case of decomposable norms, defined ¯ of Rp . The model is assumed to be in the subspace M, and the in terms of a pair of subspaces M ⊆ M ¯ ⊥ which is the orthogonal complement of M. ¯ A definition considers the so-called perturbation subspace M ⊥ ¯ ) if R(θ + γ) = R(θ) + R(γ) for norm R(·) is considered decomposable with respect to subspaces (M, M ⊥ ¯ all θ ∈ M and γ ∈ M .
8
(a) Gradients of loss and norm.
(b) Gradient of the overall objective
Figure 3: Schematic illustrating the error bound in Lemma 2. Under restricted strong convexity (RSC) of ˆ n k2 can be bounded in terms of the gradient of the overall the loss function in the error set Er , the error k∆ ∗ objective evaluated at θ . We show that for decomposable norms, the error set Er in our analysis is included in the error cone defined in [26]. In the current context, let β = 2, θ∗ ∈ M, then for any ∆ = ∆M ¯ ⊥ + ∆M ¯ ∈ Er , we have 1 ≤ R(θ∗ ) + R(∆) 2 1 R(θ∗ + ∆M ≤ R(θ∗ ) + R(∆M ¯ ⊥ + ∆M ¯) ¯ ⊥ + ∆M ¯) 2 (a) 1 1 R(θ∗ + ∆M ≤ R(θ∗ ) + R(∆M ¯ ⊥ ) − R(∆M ¯) ¯ ⊥ ) + R(∆M ¯) 2 2 (b) 1 1 ∗ R(θ∗ ) + R(∆M ¯ ⊥ ) + R(∆M ¯) ¯ ⊥ ) − R(∆M ¯ ) ≤ R(θ ) + R(∆M 2 2 R(∆M ≤ 3R(∆M ¯ ⊥) ¯ ). R(θ∗ + ∆)
⇒ ⇒ ⇒ ⇒
(25) (26) (27) (28) (29)
where inequality (a) follows from the triangle inequality and (b) follows from decomposability of the norm. The last inequality is precisely the error cone in [26] for θ∗ ∈ M. As a result, for any ∆ ∈ Er , for decomposable norms we have R(∆) = R(∆M ¯ ⊥ + ∆M ¯ ) ≤ R(∆M ¯ ⊥ ) + R(∆M ¯ ) ≤ 4R(∆M ¯)
(30)
Hence, the norm compatibility constant can be bounded as Ψ(Er ) = sup ∆∈Er
R(∆) R(∆M R(u) ¯) ¯ ≤ 4 sup ≤ 4 sup = 4Ψ(M). k∆k2 k∆k kuk2 ¯ 2 ∆∈Er u∈M\{0}
(31)
¯ is the subspace compatibility in M, ¯ as used in [26]. where Ψ(M)
3
Bounds on the Regularization Parameter
Recall that the parameter λn needs to satisfy the inequality λn ≥ βR∗ (∇L(θ∗ ; Z n )) . 9
(32)
The right hand side of the inequality has two issues: the expression depends on θ∗ , the optimal parameter which is unknown, and the expression is a random variable, since it depends on Z n . In this section, we characterize E[R∗ (∇L(θ∗ ; Z n ))] in terms of the Gaussian width of the unit norm ball ΩR = {u : R(u) ≤ 1}, and also discuss the upper bounds. For ease of exposition, we present results for the case of squared loss, i.e., 1 L(θ∗ ; Z n ) = 2n ky − Xθ∗ ||2 with the linear model y = Xθ + ω, where ω can be Gaussian or sub-Gaussian noise. For this setting, 1 1 (33) ∇L(θ∗ ; Z n ) = X T (y − Xθ∗ ) = X T ω . n n Before presenting the results, we introduce a few notations. We let k·kop denote the norm of linear operators, which map from the Banach space endowed with norm R(·) to itself, and let Λmax (·) denote the largest eigenvalue of a square matrix. We also recall the definition of the sub-Gaussian norm for a sub-Gaussian variable x, |||x|||ψ2 = supp≥1 √1p (E[|x|p ])1/p [38]. Gaussian Designs: First, we consider Gaussian design, where X can be independent isotropic, independent anisotropic or dependent isotropic, and the noise ω is elementwise independent Gaussian or sub-Gaussian noise. For independent isotropic design, X has i.i.d entries sampled from N (0, 1). For anisotropic Gaussian design, we assume columns of X ∈ Rn×p have covariance Σp×p . For correlated isotropic design, rows of X ∈ Rn have covariance Γn×n . Theorem 3 Let ΩR = {u : R(u) ≤ 1}. For Gaussian design X and Gaussian or sub-Gaussian noise ω, we have 1 T η1 ∗ ∗ n E[R (∇L(θ ; Z ))] = E X ω ≤ √ w(ΩR ) , (34) n n where the expectation is taken over both X and ω. The constant η1 is given by 1 if X is independent isotropic p Λmax (Σ) if X is independent anisotropic η1 = p Λmax (Γ) if X is dependent isotropic . 2
τ Further for any τ > 0, with probability at least (1 − 3 exp(− min( 2Φ 2 , cn))), we have
η1 η2 R∗ (∇L(θ∗ ; Z n )) ≤ √ (w(ΩR ) + τ ) , n
(35)
where Φ2 = supR(u)=1 kuk22 , c is an absolute constant, η2 = 2 when ω is i.i.d. standard Gaussian, and √ η2 = 2K 2 + 1 when ω is i.i.d. centered unit-variance sub-Gaussian with |||x|||ψ2 ≤ K. Sub-Gaussian Designs: We consider the setting of sub-Gaussian designs, where X can also be independent isotropic, independent anisotropic, or dependent isotropic, as considered for Gaussian designs. ω is also elementwise independent Gaussian or sub-Gaussian noise. Theorem 4 Let ΩR = {u : R(u) ≤ 1}. For sub-Gaussian design X and Gaussian or sub-Gaussian noise ω, we have η0 η1 (36) E[R∗ (∇L(θ∗ ; Z n ))] ≤ √ w(ΩR ) . n where η0 is a constant. The constant η1 is given by 1 if X is independent isotropic , p Λ (Σ) if X is independent anisotropic , η1 = max p Λmax (Γ) if X is dependent isotropic .
10
Interestingly, the analysis for the result above involves ‘sub-Gaussian width’ which can be upper bounded by a constant times the Gaussian width, using generic chaining [33]. Further, one can get Gaussian-like exponential concentration around the expectation for important classes of sub-Gaussian random variables, including bounded random variables [22], and when Xu = hh, ui, where u is any are such R 1 unit vector, 2 2 that their Malliavin derivatives have almost surely bounded norm in L [0, 1], i.e., 0 |Dr Xu | dr ≤ η [40]. For convenience, we denote Z 0 = supR(u)≤1 hh, ui. The ideal scenario is to obtain dimension independent concentration inequalities of the form (37) P Z 0 − E Z 0 ≥ τ ≤ η3 exp −η4 τ 2 , where η3 and η4 are constants. We show that such inequalities are indeed satisfied by a wide variety of distributions. Theorem 5 Let h = [h1 · · · hp ] be a vector of independent centered random variables, and let Z 0 = R∗ (h) = supR(u)≤1 hh, ui. Then, we have (38) P Z 0 − E Z 0 ≥ τ ≤ η3 exp −η4 τ 2 , if h satisfies any of the following conditions: (i) h is bounded, i.e., khk∞ = L < ∞, (ii) for any unit vector u, Y R u1 = hh, ui are such that their Malliavin derivatives have almost surely bounded norm in L2 [0, 1], i.e., 0 |Dr Yu |2 dr ≤ L2 . Interestingly, both of the above conditions imply that h is a sub-Gaussian vector. In particular, the condition in (i) implies finite sub-Gaussian norm [38]. The implication for the condition in (ii) is discussed in [40, Lemma 3.3] which in turn relies on [37, Theorem 9.1.1]. Based on Theorem 4 and Theorem 5, we can easily get the high-probability upper bound for Z 0 , P Z 0 ≥ η0 w(ΩR ) + τ ≤ η3 exp −η4 τ 2 . Hence, for any τ > 0, with probability at least 1 − (η3 + 2) exp(− min(η4 τ 2 , cn)), we have η1 η2 R∗ (∇L(θ∗ ; Z n )) ≤ √ (η0 w(ΩR ) + τ ) , n
(39)
for sub-Gaussian design matrix X, where η1 and η2 are those defined in Theorem 4. Bounding the Gaussian width w(ΩR ): In certain cases, one may be able to directly obtain a bound on the Gaussian width w(ΩR ). Here, we provide a mechanism for bounding the Gaussian width w(ΩR ) of the unit norm ball in terms of the Gaussian width of a suitable cone, obtained by shifting or translating the norm ball. In particular, the result involves taking any point on the boundary of the unit norm ball, considering that as the origin, and constructing a cone using the norm ball. Since such a construction can be done with any point on the boundary, the tightest bound is obtained by taking the infimum over all points on the boundary. The motivation behind getting an upper bound of the Gaussian width w(ΩR ) of the unit norm ball in terms of the Gaussian width of such a cone is because considerable advances have been made in recent years in upper bounding Gaussian widths of such cones [14, 2]. Lemma 3 Let ΩR = {u : R(u) ≤ 1} be the unit norm ball and ΘR = {u : R(u) = 1} be the boundary. ˜ = sup ˜ ˜ For any θ˜ ∈ ΘR , ρ(θ) θ:R(θ)≤1 kθ − θk2 is the diameter of ΩR measured with respect to θ. Let p ˜ = cone(ΩR − θ) ˜ ∩ ρ(θ)B ˜ ˜ ˜ G(θ) 2 , i.e., the cone of (ΩR − θ) intersecting the ball of radius ρ(θ). Then ˜ . w(ΩR ) ≤ inf w(G(θ)) ˜ R θ∈Θ
11
(40)
Ω
∩2 2
Ω
Ω
Figure 4: Bounding the Gaussian width of a norm ball, e.g., corresponding to L1 norm, by shifting the norm ball and using the width of the corresponding cone (Lemma 3). The approach allows one to directly use existing results on bounding Gaussian widths of certain cones. In some cases, it may be easier to directly bound the Gaussian width of the norm ball, rather than using the shifting argument. The analysis and results for λn presented above can be extended to general convex losses arising in the context of GLMs for sub-Gussian designs and sub-Gaussian noise (see Section 5). .
4
Least Squares Models: Restricted Eigenvalue Conditions
The error bound analysis in Theorem 2 depends on the restricted strong convexity (RSC) assumption. In this section, we establish RSC conditions for both Gaussian and sub-Gaussian random design matrices when the 1 loss function is the squared loss. For squared loss, i.e., L(θ; Z n ) = 2n ky − Xθk2 , the RSC condition (21) 1 becomes equivalent to the Restricted Eigenvalue (RE) condition [7, 26], i.e., 2n kX∆k22 ≥ κk∆k22 for all ∆ ∈ Er , since n 1 1 X δL(∆, θ∗ ) = (41) kX∆k2 = hXi , ∆i2 . 2n 2n i=1
We make two simplifications which lets us develop the RE results in terms of widths spherical caps rather than over the error set Er . Let nEr be the sample complexity for the RE condition over the set Er , so that for n > nEr samples, with high probability inf
∆∈Er
1 kX∆k22 ≥ κEr k∆k22 , 2n
(42)
for some κEr > 0. Let Cr = cone(Er ) and let nCr be the sample complexity for the RE condition over the cone Cr , so that for n > nCr samples, with high probability inf
∆∈Cr
1 kX∆k22 ≥ κCr k∆k22 , 2n
(43)
for some κCr > 0. Since Er ⊆ Cr , we have nEr ≤ nCr . Thus, it is sufficient to obtain (an upper bound to) the sample complexity nCr , since that will serve as an upper bound to nEr , the sample complexity over Er . 12
Further, since Cr is a cone, the absolute magnitude k∆k2 does not affect the sample complexity. As a result, it is sufficient to focus on a spherical cap A = Cr ∩ S p−1 . In particular, if nA denotes the sample complexity for the RE condition over the spherical cap A, so that for n > nA samples, with high probability inf
u∈A
1 kXuk22 ≥ κ ¯ A kuk22 , 2n
(44)
for some κ ¯ A > 0, then nA = nCr ≥ nEr . Noting that kuk2 = 1 for u ∈ Cr ∩ S p−1 , we present sample complexity results for RE conditions the form √ (45) inf kXuk2 ≥ κA (n, p) n u∈A
where κA (n, p) > 0 with high probability for n > nA . In this section, we characterize sample complexity nA over any given spherical cap A, and establish RE conditions for a variety of Gaussian and sub-Gaussian design matrices X, with isotropic, anisotropic, or dependent rows, i.e., when samples (rows of X) are not independent. Results for certain types of design matrices for certain types of norms, especially the L1 norm, have appeared in the literature [7, 31, 32]. Our analysis considers a wider variety of design matrices and establishes RE conditions for any A ⊆ S p−1 , thus corresponding to all norms. Interestingly, the Gaussian width w(A) of the spherical cap A shows up in all bounds, as a measure of the size of the set A, even for sub-Gaussian design matrices. In fact, all existing RE results do implicitly have the width term, but in a form specific to the chosen norm [31, 32]. The analysis on atomic norm in [14] has the w(A) term explicitly, but the analysis relies on Gordon’s inequality [18, 19, 23], which is applicable only for isotropic Gaussian design matrices. The proof technique we use is simple, a standard covering argument, and is largely the same across all design matrices considered. A unique aspect of our analysis, used in all the proofs, is a way to go from covering numbers of A to the Gaussian width of A using Sudakov’s inequality, also known as the ‘weak converse’ of Dudley’s inequality, connecting Gaussian widths and covering numbers [16, 23]. Our simple techniques are in contrast to much of the existing literature on RE conditions, which use specialized tools such as Gaussian comparison principles [31, 26], and/or specialized analysis geared to a particular norm such as L1 [32].
4.1
Restricted Eigenvalue Conditions: Gaussian Designs
Here we focus on the case of Gaussian design matrices X ∈ Rn×p , and consider three settings: (i) independentisotropic, where the entries are elementwise independent, (ii) independent-anisotropic, where rows Xi are independent but each row has a covariance E[Xi XiT ] = Σ ∈ Rp×p , and (iii) dependent-isotropic, where the rows are isotropic but the columns Xj are correlated with E[Xj XjT ] = Γ ∈ Rn×n . For convenience, we assume E[x2ij ] = 1, noting that the analysis easily extends to the general case of E[x2ij ] = σ 2 . Independent Isotropic Gaussian (IIG) Designs: The IIG setting has been extensively studied in the literature [10, 26]. As discussed in the recent work on atomic norms [14], one can use Gordon’s inequality [19, 23] to get RE conditions for the IIG setting. Our goal here is two-fold: first, we present the RE conditions obtained using our simple proof technique, and show that it is equivalent, up to constants, the RE condition obtained using Gordon’s inequality, an arguably heavy-duty technique only applicable to the IIG setting; and second, we go over some facets of how we present the results, which will apply to all subsequent RE-style results as well as give a way to plug-in κ in the estimation error bound in (24). Theorem 6 Let the design matrix X ∈ Rn×p be elementwise independent and normal, i.e., xij ∼ N (0, 1). Then, for any A ⊆ S p−1 , any n ≥ 2, and any τ > 0, with probability at least (1 − η1 exp(−η2 τ 2 )), we have inf kXuk2 ≥
u∈A
1√ n − η0 w(A) − τ , 2 13
(46)
η0 , η1 , η2 > 0 are absolute constants. As a result, the sample complexity nA = O(w2 (A)). We consider the equivalent result one could obtain by directly using Gordon’s inequality [19, 23]: Theorem 7 (Gordon’s inequality [19]) Let the design matrix X be elementwise independent and normal, i.e., xij ∼ N (0, 1). Then, for any A ⊆ S p−1 and any τ > 0, with probability at least (1 − 2 exp(−τ 2 /2)), we have inf kXuk2 ≥ γn − w(A) − τ , (47) u∈A
where γn = E[kgk2 ] >
√n n+1
is the expected length of a Gaussian random vector in Rn .
Interestingly, the results are equivalent up to constants. However, unlike Gordon’s inequality, our proof technique generalizes to other design matrices, including sub-Gaussian designs. We emphasize three additional aspects in the context of the above analysis, which will continue to hold for all the subsequent results but will not be discussed explicitly. First, to get a form of the result which can be √ used as κ and plugged in to the estimation error bound (24), one can simply choose τ = 21 ( 12 n − η0 w(A)) so as to get 1√ η0 inf kXuk2 ≥ n − w(A) , (48) u∈A 4 2 with high probability. Second, the result does not depend on the size of u, i.e., kuk2 . For the above analysis, u ∈ A ⊆ Cr ∩ S p−1 so that kuk2 = 1. However, one can consider the cone Cr to be intersecting with a sphere ρS p−1 of a different radius ρ, to give Aρ = Cr ∩ ρS p−1 so that u ∈ Aρ has kuk2 = ρ. For simplicity, let A = A1 , i.e., corresponding to ρ = 1. Then, a straightforward extension yields inf u∈Aρ kXuk2 ≥ √ u ( 21 n − η0 w(A) − τ )kuk2 , with probability at least (1 − η1 exp(−η2 τ 2 )), since kXuk2 = kX kuk k2 kuk2 2 1 and w(Akuk2 ) = kuk2 w(A) [14]. Finally, note that the leading constant 2 was a consequence of our choice of = 14 for the -net covering of A in the proof. One can get other constants, less than 1, with different choices of , and the constants η0 , η1 , η2 will change based on this choice. Independent Anisotropic Gaussian (IAG) Designs: We consider a setting where the rows Xi of the design matrix are independent, but each row is sampled from an anisotropic Gaussian distribution, i.e., Xi ∼ N (0, Σp×p ) where Xi ∈ Rp . The setting has been considered in the literature [31] for the special case of L1 norms, and results have been established using Gaussian comparison techniques [23]. We show that equivalent results can be obtained by our simple technique, which does not rely on Gaussian comparisons [23, 26]. Theorem 8 Let the design matrix X be row wise independent and each row Xi ∼ N (0, Σp×p ). Then, for any A ⊆ S p−1 and any τ > 0, with probability at least 1 − η1 exp(−η2 τ 2 ), we have inf kXuk2 ≥
u∈A
p 1√ √ ν n − η0 Λmax (Σ) w(A) − τ , 2
(49)
√ where ν = inf u∈A kΣ1/2 uk2 , Λmax (Σ) denotes the largest eigenvalue of Σ and η0 , η1 , η2 > 0 are constants. As a result, the sample complexity nA = O(λmax (Σ)w2 (A)/ν). A comparison with the results of [31] is √ instructive. The leading term ν appears in both results—the result in [31] is for any u and has a kΣ1/2 uk2 term, whereas we have simply considered inf u∈A on both sides. The second term in [31] depends on the √ largest entry in the diagonal of Σ, log p, and kuk1 . These terms are a consequence of the special case analysis √ for L1 norm. In contrast, we consider the general case and simply get the scaled Gaussian width term Λmax (Σ)w(A). 14
Dependent Isotropic Gaussian (DIG) Designs: We now consider a setting where the rows of the design ˜ T ] = Γ ∈ Rn×n . Inter˜ are isotropic Gaussians, but the columns X ˜ j are correlated with E[X ˜j X matrix X j estingly, correlation structure over the columns make the samples dependent, a scenario which has not yet been widely studied in the literature [44, 27]. We show that our simple technique continues to work in this scenario and gives a rather intuitive result. ˜ ∈ Rn×p be a matrix whose rows X ˜ i are isotropic Gaussian random vectors in Rp and Theorem 9 Let X T ˜ ˜ ˜ the columns Xj are correlated with E[Xj Xj ] = Γ. Then, for any set A ⊆ S p−1 and any τ > 0, with probability at least (1 − η1 exp(−η2 τ 2 ), we have p 3p 5 ˜ inf kXuk2 ≥ Tr(Γ) − Λmax (Γ) η0 w(A) + −τ (50) u∈A 4 2 where η0 , η1 , η2 > 0 are constants. Note that with the assumption that E[x2ij ] = 1, Γ will be a correlation matrix implying Tr(Γ) = n, and making the sample size dependence explicit. Intuitively, due to sample correlations, n samples are effectively n equivalent to ΛTr(Γ) = Λmax (Γ) samples. For the special case of Γ = In×n , Tr(Γ) = n, Λmax (Γ) = 1, and max (Γ) we recover the result for IIG designs.
4.2
Restricted Eigenvalue Conditions: Sub-Gaussian Designs
In this section, we focus on the case of sub-Gaussian design matrices X ∈ Rn×p , and consider three settings: (i) independent-isotropic, where the rows are independent and isotropic, (ii) independent-anisotropic, where the rows Xi are independent but each row has a covariance E[Xi XiT ] = Σp×p , and (iii) dependent-isotropic, where the rows are isotropic and the columns Xj are correlated with E[Xj XjT ] = Γn×n . For convenience, we assume E[x2ij ] = 1 and the sub-Gaussian norm |||xij |||ψ2 ≤ k [38]. In recent work, building on [1], [36] also considers generalizations of RE conditions to sub-Gaussian designs, although our proof techniques are different. Independent Isotropic Sub-Gaussian Designs: We start with the setting where the sub-Gaussian design matrix X ∈ Rn×p has independent rows Xi and each row is isotropic. Theorem 10 Let X ∈ Rn×p be a design matrix whose rows Xi are independent isotropic sub-Gaussian random vectors in Rp . Then, for any set A ⊆ S p−1 and any τ > 0, with probability at least (1 − 2 exp(−η1 τ 2 )), we have √ inf kXuk2 ≥ n − η0 w(A) − τ , (51) u∈A
where η0 , η1 > 0 are constants which depend only on the sub-Gaussian norm |||xij |||ψ2 = k. As in the case of Gaussian designs, the sample complexity nA = O(w2 (A)). Independent Anisotropic Sub-Gaussian Designs: We consider a setting where the rows Xi of the design matrix are independent, but each row is sampled from an anisotropic sub-Gaussian distribution, i.e., |||xij |||ψ2 = k and E[Xi XiT ] = Σp×p . Theorem 11 Let the sub-Gaussian design matrix X be row wise independent, and each row has E[Xi XiT ] = Σ ∈ Rp×p . Then, for any A ⊆ S p−1 and any τ > 0, with probability at least (1 − 2 exp(−η1 τ 2 )), we have √ √ inf kXuk2 ≥ ν n − η0 Λmax (Σ) w(A) − τ , (52) u∈A
√
where ν = inf u∈A kΣ1/2 uk2 , Λmax (Σ) denotes the largest eigenvalue of Σ, and η0 , η1 > 0 are constants which depend on the sub-Gaussian norm |||xij |||ψ2 = k. 15
As a result, the sample complexity nA = O(λ2max (Σ)w2 (A)/ν). Note that [32] establish RE conditions for anisotropic sub-Gaussian designs for the special case of L1 norm. In contrast, our results are general and in terms of the Gaussian width w(A). Dependent Isotropic Sub-Gaussian Designs: We consider the setting where the sub-Gaussian design ma˜ has isotropic sub-Gaussian rows, but the columns X ˜ j are correlated with E[X ˜j X ˜ T ] = Γ, implying trix X j dependent samples. ˜ ∈ Rn×p be a sub-Gaussian design matrix with isotropic rows and correlated columns Theorem 12 Let X ˜ T ] = Γ ∈ Rn×n . Then, for any A ⊆ S p−1 and any τ > 0, with probability at least (1 − ˜j X with E[X j 2 exp(−η1 τ 2 )), we have p ˜ 2 ≥ Tr(Γ) − η0 Λmax (Γ)w(A) − τ , inf kXuk (53) u∈A
where η0 , η1 are constants which depend on the sub-Gaussian norm |||xij |||ψ2 = k.
4.3
Some Examples
In this section, we give examples of the analysis from previous sections for three norms: L1 norm, group sparse norm, and L2 norm. The summary of the results is given in Table 1. Other examples can be constructed for norms and error sets with known bounds on the Gaussian widths and norm compatibility constants [14, 26]. √ L1 Norm: Assume that the statistical parameter θ∗ is s-sparse, and note that kθ∗ k1 ≤ skθ∗ k2 . Since L1 ¯ = 4√s. norm is a decomposable norm following the result in (29), we have Ψ(Cr ) ≤ 4Ψ(M) ¯ = 2, then w(ΩR ) can be bounded by Applying Lemma 3, let θ¯ be a 1-sparse vector, and ρ(θ) p ˜ = w(G(θ)) ¯ (a) w(ΩR ) ≤ inf w(G(θ)) =O log p , (54) ˜ R θ∈Θ
˜ with θ˜ be a s-sparse vector is where (a) is obtained from the fact that Gaussian width of G(θ) [14]. See Figure 4 for more details. From Theorem 3 and (54), the bound on λn is ! r w(ΩR ) log p λn ≤ c √ =O . n n
q 2s log( ps ) + 54 s
(55)
Hence, the recovery error is bounded by ˆ n k2 ≤ c3 Ψ(Cr )λn = O k∆ κ
r
s log p n
! ,
(56)
which is similar to the results obtained in well known results [14, 26]. Group Norm: Suppose that the index set {1, 2, · · · , p} can be partitioned into a set of T disjoint groups, say G = {G1 , G2 , · · · , GT }. Define (1, ν)-group norm for a given vector ν = (ν1 , · · · , νT ) ∈ [1, ∞]T as kαkG,ν =
T X
kαGt kνt
(57)
t=1
As shown in [26] Group norm is a decomposable norm. For a given subset SG ⊂ {1, 2, · · · T } with cardinality |SG |, define the subspace A(SG ) = {α ∈ Rp |αGt = 0, ∀t ∈ / SG }. Let νt ≥ 2, then we have X X √ k∆kG,ν = k∆Gt kνt ≤ k∆Gt k2 ≤ sG k∆k2 . (58) t∈SG
t∈SG
16
Hence, from (31) and (58) we have
√ Ψ(Cr ) ≤ 4 sG .
(59)
¯ = 2, then w(ΩR ) can be bounded by Applying Lemma 3, define θ¯ with 1-active group, and ρ(θ) (a)
˜ = w(G(θ)) ¯ =O w(ΩR ) ≤ inf w(G(θ)) ˜ R θ∈Θ
p m + log T ,
(60)
˜ where θ˜ has k active where m = max|Gt | and (a) is obtained from the fact that Gaussian width of G(θ) t p group is 2k(m + log(T − k)) + k [14]. From Theorem 3 and (60), the bound on λn is ! r w(ΩR ) m + log T . (61) λn ≤ c √ =O n n Hence, the recovery error is bounded by ˆ n k2 ≤ c3 Ψ(Cr )λn = O k∆ κ
r
sG (m + log T ) n
! ,
(62)
which is similar to the results obtained in previous works [14, 26]. L2 Norm: With L2 norm as the regularizer, the norm constant is obtained as Ψ(Cr ) = sup ∆∈Cr
k∆k2 = 1. k∆k2
(63)
˜ = 1, then w(ΩR ) can be bounded by Applying Lemma 3, set ρ(θ) ˜ = O (√p) . w(ΩR ) ≤ inf w(G(θ)) ˜ R θ∈Θ
(64)
From Theorem 3 and (64), the bound on λn is w(ΩR ) λn ≤ c √ =O n
r p . n
(65)
Hence, the recovery error is bounded by ˆ n k2 ≤ c3 Ψ(Cr )λn = O k∆ κ
5
r p . n
(66)
Generalized Linear Models: Restricted Strong Convexity
In this section, we consider the setting where the conditional probability distribution of yi |Xi follows an exponential family distribution: p(yi |Xi ; θ) = exp{yi hθ, Xi i − ψ(hθ, Xi i)}, where ψ(·) is the log-partition function [9, 5, 42]. Generalized linear models consider the negative likelihood of such conditional distributions as the loss function: n 1X n L(θ; Z ) = (ψ(hθ, Xi i) − hθ, yi Xi i) . (67) n i=1
17
R(u)
`1 norm
Group norm `2 norm
) √R λn := c1 w(Ω n
O
O
q
q
O
log p n
h n oi2 √ √ κ := max 1 − c2 w(A) ,0 n
m+log T n
Θ(1) if n > c2
w2 (A)
Θ(1) if n > c2
w2 (A)
q p n
Θ(1) if n > c2 w2 (A)
Ψ(Er ) √
ˆ n k2 := c3 Ψ(Er )λn k∆ κ
s
√ sG
O
O
q
q
1
s log p n
sG (m+log T ) n
O
q p n
Table 1: A summary of values for the regularization parameter λn , the RE condition constant κ, the norm ˆ n k2 for `1 , `2 and group norms in case of Gaussian Design matrix constant Ψ(Cr ) and recovery bounds k∆ with Gaussian noise. All results are given upto constants with more emphasis on the scale of the results. Least squares regression and logistic regression are popular special cases of GLMs. Since ∇ψ(hθ, Xi i) = E[yi |Xi ], we have ∇L(θ∗ ; Z n ) = n1 X T ω, where ωi = ∇ψ(hθ, Xi i) − yi = E[y|Xi ] − yi plays the role of noise. Hence, the analysis in Section 3 can be applied assuming ωi is Gaussian or sub-Gaussian. To obtain RSC conditions for GLMs, first note that n
δL(θ∗ , ∆; Z n ) =
1X 2 ∇ ψ(hθ∗ , Xi i + γi h∆, Xi i)h∆, Xi i2 , n
(68)
i=1
where γi ∈ [0, 1], by mean value theorem. Since ψ is of Legendre type [4, 5, 9], the second derivative ∇2 ψ(·) is always positive. Since the RSC condition relies on a non-trivial lower bound for the above quantity, the analysis considers a suitable compact set where ` = `ψ (T ) = min|a|≤2T ∇2 ψ(a) is bounded away from zero. Outside this compact set, we will only use ∇2 ψ(·) > 0. Then, n
δL(θ∗ , ∆; Z n ) ≥
`X hXi , ∆i2 I[|hXi , θ∗ i| < T ] I[|hXi , ∆i| < T ] . n
(69)
i=1
We give a characterization of the RSC condition for independent isotropic sub-Gaussian design matrices X ∈ Rn×p . The analysis can be suitably generalized to the other design matrices considered in Section 4 by using the same techniques. As before, we denote ∆ as u, and consider u ∈ A ⊆ S p−1 so that kuk2 = 1. Further, we assume kθ∗ k2 ≤ c1 for some constant c1 . Assuming X has sub-Gaussian entries with |||xij |||ψ2 ≤ k, hXi , θ∗ i and hXi , ui are sub-Gaussian random variables with sub-Gaussian norm at most Ck [39]. Let φ1 = φ1 (T ; u) = P {|hXi , ui| > T } ≤ e · exp(−c2 T 2 /C 2 k 2 ), and φ2 = φ2 (T ; θ∗ ) = P {|hXi , θ∗ i| > T } ≤ e · exp(−c2 T 2 /C 2 k 2 ). The result we present is in terms of the constants ` = `ψ (T ), φ1 = φ(T ; u) and φ2 = φ(T, θ∗ ) for any suitably chosen T . Theorem 13 Let X ∈ Rn×p be a design matrix with independent isotropic sub-Gaussian rows. Then, for 5 1 −φ2 ) (1 − α)τ 2 ) any set A ⊆ S p−1 , any α ∈ (0, 1), any τ > 0, and any n ≥ α2 (1−φ2 1 −φ2 ) (cw2 (A) + c2 (1−φ 4 K for suitable constants c, c2 , with probability at least 1 − 3 exp −η1 τ 2 , we have p √ √ inf n∂L(θ∗ ; u, X) ≥ ` π n − η0 w(A) − τ , (70) u∈A
where π = (1 − α)(1 − φ1 − φ2 ), ` = `ψ (T ) = min|a|≤2T +β ∇2 ψ(a), and constants (η0 , η1 ) depend on the sub-Gaussian norm |||xij |||ψ2 = k. 18
The form of the result is closely related to the corresponding result for the RE condition on inf u∈A kXuk2 in Section 4.2. Note that RSC analysis for GLMs was considered in [26] for specific norms, especially L1 , whereas our analysis applies to any set A ⊆ S p−1 , and hence to any norm. Further, following similar argument structure as in Section 4.2, the analysis for GLMs can be extended to anisotropic and dependent design matrices.
6
Conclusions
The paper presents a general set of results and tools for characterizing non-asymptotic estimation error in norm regularized regression problems. The analysis holds for any norm, and subsumes much of existing literature focused on structured sparsity and related themes. The work can be viewed as a direct generalization of results in [26], which presented related results for decomposable norms. Our analysis illustrates the important role Gaussian widths, as a measure of size of suitable sets, play in such results. Further, the error sets for regularized and constrained versions of such problems are shown to be closely related [7]. While the paper presents a unified geometric treatment of non-asymptotic structured estimation with regularized estimators, several technical questions need further investigation. The focus of the analysis has been on thin-tailed distributions, and the RE/RSC type analysis presented really gives two sided bounds, i.e., RIP, showing that thin-tailed distributions do satisfy the RIP condition. For heavy tailed measurements, the lower and upper tails of quadratic forms behave differently [28, 1], and it may be possible to establish geometric estimation error analysis for general norms, some special cases of which have been investigated in recent years [20, 21, 1]. Further, the sample complexity of the phase transitions in the RE/RSC conditions for anisotropic designs depend on the largest eigenvalue (operator norm) of the covariance matrix, making the estimator sample inefficient for highly correlated designs. Since real-world several problems, including spatial and temporal problems, do have correlated observations, it will be important to investigate estimators which perform well in such settings [17]. Finally, the focus of the work is on parametric estimation, and it will be interesting to explore generalizations of the analysis to non-parametric settings. Acknowledgements: We thank the reviewers of the conference version [3] for helpful comments and suggestions on related work. We thank Sergey Bobkov, Snigdhansu Chatterjee, and Pradeep Ravikumar for helpful discussions related to the paper. The research was supported by NSF grants IIS-1447566, IIS-1422557, CCF-1451986, CNS-1314560, IIS-0953274, IIS-1029711, and by NASA grant NNX12AQ39A.
Appendix A
Background and Preliminaries
We start with a review of some definitions and well-known results which will be used for our proofs.
A.1
Gaussian Width
In several of our proofs, we use the concept of Gaussian width [19, 14], which is defined as follows. Definition 1 (Gaussian width) For any set A ∈ Rp , the Gaussian width of the set A is defined as: w(A) = Eg suphg, ui , u∈A
19
(71)
where the expectation is over g ∼ N (0, Ip×p ), a vector of independent zero-mean unit-variance Gaussian random variable. The Gaussian width w(A) provides a geometric characterization of the size of the set A. We consider three perspectives of the Gaussian width, and provide some properties which are used in our analysis. First, consider the Gaussian process {Zu } where the constituent Gaussian random variables Zu = ht, gi are indexed by u ∈ A, and g ∼ N (0, Ip×p ). Then the Gaussian width w(A) can be viewed as the expectation of the supremum of the Gaussian process {Zt }. Bounds on the expectations of Gaussian and other empirical processes have been widely studied in the literature, and we will make use of generic chaining for some of our analysis [33, 34, 8, 23]. Second, hu, gi can be viewed as a Gaussian random projection of each u ∈ A to one dimension, and the Gaussian width simply measures the expectation of largest value of such projections. Third, if A is the unit ball of any norm R(·), i.e., A = {x ∈ Rp | R(x) ≤ 1}, then w(A) = Eg [R∗ (g)] by definition of the dual norm. Thus, the Gaussian width is the expected value of the dual norm of a standard Gaussian random vector. For instance, if A is unit ball of L1 norm, w(A) = E[kgk∞ ]. Below we list some simple and useful properties of the Gaussian width of A ⊆ Rp : Property 1: w(A) ≤ w(B) for A ⊆ B. Property 2: w(A) = w(conv(A)), where conv(·) denotes the convex hull of A. Property 3: w(cA) = cw(A) for any positive scalar c, in which cA = {cx | x ∈ A}. Property 4: w(ΓA) = w(A) for any orthogonal matrix Γ ∈ Rp×p . Property 5: w(A + b) = w(A) for any A ⊆ Rp and fixed b ∈ Rp . The last two properties illustrate the Gaussian width is rotation and translation invariant.
A.2
Dudley-Sudakov Inequality
The Dudley-Sudakov inequality connects Gaussian widths with covering numbers [33, 23]. We begin with the definition of the covering number of a set A. Definition 2 Covering number: For any set A ⊆ S p−1 and any , we say N (A) ⊆ A is an -net of A, or -cover of A, if for every u ∈ A there exists a v ∈ N (A) such that ku − vk2 ≤ . Then the covering number N (A, ) is the smallest cardinality -covering of A, defined as N (A, ) = min {|N (A)| : N (A) is an net of A} .
(72)
The covering number can be defined using distances d different from the L2 norm, and such covering numbers are denoted as N (A, , d). The Dudley-Sudakov inequality and our usage of covering numbers in this paper are based on the L2 norm distance. The Dudley-Sudakov inequality [16, 23] gives upper and lower bound of the Gaussian width in terms of covering numbers. In particular, the inequality states that Z ∞p p sup c log(N (A, )) ≤ w(A) ≤ log(N (A, ))d . (73) >0
0
20
where c > 0 is an absolute constant. For our analysis, we use the lower bound, which is often called weak converse of Dudley’s inequality or Sudakov’s inequality. Without loss of generality, our analysis will often use = 14 . With such a specific constant choice of , the weak converse of Dudley’s inequality gives an upper bound on the covering number in terms of the Gaussian width, i.e., 1 ≤ exp c1 w2 (A) . (74) N A, 4 The following lemmas give useful mechanisms to move from analysis done on a -net of a set to the full set. Lemma 4 Let N (A) be a -net of A for some ∈ [0, 1). Then, suphx, si ≤ max |hv, si| + ksk2 . v∈N (A)
x∈A
(75)
Lemma 5 Let C be a symmetric p × p matrix, and let N (A) be an -net of A ⊆ S p−1 for some ∈ [0, 12 ). Then, 1 max |hCv, vi| . (76) sup |hCu, ui| ≤ 1 − 2 v∈N (A) u∈A
A.3
Sub-Gaussian and Sub-exponential Random Variables (Vectors)
In the proof, we will also frequently use the properties of sub-Gaussian and sub-exponential random variables (vectors). In particular, we are interested in their definitions using moments. Definition 3 Sub-Gaussian (sub-exponential) random variable: We say that a random variable x is subGaussian (sub-exponential) if the moments satisfies 1 1 √ [E|x|p ] p ≤ K2 p ([E|x|p ] p ≤ K1 p)
(77)
for any p ≥ 1 with a constant K2 (K1 ). The minimum value of K2 (K1 ) is called sub-Gaussian (subexponential) norm of x, denoted by |||x|||ψ2 (|||x|||ψ1 ). Definition 4 Sub-Gaussian (sub-exponential) random vector: We say that a random vector X in Rn is sub-Gaussian (sub-exponential) if the one-dimensional marginals hX, xi are sub-Gaussian (sub-exponential) random variables for all x ∈ Rn . The sub-Gaussian (sub-exponential) norm of X is defined as |||X|||ψ2 = sup khX, xikψ2 (|||X|||ψ1 = sup khX, xikψ1 ) x∈S n−1
(78)
x∈S n−1
The following definitions and lemmas are from [38]. Lemma 6 Consider a finite number of independent centered sub-Gaussian random variables Xi . Then P i Xi is also a centered sub-Gaussian random variable. Moreover, X 2 X Xi ≤ C |||Xi |||2ψ2 (79) i
i
ψ2
21
Lemma 7 Let X1 , . . . , Xn be independent centered sub-Gaussian random variables. Then X = (X1 , . . . , Xn ) is a centered sub-Gaussian random vector in Rn , and |||X|||ψ2 ≤ C max |||Xi |||ψ2 i≤n
(80)
where C is an absolute constant. Lemma 8 Consider a sub-Gaussian random vector X with sub-Gaussian norm K = maxi |||Xi |||ψ2 , then, Z = hX, ai is a sub-Gaussian random variable with sub-Gaussian norm |||Z|||ψ2 ≤ CKkak2 . Lemma 9 A random variable X is sub-Gaussian if and only if X 2 is sub-exponential. Moreover, |||X|||2ψ2 ≤ X 2 ψ1 ≤ 2|||X|||2ψ2
(81)
Lemma 10 If X is sub-Gaussian (or sub-exponential), then so is X − EX. Moreover, the following holds, |||X − EX|||ψ2 ≤ 2|||X|||ψ2 , |||X − EX|||ψ1 ≤ 2|||X|||ψ1
B
(82)
Restricted Error Set and Recovery Guarantees
Section 2 is about the restricted error set. Lemma 1 characterizes the restricted error set. Theorem 1 establishes the relation between the constrained and restricted error sets. In particular, we prove that the Gaussian width of the regularized and constrained error sets (cone) are of the same order. Starting with the assumption that the RSC condition is satisfied Lemma 2 and Theorem 2 derive results on the upper bound on the L2 norm of the error. We collect the proofs of the different results in this section.
B.1
The Restricted Error Set
Lemma 1 in Section 2 characterizes the set to which the error vector belongs. We give the proof of Lemma 1 below: Lemma 1 For any β > 1, assuming λn ≥ βR∗ (∇L(θ∗ ; Z n )) ˆ n = θˆλ − θ∗ belongs to the set: where R∗ (·) is the dual norm of R(·). Then the error vector ∆ n 1 ∗ p ∗ ∗ Er = Er (θ , β) = ∆ ∈ R R(θ + ∆) ≤ R(θ ) + R(∆) . β Proof:
(83)
(84)
ˆ n , we have By the optimality of θˆλn = θ∗ + ∆ ˆ n ) + λn R(θ∗ + ∆ ˆ n ) − {L(θ∗ ) + λn R(θ∗ )} ≤ 0 . L(θ∗ + ∆
(85)
Now, since L is convex, L(θ∗ + ∆) − L(θ∗ ) ≥ h∇L(θ∗ ), ∆i ≥ −|h∇L(θ∗ ), ∆i| .
22
(86)
Further, by generalized Holder’s inequality, we have |h∇L(θ∗ ), ∆i| ≤ R∗ (∇L(θ∗ ))R(∆) ≤
λn R(∆) , β
(87)
where we have used λn ≥ βR∗ (∇L(θ∗ ; Z n )). Hence, we have ˆ n) . ˆ n ) − L(θ∗ ) ≥ − λn R(∆ L(θ∗ + ∆ β
(88)
As a result, λn
1 ˆ ∗ ˆ R(θ + ∆n ) − R(θ ) − R(∆n ) ≤ 0 . β ∗
(89)
Noting that λn > 0 and rearranging completes the proof.
B.2
Relation between the Constrained and Regularized Error Cones
In this section we show that the sizes of the regularized and constrained error sets are of the same order. Recall from [14], that the error set for the constrained setting for atomic norms is a cone given by: Cc = Cc (θ∗ ) = cone {∆ ∈ Rp | R(θ∗ + ∆) ≤ R(θ∗ ) } .
(90)
The error set Er is given by:
∗
Er = Er (θ , β) =
1 ∗ ∗ ∆ ∈ R R(θ + ∆) ≤ R(θ ) + R(∆) . β p
Below we provide the proof of Theorem 1. Theorem 1 Let Ar = Er ∩ ρB2p and Ac = Cc ∩ ρB2p , where B2p = {u|kuk2 ≤ 1} is the unit ball of L2 norm and ρ is any suitable radius. Then, for any β > 1 we have 2 kθ∗ k2 w(Ar ) ≤ 1 + w(Ac ) , (91) β−1 ρ where w(A) denotes the Gaussian width of any set A given by: w(A) = Eg [supa∈A ha, gi], where g is an isotropic Gaussian random vector. Proof:
From triangle inequality, we have R(∆) ≤ R(θ∗ + ∆) + R(θ∗ ) .
Then, 1 p ∗ ∗ Er (θ , β) = ∆ ∈ R R(θ + ∆) ≤ R(θ ) + R(∆) β 1 1 p ∗ ∗ ∗ ∗ ⊆ ∆ ∈ R R(θ + ∆) ≤ R(θ ) + R(θ + ∆) + R(θ ) β β 1 1 p ∗ ∗ = ∆∈R 1− R(θ + ∆) ≤ 1 + R(θ ) β β β+1 ¯r (θ∗ , β) . = ∆ ∈ Rp R(θ∗ + ∆) ≤ R(θ∗ ) = E β−1 ∗
23
(92)
Let C¯r (θ∗ , β) denote the following set
2 ¯r C¯r = C¯r (θ , β) = cone ∆ − θ∗ ∆ ∈ E β−1 ∗
+
2 θ∗ . β−1
(93)
It follows naturally from the construction that Er ⊆ C¯r . Let A¯r = C¯r (θ∗ , β) ∩ ρB2p . Since Er (θ∗ , β) ⊆ C¯r (θ∗ , β), we have w(Ar ) ≤ w(A¯r ). We define two additional sets for our analysis: 2 2 ∗ p ∗ ¯ ¯ ¯ (94) θ = ∆ ∈ R ∆ + θ ∈ Ar , B r = Ar − β−1 β−1 ¯ c = Cc (θ∗ , β) ∩ ρ + 2 kθ∗ k2 B p = ∆ ∈ Rp ∆ ∈ C¯c , k∆k2 ≤ ρ + 2 D kθ∗ k2 . (95) 2 β−1 β−1 For any set S and any t > 0, following the properties of Gaussian width since w(tS) = tw(S), we have 2 kθ∗ k2 ¯ w(Dc ) = 1 + w(Ac ) . (96) β−1 ρ Further, since Gaussian width is translation invariant, we have ¯r ) . w(A¯r ) = w(B
(97)
¯r ⊂ D ¯ c . Hence we have From the construction it is clear that B ¯ c ) ≥ w(B ¯r ) w(D
(98)
Then, we have
¯r ) ≤ w(D ¯ c) = w(A¯r ) = w(B
1+
2 kθ∗ k2 β−1 ρ
w(Ac )
Noting that w(Ar ) ≤ w(A¯r ) completes the proof.
B.3
Recovery Guarantees
Lemma 2 and Theorem 2 in the paper are results which establish recovery guarantees. The result in Lemma 2 depends on θ∗ , which is unknown. On the other hand Theorem 2 gives the result in terms of quantities like λn and the norm compatibility constant Ψ(Er ) = supu∈Er R(u) kuk2 which are easier to compute or bound. In this section we give proofs of Lemma 2 and Theorem 2. ˆn = Lemma 2 Assume that the RSC condition is satisfied in Er by the loss L(·) with parameter κ. With ∆ ∗ ˆ θλn − θ , for any norm R(·), we have ˆ n k2 ≤ k∆
1 k∇L(θ∗ ) + λn ∇R(θ∗ )k2 , κ
(99)
where ∇R(·) is any sub-gradient of the norm R(·). Proof:
By the RSC property in Er , for any ∆ ∈ Er we have L(θ∗ + ∆) − L(θ∗ ) ≥ h∇L(θ∗ ), ∆i + κk∆k22 . 24
(100)
θ∗(ρ, 0)
θ∗(ρ, 0)
ρB22
Ac (b)
(a) A¯r
B¯ r
ρB22
¯c D θ∗(ρ, 0)
θ∗(ρ, 0) ρB22
ρB22
β+1 2 ρ β−1 B2
Ac (d)
(c) ¯c D ¯ Br
β+1 2 B2 ρ β−1
A¯r θ∗(ρ, 0) ρB22 β+1 2 B2 ρ β−1 Ac (e)
Figure 5: Error cone for L1 norm in two dimensions: (a) The L1 norm ball in two dimensions; (b) The ˆr ; (d) The constrained constrained error cone Ac ; (c) The regularized error cone Aˆr and the shifted cone B ¯ c ; (e) All error cones. error cone Ac and the shifted constrained error cone D
25
Also, recall that any norm is convex, since by triangle inequality, for t ∈ [0, 1], we have R(tθ1 + (1 − t)θ2 ) ≤ R(tθ1 ) + R((1 − t)θ2 ) = tR(θ1 ) + (1 − t)R(θ2 ) .
(101)
As a result, for any sub-gradient ∇R(θ) of R(θ), we have R(θ∗ + ∆) − R(θ∗ ) ≥ h∆, ∇R(θ∗ )i .
(102)
Adding (100) and (102), we get L(θ∗ + ∆) − L(θ∗ ) + λn (R(θ∗ + ∆) − R(θ∗ )) ≥ h∇L(θ∗ ) + λn ∇R(θ∗ ), ∆i + κk∆k22
(103)
Now, by Cauchy-Schwartz inequality, we have |h∇L(θ∗ ) + λn ∇R(θ∗ ), ∆i| ≤ k∇L(θ∗ ) + λn ∇R(θ∗ )k2 k∆k2 ⇒
h∇L(θ∗ ) + λn ∇R(θ∗ ), ∆i ≥ −k∇L(θ∗ ) + λn ∇R(θ∗ )k2 k∆k2 .
(104)
Using (104) in (103), we have F(∆) = L(θ∗ + ∆) − L(θ∗ ) + λn (R(θ∗ + ∆) − R(θ∗ )) ≥ −k∇L(θ∗ ) + λn ∇R(θ∗ )k2 k∆k2 + κk∆k22 k∇L(θ∗ ) + λn ∇R(θ∗ )k2 = κk∆k2 k∆k2 − . κ
(105)
ˆ n ) ≤ 0, from (105), we have Now, since F(∆ ˆ n k2 ≤ k∆
k∇L(θ∗ ) + λn ∇R(θ∗ )k2 , κ
(106)
which completes the proof. Theorem 2 Assume that the RSC condition is satisfied in Er by the loss L(·) with parameter κ. With ˆ n = θˆλ − θ∗ , for any norm R(·), we have ∆ n ˆ n k2 ≤ k∆
Proof:
1 + β λn Ψ(Er ) . β κ
(107)
By the RSC property in Er , we have for any ∆ ∈ Er L(θ∗ + ∆) − L(θ∗ ) ≥ h∇L(θ∗ ), ∆i + κk∆k22 .
(108)
By definition of a dual norm, we have |h∇L(θ∗ ), ∆i| ≤ R∗ (∇L(θ∗ ))R(∆) . Further, by construction, R∗ (∇L(θ∗ )) ≤
λn β ,
(109)
implying
λn R(∆) β λn h∇L(θ∗ ), ∆i ≥ − R(∆) . β
|h∇L(θ∗ ), ∆i| ≤ ⇒
26
(110)
Further, from triangle inequality, we have R(θ∗ + ∆) − R(θ∗ ) ≥ −R(∆)
(111)
Adding (110) and (111), we have F(∆) = L(θ∗ + ∆) − L(θ∗ ) + λn (R(θ∗ + ∆) − R(θ∗ )) ≥ − = κk∆k22 − λn
λn R(∆) + κk∆k22 − λn R(∆) β
1+β R(∆) . β
(112)
By definition of the norm compatibility constant Ψ(Er ), we have R(∆) ≤ k∆k2 Ψ(Er ) implying −R(∆) ≥ −k∆k2 Ψ(Er ). Plugging the inequality back into (112), we have 1 + β λn F(∆) ≥ κk∆k2 k∆k2 − Ψ(Er ) . (113) β κ ˆ n ) ≤ 0, we have Since F(∆ ˆ n k2 ≤ k∆
1 + β λn Ψ(Er ) , β κ
(114)
which completes the proof.
C
Bounds on the Regularization Parameter
In this section, we prove Theorem 3 and Theorem 4 in Section 3 of the paper. The regularization parameter should satisfy the condition λn ≥ βR∗ (∇L(θ∗ ; Z n )). In Theorem 3 we establish an upper bound on R∗ (∇L(θ∗ ; Z n )) in terms of the Gaussian width of the error cone for least squares loss and Gaussian designs. In Theorem 4 we obtain similar bounds but for the case of sub-Gaussian design. Here we will give more details about the concentration result of R∗ (∇L(θ∗ ; Z n )) for Theorem 4.
C.1
Proof of Theorem 3
Theorem 3 Let ΩR = {u : R(u) ≤ 1} and Φ2 = supR(u)=1 kuk22 . For Gaussian design X and Gaussian or sub-Gaussian noise ω, we have η1 E[R∗ (∇L(θ∗ ; Z n ))] ≤ √ w(ΩR ) . (115) n The constant η1 is given by if X is independent isotropic 1 p Λmax (Σ) if X is independent anisotropic . η1 = p Λmax (Γ) if X is dependent isotropic 2
τ Further for any τ > 0, with probability at least (1 − 3 exp(− min( 2Φ 2 , cn))), we have
η1 η2 R∗ (∇L(θ∗ ; Z n )) ≤ √ (w(ΩR ) + τ ) , n where c is an absolute constant, η2 = 2 when ω is i.i.d. standard Gaussian, and η2 = i.i.d. centered unit-variance sub-Gaussian with |||x|||ψ2 ≤ K. 27
(116) √
2K 2 + 1 when ω is
Proof:
For least squares loss, we first note that
1 T 1 R∗ (∇L(θ∗ ; Z n )) = R∗ ( X T ω) = sup X ω, u n R(u)≤1 n E D 1 ω = kωk2 · sup X T ,u . n kωk2 R(u)≤1
(117)
ω kωk2 is the length of a Gaussian or sub-Gaussian random vector. Then we consider supR(u)≤1 hX T kωk , ui 2 for different X. Case 1. If X is independent isotropic Gaussian, then E D ω , u = sup hg, ui , sup X T kωk2 R(u)≤1 R(u)≤1
where g is a standard Gaussian random vector. Case 2. If X is independent anisotropic Gaussian, then D D 1 E E 1 ω ω sup X T , u = sup Σ− 2 X T , Σ2 u kωk2 kωk2 R(u)≤1 R(u)≤1 ≤ 1
R(Σ 2 u)≤
=
1
hg, Σ 2 ui
sup √
Λmax (Σ)
hg, ui =
sup √
R(u)≤
p Λmax (Σ) sup hg, ui , R(u)≤1
Λmax (Σ)
p 1 where Λmax (Σ) is the norm of the linear operator Σ 2 mapping from the Banach space equipped with norm R(·) to itself. Case 3. If X is dependent isotropic Gaussian, then D D E E ω T − 12 12 ω sup X T X Γ , u = sup Γ ,u kωk2 kωk2 R(u)≤1 R(u)≤1 1 D E p 1 Γ2 ω ≤ Λmax (Γ) sup X T Γ− 2 1 ,u R(u)≤1 kΓ 2 ωk2 p = Λmax (Γ) sup hg, ui , R(u)≤1
where Λmax (Γ) denotes the largest eigenvalue of Γ. For convenience, we denote Z = supR(u)≤1 hg, ui. We also denote η1 as the constant before supR(u)≤1 hg, ui p p for the three cases, which are 1, Λmax (Σ), and Λmax (Γ) respectively. According to the discussion above, we need upper bounds for kωk2 and Z in order to bound R∗ (∇L(θ∗ ; Z n )). Upper Bound for kωk2 : If ω is standard Gaussian, i.e. ω ∼ N (0, In×n ), then E[kωk2 ] is the expected length of a Gaussian random vector in Rn , which satisfies q √ (118) E[kωk2 ] < E[kωk22 ] = n Since L2 norm k · k2 is a 1-Lipschitz function, using the concentration for Lipschitz function of Gaussian random vector, we have P (|kωk2 − E[kωk2 ]| ≥ τ ) ≤ 2 exp(−cτ 2 ) . 28
Setting τ =
√
n and using (118), we get the one-sided concentration √ P (kωk2 ≥ 2 n) ≤ 2 exp(−cn) .
(119)
If ω consists of i.i.d. centered unit-variance sub-Gaussian elements with |||ωi |||ψ2 < K, ωi2 is sub-exponential with |||ωi |||ψ1 < 2K 2 . Similarly we have E[kωk2 ] < Applying Bernstein’s inequality to kωk22 =
q √ E[kωk22 ] = n .
(120)
Pn
2 i=1 ωi ,
we obtain τ2 τ 2 2 P (|kωk2 − E[kωk2 ]| ≥ τ ) ≤ 2 exp −c min( , ) 4K 4 n 2K 2
Setting τ = 2K 2 n, we get P (kωk2 ≥
p √ 2K 2 + 1 n) ≤ 2 exp(−cn)
For convenience, √ we denote η2 as the constants before Gaussian case, 2K 2 + 1 for sub-Gaussian case.
√
(121)
n term in both (119) and (121), which are 2 for
Upper Bound for Z: Next we need to bound Z = supR(u)≤1 hg, ui, where g is a standard Gaussian random vector. According to the definition of Gaussian width, the expectation of Z is " # Eg [Z] = Eg
sup hg, ui = w (ΩR ) . R(u)≤1
Combined with the bounds in (118) and (120), this immediately proves the upper bounds for E[R∗ (∇L(θ∗ ; Z n ))] in (115). Then, by invoking the concentration inequality for supreme of Gaussian process, we have P (Z − Eg [Z] ≥ τ ) ≤ exp(−
τ2 ), 2Φ2
where Φ2 = supR(u)≤1 Eg [hg, ui2 ] = supR(u)=1 kuk22 . Plugging Eg [Z] = w (ΩR ) into the inequality, we have τ2 (122) P (Z ≥ w(ΩR ) + τ ) ≤ exp − 2 . 2Φ Last, we apply union bound to the inequality (119), (121), and (122), we get η1 η2 R∗ (∇L(θ∗ ; Z n )) ≤ √ (w(ΩR ) + τ ) , n 2
τ with probability at least 1 − 3 exp(− min( 2Φ 2 , cn)), where the choices of η1 and η2 are determined by the properties of ω and X. This completes the proof.
29
C.2
Proof of Theorem 4
Theorem 4 Let ΩR = {u : R(u) ≤ 1}. For sub-Gaussian design X and Gaussian or sub-Gaussian noise ω, we have η0 η1 E[R∗ (∇L(θ∗ ; Z n ))] ≤ √ w(ΩR ) . (123) n where η0 is a constant. The constant η1 is given by if X is independent isotropic 1 p Λmax (Σ) if X is independent anisotropic . η1 = p Λmax (Γ) if X is dependent isotropic In order to prove inequality (123), we first need the following theorem. Theorem 14 Let ΩR = {u : R(u) ≤ 1} be the unit norm ball. Assuming h is sub-Gaussian with |||hi |||ψ2 ≤ k, then we have " # sup hh, ui ≤ η0 w (ΩR ) ,
E
R(u)≤1
where η0 is a constant independent of n, p. Proof: The quantity E[supR(u)≤1 hh, ui] can be considered the “sub-Gaussian width” of ΩR , the unit norm ball, since it has the exact same form as the Gaussian width, with h being an elementwise independent subGaussian vector instead of a Gaussian vector. Next, we show that the sub-Gaussian width is always bounded by constant times the Gaussian width. Consider the sub-Gaussian process Y = {Yu }, Yu = hu, hi indexed by u ∈ ΩR , the unit norm ball. Consider the Gaussian process X = {Xu }, Xu = hu, gi, where g ∼ N (0, I), indexed by the same set, i.e., u ∈ ΩR , the unit norm ball. First, note that |Yu − Yv | = |hh, u − vi|, so that by Hoeffding’s inequality [38, Proposition 5.10], we have X p c2 P (|Yu − Yv | ≥ ) = P (uj − vj )hj ≥ ≤ e · exp − 2 , (124) k ku − vk2 j=1
where k = maxj |||hj |||ψ2 and c > 0 is an absolute constant. As a result, a direct application of the generic chaining argument for upper bounds on such empirical processes [33, Theorem 2.1.5] gives E sup |Yu − Yv | ≤ η1 E sup Yu = η1 w(ΩR ) , (125) u,v
u
where η1 is an absolute constant. Further, since {Yu } is a symmetric process, from [33, Lemma 1.2.8], we have E sup |Yu − Yv | = 2E sup Yu . (126) u,v
As a result, with η0 = η1 /2, we have " E
u
#
sup hh, ui = E sup Yu ≤ η0 w(ΩR ) . u
R(u)≤1
30
(127)
That completes the proof. To prove Theorem 4, we assume that for isotropic independent X, the sub-Gaussian norm of each hXi , ui satisfies |||hXi , ui|||ψ2 ≤ k for any u ∈ S p−1 . proof of Theorem 4: The equation (117) still holds for sub-Gaussian design matrix, and the analysis for kωk2 ω remains the same as Gaussian designs. We also have three cases for supR(u)≤1 hX T kωk , ui. 2 Case 4. If X is independent isotropic sub-Gaussian, then E D ω , u = sup hh, ui , sup X T kωk2 R(u)≤1 R(u)≤1 where h is an i.i.d. sub-Gaussian random vector with |||h|||ψ2 ≤ k, and its exact distribution depends on ω. Case 5. If X is independent anisotropic sub-Gaussian, then D E p ω sup X T , u ≤ Λmax (Σ) sup hh, ui , kωk2 R(u)≤1 R(u)≤1 Case 6. If X is dependent isotropic sub-Gaussian, then D E p ω , u ≤ Λmax (Γ) sup hh, ui , sup X T kωk2 R(u)≤1 R(u)≤1 Based on Theorem 14, it is easy to see that η0 η1 E[R∗ (∇L(θ∗ ; Z n ))] ≤ √ w(ΩR ) . n where η1 remains the same as in Theorem 3. This completes the proof.
C.3
Proof of Theorem 5
Theorem 5 Let h = [h1 · · · hp ] be a vector of independent centered random variables, and let Z 0 = R∗ (h) = supR(u)≤1 hh, ui. Then, we have P Z 0 − E Z 0 ≥ τ ≤ η3 exp −η4 τ 2 ,
(128)
if h satisfies any of the following conditions: (i) h is bounded, i.e., khk∞ = L < ∞, (ii) for any unit vector u, Y R u1 = hh, ui are such that their Malliavin derivatives have almost surely bounded norm in L2 [0, 1], i.e., 0 |Dr Yu |2 dr ≤ L2 . Proof: The key aspect of the results is to show that the concentration happens in a dimensionality independent manner, i.e., the constants involved do not depend on p. We also recall that since Z 0 = R∗ (h) = sup hh, ui, one can view Z 0 as the 1-Lipschitz function of h, or the supremum of a sub-Gaussian process R(u)≤1
determined by h. Part (i): When h is bounded, i.e., khk∞ = L < ∞, the result follows from existing results on measure concentration of convex 1-Lipschitz functions on product spaces, e.g., see Chapter 4.2 in [22]. In particular, since the distribution over h is a product measure, the elements being independent, and R∗ (·) is a convex 31
1-Lipschitz function, following [22, Corollary 4.10] and subsequent discussions, it follows that for every τ ≥ 0, P (|Z 0 − m(Z 0 )| ≥ τ ) ≤ 4 exp(−τ 2 /L2 ) , (129) where m(Z 0 ) is the median of Z 0 for P . One can go from the median m(Z 0 ) to the mean E[Z 0 ] with only changes in constants, as shown in [22, Proposition 1.8]; also see discussions leading to [22, Corrollary 4.8]. Part (ii): In this setting, Yu = hh, to such that their Malliavin derivatives Rhave almost surely R 1ui is assumed 1 2 2 bounded norm in L [0, 1], i.e., 0 |Dr Yu | dr ≤ L2 . As a result, it follows that supu 0 |Dr Yu |2 ≤ L2 . Recall that Z 0 = supu:R(u)≤1 Yu and E[Z 0 ] = E[supu:R(u)≤1 Yu ]. Then, following [40, Theorem 3.6], Z 0 − E[Z 0 ] is sub-Gaussian relative to L2 , so that P (|Z 0 − E[Z 0 ]| > τ ) ≤ 2 exp(−τ 2 /L2 )
(130)
That completes the proof.
C.4
Proof of Lemma 3
Lemma 3 Let ΩR = {u : R(u) ≤ 1} be the unit norm ball and ΘR = {u : R(u) = 1} be the boundary. ˜ ˜ ˜ = sup For any θ˜ ∈ ΘR define ρ(θ) θ:R(θ)≤1 kθ − θk2 is the diameter of ΩR measured with respect to θ. If p ˜ = cone(ΩR − θ) ˜ ∩ ρ(θ)B ˜ ˜ ˜ G(θ) 2 , i.e., the cone of (ΩR − θ) intersecting the ball of radius ρ(θ). Then ˜ w(ΩR ) ≤ inf w(G(θ)) ˜ R θ∈Θ
(131)
˜ = ΩR − θ˜ = {u : R(u + θ) ˜ ≤ 1}. Since Gaussian Proof: For any θ˜ ∈ ΘR , consider the set FR (θ) ˜ width is translation invariant, the Gaussian width of ΩR and FR are the same, i.e., w(ΩR ) = w(FR (θ)). ˜ ˜ ˜ ˜ Since, ρ(θ) = supθ:R(θ)≤1 kθ − θk2 is the diameter of ΩR as well as FR (θ), a ball of radius ρ(θ) will include ˜ ⊆ cone(FR (θ)) ˜ = cone(ΩR − θ). ˜ Let ˜ so that FR (θ) ˜ ⊆ ρ(θ)B ˜ p . Further, by definition, FR (θ) FR (θ), 2 p ˜ ˜ ˜ ˜ ˜ G(θ) = cone(ΩR − θ) ∩ ρ(θ)B2 . By construction, FR (θ) ⊆ G(θ). Then, ˜ ≤ w(G(θ)) ˜ . w(ΩR ) = w(FR (θ)) Noting the analysis holds for any θ˜ ∈ ΘR , completes the proof.
D
Restricted Eigenvalue Conditions: Gaussian Designs
We focus on results in Section 4.1. In particular we consider RE conditions for Gaussian design matrices. We give proofs of RE condition for three different cases: (i) The design matrix has i.i.d isotropic Gaussian rows (ii) The design matrix has rows which are independent but the columns are correlated and (iii) The columns are independent and the rows are correlated.
D.1
Independent Isotropic Gaussian (IIG) Designs
Our goal in this section is two-fold: first, we present the RE conditions obtained using our simple proof technique, and show that it is equivalent, up to constants, the RE condition obtained using Gordon’s inequality, an arguably heavy-duty technique only applicable to the IIG setting; and second, we go over some facets of how we present the results, which will apply to all subsequent RE-style results. 32
Theorem 6 Let the design matrix X ∈ Rn×p be elementwise independent and normal, i.e., xij ∼ N (0, 1). 2 Then, for any A ⊆ S p−1 , any n ≥ 2, and any τ > 0, with probability at least (1 − 2 exp(− τ72 )), we have √ inf kXuk2 ≥ n − η0 w(Ar ) − τ , (132) u∈A
η0 > 0 is an absolute constant . Proof: Consider the function f (X) = kXuk2 where u ∈ A ⊆ S p−1 . Then, from triangle inequality, with kXkop denoting the operator norm, we have f (X1 ) − f (X2 ) = kX1 uk2 − kX2 uk2 ≤ k(X1 − X2 )uk2 ≤ kX1 − X2 kop kuk2 = kX1 − X2 kop ≤ kvec(X1 ) − vec(X2 )k2 , implying f (X) has a Lipschitz constant of 1. Then, since X is a Gaussian random matrix, following the large deviation bound for Lipschitz functions of Gaussian variables, we have 2 δ P {|kXuk2 − E [kXuk2 ]| > δ} ≤ 2 exp − . (133) 2 Let Zi = hXi , ui, then Zi ∼ N (0, 1), and each Zi is independent. Let Z ∈ Rn be the vector with elements Zi . Then, E[kXuk2 ] = E[kZk2 ] = γn . Then, we have 2 δ P {|kXuk2 − γn | > δ} ≤ 2 exp − . (134) 2 Let N (A) be an -net covering of A based on L2 norm, and let N (A, ) be the covering number. Then, for any v ∈ N (A), we have 2 δ P {|kXvk2 − γn | > δ} ≤ 2 exp − . (135) 2 Taking a union bound over all v, and using the weak converse of Dudley’s inequality at = 41 , we have 2 1 δ δ2 2 P max |kXvk2 − γn | > δ ≤ 2N (A, ) exp − ≤ 2 exp cw (a) − . (136) 4 2 2 v∈N (A) √ 2 2 With δ = 2cw(A) + τ6 , we have δ2 ≥ cw2 (A) + τ72 , so that 2 τ τ P max kXvk2 − γn | > η1 w(A) + ≤ 2 exp − . (137) 6 72 v∈N (A) Next, we focus on extending the result from -net to the full set A. We will use results from Lemmas 11 and 12 which are Lemmas 5.36 and 5.4 respectively from [38]. Lemma 11 Let A ⊆ S p−1 . Consider a matrix B which satisfies, for some δ > 0 |kBuk22 − 1| ≤ max(δ, δ 2 ) ,
∀u ∈ A .
(138)
Then, 1 − δ ≤ inf kBuk2 ≤ sup kBuk2 ≤ 1 + δ . u∈A
(139)
u∈A
Conversely, if B satisfies (139) for some δ > 0 then |kBuk22 − 1| ≤ 3 max(δ, δ 2 ) ,
33
∀u ∈ A .
(140)
Lemma 12 Let C be a symmetric p × p matrix, and let N (A) be an -net of A ⊆ S p−1 for some ∈ [0, 1). Then, 1 sup |hCu, ui| ≤ max |hCv, vi| . (141) 1 − 2 v∈N (A) u∈A
Note that previously, using the fact γn ≈ √ δ = η1 w(A) + n
√
2 n we have proved with probability 1 − 2 exp − τ72 and
τ √ 6 n
1 1 1 − δ ≤ min √ kXvk2 ≤ max √ kXvk2 ≤ 1 + δ . n n v∈N (A) v∈N (A)
(142)
2 Therefore, from the second result in Lemma 11, the following is true with probability 1 − 2 exp − τ72 1 kXvk22 − 1 ≤ 3 max(δ, δ 2 ) , n Now. using the result of Lemma 12 and choosing =
1 4
∀v ∈ N (A) .
(143)
2 with probability 1 − 2 exp − τ72
1 1 2 T sup kXuk2 − 1 = sup h(X X − I)u, ui u∈A n u∈A n 1 1 T max h(X X − I)v, vi ≤ 1 − 2 v∈N (A) n 1 2 ≤ 2 max kXvk2 − 1 v∈N (A) n ≤ 6 max(δ, δ 2 ) . Using the above result along with (142), and using the first result in Lemma 11, the following holds with τ2 probability 1 − 2 exp − 72 1 w(A) τ inf √ kXuk2 ≥ 1 − 6δ ≥ 1 − η0 √ − √ . u∈A n n n
(144)
where in the last step we denote η0 = 6η1 . That completes the proof. We consider the equivalent result one could obtain by directly using Gordon’s inequality [18, 19, 23]: Theorem 7 Let the design matrix X be elementwise independent and normal, i.e., xij ∼ N (0, 1). Then, for any A ⊆ S p−1 and any τ > 0, with probability at least (1 − 2 exp(−τ 2 /2)), we have inf kXuk2 ≥ γn − w(Ar ) − τ ,
u∈A
where γn = E[kgk2 ] >
√n n+1
(145)
is the expected length of a Gaussian random vector in Rn .
Interestingly, the results are equivalent, up to constants. In fact, our technique provides an alternative proof to Gordon’s inequality using a covering argument, and not using Gaussian comparison techniques [16, 23].
34
Proof: We consider the function f (X) = inf u∈A kXuk2 , which by the analysis for Theorem 6 has a Lipschitz constant of at most 1. Then, by the large deviation bound for functions of Gaussian variables, we have 2 δ P inf kXuk2 − E inf kXuk2 > δ ≤ 2 exp − . (146) u∈A u∈A 2 Now, from an application of Gordon’s inequality [18, 19, 23] (see Theorem 15), we have E inf kXuk2 ≤ γn − w(A) , u∈A
where γn ≥
√n n+1
(147)
is the expected length of an isotropic Gaussian random vector in Rn . Then, we have P
inf kXuk2 < γn − w(A) − δ
u∈A
2 δ ≤ 2 exp − . 2
(148)
Changing the direction of the inequality and using τ = δ completes the proof. The form of the result in Theorem 7 is the same as Theorem 6, but has an exact constant of 1 with the Gaussian width term instead of η0 in Theorem 6. Further, the proof of Theorem 7 relies on Gordon’s inequality which is applicable only for Gaussian design matrices, and cannot be readily extended to other types of design matrices. On the other hand, our analysis technique for Theorem 6 is general, and can be extended with suitable modifications to other types of design matrices like sub-Gaussian design matrices as shown in later sections. For the sake of completeness, we give a proof of the following key result used in the analysis of Theorem 7: Theorem 15 Let the design matrix X be elementwise independent and normal, i.e., xij ∼ N (0, 1). Then, for any A ⊆ S p−1 , we have E inf kXuk2 ≥ γn − w(A) , (149) u∈A
where γn = E[khk2 ] is the expected length of a Gaussian random vector in Rn . For the proof, we start by reviewing the Gaussian comparison form for Gordon’s inequality: Theorem 16 (Gordon’s inequality) Let (Xu,v , u ∈ U, v ∈ V ) and (Yu,v , u ∈ U, v ∈ V ) be centered Gaussian processes, i.e., E[Xu,v ] = E[Yu,v ] = 0. Assume that 1. kXu,v − Xu0 ,v0 k2 ≤ kYu,v − Yu0 ,v0 k2 if u 6= u0 , 2. kXu,v − Xu,v0 k2 = kYu,v − Yu,v0 k2 . Then,
E sup inf Xu,v ≤ E sup inf Yu,v u∈U v∈V
u∈U v∈V
.
(150)
The proof of the result can be found in Chapter 3 of [23]. We will use the following form of the inequality, which can be obtained by applying the above form on −Xu,v and −Yu,v E inf sup Xu,v ≤ E inf sup Yu,v . (151) u∈U v∈V
u∈U v∈V
35
Proof of Theorem 15: The proof follows by standard comparison techniques for Gaussian process. Consider the family of processes Xu,v = hv, Xui = hX, u ⊗ viTr , (152) where v ∈ S n−1 , u ∈ S p−1 are unit vectors. For comparison, we consider the following family of processes Yu,v = hh, vi + hg, ui
(153)
where g ∼ N (0, I) in Rp and h ∼ N (0, I) in Rn . Recall that for a canonical Gaussian process Xt = hg, ti, where g ∼ N (0, I), we have kXs − Xt k2 = kt − sk2 . (154) Hence, we have kXu,v − Xu0 ,v0 k2 = ku ⊗ v − u0 ⊗ v 0 k2 ,
0
u u
kYu,v − Yu0 ,v0 k2 =
v − v0
(155) (156)
A direct calculation shows that, for unit vectors u, v, u0 , v 0 , the conditions for Gordon’s inequality is satisfied by the two families of processes, i.e., q ku ⊗ v − u0 ⊗ v 0 k2 ≤ ku − u0 k22 + kv − v 0 k22 , (157) ku ⊗ v − u ⊗ v 0 k22 = kv − v 0 k2 . Hence, for any A ⊆ S p−1 , by Gordon’s inequality we have E inf kXuk2 = E inf sup hv, Xui = E inf sup Xu,v u∈A u∈A v∈S n−1 u∈A v∈S n−1 ≤ E inf sup Yu,v = E inf sup {hh, vi + hg, ui} u∈A v∈S n−1 u∈A v∈S n−1 = E sup hh, vi − E suphg, ui v∈S n−1
(158)
(159)
u∈A
≤ γn − w(A) , since E[supv∈S n−1 hh, vi] = E[khk2 ] = γn , the expected length of a Gaussian random vector in Rn . That completes the proof.
D.2
Independent Anisotropic Gaussian (IAG) Designs
We consider a setting where the rows Xi of the design matrix are independent, but each row is sampled from an anisotropic Gaussian distribution, i.e., Xi ∼ N (0, Σ) where Xi ∈ Rp . The setting has been considered in the literature [31] for the special case of L1 norms, and sharp results have been established using Gaussian comparison techniques [18, 19, 23]. We show that equivalent results can be obtained by our simple technique, which does not rely on Gaussian comparisons [16, 23]. Theorem 8 Let the design matrix X be row wise independent and each row Xi ∼ N (0, Σp×p ). Then, for 2 any A ⊆ S p−1 and any τ > 0, with probability at least 1 − 2 exp(− τ72 ), we have p √ √ inf kXuk2 ≥ ν n − η0 Λmax (Σ) w(A) − τ , (160) u∈A
36
p √ where ν = inf u∈A kΣ1/2 uk2 , Λmax (Σ) denotes the largest eigenvalue of Σ1/2 and η0 > 0 is an absolute constant. Proof: Note that XΣ−1/2 is a matrix with independent rows and each row is isotropic Gaussian. For any u ∈ A, let ξ = Σ1/2 u, so that u = Σ−1/2 ξ. Then, Xu = XΣ−1/2 ξ. Further, following the large deviation bound for Lipschitz functions of Gaussian variables, we have
2
δ −1/2 ξ
P XΣ − γn > δ ≤ 2 exp −
kξk2 2 2 (161) 2 kXuk2 δ ⇒ P 1/2 − γn > δ ≤ 2 exp − . 2 kΣ uk2 p √ √ 1/2 1/2 Note Λmax (Σ), so that ν ≤ kξk2 = kΣ1/2 uk2 ≤ p that, ν = inf u∈A kΣ uk2 and kΣ uk2 ≤ Λmax (Σ). Let N (A) be an -net covering of A based on L2 norm, and let N (A, ) be the covering number. Then, for any v ∈ N (A), we have 2 kXvk2 δ − γn > δ ≤ 2 exp − . P 1/2 2 kΣ vk
(162)
2
Taking a union bound over all v, and using the weak converse of Dudley’s inequality, we have 2 kXvk2 δ P max 1/2 − γn > δ ≤ 2N (A, ) exp − 2 v∈N (A) kΣ vk2 2 δ 2 ≤ 2 exp cw (A) − . 2 √ 2 τ2 With δ = 2c w(A) + √ τ , we have δ2 ≥ cw2 (A) + 72Λmax (Σ) , so that 6
(163)
Λmax (Σ)
( P
) kXvk2 τ2 τ max ≤ 2 exp − . − γn > η1 w(A) + p 72Λmax (Σ) v∈N (A) kΣ1/2 vk2 6 Λmax (Σ)
(164)
Next, we focus on extending the result from -net to the full set A. We will use results from Lemma 13 below, which is similar in spirit to the result of Lemma 11 and the proof of which is very similar to the proof of Lemma 14 in [38], and Lemma 12 which is stated in the previous subsection . Lemma 13 Let A ⊆ S p−1 . Consider a matrix B which satisfies, for some δ > 0 kBuk22 2 kΣ1/2 uk2 − 1 ≤ max(δ, δ ) , ∀u ∈ A . 2 Then,
√
ν−δ
p p p Λmax (Σ) ≤ inf kBuk2 ≤ sup kBuk2 ≤ Λmax (Σ) + δ Λmax (Σ) . u∈A
(165)
(166)
u∈A
Conversely, if B satisfies (166) for some δ > 0 then kBuk22 2 kΣ1/2 uk2 − 1 ≤ 3 max(δ, δ ) , 2
37
∀u ∈ A .
(167)
Note that previously, using the fact γn ≈ √ δ = η1 w(A) + √ n 6
τ Λmax (Σ)n
√
τ2 n we have proved with probability 1 − 2 exp − 72Λmax (Σ) and
for all v ∈ N (A)
1 kXvk2 1 kXvk2 ≤ max √ ≤1+δ 1 − δ ≤ min √ 1/2 n kΣ vk2 v∈N (A) n kΣ1/2 vk2 v∈N (A) p p p √ 1 1 ⇒ ν − δ Λmax (Σ) ≤ min √ kXvk2 ≤ max √ kXvk2 ≤ Λmax (Σ) + δ Λmax (Σ) . n n v∈N (A) v∈N (A) (168) Therefore,from the second result in Lemma 13, the following is true with probability atleast τ2 1 − 2 exp − 72Λmax (Σ) 1 kXvk22 2 n kΣ1/2 vk2 − 1 ≤ 3 max(δ, δ ) , ∀v ∈ N (A) 2 1 ⇒ kXvk22 − kΣ1/2 vk22 ≤ 3kΣ1/2 vk22 max(δ, δ 2 ) , ∀v ∈ N (A) . n Now. using the result of Lemma 12 and choosing =
1 4
(169)
τ2 with probability atleast 1 − 2 exp − 72Λmax (Σ)
1 1 2 1/2 2 T sup kXuk2 − kΣ uk2 = sup h(X X − nΣ)u, ui u∈A n u∈A n 1 1 T ≤ max h(X X − nΣ)v, vi 1 − 2 v∈N (A) n 1 2 1/2 2 ≤ 2 max kXvk2 − kΣ vk2 v∈N (A) n ≤ 6kΣ1/2 uk22 max(δ, δ 2 ) . Using the above result alongwith (168), and using the first result in Lemma 13, the following holds with τ2 probability atleast 1 − 2 exp − 72Λmax (Σ) p √ √ w(A) Λmax (Σ) 1 τ √ inf √ kXuk2 ≥ ν − 6δ ≥ ν − η0 −√ . u∈A n n n
(170)
where in the last step we denote η0 = 6η1 . That completes the proof.
D.3
Dependent Isotropic Gaussian Designs
˜ are isotropic Gaussians, but the columns We now consider a setting where the rows of the design matrix X T n×n ˜ j are correlated with E[X ˜j X ˜ ]=Γ∈R X . Interestingly, correlation structure over the columns make j the samples dependent, a scenario which has not yet been widely studied in the literature [27, 44]. We show that our simple technique continues to work in this scenario and gives a rather intuitive result. ˜ ∈ Rn×p be a matrix whose rows X ˜ i are isotropic Gaussian random vectors in Rp and Theorem 9 Let X ˜ j are correlated with E[X ˜j X ˜ T ] = Γ. Then, for any set A ⊆ S p−1 and any τ > 0, with the columns X j 38
2
probability at least (1 − 2 exp(− τ72 ), we have p p ˜ 2 ≥ Tr(Γ) − Λmax (Γ) (η0 w(A) + 12) − τ inf kXuk u∈A
(171)
where η0 > 0 is an absolute constant. The analysis will rely on the following two results, which respectively give a deviation bound for the norm of a correlated Gaussian random vector, and allows converting the continuous problem to a discrete problem over an -net covering. Both results rely on the fact that f (g) = kΓ1/2 gk2 is Lipschitz with constant p Λmax (Γ1/2 ) = Λmax (Γ). Lemma 14 Let Γ ∈ Rn×n be a positive (semi)definite matrix, and g ∼ N (0, In×n ) be isotropic Gaussian random vector. Then, for all δ > 0, n o p p δ2 1/2 P |kΓ gk2 − Tr(Γ)| > δ + 2 Λmax (Γ) ≤ 2 exp − . (172) 2Λmax (Γ) √ Proof: Since kΓ1/2 gk2 is Lipschitz with constant Λmax by large deviation bounds for functions of Gaussian random variables we have n o δ2 1/2 1/2 P |kΓ gk2 − E(kΓ gk2 )| > δ ≤ 2 exp − (173) 2Λmax (Γ) Integrating this tail bound, we observe that var(kΓ1/2 gk2 ) ≤ 4Λmax (Γ). Hence q p p E[kΓ1/2 gk2 ] − E[kΓ1/2 gk2 ] = T r(Γ) − E[kΓ1/2 gk2 ] ≤ 2 Λmax (Γ) 2
(174)
Combining (173) and (174) we get the required result. Lemma 15 Let f : X 7→ R, where X ⊆ Rn be L-Lipschitz, i.e., |f (x) − f (y)| ≤ Lkx − yk2 , ∀x, y ∈ X . Let N be an -net covering X . Then, inf f (x) ≥ inf f (y) − L .
(175)
|f (x) − f (y)| ≤ Lkx − yk2 ≤ L
(176)
f (x) ≥ f (y) − L
(177)
x∈X
Proof:
y∈N
We have
Then by triangle inequality
Taking inf over x and y on the l.h.s and r.h.s gives the required result. Proof of Theorem 9: Let X ∈ Rn×p be a matrix whose rows are independent Gaussian isotropic random ˜ = Γ1/2 X. Then, for any u ∈ S p−1 , Xu = g ∼ N (0, In×n ), so that vectors. Then, X ˜ 2 = kΓ1/2 Xuk2 = kΓ1/2 gk2 . kXuk 39
(178)
Then, following Lemma 14, we have n o p p ˜ P kXuk2 − Tr(Γ) > δ + 2 Λmax (Γ) ≤ 2 exp − Let N (A) be an -net covering of A. For any v ∈ N (A) ⊆ S p−1 , we have o n p p ˜ P kXvk2 − Tr(Γ) > δ + 2 Λmax (Γ) ≤ 2 exp −
δ2
2Λmax (Γ)
δ2
v∈N (A)
(179)
.
(180)
2Λmax (Γ)
Taking union bound over all v ∈ N (A), we have p p ˜ P max kXvk2 − Tr(Γ) > δ + 2 Λmax (Γ) ≤ 2N (A, ) exp −
.
δ2 2Λmax (Γ)
.
(181)
where N (A, ) is the covering number. Choosing = 14 , and using the weak converse of Dudley’s inequality to convert the union bound in terms of the Gaussian width, we have p p δ2 ˜ 2 . (182) P max kXvk2 − Tr(Γ) > δ + 2 Λmax (Γ) ≤ 2 exp(cw (A)) exp − 2Λmax (Γ) v∈N (A) p √ 2 Let δ = 2cΛmax (Γ)w(A) + τ6 , with τ > 0, so that δ 2 ≥ 2cΛmax (Γ)w2 (A) + τ36 . Then, with η0 = 2c, we have p p τ τ2 ˜ P max kXvk2 − Tr(Γ) > Λmax (Γ)(η1 w(A) + 2) + . (183) ≤ 2 exp − 6 72Λmax (Γ) v∈N (A) Next, we focus on extending the result from -net to the full set A. We will use results from Lemma 16 below, which is similar in spirit to the result of Lemma 11 and the proof of which is very similar to the proof of Lemma 14 in [38], and Lemma 12. Lemma 16 Let A ⊆ S p−1 . Consider a matrix B which satisfies, for some δ > 0 kBuk22 ≤ max(δ, δ 2 ) , ∀u ∈ A . − 1 Tr(Γ)
(184)
Then, p p p p Tr(Γ) − δ Tr(Γ) ≤ inf kBuk2 ≤ sup kBuk2 ≤ Tr(Γ) + δ Tr(Γ) . u∈A
(185)
u∈A
Conversely, if B satisfies (185) for some δ > 0 then kBuk22 2 Tr(Γ) − 1 ≤ 3 max(δ, δ ) ,
∀u ∈ A .
(186)
τ2 Note that previously, we have proved with probability at least 1 − 2 exp − 72Λmax (Σ) and √ Λ (Γ) δ = √max (η1 w(A) + 2) + √ τ for all v ∈ N (A) Tr(Γ)
6
Tr(Γ)
p p p p ˜ 2 ≤ max kXvk ˜ 2 ≤ Tr(Γ) + δ Tr(Γ) . Tr(Γ) − δ Tr(Γ) ≤ min kXvk v∈N (A)
v∈N (A)
40
(187)
Therefore,from the second result in Lemma 16, the following is true with probability at least τ2 1 − 2 exp − 72Λmax (Σ) kXvk ˜ 22 − 1 ≤ 3 max(δ, δ 2 ) , ∀v ∈ N (A) Tr(Γ) ˜ 2 2 ⇒ kXvk ∀v ∈ N (A) . 2 − Tr(Γ) ≤ 3 Tr(Γ) max(δ, δ ) , Now. using the result of Lemma 12 and choosing =
1 4
(188)
τ2 with probability atleast 1 − 2 exp − 72Λmax (Σ)
˜T ˜ ˜ 2 X − Tr(Γ)I)u, ui h( X − Tr(Γ) = sup sup kXuk 2 u∈A
u∈A
1 ˜T ˜ ≤ max h(X X − Tr(Γ)I)v, vi 1 − 2 v∈N (A) ˜ 2 ≤ 2 max kXvk − Tr(Γ) 2 v∈N (A)
≤ 6 Tr(Γ) max(δ, δ 2 ) . Using the above result alongwith (187), and using the first result in Lemma 16, the following holds with τ2 probability atleast 1 − 2 exp − 72Λmax (Σ) ˜ 2≥ inf kXuk
u∈A
p p p p Tr(Γ) − 6δ Tr(Γ) ≥ Tr(Γ) − Λmax (Σ) (η0 w(A) + 12) − τ .
(189)
where in the last step we denote η0 = 6η1 . That completes the proof.
E
Restricted Eigenvalue Conditions: Sub-Gaussian Designs
In this section we consider RE condition for sub-Gaussian design matrices with squared loss. As with the Gaussian case we prove results for three different cases: (i) Design matrices with isotropic i.i.d rows (ii) Design matrices with independent rows but dependent columns and (iii) Design matrices with independent columns but dependent rows.
E.1
Independent Isotropic Sub-Gaussian Designs
We start with the setting where the sub-Gaussian design matrix X ∈ Rn×p has independent rows Xi and each row is isotropic. Theorem 10 Let X ∈ Rn×p be a design matrix whose rows Xi are independent isotropic sub-Gaussian random vectors in Rp . Then, for any set A ⊆ S p−1 and any τ > 0, with probability at least (1 − 2 exp(−η1 τ 2 )), we have √ inf kXuk2 ≥ n − η0 w(A) − τ , (190) u∈A
where η0 , η1 > 0 are constants which depend only the sub-Gaussian norm |||xij |||ψ2 = k.
41
The analysis will use results from Lemmas 11 and 12, which respectively help in converting the problem from a minimum singular value problem for X to a minimum eigen-value problem, and from a continuous to a discrete problem. With B =
√1 X, n
following Lemma 11, it suffices to show
1 2 kXuk2 − 1 ≤ max(δ, δ 2 ) , t ∀u ∈ A , n
τ w(A) where δ = η0 √ + √ . n n
From Lemma 12, using an = 14 -net N (A) on A ⊆ S p−1 , we have 1 1 1 2 T T sup kXuk2 − 1 = sup h(X X − I)u, ui ≤ 2 max h(X X − I)v, vi v∈N (A) n u∈A n u∈A n 1 2 = 2 max kXvk2 − 1 . v∈N (A) n As a result, it suffices to prove that with high probability, 1 t 2 max kXvk2 − 1 ≤ . 2 v∈N (A) n
(191)
(192)
(193)
For any v ∈ N (A) ⊆ S p−1 , and any row Xi of X, Zi = hXi , vi are independent sub-Gaussian random variables with E[Zi2 ] = 1 and |||Zi |||ψ2 ≤ K. Therefore, Zi2 − 1 are independent centered sub-exponential random variables with 2 Zi − 1 ≤ 2 Zi2 ≤ 4|||Zi |||2 ≤ 4K 2 . (194) ψ2 ψ1 ψ1 By large deviation inequality for sub-exponential variables, we have ( n ) 1 X 1 t t 2 2 P kXvk2 − 1 ≥ =P Zi − 1 ≥ 2 n n 2 i=1 c c 2 2 ≤ 2 exp − 4 min(t, t2 )n = 2 exp − 4 δ 2 n K K c2 2 2 2 ≤ 2 exp − 4 (η0 w (A) + τ ) . K
(195)
Taking union bound over all v ∈ N (A), and using (74), i.e., the upper bound on the cardinality of the covering in terms of the Gaussian width of A, we have c 1 t 1 2 2 ≤ 2N A, exp − 4 (η02 w2 (A) + τ 2 ) P max kXvk2 − 1 ≥ v∈N n 2 4 K c c2 η02 2 2 2 ≤ 2 exp(c1 w (A)) exp − 4 w (A) exp − 4 τ 2 K K 2 c2 η0 c2 2 2 2 = 2 exp − − c w (A) exp − τ 1 K4 K4 c 2 ≤ 2 exp − 4 τ 2 , K assuming η0 ≥
2 K √ c1 . c2
Denoting η1 =
c2 K4
completes the proof.
42
E.2
Independent Anisotropic Sub-Gaussian Designs
We consider a setting where the rows Xi of the design matrix are independent, but each row is sampled from an anisotropic sub-Gaussian distribution, i.e., |||xij |||ψ2 = k and E[Xi XiT ] ∼ N (0, Σ) where Xi ∈ Rp . Further, to maintain the results at the same scale as previous results, we assume |||xij |||2 = 1 without loss of generality. The setting of anisotropic sub-Gaussian design matrices has been investigated in [32] for the special case of L1 norm. In contrast, our analysis applies to any norm regularization with dependency captured by the Gaussian width of the corresponding error set. Further, the analysis can be viewed as an extension of our general proof technique used in earlier sections. Theorem 11 Let the sub-Gaussian design matrix X be row wise independent, and each row has E[XiT Xi ] = Σ ∈ Rp×p . Then, for any A ⊆ S p−1 and any τ > 0, with probability at least (1 − 2 exp(−η1 τ 2 )), we have √ √ inf kXuk2 ≥ ν n − η0 Λmax (Σ) w(A) − τ , (196) u∈A
p √ where ν = inf u∈A kΣ1/2 uk2 , Λmax (Σ) denotes the largest eigenvalue of Σ1/2 , and η0 , η1 > 0 are constants which depend on the sub-Gaussian norm |||xij |||ψ2 = k. Proof:
The proof will use the results from Lemmas 12 and 13.
With B =
√1 X n
and following Lemma 13, it suffices to show that 1 kXuk22 − kΣ1/2 uk22 ≤ max(δ, δ 2 ) , t ∀u ∈ A n
where δ =
η0 Λmax (Σ) w(A) √ n
+
√τ . n
(197)
From Lemma 12, using a = 14 -net N (A) on A ⊆ S p−1 , we have
1 1 T 1 T 2 1/2 2 sup kXuk2 − kΣ uk2 = sup h( X X − Σ)u, ui ≤ 2 max h( X X − Σ)v, vi n n v∈N (A) u∈A n u∈A 1 = 2 max kXvk22 − kΣ1/2 vk22 v∈N (A) n
(198)
As a result, it suffices to prove with high probability 1 t 2 1/2 2 max kXvk2 − kΣ vk2 ≤ 2 v∈N (A) n
(199)
For any v ∈ N (A) ⊆ S p−1 , and any row Xi of X, Z = hXi , vi are independent sub-Gaussian random varip ables with E[Zi2 ] = E[v T XiT Xi v] = v T E[XiT Xi ]v = kΣ1/2 vk22 and |||Zi |||ψ2 ≤ K Λmax (Σ). Therefore Zi2 − kΣ1/2 vk22 are independent centered sub-exponential random variables with 2 1/2 2 Zi − kΣ vk2
ψ1
≤ 2 Zi2 ψ1 ≤ 4|||Zi |||2ψ2 ≤ 4K 2 Λmax (Σ)
Therefore, by large deviation inequality for sub-exponential variables, we have
43
(200)
( n ) 1 X 1 t t P kXvk22 − kΣ1/2 vk22 ≥ =P Zi2 − kΣ1/2 vk22 ≥ n 2 n 2 1 c2 2 ≤ 2 exp − 4 2 min(t, t )n K Λmax (Σ) c2 2 = 2 exp − 4 2 δ n K Λmax (Σ) c2 2 2 2 2 ≤ 2 exp − 4 2 (η w (A)Λmax (Σ) + τ ) K Λmax (Σ) 0
(201)
Taking union bound over all v ∈ N (A), and using (74), i.e., the upper bound on the cardinality of the covering number in terms of the Gaussian width of A, we have P
1 t 1 c2 2 1/2 2 2 2 2 2 max kXvk2 − kΣ vk2 ≥ ≤ 2N A, exp − 4 2 (η w (A)Λmax (Σ) + τ ) 2 4 K Λmax (Σ) 0 v∈N (A) n c2 2 2 2 2 2 ≤ 2 exp(c1 w (A)) exp − 4 2 (η w (A)Λmax (Σ) + τ ) K Λmax (Σ) 0 c2 c2 η02 2 2 2 − c w (A) exp − τ = 2 exp − 1 K4 K 4 Λ2max (Σ) c2 ≤ 2 exp − 4 2 τ2 K Λmax (Σ)
assuming η0 ≥
E.3
2 K √ c1 . c2
Denoting η1 =
c2 K 4 Λ2max (Σ)
completes the proof.
Dependent Isotropic Sub-Gaussian Designs
˜ has isotropic sub-Gaussian rows, but the In this section, we consider the setting where the design matrix X T rows are dependent. In particular, we assume that |||˜ xij |||ψ2 = k, and E[X˜j X˜j ] = Γ ∈ Rn×n . We show that the gain condition holds for such dependent sub-Gaussian designs: ˜ ∈ Rn×p be a sub-Gaussian design matrix with isotropic rows and correlated columns Theorem 12 Let X ˜j X ˜ T ] = Γ ∈ Rn×n . Then, for any A ⊆ S p−1 and any τ > 0, with probability at least (1 − with E[X j 2exp(−η1 τ 2 )), we have p ˜ 2 ≥ Tr(Γ) − Λmax (Γ) η0 w(A) − τ , (202) inf kXuk u∈A
where η0 , η1 are constants which depend on the sub-Gaussian norm |||xij |||ψ2 = k. The analysis will use results from Lemma 12 and 16. ˜ and following Lemma 16, it suffices to show that With B = X Proof:
˜ 2 sup kXuk − T r(Γ) ≤ max(δ, δ 2 )T r(Γ) , t, ∀u ∈ A 2
(203)
u∈A
where δ =
η0 w(A)Λmax (Γ) √ T r(Γ)
+√
τ . T r(Γ)
From Lemma 12, using a = 41 -net N (A) on A ⊆ S p−1 , we have 44
˜ 2 ˜T ˜ ˜T ˜ sup kXuk − T r(Γ) = sup h( X X − IT r(Γ))u, ui ≤ 2 max h( X X − IT r(Γ))v, vi 2 v∈N (A) u∈A u∈A ˜ 2 = 2 max kXvk − T r(Γ)
(204)
2
v∈N (A)
As a result, it suffices to prove with high probability t ˜ 2 max kXvk − T r(Γ) ≤ 2 2 v∈N (A)
(205)
˜ Also X ˜ = Γ1/2 X where E[X˜j X˜j T ] = Γ and X ∈ Rn×p has independent sub-Gaussian Let Z˜ = Xv. entries. Let Z = Xv is an isotropic sub-Gaussian random vector, with each entry Zi = hXi , vi such that |||Zi |||ψ2 ≤ K. Then
˜ 2 = kΓ1/2 Zk2 = k kXvk 2 2
n X
1/2 Zj Γj k22
=
n X
j=1 1/2
1/2
1/2
X
Zj2 kΓj k22 +
j=1
1/2
Zj Zk hΓj , Γk i
(206)
j,k∈[1,...,n],j6=k
1/2
where Γj , Γk
denote the jth and kth columns of Γ and [1, ...n] denotes a set containing the first n natural P 1/2 numbers. We assume that k nj=1 Zj Γj k22 = T r(Γ) almost surely. Then,
˜ 22 = kXvk
n X
1/2
j=1
1/2
X
kΓj k22 +
1/2
j,k∈[1,...,n],j6=k
1/2
X
Zj Zk hΓj , Γk i = T r(Γ) +
1/2
Zj Zk hΓj , Γk i
j,k∈[1,...,n],j6=k
(207) Therefore, we get X 1/2 1/2 2 ˜ |kXvk2 − T r(Γ)| ≤ Zj Zk hΓj , Γk i j,k∈[1,...,n],j6=k
(208)
The sum on the r.h.s is hΓ0 x, xi where Γ0 is the off-diagonal part of Γ. This can also be written as follows: 1/2
X
RT (z) =
1/2
Zj Zk hΓj , Γk i
(209)
j∈T,k∈T c
We state the following decoupling Lemma, the proof of which is provided on Pg. 38 of [38] to bound the above quantity Lemma 17 Consider a double array of real numbers (aij )ni,j=1 , such that aii = 0 for all i. Then X
aij = 4E
X
ai,j
(210)
i∈T,j∈T c
i,j∈[1,...,n]
where T is a random subset of [1, ..., n] with average size n/2. In particular 4
min
T ⊆[1,...,n]
X i∈T,j∈T c
aij ≤
X i,j∈[1,...,n]
45
aij ≤ 4
max
T ⊆[1,...,n]
X i∈T,j∈T c
aij
(211)
where the minimum and maximum are over all subsets of [1, ..., n]. From the above Lemma, ˜ 2 − T r(Γ)| ≤ 4 max |RT (z)| |kXvk 2
(212)
T ⊆[1,...n]
Therefore we want to compute the probability of the following event: P
t max |RT (z)| > 8 v∈N (A),T ⊆[1,...,n]
1 t n ≤ N A, ·2 · max P |RT (z)| > 4 8 v∈N (A),T ⊆[1,...,n]
(213)
The r.h.s follows from a union bounding argument. To estimate the probability, we fix a vector v ∈ N (A) and a subset T ⊆ [1, ...n] and we condition on a realization of random variables (Zk )k∈T c . Therefore, we express RT (z) =
X
1/2
X
Zj hΓj , bi, where b =
1/2
Zk Γk
(214)
k∈T c
j∈T
Under this conditioning b is a fixed vector, so RT (z) is a sum of independent random variables. Moreover, kbk2 ≤ kΓ1/2 k2 kZk2 = kΓ1/2 k2 kXvk2 ≤
p √ Λmax (Γ)c n
(215)
The inequality above follows from our results for the isotropic sub-Gaussian design scenario. It can be proved using the same arguments, kZk2 = kXvk2 ≤
√
√ n + η0 w(A) + τ ≤ c n
(216)
P 1/2 For a large enough c the above result is almost always satisfied. Under these conditions j∈T Zj hΓj , bi is a sum of independent sub-Gaussian random variables which in turn is a sub-Gaussian variable with subGaussian norm |||RT (z)|||ψ2 ≤ K
(217)
X 1/2 √ hΓj , bi ≤ kΓj k2 kbk2 ≤ Λmax (Γ)c n
(218)
and
j∈T
Denoting the conditional probability by PT = P (·|(Zk )k∈T c ) and the expectation with respect to (Ak )k∈T c by ET c , we obtain P
t |RT (z)| > 8
t ≤ ET c PT |RT (z)| > 8 " 2 # t/8 √ ≤ 2 exp −c4 KΛmax (Γ) n " 2 # δT r(Γ) √ ≤ 2 exp −c5 KΛmax (Γ) n 46
η0 w(A)Λmax (Γ) √ T r(Γ)
Substituting δ =
+√
τ T r(Γ)
and by the earlier argument we get
t max |RT (z)| > 8 v∈N (A),T ⊆[1,.,n]
P
2 2 η0 w (A)T r(Γ) τ 2 T r(Γ) ≤ exp(c1 w (A)) exp(n ln 2) exp −c5 + 2 2 K 2n K Λmax (Γ)n 2 2 η0 T r(Γ) c5 τ T r(Γ) 2 ≤ exp −w (A) c5 − c1 + n ln 2 exp − 2 2 K 2n K Λmax (Γ)n 2
Choosing η0 >
F
q
c1 K 2 n c5 T r(Γ)
+
n2 K 2 ln n c5 w2 (A)T r(Γ)
and η1 =
c5 T r(Γ) K 2 Λ2max (Γ)n
completes the proof
Generalized Linear Models: Restricted Strong Convexity
We consider the bounds on the regularization parameter and the RE condition for GLM loss functions and any norm.
F.1
Noise for GLM models
For any GLM the first derivative of the log-partition function ψ(u) w.r.t u is equal to the mean. R
y|x exp{(yhθ, xi)}y ψ (hθ, xi) = R = y|x exp{yhθ, xi} 0
R
y|x exp{(yhθ, xi)}y
exp{ψ(hθ, xi)}
Z =
P(y|x; θ)y = E(y|x) y|x
Now the maximum likelihood loss function the negative of the log of the probability distribution.
L(θ; Z1n )
n
n
i=1
i=1
1X 1X = −hθ, yi Xi i + ψ(hθ, Xi i) n n
Therefore
∇L(θ
∗
; Z1n )
n
n
i=1
i=1
1X 1 1 1X 0 =− yi Xi + Xi ψ (hθ∗ , Xi i) = X T (En (y|x) − y) = X T w n n n n
If the noise is assumed to be gaussian or sub-Gaussian the analysis in Section 3 gives the analysis for the regularization parameter and the analysis for recovery bounds follows analogously.
F.2
RSC condition for GLMs
The RSC condition is as follows: δL(θ∗ , u; Z1n ) := L(θ∗ + u; Z1n ) − L(θ∗ ; Z1n ) − h∇L(θ∗ ; Z1n ), ui ≥ κkuk22 47
(219)
for ∀u ∈ Ar For the general formulation of GLM’s given earlier n
δL(θ∗ , u; Z1n ) = −hθ∗ + u,
n
i=1
1 n
n X
n
1X 1X 1X yi Xi i + ψ(hθ∗ + u, Xi i) + hθ∗ , yi Xi i− n n n
ψ(hθ∗ , Xi i) − h−
i=1
1 n
i=1 n X
yi Xi +
i=1
i=1
1 n
n X
0
Xi ψ (hθ∗ , xi i), ui
i=1
Simplifying the expression and applying mean value theorem twice we get the following n
1 X 00 ψ (hθ∗ , Xi i + γi hu, Xi i) hu, Xi i2 n
δL(θ∗ , u; Z1n ) =
(220)
i=1
where γi ∈ [0, 1] The RSC condition for GLMs then needs to consider lower bounds for n
δL(θ∗ , u; Z1n ) =
1 X 00 ∗ ψ (hθ , Xi i + γi hu, Xi i)hu, Xi i2 n
(221)
i=1
where γi ∈ [0, 1]. The second derivative of the log-partition function is always positive. Since the RSC condition relies on a non-trivial lower bound for the above quantity, the analysis will suitably consider a 00 compact set where ` = `ψ (T ) = min|a|≤2T ψ (a) is bounded away from zero. The only assumption outside this compact set {a : |a| ≤ 2T } is that the second derivative is greater than 0. Further, we assume kθ∗ k2 ≤ c1 for some constant c1 . With these assumptions n
δL(θ∗ , u; Z1n ) ≥
`X hXi , ui2 I[|hXi , θ∗ i| < T ]I[|hXi , ui| < T ] n
(222)
i=1
We give a characterization of the RSC condition for independent isotropic sub-Gaussian design matrices X ∈ Rn×p . The analysis can be suitably generalized to the other design matrices considered in earlier sections by using the same techniques. We consider u ∈ A ⊆ S p−1 so that kuk2 = 1. Further, we assume kθ∗ k2 ≤ c1 for some constant c1 . Assuming X has sub-Gaussian entries with |||xij |||ψ2 ≤ k, hXi , θ∗ i and hXi , ui are sub-Gaussian random variables with sub-Gaussian norm at most Ck. Let φ1 = φ1 (T ; u) = P {|hXi , ui| > T } ≤ e · exp(−c2 T 2 /C 2 k 2 ), and φ2 = φ2 (T ; θ∗ ) = P {|hXi , θ∗ i| > T } ≤ e · exp(−c2 T 2 /C 2 k 2 ). The result we present is in terms of the constants ` = `ψ (T ), φ1 = φ1 (T ; u) and φ2 = φ2 (T, θ∗ ) for any suitably chosen T . Theorem 13 Let X ∈ Rn×p be a design matrix with independent isotropic sub-Gaussian rows. Then, for 5 1 −φ2 ) (1 − α)τ 2 ) any set A ⊆ S p−1 , any α ∈ (0, 1), any τ > 0, and any n ≥ α2 (1−φ2 1 −φ2 ) (cw2 (A) + c2 (1−φ 4 K for suitable constants c, c2 , with probability at least 1 − 3 exp −η1 τ 2 , we have inf
u∈A
p √ √ n∂L(θ∗ ; u, X) ≥ ` π n − η0 w(A) − τ ,
(223)
where π = (1 − α)(1 − φ1 − φ2 ), ` = `ψ (T ) = min|a|≤2T +β ∇2 ψ(a), and constants (η0 , η1 ) depend on the sub-Gaussian norm |||xij |||ψ2 = k. Proof: As in earlier sections we work with vectors v ∈ N (A). We make the observation that, for ∀u ∈ A and the corresponding v ∈ N (A) and some fixed constant β, let Mu (X), Mv (X) denote sets such that 48
Mu (X) = {Xi ||hXi , ui| ≤ T + β} and Mv (X) = {Xi ||hXi , vi| ≤ T }, then Mv (X) ⊆ Mu (X). This observation helps when we extend the argument from v ∈ N (A) to u ∈ A, as to apply Lemma 12 we require that the submatrix Mv (X) ⊆ Mu (X). For v ∈ N (A), let Zi = hXi , vi. Then Zi is sub-Gaussian with |||Zi |||2 = 1 and |||Zi |||ψ2 = Ck = K. Now for any fixed T , let Z¯i = hXi , viI(|hXi , vi| ≤ T )I(|hXi , θ∗ i| ≤ T ). Then, the probability distribution over Z¯i can be written as:2 P (hXi , vi = z)I(|hXi , vi| ≤ T )I(|hXi , θ∗ i| ≤ T ) 1 P (hXi , vi = z) . (224) ≤ ∗ P (|hXi , vi| ≤ T, |hXi , θ i| ≤ T ) 1 − φ1 − φ2 As a result, Z¯i ψ2 ≤ 1−φK1 −φ2 . Let Z¯ ∈ Rn be the vector whose elements are Z¯i , implying that some elements can be zero. Note that p ¯ 2, n∂L(θ∗ ; u, X) ≥ `kZk (225) P (Z¯i = z) =
where ` = `ψ (T ). Further, if Z¯m ∈ Rm , for some m ≤ n, be the vector whose elements are of the form Z¯i corresponding to m independent samples Xi , we follow the same analysis style as for Theorem 10. √ Let max(δ, δ 2 ) = t, δ = η0 w(A) + n
P
√τ n
and given (195), following the same analysis as Section E.1 we have
4 1 kZ¯m k22 − 1 ≥ t ≤ 2 exp − c2 (1 − φ1 − φ2 ) δ 2 m . 2 m K4
(226)
In particular, for any α ∈ (0, 1), for m ≥ (1 − α)(1 − φ1 − φ2 )n, we have 1 t c2 (1 − φ1 − φ2 )5 2 2 ¯ P kZm k2 − 1 ≥ ≤ 2 exp − δ (1 − α)n (1 − α)(1 − φ1 − φ2 )n 2 K4
(227)
For convenience, let π = (1 − α)(1 − φ1 − φ2 ). Now, based on the design matrix X consists of n samples Xi , consider the discrete random variable Mv,θ∗ (X) = |{Xi ||hXi , vi| ≤ T, hXi , θ∗ i| ≤ T }|. Clearly, Mv,θ∗ (X) ∈ {0, 1, . . . , n} follows a binomial distribution with success probability greater than (1−φ1 −φ2 ) ¯ 2 = kZ¯m k2 we have and E[Mv,θ∗ (X)] ≥ (1 − φ1 − φ2 )n. Then, for any v ∈ N (A) and noting that kZk 1 t 2 ¯ P kZk2 − 1 ≥ πn 2 1 ¯ 22 − 1 ≥ t Mv,θ∗ (X) ≥ (1 − α)(1 − φ1 − φ2 )n = P kZk 2 πn P {Mv,θ∗ (X) ≥ (1 − α)(1 − φ1 − φ2 )n} 1 t 2 ¯ − 1 ≥ Mv,θ∗ (X) < (1 − α)(1 − φ1 − φ2 )n + P kZk 2 2 πn P {Mv,θ∗ (X) < (1 − α)(1 − φ1 − φ2 )n}
By application of Chernoff bounds for binomial distributions α2 (1 − φ1 − φ2 ) P {Mv,θ∗ (X) < (1 − α)(1 − φ1 − φ2 )n} ≤ exp − n 2
(228)
With abuse of notation, we treat the distribution over Z¯i as discrete for ease of notation. A similar argument applies for the true continuous distribution, but more notation is needed. 2
49
Noting that we are looking for an upper bound 1 1 t t 2 2 ¯ ¯ = P kZk2 − 1 ≥ m ≥ (1 − α)(1 − φ1 − φ2 )n P kZk2 − 1 ≥ πn 2 πn 2 2 α (1 − φ1 − φ2 ) + exp − n 2 α2 (1 − φ1 − φ2 ) c2 (1 − φ1 − φ2 )5 2 δ (1 − α)n + exp − n ≤ 2 exp − K4 2 2 α (1 − φ1 − φ2 ) ≤ 2 exp −η1 [η02 w2 (A) + τ 2 ] + exp − n , 2 √ + where δ = η0 w(A) n
√τ n
and η1 =
c2 (1−φ1 −φ2 )5 (1 K4
− α).
Taking union bound over all v ∈ N (A), and using (74), i.e., the upper bound on the cardinality of the covering number in terms of the Gaussian width of A, we have 1 t 1 α2 (1 − φ1 − φ2 ) 2 2 2 2 ¯ P max kZk2 − 1 ≥ ≤ N A, 2 exp −η1 [η0 w (A) + τ ] + exp − n v∈N πn 2 4 2 α2 (1 − φ1 − φ2 ) 2 2 2 2 2 ≤ 2 exp cw (A) − η1 [η0 w (A) + τ ] + exp cw (A) − n 2 α2 (1 − φ1 − φ2 ) 2 2 2 2 n = 2 exp −(η1 η0 − c)w (A) exp −η1 τ + exp cw (A) − 2 α2 (1 − φ1 − φ2 ) 2 2 n ≤ 2 exp −η1 τ + exp cw (A) − 2 ≤ 3 exp −η1 τ 2 , q 2 assuming η0 ≥ ηc1 and for n ≥ η12 (cw2 (A) + η1 τ 2 ) where η2 = α (1−φ2 1 −φ2 ) As a result, from Lemma 11 and (225) we have p √ √ P ( inf n∂L(θ∗ ; u, X) ≥ ` π n − η0 w(A) − τ ) ≥ 1 − 3 exp −η1 τ 2 . u∈A
(229)
References [1] [2] D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp. Living on the edge: Phase transitions in convex programs with random data. Inform. Inference, 3(3):224–294, 2014. [3] A. Banerjee, S. Chen, F. Fazayeli, and V. Sivakumar. Estimation with norm regularization. In Advances in Neural Information Processing Systems (NIPS), 2014. [4] A Banerjee, S Merugu, I Dhillon, and J Ghosh. Clustering with {B}regman Divergences. Journal of Machine Learning Research, 6:1705–1749, 2005. [5] O Barndorff-Nielsen. Information and Exponential Families in Statistical Theory. John Wiley and Sons, 1978. 50
[6] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002. [7] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009. [8] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013. [9] L Brown. Fundamentals of Statistical Exponential Families. Institute of Mathematical Statistics, 1986. [10] P. Buhlmann and S. van de Geer. Statistics for High Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Springer, 2011. [11] E. Candes and T Tao. The Dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics, 35(6):2313–2351, 2007. [12] E. J. Candes, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52:489–509, 2006. [13] E. J. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51:4203–4215, 2005. [14] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805–849, 2012. [15] S. Chatterjee, S. Chen, and A. Banerjee. Generalized dantzig selector: Application to the k-support norm. In Advances in Neural Information Processing Systems (NIPS), 2014. [16] R. M. Dudley. The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. Journal of Functional Analysis, 1:290–330. [17] M. A. T. Figueiredo and R. D. Nowak. Sparse estimation with strongly correlated variables using ordered weighted l1 regularization. arXiv:1409.4005, 2014. [18] Y. Gordon. Some inequalities for gaussian processes and applications. Israel Journal of Mathematics, 50(4):265–289, 1985. [19] Y. Gordon. On milmans inequality and random subspaces which escape through a mesh in rn . In Geometric Aspects of Functional Analysis, volume 1317 of Lecture Notes in Mathematics, pages 84– 106. Springer, 1988. [20] V. Koltchinskii and S. Mendelson. Bounding the smallest singular value of a random matrix without concentration. arXiv:1312.3580, 2013. [21] G. Lecu´e and S. Mendelson. Sparse recovery under weak moment assumptions. arXiv:1401.2188, 2014. [22] M. Ledoux. The concentration of measure phenomenon. Mathematical Surveys and Mongraphs. American Mathematical Society. [23] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer, 2013. 51
[24] N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37(1):246—270, 2009. [25] A. Montanari and J. A. Pereira. Which graphical models are difficult to learn? In Neural Information Processing Systems (NIPS), pages 1303–1311, 2009. [26] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A unified framework for the analysis of regularized M -estimators. Statistical Science, 27(4):538–557, 2012. [27] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and highdimensional scaling. Annals of Statistics, 39(2):1069–1097, 2011. [28] R. I. Oliveira. The lower tail of random quadratic forms, with applications to ordinary least squares and restricted eigenvalue properties. arXiv:1312.2903, 2013. [29] S. Oymak, C. Thrampoulidis, and B. Hassibi. The Squared-Error of Generalized Lasso: A Precise Analysis. arXiv:1311.0830, 2013. [30] Y. Plan and R. Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Transactions on Information Theory, 59(1):482–494, 2013. [31] G. Raskutti, M. J. Wainwright, and B. Yu. Restricted Eigenvalue Properties for Correlated Gaussian Designs. Journal of Machine Learning Research, 11:2241–2259, 2010. [32] Z. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements. IEEE Transactions on Information Theory, 59(6):3434–3447, 2013. [33] M. Talagrand. The Generic Chaining. Springer, 2005. [34] M. Talagrand. Upper and Lower Bounds for Stochastic Processes. Springer, 2014. [35] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996. [36] J. A. Tropp. Convex recovery of a structured signal from independent random linear measurements. In Sampling Theory, a Renaissance. (To Appear), 2015. [37] A.-S. Ustunel. An introduction to analysis on Weiner Space. Springer-Verlag, 1995. [38] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar and G. Kutyniok, editors, Compressed Sensing, chapter 5, pages 210–268. Cambridge University Press, 2012. [39] R. Vershynin. Estimation in high dimensions: A geometric perspective. Submitted, 2014. [40] A. B. Vizcarra and F. G. Viens. Some applications of the malliavin calculus to sub-gaussian and non-sub-gaussian random fields. In Seminar on Stochastic Analysis, Random Fields and Applications, Progress in Probability, volume 59, pages 363–396. Birkhauser, 2008. [41] M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using `1 constrained quadratic programming(Lasso). IEEE Transactions on Information Theory, 55:2183–2202, 2009. [42] M J Wainwright and M I Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008. 52
[43] P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning Research, 7:2541–2567, November 2006. [44] S. Zhou. Gemini: Graph estimation with matrix variate normal instances. The Annals of Statistics, 42(2):532–562, 2014.
53