1
Overcoming The Limitations of Phase Transition by Higher Order Analysis of Regularization Techniques Haolei Weng, Arian Maleki, Le Zheng
Abstract 2 We study the problem of estimating β ∈ R from its noisy linear observations y = Xβ + w, where w ∼ N (0, σw In×n ), under the following high-dimensional asymptotic regime: given a fixed number δ, p → ∞, while n/p → δ. We consider the popular class of `q -regularized least squares (LQLS) estimators, a.k.a. bridge, given by the optimization problem:
arXiv:1603.07377v1 [math.ST] 23 Mar 2016
p
ˆ q) ∈ arg min 1 ky − Xβk22 + λkβkqq , β(λ, β 2 ˆ q) − βk22 . The expression we derive for this limit does not have explicit forms and characterize the almost sure limit of p1 kβ(λ, and hence are not useful in comparing different algorithms, or providing information in evaluating the effect of δ or sparsity level of β. To simplify the expressions, researchers have considered the ideal “no-noise” regime and have characterized the values of δ for which the almost sure limit is zero. This is known as the phase transition analysis. In this paper, we first perform the phase transition analysis of LQLS. Our results reveal some of the limitations and misleading features of the phase transition analysis. To overcome these limitations, we propose the study of these algorithms under the low noise regime. Our new analysis framework not only sheds light on the results of the phase transition analysis, but also makes an accurate comparison of different regularizers possible.
I. I NTRODUCTION A. Objective Consider the linear regression problem where the goal is to estimate a vector β ∈ Rp from a set of n noisy linear observations y = Xβ + w. This problem has been studied extensively in the last two centuries since Gauss and Legendre developed the least square estimate of β. The instability or in statistical words the high variance of the least square estimates led to the development of the regularized least squares. One of the most popular regularization classes is the `q -regularized least squares (LQLS), a.k.a. bridge regression [1], [2], given by the following optimization problem: ˆ q) ∈ arg min 1 ky − Xβk2 + λkβkq , β(λ, 2 q β 2
(1)
Pp where kβkqq = i=1 |βi |q and 1 ≤ q ≤ 2.1 These algorithms have been extensively studied in the literature. In particular, one can prove the consistency of these algorithms under the classical asymptotic analysis (p fixed while n → ∞) [4]. However, this asymptotic regime becomes irrelevant for high-dimensional problems in which the number of observations, n, is not much larger than p. Under this high dimensional setting, if β does not have any specific “structure”, we do not expect any estimator to perform well. One of the structures that has attracted attention in the last twenty years is the sparsity, that assumes only k of the elements of β are non-zero and the rest are zero. To understand the behavior of the estimators under this high dimensional setting, a new asymptotic framework has been proposed in which it is assumed that k, n, p → ∞, while n/p → δ and k/p → , where δ and are fixed numbers [5]–[9]. One of the main notions that has been studied extensively in this asymptotic framework, is the phase transition [5]–[7], [10]. Intuitively speaking, phase transition analysis ignores the noise w and characterizes the value of δ above which the estimate we obtain from an algorithm converges to the true β (in certain sense that will be clarified later). While there is always noise on the observations, it is believed that phase transition analysis provides reliable information when the noise variance is small. In this paper, we start by analyzing the phase transition diagrams of LQLS for 1 ≤ q ≤ 2. Our analysis reveals several limitations of the phase transition analysis. We will clarify these limitations in the next section. We then propose the higher-order analysis of these algorithms as a replacement for the phase transition analysis. As we will explain in the next section, not only our new framework resolves the issues of the phase transition analysis, but also it sheds light on the peculiar behavior of the phase transition diagrams. Furthermore, it enables us to address the following fundamental questions regarding the performance of different regularizers: (i) If δ < 1 and β is exactly sparse, i.e. it at most has k non-zero elements, which LQLS has the smallest MSE? (ii) What if δ > 1 and still the vector is exactly sparse? (iii) What if the β is not exactly sparse, but still has many coefficients around zero? As we will see, our framework provides quantitative and accurate answers to all these questions. 1 Bridge regression is a name used for LQLS with any q ≥ 0. In this paper we only focus on 1 ≤ q ≤ 2. To analyze the case 0 ≤ q < 1, [3] has used the replica method from statistical physics.
2
1 0.8
q1
ε
0.6 0.4
0.2
0 0
0.2
0.4
δ
0.6
0.8
1
Fig. 1. Phase transition curves of LQLS for (i) q < 1: these results are derived in [3] from the non-rigorous replica method from statistical physics. We have just included them for comparison purposes. In this paper we have focused on q ≥ 1. (ii) q = 1: the blue curve exhibits the phase transition of LASSO. Below this curve LASSO can “successfully” recover βo . (iii) q > 1: The magenta curve represents the phase transition of LQLS for any q > 1. This figure is based on Informal result 1 and will be carefully defined and derived in Section III.
B. Limitations of the phase transition and our solution In this section, we intuitively describe the results of phase transition analysis, its limitations, and our new framework. Consider the class of LQLS estimators and suppose that we would like to compare the performance of these estimators through the phase transition diagrams. For the purpose of this section, we assume that the vector β has only k non-zero elements, where ˆ q) k/p → with ∈ (0, 1]. Since phase transition analysis is concerned with the noiseless setting, it considers limλ→0 β(λ, which is equivalent to βˆ0 (q), the solution of arg min kβkqq , β
subject to y = Xβ.
(2)
We have explained this equivalence more carefully in the Appendix. Below we informally state the results of the phase transition analysis. We will formalize this statement and describe in details the conditions under which this result holds later in the paper. Informal Result 1. For a given > 0 and q ∈ [1, 2], there exists a number Mq () such that if δ ≥ Mq () + γ (γ > 0 is an arbitrary number), then as p → ∞ (2) succeeds in recovering β, while if δ ≤ Mq () − γ the algorithm fails.2 The curve δ = Mq () is called the phase transition curve of (2). While the phase transition curves can be derived with different techniques, we will derive them as a simple byproduct of one of our main results in Section III. We will show that Mq () is given by the following formula: 1 if 2 ≥ q > 1, Mq () = (3) inf χ≥0 (1 − )Eη12 (Z; χ) + (1 + χ2 ) if q = 1, where η1 (u; χ) = (|u| − χ)+ sign(u) denotes the soft thresholding function. This result has several peculiar features: (i) As is clear from Figure 1, q = 1 seems to require less measurements than the other values of q > 1. (ii) The exact values of the non-zero elements of β do not have any effect on the phase transition. In fact, even the sparsity level does not have any effect on the phase transition for q > 1. (iii) For every q > 1, the phase transition of (2) happens at exactly the same value. These features raise the following question: how much and to what extent are these phase transition results useful in applications, where at least small amount of noise is present in the observations? For instance, intuitively speaking, we do not expect to see much difference between the performance of LQLS for q = 1.01 and q = 1. However, according to the phase transition analysis, q = 1 outperforms q = 1.01 by a wide margin. More interestingly, the performance of LQLS for q = 1.01 seems to be closer to that of q = 2 than q = 1. Another example, is the fact that according to phase transition analysis, the values of the non-zero elements of β are irrelevant to the performance of LQLS. Even the sparsity level does not affect the phase transition of LQLS for q > 1. However, intuitively speaking, none of these features are expected. The main goal of this paper is to present a new analysis that will shed light on the misleading features of the phase transition analysis. Furthermore, our new framework enables us to provide accurate comparison between different bridge estimators and evaluate the usefulness of phase transition analysis in practice. 2 In our new framework the observation noise is present, but its variance, denoted by σw , is assumed to be small. Then we 2 Different
notions of success have been studied in the phase transition analysis. We will mention one notion later in our paper.
3 2 ˆ kβ(λ,q)−βk
2 , solve (1) with the optimal value of λ for which the asymptotic mean square error of the algorithm, i.e., limp→∞ p is minimized. Finally, we characterize the asymptotic mean square error in terms of σw . Since σw is small, we only keep 2 ˆ kβ(λ,q)−βk 2 the largest two terms of limp→∞ . As we will describe later, the phase transition of different algorithms can be p derived from the first dominant term, however the second dominant term is also very informative. It is capable of evaluating the importance of the phase transition analysis for practical situations and also provides a much more accurate analysis of different algorithms. Here is one of the results of our paper, presented informally to clarify our claims. All the conditions will be determined later.
Informal Result 2. If λ∗ denotes the optimal value of λ, then for any q ∈ (1, 2) and δ > 1 and < 1 2 q+1 1 ˆ ∗ σw (1 − )2 (E|Z|q )2 2q 2q δ + o(σw ), kβ(λ , q) − βk22 = − σw p→∞ p 1 − 1/δ (δ − 1)q+1 E|B|2q−2
lim
(4)
where Z ∼ N (0, 1) and B is a random variable whose distribution is specified by the non-zero elements of β. We will clarify this in the next section. Finally, the limit notation we have used above is the almost sure limit. σ2
w As we will discuss in Section III, the first term 1−1/δ determines the phase transition. However, we have also derived the second dominant term in the expansion of the asymptotic MSE. This term enables us to clarify most of the confusing features of the phase transitions. Here are some important features of this term: (i) It is negative. Hence, the MSE that is predicted by the first term (and phase transition analysis) is overestimated specially when q is close to 1. (ii) Fixing q, the magnitude of the second dominant term grows as decreases. Hence, all values of 1 < q < 2 benefit from the sparsity of β. (iii) Fixing and δ, the power of σw decreases as q decreases. This makes the absolute value of the second dominant term bigger. As q decreases to one, the order of the second dominant term gets closer to that of the first dominant term and hence the predictions of phase transition analysis become less accurate. To show some more interesting features of our approach, we also informally state a result we prove for LASSO.
Informal Result 3. If λ∗ denotes the value of λ that leads to the smallest asymptotic MSE, and if δ > M1 (), then 2 2 ˆ ∗ , 1) − βk2 − δM1 ()σw = O(exp(−˜ lim 1 kβ(λ µ/σw )), 2 p→∞ p δ − M1 ()
(5)
where µ ˜ is a constant that depends on the non-zero coefficients of β. As can be seen here, compared to the other values of q, q = 1 has smaller first order term (according to Lemma 1), but much smaller (in magnitude) second order term. While the first dominant term of LASSO is much smaller than that of LQLS (for q > 1), the fact that the second dominant term of LQLS (for q > 1) is much bigger than that of LASSO in magnitude and is negative, implies that the difference between the performance of LASSO and LQLS (q > 1) is not as extreme as the phase transition analysis predicts. We will more formally compare the performance of LASSO and LQLS with q > 1 later in the paper. In the rest of the paper, we first present all the assumptions required for our analysis. Then we present the formal statements of these results for the case β is sparse. Then we extend our results to the case = 1 and provide a fair comparison among all different algorithms carefully. C. Organization of the paper The organization of the paper is as follows: Section II presents the asymptotic framework of this paper formally. Section III discusses the main contributions of our paper. Section IV compares our results with the related work. Section V is devoted to the proofs of our main results. II. T HE ASYMPTOTIC FRAMEWORK The main goal of this section is to formally introduce the asymptotic setting under which we study LQLS algorithms. In this section only, we write the vectors and matrices as β(p), X(p), y(p), and w(p) to emphasize the dependence on the length of β. Note that even though the number of rows of X is n = δp, since we assume that δ is fixed, we do not include n in our ˆ q, p) as a substitute for β(λ, ˆ q) in this notation. The same argument is applied to y(p) and w(p). Similarly, we may use β(λ, section. Now we define a specific type of a sequence known as a converging sequence. Our definition is borrowed from other papers [11]–[13] with some minor modifications. Definition 1. A sequence of instances {β(p), X(p), w(p)} is called a converging sequence if the following conditions hold: - The empirical distribution of β(p) ∈ Rp converges weakly to a probability measure pβ with bounded second moment. Further, p1 kβ(p)k22 converges to the second moment of pβ . - The empirical distribution of w(p) ∈ Rn (n = δp) converges weakly to a zero mean Gaussian distribution with variance 2 2 σw . Furthermore, n1 kw(p)k22 → σw . - The elements of X(p) are iid with distribution N (0, 1/n).
4
ˆ q, p) as For each of the problem instances in our converging sequence, we solve the LQLS problem (1) and obtain β(λ, the estimate. The goal is to evaluate the accuracy of this estimate. Below we define observables as asymptotic measures of performance. ˆ q, p) be the sequence of solutions of LQLS for the converging sequence of instances {β(p), X(p), w(p)}. Definition 2. Let β(λ, Consider a function ψ : R2 → R. An observable Jψ is defined as the almost sure limit of p X ˆ q, p) , lim 1 ψ βi (p), βˆi (λ, q, p) . Jψ β(p), β(λ, p→∞ p i=1
Note that in the above definition, we have assumed that the almost sure limit exists. A popular choice for ψ is ψ(u, v) = (u−v)2 , which yields the MSE. In this case, we call our observable asymptotic MSE (AMSE). Under this asymptotic framework we can evaluate the performance of LQLS. The following theorem describes how this can be done. ˆ q, p) is the solution of LQLS defined in Theorem 1. Consider a converging sequence {β(p), X(p), w(p)}. Suppose that β(λ, 3 2 (1). Then for any pseudo-Lipschitz function ψ : R → R, almost surely p 1 X ˆ ψ βi (λ, q, p), βi (p) = EB,Z [ψ(ηq (B + σ ˆ Z; χˆ σ 2−q ), B)], lim (6) p→∞ p i=1 where B and Z are two independent random variables with distributions pβ and N (0, 1), respectively, ηq is the proximal ˆ and χ satisfy the following equations: operator for the function k · kqq ,4 and σ σ ˆ2 λ
1 = σω2 + EB,Z [(ηq (B + σ ˆ Z; χˆ σ 2−q ) − B)2 ], δ 1 = χˆ σ 2−q 1 − E[ηq0 (B + σ ˆ Z; χˆ σ 2−q )] . δ
(7) (8)
The proof of this result is similar to but easier than the proof for the asymptotic performance of LASSO presented in [13]. We give the complete proof in the Appendix. Theorem 1 provides the first step in our analysis of LQLS. Note that this theorem 2 , we first calculate σ ˆ and χ from (7) and (8). From enables us to characterize the asymptotic MSE of LQLS: given λ and σw Corollary 3 (in Section V-C) we know that there exists a unique value of σ ˆ and χ that satisfies (7) and (8). Then incorporating σ ˆ and χ in (6) gives the following expression for the asymptotic MSE of the algorithm: 2 AMSE(λ, q; σw ) , EB,Z (ηq (B + σ ˆ Z; χˆ σ 2−q ) − B)2 .
(9)
2 , the number of observations δ, and the Given the distribution of B (the sparsity level included), the noise variance σw regularization parameter λ, it is straightforward to write a computer program to find the solution of (7) and (8). However, it is needless to say that this approach does not shed much light on the performance of these algorithms, since there are many parameters involved and each affects the result in a non-trivial fashion. In this paper, we study the fixed points of (7) and (8) and derive their explicit forms in the low noise regime.
III. O UR MAIN CONTRIBUTIONS A. Optimal tuning of λ The performance of LQLS, as defined in (1), is not only a function of the shape of the regularizer, i.e. the value of q, but is also a function of the parameter λ. In practice, for any regularizer usually a value of λ is picked that gives the minimum prediction error or the mean square error. In this paper, we consider the value of λ that gives the minimum MSE. Let λ∗,q denote the value of λ that minimizes AMSE as defined in (9). Then LQLS is solved with this specific value of λ, i.e., ˆ ∗,q , q) = arg min 1 ky − Xβk2 + λ∗,q kβkq . β(λ 2 q β 2 Note that this is the best performance that LQLS can achieve in terms of the AMSE. Theorem 1 enables us to evaluate the ˆ ∗,q , q). The following result is a simple corollary of Theorem 1. asymptotic MSE of β(λ ˆ ∗,q , q, p) is the solution of LQLS defined Corollary 1. Consider a converging sequence {β(p), X(p), w(p)}. Suppose that β(λ in (1). Then, almost surely p 2 1 Xˆ lim β(λ∗,q , q, p) − βi (p) = EB,Z (ηq (B + σ ˆ Z; χˆ σ 2−q ) − B)2 , (10) p→∞ p i=1 3 A function ψ : R2 → R is pseudo-Lipschitz of order k if there exists a constant L > 0 such that for all x, y ∈ R2 , we have |ψ(x) − ψ(y)| ≤ L(1 + kxkk−1 + kykk−1 )kx − yk2 . We consider pseudo-Lipschitz functions with order 2 in this paper. 2 2 4 Proximal operator of k · kq is defined as η (u; λ) , arg min 1 (u − z)2 + λ|z|q . For further information on these functions, please refer to Section V-B. q z 2 q
5
where B and Z are two independent random variables with distributions pβ and N (0, 1), respectively, and σ ˆ and χ satisfy the following equations: σ ˆ2 λ∗,q
1 ˆ Z; χˆ σ 2−q ) − B)2 ], = σω2 + EB,Z [(ηq (B + σ δ 1 = χˆ σ 2−q 1 − E[ηq0 (B + σ ˆ Z; χˆ σ 2−q )] . δ
(11) (12)
ˆ ∗,q , q) is simplified to the study of the solutions of (11) and (12). According to Corollary 1, the study of the AMSE of β(λ Note that one extra complication (11) has compared to (7) is that λ∗,q has to be chosen optimally. As discussed before, it is of great interest to simplify these expressions and give explicit formulas for AMSE. Toward this goal, we study (11) and (12) 2 under the low-noise regime, where σw is assumed to be small and derive the explicit form of the AMSE. Since our results look very different (and their proof techniques are also different) for the sparse and dense cases, we study these two cases in separate sections below. B. Analysis of AMSE for Sparse Signals In this case, we assume that the distribution, to which the empirical distribution of β ∈ Rp converges, has the form pβ (b) = (1 − )δ0 (b) + G(b), where δ0 denotes a point mass at zero, while G is a generic distribution that does not have any point mass at 0. Here, ∈ (0, 1) is a fixed number that denotes the sparsity level of β. The smaller is, the sparser β will be. Since our results and proof techniques look very different for the case q > 1 and q = 1, we study these cases separately. Before we summarize our results we mention a few notations that will be used throughout out paper: The notation PG (A) will be used to denote the probability of event A under the measure G. Also, EG (f (B)) denotes the expected value of f (B) when B ∼ G. If the distribution is clear from the context, we may drop the subscript G. 1) Results for q > 1 : Our first result is concerned with the AMSE of LQLS for 1 < q ≤ 2, when the number of observations is larger than the number of variables p, i.e., δ > 1. Theorem 2. Suppose PG (|B| < σ) = O(σ) and EG (|B|2 ) < ∞, then for 1 < q < 2, δ > 1 and ∈ (0, 1), we have lim lim
1 ˆ p kβ(λ∗,q , q)
− βk22 −
2 σw 1−1/δ
2q σw
σw →0 p→∞
=−
δ q+1 (1 − )2 (E|Z|q )2 , (δ − 1)q+1 EG |B|2q−2
a.s.
(13)
For q = 2, δ > 1 and ∈ (0, 1), if EG (|B|2 ) < ∞, we have lim lim
1 ˆ p kβ(λ∗,2 , 2)
− βk22 −
2 σw 1−1/δ
4 σw
σw →0 p→∞
=−
δ3 (δ −
1)3 EG |B|2
,
a.s.
Note that Z ∼ N (0, 1) and B ∼ G. The proof of this result is presented in Section V-F. There are several interesting features of this result that we would like to discuss: (i) The second dominant term of AMSE is negative. This means that the actual MSE is better than the one predicted by the first order term, specially for smaller values of q. (ii) Neither the sparsity level, nor any other statistics of the distribution 2 σw of the non-zero coefficients in β appear in the first dominant term, i.e. 1−1/δ . However, both appear in the second dominant term. As we will discuss later in this section, the first dominant term is the one that specifies the phase transition curve. (iii) The sparsity level has a major impact on the second dominant term. As decreases, the magnitude of the second order term increases. Since the second dominant term is negative, the MSE will be much smaller than what is predicted by the first dominant term. (iv) For the fully dense vector, i.e. = 1, (13) may imply that for 1 < q < 2, 1 ˆ p kβ(λ∗,q , q)
− βk22 −
2 σw 1−1/δ
= 0, a.s. 2q σw Hence, we require a different analysis to obtain the second dominant term (with different orders). This will be done in Section III-C. (v) For < 1, the choice of q ∈ (1, 2] does not affect the first dominant term. However, it has a major impact on the second dominant term. In particular, as q apporaches 1, the order of the second dominant term in terms of σw gets closer to that of the first dominant term. This means that in any practical setting, ignoring the second dominant term leads to possibly misleading conclusions. This is the main reason why phase transition analysis described in the last section implies that the performance of q = 1.01 is similar to that of LQLS for q = 2. Our next theorem discusses the asymptotic MSE below the phase transition, i.e. δ < 1. lim lim
σw →0 p→∞
Theorem 3. Suppose EG (|B|2 ) < ∞, then for 1 < q ≤ 2 and δ < 1, P( lim lim
σw →0 p→∞
1 ˆ kβ(λ∗,q , q) − βk22 = 0) = 0. p
(14)
6
The proof of this theorem is presented in Section V-G. Theorems 2 and 3 show a notion of phase transition. For δ > 1, as 2 σw → 0, AMSE = O(σw ), and hence it will go to zero, while AMSE 9 0 for δ < 1. In the next section, we go over a sketch of the proofs of Theorems 2 and 3 and will then shed some light on the properties of this phase transition discussed in Section I-B. 2) Proof intuition and the misleading feature of phase transition: We present the full proof of the above results in Section V-F and V-G. However, since some parts of the proof clarify some features of our results, we present a brief proof sketch here. Note that according to Theorem 1, we have to study the fixed points of σ ˆ2 λ∗,q
1 ˆ Z; χˆ σ 2−q ) − B)2 ], = σω2 + EB,Z [(ηq (B + σ δ 1 ˆ Z; χˆ σ 2−q )] . = χˆ σ 2−q 1 − E[ηq0 (B + σ δ
(15) (16)
Suppose that (χ∗ , σ ˆχ∗ ) is the unique solution of the above equations. We will prove the uniqueness of the solution in Lemma 11. Then, σ ˆχ∗ is the unique solution of the following fixed point equation: 1 ˆ Z; χ∗ σ ˆ 2−q ) − B)2 ] (17) σ ˆ 2 = σω2 + EB,Z [(ηq (B + σ δ Indeed, as we will prove in Corollary 2, (17) has a unique fixed point solution. According to (17), we can in fact confirm that σ ˆχ∗ /σw = Θ(1), as σw → 0. This essentially implies that we can characterize σ ˆχ∗ by characterizing the behavior of EB,Z [(ηq (B + σ ˆ Z; χ∗ σ ˆ 2−q ) − B)2 ] for small values of σ ˆ . Hence, Lemma 18 shows that EB,Z [(ηq (B + σ ˆ Z; χ∗ σ ˆ 2−q ) − B)2 ] = σ ˆ2 − σ ˆ 2q 2
(1 − )2 (E|Z|q )2 + o(ˆ σ 2q ). E|B|2q−2
(18)
q 2
(E|Z| ) Define G(ˆ σ ) , (1 − 1δ )ˆ σ 2 + 1δ σ ˆ 2q (1−) . Intuitively speaking, if we combine (17) and (18), we expect σ ˆχ∗ to be close E|B|2q−2 to the solution of the following equation: 2 G(ˆ σ ) = σw . (19)
It is straightforward to confirm that when δ > 1, this equation has a unique solution that is smaller that one can confirm that the solution of (19) satisfies: 2 σw 1− δ1 σ ˆ 2q
σ ˆ2 −
=
2 σw 1−1/δ .
Furthermore,
(1 − )2 (E|Z|q )2 . (1 − δ)E|B|2q−2
Combined with Theorem 1, it is then straightforward to play with the last expression and obtain the exact asymptotic orders which were reported in (13). A more interesting and more informative situation is the following: what should we expect when δ < 1? Figure 2 exhibits the shape of G(ˆ σ ) for this case. As is clear from the figure, the value of G(ˆ σ ) is negative below 1 (1 − δ)E|B|2q−2 2q−2 σ0 (q) , . (1 − )2 (E|Z|q )2 2 2 Hence, G(ˆ σ ) = σw cannot have a solution below σ0 . But what can we say about the solution of G(ˆ σ ) = σw ? At this point, we should warn the reader that since the solution of this equation is not necessarily small, it does not necessarily have any connection with the solution of (17). However, intuitively speaking, if σ0 (q) is small enough then it might still be a good approximation. 2 σw 2 2 ). To It is straightforward to confirm that the value of σ ˆ that satisfies G(ˆ σ ) = σw is given by σ02 (q) + (q−1)(1/δ−1) + o(σw dG 1 obtain this result, we can use the Taylor expansion and the fact that dˆσ2 σˆ 2 =σ2 = ( δ − 1)(q − 1). Note that 0
lim (σ0 (q))2q−2
q→1
(1 − δ) π(1 − δ) = = . (1 − )2 (E|Z|)2 2(1 − )2
(20)
We have used dominated convergence theorem (DCT) to obtain the equation above.5 Suppose that π(1−δ) 2(1−)2 < 1 (this implies 2q−2 that if q is close enough to 1, then σ0 (q) < 1). Then, as q → 1, σ0 (q) → 0. We consider q ≈ 1 so that the approximation 2 σw just mentioned is reasonable. As we decrease q toward 1, σ02 (q) → 0, however, the other term (q−1)(1/δ−1) → ∞. On the σ2
w other hand, as q moves away from 0, the term (q−1)(1/δ−1) decreases, however, σ0 (q) increases. One may optimize over the value of q and see which value minimizes AMSE. We will do this in Section III-B4 when we compare the performance of these algorithms with q = 1.
5 There are several cases in this paper, where DCT fails. In fact switching the limit and integral in most of those cases leads to incorrect results. Hence, we check the conditions required for DCT carefully. Here, DCT works because |B|2q−2 ≤ 1(|B| ≤ 1) + 1(|B| > 1)|B|2 and we have assumed that EG |B|2 < ∞. The same argument applies to |Z|q .
7
G(ˆ )
2 w
0
Fig. 2.
ˆ
The shape of G(ˆ σ ) for δ < 1. G(ˆ σ ) is an approximation of EB,W [(ηq (B + σ ˆ W ; χ∗ σ ˆ 2−q ) − B)2 ] for small values of σ ˆ.
3) Results for q = 1: So far we have studied the case 1 < q ≤ 2. In this section, we study q = 1 or the LASSO algorithm. In Theorems 2 and 3, we characterized the behavior of LQLS for q > 1 for a general class of distributions. It turns out that the distribution of the non-zero elements of β may have more serious impact on the second dominant term of LASSO. Hence, we consider two cases: Our first theorem considers the distributions that do not have any mass around zero. Theorem 4. Suppose PG (|B| > µ) = 1 with µ being a positive constant and EG (|B|2 ) < ∞, then for δ > M1 () lim lim
σw →0 p→∞
1 ˆ p kβ(λ∗,1 , 1)
δM1 () 2 − βk22 − δ−M () σw p √1 = 0, φ(˜ µ δ − M1 ()/(σw δ))
a.s.
(21)
where µ ˜ is any positive constant smaller than µ. The proof of this theorem will be presented in Section V-H. Before we interpret this result, let us mention our result for the distributions that have more mass around zero as well. Theorem 5. Suppose that PG (|B| ≤ σ) = Θ(σ ` ) with ` > 0 and EG (|B|2 ) < ∞, then for δ > M1 () `/2 1 `+2 ˆ ∗,1 , 1) − βk2 − δM1 () σ 2 . −Θ(σ `+2 ), . lim 1 kβ(λ −Θ(σw ) · log log . . . log 2 w p→∞ p | {z } σw δ − M1 () w
a.s.
m times
where m is an arbitrary but finite natural number. The proof of this theorem is presented in Section V-I. It is important to notice the difference between Theorems 4 and 5. The first point we would like to emphasize here is that the first dominant terms are the same in both cases. The second dominant terms are different though. As we will prove in Section V-H and V-I, similar to LQLS for 1 < q ≤ 2, the second dominant terms are in fact negative. Hence, the actual MSE will be smaller than the one predicted by the first dominant term. Furthermore, note that the magnitude of the second dominant term is much larger in Theorem 5 compared to Theorem 4. This seems intuitive, since LASSO tends to shrink the coefficient towards zero, and hence, if the true coefficient β has more mass around zero, the AMSE will be smaller. The more mass the distribution has around zero, the better this second order term will be. Before we compare LASSO with LQLS for q > 1, we also discuss what happens if δ < M1 (). Theorem 6. Suppose that EG (|B|2 ) < ∞. Then for δ < M1 (), we have P( lim lim
σw →0 p→∞
1 ˆ kβ(λ∗,1 , 1) − βk22 = 0) = 0. p
(22)
The proof of this theorem is presented in Section V-J. Theorems 4, 5 and 6 enable us to provide a fair comparison between LASSO and LQLS for q > 1. This will be discussed in the next section. 4) Discussion and comparison of q > 1 and q = 1: The goal of this section is to provide a fair comparison between the asymptotic performance of q = 1 and q > 1 based on the results we have mentioned so far. Note that all these results (except Theorem 3 and 6) are for the case < 1 and are not necessarily valid for the dense case, = 1. We postpone the discussion of = 1 until Section III-C. We start the comparison for the first dominant term. According to Theorems 2, 4 and 5, the first term for q = 1 and 1 < q ≤ 2 2 2 δM1 ()σw δσw and δ−1 , respectively. The following lemma that will be proved in Section V-D, enables us to compare these are δ−M 1 () quantities. Lemma 1. M1 () satisfies the following properties: 1) M1 () is an increasing and continuous function of . 2) lim→0 M1 () = 0.
8
3) lim→1 M1 () = 1. 4) M1 () > , for ∈ (0, 1). Note that according to this lemma, the dominant term for q = 1 is smaller than that of q > 1. The difference is larger when is small and it vanishes as → 1. As we discussed before, this term is in fact responsible for the phase transition behavior we discussed in Informal Result 1. To provide a fair comparison between LASSO and LQLS (with q > 1), we consider two different cases: 1) M1 () < δ < 1: In this range, LASSO is operating below its phase transition, while LQLS is operating above its phase transition. However, as we discussed in Section III-B2, the fact that δ < 1 does not necessarily mean that the estimate is not accurate. We heuristically argued that for the values of q that are close to 1, the fixed point solution σ ˆχ2 ∗ is close to σ2
w , under the assumption that σ02 (q) + (q−1)(1/δ−1) shows that in fact this assumption holds.
Lemma 2. If M1 () < δ < 1, then
π(1−δ) 2(1−)2
≤
π(1−δ) 2(1−)2
1−δ 1−M1 ()
< 1. The following lemma that will be proved in Section V-E
< 1.
Then the question is how does LQLS compare with LASSO, given the fact that σ0 (q) → 0 as q → 1? If we assume that these results accurately predict AMSE of LQLS for all values of q that are close to 1, then we can 2 σw over a neighborhood of q = 1. According to (20), there exists a lower bound the minimum of σ02 (q) + (q−1)(1/δ−1) sufficiently small neighborhood (1, 1 + ρ) such that there exists a positive constant α < 1 and σ02 (q) > α1/(q−1) for q ∈ (1, 1 + ρ). It is then straightforward to see min
q∈(1,1+ρ)
σ02 (q) +
2 2 σw σw 2 2 > min α1/(q−1) + = Ω(σw log(1/σw )). (q − 1)(1/δ − 1) q∈(1,1+ρ) (q − 1)(1/δ − 1)
2 2 Hence, the asymptotic MSE of LQLS for optimal q > 1 is given by Ω(σw log(1/σw )). By comparing this result with δM1 () 2 the first dominant term in LASSO, i.e. δ+M1 () σw , we can argue that even though the difference between LASSO and LQLS for q > 1 is not as extreme as what was predicted by the phase transition analysis, but for small noise levels, LASSO is better than any other LQLS with q > 1 by a logarithmic factor. Note that so far we have assumed that the second dominant term of LASSO is negligible, which is in fact the case when distribution G does not have any mass around zero. However, as we proved in Theorem 5, as the mass of the non-zero components increases around zero, the magnitude of the second dominant term increases in LASSO to σ `+2 . For instance when PG (|B| ≤ σ) = Θ(σ), it will be essentially σ 3 (note that we are ignoring the mismatch between the upper and lower bounds in the theorem). As the second dominant term of LASSO gets bigger in magnitude, the performance of LASSO will improve further. Hence, our conclusions will still be valid.
2) δ > 1: In this range, all LQLS algorithms operate below their phase transitions. We would like to ask the following question: which value of q leads to better AMSE given that β is sparse, i.e., < 1. Again we do some heuristic calculations to clarify the formulas here. Define 2 Hq (σw ),
2 σw δ q+1 (1 − )2 (E|Z|q )2 2q − σ . (δ − 1)q+1 EG |B|2q−2 w 1 − 1δ
These are the top two dominant terms in the expansion of AMSE for LQLS (1 < q < 2). Given the fact that according to Theorem 4 and 5, the second dominant term for LASSO (in quite general settings) is much smaller than that of 2 LQLS, our goal is to compare Hq (σw ) with the first dominant term of LASSO. To simplify the problem, we compare the limiting behavior of 2 σw δ 2 (1 − )2 (E|Z|)2 2 2 lim Hq (σw )= − σw . 1 q→0 (δ − 1)2 1− δ with the first dominant term of LASSO. Our comparison is reduced to the comparison of C1 ,
δ δ 2 (1 − )2 (E|Z|)2 − , δ−1 (δ − 1)2
(23)
δM1 () . δ − M1 ()
(24)
and C2 ,
The following lemma that will be proved in Section V-E provides such comparison: Lemma 3. For C1 and C2 that were defined in (23) and (24), we have C1 < C2 .
9
Based on the above heuristic argument, one may argue that if δ > 1, the values of q that are close to 1, but larger than one can in fact outperform LASSO, even if sparsity is present in the data. Note that this result is in contrast to what we showed for δ < 1. C. Asymptotic performance for dense β, i.e., = 1 We study these algorithms below the phase transition, i.e. for δ > 1. We start with the case P(|B| > µ) = 1. The following theorem characterizes the AMSE of LQLS for 1 ≤ q < 2. Theorem 7. Suppose that PG (|B| > µ) = 1 and EG (|B|2 ) < ∞, then for 1 < q < 2 and δ > 1, lim lim
1 ˆ p kβ(λ∗,q , q)
− βk22 −
2 σw 1−1/δ
4 σw
σw →0 p→∞
=−
δ 3 (q − 1)2 (E|B|q−2 )2 (δ − 1)3 E|B|2q−2
a.s.,
and for q = 1, we have lim lim
σ2
1 ˆ p kβ(λ∗,1 , 1)
σw →0 p→∞
w − βk22 − 1−1/δ p √ = 0, φ(˜ µ 2(δ − 1)/(σw δ))
a.s.,
where µ ˜ is any fixed positive number smaller than µ. 2 In the dense case, the first dominant term is exactly the same for all values of q, including q = 1 and is equal to σw /(1−1/δ). Hence, to compare the performance in the low-noise regime, one should compare their second order terms. While there is no clear comparison among different values of q and it may depend on the moments of B, one point is clear and that is LASSO performs worse than other values of q, when the actual distribution does not have any mass around zero. It is still important 4 to something exponentially small, but to note that even though it seems that the second dominant term has jumped from σw in practice we will not see large difference between q = 1 and q = 1.001. As is clear from Theorem 7, the coefficient that is 4 multiplied by σw is proportional to (q − 1)2 , which makes the second dominant term for q = 1.001 very small too. A more informative and interesting case is when the distribution of nonzero coefficients, G, has more mass around zero.
Theorem 8. Suppose that PG (|B| ≤ σ) = O(σ) and EG (|B|2 ) < ∞. Then for 1 < q < 2 and δ > 1 lim lim
1 ˆ p kβ(λ∗,q , q)
σw →0 p→∞
− βk22 −
2 σw 1−1/δ
4 σw
=−
δ 3 (q − 1)2 (E|B|q−2 )2 , (δ − 1)3 E|B|2q−2
a.s.
For q = 2 and δ > 1, if EG (|B|2 ) < ∞, lim lim
σw →0 p→∞
1 ˆ p kβ(λ∗,2 , 2)
− βk22 − 4 σw
2 σw 1−1/δ
=−
δ3 , (δ − 1)3 E|B|2
a.s.
For q = 1 and δ > 1, suppose that PG (|B| ≤ σ) = Θ(σ ` ) with ` > 0 and E(|B|2 ) < ∞, ` 2 1 2`+2 ˆ ∗,1 , 1) − βk2 − σw . −Θ(σ 2`+2 ), . lim 1 kβ(λ −Θ(σw ) · log log . . . log 2 w p→∞ p | {z } σw 1 − 1/δ
a.s.
m times
Note that as the mass of B around zero increases, the second dominant term of LASSO improves and for ` = 1 it has the same order as that of LQLS with q > 1 in the low-noise regime (note that we are ignoring the log term in the lower bound). Furthermore, the result keeps improving for LASSO as ` goes below 1, however, we expect the results to remain the same for q > 1. This implies that as soon as the distribution has a large mass around zero, LASSO starts outperforming LQLS. IV. R ELATED WORK A. Other n/p → δ asymptotic results The asymptotic framework that we considered in this paper evolved in a series of papers by Donoho and Tanner [5], [14]–[17]. This framework was used before on similar problems in engineering and physics [18]–[20]. Donoho and Tanner characterized the phase transition curve for LASSO and some of its variants. Inspired by this framework, many researchers started exploring the performance of different algorithms under this asymptotic settings [3], [8]–[13], [21]–[32]. Our paper performs the analysis of LQLS under this asymptotic framework. Also, we adopt the message passing analysis that was developed in a series of papers [6], [11]–[13], [33], [34]. The notion of phase transition we consider is similar to the one introduced in [11]. However, there are two major differences: (i) The analysis of [11] is performed for LASSO, while we have generalized the analysis to any LQLS with 1 < q ≤ 2. (ii) The analysis of [11] is performed on the least favorable
10
distribution for LASSO, while here we characterize the effect of the distribution of non-zero coefficients on the AMSE as well. (iii) Finally, [11] is only concerned with the dominant term in AMSE of LASSO, while we characterize the second dominant term whose importance has been discussed in the last few sections. Several researchers have also worked on the analysis of LQLS for q ≤ 1 [3], [31], [35]. These analyses are based on non-rigorous, but widely accepted replica method from Statistical physics. The current paper extends the analysis of [3] to q ≥ 1 case, makes the analysis rigorous by using the message passing framework rather than the replica method, and finally provides a higher order analysis that is not present in [3]. B. Other analysis framework One of the first papers that compared the performance of penalization techniques is [36] that showed that there exists a value of λ with which Ridge regression, i.e. LQLS with q = 2, outperforms the vanilla least squares estimator. Since then, many more regularizers have been introduced to the literature each with a certain purpose. For instance, we can mention LASSO [37], elastic net [38], SCAD [39], bridge regression [1], and more recently SLOPE [40]. There has been a large body of work on studying all these regularization techniques. We partition all these work in the following categories and explain what in each category has been done about the bridge regression: (i) Simulation results: The main motivation for our work comes from the nice simulation study of the bridge regression presented by [2]. This paper finds the optimal values of λ and q by generalized cross validation and compares the performance of the resulting algorithm with both LASSO and ridge. The main conclusion is that the bridge regression can outperform both ridge (q = 2) and LASSO (q = 1). Given our results we see that if sparsity is present in β, then smaller values of q perform better than ridge (in their second dominant term) and also when δ > 1, q > 1 can outperform LASSO. Also, note that when sparsity is not present all the schemes can outperform LASSO. (ii) Fisherian asymptotics: Knight and Fu [4] studied the asymptotic properties of Bridge regression under the asymptotic setting n → ∞, while p is fixed. They established the consistency and asymptotic normality of the estimates under quite general settings. Huang et al. [41] studied LQLS for q < 1 under a high-dimensional asymptotic in which p grows with n, while it is still assumed to be less than n, and not only derived the asymptotic distribution of the estimators, but also proved LQLS has oracle properties in the sense of Fan and Li [39].6 They have also considered the case p > n, and have shown that under partial orthogonality assumption on X, the bridge algorithm distinguishes correctly between covariates with zero and non-zero coefficients. Note that under the asymptotic regime of our paper both LASSO and the other bridge estimators have false discoveries [42] and non-zero AMSE. Hence, they will not provide consistent estimators. We should also mention that the analysis of bridge (for q < 1) under the asymptotic regime n/p → δ and k/p → is presented in [3]. Finally, the performance of the LASSO estimator under a variety of conditions has been studied extensively. We refer the reader to [43] for the review of those results. (ii) Minimax analysis: One of the successful approaches that have been employed for studying the performance of regularization techniques such as LASSO is the rate-optimal minimaxity [44], [45]. We refer the reader to [43] for a complete list of references on this direction. In this minimax approach, a lower bound for the prediction error or mean square error of any estimation technique is first derived. Then a specific algorithms such as LASSO is considered and an upper bound is derived for the prediction error and mean square error when the design matrices satisfy certain conditions such as restrictive eigenvalue assumption [44], [46], restricted isometry condition [47], or coherence conditions [48]. These conditions can be confirmed for matrices with iid subgaussian elements. Based on these evaluations we can claim that LASSO is rate-optimal minimax. This approach has some advantages and disadvantages compared to our asymptotic approach: (i) It works under more general conditions, and (ii) they provide information for any sample size. The price that we pay in the minimax analysis is the fact that the constants are not usually sharp and hence many algorithms have similar guarantees and cannot be compared to each other. Our asymptotic framework looses the generality and in return gives sharp constants that can then be used in comparing different algorithms as we do in this paper. V. P ROOFS OF OUR MAIN RESULTS A. Overview of this section The goal of this section is to prove all the results we claimed in the previous sections. We start with some of the main properties of the function ηq . These derivations will be used often in the rest of the proofs. Since we expect these properties to be useful for other researchers working on similar problems, we have summarized them all in one section. B. Properties of the proximal operator of k · kqq This section is devoted to the properties of ηq defined as 1 ηq (u; χ) , arg min (u − z)2 + χ|z|q . z 2 6 Note
that [41] is not Fisherian asymptotic, since p can also grow.
(25)
11
We start with some basic properties of these functions. Since the explicit form of proximal operators for q = 1, η1 (u; χ) = u is known, we focus our study on the case 1 < q < 2. (|u| − χ)sign(u)I(|u| > χ) and for q = 2, η2 (u; χ) = 1+2χ Lemma 4. ηq (u; χ) satisfies the following properties: (i) u − ηq (u; χ) = χqsign(u)|ηq (u; χ)|q−1 . (ii) |ηq (u; χ)| ≤ |u|. (iii) limχ→0 ηq (u; χ) = u and limχ→∞ ηq (u; χ) = 0. (iv) ηq (−u; χ) = −ηq (u; χ). (v) For α > 0, we have ηq (αu; α2−q χ) = αηq (u; χ). (vi) |ηq (u; χ) − ηq (˜ u, χ)| ≤ |u − u ˜|. Proof: To prove (i) we should take the derivative of 21 (u − z)2 + χ|z|q and set it to zero. Proofs of parts (ii), (iii) and (iv) are straightforward and are hence skipped. To prove (v) note that 1 arg min (αu − z)2 + χα2−q |z|q z 2 α2 = arg min (u − z/α)2 + χα2 |z/α|q z 2 1 z |q = αηq (u; χ). (26) = α arg min (u − z˜)2 + χ|˜ z˜ 2 (vi) is a standard property of the proximal operators of convex functions [49]. In this paper, we will be dealing with various derivatives of ηq . Hence, our next lemma is concerned with the differentiability of these functions. ηq (αu; α2−q χ)
=
Lemma 5. For every 1 < q < 2, ηq (u; χ) is a differentiable function of (u, χ) for u ∈ R and χ > 0 with continuous partial ∂η (u;χ) derivatives. Moreover, q∂χ is differentiable with respect to u, for any given χ > 0. Proof: We start with the case u0 > 0. The goal is to prove that ηq (u; χ) is differentiable at (u0 , χ0 ). Since u0 > 0, the optimal value ηq (u0 ; χ0 ) will be positive. Then Lemma 4 part (i) shows ηq (u0 ; χ0 ) must satisfy ηq (u0 ; χ0 ) + χ0 qηqq−1 (u0 ; χ0 ) = u0 .
(27)
Define the function F (u, χ, v) = u−v−χqv q−1 . Note that F (u, χ, v) is equal to zero at (u0 , χ0 , ηq (u0 ; χ0 )). It is straightforward to confirm that the derivative of F with respect to v is nonzero at (u0 , χ0 , ηq (u0 ; χ0 )). Hence, by implicit function theorem, ηq (u; χ) is a differentiable function of (u, χ) at (u0 , χ0 ). Since ηq (u; χ) = −ηq (−u; χ), the same result holds when u0 < 0. We can then focus on the points (0, χ0 ). First note that ηq (0, χ0 ) = 0. Hence, ∂ηq (0; χ0 ) |ηq (u; χ0 )| |u|1/(q−1) = lim ≤ lim = 0. u→0 u→0 (χ0 q)1/(q−1) |u| ∂u |u| where the last inequality comes from (27). This means that the partial derivative of η(u; χ0 ) with respect to u at (0, χ0 ) exists and is equal to zero. It is straightforward to see that the partial derivative with respect to χ at (0, χ0 ) exists and is equal to zero as well. Hence, the partial derivatives exist with respect to both (u, χ) for every u ∈ R, χ > 0. Furthermore, we claim the partial derivatives are continuous everywhere. For u 6= 0, the result comes from the implicit function theorem, because F (u, χ, v) is a smooth function when v 6= 0. Hence, we only focus on the proof when u = 0 : lim
(u,χ)→(0,χ0 )
Note since we obtain
ηq (u;χ) ∂u
=
ηq (−u;χ) , ∂u
∂ηq (u; χ) = 0. ∂u
we only consider u → 0+ in the above limit. By taking derivative of (27) with respect to u,
∂ηq (u; χ) ∂ηq (u; χ) + χq(q − 1)ηqq−2 (u; χ) = 1. ∂u ∂u Moreover, it is straightforward to see from (27) that ηq (u; χ) → 0, as (u, χ) → (0+ , χ0 ). Therefore we have lim+
(u,χ)→(0
∂ηq (u; χ) 1 = lim+ = 0. q−2 ∂u ,χ0 ) (u,χ)→(0 ,χ0 ) 1 + χq(q − 1)ηq (u; χ) ∂η (u;χ)
The same approach can prove that the partial derivative q∂χ is continuous at (0, χ0 ). We now prove the second part of the lemma. Because F (u, χ, v) is infinitely many times differentiable in any open set ∂η (u;χ) is differentiable at any u 6= 0. The remaining is to show its with v 6= 0, implicit function theorem further implies q∂χ
12
differentiability at u = 0. This follows by noting
∂ηq (0;χ) ∂χ
∂ηq (u; χ) /u = u→0 ∂χ lim
=
= 0, and
−q|ηq (u; χ)|q−1 u→0 |u|(1 + χq(q − 1)|ηq (u; χ)|q−2 ) −(|u| − |ηq (u; χ)|) lim = 0, u→0 χ|u|(1 + χq(q − 1)|ηq (u; χ)|q−2 ) lim
where the last two equalities above are due to Lemma 4 (part (i) and (iii)). This completes the proof of differentiability properties of ηq (u; χ). Since we will use the derivatives of ηq many times in the paper, for both notational simplicity and saving some space, we ∂η (u;χ) ∂η (u;χ) use the notation ∂1 ηq (u; χ) and ∂2 ηq (u; χ) for q∂u and q∂χ , respectively. Also we will use the notation ∂12 ηq (u; χ) instead of
∂ 2 ηq (u;χ) . ∂u2
Lemma 6. Consider a given χ > 0, then for every 1 < q < 3/2, ∂1 ηq (u; χ) is a differentiable function of u for u ∈ R with continuous derivative; for q = 3/2, it is a weakly differentiable function of u; for 3/2 < q < 2, ∂1 ηq (u; χ) is differentiable at u 6= 0, but is not differentiable at zero. Proof: According to Lemma 5 and the formula in Lemma 7 part (i), it is clear that ∂1 ηq (u; χ) is differentiable at u 6= 0 with continuous derivative for every 1 < q < 2. We thus focus on u = 0. We calculate the derivative of ∂1 ηq (u; χ) at u = 0: ∂12 ηq (0; χ) = lim
u→0
∂1 ηq (u; χ) − ∂1 ηq (0; χ) 1 = lim u→0 u u + χq(q − 1)u|ηq (u; χ)|q−2
Note that Lemma 4 part (i) implies limu→0 χq|η (u;χ)|uq−1 sign(u) = 1 + limu→0 q consider 1 < q < 3/2. From Equation (28) we have, ∂12 ηq (0; χ) = lim
u→0
|ηq (u;χ)|2−q χq
1 q−2
u + χq(q − 1)u(|u|/(χq)) q−1
(28)
= 1. Based on this result, we now
= 0.
Furthermore, we calculate the limit of ∂12 ηq (u; χ) (this second derivative can be obtained from the formula in Lemma 7 part (i)) as u → 0: −χq(q − 1)(q − 2)|ηq (u; χ)|q−3 sign(u) = 0. u→0 (1 + χq(q − 1)|ηq (u; χ)|q−2 )3
lim ∂12 ηq (u; χ) = lim
u→0
Therefore, ∂1 ηq (u; χ) is continuously differentiable on (−∞, +∞). When q > 3/2, it is straightforward to do similar calculations and show limu→0+ ∂12 ηq (0; χ) = +∞ and limu→0− ∂12 ηq (0; χ) = −∞. For q = 3/2, to show the weak differentiability, we show ∂1 ηq (u; χ) is a Lipschitz continuous function on (−∞, +∞). Note that for u 6= 0, −χq(q − 1)(q − 2)|ηq (u; χ)|q−3 sign(u) 2 ≤ 8 , |∂1 ηq (u; χ)| = 9χ2 (1 + χq(q − 1)|ηq (u; χ)|q−2 )3 and ∂12 ηq (0+ ; χ) = −∂12 (0− ; χ) = 9χ8 2 . Therefore, using mean value theorem, it is straightforward to get |∂1 η3/2 (u; χ) − ∂1 η3/2 (˜ u; χ)| ≤ 9χ8 2 |u − u ˜|, for u˜ u ≥ 0. When u˜ u < 0, we can have |∂1 η3/2 (u; χ) − ∂1 η3/2 (˜ u; χ)| = |∂1 η3/2 (u; χ) − 8 8 ∂1 η3/2 (−˜ u; χ)| ≤ 9χ2 |u + u ˜| ≤ 9χ2 |u − u ˜|. Lemma 7. The derivatives of ηq (u; χ) satisfy the following properties: 1 (i) ∂1 ηq (u; χ) = 1+χq(q−1)|η q−2 . q (u;χ)| (ii) (iii) (iv) (v) (vi)
−q|η (u;χ)|q−1 sign(u)
q ∂2 ηq (u; χ) = 1+χq(q−1)|η q−2 . q (u;χ)| 0 ≤ ∂1 ηq (u; χ) ≤ 1. For u > 0, ∂12 ηq (u; χ) > 0. |ηq (u; χ)| is a decreasing function of χ. limχ→∞ ∂1 ηq (u; χ) = 0.
Proof: To show (i), first note that according to Lemma 4 part (i), for u > 0 we have u − ηq (u; χ) = χqηq (u; χ)q−1 .
(29)
By taking a derivative with respect to u from both sides of the above equation, we obtain (i) for positive values of u. Similarly, the result can be proved for negative values of u. According to Lemma 5, ∂1 ηq (u; χ) is a continuous function of u. Hence, (i) is proved. The proof of (ii) is similar: we take the derivative with respect to χ from both sides of (29). Part (iii) is a simple conclusion of part (i). To prove part (iv), we take two derivatives with respect to u from both sides of (29) to obtain: ∂12 ηq (u; χ)(1 + χq(q − 1)ηqq−2 (u; χ)) = χq(q − 1)(2 − q)ηqq−3 (u; χ)(∂1 ηq (u; χ))2 .
(30)
13
This proves the statement of part (iv). Part (v) is a simple application of part (ii). Finally, part (vi) is an application of part(i) of Lemma 7 and part (iii) of Lemma 4. C. Discussion of The Solutions of (7) and (8) The solutions of Equations (7), (8) and (15), (16) play a crucial role in our analysis. In this section, we explore some of the properties of these equations that we will employ later in our paper. In particular, we prove that for every λ > 0 there exist unique χ and σ ˆ that satisfy (7), (8). Furthermore, we will show that (15), (16) have a unique solution for (χ, σ ˆ ) and we provide a simple expression for this unique solution. These simpler expressions will be used throughout the rest of the paper.We start with Stein’s lemma [50] that will be used several times in this paper. Lemma 8. Let g : R → R denote a weakly differentiable function. If Z ∼ N (0, 1) and E|g 0 (Z)| < ∞, we have E(Zg(Z)) = E(g 0 (Z)), where g 0 denotes the weak-derivative of g. Consider the function Gqχ (σ 2 ) , E(ηq (B/σ + Z; χ) − B/σ)2 ,
(31)
where B ∼ pβ and Z ∼ N (0, 1) are two independent random variables. Our first lemma is concerned with the behavior of this function. Lemma 9. For 1 ≤ q ≤ 2, Gqχ (σ 2 ) is a decreasing function of σ > 0. Proof: We consider four different cases: (i) q = 1, (ii) q = 2, (iii) 1 < q ≤ 3/2, (iv) 3/2 < q < 2. (i) q = 1: Since G1χ (σ 2 ) is a differentiable function of σ, we will prove this lemma by showing that dG1χ (σ 2 ) dσ
dGqχ (σ 2 ) dσ
< 0. We have
2 E [B(I(|B/σ + Z| > χ) − 1)(η1 (B/σ + Z; χ) − B/σ)] σ2 2 = − 2 E I(|B/σ + Z| < χ)(B 2 /σ) < 0. σ To obtain the first equality, we have used the dominated convergence theorem. u (ii) q = 2: Since η2 (u; χ) = 1+2χ , we have 1 d 2χB 8χ2 E(B 2 ) d 2 Gχ (σ) = E Z − = − < 0. dσ (1 + 2χ)2 dσ σ (1 + 2χ)2 σ 3 = −
(iii) 1 < q ≤ 3/2: The strategy for this case is similar to that of the last two cases. We show that the derivative dGqχ (σ 2 ) dσ
(a)
=
=
2E (ηq (B/σ + Z; χ) − B/σ)(∂1 ηq (B/σ + Z; χ) − 1)(−B/σ 2 ) 2E (ηq (B/σ + Z; χ) − B/σ − Z)(∂1 ηq (B/σ + Z; χ) − 1)(−B/σ 2 ) +2E Z(∂1 ηq (B/σ + Z; χ) − 1)(−B/σ 2 ) .
dGqχ (σ 2 ) dσ
< 0.
(32)
To obtain Equality (a), we have used the dominated convergence theorem; We employed Lemma 4 part (vi) to confirm the conditions of dominated convergence theorem. Our goal is to show that each of the last two terms is negative for every value b of the random variable B and hence the expected value with respect to B is also negative. For the moment, we assume that b > 0, but the proof for the case b < 0 is similar. We have EZ [(ηq (b/σ + Z; χ) − b/σ − Z)(∂1 ηq (b/σ + Z; χ) − 1)] Z (a) = (ηq (z; χ) − z)(∂1 ηq (z; χ) − 1)φ(z − b/σ)dz z Z ∞ Z 0 = (ηq (z; χ) − z)(∂1 ηq (z; χ) − 1)φ(z − b/σ)dz + (ηq (z; χ) − z)(∂1 ηq (z; χ) − 1)φ(z − b/σ)dz −∞ Z0 ∞ (b) = (ηq (z; χ) − z)(∂1 ηq (z; χ) − 1)(φ(z − b/σ) − φ(z + b/σ))dz > 0. (33) 0
(a) is obtained by a change of variables. To obtain (b), we have used Lemma 4 part (ii) that proves ηq (−z; χ) = −ηq (z; χ) and hence ∂1 ηq (−z; χ) = ∂1 ηq (z; χ). The last inequality is due to the fact that, according to Lemmas 4 part (ii) and Lemma 7 part (iii), for z > 0, we have ηq (z; χ) < z and ∂1 ηq (z; χ) < 1. Furthermore, we have used the fact that since b/σ > 0 we have φ(z − b/σ) − φ(z + b/σ) > 0. Hence we have EZ [(ηq (b/σ + Z; χ) − b/σ − Z)(∂1 ηq (b/σ + Z; χ) − 1)] > 0,
14
and therefore EZ (ηq (b/σ + Z; χ) − b/σ − Z)(∂1 ηq (b/σ + Z; χ) − 1)(−b/σ 2 ) < 0. Now we should discuss the second term in (32). (a)
EZ [Z(∂1 ηq (b/σ + Z; χ) − 1)] = E(∂12 ηq (b/σ + Z; χ)) Z ∞ = [∂12 ηq (z; χ)(φ(z − b/σ) − φ(z + b/σ))]dz > 0.
(34)
0
Equality (a) is the result of Stein’s lemma, i.e. Lemma 8. Note that according to Lemma 6, all the conditions required for Stein’s lemma are satisfied. To obtain the last inequality, we have used Lemma 7 part (iv) and the fact that φ(z − b/σ) − φ(z + b/σ) > 0 for b/σ > 0. Hence we conclude that EZ Z(∂1 ηq (b/σ + Z; τ ) − 1)(−b/σ 2 ) < 0. dGq (σ 2 )
χ < 0 for 1 < q ≤ 3/2. The same approach would work for b < 0 and hence establishes dσ 2 (iv) 3/2 < q < 2: The proof of this case is similar to the last case. The only difference is that the proof we presented in (34) for the non-negativity of EZ [Z(∂1 ηq (b/σ + Z; χ) − 1)] does not work, due to the non-differentiability of ηq (u; χ) for q > 3/2, as shown in Lemma 6. Hence our goal here is only to prove that: EZ [Z(∂1 ηq (b/σ + Z; χ) − 1)] > 0 for q > 3/2. We have Z −b/σ EZ [Z(∂1 ηq (b/σ + Z; χ) − 1)] = z(∂1 ηq (b/σ + z; χ) − 1)φ(z)dz
−∞
Z
∞
Z
+
b/σ
z(∂1 ηq (b/σ + z; χ) − 1)φ(z)dz + b/σ
z(∂1 ηq (b/σ + z; χ) − 1)φ(z)dz.
(35)
−b/σ
First note that Z
−b/σ
−∞
=
Z ∞ z(∂1 ηq (b/σ + z; χ) − 1)φ(z)dz + z(∂1 ηq (b/σ + z; χ) − 1)φ(z)dz b/σ Z ∞ z(∂1 ηq (b/σ + z; χ) − ∂1 ηq (b/σ − z; χ))φ(z)dz > 0,
(36)
b/σ
where the last equality is due to the fact that |b/σ − z| < |b/σ + z| and hence according to Lemma 7 part (iv), ∂1 ηq (b/σ + z; χ) − ∂1 ηq (b/σ − z; χ) > 0. Now we consider the second term: Z b/σ Z b/σ z(∂1 ηq (b/σ + z; χ) − 1)φ(z)dz = z(∂1 ηq (b/σ + z; χ) − ∂1 ηq (b/σ − z; χ))φ(z)dz > 0. (37) −b/σ
0
where to obtain the last inequality, we used part (iv) of Lemma 7 and the fact that b/σ − z < b/σ + z. Combining (35), (36), and (37) establishes the final result. Lemma 9 paves our way in the study of the solutions of (7), (8) and (15), (16). The following simple corollary is the first useful conclusion of Lemma 9. Corollary 2. For a given 1 ≤ q ≤ 2, let χ > 0 be a fixed number that satisfies 1δ E(ηq2 (Z; χ)) < 1. Then the following equation has a unique solution for σ ˆ: 1 2 σ ˆ 2 = σw + E(ηq (B + σ ˆ Z; χˆ σ 2−q ) − B)2 . (38) δ Furthermore, if 1δ E(ηq2 (Z; χ)) ≥ 1, (38) does not have any solution. 2 Proof: First note that if σw > 0, then σ ˆ = 0 is not a fixed point. Hence, we can assume that σ ˆ > 0 and divide both sides by σ ˆ . We can then write (38) in the following form: 2
1= where Gqχ (ˆ σ 2 ) is defined in (31). Since both
2 σw σ ˆ2
2 1 σw + Gqχ (ˆ σ 2 ), σ ˆ2 δ
(39)
and Gqχ (ˆ σ 2 ) are decreasing functions of σ ˆ 2 (according to Lemma 9), their
15
summation will be strictly decreasing. Furthermore, it is straightforward to confirm that 2 1 σw + Gqχ (ˆ σ 2 ) = ∞, σ ˆ →0 σ ˆ2 δ 1 1 σ2 σ 2 ) = E(ηq2 (Z; χ)) < 1, (40) lim w2 + Gqχ (ˆ σ ˆ →∞ σ ˆ δ δ where the last inequality is according to the condition we considered in the corollary. Hence, in this case there is a unique value of σ ˆ that satisfies (38) (we know Gqχ (·) is a continuous function from the proof of Lemma 9). Moreover, from (40) it is clear that if 1δ E(ηq2 (Z; χ)) ≥ 1, then (38) does not have any solution. According to the above lemma, two cases can be specified that have slightly different behaviors: 1) δ ≥ 1: In this case, for every value of χ > 0, according to Lemma 7 part (v), we have
lim
1 1 E(ηq2 (Z; χ)) < E(ηq2 (Z; 0)) ≤ 1, δ δ and hence (7) has a unique fixed point for σ ˆ. 2) δ < 1 : In this case, there is a minimum value of χ that we call χmin below which 1δ E(ηq2 (Z; χ)) ≤ 1 does not hold. In this case, Corollary 2 confirms that for χ ∈ (χmin , ∞), (7) has a unique fixed point. Corollary 2 characterizes the existence and uniqueness of solution for (7). Our next goal is to see if (8) and (7) can have a common solution as well. Our strategy to show this is the following: Among all the pairs (χ, σ ˆχ ) that satisfy (7), we show that at least one of them satisfies (8). We do this in the next few lemmas. Lemma 10. Let δ < 1. For each value of χ > χmin , define σ ˆχ as the value of σ ˆ that satisfies (7). Then, 1 ˆχ Z; χˆ σχ2−q )] = ∞, lim χˆ σχ2−q 1 − E[∂1 ηq (B + σ χ→∞ δ 1 lim χˆ σχ2−q 1 − E[∂1 ηq (B + σ ˆχ Z; χˆ σχ2−q )] = −∞. χ&χmin δ
(41)
2
2 + EB Proof: We first claim that σ ˆχ2 → σw ˆχ → ∞, as χ & χmin . For the first part, we only need δ , as χ → ∞, while σ 2 2 2 + EB . For that purpose, we first show σ ˆχ = O(1). Otherwise, there exists a to show for any sequence χn → ∞, σ ˆ χn → σ w δ sequence χn → ∞ such that σ ˆχn → ∞. Since (ηq (B/ˆ σχn + Z; χn ) − B/ˆ σχn )2 ≤ 2(B/ˆ σχn + Z)2 + 2B 2 /ˆ σχ2 n ≤ 6B 2 + 4Z 2 , q 2 for large enough n, we can apply the dominated convergence theorem (DCT) to show Gχn (ˆ σχn ) = o(1). Then Equation (39) implies, σ2 1 1 = 2w + Gqχn (ˆ σχ2 n ) = o(1), σ ˆ χn δ
where contradiction arises. We now consider any convergent subsequence σ ˆχkn → σ ∗ (note that (38) and σ ˆχ = O(1) together ∗ 2−q 2 implies 0 < σ < ∞). Since (ηq (B + σ ˆχkn Z; σ ˆχkn χkn ) − B) ≤ 2(B + σ ˆχkn Z)2 + 2B 2 ≤ 4B 2 + 4ˆ σχ2 kn Z 2 + 2B 2 ≤ 2 ∗ 2 2 2 q 2 2 6B + 8(σ ) Z , when n is large enough. We can apply DCT to show σ ˆχkn Gχkn (ˆ σχkn ) → EB . Then Equation (39) leads to, σ ˆχ2 kn Gqχkn (ˆ σχ2 kn ) EB 2 2 2 = σw + . (σ ∗ )2 = lim σ ˆχ2 kn = σw + lim n→∞ n→∞ δ δ 2
2 Thus, we have showed any convergent subsequence of σ ˆχ2 n converges to the same limit σw + EB δ . Hence the sequence also converges to that limit. For the second part of the claim, if it is not the case, then there exist a sequence χn & χmin such that σ ˆχn = O(1). We can again use DCT to get Gqχn (∞) = E(ηq2 (Z; χn )). Hence Equation (39) gives,
1=
2 σw 1 σ2 1 σ2 1 + Gqχn (ˆ σχ2 n ) ≥ 2w + Gqχn (∞) = 2w + E(ηq2 (Z; χn )). 2 σ ˆ χn δ σ ˆ χn δ σ ˆ χn δ
Note 1δ E(ηq2 (Z; χmin )) = 1, hence letting n → ∞ on the both sides of the above equation leads to 1 ≥ Ω(1) + 1, which is a contradiction. Now we prove the two equalities stated in the above lemma. To prove the first equality, note that as 2 2 χ → ∞, σ ˆχ2 → σw + EB δ . Hence, according to Lemma 7 part (vi) and the dominated convergence theorem, we have E∂1 ηq (B + σ ˆχ Z; χˆ σχ2−q ) → 0. Therefore, the limit can be trivially calculated. To prove the second equality, we have showed that as χ & χmin , σ ˆχ → ∞. Furthermore, we have (a)
E∂1 ηq (B + σ ˆχ Z; χˆ σχ2−q ) =
1 E(Zηq (B + σ ˆχ Z; χˆ σχ2−q )) = E(Zηq (B/ˆ σχ + Z; χ)), σ ˆχ
16
where (a) holds by Lemma 8. It is then straightforward to verify that lim
χ&χmin
E∂1 ηq (B + σ ˆχ Z; χˆ σχ2−q )
= =
lim E(Zηq (B/ˆ σχ + Z; χ)) = E(Zηq (Z; χmin )) χ&χmin E(ηq2 (Z; χmin )) + χmin qE(|ηq (Z; χmin )|q ),
(42)
where the last equality is the result of Lemma 4 part (i). Note that 1 E|ηq (Z; χmin )|2 = 1. δ Hence, by combining (42) and (43) we have 1 1 2−q lim 1 − E∂1 ηq (B + σ ˆχ Z; χˆ σχ ) = − χmin qE|ηq (Z; χmin )|q . χ&χmin δ δ
(43)
(44)
ˆχ Z; χˆ σχ2−q )] = It is straightforward to combine (44) and the fact that σ ˆχ → ∞ to show limχ&χmin χˆ σχ2−q 1 − 1δ E[∂1 ηq (B + σ −∞. Corollary 3. For any value of δ, σw > 0 and λ > 0, there exists a unique pair (χ, σ ˆχ ) that satisfies both (7) and (8). Proof: We consider two cases and prove the result separately. (i) δ ≥ 1: First note that according to Corollary 2, for any χ > 0 there exists a unique σ ˆχ that satisfies (7). We can then show 1 2−q 2−q ˆχ Z; χˆ σχ )] = ∞, (45) lim χˆ σχ 1 − E[∂1 ηq (B + σ χ→∞ δ 1 lim χˆ σχ2−q 1 − E[∂1 ηq (B + σ ˆχ Z; χˆ σχ2−q )] = 0. (46) χ&0 δ To derive Equation (45), we can use exactly the same arguments as in the proof of Equation (41) in Lemma 10. For σχ2−q = o(1). First consider δ > 1, then Equation (46), since E|∂1 ηq (B + σ ˆχ Z; χˆ σχ2−q )| ≤ 1, our goal will be showing χˆ we claim σ ˆχ = O(1). Otherwise, there exists a sequence χn → 0 such that σ ˆχn → ∞. However, taking the limit on both sides of Equation (39), we arrive at 1 = 1δ < 1, which is a contradiction. In the case where δ = 1, it is straightforward to see σ ˆχ → ∞, as χ & 0. Otherwise, there exists a sequence χn → 0 such that σ ˆχn → σ ∗ < ∞. Letting n → 0 on ∗ 2 ∗ 2 2 σχ2 = O(1), both sides of Equation (38) gives (σ ) = σw + (σ ) , where contradiction arises. Hence, if we can show χˆ 2−q then χˆ σχ = o(1) will be proved. Again starting from the Equation (38), we can have 0
2 = σw +σ ˆχ2 E(ηq (B/ˆ σχ + Z; χ) − B/ˆ σχ − Z)2 + 2ˆ σχ2 EZ(ηq (B/ˆ σχ + Z; χ) − B/ˆ σχ − Z) 2 = σw +σ ˆχ2 E[χ2 q 2 |ηq (B/ˆ σχ + Z; χ)|2q−2 − 2χq(q − 1)|ηq (B/ˆ σχ + Z; χ)|q−2 /(1 + χq(q − 1)|ηq (B/ˆ σχ + Z; χ)|q−2 )],
where we have used Lemma 4 part (i), Lemma 7 part (i) and Lemma 8. Therefore, we can get χˆ σχ2
2 = σw {E[−χq 2 |ηq (B/ˆ σχ + Z; χ)|2q−2 + 2q(q − 1)|ηq (B/ˆ σχ + Z; χ)|q−2 /(1 + χq(q − 1)|ηq (B/ˆ σχ + Z; χ)|q−2 )]}−1 2 , σw (A + B)−1
It is easily seen that A → 0 and lim inf χ→0 B ≥ 2q(q−1)E|Z|q−2 , hence χˆ σχ2 = O(1). Furthermore, one can use implicit function theorem to show that in fact σ ˆχ is a continuous function of χ. According to Lemma 5, ∂1 ηq is also a continuous function of its arguments. Employing these facts, it is straightforward to show that χˆ σχ2−q 1 − 1δ E[∂1 ηq (B + σ ˆχ Z; χˆ σχ2−q )] is a continuous function of χ. Hence, for any λ > 0, there is at least a value of χ that satisfies (8). (ii) δ < 1: Lemma 10 combined with the same argument we presented at the end of the previous case shows that there exists at least one solution pair. To prove the uniqueness, suppose there are two different solutions denoted by (ˆ σ (χ1 ), χ1 ) and (ˆ σ (χ2 ), χ2 ), respectively. By applying Theorem 1 with ψ(a, b) = (a − b)2 , we have p
1X ˆ (a) 2 lim (βi (λ, q) − βi )2 = E[ηq (B + σ ˆ (χ1 )Z; χ1 σ ˆ 2−q (χ1 )) − B]2 = δ(ˆ σ 2 (χ1 ) − σw ), p→∞ p i=1 where (a) is due to Equation (7). The same equations hold for the other pair (ˆ σ (χ2 ), χ2 ). Since they have the same AMSE, it follows that σ ˆ (χ1 ) = σ ˆ (χ2 ). We then choose a different pseudo-Lipschitz function ψ(a, b) = |a| in Theorem 1 to obtain, p
1X ˆ kβ(λ, q)k1 = E|ηq (B + σ ˆ (χ1 )Z; χ1 σ ˆ 2−q (χ1 ))| = E|ηq (B + σ ˆ (χ1 )Z; χ2 σ ˆ 2−q (χ1 ))|. p→∞ p i=1 lim
Since E|ηq (B + σ ˆ (χ1 )Z; χ)|, as a function of χ ∈ (0, ∞), is strictly decreasing based on Lemma 7 part (v), we conclude
17
χ1 = χ2 . The proof technique of uniqueness has already been shown in [13]. The key idea is to analyze the fixed point equations (7), (8) indirectly through Theorem 1. Working directly with those equations seems substantially harder. So far, we have discussed the solution of (7) and (8). We now turn our attention to (15) and (16). Our goal here is to show that, there exists a unique pair (χ, σ ˆ ) that satisfies these two equations. Furthermore, we would like to derive a simpler expression for the (χ, σ ˆ ) that satisfies (15) and (16). These simpler expressions will be used in the proofs of the main theorems. In the rest of this section, we make the following assumption: Assumption 1. For every value of σ ˆ > 0, the set arg min E[(ηq (B + σ ˆ Z; χˆ σ 2−q ) − B)2 ] χ≥0
has cardinality one. Lemma 11. For every 1 ≤ q ≤ 2, there exists at least a pair (χ, σ ˆχ ) that satisfies (15) and (16). If we call any one of those pairs (χ∗ , σ ˆχ∗ ), then σ ˆχ2 ∗ is the unique fixed point of 2 Gq∗ (ˆ σ 2 ) , σw +
1 min E(ηq (B + σ ˆ Z; χˆ σ 2−q ) − B)2 , δ χ≥0
and 2 χ∗ ∈ arg min E[(ηq (B + σ ˆχ∗ Z; χˆ σχ2−q ∗ ) − B) ]. χ≥0
(47)
Furthermore, under the Assumption 1, the pair (χ∗ , σ ˆχ∗ ) is unique and λ∗,q is unique. Proof: We first show that choose χ1 , χ2 such that
Gq∗ (σ 2 ) σ2
is a strictly decreasing function of σ over (0, ∞). For any given σ1 > σ2 > 0, we can χ1
=
arg min E(ηq (B + σ1 Z; χσ12−q ) − B)2 ,
χ2
=
arg min E(ηq (B + σ2 Z; χσ22−q ) − B)2 .
χ≥0 χ≥0
Applying Lemma 9, we can have Gqχ1 (σ12 ) = min Gqχ (σ12 ) ≤ Gqχ2 (σ12 ) ≤ Gqχ2 (σ22 ) = min Gqχ (σ22 ). χ
Since
Gq∗ (σi2 ) σi2
=
2 σw σi2
+ 1δ Gqχi (σi2 ), i = 1, 2, we conclude
χ
Gq∗ (σ12 ) σ12
Gq∗ (σ 2 ) = ∞, σ→0 σ2 lim
0. We have used Equation (40) to get the last equality above. Choosing a sufficiently large χ completes the proof for the second inequality in (48). From now on, we assume Assumption 1 holds. The arguments without Assumption 1 follow similarly. Note that according to Assumption 1 and the fact that the fixed point of Gq∗ (ˆ σ 2 ) is unique, we conclude q 2 ∗ σ ) and the unique result of (47), σ ˆχ∗ and χ∗ that χ that satisfies (47) is also unique. Call the unique fixed point of G∗ (ˆ respectively. Define 1 λ ∗ = χ∗ σ ˆχ2−q 1 − E[∂1 ηq (B + σ ˆχ∗ Z; χ∗ σ ˆχ2−q ∗ ∗ )] . δ Our goal is to show that no other value of λ can achieve the same or smaller AMSE than this λ∗ . Suppose that there exists a ˜ achieving no greater AMSE than λ∗ does. According to Corollary 3, there exists a pair (χ, different λ ˜ σ ˜ ) that satisfy (7) and 2 ˆχ2−q ˜ Z; χ˜ ˜σ 2−q )−B)2 . (8). We claim σ ˜ 6= σ ˆχ∗ . Otherwise, Assumption 1 implies that E(ηq (B + σ ˆχ∗ Z; χ∗ σ ∗ )−B) < E(ηq (B + σ ∗ ˜ Moreover, according to Theorem 1, AMSE Then according to Theorem 1, it means AMSE for λ is smaller than AMSE for λ. ˜ is given by for λ 2 E(ηq (B + σ ˜ Z; χ˜ ˜σ 2−q ) − B)2 = δ(˜ σ 2 − σw ).
lim
But note that, 1 1 2 2 σ ˜ 2 = σw + E(ηq (B + σ ˜ Z; χ˜ ˜σ 2−q ) − B)2 ≥ σw + min E(ηq (B + σ ˜ Z; χ˜ σ 2−q ) − B)2 = Gq∗ (˜ σ 2 ). χ δ δ
18
So we immediately see Gq∗ (σ 2 ) σ2
Gq∗ (˜ σ2 ) σ ˜2
≤ 1. This implies the fixed point of Gq∗ (ˆ σ 2 ), i.e. σ ˆχ∗ is necessarily smaller than σ ˜ . Otherwise
since is a strictly decreasing function of σ, we have point. Therefore, the AMSE satisfies
2 Gq∗ (ˆ σχ ∗) 2 σ ˆχ∗
[1 + (χ∗ ())2 ] − EZ 2 = (χ∗ ())2 > 0, 1 () = 0. To prove (ii) note that where Equality (a) is due to the fact that ∂M∂χ =
χ=χ∗ ()
0
≤ ≤ ≤
lim min(1 − )E(η1 (Z; χ))2 + (1 + χ2 ) ≤ lim (1 − )E[η1 (Z; log(1/))]2 + (1 + log(1/)2 ) →0 Z ∞ Z ∞ lim 2(1 − ) (z − log(1/))2 φ(z)dz = lim 2(1 − ) z˜2 φ(˜ z + log(1/))d˜ z →0 →0 log(1/) 0 Z ∞ log2 (1/) lim 2(1 − )e− 2 z˜2 φ(˜ z )d˜ z = 0. →0 χ≥0
→0
0 ∗
To prove part (iii), note that as → 1, χ () → 0. Otherwise, if χ∗ () → χ∗ > 0 (taking a convergent subsequence if necessary). Since E|η1 (Z; χ∗ ())|2 ≤ E|Z|2 , we obtain lim M1 () = lim (1 − )Eη12 (Z; χ∗ ()) + (1 + χ∗ ()2 ) = 1 + (χ∗ )2 .
→1
→1
However, it is straightforward to check that the risk with threshold value 0 is equal to 1. Hence, χ∗ () → 0 as → 1. It is then easily seen that lim→1 M1 () = 1. The last part, i.e. Part (iv), is clear from the definition of M () = + χ2 + (1 − )E(η1 (Z; χ))2 . E. Proof of Lemmas 2 and 3 2 2 ˜ () denote 1 − (1−) (E|Z|) . We also remind the reader that we defined M1 () in the following way: Let M M1 () = min (1 + χ2 ) + (1 − )(1 − E[|Z|2 − (η1 (|Z|; χ))2 ]). χ≥0
(49)
˜ and M . This comparison will be used in the proofs of Lemmas 2 and 3. We first compare M Lemma 12. Let ∈ (0, 1) and q ∈ (1, 2]. Then,
˜ () < M1 (). M
Proof: First note that simple calculations show that ˜ = min (1 + χ2 ) + (1 − )(1 − 2χE|Z|). M χ≥0
(50)
Moreover, we have that E[|Z|2 − (η1 (|Z|; χ))2 ] = E(|Z| − η1 (|Z|; χ))(|Z| + η1 (|Z|; χ)) < 2χE|Z|. Hence for any χ > 0, (1 + χ2 ) + (1 − )(1 − 2χE|Z|) < (1 + χ2 ) + (1 − )(1 − E[|Z|2 − (η1 (|Z|; χ))2 ]).
(51)
˜ () < M1 (). It is clear that the minimizing χ happens at a non-zero value in (50). Hence, we conclude that M Now we discuss the proof of Lemma 2. 1−δ ˜ Proof of Lemma 2 : It is straightforward to confirm that π(1−δ) ˜ . Lemma 12 has shown that M < M1 (). Hence, 2(1−)2 = 1−M we obtain π(1 − δ) 1−δ < < 1, 2 2(1 − ) 1 − M1 () where the last inequality is due to the fact that M1 () < δ < 1. Finally, we can discuss the proof of Lemma 3.
19
Proof of Lemma 3: It is straightforward to confirm C1 = C1 − C2
= = (a)
≤
=
˜ −δ δ2 M (δ−1)2 .
We compute C1 − C2 :
˜ −δ δ2 M δM1 () − (δ − 1)2 δ − M1 () ˜ − M1 ())δ 3 + (2M1 () − M ˜ M1 () − 1)δ 2 (M (δ − 1)2 (δ − M1 ()) ˜ − M1 ())δ 2 + (2M1 () − M ˜ M1 () − 1)δ 2 (M (δ − 1)2 (δ − M1 ()) ˜ )(M1 () − 1)δ 2 (b) (1 − M < 0, (δ − 1)2 (δ − M1 ())
˜. where (a) and (b) hold because δ > 1 > M1 () ≥ M F. Proof of Theorem 2 1) Roadmap of the proof: Since the proof of this result has several components and is long, we lay out the roadmap of the proof here to help the reader. Once we show the proof for 1 < q < 2, since η2 (u; χ) has a nice explicit form, the proof for q = 2 can be easily obtained, thus skipped here. Our goal in this theorem is to characterize the behavior of AMSE of LQLS when λ is picked optimally. According to Lemma 11, this AMSE can be calculated through the following simple steps: 2 ˆ Z; χˆ σ 2−q ) − B)2 , and call it σ ˆ∗. + minχ≥0 1δ E(ηq (B + σ 1. Find the solution of σ ˆ 2 = σw ∗ 2 2 2. AMSE is then equal to δ((ˆ σ ) − σw ). 2 → 0, then the key step is to Hence our main goal here is to characterize σ ˆ ∗ . Suppose (we will prove it later) σ ˆ ∗ → 0 as σw 1 2−q 2 characterize the behavior of minχ≥0 δ E(ηq (B + σ ˆ Z; χˆ σ ) − B) for small values of σ ˆ . To do so, we consider the following definition: Rq (χ, σ) , (1 − )E(ηp (Z; χ))2 + EB∼G (ηq (B/σ + Z; χ) − B/σ)2 ,
(52)
where G(·) is the distribution of nonzero elements of β. We study the behavior of χ∗q (σ) , arg minχ≥0 Rq (χ, σ) and Rq (χ∗q (σ), σ) for small values of σ. For notational simplicity, we omit the subscript q when no confusion is caused. As the first step in the proof, we show in Section V-F2 that χ∗ (σ) → 0 as σ → 0. The next step is to characterize the rate at which χ∗ (σ) → 0 as σ → 0 and then characterize the behavior of R(χ∗ (σ), σ). This is done in Section V-F3. Once we characterize the behavior of R(χ∗ (σ), σ), we use this result in Section V-F4 to derive the behavior of the fixed point of 2 + minχ≥0 1δ E(ηq (B + σ σ ˆ 2 = σw ˆ Z; χˆ σ 2−q ) − B)2 . ∗ 2) Proof of χ (σ) → 0 as σ → 0: Lemma 13. Let σ > 0 be a fixed number. Suppose EG (B 2 ) < ∞. Then 2
) 1) limχ→∞ R(χ, σ) = E(B σ2 2) limχ→0 R(χ, σ) = 1. 3) R(χ, σ) is differentiable with respect to χ at [0, ∞). The derivative and expectation can be exchanged in (52).
Proof: The first two results are simple implications of Lemma 4 part (iii) and the dominated convergence theorem. For the third one, we have (ηq (Z; χ + ∆))2 − (ηq (Z; χ))2 ≤ (ηq (Z; χ + ∆) + ηq (Z; χ))(ηq (Z; χ + ∆) − ηq (Z; χ)) ∆ ∆ (a)
≤
(b)
2|Z||∂2 ηq (Z; χ)| ˜ ≤ 2q|Z|q .
To obtain Inequality (a), we employed Lemma 4 part (ii) and the mean value theorem; χ ˜ is a number in the range [χ, χ + ∆]. To obtain Inequality (b), we employed Lemma 7 pat (ii) and then upper bounded |∂2 ηq (u; χ)| by q|u|q−1 . Similarly, (ηq (B/σ + Z; χ + ∆) − B/σ)2 − (ηq (B/σ + Z; χ) − B/σ)2 ≤ 4q(|B|/σ + |Z|)q ∆ Moreover, Lemma 5 guarantees the limits of the left side terms in the above two inequalities exist. Thus, we can apply dominated convergence theorem to exchange the limit ∆ → 0 and expectation in calculating lim∆→0 (R(χ + ∆, σ) − R(χ, σ))/∆. As we discussed before, let χ∗q (σ) = arg min Rq (χ, σ). (53) χ≥0
χ∗q (σ)
Lemma 13 implies that exists at least when σ is small enough. We investigate the behavior of χ∗q (σ) and Rq (χ∗q (σ), σ) for small values of σ, in a sequence of lemmas below.
20
Lemma 14. Let χ∗ (σ) denote the optimal threshold value as defined in (53). Then, for every b 6= 0 and z ∈ R, |ηq (b/σ + z; χ∗ (σ))| → ∞, as σ → 0. Proof: Suppose this is not the case. Then there exist a value of b 6= 0 and z and a sequence σk → 0, such that |ηq (b/σk + z; χ∗ (σk ))| is bounded. From Lemma 4 part (i) we know that b/σk + z − ηq (b/σk + z; χ∗ (σk )) = χ∗ (σk )q|ηq (b/σk + z; χ∗ (σk ))|q−1 sign(b/σk + z).
(54)
Since |ηq (b/σk + z; χ∗ (σk ))| is bounded we conclude that χ∗ (σk ) = Ω( σ1k ). Based on this assumption, we would like to show that for any other ˜b 6= 0 and z˜, |ηq (˜b/σk + z˜; χ∗ (σk ))| is also bounded. First note that |˜b/σk + z˜| = |ηq (˜b/σk + z˜; χ∗ (σk ))| + χ∗ (σk )q|ηq (˜b/σk + z˜; χ∗ (σk ))|q−1 .
(55)
If |ηq (˜b/σk + z˜; χ∗ (σk ))| is unbounded, then the right side of equation (55) (take a subsequence if necessary) has the order larger than 1/σk by noticing χ∗ (σk ) = Ω( σ1k ). Thus (55) can not hold universally for all k. A contradiction arises. This means that if |ηq (b/σk + z; χ∗ (σk ))| is bounded for one b 6= 0 and z, then it will be bounded for every b 6= 0 and z. Therefore, |ηq (B/σk + Z; χ∗ (σk )) − B/σk | → ∞ a.s., as k → ∞. Accordingly, we can use Fatou’s lemma and show that R(χ∗ (σk ), σk ) → ∞. However, since R(0, σk ) = 1 and χ∗ (σk ) is the optimal threshold value, we conclude that R(χ∗ (σk ), σk ) must be bounded. This contradiction implies that |ηq (b/σ + z; χ∗ (σ))| → ∞. Lemma 15. Let χ∗ (σ) denote the optimal threshold value as defined in (53). Then, χ∗ (σ) → 0 as σ → 0. Proof: First note that R(χ, σ)
=
(1 − )E(ηq (Z; χ))2 + E(ηq (B/σ + Z; χ) − B/σ − Z)2 + 2EZ(ηq (B/σ + Z; χ) − B/σ − Z) +
(a)
=
(1 − )E(ηq (Z; χ))2 + χ2 q 2 E|ηq (B/σ + Z; χ)|2q−2 + 2E(∂1 ηq (B/σ + Z; χ) − 1) +
(b)
=
(1 − )E(ηq (Z; χ))2 + χ2 q 2 E|ηq (B/σ + Z; χ)|2q−2 + 1 2E − 1 + 1 + χq(q − 1)|ηq (B/σ + Z; χ)|q−2
(56)
To obtain Equality (a), we have used Lemma 8 (ηq (u; χ) is differentiable with respect to u from Lemma 5), and also Lemma 4 part (i). To obtain Equality (b), we have used Lemma 7 part (i). Note that according to Lemma 14, |ηq (B/σ + Z; χ∗ (σ))| → ∞ a.s., as σ → 0. Hence, if χ∗ (σ) 9 0, as σ → 0, the second term in (56) goes off to infinity, while the other terms remain finite, and hence R(χ∗ (σ), σ) → ∞, which is a contradiction. 3) Characterizing the behavior of R(χ∗ , σ) : Lemma 16. Suppose that PG (|B| ≤ σ) = O(σ) and EG (|B|2 ) < ∞. If χ(σ) = Cσ 2q−2 , with a fixed number C > 0, then for 1 < q < 2 we have R(χ, σ) − 1 lim = −2C(1 − )qE|Z|q + C 2 q 2 E|B|2q−2 . (57) σ→0 σ 2q−2 Moreover, if χ(σ) = o(σ 2q−2 ), then lim
σ→0
R(χ, σ) − 1 = 0. σ 2q−2
(58)
Proof: We first focus on the case χ(σ) = Cσ 2q−2 . According to (56) we have R(χ, σ) − 1
=
(1 − )E(ηq2 (Z; χ) − Z 2 ) + χ2 q 2 E|ηq (B/σ + Z; χ)|2q−2 |ηq (B/σ + Z; χ)|q−2 −2χq(q − 1)E 1 + χq(q − 1)|ηq (B/σ + Z; χ)|q−2
,
R1 + R2 + R3 .
(59)
Now we calculate the limit of each of the terms individually. We have R1
=
(c)
(1 − )E(ηq (Z; χ) + Z)(ηq (Z; χ) − Z) = −(1 − )E(ηq (Z; χ) + Z)(χq|ηq (Z; χ)|q−1 sign(Z))
= −(1 − )χq(E|Z||ηq (Z; χ)|q−1 + E|ηq (Z; χ)|q ),
21
where (c) is due to Lemma 4 part (i). Hence, we have R1 = −C(1 − )q lim (E|Z||ηq (Z; χ)|q−1 + E|ηq (Z; χ)|q ) = −2C(1 − )qE|Z|q . (60) σ→0 σ 2q−2 To obtain the last equality, we followed these steps: 1) According to dominated convergence theorem, we have limσ→0 E|ηq (Z; χ)|q = E limσ→0 |ηq (Z; χ)|q . Note that dominated convergence theorem works since |ηq (Z; χ)| ≤ Z by Lemma 4 part (ii). 2) According to the conditions in this lemma, as σ → 0, so does χ(σ). 3) According to Lemma 4 part (iii), limσ→0 |ηq (Z; χ)|q = |Z|q . Next we discuss R2 . lim
σ→0
χ2 q 2 E|ηq (B/σ + Z; χ)|2q−2 = C 2 q 2 lim E|ηq (B + σZ; χσ 2−q )|2q−2 = C 2 q 2 E|B|2q−2 . (61) σ→0 σ→0 σ 2q−2 The arguments we used to obtain the last equality are the same as the arguments used for R1 . Now we discuss R3 . To save some space in the proof of the second part of this lemma, we will show that if χ(σ) = O(σ 2q−2 ), then R2 σ→0 σ 2q−2 lim
= lim
lim
χ
σ→0 σ 2q−2
E
1 = 0. |ηq (B/σ + Z; χ)|2−q + χq(q − 1)
(62)
Define α1 = 1 if 1 < q < 3/2 and α1 = σ 2q−3+c if 3/2 ≤ q < 2, where c > 0 is a sufficiently small constant we specify later. Let F (b) be the cumulative probability function of |B| and α2 > 1 be a fixed constant. Define the following two intervals: I1 (x) , [−x − α1 , −x + α1 ], I2 (x) , [−x − α2 , −x + α2 ]. Then the expression in (62) can be written as χ
1 E |η (|B|/σ + Z; χ)|2−q + χq(q − 1) Z ∞q Z χ 1 φ(z)dzdF (b) + 2−q + χq(q − 1) σ 2q−2 0 |η (b/σ + z; χ)| q z∈I1 (b/σ) Z ∞Z 1 χ φ(z)dzdF (b) + 2−q + χq(q − 1) σ 2q−2 0 |η (b/σ + z; χ)| q z∈I2 (b/σ)\I1 (b/σ) Z ∞Z χ 1 φ(z)dzdF (b) , G1 + G2 + G3 . 2−q + χq(q − 1) σ 2q−2 0 |η (b/σ + z; χ)| q R\I2 (b/σ) σ 2q−2
=
(63)
We now bound the first integral on the right side of (63). Z ∞ Z −b/σ+α1 χ 1 G1 ≤ φ(z)dzdF (b) 2q−2 σ 0 −b/σ−α1 χq(q − 1) Z σ log 1/σ Z −b/σ+α1 χ 1 ≤ φ(z)dzdF (b) σ 2q−2 0 χq(q − 1) −b/σ−α1 Z ∞ Z −b/σ+α1 χ 1 + 2q−2 φ(z)dzdF (b) σ σ log 1/σ −b/σ−α1 χq(q − 1) (a)
2α1 φ(log 1/σ − α1 ) 2α1 φ(0) + q(q − 1)σ 2q−2 q(q − 1)σ 2q−2 α1 α1 φ((log 1/σ)/2) ≤ O(1) · σ c log(1/σ) 2q−3+c + O(1) · → 0, as σ → 0. σ σ 2q−2 To obtain inequality (a), we have used the following inequalities: Z −b/σ+α1 φ(z)dz ≤ 2φ(0)α1 , for b ≤ σ log(1/σ) ≤
P(|B| ≤ σ log 1/σ)
(64)
−b/σ−α1 −b/σ+α1
Z
φ(z)dz
≤
2α1 φ(log(1/σ) − α1 ),
for b > σ log(1/σ),
−b/σ−α1
where the second inequality holds for small values of σ only. Furthermore, note that the choice of α1 guarantees that
22
1/σ)/2) α1 → 0. It is also straightforward to see that α1 φ((log → 0. For the second term G2 , we have σ c log(1/σ) σ2q−3+c σ 2q−2 Z ∞Z χ 1 G2 ≤ φ(z)dzdF (b) 2q−2 2−q σ 0 z∈I2 (b/σ)\I1 (b/σ) |ηq (b/σ + z; χ)| Z σ log 1/σ Z 1 χ φ(z)dzdF (b) = σ 2q−2 0 |η (b/σ + z; χ)|2−q q z∈I2 (b/σ)\I1 (b/σ) Z Z ∞ χ 1 + 2q−2 φ(z)dzdF (b) σ |η (b/σ + z; χ)|2−q q σ log 1/σ z∈I2 (b/σ)\I1 (b/σ) Z σ log 1/σ Z (a) 1 χ φ(z)dzdF (b) ≤ 2q−2 2−q σ 0 z∈I (b/σ)\I1 (b/σ) |ηq (α1 ; χ)| Z 2 Z ∞ χ 1 + 2q−2 φ(z)dzdF (b) 2−q σ σ log 1/σ z∈I2 (b/σ)\I1 (b/σ) |ηq (α1 ; χ)| (b)
≤
(c)
≤
2α2 φ(0)χ 2α2 φ(log 1/σ − α2 )χ + 2q−2 2−q σ |ηq (α1 ; χ)|2−q q (α1 ; χ)| 1 φ((log 1/σ)/2) O(1) · σ c log(1/σ) 2−q + O(1) · → 0, as σ → 0. α1 σ c−1 α12−q P (|B| ≤ σ log 1/σ)
σ 2q−2 |η
(65)
To obtain Inequality (a), we used the fact that η(u; χ) is an increasing function of u from Lemma 7 part (iii). The arguments for (b) are similar to the calculations in G1 and hence will not be repeated here. To obtain (c), we have used the following key steps: η (α ;χ) η (α ;χ) 1) limσ→0 q α11 = 1. To see why this is true, we know from Lemma 4 part (v) that limσ→0 q α11 = limσ→0 ηq (1; α1q−2 χ). q−2 Note that α1 χ → 0. This is straightforward to show for the case α1 = 1. When α1 = σ 2q−3+c , we have α1q−2 χ = 2 O(1) · σ 2q +(c−5)q+4−2c and 2q 2 + (c − 5)q + 4 − 2c > 0 if c is small enough. Finally note that since α1q−2 χ → 0, according to Lemma 4 part (iii), limσ→0 ηq (1; α1q−2 χ) = 1. 2) P (|B| ≤ σ log 1/σ) = O(σ log 1/σ). This is true by our assumption. Finally note that when α1 = 1, σ c log(1/σ) α2−q1σc−1 goes to zero; when α1 = σ 2q−3+c , we can choose small enough c such that α1q−2 σ 1−c = σ 2q
2
1
+(c−7)q+7−3c
= o(1). For the final integral G3 , by using dominated convergence theorem, we have
lim G3 = O(1) · lim E
σ→0
σ→0
1(|B/σ + Z| > α2 ) |ηq (B/σ + Z; χ)|2−q + χq(q − 1)
= 0.
(66)
Note that dominated convergence theorem can be applied here since when z ∈ R\I2 (b/σ), we know
1(|B/σ + Z| > α2 ) |ηq (B/σ +
Z; χ)|2−q
+ χq(q − 1)
≤
1(|B/σ + Z| > α2 ) |ηq (B/σ +
Z; χ)|2−q
(a)
≤
(b) 1 1 , ≤ 2−q |ηq (α2 ; χ)| |ηq (α2 ; 1)|2−q
where to obtain Inequality (a) we employed Lemma 7 part (iii) and to obtain (b) we used part (v) of Lemma 7. Note that (b) only holds for small values of σ. Combining (63), (64), (65) and (66) together shows that lim
σ→0
R3 = 0. σ 2q−2
(67)
Combining (60), (61), and (67) establishes (57). To prove (58), first note that (67) was established in the general setting χ(σ) = O(σ 2q−2 ). Similar to what we did for the case χ(σ) = Cσ 2q−2 , we can prove that for χ(σ) = o(σ 2q−2 ), lim
σ→0
R1 = 0, σ 2q−2
lim
σ→0
R3 = 0, σ 2q−2
to establish (58). Lemma 17. Suppose PG (|B| < σ) = O(σ) and EG (|B|2 ) < ∞, if χ(σ) = Cσ 2q−2 with C =
(1−)E|Z|q qE|B|2q−2 ,
R(χ, σ) − 1 (1 − )2 (E|Z|q )2 = − . σ→0 σ 2q−2 E|B|2q−2 lim
Proof: This is a direct application of Lemma 16. We are now in the position to characterize the sharp behavior of χ∗ (σ) and R(χ∗ (σ), σ), when σ → 0.
then
23
Lemma 18. Suppose PG (|B| ≤ σ) = O(σ) and EG (|B|2 ) < ∞. For 1 < q < 2, we have (1 − )E|Z|q χ∗ (σ) = , σ→0 σ 2q−2 qE|B|2q−2 (1 − )2 (E|Z|q )2 + o(σ 2q−2 ). R(χ∗ (σ), σ) = 1 − σ 2q−2 E|B|2q−2 lim
Proof: We first claim that χ∗ (σ) = Ω(σ 2q−2 ). If this is not the case, then there exists a sequence σk → 0 such that R(χ∗ (σk ),σk )−1 2q−2 χ (σk ) = o(σk ). According to Lemma 16, limk→∞ = 0. But by choosing χ(σk ) = Cσk2q−2 with C = σ 2q−2 ∗
k
(1−)E|Z|q qE|B|2q−2 ,
k ),σk )−1 Lemma 17 says that limk→∞ R(χ(σ < 0, contradicting the optimality of χ∗ (σk ). Moreover, this choice of 2q−2 σk C shows that there exists χ(σ) such that R(χ∗ (σ), σ) ≤ R(χ(σ), σ) < R(0, σ) = 1, implying χ∗ (σ) is a non-zero finite value when σ is small enough. Hence it is a solution of ∂R(χ,σ) = 0. More specifically, χ∗ satisfies the following: ∂χ
0
(a)
=
(b)
=
(c)
=
,
(1 − )Eηq (Z; χ∗ )∂2 ηq (Z; χ∗ ) + E(ηq (B/σ + Z; χ∗ ) − B/σ)∂2 ηq (B/σ + Z; χ∗ ) −(ηq (B/σ + Z; χ∗ ) − B/σ − Z)q|ηq (B/σ + Z; χ∗ )|q−1 sign(B/σ + Z) −q|ηq (Z; χ∗ )|q + E (1 − )E 1 + χ∗ q(q − 1)|ηq (Z; χ∗ )|q−2 1 + χ∗ q(q − 1)|ηq (B/σ + Z; χ∗ )|q−2 ∗ +E(Z∂2 ηq (B/σ + Z; χ )) −q|ηq (Z; χ∗ )|q χ∗ q 2 |ηq (B/σ + Z; χ∗ )|2q−2 (1 − )E + E + E(Z∂2 ηq (B/σ + Z; χ∗ )) ∗ ∗ q−2 1 + χ q(q − 1)|ηq (Z; χ )| 1 + χ∗ q(q − 1)|ηq (B/σ + Z; χ∗ )|q−2 (1 − )H1 + χ∗ H2 + H3 .
(68)
where (a) is due to Lemma 13 part (3), (b) is due to Lemma 7 part (ii) and (c) is the result of Lemma 4 part (i). Now we analyze H1 , H2 and H3 , respectively. According to Lemma 15, χ∗ (σ) → 0 as σ → 0. Hence, ηq (Z; χ∗ ) → Z as σ → 0. Furthermore, Lemma 4 part (ii) shows |ηq (Z; χ∗ )| ≤ |Z|. Therefore, employing dominated convergence theorem (DCT) we conclude that lim H1 = −qE|Z|q .
(69)
σ→0
To calculate H2 , we first note that limσ→0 ηq (B + σZ; σ 2−q χ∗ ) = B and |ηq (B + σZ; σ 2−q χ∗ )| ≤ |B| + σ|Z|. Moreover, according to Lemma 14, we know |ηq (B/σ + Z; χ∗ )| → ∞, a.s., as σ → 0. Hence, we can use DCT to conclude that lim
σ→0
H2 σ 2−2q
= lim E σ→0
q 2 |ηq (B + σZ; σ 2−q χ∗ )|2q−2 = q 2 E|B|2q−2 . 1 + χ∗ q(q − 1)|ηq (B/σ + Z; χ∗ )|q−2
(70)
The only remaining term is H3 = E(Z∂2 ηq (B/σ + Z; χ∗ )). By Lemma 5 we know that ∂2 ηq (B/σ + Z; χ∗ ) is differentiable with respect to its first argument. So, we can apply Lemma 8. Furthermore, Lemma 7 part (ii) gives ∂2 ηq (B/σ + Z; χ∗ ) =
−q|ηq (B/σ + Z; χ∗ )|q−1 sign(B/σ + Z) . 1 + χ∗ q(q − 1)|ηq (B/σ + Z; χ∗ )|q−2
Hence we obtain −H3
χ∗ q 2 (q − 1)|ηq (B/σ + Z; χ∗ )|2q−4 q(q − 1)|ηq (B/σ + Z; χ∗ )|q−2 + E (1 + χ∗ q(q − 1)|ηq (B/σ + Z; χ∗ )|q−2 )3 (1 + χ∗ q(q − 1)|ηq (B/σ + Z; χ∗ )|q−2 )3 ∗ 4−2q q(q − 1)|ηq (B/σ + Z; χ )| χ∗ q 2 (q − 1)|ηq (B/σ + Z; χ∗ )|2−q = E + E (|ηq (B/σ + Z; χ∗ )|2−q + χ∗ q(q − 1))3 (|ηq (B/σ + Z; χ∗ )|2−q + χ∗ q(q − 1))3 = E
, q(q − 1)J1 + q 2 (q − 1)J2 . Note that J1 ≤ E
1 |ηq (B/σ +
Z; χ∗ )|2−q
+
χ∗ q(q
− 1)
,
J2 ≤ E
1/(4q(q − 1)) . |ηq (B/σ + Z; χ∗ )|2−q + χ∗ q(q − 1)
Our next step is to show lim E
σ→0
1 = 0. |ηq (B/σ + Z; χ∗ )|2−q + χ∗ q(q − 1)
(71)
Note that dominated convergence theorem may not be applied directly here, since the function inside the expectation cannot 1 1 be easily bounded. Let α1 be a number that satisfies ηq (α1 ; χ∗ ) = (χ∗ ) 2−q and also α2 = (χ∗ ) 2 . Note the following simple fact about α1 , based on Lemma 4 part (i), 1
α1 = ηq (α1 ; χ∗ ) + χ∗ qηqq−1 (α1 ; χ∗ ) = (q + 1)(χ∗ ) 2−q .
(72)
24
Define the following three intervals: I1 (x) , [−x − α1 , −x + α1 ], I2 (x) , [−x − α2 , −x + α2 ], I3 (x) = [−x − α3 , −x + α3 ], where α3 is a positive constant that does not change with σ. With these definitions, we start the proof of (71). We have 1 |ηq (|B|/σ + Z; χ∗ )|2−q + χ∗ q(q − 1) Z ∞Z 1 = φ(z)dzdF (b) + ∗ 2−q + χ∗ q(q − 1) 0 z∈I1 (b/σ) |ηq (b/σ + z; χ )| Z ∞Z 1 φ(z)dzdF (b) + ∗ 2−q + χ∗ q(q − 1) 0 z∈I2 (b/σ)\I1 (b/σ) |ηq (b/σ + z; χ )| Z ∞Z 1 φ(z)dzdF (b) + ∗ 2−q + χ∗ q(q − 1) 0 z∈I3 (b/σ)\I2 (b/σ) |ηq (b/σ + z; χ )| Z ∞Z 1 φ(z)dzdF (b) , G1 + G2 + G3 + G4 . ∗ 2−q + χ∗ q(q − 1) 0 R\I3 (b/σ) |ηq (b/σ + z; χ )| E
We calculate each of the last four integrals separately. 1 Z ∞Z 2α1 φ(0) 2(q + 1)(χ∗ ) 2−q φ(0) 1 φ(z)dzdF (b) ≤ ≤ → 0, G1 ≤ ∗ χ∗ q(q − 1) χ∗ q(q − 1) 0 z∈I1 (b/σ) χ q(q − 1)
(73)
(74)
as σ → 0, where the last step is due to the fact that χ∗ (σ) → 0 (Lemma 15) and that 2 − q < 1. For G2 we have Z ∞Z 1 G2 ≤ φ(z)dzdF (b) |η (b/σ + z; χ∗ )|2−q q 0 z∈I2 (b/σ)\I1 (b/σ) Z Z ∞ (d) 1 ≤ φ(z)dzdF (b) ∗ 0 z∈I2 (b/σ)\I1 (b/σ) χ Z σ log(1/σ) Z Z ∞ Z 1 1 ≤ φ(z)dzdF (b) + φ(z)dzdF (b) ∗ ∗ χ χ 0 z∈I2 (b/σ)\I1 (b/σ) σ log(1/σ) z∈I2 (b/σ)\I1 (b/σ) (e)
2φ(log(1/σ) − α2 )α2 2φ(0)α2 + χ∗ χ∗ 2φ(0) 2φ(log(1/σ) − α2 ) ≤ P(|B| ≤ σ log(1/σ)) ∗ 1/2 + (χ ) (χ∗ )1/2 (f ) φ((log(1/σ))/2) ≤ O(1) · σ 2−q log(1/σ) + O(1) · → 0, as σ → 0. (75) σ q−1 To derive Inequality (d), we used the fact that ηq (u; χ) is an increasing function of u (Lemma 7 part (iii)). Hence, for z∈ / I1 (b/σ), |ηq (b/σ + z; χ∗ )| ≥ ηq (α1 ; χ∗ ) = (χ∗ )1/(2−q) , where the last equality is due to the definition of α1 . To obtain (e), we used similar argument as the one used to derive (a) in (64). Hence, we do not repeat it here. For (f), we have used the fact that P(|B| ≤ σ log(1/σ)) = O(σ log(1/σ)) and that χ∗ = Ω(σ 2q−2 ). Regarding G3 note that Z ∞Z 1 G3 ≤ φ(z)dzdF (b) ∗ 2−q z∈I3 (b/σ)\I2 (b/σ) |ηq (α2 ; χ )| 0 Z σ log 1/σ Z 1 ≤ φ(z)dzdF (b) ∗ 2−q |η (α ; q 2 χ )| 0 z∈I (b/σ)\I2 (b/σ) Z ∞ Z 3 1 + φ(z)dzdF (b) ∗ 2−q |η (α ; q 2 χ )| σ log 1/σ z∈I3 (b/σ)\I2 (b/σ) 2φ(0)α3 2φ(log(1/σ) − α3 )α3 ≤ P(|B| ≤ σ log(1/σ)) + |ηq (α2 ; χ∗ )|2−q |ηq (α2 ; χ∗ )|2−q (g) 2 φ((log(1/σ))/2) ≤ O(1) · σ q −3q+3 log(1/σ) + O(1) · → 0, as σ → 0. (76) σ (q−1)(2−q) Since all the steps are similar to the corresponding steps in the calculations of G2 , we just describe Inequality (g). To obtain this inequality, we have used the following derivation: ≤
P(|B| ≤ σ log(1/σ))
ηq ((χ∗ )1/2 ; χ∗ ) ηq (α2 ; χ∗ ) = lim = lim ηq (1; (χ∗ )q/2 ) = 1, σ→0 σ→0 σ→0 (χ∗ )1/2 (χ∗ )1/2 lim
where the last equality is due to Lemma 4 part (iii) and Lemma 15. Finally we can apply dominated convergence theorem to
25
show that: lim G4 = lim E
σ→0
σ→0
1(|B/σ + Z| > α3 ) |ηq (B/σ + Z; χ∗ )|2−q + χ∗ q(q − 1)
= 0.
(77)
To see why DCT can be applied, you may refer to the argument we presented in deriving (66). Putting (73), (74), (75), (76) and (77) together, we have proved limσ→0 H3 = 0. This fact together with Equation (68), (69) and (70) gives us, −(1 − )H1 − H3 (1 − )E|Z|q χ∗ (σ) = = σ→0 σ 2q−2 H2 /σ 2−2q qE|B|2q−2 lim
Based on this result, applying Lemma 17 gives the asymptotics of R(χ∗ (σ), σ). 4) Characterizing AMSE for 1 < q < 2: Combining Theorem 1 and Lemma 11, we conclude that lim
p→∞
1 ˆ kβ(λ∗,q , q) − βk22 = σ ˜ 2 · R(χ∗ (˜ σ ), σ ˜ ) a.s, p
where σ ˜ is the fixed point of the following equation: σ ˆ2 R(χ∗ (ˆ σ ), σ ˆ ). δ As we discussed in the proof of Lemma 11, this equation has a unique solution. Note that 2 σ ˆ 2 = σw +
R(χ∗ (˜ σ ), σ ˜ ) ≤ R(0, σ ˜ ) = 1. Therefore,
δ 1 2 2 ˜ ⇒σ ˜2 ≤ σ2 . σ ˜ 2 ≤ σw + σ δ δ−1 w
(78)
Hence, as σw → 0, σ ˜ → 0. Since R(χ∗ (˜ σ ), σ ˜ ) → 1, as σ ˜ → 0 from Lemma 18, we have 2 σw δ−1 = . 2 σw →0 σ ˜ δ
lim
To obtain the next dominant term in σ ˜ 2 , we use Lemma 18 again to have limσ˜ →0 lim
σw →0
2 σw 1−1/δ 2q σw
σ ˜2 −
R(χ∗ (˜ σ ),˜ σ )−1 σ ˜ 2q−2
2
q 2
(E|Z| ) = − (1−) . Hence, E|B|2q−2
(R(χ∗ (˜ σ ), σ ˜ ) − 1)˜ σ2 δ q (1 − )2 (E|Z|q )2 = − . 2q σw →0 (δ − 1)q+1 E|B|2q−2 (δ − 1)σw
= lim
(79)
2 ˆ ∗,q , q) − βk2 = δ(˜ Finally note that limp→∞ p1 kβ(λ σ 2 − σw ), a.s. It is then straightforward to derive (13) from (79). 2
G. Proof of Theorem 3 We claim limσ→0 R(χ∗ (σ), σ) = 1. For 1 < q < 2, first note R(χ∗ (σ), σ) ≤ R(0, σ) = 1, hence lim supσ→0 R(χ∗ (σ), σ) ≤ 1. On the other hand, Equation (56) shows that, 1 ∗ ∗ 2 − R(χ (σ), σ) ≥ (1 − )E(ηq (Z; χ )) + 2E 1 + χ∗ q(q − 1)|ηq (B/σ + Z; χ∗ )|q−2 Using Fatou’s lemma combined with Lemma 14 and 15, we then have lim inf σ→0 R(χ∗ (σ), σ) ≥ 1 from the equation above. When q = 2, R(χ∗ (σ); σ) admits a nice explicit expression and can be easily shown to converge to 1. Now suppose ˆ ∗,q , q) − βk2 = 0. Using the same notation as in Section V-F4, we have σ ˜ → 0, as σw → 0. However, limσw →0 limp→∞ p1 kβ(λ 2 it means, 1≥1−
2 σ ), σ ˜) σw R(χ∗ (˜ 1 = → , as σw → 0, 2 σ ˜ δ δ
which is a contradiction since δ < 1. H. Proof of Theorem 4 1) Roadmap of the proof: The roadmap of the proof will be similar to the roadmap we discussed for the case 1 < q ≤ 2 in Section V-F1 with a few changes. Similar to Section V-F1, we define R1 (χ, σ)
=
(1 − )E(η1 (Z; χ))2 + EB∼G (η1 (B/σ + Z; χ) − B/σ)2 ,
(80)
and we study the behavior of χ∗ (σ) , arg minχ≥0 R1 (χ, σ) and R1 (χ∗ (σ), σ) for small values of σ. It will be shown in Lemma 19 that in this case, χ∗ (σ) → χ∗∗ , where χ∗∗ is a non-zero number that depends on the sparsity level and will be defined in Lemma 19. We then derive the rate at which the difference |χ∗ (σ) − χ∗∗ | goes to zero in Lemma 20. The next step
26
is to characterize the behavior of the risk R1 (χ∗ (σ), σ) for small values of σ that is done in Lemma 21. Finally we use this 2 ˆ Z; χ) − B)2 in Section V-H4. result to derive the behavior of the fixed point of σ ˆ 2 = σw + inf χ 1δ E(η1 (B + σ 2) Proof of χ∗ (σ) → χ∗∗ as σ → 0: Our first lemma confirms that unlike q > 1, χ∗ (σ) 9 0. Lemma 19. Suppose EG |B|2 < ∞, then for q = 1, lim χ∗ (σ) = χ∗∗ ,
σ→0
lim R(χ∗ (σ), σ) = (1 − )E(η1 (Z; χ∗∗ ))2 + (1 + (χ∗∗ )2 ),
σ→0
where χ∗∗ minimizes (1 − )E(η1 (Z; χ))2 + (1 + χ2 ) over [0, ∞). Proof: We first claim that χ∗ (σk ) is bounded for any given sequence σk → 0. If this is not true, there exists an unbounded subsequence χ∗ (σkn ) → +∞ with σnk → 0. But because η1 (b/σnk +z; χ∗ (σnk )) = sign(b/σnk +z)(|b/σnk +z|−χ∗ (σnk ))+ , we know |η1 (B/σnk + Z; χ∗ (σnk )) − B/σnk |2 → +∞, a.s.. By Fatou’s lemma, we know E(η1 (B/σnk + Z; χ∗ (σnk )) − B/σnk )2 → +∞, contradicting the boundedness of R(χ∗ (σnk ), σnk ). Now we show the sequence χ∗ (σk ) converges to a finite constant, for any σk → 0. Taking a convergent subsequence χ∗ (σnk ), due to the boundedness of χ∗ (σk ), the limit of the subsequence is finite. Call it χ. ˜ Note that E(η1 (B/σnk + Z; χ∗ (σnk )) − B/σnk )2 =
1 + E(η1 (B/σnk + Z; χ∗ (σnk )) − B/σnk − Z)2 + 2EZ(η1 (B/σnk + Z; χ∗ (σnk )) − B/σnk − Z).
(81)
Since η1 (u; χ) = sign(u)(|u| − χ)+ , we have |η1 (Z; χ∗ (σnk ))|2 ≤ |Z|2 , (η1 (B/σnk + Z; χ∗ (σnk )) − B/σnk − Z)2 ≤ (χ∗ (σnk ))2 , |Z(η1 (B/σnk + Z; χ∗ (σnk )) − B/σnk − Z)| ≤ |Z|χ∗ (σnk ).
(82)
Furthermore, note that all the terms above on the right are integrable. Hence, we can apply DCT to obtain lim (1 − )E(η1 (Z; χ∗ (σnk )))2 + E(η1 (B/σnk + Z; χ∗ (σnk )) − B/σnk )2 = (1 − )E(η1 (Z; χ)) ˜ 2 + (1 + (χ) ˜ 2 ).
nk →∞
Since χ∗ (σnk ) is the optimal threshold value, χ ˜ has to be arg minχ≥0 (1 − )E(η1 (Z; χ))2 + (1 + χ2 ). Moreover, by taking derivatives, it is not hard to verify (1 − )E(η1 (Z; χ))2 + (1 + χ2 ), as a function of χ ≥ 0, is strongly convex and always has a unique minimizer. Since χ∗ (σnk ) is an arbitrary convergent subsequence, this implies that the sequence χ∗ (σk ) converges to that minimizer as well. This is true for any σk → 0, hence χ∗ (σ) goes to the minimizer, as σ → 0. Our next goal is to characterize the difference between (1 − )E(η1 (Z; χ∗ (σ)))2 + E(η1 (B/σ + Z; χ∗ (σ)) − B/σ)2 and (1 − )E(η1 (Z; χ∗∗ ))2 + (1 + (χ∗∗ )2 ) and see how fast it goes to zero. 3) Characterizing the behavior of R(χ∗ (σ), σ): Lemma 20. Suppose PG (|B| ≥ µ) = 1 and EG |B|2 < ∞, then for q = 1, |χ∗ (σ) − χ∗∗ | = O(φ(−µ/σ + χ∗∗ )). Proof: Define Uσ , B/σ + Z. By taking the derivative of (1 − )E(η1 (Z; χ))2 + E(η1 (B/σ + Z; χ) − B/σ)2 with respect to χ, we obtain the following equation for χ∗ (σ): χ∗ (σ) =
(1 − )E(ZI(Z > χ∗ (σ))) − (1 − )E(ZI(Z < −χ∗ (σ))) + E(ZI(Uσ > χ∗ (σ)) − E(ZI(Uσ < −χ∗ (σ))) . (83) (1 − )E(I(Z > χ∗ (σ))) + (1 − )E(I(Z < −χ∗ (σ))) + E(I(Uσ > χ∗ (σ)) + E(I(Uσ < −χ∗ (σ)))
Letting σ go to zero on both sides in the above equation, we then have, χ∗∗ =
(1 − )E(ZI(Z > χ∗∗ )) − (1 − )E(ZI(Z < −χ∗∗ )) , (1 − )E(I(Z > χ∗∗ )) + (1 − )E(I(Z < −χ∗∗ )) +
(84)
where we applied dominated convergence theorem to obtain the above equality. We first study each of the terms in (83). By employing the mean value theorem, we get Z χ∗∗ ∗ ∗∗ E[ZI(Z > χ )] − E[ZI(Z > χ )] = zφ(z)dz = (χ∗∗ − χ∗ )χφ( ˜ χ), ˜ χ∗
∗
∗∗
where χ ˜ is a number between χ and χ . Similarly, E[ZI(Z < −χ∗ )] − E[ZI(Z < −χ∗∗ )] =
Z
−χ∗
Z
χ∗∗
zφ(z)dz = − −χ∗∗
χ∗
zφ(z)dz = −(χ∗∗ − χ∗ )χφ( ˜ χ), ˜
27
∗
∗∗
Z
E[I(Z > χ )] − E[I(Z > χ )] = and
χ∗∗
˜˜ φ(z)dz = (χ∗∗ − χ∗ )φ(χ),
χ∗
˜˜ E[I(Z < −χ∗ )] − E[I(Z < −χ∗∗ )] = (χ∗∗ − χ∗ )φ(χ),
˜ where χ ˜ is between χ∗ and χ∗∗ . Now we consider the other four terms in (83). Let F (b) denote the distribution function of |B|. We have |E(ZI(B/σ + Z > χ∗ ))| = Eφ(χ∗ − B/σ) ≤ Eφ(χ∗ − |B|/σ) ≤ φ(µ/σ − χ∗ ), (85) where the last inequality is due to the fact that P(|B| > µ). Note that we have assumed that σ is so small that µ/σ − χ∗ > 0. Similarly, |E(ZI(B/σ + Z < −χ∗ ))| ≤ φ(µ/σ − χ∗ ). (86) Finally, |E(I(B/σ + Z > χ∗ )) − 1 + E(I(B/σ + Z < −χ∗ ))| Z ∞ Z −b/σ+χ∗ ∗ φ(z)dzdF (b) = EI(|B/σ + Z| ≤ χ ) = 0
Z
−b/σ−χ∗
χ∗ −µ/σ
φ(z)dz ≤ 2χ∗ φ(µ/σ − χ∗ ),
≤
(87)
−χ∗ −µ/σ
where to obtain the last inequality we have assumed that σ is so small that χ∗ − µ/σ < 0. Define, e1 , (E(I(B/σ + Z > χ∗ )) − 1 + E(I(B/σ + Z < −χ∗ ))), and e2 , (E(ZI(B/σ + Z > χ∗ )) − E(ZI(B/σ + Z < −χ∗ ))). Also, define S = (1 − )E(ZI(Z > χ∗∗ )) − (1 − )E(ZI(Z < −χ∗∗ )), and T = (1 − )E(I(Z > χ∗∗ )) + (1 − )E(I(Z < −χ∗∗ )) + . Using the new notations, we can conclude these two equations: χ∗∗
=
χ∗ (σ)
=
S , T S + 2(1 − )(χ∗∗ − χ∗ (σ))χφ( ˜ χ) ˜ + e2 . ∗∗ ∗ ˜ T + 2(1 − )(χ − χ (σ))φ(χ) ˜ + e1
(88)
Hence, we have χ∗ (σ) − χ∗∗
= = =
S S + 2(1 − )(χ∗∗ − χ∗ (σ))χφ( ˜ χ) ˜ + e2 − ˜˜ + e1 T T + 2(1 − )(χ∗∗ − χ∗ (σ))φ(χ) ˜˜ + e1 ) 2(1 − )(χ∗∗ − χ∗ (σ))χφ( ˜ χ) ˜ + e2 S(2(1 − )(χ∗∗ − χ∗ (σ))φ(χ) − ˜˜ + e1 ˜˜ + e1 ) T + 2(1 − )(χ∗∗ − χ∗ (σ))φ(χ) T (T + 2(1 − )(χ∗∗ − χ∗ (σ))φ(χ) ∗∗ ∗ ∗∗ ∗∗ ˜˜ 2(1 − )(χ − χ (σ))(χφ( ˜ χ) ˜ − χ φ(χ)) e2 − χ e1 + . ∗∗ ∗ ˜ ˜˜ + e1 T + 2(1 − )(χ − χ (σ))φ(χ) ˜ + e1 T + 2(1 − )(χ∗∗ − χ∗ (σ))φ(χ)
(89)
From (89) we get (χ∗ (σ) − χ∗∗ ) 1 +
2(1 − )(χφ( ˜ χ) ˜ − χ∗∗ φ(χ)) ˜˜ e2 − χ∗∗ e1 = . ˜˜ + e1 ˜˜ + e1 T + 2(1 − )(χ∗∗ − χ∗ (σ))φ(χ) T + 2(1 − )(χ∗∗ − χ∗ (σ))φ(χ)
(90)
˜˜ → χ∗∗ . Therefore, we conclude Note that in the above expression we know that χ∗ (σ) → χ∗∗ , and hence χ ˜ → χ∗∗ and χ ∗∗ ∗∗ ∗ ˜ ˜ that χφ( ˜ χ) ˜ − χ φ(χ) ˜ → 0 and (χ − χ (σ))φ(χ) ˜ → 0. Moreover, since (85), (86) and (87) together show e1 and e2 go to
28
0 exponentially fast, we conclude from (90) that (χ∗ (σ) − χ∗∗ )/σ → 0. We can then have ∗ ∗ (χ (σ) − χ∗∗ ) (χ (σ) − χ∗∗ ) = lim lim σ→0 φ(µ/σ − χ∗ ) σ→0 φ(µ/σ − χ∗∗ ) ∗ ∗∗ ∗ e2 − χ∗∗ e1 (b) (a) ≤ lim 2(1 + χ (σ)χ )φ(µ/σ − χ ) = lim σ→0 T φ(µ/σ − χ∗ ) σ→0 T φ(µ/σ − χ∗ ) 2(1 + (χ∗∗ )2 ) . (91) T We have used Equation (90) to get (a); To obtain (b) we used the following steps: 1) According to (87), |e1 | ≤ 2χ∗ φ(µ/σ − χ∗ ). 2) According to (85) and (86), |e2 | ≤ 2φ(µ/σ − χ∗ ). Note that (91) has established our claim. The next step is to characterize the behavior of the second order term in the risk of LASSO. According to Lemma 19, limσ→0 R(χ∗ (σ), σ) exists. Recall that we denoted this quantity with M1 (). =
Lemma 21. Suppose PG (|B| > µ) = 1 and EG |B|2 < ∞, then for q = 1, |R(χ∗ (σ), σ) − M1 ()| = O(φ(µ/σ − χ∗∗ )). Proof: We recall the two quantities : M1 () ∗
R(χ (σ), σ)
= =
(1 − )E(η1 (Z; χ∗∗ ))2 + (1 + (χ∗∗ )2 ) ∗
2
(92) ∗
2
(1 − )E(η1 (Z; χ (σ))) + [1 + E(η1 (B/σ + Z; χ (σ)) − B/σ − Z) ] +2EZ(η1 (B/σ + Z; χ∗ (σ)) − B/σ − Z)
(93)
We bound the |R(χ∗ (σ), σ) − M1 ()| by bounding the difference between the corresponding terms in (93) and (92). Note that we in fact know e1 < 0, e2 > 0 from the proof of Lemma 20, hence (90) implies χ∗ (σ) > χ∗∗ for small enough σ. We start with |E(η1 (Z; χ∗ (σ)))2 − E(η1 (Z; χ∗∗ ))2 | = |E(η1 (Z; χ∗ (σ)) − η1 (Z; χ∗∗ ))(η1 (Z; χ∗ (σ)) + η1 (Z; χ∗∗ ))| ≤
E[||χ∗ (σ) − χ∗∗ | + χ∗ (σ)1(|Z| ∈ (χ∗∗ , χ∗ (σ)))| · |η1 (Z; χ∗ (σ)) + η1 (Z; χ∗∗ )|]
≤
2|χ∗ (σ) − χ∗∗ | · E|Z| + 2χ∗ (σ)E[1(|Z| ∈ (χ∗∗ , χ∗ (σ)))|Z|]
≤
2|χ∗ (σ) − χ∗∗ | · E|Z| + 4χ∗ (σ)|χ∗ (σ) − χ∗∗ |χφ( ˜ χ) ˜
=
O(φ(µ/σ − χ∗∗ )),
(94)
where χ ˜ is a number between χ∗ (σ) and χ∗∗ and the last equality is due to Lemma 20. The next step is to characterize the difference between (χ∗∗ )2 and E(η1 (B/σ + Z; χ∗ (σ)) − B/σ − Z)2 : |(χ∗∗ )2 −E(η1 (B/σ +Z; χ∗ (σ))−B/σ −Z)2 | ≤ |(χ∗ (σ))2 −E(η1 (B/σ +Z; χ∗ (σ))−B/σ −Z)2 |+|(χ∗∗ )2 −(χ∗ (σ))2 |. (95) To bound the two terms on the right, first note that 0
≤
(χ∗ (σ))2 − E(η1 (B/σ + Z; χ∗ (σ)) − B/σ − Z)2 = E[1(|B/σ + Z| ≤ χ∗ (σ)) · ((χ∗ (σ))2 − (B/σ + Z)2 )] Z ∞ Z −b/σ+χ∗ (σ) Z −µ/σ+χ∗ (σ) ∗ 2 ∗ 2 φ(z)dzdF (b) ≤ (χ (σ)) φ(z)dz ≤ 2(χ∗ (σ))3 φ(µ/σ − χ∗ (σ)) (χ (σ))
=
O(φ(µ/σ − χ∗∗ )),
≤
0
−b/σ−χ∗ (σ)
−µ/σ−χ∗ (σ)
(96)
where the last equality holds since (χ∗ (σ) − χ∗∗ )/σ → 0. Furthermore, according to Lemma 20 (χ∗ (σ))2 − (χ∗∗ )2 = O(φ(µ/σ − χ∗∗ )).
(97)
Combining (95), (96), and (97), we obtain |(χ∗∗ )2 − E(η1 (B/σ + Z; χ∗ (σ)) − B/σ − Z)2 | = O(φ(µ/σ − χ∗∗ )).
(98)
Finally, we combine Inequality (87) and Lemma 8 to obtain 0
≤ EZ(B/σ + Z − η1 (B/σ + Z; χ∗ (σ))) = E(1 − ∂1 η1 (B/σ + Z; χ∗ (σ))) = P (|B/σ + Z| ≤ χ∗ (σ)) = O(φ(µ/σ − χ∗∗ )).
Combining (94), (98), and (99) together completes the proof.
(99)
29
4) Characterizing AMSE of LASSO (P (B > µ) = 1): We use the same strategy as in Section V-F4. Before proving this theorem, we mention the following result that will be used later in the proof: Lemma 22. The function E(η1 (α + Z; χ) − α)2 , can be upper bounded by 1 + χ2 , is an increasing function of α > 0. Proof: By taking the derivative we have d E(η1 (α + Z; χ) − α)2 = 2E(αI(|α + Z| ≤ χ)) ≥ 0. dα Hence, E(η1 (α + Z; χ) − α)2 ≤ lim E(η1 (α + Z; χ) − α)2 = 1 + χ2 . α→∞
Combining Theorem 1 and Lemma 11 we conclude that 1 kβ(λ∗,1 , 1) − βk22 = σ ˜ 2 · R1 (χ∗ (˜ σ ), σ ˜ ) a.s, p
lim
p→∞
where σ ˜ is the unique fixed point of the following equation: 2 σ ˆ 2 = σw +
σ ˆ2 R1 (χ∗ (ˆ σ ), σ ˆ ). δ
(100)
Note that according to Lemma 22 we have R1 (χ, σ)
=
(1 − )E(η1 (Z; χ))2 + EB∼G (η1 (B/σ + Z; χ) − B/σ)2
≤
(1 − )E(η1 (Z; χ))2 + (1 + χ2 ).
Hence, R1 (χ∗ (˜ σ ), σ ˜ ) = min R1 (χ, σ ˜ ) ≤ min(1 − )E(η1 (Z; χ))2 + (1 + χ2 ) = M1 (). χ≥0
χ≥0
Combining (100) and (101) we have σ ˜2 ≤
(101)
2 σw . 1 − M1 ()/δ
2 → 0. Then according to Lemma 21, R1 (χ∗ (˜ σ ), σ ˜ ) → M1 (). Therefore, we have Hence, when M1 () < δ, σ ˜ 2 → 0 as σw 2 σw δ − M1 () = . σw →0 σ ˜2 δ
lim
By noting from Lemma 21 that R(χ∗ (σ), σ) − M1 () = O(φ(µ/σ − χ∗∗ )), we get σ ˜2 −
2 δσw δ−M1 ()
(R(χ∗ (˜ σ ), σ ˜ ) − M1 ())˜ σ2 p p √ √ = lim = 0. σw →0 φ(˜ µ δ − M1 ()/(σw δ)) σw →0 (δ − M1 ())φ(˜ µ δ − M1 ()/(σw δ)) lim
2 ˆ ∗,1 , 1) − βk2 = δ(˜ Combining with limp→∞ p1 kβ(λ ) leads to (21). σ 2 − σw 2
I. Proof of Theorem 5 Since the roadmap of the proof of this theorem is similar to that we mentioned in Section V-H1, we do not repeat it here. Furthermore, note that the proof we presented in Lemma 19 for χ∗ (σ) → χ∗∗ is general. Hence, we start with characterizing the difference between χ∗ (σ) and χ∗∗ . Lemma 23. Suppose PG (|B| ≤ σ) = Θ(σ ` ) with ` > 0 and EG |B|2 < ∞, then for q = 1, ` v u 1 u αm σ ` ≤ χ∗ (σ) − χ∗∗ ≤ βm σ ` · tlog log . . . log , | {z } σ m times
for small enough σ, where m > 0 is an arbitrary integer number and αm , βm > 0 are two constants depending on m. Proof: To prove this result, we use exactly the same line of proof that we presented for Lemma 20. Hence, we do not repeat the entire proof and only mention the differences. To bound the difference between χ∗ (σ) and χ∗∗ , we use the formula in (89). Hence, we have to bound e1 and e2 . Since we use a different strategy to bound these terms, we mention our approach herein. First observe that Eφ(χ∗ + |B|/σ) ≤ E(ZI(B/σ + Z > χ∗ )) = Eφ(χ∗ − B/σ) ≤ Eφ(χ∗ − |B|/σ),
30
where the two inequalities above are due to the simple fact −χ∗ |B|/σ ≤ χ∗ B/σ ≤ χ∗ |B|/σ. Let logm (a) , log log . . . log(a), {z } | m times
and F (b) be the probability function of |B|. Given an integer m > 0 and a constant c > 0, we then have Eφ(χ∗ − |B|/σ) Z cσ(logm (1/σ))1/2 m−1 XZ ∗ = φ(χ − b/σ)dF (b) + 0
≤ φ(0)P(|B| ≤ cσ(logm (1/σ))1/2 ) +
∗
Z
∞
φ(χ − b/σ)dF (b) +
cσ(logm−i+1 (1/σ))1/2
i=1 m−1 X
cσ(logm−i (1/σ))1/2
φ(χ∗ − b/σ)dF (b)
cσ(log(1/σ))1/2
φ(c(logm−i+1 (1/σ))1/2 − χ∗ )P(|B| ≤ cσ(logm−i (1/σ))1/2 ) +
i=1
φ(c(log(1/σ))1/2 − χ∗ ).
(102)
Further note that m−1 X
φ(c(logm−i+1 (1/σ))1/2 − χ∗ )P(|B| ≤ cσ(logm−i (1/σ))1/2 )
i=1 ∗ 2 m−1 e(χ ) /2 X −(c2 logm−i+1 (1/σ))/4 √ e P(|B| ≤ cσ(logm−i (1/σ))1/2 ) 2π i=1
≤
= O(1) ·
m−1 X
2
σ ` (logm−i (1/σ))`/2−c
/4
,
i=1 ∗ 2
2
and φ(c(log(1/σ))1/2 − χ∗ ) ≤ √12π e(χ ) /2 · σ c /4 . We have used the simple inequality −(a − b)2 /2 ≤ −b2 /4 + a2 /2 in the above derivations. Hence, by choosing a sufficiently large c, we can conclude that the dominant term in (102) is φ(0)P(|B| ≤ cσ(logm (1/σ))1/2 ) = Θ(σ ` (logm (1/σ))`/2 ). Now we would like to derive a lower bound for Eφ(χ∗ − B/σ). Choosing a fixed constant C > 0, then Z Cσ ∗ Eφ(χ + |B|/σ) ≥ φ(χ∗ + b/σ)dF (b) ≥ φ(C + χ∗ )P (|B| ≤ Cσ) = Θ(σ ` ). 0 `
We so far have showed that Θ(σ ) ≤ E(ZI(B/σ + Z > χ∗ )) ≤ Θ(σ ` (logm (1/σ))`/2 ). Similarly, we know Θ(σ ` ) = Eφ(χ∗ + |B|/σ) ≤ −E(Z 1(B/σ + Z < −χ∗ )) = Eφ(χ∗ + B/σ) ≤ Eφ(χ∗ − |B|/σ) = Θ(σ ` (logm (1/σ))`/2 ). It is easy to see that e2 = (E(ZI(B/σ + Z > χ∗ )) − E(ZI(B/σ + Z < −χ∗ ))) = Eφ(χ∗ + B/σ) + Eφ(χ∗ − B/σ), hence Θ(σ ` ) ≤ e2 ≤ Θ(σ ` (logm (1/σ))`/2 )
(103)
To bound e1 , recall that e1 = (E(I(B/σ + Z > χ∗ )) − 1 + E(I(B/σ + Z < −χ∗ ))) = −E1(|B/σ + Z| ≤ χ∗ ) We can have E1(|B/σ + Z| ≤ χ ) = E ∗
Z
−B/σ+χ∗
−B/σ−χ∗
φ(z)dz = 2χ∗ Eφ(−B/σ + aχ∗ ),
where |a| ≤ 1 dependent on B. It is straightforward to confirm that, there exist two positive constants C1 and C2 such that C1 Eφ(χ∗ + |B|/σ) ≤ Eφ(−B/σ + aχ∗ ) ≤ C2 Eφ(χ∗ − |B|/σ). Therefore, we can immediately get Θ(σ ` ) ≤ −e1 ≤ Θ(σ ` (logm (1/σ))`/2 ). Now by looking back at the Equation (90): ˜˜ 2(1 − )(χφ( ˜ χ) ˜ − χ∗∗ φ(χ)) e2 − χ∗∗ e1 ∗ ∗∗ = , (χ (σ) − χ ) 1 + ˜˜ + e1 ˜˜ + e1 T + 2(1 − )(χ∗∗ − χ∗ (σ))φ(χ) T + 2(1 − )(χ∗∗ − χ∗ (σ))φ(χ) combined with the results from (103) and (104), we conclude Θ(σ ` ) ≤ χ∗ (σ) − χ∗∗ ≤ Θ(σ ` (logm (1/σ))`/2 ).
(104)
31
Lemma 24. Suppose PG (|B| ≤ σ) = Θ(σ ` ) and EG |B|2 < ∞, then for q = 1, ` v u 1 u ∗ ` −βm σ ` · tlog log . . . log ≤ R(χ (σ), σ) − M1 () ≤ −αm σ , | {z } σ m times
for small enough σ, where m > 0 is an arbitrary integer number and αm , βm > 0 are two constants depending on m. Proof: We recall the two quantities : M1 ()
=
(1 − )E(η1 (Z; χ∗∗ ))2 + (1 + (χ∗∗ )2 )
R(χ∗ (σ), σ)
=
(1 − )E(η1 (Z; χ∗ (σ)))2 + (1 + E(η1 (B/σ + Z; χ∗ (σ)) − B/σ − Z)2 ) +2EZ(η1 (B/σ + Z; χ∗ (σ)) − B/σ − Z),
Since χ∗ (σ) is the optimal threshold value and |η1 (B/σ + Z; χ∗∗ ) − B/σ − Z| ≤ χ∗∗ , we know R(χ∗ (σ), σ) − R∗∗ ≤ R(χ∗∗ , σ) − R∗∗
=
(E(η1 (B/σ + Z; χ∗∗ ) − B/σ − Z)2 − (χ∗∗ )2 ) + 2EZ(η1 (B/σ + Z; χ∗∗ ) − B/σ − Z)
≤
−2E1(|B/σ + Z| ≤ χ∗∗ ) ≤ −Θ(σ ` ),
(105)
where the last inequality holds by the similar arguments for bounding e1 in Lemma 23. To derive the lower bound, we follow the same calculation steps in the proof of Lemma 21 and use the bounds we derived for |χ∗ (σ) − χ∗∗ | in Lemma 23. We can get |R(χ∗ (σ), σ) − M1 ()| ≤ Θ(σ ` (logm (1/σ))`/2 ).
(106)
Putting (105) and (106) together completes the proof. The next step is to characterize the AMSE for small values of σ. However, since the proof is exactly the same as the proof we presented in Section V-H4, we do not repeat it here. J. Proof of Theorem 6 The same strategy used to prove Theorem 3 works here. We hence skip the details. Note the key argument limσ→0 R(χ∗ (σ), σ) = M1 () has been shown in Lemma 19. K. Proof of Theorem 7 The roadmap of the proof is the same as in the sparse setting. Hence we do not repeat it here. We do 1 < q ≤ 2 and q = 1 cases separately. 1) Proof for the case 1 < q ≤ 2: Note that Lemma 15 remains valid when = 1. Thus we know χ∗ (σ) → 0, as σ → 0. We aim to characterize the convergence rate of χ∗ (σ). Towards that end, we first derive a result similar to Lemma 16. Lemma 25. Suppose that PG (|B| > µ) = 1, EG (|B|2 ) < ∞ and χ(σ) = Cσ q , where µ > 0, C > 0 are fixed numbers, then we have for 1 < q < 2, R(χ, σ) − 1 = C 2 q 2 E|B|2q−2 − 2Cq(q − 1)E|B|q−2 . σ→0 σ2 Proof: According to (59), for = 1 we have lim
R(χ, σ) − 1
= χ2 q 2 E|ηq (B/σ + Z; χ)|2q−2 − 2χq(q − 1)E ,
|ηq (B/σ + Z; χ)|q−2 1 + χq(q − 1)|ηq (B/σ + Z; χ)|q−2
S1 + S2 .
(107)
Using similar argument as presented for deriving (61), it is straightforward to show S1 χ2 q 2 E|ηq (B/σ + Z; χ)|2q−2 = lim = C 2 q 2 lim E|ηq (B + σZ; χσ 2−q )|2q−2 = C 2 q 2 E|B|2q−2 . 2 σ→0 σ σ→0 σ→0 σ2 lim
(108)
32
Now we focus on analyzing S2 . −S2 σ2
|ηq (B/σ + Z; χ)|q−2 1 + χq(q − 1)|ηq (B/σ + Z; χ)|q−2 1 = 2Cσ q−2 q(q − 1)E |ηq (|B|/σ + Z; χ)|2−q + χq(q − 1) Z ∞ Z −b/σ+µ/(2σ) 1 φ(z)dzdF (b) = 2Cσ q−2 q(q − 1) 2−q + χq(q − 1) |η (b/σ + z; χ)| q −b/σ−µ/(2σ) 0 Z ∞Z 1 + 2Cσ q−2 q(q − 1) φ(z)dzdF (b) 2−q + χq(q − 1) |η (b/σ + z; χ)| q 0 R\[−b/σ−µ/(2σ),−b/σ+µ/(2σ)] =
2Cσ q−2 q(q − 1)E
,
T1 + T2 ,
where F (b) is the probability distribution function of |B|. We then consider T1 and T2 separately. For T1 , we have Z ∞ Z −µ/σ+µ/(2σ) 1 T1 ≤ 2Cσ q−2 q(q − 1) φ(z)dzdF (b) −µ/σ−µ/(2σ) χq(q − 1) 0 ≤ 2σ −3 µφ(µ/(2σ)) → 0, as σ → 0.
(109)
By using the dominated convergence theorem (DCT), we get lim T2
σ→0
=
lim 2Cσ q−2 q(q − 1)E
1(Z ∈/ [−B/σ − µ/(2σ), −B/σ + µ/(2σ)])
|ηq (B/σ + Z; χ)|2−q + χq(q − 1) 1(Z ∈/ [−B/σ − µ/(2σ), −B/σ + µ/(2σ)]) = lim 2Cq(q − 1)E σ→0 |ηq (B + σZ; χσ 2−q )|2−q + Cσ 2 q(q − 1) q−2 = 2Cq(q − 1)E|B| σ→0
(110)
Note that DCT works here because for small enough σ,
1(Z ∈/ [−B/σ − µ/(2σ), −B/σ + µ/(2σ)]) |ηq (B + σZ; χσ 2−q )|2−q + Cσ 2 q(q − 1)
≤
1 1 ≤ . |ηq (µ/2; χσ 2−q )|2−q |ηq (µ/2; 1)|2−q
Combining (107), (108), (109) and (110) together completes the proof. Lemma 25 shows that by choosing an appropriate χ for σ small enough, R(χ(σ), σ) is less than 1. This result will be used to show that χ∗ (σ) cannot be too small. Then, we use this fact to derive the behavior of χ∗ (σ). This is done in the next lemma. Lemma 26. Suppose that PG (|B| > µ) = 1 and EG (|B|2 ) < ∞, then we have for 1 < q < 2, χ∗ (σ) (q − 1)E|B|q−2 = , σ→0 σ q qE|B|2q−2 (q − 1)2 (E|B|q−2 )2 R(χ∗ (σ), σ) − 1 = − . lim σ→0 σ2 E|B|2q−2 lim
Proof: Choosing χ(σ) =
(q−1)E|B|q−2 qE|B|2q−2
· σ q in Lemma 25, we have
R(χ, σ) − 1 (q − 1)2 (E|B|q−2 )2 =− < 0, 2 σ→0 σ E|B|2q−2 lim
(111)
hence we can conclude that χ∗ (σ) > 0 when σ is small enough. Moreover, by a slight change of arguments in the proof of Lemma 25 summarized below: 1) the fact χσ 2−q = o(1) used several times in Lemma 25 still holds here 2) χσ 2−q = o(1) and χ = o(σ q ) are sufficient to have S1 = o(σ 2 ) 3) derivation for T1 in (109) does not depend on χ 4) χσ 2−q = o(1) and χ = o(σ q ) are sufficient to get T2 = o(1) we can show R(χ, σ) − 1 = 0, lim σ→0 σ2 for χ(σ) = o(exp(−c/σ)) with any fixed positive constant c. This implies that limσ→0 χ∗ (σ) · ec˜/σ = +∞ for any c˜ > 0. We will use the two properties about χ∗ (σ) we showed so far in the following proof. Firstly since χ∗ (σ) is a non-zero finite
33
∂R(χ,σ) ∂χ
value, it is a solution of 0
= 0. This means that
(b)
E((ηq (B/σ + Z; χ∗ ) − B/σ)∂2 ηq (B/σ + Z; χ∗ )) −(ηq (B/σ + Z; χ∗ ) − B/σ − Z)q|ηq (B/σ + Z; χ∗ )|q−1 sign(B/σ + Z) E + E(Z∂2 ηq (B/σ + Z; χ∗ )) 1 + χ∗ q(q − 1)|ηq (B/σ + Z; χ∗ )|q−2 q(q − 1)|ηq (B/σ + Z; χ∗ )|4−2q χ∗ q 2 |ηq (B/σ + Z; χ∗ )|2q−2 − E E 1 + χ∗ q(q − 1)|ηq (B/σ + Z; χ∗ )|q−2 (|ηq (B/σ + Z; χ∗ )|2−q + χ∗ q(q − 1))3 ∗ 2 ∗ 2−q χ q (q − 1)|ηq (B/σ + Z; χ )| −E (|ηq (B/σ + Z; χ∗ )|2−q + χ∗ q(q − 1))3
,
χ∗ H2 − I1 − χ∗ I2 .
= (a)
=
=
(112)
To derive Equality (a), we have used Lemma 7 part (ii). To obtain (b), we have used the following steps: 1) We used lemma 4 part (i) to conclude that ηq (B/σ + Z; χ∗ ) − B/σ − Z = −χ∗ q|ηq (B/σ + Z; χ∗ )|q−1 sign(B/σ + Z). 2) We used the expression we derived in Lemma 7 part (ii) for ∂2 ηq (B/σ + Z; χ∗ ) and then employed Lemma 8 to simplify the expression for E(Z∂2 ηq (B/σ + Z; χ∗ )). Note that according to Lemma 5, ∂2 ηq (B/σ + Z; χ∗ ) is differentiable with respect to its first argument and hence Lemma 8 can be applied. Now we evaluate H2 , I1 and I2 individually. Our goal is to show the following: H2 = q 2 E|B|2q−2 . (i) limσ→0 σ2−2q I1 (ii) limσ→0 σ2−q = q(q − 1)E|B|q−2 . I2 (iii) limσ→0 σ4−2q = q 2 (q − 1)E|B|2q−4 . The calculations for H2 are similar to the calculations we did in Lemma 18 and obtained H2
= q 2 E|B|2q−2 . σ 2−2q We refer the reader to (70) for the derivation. Now we derive the asymptotic behavior of I1 and I2 . We follow the same strategy used in the proof of Lemma 25. Let F (b) denote the probability distribution function of |B|. For I1 we have Z ∞Z ∞ q(q − 1)|ηq (b/σ + z; χ∗ )|4−2q I1 = φ(z)dzdF (b) (|ηq (b/σ + z; χ∗ )|2−q + χ∗ q(q − 1))3 0 −∞ µ Z ∞ Z − σb + 2σ q(q − 1)|ηq (b/σ + z; χ∗ )|4−2q = φ(z)dzdF (b) µ b (|ηq (b/σ + z; χ∗ )|2−q + χ∗ q(q − 1))3 −σ − 2σ 0 Z ∞Z q(q − 1)|ηq (b/σ + z; χ∗ )|4−2q + φ(z)dzdF (b) , I3 + I4 . (113) µ µ b b (|ηq (b/σ + z; χ∗ )|2−q + χ∗ q(q − 1))3 − 2σ ,− σ + 2σ ] 0 z ∈[− / σ lim
σ→0
First note that |I3 | ≤ σ q−2 σ 2−q
Z 0
∞
Z
µ b + 2σ −σ
µ b −σ − 2σ
µ 4−2q q(q − 1)| 2σ | (χ∗ q(q − 1))3
µ5−2q φ(µ/(2σ))
φ(z)dzdF (b) ≤
σ 7−3q 24−2q (χ∗ )3 q 2 (q
− 1)2
→ 0, as σ → 0,
(114)
where the last step is due to the fact that limσ→0 χ∗ (σ) · ec˜/σ = +∞. For the calculation of I4 , first observe that µ µ 1(z ∈/ [− σb − 2σ , − σb + 2σ ]) · q(q − 1)|ηq (b + σz; χ∗ σ 2−q )|4−2q
(|ηq (b + σz; χ∗ σ 2−q )|2−q + χ∗ σ 2−q q(q − 1))3
≤
q(q − 1) q(q − 1) ≤ . |ηq (µ/2; χ∗ σ 2−q )|2−q |ηq (µ/2; 1)|2−q
Hence, applying DCT gives lim
I4
= q(q − 1)E|B|q−2 .
(115)
I1 = q(q − 1)E|B|q−2 σ 2−q
(116)
σ 2−q Combining (113), (114), and (115) we conclude that σ→0
lim
σ→0
Using similar arguments, we can conclude that I2 = q 2 (q − 1)E|B|2q−4 . σ 4−2q Finally, we combine (112), (116) and (117), we have lim
σ→0
χ∗ (σ) I1 /σ 2−q (q − 1)E|B|q−2 = lim = . σ→0 σ q σ→0 H2 /σ 2−2q − I2 /σ 2−2q qE|B|2q−2 lim
(117)
34
Now since we know the exact order of χ∗ (σ), (111) shows the exact order of R(χ∗ (σ), σ). Characterizing AMSE for 1 < q < 2: Based on the results of Lemma 26, calculating AMSE can be done in a similar way as in the proof of Theorem 2 (Section V-F4). We hence do not repeat it here. 2) Proof for the case q = 1: Lemma 27. Suppose that PG (|B| > µ) = 1 and EG (|B|2 ) < ∞, then we have for q = 1, χ∗ (σ) = O(φ(µ/σ)), 0 > R(χ∗ (σ), σ) − 1 & −Θ(φ2 (µ/σ)). Proof: The first key observation is that Lemma 19 holds in general settings and hence it includes the case = 1. According to that result, χ∗ (σ) → χ∗∗ as σ → 0, where χ∗∗ , arg minχ≥0 (1 − )E(η1 (Z; χ))2 + (1 + χ2 ). This implies that for = 1, χ∗∗ = 0. Hence χ∗ (σ) → 0, as σ → 0. Note that R(χ, σ) − 1
=
E(η1 (B/σ + Z; χ) − B/σ − Z)2 + 2E(Z(η1 (B/σ + Z; χ) − B/σ − Z))
(a)
E(η1 (B/σ + Z; χ) − B/σ − Z)2 + 2E(∂1 η1 (B/σ + Z; χ) − 1) Z −B/σ+χ (c) φ(z)dz = χ2 − 4χEφ(−B/σ + αχ). χ2 − 2E
=
(b)
≤
−B/σ−χ
To obtain (a) we used Lemma 8; note that η1 (u; χ) is a weakly differentiable function of u. Inequality (b) holds since |η1 (u; χ) − u| ≤ χ. Finally, Equality (c) is the result of the mean value theorem and hence√|α| ≤ 1 is dependent on B. From the above inequality, it is straightforward to verify that if we choose χ(σ) = 3e−1 Eφ( 2B/σ), then R(χ∗ (σ), σ) ≤ R(χ(σ), σ) < 1 for small enough σ. This means the optimal threshold χ∗ (σ) is a non-zero finite value. Hence as a solution = 0, χ∗ satisfies to ∂R(χ,σ) ∂χ χ∗ =
Eφ(χ∗ − B/σ) + Eφ(χ∗ + B/σ) 2Eφ(χ∗ − |B|/σ) 2φ(χ∗ − µ/σ) ≤ ≤ , ∗ ∗ E1(|Z + B/σ| ≥ χ ) E1(|Z + B/σ| ≥ χ ) E1(|Z + B/σ| ≥ χ∗ )
(118)
where √the last inequality holds for small values of σ. Since E1(|Z + B/σ| ≥ χ∗ ) → 1, as σ → 0 and φ(χ∗ − µ/σ) ≤ ∗ 2 φ(µ/( 2σ))e(χ ) /2 , from (118) we can first conclude χ∗ (σ) = o(σ), which in turn (use (118) again) implies χ∗ (σ) = O(φ(µ/σ)). We now turn to analyzing R(χ∗ , σ): R(χ∗ , σ) − 1
E(η1 (B/σ + Z; χ∗ ) − B/σ − Z)2 + 2E(∂1 η1 (B/σ + Z; χ∗ ) − 1) Z −µ/σ+χ∗ −2E1(|B/σ + Z| ≤ χ∗ ) ≥ −2 φ(z)dz ≥ −4χ∗ φ(µ/σ − χ∗ )
= ≥
−µ/σ−χ∗
(d)
≥
2
∗
−8φ (χ − µ/σ) (e) ∼ −8φ2 (µ/σ), E1(|Z + B/σ| ≥ χ∗ )
where (d) is due to (118) and (e) holds because E1(|Z + B/σ| ≥ χ∗ ) → 1 and χ∗ (σ) = o(σ). Characterizing AMSE for q = 1: Based on the results of Lemma 27, calculating AMSE can be done in a similar way as in the proof of Theorem 2 (Section V-F4). We hence skip it here. L. Proof of Theorem 8 Similar to the proof of Theorem 7, we consider two cases, i.e. 1 < q < 2 and q = 1, and prove them separately. The proof for the case q = 2 is straightforward, hence skipped here. 1) Proof for the case 1 < q < 2: Before we start the proof of our main result, we mention a simple lemma that will be used multiple times in our proof. Lemma 28. Let T (σ) and χ(σ) be two nonnegative sequences with the property: χ(σ)T q−2 (σ) → 0, as σ → 0. Then, lim
σ→0
ηq (T (σ), χ(σ)) = 1. T (σ)
Proof: The proof is a simple application of scale invariance property of ηq , i.e, Lemma 4 part (v). We have lim
σ→0
ηq (T (σ), χ(σ)) = lim ηq (1; χ(σ)T q−2 (σ)) = 1, σ→0 T (σ)
(119)
where the last step is the result of Lemma 4 part (iii). Our first goal is to show that when χ = Cσ q , then limσ→0 R(χ,σ)−1 is a negative constant by choosing an appropriate C. σ2 However, since this proof is long, we break it to several steps. These steps are summarized in Lemmas 29, 30, and 31. Then
35
in Lemma 32 we employ these three results to show that if χ = Cσ q , then R(χ, σ) − 1 = C 2 q 2 E|B|2q−2 − 2Cq(q − 1)E|B|q−2 . σ2 Let F (b) denote the distribution function of |B|. lim
σ→0
Lemma 29. Suppose that PG (|B| < σ) = O(σ), EG (B 2 ) < ∞ and χ(σ) = Cσ q , where C is a fixed number. Then for 1 < q < 2 we have Z ∞ Z −b σ +α 1 σ q−2 φ(z)dzdF (b) → 0, 2−q + χq(q − 1) −b |η (b/σ + z; χ)| q 0 σ −α as σ → 0. Note that α is an arbitrary positive constant. Proof: The main idea of the proof is to break this integral into several pieces and prove that each piece converges to zero. Based on the value of q, we consider the following intervals. First find a non-negative integer of m∗ such that 1 1 n (l) = lm + lm+1 + · · · + ln (m ≤ n). Now we define q ∈ [2 − 0.5 m∗ +1 , 2 − 0.5 m∗ ). Note for q > 3/2, m∗ is zero. Denote Sm the following intervals: b σ b σ I0 = − − ,− + , σ log( σ1 ) σ log( σ1 ) " # i i σ 2(2−q) −1 σ 2(2−q) −1 b b Ii = − − ,− + , 1 ≤ i ≤ m∗ , σ (log(1/σ))S0i (2−q) σ (log(1/σ))S0i (2−q) # " b b 1 1 ,− + , Im∗ +1 = − − ∗ ∗ σ (log(1/σ))S0m +1 (2−q) σ (log(1/σ))S0m +1 (2−q) −b b Im∗ +2 = − α, − + α . (120) σ σ Using these intervals we have Z ∞Z q−2 σ 0
=
−b σ +α −b σ −α
∞
1 φ(z)dzdF (b) |ηq (b/σ + z; χ)|2−q + χq(q − 1)
1 φ(z)dzdF (b) 2−q + χq(q − 1) 0 I0 |ηq (b/σ + z; χ)| ∗ m +2 Z ∞ Z X 1 q−2 +σ φ(z)dzdF (b). 2−q + χq(q − 1) 0 Ii \Ii−1 |ηq (b/σ + z; χ)| i=1 σ q−2
Z
Z
(121)
Define P0 Pi
∞
1 φ(z)dzdF (b), 2−q + χq(q − 1) |η (b/σ + z; χ)| q Z0 ∞ ZI0 1 , σ q−2 φ(z)dzdF (b), 2−q + χq(q − 1) |η (b/σ + z; χ)| q 0 Ii \Ii−1
,
σ
q−2
Z
Z
1 ≤ i ≤ m∗ + 2.
(122)
Our goal is to show that Pi → 0 as σ → 0. Since these intervals have different forms, we consider four different cases (i) i = 0, (ii) 1 ≤ i ≤ m∗ , (iii) i = m∗ + 1, and (iv) i = m∗ + 2 and for each case we show that Pi → 0. Let |I0 | denote the Lebesgue measure of the interval I0 . For the first term, we have Z ∞Z Z ∞Z 1 1 q−2 P0 = σ q−2 φ(z)dzdF (b) ≤ σ φ(z)dzdF (b) 2−q + χq(q − 1) |η (b/σ + z; χ)| χq(q − 1) q 0 I0 0 I0 Z C˜0 σ√log(1/σ) Z Z ∞ Z 1 1 q−2 ≤ σ φ(z)dzdF (b) + σ q−2 φ(z)dzdF (b) √ χq(q − 1) χq(q − 1) ˜ 0 I0 C0 σ log(1/σ) I0 p p σ q−2 φ(C˜0 log(1/σ) − log(1/σ) )|I0 | σ q−2 φ(0)|I0 |P(|B| ≤ C˜0 σ log(1/σ)) σ ≤ + χq(q − 1) χq(q − 1) p p σ q−2 ˜ φ(C0 log(1/σ) − log(1/σ) )|I0 | σ q−2 φ(0)σ 2 log(1/σ)O(1) σ ≤ + q Cσ q(q − 1) log(1/σ) χq(q − 1) ˜0 p C q−2 σ φ( 2 log(1/σ))|I0 | 1 ≤ O(1) p + O(1) → 0. (123) χ log(1/σ)
36
Note that to obtain the last statement we can choose that C˜0 = 4. Now we consider an arbitrary 0 < i ≤ m∗ and show that Pi → 0. Z ∞Z 1 φ(z)dzdF (b) Pi = σ q−2 2−q + χq(q − 1) |η (b/σ + z; χ)| q 0 Ii \Ii−1 Z ∞Z 1 ≤ σ q−2 φ(z)dzdF (b) |η (b/σ + z; χ)|2−q q 0 Ii \Ii−1 Z C˜i σ√log(1/σ) Z 1 q−2 φ(z)dzdF (b) = σ |η (b/σ + z; χ)|2−q q 0 Ii \Ii−1 Z ∞ Z 1 φ(z)dzdF (b) +σ q−2 √ |η (b/σ + z; χ)|2−q ˜ q Ci σ log(1/σ) Ii \Ii−1 p i σ 2(2−q) −1 q−2 p ˜ σ φ log(1/σ) − |Ii | C i i σ q−2 φ(0)|Ii |P(|B| < C˜i σ log(1/σ)) log(1/σ)S0 (2−q) ≤ + (124) 2−q 2−q σ 2(2−q)i−1 −1 σ 2(2−q)i−1 −1 ηq ;χ ηq ;χ i−1 i−1 S (2−q) S (2−q) log(1/σ)
log(1/σ)
0
0
Note that according to Lemma 28, we have i
σ 2(2−q) −(2−q) i log(1/σ)S1 (2−q)
lim
σ→0
ηq2−q
σ 2(2−q)i−1 −1 i−1 log(1/σ)S0 (2−q)
= 1.
(125)
;χ
It is then straightforward to confirm that if C˜i is large enough, the second term in (124) goes to zero. Hence, we focus on the first term: p σ q−2 φ(0)|Ii |P(|B| < C˜i σ log(1/σ)) lim σ→0 σ 2(2−q)i−1 −1 ; χ ηq2−q i−1 S (2−q) log(1/σ)
σ σ i log(1/σ)S0 (2−q)
p log(1/σ)O(1) 2−q σ 2(2−q)i−1 −1 ;χ ηq i−1 S (2−q)
σ q−2 φ(0) =
0
2(2−q)i −1
lim
σ→0
log(1/σ)
(a)
=
σ O(1) · lim
σ→0
q−2
i
0
σ 2(2−q) −1 σ i log(1/σ)S0 (2−q)
p log(1/σ)
σ 2(2−q)i −(2−q) i log(1/σ)S1 (2−q)
O(1) = lim p = 0, σ→0 log(1/σ)
where we have used (125) in (a). By combining (124) and (126) we conclude that Z ∞Z m∗ m∗ X X 1 q−2 lim Pi = lim σ φ(z)dzdF (b) = 0. 2−q + χq(q − 1) σ→0 σ→0 0 Ii \Ii−1 |ηq (b/σ + z; χ)| i=1 i=1
(126)
(127)
37
Our next step is to prove that Pm∗ +1 → 0. Z ∞Z 1 q−2 ∗ φ(z)dzdF (b) Pm +1 = σ 2−q + χq(q − 1) |η (b/σ + z; χ)| q 0 I ∗ \I ∗ Z ∞ Z m +1 m 1 φ(z)dzdF (b) ≤ σ q−2 |η (b/σ + z; χ)|2−q q 0 Im∗ +1 \Im∗ √ Z C˜m∗ +1 σ log(1/σ) Z φ(z) q−2 = σ dzdF (b) 2−q 0 Im∗ +1 \Im∗ |ηq (b/σ + z; χ)| Z ∞ Z φ(z) +σ q−2 dzdF (b) √ ˜m∗ +1 σ log(1/σ) Im∗ +1 \Im∗ |ηq (b/σ + z; χ)|2−q C p σ q−2 φ(0)|Im∗ +1 |P(|B| < C˜m∗ +1 σ log(1/σ)) ≤ ∗ σ 2(2−q)m −1 ηq2−q ; χ ∗ m log(1/σ)S0 (2−q) p 1 q−2 ˜ σ φ Cm∗ +1 log(1/σ) − |Im∗ +1 | m∗ +1 (2−q) log(1/σ)S0 + . ∗ 2−q σ 2(2−q)m −1 ηq ;χ ∗ S m (2−q) log(1/σ)
(128)
0
It is straightforward to see that if C˜m∗ +1 is large enough, then the second term in (128) goes to zero. Hence, we focus on the first term: p σ q−2 φ(0)|Im∗ +1 |P(|B| < C˜m∗ +1 σ log(1/σ)) lim ∗ σ→0 2−q σ 2(2−q)m −1 ηq ;χ m∗ log(1/σ)S0 (2−q) p 1 σ q−2 φ(0) log(1/σ)O(1) σ m∗ +1 (2−q) (a) log(1/σ)S0 = lim ∗ σ→0 σ 2(2−q)m −1 ηq2−q ; χ ∗ m log(1/σ)S0 (2−q) p 1 m∗ +1 σ q−2 φ(0) σ log(1/σ)O(1) m∗ +1 +1 (2−q) O(1)σ −2(2−q) (b) (c) log(1/σ)S0 p = lim = lim = 0. (129) ∗ +1 m −(2−q) σ 2(2−q) σ→0 σ→0 log(1/σ) m∗ +1 log(1/σ)S1
(2−q)
To obtain Equality (a), we used the fact P(|B| ≤ σ) = O(σ). To obtain (b), we have used Lemma 28. Finally Equality (c) is ∗ due to the condition we imposed on m∗ that ensures 1 − 2(2 − q)m +1 ≥ 0. Combining (128) and (129) proves lim Pm∗ +1 = 0.
σ→0
The last remaining term of (121) is Pm∗ +2 . To prove Pm∗ +2 → 0, we have Z ∞Z 1 q−2 φ(z)dzdF (b) Pm∗ +2 = σ 2−q + χq(q − 1) 0 Im∗ +2 \Im∗ +1 |ηq (b/σ + z; χ)| Z ∞Z 1 ≤ σ q−2 φ(z)dzdF (b) 2−q 0 Im∗ +2 \Im∗ +1 |ηq (b/σ + z; χ)| Z C˜m∗ +2 σ√log(1/σ) Z 1 q−2 = σ φ(z)dzdF (b) 2−q 0 Im∗ +2 \Im∗ +1 |ηq (b/σ + z; χ)| Z ∞ Z 1 +σ q−2 φ(z)dzdF (b). √ |η (b/σ + z; χ)|2−q ˜ q Cm∗ +2 σ log(1/σ) Im∗ +2 \Im∗ +1
(130)
(131)
By using the same strategy of calculations as we did for Pi (i < m∗ + 2), it is not hard to see that the second integral above goes to zero as σ → 0, when C˜m∗ +2 is chosen large enough. Hence, we focus on the first integral. We have Z C˜m∗ +2 σ√log(1/σ) Z 1 σ q−2 φ(z)dzdF (b) |η (b/σ + z; χ)|2−q q 0 Im∗ +2 \Im∗ +1 p p σ q−2 φ(0)2αP(|B| ≤ C˜m∗ +2 σ log(1/σ)) (d) σ q−1 log(1/σ) ≤ = O(1) → 0, (132) 1 2−q m∗ +2 1 S (2−q) ηq ;χ log(1/σ) 1 m∗ +1 S (2−q) log(1/σ)
0
38
where Equality (d) is due to Lemma 28. Combining (131) and (132) proves that lim Pm∗ +2 = 0.
(133)
σ→0
Finally, Combining (121), (123), (127), (130), and (133) finishes our proof. Define b b α α I γ , − − 1−γ , − + 1−γ . σ σ σ σ
(134)
In Lemma 29 we proved that: σ
q−2
Z 0
∞
Z
−b σ +α −b σ −α
1 φ(z)dzdF (b) → 0. |ηq (b/σ + z; χ)|2−q + χq(q − 1)
In the next lemma, we would like to extend this result and show that in fact, Z ∞Z 1 σ q−2 φ(z)dzdF (b) → 0. 2−q + χq(q − 1) |η (b/σ + z; χ)| γ q 0 I Lemma 30. Suppose that PG (|B| < σ) = O(σ), EG (B 2 ) < ∞ and χ(σ) = Cσ q , where C is a fixed positive number. Then for 1 < q < 2 and any fixed 0 < γ < 0, we have Z ∞Z 1 φ(z)dzdF (b) → 0, σ q−2 2−q + χq(q − 1) |η (b/σ + z; χ)| γ q I \Im∗ +2 0 as σ → 0. Note that α is an arbitrary positive constant. Proof: As in the proof of Lemma 29, we break the integral into smaller subintervals and prove each one goes to zero. Consider the following intervals: b α b α Ji = − − q−1 i , − + q−1 i , σ σ 1+ S0 (2−q) σ σ 1+ S0 (2−q) where > 0 is an arbitrarily small number and i is an arbitrary natural number. Our goal is to show that the following integrals go to zero as σ → 0: Z ∞Z 1 q−2 φ(z)dzdF (b) → 0, Q−1 , σ 2−q + χq(q − 1) |η (b/σ + z; χ)| q 0 J \I ∗ Z ∞ Z 0 m +2 1 φ(z)dzdF (b) → 0, Qi , σ q−2 2−q + χq(q − 1) |η (b/σ + z; χ)| q 0 Ji+1 \Ji q−1
i
where i is an arbitrary natural number. Define σ ˜i , α1 σ 1+ (1+(2−q)+...+(2−q) ) . Then we have Z ∞Z 1 Q−1 = σ q−2 φ(z)dzdF (b) 2−q + χq(q − 1) 0 J0 \Im∗ +2 |ηq (b/σ + z; χ)| Z ∞Z 1 ≤ σ q−2 φ(z)dzdF (b) 2−q 0 J0 \Im∗ +2 |ηq (b/σ + z; χ)| Z ∞Z (a) 1 ≤ σ q−2 φ(z)dzdF (b) |η (α; χ)|2−q q 0 J0 \Im∗ +2 Z σ˜σ log(1/σ) Z 0 1 = σ q−2 φ(z)dzdF (b) 2−q 0 J0 \Im∗ +2 |ηq (α; χ)| Z Z ∞ 1 +σ q−2 φ(z)dzdF (b) σ |η (α; χ)|2−q q ∗ log(1/σ) J \I 0 m +2 σ ˜0 Z σ˜σ log(1/σ) Z ∞ (b) φ( log(1/σ) − σ˜10 )|J0 | 0 1 σ ˜0 q−2 q−2 ≤ σ dF (b) + σ dF (b). σ |ηq (α; χ)|2−q |ηq (α; χ)|2−q 0 σ ˜ log(1/σ) 0
Note that (a) and (b) are obtained by the similar arguments in the proof of Lemma 29, see the derivations in (124) for example. It is straightforward to notice that the second term above converges to zero. Hence we focus on the first term. Z σ˜σ log(1/σ) 0 1 σ q−2 σ log(1/σ) q−2 σ dF (b) ≤ O(1) = O(1)σ (q−1)( +1 ) log(1/σ) → 0. 2−q |η (α; χ)| σ ˜ q 0 0
39
Now we discuss an arbitrary Qi (i ≥ 0): Z ∞Z 1 q−2 φ(z)dzdF (b) Qi = σ 2−q + χq(q − 1) |η (b/σ + z; χ)| q 0 J \J Z ∞ Z i+1 i 1 φ(z)dzdF (b) ≤ σ q−2 |η (b/σ + z; χ)|2−q q 0 Ji+1 \Ji Z σ˜ σ log(1/σ) Z i+1 1 = σ q−2 φ(z)dzdF (b) 2−q 0 Ji+1 \Ji |ηq (b/σ + z; χ)| Z ∞ Z 1 +σ q−2 φ(z)dzdF (b) 2−q σ Ji+1 \Ji |ηq (b/σ + z; χ)| σ ˜ i+1 log(1/σ) Z σ˜ σ log(1/σ) Z i+1 1 φ(z)dzdF (b) ≤ σ q−2 1 2−q ; |η ( q σ 0 Ji+1 \Ji ˜i χ)| Z ∞ Z 1 q−2 +σ φ(z)dzdF (b). 1 2−q σ log(1/σ) Ji+1 \Ji |ηq ( σ ˜i ; χ)| σ ˜
(135)
i+1
It is again straightforward to show that the second integral in (135) converges to zero as σ → 0. We then study the first integral. Z σ˜ σ log(1/σ) Z i+1 1 q−2 lim σ φ(z)dzdF (b) 1 2−q σ→0 0 Ji+1 \Ji |ηq ( σ ˜i ; χ)| σ q−2 σ ≤ lim P |B| ≤ log(1/σ) σ→0 |ηq ( 1 ; χ)|2−q σ ˜i+1 σ ˜i ≤
σ q−1 log(1/σ) σ q−1 log(1/σ)˜ σi2−q O(1) = lim O(1) σ→0 |ηq ( 1 ; χ)|2−q σ σ→0 σ ˜i+1 ˜i+1 σ ˜i lim
q−1
= ≤
i
σ q−1 log(1/σ)(σ 1+ (1+(2−q)+...+(2−q) ) )2−q
lim
q−1
i+1 )
σ 1+ (1+(2−q)+...+(2−q) (q−1) lim (σ 1+ ) log(1/σ)O(1) = 0.
σ→0
O(1) (136)
σ→0
1 2 Now note that as i goes to infinity, the exponent of σ in interval Ji goes to q−1 1+ (1 + (2 − q) + (2 − q) + . . .) = 1+ . So, γ by choosing small enough and enough number of intervals, we can get the number 1 − γ from I arbitrarily close to 1. In the last two lemmas, we have been able to prove that for χ = Cσ q , Z ∞Z 1 φ(z)dzdF (b) → 0. σ q−2 2−q + χq(q − 1) |η (b/σ + z; χ)| γ q 0 I
Next step is to extend this result and show that Z ∞Z 1 σ q−2 φ(z)dzdF (b) = 0. 2−q + χq(q − 1) 0 R |ηq (b/σ + z; χ)| Before we prove this result, we mention a simple lemma that will be applied several times in our proofs. Lemma 31. For 1 < q < 2 and α, χ > 0 we have 1 2 ≤ . |ηq (α; χ)|2−q + χq(q − 1) |α|2−q (q − 1) Proof: We consider two cases: 1 1) χ ≤ α2−q 2q : Since we know η(α; χ) ≤ α, in this case we have ηq (α; χ) = α − χqηqq−1 (α; χ) ≥ α − χqαq−1 ≥ α − α2−q Hence,
1 |ηq (α; χ)|2−q + χq(q − 1)
1 2) χ ≥ α2−q 2q :
≤
1 q−1 α qα = . 2q 2
1 22−q ≤ . |ηq (α; χ)|2−q α2−q
1 1 2 ≤ ≤ . |ηq (α; χ)|2−q + χq(q − 1) χq(q − 1) (q − 1)α2−q
40
This completes our proof. Now we can consider one of the main results of this section. Lemma 32. Suppose that PG (|B| ≤ σ) = O(σ), EG (|B|2 ) < ∞ and χ(σ) = Cσ q , where C > 0 is a fixed positive number, then we have for 1 < q < 2 R(χ, σ) − 1 = C 2 q 2 E|B|2q−2 − 2Cq(q − 1)E|B|q−2 . σ→0 σ2 Proof: We use the same roadmap as in the proof of Lemma 25. Recall that when = 1, we have lim
R(χ, σ) − 1
= χ2 q 2 E|ηq (B/σ + Z; χ)|2q−2 − 2χq(q − 1)E
|ηq (B/σ + Z; χ)|q−2 1 + χq(q − 1)|ηq (B/σ + Z; χ)|q−2
S1 + S2 .
,
(137)
Similar to the proof of Lemma 25, we have
S1 = C 2 q 2 E|B|2q−2 . (138) σ→0 σ 2 Hence, we focus on analyzing S2 . First note that restricting |B| to be bounded away from 0 makes it possible to use the same arguments as in the proof of Lemma 25 to show, lim
1(|B| > 1) = E|B|q−2 1(|B| > 1). |ηq (|B| + σZ; Cσ 2 )|2−q + Cσ 2 q(q − 1)
lim E
σ→0
(139)
Hence, without loss of generality, we assume |B| ≤ 1. Let F (b) denote the distribution function of |B|, then we have −S2 σ2
=
1
2Cq(q − 1)E
σZ; Cσ 2 )|2−q
Z =
|ηq (|B| + 1 Z −b/σ+bc /(2σ)
2Cq(q − 1) 0
Z
−b/σ−bc /(2σ) 1Z
+ Cσ 2 q(q − 1) 1
|ηq (b +
σz; Cσ 2 )|2−q
+ 2Cq(q − 1) 0
R\[−b/σ−bc /(2σ),−b/σ+bc /(2σ)]
+ Cσ 2 q(q − 1)
φ(z)dzdF (b)
1 φ(z)dzdF (b) |ηq (b + σz; Cσ 2 )|2−q + Cσ 2 q(q − 1)
, 2Cq(q − 1)(T1 + T2 ), where c > 1 is a constant we specify later. We first analyze T2 . Note that, T2 = E
1(|B + σZ| ≥ |B|c /2) |ηq (B + σZ; Cσ 2 )|2−q + Cσ 2 q(q − 1)
,
and
1(|B + σZ| ≥ |B|c /2) |ηq (B + σZ; Cσ 2 )|2−q + Cσ 2 q(q − 1)
(a)
≤
21(|B + σZ| ≥ |B|c /2) |B|c(q−2) ≤ , (q − 1)|B + σZ|2−q 2q−3 (q − 1)
where (a) is due to Lemma 31. For any 1 < q < 2, it is straightforward to verify that E|B|c(q−2) < ∞ if c is chosen close enough to 1. So applying DCT gives lim T2 = E1(|B| ≥ |B|c /2)|B|q−2 = E|B|q−2 1(|B| ≤ 1).
(140)
σ→0
We now turn to bounding T1 . According to Lemma 29 and 30, we know Z ∞Z 1 σ q−2 φ(z)dzdF (b) → 0, 2−q + χq(q − 1) 0 I γ |ηq (b/σ + z; χ)| c
c
c
c
α α b b b b , − σb + σ1−γ ]. Define Icγ = [− σb − σ1−γ , − σb + σ1−γ ] and I˜ c = [− σb − 2σ , − σb + 2σ ]. Since 0 ≤ b ≤ 1, where I γ = [− σb − σ1−γ γ γ we get Ic ⊆ I for any given α > 1. Therefore, Z 1Z 1 q−2 T3 , σ φ(z)dzdF (b) → 0. (141) γ |η (b/σ + z; χ)|2−q + χq(q − 1) q 0 Ic
41
Then the only part of T1 that may be non-zero is Z 1Z T1 − T3 = σ q−2
˜ c \Icγ I
0
≤ (a)
≤
=
,
1
1 φ(z)dzdF (b) |ηq (b/σ + z; χ)|2−q + χq(q − 1)
1 φ(z)dzdF (b) c /σ 1−γ ; χ)|2−q + χq(q − 1) |η (b q 0 Z 1Z 2bc(q−2) q−2+(1−γ)(2−q) σ φ(z)dzdF (b) ˜ c \Icγ q − 1 0 I √ Z C˜f σ log(1/σ) Z 2bc(q−2) q−2+(1−γ)(2−q) σ φ(z)dzdF (b) ˜ c \Icγ q − 1 0 I Z Z 1 2bc(q−2) +σ q−2+(1−γ)(2−q) φ(z)dzdF (b) √ ˜ c \Icγ q − 1 ˜f σ log(1/σ) I C σ q−2
Z
Z
˜ c \Icγ I
T4 + T5 .
(142)
where Inequality (a) is the result of Lemma 31. We first bound T5 in the following: Z 2σ q−2+(1−γ)(2−q) 1 b 2σ q−3+(1−γ)(2−q) ˜ p bc(q−1) T5 ≤ φ dF (b) ≤ φ(Cf log(1/σ)/2). √ q−1 σ 2σ q−1 ˜f σ log(1/σ) C It is then straightforward to see that T5 goes to zero by choosing large enough C˜f . For the remaining term T4 , we have Z C˜f σ√log(1/σ) Z 2bc(q−2) T4 = σ q−2+(1−γ)(2−q) φ(z)dzdF (b) ˜ c \Icγ q − 1 0 I Z ˜ √ 2σ q−3+(1−γ)(2−q) Cf σ log(1/σ) c(q−1) b ≤ b φ dF (b) q−1 2σ 0 p 2σ q−3+(1−γ)(2−q) ˜ p (Cf σ log(1/σ))c(q−1) φ(0)P (|B| ≤ C˜f σ log(1/σ)) ≤ q−1 ≤ O(1)σ c(q−1)−γ(2−q) (log(1/σ))(c(q−1)+1)/2 → 0. (143) To obtain the last statement, we can choose γ close enough to zero. Putting the results of (137)(138)(139)(140)(141)(142)(143) together completes the proof. As in the other lemmas, our first goal is to characterize χ∗ (σ). Towards this goal, we first show that χ∗ (σ) cannot be either too large or too small. In particular, in Lemma 33 and 34, we show that χ∗ (σ) = Ω(σ q ) and χ∗ (σ) = O(σ q−1 ). Then we use this result in Lemma 35 to conclude that χ∗ (σ) = Θ(σ q ). Lemma 33. Suppose EG (|B|2 ) < ∞, if
χ(σ) σ q−1
= ∞ and χ(σ) = o(1), then R(χ, σ) → ∞, as σ → 0.
Proof: According to (59) we have R(χ, σ)
=
1 + χ2 q 2 E|ηq (B/σ + Z; χ)|2q−2 − 2χq(q − 1)E
,
1 + S1 − S2 .
1 |ηq (B/σ + Z; χ)|2−q + χq(q − 1) (144)
Since χ(σ) = o(1), it is straightforward to apply the dominated convergence theorem and show that lim
σ→0
Because
χ2 σ 2q−2
S1 = q 2 E|B|2q−2 . χ2 σ 2−2q
→ ∞, limσ→0 S1 = ∞. Also note S2 ≤ 2χq(q − 1)
1 = 2. χq(q − 1)
Hence, R(χ, σ) → ∞. Our next lemma shows that χ∗ (σ) cannot be too small. Lemma 34. Suppose that the same conditions for |B| in Lemma 32 hold, if χ(σ) = o(σ q ), then R(χ, σ) − 1 → 0, σ2
(145)
42
as σ → 0. Proof: Adopting the same notations from the proof of Lemma 33, first note that χ2 σ 2−2q q 2 E|ηq (B + σZ; χσ 2−q )|2q−2 S1 = lim = 0, 2 σ→0 σ→0 σ σ2 where to obtain the last equality we have employedh the dominated convergencei theorem. Now we study the behavior of S2 . α α Recall that in (120) and (134), we defined I0 = − σb − log(σ 1 ) , − σb + log(σ 1 ) and I γ , − σb − σ1−γ , − σb + σ1−γ . It is σ σ straightforward to use the same calculations as in the proof of Lemma 29 (see the derivations in (123)) to show that Z ∞Z Z ∞Z φ(z) φ(z) χ χ dzdF (b) ≤ 2 dzdF (b) → 0. 2 2−q σ 0 + χq(q − 1) σ 0 I0 |ηq (b/σ + z; χ)| I0 χq(q − 1) lim
Moreover, note that χ(σ) < Cσ q for small enough σ. Therefore, according to Lemma 7 part (v) we have |ηq (b/σ + z; χ)| ≥ |ηq (b/σ + z; Cσ q )|. Hence, Z ∞Z φ(z) χ 1 φ(z) dzdF (b) ≤ q 2−q dzdF (b) → 0, 2−q q 2−q σ σ 0 I γ \I0 |ηq (b/σ + z; χ)| 0 I γ \I0 |ηq (b/σ + z; Cσ )| R∞R φ(z) 1 where the last statement holds because of σ2−q dzdF (b) → 0 that has already been proved in 0 I γ \I0 |ηq (b/σ+z;Cσ q )|2−q Lemma 29 and 30. We thus have showed, Z ∞Z χ φ(z) dzdF (b) → 0. 2−q + χq(q − 1) σ2 0 |η (b/σ + z; χ)| γ q I χ σ2
Z
∞
Z
Based on the above result, we can easily follow the same derivations about T1 in the proof of Lemma 32 and show that, Z 1 Z −b/σ+bc /(2σ) χ φ(z) lim 2 dzdF (b) = 0. (146) 2−q + χq(q − 1) σ→0 σ |η (b/σ + z; χ)| c q 0 −b/σ−b /(2σ) Furthermore, because χ(σ) = o(σ q ), the calculations about T2 and Equation (139) from the proof of Lemma 32 imply, Z ∞ Z +∞ χ φ(z) lim 2 dzdF (b) = 0, (147) 2−q + χq(q − 1) σ→0 σ 1 −∞ |ηq (b/σ + z; χ)| Z 1Z χ φ(z) lim 2 dzdF (b) = 0. (148) 2−q + χq(q − 1) σ→0 σ 0 R\[−b/σ−bc /(2σ),−b/σ+bc /(2σ)] |ηq (b/σ + z; χ)| Putting (146), (147) and (148) together gives, Z Z S2 2χq(q − 1) ∞ ∞ φ(z) lim = lim dzdF (b) = 0. 2 2−q + χq(q − 1) σ→0 σ 2 σ→0 σ 0 −∞ |ηq (b/σ + z; χ)| Combing the results from Lemma 32, 33 and 34, we can upper and lower bound the optimal threshold value χ∗ (σ) as shown in the following corollary. Corollary 4. χ∗ (σ) = Ω(σ q ) and χ∗ (σ) = O(σ q−1 ). Lemma 35. Suppose that PG (|B| ≤ σ) = O(σ) and EG (|B|2 ) < ∞, then we have for 1 < q < 2, χ∗ (σ) (q − 1)E|B|q−2 = , q σ→0 σ qE|B|2q−2 R(χ∗ (σ), σ) − 1 (q − 1)2 (E|B|q−2 )2 lim =− . 2 σ→0 σ E|B|2q−2 lim
Proof: From Equation (112), we know that χ∗ satisfies the following equation: 0 = χ∗ H2 − I1 − χ∗ I2 . Our first goal is to show that
1 σ 2−q I1
→ q(q − 1)E|B|q−2 , as σ → 0. Define the interval
I−1 = [−b/σ − χ∗ (σ)1/(2−q) , −b/σ + χ∗ (σ)1/(2−q) ].
(149)
43
Then we have, lim
σ→0
I1 σ 2−q
Z Z |ηq (b/σ + z; χ∗ )|4−2q q(q − 1) ∞ ∞ = lim φ(z)dzdF (b) ∗ 2−q + χ∗ q(q − 1))3 σ→0 σ 2−q −∞ (|ηq (b/σ + z; χ )| 0 Z ∞Z q(q − 1) |ηq (b/σ + z; χ∗ )|4−2q = lim φ(z)dzdF (b) 2−q ∗ 2−q + χ∗ q(q − 1))3 σ→0 σ 0 I−1 (|ηq (b/σ + z; χ )| Z Z |ηq (b/σ + z; χ∗ )|4−2q q(q − 1) ∞ + lim φ(z)dzdF (b) ∗ 2−q + χ∗ q(q − 1))3 σ→0 σ 2−q 0 R\I−1 (|ηq (b/σ + z; χ )|
(150)
We first show that the first integral in (150) goes to zero. Note that ηq ((χ∗ )1/(2−q) ; χ∗ ) = (χ∗ )1/(2−q) ηq (1; 1) by Lemma 4 part (v), we thus have Z Z q(q − 1) ∞ |ηq (b/σ + z; χ∗ )|4−2q lim φ(z)dzdF (b) 2−q ∗ 2−q + χ∗ q(q − 1))3 σ→0 σ 0 I−1 (|ηq (b/σ + z; χ )| Z Z q(q − 1) ∞ |ηq ((χ∗ )1/(2−q) ; χ∗ )|4−2q ≤ lim φ(z)dzdF (b) σ→0 σ 2−q (χ∗ q(q − 1))3 0 I−1 Z ∞Z ηq4−2q (1; 1) 1 ≤ lim 2−q φ(z)dzdF (b) ∗ 2 σ→0 σ 0 I−1 χ (q(q − 1)) Z C−1 σ√log(1/σ) Z ηq4−2q (1; 1) 1 ≤ lim 2−q φ(z)dzdF (b) ∗ 2 σ→0 σ 0 I−1 χ (q(q − 1)) Z ∞ Z ηq4−2q (1; 1) 1 + lim 2−q φ(z)dzdF (b) √ ∗ 2 σ→0 σ C−1 σ log(1/σ) I−1 χ (q(q − 1)) Since we have already shown that χ∗ (σ) = O(σ q−1 ) and χ∗ (σ) → 0, it is straightforward to see that the second integral above is negligible for large enough C−1 . For the first one, we know √ p Z Z ηq4−2q (1; 1) C−1 σ log(1/σ) (χ∗ )1/(2−q) σ log(1/σ) 1 lim φ(z)dzdF (b) ≤ lim O(1) = 0. (151) ∗ 2 σ→0 σ→0 σ 2−q χ∗ σ 2−q 0 I−1 χ (q(q − 1)) R∞R |ηq (b/σ+z;χ∗ )|4−2q 1 Our next goal is to find the limit of limσ→0 σ2−q φ(z)dzdF (b). In order to do that, 0 R\I−1 (|ηq (b/σ+z;χ∗ )|2−q +χ∗ q(q−1))3 we again break this integral into several subintervals. Recall the intervals I0 , I1 , . . . , J0 , J1 , . . . that we introduced in Lemma 29 and 30. We consider several cases: σ 1) In this case, we assume that χ∗ (σ)1/(2−q) = o( log(1/σ) ): we then have Z ∞Z 1 |ηq (b/σ + z; χ∗ )|4−2q lim 2−q φ(z)dzdF (b) ∗ 2−q + χ∗ q(q − 1))3 σ→0 σ 0 R\I−1 (|ηq (b/σ + z; χ )| Z ∞Z 1 |ηq (b/σ + z; χ∗ )|4−2q = lim 2−q φ(z)dzdF (b) ∗ 2−q + χ∗ q(q − 1))3 σ→0 σ 0 I0 \I−1 (|ηq (b/σ + z; χ )| Z ∞Z 1 |ηq (b/σ + z; χ∗ )|4−2q + lim 2−q φ(z)dzdF (b) ∗ 2−q + χ∗ q(q − 1))3 σ→0 σ 0 I γ \I0 (|ηq (b/σ + z; χ )| Z ∞Z 1 |ηq (b/σ + z; χ∗ )|4−2q φ(z)dzdF (b) (152) + lim 2−q ∗ 2−q + χ∗ q(q − 1))3 σ→0 σ 0 R\I γ (|ηq (b/σ + z; χ )| Hence we have to evaluate each of the three terms in (152). For the first one, we know Z ∞Z 1 |ηq (b/σ + z; χ∗ )|4−2q lim 2−q φ(z)dzdF (b) ∗ 2−q + χ∗ q(q − 1))3 σ→0 σ 0 I0 \I−1 (|ηq (b/σ + z; χ )| Z ∞Z 1 |ηq (b/σ + z; χ∗ )|4−2q ≤ lim 2−q φ(z)dzdF (b) ∗ 2−q )3 σ→0 σ 0 I0 \I−1 (|ηq (b/σ + z; χ )| Z ∞Z 1 1 = lim 2−q φ(z)dzdF (b) ∗ 2−q σ→0 σ 0 I0 \I−1 |ηq (b/σ + z; χ )| Z ∞Z 1 1 ≤ lim 2−q φ(z)dzdF (b) 2−q ∗ σ→0 σ (1; 1) 0 I0 \I−1 χ ηq Z Cσ√log(1/σ) Z Z ∞ Z 1 1 1 1 ≤ lim 2−q φ(z)dzdF (b) + lim φ(z)dzdF (b). √ 2−q 2−q 2−q ∗ ∗ σ→0 σ σ→0 σ (1; 1) (1; 1) 0 I0 \I−1 χ ηq Cσ log(1/σ) I0 \I−1 χ ηq
44
Since we know that χ∗ (σ) = Ω(σ q ), it is straightforward to show that the second integral above goes to zero, if C is chosen properly. Hence, we again focus on the first integral, i.e., Z Cσ√log(1/σ) Z 1 1 φ(z)dzdF (b) 2−q ∗ σ 2−q 0 (1; 1) I0 \I−1 χ ηq p q σ 2 log(1/σ) 1 σ (a) p = o(1), ≤ O(1) 2−q ∗ = O(1) ∗ σ χ log(1/σ) χ log(1/σ) where (a) is due to χ∗ = Ω(σ q ). Now we study the second term of (152). Z ∞Z |ηq (b/σ + z; χ∗ )|4−2q 1 lim 2−q φ(z)dzdF (b) ∗ 2−q + χ∗ q(q − 1))3 σ→0 σ 0 I γ \I0 (|ηq (b/σ + z; χ )| Z ∞Z 1 1 ≤ lim 2−q φ(z)dzdF (b) σ→0 σ |η (b/σ + z; χ∗ )|2−q γ q 0 I \I0 Our goal is to show that this integral goes to zero as well. In order to be able to do so, we use the following calculations: Z ∞Z 1 1 φ(z)dzdF (b) σ 2−q 0 |η (b/σ + z; χ∗ )|2−q γ q I \I0 m∗ +2 Z ∞ Z 1 X 1 ≤ φ(z)dzdF (b) σ 2−q i=1 0 |η (b/σ + z; χ∗ )|2−q q Ii \Ii−1 Z ` Z 1 X ∞ 1 + 2−q φ(z)dzdF (b) ∗ 2−q σ Ji \Ji−1 |ηq (b/σ + z; χ )| i=1 0 Z ∞Z 1 1 + 2−q φ(z)dzdF (b), σ |η (b/σ + z; χ∗ )|2−q q 0 J0 \Im∗ +2 where ` is chosen in a way that the interval J` covers I γ . Define mi = |Ii | and m ˜ i = |Ji |. Note that we did similar calculations for the case χ∗ (σ) = Cσ q in Lemmas 29 and 30. The key argument regarding χ(σ) that we used there to prove the result was that ηq (mi ; Cσ q ) = Θ(mi ) and ηq (m ˜ i ; Cσ q ) = Θ(m ˜ i ). Hence, if we can show that ∗ ∗ ηq (mi ; χ (σ)) = Θ(m ) and η ( m ˜ ; χ (σ)) = Θ( m ˜ ) in the current case, then those proofs will work and we will i q i i R∞R 1 1 have σ2−q φ(z)dzdF (b) → 0. For this purpose, we can use Lemma 28. Note that since ∗ 2−q γ |η (b/σ+z;χ )| 0 I \I0 q m0 < m1 < . . . < mm∗+2 < m ˜0 < m ˜1 < m ˜2 ... < m ˜ ` , we just need to confirm the condition of Lemma 28 for m0 . We have χ∗ (σ)m0q−2 = χ∗ (σ)σ q−2 log(1/σ)2−q → 0, by the assumption of case I. Hence under the assumption of Case I, we could also show that the second term of (152) goes to zero. In summary, we have showed that Z ∞Z 1 |ηq (b/σ + z; χ∗ )|4−2q φ(z)dzdF (b) → 0. 2−q ∗ 2−q + χ∗ q(q − 1))3 σ 0 I γ (|ηq (b/σ + z; χ )| |η (b/σ+z;χ∗ )|4−2q
q 1 Furthermore, we know σ q−2 (|ηq (b/σ+z;χ ∗ )|2−q +χ∗ q(q−1))3 ≤ |η (b+σz;χ∗ σ 2−q )|2−q +χ∗ σ 2−q q(q−1) . We can then follow the q same track of calculations for deriving limσ→0 −S2 /σ 2 in the proof of Lemma 32 (note that those derivations work I1 more generally for σ 2−q χ∗ = o(1)), to get limσ→0 σ2−q = q(q − 1)E|B|q−2 . 1 1 σ σ ∗ 2−q 2) (χ ) = Ω( log(1/σ) ): So far we have discussed the case in which (χ∗ ) 2−q = o( log(1/σ) ). In this part, we focus on 1
σ the other case. Because (χ∗ ) 2−q = Ω( log(1/σ) ) and χ∗ = O(σ q−1 ), there exists a value of 1 ≤ m ¯ ≤ m∗ + 1 such that 1
1
∗ 2−q for σ small enough, (χ∗ ) 2−q = o(|Im = Ω(|Im−1 |). We then break the integral into: ¯ |) and (χ ) ¯ Z ∞Z ∗ 4−2q 1 |ηq (b/σ + z; χ )| lim φ(z)dzdF (b) σ→0 σ 2−q 0 (|η (b/σ + z; χ∗ )|2−q + χ∗ q(q − 1))3 q R\I−1 Z ∞Z 1 |ηq (b/σ + z; χ∗ )|4−2q = lim 2−q φ(z)dzdF (b) σ→0 σ (|ηq (b/σ + z; χ∗ )|2−q + χ∗ q(q − 1))3 0 Im ¯ \I−1 Z ∞Z 1 |ηq (b/σ + z; χ∗ )|4−2q + lim 2−q φ(z)dzdF (b) σ→0 σ (|ηq (b/σ + z; χ∗ )|2−q + χ∗ q(q − 1))3 0 I γ \Im ¯ Z ∞Z 1 |ηq (b/σ + z; χ∗ )|4−2q + lim 2−q φ(z)dzdF (b) ∗ 2−q + χ∗ q(q − 1))3 σ→0 σ 0 R\I γ (|ηq (b/σ + z; χ )|
(153)
45
Again we have to show that each of the first two integrals above goes to zero as σ → 0 and then calculate the last integral. First note that the calculation of the last integral is exactly the same as the calculation of the corresponding term in Case I. Therefore, if we can show that the first two integrals converge to zero, we are done. For the first integral we have Z ∞Z |ηq (b/σ + z; χ∗ )|4−2q 1 φ(z)dzdF (b) lim 2−q σ→0 σ (|ηq (b/σ + z; χ∗ )|2−q + χ∗ q(q − 1))3 0 Im ¯ \I−1 Z Cσ√log(1/σ) Z 1 |ηq (b/σ + z; χ∗ )|4−2q = lim 2−q φ(z)dzdF (b) σ→0 σ (|ηq (b/σ + z; χ∗ )|2−q + χ∗ q(q − 1))3 0 Im ¯ \I−1 Z ∞ Z 1 |ηq (b/σ + z; χ∗ )|4−2q + lim 2−q φ(z)dzdF (b) √ σ→0 σ (|ηq (b/σ + z; χ∗ )|2−q + χ∗ q(q − 1))3 Cσ log(1/σ) Im ¯ \I−1 Since according to Corollary 4, χ∗ (σ) = Ω(σ q ), it is straightforward to show that the second term in the above equation goes to zero by choosing large enough C. Regarding the second term, we know Z Cσ√log(1/σ) Z 1 |ηq (b/σ + z; χ∗ )|4−2q lim 2−q φ(z)dzdF (b) σ→0 σ (|ηq (b/σ + z; χ∗ )|2−q + χ∗ q(q − 1))3 0 Im ¯ \I−1 Z Cσ√log(1/σ) Z 1 1 ≤ φ(z)dzdF (b) 2−q σ |ηq (b/σ + z; χ∗ )|2−q 0 Im ¯ \I−1 Z Cσ√log(1/σ) Z 1 1 ≤ φ(z)dzdF (b) 2−q 2−q ∗ σ (1; 1) 0 Im ¯ \I−1 χ ηq m ¯ (a) p σ 2(2−q) −1 q−2 ≤ O(1)σ σ log(1/σ) (154) m ¯ χ∗ log(1/σ)(2−q) +...+1 ∗
∗ m +1 Note that (a) holds ¯ =m − 1 ≤ 0. According to the choice of m ¯ we know that + 1 since 2(2 − q) even when m ¯ 2(2−q)m−1 −1 σ ∗ 1/(2−q) . Therefore, the upper bound in (154) can be calculated: (χ ) =Ω m−1 ¯ S (2−q) (log(1/σ))
0
m ¯
p σ 2(2−q) −1 O(1)σ q−2 ∗ σ log(1/σ) = O m ¯ χ log(1/σ)(2−q) +...+1
1 p
log(1/σ)
! → 0.
1
∗ q−2 → 0. It implies that the arguments For the second integral in (153), note that (χ∗ ) 2−q = o(|Im ¯ |), hence χ (σ)|Im ¯| in calculating the second integral in Case I hold here as well. ∗ I2 I1 So far we have been able to characterize the limσ→0 σ2−q . It is now time to characterize σχ2−q and show that it goes to zero as σ → 0. As before we break the integral into several pieces. Recall the interval defined in (149). We have Z ∞Z χ∗ |ηq (b/σ + z; χ∗ )|2−q lim 2−q φ(z)dzdF (b) ∗ 2−q + χ∗ q(q − 1))3 σ→0 σ 0 I−1 (|ηq (b/σ + z; χ )| Z ∞Z χ∗ |ηq (b/σ + z; χ∗ )|2−q ≤ lim 2−q φ(z)dzdF (b) σ→0 σ (χ∗ q(q − 1))3 0 I−1 Z ∞Z Z ∞Z χ∗ ηq2−q (1; 1) ηq2−q (1; 1) χ∗ 1 ≤ lim 2−q φ(z)dzdF (b) = lim φ(z)dzdF (b) ∗ 3 2−q ∗ 3 σ→0 σ σ→0 σ 0 I−1 (χ q(q − 1)) 0 I−1 χ (q(q − 1))
The upper bound above has been shown to be zero in the calculations about I1 (see Equation (151)). Moreover, note that when z ∈ / I−1 , we have |b/σ + z| > (χ∗ )1/(2−q) ⇒ |ηq (b/σ + z; χ∗ )| > ηq ((χ∗ )1/(2−q) ; χ∗ ) = (χ∗ )1/(2−q) ηq (1; 1). Hence, |ηq (b/σ + z; χ∗ )|2−q ≥ χ∗ ηq2−q (1; 1). Using this equation, we can obtain Z ∞Z |ηq (b/σ + z; χ∗ )|2−q χ∗ φ(z)dzdF (b) lim 2−q ∗ 2−q + χ∗ q(q − 1))3 σ→0 σ 0 I γ \I−1 (|ηq (b/σ + z; χ )| Z ∞Z 1 |ηq (b/σ + z; χ∗ )|4−2q ≤ O(1) · lim 2−q φ(z)dzdF (b) ∗ 2−q + χ∗ q(q − 1))3 σ→0 σ 0 I γ \I−1 (|ηq (b/σ + z; χ )|
46
I1 that we showed converges to zero. Thus, we can conclude The last term is the same as part of σ2−q ∗ Z ∞Z χ |ηq (b/σ + z; χ∗ )|2−q lim 2−q φ(z)dzdF (b) = 0 ∗ 2−q + χ∗ q(q − 1))3 σ→0 σ 0 I γ (|ηq (b/σ + z; χ )| χ∗ |η (b/σ+z;χ∗ )|2−q
q 1 1 This together with the fact σ2−q (|ηq (b/σ+z;χ ∗ )|2−q +χ∗ q(q−1))3 ≤ |η (b+σz;σ 2−q χ∗)|2−q +σ 2−q χ∗ q(q−1) · q(q−1) , we can again q 2 apply the calculations about −S2 /σ from the proof in Lemma 32 to get Z ∞Z |ηq (b/σ + z; χ∗ )|2−q χ∗ φ(z)dzdF (b) = 0. lim 2−q ∗ 2−q + χ∗ q(q − 1))3 σ→0 σ 0 R (|ηq (b/σ + z; χ )|
Finally putting all the results we derived together, we have χ∗ (σ) I1 /σ 2−q + χ∗ I2 /σ 2−q (q − 1)E|B|q−2 = lim = . q 2−2q σ→0 σ σ→0 H2 /σ qE|B|2q−2 lim
Since we have derived the order of χ∗ (σ), according to Lemma 32, we can immediately get the order of R(χ∗ (σ), σ). Characterizing AMSE for 1 < q < 2: Based on the results of Lemma 35, calculating AMSE can be done in a similar way as in the proof of Theorem 2 (Section V-F4). We hence skip it here. 2) Proof for the case q = 1: Lemma 36. Suppose that PG (|B| ≤ σ) = Θ(σ ` ) and EG (|B|2 ) < ∞, then we have for q = 1, ` v u 1 u αm σ ` ≤ χ∗ (σ) ≤ βm σ ` · tlog log . . . log , | {z } σ m times
` 1 , ≤ 1 − R(χ∗ (σ), σ) ≤ β˜m σ 2` · log log . . . log | {z } σ
α ˜ m σ 2`
m times
for small enough σ, where m > 0 is an arbitrary integer number and αm , βm , α ˜ m , β˜m > 0 are four constants depending on m. Proof: Since the proof steps are similar to that in Lemma 27, we only mention the key differences and skip detailed explanations. From the proof in Lemma 27, we know χ∗ (σ) → 0, as σ → 0 and χ∗ =
Eφ(χ∗ − B/σ) + Eφ(χ∗ + B/σ) . E1(|Z + B/σ| ≥ χ∗ )
Since E1(|Z + B/σ| ≥ χ∗ ) → 1 and we have showed in Lemma 23 that Θ(σ ` ) ≤ Eφ(χ∗ − B/σ) + Eφ(χ∗ + B/σ) ≤ Θ(σ ` (logm (1/σ))`/2 ), the bounds for χ∗ (σ) is proved. Furthermore, we know R(χ∗ (σ), σ) − 1 ≤ R(χ, σ) − 1
= E(η1 (B/σ + Z; χ) − B/σ − Z)2 + 2E(∂1 η1 (B/σ + Z; χ) − 1) Z −B/σ+χ ≤ χ2 − 2E φ(z)dz = χ2 − 4χEφ(−B/σ + αχ), −B/σ−χ
√ where |α| ≤ 1 is dependent on√B. If we choose χ(σ) = 3e Eφ( 2B/σ) in the above inequality, it is straightforward to see R(χ∗ (σ), σ) − 1 ≤ −Θ((Eφ( 2B/σ))2 ) = −Θ(σ 2` ). For the other bound, note that −1
R(χ∗ (σ), σ) − 1
= E(η1 (B/σ + Z; χ∗ ) − B/σ − Z)2 + 2E(∂1 η1 (B/σ + Z; χ∗ ) − 1) Z −B/σ+χ∗ ≥ −2E1(|B/σ + Z| ≤ χ∗ ) = −2E φ(z)dz ≥ −4χ∗ Eφ(−B/σ + αχ∗ ) = −Θ(σ 2` (logm (1/σ))` ). −B/σ−χ∗
Characterizing AMSE for q = 1: Based on the results of Lemma 36, calculating AMSE can be done in a similar way as in the proof of Theorem 2 (Section V-F4). We hence do not repeat it here.
47
APPENDIX
ˆ q) and βˆ0 (q) M. The Equivalence between limλ→0 β(λ, ˆ q) to the solution βˆ0 (q), defined as In this section, we give a detailed explanation on the equivalence of limλ→0 β(λ, βˆ0 (q)
=
arg min kβkqq , β
subject to y = Xβ.
(155) p
We first consider 1 < q ≤ 2. Since in the noiseless setting where σw = 0, the set {β ∈ R : y = Xβ} is a non-empty ˆ q) with λ > 0 and βˆ0 (q) uniquely exist. We aim to show, convex set, it is straightforward to confirm that both β(λ, ˆ q) = βˆ0 (q). lim β(λ,
λ→0+
ˆ q) is the global minimizer, we have Let β˜ be any vector such that X β˜ = y. Since β(λ, ˆ q)kq ≤ λkβ(λ, q
1 ˆ q)k2 + λkβ(λ, ˜ 2 + λkβk ˆ q)kq ≤ 1 ky − X βk ˜ q = λkβk ˜ q, ky − X β(λ, 2 q 2 q q 2 2
hence we obtain ˆ q)kq ≤ kβk ˜ q, kβ(λ, q q
λ > 0,
(156)
ˆ q) is bounded over λ > 0. We consider any convergent sequence {β(λ ˆ n , q)}∞ such that λn → 0, as i.e., the solution β(λ, n=1 ∗ ˆ ˆ n → ∞. Let limn→0 β(λn , q) = β . Due to the optimality of β(λn , q) we can have ˆ n , q)) = λn ∇kβ(λ ˆ n , q)kq , X T (y − X β(λ q ˆ n , q)kq is the gradient of kβkq at β = β(λ ˆ n , q). Equation (156) implies ∇kβ(λ ˆ n , q)kq is bounded. Hence, letting where ∇kβ(λ q q q n go to infinity on both sides of the above equation gives, X T (y − Xβ ∗ ) = 0. In other words, it means the convex function 21 ky − Xβk22 has zero gradient at β = β ∗ . Therefore, β ∗ ∈ arg minβ ky − Xβk22 , thus we know, y − Xβ ∗ = 0.
(157)
On the other hand, taking the limit n → 0 in Equation (156), we will get, ˜ q kβ ∗ kqq ≤ kβk q ˜ we have, Since this is true for any β˜ satisfying y = X β, ˜ q kβ ∗ kqq ≤ min kβk q y=X β˜
(158)
Putting Equation (157) and (158) together, and noticing βˆ0 (q) is unique, we obtain, ˆ n , q) = βˆ0 (q). lim β(λ
n→∞
ˆ q) = βˆ0 (q). Because we have showed the equation above holds for any convergent sequence, we can conclude limλ→0 β(λ, ˆ ˆ For q = 1, almost all the arguments presented in the case 1 < q ≤ 2 work as well, except β0 (q) and β(λ, q) may not be unique. Accordingly, we may only draw case-dependent conclusions: ˆ 1) = βˆ0 (1). Here β(λ, ˆ 1) needs not be unique. It can be any optimal solution. a. If βˆ0 (1) is unique, then limλ→0 β(λ, 1 ˆ n , 1)} converges to one of the solutions arg min b. If βˆ0 (1) is not unique, any convergent sequence {β(λ β:y=Xβ kβk1 . N. Proof of Theorem 1 This section contains the complete proof of Theorem 1. The proof for LASSO (q = 1) has been shown in [13]. We aim to extend the results to 1 < q ≤ 2. We will follow similar proof strategy as the one proposed in [13]. However, as will be described later some of the steps are more challenging for 1 < q ≤ 2. Our goal is to address these steps in this section. Motivated by [13] we construct a sequence of estimates from approximate message passing (AMP) algorithms and establish their asymptotic equivalence to LQLS estimates. We then utilize the existing asymptotic results from AMP framework to prove the theorem. The rest of the material is organized as follows. In Section V-N1, we briefly review approximate message passing algorithms and state relevant results that will be used later in our proof. We then give the whole proof in Section V-N2. Section V-N4 collects two useful results being applied throughout the proof.
48
1) Approximate Message Passing Algorithms and Theories: Recall ηq (u; χ) is the proximal operator for the function k · kqq . When u is a vector, ηq (u; χ) is considered to operate component-wise. We are in the linear regression model setting: y = Xβ+w. The approximate message passing (AMP) algorithms [12], [33] generate a sequence of estimates β t ∈ Rp , from the following iterations, initialized at β 0 = 0, z 0 = y: β t+1 zt
= ηq (X T z t + β t ; θt ), 1 = y − Xβ t + z t−1 h∂1 ηq (X T z t−1 + β t−1 ; θt−1 )i, δ
(159)
Pp where hui = p1 i=1 ui denotes the average of a vector and {θt } is a sequence of tuning parameters specified during the iterations. A remarkable phenomenon about AMP is that the asymptotics of the sequence {β t } can be characterized by one dimensional functions. This is known as state evolution introduced in [33]. Theorem 9. (Donoho, Maleki and Montanari, 2009; Bayati and Montanari, 2011). Let {β(p), X(p), w(p)} be a converging sequence and ψ : R2 → R be a pseudo-Lipschitz function. Then for any iteration number t > 0, p
1X ψ(βit+1 , βi ) = E[ψ(ηq (B + τt Z; θt ), B)], lim p→∞ p i=1
a.s.,
1 2 2 2 where B ∼ pβ and Z ∼ N (0, 1) are independent and {τt }∞ t=0 can be tracked through the recursion with τ0 = σ + δ E|B| ,
1 t 2 τt+1 = σw + E[ηq (B + τt Z; θt ) − B]2 . δ
(160)
For the purpose of proof in the next section, from now on we only consider AMP estimates {β t } with θt = χτt2−q , where χ is a positive constant. When δ ≥ 1, define χmin = 0; otherwise let χmin be the unique solution of the following equation: 1 2 Eη (Z; χ) = 1. δ q Refer to Section V-C for the properties of χmin . We can then obtain a useful property about the behavior of the sequence {τt }∞ t=1 . Lemma 37. For any given χ ∈ (χmin , ∞), the sequence {τt }∞ t=0 , constructed according to (160), converges to a finite number, denoted by τ∗ , as t → ∞. Moreover, the convergence is monotone, i.e., |τt2 − τ∗2 | & 0, as t → ∞. 2 + 1δ E[ηq (B + τ Z; τ 2−q χ) − B]2 . According to Corollary 2 and follow-up discussions, we know Proof: Denote H(τ ) = σw 2 H(τ ) = τ has a unique solution. Furthermore, since H(0) > 0 and H(τ ) < τ 2 when τ is large enough, it is straightforward to confirm the results stated in the above lemma. Theorem 9 captures the asymptotics of {β t }. In fact, more general state evolution properties can be derived. We state a few of them in the next theorem, paving our way to the proof in Section V-N2. 2−q Theorem 10. Define wt , 1δ h∂1 ηq (X T z t−1 + β t−1 ; χτt−1 )i. Under the conditions of Theorem 9, we have almost surely, kβ t+1 −β t k22 p t→∞p→∞
(i) lim lim (ii) lim lim (iii)
kz
t→∞p→∞ kz t k2 lim n 2 p→∞
t+1
−z t k22
p
= 0. = 0.
= τt2 .
2−q (iv) lim wt = 1δ E[∂1 ηq (B + τt−1 Z; χτt−1 )]. p→∞
Proof: All the results for q = 1 have been derived in [13]. We here generalize them to the case 1 < q ≤ 2. Since the proof is mostly a direct modification of that in [13], we only highlight the difference and refer the reader to [13] for detailed arguments. According to the proof of Lemma 4.3 in [13], we have kz t+1 − z t k22 kβ t+1 − β t k22 2−q = lim = E[ηq (B + Zt ; τt2−q χ) − ηq (B + Zt−1 ; τt−1 χ)]2 , p→∞ p→∞ p p lim
a.s.
where (Zt , Zt−1 ) is jointly zero-mean gaussian, independent from B ∼ pβ , with covariance matrix defined by the recursion (4.13) in [13]. From Lemma 5, we know ηq (u; χ) is a differentiable function over (−∞, +∞) × (0, ∞). Hence we can apply mean value theorem to obtain, 2−q 2−q E[ηq (B + Zt ; τt2−q χ) − ηq (B + Zt−1 ; τt−1 χ)]2 ≤ E[∂1 ηq (a; b) · (Zt − Zt−1 ) + ∂2 ηq (a; b) · (τt2−q − τt−1 )χ]2
≤ (a)
≤
2−q 2 2 2E[(∂1 ηq (a; b))2 · (Zt − Zt−1 )2 ] + 2E[(∂2 ηq (a; b))2 · (τt2−q − τt−1 ) χ ] 2−q 2 2 2E[(Zt − Zt−1 )2 ] + 2(τt2−q − τt−1 ) χ E
a2 , b2 (q − 1)2
49
2−q χ); we have used Lemma 4 part (ii) and Lemma 7 part where (a, b) is a point between (B + Zt , τt2−q χ) and (B + Zt−1 , τt−1 (i)(ii) to obtain (a). Note that Lemma 37 implies the second term in the last inequality goes to zero, as t → ∞. Regarding the first term, we can follow similar proof steps as for Lemma 5.7 in [13] to show E(Zt − Zt−1 )2 → 0, as t → ∞. The key observations to make it work are: 1) Using Lemma 8, the calculation from Equation (C.7) to the next one in [13] holds generally for 1 ≤ q ≤ 2. 2) According to [51] (see Equation (4.41)), Lemma C.1 in [13] holds for ∂1 ηq (·; ·) with 1 ≤ q ≤ 2. Result (iii) is a direct copy of Lemma 4.1 in [13]. We hence do not repeat the proof here. For (iv), Lemma F.3(b) in [13] implies the empirical distribution of {((X T z t−1 + β t−1 )i , βi )}pi=1 converges weakly to the distribution of (B + τt−1 Z, B). 2−q ) is a bounded and continuous function according to Lemma 4 and Lemma 6 part Since the function J(a, b) = ∂1 ηq (a; χτt−1 (i), (iv) follows directly from the Portmanteau theorem.
2) The Main Proof Steps: We start the proof by a key lemma that characterizes the structural properties of `q regularized ˆ q) and any point close to it. Define cost function. It provides possible analysis of relation between the optimal solution β(λ, F(β) , 21 ky − Xβk22 + λkβkqq . Lemma 38. Denote the nullspace of a matrix X ∈ Rn×p by ker(X) , {β ∈ Rp | Xβ = 0}. There exists a function h(, c1 , c3 , c3 , c4 ) such that the following happens. If β, r ∈ Rp satisfy the following conditions √ (i) krk2 ≤ c1 p (ii) F(β + r) ≤ F(β) √ (iii) k∇F (β)k2 P ≤ p p (iv) sup0≤µi ≤1 i=1 |βi + µi ri |2−q ≤ pc2 (v) 0 < c3 ≤ σmin (X), where σmin (X) is the minimum non-zero singular values of X. kr k k2 (vi) krk k22 ≤ c4 p 1 . The vector rk ∈ Rp is the projection of r onto ker(X). √ Then krk2 ≤ pf (, c1 , c3 , c3 , c4 ). Moreover, f (, c1 , c3 , c3 , c4 ) → 0 as → 0. Proof: First note that ∇F(β) = −X T (y − Xβ) + λq(|β1 |q−1 sign(β1 ), . . . , |βp |q−1 sign(βp ))T . According to Condition (ii) we have, 1 1 0 ≥ F(β + r) − F(β) = ky − Xβ − Xrk22 + λkβ + rkqq − ky − Xβk22 − λkβkqq 2 2 1 kXrk22 − rT X T (y − Xβ) + λ(kβ + rkqq − kβkqq ) = 2 p X 1 = kXrk22 + rT ∇F(β) + λ (|βi + ri |q − |βi |q − qri |βi |q−1 sign(βi )) 2 i=1 (a)
≥
p
1 λq(q − 1) X kXrk22 + rT ∇F(β) + |βi + µi ri |q−2 ri2 , 2 2 i=1
(161)
where (a) is due to Lemma 39 and {µi } are numbers between 0 and 1. Note that we can decompose r as r = rk + r⊥ such that rk ∈ ker(X), r⊥ ∈ ker(X)⊥ . Accordingly Condition (v) yields c23 kr⊥ k22 ≤ kXr⊥ k22 . This important fact combined with Inequality (161) implies, c23 ⊥ 2 1 1 kr k2 ≤ kXr⊥ k22 = kXrk22 ≤ −rT ∇F(β) ≤ krk2 · k∇F(β)k2 ≤ c1 p, 2 2 2 where the last inequality is derived from Condition (i) and (iii). We hence obtain kr⊥ k22 ≤ krk k22 . First note that by Cauchy-Schwarz inequality, we have p p q X X p |ri | = |βi + µi ri |2−q · ri2 |βi + µi ri |q−2 , i=1
2c1 p . c23
Our next step is to bound
i=1 v v u p u p uX uX t 2−q |βi + µi ri | ·t ri2 |βi + µi ri |q−2 . ≤ i=1
i=1
We thus obtain, p X i=1
krk21 . 2−q i=1 |βi + µi ri |
ri2 |βi + µi ri |q−2 ≥ Pp
(162)
50
Combining Inequality (161) and (162) gives, krk21 ≤
p (a) 2c1 c2 2 −2rT ∇F(β) X · |βi + µi ri |2−q ≤ p , λq(q − 1) λq(q − 1) i=1
where we have used Condition (i)(iii)(iv) to derive (a). Using the upper bounds we obtained for krk21 and kr⊥ k22 , together with Condition (vi), it is straightforward to verify the following chains of inequalities, krk k22
≤ ≤
c4 k 2 2c4 2c4 kr k1 ≤ (krk21 + kr⊥ k21 ) ≤ (krk21 + pkr⊥ k22 ) p p p 4c1 c4 2c4 2c1 c2 2 2c1 2 4c1 c2 c4 · p + 2 p = + 2 · p. p λq(q − 1) c3 λq(q − 1) c3
We finally have krk22 = krk k22 + kr⊥ k22 ≤
4c c c 4c1 c4 2c1 1 2 4 + 2 + 2 · p. λq(q − 1) c3 c3
This completes the proof. Note that Lemma 38 is a non-asymptotic and deterministic result. It sheds light on the behavior of the cost function F(β) around its global minimum. Suppose β + r is the global minimizer (a reasonable assumption according to Condition (ii)), and if there is another point β having small function value (indicated by its gradient from Condition (iii)), then the distance krk2 between β and the optimal solution β + r should also be small. This interpretation should not sound surprising, since we already know F(β) is a strictly convex function. However, Lemma 38 enables us to characterize this property in a precise way, which is crucial in the high dimensional asymptotic analysis. We will apply Lemma 38 with specified x and r to prove ˆ q) and AMP estimator β t , as stated below. the asymptotic equivalence between the bridge estimator β(λ, ˆ q), and let {β t }t≥0 Proposition 1. Let {β(p), X(p), w(p)} be a converging sequence. Denote the solution of LQLS by β(λ, be the sequence of estimates generated from AMP algorithm. We then have 1 ˆ kβ(λ, q) − β t k22 = 0, t→∞ p→∞ p lim lim
a.s.
(163)
ˆ q), β = β t . Then if this pair of β and r satisfies the conditions in Lemma 38, we will have Proof: Let β + r = β(λ, t ˆ kβ(λ,q)−β k2 = being very small. In the rest of the proof, we aim to verify that the conditions in Lemma 38 hold with p high probability and establish the connection between the iteration numbers t and in Lemma 38. √ t ˆ kβ(λ,q)k 2 √ 2 ≤ lim sup √ a. Condition (i) follows from Lemma 40: lim lim sup krk + lim lim kβ√pk2 ≤ 2 C, a.s. p p krk22 p
t→∞ p→∞
p→∞
t→∞p→∞
ˆ q) is the optimal solution. b. Condition (ii) holds since β(λ, c. Condition (iii) can be confirmed by Lemma 42. Note that → 0, as t → ∞. d. Condition (iv) holds by choosing large enough t, according to Lemma 41. e. Condition (v) is the result of Theorem 11. f. Condition (vi) is a direct application of Theorem 12. Note all the claims made above hold almost surely, as p → ∞. Hence the result (163) follows directly. Based on the results from Proposition 1, Theorem 9 and Lemma 37, we can use exactly the same arguments as the proof of Theorem 1.4 in [13] to obtain Theorem 1. Since the arguments are straightforward, we do not repeat it here. 3) A sequel of useful lemmas: We prove an array of lemmas that have been applied in the proof of Proposition 1. Lemma 39. Given a constant q satisfying 1 < q ≤ 2, for any x, r ∈ R, there exists a number 0 ≤ µ ≤ 1 such that q(q − 1) |x + µr|q−2 r2 . (164) 2 Proof: Denote fq (x) = |x|q . When q = 2, since f2 (x) is a smooth function over (−∞, +∞), we can apply Taylor’s theorem to obtain (164). For any 1 < q < 2, note that fq00 (0) = ∞, hence Taylor’s theorem is not applicable to every x ∈ R. We then prove the inequality above in separate cases. First observe that if (164) holds for any x > 0, r ∈ R, then it is true for any x < 0, r ∈ R as well. It is also straightforward to confirm that when x = 0, we can always choose µ = 1 to satisfy Inequality (164) for any r ∈ R. We hence focus on the case x > 0, r ∈ R. a. When x + r > 0, since fq (x) is a smooth function over (0, ∞), we can apply Taylor’s theorem to obtain (164). xq , which is clearly valid. b. If x + r = 0, choosing µ = 0, Inequality (164) is simplified to (q − 1)xq ≥ q(q−1) 2 c. When x + r < 0, we consider two different scenarios. i. First suppose −x − r ≥ x. We apply (164) to the pair −r − x and x. Then we know there exists 0 ≤ µ ˜ ≤ 1 such q−2 2 that |x + r|q − |x|q ≥ (−2x − r)q|x|q−1 + q(q−1) |˜ µ (−x − r) + (1 − µ ˜ )x| (2x + r) . Then it is straightforward to 2 verify that there is 0 ≤ µ ≤ 1 so that µ(x + r) + (1 − µ)x = −˜ µ(−x − r) − (1 − µ ˜)x. Denote g(y) = q(q−1) |˜ µ (−x − 2 |x + r|q − |x|q − rq|x|q−1 sign(x) ≥
51
r) + (1 − µ ˜)x|q−2 y 2 + q|x|q−1 y. Note that if we can show g(−2x − r) ≥ g(r), we can obtain the Inequality (164). −1 |x|q−1 |˜ µ(−x − r) + (1 − It is easily seen that the quadratic function g(y) achieves global minimum at y0 = q−1 −1 2−q µ ˜)x| ≤ q−1 |x| < −x. Moreover, note that −2x − r ≥ 0, r < 0 and they are symmetric around y = −x, hence g(−2x − r) ≥ g(r). |˜ µ(−x − r) + ii. Consider 0 < −x − r < x. We again use (164) to obtain |x + r|q − |x|q ≥ (−2x − r)q|x|q−1 + q(q−1) 2 q(q−1) q(q−1) q−2 2 q−2 2 q−2 2 q−1 (1 − µ ˜)x| (2x + r) ≥ (−2x − r)q|x| + 2 |x| (2x + r) . Denote h(y) = 2 |x| y + q|x|q−1 y. Then if we can show h(−2x − r) ≥ h(r), Inequality (164) will be established for µ = 0. Since h(x) achieves −1 |x| < −x and −2x − r > r, we can get h(−2x − r) ≥ h(r). global minimum at y0 = q−1 Lemma 40. Under the conditions of Proposition 1, there exists a positive constant C such that kβ t k22 ≤ C, a.s., t→∞ p→∞ p ˆ q)k2 kβ(λ, ≤ C, a.s. lim sup p p→∞ lim lim
Proof: The proof is a direct modification of the proof for Lemma 3.3 in [13]. For completeness, we give the entire proof here. To show the first inequality, according to Theorem 9 and Lemma 37, choosing a particular pseudo-Lipschitz function ψ(x, y) = x2 , we can have kβ t k22 = EB,Z [ηq (B + τ∗ Z; χτ∗2−q )]2 < ∞, t→∞ p→∞ p lim lim
a.s.,
ˆ q) is the optimal solution, where B ∼ pβ and Z ∼ N (0, 1) are independent. For the second inequality, first note that since β(λ, we have ˆ q)kq ≤ F(β(λ, ˆ q)) ≤ F(0) = 1 kyk2 = 1 kXβ + wk2 ≤ kXβk2 + kwk2 ≤ [σmax (X)]2 kβk2 + kwk2 . λkβ(λ, (165) q 2 2 2 2 2 2 2 2 ˆ q) = β(λ, ˆ q)⊥ + β(λ, ˆ q)k , where β(λ, ˆ q)⊥ ∈ ker(X)⊥ and β(λ, ˆ q)k ∈ ker(X). Since We then consider the decomposition β(λ, ker(X) is a uniformly random subspace with dimension p(1 − δ)+ , we can apply Theorem 12 to conclude that, there exists a constant c(δ) > 0 depending on δ such that the following holds with high probability, ˆ q)k2 kβ(λ, 2
= ≤
k 2 ˆ ˆ q)k k2 + kβ(λ, ˆ q)⊥ k2 ≤ kβ(λ, q) k1 + kβ(λ, ˆ q)⊥ k2 kβ(λ, 2 2 2 c(δ)p 2 ˆ ˆ q)⊥ k2 + 2kβ(λ, ˆ q)k2 2kβ(λ, 1 1 ˆ q)⊥ k2 ≤ 2kβ(λ, q)k1 + 2 + c(δ) kβ(λ, ˆ q)⊥ k2 . + kβ(λ, 2 2 c(δ)p c(δ)p c(δ)
Moreover, H¨ older’s inequality combined with Inequality (165) yields, ! !1/q ˆ q)kq 1/q ˆ q)k1 kβ(λ, kβ(λ, [σmax (X)]2 kβk22 + kwk22 q ≤ ≤ . p p λp
(166)
(167)
ˆ q)k2 , Using the results from (166) and (167), we can upper bound kβ(λ, 2 !2/q ˆ q)k2 (a) kβ(λ, 2 [σmax (X)]2 kβk22 + kwk22 2 + c(δ) 2 ˆ q)⊥ k2 , ≤ + kX β(λ, 2 p c(δ) λp pc(δ)[σmin (X)]2 !2/q (b) 2 [σmax (X)]2 kβk22 + kwk22 2 + c(δ) ˆ q)k2 + 2kyk2 ), ≤ + (2ky − X β(λ, 2 2 c(δ) λp pc(δ)[σmin (X)]2 !2/q (c) [σmax (X)]2 kβk22 + kwk22 2 + c(δ) 2 ≤ + · 4kyk22 , c(δ) λp pc(δ)[σmin (X)]2 !2/q (d) 2 [σmax (X)]2 kβk22 + kwk22 16 + 8c(δ) [σmax (X)]2 kβk22 + kwk22 ≤ + · . c(δ) λp c(δ)[δmin (X)]2 p ˆ q)⊥ k2 ≥ [σmin (X)]2 kβ(λ, ˆ q)⊥ k2 ; (b) is due to the simple fact X β(λ, ˆ q)⊥ = X β(λ, ˆ q); To obtain (a), we have used kX β(λ, 2 2 2 ˆ ˆ (c)(d) hold since ky − X β(λ, q)k2 ≤ 2F(β(λ, q)) and inequalities in (165). Finally, because Inequality (166) holds with probability larger than 1 − 2−p , σmin (X) and σmax (X) go almost surely to non-zero constants by Theorem 11 and (β, X, w) is a converging sequence, the right side of the last inequality converges to a finite number. This completes the proof.
52
Lemma 41. Under the conditions of Proposition 1, there exists a positive constant C˜ such that Pp t 2−q ˆ i=1 |µi βi (λ, q) + (1 − µi )βi | ˜ lim sup lim sup sup < C, a.s. p t→∞ p→∞ 0≤µi ≤1 Proof: For any given 0 ≤ µi ≤ 1, it is straightforward to see |µi βˆi (λ, q) + (1 − µi )βit |2−q ≤ max{|βˆi (λ, q)|2−q , |βit |2−q } ≤ |βˆi (λ, q)|2−q + |βit |2−q . Hence, by using H¨ older’s inequality we know p
1X |µi βˆi (λ, q) + (1 − µi )βit |2−q 0≤µi ≤1 p i=1 sup
≤ ≤
p p 1X ˆ 1 X t 2−q |βi (λ, q)|2−q + |β | p i=1 p i=1 i p p i(2−q)/2 h 1 X i(2−q)/2 h1 X |βˆi (λ, q)|2 + |βit |2 p i=1 p i=1
Applying the results from Lemma 40 to the above inequality finishes the proof. Lemma 42. Under the conditions of Proposition 1, we have k∇F(β t )k22 = 0, t→∞ p→∞ p lim lim
a.s.
2−q Proof: From the AMP updating rule (159) : β t = ηq (X T z t−1 + β t−1 ; τt−1 χ), we know 2−q X T z t−1 + β t−1 = β t + τt−1 χq(|β1t |q−1 sign(β1t ), . . . , |βpt |q−1 sign(βpt ))T .
It is also straightforward to see z t = y − Xβ t + wt z t−1 . Since we know the explicit form of ∇F(β t ): ∇F(β t ) = −X T (y − Xβ t ) + λq(|β1t |q−1 sign(β1t ), . . . , |βpt |q−1 sign(βpt ))T , we can then upper bound ∇F(β t ) in the following way, 1 X T z t−1 + β t−1 − β t 1
√ k∇F(β t )k2 = √ − X T (y − Xβ t ) + λ 2−q p p 2 τt−1 χ
T t−1 t−1 1 X z +β − βt
= √ − X T (z t − wt z t−1 ) + λ
2−q p 2 τt−1 χ ≤
2−q |(λ + τt−1 χ(wt − 1))| · kX T z t−1 k λkβ t−1 − β t k2 kX T (z t−1 − z t )k2 + + √ 2−q √ 2−q √ p τt−1 χ p τt−1 χ p
≤
2−q σmax (X)|λ + τt−1 χ(wt − 1)| · kz t−1 k2 λkβ t−1 − β t k2 σmax (X)kz t−1 − z t k2 + + . √ √ 2−q √ 2−q p τt−1 χ p τt−1 χ p
By Lemma 37, Theorem 10 part (i)(ii) and Theorem 11, it is straightforward to confrim the first two terms on right side of the last inequality vanish almost surely, as p → ∞, t → ∞. For the third term, Lemma 37 and Theorem 10 part (iii)(iv) imply √ 2−q |λ + τt−1 χ(wt − 1)| · kz t−1 k2 δτ∗ 1 0 2−q 2−q lim lim = − τ χ 1 − Eη (B + τ Z; τ χ) λ = 0, a.s. ∗ √ ∗ ∗ 2−q t→∞ p→∞ δ q χ p τt−1 τ∗2−q χ To obtain the last equality, we have used Equation (8). 4) Useful Reference Theorems: In this section, we refer to two useful theorems that have also been applied and cited in [13]. The first one is regarding the limit of the singular values of random matrices taken from [52]. 2 Theorem 11. (Bai and Yin, 1993). Let X ∈ Rn×p be a matrix with i.i.d entries with EXij = 0, EXij = 1/n. Denote by σmax (X), σmin (X) the largest and smallest non-zero singular values of X, respectively. If n/p → δ > 0, as p → ∞, then
1 lim σmax (X) = √ + 1, a.s., δ 1 lim σmin (X) = √ − 1 , a.s. p→0 δ The second theorem establishes the relation between `1 and `2 norm for vectors from random subspace, showed in [53]. p→0
Theorem 12. (Kashin, 1977). For a given constant 0 < v ≤ 1, there exists a universal constant cv such that for any p ≥ 1
53
and a uniformly random subspace V of dimension p(1 − v), 1 P ∀β ∈ V : cv kβk2 ≤ √ kβk1 ≥ 1 − 2−p . p ACKNOWLEDGEMENT
This work is supported by NSF grant CCF-1420328. R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40]
L. Frank and J. Friedman. A statistical view of some chemometrics regression tools. Technometrics, 35(2):109–135, 1993. W. Fu. Penalized regressions: the bridge versus the lasso. Journal of computational and graphical statistics, 7(3):397–416, 1998. L. Zheng, A. Maleki, X. Wang, and T. Long. Does `p -minimization outperform `1 -minimization? arXiv preprint arXiv:1501.03704, 2015. K. Knight and W. Fu. Asymptotics for lasso-type estimators. Annals of statistics, pages 1356–1378, 2000. D. Donoho and J. Tanner. Sparse nonnegative solution of underdetermined linear equations by linear programming. Proceedings of the National Academy of Sciences, 102(27):9446–9451, 2005. D. Donoho, A. Maleki, and A. Montanari. Message-passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914–18919, 2009. D. Amelunxen, M. Lotz, M. McCoy, and J. Tropp. Living on the edge: Phase transitions in convex programs with random data. Information and Inference, page iau005, 2014. N. El Karoui, D. Bean, P. Bickel, C. Lim, and B. Yu. On robust regression with high-dimensional predictors. Proceedings of the National Academy of Sciences, 110(36):14557–14562, 2013. J. Bradic and J. Chen. Robustness in sparse linear models: relative efficiency based on robust approximate message passing. arXiv preprint arXiv:1507.08726, 2015. M. Stojnic. Various thresholds for `1 -optimization in compressed sensing. arXiv preprint arXiv:0907.3666, 2009. D. L. Donoho, A. Maleki, and A. Montanari. Noise sensitivity phase transition. IEEE Trans. Inform. Theory, 57(10):6920–6941, Oct. 2011. M. Bayati and A. Montanri. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. Inform. Theory, 57(2):764–785, Feb. 2011. M. Bayati and A. Montanari. The LASSO risk for Gaussian matrices. IEEE Trans Inform Theory, 58(4):1997–2017, 2012. D. Donoho. For most underdetermined systems of linear equations, the minimal i1-norm near-solution approximates the sparsest near-solution. Comm. Pure and Appl. Math, 2004. D. Donoho. For most underdetermined systems of linear equations, the minimal l¡ sup¿ 1¡/sup¿-norm near-solution approximates the sparsest near-solution. Manuscript, submitted for publication, URL: http://www-stat. stanford. edu/˜ donoho/Reports, 2004. D. Donoho. High-dimensional centrally symmetric polytopes with neighborliness proportional to dimension. Discrete & Computational Geometry, 35(4):617–652, 2006. D. Donoho and J. Tanner. Neighborliness of randomly projected simplices in high dimensions. Proceedings of the National Academy of Sciences, 102(27):9452–9457, 2005. D. Guo and S. Verd´u. Randomly spread cdma: Asymptotics via statistical physics. IEEE Trans. Inform. Theory, 51(6):1983–2010, 2005. T. Tanaka. A statistical-mechanics approach to large-system analysis of cdma multiuser detectors. IEEE Transactions Information Theory, 48(11):2888– 2910, 2002. A. Coolen. The Mathematical Theory of Minority Games: Statistical Mechanics of Interacting Agents (Oxford Finance Series). Oxford University Press, Inc., 2005. M. Stojnic. Block-length dependent thresholds in block-sparse compressed sensing. arXiv preprint arXiv:0907.3679, 2009. M. Stojnic. Under-determined linear systems and `q -optimization thresholds. arXiv preprint arXiv:1306.3774, 2013. D. Amelunxen, M. Lotz, M. McCoy, and J. Tropp. Living on the edge: A geometric theory of phase transitions in convex optimization. Technical report, DTIC Document, 2013. C. Thrampoulidis, E. Abbasi, and B. Hassibi. Precise error analysis of regularized m-estimators in high-dimensions. arXiv preprint arXiv:1601.06233, 2016. N. Karoui. Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: rigorous results. arXiv preprint arXiv:1311.2445, 2013. D. Donoho and A. Montanari. High dimensional robust m-estimation: Asymptotic variance via approximate message passing. Probability Theory and Related Fields, pages 1–35, 2013. D. Donoho, M. Gavish, and A. Montanari. The phase transition of matrix recovery from gaussian measurements matches the minimax mse of matrix denoising. Proceedings of the National Academy of Sciences, 110(21):8405–8410, 2013. D. Donoho and A. Montanari. Variance breakdown of huber (m)-estimators: n/p → m. 2015. D. Donoho, A. Maleki, and A. Montanari. The noise-sensitivity phase transition in compressed sensing. IEEE Transactions Information Theory, 57(10):6920–6941, 2011. R. Foygel and L. Mackey. Corrupted sensing: Novel guarantees for separating structured signals. IEEE Transactions Information Theory, 60(2):1223– 1247, 2014. S. Rangan, V. Goyal, and A. Fletcher. Asymptotic analysis of map estimation via the replica method and compressed sensing. In Advances in Neural Information Processing Systems, pages 1545–1553, 2009. F. Krzakala, M. M´ezard, F. Sausset, Y. Sun, and L. Zdeborov´a. Statistical-physics-based reconstruction in compressed sensing. Physical Review X, 2(2):021005, 2012. A. Maleki. Approximate message passing algorithms for compressed sensing. Ph.D. thesis, Stanford University, 2010. A. Maleki, L. Anitori, Z. Yang, and R. Baraniuk. Asymptotic analysis of complex lasso via complex approximate message passing (camp). IEEE Transactions Information Theory, 59(7):4290–4308, 2013. Y. Kabashima, T. Wadayama, and T. Tanaka. A typical reconstruction limit for compressed sensing based on lp-norm minimization. Journal of Statistical Mechanics: Theory and Experiment, 2009(09):L09003, 2009. A. Hoerl and R. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, pages 267–288, 1996. H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2):301–320, 2005. J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001. M. Bogdan, E. van den Berg, C. Sabatti, W. Su, and E. Cand`es. Slope:adaptive variable selection via convex optimization. The annals of applied statistics, 9(3):1103, 2015.
54
[41] J. Huang, J. Horowitz, and S. Ma. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics, pages 587–613, 2008. [42] W. Su, M. Bogdan, and E. Candes. False discoveries occur early on the lasso path. arXiv preprint arXiv:1511.01957, 2015. [43] P. B¨uhlmann and S. Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011. [44] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, pages 1705–1732, 2009. [45] G. Raskutti, M. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional linear regression over-balls. IEEE Transactions Information Theory, 57(10):6976–6994, 2011. [46] V. Koltchinskii. Sparsity in penalized empirical risk minimization. In Annales de l’IHP Probabilit´es et statistiques, volume 45, pages 7–57, 2009. [47] E. Cand`es. The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique, 346(9):589–592, 2008. [48] F. Bunea, A. Tsybakov, and M. Wegkamp. Sparsity oracle inequalities for the lasso. Electronic Journal of Statistics, 1:169–194, 2007. [49] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in optimization, 1(3):127–239, 2014. [50] C. Stein. Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, pages 1135–1151, 1981. [51] Grigorios A Pavliotis. Stochastic processes and applications. Diffusion Processes, the Fokker-Planck, 2014. [52] ZD Bai and YQ Yin. Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. The annals of Probability, pages 1275–1294, 1993. [53] Boris Sergeevich Kashin. Diameters of some finite-dimensional sets and classes of smooth functions. Izvestiya Rossiiskoi Akademii Nauk. Seriya Matematicheskaya, 41(2):334–351, 1977.