!1-regularized ensemble learning Logan Grosenick Convex Optimization II Class Project
Abstract Methods that use an !1 -norm to encourage model sparsity are now widely applied across many disciplines. However, aggregating such sparse models across fits to resampled data remains an open problem. Because resampling approaches have been shown to be of great utility in reducing model variance and improving variable selection, a method able to generate a single sparse solution from multiple fits to resampled data is desirable. We present a method, related to collaborative filtering and consensus optimization, that finds a sparse "global" consensus solution from many "local" models fit to subsampled data using constrained convex optimization. Called “consensus selection”, this ensemble method preserves sparsity while decreasing both model variance and model bias relative to other state-of-the-art variable selection and fitting procedures for !1 -regularized models.
1
Introduction
Methods that use !1 -regularization as a heuristic for arriving at a sparse model are now applied in many fields. Examples of such models include the lasso [13], compressed sensing [5], total variation denoising [12], soft thresholding [7], and estimation of structure in graphical models [10], to name a few. Indeed, the prevalence of such !1 -norm approaches has led some to call them “the modern least squares” [6]. Despite their widespead adoption, it remains unclear how to sensibly combine !1 -regularized models while maintaining sparsity. Such aggregation typically occurs after models are fit to multiple subsamples generated by cross-validation, bootstrapping, or other resampling methods, and is used to improve model fit or model selection. The difficulty in combining !1 -regularized models results from variability in the set of nonzero coefficients selected across different samples of the same data. Simply averaging the sparse models does not make sense (as it might for non-sparse models) because such averaging destroys the sparsity sought by the user. The potential benefits of model aggregation (or “ensemble”) methods—that is, methods that construct a single model from a linear combination of many—are well established. For example, in the case of high variance models such as regression trees, averaging many tree models together to form a new predictor can significantly reduce model variance and improve model generalization to new data [3]. This approach, called “bagging” (a sobriquet for bootstrap aggregating), as well as the related methods of subbagging (subsample aggregating), bragging (bootstrap r obust aggregating), and random forests, are popular methods that in many (but not all) cases successfully trade off an appreciable amount of model variance for a small amount of model bias. The primary disadvantage of these models has been that they are hard to interpret–bagged trees are not trees, bagged sparse models are no longer sparse–and the analyst is left without standard methods for measuring variable importance. Recently, however, model aggregation has been fruitfully applied to variable selection. Just as averaging models in bagging can improve model fit, aggregating nonzero variables across sparse models can improve variable selection consistency. This is accomplished by observing 1
the frequency with which variables are assigned nonzero coefficients across models, and then selecting only those variables that occur in more than a certain fraction of them. Such “stability selection” can be used to control the expected false positive rate within the selected set of variables [10]. Although the selection procedure itself does not yield coefficients, one can fix the model sparsity pattern using stability selection and then estimate model coefficients for just the selected variables. While stability selection has been proven effective at controlling false discovery rates on real world data [10], it does not explicitly take into account model generalization error and can be over-conservative in its variable selection given correlated data (see the discussion following the article in [10]). This paper presents a framework for sparse methods that combine (a) the model variance reduction and improved generalization performance of model averaging with (b) improved variable selection using model aggregation (as in stability selection). This is accomplished using an !1 -regularized consensus model—in which many fits to resampled data are constrained to agree on a single sparse global fit. Similar models have been proposed previously in cases where a prohibitively large number of independent training examples [2] or a widely distributed network of independent agents [11] require fitting a single problem in a distributed fashion, but to the best of our knowledge consensus models have not been used to aggregate models during resampling. Of particular interest when considering !1 -regularized models are problems in which there are many more variables than observations (p >> n for data X ∈ Rn×p ). In this case sparse consensus methods could allow us to better use a limited amount of data to improve model stability while identifying a few important variables from among many irrelevant ones.
2
Consensus selection
We would like to solve the following problem: given m data sets generated by resampling, select the model that minimizes empirical risk across all the data sets while maintaining (or exceeding) a fixed level of sparsity. Because a single set of sparse coefficients is selected as a result of consensus among many submodels, we call this approach “consensus selection”. 2.1
General form !1 -regularized consensus selection
Given a set of m convex loss functions li , data samples D(i) , and coefficient estimates β(i) indexed by i = 1, . . . m, the consensus selection problem is m ! li (β(i) , D(i) ) + λ"z"1 (1) minimize β(i) ,z
i=1
subject to β(i) − z = 0, i = 1, . . . , m. (2) The first term in the objective is a sum of the empirical risks associated with each of the m different models fit to the data samples D(i) , each with a corresponding set of coefficients β(i) . The m equality constraints (2) ensure that each of these different sets of coefficients is equal to " a global consensus model z. The second term of the objective is an !1 -norm penalty p "z"1 = j=1 |zj | on the consensus coefficients with fixed penalty parameter λ ≥ 0. This is a heuristic for encouraging model sparsity, and can be thought of as a convex relaxation of the !0 -pseudonorm, or the cardinality of a vector. In many applications having a sparse coefficient vector is quite desirable, as a model with fewer non-zero coefficients is simpler and can help identify which predictors have an important relationship with a response variable. 2.2
ADMM algorithm for !1 -regularized consensus selection
To solve (1) subject to (2), we will use the Alternating Direction Method of Multipliers (ADMM). ADMM is a classic algorithm recently rediscovered due to its decomposability and consequent scalability to very large problems [2]. In brief, ADMM combines the decomposability of dual ascent methods with a quadratic regularization term that ensures convergence without requiring the primal problem to be strongly convex (or finite). For a thorough analysis of this method as well as many modern applications (incuding consensus and collaborative filtering) see [2]. 2
The augmented Lagrangian for the equality constrained convex optimization problem (1) is m % m & ! # $ ! T γ(i) (β(i) − z) + (ρ/2)"β(i) − z"22 li (β(i) , D(i) ) + λ"z"1 + Lρ z, {β(i) }∀i , {γ(i) }∀i = i=1
i=1
(3) where the γ(i) are dual variables corresponding to the relaxed equality constraints β(i) − z for i = 1, . . . m, and the final summand in the second sum is a quadratic regularization of the primal residual (the “augmentation” of the Lagrangian). Note that when the equality constraint (2) is satisfied (that is, when the problem is feasible), these terms disappear.
ADMM decomposes the augmented Lagrangian and at each iteration alternates between optimizing over the β(i) with z and the γ(i) fixed, optimizing over z with the the β(i) and γ(i) fixed, and finally updating the dual variable by adding the residual misfit between each β(i) and the consensus solution z. With a change of variables ν(i) = (1/ρ)γ(i) , this is expressed % & k+1 k β(i) := argmin li (β(i) , D(i) ) + (ρ/2)"β(i) + ν(i) − z k "22 (4) β(i)
z
k+1
:=
# $ argmin λ"z"1 + (mρ/2)"z − (β¯k+1 + ν¯k )"22
(5)
z
k+1 ν(i)
:=
k+1 k + β(i) − z k+1 . ν(i)
(6)
Thus the algorithm can be split into parallel updates for the primal variable subproblems (4), parallel updates for the dual variable subproblems (6), and a collective update (5) of the consensus coefficients z that gathers all the variables across the m samples and combines them. Looking at (4), we see that if our loss function is expressed as the negative log-likelihood of β(i) , given data D(i) , then we are solving a maximum likelihood problem on that sample plus a regularization term. This is equivalent to solving a maximum a posteriori (MAP) k k estimation problem with prior distribution π(i) ∼ N (z k − ν(i) , ρI)—which is an iteration dependent prior consisting of the difference between the current consensus solution and "k k k k k the integrated residual for the ith sample ν(i) = i=1 β(i) − z . Thus if βj(i) , (the jth primal variable entry in sample i), repeatedly overestimates the consensus solution, it will be adaptively shrunk towards it. If on the other hand, it repeatedly underestimates zjk , a small amount will be added to it on each iteration. This sequential reweighting of the primal variables in proportion to their previous error bears at least a casual resemblance to boosting (from the perspective of trying to estimate z k with base learners β(i) and adaptive weights ν(i) ), and as we will see has a direct connection to existing adaptive !1 methods. 1 "m k Examining consensus update (5), we see that β¯k = m i=1 β(i) is an average of bootstrapped MAP estimates, and thus by definition a “bagged” estimate of β [3]. It is therefore an approximation to the mean of the posterior distribution [9], and so should (in most cases) decrease mean-squared prediction error relative to the individual MAP fits β(i) , which are point estimates of the posterior mode. The solution to the lasso (or basis-pursuit denoising) minimization problem (5) is given by # $ z k+1 := Sλ/mρ β¯k+1 + ν¯k (7) where Sλ (x) = sign(x)(|x| − λ)+ is the soft-thresholding operator with threshold λ. The bagged estimate β¯k+1 , first adjusted by adding ν¯k , is thus thresholded to obtain the sparse consensus solution z k+1 . Thus, (4) and (5) approximate “!1 -bagging” by iteratively bagging an ensemble of dense models to reduce their variance and then projecting the posterior mean of the ensemble estimates onto a sparse subspace.
3
Case study: The consensus lasso
The Least Absolute Shrinkage and Selection Operator, or lasso [13], in which the loss functions li in (1) are given by squared-error loss, is a canonical example of a sparse regression 3
method, and therefore serves as a good case study in using consensus selection to fit an !1 -regularized model. We begin by reviewing some relevant results from the literature on lasso problems. 3.1
The lasso
Consider the noise corrupted observations y = Xβ + (, (i ∼ N (0, 1), i = 1, . . . , n, n
with observations y ∈ R , independent variables X ∈ R The lasso [13] solves the estimation problem
n×p
(8)
, and model coefficients β ∈ Rp .
β' = argmin (1/2)"y − Xβ"22 + λ"β"1 , λ ≥ 0,
(9)
β
which is an !1 -regularized version of standard least-squares regression. While straightforward and effective at generating a sparse fit to data, there are some disadvantages to using standard !1 -regularized methods like the lasso in practice. For example, given a group of correlated predictors, the lasso will tend to select only a subset of “representative” predictors ' as from this group to include in the model [16]. This can make it difficult to interpret β, quite different sets of coefficients may be chosen by lasso models fit to different data samples drawn from the same distribution. In this regard the lasso is similar to regression trees and other hard threshold, high variance models that tend to benefit from model averaging. In addition, the lasso penalty biases model coefficients towards zero which (1) makes it difficult to interpret coefficient magnitude in terms of original data units, and (2) severely limits the conditions under which the lasso can perform consistent variable selection [15]. 3.2
Smoothed and reweighted lasso problems
Methods to overcome both the lasso’s sensitivity to correlated data and the ill effects of coefficient bias have been developed, and work by modifying the regularization term in (9). For example, “elastic net” regression adds to the existing !1 -regularization of the lasso an !2 -norm penalty. This effectively shrinks the sample covariance estimate towards the identity matrix, stabilizing the estimate and allowing correlated variables to coexist in the model [16]. Coefficient bias, meanwhile, is addressed by the “adaptive lasso” [15], which uses a weighted !1 penalty to adaptively penalize large coefficients less, thereby counteracting the bias induced by the penalty term. The “adaptive elastic net” [17] net combines both approaches, yielding the estimates β' = (1 + ρ/2) argmin (1/2)"y − Xβ"22 + ρ"β"22 + λ β
p ! j=1
w 'j |βj |, λ, ρ ≥ 0
(10)
where the w ˆj ≥ 0, j = 1, . . . , m allow reweighting of the !1 penalty associated with each coefficient. This approach improves on the lasso, elastic net, and adaptive lasso, particularly in the case of correlated data with many variables and few observations (the correlated “p >> n” problem for X ∈ Rn×p ) [15]. Two important caveats are (i) that the additional coefficient bias from the !2 -norm penalty must be counteracted with the heuristic “unshrinking” factor 1+ρ/2, and (ii) that the w 'j must be supplied by the user (some good heuristics are suggested in [15, 17]), or arrived at using greedy recursive estimation of the weights [6]. Below we show that the consensus lasso with ADMM is both stable when fit to correlated data in the n >> p case, and that it adaptively rescales the model coefficients during model fitting in a manner similar to the adaptive lasso, but with weights determined automatically by the ADMM consensus process. We consider one more lasso variant, the “random lasso”, which will be used in the following section. The random lasso takes the same form as the adaptive lasso (that is, (10) with ρ = 0); however, the weights w 'j are set to random values in the interval [α, 1], where α is a “weakness” parameter in (0, 1]. On its own this is not particularly useful, but combined with “stability selection” it can be very effective for variable selection [10]. In stability selection with the random lasso, m subsamples of size &n/2' are fit with random lassos, and 4
' λ with which variable j appears in the different models is taken as the relative frequency Π j ' λ plotted for all an estimate of the relative frequency over all possible subsamples. The Π j λ ∈ Λ ⊆ R+ trace out “stability paths” for the coefficients (see Figure 2), which show their estimated probability of inclusion as a function of λ . The set of “stable” coefficients that are chosen is given by ( ) S'stable =
j : max Πλj ≥ τ λ∈Λ
(11)
,
where τ is a threshold chosen to control the expected number of false positives and Λ is the set of all λ values considered [10]. 3.3
The consensus lasso
Let Di = (y(i) , X(i) ) be the ith sample drawn from the data D = (y, X) defined in (8). For the standard lasso problem (9) the general consensus selection problem from section 2.1 becomes m *2 1 !* * * T minimize β(i) * + λ"z"1 (12) *y(i) − X(i) β(i) 2 i=1 2 subject to
β(i) − z = 0,
i = 1, ..., m,
where β(i) are the estimated model coefficients for the ith of m samples D1 , ..., Dm . Following equations (4)-(7), the ADMM updates for the consensus lasso are % &−1 % % && k+1 T T k (13) β(i) := X(i) X(i) + ρI X(i) y(i) + ρ z k − ν(i) # $ (14) z k+1 := Sλ/mρ β¯k+1 + ν¯k k+1 ν(i)
:=
k+1 k ν(i) + β(i) − z k+1
(15)
It is interesting to note that the first two ADMM steps (5) and (6) are a Tikhonov (or “ridge”) regression [14] and a soft-thresholding [7] step, respectively, which together are an elastic net regression [16]. We might therefore expect ρ to improve model fit to correlated data by effectively decorrelating the variables and stabilizing the sample correlation matrix. Empirically, increasing ρ does appear to stabilize a model fit to correlated data, reducing model variability from iteration to iteration and decreasing prediction error on new data (compare Figure 1a and 1b). However, in contrast to the elastic net, here the regularization is applied to the primal residual rather that to the coefficients themselves, and so should leave the coefficients unbiased as the residual diminishes over iterations. Finally, we note that the element-wise soft thresholding operation in (14) can be rewritten as % & # $ # $ k+1 S(λ/mρ)−ξj β¯j = |β¯jk+1 | − (λ/mρ) + ξj + sign |β¯jk+1 | − (λ/mρ) + ξj ,
where ξj = |β¯jk+1 + ν¯jk | − |β¯jk+1 |. Back-calculating the lasso problem that this reformulated soft threshold operator solves—so that the new lasso penalty parameter is λ%j = λ(1 − mρξj )—we see that the consensus update is equivalent to p ! w 'j |zj | , (16) z k+1 := argmin (mρ/2)"z − (β¯k+1 + ν¯k )"22 + λ z
j=1
which is an adaptive lasso problem with weights # $ w 'j = 1 − mρ |β¯jk+1 + ν¯jk | − |β¯jk+1 | .
(17)
Examining the weights, we see that they are in the interval [0, 1] and will change the threshold applied to each element of z k based on the relationship between the averages of the primal and dual variables across samples. For example, if β¯jk and zjk are both nonnegative for all k, then if β¯jk is reliably larger than zjk the average integrated residual ν¯jk will grow 5
Figure 1: (a-b) Effects of regularization parameter ρ on model stability given correlated data. (a) Prediction error on out of sample data after each of 100 iterations for 100 different values of !1 -regularization parameter λ (each shown as a different line) with ρ = 1, and (b) with ρ = 100 (for 50 iterations). (c-d) Empirical estimates of the penalty function applied to each coefficient as a function of its size. Red points correspond to nonzero coefficients, and the effective penalty function is given by the second term in equation (17).
positively, decreasing w 'j and relaxing the penalty on zjk . The dual variables are therefore iteratively adjusting the threshold on each variable to approximate a nonconvex loss penalty on the consensus coefficients z. This is shown empirically in Figure 1c, where the values of the primal and dual variables # $ at the final model iteration (for a fixed λ) were used to plot mρ |β¯jk+1 + ν¯jk | − |β¯jk+1 | as a function of coefficient values. The shape of the resulting " curve follows a nonconvex loss profile resembling an !q -norm penalty where ||β||q = pj=1 |βj |q and p < 1. Such a penalty shrinks small coefficients more than large # $ ones, approaching zero penalty as mρ |β¯jk+1 + ν¯jk | − |β¯jk+1 | → 1. 3.4
Resampling and computational details
There are many ways to resample data D = (y, X) in order to construct Di = (y(i) , X(i) ), i = 1, . . . , m, including cross validation, bootstrapping, various approaches to subsampling the observations, and some ways me might subsampling both the columns and rows of X ∈ Rn×p . In the experiments that follow we use two strategies. The first is a common subsampling scheme: drawing m = &n/2' subsamples without replacement across all p variables. This strategy approximates the bootstrap but is more computationally efficient [4]. The second approach probes how well consensus selection tolerates being decomposed by subsampling many small “sub-blocks” from across both the rows and the columns of the data matrix to arrive at a consensus solution. We note that such a variable subsampling approach could also serve the additional function of decorrelating the variables during the fitting procedure (in the spirit of random forests). A direct method was to used compute update (13) and the factorization calculated once and reused in all subsequent updates. In cases where k > m for X(i) ∈ Rm×k the matrix inversion lemma was used to speed computation. The consensus selection by ADMM code was written in Python, using the RPy2 library to call R functions for the median probability and stability selection methods used in the comparisons that follow. Python’s multiprocessing package was used to parallelize the primal and dual updates on a 24-core workstation.
4 4.1
Experiments on simulated data Comparison to median probability model and stability selection
Consensus selection for the lasso was compared to two recent methods for model selection in !1 -regularized problems: the median probability model [1] and stability selection with the random lasso [10] (described in section 3.2). Performance measures were (i) correct variable 6
Figure 2: (a) Coefficient paths for model fits to artificially generated data (“model 1” described in section 4). Red lines denote “signal” coefficients that should be nonzero, grey lines are noise variables and should be set to zero. The x-axis gives the value of the sparsity governing parameter λ, and variables enter the model as this parameter is relaxed (from left to right). (a) A single lasso fit to one subsample. (b) Median lasso paths (c) stability selection paths, and (d) consensus lasso paths fit over 50 subsamples of size &n/2'.
Figure 3: Performance measures for the median probability (grey), stability selection (blue), and consensus selection (red) methods each fit to 10 instances of each of the artificial data described in section 4. The top row shows ROC curves while the bottom row plots two types of error: mean coefficient error in the !1 -norm (x-axis) versus mean squared prediction error on new data (y-axis). The distinct points in the bottom plots correspond to different λ for the median and consensus models and different thresholds τ for stability selection. The data types “adjacent” and “random index” correspond to the two artificial data sets described is section 4: one with correlated signal variables adjacent to one another and the other with a randomly distributed signal. The sampling schemes used are “standard”, i.e. m = &n/2' samples drawn without replacement, and “block”, in which 20 × 20 submatrices are sampled at random from the full data matrix. The latter approach is only used for consensus lasso fitting (in the righthand plots), the other methods use “standard” half-sample resampling in all cases. Consensus selection reliably outperforms the other methods on these data.
7
selection (summarized by a receiver operating characteristic (ROC)) and (ii) generalization performance on new (out-of-sample) data measured by mean squared error (see Figure 3). Two different true models were used: model 1 (adjacent signal variables) :
model 2 (randomly indexed signal variables) :
if j = 1, . . . , 5 1 βj = −1 if j = 6, . . . , 10 0 otherwise ( (−1)j for 10 random indices βj = . 0 otherwise
In the first model the 10 signal variables are all adjacent (and thus correlated). In the second the variables are randomly distributed among the 1000 columns of the data matrix, making them more likely to be independent of each other but correlated with noise variables. For both models, ten problem instances each with n = 100 observations on p = 1000 variables were generated according to (8), with Xi . ∼ N (0, Σ), Σij = 0.9|i−j| , and the true coefficients described by one of the models above. Signal-to-noise, measured as "Xβ true "22 /"("22 , was ≈ 10 ± 5 across the data sets. For all methods, m = 50 subsamples of size &n/2' were drawn with replacement and fit over a grid of 50 values of λ. In addition, consensus selection was also fit on smaller, randomly sampled 20 × 20 “sub-blocks” as described in the previous section (with results shown in the right-hand panels of Figure 3). The median probability model and stability selection methods used the glmnet package [8] in R to fit the lasso. The basic median probability model was found by taking the median of the coefficients over all standard lasso fits to m samples. The stability selection results were obtained by (i) fitting m random lasso models and using the stability selection criterion (11) to choose the stable variable indices, and then (ii) fitting a model to only the chosen variables over all the data to obtain the final model. 4.2
Results
Example coefficient paths for the lasso fit to a single sample from model 1 over a grid of λ-values are shown in Figure 1a, while paths chosen using the median probability model and stability selection over 50 subsamples are shown in Figure 1b and 1c, respectively. In this case the individual lasso fit is very noisy, including many grey noise variables before any red signal variables enter the model. The median probability model over 50 subsamples does better in terms of avoiding noise variables, but is very conservative. Stabililty selection works in a different way, since by the criterion 11 we choose the variables whose maximum stability path value exceeds a threshold for any λ ∈ Λ. By this criterion we would choose two true variables before incuding a false positive. Consensus selection chooses four true variables before including a false positive. Better than looking at coefficient paths is examining more concrete measures of model performance. The competitiveness of consensus selection as compared to the other stateof-the-art variable selection methods is apparent in Figure 3. The average ROC curves for 10 different fits to data generated as before, with all models tested on out-of-sample data, demonstrate superior model selection properties, while the coefficient vs. prediction error curves show that consensus selection improves over the other methods in both model bias and model variance—possibly by iterating between bagging-like (variance reducing) and boosting-like (bias-reducing) steps during the fitting procedure. Finally we note that consensus selection, when fit only to 20 × 20 sub-blocks sampled randomly from the rows and columns of the full matrix X ∈ R100×1000 (with a total of 20 column and 100 row samples for 2000 total 20 × 20 blocks), still ouperformed the other methods. This is interesting for at least three reasons: (i) it suggests that consensus selection, like boosting, can combine many “weak learners” into a successful estimator, (ii) it is reminiscent of random forests, where subsampling the variables in addition to the observations during the fitting procedure serves to decorrelate submodels and theeby reduce the variance of the ensemble model [4], and (iii) it indicates that consensus selection models like those above may be fit blockwise to very large data by leveraging massively parallel devices like commodity graphics hardware (GPUs). 8
4.3
Summary
In the case of the lasso, consensus selection is competitive with state of the art methods for model selection, and yields fits that generalize well to new data. It admits several interpretations that relate the stages of the algorithm to existing estimation methods, and that help explain its positive performance on sparse, correlated data, where it decreases both model variance and model bias relative to other state-of-the-art methods. These results are easily extensible to more general !1 -penalized models, and in cases where it is natural to decompose the problem across variables, the ADMM formulation will scale to very large problems.
References [1] M. Barbieri and J. Berger. Optimal predictive model selection. Annals of Statistics, 32:870–897, 2004. [2] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning, pages 1–118, 2011. [3] Breiman. Bagging Predictors. Machine Learning, 26:123–140, 1996. [4] Peter Buhlmann. (Bagging, Boosting, and ensemble methods. In Gentle, J. E., Hardle, W., and Mori, Y. (eds). Handbook of Computational Statistics, pages 877–907, 2004. [5] E. J. Candes, M. B. Wakin, and S. P. Boyd. Optimally sparse representation in general (nonorthogonal) dictionaries via l1 minimization. PNAS, 100(5):2197–2202, 2003. [6] E. J. Candes, M. B. Wakin, and S. P. Boyd. Enhancing sparsity by reweighted l1 minimization. J Fourier Anal Appl, 14:69–82, 2008. [7] D.L. Donoho. De-noising by soft-thresholding. IEEE Transactions on Information Theory, 41(3):613–627, 1995. [8] J. Friedman, T. Hastie, H. Höfling, and R. Tibshirani. Pathwise coordinate optimization. Annals of Applied Statistics, 1(2):302–332, 2007. [9] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining, inference, and prediction. 2009. [10] N. Meinshausen and P. Buhlmann. Stability Selection. J Royal Statistical Society Series B, 72(4):417–473, 2010. [11] A. Nedic and A. Ozdglar. Distributed subgradient methods for multi-agent optimization. IEEE Trans on Automatic Control, 54(1):48–61, 2009. [12] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 50:259–268, 1992. [13] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, 58(1):267–288, 1996. [14] A. Tikhonov. On the stability of inverse problems. 1943. [15] H. Zou. The adaptive lasso and its oracle properties. JASA, 101(467):1418–1429, 2006. [16] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B, Jan 2005. [17] H. Zou and H. Zhang. On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics, Jan 2009.
9