Model-Averaged `1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley∗and Daniel Percival† August 22, 2008, revised May 14, 2010
Abstract This paper studies combining `1 regularization and Markov chain Monte Carlo model composition techniques for Bayesian model averaging (BMA). The main idea is to resolve the model uncertainty issues arising from path point selection by treating the `1 regularization path as a model space for BMA. The method is developed for linear and logistic regression, and applied to sample classification in four different biomedical datasets.
1
Introduction
In regression and classification problems, the goal is to predict a response from inputs, or predictors. In many modern data applications, there are a great number of predictors available. In this case, dimension reduction is critical for model fitting, and often improves prediction accuracy and interpretability. Variable selection is the process of choosing a few predictors to build the regression or classification model. However, most variable selection methods raise the following question: once a set of variables is chosen, how certain are we that this is the correct set? Could another set do nearly as well? These questions are at the heart of model uncertainty. Bayesian model averaging (BMA) seeks to address model uncertainty by taking a weighted average over a class of models under consideration (see Hoeting et al. (1999) for an introduction). BMA techniques require a model space over which to average, and Hoeting et al. identify two approaches for determining the model space. The first approach, due to Volinsky et al. (1997), uses the ‘leaps and bounds’ algorithm (Furnival and Wilson 1974) to obtain a set of candidate models. The second approach uses Markov chain Monte Carlo model composition (MCMCMC), due to Madigan and York (1995), to directly approximate the posterior distribution. In this paper, we consider `1 regularization as a method for determining the model space for BMA. Regularization is a popular “automatic” dimension reduction and variable selection technique. In the case of `1 penalized linear regression – also known as ‘lasso’ (Tibshirani 1996) – one solves the following optimization problem: N X X X βˆ = argmin (yi − βj xij )2 + λ |βj | , i=1 ∗ †
j
j
Insightful Corporation (now at Insilicos LLC) and Department of Statistics, University of Washington Department of Statistics, Carnegie Mellon University
1
in which λ is a nonnegative penality parameter. For λ = 0, we recover the ordinary least squares (OLS) solution, and for λ sufficiently large we have βˆ = 0. Below this threshold, each particular ˆ This progression of the value of β is the regularization path. choice of λ gives a different estimate β. As λ grows from zero, the amount of regularization increases. This has the effect of shrinking the entries of β towards zero, and (due to the `1 penalty) making some entries exactly zero, giving the lasso its popular sparsity property. A similar effect occurs in other settings — in particular, for `1 regularized logistic regression methods. For BMA, the ‘leaps and bounds’ approach was extended by Yeung et al. (2005) to apply to ‘wide’ datasets, in which there are many more measurements than samples (p > n in statistical terminology), as is common in bioinformatics applications. Fraley and Seligman (2010) replace the models obtained by ‘leaps and bounds’ with those defined by the `1 regularization path, yielding a method suitable for wide as well as narrow datasets. In this paper we treat the `1 regularization path as a model space for MCMCMC, and develop a combination technique of regularization and model averaging with the aim of resolving the model uncertainty issues arising from path point selection. In this paper, we first give a review of the general Markov Chain Monte Carlo Mondel Composition for BMA. We next outline a version of MCMCMC which uses the `1 regularization path as the model space. We finally present examples of applications of this method.
2 2.1
Model Averaging using Regularization Paths Markov Chain Monte Carlo Model Composition
Suppose we are given data D, and a model space M, and wish to compute a quantity ∆. For example, in a regression problem, ∆ can be a single entry of β or the probability that a particular entry of β is zero. The posterior distribution of ∆ is given by: P(∆|D) =
X
P(∆|M, D)P(M |D).
M ∈M
We can estimate this quantity via MCMC by constructing a Markov chain with equilibrium distribution P(M |D) – the posterior distribution of the entire model space. This process is called Markov Chain Monte Carlo Model Composition (MCMCMC, or MC3 ). Indexing the chain as {M (t)}, t = 1, 2...T , we can then estimate ∆ by taking the following average: b = ∆
N 1X g(M (t)). T t=1
Where g(M (t)) calculates the quantity ∆ for the model M (t). This quantity will converge almost surely by the law of large numbers to the true value of ∆. Such a chain with the desired equilibrium property can be constructed via the MetropolisHastings (M-H) algorithm. Consider a model N ∈ M. At each step in the M-H algorithm, we must “propose” a new model N 0 ∈ M. This is done via a proposal distribution: P(N 0 |N ). One way to construct such a distribution is to first define a neighborhood around N , and then propose a new model from this neighborhood. For example, in the case of linear regression, the set of models using all possible subsets of a set of candidate covariates can be considered the model space. Given a particular model, all models with one additional or fewer covariate included in it could be considered to be the neighborhood of that model in this model space.
2
Once we have constructed our proposal distribution, and drawn from it, this new proposal is then “accepted” with probability:
P(N 0 |N ))P(N 0 |D) . P(N |N 0 )P(N |D)
P(Accept N’) = min 1,
Note that if the proposal distribution is symmetric, e.g. P(N 0 |N ) = P(N |N 0 ), then this is simply the ratio of posterior model probabilities. If we accept N 0 , then N 0 is now considered as N for the next iteration, and a new N 00 is proposed conditional on N 0 . Otherwise, the original model N is kept and another N 0 is proposed as before. The equilibrium distribution of this chain is then P(M |D). There are two difficulties in this setup. First, we must construct a sensible proposal distribution. In particular, the concept of neighborhood in a model space can be hard to define. Later, we will see that this idea can be made quite natural in the context of a regularization path. The (N 0 |D) second difficulty is calculating the ratio of posterior model probabilities P P(N |D) . This requires the computation of integrated likelihood values for each model, a notoriously difficult problem.
2.2
Combination Algorithm
We propose to combine MCMCMC and regularization by using the `1 regularization path as the model space M for MCMCMC. In the case of BMA when applied to a model space generated by `1 regularization, MCMCMC has particular appeal since the metric on this model space is quite natural. The `1 regularization path can be represented by the interval [0, λmax ], which is simply the range of possible λ penalization values i.e., a point α ∈ (0, λmax ) means that: N X X X |βj | . βj xij )2 + α βˆα = argmin (yi − j
i=1
j
We can thus write Mα to represent a model from this space at the point λ = α along the `1 regularization path. Under this representation, λmax represents the intercept only model and 0 represents the unpenalized regression. A similar setup can be used for the “elastic net” regularization technique (Zou and Hastie 2005). Since neighborhoods on the real line are intuitive concepts, this makes this model space framework quite natural. Our combination algorithm then proceeds as follows: 1. Begin by choosing a starting point α ∈ [0, λmax ]. This may be done randomly, or in a deterministic way. 2. Propose a new α0 (discussed below)
P(M |D) 3. Calculate P(Accept Mα0 ) = min 1, P(Mα0|D) α
4. Draw a single U (0, 1) to determine acceptance. Set α = α0 if accepted, keep α = α otherwise. 5. Repeat steps 2 - 4 a large number of times. 6. Discard the first portion of this sequence of α values, and use the remaining tail as an approximation of the model space distribution (this is the distribution of the single parameter α).
3
The MCMCMC method described was implemented for elastic net paths for both normal linear regression and for binomial family generalized linear models. The elastic net models were fit using the R package glmnet (Friedman et al. 2010). Note that `1 regularization is a special case of the elastic net. There are several issues that arise with this algorithm which we now discuss. Choosing a proposal distribution. In the second step, the proposal distribution must be adjusted to ensure good properties. This raises three main concerns - (1) the acceptance rate, how often a proposed model (α) is accepted, (2) the autocorrelation of the samples (high autocorrelation is not desired) and (3) the potential for the chain to become stuck at a certain point in the middle, or somehow be forced to the ends. Using either one of a U (α −a, α +a) or a N (α, a) proposal distribution can produce good results with respect to concerns (1) and (2). However, concern (3) is not fully addressed, since this scheme has the potential to become trapped at the ends of the interval representing the regularization path. When the chain is near the end, a significant part of the density of either proposal distribution lies outside of the space. If one of these points is chosen, it is simply mapped to the border. This creates an attracting effect. There are two main ways in which this attracting effect could be resolved. First, a new proposal distribution could be devised, in which the distributions could be truncated to exclude any support outside of the model space. This approach produces the opposite effect: the endpoints now are repelling. Another possibility is to alter the truncation to keep the distributions symmetric around the current model. However, this only serves to enhance the attracting effect of the endpoints. A second approach is to re-parameterize the model space to be constraint free, which can be accomplished by a transformation such as a logit transform. A drawback is that there remain issues with the “endpoints”, which are distorted to occupy large areas of the real line by the transform, creating an attracting effect. For now, the best approach to the problem is to choose the tuning parameter a wisely. This choice must take into account both the attracting concern and the number of modes in the posterior distribution of the model space. If the posterior is suspected to be unimodal, then a small tuning parameter is acceptable, and can be chosen to be small enough to avoid ever reaching one of the endpoints. If the posterior has many modes, then a large tuning parameter can allow the chain to jump between modes easily. Note that the notion of small and large is relative to the value of λmax . Approximating Integrated Likelihood. Another major issue in this algorithm is computing the P(M |D) quantity P(Mα0|D) for use in step 3. Noting that P(M |D) ∝ P(M, D)P(M ), there are two issues α in this computation. First, a prior must be placed on the model space. At this time, we use a flat prior, so that all models are considered a priori equally likely. This effectively eliminates these terms from all calculations. Second, the integrated likelihood (P(M, D)) is a notoriously difficult quantity to compute. At this time, we use the BIC approximation. The main way to improve upon the BIC approximation is to apply some sort of Monte Carlo integration technique. As noted in Tibshirani (1996), the `1 penalized regression model corresponds to a Bayesian regression model with independent double exponential priors on each coordinate of β, using the posterior mode as the coefficient estimates. This gives us a basic setup for a Monte Carlo integration. However, in many cases this is a very high dimensional integral. Bayesian approaches to lasso have been the subject of recent research (Park and Casella 2008, Hans 2009). Length of Chains and Mixing. As in all Monte Carlo techniques, the length of the chain (number of times steps 2-4 are repeated in the algorithm) and the convergence of the chain are important concerns. In our examples we found that 10,000 iterations, with the first 2500 discarded
4
were sufficient. A related concern to these is autocorrelation in the chain. We suggest using regressions on the sequence of α on lagged sequences. Choosing the lowest lag value which gives a small R2 will roughly eliminate autocorrelation.
2.3
Processing MCMCMC Results
Post-processing of an MCMCMC analysis is straightforward. In the algorithm described above, we store at each step the coefficients of the accepted point (either α0 if it was accepted or α if α0 was rejected). These stored values can be used for all calculations by simple averaging or taking quantiles. Some items of interest that are easily obtained: • The posterior model space distribution. An estimate of the posterior distribution of the model space can be obtained using the α values stored in the tail. This can be done via a density estimator. In the examples in the next section, we will use a histogram of the tail α values. • The probability of a coefficient being equal to zero. This can be found by simply computing the fraction of models in the tail with that coefficient equal to zero. • The value of a coefficient. This can be found by averaging over all of the stored values of that coefficient in the tail. • Fitted values. Fitted values can be produced by using the average coefficient values obtained above. Binomial regression problems can be further investigated by applying a decision boundary to produce classes. • Credible intervals for a coefficient. These can be found by taking quantiles of the coefficient values stored in the tail.
5
3 3.1
Examples Example 1: Diabetes Data
The Diabetes dataset used in Efron et al. 2004 is used to demonstrate the algorithm applied a linear regression problem. An `1 penalized linear regression model was fit to the data using glmnet. The usual coefficient plot and cross-validation results are displayed in Figure 1. According to the cross validation results, the entire last half of the regularization path could contain a good choice for a single point. This displays a degree of model uncertainty that would encourage application of our combination technique. MCMCMC was run for 10,000 iterations, with the first 2500 discarded to allow time for convergence. For analysis, only every 6th draw from this tail was used to reduce autocorrelation. 59% of all moves in the model space were accepted, and of these 86% were moves made with probability one. The MSE for the “averaged” model was 2884.7. This was obtained by taking the coordinate-wise mean of the β vector over the tail of the Markov chain. The MSE for an OLS fit of the data is 2933. This process thus gives a marginal improvement.
2
0
500
Coefficient Plot10 4 6 8
Cross Validation (MSE)
10
5000 4000
0
3000
−500
Coefficients
500
6000
0
1500
2500
0.2
0.4
0.6
Norm Fraction
Histogram of Posterior Model Space Distribution
BIC Path
1.0
0.8
1.0
3800 3600
BIC
4000
L1 Norm
0.8
0.0
0.2
0.4
0.6
0.8
1.0
0.2
Norm Fraction
0.4
0.6
Norm Fraction
Figure 1: Results of the MCMCMC algorithm. Top left: coefficient path plot, produced by glmnet. Top right: cross validation of MSE over the `1 regulaization path. Bottom left: histogram of the posterior distribution over the model space (P(M |D) ). Bottom right: plot of BIC over path.
6
The remaining examples are all binomial family generalized linear models.
3.2
Example 2: South African Heart Disease Data
The South African Heart Disease data was used by Park and Hastie (2007) to demonstrate the glmpath algorithm. This data set consists of nine covariates, with the response being the presence of heart disease - a binary covariate. Cross validation on an `1 -regularized fit of the data suggests model uncertainty issues, see Figure 2. The MCMCMC algorithm was run for 10,000 iterations, with the first 2500 discarded to allow time for convergence. 54% of all proposed moves were accepted, with 99% of these made with probability one. 332 of the 462 cases fit were predicted correctly using the mean coefficient vector and a decision boundary at .5. This is comparable to standard techniques.
3
0.0
0.2
Coefficient Plot6 4 5 5
Cross Validation (Likelihood)
7
1.20
0.6
1.10
0.4 0.2
1.00
0.0
Coefficients
0.8
1.30
0
0.4
0.6
0.8
1.0
1.2
0.2
0.4
0.6
Norm Fraction
Histogram of Posterior Model Space Distribution
BIC Path
1.0
0.8
1.0
800 700 600
BIC
900
L1 Norm
0.8
0.0
0.2
0.4
0.6
0.8
1.0
0.2
Norm Fraction
0.4
0.6
Norm Fraction
Figure 2: Results of the MCMCMC algorithm. Top left: coefficient path plot, produced by glmnet. Top right: cross validation of log-likelihood over the `1 regularization path. Bottom left: histogram of the posterior distribution over the model space (P(M |D) ). Bottom right: plot of BIC over the path.
7
3.3
Example 3: Leukemia Data
The Leukemia data was used by Yeung et al. (2005) to demonstrate the iterative Bayesian model averaging algorithm for wide datasets. We only consider the two class version of the data set. The data set consists of 38 training samples and 34 test samples. There are 3051 covariates, each representing a gene expression. The response is the presence of leukemia. Cross validation over the `1 regularization path again shows potential model uncertainty issues. The MCMCMC algorithm was run for 10,000 iterations, with the first 2500 discarded to allow time for convergence. 53% of proposed moves were accepted, with 92% of these moves made with probability one. Every 5th observation of the tail was used to reduce autocorrelation. See Figure 3 for plots of the results. The “averaged” model predicted 33 of 34 of the cases in the test set. As in Yeung et al. (2005), we also evaluate performance using Brier score: N 1 X Brier Score := (Y − Yˆ )2 N i=1
The Brier score for this model was 1.385. Yeung et al. achieved a score of 1.5, with 33 of 34 test cases predicted correctly. Alternately, we can drop all covariates with P(βi = 0) < .5 so that our solution is sparse. This strategy leaves 18 genes (compared to 20 in Yeung et al.) plus an intercept. This model predicts 31 of 34 test cases correctly, and has a Brier score of 1.876.
6
0
2
Coefficient Plot18 11 17
Cross Validation (Likelihood)
20
1.2
1.0
0.8
0.5
0.4
−0.5 0.0
4
6
8
10
0.0
0.2
0.4
0.6
L1 Norm
Norm Fraction
Histogram of Posterior Model Space Distribution
BIC Path
0.8
1.0
0.8
1.0
BIC
60
80
100
120
Coefficients
1.5
0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
Norm Fraction
0.2
0.4
0.6
Norm Fraction
Figure 3: Results of the MCMCMC algorithm. Top left: coefficient path plot, produced by glmnet. Top right: cross validation of log-likelihood over the `1 regularization path. Bottom left: histogram of the posterior distribution over the model space (P(M |D) ). Bottom right: plot of BIC over the path.
8
3.4
Example 4: Breast Cancer Data
The breast cancer data was also used by Yeung et al. (2005) to demonstrate iterative BMA. This dataset contains about 10,000 covariates, 78 training cases and 19 test cases. An `1 regularized fit of the data, with optimal point chosen by AIC or BIC, gives very poor performance on the test data. The MCMCMC algorithm was run for 10,000 iterations, with the first 2500 discarded to allow time for convergence. 53% of proposed moves were accepted, with 92% of these moves made with probability one. Every 5th observation of the tail was used to reduce autocorrelation. See Figure 4 for plots of the results. The “averaged” model predicts 11 of 19 of the test cases correctly, with a Brier score of 4.79. This suggests a weakness in this technique. The poor performance of this technique on this dataset may be due to poor performance of the `1 regularization on the data.
10
0
10
Coefficient Plot 18 25 34
Cross Validation (Likelihood)
37
1.2 0.8
1.0
5
0.6
0
Coefficients
10
1.4
0
20
30
40
50
0.0
0.2
0.4
0.6
Norm Fraction
Histogram of Posterior Model Space Distribution
BIC Path
1.0
0.8
1.0
200 100
150
BIC
250
L1 Norm
0.8
0.0
0.2
0.4
0.6
0.8
1.0
0.0
Norm Fraction
0.2
0.4
0.6
Norm Fraction
Figure 4: Results of the MCMCMC algorithm. Top left: coefficient path plot, prod uced by glmnet. Top right: cross validation of log-likelihood over the LASSO path. Bottom left: histogram of the posterior distribution over the model space (P(M |D) ). Bottom right: plot of BIC over the LASSO path.
9
4
Future Work
There are several areas of potential improvement in this combination algorithm. First, there are issues with the proposal distribution in the Metropolis-Hastings step. The most significant of these is a tendency to become trapped near the edges of the regularization path. A new proposal distribution or an automatic way to choose the tuning parameter for the distribution could solve this issue. Re-parameterization of the model space could also help. Second, the BIC approximation is fast, but it is in many cases poor. BIC approximations are worst when the number of observations is very small compared to the number of covariates. This is a situation where regularization techniques are very desirable. Therefore, we would prefer that the algorithm did not rely so heavily on BIC. Computing integrated likelihood is a tough problem, so this improvement will probably be the most challenging to make. Third, the algorithm did not work well compared to other methods on the fourth example data set. We suspect that this was a result of poor performance of `1 regularization on the data set. A closer investigation will hopefully reveal the true cause, and will certainly aid in more appropriate application of the algorithm. Fourth, the question of the model space prior has not been fully addressed. Though the flat prior is easy to use, there are others that may be more appropriate. For example, choosing a prior that prefers sparse models could be of some benefit. Finally, it is worth noting that a simple direct use of the BIC curve over the regularization path may be as effective as the MCMCMC algorithm. This would be done by simply using BIC to take a weighted average. As in the MCMCMC, BIC could be replaced by a better approximation of the integrated likelihood. Acknowledgements This work was supported by National Institutes of Health SBIR awards 2R44GM074313-02 and 3R44GM074313-04S1.
References B. Efron, T. Hastie, and R. Tibshirani. Least angle regression. The Annals of Statistics, 32:407–451, May 2004. C. Fraley and M. Seligman. Model-averaged `1 regularization. Technical report, Insilicos LLC, 2010. (in preparation). J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 2010. G. M. Furnival and R. W. Wilson. Regression by leaps and bounds. Technometrics, 16:499–511, 1974. C. Hans. Bayesian lasso regression. Biometrika, 103:835–845, 2009. J. A. Hoeting, D. Madigan, A. E. Raftery, and C. T. Volinsky. Bayesian model averaging: a tutorial. Statistical Science, 14(4):382–417, 1999. D. Madigan and J. York. Bayesian graphical models for discrete data. International Statistical Review, 63:215–232, 1995.
10
M.-Y. Park and T. Hastie. An L1 regularization path algorithm for generalized linear models. Journal of the Royal Statistics Society Series B, 69:659–677, 2007. T. Park and G. Casella. The Bayesian lasso. Journal of the American Statistical Association, 103: 681–686, 2008. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistics Society Series B, 58:385–395, 1996. C. T. Volinsky, D. Madigan, A. E. Raftery, and R. A. Kronmal. Bayesian model averaging in proportional hazard models: Assessing stroke risk. Journal of the Royal Statistical Society, Series C — Applied Statistics, 46:433–448, 1997. K.-Y. Yeung, R. E. Bumgarner, and A. E. Raftery. Bayesian model averaging: Development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics, 21:2394–2402, 2005. H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67:301–320, 2005.
11