SPARSE LINEAR REGRESSION WITH BETA ... - Columbia University

Report 2 Downloads 333 Views
SPARSE LINEAR REGRESSION WITH BETA PROCESS PRIORS Bo Chen, John Paisley and Lawrence Carin Department of Electrical & Computer Engineering Duke University, Durham, NC, USA ABSTRACT A Bayesian approximation to finding the minimum `0 norm solution for an underdetermined linear system is proposed that is based on the beta process prior. The beta process linear regression (BP-LR) model finds sparse solutions to the underdetermined model y = Φx + , by modeling the vector x as an element-wise product of a non-sparse weight vector, w, and a sparse binary vector, z, that is drawn from the beta process prior. The hierarchical model is fully conjugate and therefore is amenable to fast inference methods. We demonstrate the model on a compressive sensing problem and on a correlated-feature problem, where we show the ability of the BP-LR to selectively remove the irrelevant features, while preserving the relevant groups of correlated features. Index Terms— Bayesian nonparametrics, beta process, sparse linear regression, compressive sensing

performance. In the Bayesian setting, two models are commonly used to achieve this end: the relevance vector machine (RVM) [13] and the Bayesian lasso [12], the latter of which produces the `1 penalty function when written in marginalized form. Ideally, the `0 norm would be used; indeed for compressive sensing [4], the `1 is chosen as a relaxation of `0 , as they are proven to share the same solutions under certain conditions. This is due to the ease of computation for `1 minimization, which can be solved in polynomial time. Minimizing the `0 norm, on the other hand, has been shown to be NP-hard [1], requiring exhaustive enumeration of the 2N subspaces of Φ to find the solution. Therefore, approximations to the `0 problem are necessary [16], which in the fully Bayesian setting suggests the use of the beta process prior. The beta process is used to decompose the vector x into the element-wise multiplication of a weight vector, w, and a sparse binary vector, z,

1. INTRODUCTION In this paper, we consider a method for finding sparse solutions to underdetermined linear systems using the beta process prior [7]. For the linear system y = Φx + 

(1)

where Φ ∈ RM ×N and M  N , properties of the vector x ∈ RN are typically constrained a priori, often through regularization of the form x∗ = arg min ky − Φxk22 + λkxkpp

(2)

x

In this formulation, the solution, x∗ , uses both its `p norm and the Euclidean approximation error, with the penalization term λ defining their relative importance. For example, when p = 2, the result is the ridge regression solution [8], and when p = 1, the result is the lasso [14]. This latter solution is known for producing sparse solutions, where many of the values in x∗ are set equal to zero. This sparsity is often desired when it is known a priori that most of the features, or columns of Φ, are irrelevant to the prediction of y. Reducing the number of features can improve generalization of the model, thus improving prediction

y = Φ(w ◦ z) + 

(3)

The prior on the binary vector, z, is what enforces sparse solutions and motivates its classification as an approximate `0 solution. The resulting model is a beta process linear regression (BP-LR) model that is sparse in the coefficient vector. A natural application for this model is therefore the compressive sensing problem [4], where an N -dimensional coefficient vector that is sparse in some basis is measured using only M  N measurements. Another application for this model is in group selection for gene expression analysis [17]. In the case of highly correlated features, the RVM and Bayesian lasso are known to select only a single feature and set the remainder of the correlated feature weights to zero. This neglects the information contained in the correlated genes, which may be of interest to the medical professional. We will show that the BP-LR model can be useful in this situation where shrinkage on correlated features is not desired. We review the beta process and present our BP-LR model in Section 2 and show experimental results on a compressive sensing problem and group selection problem in Section 3. We then conclude in Section 4.

2. BETA PROCESS FOR LINEAR REGRESSION The beta process [7], here extended to a two-parameter beta process, is a nonparametric Bayesian prior that takes three inputs: two positive scalars, a and b, and a base measure, H0 , and is denoted H ∼ BP (a, b, H0 ). For the model considered here, the base measure is inherently discrete, with N X 1 δφn H0 = N n=1

(4)

where φn ∈ RM is the nth column of Φ. The discreteness of this base measure greatly simplifies the model, though we mention that due to theoretical issues, the beta process is a true stochastic process only in the limit as N → ∞ [7]. Provided that N is reasonably small, which we assume to be the PNcase since we know Φ, the discrete beta process, H = n=1 πn δφn , can be drawn exactly and requires only the generation of the vector π, where πn ∼ Beta(a/N, b(N − 1)/N )

(5)

for n = 1, . . . , N . The resulting H, unlike H0 , is not a probability measure. Therefore, samples are not drawn directly from H, but rather H is used as the parameter PN of a Bernoulli process, or B ∼ BeP (H), where B = n=1 zn δφn and the binary vector z is generated as, zn ∼ Bernoulli(πn )

(6)

for n = 1, . . . , N . The resulting vector, z, will be sparse, as enforced through the vector π; we examine some more theoretical properties of this prior in the next section. The beta process therefore provides a natural framework for performing sparse linear regression, and when used in conjunction with a weight vector, w, it motivates the following BP-LR model. A Beta Process Model for Sparse Linear Regression (BP-LR): y w zn πn 

= Φ(w ◦ z) +  2 ∼ N (0, σw I) ∼ Bernoulli(πn ) ∼ Beta(a/N, b(N − 1)/N ) ∼ N (0, σ2 I)

(7) (8) (9) (10) (11)

for n = 1, . . . , N and where the symbol ◦ represents the element-wise multiplication of two vectors or matrices. In this model, we have defined x ≡ w ◦ z, with z providing the sparsity mechanism and w the weighting of those vectors in Φ selected by z. Due to space limitations, we do not provide inference equations. However, we mention that the fully analytical nature of the model allows for fast variational inference [3], in addition

Fig. 1. A graphical representation of the BP-LR model.

to the MCMC Gibbs sampling method [6]. We also note that separate gamma priors can be placed on the inverse variances, −2 and σn−2 , which we find has a significant, beneficial imσw pact on model learning. See [11] and references therein for more details on inference for the beta process. Figure 1 contains a graphical representation of the proposed model. 2.1. Setting the hyperparameters a and b The parameters a and b have a significant impact on model learning and careful consideration must be given to their setting. We therefore provide a means for setting a and b as functions of two different parameters, S and F , for which an intuitive understanding is clear. Defining x ≡ w ◦z and considering that wn = 0 is a probability zero event, the expected P `0 norm ofP x, S = E[kxk0 ] can be calculated by solving E [ n zn ] = n E[zn ], which yields, aN (12) S= a + b(N − 1) This parameter can be set to the desired sparsity level. We note that as N → ∞, S is Poisson distributed, S ∼ P o(a/b). A second parameter, F , can be set to control the maximum value that the posterior expectation of any πn can have. That is, for a given zn = 1, the posterior expectation of πn , F = E[πn |zn = 1] is equal to, F =

a+N a + b(N − 1) + 1

(13)

We discuss these values for specific problems in the following section. Using these parameters, we can solve for a and b, a = b

=

where 1 < S < N − 1 and

SN (1 − F ) FN − S a(N − S) S(N − 1) S N

< F < 1.

(14) (15)

3. EXPERIMENTS We consider two problems to which the BP-LR model is applicable: group selection in the case of highly correlated features and compressive sensing. 3.1. Correlated Feature Selection Empirical experience shows that the RVM and Bayesian lasso handle highly correlated features by selecting one (or a small fraction) of them and setting the rest to zero [17]. However, in some cases it may be desired to spread the weight given to this one feature over all of the features in the group (e.g., for biological interpretation of the analysis of gene-expression data). We demonstrate the BP-LR model for group selection on a toy problem of N = 500 dimensions. We generated groups √ according to Section 5d in [17] and set σ = 15. In Figure 3, we compare the results for the BP-LR model with S = 100 and F = .75. We place a noninformative 2 gamma prior on σ−2 and set σw = 1. We see that the beta process model was able to select the three groups, or the first 15 features, while the RVM and Bayesian lasso models selected only one or two from each group. This grouping nature arises because, following feature selection for a given iteration (via z), the corresponding weights are calculated using ridge regression, which is a non-sparse solution [8]. Therefore, the weights among correlated features do not grow or shrink with respect to each other following their selection. We also see that the BP-LR model is more sparse in the selection of noisy features, with the number of coefficients greater than 10−2 for the BP-LR model equal to 14, while it is equal to 23 and 46 for the RVM and Bayesian lasso, respectively. 3.2. Compressive Sensing For the compressive sensing problem, we used the 128 × 128 image in Figure 2 and compared our model with a variety of CS inversion algorithms [5][10][15]. We plot in Figure 4 the correlative error, being the magnitude of the reconstruction error divided by the magnitude of the original image. We see that the BP-LR model (here called BetaP CS) performs particularly well in low signal to noise conditions.

Fig. 3. A comparison of the BP-LR (top), RVM (middle) and Bayesian lasso (bottom) results for the problem of correlated features. The first 15 features are correlated in sets of 5 with the remainder being noisy features. Only the BP-FA model selects all of the features, while also selecting fewer noisy features.

In Figure 5, we plot the correlative error as a function of the number of measurements for an SNR equal to ten. In this case of high noise, we see that the performance of the BP-LR model is significantly better. This is perhaps due to the increased sparsity that arises from the approximate `0 solution. Visual inspection, not given here to preserve space, indicates that the reconstruction of other algorithms tends to contain substantially more noise. In the toy problem of the previous section, we see this tendency in the RVM and Bayesian lasso, while the BP-LR model is more sparse in the noise. For this model, we used noninformative gamma priors on both variances and set a = 8M and b = 100. 3.2.1. Joint Sparsity Modeling In the form of a single vector, z, the BP-LR model is in the class of spike and slab priors [9]. However, we note that the beta process prior can be used to model the joint sparsity [2] of multiple signals that share sparsity patterns. This shared sparsity occurs through the shared π vector, from which multiple z vectors, one for each signal, are drawn. Multiple weight vectors, w, are drawn as well, with the resulting form being, Y = Φ(W ◦ Z) + E

Fig. 2. The image used for the compressive sensing problem.

(16)

where W and Z are matrices with the number of columns equal to the number of signals. In this case, the setting of a and b needs to be reconsidered, as the definition of the variable F has changed, though S remains the same.

5. REFERENCES [1] E. Amaldi and V. Kann, “On the approximability of minimizing non zero variables or unsatisfied relations in linear systems,” Theoretical Computer Science, vol. 209, pp. 237-260, 1998. [2] D. Baron, M.B. Wakin, M.F. Duarte, S. Sarvotham and R.G. Baraniuk, “Distributed Compressed Sensing,” Preprint, 2005. [3] M. Beal, Variational algorithms for approximate bayesian inference. PhD dissertation, University College London, 2003. [4] D.L. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, pp. 489-509, 2006.

Fig. 4. A plot of the correlative error as a function of the signal to noise ratio. For higher noise levels, the improved performance of the beta process model is clear.

[5] D.L. Donoho, Y. Tsaig, I. Drori and J.L. Starck, “Sparse solution of underdetermined linear equations by stagewise orthogonal matching pursuit,” Stanford Statistics Technical Report, 2006-2, April 2009. [6] D. Gamerman and H.F. Lopes, Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, Second Edition, Chapman & Hall, 2006. [7] N.L. Hjort, “Nonparametric bayes estimators based on beta processes in models for life history data,” Annals of Statistics, vol. 18, pp. 1259-1294, 1990. [8] A.E. Hoerl and R.W. Kennard, “Ridge regression: biased estimation for nonorthogonal problems,” Technometrics, Special Ed., Feb. 2000, vol. 42, pp. 80-86, 1970. [9] H. Ishwaran and J.S. Rao, “Spike and slab variable selection: frequentist and Baysian strategies,” Annals of Statistics, vol. 3, no. 2, pp. 730-773. [10] S. Ji, Y. Xue and L. Carin, “Bayesian compressive sensing,” IEEE Transactions on Signal Processing, vol. 56, pp. 23462356, 2008. [11] J. Paisley and L. Carin, “Nonparametric factor analysis with beta process priors,” International Conference on Machine Learning (ICML), Montreal, 2009.

Fig. 5. A plot of the correlative error as a function of measurement number for an SNR of 10 dB. The BP-LR model exhibits improved performance for all measurements.

4. CONCLUSION We have presented a sparse linear regression model that uses the beta process as a prior. This beta process linear regression model (BP-LR) is an approximation to the minimum `0 solution, which is done in a fully Bayesian setting and has analytical updates not given here due to space constraints; fast inference algorithms are thus available for use. We demonstrated our model on a group-selection toy problem, which is useful in gene analysis problems where groups of correlated features are sought. We also showed improved performance on a compressive sensing problem, most noticeably in low signal to noise scenarios. Due to space limitations, we briefly mentioned the further ability of the BP-LR model to solve multiple inversion problems where joint sparsity is assumed.

[12] T. Park and G. Casella, “The Bayesian lasso,” Journal of the American Statistical Association, vol. 103, pp. 681-686, 2008. [13] M. Tipping, “Sparse Bayesian learning and the relevance vector machine,” Journal of Machine Learning Research, vol. 1, pp. 211-244, 2001. [14] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society, Ser. B, vol. 58, pp. 267-288, 1996. [15] J.A. Tropp and A.C. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,” IEEE Transacations on Information Theory, vol. 53, pp. 4655-4666, 2007. [16] J. Weston, A. Elisseef, B. Sch¨olkopf and M. Tipping, “Use of the zero-norm with linear models and kernel methods,” Journal of Machine Learning Research, vol. 3, pp. 1439-1461, 2003. [17] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society, Ser. B, vol. 67, pp. 301-320, 2005.