Sparse Bayesian nonparametric regression Fran¸cois Caron and Arnaud Doucet Depts of Computer Science & Statistics, UBC
July 7, 2008
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
1 / 22
Introduction
Linear regression model y = X β + ε,
F. Caron (UBC)
ε ∼ N 0, σ 2 IL
Sparse Bayesian nonparametric regression
(1)
July 7, 2008
2 / 22
Introduction
Linear regression model y = X β + ε,
ε ∼ N 0, σ 2 IL
(1)
y ∈ RL , X design matrix of size L × K , β ∈ RK
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
2 / 22
Introduction
Linear regression model y = X β + ε,
ε ∼ N 0, σ 2 IL
(1)
y ∈ RL , X design matrix of size L × K , β ∈ RK Sparse estimate of β
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
2 / 22
Introduction
Linear regression model y = X β + ε,
ε ∼ N 0, σ 2 IL
(1)
y ∈ RL , X design matrix of size L × K , β ∈ RK Sparse estimate of β Variable selection, decomposition of a signal over an overcomplete basis, etc.
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
2 / 22
Introduction
Linear regression model y = X β + ε,
ε ∼ N 0, σ 2 IL
(1)
y ∈ RL , X design matrix of size L × K , β ∈ RK Sparse estimate of β Variable selection, decomposition of a signal over an overcomplete basis, etc. Numerous models/algorithms
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
2 / 22
Introduction
Linear regression model y = X β + ε,
ε ∼ N 0, σ 2 IL
(1)
y ∈ RL , X design matrix of size L × K , β ∈ RK Sparse estimate of β Variable selection, decomposition of a signal over an overcomplete basis, etc. Numerous models/algorithms Spike and Slab
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
2 / 22
Introduction
Linear regression model y = X β + ε,
ε ∼ N 0, σ 2 IL
(1)
y ∈ RL , X design matrix of size L × K , β ∈ RK Sparse estimate of β Variable selection, decomposition of a signal over an overcomplete basis, etc. Numerous models/algorithms Spike and Slab Lasso
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
2 / 22
Introduction
Linear regression model y = X β + ε,
ε ∼ N 0, σ 2 IL
(1)
y ∈ RL , X design matrix of size L × K , β ∈ RK Sparse estimate of β Variable selection, decomposition of a signal over an overcomplete basis, etc. Numerous models/algorithms Spike and Slab Lasso Relevance Vector Machine
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
2 / 22
Introduction Prior distribution p(β) =
F. Caron (UBC)
QK
k=1 p(βk )
Sparse Bayesian nonparametric regression
July 7, 2008
3 / 22
Introduction Prior distribution p(β) =
QK
k=1 p(βk )
Local minima of the objective function −
F. Caron (UBC)
1 ||y − X β||2 − log p(β) 2σ 2
Sparse Bayesian nonparametric regression
(2)
July 7, 2008
3 / 22
Introduction Prior distribution p(β) =
QK
k=1 p(βk )
Local minima of the objective function −
1 ||y − X β||2 − log p(β) 2σ 2
(2)
Scale-mixture of Gaussians Z p (βk ) = N (βk ; 0, σk2 )p σk2 dσk2
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
3 / 22
Introduction Prior distribution p(β) =
QK
k=1 p(βk )
Local minima of the objective function −
1 ||y − X β||2 − log p(β) 2σ 2
(2)
Scale-mixture of Gaussians Z p (βk ) = N (βk ; 0, σk2 )p σk2 dσk2 Laplace prior → Lasso objective function
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
3 / 22
Introduction Prior distribution p(β) =
QK
k=1 p(βk )
Local minima of the objective function −
1 ||y − X β||2 − log p(β) 2σ 2
(2)
Scale-mixture of Gaussians Z p (βk ) = N (βk ; 0, σk2 )p σk2 dσk2 Laplace prior → Lasso objective function Normal-Jeffreys
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
3 / 22
Introduction Prior distribution p(β) =
QK
k=1 p(βk )
Local minima of the objective function −
1 ||y − X β||2 − log p(β) 2σ 2
(2)
Scale-mixture of Gaussians Z p (βk ) = N (βk ; 0, σk2 )p σk2 dσk2 Laplace prior → Lasso objective function Normal-Jeffreys Normal-exponential gamma
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
3 / 22
Introduction Prior distribution p(β) =
QK
k=1 p(βk )
Local minima of the objective function −
1 ||y − X β||2 − log p(β) 2σ 2
(2)
Scale-mixture of Gaussians Z p (βk ) = N (βk ; 0, σk2 )p σk2 dσk2 Laplace prior → Lasso objective function Normal-Jeffreys Normal-exponential gamma
Find local minimum of Eq. (2) with EM algorithm F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
3 / 22
Why this title is too bad
Sparse... but... one model does not lead to ‘strictly’ sparse estimates!
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
4 / 22
Why this title is too bad
Sparse... but... one model does not lead to ‘strictly’ sparse estimates! Bayesian... but... EM algorithm
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
4 / 22
Why this title is too bad
Sparse... but... one model does not lead to ‘strictly’ sparse estimates! Bayesian... but... EM algorithm nonparametric... but... the number of parameters is finite!
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
4 / 22
Why this title is too bad
Sparse... but... one model does not lead to ‘strictly’ sparse estimates! Bayesian... but... EM algorithm nonparametric... but... the number of parameters is finite! regression... YES IT IS :-)
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
4 / 22
Overview
1
Introduction
2
Models
3
Sparsity properties
4
Empirical results
5
Conclusion
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
5 / 22
Normal-gamma model Gamma prior over σk2 σk2 ∼ G(
F. Caron (UBC)
α γ2 , ) K 2
Sparse Bayesian nonparametric regression
July 7, 2008
6 / 22
Normal-gamma model Gamma prior over σk2 σk2 ∼ G(
α γ2 , ) K 2
Marginal distribution over βk α
1
p(βk ) ∝ |βk | K − 2 K α − 1 (γ|βk |) K
(3)
2
where Kν (·) is the modified Bessel function of the second kind.
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
6 / 22
Normal-gamma model
(a) α = 1
F. Caron (UBC)
(b) α = 5
Sparse Bayesian nonparametric regression
(c) α = 100
July 7, 2008
7 / 22
Asymptotic properties Bounded sum of the terms K X 2α lim E[ |βk |] = K →∞ γ k=1
F. Caron (UBC)
, E[
K X k=1
Sparse Bayesian nonparametric regression
βk2 ] =
2α . γ2
July 7, 2008
8 / 22
Asymptotic properties Bounded sum of the terms K X 2α lim E[ |βk |] = K →∞ γ
, E[
k=1
K X k=1
βk2 ] =
2α . γ2
Stick breaking construction for the weights
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
8 / 22
Asymptotic properties Bounded sum of the terms K X 2α lim E[ |βk |] = K →∞ γ
, E[
k=1
K X k=1
βk2 ] =
2α . γ2
Stick breaking construction for the weights 2 2 2 Let σ(1) ≥ σ(2) ≥ . . . ≥ σ(K ) be the order statistics of the sequence 2 2 2 σ1 , σ2 , . . . , σK .
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
8 / 22
Asymptotic properties Bounded sum of the terms K X 2α lim E[ |βk |] = K →∞ γ
, E[
k=1
K X k=1
βk2 ] =
2α . γ2
Stick breaking construction for the weights 2 2 2 Let σ(1) ≥ σ(2) ≥ . . . ≥ σ(K ) be the order statistics of the sequence 2 2 2 , . . . , σK . σ1 , σ2 P 2 σ2 σ2 2 σ = P (1)σ2 , P (2)σ2 , . . . and k σ(k) are independent and k
(k)
k
(k)
respectively distributed according to PD(α) and G(α, γ 2 /2)
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
8 / 22
Asymptotic properties Bounded sum of the terms K X 2α lim E[ |βk |] = K →∞ γ
, E[
k=1
K X k=1
βk2 ] =
2α . γ2
Stick breaking construction for the weights 2 2 2 Let σ(1) ≥ σ(2) ≥ . . . ≥ σ(K ) be the order statistics of the sequence 2 2 2 , . . . , σK . σ1 , σ2 P 2 σ2 σ2 2 σ = P (1)σ2 , P (2)σ2 , . . . and k σ(k) are independent and k
(k)
k
(k)
respectively distributed according to PD(α) and G(α, γ 2 /2) Can be recovered from the (Infinite) stick-breaking construction πk = ζk
k−1 Y
(1 − ζj ) with ζj ∼ B(1, α)
(4)
j=1
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
8 / 22
Asymptotic properties Bounded sum of the terms K X 2α lim E[ |βk |] = K →∞ γ
, E[
k=1
K X k=1
βk2 ] =
2α . γ2
Stick breaking construction for the weights 2 2 2 Let σ(1) ≥ σ(2) ≥ . . . ≥ σ(K ) be the order statistics of the sequence 2 2 2 , . . . , σK . σ1 , σ2 P 2 σ2 σ2 2 σ = P (1)σ2 , P (2)σ2 , . . . and k σ(k) are independent and k
(k)
k
(k)
respectively distributed according to PD(α) and G(α, γ 2 /2) Can be recovered from the (Infinite) stick-breaking construction πk = ζk
k−1 Y
(1 − ζj ) with ζj ∼ B(1, α)
(4)
j=1
Coefficients (βk ) are the weights (jumps) of the so-called variance gamma process (Brownian motion evaluated at times given by a gamma process) F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
8 / 22
Normal-inverse Gaussian model Inverse-Gaussian prior over σk2 σk2 ∼ IG(
F. Caron (UBC)
α , γ) K
Sparse Bayesian nonparametric regression
(5)
July 7, 2008
9 / 22
Normal-inverse Gaussian model Inverse-Gaussian prior over σk2 σk2 ∼ IG(
α , γ) K
(5)
Marginal pdf of βk p(βk ) ∝
F. Caron (UBC)
α2 + βk2 K2
−1/2
r K1
γ
Sparse Bayesian nonparametric regression
α2 + βk2 K2
! (6)
July 7, 2008
9 / 22
Normal-inverse Gaussian model
(d) α = 1
F. Caron (UBC)
(e) α = 5
Sparse Bayesian nonparametric regression
(f) α = 100
July 7, 2008
10 / 22
Extension L N vectors {yn }N n=1 where yn ∈ R .
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
11 / 22
Extension L N vectors {yn }N n=1 where yn ∈ R .
For a given k the random variables {βkn }N n=1 are statistically dependent and exchangeable
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
11 / 22
Extension L N vectors {yn }N n=1 where yn ∈ R .
For a given k the random variables {βkn }N n=1 are statistically dependent and exchangeable Hierarchical model σk2 ∼ G(
α γ2 α , ) or σk2 ∼ IG( , γ) K 2 K
for k = 1, . . . , K and βkn ∼ N (0, σk2 ) for n = 1, . . . , N.
(m) α = 1 F. Caron (UBC)
(n) α = 5 Sparse Bayesian nonparametric regression
(o) α = 100 July 7, 2008
11 / 22
Extensions
As K → ∞, prior distributions over infinite matrices with real-valued entries
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
12 / 22
Extensions
As K → ∞, prior distributions over infinite matrices with real-valued entries Complementary to the Indian buffet process and the infinite gamma-Poisson process which are prior distributions over infinite matrices with integer-valued entries.
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
12 / 22
Overview
1
Introduction
2
Models
3
Sparsity properties
4
Empirical results
5
Conclusion
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
13 / 22
Sparsity properties Minimize −
N K X X 1 n n 2 ||y − X β || − pen(βk1:N ) 2σn2 n=1
k=1
where
Lasso (N = 1) NJ NG
pen(βk1:N ) γ|βk | N log(uk ) ( N2 − Kα ) log uk − log K α − N (γuk ) K
N+1 2
2
log (qk ) − log K N+1 (γqk ) 2 r q XN 2 2 n where uk = βk , qk = Kα 2 + uk2 NIG
n=1
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
14 / 22
Sparsity properties
(p) Laplace
(q) Normal-Jeffreys
(r) Normal-gamma
(s) Normal-inverse Gaussian
Figure: Contour of constant value of pen(β1 ) + pen(β2 ) for different priors F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
15 / 22
Sparsity properties
The normal-gamma prior is a thresholding rule for α/K ≤ 1 and yields sparse estimates The normal-inverse Gaussian is not a thresholding rule but it can yield “almost sparse” estimates
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
16 / 22
Overview
1
Introduction
2
Models
3
Sparsity properties
4
Empirical results
5
Conclusion
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
17 / 22
Empirical results
100 datasets with L = 50 and σ = 1 Correlation between Xk,i and Xk,j is ρ|i−j| with ρ = 0.5 True β = (3 1.5 0 0 2 0 0 . . .)T ∈ RK , where K = 20, 60, 100, 200 Parameters of the Lasso, NG and NIG are estimated by 5-fold cross-validation
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
18 / 22
Empirical results
Figure: Box plots of the MSE associated to the simulated data. F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
19 / 22
Conclusion Why this title is not so bad
Sparse...
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
20 / 22
Conclusion Why this title is not so bad
Sparse... Two classes of models which lead to sparser estimates
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
20 / 22
Conclusion Why this title is not so bad
Sparse... Two classes of models which lead to sparser estimates
Bayesian nonparametric...
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
20 / 22
Conclusion Why this title is not so bad
Sparse... Two classes of models which lead to sparser estimates
Bayesian nonparametric... Related to a class of nonparametric Bayesian model when K → ∞
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
20 / 22
Conclusion Why this title is not so bad
Sparse... Two classes of models which lead to sparser estimates
Bayesian nonparametric... Related to a class of nonparametric Bayesian model when K → ∞
regression...
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
20 / 22
Conclusion Why this title is not so bad
Sparse... Two classes of models which lead to sparser estimates
Bayesian nonparametric... Related to a class of nonparametric Bayesian model when K → ∞
regression... Extension to probit regression
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
20 / 22
Conclusion
Ongoing work with K. Murphy on graph learning with group sparsity
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
21 / 22
Conclusion
Ongoing work with K. Murphy on graph learning with group sparsity How to use the stick-breaking construction?
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
21 / 22
Conclusion
Ongoing work with K. Murphy on graph learning with group sparsity How to use the stick-breaking construction? Marginal distribution?
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
21 / 22
Bibliography Barndorff-Nielsen, O. (1997). Normal inverse Gaussian distributions and stochastic volatility modelling. Scandinavian Journal of Statistics, 24, 1–13. Figueiredo, M. (2003). Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 1150–1159. Griffin, J., & Brown, P. (2007). Bayesian adaptive lasso with non-convex penalization (Technical Report). Dept of Statistics, University of Warwick.
F. Caron (UBC)
Sparse Bayesian nonparametric regression
July 7, 2008
22 / 22