Sparse Bayesian nonparametric regression - Semantic Scholar

Report 2 Downloads 204 Views
Sparse Bayesian nonparametric regression Fran¸cois Caron and Arnaud Doucet Depts of Computer Science & Statistics, UBC

July 7, 2008

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

1 / 22

Introduction

Linear regression model y = X β + ε,

F. Caron (UBC)

ε ∼ N 0, σ 2 IL

Sparse Bayesian nonparametric regression



(1)

July 7, 2008

2 / 22

Introduction

Linear regression model y = X β + ε,

ε ∼ N 0, σ 2 IL



(1)

y ∈ RL , X design matrix of size L × K , β ∈ RK

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

2 / 22

Introduction

Linear regression model y = X β + ε,

ε ∼ N 0, σ 2 IL



(1)

y ∈ RL , X design matrix of size L × K , β ∈ RK Sparse estimate of β

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

2 / 22

Introduction

Linear regression model y = X β + ε,

ε ∼ N 0, σ 2 IL



(1)

y ∈ RL , X design matrix of size L × K , β ∈ RK Sparse estimate of β Variable selection, decomposition of a signal over an overcomplete basis, etc.

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

2 / 22

Introduction

Linear regression model y = X β + ε,

ε ∼ N 0, σ 2 IL



(1)

y ∈ RL , X design matrix of size L × K , β ∈ RK Sparse estimate of β Variable selection, decomposition of a signal over an overcomplete basis, etc. Numerous models/algorithms

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

2 / 22

Introduction

Linear regression model y = X β + ε,

ε ∼ N 0, σ 2 IL



(1)

y ∈ RL , X design matrix of size L × K , β ∈ RK Sparse estimate of β Variable selection, decomposition of a signal over an overcomplete basis, etc. Numerous models/algorithms Spike and Slab

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

2 / 22

Introduction

Linear regression model y = X β + ε,

ε ∼ N 0, σ 2 IL



(1)

y ∈ RL , X design matrix of size L × K , β ∈ RK Sparse estimate of β Variable selection, decomposition of a signal over an overcomplete basis, etc. Numerous models/algorithms Spike and Slab Lasso

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

2 / 22

Introduction

Linear regression model y = X β + ε,

ε ∼ N 0, σ 2 IL



(1)

y ∈ RL , X design matrix of size L × K , β ∈ RK Sparse estimate of β Variable selection, decomposition of a signal over an overcomplete basis, etc. Numerous models/algorithms Spike and Slab Lasso Relevance Vector Machine

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

2 / 22

Introduction Prior distribution p(β) =

F. Caron (UBC)

QK

k=1 p(βk )

Sparse Bayesian nonparametric regression

July 7, 2008

3 / 22

Introduction Prior distribution p(β) =

QK

k=1 p(βk )

Local minima of the objective function −

F. Caron (UBC)

1 ||y − X β||2 − log p(β) 2σ 2

Sparse Bayesian nonparametric regression

(2)

July 7, 2008

3 / 22

Introduction Prior distribution p(β) =

QK

k=1 p(βk )

Local minima of the objective function −

1 ||y − X β||2 − log p(β) 2σ 2

(2)

Scale-mixture of Gaussians Z  p (βk ) = N (βk ; 0, σk2 )p σk2 dσk2

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

3 / 22

Introduction Prior distribution p(β) =

QK

k=1 p(βk )

Local minima of the objective function −

1 ||y − X β||2 − log p(β) 2σ 2

(2)

Scale-mixture of Gaussians Z  p (βk ) = N (βk ; 0, σk2 )p σk2 dσk2 Laplace prior → Lasso objective function

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

3 / 22

Introduction Prior distribution p(β) =

QK

k=1 p(βk )

Local minima of the objective function −

1 ||y − X β||2 − log p(β) 2σ 2

(2)

Scale-mixture of Gaussians Z  p (βk ) = N (βk ; 0, σk2 )p σk2 dσk2 Laplace prior → Lasso objective function Normal-Jeffreys

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

3 / 22

Introduction Prior distribution p(β) =

QK

k=1 p(βk )

Local minima of the objective function −

1 ||y − X β||2 − log p(β) 2σ 2

(2)

Scale-mixture of Gaussians Z  p (βk ) = N (βk ; 0, σk2 )p σk2 dσk2 Laplace prior → Lasso objective function Normal-Jeffreys Normal-exponential gamma

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

3 / 22

Introduction Prior distribution p(β) =

QK

k=1 p(βk )

Local minima of the objective function −

1 ||y − X β||2 − log p(β) 2σ 2

(2)

Scale-mixture of Gaussians Z  p (βk ) = N (βk ; 0, σk2 )p σk2 dσk2 Laplace prior → Lasso objective function Normal-Jeffreys Normal-exponential gamma

Find local minimum of Eq. (2) with EM algorithm F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

3 / 22

Why this title is too bad

Sparse... but... one model does not lead to ‘strictly’ sparse estimates!

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

4 / 22

Why this title is too bad

Sparse... but... one model does not lead to ‘strictly’ sparse estimates! Bayesian... but... EM algorithm

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

4 / 22

Why this title is too bad

Sparse... but... one model does not lead to ‘strictly’ sparse estimates! Bayesian... but... EM algorithm nonparametric... but... the number of parameters is finite!

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

4 / 22

Why this title is too bad

Sparse... but... one model does not lead to ‘strictly’ sparse estimates! Bayesian... but... EM algorithm nonparametric... but... the number of parameters is finite! regression... YES IT IS :-)

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

4 / 22

Overview

1

Introduction

2

Models

3

Sparsity properties

4

Empirical results

5

Conclusion

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

5 / 22

Normal-gamma model Gamma prior over σk2 σk2 ∼ G(

F. Caron (UBC)

α γ2 , ) K 2

Sparse Bayesian nonparametric regression

July 7, 2008

6 / 22

Normal-gamma model Gamma prior over σk2 σk2 ∼ G(

α γ2 , ) K 2

Marginal distribution over βk α

1

p(βk ) ∝ |βk | K − 2 K α − 1 (γ|βk |) K

(3)

2

where Kν (·) is the modified Bessel function of the second kind.

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

6 / 22

Normal-gamma model

(a) α = 1

F. Caron (UBC)

(b) α = 5

Sparse Bayesian nonparametric regression

(c) α = 100

July 7, 2008

7 / 22

Asymptotic properties Bounded sum of the terms K X 2α lim E[ |βk |] = K →∞ γ k=1

F. Caron (UBC)

, E[

K X k=1

Sparse Bayesian nonparametric regression

βk2 ] =

2α . γ2

July 7, 2008

8 / 22

Asymptotic properties Bounded sum of the terms K X 2α lim E[ |βk |] = K →∞ γ

, E[

k=1

K X k=1

βk2 ] =

2α . γ2

Stick breaking construction for the weights

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

8 / 22

Asymptotic properties Bounded sum of the terms K X 2α lim E[ |βk |] = K →∞ γ

, E[

k=1

K X k=1

βk2 ] =

2α . γ2

Stick breaking construction for the weights 2 2 2 Let σ(1) ≥ σ(2) ≥ . . . ≥ σ(K ) be the order statistics of the sequence 2 2 2 σ1 , σ2 , . . . , σK .

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

8 / 22

Asymptotic properties Bounded sum of the terms K X 2α lim E[ |βk |] = K →∞ γ

, E[

k=1

K X k=1

βk2 ] =

2α . γ2

Stick breaking construction for the weights 2 2 2 Let σ(1) ≥ σ(2) ≥ . . . ≥ σ(K ) be the order statistics of the sequence 2 2 2 , . . . , σK . σ1 , σ2  P 2 σ2 σ2 2 σ = P (1)σ2 , P (2)σ2 , . . . and k σ(k) are independent and k

(k)

k

(k)

respectively distributed according to PD(α) and G(α, γ 2 /2)

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

8 / 22

Asymptotic properties Bounded sum of the terms K X 2α lim E[ |βk |] = K →∞ γ

, E[

k=1

K X k=1

βk2 ] =

2α . γ2

Stick breaking construction for the weights 2 2 2 Let σ(1) ≥ σ(2) ≥ . . . ≥ σ(K ) be the order statistics of the sequence 2 2 2 , . . . , σK . σ1 , σ2  P 2 σ2 σ2 2 σ = P (1)σ2 , P (2)σ2 , . . . and k σ(k) are independent and k

(k)

k

(k)

respectively distributed according to PD(α) and G(α, γ 2 /2) Can be recovered from the (Infinite) stick-breaking construction πk = ζk

k−1 Y

(1 − ζj ) with ζj ∼ B(1, α)

(4)

j=1

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

8 / 22

Asymptotic properties Bounded sum of the terms K X 2α lim E[ |βk |] = K →∞ γ

, E[

k=1

K X k=1

βk2 ] =

2α . γ2

Stick breaking construction for the weights 2 2 2 Let σ(1) ≥ σ(2) ≥ . . . ≥ σ(K ) be the order statistics of the sequence 2 2 2 , . . . , σK . σ1 , σ2  P 2 σ2 σ2 2 σ = P (1)σ2 , P (2)σ2 , . . . and k σ(k) are independent and k

(k)

k

(k)

respectively distributed according to PD(α) and G(α, γ 2 /2) Can be recovered from the (Infinite) stick-breaking construction πk = ζk

k−1 Y

(1 − ζj ) with ζj ∼ B(1, α)

(4)

j=1

Coefficients (βk ) are the weights (jumps) of the so-called variance gamma process (Brownian motion evaluated at times given by a gamma process) F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

8 / 22

Normal-inverse Gaussian model Inverse-Gaussian prior over σk2 σk2 ∼ IG(

F. Caron (UBC)

α , γ) K

Sparse Bayesian nonparametric regression

(5)

July 7, 2008

9 / 22

Normal-inverse Gaussian model Inverse-Gaussian prior over σk2 σk2 ∼ IG(

α , γ) K

(5)

Marginal pdf of βk  p(βk ) ∝

F. Caron (UBC)

α2 + βk2 K2

−1/2

r K1

γ

Sparse Bayesian nonparametric regression

α2 + βk2 K2

! (6)

July 7, 2008

9 / 22

Normal-inverse Gaussian model

(d) α = 1

F. Caron (UBC)

(e) α = 5

Sparse Bayesian nonparametric regression

(f) α = 100

July 7, 2008

10 / 22

Extension L N vectors {yn }N n=1 where yn ∈ R .

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

11 / 22

Extension L N vectors {yn }N n=1 where yn ∈ R .

For a given k the random variables {βkn }N n=1 are statistically dependent and exchangeable

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

11 / 22

Extension L N vectors {yn }N n=1 where yn ∈ R .

For a given k the random variables {βkn }N n=1 are statistically dependent and exchangeable Hierarchical model σk2 ∼ G(

α γ2 α , ) or σk2 ∼ IG( , γ) K 2 K

for k = 1, . . . , K and βkn ∼ N (0, σk2 ) for n = 1, . . . , N.

(m) α = 1 F. Caron (UBC)

(n) α = 5 Sparse Bayesian nonparametric regression

(o) α = 100 July 7, 2008

11 / 22

Extensions

As K → ∞, prior distributions over infinite matrices with real-valued entries

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

12 / 22

Extensions

As K → ∞, prior distributions over infinite matrices with real-valued entries Complementary to the Indian buffet process and the infinite gamma-Poisson process which are prior distributions over infinite matrices with integer-valued entries.

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

12 / 22

Overview

1

Introduction

2

Models

3

Sparsity properties

4

Empirical results

5

Conclusion

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

13 / 22

Sparsity properties Minimize −

N K X X 1 n n 2 ||y − X β || − pen(βk1:N ) 2σn2 n=1

k=1

where

Lasso (N = 1) NJ NG

pen(βk1:N ) γ|βk | N log(uk ) ( N2 − Kα ) log uk − log K α − N (γuk ) K

N+1 2

2

log (qk ) − log K N+1 (γqk ) 2 r q XN 2 2 n where uk = βk , qk = Kα 2 + uk2 NIG

n=1

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

14 / 22

Sparsity properties

(p) Laplace

(q) Normal-Jeffreys

(r) Normal-gamma

(s) Normal-inverse Gaussian

Figure: Contour of constant value of pen(β1 ) + pen(β2 ) for different priors F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

15 / 22

Sparsity properties

The normal-gamma prior is a thresholding rule for α/K ≤ 1 and yields sparse estimates The normal-inverse Gaussian is not a thresholding rule but it can yield “almost sparse” estimates

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

16 / 22

Overview

1

Introduction

2

Models

3

Sparsity properties

4

Empirical results

5

Conclusion

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

17 / 22

Empirical results

100 datasets with L = 50 and σ = 1 Correlation between Xk,i and Xk,j is ρ|i−j| with ρ = 0.5 True β = (3 1.5 0 0 2 0 0 . . .)T ∈ RK , where K = 20, 60, 100, 200 Parameters of the Lasso, NG and NIG are estimated by 5-fold cross-validation

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

18 / 22

Empirical results

Figure: Box plots of the MSE associated to the simulated data. F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

19 / 22

Conclusion Why this title is not so bad

Sparse...

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

20 / 22

Conclusion Why this title is not so bad

Sparse... Two classes of models which lead to sparser estimates

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

20 / 22

Conclusion Why this title is not so bad

Sparse... Two classes of models which lead to sparser estimates

Bayesian nonparametric...

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

20 / 22

Conclusion Why this title is not so bad

Sparse... Two classes of models which lead to sparser estimates

Bayesian nonparametric... Related to a class of nonparametric Bayesian model when K → ∞

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

20 / 22

Conclusion Why this title is not so bad

Sparse... Two classes of models which lead to sparser estimates

Bayesian nonparametric... Related to a class of nonparametric Bayesian model when K → ∞

regression...

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

20 / 22

Conclusion Why this title is not so bad

Sparse... Two classes of models which lead to sparser estimates

Bayesian nonparametric... Related to a class of nonparametric Bayesian model when K → ∞

regression... Extension to probit regression

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

20 / 22

Conclusion

Ongoing work with K. Murphy on graph learning with group sparsity

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

21 / 22

Conclusion

Ongoing work with K. Murphy on graph learning with group sparsity How to use the stick-breaking construction?

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

21 / 22

Conclusion

Ongoing work with K. Murphy on graph learning with group sparsity How to use the stick-breaking construction? Marginal distribution?

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

21 / 22

Bibliography Barndorff-Nielsen, O. (1997). Normal inverse Gaussian distributions and stochastic volatility modelling. Scandinavian Journal of Statistics, 24, 1–13. Figueiredo, M. (2003). Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 1150–1159. Griffin, J., & Brown, P. (2007). Bayesian adaptive lasso with non-convex penalization (Technical Report). Dept of Statistics, University of Warwick.

F. Caron (UBC)

Sparse Bayesian nonparametric regression

July 7, 2008

22 / 22