Additive Gaussian Processes

Report 11 Downloads 150 Views
Additive Gaussian Processes David Duvenaud, Hannes Nickisch, Carl Rasmussen

Cambridge University Computational and Biological Learning Lab

January 13, 2012

Outline

Gaussian Process Regression Definition Properties

Additive Gaussian Processes Central Modeling Assumption Interpretability Related Work Results

Regression Methods

Given X, y, predict some new function value y∗ at location x∗.

Regression Methods

Given X, y, predict some new function value y∗ at location x∗. Many methods with nice properties.

Regression Methods

Given X, y, predict some new function value y∗ at location x∗. Many methods with nice properties. Linear Regression - Fast

Regression Methods

Given X, y, predict some new function value y∗ at location x∗. Many methods with nice properties. Linear Regression - Fast Deep Belief Networks - Semi-supervised

Regression Methods

Given X, y, predict some new function value y∗ at location x∗. Many methods with nice properties. Linear Regression - Fast Deep Belief Networks - Semi-supervised Spline Models - Nonparametric

Regression Methods

Given X, y, predict some new function value y∗ at location x∗. Many methods with nice properties. Linear Regression - Fast Deep Belief Networks - Semi-supervised Spline Models - Nonparametric Gaussian Process Regression

Regression Methods

Given X, y, predict some new function value y∗ at location x∗. Many methods with nice properties. Linear Regression - Fast Deep Belief Networks - Semi-supervised Spline Models - Nonparametric Gaussian Process Regression I Non-parametric

Regression Methods

Given X, y, predict some new function value y∗ at location x∗. Many methods with nice properties. Linear Regression - Fast Deep Belief Networks - Semi-supervised Spline Models - Nonparametric Gaussian Process Regression I Non-parametric I Data-Efficient

Regression Methods

Given X, y, predict some new function value y∗ at location x∗. Many methods with nice properties. Linear Regression - Fast Deep Belief Networks - Semi-supervised Spline Models - Nonparametric Gaussian Process Regression I Non-parametric I Data-Efficient I Tractable Joint Posterior

Definition Assume our data (X, y) is generated by y = f(x) + σ f is a latent function which we need to do inference about.

Definition Assume our data (X, y) is generated by y = f(x) + σ f is a latent function which we need to do inference about. A GP prior distribution over f means that, for any finite set of indices X, p(fx |θ) = N (µθ (X), Kθ (X, X))

Definition Assume our data (X, y) is generated by y = f(x) + σ f is a latent function which we need to do inference about. A GP prior distribution over f means that, for any finite set of indices X, p(fx |θ) = N (µθ (X), Kθ (X, X)) where Kij = kθ (x, x0 ) is the covariance function or kernel, which specifies the covariance between two function values f (x1 ), f (x2 ) given their locations x1 , x2 .

Definition Assume our data (X, y) is generated by y = f(x) + σ f is a latent function which we need to do inference about. A GP prior distribution over f means that, for any finite set of indices X, p(fx |θ) = N (µθ (X), Kθ (X, X)) where Kij = kθ (x, x0 ) is the covariance function or kernel, which specifies the covariance between two function values f (x1 ), f (x2 ) given their locations x1 , x2 . e.g. 1

Covariance

0.8

0.6

0.4

1 kθ (x, x0 ) = exp(− |x−x0 |22 ) 2θ

0.2

0 −3

−2

−1

0 Distance

x−x0

1

2

3

Sampling from a GP function simple_gp_sample % Choose a set of x locations. N = 100; x = linspace( -2, 2, N); % Specify the covariance between function % values, depending on their location. for j = 1:N for k = 1:N sigma(j,k) = covariance( x(j), x(k) ); end end

1.4 1.2 1 0.8 0.6

% Specify that the prior mean of f is zero. mu = zeros(N, 1);

0.4 0.2

% Sample from a multivariate Gaussian. f = mvnrnd( mu, sigma );

−0.2

plot(x, f);

−0.4 −2

end % Squared−exp covariance function. function k = covariance(x, y) k = exp( -0.5*( x - y )^2 ); end

0

−1.5

−1

−0.5

0

0.5

1

1.5

2

Sampling from a GP function simple_gp_sample % Choose a set of x locations. N = 100; x = linspace( -2, 2, N); % Specify the covariance between function % values, depending on their location. for j = 1:N for k = 1:N sigma(j,k) = covariance( x(j), x(k) ); end end

1.6 1.4 1.2 1 0.8

% Specify that the prior mean of f is zero. mu = zeros(N, 1);

0.6 0.4

% Sample from a multivariate Gaussian. f = mvnrnd( mu, sigma ); plot(x, f); end % Squared−exp covariance function. function k = covariance(x, y) k = exp( -0.5*( x - y )^2 ); end

0.2 0 −0.2 −2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Sampling from a GP function simple_gp_sample % Choose a set of x locations. N = 100; x = linspace( -2, 2, N); % Specify the covariance between function % values, depending on their location. for j = 1:N for k = 1:N sigma(j,k) = covariance( x(j), x(k) ); end end % Specify that the prior mean of f is zero. mu = zeros(N, 1); % Sample from a multivariate Gaussian. f = mvnrnd( mu, sigma ); plot(x, f); end % Squared−exp covariance function. function k = covariance(x, y) k = exp( -0.5*( x - y )^2 ); end

0.5

0

−0.5

−1

−1.5

−2 −2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Sampling from a GP function simple_gp_sample % Choose a set of x locations. N = 100; x = linspace( -2, 2, N); % Specify the covariance between function % values, depending on their location. for j = 1:N for k = 1:N sigma(j,k) = covariance( x(j), x(k) ); end end

0

−0.2

−0.4

−0.6

% Specify that the prior mean of f is zero. mu = zeros(N, 1);

−0.8

−1

% Sample from a multivariate Gaussian. f = mvnrnd( mu, sigma ); plot(x, f); end % Squared−exp covariance function. function k = covariance(x, y) k = exp( -0.5*( x - y )^2 ); end

−1.2

−1.4 −2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Sampling from a GP function simple_gp_sample % Choose a set of x locations. N = 100; x = linspace( -2, 2, N); % Specify the covariance between function % values, depending on their location. for j = 1:N for k = 1:N sigma(j,k) = covariance( x(j), x(k) ); end end

0.6

0.4

0.2

0

% Specify that the prior mean of f is zero. mu = zeros(N, 1);

−0.2

−0.4

% Sample from a multivariate Gaussian. f = mvnrnd( mu, sigma ); plot(x, f); end % Periodic covariance function. function c = covariance(x, y) c = exp( -0.5*( sin(( x - y )*1.5).^2 )); end

−0.6

−0.8 −2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Conditional Posterior

After conditioning on some data (X, y),

p(f(x∗ )|X, y, θ) =

1 p(y|f, X, θ)p(f|θ) Z

Conditional Posterior

After conditioning on some data (X, y), 1 p(y|f, X, θ)p(f|θ) Z = N (y|f, X, θ)N (f|µ, Kθ (X, X))

p(f(x∗ )|X, y, θ) =

Conditional Posterior

After conditioning on some data (X, y), 1 p(y|f, X, θ)p(f|θ) Z = N (y|f, X, θ)N (f|µ, Kθ (X, X))

p(f(x∗ )|X, y, θ) =

= N (f(x∗ )|µ = k(x∗ , X)K −1 y, Σ = k(x∗ , x∗ ) − k(x∗ , X)K −1 k(X, x∗ ))

Conditional Posterior

p(f(x∗ )|X, y, θ) = N (f(x∗ )|µ = k(x∗ , X)K −1 y, Σ = k(x∗ , x∗ ) − k(x∗ , X)K −1 k(X, x∗ ))

f(x)

GP Posterior Mean GP Posterior Uncertainty

x

Conditional Posterior

p(f(x∗ )|X, y, θ) = N (f(x∗ )|µ = k(x∗ , X)K −1 y, Σ = k(x∗ , x∗ ) − k(x∗ , X)K −1 k(X, x∗ ))

f(x)

GP Posterior Mean GP Posterior Uncertainty Data

x

Conditional Posterior

p(f(x∗ )|X, y, θ) = N (f(x∗ )|µ = k(x∗ , X)K −1 y, Σ = k(x∗ , x∗ ) − k(x∗ , X)K −1 k(X, x∗ ))

f(x)

GP Posterior Mean GP Posterior Uncertainty Data

x

Conditional Posterior

p(f(x∗ )|X, y, θ) = N (f(x∗ )|µ = k(x∗ , X)K −1 y, Σ = k(x∗ , x∗ ) − k(x∗ , X)K −1 k(X, x∗ ))

f(x)

GP Posterior Mean GP Posterior Uncertainty Data

x

Conditional Posterior

p(f(x∗ )|X, y, θ) = N (f(x∗ )|µ = k(x∗ , X)K −1 y, Σ = k(x∗ , x∗ ) − k(x∗ , X)K −1 k(X, x∗ ))

f(x)

GP Posterior Mean GP Posterior Uncertainty Data

x

Normalization Constant Problem with Bayesian models: p(f|X, y, θ) =

1 p(y|f, X, θ)p(f|θ) Z

Normalization Constant Problem with Bayesian models: p(f|X, y, θ) = =

1 p(y|f, X, θ)p(f|θ) Z p(y|f, X, θ)p(f|θ) R p(y|f, X, θ)p(f|θ)df

Normalization Constant Problem with Bayesian models: p(f|X, y, θ) = =

1 p(y|f, X, θ)p(f|θ) Z p(y|f, X, θ)p(f|θ) R p(y|f, X, θ)p(f|θ)df

Can actually compute model evidence p(y|X, θ), aka Z:

Normalization Constant Problem with Bayesian models: p(f|X, y, θ) = =

1 p(y|f, X, θ)p(f|θ) Z p(y|f, X, θ)p(f|θ) R p(y|f, X, θ)p(f|θ)df

Can actually compute model evidence p(y|X, θ), aka Z: Z log p(y|X, θ) = log p(y|f, X, θ)p(f|θ)df

Normalization Constant Problem with Bayesian models: p(f|X, y, θ) = =

1 p(y|f, X, θ)p(f|θ) Z p(y|f, X, θ)p(f|θ) R p(y|f, X, θ)p(f|θ)df

Can actually compute model evidence p(y|X, θ), aka Z: Z log p(y|X, θ) = log p(y|f, X, θ)p(f|θ)df 1 N 1 = − yT (Kθ + σ2 I)−1 y − log |Kθ + σ2 I| − log(2π) 2 2 2 | {z } | {z } Data-fit

Bayesian Occam’s Razor

Choice of Kernel Function

Depending on kernel function, GP Regression is equivalent to:

Choice of Kernel Function

Depending on kernel function, GP Regression is equivalent to: I

Linear Regression

Choice of Kernel Function

Depending on kernel function, GP Regression is equivalent to: I

Linear Regression

I

Polynomial Regression

Choice of Kernel Function

Depending on kernel function, GP Regression is equivalent to: I

Linear Regression

I

Polynomial Regression

I

Splines

Choice of Kernel Function

Depending on kernel function, GP Regression is equivalent to: I

Linear Regression

I

Polynomial Regression

I

Splines

I

Kalman Filters

Choice of Kernel Function

Depending on kernel function, GP Regression is equivalent to: I

Linear Regression

I

Polynomial Regression

I

Splines

I

Kalman Filters

I

Generalized Additive Models

Choice of Kernel Function

Depending on kernel function, GP Regression is equivalent to: I

Linear Regression

I

Polynomial Regression

I

Splines

I

Kalman Filters

I

Generalized Additive Models

Can use gradients of model evidence to learn which model best explains the data; no need for cross-validation.

Not just for 1-D continuous functions

2 1 0 −1 −2 −3 4 2

4 2

0

I

D-dimensional input

0

−2

−2 −4

−4

Not just for 1-D continuous functions

2 1 0 −1 −2 −3 4 2

4 2

0 0

I

D-dimensional input

I

Functions over discrete domains

−2

−2 −4

−4

Not just for 1-D continuous functions

2 1 0 −1 −2 −3 4 2

4 2

0 0

I

D-dimensional input

I

Functions over discrete domains

I

Functions over strings, trees, trajectories

−2

−2 −4

−4

Not just for 1-D continuous functions

2 1 0 −1 −2 −3 4 2

4 2

0 0

I

D-dimensional input

I

Functions over discrete domains

I

Functions over strings, trees, trajectories

I

Classification

−2

−2 −4

−4

Performance

[Blei et. al, 2011]

Limitations

I

Slow: O(N 3 ) means that N < 3000.

Limitations

I

Slow: O(N 3 ) means that N < 3000. I

Recently some good O(NM 2 ) approximations (FITC).

Limitations

I

Slow: O(N 3 ) means that N < 3000. I

I

Recently some good O(NM 2 ) approximations (FITC).

Most commonly used kernels have fairly limited generalization abilities.

Limitations

I

Slow: O(N 3 ) means that N < 3000. I

Recently some good O(NM 2 ) approximations (FITC).

I

Most commonly used kernels have fairly limited generalization abilities.

I

Non-Gaussian noise requires approximate inference.

Limitations

I

Slow: O(N 3 ) means that N < 3000. I

Recently some good O(NM 2 ) approximations (FITC).

I

Most commonly used kernels have fairly limited generalization abilities.

I

Non-Gaussian noise requires approximate inference.

Best choice if: I

Data is small / expensive to gather.

I

You want to do anything besides point prediction.

Outline

Gaussian Process Regression Definition Properties

Additive Gaussian Processes Central Modeling Assumption Interpretability Related Work Results

Central Dogma Central modeling assumption:

4

2

1.5 1

3

1.5

0.5

2

1

0

1 −0.5

0.5

0

−1

−1 4

0 4

−1.5 4

2

4

2

2

0 0

−2

−2 −4

−4

f1 (x1 ) + f2 (x2 )

=

2

4 2

0 0

−2

−2 −4

−4

f1 (x1 )

+

4 2

0 0

−2

−2 −4

−4

f2 (x2 )

Central Dogma Central modeling assumption:

4

2

1.5 1

3

1.5

0.5

2

1

0

1 −0.5

0.5

0

−1

−1 4

0 4

−1.5 4

2

4

2

2

0 0

−2

−2 −4

−4

f1 (x1 ) + f2 (x2 )

=

2

4 2

0 0

−2

−2 −4

−4

f1 (x1 )

+

4 2

0 0

−2

−2 −4

−4

f2 (x2 )

We hope our high-dimensional function can be written as a sum of orthogonal low-dimensional functions.

Central Dogma Central modeling assumption:

4

2

1.5 1

3

1.5

0.5

2

1

0

1 −0.5

0.5

0

−1

−1 4

0 4

−1.5 4

2

4

2

2

0 0

−2

−2 −4

−4

f1 (x1 ) + f2 (x2 )

=

2

4 2

0 0

−2

−2 −4

−4

f1 (x1 )

+

4 2

0 0

−2

−2 −4

−4

f2 (x2 )

We hope our high-dimensional function can be written as a sum of orthogonal low-dimensional functions. it’s far easier to learn ten 1-dimensional functions than one 10-dimensional function!

Additivity in GPs Easy to express additive property in a GP: 1

1

1

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0 4

0 4 2

4 2

0 0

−2

−2 −4

−4

k1 (x1 , x20 ) 1D kernel

0.2 0 4 2

+

4

2

2

0 0

−2

−2 −4

−4

k2 (x2 , x20 ) 1D kernel

=

4 2

0 0

−2

−2 −4

−4

k1 (x1 , x10 ) + k2 (x2 , x20 ) ↓

Additivity in GPs Easy to express additive property in a GP: 1

1

1

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0 4

0 4 2

4 2

0 0

−2

−2 −4

−4

k1 (x1 , x20 ) 1D kernel

0.2 0 4 2

+

4

2

2

0 0

−2

−2 −4

−4

k2 (x2 , x20 ) 1D kernel

4 2

0

=

0

−2

−2 −4

−4

k1 (x1 , x10 ) + k2 (x2 , x20 ) ↓ 4 3 2 1 0 −1 4 2

4 2

0 0

−2

−2 −4

−4

f1 (x1 ) + f2 (x2 ) draw from 1st order GP

We can extend our prior to include more interaction terms:

We can extend our prior to include more interaction terms: f (x1 , x2 , x3 , x4 ) = f1 (x1 ) + f2 (x2 ) + f3 (x3 ) + f4 (x4 )

+ f12 (x1 , x2 ) + f13 (x1 , x3 ) + f14 (x1 , x4 ) + f23 (x2 , x3 ) + f24 (x2 , x4 ) + f34 (x3 + f123 (x1 , x2 , x3 ) + f124 (x1 , x2 , x4 ) + f134 (x1 , x3 , x4 ) + f234 (x2 , x3 , x4 ) + f1234 (x1 , x2 , x3 , x4 )

We can extend our prior to include more interaction terms: f (x1 , x2 , x3 , x4 ) = f1 (x1 ) + f2 (x2 ) + f3 (x3 ) + f4 (x4 )

+ f12 (x1 , x2 ) + f13 (x1 , x3 ) + f14 (x1 , x4 ) + f23 (x2 , x3 ) + f24 (x2 , x4 ) + f34 (x3 + f123 (x1 , x2 , x3 ) + f124 (x1 , x2 , x4 ) + f134 (x1 , x3 , x4 ) + f234 (x2 , x3 , x4 ) + f1234 (x1 , x2 , x3 , x4 ) Corresponding GP model: assign each dimension i ∈ {1 . . . D} a one-dimensional base kernel ki (xi , xi0 ) Let zi = ki (xi , xi0 ) kadd1 (x, x0 ) = z1 + z2 + z3 + z4 kadd2 (x, x0 ) = z1 z2 + z1 z3 + z1 z4 + z2 z3 + z2 z4 + z3 z4 kadd3 (x, x0 ) = z1 z2 z3 + z1 z2 z4 + z1 z3 z4 + z2 z3 z4 kadd4 (x, x0 ) = z1 z2 z3 z4

In D dimensions: 0

kadd1 (x, x ) =

σ12

D X i=1

ki (xi , xi0 )

In D dimensions: 0

kadd1 (x, x ) =

σ12

kadd2 (x, x0 ) = σ22

D X i=1 D X

ki (xi , xi0 ) D X

i=1 j=i+1

ki (xi , xi0 )kj (xj , xj0 )

In D dimensions: 0

kadd1 (x, x ) =

σ12

kadd2 (x, x0 ) = σ22

D X i=1 D X

ki (xi , xi0 ) D X

ki (xi , xi0 )kj (xj , xj0 )

i=1 j=i+1 0

kadd3 (x, x ) =

σ32

D X D D X X i=1 j=i+1 k=j+1

ki (xi , xi0 )kj (xj , xj0 )kk (xk , xk0 )

In D dimensions: 0

kadd1 (x, x ) =

σ12

kadd2 (x, x0 ) = σ22

D X i=1 D X

ki (xi , xi0 ) D X

ki (xi , xi0 )kj (xj , xj0 )

i=1 j=i+1 0

kadd3 (x, x ) =

σ32

D X D D X X

ki (xi , xi0 )kj (xj , xj0 )kk (xk , xk0 )

i=1 j=i+1 k=j+1 0

kaddn (x, x ) =

σn2

X

N Y

1≤i1