Additive Gaussian Processes David Duvenaud, Hannes Nickisch, Carl Rasmussen
Cambridge University Computational and Biological Learning Lab
January 13, 2012
Outline
Gaussian Process Regression Definition Properties
Additive Gaussian Processes Central Modeling Assumption Interpretability Related Work Results
Regression Methods
Given X, y, predict some new function value y∗ at location x∗.
Regression Methods
Given X, y, predict some new function value y∗ at location x∗. Many methods with nice properties.
Regression Methods
Given X, y, predict some new function value y∗ at location x∗. Many methods with nice properties. Linear Regression - Fast
Regression Methods
Given X, y, predict some new function value y∗ at location x∗. Many methods with nice properties. Linear Regression - Fast Deep Belief Networks - Semi-supervised
Regression Methods
Given X, y, predict some new function value y∗ at location x∗. Many methods with nice properties. Linear Regression - Fast Deep Belief Networks - Semi-supervised Spline Models - Nonparametric
Regression Methods
Given X, y, predict some new function value y∗ at location x∗. Many methods with nice properties. Linear Regression - Fast Deep Belief Networks - Semi-supervised Spline Models - Nonparametric Gaussian Process Regression
Regression Methods
Given X, y, predict some new function value y∗ at location x∗. Many methods with nice properties. Linear Regression - Fast Deep Belief Networks - Semi-supervised Spline Models - Nonparametric Gaussian Process Regression I Non-parametric
Regression Methods
Given X, y, predict some new function value y∗ at location x∗. Many methods with nice properties. Linear Regression - Fast Deep Belief Networks - Semi-supervised Spline Models - Nonparametric Gaussian Process Regression I Non-parametric I Data-Efficient
Regression Methods
Given X, y, predict some new function value y∗ at location x∗. Many methods with nice properties. Linear Regression - Fast Deep Belief Networks - Semi-supervised Spline Models - Nonparametric Gaussian Process Regression I Non-parametric I Data-Efficient I Tractable Joint Posterior
Definition Assume our data (X, y) is generated by y = f(x) + σ f is a latent function which we need to do inference about.
Definition Assume our data (X, y) is generated by y = f(x) + σ f is a latent function which we need to do inference about. A GP prior distribution over f means that, for any finite set of indices X, p(fx |θ) = N (µθ (X), Kθ (X, X))
Definition Assume our data (X, y) is generated by y = f(x) + σ f is a latent function which we need to do inference about. A GP prior distribution over f means that, for any finite set of indices X, p(fx |θ) = N (µθ (X), Kθ (X, X)) where Kij = kθ (x, x0 ) is the covariance function or kernel, which specifies the covariance between two function values f (x1 ), f (x2 ) given their locations x1 , x2 .
Definition Assume our data (X, y) is generated by y = f(x) + σ f is a latent function which we need to do inference about. A GP prior distribution over f means that, for any finite set of indices X, p(fx |θ) = N (µθ (X), Kθ (X, X)) where Kij = kθ (x, x0 ) is the covariance function or kernel, which specifies the covariance between two function values f (x1 ), f (x2 ) given their locations x1 , x2 . e.g. 1
Covariance
0.8
0.6
0.4
1 kθ (x, x0 ) = exp(− |x−x0 |22 ) 2θ
0.2
0 −3
−2
−1
0 Distance
x−x0
1
2
3
Sampling from a GP function simple_gp_sample % Choose a set of x locations. N = 100; x = linspace( -2, 2, N); % Specify the covariance between function % values, depending on their location. for j = 1:N for k = 1:N sigma(j,k) = covariance( x(j), x(k) ); end end
1.4 1.2 1 0.8 0.6
% Specify that the prior mean of f is zero. mu = zeros(N, 1);
0.4 0.2
% Sample from a multivariate Gaussian. f = mvnrnd( mu, sigma );
−0.2
plot(x, f);
−0.4 −2
end % Squared−exp covariance function. function k = covariance(x, y) k = exp( -0.5*( x - y )^2 ); end
0
−1.5
−1
−0.5
0
0.5
1
1.5
2
Sampling from a GP function simple_gp_sample % Choose a set of x locations. N = 100; x = linspace( -2, 2, N); % Specify the covariance between function % values, depending on their location. for j = 1:N for k = 1:N sigma(j,k) = covariance( x(j), x(k) ); end end
1.6 1.4 1.2 1 0.8
% Specify that the prior mean of f is zero. mu = zeros(N, 1);
0.6 0.4
% Sample from a multivariate Gaussian. f = mvnrnd( mu, sigma ); plot(x, f); end % Squared−exp covariance function. function k = covariance(x, y) k = exp( -0.5*( x - y )^2 ); end
0.2 0 −0.2 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Sampling from a GP function simple_gp_sample % Choose a set of x locations. N = 100; x = linspace( -2, 2, N); % Specify the covariance between function % values, depending on their location. for j = 1:N for k = 1:N sigma(j,k) = covariance( x(j), x(k) ); end end % Specify that the prior mean of f is zero. mu = zeros(N, 1); % Sample from a multivariate Gaussian. f = mvnrnd( mu, sigma ); plot(x, f); end % Squared−exp covariance function. function k = covariance(x, y) k = exp( -0.5*( x - y )^2 ); end
0.5
0
−0.5
−1
−1.5
−2 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Sampling from a GP function simple_gp_sample % Choose a set of x locations. N = 100; x = linspace( -2, 2, N); % Specify the covariance between function % values, depending on their location. for j = 1:N for k = 1:N sigma(j,k) = covariance( x(j), x(k) ); end end
0
−0.2
−0.4
−0.6
% Specify that the prior mean of f is zero. mu = zeros(N, 1);
−0.8
−1
% Sample from a multivariate Gaussian. f = mvnrnd( mu, sigma ); plot(x, f); end % Squared−exp covariance function. function k = covariance(x, y) k = exp( -0.5*( x - y )^2 ); end
−1.2
−1.4 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Sampling from a GP function simple_gp_sample % Choose a set of x locations. N = 100; x = linspace( -2, 2, N); % Specify the covariance between function % values, depending on their location. for j = 1:N for k = 1:N sigma(j,k) = covariance( x(j), x(k) ); end end
0.6
0.4
0.2
0
% Specify that the prior mean of f is zero. mu = zeros(N, 1);
−0.2
−0.4
% Sample from a multivariate Gaussian. f = mvnrnd( mu, sigma ); plot(x, f); end % Periodic covariance function. function c = covariance(x, y) c = exp( -0.5*( sin(( x - y )*1.5).^2 )); end
−0.6
−0.8 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Conditional Posterior
After conditioning on some data (X, y),
p(f(x∗ )|X, y, θ) =
1 p(y|f, X, θ)p(f|θ) Z
Conditional Posterior
After conditioning on some data (X, y), 1 p(y|f, X, θ)p(f|θ) Z = N (y|f, X, θ)N (f|µ, Kθ (X, X))
p(f(x∗ )|X, y, θ) =
Conditional Posterior
After conditioning on some data (X, y), 1 p(y|f, X, θ)p(f|θ) Z = N (y|f, X, θ)N (f|µ, Kθ (X, X))
p(f(x∗ )|X, y, θ) =
= N (f(x∗ )|µ = k(x∗ , X)K −1 y, Σ = k(x∗ , x∗ ) − k(x∗ , X)K −1 k(X, x∗ ))
Conditional Posterior
p(f(x∗ )|X, y, θ) = N (f(x∗ )|µ = k(x∗ , X)K −1 y, Σ = k(x∗ , x∗ ) − k(x∗ , X)K −1 k(X, x∗ ))
f(x)
GP Posterior Mean GP Posterior Uncertainty
x
Conditional Posterior
p(f(x∗ )|X, y, θ) = N (f(x∗ )|µ = k(x∗ , X)K −1 y, Σ = k(x∗ , x∗ ) − k(x∗ , X)K −1 k(X, x∗ ))
f(x)
GP Posterior Mean GP Posterior Uncertainty Data
x
Conditional Posterior
p(f(x∗ )|X, y, θ) = N (f(x∗ )|µ = k(x∗ , X)K −1 y, Σ = k(x∗ , x∗ ) − k(x∗ , X)K −1 k(X, x∗ ))
f(x)
GP Posterior Mean GP Posterior Uncertainty Data
x
Conditional Posterior
p(f(x∗ )|X, y, θ) = N (f(x∗ )|µ = k(x∗ , X)K −1 y, Σ = k(x∗ , x∗ ) − k(x∗ , X)K −1 k(X, x∗ ))
f(x)
GP Posterior Mean GP Posterior Uncertainty Data
x
Conditional Posterior
p(f(x∗ )|X, y, θ) = N (f(x∗ )|µ = k(x∗ , X)K −1 y, Σ = k(x∗ , x∗ ) − k(x∗ , X)K −1 k(X, x∗ ))
f(x)
GP Posterior Mean GP Posterior Uncertainty Data
x
Normalization Constant Problem with Bayesian models: p(f|X, y, θ) =
1 p(y|f, X, θ)p(f|θ) Z
Normalization Constant Problem with Bayesian models: p(f|X, y, θ) = =
1 p(y|f, X, θ)p(f|θ) Z p(y|f, X, θ)p(f|θ) R p(y|f, X, θ)p(f|θ)df
Normalization Constant Problem with Bayesian models: p(f|X, y, θ) = =
1 p(y|f, X, θ)p(f|θ) Z p(y|f, X, θ)p(f|θ) R p(y|f, X, θ)p(f|θ)df
Can actually compute model evidence p(y|X, θ), aka Z:
Normalization Constant Problem with Bayesian models: p(f|X, y, θ) = =
1 p(y|f, X, θ)p(f|θ) Z p(y|f, X, θ)p(f|θ) R p(y|f, X, θ)p(f|θ)df
Can actually compute model evidence p(y|X, θ), aka Z: Z log p(y|X, θ) = log p(y|f, X, θ)p(f|θ)df
Normalization Constant Problem with Bayesian models: p(f|X, y, θ) = =
1 p(y|f, X, θ)p(f|θ) Z p(y|f, X, θ)p(f|θ) R p(y|f, X, θ)p(f|θ)df
Can actually compute model evidence p(y|X, θ), aka Z: Z log p(y|X, θ) = log p(y|f, X, θ)p(f|θ)df 1 N 1 = − yT (Kθ + σ2 I)−1 y − log |Kθ + σ2 I| − log(2π) 2 2 2 | {z } | {z } Data-fit
Bayesian Occam’s Razor
Choice of Kernel Function
Depending on kernel function, GP Regression is equivalent to:
Choice of Kernel Function
Depending on kernel function, GP Regression is equivalent to: I
Linear Regression
Choice of Kernel Function
Depending on kernel function, GP Regression is equivalent to: I
Linear Regression
I
Polynomial Regression
Choice of Kernel Function
Depending on kernel function, GP Regression is equivalent to: I
Linear Regression
I
Polynomial Regression
I
Splines
Choice of Kernel Function
Depending on kernel function, GP Regression is equivalent to: I
Linear Regression
I
Polynomial Regression
I
Splines
I
Kalman Filters
Choice of Kernel Function
Depending on kernel function, GP Regression is equivalent to: I
Linear Regression
I
Polynomial Regression
I
Splines
I
Kalman Filters
I
Generalized Additive Models
Choice of Kernel Function
Depending on kernel function, GP Regression is equivalent to: I
Linear Regression
I
Polynomial Regression
I
Splines
I
Kalman Filters
I
Generalized Additive Models
Can use gradients of model evidence to learn which model best explains the data; no need for cross-validation.
Not just for 1-D continuous functions
2 1 0 −1 −2 −3 4 2
4 2
0
I
D-dimensional input
0
−2
−2 −4
−4
Not just for 1-D continuous functions
2 1 0 −1 −2 −3 4 2
4 2
0 0
I
D-dimensional input
I
Functions over discrete domains
−2
−2 −4
−4
Not just for 1-D continuous functions
2 1 0 −1 −2 −3 4 2
4 2
0 0
I
D-dimensional input
I
Functions over discrete domains
I
Functions over strings, trees, trajectories
−2
−2 −4
−4
Not just for 1-D continuous functions
2 1 0 −1 −2 −3 4 2
4 2
0 0
I
D-dimensional input
I
Functions over discrete domains
I
Functions over strings, trees, trajectories
I
Classification
−2
−2 −4
−4
Performance
[Blei et. al, 2011]
Limitations
I
Slow: O(N 3 ) means that N < 3000.
Limitations
I
Slow: O(N 3 ) means that N < 3000. I
Recently some good O(NM 2 ) approximations (FITC).
Limitations
I
Slow: O(N 3 ) means that N < 3000. I
I
Recently some good O(NM 2 ) approximations (FITC).
Most commonly used kernels have fairly limited generalization abilities.
Limitations
I
Slow: O(N 3 ) means that N < 3000. I
Recently some good O(NM 2 ) approximations (FITC).
I
Most commonly used kernels have fairly limited generalization abilities.
I
Non-Gaussian noise requires approximate inference.
Limitations
I
Slow: O(N 3 ) means that N < 3000. I
Recently some good O(NM 2 ) approximations (FITC).
I
Most commonly used kernels have fairly limited generalization abilities.
I
Non-Gaussian noise requires approximate inference.
Best choice if: I
Data is small / expensive to gather.
I
You want to do anything besides point prediction.
Outline
Gaussian Process Regression Definition Properties
Additive Gaussian Processes Central Modeling Assumption Interpretability Related Work Results
Central Dogma Central modeling assumption:
4
2
1.5 1
3
1.5
0.5
2
1
0
1 −0.5
0.5
0
−1
−1 4
0 4
−1.5 4
2
4
2
2
0 0
−2
−2 −4
−4
f1 (x1 ) + f2 (x2 )
=
2
4 2
0 0
−2
−2 −4
−4
f1 (x1 )
+
4 2
0 0
−2
−2 −4
−4
f2 (x2 )
Central Dogma Central modeling assumption:
4
2
1.5 1
3
1.5
0.5
2
1
0
1 −0.5
0.5
0
−1
−1 4
0 4
−1.5 4
2
4
2
2
0 0
−2
−2 −4
−4
f1 (x1 ) + f2 (x2 )
=
2
4 2
0 0
−2
−2 −4
−4
f1 (x1 )
+
4 2
0 0
−2
−2 −4
−4
f2 (x2 )
We hope our high-dimensional function can be written as a sum of orthogonal low-dimensional functions.
Central Dogma Central modeling assumption:
4
2
1.5 1
3
1.5
0.5
2
1
0
1 −0.5
0.5
0
−1
−1 4
0 4
−1.5 4
2
4
2
2
0 0
−2
−2 −4
−4
f1 (x1 ) + f2 (x2 )
=
2
4 2
0 0
−2
−2 −4
−4
f1 (x1 )
+
4 2
0 0
−2
−2 −4
−4
f2 (x2 )
We hope our high-dimensional function can be written as a sum of orthogonal low-dimensional functions. it’s far easier to learn ten 1-dimensional functions than one 10-dimensional function!
Additivity in GPs Easy to express additive property in a GP: 1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0 4
0 4 2
4 2
0 0
−2
−2 −4
−4
k1 (x1 , x20 ) 1D kernel
0.2 0 4 2
+
4
2
2
0 0
−2
−2 −4
−4
k2 (x2 , x20 ) 1D kernel
=
4 2
0 0
−2
−2 −4
−4
k1 (x1 , x10 ) + k2 (x2 , x20 ) ↓
Additivity in GPs Easy to express additive property in a GP: 1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0 4
0 4 2
4 2
0 0
−2
−2 −4
−4
k1 (x1 , x20 ) 1D kernel
0.2 0 4 2
+
4
2
2
0 0
−2
−2 −4
−4
k2 (x2 , x20 ) 1D kernel
4 2
0
=
0
−2
−2 −4
−4
k1 (x1 , x10 ) + k2 (x2 , x20 ) ↓ 4 3 2 1 0 −1 4 2
4 2
0 0
−2
−2 −4
−4
f1 (x1 ) + f2 (x2 ) draw from 1st order GP
We can extend our prior to include more interaction terms:
We can extend our prior to include more interaction terms: f (x1 , x2 , x3 , x4 ) = f1 (x1 ) + f2 (x2 ) + f3 (x3 ) + f4 (x4 )
+ f12 (x1 , x2 ) + f13 (x1 , x3 ) + f14 (x1 , x4 ) + f23 (x2 , x3 ) + f24 (x2 , x4 ) + f34 (x3 + f123 (x1 , x2 , x3 ) + f124 (x1 , x2 , x4 ) + f134 (x1 , x3 , x4 ) + f234 (x2 , x3 , x4 ) + f1234 (x1 , x2 , x3 , x4 )
We can extend our prior to include more interaction terms: f (x1 , x2 , x3 , x4 ) = f1 (x1 ) + f2 (x2 ) + f3 (x3 ) + f4 (x4 )
+ f12 (x1 , x2 ) + f13 (x1 , x3 ) + f14 (x1 , x4 ) + f23 (x2 , x3 ) + f24 (x2 , x4 ) + f34 (x3 + f123 (x1 , x2 , x3 ) + f124 (x1 , x2 , x4 ) + f134 (x1 , x3 , x4 ) + f234 (x2 , x3 , x4 ) + f1234 (x1 , x2 , x3 , x4 ) Corresponding GP model: assign each dimension i ∈ {1 . . . D} a one-dimensional base kernel ki (xi , xi0 ) Let zi = ki (xi , xi0 ) kadd1 (x, x0 ) = z1 + z2 + z3 + z4 kadd2 (x, x0 ) = z1 z2 + z1 z3 + z1 z4 + z2 z3 + z2 z4 + z3 z4 kadd3 (x, x0 ) = z1 z2 z3 + z1 z2 z4 + z1 z3 z4 + z2 z3 z4 kadd4 (x, x0 ) = z1 z2 z3 z4
In D dimensions: 0
kadd1 (x, x ) =
σ12
D X i=1
ki (xi , xi0 )
In D dimensions: 0
kadd1 (x, x ) =
σ12
kadd2 (x, x0 ) = σ22
D X i=1 D X
ki (xi , xi0 ) D X
i=1 j=i+1
ki (xi , xi0 )kj (xj , xj0 )
In D dimensions: 0
kadd1 (x, x ) =
σ12
kadd2 (x, x0 ) = σ22
D X i=1 D X
ki (xi , xi0 ) D X
ki (xi , xi0 )kj (xj , xj0 )
i=1 j=i+1 0
kadd3 (x, x ) =
σ32
D X D D X X i=1 j=i+1 k=j+1
ki (xi , xi0 )kj (xj , xj0 )kk (xk , xk0 )
In D dimensions: 0
kadd1 (x, x ) =
σ12
kadd2 (x, x0 ) = σ22
D X i=1 D X
ki (xi , xi0 ) D X
ki (xi , xi0 )kj (xj , xj0 )
i=1 j=i+1 0
kadd3 (x, x ) =
σ32
D X D D X X
ki (xi , xi0 )kj (xj , xj0 )kk (xk , xk0 )
i=1 j=i+1 k=j+1 0
kaddn (x, x ) =
σn2
X
N Y
1≤i1