Gaussian Process Regression Networks - Semantic Scholar

Report 5 Downloads 176 Views
Gaussian Process Regression Networks Andrew Gordon Wilson [email protected] mlg.eng.cam.ac.uk/andrew University of Cambridge

Joint work with David A. Knowles and Zoubin Ghahramani

June 27, 2012 ICML, Edinburgh

1 / 20

Multiple responses with input dependent covariances Two response variables: y1 (x): concentration of cadmium at a spatial location x. y2 (x): concentration of zinc at a spatial location x.

I

The values of these responses, at a given spatial location x∗ , are correlated.

I

I

We can account for these correlations (rather than assuming y1 (x) and y2 (x) are independent) to enhance predictions. We can further enhance predictions by accounting for how these correlations vary with geographical location x. Accounting for input dependent correlations is a distinctive feature of the Gaussian process regression network.

0.90

6

0.75

5

0.60

4 latitude

I

0.45 3

0.30

2

0.15

1

0.00 1

2

3 longitude

4

5

0.15

2 / 20

Motivation for modelling dependent covariances Promise I

Many problems in fact have input dependent uncertainties and correlations.

I

Accounting for dependent covariances (uncertainties and correlations) can greatly improve statistical inferences.

Uncharted Territory I

For convenience, response variables are typically seen as independent, or as having fixed covariances (e.g. multi-task literature).

I

The few existing models of dependent covariances are typically not expressive (e.g. Brownian motion covariance structure) or scalable (e.g. < 5 response variables).

Goal I

We want to develop expressive and scalable models (> 1000 response variables) for dependent uncertainties and correlations. 3 / 20

Outline

I

Gaussian process review Gaussian process regression networks

I

Applications

I

4 / 20

Gaussian processes Definition A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution.

Nonparametric Regression Model Prior: f (x) ∼ GP(m(x), k(x, x0 )), meaning (f (x1 ), . . . , f (xN )) ∼ N (µ, K), with µi = m(xi ) and Kij = cov(f (xi ), f (xj )) = k(xi , xj ). GP posterior

Likelihood

GP prior

z }| { z }| { z }| { p(f (x)|D) ∝ p(D|f (x)) p(f (x)) Gaussian process sample prior functions 3

Gaussian process sample posterior functions 3

2

2

1

1

output, f(t)

output, f(t)

I

0 −1 −2 −3 −10

0 −1 −2

−5

0 input, t a)

5

10

−3 −10

−5

0 input, t b)

5

10

5 / 20

Gaussian processes

“How can Gaussian processes possibly replace neural networks? Did we throw the baby out with the bathwater?” David MacKay, 1998.

6 / 20

Semiparametric Latent Factor Model The semiparametric latent factor model (SLFM) (Teh, Seeger, and Jordan, 2005) is a popular multi-output (multi-task) GP model for fixed signal correlations between outputs (response variables): p×1

p×q q×1

N (0,Ip )

z}|{ z}|{ z}|{ z}|{ y(x) = W f (x) +σy z(x) y1 (x)

1

W1

I

y(x): p × 1 vector of output variables (responses) evaluated at x.

I

W: p × q matrix of mixing weights.

I

f (x): q × 1 vector of Gaussian process functions.

I

σy : hyperparameter controlling noise variance.

I

.. . x

.. . fq (x)

.. .

yj (x)

fi (x) Wp1

x: input variable (e.g. geographical location).

W1q

I

f1 (x)

W

pq

.. .

yp (x)

z(x): i.i.d Gaussian white noise with p × p identity covariance Ip . 7 / 20

Deriving the GPRN

Structure at x = x1

Structure at x = x2

f1 (x1 )

y1 (x1 )

f1 (x2 )

y1 (x2 )

f2 (x1 )

y2 (x1 )

f2 (x2 )

y2 (x2 )

At x = x1 the two outputs (responses) y1 and y2 are correlated since they share the basis function f1 . At x = x2 the outputs are independent.

8 / 20

From the SLFM to the GPRN SLFM

GPRN y1 (x)

1

W1

f1 (x) W1q

.. . x

yj (x)

fi (x) Wp1

.. . fq (x)

W

pq

p×1

p×q q×1

.. .

?

.. .

yp (x)

N (0,Ip )

z}|{ z}|{ z}|{ z}|{ y(x) = W f (x) +σy z(x)

9 / 20

From the SLFM to the GPRN SLFM

GPRN y1 (x)

1

W1

1(

fˆ1 (x)

f1 (x)

x

W

pq

p×q q×1

N (0,Ip )

z}|{ z}|{ z}|{ z}|{ y(x) = W f (x) +σy z(x)

fˆi (x)

x

fˆq (x) yp (x)

W

pq (

p×q

x)

N (0,Iq )

q×1

.. .

yj (x)

.. .

.. .

p×1

y1 (x)

x) Wp1(

Wp1

.. .

p×1

.. .

yj (x)

fi (x)

fq (x)

.. .

x)

W1q (x )

W1q

.. .

W1

.. .

yp (x)

N (0,Ip )

z}|{ z }| { z}|{ z}|{ z}|{ y(x) = W(x) [ f (x) +σf (x) ] +σy z(x) {z } | ˆ f (x)

10 / 20

Gaussian process regression networks p×1

p×q

N (0,Iq )

q×1

N (0,Ip )

z}|{ z}|{ z }| { z}|{ z}|{ y(x) = W(x) [ f (x) +σf (x) ] +σy z(x) | {z } ˆ f (x)

or, equivalently, noise

I

W(x): p × q matrix of weight functions. W(x)ij ∼ GP(0, kw ).

I

f (x): q × 1 vector of Gaussian process node functions. f (x)i ∼ GP(0, kf ).

I

σf , σy : hyperparameters controlling noise variance.

I

(x), z(x): Gaussian white noise.

x

x)

y1 (x)

fˆi (x) .. . fˆq (x)

.. .

yj (x) x) Wp1(

y(x): p × 1 vector of output variables (responses) evaluated at x.

1(

.. .

z }| { z }| { y(x) = W(x)f (x) + σf W(x)(x) + σy z(x) . I

W1

W1q (x )

signal

fˆ1 (x)

W

pq (

x)

.. .

yp (x)

11 / 20

GPRN Inference

I

We sample from the posterior over Gaussian processes in the weight and node functions using elliptical slice sampling (ESS) (Murray, Adams, and MacKay, 2010). ESS is especially good for sampling from posteriors with correlated Gaussian priors.

I

We also approximate this posterior using a message passing implementation of variational Bayes (VB).

I

The computational complexity is cubic in the number of data points and linear in the number of response variables, per iteration of ESS or VB.

I

Details are in the paper.

12 / 20

GPRN Results, Jura Heavy Metal Dataset

0.90

6

0.75

5

0.60

4 latitude

W11 (x)

0.45 3

W12 (x)

f1 (x)

0.30

W31 (x) W22 (x)

2

0.15

1

f2 (x)

2

3 longitude

4

y2 (x)

W32 (x)

0.00 1

y1 (x)

W21 (x)

y3 (x)

0.15

5

signal

noise

z }| { z }| { y(x) = W(x)f (x) + σf W(x)(x) + σy z(x) .

13 / 20

GPRN Results, Gene Expression 50D

GENE (p=50) 0.8

0.7

0.6

SMSE

0.5

0.4

0.3

0.2

0.1

0

GPRN (ESS)

GPRN (VB)

CMOGP

LMC

SLFM

14 / 20

GPRN Results, Gene Expression 1000D

GENE (p=1000) 0.7

0.6

0.5

SMSE

0.4

0.3

0.2

0.1

0

GPRN (ESS)

GPRN (VB)

CMOFITC

CMOPITC

CMODTC

15 / 20

Training Times on GENE

Training time GPRN (VB) GPRN (ESS) LMC, CMOGP, SLFM

GENE (50D) (s)

GENE (1000D) (s)

12 40 minutes

330 9000 days

16 / 20

Multivariate Volatility Results

MSE on EQUITY dataset 7

6

5

MSE

4

3

2

1

0

GPRN (VB)

GPRN (ESS)

GWP

WP

MGARCH

17 / 20

Summary

I

A Gaussian process regression network is used for multi-task regression and multivariate volatility, and can account for input dependent signal and noise covariances.

I

Can scale to thousands of dimensions.

I

Outperforms multi-task Gaussian process models and multivariate volatility models.

18 / 20

Generalised Wishart Processes Recall that the GPRN model can be written as signal

noise

}| { z }| { z y(x) = W(x)f (x) + σf W(x)(x) + σy z(x) . The induced noise process, Σ(x)noise = σf2 W(x)W(x)T + σy2 I , is an example of a Generalised Wishart Process (Wilson and Ghahramani, 2010). At every x, Σ(x) is marginally Wishart, and the dynamics of Σ(x) are governed by the GP covariance kernel used for the weight functions in W(x).

19 / 20

GPRN Inference New signal

noise

z }| { z }| { y(x) = W(x)f (x) + σf W(x)(x) + σy z(x) .

Prior is induced through GP priors in nodes and weights p(u|σf , γ) = N (0, CB )

Likelihood p(D|u, σf , σy ) =

N Y

N (y(xi ); W(xi )ˆf (xi ), σy2 Ip )

i=1

Posterior p(u|D, σf , σy , γ) ∝ p(D|u, σf , σy )p(u|σf , γ) We sample from the posterior using elliptical slice sampling (Murray, Adams, and MacKay, 2010) or approximate it using a message passing implementation of variational Bayes. 20 / 20