Learning graphical models for stationary time series - Semantic Scholar

Report 3 Downloads 83 Views
Learning graphical models for stationary time series Francis Bach

Michael Jordan

Ecole des Mines de Paris

UC Berkeley

SAMSI - November 2006

Outline • Graphical models: the machine learning perspective

– Directed vs. undirected, Gaussian vs. non Gaussian – Learning the ordering/structure

• Stationary time series

– spectral theory – inference, parameter estimation

• Stationary time series and conditional independence – Spectral interpretation – Graphical models: inference, learning

• Extensions

– Missing data, frequency-dependent graph, Bayesian formulation

Graphical models • Random variables X1, . . . , Xn • Families of distributions equivalently characterized by – A factorized distribution p(x1, . . . , xn) – A set of conditional independence statements • Interpretability, inference, elicitation • Two main types: – Directed (a.k.a. “Bayesian” networks) – Undirected (a.k.a. Markov random fields) • jointly Gaussian distributions: X ∼ N (µ, Σ) → different assumptions regarding the covariance matrix

Undirected graphical models X2 X2

X4

X4

X1

X1

X6

A X6 X3

X3

X5

C

• Undirected graph G = (V, E) • Factorized distributions: • Conditional independences:

X5

p(x) =

1 Z

Q

C∈C

ψC (xC )

– pairwise : (i, j) ∈ E ⇒ Xi⊥Xj |X{1,...,n}\{i,j} – global : C separates A and B ⇒ XA⊥XB |XC

• Gaussian distributions:

(i, j) ∈ E ⇒ (Σ−1)ij = 0

B

Directed graphical models B X2 X2

C

X4

X4

X1 X1

X6 X6 X3

A

X5

X3

• Directed acyclic graph G = (V, E) • Factorized distributions:

p(x) =

n Y

i=1

p(xi|xπi )

• Conditional independences:

– local: Xi⊥Xnon-descendants(i)|Xπi

– global: C d-separates A and B ⇒ XA⊥XB |XC

X5

Directed Gaussian graphical models (zero-mean) X2

X4

p(x) =

X1 X6 X3

N (0, Σ) =

X5

Qn

i=1 p(xi|xπi ) Qn i=1 N (xi|Wixπi , di)

• From Σ to (W, d): Wi = Σi,πi (Σπi,πi )−1 di = Σi,i − Σi,πi (Σπi,πi )−1Σπi,i • From (W, d) to Σ−1, with D = Diag(d), W = ((W1)0, . . . , (Wn)0): Σ−1 = (I − W )>D−1(I − W ) • Given a topological order, Σ−1 has a sparse Cholesky factor

Complexity of (exact) inference in graphical models Undirected models • Typically: compute p(xi, xj ) for all (i, j) ∈ E • Discrete models: exponential in the treewidth tw of G ( = 1 + size of the largest clique in the optimally triangulated graph) X2

X2

X4

X1

X4

X1 X6 X3

X5

X6



• Gaussian graphical model: O(n × tw3)

X3

X5

Complexity of (exact) inference in graphical models Directed models • Discrete models: exponential in the treewidth tw of the moralized graph (obtained by connecting parents) X2

X2

X4

X1

X2

X4 X1

X1 X6 X3

X5

X4

X6

X6



X3

X5



X3

• Gaussian graphical model: O(n × tw3) is not the best!

X5

Complexity of (exact) inference in graphical models Directed models • Discrete models: exponential in the treewidth tw of the moralized graph (obtained by connecting parents) X2

X2

X4

X1

X2

X4 X1

X1 X6 X3

X5

X4

X6

X6



X3

X5



X3

• Gaussian graphical model: O(n × tw3) is not the best! Σ−1 = (I − W )>D−1(I − W )

Σ = (I − W )−1D(I − W )−>

sparse triangular inversion ⇒ O(n2d), d maximum fan-in

X5

Learning the structure of graphical models undirected

directed decom− posable

Learning the structure of graphical models undirected

directed decom− posable

• Approaches based on conditional independence tests • Model selection approaches : for directed and decomposable – Decomposition and hierarchical Bayesian approaches feasible – Inference with the model not always tractable in the directed non Gaussian case

Learning the structure of directed graphical models • Directed models: – Cost function J(G) from classical model selection criteria (AIC, BIC, marginal likelihood) Pn – J(G) factorizes: J(G) = i=1 Ji(πi(G)) ˆ i, xπ ) − 1 kπ (G) – e.g., AIC: Ji(πi(G)) = I(x G n i (empirical mutual information + number of parameters ) • Non Bayesian setting: finding G = arg max J(G)? is it hopeless?

Learning the structure of directed graphical models • Directed models: – Cost function J(G) from classical model selection criteria (AIC, BIC, marginal likelihood) Pn – J(G) factorizes: J(G) = i=1 Ji(πi(G)) ˆ i, xπ ) − 1 kπ (G) – e.g., AIC: Ji(πi(G)) = I(x G n i (empirical mutual information + number of parameters ) • Non Bayesian setting: finding G = arg max J(G)? is it hopeless? – Restriction to trees: maximum weight spanning forest, global maximum can be found in O(n2 log n) (Chow and Liu, 1968)

Learning the structure of directed graphical models • Directed models: – Cost function J(G) from classical model selection criteria (AIC, BIC, marginal likelihood) Pn – J(G) factorizes: J(G) = i=1 Ji(πi(G)) ˆ i, xπ ) − 1 kπ (G) – e.g., AIC: Ji(πi(G)) = I(x G n i (empirical mutual information + number of parameters ) • Non Bayesian setting: finding G = arg max J(G)? is it hopeless? – Restriction to trees: maximum weight spanning forest, global maximum can be found in O(n2 log n) (Chow and Liu, 1968) – General case: greedy algorithm in the space of DAGs or equivalence classes of DAGs → not so bad... (Chickering, 2002)

Stationary time series • Multiple time series x(t) ∈ Rm, t ∈ Z • Strict stationarity: p(x(t1 + h), x(t2 + h), . . . , x(tk + h)) independent of h • For jointly Gaussian distribution, equivalent to weak stationarity ∀t, Ex(t) = Ex(0) and ∀t, u, Ex(t)x(u)> = Ex(t − u)x(0)> • Strong assumption? – Yes, in general – No, after proper transformation and/or removal of non stationarities

Stationary (zero-mean) Gaussian processes • Joint distribution characterized by mean µ (=0) and auto-covariance function Γ(h) = Ex(t + h)x(t)> = Ex(h)x(0)> ∈ Rm×m • Main modelling assumption: joint covariance matrix Σh of x(t), x(t+ 1), . . . , x(t + h − 1) is Toeplitz by block: 

Γ(0) Γ(−1) Γ(−2) · · ·  Γ(1) Γ(0) Γ(−1)  ...  Γ(2) Γ(1) Γ(0) Σh =  .. ... ...   ..  Γ(h − 1) ···

 · · · Γ(−h + 1)   ..   ...   Γ(−1)  Γ(1) Γ(0)

Spectral density matrix • if

P∞

−∞ ||Γ(h)||2

< ∞, spectral density matrix defined as

1 f (ω) = 2π

+∞ X

h=−∞

Γ(h)e−ihω ∈ Cm×m

• Properties: – 2π-periodic – ∀ω, f (ω) is Hermitian positive semi-definite • Fourier transform pairs: Γ(h) =

Z

0



f (ω)eihω dω

Spectral representation • Spectral representation theorem: x(t) =

Z



eitω dZ(ω)

0

with Z(ω) random process with orthogonal increments and such that EdZ(ω)dZ(ω)∗ = f (ω) • Infinite sum of independent sinusoids with covariance f (ω) • Graphical model:

ω1

ω2

ω3

ω4

AR(1) processes • x(t + 1) = Ψx(t) + ε(t), ε(t) ∼ N (0, Σ) • One time series: m = 1, Σ = σ 2, (|ψ| < 1) – Spectral density : f (ω) = – Autocovariance function :

1 σ2 2π |1−ψe−iω |2 σ 2ψ |h| γ(h) = (1−ψ2) 1

0.4 0.3 0.5 0.2 0.1 0 0

2

4

6

0 −10

0

10

• Multiple time series (eigenvalues of Ψ inside the unit circle) – Spectral density matrix f (ω) = – Autocovariance function

1 2π (I

− Ψe−iω )−1Σ(I − Ψe−iω )−∗

Estimation from finite sample • The sample autocovariance function is defined as 1 ˆ Γ(h) = T

T −h−1 X t=0

x(t + h)x(t)>, h ∈ [0, T − 1]

(NB: usually, after removing mean and non-stationarities) • Consistent estimator of Γ(h) ˆ • Estimation of f (ω) as the Fourier transform of Γ(h)? – Periodogram : I(2πk/T ) = d(k) =

1 ∗ d(k)d(k) , 2π

√1 T

PT −1 t=0

– Bad estimator of the spectral density

with

x(t)e−ikt

Estimation from finite sample • Estimation of autocovariance function 1.5 1 0.5 0 −10

0

10

• Periodogram 2 1.5 1 0.5 0 0

2

4

6

Estimation from finite sample of size T • Smoothing the periodogram leads to consistent estimators • Smoothing by Gaussian window of standard deviation r – estimate consistent if r(T ) → ∞, r(T )/T → 0 2

2

1.5

1.5

1

1

0.5

0.5

0 0

2

4

6

0 0

2

4

6

ˆ • Given r, Γ(h) is “equal” to zero for h larger than H = 2T /r = time horizon • Given time horizon H, only need spectral density at ωh = 2πh/H (Nyquist theorem).

Forecasting • Find Ψ that minimizes

2 P

H e(Ψ) = E x(H + 1) − j=1 Ψj x(H + 1 − j) • Yule-Walker equation: ∀j > 1,

H X i=1

ΨiΓ(i − j) = Γ(j)

• Durbin-Levinson algorithm: inversion of linear system in O(m3H 2)

Graphical models for stationary time series Time domain • Classical approach in the time domain: sparse AR models x(t) =

H X

h=1

Φhx(t − h) + ε(t)

ε(t) ∼ N (0, Σ) with both Φ and Σ−1 sparse

Graphical models for stationary time series State-space models • a.k.a. structured HMM, dynamic graphical models, dynamic Bayesian networks

Graphical model for stationary time series Frequency domain • Sparse models for the spectral density matrix • i.e., for each ω, f (ω) factorizes in a given graphical model G • Questions: – Does it correspond to conditional independences in the time domain? – Is inference tractable? – Is learning tractable?

Independence between whole time series • Let xi = (xi(t))t∈Z be the whole i-th time series • Theorem 1 (Brillinger, 1996): the Gaussian time series xi and xj are marginally independent if and only if ∀ω, f (ω)ij = 0. • Theorem 2 (Brillinger, 1996): the Gaussian time series xi and xj are conditionally independent given all other time series if and only if ∀ω, (f (ω)−1)ij = 0. • Graphical model between whole time series (Brillinger, 1996, Dahlhaus, 2000) – Interpretability – Efficient inference/learning algorithms?

graphical models between whole time series • Interpretation in terms of spectral representation x1

d2

d1

ω k−1

x3

d4

d4

d4

t+1 t+2

d3

d3

d3

t

ωk

t−1

d2

d2 d1

d1

t−1

t−1

t

t+1 t+2

t−1

t

t+1 t+2

ω k+1 x4

t

t+1 t+2

x2

Time domain vs. frequency domain • Two different notions of sparsity. How do they relate? • Distinct notions • Shared models: AR(1) processes with Ψ diagonal 1 f (ω) = (I − Ψe−iω )−1Σ(I − Ψe−iω )−∗ 2π

Estimating joint spectral density • Directed models • If f (ω) factorizes in G, estimation problem decouples – Estimation of local densities on variables xi∪πi(G) – Simplest: one full smoothing of the joint spectral density – Adaptive post-smoothing to avoid over-smoothing

Inference • Forecasting filter obtained by solving Yule-Walker equations • Using the structure of f (ω) to avoid O(m3) • Efficient method for block Toeplitz systems – based on block circulant matrices

Block circulant matrices • Approximation of block Toeplitz matrix of order H     C0 C−1 · · · C−H+1 T0 T−1 · · · T−H+1        C1 C0  T1 T0 T = . ,  C= . . . . . . .    .  . CH−1 · · · C0 TH−1 · · · T0 where Ck = Tk for 0 6 k 6 H/2 − 1 and Ck = Cn−k , C−k = Ck>

Block circulant matrices • Approximation of block Toeplitz matrix of order H     C0 C−1 · · · C−H+1 T0 T−1 · · · T−H+1     C C T T   1   0 0 C = T =  .1 ,   . . . . . . .   .   . CH−1 · · · C0 TH−1 · · · T0 where Ck = Tk for 0 6 k 6 H/2 − 1 and Ck = Cn−k , C−k = Ck> • Theorem: block circulant matrices are block diagonalized in the Fourier basis with “eigenblocks” 1 b Ck = √ H

H−1 X j=0

Ck exp(−2ijkπ/H)

Preconditioned conjugate gradient • Conjugate gradient for systems of the form T x = b where T is positive semi-definite – only require matrix-vector multiplications – number of operations depends on the condition number of T : might be very large in general

Preconditioned conjugate gradient • Conjugate gradient for systems of the form T x = b where T is positive semi-definite – only require matrix-vector multiplications – number of operations depends on the condition number of T : might be very large in general • Preconditioning – – – –

if approximation C ≈ T is known, with simple inversion routine solve : (C −1/2T C −1/2)C 1/2x = C −1/2b C −1/2T C −1/2 has lower condition number requires inverting systems with C and matrix-vector products by T

Preconditioned conjugate gradient • Conjugate gradient for systems of the form T x = b where T is positive semi-definite – only require matrix-vector multiplications – number of operations depends on the condition number of T : might be very large in general • Preconditioning – – – –

if approximation C ≈ T is known, with simple inversion routine solve : (C −1/2T C −1/2)C 1/2x = C −1/2b C −1/2T C −1/2 has lower condition number requires inverting systems with C and matrix-vector products by T

• Circulant conditioning for Toeplitz systems: theoretical analysis of the resulting condition number

Toeplitz systems for graphical models PH−1 1 b • Eigenblocks Ck = √H j=0 Ck exp(−2ijkπ/H) are samples of the spectral density matrix • Factorization in a DAG → efficient linear systems • Computational complexity: – Each iteration of CG: O(m2dH) + O(m2H log H) – If f (ω) bounded away from zero, known upper bound on the number of iterations • NB: extensions to ν-step ahead predictors

Learning the structure for stationary time series • Essentially similar to the i.i.d. case. Cost function J(G) • J(G) factorizes: J(G) =

Pn

i=1 Ji(πi(G))

ˆ i, xπ ) − 1 kπ (G) • e.g., AIC: Ji(πi(G)) = I(x G n i (empirical mutual information + number of parameters ) • Heuristic definition of the number of parameters using degrees of freedom of the linear smoothers used for smoothing the periodogram • Bayesian formulation?

Simulation experiments

2

20

1.5 1 0.5 0 128

256 512 1024 2048 Number of samples

8192

Fully connected Learned Model, no post-smoothing Learned Model, with post-smoothing

KL divergence from the truth vs. number of samples.

Number of edges

KL divergence

• Toy data simulated from sparse graphical models 15 10 5 0 0

3000

6000

9000

Number of samples

Number of non recovered edges vs. number of samples

Climate dataset

• “summary of the day” NCDC data (temperatures, precipitations), 1024 days • removing of component

seasonal

• one step ahead prediction

Climate dataset • Comparison of various models Mean prediction error

3 Gaussian iid graphical model

2.5 2

Sparse AR

1.5 Stationary independent

1 0.5

Stationary graphical model

0 20

40

80

160

240

Number of variables

mean prediction error

Climate dataset 40

log likelihood

30 20

Gaussian iid graphical model

10

Sparse AR

0 -10

20

40

80

160

240

Stationary independent Stationary graphical model

-20 -30 Number of variables

Predictive log likelihood normalized by the log likelihood for the full Gaussian iid model

Missing data • Amplitude modulating function: yi(t) = 1 if xi(t) is observed, zero otherwise • Main assumption: y and x are independent • Different type of missing observations

– Missing at random - time stationarity : y(t) stationary – Missing stretches : y(t) is piecewise constant function of time – Typically: combination of the above 1

0.5

0 0

50 100 150 amplitude modulating process

200

• Can we estimate spectral densities from missing data?

Missing data - y(t) stationary • Notations: – – – –

amplitude modulated process: z(t) = x(t) ◦ y(t) means: cx = 0, cy 6= 0, cz 6= 0 spectral densities: f x(ω), f y (ω), f z (ω) Autocovariance functions: Γx(h), Γy (h), Γz (h)

• Moment matching approach (Priestley, 1981)

Missing data - y(t) stationary • Notations: – – – –

amplitude modulated process: z(t) = x(t) ◦ y(t) means: cx = 0, cy 6= 0, cz 6= 0 spectral densities: f x(ω), f y (ω), f z (ω) Autocovariance functions: Γx(h), Γy (h), Γz (h)

• Moment matching approach (Priestley, 1981) • If x and y are independent: cz = Ez(0) = Ey(0) ◦ Ex(0) = 0 and Γz (h) = E(z + h)z(t)> = Ex(t + h)x(t)> ◦ Ey(t)y(t + h)>  y y >  x y = Γ (h) ◦ c (c ) + Γ (h) Z 2π   f z (ω) = f x(ω) ◦ cy (cy )> + f y (ω − θ) ◦ f x(θ)dθ 0

Sensor networks - I • Sensors distributed in a building, recording temperatures • 1 measurement every 30 seconds, 44 sensors, 5 days (214 measurements) • Many missing data (communication constraints), many long stretches (break downs) • Preprocessing (original data are not stationary!) – “seasonal component” – Fixed effect common to all sensors – Modelling the residual

Sensor networks - II • Prediction results (learning on first half period, predicting half the sensors given the other half in the second half period) • Comparing models that can deal with missing data – State-space model: 0.61 ± 0.34 degrees Celsius – graphical model in frequency: 0.19 ± 0.05 degrees Celsius – graphical model in time: ?

Frequency dependent graphs • Given a priori partition of the frequency axis • Learning the partition – hard – smoothness problem (the spectral density has to be smooth)

Bayesian formulation - I • Usual annoyance with Gaussian stationary time series: – Likelihood is not a simple function of the spectral density f (ω) – Whittle approximation: likelihood of the discrete Fourier transform d0, . . . , dT −1, normally distributed with zero mean and diagonal covariance matrix, with diagonal elements f (2kπ/T ). – cf. Toeplitz and circulant matrices • Model defined by regression weights Wi(ω) and variances di(ω) • Main difficulty: smoothness with respect to ω

Bayesian formulation - II i (ω) • Regression weights: dW(ω) 1/2 distributed as a Gaussian process with i appropriate covariance kernel

• Individual variances: log di(ω) distributed as a Gaussian process • Does not seem to lead to a prior which is consistent with a prior on the full spectral density matrix • Any ideas welcome

Conclusion • Graphical models in the frequency domain for stationary time series • Efficient inference and learning (parameters, graph) • Different from time sparsity (cannot really be sparse in both) • Simple way of dealing with dependent samples • Reference: F. R. Bach, M. I. Jordan. Learning graphical models for stationary time series, IEEE Trans. Signal Processing, 52(8), 2004