Learning graphical models for stationary time series Francis Bach
Michael Jordan
Ecole des Mines de Paris
UC Berkeley
SAMSI - November 2006
Outline • Graphical models: the machine learning perspective
– Directed vs. undirected, Gaussian vs. non Gaussian – Learning the ordering/structure
• Stationary time series
– spectral theory – inference, parameter estimation
• Stationary time series and conditional independence – Spectral interpretation – Graphical models: inference, learning
• Extensions
– Missing data, frequency-dependent graph, Bayesian formulation
Graphical models • Random variables X1, . . . , Xn • Families of distributions equivalently characterized by – A factorized distribution p(x1, . . . , xn) – A set of conditional independence statements • Interpretability, inference, elicitation • Two main types: – Directed (a.k.a. “Bayesian” networks) – Undirected (a.k.a. Markov random fields) • jointly Gaussian distributions: X ∼ N (µ, Σ) → different assumptions regarding the covariance matrix
Undirected graphical models X2 X2
X4
X4
X1
X1
X6
A X6 X3
X3
X5
C
• Undirected graph G = (V, E) • Factorized distributions: • Conditional independences:
X5
p(x) =
1 Z
Q
C∈C
ψC (xC )
– pairwise : (i, j) ∈ E ⇒ Xi⊥Xj |X{1,...,n}\{i,j} – global : C separates A and B ⇒ XA⊥XB |XC
• Gaussian distributions:
(i, j) ∈ E ⇒ (Σ−1)ij = 0
B
Directed graphical models B X2 X2
C
X4
X4
X1 X1
X6 X6 X3
A
X5
X3
• Directed acyclic graph G = (V, E) • Factorized distributions:
p(x) =
n Y
i=1
p(xi|xπi )
• Conditional independences:
– local: Xi⊥Xnon-descendants(i)|Xπi
– global: C d-separates A and B ⇒ XA⊥XB |XC
X5
Directed Gaussian graphical models (zero-mean) X2
X4
p(x) =
X1 X6 X3
N (0, Σ) =
X5
Qn
i=1 p(xi|xπi ) Qn i=1 N (xi|Wixπi , di)
• From Σ to (W, d): Wi = Σi,πi (Σπi,πi )−1 di = Σi,i − Σi,πi (Σπi,πi )−1Σπi,i • From (W, d) to Σ−1, with D = Diag(d), W = ((W1)0, . . . , (Wn)0): Σ−1 = (I − W )>D−1(I − W ) • Given a topological order, Σ−1 has a sparse Cholesky factor
Complexity of (exact) inference in graphical models Undirected models • Typically: compute p(xi, xj ) for all (i, j) ∈ E • Discrete models: exponential in the treewidth tw of G ( = 1 + size of the largest clique in the optimally triangulated graph) X2
X2
X4
X1
X4
X1 X6 X3
X5
X6
⇒
• Gaussian graphical model: O(n × tw3)
X3
X5
Complexity of (exact) inference in graphical models Directed models • Discrete models: exponential in the treewidth tw of the moralized graph (obtained by connecting parents) X2
X2
X4
X1
X2
X4 X1
X1 X6 X3
X5
X4
X6
X6
⇒
X3
X5
⇒
X3
• Gaussian graphical model: O(n × tw3) is not the best!
X5
Complexity of (exact) inference in graphical models Directed models • Discrete models: exponential in the treewidth tw of the moralized graph (obtained by connecting parents) X2
X2
X4
X1
X2
X4 X1
X1 X6 X3
X5
X4
X6
X6
⇒
X3
X5
⇒
X3
• Gaussian graphical model: O(n × tw3) is not the best! Σ−1 = (I − W )>D−1(I − W )
Σ = (I − W )−1D(I − W )−>
sparse triangular inversion ⇒ O(n2d), d maximum fan-in
X5
Learning the structure of graphical models undirected
directed decom− posable
Learning the structure of graphical models undirected
directed decom− posable
• Approaches based on conditional independence tests • Model selection approaches : for directed and decomposable – Decomposition and hierarchical Bayesian approaches feasible – Inference with the model not always tractable in the directed non Gaussian case
Learning the structure of directed graphical models • Directed models: – Cost function J(G) from classical model selection criteria (AIC, BIC, marginal likelihood) Pn – J(G) factorizes: J(G) = i=1 Ji(πi(G)) ˆ i, xπ ) − 1 kπ (G) – e.g., AIC: Ji(πi(G)) = I(x G n i (empirical mutual information + number of parameters ) • Non Bayesian setting: finding G = arg max J(G)? is it hopeless?
Learning the structure of directed graphical models • Directed models: – Cost function J(G) from classical model selection criteria (AIC, BIC, marginal likelihood) Pn – J(G) factorizes: J(G) = i=1 Ji(πi(G)) ˆ i, xπ ) − 1 kπ (G) – e.g., AIC: Ji(πi(G)) = I(x G n i (empirical mutual information + number of parameters ) • Non Bayesian setting: finding G = arg max J(G)? is it hopeless? – Restriction to trees: maximum weight spanning forest, global maximum can be found in O(n2 log n) (Chow and Liu, 1968)
Learning the structure of directed graphical models • Directed models: – Cost function J(G) from classical model selection criteria (AIC, BIC, marginal likelihood) Pn – J(G) factorizes: J(G) = i=1 Ji(πi(G)) ˆ i, xπ ) − 1 kπ (G) – e.g., AIC: Ji(πi(G)) = I(x G n i (empirical mutual information + number of parameters ) • Non Bayesian setting: finding G = arg max J(G)? is it hopeless? – Restriction to trees: maximum weight spanning forest, global maximum can be found in O(n2 log n) (Chow and Liu, 1968) – General case: greedy algorithm in the space of DAGs or equivalence classes of DAGs → not so bad... (Chickering, 2002)
Stationary time series • Multiple time series x(t) ∈ Rm, t ∈ Z • Strict stationarity: p(x(t1 + h), x(t2 + h), . . . , x(tk + h)) independent of h • For jointly Gaussian distribution, equivalent to weak stationarity ∀t, Ex(t) = Ex(0) and ∀t, u, Ex(t)x(u)> = Ex(t − u)x(0)> • Strong assumption? – Yes, in general – No, after proper transformation and/or removal of non stationarities
Stationary (zero-mean) Gaussian processes • Joint distribution characterized by mean µ (=0) and auto-covariance function Γ(h) = Ex(t + h)x(t)> = Ex(h)x(0)> ∈ Rm×m • Main modelling assumption: joint covariance matrix Σh of x(t), x(t+ 1), . . . , x(t + h − 1) is Toeplitz by block:
Γ(0) Γ(−1) Γ(−2) · · · Γ(1) Γ(0) Γ(−1) ... Γ(2) Γ(1) Γ(0) Σh = .. ... ... .. Γ(h − 1) ···
· · · Γ(−h + 1) .. ... Γ(−1) Γ(1) Γ(0)
Spectral density matrix • if
P∞
−∞ ||Γ(h)||2
< ∞, spectral density matrix defined as
1 f (ω) = 2π
+∞ X
h=−∞
Γ(h)e−ihω ∈ Cm×m
• Properties: – 2π-periodic – ∀ω, f (ω) is Hermitian positive semi-definite • Fourier transform pairs: Γ(h) =
Z
0
2π
f (ω)eihω dω
Spectral representation • Spectral representation theorem: x(t) =
Z
2π
eitω dZ(ω)
0
with Z(ω) random process with orthogonal increments and such that EdZ(ω)dZ(ω)∗ = f (ω) • Infinite sum of independent sinusoids with covariance f (ω) • Graphical model:
ω1
ω2
ω3
ω4
AR(1) processes • x(t + 1) = Ψx(t) + ε(t), ε(t) ∼ N (0, Σ) • One time series: m = 1, Σ = σ 2, (|ψ| < 1) – Spectral density : f (ω) = – Autocovariance function :
1 σ2 2π |1−ψe−iω |2 σ 2ψ |h| γ(h) = (1−ψ2) 1
0.4 0.3 0.5 0.2 0.1 0 0
2
4
6
0 −10
0
10
• Multiple time series (eigenvalues of Ψ inside the unit circle) – Spectral density matrix f (ω) = – Autocovariance function
1 2π (I
− Ψe−iω )−1Σ(I − Ψe−iω )−∗
Estimation from finite sample • The sample autocovariance function is defined as 1 ˆ Γ(h) = T
T −h−1 X t=0
x(t + h)x(t)>, h ∈ [0, T − 1]
(NB: usually, after removing mean and non-stationarities) • Consistent estimator of Γ(h) ˆ • Estimation of f (ω) as the Fourier transform of Γ(h)? – Periodogram : I(2πk/T ) = d(k) =
1 ∗ d(k)d(k) , 2π
√1 T
PT −1 t=0
– Bad estimator of the spectral density
with
x(t)e−ikt
Estimation from finite sample • Estimation of autocovariance function 1.5 1 0.5 0 −10
0
10
• Periodogram 2 1.5 1 0.5 0 0
2
4
6
Estimation from finite sample of size T • Smoothing the periodogram leads to consistent estimators • Smoothing by Gaussian window of standard deviation r – estimate consistent if r(T ) → ∞, r(T )/T → 0 2
2
1.5
1.5
1
1
0.5
0.5
0 0
2
4
6
0 0
2
4
6
ˆ • Given r, Γ(h) is “equal” to zero for h larger than H = 2T /r = time horizon • Given time horizon H, only need spectral density at ωh = 2πh/H (Nyquist theorem).
Forecasting • Find Ψ that minimizes
2 P
H e(Ψ) = E x(H + 1) − j=1 Ψj x(H + 1 − j) • Yule-Walker equation: ∀j > 1,
H X i=1
ΨiΓ(i − j) = Γ(j)
• Durbin-Levinson algorithm: inversion of linear system in O(m3H 2)
Graphical models for stationary time series Time domain • Classical approach in the time domain: sparse AR models x(t) =
H X
h=1
Φhx(t − h) + ε(t)
ε(t) ∼ N (0, Σ) with both Φ and Σ−1 sparse
Graphical models for stationary time series State-space models • a.k.a. structured HMM, dynamic graphical models, dynamic Bayesian networks
Graphical model for stationary time series Frequency domain • Sparse models for the spectral density matrix • i.e., for each ω, f (ω) factorizes in a given graphical model G • Questions: – Does it correspond to conditional independences in the time domain? – Is inference tractable? – Is learning tractable?
Independence between whole time series • Let xi = (xi(t))t∈Z be the whole i-th time series • Theorem 1 (Brillinger, 1996): the Gaussian time series xi and xj are marginally independent if and only if ∀ω, f (ω)ij = 0. • Theorem 2 (Brillinger, 1996): the Gaussian time series xi and xj are conditionally independent given all other time series if and only if ∀ω, (f (ω)−1)ij = 0. • Graphical model between whole time series (Brillinger, 1996, Dahlhaus, 2000) – Interpretability – Efficient inference/learning algorithms?
graphical models between whole time series • Interpretation in terms of spectral representation x1
d2
d1
ω k−1
x3
d4
d4
d4
t+1 t+2
d3
d3
d3
t
ωk
t−1
d2
d2 d1
d1
t−1
t−1
t
t+1 t+2
t−1
t
t+1 t+2
ω k+1 x4
t
t+1 t+2
x2
Time domain vs. frequency domain • Two different notions of sparsity. How do they relate? • Distinct notions • Shared models: AR(1) processes with Ψ diagonal 1 f (ω) = (I − Ψe−iω )−1Σ(I − Ψe−iω )−∗ 2π
Estimating joint spectral density • Directed models • If f (ω) factorizes in G, estimation problem decouples – Estimation of local densities on variables xi∪πi(G) – Simplest: one full smoothing of the joint spectral density – Adaptive post-smoothing to avoid over-smoothing
Inference • Forecasting filter obtained by solving Yule-Walker equations • Using the structure of f (ω) to avoid O(m3) • Efficient method for block Toeplitz systems – based on block circulant matrices
Block circulant matrices • Approximation of block Toeplitz matrix of order H C0 C−1 · · · C−H+1 T0 T−1 · · · T−H+1 C1 C0 T1 T0 T = . , C= . . . . . . . . . CH−1 · · · C0 TH−1 · · · T0 where Ck = Tk for 0 6 k 6 H/2 − 1 and Ck = Cn−k , C−k = Ck>
Block circulant matrices • Approximation of block Toeplitz matrix of order H C0 C−1 · · · C−H+1 T0 T−1 · · · T−H+1 C C T T 1 0 0 C = T = .1 , . . . . . . . . . CH−1 · · · C0 TH−1 · · · T0 where Ck = Tk for 0 6 k 6 H/2 − 1 and Ck = Cn−k , C−k = Ck> • Theorem: block circulant matrices are block diagonalized in the Fourier basis with “eigenblocks” 1 b Ck = √ H
H−1 X j=0
Ck exp(−2ijkπ/H)
Preconditioned conjugate gradient • Conjugate gradient for systems of the form T x = b where T is positive semi-definite – only require matrix-vector multiplications – number of operations depends on the condition number of T : might be very large in general
Preconditioned conjugate gradient • Conjugate gradient for systems of the form T x = b where T is positive semi-definite – only require matrix-vector multiplications – number of operations depends on the condition number of T : might be very large in general • Preconditioning – – – –
if approximation C ≈ T is known, with simple inversion routine solve : (C −1/2T C −1/2)C 1/2x = C −1/2b C −1/2T C −1/2 has lower condition number requires inverting systems with C and matrix-vector products by T
Preconditioned conjugate gradient • Conjugate gradient for systems of the form T x = b where T is positive semi-definite – only require matrix-vector multiplications – number of operations depends on the condition number of T : might be very large in general • Preconditioning – – – –
if approximation C ≈ T is known, with simple inversion routine solve : (C −1/2T C −1/2)C 1/2x = C −1/2b C −1/2T C −1/2 has lower condition number requires inverting systems with C and matrix-vector products by T
• Circulant conditioning for Toeplitz systems: theoretical analysis of the resulting condition number
Toeplitz systems for graphical models PH−1 1 b • Eigenblocks Ck = √H j=0 Ck exp(−2ijkπ/H) are samples of the spectral density matrix • Factorization in a DAG → efficient linear systems • Computational complexity: – Each iteration of CG: O(m2dH) + O(m2H log H) – If f (ω) bounded away from zero, known upper bound on the number of iterations • NB: extensions to ν-step ahead predictors
Learning the structure for stationary time series • Essentially similar to the i.i.d. case. Cost function J(G) • J(G) factorizes: J(G) =
Pn
i=1 Ji(πi(G))
ˆ i, xπ ) − 1 kπ (G) • e.g., AIC: Ji(πi(G)) = I(x G n i (empirical mutual information + number of parameters ) • Heuristic definition of the number of parameters using degrees of freedom of the linear smoothers used for smoothing the periodogram • Bayesian formulation?
Simulation experiments
2
20
1.5 1 0.5 0 128
256 512 1024 2048 Number of samples
8192
Fully connected Learned Model, no post-smoothing Learned Model, with post-smoothing
KL divergence from the truth vs. number of samples.
Number of edges
KL divergence
• Toy data simulated from sparse graphical models 15 10 5 0 0
3000
6000
9000
Number of samples
Number of non recovered edges vs. number of samples
Climate dataset
• “summary of the day” NCDC data (temperatures, precipitations), 1024 days • removing of component
seasonal
• one step ahead prediction
Climate dataset • Comparison of various models Mean prediction error
3 Gaussian iid graphical model
2.5 2
Sparse AR
1.5 Stationary independent
1 0.5
Stationary graphical model
0 20
40
80
160
240
Number of variables
mean prediction error
Climate dataset 40
log likelihood
30 20
Gaussian iid graphical model
10
Sparse AR
0 -10
20
40
80
160
240
Stationary independent Stationary graphical model
-20 -30 Number of variables
Predictive log likelihood normalized by the log likelihood for the full Gaussian iid model
Missing data • Amplitude modulating function: yi(t) = 1 if xi(t) is observed, zero otherwise • Main assumption: y and x are independent • Different type of missing observations
– Missing at random - time stationarity : y(t) stationary – Missing stretches : y(t) is piecewise constant function of time – Typically: combination of the above 1
0.5
0 0
50 100 150 amplitude modulating process
200
• Can we estimate spectral densities from missing data?
Missing data - y(t) stationary • Notations: – – – –
amplitude modulated process: z(t) = x(t) ◦ y(t) means: cx = 0, cy 6= 0, cz 6= 0 spectral densities: f x(ω), f y (ω), f z (ω) Autocovariance functions: Γx(h), Γy (h), Γz (h)
• Moment matching approach (Priestley, 1981)
Missing data - y(t) stationary • Notations: – – – –
amplitude modulated process: z(t) = x(t) ◦ y(t) means: cx = 0, cy 6= 0, cz 6= 0 spectral densities: f x(ω), f y (ω), f z (ω) Autocovariance functions: Γx(h), Γy (h), Γz (h)
• Moment matching approach (Priestley, 1981) • If x and y are independent: cz = Ez(0) = Ey(0) ◦ Ex(0) = 0 and Γz (h) = E(z + h)z(t)> = Ex(t + h)x(t)> ◦ Ey(t)y(t + h)> y y > x y = Γ (h) ◦ c (c ) + Γ (h) Z 2π f z (ω) = f x(ω) ◦ cy (cy )> + f y (ω − θ) ◦ f x(θ)dθ 0
Sensor networks - I • Sensors distributed in a building, recording temperatures • 1 measurement every 30 seconds, 44 sensors, 5 days (214 measurements) • Many missing data (communication constraints), many long stretches (break downs) • Preprocessing (original data are not stationary!) – “seasonal component” – Fixed effect common to all sensors – Modelling the residual
Sensor networks - II • Prediction results (learning on first half period, predicting half the sensors given the other half in the second half period) • Comparing models that can deal with missing data – State-space model: 0.61 ± 0.34 degrees Celsius – graphical model in frequency: 0.19 ± 0.05 degrees Celsius – graphical model in time: ?
Frequency dependent graphs • Given a priori partition of the frequency axis • Learning the partition – hard – smoothness problem (the spectral density has to be smooth)
Bayesian formulation - I • Usual annoyance with Gaussian stationary time series: – Likelihood is not a simple function of the spectral density f (ω) – Whittle approximation: likelihood of the discrete Fourier transform d0, . . . , dT −1, normally distributed with zero mean and diagonal covariance matrix, with diagonal elements f (2kπ/T ). – cf. Toeplitz and circulant matrices • Model defined by regression weights Wi(ω) and variances di(ω) • Main difficulty: smoothness with respect to ω
Bayesian formulation - II i (ω) • Regression weights: dW(ω) 1/2 distributed as a Gaussian process with i appropriate covariance kernel
• Individual variances: log di(ω) distributed as a Gaussian process • Does not seem to lead to a prior which is consistent with a prior on the full spectral density matrix • Any ideas welcome
Conclusion • Graphical models in the frequency domain for stationary time series • Efficient inference and learning (parameters, graph) • Different from time sparsity (cannot really be sparse in both) • Simple way of dealing with dependent samples • Reference: F. R. Bach, M. I. Jordan. Learning graphical models for stationary time series, IEEE Trans. Signal Processing, 52(8), 2004