A Bayesian Markov-switching Model for Sparse Dynamic ... - CiteSeerX

Report 0 Downloads 78 Views
A Bayesian Markov-switching Model for Sparse Dynamic Network Estimation Huijing Jiang



Aur´elie C. Lozano

Abstract Inferring Dynamic Bayesian Networks (DBNs) from multivariate time series data is a key step towards the understanding of complex systems as it reveals important dependency relationship underlying such systems. Most of the traditional approaches assume a “static” DBN. Yet in many relevant applications, such as those arising in biology and social sciences, the dependency structures may vary over time. In this paper, we introduce a sparse Markov-switching vector autoregressive model to capture the structural changes in the dependency relationships over time. Our approach accounts for such structural changes via a set of latent state variables, which are modeled by a discrete-time discretestate Markov process. Assuming that the underlying structures are sparse, we estimate the networks at each state through the hierarchical Bayesian group Lasso, so as to efficiently capture dependencies with lags greater than one time unit. For computation, we develop an efficient algorithm based on the Expectation-Maximization method. We demonstrate the strength of our approach through simulation studies and a real data set concerning climate change. 1 Introduction Inferring the underlying dependency network between variables in multivariate time-series data is a problem of significant interest that arises in a variety of fields including social sciences [2], climate science [18] and microbiology [25]. The dependency structures are typically modeled by Dynamic Bayesian networks (DBNs)1 , which are directed graphical models where the nodes represent time series variables and edges represent directed dependency relationships. Most existing methods assume static DBNs whereas real world applications strongly suggest that the underlying dependency structure may vary over time. Anomaly detection and climate change modeling provide some key examples where the primary interest is to understand how and when the ∗ IBM

T. J. Watson Research Center. T. J. Watson Research Center. ‡ IBM T. J. Watson Research Center. 1 It is important to note that the term “dynamic” refers to fact that we are modeling dynamic processes, and does not mean that the network structure changes at every time point. † IBM



Fei Liu‡

dependency relationships between various variables may have been altered over time. In addition, in many applications it is conceivable that the process may undergo different regime shifts and possibly switch back to previous states at various time points (e.g., a system may go back to normal after abnormal periods). Furthermore, for problems involving a massive number of variables, the dependency networks are typically assumed to be sparse. A sample of relevant work on “change point modeling” include [5], [4], and [6]. The literature of change point modeling in non-sparse DBNs include [27], [7], [10], [9] and [11]. Bayesian approaches for sparse DBNs have been recently developed. For example, [24] provides a two-step approach for the change point modeling in sparse DBNs, which learns the network topology first and then estimate the parameters with fixed topology. We note that the computational cost to learn the structure provided in this paper has “exponential time” complexity, and thus the method is not scalable to high-dimensional applications. A second approach, proposed in [15], simultaneously infers the topology of the network and the time varying networks. The inference, however, is based on reversible jump MCMC algorithm, which is very computationally expensive. A noteworthy alternative approach for sparse estimation of time-varying DBNs was proposed in [25], which consists of estimating the network at each time point t separately by solving a kernel weighted lasso problem, where the weights associated with each observation depend on the distance between t and the observation’s time point. The procedure is simple but relies on the assumption that the networks are constantly varying over time and that the variations are smooth over time. The notion of regime therefore does not apply. In addition, the weighting scheme implicitly assumes that close-by observations have related DBNs, and is thus unable to incorporate relatedness between distant observations (in contrast to our method). In many applications there is no prior knowledge on the characteristics of the changes. Defining an appropriate kernel function is therefore not obvious. In addition, in some settings, the smoothness assumption may not be applicable, and certain changes may be abrupt. We remark that an interesting Bayesian nonparametric approach

541 506

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

for Markov-switching model estimation has been proposed in [9] to address a completely different question. [9] considers the so-called “automatic relevevance determination” prior to turn off entire lag blocks in the underlying vector autoregressive models, and it is useful for selection of important lags. The networks estimated by their method are dense. Therefore, their method can not be applied to infer sparse networks structure, which focuses on selecting the important edges (a.k.a. features). Additionally, [9] employs an MCMC algorithm for the inference, whose computational cost is very expensive. Our main goal in this paper is to develop a novel and computationally efficient methodology to model the time varying dependency structure underlying multivariate time series data, with a particular focus on regime change identification and on the sparsity of the estimated networks. In other words, the focus herein is to obtain a sparse graphical representation of the relationship between multiple time series, where it is essential to enforce both the sparse and the grouping structure of the lagged variables belonging to the same variable. For that purpose we bring together two distinct frameworks: the markov switching modeling framework and the sparse learning framework. Specifically, we propose a Bayesian markov switching model for estimating sparse DBNs (MS-SDBN). Following [6], we introduce a latent discrete state variable to indicate the regime from which the observations at each time point have been drawn. While most of the change point modeling techniques in the literature do not allow the process to come back to previous states, we extend the method in [6] so that the systems can switch back to a previous state at a change point. In addition to capturing important realistic settings, this allows us to alleviate the problem of sample scarcity more efficiently by borrowing strength across samples that may not be adjacent in time. To enforce sparsity of the estimated DBNs, at each state, we further utilize a group Lasso method [28, 16], where the lagged variables belonging to the same time series are considered as a group. This formulation naturally allows for the consideration of dependencies with lags greater than one (namely the value of a variable at time t may be related to the values of another variable at time t − 1 but also at time t − 2 up to t − L) by enforcing selection/exclusion of variables within a group in an “all-in-all-out”, and offers better interpretability for the resulting Bayesian networks. We note that such a group selection is not a requirement: if desired, our method can also accommodate individual selection of each lag component (by defining groups of size 1). Specifically we extend a hierarchical Bayesian framework that adds flexibility to

the original Group Lasso [16]. In this approach, a hierarchical prior is specified for the regression coefficients, which results in maximum a posteriori (MAP) estimation with sparsity-inducing regularization, and can be seen as an iteratively reweighted adaptive group Lasso estimator. Here adaptivity refers to the fact that the penalty amount may differ across groups of regression coefficients, similarly to the Adaptive Lasso algorithm [29]. Moreover, the penalty parameter λ is iteratively updated, therefore alleviating the need for parameter tuning (as opposed to non-bayesian approaches). An additional benefit of such a quasi-bayesian approach is its computational efficiency, which allows for graceful accommodation of high-dimensional datasets. Our method, by combining the strength of Markov-switching framework and bayesian Group Lasso, provides a very natural and integrated modeling framework to capture each piece of the modeling desiderata: regime change identification and sparse DBN estimation. We evaluate our approach on simulated data in terms of the accuracy of change point detection and that of the estimated Bayesian networks respectively. The results demonstrate the strength of our formulation not only at detecting structural changes, but also at estimating the Bayesian network for each state. As an illustration of its usefulness in real applications, we apply the proposed method to climate measurement data, which consists of monthly measurements for a comprehensive set of climate indices spanning from 1951 to 2007. The interesting insights obtained suggest that our method may provide a valuable complement to the existing methods of change point detection used in climate science. The remainder of this paper is organized as follows. A brief review of the dynamic Bayesian network and the vector autoregressive (VAR) model will be provided in Section 2. We introduce our model in Section 3. In Section 4, we describe the algorithm we propose to estimate the model’s parameters. The setup and results of our experiments on simulated data are provided in Section 5. In section 6 we describe the analysis of climate data. Concluding remarks are given in Section 7. 2

Preliminaries: Dynamic Bayesian Networks and the Vector Autoregressive Model We deal with Dynamic Bayesian Networks [12], which are directed graphical models of stochastic processes. We assume that the stochastic processes considered can be modeled by a Markovian transition model of order L, where L ≥ 1 so as to allow for dependencies with time lags up to L. Specifically we consider the Vector Autoregressive (VAR) model [8, 26], which is commonly

542 507

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

used as it offers a convenient framework for structure learning. The VAR model assumes that the value of the multivariate time series at time t is a linear combination of its L past values and Gaussian measurement noise. Namely, let {Yj,t , j = 1, . . . , d; t = 1, . . . , T } denote the d-multivariate time series observed at T consecutive time points, the VAR model is defined as (2.1)

Yj,t =

d X L X

θij,l Yi,t−1 + j,t , j,t ∼ N (0, σ 2 ),

i=1 l=1

where L is the maximum lag and θij,l are the lagged coefficients. It is easy to see that the non-zero coefficients of the VAR model induce a directed graph over the time series variables: there is a directed edge from time series Yi to time series Yj if θij,l is non-zero for at least one l ∈ {1, . . . , L}. This correspondence suggests the use of sparse modeling techniques to estimate the VAR model coefficients. Following [17], we concerns ourselves with methods enforcing group-wise sparsity, as these methods are able to incorporate the natural grouping structure induced by the lagged variables belonging to the same time series. Specifically we will consider a variant of the group Lasso estimator [28]. In the context of the aforementioned VAR model, the Group Lasso can be applied with variable groups L {{Y1,t−l }L l=1 , . . . , {Yd,t−l }l=1 }, leading to the minimization of d L L d X T X X X 1 X 2 θij,l θij,l Yi,t−1 )2 +λ (Yj,t − 2 i=1 i=1 t=L+1

l=1

VAR model, (3.2) Yj,t =

d X L X

θijSt ,l Yi,t−1 + j,t , j,t ∼ N (0, σS2 t ).

i=1 l=1

Note that the state variables St ∈ {1, . . . , K} are defined jointly on all features (They do not depend on j). We further model St as a Markov chain, where the K × K transition probability matrix is defined as P, where the (i, j)th element pij in P is the transition probability from state i to j: pij = P r(St = i|St−1 = j), ∀i, j ∈ {1, . . . , K}. Note the difference between the transition probability matrix defined above and that of [6]. In [6], the transition probability matrix has zero entries except for the diagonal elements and the one right next to the diagonal elements. This implies that the model in [6] only allows for a new state when a change point occurs. In our formulation, with the full matrix formulation of the transition probability matrix, we allow for the process to go back to a previous state or forward to a new state. According to the model (3.2), if two time points are from the same state, they will have the same set of AR coefficients. For simplicity, we denote the regression coefficients at state k by θijkl throughout the rest of the paper (θijSt ,l = θijkl , if St = k). We note that the model allows the DBN to vary over time as the states {St } change over time.

!1/2 ,

l=1

for j = 1, . . . , d. As the Group Lasso penalty induces group-wise variable selection, selection for the lagged variables belonging to the same time series will be performed in an “all-in all-out” fashion, facilitating the mapping from the support pattern of the VAR model to the corresponding DBN. Indeed a directed edge will exist from time series Yi to time series Yj if the coefficient group {θij,l }L l=1 is non-zero. 3 Model 3.1 Markov Switching Modeling for Bayesian Networks We extend the VAR model to allow for structural changes in the Bayesian networks over time. Specifically we propose a Markov switching vector autoregressive model as follows. We first introduce a latent state variable St , St ∈ {1, 2, . . . , K}, at each time point, where K is the total number of possible states and St stands for the state at time t. Given the state variables {St }, we can model the observed data Yj,t using the the

3.2 Sparse Modeling via the Bayesian Group Lasso In order to naturally map the VAR model coefficients into the dependency structure of the DBNs, we make use of a group lasso technique for variable group selection, and consider the lagged variables (features) for a given time series as a group, thereby allowing for the variables belonging to the same group to be selected in an “all-in all-out fashion” (instead of selecting the lagged variables individually). In our context, for a given (i, j, k), we define {θijkl , l = 1, . . . , L} as a coefficient group. We adapt the Bayesian hierarchical framework for group variable selection in [16] as follows: 2 θijkl |σijk 2 σijk |τijk

(3.3)

τijk |aijk , bijk

2 ∼ N (0, σijk ), L+1 2 , 2τijk ), ∼ G( 2 ∼ IG(aijk , bijk ) ,

where G(a, b) represents the Gamma distribution with density function f (x) = xa−1 b−a Γ(a)−1 exp(−x/b), and IG(a, b) represents the Inverse Gamma distribution whose density takes the form f (x) =

543 508

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

ba −a−1 Γ(a) x

exp(−b/x). This hierarchical formulation implies an adaptive version of the group lasso algorithm and allows for automatic update of the smoothing parameters. As suggested in [16], θijk = {θijkl }L l=1 can be estimated by the MAP estimate, which takes the form

3.3 Selecting the Number of States We utilize the Bayesian information criterion (BIC) ([23]) to select K, the number of states. The BIC is defined as b ) + dK log(N ), where log L(ψ b ) is the log −2 log L(ψ K K b , and dK likelihood of the observed data under ψ K is the number of parameters and N is the number of observations. The first term in BIC measures the argmaxθ ijk log L(θ ijk |aijk , bijk ) + log p(θ ijk |aijk , bijk ) . goodness of fit for a Markov model with K states while the second term is an increasing function of 2 Integrating σijk , τijk out in (3.3), the marginal density the number of parameters dK , which penalizes the for θ ijk can be written as model complexity. Following [28], the complexity of our model, with an underlying group sparse structure can p(θ ijk |aijk , bijk ) = be written as  −aijk −L −L −(L−1)/2 (2bijk ) π Γ(L + aijk ) kθ ijk k2 +1 , Γ((L + 1)/2)Γ(aijk ) bijk qP L 2 K X d X d where kθ ijk k2 = X l=1 θijkl is the L2 norm of θ ijk . d = I(kθ ijk k > 0) K We note that the marginal distribution includes the L2 i=1 j=1 k=1 norm of θ ijk , which is directly related to the penalty d X d K X X term in the group lasso. kθ ijk k (L − 1) , + However, the marginal likelihood resulting from the kθ LS ijk k i=1 j=1 k=1 hierarchical group lasso prior is not concave. As a result, finding the global mode by direct maximization is not feasible. An alternative approach proposed in [16] is to find local modes of the posterior using the Expectation-Maximization (EM) algorithm [19] with where θ LS are the parameters estimated by ordinary ijk τijk being treated as latent variables. This leads to the ˆ the total least squares estimates. We thus estimate K, following iteratively reweighted minimization algorithm, number of states, by the value which has minimum BIC (m+1) (m) value. θ ijk = argmaxθ ijk log p(Y|θ ijk ) − wijk kθ ijk k2 (m)

where wijk =

aijk +L (m)

kθ ijk k2 +bijk

.

For the parameters in the transition probability matrix, we assign the Dirichlet distribution as their prior distributions: For a given state k, the transition probabilities to all possible states pk· = (pk1 , . . . , pkK ) take QK α −1 the form p(pk· ) ∝ k0 =1 pkkk00 , where αk0 are the hyperparameters in the Dirichlet distribution. A popular choice is αk0 = 1, corresponding to a noninformative prior on pk· . The Dirichlet distribution is a popular prior for the probabilities of discrete random variables. Indeed it is the conjugate prior for the multinomial distribution, which results in efficient computation. We note that the noninformative prior on the transition probability does not imply that the states are equally likely at a transition. Indeed, the transition probabilities will also be updated according to the data likelihood. Finally, to complete the Bayesian hierarchical model, we assign the following noninformative prior to the variances, p(σk2 ) ∝ σ12 , where σk2 , introduced in (3.2), is the k variance of the error terms when the process is under the kth state.

4 Algorithm Let θ = {θ ijk }i,j=1,...,d;k=1,...,K , σ 2 = {σk2 }k=1,...,K and recall that P is the transition matrix. The unknown parameters in our model are ψ = (θ, σ 2 , P), and we estimate them by the MAP estimates, which can be obtained by maximizing the posterior distribution p(ψ|Y). In this section, we develop an efficient algorithm to find the MAP estimates using an EM approach [19] where the state variables St are treated as missing data. When the goal is to find MAP estimates, the EM algorithm converges to the global modes of the posterior distribution by iteratively alternating between an Expectation (E) step and a Maximization (M) step, as follows. In the E-step, we compute the expectation of the joint posterior distribution of latent variables and unknown parameters, conditional on the observed data, denoted by Q(ψ; ψ (m) ). Let Yt be the d−dimensional vector of observations at time t and Yt1 :t2 is the collection of the measurements from time t1 to t2 . For simplicity, we set D0 = {Y1 , . . . , YL } to be the initial information consisting of the first L observations and then relabel Yt ⇒ Yt−L for t > L. Then we have,

544 509

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

3. Compute the posterior probability (m+1)

Ltk

Q(ψ; ψ (m) ) = ES,τ |Y,D ,ψ (m) [log p(ψ, S, τ |Y1:T , D0 )] 0 T X K X = Ltk log f (Yt |Yt−1:t−L , St = k, θ, σ 2 )

(m+1)

Ht,k0 k log p(St = k|St−1 = k 0 , P)

L1k log πk −



d X d X K X

= f (Yt |St = k, Yt−1:t−L , ψ (m) ) (m+1)

wijk kθ ijk k2

log σk2 +

K X K X

(m) (m+1)

(t − 1)pk0 k βk

f (Y|D0 , ψ

(m)

(t)

)

(αk − 1) log pk0 k + const

k=1 k0 =1

k=1

αk 0

×

i=1 j=1 k=1

k=1 K X

,

= p(St−1 = k 0 , St = k|Y1:T , D0 , ψ (m) )

Ht,k0 k

t=2 k=1 k0 =1

+

f (Y1:T |D0 , ψ (m) )

and

T X K X K X

K X

α(m+1) (t)β (m+1) (t)

=

t=1 k=1

+

= p(St = k|Y1:T , D0 , ψ (m) )

where Ltk = p(St = k|Y1:T , D0 , ψ (m) ) and Ht,k0 k = p(St−1 = k 0 , St = k|Y1:T , D0 , ψ (m) ) are the posterior probability of all hidden state variables and πk = P r(S1 = k|Y1:T , D0 ) is the probability of the initial state being k. In the E-step, the posterior probability Ltk and Ht,k0 k can be calculated using the three step backward and forward algorithm [3] as follows.

In the M-step, we update ψ by maximizing Q(ψ; ψ (m) ), as follows. (m+1)

1. The VAR coefficients θ ijk minimizing T X 1 t=1

2

(m+1)

Ltk

(Yj,t −

P X

are estimated by

2,(m)

Xti θ ijk )2 /σk

i=1

+

P X

(m+1)

wijk

kθ ijk k2 ,

i=1 (m+1)

1. Compute the forward probability αk (t) = p(Y1:t , St = k|D0 , ψ (m) ) by going forward iteratively in time, as follows

where Xti = (Yi,t−1 , . . . , Yi,t−L ) and the updated weights are calculated as (m+1)

(m+1) αk (1)

= p(Y1 , S1 = k|D0 , ψ =

(m+1) αk (t)

(m) πk p(Y1 |S1

(m)

=

(m)

)

k0 =1 (m) (m+1)

aijk + L . (m) kθ ijk k2 + bijk

The above regularized minimization problem can be transformed into a standard group lasso formulation [28] by appropriately rescaling Yj,t and Xti . This group lasso problem can be solved efficiently by using the optimization procedure proposed by [20].

f (Yt |Yt−L:t−1 , St = k, D0 , ψ (m) ) × pk0 k αk0

=

)

= k, D0 , ψ (m) )

= p(Y1:t , St = k|D0 , ψ K X

wijk

(t − 1)

2,(m+1)

2. The variance σ0,k updated as (m+1)

2. Compute the backward probability βk (t) = f (Yt+1:T | Y1:t , D0 , St = k, ψ (m) ) by going backward iteratively in time, as follows (m+1)

βk

2,(m+1)

σk

=

d X T X

(m+1)

Ltk

(m+1) j=1 t=1 dTk

(T ) =

(m+1) βk (t)

1

(m+1)

= f (Yt+1:T |Y1:t , D0 , St = k, ψ =

K X

(m)

where Tk

)

of each Markov states are

=

PT

t=1

+2

(Yj,t −

(m+1) 2

Xti θ ijk

)

i=1

(m+1)

Ltk

P X

.

(m+1)

f (Yt+1 |Yt:t−L+1 , D0 , St+1

k0 =1 (m)

× βk0 (t + 1)pkk0

3. The transition probability pk0 k are updated as = k ,ψ ) PT (m+1) (m+1) −1 t=1 Ht,k0 k + αk (m+1) pk 0 k = PT PK (m+1) (m+1) + k=1 (αk − 1) t=1 Ltk0 0

545 510

(m)

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

(m+1)

(m+1)

considered K = 2 and K = 5 states while generating the synthetic data. We considered the maximum lags In summary, the proposed EM algorithm is com- in the VAR model to be L = 1 and L = 3. The synthetic putationally efficient. In fact the Q(ψ, ψ (m) ) can be data are generated according to the following steps. derived in closed form in the E-step and the maximiza1. Generate the state assignment sequence. Instead of tion in the M-step can be transformed into a standard assuming that the true generative process for the group lasso formulation ([28]) and the maximization can states is a Markov switching process, we randomly be carried out very efficiently by using the optimization sampled T /60 change points and randomly assign procedure in [20]. The algorithm iterates between the states to each block. We relax the Markov assumpE-step and M-step until it converges. We summarize the tion to test if our model still enables us to identify algorithm in Figure 1. the underlying true process under a more general and realistic condition. EM algorithm 4. The initial probability πk

= L1k

.

2. Generate a random directed graph (the true network) with a specified edge probability. We generated a set of d × d adjacency matrices A1 , . . . , AK , where the entry Ak [i, j] = 1 indicates that at Markov state k, Yi influences Yj , and Ak [i, j] = 0 otherwise. The value of each entry was chosen by sampling from a binomial distribution, where the probability that an entry equals to one was set to 0.2.

1. Input: a. Observations: {Yj,t , j = 1, . . . , d; t = 1, . . . , T } where Yj,t is a measurement taken for feature j at time t. b. Algorithm: a two-step iterative algorithm which incorporates Bayesian hierarchical group selection, EM. 2. Initialization:

3. Create a sparse Markov Switching VAR model that corresponds to the Markov states and the networks generated in the previous two steps. For each Markov states, the VAR coefficients θijkl , l = 1, . . . , L were sampled according to a normal distribution with mean 0 and SD 0.20 if Ak [i, j] = 1, and set to be 0 otherwise. The noise SD was set at 0.01 for all states.

a. Missing data: Markov state variables {St }t=1,...,T where St = k if time point t is in state k. b. Graphical structure: adjacency matrices for each Markov states, i.e. Gk = hVk , Ek i, k = 1, . . . , K where Vk is the set observations assigned to Markov state k. 3. Run EM until the algorithm converges: E-step : Impute missing data St through backwardforward steps and assign Markov states to each time points.

4. Simulate data from the above switching VAR model. The sample size per feature per lag per state was set at 10. We considered p = 12 measurements and maximum lags L = 1 and L = 3.

M-step : Update the parameters ψ and select causal effects through Bayesian hierarchical group Lasso. 4.

Output: a. Markov switching path connected through all the detected change points. b. For each Markov state k, place an edge i → j into Ek , if and only if feature i was selected as an causal effect to j.

Figure 1: EM algorithm to fit the Sparse Dynamic Bayesian Network (sDBN)

To evaluate the accuracy of the switching states estimation, we use the Rand index [22], where we treat the switching modeling problem as clustering T multivariate data vector Yt = (Y1,t , . . . , Yd,t ) into K Markov states. Rand index is often used to measure the clustering error and defined as the fraction of all misclustered pairs of data vectors (Yt , Ys ). Let C ∗ and Cˆ denote the true and estimated clustering maps, respectively, Rand index is defined by R = P t<s

ˆ t ,Ys )6=C ∗ (Yt ,Ys )) I(C(Y . (T 2 )

To evaluate the accuracy of the DBN estimation, we use the F1 score, which is the harmonic mean of precision and recall scores in retrieving the true network 5 Simulation edges. We note that larger values of Rand index and F1 We investigate the performance of our method on score indicate higher accuracy. synthetic data with respect to identifying the switching We compare our method (MS-SDBN) with two states and the resulting sparse Bayesian networks. We comparison methods. Our first comparison method is

546 511

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

a change point detection method extending the method of [13] to the VAR setting. This method estimates change points via fused lasso penalized regression and subsequently estimates sparse static networks for each segment separately. We refer to this method as “FusedDBN”. Our second comparison method is the TVDBN method of [25] (see Section 1 for details on this method). Note that the simulation setting is not favorable to TV-DBN, and comparison with it is only meant to illustrate the point we made in Section 1 when we discussed TV-DBN, that is, the value of our approach in real world settings where the assumptions made by TV-DBN are not appropriate (namely, settings where the networks are not smoothly and constantly varying over time). We remark that neither of the comparative methods allow for regime identification, as these methods do not allow the process to switch back to a previous state. The results of our experiments are summarized in Table 1, where we show respectively the average F1 score for the three comparison methods, and the average Rand index for our method only (since to our knowledge our method is the only method allowing for regime identification in DBNs) over 100 runs, along with the standard errors. Table 1: Accuracy of comparison methods in identifying the correct Bayesian networks measured by the average Rand index and F1 score on synthetic data with varying number of Markov states (K) and lags (L). The numbers in the parentheses are standard errors. Type

Method

Rand index

MS-SDBN

F1 score

Fused-DBN TV-DBN MS-SDBN Fused-DBN TV-DBN

K=2 L=1 0.94 (0.02) N/A N/A 0.76 (0.05) 0.48 (0.02) 0.36 (0.02)

K=2 L=3 0.96 (0.02) N/A N/A 0.91 (0.12) 0.50 (0.03) 0.38 (0.02)

K=5 L=1 0.97 (0.005) N/A N/A 0.76 (0.03) 0.53 (0.04) 0.50 (0.06)

K=5 L=3 0.97 (0.06) N/A N/A 0.87 (0.17) 0.52 (0.02) 0.40 (0.04)

Overall, our method has a very good accuracy under varying number of Markov states. Rand indices do not change much from small to large number states, which suggests similar accuracy in change point detection. The F1 scores, smaller under L = 1, suggest that the dependency structure may be more accurately recovered when the relationships involve multiple time lags, which is intuitive since more lag variables provide more information on the grouping structure of the vari-

ables. Our approach achieves significantly better F1 accuracy than the other methods for all the cases considered. From a computational complexity standpoint, our algorithm is particularly attractive compared to the alternate change point based method, Fused-DBN. Indeed the Fused-DBN method involves applying randomized lasso to a transformed data matrix of dimensions T × T p, this implies that the number of features effectively considered is T times higher than the actual number of features. This can be a significant impediment even in low-dimensional settings (For 10 features and 1000 time points, one has to work with 10’000 features after the transformation). In addition Fused-DBN is unable to leverage the grouping structure corresponding to dependencies with L > 1. Another noteworthy strength of our method, compared to Fused-DBN and other change point based methods proposed in the literature, is for the setting of regime change identification. In the existing methods, once the change points have been estimated, the coefficients are estimated individually for each interval. So if some of the intervals between two subsequent change points are small (which may happen in many practical situations), the algorithms may be forced to work with extremely small sample size, thus leading to poor estimates. In contrast, our method considers states and allows for return to previous states. It is thus able to borrow strength across a wider number of samples which may be far away in time. In the remainder of this section we focus in more details on the results obtained by our method. We demonstrate the results from our method using the synthetic data with K = 2 and L = 1. According to BIC, the number of states K is estimated as 2. In the left panel of Figure 2, we show the Markov path estimated by our method and the true path for a particular simulation run, where the transition jumps highlighted in red in the true path (upper panel) are those missed by our method. As shown in the plot, our method is able to detect the change points with very little delay. We also observe that the method tends to miss a transition when the process remains in a single state for too short a duration. In practice, however, such transient jumps rarely happen or may be of less interest in real applications. In the right panel of Figure 2 we show the corresponding estimated Bayesian networks along with the true networks. In the true networks (left column), we highlighted in red the false negatives (edges that exist in the true graphs but are missed in the estimated graphs) and in the estimated networks (right column) we highlighted in green the false positives (edges that do not exist in the true graphs but are selected by our method). As we can see from the plot, the estimated Bayesian networks

547 512

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

exhibit reasonable agreement with the true networks. 2.0

True Markov Path

1.6

!%#$ 1

1

1.4

12

2

11

3

2

3

11

1.2

Markov State

1.8

!"#$ 12

4

4

1.0

10

0

100

200

300

400

500

600

10

700

9

5

5

9

Time 8

6

8

6

7

7

Estimated Markov Path

1.8

2.0

!&#$

!'#$ 1

1 2

1.6

12

2

3

11

1.4

3

10

1.0

1.2

Markov State

12

11

9

0

100

200

300

400

500

600

10

4

8

4

5

9

5

700

8

6

6 7

7

Time

Figure 2: Left: Output switching path on one synthetic dataset with 2 Markov states. Transition jumps missing in the estimated Markov path are highlighted in red. Right: the corresponding output networks (a) True network at state 1; (b) Estimated network at state 1; (c) True network at state 2; (d) Estimated network at state 2. Edges coded in red are the false positives and those in green are the false negatives.

‘regimes’, their specific characteristics and their duration may shed light on the underlying mechanisms in the climate system that resulted in its observed variability. In our experiments, we focus on climate indices [21], which are time series data characterizing the state of the climate system over time. These indices describe various atmospheric and oceanic events and are being compiled for the very purpose of climate monitoring. We used monthly measurements available at the Climate Prediction Center of the National Oceanic and Atmospheric Administration (NOAA) [21]. A brief description of the indices we used is provided in Table 2. Note that the indices we considered cover all broad categories of climate indices available, namely, Teleconnections (characterizing phenomena where remote locations exhibit related weather changes), Atmosphere, El Nino Southern Oscillation (a climate pattern that occurs across the tropical Pacific Ocean), Sea Surface Temperature Pacific, Sea Surface Temperature Atlantic, and others.

Figure 3 shows BIC criteria proposed in section 3.3 for different number of states and for two simulation runs with 2 and 5 Markov states respectively. In these examples the BIC criterion is minimized for the true number of states, which indicates that the proposed criterion is capable of recovering the true number of states.

Code AMM AO EPO GMP







−130000





−51500

−131000 ●

BIC

−52000 −52500











2



Nino3 Nino4

−133000

−53500

−53000

BIC

Nino3+4

● ●

● ●

−132000

−51000

Nino1+2 ●

4

6 Number of States

8

10





2

4

6

8

10

Number of States

NOI

Figure 3: BIC values under different number of states for one synthetic dataset with 2 and 5 Markov states respectively. The minimum BIC values are highlighted with red color.

ONI PDO PNA Sol.flux

Name Atlantic Meridional Mode Arctic Oscillation Eastern Pacific Oscillation Global Mean temperature Extreme Eastern Pacific SST East Central Tropical Pacific SST Nino.3 Central Tropical Pacific SST Northern Oscillation Index Oceanic Nino Index Pacific Decadal Oscillation Pacific North American Index Solar Cycle Flux Southern Oscillation Index South Western USA Monsoon West Pacific Index

Type SST Atlantic Atmosphere Teleconnections Others ENSO ENSO ENSO ENSO Teleconnections SST Pacific Teleconnections Teleconnections Others

6 Application to Climate Modeling SOI Atmosphere Climate change is one of the greatest sociological and SW.Monsoon Atmosphere scientific issue of our time [1]. Many fundamental questions have yet to be resolved to improve our underWP Teleconnections standing of the climate system and therefore our capacity to predict and respond to climate variability. Table 2: Climate Indices used for the modeling One important challenge resides in uncovering dependency relationships between various climate variables, characterizing how such relationships may have been alThe multivariate time series data span 57 years tered over time, and detecting the time points at which (from January 1951 to May 2007). Prior to modeling, those changes take place. Indeed understanding climate we performed non-parametric (linear and quadratic)

548 513

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

detrending, as our goal is to detect changes in the dependency structure of the Bayesian networks, not in trends. Exploratory data analysis suggest that we consider a maximum lag of 3 months. We set (a = 1, b = 0.05) in the hierarchical group lasso prior (other values led to similar results). 2 states are selected, the results under which are depicted in Figure 4. We performed various experiments, using different initial values for the change points. All runs converged to a single change point in August 1978, suggesting that our method leads to stable results.

nature of such changes. We consider the application very compelling in view of the debate on the causes of climate change: human intervention or natural cyclical property of the climate. By allowing for regime changes and switching back to previous states, our method is able to infer whether or not the climate shows cyclical behavior. As future work we also plan to apply our method on paleoclimate data for consideration of much longer time span.

7 Concluding Remarks In this article, we proposed a novel statistical method !"#$ !%#$ for change point modeling in sparse Bayesian networks. The resulting approach is valuable for regime change detection, anomaly detection and dependency structure learning with multivariate time series data. Some of the other interesting future research agenda include: automatically selecting hyperparameters for the hierarchical Bayesian group lasso within the statistical modeling; simultaneously modeling the change point of the main Figure 4: Output dependency structures uncovered trend within series and that of the dependency relafrom the climate indices dataset: (a) Prior to August tionships across series. 1978; (b) After August 1978. Edges coded in red are removed after August 1978 while Edges highlighted in green are added. AMM WP SW.Monsoon

WP AMM AO SW.Monsoon EPO

SOI

GMP

Sol.flux

Nino1+2

AO

EPO

SOI

GMP

PNA

Nino3+4

Sol.flux

Nino1+

PNA

Nino3+4

PDO

ONI

NOI

Nino3 Nino4

PDO ONI

Nino3

NOI

Nino4

We remark that in the two Bayesian networks we uncovered, Solar Flux is a key influencing variable. This is consistent with a various findings from the climate science literature regarding the influence of the Sun’s energy outputs on Earth’s climate. Focusing on the changes in structure, we observe that the post May 1978 structure becomes much sparser. Note however that one edge is added from Solar Cycle Flux to Pacific Decadal Oscillation. This intriguing phenomenon may confirm and help understand results from recent modeling studies, which have found resonant responses and positive feedbacks in the ocean-atmosphere system that may lead to an amplified response to solar irradiance variations [14]. As future work, it would be interesting to discuss the implications of these results with climate scientists and refine them. Several researchers have identified change points with respect to the trend of the climate system indices in the mid-1970s, with intense warming stating in 1976 (see [1] and references therein). Our results suggest that changes in the underlying relationships governing the climate system occurred slightly later, namely in 1978. This illustrates the usefulness of our approach as a complementary tool in monitoring and understanding climate change, as emphatically, our focus is on detecting the structural changes in the relationships governing the climate indices rather than changes in trends, and our methodology also provides a characterization of the

549 514

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

References [1] Climate change 2007 - the physical science basis. IPCC Fourth Assessment Report on scientific aspects of climate change for researchers, students, and policymakers, 2007. [2] A. Arnold, Y. Liu, and N. Abe. Temporal causal modeling with graphical granger methods. In Proceedings of International Conference on Knowledge Discovery and Data Mining (SIGKDD-07), 2007. [3] Leonard Baum, Ted Petrie, George Soules, and Norman Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The Annals of Mathematical Statistics, 41(1):164–171, 1970. [4] B. Carlin, A. Gelfand, and A. F. M. Smith. Hierarchical bayesian analysis of changepoint problems. Applied Statistics, 41(2):389–405, 1992. [5] H. Chernoff and S. Zacks. Estimating the current mean of a normal distribution which is subject to changes in time. Annals of Mathematical Statistics, 35:999–1018, 1964. [6] Siddhartha Chib. Estimation and comparison of multiple change-point models. Journal of Econometrics, 86:221–241, 1998. [7] N. Dobingeon, J. Tourneret, and J. Davy. Joint segmentation of piecewise constant autoregressive process by using a hierarchical model and a bayesian sampling approach. IEEE Transactions on Signal Processing, 55(4):1251–1263, 2007. [8] W. Enders. Applied Econometric Time Series. John Wiley & Sons, 2003. [9] E. Fox, E.B. Sudderth, M.I. Jordan, and A.S. Willsky. Bayesian nonparametric inference of switching dynamic linear models. IEEE Transactions on Signal Processing, 59(4):1569–1585, 2011. [10] M. Grzegorczyk and D. Husmeier. Non-stationary continuous dynamic bayesian networks. Advances in neural information processing systems (NIPS), 22:682– 690, 2009. [11] M. Grzegorczyk and D. Husmeier. Non-homogeneous dynamic bayesian networks for continuous data. Machine Learning, 83:355–419, 2011. [12] Keiji Kanazawa, Daphne Koller, and Stuart Russell. Stochastic simulation algorithms for dynamic probabilistic networks. In Proceedings of the Proceedings of the Eleventh Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-95), 1995. [13] Mladen Kolar, Le Song, and Eric P. Xing. Sparsistent learning of varying-coefficient models with structural changes. NIPS, 2009. [14] Judith L. Lean. Cycles and trends in solar irradiance and climate. Wiley Interdisciplinary Reviews: Climate Change, 1(1):111–122, 2010. [15] S. L`ebre, J. Becq, F. Devaux, G. Lelandais, and M. Stumpf. Statistical inference of the time-varying structure of gene-regulation networks. BMC Systems Biology, 4(130), 2010.

[16] Anthony Lee, Francois Caron, Arnaud Doucet, and Chris Holmes. A Hierarchical Bayesian Framework for Constructing Sparsity-inducing Priors. http://arxiv.org/abs/1009.1914, 2010. [17] A.C. Lozano, N. Abe., Y. Liu, and Rosset S. Grouped graphical granger modeling for gene expression regulatory networks discovery. In Proceedings of International Conference on Intelligent Systems for Molecular Biology (ISMB-09), 2009. [18] A.C. Lozano, H. Li, A. Niculescu-Mizil, Y. Liu, C. Perlich, J. Hosking, and N. Abe. Spatial-temporal causal modeling for climate change attribution. In Proceedings of International Conference on Knowledge Discovery and Data Mining (SIGKDD-09), 2009. [19] Geoffrey J. McLachlan and Thriyambakam Krishnan. The EM Algorithm and Extensions. WileyInterscience, 2008. [20] Lukas Meier, Sara van de Geer, and Peter B¨ uhlmann. The group lasso for logistic regression. Journal of the Royal Statistical Society, 70(1):53–71, 2008. [21] National Oceanic and Atmospheric Administration: Climate Prediction Center. Climate indices: Monthly atmospheric and ocean time series. http://www.esrl.noaa.gov/psd/data/climateindices/list/, 2011. [22] William Rand. Objective criteria for the evaluation of clusterings methods. Journal of the American Statistical Association, 66(336):846–850, 1971. [23] Gideon E. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2), 1978. [24] M. Siracusa and J. W. Fisher. Tractable bayesian inference of time-series dependence structure. Journal of Machine Learning Research - Proceedings Track, pages 528–535, 2009. [25] Le Song, Mladen Kolar, and Eric P. Xing. Timevarying dynamic bayesian networks. NIPS, 2009. [26] P. C. Woodland. Hidden markov models using vector linear prediction and discriminative output distributions. In Proceedings of 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 509–512, 1992. [27] X. Xuan and K. Murphy. Modeling changing dependency structure in multivariate time series. International Conference in Machine Learning, 2007. [28] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68:49–67, 2006. [29] Hui Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006.

550 515

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.