Linear Embedding of Large-Scale Brain Networks for Twin fMRI

Report 2 Downloads 19 Views
arXiv:1509.04771v1 [cs.AI] 15 Sep 2015

Linear Embedding of Large-Scale Brain Networks for Twin fMRI Moo K. Chung1 , Victoria G. Vilalta2 , Paul J. Rathouz1 , Benjamin B. Lahey3 , David H. Zald2 1

University of Wisconsin-Madison, 2 Vanderbilt University, 3 University of Chicago email:[email protected] September 17, 2015

Abstract In many human brain network studies, we do not have sufficient number (n) of images relative to the number (p) of voxels due to the prohibitively expensive cost of scanning enough subjects. Thus, brain network models usually suffer the small-n large-p problem. Such a problem is often remedied by sparse network models, which are usually solved numerically by optimizing L1-penalties. Unfortunately, due to the computational bottleneck associated with optimizing L1-penalties, it is not practical to apply such methods to learn large-scale brain networks. In this paper, we introduce a new sparse network model based on crosscorrelations that bypass the computational bottleneck. Our model can build the sparse brain networks at voxel level with p > 25000. Instead of using a single sparse parameter that may not be optimal in other studies and datasets, we propose to analyze the collection of networks at every possible sparse parameter in a coherent mathematical framework using graph filtrations. The method is subsequently applied in determining the extend of genetic effects on functional brain networks at voxel-level for the first time using twin fMRI.

1

1

Introduction

In brain imaging, there have been many attempts of using high-dimensional imaging features via various multivariate approaches (Chung et al., 2013; Worsley et al., 2005a; He et al., 2008). However, these standard multivariate techniques suffer the small-n large-p problem (Friston et al., 1995; Sch¨ afer and Strimmer, 2005; Vald´es-Sosa et al., 2005; Chung et al., 2013). Specifically, when the number of voxels are substantially larger than the number of images, it produces an under-determined linear model with infinite number of solutions. The small-n large-p problem is often remedied by using sparse models, which regularize the under-determined system with additional sparse penalties. Popular sparse models are LASSO (Bickel and Levina, 2008; Peng et al., 2009; Huang et al., 2009; Chung et al., 2013) sparse canonical correlation (Avants et al., 2010) and grphical-LASSO (Banerjee et al., 2006, 2008; Friedman et al., 2008; Huang et al., 2009; Mazumder and Hastie, 2012; Witten et al., 2011). Most of these sparse models require optimizing L1-norm penalty, which has been the major computational bottleneck for solving large-scale problems. Thus, almost all sparse brain network models in brain imaging has been restricted to up to few hundreds to thousands nodes. As far as the authors are aware, 2527 MRI features used in learning LASSO model for Alzheimer’s disease (Xin et al., 2015) is probably the maximum number of variables used in any sparse model in brain imaging. In this paper, we propose a new large-scale sparse network model (p > 25000) that yields greater computational speed and efficiency by bypassing the computational bottleneck of optimizing L1-penalties. There are few attempts at speeding up the computation for sparse models. By identifying block diagonal structures in the estimated (inverse) covariance matrix, it is possible to bypass the numerical optimization in the penalized loglikelihood method (Mazumder and Hastie, 2012; Witten et al., 2011). The proposed method substantially differs from (Mazumder and Hastie, 2012; Witten et al., 2011) in that we do not need to assume the data to follow normality since there is no need to specify the likelihood function. Further the cost functions we are optimizing are different. Sparse model A(λ) is usually parameterized by a tuning parameter λ that controls the sparsity of the representation. Increasing the sparse parameter makes the solution more sparse. So far, many existing sparse network approaches use a fixed parameter λ that may not be optimal in other datasets or studies. Depending on the choice of the sparse parameter, the final classification and statistical results will be different. Instead of con2

structing networks at one fixed sparse parameter that may not be optimal, we propose to analyze the collection of sparse models over every possible parse parameter using graph filtration recently introduced in (Chung et al., 2013; Lee et al., 2012). The graph filtration is a threshold-free framework for analyzing a family of weighted graphs but requires hierarchically building specific nested subgraph structures. The main advantage of graph filtration is its avoidance of determining and using one fixed threshold that may not be optimal across different populations. The proposed method share some similarities to the existing multi-thresholding network models that threshold connectivity matrices at many different thresholds (Achard et al., 2006; He et al., 2008; Supekar et al., 2008; Lee et al., 2012). However, such approach has been applied in an ad-hoc fashion and not often applied to sparse models often. Further, constructing sparse models for multiple sparse parameters can causes an additional computational bottleneck. Our new sparse model can address the above mentioned shortcomings within a unified mathematical framework. Specifically, we present a new sparse network model based on crosscorrelations for the first time. Although cross-correlations have been often used in sciences in connection to times series and stochastic processes (Worsley et al., 2005a,b), the sparse version of cross-correlation has been somewhat neglected. The proposed sparse cross-correlation network model is used in constructing large-scale brain networks at voxel level in a functional magnetic resonance imaging (MRI) study consisting of monozygotic (MZ) and dizygotic (DZ) twin pairs. The method is applied in generalizing the widely used voxel-level univariate feature heritability index (HI) into a network-level multivariate feature called heritability graph index (HGI). HGI is then used in determining the genetic effects by constructing large-scale HGI by taking every voxel as network nodes.

2 2.1

Methods Sparse Cross-Correlations

Let V = {v1 , · · · , vp } be a node set where data is defined. We expect the number of nodes to be very large. In this study, we have p = 25972 voxels in the brain that serves as the node set. Let xk (vi ) and yk (vi ) be the k-th paired scalar measurements at node vi . Denote x(vi ) = (x1 (vi ), · · · , xn (vi ))0 and y(vi ) = (y1 (vi ), · · · , yn (vi ))0 be the paired data vectors over n different images at voxel vi . We expect p to be significantly larger than n, i.e., p  n.

3

Center and scaled x and y such that n X

xk (vi ) =

k=1

n X

yk (vi ) = 0,

k=1

kx(vi )k2 = x0 (vi )x(vi ) = ky(vi )k2 = y0 (vi )y(vi ) = 1 for all vi . The reasons for centering and scaling will soon be obvious. We set up a linear model between x(vi ) and y(vj ): y(vj ) = bij x(vi ) + e,

(1)

where e is the zero-mean error vector whose components are independent and identically distributed. Since the data are all centered, we do not have the intercept in linear regression (1). The least squares estimation (LSE) of bij is given by bbij = x0 (vi )y(vj ),

(2)

which are the (sample) cross-correlations (Worsley et al., 2005a,b). The cross-correlation is invariant under the centering and scaling operations. The least squares estimation in (1) is equivalent to minimizing the sum of squares over all nodes: p X

k y(vj ) − bij x(vi ) k2 .

(3)

i,j=1

Then we propose the following sparse version of (3): p p X 1 X k y(vj ) − βij x(vi ) k2 +λ |βij |. 2 i,j=1

(4)

i,j=1

Denote the cost function (4) as F (β; x, y), which is minimized over every possible βij ∈ R: βb = arg min F (β; x, y). β

b The estimated sparse cross-correlations β(λ) = (βbij (λ)) are shrinked toward zero as sparse parameter λ ≥ 0 increases. The direct optimization of (4) for large p is computationally demanding. However, there is no need to optimize (4) numerically using the coordinate descent learning or the activeset algorithm as often done in sparse modeling (Peng et al., 2009; Friedman et al., 2008). We can show that the minimization of (4) is simply done algebraically. 4

Theorem 1. For λ ≥ 0, the minimizer of F (β; x, y) is given by  0 0  x (vi )y(vj ) − λ if x (vi )y(vj ) > λ βbij (λ) = 0 if |x0 (vi )y(vj )| ≤ λ .   0 x (vi )y(vj ) + λ if x0 (vi )y(vj ) < −λ

(5)

Proof. Write F (β; x, y) as p 1 X f (βij ), F (β; x, y) = 2 i,j=1

where f (βij ) =k y(vj ) − βij x(vi ) k2 +2λ|βij |. Since f (βij ) is nonnegative and convex, F (β; x, y) is minimum if each component f (βij ) achieves its minimum. So we only need to minimize each component f (βij ) separately. Using the constraint kx(vi )k = 1, f (γjk ) is simplified as 2 1 − 2βij x(vi )0 y(vj ) + βij + 2λ|βij |

= (βij − x(vi )0 y(vj ))2 + 2λ|βij | + 1 − [x(vi )0 y(vj )]2 . For λ = 0, the minimum of f (βij ) is achieved when βij = x(vi )0 x(vj ), which is the usual LSE. For λ > 0, Since f (βij ) is quadratic, the minimum is achieved when ∂f = 2βij − 2x0 (vi )y(vj ) ± 2λ = 0. ∂βij

(6)

The sign of λ depends on the sign of βij . By rearranging terms, we obtain the desired result.  Although it is not obvious yet, Theorem 1 is related to the orthogonal design in LASSO (Tibshirani, 1996) and the soft-shrinkage in wavelets (Donoho et al., 1995). To see this, let us transform linear equations (1) into an index-free gigantic matrix equation. Let Xn×p = [x(v1 ) x(v2 ) · · · x(vp )] Yn×p = [y(v1 ) y(v2 ) · · · y(vp )] 1p×1 = [1 1 · · · 1]0 ,

5

where 1p×1 is a vector consisting of p ones. Then (1) can be written as 1p×1 ⊗ vec(Y) = X vec(β) + 1np2 ×1 ⊗ e,

(7)

vec is the vectorization operation, Ip is p × p identity matrix. The gigantic design matrix Xnp2 ×p2 is a block diagonal matrix consisting of p blocks Ip ⊗ x(v1 ), · · · , Ip ⊗ x(vp ). Note X0 X is a block diagonal matrix, where the i-th block is [Ip ⊗ x(vi )]0 [Ip ⊗ x(vi )] = Ip ⊗ [x(vi )0 x(vi )] = Ip . Thus, X is orthogonal. However, our formulation is not exactly the orthogonal design of LASSO as specified in (Tibshirani, 1996) since the noise components in (7) are not independent . Figure 1-top displays an example of obtaining sparse cross-correlations from the initial sample cross-correlation matrix   × 0.4 0.5 -0.7 X0 Y =  × × 0.3 -0.1  . × × × 0.9 For simplicity, only the upper triangle part is shown. Theorem 1 simplifies the time consuming convex optimization problem into a simple thresholding problem.

2.2

Graph Filtrations

b From computed sparse cross-correlations β(λ), we will build the graph representations for subsequent network quantification. Instead of analyzing the graph at fixed λ, we will analyze it over every possible λ. We need few new definitions. Definition 1. Suppose weighted graph G = (V, W ) is formed by the pair of node set V = {v1 , · · · , vp } and edge weights W = (wij ). Let 0 and + be binary operations on graphs. The binary graph G0 of G is given by the adjacency matrix A = (aij ): ( 1 if wij 6= 0; aij = . 0 otherwise The binary graph G+ (λ) of G is given by the adjacency matrix A = (aij ): ( 1 if wij > λ; aij = . 0 otherwise 6

Figure 1: Top: The sparse cross-correlations are estimated using thresholding rule (5) with two different sparse parameters λ. The edge weights shrinked to zero are not shown. Bottom: the equivalent binary graph can be obtained by simply thresholding the sample cross-correlations at λ. b Consider weighted graph G = (V, β(λ)) with sparse parameter λ. Once 0 we obtain the binary graph G (λ) out of G,we can use existing graph theoretic approaches or persistent homology (Carlsson and Memoli, 2008; Edelsbrunner and Harer, 2008; Singh et al., 2008; Ghrist, 2008; Lee et al., 2012; Chung et al., 2013) in analyzing the topological properties of the network. Note there are infinitely many binary graphs G 0 (λ) corresponding to every possible λ. To avoid the problem of determining the optimal parameter λ that may not be universally optimal, we propose to analyze the collection graphs {G 0 (λ) : λ ∈ R+ }. At first glance, it seems we may have to deal with infinitely many binary graphs. However, we do not really need to analyze infinitely many graphs. To prove this, we need a new concept. Definition 2. For graphs G1 = (V, W ), W = (wij ) and G2 = (V, Z), Z = (zij ), G1 is a subset of G2 , i.e., G1 ⊂ G2 , if wij ≤ zij for all i, j. The 7

collection of nested subgraphs G1 ⊂ G2 ⊂ · · · ⊂ Gk is called a graph filtration. k is the level of the filtration. Our definition extends the concept of the graph filtration in (Lee et al., 2012; Chung et al., 2013), where only undirected graphs are considered, to directed graphs as well. Graph filtrations is a special case of Rips filtrations often studied in persistent homology (Carlsson and Memoli, 2008; Edelsbrunner and Harer, 2008; Singh et al., 2008; Ghrist, 2008; Lee et al., 2012). From Definitions 1 and 2, for an arbitrary graph G = (V, W ), we have G+ (λ1 ) ⊃ G+ (λ2 ) ⊃ G+ (λ3 ) ⊃ · · ·

(8)

for any λ1 ≤ λ2 ≤ λ3 · · · . We will try to linearly embed the collection of sparse graphs {G 0 (λ) : λ ∈ R+ } somehow in the form (8). Theorem 2. (Soft-Thresholding) Let G 0 (λ) be the binary graph of G(λ) = b b (V, β(λ)) with sparse cross-correlations β(λ). Let ρij = |x(vi )0 y(vj )| be the absolute value of sample cross-correlations. Let H+ (λ) be the binary graph of H = (V, ρ). Let 0 = ρ(0) ≤ ρ(1) = min ρij ≤ · · · ≤ ρ(q) = max ρij i,j

i,j

be the order statistics of ρij . Then, we have the following. (1) G 0 (λ) = H+ (λ). (2) The collection of sparse graphs {G 0 (λ) : λ ∈ R+ } forms a graph filtration. (3) For any 0 ≤ λ, G 0 (λ) = H+ (ρ(i) ) for some i. Proof. (1) The adjacency matrix A = (aij ) of G 0 (λ) is given by aij = 1 if βbij (λ) 6= 0 and aij = 0 otherwise. From Theorem 1, adjacency matrix A is equivalent to adjacency matrix B = (bij ) of H+ (λ): ( 1 if |ρij | > λ; bij (λ) = . (9) 0 otherwise Thus G 0 (λ) = H+ (λ). (2) For all 0 ≤ λ1 ≤ λ2 ≤ · · · , bij (λ1 ) ≥ bij (λ2 ) ≥ · · · for all i, j. Hence, H+ (λ1 ) ⊃ H+ (λ2 ) ⊃ · · · and equivalently, G 0 (λ1 ) ⊃ G 0 (λ2 ) ⊃ · · · . So the collection of G 0 (λi ) forms a graph filtration. (3) If λ < ρ(1) , G 0 (λ) = H+ (ρ(0) ). If ρ(i) ≤ λ < ρ(i+1) , G 0 (λ) = H+ (ρ(i) ) 8

for 1 ≤ i ≤ q − 1. For ρ(q) ≤ λ, G 0 (λ) = H+ (ρq) ), the node set. Thus G 0 (λ) = H+ (ρ(i) ) for any λ ≥ 0 for some i.  Theorem 2-(1) enables us to construct large-scale sparse networks by a simple thresholding rule. This completely bypasses numerical optimizations that have been the main computational bottleneck in the field. 4-nodes example shown in Figure 1 illustrates the equivalent soft-thresholding rule in Theorem 2-(1). Theorem 2-(2) and 2-(3) further show that we can monotonically map the collection of graphs {G 0 (λ) : λ ∈ R+ } into finite graph filtration H+ (ρ(0) ) ⊃ H+ (ρ(1) ) ⊃ · · · ⊃ H+ (ρ(q) ).

(10)

For a graph with p nodes, the maximum possible number of edges is (p2 − p)/2, which is obtained in a complete graph. Thus, (p2 − p)/2 + 1 is the upper limit for the level of filtration in (10).

2.3

Statistical Inference

The quantification of graph filtration is done by using existing graph theoretic functions which provides a scalar feature per graph. Unfortunately, for large p, the maximum possible number of filtration is still too large for our study. With p = 25972 nodes, dealing with the collection of (p2 − p)/2 + 1 = 337259407 binary graphs will cause a serious computational bottleneck for most graph theoretic functions. However, for monotonic functions such as the number of disjoint clusters (disconnected components) or the size of the largest clusters (in terms of the number of nodes), it is possible to further reduce the number of filtrations to exactly p−1 using the minimum spanning tree (MST) of the underlying graph as follows. Theorem 3. Let M be the minimum spanning tree (MST) of weighted graph G = (V, W ) with nonnegative edge wights W . Suppose the edge weights of M are r1 ≤ r2 ≤ · · · ≤ rp−1 . Also let r0 = 0 and #G+ (λ) be the number of the disjoint clusters of G+ (λ). Then for any 0 ≤ λ1 ≤ λ2 , #G+ (λ1 ) = #M + (ri ) ≤ #G+ (λ2 ) for some i. Proof. For a tree with p nodes, there are p − 1 edges r1 ≤ r2 ≤ · · · ≤ rp−1 . No edge weights are above rp−1 , thus when thresholded at rp−1 , all the edges are removed and end up with node set V . Thus, #M + (rp−1 ) = p. Since 9

Figure 2: The result of linear embedding of the sparse graph filtrations in our data. The number of disjoint clusters (top) and the size of largest cluster (bottom) are plotted over the cross-correlation values. At the fixed correlation, MZ-twins have smaller number of clusters but larger cluster size indicating many higher correlation edges are still present within larger clusters.

10

all edges are above 0, #M + (r0 ) = 1. In the tree, each time we delete an edge, the number of disjoint clusters increases by one. Therefore, we have #M + (ri ) = i + 1. For any graph G with p nodes and any λ, 1 ≤ #G+ (λ) ≤ p. Thus, the ranges of #G+ (λ1 ) and M + (ri ) match. There exists some ri satisfying #G+ (λ1 ) = #M + (ri ). Since G+ (λ) forms a graph filration, G+ (λ1 ) ⊃ G+ (λ2 ) and subsequently, #G+ (λ1 ) ≤ #G+ (λ2 ).  From Theorems 2 and 3, we can monotonically map up to q = (p2 − p)/2 number of ordered function values #G 0 (ρ(1) ) ≤ #G 0 (ρ(2) ) ≤ · · · ≤ #G 0 (ρ(q) ) onto p ordered function values #M + (r0 ) ≤ #M + (r1 ) ≤ · · · ≤ #M + (rp−1 ).

(11)

Theorem 3 can be similarly applied to other monotonic graph functions such as the size of largest clusters. Let &G be the size of the largest cluster in terms of the number of nodes belong to the cluster in graph G. Then we can also show that &G + (ρ(1) ) ≥ &G + (ρ(2) ) ≥ · · · ≥ &G + (ρ(q) ) can be monotonically mapped to &M + (r0 ) ≥ &M + (r1 ) ≥ & · · · ≥ &M + (rp−1 ).

(12)

The monotonic feature vectors (11) and (12) are then used in quantifying sparse networks. Figure 2 displays the plot of the number of clusters and the size of the largest cluster over cross-correlations for the 25972-nodes whole brain networks in our twin fMRI data. There are clear group discriminations in the pattern of the monotonic feature vectors. Let ri1 ≤ ri2 ≤ · · · ≤ rip−1 be the ordered edge weights of MST of graph Gi . We are then interested in testing the null hypothesis of the equivalence of the two set of monotonic feature vectors H0 :

(#M + (r11 ), #M + (r12 ), · · · , #M + (r1p−1 )) = (#M + (r21 ), #M + (r22 ), · · · , #M + (r2p−1 )).

As a test statistic, we propose to use Dp = max |#M + (r1i ) − #M + (r2i )|. 1≤i≤p−1

11

(13)

Dp can be viewed as distance between two graph filtrations {G+ 1 (λ), λ ∈ R} and {G+ (λ), λ ∈ R}, i.e., 2  + Dp = Dp {G+ 1 (λ), λ ∈ R}, {G2 (λ), λ ∈ R} . The test statistic Dp is related to the two-sample Kolmogorove-Smirnov (KS) test statistic (Chung et al., 2013; Gibbons and Chakraborti, 2011; B¨ ohm and Hornik, 2010), which is a nonparametric test for determining the equivalence of two cumulative distribution functions. The asymptotic probability distribution of Dp can be determined. Theorem 4. lim P (Dp /

p→∞

∞ X p 2 2 2(p − 1) ≤ d) = 1 − 2 (−1)i−1 e−2i d . i=1

Proof. Combine the two samples and arrange them in increasing order. Represent #M + (r1i ) and #M + (r2i ) as x and y respectively. Then the combined sample can be represented as, for example, xxyxyx · · · . Treat the sequence as walks on a Cartesian grid. x indicates one step to the right and y indicates one step up. Then from grid (0, 0) to (p − 1, p − 1), there are total 2p−2 p−1 possible number of paths. Then following closely the argument given in pages 192-194 in (Gibbons and Chakraborti, 2011) and (B¨ohm and Hornik, 2010) we have P (Dp /(p − 1) ≥ d) = 1 −

Ap−1,p−1  , 2p−2

(14)

p−1

where Ap−1,p−1 = Ap−1,p−2 + Ap−2,p−1 with boundary conditions A(0, p − 1) = A(p − 1, 0) = 1. Then asymptotically (14) can be written as (Smirnov, 1939) lim P (Dp /

p→∞

∞ X p 2 2 2(p − 1) ≤ d) = 1 − 2 (−1)i−1 e−2i d .  i=1

Note Theorem 4 is a nonparametric test procedure and does not assume any statistical distribution on the feature vectors except that they are required to be monotonic function in the range 1 and p. Theorem 4 is then used to compute the p-value of testing H0 in our study. Given the observed value of do of d, for large p, p-value is given by 2

2

p-value = 2e−do − 2e−8do + · · · . For any large observed value d0 ≥ 2, the second term is in the order of 10−14 and insignificant. 12

Figure 3: Cross-correlations for MZ- and DZ-twins for all 25972 voxels in the brain template. There are more than 337 million cross-correlations among voxels. If we threshold at correlation 0.7, we still have more than 3 million connections which are difficult to visualize and provide biological interpretation. So it is necessary to reduce the number of edges by sparse network models.

2.4

Symmetric Cross-Correlations

The sparse cross-correlations βb = (βbij ) are asymmetric and induce directed binary graphs. Suppose there is no preference in data vectors x and y and they are interchangeable. Our twin data is one of those interchangeable cases since there is no preference of one twin over the other. Note many paired data in sciences are interchangeable. Then we have another linear model, where x and y are interchanged in (1): x(vj ) = cij y(vi ) + ,  is the zero-mean error vector whose components are independent and identically distributed. The LSE of cij is b cij = y0 (vi )x(vj ). The symmetric version of (sample) cross-correlation ζ = (ζij ) is then given by 2ζij = bbij + b cij = x0 (vi )y(vj ) + y0 (vi )x(vj ). Similarly, the symmetric version of sparse version of cross-correlation η = (ηij ) is then similarly obtained as 2η = arg min F (β; x, y) + arg min F (γ; y, x). γ

β

13

Figure 4: Representative MZ- (left) and DZ-twin (right) pairs of T -statistic maps obtained from fitting a general linear model testing the significance of the delay in hitting the response button in $5 reward in contrast to $0 reward. Bottom: The correlation between twin pairs. The cross-hairs are at slides x =27, y =31, z =23. For our study, each symmetrized cross-correlation matrix requires computing 1.3 billion entries and 5.2GB of computer memory (Figure 3). Since the cross-correlation matrices are very dense, it is difficult to provide biological interpretation and interpretable visualization. That is the main biological reason we developed the proposed sparse model to reduce the number of connections. Note all the Theorems and methods presented here are applicable to symmeterized cross-correlations as well. To see this, define the sum of graphs with identical node set V as follows: Definition 3. For graphs G1 = (V, W ) and G2 = (V, Z), G1 + G2 = (V, W + Z).

14

With the above sum of graphs, graph filtrations are addictive in the following sense. Given two graph filtrations G1 ⊂ G2 ⊂ · · · ⊂ Gk and H1 ⊂ H2 ⊂ · · · ⊂ Hk , G1 + H1 ⊂ G2 + H2 ⊂ · · · ⊂ Gk + Hk . This is the direct consequence of Definition 2. Thus sum of any two graph filtrations will be again a graph filtration. Using two sparse cross-correlation estimates βb and γ b, construct weighted b graphs G1 (λ) = (V, β(λ)) and G2 (λ) = (V, γ b(λ)). Induce binary graphs G10 (λ) 0 and G2 (λ). Using sample cross-correlations X0 Y and Y0 X, we can also define weighted graphs H1 = (V, X0 Y) and H2 = (V, Y0 X). Induce binary graphs H1+ (λ) and H2+ (λ). Subsequently, we have G10 (λ) + G20 (λ) = H1+ (λ) + H2+ (λ)

(15)

and (15) forms a graph filtration over λ. Thus, Theorem 2 can be extended to the sum of graphs G10 (λ) + G20 (λ) as well.

3 3.1

Application Twin fMRI Dataset

The study consists of 11 monozygotic (MZ) and 9 same-sex dizygotic (DZ) twin pairs of 3T functional magnetic resonance images (fMRI). Subjects went through a reward experiment of 3 runs of 40 trials. Total 120 trials consist of 40 $0, 40 $1 and 40 $5 rewards, which are randomly split into 3 runs. Each trial has the same order of events for each subject. The aim of the task is to hit a target on time in order to make money. The target appears very briefly. First there is a cue indicating the amount of money that’s on stake for that specific trial, then there is a delay period between the cue and the target in which participants are waiting for the target to appear. Then a white star (target) flashes rapidly on the screen. If the participants hit the response button while the star is on the screen they will have succeeded in winning the money. If they hit too late, they will not win the money. After the response, a feedback slide is shown indicating to the participants whether they hit the target on time or not and the amount of money they made for that specific trial. The fMRI dimension is 53 × 63 × 46 and the voxel size is 3mm cube. There are total p = 25972 voxels in the brain template where fMRI measurements are obtained. Figure 4 shows the outline of the template. fMRI 15

Figure 5: Node colors are correlations of MZ- (left) and DZ-twins (middle). The edge colors are symmeterized cross-correlations between nodes thresholded at 0.5. Right: Node values are heritability indices (HI). Edge values are heritability graph indices (HGI) thresholded at 0.5. MZ-twins show higher correlations at both voxel and network levels compared to DZtwins. Some low correlation nodes have high cross-correlations. Using only HI feature fail to detect such subtle network differences. Each voxel in the template is visualized as a network node. went through spatial normalization to a template following the standard SPM pipeline (Penny et al., 2011). Images went through Gaussian kernel smoothing with 8mm full-width-at-half-maximum (FWHM) spatially. Biologically, we are interested in knowing the extent of the genetic influence on the functional brain networks of these participants while anticipating a $5 reward as measured by delays in hitting the response button. After fitting a general linear model at each voxel (Friston et al., 1995; Penny et al., 2011), we obtain the T-statistic map testing the significance of delay for $5 trials to the baseline (Figure 4). The T-statistic maps are then used in computing cross-correlations in this study.

3.2

Heritability Graph Index

As a way of quantifying the genetic contribution of phenotypes, the heritability index (HI) has been extensively used in neuroscience. HI determines the amount of variation (in terms of percentage) due to genetic influence and usually estimated using Falconer’s formula (Falconer and Mackay, 1995). At each fixed voxel vi , HI is given by HIi = 2[ρMZ (vi ) − ρDZ (vi )], 16

where ρMZ and ρDZ are the pairwise correlation between MZ- and and samesex DZ-twins. HI is a univariate feature measured at each voxel and does not quantify if the change in one voxel is related to other voxels. We can extend the concept of HI to network level by defining the heritability graph index (HGI): HGIij = 2[%MZ (vi , vj ) − %DZ (vi , vj )], where %MZ and %DZ are the symmetrized cross-correlations between voxels vi and vj for MZ- and DZ-twin pairs. Note that %MZ (vi , vi ) = ρMZ (vi ) and %DZ (vi , vi ) = ρDZ (vi ) and the diagonal entries of HGI is HI, i.e., HGIii = HIi . HGI measures genetic contribution in terms of percentage at network level while generalizing the concept of HI. In Figure 5-left and 5-middle, the node values are correlations ρMZ and ρDZ while the edge values are crosscorrelations %MZ and %DZ . In Figure 5-right, the node values are HI while edge values are off-diagonal entries of HGI. For Figure 5, networks thresholded at 0.5 is shown. For attaching the statistical significance to HGI, we computed statistic Dp and used Theorem 4. For our data, the observed do is 2.40 for the number of clusters and 23.12 for the size of largest cluster. Then pvalues are respectively less than 0.00002 and 0.00001 indicating very strong significance of HGI.

4

Conclusion

In this paper, we explored the large-scale sparse cross-correlation networks to address an important problem of determining the genetic contribution of functional brain networks. We presented a unified mathematical framework integrating network model building, parameter estimation and statistical inference. Although the method is applied in twin fMRI data, it can be applied to other more general problems of correlating multimodal data such as correlating fMRI to PET. We believe the theoretical framework we developed provide a motivation for future works on various fields dealing with paired data.

Acknowledgment This work was supported by NIH Research Grant 5 R01 MH098098 04. We thank Yuan Wang at the University of Wisconsin-Madison, Zhiwei Ma at Tsinghua University, Matthew Arnold of University of Bristol for valuable discussions about Theorem 4.

17

References S. Achard, R. Salvador, B. Whitcher, J. Suckling, and E.D. Bullmore. A resilient, low-frequency, small-world human brain functional network with highly connected association cortical hubs. The Journal of Neuroscience, 26:63–72, 2006. B.B. Avants, P.A. Cook, L. Ungar, J.C. Gee, and M. Grossman. Dementia induces correlated reductions in white matter integrity and cortical thickness: a multivariate neuroimaging study with sparse canonical correlation analysis. NeuroImage, 50:1004–1016, 2010. O. Banerjee, L.E. Ghaoui, A. d’Aspremont, and G. Natsoulis. Convex optimization techniques for fitting sparse Gaussian graphical models. In Proceedings of the 23rd international conference on Machine learning, page 96, 2006. O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. The Journal of Machine Learning Research, 9:485–516, 2008. P.J. Bickel and E. Levina. Regularized estimation of large covariance matrices. The Annals of Statistics, 36:199–227, 2008. Walter B¨ ohm and Kurt Hornik. A Kolmogorov-Smirnov test for r samples. Institute for Statistics and Mathematics, Research Report Series:Report 105, 2010. G. Carlsson and F. Memoli. Persistent clustering and a theorem of J. Kleinberg. arXiv preprint arXiv:0808.2241, 2008. M.K. Chung, J.L. Hanson, H. Lee, Nagesh Adluru, Andrew L. Alexander, R.J. Davidson, and S.D. Pollak. Persistent homological sparse network approach to detecting white matter abnormality in maltreated children: MRI and DTI multimodal study. MICCAI, Lecture Notes in Computer Science (LNCS), 8149:300–307, 2013. D.L. Donoho, I.M. Johnstone, G. Kerkyacharian, and D. Picard. Wavelet shrinkage: asymptopia? Journal of the Royal Statistical Society. Series B (Methodological), 57:301–369, 1995. H. Edelsbrunner and J. Harer. Persistent homology - a survey. Contemporary Mathematics, 453:257–282, 2008. 18

D. Falconer and T Mackay. Introduction to Quantitative Genetics, 4th ed. Longman, 1995. J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9:432, 2008. K.J. Friston, A.P. Holmes, K.J. Worsley, J.-P. Poline, C.D. Frith, and R.S.J. Frackowiak. Statistical parametric maps in functional imaging: a general linear approach. Human Brain Mapping, 2:189–210, 1995. R. Ghrist. Barcodes: The persistent topology of data. Bulletin of the American Mathematical Society, 45:61–75, 2008. J. D. Gibbons and S. Chakraborti. Nonparametric Statistical Inference. Chapman & Hall/CRC Press, 2011. Y. He, Z. Chen, and A. Evans. Structural insights into aberrant topological patterns of large-scale cortical networks in Alzheimer’s disease. Journal of Neuroscience, 28:4756, 2008. S. Huang, J. Li, L. Sun, J. Liu, T. Wu, K. Chen, A. Fleisher, E. Reiman, and J. Ye. Learning brain connectivity of Alzheimer’s disease from neuroimaging data. In Advances in Neural Information Processing Systems, pages 808–816, 2009. H. Lee, H. Kang, M.K. Chung, B.-N. Kim, and D.S Lee. Persistent brain network homology from the perspective of dendrogram. IEEE Transactions on Medical Imaging, 31:2267–2277, 2012. R. Mazumder and T. Hastie. Exact covariance thresholding into connected components for large-scale graphical LASSO. The Journal of Machine Learning Research, 13:781–794, 2012. J. Peng, P. Wang, N. Zhou, and J. Zhu. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association, 104:735–746, 2009. W.D. Penny, K.J. Friston, J.T. Ashburner, S.J. Kiebel, and T.E. Nichols. Statistical parametric mapping: the analysis of functional brain images: the analysis of functional brain images. Academic press, 2011. J. Sch¨ afer and K. Strimmer. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4:32, 2005. 19

G. Singh, F. Memoli, T. Ishkhanov, G. Sapiro, G. Carlsson, and D.L. Ringach. Topological analysis of population activity in visual cortex. Journal of Vision, 8:1–18, 2008. N.V. Smirnov. Estimate of deviation between empirical distribution functions in two independent samples. Bulletin of Moscow University, 2:3–16, 1939. K. Supekar, V. Menon, D. Rubin, M. Musen, and M.D. Greicius. Network analysis of intrinsic functional brain connectivity in Alzheimer’s disease. PLoS Computational Biology, 4(6):e1000100, 2008. R. Tibshirani. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society. Series B (Methodological), 58:267–288, 1996. P.A. Vald´es-Sosa, J.M. S´ anchez-Bornot, A. Lage-Castellanos, M. VegaHern´ andez, J. Bosch-Bayard, L. Melie-Garc´ıa, and E. Canales-Rodr´ıguez. Estimating brain functional connectivity with sparse multivariate autoregression. Philosophical Transactions of the Royal Society B: Biological Sciences, 360:969–981, 2005. Daniela M Witten, Jerome H Friedman, and Noah Simon. New insights and faster computations for the graphical lasso. Journal of Computational and Graphical Statistics, 20:892–900, 2011. K.J. Worsley, A. Charil, J. Lerch, and A.C. Evans. Connectivity of anatomical and functional MRI data. In Proceedings of IEEE International Joint Conference on Neural Networks (IJCNN), volume 3, pages 1534–1541, 2005a. K.J. Worsley, J.I. Chen, J. Lerch, and A.C. Evans. Comparing functional connectivity via thresholding correlations and singular value decomposition. Philosophical Transactions of the Royal Society B: Biological Sciences, 360:913, 2005b. B. Xin, L. Hu, Y. Wang, and W. Gao. Stable feature selection from brain sMRI. 2015.

20