Nonparametric Estimation of Scale-Free Graphical Models

Report 3 Downloads 112 Views
Nonparametric Estimation of Scale-Free Graphical Models Yuancheng Zhu, Zhe Liu

arXiv:1511.03796v1 [stat.ME] 12 Nov 2015

University of Chicago Siqi Sun Toyota Technological Institute at Chicago

Abstract We present a nonparametric method for estimating scale-free graphical models. To avoid the usual Gaussian assumption, we restrict the graph to be a forest and build on the work of forest density estimation. The method is motivated from a Bayesian perspective and is equivalent to finding the maximum spanning tree of a weighted graph with a log degree penalty. We solve the optimization problem via a minorize-maximization procedure with Kruskal’s algorithm. Simulations show that the proposed method outperforms competing parametric methods, and is robust to the true data distribution. It also leads to improvement in predictive power and interpretability in two real data examples.

1

Introduction

Graphical models are used to encode the conditional independence relationships between random variables. In particular, a random vector X = (X1 , . . . , Xd ) is represented by an undirected graph G = (V, E) with d vertices if Xi and Xj are conditional independent given the other variables whenever an edge (i, j) is missing in the edge set E. One major statistical task is to learn the graph from n i.i.d. copies of the random vector. Existing methods for learning graphical models make assumptions on either the underlying distribution or the graphical structure. Currently the most popular method called graphical lasso [5] assumes that the random vector follows a multivariate Gaussian distribution. In this way, learning the graph is equivalent to estimating the precision matrix Ω, since the conditional independence of a Gaussian random vector is entirely determined by the sparsity pattern of Ω. The graphical lasso finds a sparse estimate of Ω by maximizing the `1 -regularized log likelihood. On the other hand, we can make no distributional assumption but restrict the graph to be a forest instead. Under this structural assumption, there exists a factorization of the density function involving only the univariate and bivariate marginal densities, which makes nonparametric estimation tractable in high dimensions. In this case, estimating the graph amounts to finding the maximum spanning tree of a weighted graph; see, for example, [2, 8] for details. A wide variety of the networks in graphical modeling, such as protein, gene, and social networks, are reported to be scale-free. That is, the degree distribution of the vertices follows a power law: 1

p(degree = k) ∝ k −α for some α > 1. In such scale-free networks, some vertices have many more connections than others, and these highest-degree vertices are usually called hubs and serve significant roles in their networks. For instance, in a gene network, a hub usually represents a gene playing functions in many biological processes [13]. The methods described above, however, take no prior information of such graphical structure into account. They usually do not fit well on a scale-free graphical model, and fail to identify the important hubs, as the methods implicitly assume a priori that each edge is equally likely. To utilize the prior belief that an underlying network may be scale-free, or more generally, contains some dominating hubs, several approaches have been proposed, including [3, 9, 11, 12]. Nevertheless, to the best of our knowledge, all the existing methods assume normality of the data distribution. In particular, they try to maximize the Gaussian likelihood as a function of the precision matrix, with some specifically designed penalty functions to encourage the scale-free feature in the estimated graph. Such distributional assumptions can be quite unrealistic and unnecessary in many applications. Even though the marginal distribution of each variable can be transformed to approximately Gaussian, which allows arbitrary univariate distributions, the joint dependence is still restricted under the Gaussian assumption. In this paper, we relax such distributional assumptions and try to estimate scale-free graphs using a nonparametric method. We build on the forest density estimation (FDE) method introduced in [8]. When the graph is known to be scale-free, it is likely that the graph structure is close to a forest. In fact, if a graph is generated by the Barab´asi and Albert model [1], a standard generative model for scale-free networks, it is indeed a tree. Thus, a forest structure can be viewed as a reasonable assumption or approximation. The FDE approach can be viewed as maximizing a log likelihood as a function of the forest, we can incorporate the prior information to favor the scalefree graphs. Motivated from such a Bayesian perspective, we formulate a problem of finding the maximum spanning tree of a weighted graph with a penalty term on the logarithm of the node degrees. We will devise an algorithm based on a minorize-maximization procedure and Kruskal’s algorithm [7] to find a local optimal solution. The rest of the paper is organized as follows. In the following section, we give background on forest density estimation and review some related work. We motivate our method of estimating scale-free graphical models in Section 3 and introduce our estimation algorithm. Experiments with results on synthetic data sets and real applications are presented in Section 4, followed by a conclusion in Section 5.

2 2.1

Background and related work Forest density estimation

We say an undirected graph is a forest if it is acyclic. Let F = (VF , EF ) be a forest with vertices VF = {1, . . . , d} and edge set EF ∈ VF × VF . Let X = (X1 , . . . , Xd ) be a d-dimensional random vector with density function p(x) > 0. We say that X, or equivalently, its density p, is Markov to F if Xi and Xj are conditionally independent given the other random variables whenever edge (i, j) is missing in EF . A density p that is Markov to F has the following factorization p(x) =

Y (i,j)∈EF

pij (xi , xj ) Y pk (xk ), pi (xi )pj (xj ) k∈VF

2

(1)

Algorithm 1 Kruskal’s (Chow-Liu) algorithm Input Weight matrix W = (wij )d×d Initialize E (0) ← ∅ for k = 1, . . . , d − 1 do (i(k) , j (k) ) ← arg max(i,j) wij such that E (k−1) ∪ {(i(k) , j (k) )} doesn’t contain a cycle E (k) ← E (k−1) ∪ {(i(k) , j (k) )} end for Output The final edge set E (d−1)

where each pij (xi , xj ) is a bivariate density, and each pk (xk ) is a univariate density. With this factorization, we can write the expected log likelihood as   Z X X pij (xi , xj ) log pk (xk ) dx + E[log p(X)] = p(x)  log pi (xi )pj (xj ) k∈VF (i,j)∈EF X X = I(Xi ; Xj ) − H(Xk ), (2) k∈VF

(i,j)∈EF

where Z I(Xi ; Xj ) =

pij (xi , xj ) log

pij (xi , xj ) dxi dxj and H(Xk ) = − pi (xi )pj (xj )

Z pk (xk ) log pk (xk )dxk

are the mutual information between the variables Xi and Xj , and the entropy of Xk , respectively. We maximize the right hand side of (2) to find the optimal F ∗ X I(Xi ; Xj ), (3) F ∗ = arg max F ∈Fd

(i,j)∈EF

where Fd is the collection of spanning trees on vertices {1, . . . , d}. We let Fd contain only spanning trees because there is always a spanning tree that solves the problem (2). This problem can be recast as the problem of finding the maximum spanning tree for a weighted graph, where the weight wij of the edge between node i and j is I(Xi ; Xj ). Kruskal’s algorithm [7] is a greedy algorithm that is guaranteed to find an optimal solution, while Chow and Liu [2] propose the procedure in the setting of density estimation. The method is described in Algorithm 1. However, this procedure is not practical since the true density p is unknown. Suppose instead that we have X (1) , . . . , X (n) , which are n i.i.d. copies of the random vector X. We replace the population mutual information by their corresponding estimates, defined by Z b i ; Xj ) = pbij (xi , xj ) log pbij (xi , xj ) dxi dxj I(X (4) pbi (xi )b pj (xj ) where pbij (xi , xj ) and pbk (xk ) are the kernel estimators of the bivariate and univariate marginal

3

densities: n

1X 1 K pbij (xi , xj ) = n t=1 h22 n

1X 1 pbk (xk ) = K n t=1 h1

(t)

Xi − xi h2 (t)

Xk − xk h1

(t)

! K

Xj − xj h2

! ,

!

with a kernel function K and bandwidths h2 and h1 . The resulting estimator of the graph becomes X b i ; Xj ). Fb∗ = arg max I(X (5) F ∈Fd

(i,j)∈EF

A held-out set is usually used to prune the spanning tree Fb∗ by stopping early in Algorithm 1 when the likelihood on the held-out set is maximized. Thus we obtain a forest estimate of the graph.

2.2

Related work

Before proceeding to present our nonparametric method of estimating scale-free graphical models based on the FDE procedure described above, we pause to review some of the existing approaches. As mentioned, the existing methods for estimating a scale-free network or a network with hubs assume that the data are from a multivariate Gaussian distribution. Under this assumption, learning the graph is equivalent to estimating the zero pattern of the precision matrix Ω. The default method for estimating a sparse Ω is the graphical lasso [5] which maximizes the `1 penalized log likelihood:   b Glasso = arg max log det Ω − tr(SΩ) − λkΩ − diag(Ω)k1 , Ω Ω0

where S is the empirical covariance matrix and λ is a tuning parameter. To encourage the power law distribution of the vertex degrees, Liu and Ihler [9] propose to replace the `1 penalty by a power law regularization term and solve   X b ΩSFGlasso = arg max log det Ω − tr(SΩ) − λ log (kΩ−i,i k1 + εi ) , Ω0

i

Pd

where kΩ−i,i k1 = j=1,j6=i |Ωij |, and εi ’s are some small quantities. The term kΩ−i,i k1 acts as a surrogate of the estimated degree of the ith node. Along the same line, Defazio and Caetano [3] impose a convex penalty by using submodular functions and their Lov´asz extension. Essentially, both methods try to penalize the log degree of each node, but end up using a continuous/convex surrogate to avoid the combinatorial problems involving the degrees. Another attempt is made by Tan et al. [11] to learn networks with hubs. To account for the existence of hubs, they design the following penalty term   d X P (Ω) = min λ1 kZ − diag(Z)k1 + λ2 kV − diag(V )k1 + λ3 k(V − diag(V ))j k2 V,Z: Ω=V +V T +Z

j=1

where the sparse elements of Z represent edges between non-hub nodes, and the non-zero columns of V correspond to hub nodes. The estimate is then obtained to be   b ΩHGlasso = arg max log det Ω − tr(SΩ) − P (Ω) . Ω0

4

3

Learning scale-free network via FDE

To avoid the Gaussian assumption in estimating a scale-free graphical model, we start from the FDE approach, and incorporate the prior information to improve the estimator obtained in (5), b i ; Xj ) have relatively large statistical especially when the sample size is small and the estimates I(X error. To encourage the resulting spanning tree to be scale-free, we add a penalty on the logarithm of the node degrees and solve the following problem ) ( X X ∗ b b F = arg max log(δ(F, k)) I(Xi ; Xj ) − λ (6) λ

F ∈Fd

k∈VF

(i,j)∈EF

where δ(F, k) is the degree of the k-th vertex of F , and λ is a tuning parameter. In fact, this optimization problem can be motivated from a Bayesian perspective. Bayesian motivation Consider a prior distribution on the spanning trees on d vertices which satisfies that Y p(F ) ∝ δ(F, k)−α k∈VF

for F ∈ Fd and some α > 1. This prior distribution favors the spanning trees whose degrees have a power law distribution, and thus reflects our prior beliefs. Given the data X (1:n) = X (1) , . . . , X (n) and assuming that the density p is known and Markov to the spanning tree F , we can write the likelihood as   (t) (t) n Y Y Y p (X , X ) ij i j (t)  pk (Xk ) . p(X (1:n) |F ) = (t) (t) p (X )p (X ) j t=1 (i,j)∈EF i k∈VF i j Now the posterior probability of F is p(F |X (1:n) ) ∝ p(X (1:n) |F )p(F ) ∝

n Y



(t)

Y 

t=1

(t)

(i,j)∈EF



(t)

pij (Xi , Xj ) (t)

Y

pi (Xi )pj (Xj ) k∈VF

(t) pk (Xk )

·

Y k∈VF

The maximum a posteriori (MAP) estimate is given by ( ) (t) (t) n X X pij (Xi , Xj ) 1 α X b Fmap = arg max log − log δ(F, k) . (t) (t) n F ∈Fd pi (X )pj (X ) n t=1 i

(i,j)∈EF

j

δ(F, k)−α .

(7)

k∈VF

We can re-parameterize by setting λ = α/n. Since the true density p is unknown, we propose the following two approaches to approximate the MAP estimate. First we split the data set into two disjoint groups according to the index sets D1 and D2 , such that |Di | = ni and D1 ∪D2 = {1, . . . , n}. Option 1. Use (4) to estimate the mutual information based on data {X (t) : t ∈ D1 } and obtain Fbλ∗

in (6) as an approximation of Fbmap . In fact, Fbλ∗ is obtained by replacing the true marginal densities and the empirical distributions in (7) by their corresponding density estimates. Subset {X (t) : t ∈ D2 } is used to prune the resulting spanning tree to obtain a final estimate that maximizes the held-out likelihood. 5

Option 2. First form the bivariate and univariate marginal density estimates pbij ’s and pbk ’s using {X (t) :

t ∈ D1 } and then solve for ( X Fˇ ∗ = arg max λ

F ∈Fd

(i,j)∈EF

) (t) (t) X X 1 pbij (Xi , Xj ) log δ(F, k) . log −λ (t) (t) n2 pbi (Xi )b pj (Xj ) t∈D2 k∈VF

(8)

Restrict Fˇλ∗ to the edges with positive weights to obtain a final estimate. Optimization To solve either of the optimization problems (6) and (8), we first rewrite the objective function as f (Z) =

X

wij zij − λ

i<j

d X i=1

log

X d

 zij

j=1

where wij is the edge weight given in (6) or (8), and Z = (zij )d×d is the adjacency matrix of F . Note that we have the additional constraint that the graph corresponding to Z is a spanning tree. To maximize f (Z) under the constraint, we use a minorize-maximization approach [6]. Let ˜ Z˜ = (˜ zij )d×d be our current solution. We first lower bound f (z) by linearizing log(·) at Z: ! Pd X  Pd d d X X ˜ij j=1 zij − λ j=1 z z˜ij + log f (Z) ≥ wij zij − λ Pd ˜ij j=1 z j=1 i=1 i<j ! X λ λ (9) = wij − Pd − Pd zij +C ˜ik ˜jk k=1 z k=1 z i<j {z } | ˜ f (Z;Z)

ˇ as where C is a constant which doesn’t depend on Z. We then maximize this lower bound f (Z; Z) a function of Z under the constraints. We iterate through this pair of minorization and maximization procedures. Since the objective function is always increasing, the algorithm is guaranteed to converge to a local maximum. Weights updating rule Within each iteration of the minorize-maximization approach, we need ˇ By (9), maximizing f (Z; Z) ˜ is equivalent to finding the to maximize the lower bound f (Z; Z). Algorithm 2 Minorize-maximization Input Weight matrix W = (wij )d×d and tuning parameter λ Initialize Z to be the adjacency matrix of the graph obtained by applying Algorithm 1 on W do Z˜ ← Z ˜ ← (w w ˜ij ← wij − Pd λ z˜ − Pd λ z˜ and W ˜ij )d×d k=1 ik k=1 jk 0 ˜ ˜ Z ← arg maxZ 0 f (Z ; Z) by applying Algorithm 1 on W while Z 6= Z˜ Output Z and the corresponding graph F .

6

Graph × Dist. Scale-free × N Stars × N Scale-free × t Stars × t

Glasso FDE 0.49 0.49 0.89 0.93

SFGlasso

HGlasso

SF-FDE 0.69 0.81 0.98 0.98

held-out

oracle

held-out

oracle

held-out

oracle

0.24 0.25 0.30 0.32

0.91 0.93 0.43 0.56

0.42 0.46 0.47 0.50

0.92 0.98 0.53 0.67

0.16 0.08 0.07 0.09

0.88 0.99 0.55 0.79

Table 1: F1 scores for graph estimation with four types of probability models (N represents Gaussian copula, and t for t copula). The scores are averaged over 10 replicates. For Glasso, SFGlasso and HGlasso, scores for both held-out tuning and oracle tuning are listed. maximum spanning tree with edge weights given by w ˜ij = wij − Pd

λ

˜ik k=1 z

− Pd

λ

˜jk k=1 z

.

(10)

We see that the weights are updated at each iteration based on the current estimate of the graph. Each edge weight is penalized by two quantities that are inversely proportional to the degrees of the two endpoints of the edge. An edge weight is thus penalized less if its endpoints are already highly connected and vice versa. This is sort of a “rich gets richer” procedure, and in this way it encourages some vertices to have high connectivity and hence the overall degree distribution to have a heavy tail. The method can be thus thought as an iteration of Kruskal’s algorithm with edges reweighted at each round. It is summarized in Algorithm 2.

4

Experiments

4.1

Synthetic data

We evaluate the performance of the proposed method and other existing methods on synthetic data. Graph structures We consider the following two types of graph structures with d = 100 vertices. • Scale-free graph: We use a preferential attachment process to generate a scale-free graph [1]. Specifically, we start with a chain of 4 nodes (i.e., with edges 1–2, 2–3, and 3–4). New nodes are added one at every time step, and each new node is connected to one of the existing nodes with probability pi ∝ δiα , where δi is the current degree of the ith node, and α is a parameter, which we set to be 1.5 in our experiments. A typical realization of such networks is shown in Figure 1(a) in the top left corner. • Stars: The graph has 5 stars of size 20. Specifically, each star is a tree with one internal node and 19 leaves. An illustration is shown in Figure 1(b) in the top left corner. Probability distributions Given a particular graph, we generate n1 = 200 samples according to two types of probability distributions that are Markov to the graph: Gaussian copulas and t copulas [4]. The Gaussian copula (resp., the t copula) can be thought of as representing the 7



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●

● ●

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ●





● ● ●





● ●





● ●

● ●





● ●

● ●



● ● ●

● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●





● ●

● ● ●



● ●

● ●





● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●







● ●

● ● ●





● ● ● ●











● ●





● ●









● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●



● ●

● ● ● ● ● ● ●







● ●

Truth



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●

● ●



● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●

● ●



● ●

● ● ●





● ●





● ● ●

● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●









● ●

● ● ●



● ●

● ●





● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●





● ●



● ● ●





● ● ● ●













● ●





● ●





● ●

● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●





● ●



● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●









● ●

● ● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●



● ● ● ● ●

● ●





● ● ●



● ● ● ● ●

● ●





● ● ●





HGlasso

● ●





● ● ●

● ●





● ● ● ● ● ●







● ●

● ● ● ●

● ●

● ●



● ●

● ●





● ● ● ● ●

● ●







● ● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●





● ●





● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ●



● ●





● ●

● ●



● ●

● ●





● ●

● ●

● ● ●



● ● ● ●

● ●



● ●

SFGlasso

(a)

● ●



● ● ●

● ● ●



● ●







● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●





● ● ● ● ● ● ●

● ●

● ● ● ●



● ●















Glasso



● ●



● ●

● ● ● ●

● ● ●















● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●







● ● ●



● ● ● ● ● ●







● ●





● ●



SFGlasso









● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●



● ● ●













● ● ●













● ●





● ● ●





● ● ●

● ●



● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



SF-FDE

● ●





● ●







FDE



● ●

● ● ●





● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



● ● ●



● ● ● ● ● ● ●







● ●





● ●

● ● ● ● ● ●





● ●

● ●

● ●





● ● ● ● ● ● ● ● ●



Glasso

● ● ●

● ●









● ●







SF-FDE

● ● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●



Truth

● ●





● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



● ● ●

● ●

FDE

● ●





● ●



● ●

● ● ●





● ●





● ● ● ● ● ● ● ● ● ●









● ● ●

● ●

● ●



● ● ● ●













● ●

● ● ●



● ● ● ● ● ● ●

HGlasso

(b)

Figure 1: Typical realizations and estimated graphs with d = 100 nodes for (a) scale-free graph and (b) stars. n1 = 200 samples are generated according to a t copula. Another n2 = 100 samples are used to prune the spanning trees. The tuning parameters of Glasso, SFGlasso and HGlasso are chosen to maximize the F1 scores. dependence structure implicit in a multivariate Gaussian (multivariate t) distribution, while each variable follows a uniform distribution on [0, 1] marginally. Since both graph structures we consider are trees or forests, we generate the data sequentially, first sampling for an arbitrary node in a tree, and then drawing samples for the neighboring nodes according to the conditional distribution given by the copula until going through all nodes in the tree. In our simulations, the degree of freedom of the t copula is set to be 1, and the correlation coefficients are chosen to be 0.4 and 0.25 for the Gaussian and the t copula. Methods In the presentation that follows, the usual forest density estimation [8] will be referred to as FDE, and our method as SF-FDE. For the two methods, we use a held-out set of size n2 = 100 to prune the estimated spanning trees. The tuning parameter of SF-FDE is chosen to maximize the likelihood on the held-out data after pruning. We also evaluate the performance of three approaches for learning Gaussian graphical models: the graphical lasso [5] (Glasso), the scale-free network approach [9] (SFGlasso), and the graphical lasso for graphs with hubs [11] (HGlasso). To apply these methods, we first transform the data marginally to be approximately Gaussian. There

8

are tuning parameters for all three methods. We choose them by searching through a fine grid, and selecting those that maximize the likelihood on the held-out set. We refer to this as held-out tuning. The results obtained by the held-out tuning reflect the performance of the methods in a fully datadriven way. In addition, we also consider what we call oracle tuning, where the tuning parameters are chosen to maximize the accuracy criterion (in particular, the F1 score, to be discussed below) of the estimated graph. This tuning method requires the knowledge of the true graph, and hence it’s not obvious that there would exist a data-driven way to achieve this. We include the oracle tuning as a way to show the optimal performance possibly achieved by the methods. Results We carry out four sets of experiments, with data generated from the two types of graphs and the two types of distributions. For each set of experiments, we apply the five methods and repeat the simulations 10 times. We record the F1 scores of the estimated graphs for each method. An F1 score is the harmonic mean of a method’s precision and recall and hence a measure of its accuracy. It’s a number between 0 and 1; a higher score means better accuracy and 1 means perfect labelling. The average F1 scores are shown in Table 1. From the table, we see that SF-FDE always performs better than FDE on these particular graphs. SF-FDE and FDE perform better than the other three methods using held-out tuning as the penalized likelihood methods tend to select more edges when tuning parameters are chosen to maximize the held-out likelihood. When the true copula is Gaussian, Glasso, SFGlasso and HGlasso all have very high scores if oracle tuning is used; they fail to deliver good performance when the true copula is no longer Gaussian. On the other hand, SF-FDE is not affected too much by the true distribution as long as the graph is scale-free or has dominating hubs. We also include in Figure 1 two sets of typical realizations and estimations. Note that the data are generated according to the t copula, and the tuning parameters for the three parametric methods are chosen to maximize the F1 scores, which is unrealistic in a data-driven way. Even with the oracle tuning, we observe that SF-FDE significantly outperforms the three parametric methods. ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ●●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ●● ● ●

● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ●●●● ● ●● ● ● ●●● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●● ● ●● ●●● ●● ● ●

FDE

SF-FDE

Figure 2: Estimated graphs via FDE and SF-FDE for the stock price data. The stocks are colored according to their Global Industry Classification Standard categories.

9

● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ●

FDE

SF-FDE



● ●

● ●● ●

Figure 3: Estimated graphs via FDE and SF-FDE for the microarray data. Genes that have at least 10 connections are marked in red.

4.2

Real data

Stock price data We test our method on stock price data from Yahoo! Finance. We collected the daily closing prices for p = 416 stocks that were consistently in the S&P 500 index from January (t) 1, 2011 to December 31, 2014, excluding some stocks with anomalous price behavior. Let Sk be (t) (t) (t−1) the closing price for stock k on day t, and Xk = log(Sk /Sk ) be the log return. We marginally transform the log returns of each stock to a normal distribution. We treat X(t)’s as independent random vectors, though they form a time series. We use the data from 2011 to 2013 as the training data and the data from 2014 as the held-out part. The two sample sizes are 753 and 252. We apply FDE and SF-FDE on this data set. We notice that SF-FDE yields a large likelihood on the held-out data, implying that a scale-free approximation is helpful in estimating/predicting the relationships. The estimated graphs are shown in Figure 2, and the stocks are colored according to their Global Industry Classification Standard categories. We see that the stocks with a same color are a bit more clustered in the graph estimated via SF-FDE. Microarray data As a second example, we consider a microarray data set from [10]. The data set contains Affymetrics chip measured expression levels of 4238 genes for 295 normal subjects in the Centre d’Etude du Polymorphisme Humain and the International HapMap collections. For the purpose of presentation, we select a subset of 300 genes and apply FDE and SF-FDE. Once again, the held-out likelihood is higher for SF-FDE, indicating an improvement in the predictive power by a scale-free estimation. The estimated graphs are shown in Figure 3. In the FDE estimated graph, only one node has degree of 10 (colored in red), while in the SF-FDE graph, we see 8 highly connected nodes of degree no less than 10. Hopefully this can help identify important genes and inspire follow-up studies.

10

5

Conclusion

In this paper, we introduce a nonparametric method for estimating scale-free graphical models. It arises from a Bayesian formulation as the MAP estimate. Compared with parametric methods that penalize the Gaussian likelihood, the proposed method leads to more accurate and robust estimation of the underlying scale-free networks.

References [1] R´eka Albert and Albert-L´ aszl´ o Barab´asi. Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47, 2002. [2] C Chow and C Liu. Approximating discrete probability distributions with dependence trees. Information Theory, IEEE Transactions on, 14(3):462–467, 1968. [3] Aaron Defazio and Tiberio S Caetano. A convex formulation for learning scale-free networks via submodular relaxation. In Advances in Neural Information Processing Systems, pages 1250–1258, 2012. [4] Stefano Demarta and Alexander J McNeil. The t copula and related copulas. International statistical review, 73(1):111–129, 2005. [5] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008. [6] David R Hunter and Kenneth Lange. A tutorial on mm algorithms. The American Statistician, 58(1):30–37, 2004. [7] Joseph B Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical society, 7(1):48–50, 1956. [8] Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, and Larry Wasserman. Forest density estimation. The Journal of Machine Learning Research, 12:907–951, 2011. [9] Qiang Liu and Alexander T Ihler. Learning scale free networks by reweighted l1 regularization. In International Conference on Artificial Intelligence and Statistics, pages 40–48, 2011. [10] Renuka R Nayak, Michael Kearns, Richard S Spielman, and Vivian G Cheung. Coexpression network based on natural variation in human gene expression reveals gene interactions and functions. Genome research, 19(11):1953–1962, 2009. [11] Kean Ming Tan, Palma London, Karthik Mohan, Su-In Lee, Maryam Fazel, and Daniela Witten. Learning graphical models with hubs. The Journal of Machine Learning Research, 15 (1):3297–3331, 2014. [12] Qingming Tang, Siqi Sun, and Jinbo Xu. Learning scale-free networks by dynamic node specific degree prior. In Proceedings of The 32st International Conference on Machine Learning, pages 1–9, 2015. [13] Bin Zhang and Steve Horvath. A general framework for weighted gene co-expression network analysis. Statistical applications in genetics and molecular biology, 4(1), 2005. 11