Objective Bayesian Model Selection in Gaussian ... - CiteSeerX

Comment

Report 3 Downloads 147 Views

Objective Bayesian Model Selection in Gaussian Graphical Models By CARLOS M. CARVALHO Graduate School of Business, University of Chicago, Chicago, Illinois 60637, U.S.A. [email protected] and JAMES G. SCOTT Department of Statistical Science, Duke University, Durham, North Carolina 27708-0251, U.S.A. [email protected] Summary This paper presents a default model-selection procedure for Gaussian graphical models that involves two new developments. First, we develop an objective version of the hyper-inverse Wishart prior for restricted covariance matrices, called the HIW g-prior, and show how it corresponds to the implied fractional prior for covariance selection using fractional Bayes factors. Second, we apply a class of priors that automatically handles the problem of multiple hypothesis testing implied by covariance selection. Numerical experiments show that these priors strongly control the number of false edges included in the model, thereby automatically rewarding sparsity. We demonstrate our methods on a variety of simulated examples, concluding with a real example analyzing covariation in mutual-fund returns. These studies reveal that the combined use of a multiplicity-correction prior on graphs with the hyper-inverse Wishart g-prior on covariance matrices yields better performance than conventional covariance selection methods. Some key words: Gaussian graphical models; hyper-inverse Wishart distribution; Fractional Bayes factors; objective Bayesian model selection; multiple hypothesis testing.

1

1

Introduction

Modeling pairwise conditional independencies using Gaussian graphical models offers many practical advantages in high-dimensional problems. It can make computing more efficient by alleviating the need to handle large matrices; it can yield better predictions by fitting a simpler model; and it can aid scientific understanding by breaking down a global model into a collection of local models that are easier to parse. Yet often the graph itself must be inferred from the data, posing at least two difficulties. First, the quality of answers is highly sensitive to the prior distribution π(Σ) under a given graph—much more so than in an estimation problem. There is no common covariance matrix shared by all graphs, but rather an entire collection of covariance matrices {ΣG } indexed by all possible graphs, and different graphs imply different numbers of free elements in Σ. We therefore cannot use an improper prior for each ΣG as we might do for covariance estimation under a fixed graph, since this would leave the resulting model probabilities defined only up to an arbitrary constant. Instead, we must either elicit a subjective prior for each ΣG —clearly intractable in high dimensions—or we must choose some default proper prior that is neither too vague nor too precise. Regardless, we must proceed carefully with the understanding that our answers will depend on the priors chosen for the various ΣG ’s. Second, graphical model selection poses an implicit problem in multiple hypothesis testing, where a null hypothesis is the exclusion of a single edge from the graph. If the number of falsely included edges is allowed to grow too rapidly as the number of tests for edge inclusion grows, we will quickly be overwhelmed by false positives; parsimony will be lost, computational burdens will accumulate, and predictive performance will suffer. This will stifle the gains to be had by fitting a graph in the first place, and so we must somehow assert control over the number of false positives by penalizing complex graphs. This paper develops an objective Bayesian approach to graphical model selection that confronts these two issues, which we argue are truly a single issue under the objective framework. Indeed, we will show that objective Bayesian multiplicity correction in graphical models depends crucially upon the quality of information coming from the marginal likelihoods of graphs of differing size. Our method induces a feedback loop between the prior chosen on Σ under a given graph G and the prior probability of G itself. Accordingly, the marginal likelihood f (X | G) and the graph prior π(G) become entwined, thereby magnifying the choice of π(Σ) in importance beyond its usual role in determining the marginal likelihood alone. After introducing the notation and basic theory of Gaussian graphical modeling in §2, we show how marginal likelihoods can be computed by fractional Bayes factors in §3. This procedure implies a specific fractional prior called the hyper-inverse Wishart g-prior, a name we have chosen due to the prior’s intimate relationship with Zellner’s g-prior. We explore this relationship in §4, along with the information consistency 2

and asymptotic behavior of the resulting model probabilities. In §5 the results given by our FBF-based marginal likelihoods are compared to results under conventional proper priors, and the FBF results are found to be consistently better. Then §6 shows how the theory of adaptive shrinkage priors for multiplicity correction, introduced by Scott & Berger (2006), can be applied to graphical models, giving excellent results at controlling false positives. In the final section, we illustrate our procedures in the context of a real example involving seven years of mutual-fund data. 2

Background

2.1 Notation and structure for graphs An undirected graph is a pair G = (V, E) with vertex set V and edge set E = {eij } for some pairs (i, j) ∈ V . Nodes (i, j) are adjacent, or neighbors, if eij ∈ E; we often write the edge simply as (i, j). A graph is complete if eij ∈ E for every i, j ∈ V . A subgraph C ⊂ V is a clique if it is complete, and a maximal clique if it is not a proper subset of another clique. A decomposition of a graph G = (V, E) is a partition into subgraphs (A, S, B) such that V = A ∪ B, S = A ∩ B is complete, and any path from a node in A to a node in B must go through S (which is called a separator). A sequence of subgraphs that cannot be decomposed further are the prime components of a graph, and decomposable graphs are those for which every prime component is complete. A perfect ordering is an ordering of the prime components and separators of a graph (P1 , S2 , P2 . . . Sk , Pk ) with the running intersection property: for every i = 2 . . . k, there is a j < i such that Si = Pi ∩ Hi−1 ⊂ Pj , where Hi−1 is the union over all previous prime components. A junction tree TG is a representation of such a perfect ordering in tree form, whereby prime components define the nodes of the tree, and the separators define the edges between the nodes. The junction tree plays a crucial role in the computational aspects of graphical models, as we discuss below. We let C = (P, S) denote the set of subgraphs C implied by the junction tree TG having a perfect ordering of prime components (P1 , . . . Pk ) ∈ P and separators (S2 , . . . Sk ) ∈ S. 2.2 Conjugate priors in Gaussian graphical models A Gaussian graphical model, or covariance selection model (Dempster, 1972), defines a set of pairwise conditional independence relationships on a p-dimensional normally distributed random quantity X ∼ N(0, Σ). With a nonsingular, positive-definite covariance matrix Σ giving precision matrix Ω = Σ−1 with entries ωij , the univariate elements xi and xj of X are conditionally independent if and only if ωij = 0. If G = (V, E) is an undirected graph representing the joint distribution of X, ωij = 0

3

for all pairs (i, j) ∈ / E. The parameter Σ belongs to M (G), the set of all positivedefinite symmetric matrices with elements in Σ−1 equal to zero for all (i, j) ∈ / E. The density of X factorizes as Q p(XP |ΣP ) , (1) p(X|Σ, G) = QP ∈P S∈S p(XS |ΣS ) a ratio of products of densities where XP and XS indicate subsets of variables in prime components and separators respectively. Given G, this distribution is defined completely by the component-marginal covariance matrices ΣP , subject to the consistency condition that submatrices in the intersecting, i.e. separating, components are identical, as in Dawid & Lauritzen (1993); that is, if S = P1 ∩ P2 the elements of ΣS are common in ΣP1 and ΣP2 . The hyper-inverse Wishart distribution over a decomposable graph G is the unique strong hyper-Markov distribution for Σ ∈ M (G) with consistent clique marginals that are inverse Wishart distributed (Dawid & Lauritzen, 1993). For a decomposable graph G with junction tree set TG = (P, S), a degrees-of-freedom parameter b > 0, and a positive-definite location matrix D ∈ M (G), writing (Σ | G) ∼ HIWG (b, D) means two things. First, the density of Σ decomposes as in Equation 1. Second, for each clique C ∈ C, ΣC ∼ IW(b, DC ) with density: 1 −1 −(b/2+|C|) (2) p(ΣC | b, DC ) = K · |ΣC | exp − tr ΣC DC 2 Here, |C| represents clique size, and the normalizing constant K is given by: 1 DC (b+|C|−1)/2 2 K= b+|C|−1 Γ|C| 2

(3)

where Γd (·) is the multivariate gamma function in dimension d. Conjugacy implies that an individual clique having prior ΣC ∼ IW(b, DC ) will, upon observing the submatrix XC , have posterior (ΣC | XC ) ∼ IW(b + n, DC + XC0 XC ). The full covariance matrix then has posterior (Σ | X) ∼ HIW(b + n, D + H(X 0 X)), where H(X 0 X) is the hyper-Markov sum-of-squares matrix corresponding to the prime components and separators of G. For a nondecomposable graph, the density still factorizes as in (1), but some of the prime components will be incomplete. These incomplete components have additional off-diagonal constraints in their covariance matrix, invalidating (2). Atay-Kayis & Massam (2005) give a Monte Carlo method for approximating the marginal likelihood of an incomplete prime component, but these methods become prohibitively timeconsuming in high dimensions. Hence we restrict our attention to the much-simpler decomposable case and denote by G the set of all decomposable graphs over a given number of nodes (which is understood from the context). 4

Using HIW priors for Σ, the marginal likelihood for G can be expressed as the ratio of the prior to the posterior normalizing constant (Jones et al., 2005): p(X | G) = (2π)−np/2

h(G, b, D) h(G, b∗ , D∗ )

(4)

where b is the prior degrees-of-freedom, D is the prior HIW location matrix, b∗ = b + n, and D∗ = D + H(X 0 X). The normalizing constant of a hyper-inverse Wishart distribution is itself a function of the normalizing constants of its prime components and separators: −1 (b+|P |−1) 1 Q b+|P |−1 DP 2 Γ |P | P ∈P 2 2 (5) h(G, b, D) = Q −1 1 (b+|S|−1) b+|S|−1 DS 2 Γ|S| S∈S 2 2 The computational benefits of the junction tree representation are now clear; decomposing a covariance matrix into cliques and separators allows us to compute several small determinants instead of a single large one. The junction tree is also necessary to generate random draws from a hyper-inverse Wishart distribution (Carvalho et al., 2007), useful in a wide range of applied settings involving graphical models. 3

Fractional Bayes factors and the HIW g-prior

We have two main goals in constructing a default prior π(Σ) appropriate for covariance selection. First, π(Σ) must be conjugate in order to make use of (4) for computing marginal likelihoods. This is a practical requirement rather than a theoretical one—graphical model spaces are far too large to explore exhaustively, and we must evaluate models very rapidly if we ever hope to find the good ones. We accept that this tethers us to a conjugate prior and seek marginal likelihoods that are as well-behaved as possible under this constraint. Second, the prior should correspond to an actual Bayesian procedure, and not merely use the machinery of Bayes’ theorem to produce an answer. We must avoid what Berger (2006) calls “pseudo-Bayes” procedures: the casual use of constant priors, vague proper priors priors that are chosen to “span the range of the likelihood,” or “priors with tuning parameters that are adjusted until the answer ‘looks nice’.” These procedures can give answers that still depend heavily upon the spread of the (arbitrarily chosen) prior, and their ad-hoc centering of the prior on the likelihood is a philosophically dubious reuse of the data. We avoid these difficulties through the use of fractional Bayes factors, first proposed in O’Hagan (1995) as a default Bayesian model-selection technique for use when prior information is weak. The idea is to train an ensemble of noninformative priors using a small fraction g of the likelihood function. This is done simultaneously for all 5

models being considered, converting all noninformative priors into proper priors that are then used to select a model with the remaining fraction 1 − g of the likelihood. Formally: choose g ∈ (0, 1) and let πN (Σ, G) be a noninformative (typically improper) prior for Σ under a graph G. The fractional Bayes factor for graphs G1 and G2 is then FBFg (G1 , G2 ) = Q(X, g | G1 )/Q(X, g | G2 ), where: R πN (Σ, G)f (X | Σ, G) dΣ (6) Q(X, g | G) = R πN (Σ, G)f (X | Σ, G)g dΣ This is equivalent to computing model probabilities by integrating the leftover fraction of the likelihood over π ∗ (Σ, G) ∝ πN (Σ, G)f (X | Σ, G)g , which is called the implied fractional prior corresponding to πN (Σ, G). Equation 6 clearly depends upon the choice of a noninformative prior for Σ. Among the possibilities reviewed in Sun & Berger (2007), the obvious choice given our need for conjugacy turns out to be the fiducial prior, so called because it yields Fisher’s fiducial distribution for marginal variances. This prior is π(Σ) ∝ |Σ|−p , which is also the prior that yields exact frequentist matching for means and variances (Geisser & Cornfield, 1963). The properties of the determinant allow us to express this prior in a particularly handy form given a graph G with junction tree TG = (P, S). We call this form the component-wise matching prior: Q |ΣP |−|P | (7) πN (Σ | G) ∝ QP ∈P −|S| S∈S |ΣS | After using a fraction g of the likelihood to train the prior in (7), we end up with: (Σ | G) ∼ HIW(gn, gH(X 0 X))

(8)

We call this prior the hyper-inverse Wishart g-prior by analogy to Zellner’s g-prior in linear regression (Zellner, 1986). The similarity is more than superficial, and we detail the intimate connection between the two ideas in §4. We can now state the main result of this section: Theorem 3.1. The HIW g-prior is the implied fractional prior for Σ corresponding to the component-wise matching prior, where g is the fraction of the likelihood used for training. The minimal training fraction is g = 1/n, and the fractional marginal likelihood can be computed as: Q(X, g | G) = (2π)−np/2

h(G, gn, gH(X 0 X)) h(G, n, H(X 0 X))

(9)

with h(G, b, D) defined as in (5). Proof. The form of the implied fractional prior follows immediately by identifying the component-wise form of the fiducial prior as the limiting case of a hyper-inverse Wishart distribution. The rest of the theorem follows directly from the conjugacy of 6

the HIW prior with the normal likelihood, and from Equation 4, in which gn = b = 1 is the smallest integer value yielding a proper prior. We emphasize that this procedure does not merely specify a single prior distribution, but rather a whole cohort of objective prior distributions for all {ΣG : G ∈ G}. As a default choice, we recommend setting g = 1/n for two reasons. First, g must be O(1/n) in order for the implied fractional prior to correspond to the gold standard of a carefully elicited subjective prior distribution (Berger & Pericchi, 2001). Any other choice will lead to strange behavior as data accumulates. If g decreases too slowly, the prior will asymptotically overwhelm the likelihood; if it decreases too fast, the null model will eventually be favored with probability 1. Second, the product of g and n gives the degrees of freedom for the implied fractional prior. Choosing gn = 1 implies that the vector x is marginally Cauchy and that π(Σ) is heavy-tailed without being too vague. This choice dovetails with the advice given in, for example, Jeffreys (1961) and Liang et al. (2007), who favor heavy-tailed priors for regression variable selection. 4

Information consistency of the HIW g-prior

This section considers the information consistency, or finite-sample consistency, of the implied fractional prior introduced in §3. The term information consistency refers to whether the Bayes factor for comparing a complex model to the null model is bounded for fixed n. The classic example of an information-inconsistent procedure is model selection in linear regression using Zellner’s g-prior (Berger & Pericchi, 2001; Liang et al., 2007). Imagine testing a specific regression model γ having kγ terms to the null model γ0 having only an intercept term. If the usual F statistic for testing γ against γ0 goes to infinity for fixed n and kγ < n − 1, the information against M0 is overwhelming, and one would expect the Bayes factor to diverge. But under the standard g-prior with g fixed (usually at g = n), this Bayes factor instead converges to the fixed constant (1 + g)(n−kγ −1)/2 . This gives an intrinsic limitation, one that is not shared by the frequentist test statistic, to how strongly the Bayes factor may support the bigger model. Such behavior is intuitively unappealing, since the case of zero residual variation under a nonsaturated model seems to demand an unbounded Bayes factor with respect to the null model. It was for precisely this reason that Jeffreys recommended Cauchy priors for testing normal means (Jeffreys, 1961, Chapter 5). The question of a Bayesian procedure’s information consistency is typically posed with respect to a particular frequentist test statistic whose value grows without bound as the model fits arbitrarily well. We show that our fractional prior is information consistent in at least two related senses, which we define in turn.

7

4.1 Tests against the null graph We follow Proposition 5.14 in Lauritzen (1996) to give the standard frequentist test statistic for comparing graphs. Let G0 denote the null graph having no edges, and b 0 be let GA denote the graph we wish to compare with the null. Further, let Ω b A be the maximum-likelihood estimate for the precision matrix under G0 , and let Ω the MLE under GA . Then there is a nested sequence G0 ⊂ . . . ⊂ Gd = GA of decomposable graphs that differ only by a single edge. Let ei denote the edge in Gi but not in Gi−1 , and let Ci be the (unique) clique of Gi containing ei . Then we have the following lemma: Lemma 4.1. The test of GA against the null graph G0 can be performed by rejecting b 0 |/|Ω b A |. Under G0 , b is distributed as the G0 for sufficiently small values of b = |Ω product of independent beta random variables B1 · · · Bd , with: Bi ∼ Be((n − |Ci |)/2, 1/2) We now give a precise statement of information consistency: Definition Let BF(G0 , GA ) be the Bayes factor for testing the null graph G0 to GA under the family of priors {π(Σ | G)}. If for fixed n > maxC∈C |C|, BF(G0 , GA ) → 0 as b → 0, then {π(Σ | G)} is β-consistent. Theorem 4.2. The family of fractional priors for g = 1/n corresponding to the component-wise matching prior are β-consistent. Proof. gn+|S|−1 g 0 Qp g 0 gn Q Xi Xi 2 X XS 2 S S∈S 2 2 BF(G0 , GA ) = K · Qi=1 n · Q g gn+|P |−1 p 1 0 2 0 2 i=1 2 Xi Xi P ∈P 2 XP XP 1 0 n+|P |−1 Q 2 P ∈P 2 XP XP · Q 1 n+|S|−1 X 0 XS 2 S∈S 2 S

(10)

This simplifies to: Q (S1 −gnp)/2 (n−gn)(S2 −p)/2 Y p |X 0 XP |(n−gn)/2 1 1 −(n−gn)/2 0 K· · |Xi Xi | · QP ∈P P (11) (n−gn)/2 0 g 2 |X X | S S i=1 S∈S where the leading constant K and the exponent terms S1 and S2 are given by:

8

gn+|P |−1 Γ P ∈P |P | Γ 2 · Q K = ·Q n+|P |−1 gn+|S|−1 Γ Γ Γ P ∈P |P | S∈S |S| 2 2 X X S1 = |P | · (gn + |P | − 1) − |S| · (gn + |S| − 1) "

#p

n 2 gn 2

Q

S∈S Γ|S|

n+|S|−1 2

P ∈P

S2 =

X P ∈P

Q

(12) (13)

S∈S

|P | −

X

|S|

(14)

S∈S

b G under any graph We now apply Lauritzen’s formula for the determinant of Ω with junction tree TG = (P, S): Q |X 0 XS | p b |ΩG | = n Q S∈S S0 (15) P ∈P |XP XP | which gives: BF(G0 , GA ) = C ·

b 0| |Ω b A| |Ω

!(n−gn)/2 (16)

where C is a fixed term involving g, p, n, and the structure of the graph G. The proof of information consistency now follows immediately by plugging the test statistic b into the above equation, and noticing that for fixed g < 1 and p < n, BF(G0 , GA ) → 0 as b → 0. 4.2 Tests for an implied conditional regression model Information consistency is important in a second sense: nonzero entries in a precision matrix imply a set of nonzero conditional regression coefficients for each xi upon x−i . Covariance-selection procedures perform variable selection on all of these implied regressions simultaneously, and hence they should behave sensibly when considered strictly as a method for variable selection under an encompassing normal model. We call this notion F -consistency: Definition Let z be a univariate element of X, let γ be a binary vector with nonzero entries corresponding to the neighbors of z under G, and let γ0 be the vector of all zeros. Let BF(γ, γ0 ) be the Bayes factor for testing the regression models implied by γ and γ0 under the family of priors {π(Σ | G)}, and let F be the usual F -statistic for comparing the two regression models. If for fixed n > maxC∈C |C|, BF(γ, γ0 ) → ∞ as F → ∞, then the family {π(Σ | G)} is F -consistent. We now show that fractional Bayes factors are F -consistent due to the behavior of the hyper-inverse Wishart g-prior, beginning with a preliminary lemma: 9

Lemma 4.3. Let (Σ|G) ∼ HIW(b, D). If Σ =

Σzz ΣzY ΣY z ΣY Y

where z is (1 × 1),

then: −1 −1 (i) ΣzY Σ−1 Y Y | Σz|Y ∼ N DzY DY Y , Σz|Y DY Y −1 −1 −1 b+k Dz|Y (ii) Σzz − ΣzY ΣY Y ΣY z = Σz|Y ∼ Ga 2 , 2 with k representing the number of neighbors of z. The lemma is proven in Appendix A; we now apply it to give the following theorem: Theorem 4.4. The family of fractional priors for g = 1/n corresponding to the component-wise matching prior are F -consistent. Proof. Assume X = (z, X ∗ ), where X ∗ is n×(p−1) and z is (n×1). Let Xi ∼ N(0, Σ) and (Σ|G) ∼ HIW(gn, D), where D = gX 0 X, and denote by γz the binary vector indicating which elements of X ∗ are neighbors of z in G. Assuming γz has k nonzero elements, we let Y be the n × k design matrix corresponding to the columns of X ∗ indicated by γz . After applying Lemma 4.3, the HIW g-prior gives the following regression relationship: z = Y F + , ∼ N(0, φ−1 I) −1 0 −1 b (F | g, φ, γz ) ∼ N F , (gφ) (Y Y ) gn + k gr (φ | g, γz ) ∼ Ga , 2 2

(17) (18) (19)

b0 where φ is the conditional precision or Σ−1 z|Y ; I is the n × n identity matrix, F = (Y 0 Y )−1 Y 0 z is the traditional least-squares estimate for F ; and r = z 0 (I − MY )z with MY = Y (Y 0 Y )−1 Y 0 denoting the perpendicular projection matrix onto the column space of Y . Hence r is the residual sum of squares after regressing z upon Y . We can compute the marginal distribution of (z | φ) using the familiar properties of the normal distribution, accounting for the fact that the likelihood in (17) must be raised to the power (1 − g) to avoid using the data twice. Then marginalizing over the conjugate gamma prior for φ given in (19) with α = (gn + k)/2 and β = gr/2 yields, after some straightforward algebra: Γ n+gn+k −n/2 gn+2k n/2 2 · r−n/2 P (z | γz ) = (π) g 2 (1 − g) · (20) gn+k Γ 2 Let γ0 represent the null model, i.e. that z is independent of Y . For regression model γ, we can compute BF(γ, γ0 ) by recognizing the null model as a special case of (20) with k = 0 and r = z 0 z. Choosing g = 1/n—the smallest integer choice of the 10

degree-of-freedom parameter for which Σ has a proper prior density after training— yields an especially simple form for the Bayes factor: BF(γ, γ0 ) =

−n/2 β (n/2, k/2) −k · n · 1 − Rγ2 2β (n, k)

(21)

where β(·) is the beta function, and where Rγ2 is the usual coefficient of determination for model γ. As Rγ2 goes to 1, the Bayes factor in favor of γ clearly goes to infinity, thus completing the proof of F -consistency. The prior given by (17), (18), and (19) is immediately recognizable as a modified form of Zellner’s g-prior for Fz|Y , the vector of conditional regression coefficients, and it is the reason we use the term “hyper-inverse Wishart g-prior” for the implied fractional prior from §3. 5

Fractional marginal likelihoods: a simulation study

We now demonstrate the relevance of the fractional/HIW-g prior through a simulation study that compares models of differing complexity. By comparing the marginal likelihoods from the HIW-g prior with the traditional subjective alternative, the HIW(δ, τ I) prior, we hope to illustrate how the casual use of conventional proper priors undermines our ability to choose between competing models and that the objective procedure presented so far is well-behaved. The choice of Σ ∼ HIW(δ, τ I) as an alternative to our prior follows the standard advice in the Gaussian graphical model-selection literature (Giudici & Green, 1999; Dobra et al., 2004; Jones et al., 2005), where the scale matrix D = τ I reflects a shrinkage prior with τ chosen to match the marginal variance of the data (for which a common scale must be assumed). In the example, the true model is a p = 50-dimensional stationary correlation matrix of an AR(10) process, which represents a Gaussian graphical model due to the band-diagonal form of the precision matrix. Hence the appropriate choice of the conventional proper prior is Σ ∼ HIW(3, I), where our choice of δ = 3 again reflects the standard advice to give π(Σ) a finite first moment. For multiple sample sizes (n = 20, 50, 100, 500, 1000), we simulated 1000 datasets from the true model and computed the associated marginal likelihood for candidate models of differing complexity. For illustrative purposes and to facilitate the interpretation of results, these candidates include only band-diagonal models from AR(0) to AR(20). Figure 1 displays the distribution of marginal log-likelihoods for each candidate model under the two priors. The plot shows substantially better separation of models under the fractional marginal likelihoods. It is clear that, unlike the standard prior, our objective procedure penalizes complexity and preserves the Ockham’s-razor effect of marginal likelihoods (Jefferys & Berger, 1992). In the absence of enough data to justify extra edges, it exhibits a clear preference for sparsity. Yet as the sample size 11

increases, the HIW-g favors more complex models, asymptotically approaching the true model and buttressing the results of §4. The conventional prior does not exhibit this tendency nearly as strongly. For each simulated data set, we also chose the model with the highest marginal likelihood. This gives a Monte-Carlo estimate for the frequency distribution of the chosen band-diagonal size under the two priors, which is presented in Figure 2. It is clear that as n grows, the fractional Bayes factors favor the true model (bandwidth 10) with increasing accuracy. Meanwhile, the results from the HIW(3, I) are erratic, with an unintuitively high level of uncertainty associated with the choice of models and without a clear preference for sparsity. This flattening effect of the conventional prior also compromises the efficiency of any search algorithm that uses pairwise model comparisons to climb hills in model space. One might imagine that the conventional prior, by shrinking the covariance structure toward the identity matrix with its strong pattern of off-diagonal zeros, would yield systematically smaller models. This expectation is not confirmed by our simulation study, indicating that intuitions about shrinkage gleaned from covariance estimation do not necessarily apply to covariance selection. The results from this simulation are consistent with the theoretical results developed so far and strengthen our claim that the fractional approach is indeed an appropriate default procedure for Gaussian graphical model selection. We note that the conventional prior induces a set of ridge-regression priors on the complete conditional models considered in §4.2 (as can be shown through a straightforward application of Lemma 4.3). The problems with ridge-regression priors for variable selection are well understood, and give some intution as to why the conventional HIW(δ, τ I) prior is suboptimal for covariance selection. 6

Multiplicity-correction priors over graphs

We now explicity confront the problem of multiple hypothesis testing, where each null hypothesis claims that an edge between two specific nodes should be excluded from the graph. The standard method for avoiding false positives is to use an edgeselection prior that penalizes large models (Dobra et al., 2004; Jones et al., 2005). Specifically, all edge inclusions are modeled as independent Bernoulli events with prior success probability r, which is fixed and common to all edges. This induces the prior probability of a specific graph G with k included edges out of m possible ones: π(G) = rk (1 − r)m−k

(22)

If the expected fraction of included edges is known quite precisely, this framework may be attractive. Yet often this fraction is not known, making an arbitrary choice of r seem heavy-handed—and in any case, we’d like a procedure which accommodates a weaker form of prior information.

12

In such cases, we recommend applying the adaptive-shrinkage approach of Scott & Berger (2006) to graphical models. The idea is to treat r as a model parameter to be estimated from the data, rather than as a fixed tuning constant. This shrinks the graph size to a data-determined value of r, which is estimated from the prevailing edge-inclusion fraction of models with favorable marginal likelihoods. This opens the feedback loop between π(Σ | G) and π(G) referred to in the introduction; multiplicity correction is the automatic result of this adaptive shrinkage. Estimating r directly can prove challenging. George & Foster (2000) give an EM procedure that can be used to compute an empirical-Bayes estimate rˆ in linear regression models, but note that it can prove computationally intractable in high dimensions, making it even less suited to graphical model spaces. Yet this step can be avoided entirely, since r can be integrated out to induce a set of prior model probabilities that will automatically create the right multiplicity-correction behavior (Scott & Berger, 2007). By Bayes Theorem, we have: Z

1

π(G | D) =

π(G | D, r)π(r | D) dr 0

Z = 0

1

π(G | r)f (D | G, r) π(r)f (D | r) · dr f (D | r) f (D)

Conditional independence then allows: Z 1 π(G | r)f (D | G) · π(r) dr π(G | D) = f (D) 0 Z f (D | G) 1 = π(G | r)π(r) dr f (D) 0

(23)

(24)

Outside the integral, the numerator f (D | G) is the marginal likelihood of the data under a specific graph, and the normalizing constant f (D) is the same for all models and can thus be ignored when computing Bayes factors. The term involving the integral must then be the prior probability of the graph, π(G). Assuming the conjugate prior r ∼ Be(a, b) gives: Z 1 β(a + k, b + m − k) π(G) = π(G | r)π(r) dr ∝ (25) β(a, b) 0 where β(·) is the beta function. For the default choice of a = b = 1, implying a uniform prior on r—and a marginal prior inclusion probability of 1/2 for all edges— this reduces to: −1 1 m (k)!(m − k)! π(G) = = (26) (m)(m!) m k 13

These model probabilities induce exactly the sort of strong multiplicity control that we need, a fact we demonstrate through simulation studies of the graph in Figure 3. These studies consist of fixing the 10-node graph seen here, adding progressively more noise nodes that are unconnected both from the graph and from each other, and testing at each step to see how many edges are flagged. We perform three sets of tests each under three different methods of assigning prior probabilities to models. The number of noise nodes chosen were 5, 15, and 40, which in addition to the 10 connected nodes of Figure 3 imply 60, 300, and 1225 separate hypothesis tests, respectively. In all cases the number of true hypotheses remains fixed at 22, one for each edge in the 10-node graph. These results are summarized in Table 1. Here, “Fully Bayes” referes to the Scott– Berger multiplicity correction regime outlined in this section, while “Oracle Bayes” refers to multiplicity correction where the known prior inclusion probability r was plugged into (22) to compute prior graph probabilities. This serves as a reasonable stand-in for an empirical-Bayes estimate of the prior inclusion probability (Cui & George, 2007), as an empirical-Bayes procedure could hardly do better than to plug in the true value. Notice how the multiplicity-correction priors, especially the fully Bayesian regime, squelch false postives far more effectively than the indifference prior does. This phenomenon is well understood in linear models, but ours is the first demonstration of such an automatic-correction effect in graphical models. The difference between corrected and uncorrected searches is substantial; in the 50-node example, the uncorrected version allows 40 false positives to enter the median graph, whereas the corrected version does not allow a single one. The automatic correction agrees qualitatively with the oracle-Bayes correction even with the default uniform prior on r, suggesting that the data need no help from a subjective prior in estimating r. The stronger correction of the fully Bayesian regime can be explained by the weakness of some of the edges in the true graph, which contribute to the oracle value of r but not to the estimated one. Yet in contrast to the plug-in prior, our approach does not depend strongly upon the choice of an arbitrary hyperparameter. And if subjective inputs are required, they can be accommodated through a different Beta prior on r while still retaining closed-form answers. Automatic multiplicity correction can also yield surprises that differ quite starkly from uncorrected results. Consider, for example, edges (5, 9) and (6, 7). Using the uncorrected prior, adding more noise edges makes (6, 7) appear stronger and (5, 9) appear slightly weaker. Yet the opposite happens using the corrected priors: the addition of more noise nodes makes (6, 7) disappear entirely and yet retains (5, 9) at close to its original strength. This behavior suggests that the multiplicity-correction priors are not merely shrinking inclusion probabilities to 0 as the number of hypothesis tests increases and the number of true hypotheses remains constant. Rather, they are differentially rewarding edges that participate in more parsimonious models. This is exactly as an ex14

plicit preference for sparsity suggests should happen—we should especially prize edges that carry the additional advantage of rendering other mediocre edges unnecessary. This effect happens because of the feedback loop we have opened between the prior π(Σ | G) and the graph prior π(G), allowing us to exploit the Ockham’s-razor effect of Bayesian marginal likelihoods. This strongly motivates a careful choice for π(Σ | G), giving the lessons of §5 even greater importance. 7

Example: mutual fund data

We now provide a comprehensive discussion of our methods in a predictive context using real data. A crucial input for many dynamic portfolio-selection problems is the estimated covariance matrix Σ for a collection of asset returns; see, for example, Quintana & West (1987). High-dimensional portfolios can often yield highly variable estimates for this matrix, and graphical models offer a potent tool for regularization and stabilization. As Carvalho & West (2007) show, graphically constrained estimates of Σ can yield portfolios that uniformly dominate their unconstrained counterparts along a variety of dimensions, including risk, transaction costs, and overall profitability. In deciding which graph (or set of graphs) to use for imposing such constraints, one must inevitably confront the issues of prior specification and multiplicity correction. Our example involves a set 86 monthly returns for 59 mutual funds from a variety of different sectors: 13 U.S. bond bunds, 30 U.S. stock funds, 7 balanced funds investing in both U.S. stocks and bonds, and 9 international stock funds. The monthly returns were split into a 60-month training set and a 26-month prediction set. We studied the predictive performance of the MLE of the unrestricted covariance matrix along with three different graphical modeling regimes involving different combinations of π(Σ | G) and multiplicity-correction priors over graphs. We also created one final candidate predictor by expanding the prior specification in §6 to include several different parameters corresponding to the prior inclusion probabilities for edges in each of these fund sectors, along with another set of parameters for the inclusion probabilities of edges between funds in different sectors. This reflects our prior understanding that different patterns of covariation might prevail in different sectors, and that these patterns might differ collectively from any between-sector crosstalk. Hence it might be suboptimal to force edges participating in these different covariational patterns to shrink towards some common prior inclusion probability. Instead, we allow block-by-block patterns of (possibly differing) sparsity to emerge, as depicted in Figure 4. Note that global shrinkage could still be accomplished via a hierarchical prior; here we have simply chosen to give each inclusion probability an independent uniform prior. For each prediction regime, we searched for good models and computed posterior ˆ using the 60-month training set using feature-inclusion stochastic search, means {Σ} or FINCS (Scott & Carvalho, 2007). We then used these estimates to predict obser15

vations in the 26-month validation set; in each month, we used 56 “observed” returns to predict the remaining 3 “missing” returns (which are known in reality) using the ˆ derived from the 60-month training set. We repeated this process for all of the Σ’s 59 = 32509 combinations of 3 unobserved columns, and then computed the total 3 squared-error in imputing the missing values. These results, given in Table 2, demonstrate that each of the methodological advancements considered here can improve predictive accuracy in covariance selection problems. It is especially interesting to note the strength of the sector-specific multiplicity correction, suggesting that the ease with which our prior specification accommodates structural information can be a real advantage in practice. 8

Discussion

In summary, we have introduced a new method of covariance selection based upon objective Bayesian ideas. The strengths of our approach are the theoretical guarantees of §4, the intuitive behavior of fractional marginal likelihoods demonstrated in §5, and the strong control over false positives shown in §6. In particular, the results of §5 suggest that the conventional prior artificially flattens modes in model space, which will compromise the efficiency of any stochastic search procedure. Our prior specification also accommodates subjective information about the structure of the problem in a flexible and intuitive way, which may, as in the case of the mutual-fund example, improve predictive accuracy. Moreover, these theoretical developments ensure that such predictions rest upon the firm ground of a well-behaved set of posterior model probabilities. Through this paper, we have considered moderate-dimensional problems for which the FINCS algorithm in Scott & Carvalho (2007) is very efficient. Of course, as dimension grows, any such search methodology will eventually flounder; in such cases, we refer the reader to Dobra et al. (2004), who outline a constructive method for model determination relying upon a family of complete conditional regressions. We note that the theoretical developments of this paper can fold seamlessly into their procedure, thereby extending our objective Bayesian approach to very high-dimensional problems. References Atay-Kayis, A. & Massam, H. (2005). The marginal likelihood for decomposable and non-decomposable graphical Gaussian models. Biometrika 92, 317–35. Berger, J. O. (2006). The case for objective Bayesian analysis. Bayesian Anal. 1, 385–402. Berger, J. O. & Pericchi, L. (2001). Objective Bayesian methods for model selection: introduction and comparison. In Model Selection, vol. 38 of Institute of 16

Mathematical Statistics Lecture Notes – Monograph Series. Beachwood, pp. 135– 207. Carvalho, C., Massam, H. & West, M. (2007). Simulation of hyper-inverse Wishart distributions in graphical models. Biometrika 94, 647–59. Carvalho, C. & West, M. (2007). Dynamic matrix-variate graphical models. Bayesian Anal. 2, 69–96. Cui, W. & George, E. I. (2007). Empirical Bayes vs. fully Bayes variable selection. J. Statist. Plann. Inference , to appear. Dawid, A. P. & Lauritzen, S. L. (1993). Hyper-Markov laws in the statistical analysis of decomposable graphical models. Ann. Statist. 21, 1272–317. Dempster, A. (1972). Covariance selection. Biometrics 28, 157–75. Dobra, A., Jones, B., Hans, C., Nevins, J. & West, M. (2004). Sparse graphical models for exploring gene expression data. J. Mult. Anal. 90, 196–212. Geisser, S. & Cornfield, J. (1963). Posterior distributions for multivariate normal parameters. J. R. Stat. Soc. Ser. B Stat. Methodol. 25, 368–76. George, E. I. & Foster, D. P. (2000). Calibration and empirical Bayes variable selection. Biometrika 87, 731–47. Giudici, P. & Green, P. J. (1999). Decomposable graphical Gaussian model determination. Biometrika 86, 785–801. Jefferys, W. & Berger, J. (1992). Am. Sci. 80, 64–72.

Ockham’s razor and Bayesian analysis.

Jeffreys, H. (1961). Theory of Probability. Oxford University Press, 3rd Ed. Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C. & West, M. (2005). Experiments in stochastic computation for high-dimensional graphical models. Statist. Sci. 20, 388–400. Lauritzen, S. L. (1996). Graphical Models. Oxford: Clarendon Press. Liang, F., Paulo, R., Molina, G., Clyde, M. & Berger, J. (2007). Mixtures of g-priors for Bayesian variable selection. J. Amer. Statist. Assoc. , to appear. O’Hagan, A. (1995). Fractional Bayes factors for model comparison. J. R. Stat. Soc. Ser. B Stat. Methodol. 57, 99–138. Quintana, J. & West, M. (1987). Multivariate time series analysis: New techniques applied to international exchange rate data. Statistician 36, 275–81. 17

Roverato, A. (2000). Cholesky decomposition of a hyper-inverse Wishart matrix. Biometrika 87, 99–112. Scott, J. G. & Berger, J. O. (2006). An exploration of aspects of Bayesian multiple testing. J. Statist. Plann. Inference 136, 2144–62. Scott, J. G. & Berger, J. O. (2007). Multiple testing and stochastic search in Bayesian variable selection. Tech. rep., Duke University Department of Statistical Science. Scott, J. G. & Carvalho, C. M. (2007). Feature-inclusion stochastic search for gaussian graphical models. Tech. rep., Duke University Department of Statistical Science. Sun, D. & Berger, J. O. (2007). Objective priors for the multivariate normal model. In Bayesian Statistics VIII, Eds. J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith & M. West, Oxford University Press, to appear. Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti. Elsevier, pp. 233–43.

A

Proof of Lemma 4.3

Proof. We start by assuming, without loss of generality, that z is a neighbor of all k variables in Y . This can always be achieved by marginalizing over the non-neighbors of z, an operation that changes the conditional independence pattern of Y but maintains the decomposability of the graph and allows application of the results quoted below. Let Ω = Σ−1 = Φ0 Φ be partitioned as: 0 Ωzz ΩzY Φzz 0 Φzz ΦzY = . ΩY z ΩY Y ΦY z Φ0Y Y 0 ΦY Y −1 −1 0 −1 Recall that (Σzz −ΣzY Σ−1 = Σ−1 Y Y ΣY z ) z|Y = Ωzz = Φzz Φzz and ΣzY ΣY Y = −Φzz ΦzY = Ω−1 zz ΩzY . If (Σ|G) ∼ HIW(b, D), properties of the Cholesky decomposition of the hyperinverse Wishart (Roverato, 2000; Atay-Kayis & Massam, 2005) allow us to write Ψ = ΦT −1 where D−1 = T 0 T with b+k 1 2 , and ΨzY ∼ N (0, I). (27) Ψzz ∼ Ga 2 2

It is straightforward to see from (27) that Ωzz =

Φ0zz Φzz

=

Φ2zz

2

= (Ψzz Tzz ) ∼ Ga 18

−2 b + k Tzz , 2 2

so that Σ−1 z|Y

∼ Ga

b + k Dz|Y , 2 2

which proves part (ii) of the Lemma. −1 Turning the focus to the form of Γ = ΣzY Σ−1 Y Y = −Φzz ΦzY and writing it as a function of Ψ and T , we get ! i 1 X Tyi z + γi = − Ψzyj Tyj yi (28) Tzz Ψzz Tzz j=1 Given Φzz , this is just a linear combination of indepedent standard normals, so that 1 0 −1 (29) (Γ|Φzz ) ∼ N −Tzz TzY , 2 TY Y TY Y Φzz −1 −1 −1 (ΣzY Σ−1 (30) Y Y |Σz|Y ) ∼ N DzY DY Y , Σz|Y DY Y proving part (i) of the Lemma.

19

(a) HIW-g (n = 20)

(b) HIW(3, I) (n = 20)

(c) HIW-g (n = 1000)

(d) HIW(3, I) (n = 1000)

Figure 1: Simulation study: Distribution of marginal likelihoods under the conventional proper HIW prior and the HIW-g prior for different band diagonal models. The x-axis reflects the bandwidth of the precision matrix; the y-axis is marginal log-likelihood.

20

(a) HIW-g

(b) HIW(3, I)

n = 1000

n = 1000

n = 500

n = 500

n = 100

n = 100

n = 50

n = 50

n = 20

n = 20

5

6

7

8

9

10

11

12

13

14

15

5

6

7

8

AR(p)

9

10

11

12

13

14

15

AR(p)

Figure 2: The empirical distribution of the best AR(p) model for various sample sizes. The area of each circle represents the fraction out of 1000 independent data sets for which the best model (defined by the highest marginal likelihood under the given prior) corresponds to the given value of p.

4

8

2

5

6 3

9

1

7

10

Figure 3: The true 10-node, decomposable graph used in the multiplicity-correction study. All noise nodes were completely unconnected both from this graph and from each other, meaning that any edges involving them are false positives.

21

Edge (1,6) (3,4) (3,6) (3,8) (3,9) (4,6) (4,9) (5,6) (5,9) (6,7) (6,9) (8,9) (9,10) FPs:

No 5 .999 .155 .991 .183 .308 .999 .998 .155 .362 .581 .223 .433 .889 6

Number of Noise Nodes Correction Oracle Bayes Fully Bayes 15 40 5 15 40 5 15 40 .999 .999 .997 .998 .998 .991 .994 .998 .016 .001 .011 .003 .029 .014 .006 .006 .998 .999 .997 .998 .970 .989 .993 .993 .004 .001 .099 .017 .002 .055 .011 .002 .030 .001 .003 .001 .001 .004 .001 .001 .999 .999 .999 .999 .999 .999 .999 .999 .999 .999 .999 .999 .999 .999 .999 .999 .052 .003 .221 .212 .183 .245 .234 .218 .142 .310 .309 .305 .341 .308 .312 .329 .755 .735 .135 .024 .005 .083 .030 .005 .027 .001 .008 .001 .001 .006 .001 .001 .040 .071 .138 .023 .002 .069 .013 .002 .950 .999 .713 .694 .763 .601 .599 .691 11 40 1 1 1 1 1 0

Table 1: Estimated inclusion probabilities for selected edges in Figure 3 as the number of unconnected noise nodes, and hence the number of spurious edges eligible for inclusion, increases. The fully Bayesian correction used a uniform prior for r, the prior inclusion probability. Marginal likelihoods were computed using fractional Bayes factors for the n = 50 sample. The last line of the table shows the number of falsely positive flags—that is, ≥ 50% edge inclusion probability—among other, nonenumerated edges. All probabilities were calculated using 5 million iterations of the FINCS algorithm (Scott & Carvalho, 2007).

Method Likelihood Multiplicity HIW-g Sector-specific HIW-g Global HIW-g No Correction HIW(3, τ I) No Correction Saturated Model

Predictive Squared Error Top Model Model Average 554.654 542.342 623.550 611.214 642.874 636.176 679.643 661.077 1,749.072

Table 2: Sum of squared errors in predicting missing data in the mutual-fund example. All model searches used 3 million iterations of FINCS on the 60-month training set. Model-averaged predictions are averaged over the top 500 models discovered during the course of the search.

22

0

0

10

10

20

20

1.0

0.8

30

30

40

40

50

0.6

50

0.4

60

0.2

0.0 0

10

20

30

40

50

60

0

10

20

30

40

50

Figure 4: The posterior mean of the prior inclusion probabilities (L) for edges in different sectors of the portfolio example, to which the various posterior edge inclusion probabilities (R) are adaptively shrunk.

23

Recommend Documents

Extended Bayesian Information Criteria for Gaussian ... - CiteSeerX

Effective transductive learning via objective model selection