Variational Gaussian Copula Inference

Report 3 Downloads 156 Views
arXiv:1506.05860v1 [stat.ML] 19 Jun 2015

Variational Gaussian Copula Inference Shaobo Han, Xuejun Liao, David B. Dunson, Lawrence Carin June 22, 2015 Abstract We utilize copulas to constitute a unified framework for constructing and optimizing variational proposals. The optimal variational posterior under Sklar’s representation is found by minimizing the KL divergence to the true posterior. For models with continuous and non-Gaussian hidden variables, we propose a Gaussian copula family that preserves multivariate posterior dependence. We further employ a class of nonparametric transformations based on Bernstein polynomials that provide ample flexibility in characterizing the univariate marginal posteriors.

1

Introduction

A crucial component of Bayesian inference is approximating the posterior distribution representing the current state of knowledge about the latent variables x after data y have been observed. When intractable integrals are involved, variational inference methods find an approximation q(x) to the posterior distribution p(x|y) by i h minimizR q(x) ing the Kullback-Leibler (KL) divergence KL[q(x)||p(x|y)] = q(x)log p(x|y) dx, providing a lower bound for the marginal likelihood. To make inference tractable, mean-field variational Bayes (MFVB) methods (Jordan et al., 1999; Wainwright and Jordan, 2008) assume q(x) is factorized over a Q certain partition of the latent variables x ≡ [x1 , . . . , xJ ], qVB (x) = j qVB (xj ), with marginal densities qVB (xj ) in free-form and correlations between partitions neglected. The structured mean-field approach (Saul and Jordan, 1996) preserves partial correlations and only applies to models with readily identified substructures. The variational Gaussian (VG) approximation (Barber and Bishop, 1998; Opper and Archambeau, 2009) allows incorporation of correlations by postulating a multivariate Gaussian parametric form qVG (x) = N (µ, Σ) with continuous margins in unconstrained space. Hence, the VG approximation may not be suitable for variables that are inherently positive, strongly skewed, or heavy tailed. For multi-modal posteriors, a mixture of MFVB (Jaakkola and Jordan, 1998) or mixture of uniformly-weighted Gaussians 1

(Gershman et al., 2012) may be employed, which usually requires a further lower bound on the average over the logarithm of the mixture distribution. In order to address the limitations of current variational methods in failing to simultaneously characterize posterior covariance broadly across the latent variables while allowing multimodality, skewness and other characteristics, we propose a variational copula framework. Our new copula-based variational approach decouples the inference taks into a multivariate dependence structure manifested in the copula function and a number of univariate margins, which are allowed to take essentially any form. Following the recent advances on automated (black-box) variational inference (Ranganath et al., 2014; Titsias and L´azaro-Gredilla, 2014; Nguyen and Bonilla, 2014), we present a stochastic optimization algorithm for generic hierarchical Bayesian models with continuous variables, which (i ) requires minimal modelspecific derivations; (ii ) reproduces peculiarities of the true marginal posteriors; and (iii ) identifies interpretable dependence structure between latent variables.

2

Variational Copula Inference Framework

Sklar’s theorem (Sklar, 1959) ensures that any multivariate joint distribution Q can be written in terms of univariate marginal distributions Fj (x) = P (Xj ≤ x), j = 1, . . . , p and a copula which describes the dependence structures between variables, such that Q(x1 , . . . , xp ) = C[F1 (x1 ), . . . , Fp (xp )].

(1)

Conversely, if C is a copula and {Fj }j=1:p are distribution functions, then the function Q defined by (1) is a p-dimensional joint distribution function with marginal distributions F1 , F2 , . . . , Fp , termed as the marginally closed property (Song, 2000). Assuming Q(x1 , ..., xp ) has p-order partial derivatives, the Qp joint probability density function (PDF) is q(x1 , . . . , xp ) = cΘ [F1 (x1 ), . . . , Fp (xp )] j=1 fj (xj ), where fj (xj ) is Rx the PDF of the jth variable and Fj (xj ) = −∞ fj (t)dt is the corresponding cumulative distribution function (CDF). Sklar’s theorem allows us to separate the marginal distributions Fj (xj ) from the dependence structure, which is appropriately expressed in the copula function C. As a modeling tool, the specified copula function and parametric/nonparametric margins can be directly fitted to the observed data y (Liu et al., 2009; Wauthier and Jordan, 2010; Lopez-Paz et al., 2013) with their parameters optimized via Bayesian or maximum likelihood estimators (see Smith (2013) and the references therein). In contrast, our goal is to use a copula as an inference engine for full posterior approximation1 . All the unknowns (variables/parameters) in the user-specified hierarchical 1

Tran et al. (2014) is a very different approach. The goal is density estimation on observed data. The proposed “copula-type” model is not strictly copula. The joint dependence between transformed data is captured by multivariate mixtures models, and only the mixture parameters are estimated via mean-field variational Bayes.

2

model are encapsulated into a vector x, and the optimal variational approximation qVC (x) to the true posterior p(x|y) is found under the Sklar’s representation. This approach provides the user with full freedom of modeling (applicable to general models), and does not require conditional conjugacy between latent variables. Within some tractable copula family C ∈ C, and assuming F (·) and C(·) to be differentiable, Q we construct the variational proposal as qC (x) = c[F1 (x1 ), . . . , Fp (xp )] pj=1 fj (xj ), such that the optimal approximation satisfies qC⋆ (x) = arg min KL[qC (x)||p(x|y)] = KL[qC (x)||p(x)] − EqC (x) [ln p(y|x)], qC (x)

where p(y|x) is the likelihood and p(x) is the prior. The evidence lower bound (ELBO) of our variational copula (VC) approximation is LVC [qC (x)] = H[c(u)] +

p X j=1

 Z  p Y fj (xj ) ln p(y, x)dx, (2) H[fj (xj )] + c[F (x)] × j=1

R where uj = Fj (xj ), H[f (x)] = − f (x) ln f (x)dx. Classical methods, such as meanfield VB (MFVB) and the variational Gaussian (VG) approximation are special cases of the proposed VC inference framework. • Special case 1 (MFVB): Q The mean-field proposal corresponds to the independence copula CΠ (u) = Jj=1 uj with free-form marginal densities fj (xj ). Given Q cΠ (u) = 1 and Q differential entropy H[cΠ (u)] = 0, we have qΠ (x) = cΠ (u) j fj (xj ) = j fj (xj ) = qVB (x) and LVC [qΠ (x)] ≡ LMFVB [qVB (x)]. If MFVB is not fully factorized, i.e. J < p, referring to the impossibility theorem (Nelsen, 2007), the independence copula is the only copula that allows the marginal closed property to hold. • Special case 2 (VG): The multivariate Q Gaussian proposal qVG (x) = N (x; µ, Σ) can be written as qVG (x) = cG (u|Υ) pj=1 φj (xj ; µj , σj2 ), where cG (u|Υ) is the Gaussian copula density (Song, 2000) indexed by parameter Υ = D −1/2 ΣD−1/2 , D = diag(Σ), and {φj (xj ; µj , σj2 )}j=1:p are univariate Gaussian marginal densities. Since H[cG (u|Υ)] = 21 ln |Υ|, we have LVC [qC (x)] ≡ LVG [qVG (x)]. Concerning analytical tractability and simplicity, in the sequel we concentrate on variational proposals constructed via Gaussian copula with continuous margins. Interestingly, for non-conjugate models in which the exact posterior cannot be written, a Sklar’s representation of the approximated posterior is still provided, including an analytical Gaussian copula with correlation matrix Υ, and a number of univariate margins (summarized as univariate histograms if not in closed-form).

3

3

Variational Gaussian Copula Inference

A Gaussian copula function with p × p correlation matrix Υ is defined as CG (u1 , . . . , up |Υ) = Φp (Φ−1 (u1 ), . . . , Φ−1 (up )|Υ) : [0, 1]p → [0, 1], where Φ(·) is a shorthand notation of the CDF of N (0, 1), and Φp (·|Υ) is the CDF of Np (0, Υ). The Gaussian copula density is   1 T −1 − 12 cG (u1 , . . . , up |Υ) = (det Υ) exp − z (Υ − I p )z , z = [Φ−1 (u1 ), . . . , Φ−1 (up )]T . 2 In the proposed variational Gaussian copula (VGC) approximation, we assume the variational proposal qVGC (x) is constructed via Gaussian copula with marginal CDF Qp Fj (xj ), i.e., qVGC (x) = cG (u|Υ) j=1 fj (xj ), where u = [F1 (x1 ), . . . , Fp (xp )]. Below we offer a reinterpretation of MFVB and VG under the proposed variational copula inference framework.

3.1

KL Divergence under Sklar’s Representation

Q Letting the true posterior p(x|y) in Sklar’s representation be p(x|y) = c⋆ (v) j fj⋆ (xj ), where v = [F1⋆ (x1 ), . . . , Fp⋆ (xp )], c⋆ (v) and {fj⋆ (xj )}j=1:p are the true underlying copula density and marginal posterior densities, respectively, the KL divergence decomposes into additive terms, X KL[qVGC (x)||p(x|y)] = KL[cG (u)||c⋆[F ⋆ (F −1 (u))]] + KL[fj (xj )||fj⋆(xj )]. (3) j

If we allow free-form margins and assume true posterior margins are found, i.e. Fj ≡ Fj⋆ , ∀j , then KL[qVGC (x)||p(x|y)] = KL[cG (u)||c⋆ (u)], which is the lowest achievable KL divergence in the variational Gaussian copula (VGC) inference. In comparison, MFVB assumes an independence copula and only optimizes free-form margins, X KL[qMFVB (x)||p(x|y)] = KL[cΠ (u)||c⋆ [F ⋆ (F −1 (u))]] + KL[fj (xj )||fj⋆(xj )]. (4) j

Similarly, the lowest achievable KL divergence is KL[cΠ (u)||c⋆ (u)], the difference between the independence copula and the true copula. As is shown in (4), the objective function contains two terms. The goal of finding accurate margins is disturbed by the first term, which also involves marginal CDFs {Fj }j=1:p. This indicates the reason why MFVB usually cannot find the true marginal posteriors in practice (e.g., variances can be underestimated (Neville et al., 2014)), although it allows for free-form margins.

4

In contrast, VG presumes a Gaussian copula function as a (hopefully) accurate approximation to the true copula, but is restricted to univariate Gaussian margins {φj (xj ; µj , σj2 )}j=1:p , X KL[qVG (x)||p(x|y)] = KL[cG (u)||c⋆ [F ⋆ (Φ−1 (u))]] + KL[φj (xj )||fj⋆ (xj )]. (5) j

We can see in (5) that Peven if the true⋆ underlying copula is a Gaussian copula, there is still a discrepancy j KL[φj (xj )||fj (xj )] between margins. Comparing (4) and (5) with (3), our VGC method extends MFVB and VG, and will improve upon both by allowing updates of both the copula parameter Υ and marginal densities {fj (xj )}j=1:p .

3.2

Variational Parameter Expansion

Directly optimizing the ELBO (2) leads to a non-trivial variational calculus problem. For computational convenience, we incorporate auxiliary variables by exploiting the latent variable representation of the Gaussian copula: xj = Fj−1 (uj ), uj = Φ(zj ), z ∼ Np (0, Υ). Let gj (·) = Fj−1 (Φ(·)) be bijective monotonic non-decreasing functions, xj = gj (zj ), ∀j. The Jacobian transformation gives   Y Z Y p p d −1 −1 g (xj ) . qVGC (x) = δ(xj − gj (zj )) qG (z; 0, Υ)dz = qG (g (x); 0, Υ) dxj j j=1 j=1 It is not convenient to directly optimize the correlation matrix Υ of interest, since Υ is a positive semi-definite matrix with ones on the diagonal and off-diagonal elements between [−1, 1]. We adopt the parameter expansion (PX) technique (Liu et al., 1998; Liu and Wu, 1999), which has been applied in accelerating variational Bayes (Qi and Jaakkola, 2006) and sampling a correlation matrix (Talhouk et al., 2012). e ∼ Np (µ, Σ), Σ = DΥD T , D = Further considering zej = t−1 j (zj ) = µj + σjj zj , z [diag(σjj )]j=1:p, thus xj = g(zj ) = g(t(zej )) := h(zej ), where hj (·) = gj ◦ tj (·) are also bijective monotonic non-decreasing functions, the variational proposal is further written as  Y  Z Y p p d −1 −1 qVGC (x) = δ(xj − hj (zej )) qG (e z ; µ, Σ)de z = qG (h (x); µ, Σ) h (xj ) . dxj j j=1 j=1 The ELBO under Sklar’s representation (2) is therefore translated into the Jacobian representation LVGC [µ, Σ; h1:p ] = H[N (e z ; µ, Σ)] +

p X j=1

2 ) + hln (p(y, h(e hln h′j (zej )iN (zej ;µj ,σjj z )))iN (ez;µ,Σ) .

Given the transformations {hj }j=1:p, qG (e z ; µ, Σ) can be further reparameterized by T the Cholesky decomposition Σ = CC , where C is a square lower triangular matrix. Detailed deterministic and stochastic updates are introduced in (Challis and 5

Barber, 2013; Titsias and L´azaro-Gredilla, 2014). The monotonic transformations can be specified according to the desired parametric form of marginal posteriors, hj (·) = Fj−1 [Φ(t(·))], such as log-normal distribution for positive variables. In this special case, auxiliary parameters {µj , σj2 }j=1:p also serve as marginal parameters; h(·) does not have additional parameters to optimize; and ln h′ (·) takes a simple form. The following illustrative example shows the posterior dependence captured by VGC methods.

3.3

An Illustrative Example

The hierarchical form of the normal inverse gamma-gamma (NIGG) model (also considered in Neville et al. (2014)) is represented as y|τ ∼ N (0, τ ), τ |γ ∼ IG(0.5, γ), γ ∼ G(0.5, 1). Here we assume y = 0.01 is the (single) observation. The full conditional posterior distributions are p(τ |y, γ) = IG (τ ; 1, y 2/2 + γ), p(γ|τ ) = G (γ; 1, 1/τ + 1), thus Gibbs sampling and MFVB are available in closed-form. Denoting x = (x1 , x2 ) = (τ, γ), we construct a variational Gaussian copula proposal with (1) a bivariate Gaussian copula, and (2) fixed-form margin for both x1 = τ ∈ (0, ∞) and x2 = γ ∈ (0, ∞); 2 2 we employ fj (xj ; µj , σjj ) = LN (xj ; µj , σjj ), xj = hj (zej ) = exp (zej ) = g(zj ) = exp (σjj zj + µj ), j = 1, 2. We implemented both deterministic (available in this 0.4

Gibbs Sampler MFVB VC(D)LN−full VC(D)LN−diag VC(S)LN−full VC(S)LN−diag

0.2 0 −15

−10

−5

0

5

5

5

0 −1

0

0

−5

−5

−2

−15 −15

−10

−5 log(τ)

0

−4 −5

Gibbs Sampler MFVB VC(D)LN−full VC(D)LN−diag VC(S)LN−full VC(S)LN−diag

−10

ELBO

log(γ)

−3

−10

−6

MFVB VC(D)LN−full VC(D)LN−diag VC(S)LN−full VC(S)LN−diag

−7

5

−15 0

0.2

0.4

(a)

−8 0

5

10

15

20 25 Iterations

30

35

40

(b)

Figure 1: (a) Approximated joint and marginal posteriors (shown in log space for visualization purpose); (b) Convergence of ELBO; Notation: LN: log-normal margin, (D) deterministic implementation, (S) stochastic implementation, -full: full covariance matrix, -diag: diagonal covariance. special case) and stochastic variants of the VGC algorithm, and compared them with 6

two baselines: (i ) Gibbs sampler (5 × 105 samples), and (ii ) mean-field variational Bayes (MFVB) (see Appendix for full details). From Figure 1(a), it is noted that the VGC methods with full correlation matrix are able to preserve the variable interdependence and alleviate the under-estimation of the posterior variance. Figure 1(b) shows that the VGC methods with log-normal margins lead to higher ELBO than MFVB, and the gain is lost with factorized assumption when Υ = I in which case the Gaussian copula reduces to the independence copula. Since the Gaussian copula admits neither lower nor upper tail dependence, the posterior dependence preserved can be restrictive. It is a future research topic to explore other copula families that allow more complex posterior dependencies in variational copula inference. Given the copula function C, we only need to find p one-dimensional margins. However, without knowing characteristics of the latent variables, specifying appropriate parametric form for margins is a difficult task in general cases. First, the marginals might exhibit multi-modality, high kurtosis or skewness, which are troublesome for particular parametric marginals to capture. Second, tractable inverse CDF with optimizable parameters is required but not available in most except a handful of cases. Instead of using some arbitrary parametric form, we construct bijective transform functions via kernel mixtures, which lead to highly flexible (ideally free-form) marginal proposals.

4

Continuous Margins Constructed via Bernstein Polynomials

The marginal densities in VGC can be recovered through Jacobian transformation, 2 fj (xj ) = qG (h−1 j (xj ); µj , σ2 )

1 d −1 2 hj (xj ) = qG (h−1 , j (xj ); µj , σ2 ) ′ −1 dxj hj (hj (xj ))

(6)

−1 where the [h′j (h−1 term is interpreted as a marginal-correction term. To guarj (xj ))] antee analytical tractability, we require h(·) to be (i ) bijective; (ii ) monotonic nondecreasing; (iii ) having unbounded/constrained range; (iv ) differentiable with respect to both its argument and parameters; (v ) sufficiently flexible. We propose a class of continuous and smooth transformations h(·) constructed via kernel mixtures that automatically have these desirable properties.

4.1

Sandwich Type Construction of Transformations

The Bernstein polynomials (BPs) have a uniform convergence property for continuous functions on unit interval [0, 1] and have been used for nonparametric density estimation (Petrone, 1999). Building upon it, a function h(e z )2 maps from (−∞, ∞) 2

The index j on ze is temporarily omitted for simplicity, and is added back when necessary.

7

to some target range and is constructed as −1

h(e z ) = Ψ [B(Φ(e z ); k, ω)],

B(u; k, ω) =

k X r=1

ωr,k Iu (r, k − r + 1),

(7)

where Iu (r, k − r + 1) is the regularized incomplete beta function. Φ(·) is the standard normal CDF mapping from (−∞, ∞) to [0, 1], and Ψ−1 (·) is some predefined inverse CDF with fixed parameters; for example, the inverse CDF of the exponential distribution helps map from [0, 1] to (0, ∞) for positive variables. The degree k is an unknown smoothing parameter, and ω is the Punknown mixture weights on the probability simplex ∆k = {(ω1 , . . . , ωk ) : ωj ≥ 0, j ωj = 1}. According to the chain rule, the first derivative of h(·) is, dΨ−1 [B(Φ(e z ); k, ω)] dB(Φ(e z ); k, ω) dΦ(e z) dh(e z) = de z dB(Φ(e z ); k, ω) dΦ(e z) de z b(Φ(e z ); k, ω)φ(e z) , = ψ(h(e z )) P where b(u; k, ω) = kr=1 ωr,k β(u; r, k − r + 1), β(x; a, b) is the beta density h′ (e z) =

(8)

β(x; a, b) = Γ(a + b)/(Γ(a)Γ(b))xa−1 (1 − x)b−1 .

ln h′ (e z ) still takes an analytical expression,

ln h′ (e z ) = ln b(Φ(e z ); k, ω) + ln φ(e z ) − ln ψ(h(e z )) Therefore, the stochastic term ℓs (e z , h1:p , y) =

p X

z )) ln h′j (e zj ) + ln p(y, h(e

j=1

in the ELBO LVGC [qG (e z ; µ, Σ); h1:p ] = (p/2) ln (2πe) + ln |C| + hℓs (e z , h1:p , y)iqG (ez) is also analytic. The derivations of deterministic VGC updates are highly modeldependent. First, due to the cross terms often involved in the log likelihood/prior, the corresponding Gaussian expectations and their derivatives may not be analytically tractable. Second, owing to the non-convex nature of many problems, only locally optimal solutions can be guaranteed. In contrast, stochastic implementation of VGC only requires the evaluation of log-likelihood and log-prior along with their respective derivatives, eliminating most model-specific derivations.

4.2

Stochastic Implementation of VGC

According to Titsias and L´azaro-Gredilla (2014), the gradient of the ELBO with respect to variational parameter (µ, C) can be written as ∇µ LVGC = EqG (ez) [∇ze ℓs (e z , h, y)] ,   ∇C LVGC = EqG (ez) ∇ze ℓs (e z , h, y)(e z − µ)T C −T + ∆C, 8

where ∆C = [diag(1/Cjj )]j=1:p and the stochastic gradient terms ∇zej ℓs (e z , h, y) = ∇zej ln p(y, h(e z )) + ∇zej ln h′j (e zj ) ∂ ln p(y, x) ′ hj (zej ) + ∇zej ln h′j (e zj ). = ∂xj

(9)

For more general cases, there exist alternative ways to express gradients in expectations, for example, Paisley et al. (2012); see Titsias and L´azaro-Gredilla (2014) for a comparison. We already have h′j (e zj ) in closed-form (8). As ∇zej ln h′j (e zj ) = ′′ ′ hj (e zj )/hj (e zj ), the second derivative of hj (·) is h′′j (e zj ) = [ρ′1 (e zj )ρ2 (e zj )ρ3 (e zj ) + ρ1 (e zj )ρ′2 (e zj )ρ3 (e zj ) − ρ1 (e zj )ρ2 (e zj )ρ′3 (e zj )]/[ρ3 (e zj )]2 , P (j) zj ) = where ρ1 (e zj ) = b(Φ(zej ); k, ω (j) ), ρ′1 (e zj ) = φ(zej ) kr=1 ωr,k β ′ (Φ(zej ); r, k−r+1), ρ2 (e ′ ′ ′ ′ φ(zej ), ρ2 (e zj ) = −zej φ(zej ), ρ3 (e zj ) = ψ(hj (e zj )), ρ3 (e zj ) = ψ (hj (e zj ))hj (e zj ), φ(·) is the ′ PDF of N (0, 1), ψ(·) and ψ (·) are the predefined PDF and its derivative respectively. Defining β(x; 0, b) = β(x; a, 0) = 0, the derivative can be written as a combination of two polynomials of lower degree β ′ (x; a, b) = (a + b − 1)[β(x; a − 1, b) − β(x; a, b − 1)]. Under computational limits, we prefer a higher degree k, as there is no over-fitting issue in this variational density approximation task. Given k, the basis functions are completely known, depending only on index r. The only parameter left to optimize in Bernstein polynomials is the mixture weights. This construction is much simpler than Gaussian mixture proposals (Gershman et al., 2012; Nguyen and Bonilla, 2014). Assuming permissibility of interchange of integration and differentiation holds, we have ∇ω(j) L = EqG (ez ) [∇ω(j) ℓs (e z , h, y)], with the stochastic gradients ∇ω(j) ℓs (e z , h, y) = ∇ω(j) ln p(y, h(e z )) + ∇ω(j) ln h′j (zej )     ∂ ln p(y, x) ∂hj (e zj ) ′ + ∇ω(j) ln hj (zej ) , = (j) r,k ∂xj ∂ωr,k r=1:k r=1:k

where

∂hj (e zj ) (j)

∂ωr,k ∂ ln h′j (zej ) (j)

∂ωr,k

= =

∂Ψ−1 [B(Φ(e zj ); k, ω (j) )] (j)

∂ωr,k

=

(10)

IΦ(ezj ) (r, k − r + 1) IΦ(ezj ) (r, k − r + 1) = , ψ(Ψ−1 [B(Φ(e zj ); k, ω (j) )]) ψ(hj (e zj ))

ψ ′ (hj (e zj )) β(Φ(e zj ); r, k − r + 1) − IΦ(ezj ) (r, k − r + 1). (j) b(Φ(e zj ); k, ω ) {ψ(hj (e zj ))}2

The gradients w.r.t ω (j) turn into expectation naturally, to enable stochastic optimization of the ELBO. To satisfy the constraints of ω (j) on the probability simplex, we apply the gradient projection operation P introduced in Duchi et al. (2008) with complexity O(klogk). The above derivations related to BPs are all analytical and model-independent. The only two model-specific terms are the log likelihood and 9

Algorithm 1 Stochastic variational Gaussian copula inference with Bernstein polynomials Input: observed data y, user specified model ln p(y, x) and first order derivatives ∇x ln p(y, x), Bernsterin polynomials degree k   (j) Initialize variational parameter Θ0 = µ0 , C 0 , {ω 0 }j=1:p , t = 0. repeat t = t + 1;  e ∼ qG ze; µt−1 , C t−1 C Tt−1 Sample z µt = µt−1 + λt ∇ze ℓs (e z , h, y);  C t = C t−1 + ηt ∇ze ℓs (e z , h, y)(e z − µt−1 )T C −T t−1 + ∆C t−1 , according to (9) for j = 1 to k do (j) (j) (j) ω t = P(ω t−1 + ξt ∇ω(j) ℓs (e z , h, y)), according to (10) end for until Convergence criterion is satisfied Output: marginal parameters ({ω j }j=1:p, µ, σ 2 ) and copula parameters Υ prior term ln p(y, x) and its derivatives ∂ ln p(y, x)/∂x. The stochastic optimization algorithm is summarized in Algorithm 1. The stability and efficiency of the stochastic optimization algorithm can be further improved by embedding subroutines adapting direction (gradient) or the scale (learning rate)(Duchi et al., 2011). It seems more natural to use kernel mixtures directly as the variational proposal for continuous margins. However, the difficulty lies in tackling a term f (F −1 (·)) involving the inverse CDF of mixtures (not analytical) and the need of a further lower bound on the entropy of mixtures.

5

Case Study

The NIGG example considered in Section 3.3 demonstrates that VGC is able to preserve posterior dependence, while the margins are constrained in certain parametric form. This restriction is relaxed by a nonparametric construction based on BPs with refinement of the mixture weights. Denote x = (x1 , x2 ) = (τ, γ), we construct the transform via “Sandwich-type” Bernstein polynomials, hj (zej ) = Ψ−1 [B(Φ(zej ); k, ω(j) )], j = 1, 2. 1. For the Normal-Inverse Gamma-Gamma (NIGG) model, ln p(y, x1, x2 ) = c0 − 2 ln x1 −

y2 x2 − − x2 2x1 x1

where c0 = −0.5 ln (2π) − 2 ln Γ(0.5).

x2 2 y2 ∂ ln p(y, x) = − + 2 + 2, ∂x1 x1 2x1 x1 10

∂ ln p(y, x) 1 =− −1 ∂x2 x1

(11)

0.4 0.2 0 −15

−10

−5

0

5 5

Gibbs Sampler MFVB VC(D)LN−full VC(D)LN−diag VC(S)LN−full VC(S)LN−diag VC(S)BP−full

4 2 0

0

log(γ)

−2 −4 −5 −6 Gibbs Sampler MFVB VC(D)LN−full VC(D)LN−diag VC(S)LN−full VC(S)LN−diag VC(S)BP−full

−8 −10 −12 −14 −14

−12

−10

−8

−6

−4 log(τ)

−2

0

2

4

−10

−15 0

0.2

0.4

Figure 2: Approximated joint and marginal posteriors (shown in log space for visualization purpose); Notation: LN: log-normal margin, BP: Bernstein polynomial margin (k = 20); (D) deterministic implementation, (S) stochastic implementation, -full: full covariance matrix, -diag: diagonal covariance.

11

2. We choose Ψ(·) to be the CDF of exponential distribution Exp(λ0 ), λ0 = 0.1, according to the support of both x1 = τ ∈ (0, ∞) and x2 = γ ∈ (0, ∞). Ψ(x) = [1 − exp (−λ0 x)]H(x), Ψ−1 (u) = − ln (1 − u)/λ0 , 0 ≤ u < 1 ψ(x) = λ0 exp (−λ0 x)H(x), ψ ′ (x) = −λ20 exp (−λ0 x)H(x) where H(x) = 1 if x ≥ 0, otherwise H(x) = 0 Figure 2 shows that this leads to more accurate margins. Future work will investigate more flexible marginal posteriors.

6

Discussions

We have presented a new variational copula inference framework, for continuous posterior inference in hierarchical Bayesian models. To avoid the difficulty of specifying margins for hidden variables, a nonparametric procedure based on Bernstein polynomials indirectly induces highly flexible univariate margins. The copula function is limited to the Gaussian family for interpretability and tractability. The correlation matrix is estimated via a parameter expansion technique, which provides a concrete expression of pairwise posterior dependence. Extensions to discrete margins and other parametric copulas allowing rich or nonstandard dependencies are interesting future research topics. Copulas have been used extensively in statistical modeling, but remained unexplored in variational inference. The proposed method is a general and automated approach of approximating full posteriors. It also applies to non-conjugate models. The obtained posterior approximation could improve the efficiency of MetropolisHastings (MH) samplers by replacing the MCMC prerun as a reasonable proposal (Schmidl et al., 2013).

Appendix: Derivations on the NIGG model A1: Hierarchical Model and Gibbs Sampler The hierarchical model is y|τ ∼ N (0, τ ),

1 τ |γ ∼ IG( , γ), 2

1 γ ∼ G( , 1) 2

The full likelihood is 1 1 p(y, τ, γ) = p(y|τ )p(τ |γ)p(γ) = N (y; 0, τ )IG(τ ; , γ)G(γ; , 1) 2 2  2   1/2 γ γ 11/2 y 1 × 1 τ −1/2−1 exp − × 1 γ 1/2−1 exp (−γ) exp − =√ 2τ τ Γ( 2 ) Γ( 2 ) 2πτ 12

The full conditional distributions are   y2 p(τ |y, γ) = IG τ ; 1, + γ , 2

  1 p(γ|τ ) = G γ; 1, + 1 τ

A2: Mean-field Variational Bayes The ELBO under MFVB is LVB = E[ln p(y, τ, γ)] + H1 [q(τ ; α1 , β1 )] + H2 [q(γ; α2 , β2 )]     1 1 1 y2 1 E[ln p(y, τ, γ)] = − ln (2π) − 2 ln Γ( ) − 2hln τ i − − hγi − hγi 2 2 2 τ τ H1 [q(τ ; α1 , β1 )] = α1 + ln β1 + ln [Γ(α1 )] − (1 + α1 )ψ(α1 ) H2 [q(γ; α2 , β2 )] = α2 − ln β2 + ln [Γ(α2 )] + (1 − α2 )ψ(α2 ) The variational distribution   y2 q(τ ) = IG (τ ; α1 , β1 ) = IG τ ; 1, + hγi , 2

    1 +1 q(γ) = G(γ; α2 , β2 ) = G γ; 1, τ

where hln τ i = ln β1 − ψ(α1 ) = ln   1 α1 1 , = = 2 y τ β1 + hγi 2



 y2 + hγi − ψ(1), 2 1 α2 = 1 hγi = β2 +1 τ

A3: Deterministic Variational Copula Inference The expected log likelihood and log prior is     1 h2 (ze2 ) y2 − − hh2 (ze2 )i E[ln p(y, h1 (ze1 ), h2 (ze2 ))] = c0 − 2hln h1 (ze1 )i − 2 h1 (ze1 ) h1 (ze1 ) y 2 hexp(−ze1 )i − hexp(ze2 − ze1 )i − hexp(ze2 )i = c0 − 2hze1 i − 2

where

    Σ11 Σ22 hze1 i = µ1 , hexp(−ze1 )i = exp −µ1 + , hexp(ze2 )i = exp µ2 + 2 2   Σ11 − Σ21 − Σ12 + Σ22 hexp(ze2 − ze1 )i = exp (µ2 − µ1 ) + 2 13

P Combine with H[qG (e z )] = ln (2πe) + ln |C|, and 2j=1 hln h′j (zej )iqG (ezj ) = µ1 + µ2 , we have   2 C11 2 y exp −µ1 + 2 − ℓ1 (µ, C) LVGC (µ, C) = c1 − µ1 + µ2 − 2   2 C 2 + C22 + ln |C| − exp µ2 + 21 2   2 2 2 C11 − 2C11 C21 + C21 + C22 ℓ1 (µ, C) = exp (µ2 − µ1 ) + 2 where c1 = c0 + ln (2πe) = 0.5 ln (2π) − 2 ln Γ(0.5) + 1. The gradients   2 C11 2 y exp −µ + 1 2 ∂LVGC (µ, C) = −1 + + ℓ1 (µ, C) ∂µ1 2   2 2 C21 + C22 ∂LVGC (µ, C) = 1 − ℓ1 (µ, C) − exp µ2 + ∂µ2 2   2 C y 2 C11 exp −µ1 + 211 ∂LVGC (µ, C) 1 =− − (C11 − C21 )ℓ1 (µ, C) + ∂C11 2 C   11 2 2 C21 + C22 ∂LVGC (µ, C) = (C11 − C21 )ℓ1 (µ, C) − C21 exp µ2 + ∂C21 2   2 ∂LVGC (µ, C) 1 C 2 + C22 + = −C22 ℓ1 (µ, C) − C22 exp µ2 + 21 ∂C22 2 C22 A4: Stochastic Variational Copula Inference with Log Normal Margins The log likelihood and log prior h2 (ze2 ) y2 − − h2 (ze2 ) 2h1 (ze1 ) h1 (ze1 ) y 2 exp(−ze1 ) − exp(ze2 − ze1 ) − exp(ze2 ) = c0 − 2ze1 − 2

E[ln p(y, h1(ze1 ), h2 (ze2 ))] = c0 − 2 ln h1 (ze1 ) − The stochastic part of the ELBO is,

where

ℓs (e z , h1:p , y) = ze2 + c0 − ze1 −

y 2 exp(−ze1 ) − exp(ze2 − ze1 ) − exp(ze2 ) 2

y 2 exp(−ze1 ) + exp(ze2 − ze1 ) 2 ∇ze2 ℓs (e z , h1:p , y) = 1 − exp(ze2 − ze1 ) − exp(ze2 ) ∇ze1 ℓs (e z , h1:p , y) = −1 +

14

References D. Barber and C. M. Bishop. Ensemble learning in Bayesian neural networks. Neural networks and machine learning, 168:215–238, 1998. E. Challis and D. Barber. Gaussian Kullback-Leibler approximate inference. Journal of Machine Learning Research (JMLR), 14(1):2239–2286, 2013. J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the ℓ1 ball for learning in high dimensions. In International Conference on Machine Learning (ICML), pages 272–279, 2008. J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research (JMLR), 12:2121–2159, 2011. S. J. Gershman, M. D. Hoffman, and D. M. Blei. Nonparametric variational inference. In International Conference on Machine Learning (ICML), 2012. T. S. Jaakkola and M. I. Jordan. Improving the mean field approximation via the use of mixture distributions. In Learning Graphical Models, 1998. M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999. C. Liu, D. B. Rubin, and Y. Wu. Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika, 85(4):755–770, 1998. H. Liu, J. Lafferty, and L. Wasserman. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research (JMLR), 10: 2295–2328, 2009. J. S. Liu and Y. Wu. Parameter expansion for data augmentation. Journal of the American Statistical Association (JASA), 94(448):1264–1274, 1999. D. Lopez-Paz, J. M. Hernandez-Lobato, and Z. Ghahramani. Gaussian process vine copulas for multivariate dependence. In International Conference on Machine Learning (ICML), pages 10–18, 2013. R. B. Nelsen. An introduction to copulas. Springer Science & Business Media, 2007. S. E. Neville, J. T. Ormerod, and M. Wand. Mean field variational Bayes for continuous sparse signal shrinkage: pitfalls and remedies. Electronic Journal of Statistics, 8:1113– 1151, 2014. T. V. Nguyen and E. V. Bonilla. Automated variational inference for Gaussian process models. In Advances in Neural Information Processing Systems (NIPS), pages 1404– 1412, 2014. M. Opper and C. Archambeau. The variational Gaussian approximation revisited. Neural computation, 21(3):786–792, 2009. J. W. Paisley, D. M. Blei, and M. I. Jordan. Variational Bayesian inference with stochastic search. In International Conference on Machine Learning (ICML), 2012. S. Petrone. Bayesian density estimation using Bernstein polynomials. Canadian Journal of Statistics, 27(1):105–126, 1999. Y. Qi and T. S. Jaakkola. Parameter expanded variational Bayesian methods. In Advances in Neural Information Processing Systems (NIPS), pages 1097–1104, 2006. R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. International

15

Conference on Artificial Intelligence and Statistics (AISTATS), 2014. L. K. Saul and M. I. Jordan. Exploiting tractable substructures in intractable networks. Advances in Neural Information Processing Systems (NIPS), pages 486–492, 1996. D. Schmidl, C. Czado, S. Hug, and F. J. Theis. A vine-copula based adaptive MCMC sampler for efficient inference of dynamical systems. Bayesian Analysis, 8(1):1–22, 2013. A. Sklar. Fonctions de R´epartition ` a n Dimensions Et Leurs Marges. Publ. Inst. Statist. Univ. Paris 8, 1959. M. S. Smith. Bayesian approaches to copula modelling. Bayesian Theory and Applications, page 336, 2013. P. X. Song. Multivariate dispersion models generated from Gaussian copula. Scandinavian Journal of Statistics, 27(2):305–320, 2000. A. Talhouk, A. Doucet, and K. Murphy. Efficient Bayesian inference for multivariate probit models with sparse inverse correlation matrices. Journal of Computational and Graphical Statistics, 21(3):739–757, 2012. M. Titsias and M. L´ azaro-Gredilla. Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning (ICML), pages 1971–1979, 2014. M. Tran, P. Giordani, X. Mun, R. Kohn, and M. K. Pitt. Copula-type estimators for flexible multivariate density modeling using mixtures. Journal of Computational and Graphical Statistics, 23(4):1163–1178, 2014. M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008. F. L. Wauthier and M. I. Jordan. Heavy-tailed process priors for selective shrinkage. In Advances in Neural Information Processing Systems (NIPS), pages 2406–2414, 2010.

16