Streamlined mean field variational Bayes for ... - Semantic Scholar

Report 3 Downloads 138 Views
Streamlined mean field variational Bayes for longitudinal and multilevel data analysis By Cathy Yuen Yi Lee∗ and Matt P. Wand School of Mathematical and Physical Sciences, University of Technology Sydney 638 Jones Street Broadway, Ultimo, New South Wales, 2007, Australia 4th August, 2015 Summary Streamlined mean field variational Bayes algorithms for efficient fitting and inference in large models for longitudinal and multilevel data analysis are obtained. The number of operations is linear in the number of groups in each level, which represents a two orders of magnitude improvement over the naïve approach. Storage requirements are also lessened considerably. We treat models for the Gaussian and binary response situations. Our algorithms allow the fastest ever approximate Bayesian analyses of arbitrarily large longitudinal and multilevel data-sets, with little degradation in accuracy compared with Markov chain Monte Carlo. The modularity of mean field variational Bayes allows relatively simple extension to more complicated scenarios. Keywords: Bayesian computing; Longitudinal data; Matrix decomposition; Multilevel model; Variational approximations.

1

Introduction

We develop mean field variational Bayes (MFVB) algorithms for fitting and inference in large longitudinal and multilevel models that are streamlined in terms of number of operations and storage. The number of operations is linear in the number of groups in each level. This constitutes a major improvement over naïve implementations, which have cubic dependence on the number of groups. Our algorithms are also optimal in terms of storage. The inferential accuracy is shown to be very good to excellent. The upshot is the ability to perform high quality Bayesian inference for arbitrarily large longitudinal and multilevel data-sets faster than ever before. MFVB (e.g. Wainwright & Jordan, 2008) is a deterministic alternative to Markov chain Monte Carlo (MCMC) for approximate Bayesian inference. Its utility is speed for high volume data and easy extendability to online processing for high velocity data. This entails some inferential inaccuracy, although this is relatively minor for the models considered in the present article. The modularity of MFVB means that it can accommodate various complications, such as missing data and the need for robustness, within the attractive graphical models framework that has become popular for Bayesian inference (e.g. Bishop, 2008). The software package Infer.NET (Minka, Winn, Guiver & Knowles, 2013) has MFVB as one of its main inference engines. Infer.NET accommodates the longitudinal and multilevel models treated here, but its non-streamlined implementation becomes quite slow for large numbers of groups. Variational methods (e.g. Bishop, 2006; Ormerod & Wand, 2010) have evolved mainly in areas such as Machine Learning and Pattern Recognition. However, since the early 2000s there has been an increasing recognition of their usefulness in mainstream Statistics (e.g. Titterington, 2004). An early contribution is Teschendorff et al. (2005), who applied ∗

Corresponding author: e-mail: [email protected]

1

MFVB to mixture modeling for gene expression data. Longitudinal and multilevel models (e.g. Diggle et al., 2002; Fitzmaurice et al., 2011; Gelman & Hill, 2007; Goldstein, 2010) represent a major branch of Statistics which, to date, has had relatively little overlap with the graphical models viewpoint and variational methods. Algorithm 3 of Ormerod & Wand (2010) and Algorithms 3 and 5 of Luts, Broderick & Wand (2014) support MFVB fitting of longitudinal and multilevel data but use naïve matrix inversion for the effects covariance matrix parameter updates. The essence of our approach is to take advantage of the sparseness of the matrix requiring inversion and streamline its inversion. This involves omitting correlations of estimated effects between groups, which are rarely of interest. Our resultant streamlined algorithms are shown, via simulation, to have inferential accuracy that is comparable with MCMC. It is likely that they provide the fastest ever means of approximate Bayesian analyses of very large longitudinal and multilevel data-sets. We restrict attention to two-level Gaussian and binary response models. Recently, Tan and Nott (2013) introduced a different strategy for improving the efficiency of variational methods for generalized linear mixed models. The product restriction of their approach is such that the random effects for each group appear in a separate factor and their partially noncentered parameterization strategy is aimed at improved accuracy in the face of such a restriction. We do not impose their restriction in our variational approximation for the random group effects. Fitting and inference proceeds efficiently via a streamlined approach that takes advantage of the inherent block-diagonal structure of the effects covariance matrix. Our algorithms correspond to the algorithms of Tan and Nott (2013) in which their tuning parameter, that captures correlation between fixed and random effects, does not have to be estimated and, instead, is specified optimally. Our streamlined algorithms are simpler due to minimal product restrictions in the MFVB approximation and can be applied to a richer class of models. The next section presents a brief overview of variational approximations. Sections 3 treats Gaussian response longitudinal and multilevel models. Details on the direct implementation of variational algorithms and inference for models are given. Section 4 gives the details of our streamlined approach to sparse covariance estimation vis matrix permutation and decomposition. The binary response case is treated in Section 5. In Section 6 we provide numerical evidence of the efficacy of the new methodology, in terms of both computation speed and inferential accuracy. Illustration of data from a United States perinatal health study is presented in Section 7 and concluding remarks are given in Section 8.

2

A Brief Overview of Variational Approximations

Consider a generic Bayesian model with observed vector y and parameter vector θ that is continuous over the parameter space Θ. Bayesian inference is based on finding the joint posterior distribution of parameters of interest given the observed data p(θ|y) =

p(y, θ) p(y|θ) p(θ) = . p(y) p(y)

Let q be an arbitrary density function over Θ. Then the the logarithm of the marginal likelihood satisfies p(y) ≥ p(y; q) is p(y, θ) dθ + q(θ) log q(θ) Θ Θ = log p(y; q) + KL {q(θ) k p(θ|y)}, 

Z

log p(y) =



Z

q(θ) log



q(θ) dθ p(θ|y) 

with KL {q(θ) k p(θ|y)} ≥ 0 for all densities q. The gap between log p(y) and the lower bound log p(y; q) is known as the Kullback-Leibler divergence and is minimized by qexact (θ) = p(θ|y), 2

the exact posterior density function. Typically, direct minimization of Kullback-Leibler divergence is intractable since it depends on the true posterior whose normalization constant is difficult to calculate. For many models of practical of interest, exact inference is typically infeasible due to the intractable posterior distributions, requiring approximate inference for use in general. Markov chain Monte Carlo has been the most widely used approximate inference method in this setting, but can be computationally intensive and often suffers from poor mixing that leads to slow convergence in large and complex statistical models. We therefore opt for a fast and analytical alternative approach, namely variational approximations. Variational approximations are a family of approximate inference techniques that are aimed to recast the problem of computing posterior probabilities as a functional optimization problem, by finding the distribution within a manageable class of distributions that best matches the posterior through optimization of some criterion. We provide here a brief overview of variational approximations and highlight the key concepts. Detailed expositions on the method can be found in Bishop, 2006 (Sections 10.1–10.4) and Ormerod and Wand (2010). The central idea of variational methods is to approximate the true posterior density function p(θ|y) with a so-called approximating density function q(θ) through minimization of a measure of dissimilarity between the two density functions. A natural choice for the dissimilarity measure is the Kullback-Leibler divergence q ∗ (θ) = argmin KL {q(θ) k p(θ|y)} , q∈Q

for some set Q of density functions. Tractability is achieved by restricting q(θ) to some product density form: (

Q=

q(θ) : q(θ) =

M Y

)

qi (θ i ) for some partition {θ 1 , . . . , θ M } of θ ,

(1)

i=1

known as mean field restriction. The resultant q ∗ (θ) is known as a mean field variational Bayes approximation to p(θ|y), and from now on we refer to as an optimal q-density. The tractability afforded by (1) comes with a trade-off in accuracy, because it imposes posterior independence between partition elements θ i that may not be present in the true posterior. Depending on the amounts of true posterior dependence, the MFVB approximations can range from excellent to poor. For example, regression coefficients and the error variance in linear regression models tend to have a weak posterior correlation and as a result the assumption of posterior independence has a negligible effect. In contrast, in univariate asymmetric Laplace models, the variance and auxiliary parameters tend to have a strong posterior correlation and so the assumption of posterior independence leads to inaccurate approximate inference (Wand et al., 2011). Under restriction (1), the optimal q-density functions can be shown to satisfy (Ormerod and Wand, 2010) qi∗ (θ i ) ∝ exp [Eq(−θi ) {log p(θ i |rest)}],

1 ≤ i ≤ M, Q

(2)

where Eq(−θi ) denotes expectation with respect to the density j6=i qj (θ j ), the notation rest ≡ {y, θ 1 , . . . , θ i−1 , θ i+1 , . . . , θ M } is the set containing the rest of random vectors in the model, except θ i , and the distributions θ i |rest, 1 ≤ i ≤ M are known as the full conditionals in the MCMC literature. We proceed by initializing each of the optimal q-density factors qi∗ (θ i ) and updating each factor in turn by replacing the current estimate with a revised estimate given in (2). Each update uniquely minimizes the Kullback-Leibler divergence, or equivalently, maximizes the lower bound, with respect to the parameters of qi∗ (θ i ). This forms an iterative coordinate ascent algorithm, Algorithm 1. Convergence of such an 3

algorithm to at least local optima is guaranteed based on convexity properties (Luenberger and Ye, 2008, p. 253). If all the parameters in a model are conditionally conjugate, then the optimal q-density functions in Algorithm 1 are available in closed form. Otherwise, numerical quadrature techniques are required to estimate the marginal likelihood, which are computationally more demanding. ∗ (θ ) Initialize: q1∗ (θ 1 ), · · · , qM M

Cycle: q1∗ (θ 1 ) ← R

exp[Eq(−θ1 ) {log p(y, θ)}] exp[Eq(−θ1 ) {log p(y, θ)}]dθ 1

.. . ∗ (θ ) ← R qM M

exp[Eq(−θM ) {log p(y, θ)}] exp[Eq(−θM ) {log p(y, θ)}]dθ M

until the increase in p(y; q) is negligible.

Algorithm 1: Iterative scheme for obtaining the optimal q-density functions under product restriction (1).

3

Gaussian Response Models

The Gaussian response models that we consider all take the following linear mixed model form y | β, u, σε2 ∼ N (Xβ + Zu, σε2 I), u | G ∼ N (0, G), (3) where X and Z are fixed effects and random effects design matrices, β and u are the so-called fixed effects and random effects vectors and G is the random effects covariance matrix. A common extension of (3) involves replacement of σε2 I by a general covariance matrix R (e.g. Robinson, 1991), but is not treated here. A very wide range of models can be obtained through various choices of Z and G. We focus on those corresponding to two-level hierarchical structures, which are often treated separately within the broad branches of Statistics known as longitudinal data analysis (e.g. Diggle et al., 2002; Fitzmaurice et al., 2011) and multilevel modeling (e.g. Gelman & Hill, 2007; Goldstein, 2010). Since around 2000, there has been a major interplay between longitudinal data analysis and semiparametric regression. An overview is given in Fitzmaurice, Davidian, Verbeke & Molenberghs (2008). Our algorithms accommodate many of the models of this type, using the general design matrix set-up of Zhao, Staudenmayer, Coull & Wand (2006).

3.1

Sample Size and Subscripting Notation

Many of the models used in longitudinal and multilevel data analysis are mathematically equivalent, but use different sample size and subscripting notation. It is prudent to delineate the various notational conventions commonly in use so that methodology developed using one set of notation can be transferred to models that use other notational systems. We achieve that in this section by way of a concrete example of (3).

4

The special case of (3) with 

Xβ = 0,

     Z=     

1 1 0 0 0 0 0

0 0 1 1 1 0 0

0 0 0 0 0 1 1

           

and

G = σu2 I

(4)

imposes the following covariance matrix on the response vector:       Cov(y) =      

σu2 + σε2 σu2 0 0 0 0 0 2 2 2 σu σu + σε 0 0 0 0 0 0 0 σu2 + σε2 σu2 σu2 0 0 0 0 σu2 σu2 + σε2 σu2 0 0 0 0 σu2 σu2 σu2 + σε2 0 0 0 0 0 0 0 σu2 + σε2 σu2 0 0 0 0 0 σu2 σu2 + σε2

      .     

This block covariance structure is in keeping with two hierarchical levels with 3 groups at the first level and sample sizes of 2, 3 and 2 within each group. There are at least two established notations for conveying the hierarchical structure imposed by (4). The first is E(yij | Ui ) = Ui , where yij ≡ jth measurement in the ith group, Ui ≡ random effect in the ith group, for 1 ≤ i ≤ m, 1 ≤ j ≤ ni .

(5)

In the current example, m = 3 and n1 = 2, n2 = 3, n3 = 2. This notation is used by prominent longitudinal data analysis textbooks Diggle et al. (2002) and Fitzmaurice, Laird & Ware (2011), as well as the semiparametric regression book Ruppert, Wand & Carroll (2003). An alternative notation is E(yi |Uj[i] ) = Uj[i] ,

1 ≤ i ≤ 7,

1 ≤ j ≤ J,

where for this example, J = 3 and the j[i] mapping is 1[1] ≡ 1, 1[2] ≡ 1, 1[3] ≡ 1, 2[3] ≡ 2, 2[4] ≡ 2, 2[5] ≡ 2, 3[6] ≡ 3, 3[7] ≡ 3 and yi = ith entry of y. This is used in the multilevel model textbook by Gelman & Hill (2007). Goldstein (2010) uses yij ≡ ith measurement in the jth group, 1 ≤ j ≤ m for two-level models with m groups. For the remainder of this article we will use the sample size and subscripting notation adopted by the abovementioned longitudinal data analysis textbooks. This involves subscripting notation corresponding to (5) and m ≡ number of groups

and

ni ≡ number of response measurements in the ith group.

The mathematical equivalence between multilevel and longitudinal models means that our methodology applies equally to both areas. However, as illustrated above, some notational adjustment is required to match common multilevel model notations. 5

3.2

Predictor Structure and Matrix Notation

Our predictor structure corresponds to the set-up notation of Section 2 of Zhao et al. (2006) with the spatial correlation structure omitted. We first partition β, X, u and Z into random group effects (superscript R) and general components (superscript G) as follows: "

β≡

βR βG

#

,

R



G

X ≡ [X X ],

u≡

uR uG



and

Z ≡ [Z R Z G ],

which leads to Xβ + Zu = X R β R + X G β G + Z R uR + Z G uG .

(6)

The random group effects design matrices are of the form X R1   X R ≡  ...  X Rm 



Z R ≡ blockdiag(X Ri ),

and

(7)

1≤i≤m

where X Ri , 1 ≤ i ≤ m, are ni × q R design matrices. The random group effects vector is such that  R  u1  ..  R (8) u =  .  and Cov(uR ) = I m ⊗ ΣR , uRm where ΣR is an unstructured q R × q R covariance matrix and ⊗ denotes Kronecker product. Thus, X R β R + Z R uR corresponds to random intercepts and slopes for repeated measures data on m groups with sample sizes n1 , . . . , nm . The matrices X G and Z G are general design matrices corresponding to the fixed effects vector β G and random effects vector uG respectively. Typically, X G contains predictors that are not already included in X R . As explained in Zhao et al. (2006), X G may also contain polynomial basis functions of a continuous predictor that enter the model as a penalized spline. The Z G matrix would then contain spline basis functions of the same predictor. Mixed model-based penalized spline fitting of (6) (e.g. Ruppert et al., 2003) involves modelling a smooth, but otherwise unspecified, function f according to: f (x) = βx x +

K X

uG k zk (x),

ind.

2 uG k ∼ N (0, σu ),

(9)

k=1

where {zk : 1 ≤ k ≤ K} are spline bases of size K and σu2 is the penalized parameter for the G spline coefficients uG 1 , · · · , uK . A common choice of the zk are suitably linearly transformed cubic O’Sullivan splines, as described in Section 4 of Wand & Ormerod (2008). However, as illustrated by Example 3 of Zhao et al. (2006), the Z G may be used for other purposes such as handling crossed random effects. The Z G uG term may be further decomposed according to G

G

Z u ≡

L X

G ZG ` u`

with

2 Cov(uG ) = blockdiag(σu` I qG ). `

1≤`≤L

`=1

(10)

G Here the uG ` are q` × 1 random spline coefficient vectors for 1 ≤ ` ≤ L, so the blocks of G Cov(u ) correspond to the decomposition of uG . The expression in (10) is in keeping with spline penalization in multi-predictor semiparametric regression models (e.g. Ruppert et al., 2003). The random effects covariance matrix, denoted by G in (3), takes the following form for the current model:





I m ⊗ ΣR 0   2 G= 0 blockdiag(σu` I qG  . 1≤`≤L

6

`

(11)

The examples of Section 2 of Zhao et al. (2006) show the wide variety of models that are encompassed by (6), (7) and (10). Included are the random intercept and slope models commonly used in longitudinal data analysis and the multilevel studies in Chapter 2 of Goldstein (2010). However, various semiparametric regression extensions are also covered – including additive models, varying coefficient models and extensions involving interaction terms and higher dimensional smooth functions. Our presentation of streamlined MFVB algorithms in Section 4 for models with this predictor structure benefits from some additional notation. It is useful to define C G ≡ [X Z G ]. Lastly, we partition y, X, Z and C G row-wise corresponding to the groups in X R : y1  ..  y =  . , ym 



Z1  ..  Z= .  Zm

X1  ..  X =  . , Xm 







CG 1   . G C =  ..  . CG m 

and



(12)

Here y i ≡ [yi1 · · · yini ]T denotes the ni × 1 vector of responses for the ith group. The matrices X i , Z i and C G i are defined in the same fashion.

3.3

Prior Distributions

In this article, we take a Bayesian approach to model fitting and inference. The prior distribution on the fixed effects vector β is taken to be of the form β ∼ N (0, σβ2 I P )

for some σβ2 > 0,

(13)

where P is the length of the vector β and σβ2 is a hyperparameter to be specified by the analyst. The extension to general mean and covariance matrices is relatively simple, but we use (13) in our algorithms for simplicity of exposition. The standard deviation parameters have independent Half Cauchy priors: ind.

σu` ∼ Half-Cauchy(Au` ),

1 ≤ ` ≤ L,

(14)

ind.

where Au` > 0 are hyperparameters and ∼ stands for “independently distributed as”. Here the Half-Cauchy(A) density function is p(x; A) = 2A/{π(x2 + A2 )}, x > 0. As explained in Gelman (2006), (14) is an attractive means by which weakly-informative priors can be imposed on the σu` s. Result 5 of Wand et al. (2011) leads to following equivalent distributional statement: ind.

ind.

2 σu` ∼ Inverse-Gamma( 21 , 1/au` ),

au` ∼ Inverse-Gamma( 21 , 1/A2u` ),

(15)

where, for A, B > 0, the Inverse-Gamma(A, B) density function is p(x; A, B) = B A Γ(A)−1 x−A−1 exp(−B/x),

x > 0.

We use representation (15) rather than (14) since it is more conducive to MFVB. Similar arguments lead to the prior distribution of σε2 being σε2 ∼ Inverse-Gamma( 12 , 1/aε ),

aε ∼ Inverse-Gamma( 21 , 1/A2ε )

for Aε > 0. For ΣR we use the following extension of (15) (Huang & Wand, 2013): ΣR | aR1 , . . . , aRqR aRr

∼ ind.







Inverse-Wishart ν + q R − 1, 2 ν diag(1/aR1 , . . . , 1/aRqR ) , Inverse-Gamma ( 21 , 1/A2Rr ), 7

1 ≤ r ≤ qR,

(16)

where ν, ARr > 0 and the Inverse-Wishart(A, B) density function in the d × d argument X is −1 p(X; A, B) = CA,d |B|A/2 |X|−(A+d+1)/2 exp{−tr(BX −1 )},

X positive definite,

where A > 0 and the d × d matrix B is positive definite. The normalizing factor is Cd,A ≡ 2

Ad/2 d(d−1)/4

π

d Y

A+1−i . Γ 2 i=1 



(17)

As shown in Huang & Wand (2013), (16) imposes prior distributions on the entries of ΣR that are analogous to (15). The choice ν = 2 corresponds to the standard deviation parameters in ΣR having Half-t distributions with 2 degrees of freedom and the correlation parameters having uniform distributions over (−1, 1).

3.4

Full Bayesian Two-level Gaussian Response Model

Our Bayesian two-level Gaussian response model can now be stated in full: y | β, u, σε2 ∼ N (Xβ + Zu, σε2 I), 

2 u | ΣR , σu`

β ∼ N (0, σβ2 I P ),





I m ⊗ ΣR 0  2 ∼ N 0,  0 blockdiag(σu` I qG  , 1≤`≤L

`





ΣR | aR1 , . . . , aRqR ∼ Inverse-Wishart ν + q R − 1, 2 ν diag(1/aR1 , . . . , 1/aRqR ) , ind.

aRr ∼ Inverse-Gamma ( 12 , A−2 Rr ), σε2 | aε ∼ Inverse-Gamma ( 12 , 1/aε ),

(18)

1 ≤ r ≤ qR, aε ∼ Inverse-Gamma ( 12 , A−2 ε ),

ind.

1 2 |a σu` u` ∼ Inverse-Gamma ( 2 , 1/au` ),

au` ∼ Inverse-Gamma ( 12 , A−2 u` ), ind.

1 ≤ ` ≤ L.

Figure 1 shows the directed acyclic graph, as a graph theoretic representation of model (18). The advantages to such graphical representation are two-folds: firstly, it provides a visualization of the hierarchical structure of its corresponding Bayesian model; and secondly, the graph theoretic results can be used to determine probabilistic relationships between nodes. The node aR corresponds to the random vector [aR1 · · · aRqR ]T . The nodes σ 2u and au 2 · · · σ 2 ]T and a ≡ [a · · · a ]T respectively. The are defined analogously, i.e. σ 2u ≡ [σu1 u u1 uL uL node u is separated into two nodes uR and uG .

3.5

Mean Field Variational Bayesian Approximate Inference

Consider the problem of Bayesian inference for the parameters and random effects in model (18), depicted in Figure 1. The essence of our MFVB approach involves approximation of the full joint posterior density function of the form p(β, uR , uG , aR , au , aε , ΣR , σ 2u , σε2 |y) ≈ q(β, uR , uG , aR , au , aε , ΣR , σ 2u , σε2 )

(19)

subject to the q-density product density restriction q(β, uR , uG , aR , au , aε , ΣR , σ 2u , σε2 ) = q(β, uR , uG , aR , au , aε ) q(ΣR , σ 2u , σε2 ).

(20)

Restriction (21) is the minimal factorization in our MFVB approximation. Induced product results (Bishop, 2006, Section 10.2.5) lead to additional factorizations in the q-density product restriction form. Such induced factorizations can be easily detected using simple graphical tests such as d-separation theory (e.g. Bishop, 2006, Section 8.2). For example, ΣR ⊥ ⊥ {σu2 , σε2 } | {y, β, uR , uG }, 8



aR

β

σ2ε

au

uG

σ2u

y ΣR

uR

Figure 1: Directed acyclic graph for model (18). The shaded node corresponds to the observed data vector. Random effects and auxiliary variables are referred to as hidden nodes. where a ⊥ ⊥ b | c means that a and b are conditionally independent given c. Similar arguments lead to the q-densities having the product form: q(β, uR , uG , aR , au , aε , ΣR , σ 2u , σε2 ) = q(β, uR , uG ) q(ΣR ) q(σε2 ) q(aε )

(21)

 R q Y 

( L  Y

q(aRr )



r=1

q(au` )

)( L Y

`=1

) 2 q(σu` ) . (22)

`=1

As explained in Section 2, the optimal q-densities are chosen to minimize KullbackLeibler divergence between the right-hand side of (21) and the full joint posterior density function (e.g. Wainwright & Jordan, 2008). Calculations similar to those in Appendix B of Wand & Ormerod (2011) show that the optimal q-densities admit the following forms: q ∗ (β, u) is the N (µq(β,u) , Σq(β,u) ) density function, Pm

q ∗ (σε2 ) is the Inverse-Gamma( 21 (

i=1 ni

+ 1), Bq(σε2 ) ) density function,

q ∗ (aε ) is the Inverse-Gamma(1, Bq(aε ) ) density function, 2 ) is the Inverse-Gamma( 1 (q G + 1), B q ∗ (σu` q(σ 2 ) ) density function, 2 ` u`

(23)

q ∗ (au` ) is the Inverse-Gamma(1, Bq(au` ) ) density function, q ∗ (aRr ) is the Inverse-Gamma( 12 (ν + q R ), Bq(aRr ) ) density function and q ∗ (ΣR ) is the Inverse-Wishart(ν + m + q R − 1, B q(ΣR ) ) density function, for parameters µq(β,u) and Σq(β,u) , the mean vector and covariance matrix of q ∗ (β, u), Bq(σε2 ) , the scale parameter of q ∗ (σε2 ), Bq(aε ) , the scale parameter of q ∗ (aε ), Bq(σ2 ) , the scale u` 2 ), B ∗ parameter of q ∗ (σu` , the scale parameter of q(au` ) , the scale parameter of q (au` ), Bq(aR r) q ∗ (aRr ) and B q(ΣR ) , the scale matrix of q ∗ (ΣR ). The q-density parameters are interrelated and their optimal values are obtained via coordinate ascent in Algorithm 2 with notation C ≡ [X Z]. The stopping criterion is based on the variational lower bound on the marginal likelihood (e.g. Ormerod & Wand, 2010) and denoted here by p(y; q). Its logarithm can be obtained by log p(y; q) = Eq {log p(y, β, u, aR , au , aε , ΣR , σ 2u , σε2 ) − log q(β, u, aR , au , aε , ΣR , σ 2u , σε2 )}, and is presented in Appendix A. 9

(24)

Initialize: µq(1/σε2 ) > 0, µq(1/aε ) > 0, µq(1/σ2 ) > 0, µq(1/au` ) > 0, 1 ≤ ` ≤ L, u` µq(1/aRr ) > 0, 1 ≤ r ≤ q R , M q((ΣR )−1 ) positive definite. Cycle through updates: σβ−2 I P   T 0  ← µq(1/σε2 ) C C +  



Σq(β,u)

−1

0  blockdiag(µq(1/σ2 ) I qG

0 0

0

I m ⊗ M q((ΣR )−1 )

u`

1≤`≤L

0

`

  

µq(β,u) ← µq(1/σε2 ) Σq(β,u) C T y Bq(σε2 ) ← µq(1/aε ) + 21 {ky − Cµq(β,u) k2 + tr(C T CΣq(β,u) )} µq(1/σε2 ) ←

1 2

(

Pm

i=1 ni

+ 1) /Bq(σε2 ) ; µq(1/aε ) ← 1/{µq(1/σε2 ) + A−2 ε }

For r = 1, . . . , q R : 

Bq(aRr ) ← ν M q((ΣR )−1 ) B q(ΣR ) ←

m  X i=1

µq(uR ) µT q(uR i i )

 rr

+ A−2 ; µq(1/aRr ) ← 12 (ν + q R )/Bq(aRr ) Rr 





+ Σq(uR ) + 2 ν diag µq(1/aR ) , . . . , µq(1/aR 1

i

qR

)

M q((ΣR )−1 ) ← (ν + m + q R − 1) B −1 q(ΣR ) For ` = 1, . . . , L: µq(1/au` ) ← 1/{µq(1/σ2 ) + A−2 u` } u`

µq(1/σ2 ) ← u`

q`G + 1 2 µq(1/au` ) + kµq(uG ) k2 + tr(Σq(uG ) ) `

`

until the increase in p(y; q) is negligible.

Algorithm 2: Naïve mean field variational Bayes algorithm for the two-level Gaussian response model according to product density restriction (??).

10

4

Streamlining Mean Field Variational Bayes Algorithms

A major computational problem arises in Algorithm 2 lies in the update for the covariance matrix of the coefficient estimates:  T Σq(β,u) ←  µq(1/σε2 ) C C +



 −2 σβ I P   0 

0 blockdiag µq(1/σ2 ) I qG u`

1≤`≤L

0

−1

0





0

`

I m ⊗ M q((ΣR )−1 )

0

  

.

This update expression requires storage and inversion of a large sparse matrix of dimension 

P + qR m +

PL





R G `=1 q` × P + q m +



PL

G `=1 q` ,

where P is the number of predictors, q`G (1 ≤ ` ≤ L) is the size of the `th spline basis function, typically in the range 15 − 40 regardless of sample size and m is the number of groups that can be arbitrarily large. For example, the Six Cities Study of Air Pollution and Health described in Fitzmaurice et al. (2011) has m = 13 379. Without doubt the number of groups m dominates the dimension of Σq(β,u) in many practical situations. Henceforth, if P = 10, q R = 2, m = 10 000, ` = 1 and q1G = 25, then the dimension of the matrix requiring storage and inversion would exceed 20 000 × 20 000. In addition, it is well known from numerical linear algebra that naïve computation of Σq(β,u) is O(m3 ), that is cubic dependence on the number of groups. Hence, direct implementation of Algorithm 2 can be very costly, or even infeasible, in large longitudinal and multilevel studies. Our aim in the next few sections is to streamline the MFVB algorithms by removing computational obstacles involving update expressions such as above for most practical values of m and, thus, maximises the benefits of variational methods for Bayesian hierarchical models. To get around the computational problem, we exploit the fact that most of the matrix requiring inversion is block-diagonal. Hence we propose a streamlined approach based around block decomposition of a matrix:  −2 σ IP 0  β T 0 blockdiag(µq(1/σ2 ) I qG ) µq(1/σε2 ) C C +  u` `  1≤`≤L

0

0

    =   

F GT 1 GT 2 .. . GT m

G1 G2 H −1 0 1 0 H −1 2 .. .. . . 0 0



0 0 I m ⊗ Mq((ΣR )−1 )

··· ··· ··· .. .

Gm 0 0 .. .

···

H −1 m

  



(25)

   ,   

where the submatrices F , Gi and H i are defined as follows: F

 −2 σβ I P ≡ µq(1/σε2 ) (C G )T C G +  0

0 

blockdiag µq(1/σ2 ) I qG 1≤`≤L

u`

 

,

`

T R Gi ≡ µq(1/σε2 ) (C G i ) Xi ,

and

Hi ≡

n

µq(1/σε2 ) (X Ri )T X Ri + Mq((ΣR )−1 )

o−1

.

The block-diagonal structure in the bottom right submatrix of (25), the contribution from the random group effects, is crucial as it enables us to reduce the inversion to O(m) operations.

11

Standard Result (e.g. Harville, 2008, Corollary 8.5.12; Smith, 2008) on inversion of a block-partitioned matrix leads to the inverse of (25) equalling #

"

Σq(β,u)

Σq(β,uG ) Λq(β,uG ,uR ··· uRm ) 1 , ≡ T Λq(β,uG ,uR ··· uR ) Σq(uR )

(26)

m

1

where Σq(β,uG ) ≡

F−

m X

!−1

Gi H i GT i

i=1

and

Λq(β,uG ,uR ··· uRm ) ≡ 1

−Σq(β,uG ) G1 H 1 · · ·

−Σq(β,uG ) Gm H m .





L G G The dimension of the matrix Σq(β,uG ) is (P + L `=1 q` ) × (P + `=1 q` ) and is relatively straightforward to compute. The summation term renders this whole process as O(m). This matrix is all we require to obtain variability bands for the penalized regression splines. To find the variability estimates to accompany group-specific mean estimates, we need Σq(uR ) , which is not a block-diagonal matrix. However, since the covariance between the fitted values of two different groups is rarely of interest, it suffices to compute and store the diagonal blocks

P

P

Σq(uR ) ≡ H i + H i GT i Σq(β,uG ) Gi H i , i

1 ≤ i ≤ m.

(27)

Since Σq(β,uG ) , Gi and H i have dimensions much smaller than m, the complexity of the matrix calculations required in these submatrices does not increase as m increases. Therefore, the calculations with the highest order of complexity are the calculation of the m relevant submatrices of Λq(β,uG ,uR ··· uRm ) and Σq(uR ) . This renders the whole process as O(m), rep1 i resenting an improvement of order m2 over the naïve approach to matrix inversion. As the number of groups increases, the improvements due to streamlining become enormous.

4.1

Streamlining Update Expressions Involving Σq(β,u)

We see in Algorithm 2 that the update expressions for the mean vector of the coefficient estimates and the scale parameter of the error variance involve the matrix Σq(β,u) . Henceforth we now work towards a streamlined alternative to these updates. Recall that the naïve update for µq(β,u) is µq(β,u) ← µq(1/σε2 ) Σq(β,u) C T y. T

By replacing Σq(β,u) with (26) and partitioning C into (C G )T (Z R )T , we can separate the above update into two expressions corresponding to (β, uG ) and uR respectively: 

"

#







Λq(β,uG ,uR ··· uRm )  (C G )T y  1

Σq(β,uG )

µq(β,uG ) ← µq(1/σε2 )  µq(uR ) ΛT Σq(uR ) q(β,uG ,uR ··· uR ) m

1



(Z R )T y

.

After some algebraic manipulations, we obtain the streamlined update for µq(β,uG ) and µq(uR ) respectively: (

µq(β,uG ) ← µq(1/σε2 ) Σq(β,uG ) (C G )T y −

m X

)

Gi H i (X Ri )T y i

(28)

i=1

and

n

o

G T R T µq(uR ) ← µq(1/σε2 ) ΛT q(β,uG ,uR ··· uR ) (C ) y + Σq(uR ) (Z ) y . 1

12

m

Using (26) and (27), we can further break down µq(uR ) into components corresponding to the ith group only: n

o

µq(uR ) ← H i µq(1/σε2 ) (X Ri )T y i − GT i µq(β,uG ) . i

Similarly, recall that the naïve update for Bq(σε2 ) is Bq(σε2 ) ← µq(1/aε ) +

1 2

n

(y − Cµq(β,u) )T (y − Cµq(β,u) ) + tr(C T CΣq(β,u) )

By replacing µq(β,u) with (28) and partitioning C into (C G )T (Z R )T Cµq(β,u) and tr(C T CΣq(β,uG ) ) in the equivalent streamlined form: 

Cµq(β,u)











tr C T CΣq(β,uG )



, we can rewrite

R  X 1 µq(uR 1)   .. G   = C µq(β,uG ) +  . 

X Rm µq(uRm )

and

T

o

= tr{(C G )T C G Σq(β,uG ) } − 2µ−1 q(1/σ 2 ) ε

+

m X

m X

tr(Gi H i GT i Σq(β,uG ) )

i=1

tr{(X Ri )T X Ri Σq(uR ) }, i

i=1

and the streamlined update for Bq(σε2 ) then follows. We are now ready to present the streamlined algorithm, Algorithm 3, for fast MFVBapproximate fitting and inference in longitudinal and multilevel models with Gaussian response. We note that the q-density covariance matrix of the vector of coefficients for the ith group is     β Σq(β,uG ) Λq(β,uG ,uR ) i  G   . (29) Covq  u  = T R Σ Λ R) q(u ui ) q(β,uG ,uR i i This matrix is needed for variability estimates to accompany group-specific mean estimates. Figure 2 provides an illustration of the fast approximate inference produced by Algorithm 3. The data consist of mathematics test scores of 728 students from 48 schools in inner London, United Kingdom. The details of data are described in Goldstein (2010). The figure demonstrates the school-specific estimates and pointwise 95% credible sets for the mean 11-year old mathematics test score by the eight-year old test score. Algorithm 3 is an alternative to (restricted) maximum likelihood and best predictionbased fitting of large longitudinal and multilevel models, the latter being employed by popular software such PROC MIXED in SAS (SAS Institute Inc. 2013) and lmer() in the R package lme4 (Bates et al., 2014). Leaving aside the fact that the latter are based on frequentist inference paradigms, whilst Algorithm 3 is intrinsically Bayesian, it is worth pointing out that Algorithm 3 is purely matrix algebraic. It does not require any numerical quadrature techniques. On the other hand, (restricted) maximum likelihood estimation of covariance matrices can be a challenging numerical problem. The matrix algebraic aspect of Algorithm 3 has advantages such as stability, speed, parallelizability and online fitting.

13

Initialize: µq(1/σε2 ) > 0, µq(1/aε ) > 0, µq(1/σ2 ) > 0, µq(1/au` ) > 0, 1 ≤ ` ≤ L, u` µq(1/aRr ) > 0, 1 ≤ r ≤ q R , M q((ΣR )−1 ) positive definite. Cycle through updates: S←0 ; s←0 For i = 1, . . . , m: n

R T R T R Gi ← µq(1/σε2 ) (C G i ) X i ; H i ← µq(1/σε2 ) (X i ) X i + M q((ΣR )−1 )

o−1

R T S ← S + Gi H i GT i ; s ← s + Gi H i (X i ) y i

Σq(β,uG )

σβ−2 I P G T G  ← µq(1/σε2 ) (C ) C + 0   



n

µq(β,uG ) ← µq(1/σε2 ) Σq(β,uG ) (C )T y − s G

−1



 0  blockdiag(µq(1/σ2 ) I qG ) − S  u` ` 1≤`≤L

o

For i = 1, . . . , m: Σq(uR ) ← H i + H i GT i Σq(β,uG ) Gi H i i

n

µq(uR ) ← H i µq(1/σε2 ) (X Ri )T y i − GT i µq(β,uG )

o

i

 

X R1 µq(uR )

1   ..

y − C G µ  Bq(σε2 ) ← µq(1/aε ) + 12  − G q(β,u ) .  

X Rm µq(uRm ) m X

 2

  

tr{(X Ri )T X Ri Σq(uR ) }

+ tr{(C G )T C G Σq(β,uG ) } +

i

i=1

−2µ−1 q(1/σε2 ) µq(1/σε2 ) ←

1 2

(

m X

tr(Gi H i GT i Σq(β,uG ) )

i

i=1

Pm

−2 i=1 ni + 1) /Bq(σε2 ) ; µq(1/aε ) ← 1/{µq(1/σε2 ) + Aε }

For r = 1, . . . , q R : 

Bq(aRr ) ← ν M q((ΣR )−1 ) B q(ΣR ) ←

m  X i=1

µq(uR ) µT q(uR i i )

 rr

+ A−2 ; µq(1/aRr ) ← 12 (ν + q R )/Bq(aRr ) Rr 





+ Σq(uR ) + 2 ν diag µq(1/aR ) , . . . , µq(1/aR 1

i

qR

)

M q((ΣR )−1 ) ← (ν + m + q R − 1) B −1 q(ΣR ) For ` = 1, . . . , L: µq(1/au` ) ← 1/{µq(1/σ2 ) + A−2 u` } u`

µq(1/σ2 ) ← u`

q`G + 1 2 µq(1/au` ) + kµq(uG ) k2 + tr(Σq(uG ) ) `

`

until the increase in p(y; q) is negligible. For i = 1, . . . , m: " "

Λq(β,uG ,uR ) ≡ Eq i

β uG

#

!

#

− µq(β,uG ) (ui − µq(uR ) R

i

)T

← −Σq(β,uG ) Gi H i

Algorithm 3: Streamlined mean field variational Bayes algorithm for the two-level Gaussian response model according to product density restriction (??). 14

10 20 30 40 ●







● ● ●●





●● ● ●

10 20 30 40



● ● ●●● ●● ●● ● ● ● ● ● ●● ●●● ●● ● ●







● ● ● ● ●

● ●







Mathematics score at age 11



40 30 20 10

●●●

● ● ● ● ●





● ● ● ● ● ●

●●●● ● ●●● ●





● ● ● ●●



10 20 30 40 ● ●●●● ● ●● ● ● ●●●

● ●







● ● ●

● ●







● ●●

●● ●







● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ●●● ● ●



40 30 20 10

● ●



● ● ● ● ● ●

● ●



● ●●●● ●●●









●● ●

● ●● ●●● ●● ● ●







● ● ●



● ●

● ●●



● ●●











● ●



●● ● ● ●



●●●● ●●●●● ●●●● ● ●● ●●●●● ● ●●●● ● ●

● ●



●● ●



● ●● ● ●● ● ● ● ● ● ● ● ● ●



● ●

●● ● ● ●







● ● ● ● ●

40 30 20 10





● ● ● ● ●●

●● ●● ●●●●● ● ●●●





●● ● ●●● ●● ● ●●●●● ● ● ● ●

● ●



●●





● ●● ●●●● ● ● ● ●



● ●



● ● ● ●●

● ● ● ● ●● ●● ● ● ● ●●





● ●● ●● ●● ● ● ● ●● ●●



● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●●●●● ● ●

● ● ● ●●● ● ● ● ●

● ●





40 30 20 10

● ●● ●● ● ●●● ●● ●●● ● ●●●●●● ●

●●● ● ●● ● ● ● ●● ●●● ●●●● ●● ●● ●● ●● ●●● ● ● ● ● ● ●







40 30 20 10









● ●

● ●

●● ● ●

● ● ● ● ● ● ●● ● ●

● ●







●●

40 30 20 10

● ● ●

● ●

● ● ●



40 30 20 10

● ● ● ●●●●●●●●●●● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ●



● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●●

● ●●● ● ●

● ● ●● ● ●● ● ● ● ●●● ●● ●

● ●



● ●● ●



● ● ●

● ● ●● ● ● ● ●



● ●







10 20 30 40





● ●

● ● ●











● ● ●● ● ● ●● ●● ● ●

● ●

● ● ●

●●● ●

● ● ●● ● ● ● ● ● ● ●



●●

● ●

● ● ● ● ●● ●

●●





●● ● ● ●●●●● ● ●● ●● ● ●● ● ●

40 30 20 10

● ●●● ●



●●● ●



● ●

10 20 30 40

10 20 30 40

Mathematics score at age 8

Figure 2: School-specific estimates (solid orange lines) and pointwise 95% credible sets (dotted orange lines) for the mean 11-year old mathematics test score by the eight-year old test score. The purple open circles are the raw data.

5

Binary Response Models

We now consider the case where the entries of y are binary and, instead of (3), the logistic mixed model 



y | β, u ∼ Bernoulli logit−1 (Xβ + Zu) ,

u | G ∼ N (0, G)

(30)

is appropriate. The notation v ∼ Bernoulli(p) used in (30) is shorthand for the entries of the random vector v having independent Bernoulli distributions with parameters corresponding to the entries of p. The notational infrastructure described in Sections 3.1–3.3 for the Gaussian response model can be transferred to the binary response case. The only change is the response distribution specification and the omission of σε2 and aε . The full Bayesian two-level binary response model is then 



y | β, u ∼ Bernoulli logit−1 (Xβ + Zu) , 

2 u | ΣR , σu`

β ∼ N (0, σβ2 I P ),





I m ⊗ ΣR 0    2 ∼ N 0, 0 blockdiag(σu` I qG  , 1≤`≤L

`





ΣR | aR1 , . . . , aRqR ∼ Inverse-Wishart ν + q R − 1, 2 ν diag(1/aR1 , . . . , 1/aRqR ) , ind.

aRr ∼ Inverse-Gamma ( 12 , A−2 Rr ),

1 ≤ r ≤ qR,

ind.

1 2 |a σu` u` ∼ Inverse-Gamma ( 2 , 1/au` ),

au` ∼ Inverse-Gamma ( 12 , A−2 u` ), ind.

15

1 ≤ ` ≤ L.

(31)

The journey towards a practical MFVB algorithm commences with the q-density product density restriction q(β, u, ΣR , au , aR , σ 2u ) = q(β, u, aR , au ) q(ΣR , σ 2u ), analogous to that given by (21). As in the Gaussian case, induced product results lead to the q-densities satisfying: R

R

q(β, u, a , au , Σ

, σ 2u )

R

= q(β, u) q(Σ )

 R q Y 

R

q(ar )

r=1

( L  Y 

)( L Y

2 q(σu` )

`=1

)

q(au` ) .

(32)

`=1

The optimal q-densities for ΣR , aR , σ 2u and au take the same form as the Gaussian response case given in (23). However, the optimal q-density for (β, u) satisfies )

( ∗

T

T

q (β, u) ∝ exp y (Xβ + Zu) − 1 log(1 + e

Xβ+Zu

1 ) − 2 kβk2 − 12 uT G−1 u , 2σβ

(33)

where G is given by (11). Unfortunately, (33) does not lend itself to convenient approximate inference for (β, u) since, for example, the mean and covariance matrix of q ∗ (β, u) are not available in closed form. Instead, we replace (33) by a member of the following family of Multivariate Normal approximations: q ∗ (β, u; ξ) ∼ N (µq(β,u;ξ) , Σq(β,u;ξ) ), where h

Σq(β,u;ξ) ≡ 2C T diag{λ(ξ)} C + blockdiag{σβ−2 I P , G−1 } where C ≡ [X Z], ξ is an ( tanh(x/2)/(4 x) and

Pm

i=1 ni )

i−1

,

(34)

× 1 vector of positive variational parameters, λ(x) ≡

µq(β,u;ξ) ≡ Σq(β,u;ξ) C T (y − 12 1). Scalar functions applied to vectors are evaluated element-wise. The justification for this family of approximations is given in Tipping (1999b), Jaakkola & Jordan (2000) and Section 3.1 of Ormerod & Wand (2010). Jaakkola & Jordan (2000) also show that the optimal variational parameter vector can be obtained via the update ξ←

r

diagonal[C {Σq(β,u ; ξ)) + µq(β,u;ξ) µTq(β,u ; ξ) } C T ],

where diagonal(M ) ≡ vector containing the diagonal entries of M for any square matrix M . As in the Gaussian case, (34) becomes infeasible for very large longitudinal and multilevel problems, so there is a strong imperative for streamlining the corresponding update. Algorithm 4 achieves this and requires an additional matrix notation: ξ i = sub-vector of ξ corresponding to the ith group,

1 ≤ i ≤ m,

which is analogous to the y i notation given by (12). Again, the stopping criterion for Algorithm 4 is based on the variational lower bound on the marginal likelihood that is presented in Appendix B.

16

Initialize: µq(1/σ2 ) > 0, µq(1/au` ) > 0, 1 ≤ ` ≤ L, µq(1/aRr ) > 0, 1 ≤ r ≤ q R , u` P M q((ΣR )−1 ) positive definite, ξ ( m i=1 ni ) × 1 vector of positive entries. Cycle through updates: S←0 ; s←0 For i = 1, . . . , m: R T Gi ← 2(C G i ) diag{λ(ξ i )}X i

n

H i ← 2(X Ri )T diag{λ(ξ i )}X Ri + M q((ΣR )−1 )

o−1

R T 1 S ← S + Gi H i GT i ; s ← s + Gi H i (X i ) (y i − 2 1)

Σq(β,uG ;ξ)

σβ−2 I P G T G  ← 2(C ) diag{λ(ξ)}C + 0   



n

µq(β,uG ;ξ) ← Σq(β,uG ;ξ) (C )T (y − G

1 2 1)

−s



1≤`≤L

o

For i = 1, . . . , m: Σq(uR ;ξ) ← H i + H i GT i Σq(β,uG ;ξ) Gi H i i

n

µq(uR ;ξ) ← H i (X Ri )T (y i − 12 1) − GT i µq(β,uG ;ξ)

o

i

2

ξ ← diagonal{C G (Σq(β,uG ;ξ) + µq(β,uG ;ξ) µT )(C G )T } q(β,uG ;ξ) For i = 1, . . . , m: )(X Ri )T } ξ 2i ← ξ 2i + 2 diagonal{C G (−Σq(β,uG ;ξ) Gi H i + µq(β,uG ;ξ) µT q(uR ;ξ) i

ξ 2i



ξ 2i

R

+ diagonal{(X i (Σq(uR ;ξ) + i

µq(uR ;ξ) µT )(X Ri )T } q(uR i i ;ξ)

R

For r = 1, . . . , q : 

Bq(aRr ) ← ν M q((ΣR )−1 ) Bq(ΣR ) ←

m  X i=1



µq(uR ;ξ) µT q(uR i i ;ξ)

rr

+ A−2 ; µq(1/aRr ) ← 12 (ν + q R )/Bq(aRr ) Rr 





+ Σq(uR ;ξ) + 2 ν diag µq(1/aR ) , . . . , µq(1/aR 1

i

qR

)

M q((ΣR )−1 ) ← (ν + m + q R − 1) B −1 q(ΣR ) For ` = 1, . . . , L: µq(1/au` ) ← 1/{µq(1/σ2 ) + A−2 u` } u`

µq(1/σ2 ) ← u`

q`G + 1 2 µq(1/au` ) + kµq(uG ;ξ) k2 + tr(Σq(uG ;ξ) ) `

`

until the increase in p(y; q) is negligible. For i = 1, . . . , m: " "

Λq(β,uG ,uR ;ξ) ≡ Eq i

β uG

#

!

#

− µq(β,uG ;ξ) (ui − µq(uR ;ξ) R

i

)T

← −Σq(β,uG ;ξ) Gi H i

Algorithm 4: Streamlined MFVB algorithm for the two-level binary response model according to product density restriction (32).

17

−1

 0  blockdiag(µq(1/σ2 ) I qG ) − S  u` `

6

Numerical Evaluation

We conducted a comprehensive simulation study to assess the performance of Algorithm 3 in terms of Bayesian inferential accuracy and computational speed. We generated 100 data-sets according to the following simulation setting, which corresponds to the special case of a Bayesian semiparametric random intercept and slope model. ind.

yij | β0 , βx , uR0i , uR1i , σε2 for







N β0 + uR0i + (βx + uR1i ) xij + f (sij ), σε2 ,

1 ≤ i ≤ m, 1 ≤ j ≤ ni ,

(35)

where 13 2 f (s) = 1 − √ e−(s−0.15) /0.2 − (2.3 s − 0.07 s2 ) + 0.5 {1 − Φ(s; 0.8, 0.07)} 5 2π

and sij ∼ Uniform (0, 1), xij ∼ Uniform (0, 1) and [uR0i uR1i ]T ΣR ∼ N (0, ΣR ). The true parameter values are: ind.

ind.

"

β0 = 0.58,

βx = 1.89,

σε2

= 0.04 and

R

Σ =

ind.

2.58 0.22 0.22 1.73

#

.

The number of groups varied as follows: m ∈ {100, 500, 2500, 12 500} with the within group sample sizes ni ranged between 10 and 20. Source code to reproduce the results is available as Supporting Information on the journal’s web page (http://onlinelibrary.wiley.com/doi/xxx/ suppinfo).

6.1

Assessment of Accuracy

We fitted each replicated dataset using both MFVB and MCMC. The MFVB fits were obtained using Algorithm 3 with the iterations terminated when the relative increase in log p(y; q) fell below 10−7 . The MCMC samples were obtained using Stan (Stan Development Team 2014) with R (R Development Core Team 2014) interfacing via the RStan package (Stan Development Team 2014). Stan is a probabilistic programming language, written in C ++, for implementing full Bayesian statistical inference. In each case, MCMC samples of size 10 000 were generated. The first 5000 values of each sample were discarded as burn-in and the remaining 5000 values were thinned by a factor of 5. The accuracy of MFVB approximation for a generic parameter θ is assessed using the accuracy score defined and justified in Faes et al. (2011), 

accuracy(q ∗ (θ)) ≡ 100 1 −

1 2

Z ∞



| q ∗ (θ) − pMCMC (θ|y)|dθ %,

(36)

−∞

where pMCMC (θ|y) is an accurate MCMC-based approximation to p(θ|y). The pMCMC (θ|y) is based on the binned kernel density estimate with a direct plug-in bandwidth, obtained via the R package KernSmooth (Wand & Ripley, 2010). Figures 3 and 4 display approximate posterior density functions and side-by-side boxplots of the accuracy scores for the model parameters β0 , βx , ΣR11 , ΣR12 , ΣR21 and σε2 and f (Qk ), 1 ≤ k ≤ 4, where Qk are the kth sample quintiles of the sij s, with m = 100. The boxplots show that the majority of the accuracy scores exceed 95% and they rarely drop below 90%. These very good accuracy results are consistent with the heuristic justification given in Section 3.1 of Menictas & Wand (2013). In addition, the left panel of Figure 3 indicates that the MFVB algorithm converged quickly, after 14 iterations. The second panel indicates that the MFVB and MCMC penalized splines fits are virtually identical. 18

98%

95%

1.0 -4.2

-3.8

q

-3.4



96%

1.5

2.0

1.0

accuracy

0.0

1.0 1.0

0.0

0.5

2.0 1.0 0.0

2.0 1.0 0.0

q (βx)

97% accuracy

-3.0

(ΣR 12)

accuracy

0.0 0.2 0.4 0.6 0.8 1.0



1.4

1.6

1.8



(σ2ε )

q

2.0

accuracy

-1.0

-0.6

q

-0.2

(ΣR 11)



94% accuracy

1.5

2.0

94%

2.5

3.0

3.5

MCMC MFVB True

accuracy

0

2.0

q

-0.2

(ΣR 22)

2.0

-0.6



accuracy

0.0

1.0 0.0

-1.0

1.0

97%

0.0 0.4 0.8 1.2

f(Q4)

accuracy

-1.4

0.8

Approx. posterior Approx. posterior

0.6 x2

0.0 1.0 2.0 3.0

0.4

80

2.0

97%

0.2

accuracy

40

f(Q3)

0.0

99%

f(Q2)

0

10 12 14

2.0

8

Iterations

Approx. posterior Approx. posterior

6

Approx. posterior Approx. posterior Approx. posterior

0 -5

f(x2)

-1400

logMLgrid

-2000

4

0.0

1.0

0.090

0.100

0.110

Figure 3: Top left panel: successive value of log p(y; q) to monitor the convergence of the mean field variational Bayes algorithm. Second panel: Fitted function estimates and pointwise 95% credible sets for both mean field variational Bayes and Markov chain Monte Carlo. Approximate posterior density functions obtained via mean field variational Bayes and Markov chain Monte Carlo for a single replication of the simulation study described in the text with m = 100. Each pair of density function corresponds to a model parameter in (35). The green lines represent the true parameter values. The accuracy scores on the top right of on each plot show the accuracy of mean field variational Bayes approximation compared against a Markov chain Monte Carlo benchmark.

100 ● ● ●

95 Accuracy

Approx. posterior Approx. posterior

2

f(Q1)

MCMC and MFVB spline fits 5

Lower bound















● ● ●

90









f(Q2)

f(Q3)



85 80

f(Q1)

βx

f(Q4)

ΣR 11

ΣR 22

ΣR 12

σ2ε

Figure 4: Side-by-side boxplots of accuracy scores for mean field variational Bayes approximations compared against Markov chain Monte Carlo over 100 runs with m = 100. Each boxplot corresponds to model parameters in (35).

6.2

Assessment of Speed

We now turn our attention to quantification of the speed gains afforded by the streamlined MFVB algorithm. Both the naïve Algorithm 2 and streamlined Algorithm 3 were imple-

19

mented in Fortran 77, and the simulation described in Section 6.1 was re-run using both versions of MFVB and MCMC omitted. All computations were performed on a Mac OS X laptop with a 2.6 GHz Intel Core i5 processor and 8 GBytes of random access memory. Table 6.2 summarizes the average (standard error) computing times over 100 runs and shows the practical benefits of the streamlined MFVB approach. As m increases the average computing time for the naïve approach increases rapidly from 0.2 seconds to almost 25 minutes, compared with about a quarter to half of a second for the streamlined approach. For m = 12 500, the naïve approach failed due to required storage of Σq(β, u) exceeding memory restrictions for typical 2014 personal computing environments. However, the streamlined approach just took over 1.8 seconds on average to compute. The impressive speed gains in the streamlined approach are clearly reflected in the ratios (naïve/streamlined) of average computing times. In situations where a data-set has a large number of groups, the streamlined approach is more than 3500 times faster than the naïve approach. m 100 500 2500 12 500

Naïve 0.198 (0.023) 14.482 (1.704) 1488.757 (108.311) Failed

Streamlined 0.027 (0.003) 0.106 (0.021) 0.419 (0.056) 1.766 (0.096)

Ratio 7.4 136.8 3550.4 N/A

Table 1: Average (standard error) elapsed of the computing times in seconds for naïve Algorithm 2 and streamlined Algorithm 3 in the simulation described in the text.

It is well-established that MCMC is computationally intensive in situations where complex models are applied to large data-sets that lead to very slow run times. For m = 12 500, we expect the MCMC fitting takes hours to days, and therefore a similar timing comparison between MFVB and MCMC is not practical.

7

Application

We now provide illustration of our streamlined MFVB methodology via an analysis of perinatal health data from the United States National Center for Health Statistics, which are presented in Abrevaya (2006). The full data are not publicly available. However, a 10% random sub-sample is provided by Rabe-Hesketh and Skrondal (2008) and our analysis is of this subset. The data have a two-level structure with 8604 births as units at level 1 and 3978 mothers as groups at level 2. There are an average of 2.2 births per mother. The following variables are given in Table 2. In this example, study variables can either vary at the birth level (therefore also at the mother level) or at the mother level only. For instance, smoke is a level-1 variable as maternal smoking status can change from one pregnancy to the next, whereas black is a level-2 variable as mother’s ethnicity remains unchanged between pregnancies. Motivated by a report from the United States Surgeon General: “Infants born to women who smoke during pregnancy have a lower average birthweight and are more likely to be small for gestational age than infants born to women who do not smoke ...” Abrevaya (2006) examined the effect of smoking on birthweight for women in the United States between 1990 and 1998 using a matched panel data approach. Here we analyzed the data using a different approach via streamlined variational Bayes. The main study

20

Variable momid birwt gestat mage smoke male married hsgrad somecoll black kessner2 kessner3 novisit pretri2 pretri3

Description mother identifier birthweight in grams infant’s gestational age in weeks mother’s age at the birth of the infant in years indicator of mother smoking during pregnancy indicator of infant being male indicator of mother being married indicator of mother having some college education, but not degree indicator of mother having graduated from college indicator of mother being black indicator of Kessner index equalling 2 indicator of Kessner index equalling 3 indicator of no prenatal care visit indicator of first prenatal care visit having occurred in second trimester indicator of first prenatal care visit having occurred in third trimester

Table 2: Description of the United States National Center for Health Statistics perinatal health data presented in Abrevaya (2006).

factor is maternal smoking status and the response of interest is infant’s birthweight. The following Bayesian Gaussian semiparametric regression model with a random intercept for each mother was fitted via Algorithm 3: birwtij |β, u, σε2

ind.



N ( β0 + β1 smokeij + f (gestatij ) + β3 mageij + β4 maleij + β5 marriedij + β6 hsgradij + β7 somecollij + β8 collgradij + β9 blackij + β10 kessner2ij + β11 kessner3ij + β12 novisitij + β13 pretri2ij + β14 pretri3ij + uRi , σε2 ) , ind.

β



N (0, σβ2 I P ),

σR2



Half-Cauchy (0, AR ), for

uRi ∼ N (0, σR2 ), σε2 ∼ Half-Cauchy (0, Aε ),

1 ≤ i ≤ 3978, 1 ≤ j ≤ ni ,

(37)

where f (gestat) = β2 gestat +

K X

uG k zk (gestat),

k=1 G ind.

uk ∼ N

(0, σu2 )

and

σu2

∼ Half-Cauchy (0, Au )

is a penalized spline function for gestational age. Here z1 (·), . . . , zK (·) is a set of O’ Sullivan spline basis functions and σu2 represents the amount of penalization of the spline coefficients G uG 1 , . . . uk as described in (9). To fit this model, we standardize the response and the continuous variables to have zero means and unit standard deviations. We use a normal prior for β and a Half Cauchy prior for σR2 , σε2 , σu2 , with the values of the hyperparameters all set to 105 . Figures 5 and 6 allow visual assessment of the MFVB-based approximate posterior density functions against a MCMC benchmark for (37). MCMC was fitted using Stan with a burn-in of size 5000, a post burn-in of size 5000 with a thinning factor of 5. The accuracy of the approximate posterior density functions of β is excellent, ranging from 95% to 98%, while it is about 75% to 82% for the variance parameters σu2 and σε2 respectively. 21

0.010

180

100

0

100

97%

accuracy

accuracy

0

50

100



−40

0

−100

0.008 0.004 0.000 0.04 0.03 0.02

accuracy

0



50

100

150

120

accuracy

20

97% accuracy

0

50

100



150

97% accuracy

−100



0

50

−200

(σ2R)

0

q 75% accuracy



100

200

300

(σ2ε ) 82% accuracy

30

accuracy

160

0.002 −200

q

98%

140



q (βnovisit)

97%

q (βpretri3)

20 0

100

200

300

10 0

0.000 0.002 0.004 0.006

0.015 0.000

0.005

0.010

accuracy

−80

100

q (βcollgrad)

0.004 −120

q (βpretri2) 96%

70

20

−100

60

10

−150

50

0



accuracy

q (βkessner3)

95%

0.000

−200

40

97%

q (βkessner2)

0.004 0.000

−250

97%

q (βsomecoll)

0.015

150

0.010

0.008

0.012

q (βblack)

50



4050

q (βmale)

0.004

50



0.008

0



0.000

−50

30

0.010

accuracy

0.005

0.005

96%

3950



0.00 20

0.005

accuracy

0.015

97%

0.010

0.015

160

q (βhsgrad)

q (βmarried) 0.010



3850

0.01

0.02 140

3750

0.000

−140

accuracy

0.00

0.00

0.000

−180

3700

97%

0.04

accuracy 0.02

0.010



3650



q (βmage)

97%

accuracy

−220

3600

q (βgestage)

97%

−260

3550

0.000 0.004 0.008 0.012

3450

accuracy

50

3400

0.04

0.020

3350



96%

40

3300

0.000

3300

0.000

3250

0.000



f(Q4)

accuracy

40

3200

96%

0.005

accuracy

q (βsmoke)

0.000

0.015

0.015

96%

0.005 3150

0.000

0.000

0.005

accuracy

f(Q3)

30

0.010

98%

f(Q2)

0.010

0.015

f(Q1)

0.32

0.34

0.36

0.38

0.40

0.38

0.40

0.42

Figure 5: Approximate posterior density functions obtained via mean field variational Bayes (orange curves) and Markov chain Monte Carlo (blue curves) for fitting the Gaussian random intercept model corresponding to (37). Each pair of density function corresponds to a model parameter in (37). The accuracy scores on the top right of on each plot show the accuracy of mean field variational Bayes approximation compared against a Markov chain Monte Carlo benchmark. Inspection of Figure 6 shows that MFVB fits and pointwise 95% credible sets are very close to those obtained using MCMC. The MCMC fits took 4.1 days, while the MFVB fits took 1.4 seconds. This represents more than a two hundred thousand-fold improvement in computing time.

22



5000

MCMC MFVB

● ●



● ●

● ● ● ●



● ●

4000



● ● ● ● ●





● ● ● ● ●



3000





● ●

● ● ● ● ●





● ● ● ● ● ● ● ●

2000

● ●

● ● ●

● ● ●

● ● ● ● ● ●

● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●





● ● ●



● ● ● ●

● ●

● ● ●









Birthweight (grams)

● ● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ●

● ●





● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



● ● ●

● ● ●



● ●

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ●



● ● ● ●

● ●

● ● ● ●



● ●

● ●



● ●

● ● ● ● ● ●

● ●







1000



● ●





● ●



30

35

40

45

Gestational age (weeks)

Figure 6: Comparison of mean field variational Bayes and Markov chain Monte Carlo penalized spline fits for the Gaussian random intercept model corresponding to (37). The solid curves are approximate posterior means of the regression function and the dashed curves are pointwise 95% credible sets. The green open circles are the raw data.

8

Concluding Remarks

Mean field variational Bayes with streamlining, as exemplified by Algorithms 3 and 4, is a useful addition to the longitudinal and multilevel data analysis arsenal. It allows rapid and accurate approximate Bayesian inference for very large datasets with computational times that increase only linearly in the number of groups. Extension of streamlined MFVB to higher level longitudinal and multilevel models is also of interest, although this requires a significant amount of additional mathematical analysis. Another extension is real-time processing that is achieved in Luts, Broderick & Wand (2014) using online versions of MFVB. This article has helped opened up a new branch of methodology for fitting and inference for grouped data in the high volume/velocity era.

23

A

The variational lower bound on the marginal log-likelihood for Algorithm 3

The marginal log-likelihood lower bound log p(y; q) can be obtained from (24) via straightforward, albeit long-winded, algebra. In addition, it uses a streamlined computation of log |Σq(β,u) |, which we now justify. It depends on the following result concerning determinant of block-diagonal matrices: A C

B D

= |D| |A − BD −1 C|,

(38)

where D is a square matrix and is invertible (e.g. Harville, 2008, Theorem 13.3.8). From (38) we then get log |Σq(β,u) | = − log

= −

m X

F GT 1 GT 2 .. .

G1 G2 H −1 0 1 0 H −1 2 .. .. . . 0 0

GT m

−1 Hm

··· ··· ··· .. .

Gm 0 0 .. .

···

−1 log |H −1 i | − log |Σq(β,uG ) |.

i=1

Convergence in Algorithm 3 is assessed using the variational lower bound on the marginal log-likelihood and admits the following explicit expression: log p(y; q) =

1 2

q R (ν + q R − 1) log(2ν) −

1 2

Pm

i=1 ni



log(2π) −

n

o

− 12 P log(σβ2 ) − 21 σβ−2 kµq(β) k2 + tr(Σq(β) ) + − 12

m X

1 2

1 R 2q



+ L + 1 log(π)

P

L G `=1 q`





log µq(1/σε2 ) (X Ri )T X Ri + M q((ΣR )−1 ) − 21 log |Σ−1 | q(β,uG )

i=1

− log (CqR , ν+qR −1 ) + log (CqR , ν+m+qR −1 ) − 21 (ν + m + q R − 1) log |Bq(ΣR ) | +

L X

log Γ

n

1 2

o

(qlG + 1) −

1 2

`=1

L X

(qlG + 1) log(Bq(σ2 ) ) u`

`=1

n

o P 1 Pm + log Γ 12 ( m i=1 ni + 1) − 2 ( i=1 ni + 1) log (Bq(σε2 ) ) R

− + −

q X

log (ARr ) + q R log Γ

r=1 qR X r=1 L X `=1

n

1 2 (ν

+ qR)

o R



ν M q((ΣR )−1 ) log (Au` ) −



L X

rr

1 2

R

µq(1/aRr ) − (ν + q )

q X

log(Bq(aRr ) )

r=1

log (Bq(au` ) ) +

`=1

L X

µq(1/au` ) µq(1/σ2

`=1

− log(Aε ) − log(Bq(aε ) ) + µq(1/aε ) µq(1/σε2 ) , where CqR , ν+qR −1 and CqR , ν+m+qR −1 are defined by (17).

24

u`

)



+P +m

B

The variational lower bound on the marginal log-likelihood for Algorithm 4

Convergence in Algorithm 4 is assessed using the variational lower bound on the marginal log-likelihood and admits the following explicit expression: log p(y; q) =

1 2

q R (ν + q R − 1) log(2ν) − 

+ y − 12 1



1 R 2q

   T 



+ L log(π) − λ(ξ)T (ξ 2 ) + 1T ζ(ξ) 

X R1 µq(uR ;ξ)   1   ..  C G µq(β,uG ;ξ) +  .       X Rm µq(uRm ;ξ)  

o

n

− 12 P log(σβ2 ) − 12 σβ−2 kµq(β;ξ) k2 + tr(Σq(β;ξ) ) + − 12

m X

1 2

P

L G `=1 q`



+P +m





| log (X Ri )T X Ri + M q((ΣR )−1 ) − 12 log |Σ−1 q(β,uG ;ξ)

i=1

− log (CqR ,ν+qR −1 ) + log (CqR ,ν+m+qR −1 ) − 12 (ν + m + q R − 1) log |Bq(ΣR ) | +

L X

log Γ

n

1 2

o

G

(ql + 1) −

1 2

L X

R

G

(ql + 1) log(Bq(σ2 ) ) − u`

`=1

`=1

q X

log (ARr )

r=1

R

+ q R log Γ

n

1 2 (ν

o

+ qR) +

q X



ν M q((ΣR )−1 )



r=1 R

− 12

R

(ν + q )

q X

log(Bq(aRr ) ) −

r=1

+

L X

L X `=1

log (Au` ) −

rr

µq(1/aRr )

L X

log (Bq(au` ) )

`=1

µq(1/au` ) µq(1/σ2 ) ,

`=1

u`

where ζ(x) ≡ x/2 − log(1 + ex ) + x tanh(x/2)/4.

Acknowledgments We are grateful to Chris Oates, Louise Ryan and Brandon Stewart for their comments on this research. This research was partially supported by an Australian Postgraduate Award, a University of Technology Sydney Chancellor’s Research Award and Australian Research Council Discovery Project DP110100061.

Conflict of Interest The authors have declared no conflict of interest.

References Abrevaya, J. (2006). Estimating the effect of smoking on birth outcomes using a matched panel data approach. Journal of Applied Econometrics 21, 489–519. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. New York: Springer. Bishop, C. M. (2008). A new framework for machine learning. In J.M. Zurada et al., editors, World Congress on Computational Intelligence, 2008 Plenary/Invited Lectures, Lecture Notes in Computer Science 5050. Germany: Springer-Verlag. 25

Boyd, S. & Vandenberghe, L. (2004), Convex Optimization. Cambridge: Cambridge University Press. Diggle, P., Heagerty, P., Liang, K. L. & Zeger, S. (2002). Analysis of Longitudinal Data, 2nd Edition. Oxford: Oxford University Press. Douglas, B., Maechler, M., Bolker, B. & Walker, S. (2014). lme4 1.1-5: Linear mixed-effects models using Eigen and S4. R package. URL http://CRAN.R-project.org/package=lme4. Faes, C., Ormerod, J. & Wand, M. P. (2011). Variational Bayesian inference for parametric and non-parametric regression with missing predictor data. Journal of the American Statistical Association 106, 495. Fitzmaurice, G. M., Laird, N. M. & Ware, J. H. (2011). Applied Longitudinal Analysis, 2nd Edition. New York: John Wiley & Sons. Fitzmaurice, G., Davidian, M., Verbeke, G. & Molenberghs, G. (2008). Longitudinal Data Analysis: A Handbook of Modern Statistical Methods. Florida: Chapman & Hall/CRC. Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models. Bayesian Analysis 1, 515–533. Gelman, A. & Hill, J. (2007). Data Analysis using Regression and Multilevel/Hierarchical Models. New York: Cambridge University Press. Goldstein, H. (2010). Multilevel Statistical Models, 4th Edition. Chichester: Wiley. Harville, D. A. (2008). Springer.

Matrix Algebra from a Statistician’s Perspective.

New York:

Huang, A. & Wand, M. P. (2013). Simple marginally noninformative prior distributions for covariance matrices. Bayesian Analysis 2, 439–452. Jaakkola, T. S. & Jordan, M. I. (2000). Bayesian parameter estimation via variational methods. Statistics and Computing 10, 25–37. Knowles, D. A. & Minka, T. P. (2011). Non-conjugate variational message passing for multinomial and binary regression. In J. Shawe-Taylor, R.S. Zamel, P. Bartlett, F. Pereira and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, 1701–1709. Luenberger, D. G. and Ye, Y. (2008). Linear and Nonlinear Programming. New York: Springer, 3rd edition. Luts, J., Broderick, T. & Wand, M. P. (2014). Real-time semiparametric regression. Journal of Computational and Graphical Statistics 23, 589–615. Menictas, M. & Wand, M. P. (2013). Variational inference for marginal longitudinal semiparametric regression. Stat 2, 61–71. Minka, T., Winn, J., Guiver, J. & Knowles, D. (2013). Infer.NET 2.5. Microsoft Research Cambridge.

26

Ormerod, J. T. & Wand, M. P. (2010). Explaining variational approximations. The American Statistician 64, 140–153. R Development Core Team. (2014). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www. R-project.org/. ISBN 3-900051-07-0. Rabe-Hesketh, S., & Skrondal, A. (2008) Multilevel and longitudinal modeling using Stata. Texas: STATA press. Robinson, G. K. (1991). That BLUP is a good thing: the estimation of random effects. Statistical Science 6, 15–51. Ruppert, D., Wand, M. P. & Carroll, R. J. (2003). Semiparametric Regression. New York: Cambridge University Press. SAS Institute Inc. (2013). SAS/STAT 13.1 User’s Guide. Cary, North Carolina: SAS Institute Inc. Stan Development Team. (2014). Stan: A C++ Library for Probability and Sampling, Version 2.2. URL http://mc-stan.org. Smith, A. D. A. C. & Wand, M. P. (2008). Streamlined variance calculations for semiparametric mixed models. Statistics in Medicine 29, 435–48. Tan, S. L. L. & Nott, D. J. (2013). Variational inference for generalized linear mixed models using partially noncentered parametrizations. Statistic Science 28, 167–188. Teschendorff, A. E., Wang, Y., Barbosa-Morais, N. L., Brenton, J. D. & Caldas, C. (2005). A variational Bayesian mixture modelling framework for cluster analysis of geneexpression data. Bioinformatics 21, 3025–3033. Tipping, M. E. (1999b). Probabilistic visualisation of high-dimensional binary data. In M. S. Kearns, S. A. Solla, and D. A. Cohn (Eds.), Advances in Neural Information Processing Systems 11, Cambridge, MA, pp. 592598. MIT Press. Titterington, D. M. (2004). Bayesian methods for neural networks and related models. Statistical Science 19, 128–139. Wainwright, M. J. & Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundation and Trends in Machine Learning 1, 1–305. Wand, M. P. & Ormerod, J. T. (2008). On semiparametric regression with O’Sullivan penalized splines. Australian and New Zealand Journal of Statistics 50, 179–198. Wand, M. P. & Ormerod, J. T. (2011). Penalized wavelets: embedding wavelets into semiparametric regression. Electronic Journal of Statistics 5, 1654–1717. Wand, M. P., Ormerod, J. T., Padoan, S. A. & Frühwirth, R. (2011). Mean field variational Bayes for elaborate distributions. Bayesian Analysis 6, 847–900. Wand, M. P. & Ripley, B. D. (2009). KernSmooth 2.23: Functions for Kernel Smoothing Corresponding to the Book: Wand, M. P. & Jones, M. C. (1995), Kernel Smoothing. R package. URL http://cran.r-project.org. 27

Zhao, Y., Staudenmayer, J., Coull, B. A. & Wand, M. P. (2006). General design Bayesian generalized linear mixed models. Statistical Science 21, 35–51.

28