Convergence of Proximal-Gradient Stochastic Variational Inference under Non-Decreasing Step-Size Sequence
arXiv:1511.00146v1 [stat.ML] 31 Oct 2015
Mohammad Emtiyaz Khan Ecole Polytechnique Federale De Lausanne, Switzerland Wu Lin University of Waterloo, Canada
Mark Schmidt University of British Columbia, Canada
Abstract Stochastic approximation methods have recently gained popularity for variational inference, but many existing approaches treat them as “blackbox” tools. Thus, they often do not take advantage of the geometry of the posterior and usually require a decreasing sequence of step-sizes (which converges slowly in practice). We introduce a new stochastic-approximation method that uses a proximal-gradient framework. Our method exploits the geometry and structure of the variational lower bound, and contains many existing methods, such as stochastic variational inference, as a special case. We establish the convergence of our method under a non-decreasing step-size schedule, which has both theoretical and practical advantages. We consider setting the step-size based on the continuity of the objective and the geometry of the posterior, and show that our method gives a faster rate of convergence for variational-Gaussian inference than existing stochastic methods.
1
Reza Babanezhad University of British Columbia, Canada
Introduction
Stochastic methods have recently gained popularity for using variational inference to maximize lower bounds to marginal likelihoods [Hoffman et al., 2013]. Stochasticapproximation gradient-descent (SAGD) methods have now been extensively applied for variational inference in latent variable models [Salimans et al., 2013, Ranganath et al., 2013, Titsias and L´azaro-Gredilla, 2014, Kingma and Welling, 2013, Mnih and Gregor, 2014, Kucukelbir et al., 2014]. These methods scale well to large data and are widely applicable due to their simplicity. However, such
Masashi Sugiyama University of Tokyo, Japan
“black-box” approaches do not exploit the structure of the variational objective function and usually converge slowly. For example, the variational objective often consists of both convex and non-convex parts and taking this structure into account could improve the rate of convergence. Another problem with existing black-box methods is that most of them ignore the geometry of the variational parameter space. One of the most popular method, stochastic variational inference (SVI), does take the geometry into account by using natural gradients [Hoffman et al., 2013], but unfortunately can be applied to limited class of models, such as conditionally-conjugate exponential-family distributions. Extension of natural gradients to black-box approaches remains a challenging problem. As a result, many existing stochastic methods lack a principled approach for step-size selection and usually suffer from slow convergence. In most cases, convergence is only guaranteed under a decreasing step-size sequence. Many approaches rely on automatic methods for step-size selection (e.g. AutoGrad [Duchi et al., 2011] and ADADELTA Zeiler [2012]) which, when used as a black-box, do not exploit the structure and geometry of the problem. In this paper, we propose a stochastic-approximation variational method based on a proximal-gradient framework. The proximal-gradient framework splits the variational bound into convex and non-convex parts, thereby taking the structure of the bound into account. In this framework we can use a a divergence function, like the Kullback-Leibler (KL) divergence or other Bregman divergence, to incorporate the geometry of the posterior in the variational objective. By doing this, we show that methods such as SVI are special cases of our method. Most importantly, each step in our method corresponds to solving a simple problems where the non-convex part is linearized and for which closed-form expressions often exist. We show convergence of proximal-gradient stochastic variational methods under very general conditions. We prove
Convergence of Proximal-Gradient Stochastic Variational Inference
that in many cases these stochastic methods can converge with a constant step-size and thus do not require the decreasing step-size sequences that destroy practical performance. For example, when the posterior belongs to an exponential family, the step-size can be set to be α∗ /L. Here, L is the Lipschitz constant of the gradient of the nonconvex part and α∗ is a constant related to the partition function of the exponential family.
to use random subsets (mini-batches) of the N examples [Hoffman et al., 2013]. We may also need to approximate the potentially-intractable expectations using MonteCarlo methods that draw samples from q(z), leading to a doubly-stochastic approximation [Titsias and L´azaroGredilla, 2014]. Many black-box approaches using similar strategies have been proposed in recent years [Salimans et al., 2013, Ranganath et al., 2013].
We apply our method to variational-Gaussian (VG) inference and experimentally confirm that constant step-sizes lead to faster convergence than black-box approaches that do not exploit the structure or geometry of the lower bound.
Most of these approaches rely on the theory of stochasticapproximation for convergence [Robbins and Monro, 1951]. Classically, these methods require that the approximation is unbiased and that the step-sizes βk satisfy
2
∞ X
Variational Inference
k=1
Consider a general latent variable model with a data vector y of length N and a latent vector z of length D. The joint distribution under the model is denoted by p(y, z|θ) with θ being the (known) hyper-parameters. The evidence lower bound optimization (ELBO) approximates the posterior p(z|y, θ) by a distribution q(z|λ) that maximizes a lower bound to the marginal likelihood. Here, λ is the variational parameter. As shown in (1), the lower bound is obtained by first multiplying and dividing by q(z|λ), and then applying Jensen’s inequality using that log is concave. Z p(y, z|θ) dz log p(y) = log q(z|λ) q(z|λ) ≥ max Eq(z|λ) [log p(y, z|θ)] − Eq(z|λ) [log q(z|λ)] λ∈S
(1) Here, S is the set of valid variational parameters λ. We denote the term inside the max by L(λ) (dropping the dependence on θ in the notation). The approximate posterior1 q(z|λ) is obtained by maximizing L(λ) with respect to λ. It is desirable to choose q(z|λ) such that the lower bound is easy to optimize, however in general the lower bound is difficult to optimize even when q is chosen carefully. There are three main reasons why this could happen. First, some terms in the lower bound might be intractable or may admit a form that is not easy to optimize. Second, the data might be too large making the optimization difficult (large N problem). Finally, there might be too many latent variables in z (large D problem). In this paper, we focus on the case of large N problems, but our method also simplifies large D problem in some cases. 2.1
Stochastic Methods for Variational Inference
βk = ∞ ,
∞ X
βk2 < ∞.
(2)
k=1
These conditions suggest to a decreasing step-size sequence, but it is hard to find a good schedule of step sizes in practice and this typically leads to poor practical performance. A second issue is about the geometry of the posterior. The recent success of SVI suggests that incorporating the geometry of the posterior should accelerate convergence. Specifically, SVI uses natural gradient steps that use a symmetric KL divergence Dsym KL instead of the standard Euclidean distance to choose a step dλ, max L(λ + dλ), s.t. Dsym KL (λ k λ + dλ) ≤ ,
dλ∈S
(3)
for some . We refer to (12) for the definition of the symmetric KL. This problem leads to a simple update that seems to have a fast convergence when the model and the posterior belong to a conditionally-conjugate exponential family. Exploiting Bregman divergence to adapt to the structure of problems is quite popular in the optimization community, where it is known as mirror descent [Beck and Teboulle, 2003]. However, there are only a small number of variational methods that exploit the geometry of the problem in such a way (e.g. [Sato, 2001, Honkela et al., 2011]). A third and final issue is about the structure of the lower bound. As we review in the next section, the lower bound consists of both convex and non-convex terms. During optimization, it is desirable to treat convex terms differently than the non-convex term. However, most existing variational methods do not take this ‘convex plus non-convex’ structure into account. In this paper, we propose a method to address the three issues discussed above.
Stochastic methods are promising recent approach to address large N problems. A simple way to do this is
3
1 We assume that q is parameterized. Some authors refer to this as a “fixed-form” approximations [Salimans et al., 2013].
A function can always be expressed as the sum of convex and non-convex ‘parts’. For the negative of the variational
Composite Structure of the Lower Bound
Khan, Babanezhad, Lin, Schmidt, Sugiyama
lower bound we denote convex part by h and non-convex part by f , as shown below: − L(λ) := f (λ) + h(λ) . | {z } | {z } non-convex
(4)
−L(λi ) := (λ∗i − λi )T 5 Ai (λi ) + Ai (λi ),
Variational Gaussian inference: Consider the variational Gaussian (VG) approximation, q(z|λ) := N (z|m, V), with mean m and covariance V. This is typically used for latent Gaussian models such as generalized linear models and Gaussian process (GP) models. We show the joint distribution for a GP model below, where we have N scalar outputs yn and latent function values zn that are sampled from GP(m(x), κ(x, x0 )) with mean function m and covariance function κ, p(y, z|X, θ) =
The lower bound with respect to λi is shown below,
convex
For the variational lower bound, such splits naturally occur. This is due to the presence of the second (entropy) term of (1). We give two examples below.
N Y
the joint-distribution parameters η i . We also assume that the representation is minimal.
p(yn |zn )N (z|µ, Σ),
where λ∗i is the mean-field update for λi (see Appendix A.1 and A.2 in Paquet [2014] for a detailed derivation). The second term is convex while the first part may not be.
4
DKL (q(z|λ) k q(z|λ0 )) (5)
where Σ is a matrix with (i, j)’th entry as κ(xi , xj ) and µ is a vector with n’th entry as m(xn ) with xn being the D-length input vector corresponding to yn . In this case, the negative of the lower bound is given as follows: −L(m, V) := −
EN (z|m,V) [log p(yn |zn )]
n=1
+ DKL [N (z|m, V) k N (z|µ, Σ)],
Geometry of the Posterior q(z|λ)
The geometry of the posterior distribution can be incorporated using divergence measures. For exponential family distributions, there are natural alternatives to using the squared Euclidean distance. For example, we could use the KL divergence defined as follows,
n=1
N X
(9)
(6)
where the KL divergence is given by (10). The KL divergence is jointly convex with respect to λ = {m, V}, while the first term is non-convex in general. Inference with exponential family distribution: We consider the set-up described in [Paquet, 2014] for conditionally-conjugate exponential family distributions. Consider a Bayesian network with nodes zi and a jointQ distribution i p(zi |pai ) where pai are the parents of zi in the network. We assume that each factor follows an exponential family distribution, p(zi |pai ) := hi (zi ) exp η Ti (pai )Ti (zi ) − Ai (η i ) , (7) where η i are the natural parameters, Ti (zi ) are the sufficient statistics, Ai (η i ) is the partition function, and hi (zi ) is the base Q measure. We seek a factorized approximation, q(z) = i qi (zi ), where each zi belongs to the same exponential family as the joint distribution, h i qi (zi ) := hi (z) exp λTi Ti (zi ) − Ai (λi ) . (8) Here, λi are the natural parameters of qi . We use λi as the parameters of this distribution to differentiate them from
:= Eq(z|λ) [log q(z|λ) − log q(z|λ0 )],
(10)
= A(λ0 ) − A(λ) − 5A(λ)(λ0 − λ).
(11)
In general, for appropriately smooth families of distributions, the KL divergence is roughly quadratic where the quadratic form is given by the Fisher information matrix [Nielsen and Garcia, 2009]. The Bregman divergence defines another class of divergence functions. For exponential family distribution, it is equal to the KL divergence with swapped natural parameters: DBreg (λ0 kλ) = DKL (λ k λ0 ). Finally, the symmetric-KL divergence used in SVI is equal to the sum of the above two divergences: 0 0 0 Dsym KL (λkλ ) = DKL (λkλ ) + DKL (λ kλ)
(12)
Our proposed approach incorporates geometry of the posterior using such divergence functions.
5
Our Approach: Proximal-Gradient SVI
We propose a stochastic variational inference based on proximal-gradient framework. In this section, we will first describe the computation of a stochastic approximation to the gradient and the specification of the divergence functions, after which we will present our approach. Stochastic-Approximation: We compute a stochasticapproximation to the gradient of the non-convex part deˆ (λk , ξ k ) noted by f (λ). We denote the approximation g where λk is the λ at k’th iteration and ξ k is a random variable that represents the noise in the approximation. We assume that the approximation is unbiased and has a bounded variance, as shown in (13) and (14). Here, σ > 0 is a constant. We also assume that the gradient of f is L-Lipschitz
Convergence of Proximal-Gradient Stochastic Variational Inference
continuous ∀λ, λ0 ∈ S as shown in (15). A1.
E[ˆ g(λk , ξ k )] = 5f (λk )
(13) 2
2
A2.
E[||ˆ g(λk , ξ k ) − 5f (λk )|| ] ≤ σ
A3.
|| 5 f (λ) − 5f (λ0 )|| ≤ L||λ − λ0 ||
(14) (15)
This notation contains the doubly-stochastic approximation [Titsias and L´azaro-Gredilla, 2014] as a special case. We may have two types of stochasticity: first, due to the minibatch selection and second, due to the Monte-Carlo (MC) approximation to intractable expectations. The later can be easily computed using samples from the posterior. In particular, for our approach, we assume a mini-batch size of Mk . For the i’th data example, we compute an approximation to the gradient and average them to get an approximation to the gradient, as shown below:
at the solution of the subproblem. In the next section, we give a few examples where this condition holds. The following theorem gives us a bound on kλk+1 − λk k. A detailed proof is given in the appendix. Theorem 1. (Convergence of PG-SVI) Let α be the constant such that A4 is satisfied. Define α∗ = α − 1/(2c) where c is a constant such that c > 1/(2α). Now, suppose k = 1, 2, . . . , K with K being the total number of iterations, and let βk such that 0 < βk ≤ 2α∗ /L with βk < 2α∗ /L for at least one k. Suppose that we sample a discrete random variable R ∈ {1, 2, . . . , K} using a probability mass function PR defined as shown below: PR (k) := P rob(R = k) = PK
α∗ βk − Lβk2 /2
k=1
(α∗ βk − Lβk2 /2)
Then, under assumption (A1-A4), we have, Mk 1 X (i) ˆ (λk , ξk ). ˆ k := g g Mk i=1
(16)
Geometry of the posterior: We use a divergence function between two distribution q(z|λ) and q(z|λ0 ) to incorporate the geometry of the posterior. We denote it by D(λ k λ0 ) for notational simplicity. This divergence function could be any function that satisfies nonnegativity: D(λ k λ0 ) ≥ 0, ∀λ, λ0 with equality holding only when λ = λ0 . However, we will make another assumption as we show in the next section. The proposed algorithm: Our stochastic-approximation proximal-gradient variational inference starts with a value λ0 and makes the following update at every iteration k usˆ k and divergence D(λkλk ): ing the approximation g λk+1
1 ˆ k + h(λ) + = arg min λ g D(λ k λk ) λ∈S βk T
(17)
We will refer to this algorithm as PG-SVI. We will show in Section 9 that this optimization problem has a closed-form solution for many problems, but first we prove convergence of our approach.
6
Convergence of Our Approach
In this section, we prove convergence of our algorithm. Our results suggest that the algorithm can converge even with constant step-size. Our proof techniques are based on the work of Ghadimi et al. [2014]. For convergence, we assume that there exist a scalar α > 0 such that the following holds for every subproblem (17): A4.
(λk+1 − λk )T 51 D(λk+1 k λk ) ≥ αkλk+1 − λk k2 , (18)
where 51 denote the gradient w.r.t. the first argument. Note that this condition need not hold globally rather only
1 E(kλR − λR−1 k2 ) βR PK L∗ − L(λ0 ) + 21 cσ 2 k=1 (βk /Mk ) ≤ PK 1 2 k=1 α∗ βk − 2 Lβk
(19)
where L∗ is the maximum. The bound depends on noise variance σ 2 , mini-batch size Mk , Lipschitz constant L, constant α for Assumption A4, step-size βk , and the gap between the maximum and the starting point L∗ − L(λ0 ). In addition, a constant c needs to be chosen such that c > 1/(2α). When the step-size and mini-batch size are held constant, we get the following corollary: Corollary 1. (Convergence under constant step-size) Let βk = α∗ /L and Mk = M > 1 for all k, then E(kλR − λR−1 k2 )/βR is bounded by, 2L qσ 2 [L∗ − L(λ0 )] + . 2 Kα∗ M α∗
(20)
We see that the bound gets tighter as mini-batch size M and number of iterations K are increased, as expected. More importantly, it shows that the bound gets tighter as α∗ is increased as well, establishing the usefulness of adding the divergence in our update. Finally, we see that the smaller value of Lipschitz constant L and the variance σ 2 are preferred, again as expected. Most important of all, the Corollary establishes that convergence of our algorithm holds under a constant step-size. The value of the step-size indeed depends on the Lipschitz constant and the geometry of the posterior through the divergence function. The second term in the bound might not be tight. One can adapt the mini-batch size depending on the noise in the gradient. Ghadimi et al. [2014] discuss some strategies for
Khan, Babanezhad, Lin, Schmidt, Sugiyama
this purpose. They also discuss a two-phase algorithm to get a linear convergence rate with constant step-size. In the two-phase algorithm, we first call the PG-SVI multiple times and store the returned results, and then pick the minimum value out of all runs. This algorithm can be shown to have linear convergence with high probability. We omit the proof for the lack of space. In practice, if we run the algorithm long enough, we can subsample multiple runs from it and then pick the minimum value.
7
Existing Methods as Special Cases
We will now show that many existing methods can be seen as special cases of our framework. When D is Euclidean distance, we recover gradient descent update. We will establish many other non-trivial connections in this section. All the methods considered in this section differ mainly in the choice of the divergence function. For simplicity, throughout this section, we will assume that q belongs to an exponential family distribution, although these relations are likely to hold for general probability distributions. We will show that a sufficient condition for Assumption A4 to hold is strong-convexity of A(λ). This might be too restrictive in many cases. However, our convergence result applies when the eigenvalue of 5A(λ) is lower bounded at all λk that are solutions of subproblem of (17). 7.1
Stochastic Variational Inference (SVI)
As discussed in Section 2.1, SVI solve the constrained optimization problem of (3) at each step, where the symmetric KL divergence is constrained. Usually, we use a first-order stochastic-approximation of L along with a second-order approximation to the divergence function, and then express the objective in the Lagrangian form, as shown below [Pascanu and Bengio, 2013]: 1 ˆ k + k (λ − λk )T 52 A(λk ) (λ − λk ) (21) min λ g λ β
eigenvalue of the Fisher information matrix. The step-size can be set using the Lipschitz constant L of L and α. Note that if L contains some convex terms, we can improve convergence by not linearizing them and choosing L only for the non-convex parts (Of course this works only when (17) can be solved in closed-form). 7.2
Several methods similar to ours with Bregman divergences have been proposed [Ravikumar et al., 2010, BabagholamiMohamadabadi et al., 2015]. These methods also benefit from our convergence results. For exponential family distribution, the Bregmen divergence is equal to the KL divergence with swapped parameters, i.e. DBreg (λkλk ) := A(λ) − A(λk ) − 5A(λk )(λ − λk ) Taking derivative with respect to λ, the condition for Assumption A4 to hold can be written as follows: (λk+1 − λk )T [5A(λk+1 ) − 5A(λk )] ≥ α||λk+1 − λk ||2 Again, a sufficient condition for this to hold is the strongconvexity of A(λ). 7.3
This step is equivalent to a step of our approach shown in (17) with h ≡ 0 and D(λ k λk ) replaced by the second term above. Assumption A1 and A2 can be safely assumed for SVI as well, and assuming that L is Lipschitz continuous we satisfy Assumption A3 as well. Finally, Assumption A4 is equivalent to the following condition: (λk+1 − λk )T 52 A(λk )(λk+1 − λk ) ≥ αkλk+1 − λk k2 This condition will hold when eigenvalue of the Fischer information matrix is lower bounded at all λk . A sufficient condition for this to hold is the strong-convexity of A(λ). Therefore, using Corollary 1, convergence of SVI under constant step-size will hold given a lower bound α on the
Methods using the KL Divergence
Recently several methods have been proposed that use use KL divergence as the distance measure and use either proximal-point or proximal-gradient method, very similar to (17), e.g. Dai et al. [2015], Theis and Hoffman [2015], Khan et al. [2015]. All of these approaches are special cases of our approach in this paper. When D is the KL divergence, the condition for Assumption A4 to hold can be written as follows: (λk+1 − λk )T 52 A(λk+1 ) (λk+1 − λk ) (22) ≥ αkλk+1 − λk k2
T
This gives the natural gradient step used in SVI.
Methods using the Bregman Divergence
This differs from SVI only slightly: we need eigenvalues of 52 A(λk+1 ) to be bounded from below (instead of λk ). Again, a sufficient condition is the strong convexity of A.
8
Parameter α
We will now give two examples to illustrate the existence of α > 0. We will show this for the KL divergence. Bernoulli distribution: Given 0 < q < 1 and 0 < p < 1, the KL-divergence for Bernoulli distribution is as follows, DKL [q k p] := q log(q/p) + (1 − q) log[(1 − q)/(1 − p)] Derivative with respect to q gives us: 5q DKL [q k p] := log
q(1 − p) p(1 − q)
(23)
Convergence of Proximal-Gradient Stochastic Variational Inference
Substituting for Assumption A4 in the left hand side of (18), we get the following, (q − p) 5q DKL [q k p] = (q − p) log
q(1 − p) p(1 − q)
N D Ntr log l for k(·) log σ for k(·) M mini-batch size S MC samples τ for SGD 1 − ρ for ADADELTA βk × Ntr for PG-SVI
(24)
= (q − p)(log q − log p) + (1 − p − 1 + q)(log(1 − p) − log(1 − q)) 1 1 ≥ + (p − q)2 p 1−q
(25) (26)
which is lower bounded by α = 2. This can be generalized to multinomial distribution. Multinomial distribution: Let us assume that q and p are vector of length K with element qi and pi that sum to 1.P Therefore, we have K − 1 variable since qK = K−1 1 − i=1 pi . The KL divergence is as following: DKL [q k p] :=
K X
qi log(qi ) − qi log(pi ).
i=1
Sonar 208 60 165 -1 6 5 2000 0.80 10−9 0.1
Ionosphere 351 34 280 1 2.5 5 500 0.51 10−6 0.4
USPS 1,781 256 884 2.5 5 20 2000 0.85 10−10 2.5
Table 1: Dataset description along with the model and algorithmic hyperparameters. Each column gives details about a dataset. ‘Ntr ’ is the number of data points used for training. σ and l are hyperparameters of squaredexponential kernel. for ADADELTA is fixed to 10−8 in our experiments. τ and ρ are decay factors for SGD and ADADELTA, while βk is the constant step-size.
Derivative with respect to qi gives us: 5qi DKL [q k p] := log
qi pK pi qK
(27)
If we compute the Hessian on the main diagonal we have 1 1 1 qi + qK , and all off diagonal elements is qK . So the minimum eigenvalue could be 1 for any possible q. So the α in multinomial case is 1. Variational-Gaussian Inference: We will derive α for VG inference outlined in Section 3. For simplicity, we will consider mean-field approximation where the distribution consists of univariate Gaussian distribution N (z|m, v). Since KL divergence is additive for different components of z, we will focus only on univariate distributions. The expression for KL-divergence is given as follows: DKL [q(m, v) k q(mk , vk ] := (28) 2 1 2 − log(v/vk ) + (v/vk ) + (m − mk ) /vk − 1 . Derivative with respect to m and v are, 5m DKL = (m − mk )/vk , 5v DKL =
1 −1 2 (vk
−v
−1
).
Substituting for Assumption A4 in the left hand side of (18) and assuming that there exist τ such that 1/vk > τ , for all k, we get the following bound,
where Σ is the prior covariance matrix and Ak is a positivedefinite matrix that depends on the data and previous value of mk and vk . It is clear that values of vk+1 will always be bounded away from 0 and it depends on the eigenvalues of Σ. For example, when Σ and Ak are both diagonal, then minimum diagonal value of Σ will be a lower bound. Therefore, we conclude that τ exists. These arguments can be generalized to multivariate Gaussian approximation by replacing the quadratic norm by trace norm.
9
Examples of PG-SVI
In this section, we give examples where each subproblem of (17) has a closed-form solution. As discussed before, SVI is a special case of our approach and therefore leads to a closed form update, e.g. for conditionally conjugate exponential family distribution discussed in Section 3, the update takes the following form [Paquet, 2014], (k+1)
(mk+1 − mk ) 5m DKL + (vk+1 − vk ) 5v DKL (29) (mk+1 − mk )2 /vk + 12 (vk+1 − vk )2 / (vk+1 vk )
(30)
≥ min(τ, τ 2 /2)[(mk+1 − mk )2 + (vk+1 − vk )2 ] (31) Therefore α = min(τ, τ 2 /2). It only remains to find τ . VG inference with Gaussian prior such as those outlined in Section 3 the optimal solution takes the following form : h −1 i vk+1 = diag Σ−1 + Ak (32)
λi
(k)
= rk λi
+ (1 − rk )λ∗i
(33)
where rk = 1/(1+βk ), λi and λ∗i are as defined in Section (k) 3 and λi denotes the vector for k’th iteration. Recently, Khan et al. [2015] proposed a proximal-gradient variational inference that uses a KL divergence function. This method is equivalent to (17) when D is equal to the KL divergence and full batch of the data is used (no stochasticity). We will use a stochastic version of this method in our experiments.
Khan, Babanezhad, Lin, Schmidt, Sugiyama
3
10
500
SGD ADADELTA PG−SVI
200
100
Negative Lower Bound
SGD ADADELTA PG−SVI
Negative Lower Bound
Negative Lower Bound
usps3v5
ionosphere
sonar
2
3
10
2
10
10 1
10
2
10
3
0
10
10
# Pass
1
0.66
Test Log−Loss
0.68
1
2
10
3
10
2
10
# Pass
usps3v5 SGD ADADELTA PG−SVI
0.6
0.5
0.4
0.7
SGD ADADELTA PG−SVI
0.5
0.3
0.1
0.3 10
1
10
ionosphere 0.7
SGD ADADELTA PG−SVI
0.64 0 10
10
10
# Pass
sonar 0.7
0
2
10
Test Log−Loss
0
10
Test Log−Loss
SGD ADADELTA PG−SVI
0
10
1
2
10
# Pass
10
# Pass
0
10
1
2
10
10
# Pass
Figure 1: Result for binary GP classification. We compare our approach PG-SVI with SGD and ADADELTA on three datasets. Each column shows results for a dataset. Top row shows the negative of the lower bound, while the bottom row shows the test log-loss. In each plot, the x-axis shows the number of passes made through the data. Our method always converges within 10 passes through the data, while other methods from 100 to 1000 passes.
10
Experimental Results
In this section, we present empirical evidence for convergence of our proposed method. We demonstrate that PGSVI with a constant step-size leads to faster convergence than stochastic gradient descent (SGD) with decreasing step-size and the ADADELTA method [Zeiler, 2012]. We show results on Gaussian process (GP) classification with Bernoulli-logit likelihood. We use the zero mean function and a squared-exponential covariance function: 0 0 0 1 k(x, x ) = σ 2 exp − 2 (x − x )T (x − x ) , (34) 2l
tion. We use a mini-batch of fixed-size M . For each data example in the mini-batch, we compute the gradient using MC with S samples. PG-SVI works fine for small MC samples (say, 50 MC samples) while SGD and the ADADELTA method converge slowly if the number of MC samples is small. For the experiments, all the methods use the same number of MC samples. We compares methods on 3 datasets. Details are in Table 1. These datasets can be found at UCI data repository2 .
where σ, l ∈ R are kernel hyperparameters. We set the value of hyperparameters using cross-validation. The final values are shown in Table 1.
We compare the negative of the lower bound of (6) on the training data (lower value is better). We also Pcompare the test log-loss which is defined as follows: − n log pˆn /N∗ where pˆn = p(yn |s, l, Dt ) is the predictive probabilities of the test point yn given training data Dt and N∗ is the total number of test-pairs. A lower value is better and a value of 1 is equivalent to random coin-flipping.
For PG-SVI, we use the updates discussed by Khan et al. [2015] with a constant step-size. We compare to SGD with decreasing step-size to optimize the lower bound of Eq. 6. We set the step size to 1/(1 + k)τ where τ is chosen to lie between (0.5, 1). We also compare to the ADADELTA method [Zeiler, 2012] with the decay rate ρ and the constant term . A summary of algorithmic parameters is given in Table 1.
Result are shown in Figure 1. A column shows results for a dataset. Top row shows the negative of the lower bound, while the bottom row shows the test log-loss. Lower values are better for both of them. In each plot, the X-axis shows the number of passes made through the data. Note that we have added 1 to number of passes to show the iteration 0, therefore first pass is simply the initialization. Our method is much faster to converge than other methods. Our method
For all the methods, we use doubly stochastic approxima-
2
https://archive.ics.uci.edu/ml/datasets.html
Convergence of Proximal-Gradient Stochastic Variational Inference
always converges within 10 passes through the data, while other methods converge from 100 to 1000 passes.
11
Discussion and Future Work
With the use of stochastic methods, variational inference can scale to large data and can be applied to a wide variety of models. Existing variational methods use them as blackbox methods and do not exploit the structure and geometry of the variational lower bound. In this paper, we proposed a proximal-gradient framework that aims to fill this gap. Our method contains many existing methods as a special case. It exploits the structure of the problem, and take the geometry of the posterior into account. Most importantly, our convergence proof shows that stochastic methods can converge even when a constant step-size is used. To find a good step-size, a good line-search methods is required. In addition, an algorithm for finding adaptive batchsize is also required. In the future, we hope to explore these issues.
References Behnam Babagholami-Mohamadabadi, Sejong Yoon, and Vladimir Pavlovic. D-mfvi: Distributed mean field variational inference using bregman admm. arXiv preprint arXiv:1507.00824, 2015. Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003. Bo Dai, Niao He, Hanjun Dai, and Le Song. Scalable bayesian inference via particle mirror descent. CoRR, abs/1506.03101, 2015. URL http://arxiv.org/ abs/1506.03101. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011. Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, pages 1–39, 2014. Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013. A. Honkela, T. Raiko, M. Kuusela, M. Tornio, and J. Karhunen. Approximate Riemannian conjugate gradient learning for fixed-form variational Bayes. JMLR, 11:3235–3268, 2011. Mohammad Emtiyaz Khan, Pierre Baque, Francois Flueret, and Pascal Fua. Kullback-Leibler Proximal Variational
Inference. In Advances in Neural Information Processing Systems, 2015. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. Alp Kucukelbir, Rajesh Ranganath, Andrew Gelman, and David Blei. Fully automatic variational inference of differentiable probability models. In NIPS Workshop on Probabilistic Programming, 2014. Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030, 2014. Frank Nielsen and Vincent Garcia. Statistical exponential families: A digest with flash cards. arXiv preprint arXiv:0911.4863, 2009. Ulrich Paquet. On the convergence of stochastic variational inference in bayesian networks. NIPS Workshop on variational inference, 2014. Razvan Pascanu and Yoshua Bengio. ural gradient for deep networks. arXiv:1301.3584, 2013.
Revisiting natarXiv preprint
Rajesh Ranganath, Sean Gerrish, and David M Blei. Black box variational inference. arXiv preprint arXiv:1401.0118, 2013. Pradeep Ravikumar, Alekh Agarwal, and Martin J Wainwright. Message-passing for graph-structured linear programs: Proximal methods and rounding schemes. The Journal of Machine Learning Research, 11:1043–1080, 2010. Herbert Robbins and Sutton Monro. A stochastic approximation method. Ann. Math. Statist., 22(3):400–407, 09 1951. doi: 10.1214/aoms/1177729586. URL http: //dx.doi.org/10.1214/aoms/1177729586. Tim Salimans, David A Knowles, et al. Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Analysis, 8(4):837–882, 2013. Masa-Aki Sato. Online model selection based on the variational bayes. Neural Computation, 13(7):1649–1681, 2001. Lucas Theis and Matthew D Hoffman. A trustregion method for stochastic variational inference with applications to streaming data. arXiv preprint arXiv:1505.07649, 2015. Michalis Titsias and Miguel L´azaro-Gredilla. Doubly stochastic variational bayes for non-conjugate inference. In ICML, 2014. Matthew D Zeiler. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
Khan, Babanezhad, Lin, Schmidt, Sugiyama
A
Proof of Theorem 1
We denote the proximal projection at λk with gradient g and step-size β by P(λk , g, β) := λk+1 = arg min λT g + h(λ) + λ∈S
1 β (λk
− λk+1 ) where
1 D(λ k λk ). β
(35)
The following lemma gives a bound on the norm of P(λk , g, β). Lemma 1. The following holds for any λk ∈ S, any real-valued vector g and β > 0. gT P(λk , g, β) ≥ α||P(λk , g, β)||2 +
1 [h(λk+1 ) − h(λk )] β
(36)
Proof. The gradient of the right hand side of (35) is given as follows: g + 5h(λ) +
1 51 D(λ k λk ) β
(37)
We use this to derive the optimality condition of (35). For any λ, the following holds from the optimality condition: # " 1 T (38) (λ − λk+1 ) g + 5h(λk+1 ) + 51 D(λ k λk ) ≥ 0 β Letting λ = λk , " T
(λk − λk+1 )
# 1 g + 5h(λk+1 ) + 51 D(λ k λk ) ≥ 0 β
(39)
which implies, 1 (λk+1 − λk )T 51 D(λ k λk ) + h(λk+1 ) − h(λk ) β α ≥ ||λk+1 − λk ||2 + h(λk+1 ) − h(λk ) β
gT (λk − λk+1 ) ≥
(40) (41)
The last line follows from Assumption A4. Now we prove the theorem 1. Proof. Let g˜λ,k := P(λk , gˆk , βk ), δk := gˆk − 5f (λk ). Since f is L-smooth, for any k = 1 . . . K we have, f (λk+1 ) ≤ f (λk ) + h5f (λk ), λk+1 − λk i + = f (λk ) − βk h5f (λk ), g˜λ,k i + = f (λk ) − βk hˆ gk , g˜λ,k i +
L kλk+1 − λk k2 2
L 2 β k˜ gλ,k k2 2 k
L 2 β k˜ gλ,k k2 + βk hδk , g˜λ,k i 2 k
(42) (43) (44)
where we just use the definition of g˜λ,k and δk . Now using Lemma 1 and Cauchy-Schwarz L f (λk+1 ) ≤ f (λk ) − αβk k˜ gλ,k k2 + h(λk+1 ) − h(λk ) + βk2 k˜ gλ,k k2 + βk kδk kk˜ gλ,k k 2
(45)
By rearranging some terms and using Young’s inequality kδk kk˜ gλ,k k ≤ (c/2)kδk k2 + 1/(2c)k˜ gλ,k k2 for some constant c > 0, we get L βk βk c gλ,k k2 + βk2 k˜ gλ,k k2 + k˜ gλ,k k2 + kδk k2 −L(λk+1 ) ≤ −L(λk ) − αβk k˜ 2 2c 2 L 2 cβk = −L(λk ) − (α − 1/(2c))βk − βk k˜ gλ,k k2 + kδk k2 2 2
(46) (47)
Convergence of Proximal-Gradient Stochastic Variational Inference
Now considering c > 1/(2α), α∗ = α − 1/(2c) and βk ≤ obtain N X
α∗ βk −
2α∗ L ,
and summing up both side for iteration k = 1 . . . K, we
N X cβk L 2 βk k˜ gλ,k k2 ≤ L∗ − L(λ1 ) + kδk k2 2 2
Now by taking expectation w.r.t. ξ from both side and knowing that E[kδk k2 ] ≤ N X k=1
(48)
k=1
k=1
α∗ βk −
σ2 Mk
by assumption A2, we get
N L 2 cσ 2 X βk βk E[k˜ gλ,k k2 ] ≤ L∗ − L(λ1 ) + 2 2 Mk
(49)
k=1
Noting that E[k˜ gλk ,R k2 ] = ER,ξ [k˜ gλk ,R k2 ] = Now by diving both side of above inequality by
PN
k=1
PN
α∗ βk −
α∗ βk − L2 βk2 E[k˜ gλ,k k2 ] PN L 2 k=1 α∗ βk − 2 βk
k=1
L 2 2 βk
, we get the required result.
(50)