Deterministic Annealing for Stochastic Variational Inference
arXiv:1411.1810v1 [stat.ML] 7 Nov 2014
Farhan Abrol Princeton University
Stephan Mandt Columbia University
Rajesh Ranganath Princeton University
David Blei Columbia University
{fabrol, rajeshr}@cs.princeton.edu {sm3976,david.blei}@columbia.edu
Abstract Stochastic variational inference (SVI) maps posterior inference in latent variable models to nonconvex stochastic optimization. While they enable approximate posterior inference for many otherwise intractable models, variational inference methods suffer from local optima. We introduce deterministic annealing for SVI to overcome this issue. We introduce a temperature parameter that deterministically deforms the objective, and then reduce this parameter over the course of the optimization. Initially it encourages high entropy variational distributions, which we find eases convergence to better optima. We test our method with Latent Dirichlet Allocation on three large corpora. Compared to SVI, we show improved predictive likelihoods on held-out data.
1
Introduction
Annealing is an ancient practice in metallurgy with thousands of years of history. When a metal was slowly cooled down, it was found to be more stable than when it was cooled abruptly. Today we understand the underlying mechanism. Atoms arrange into a better configuration when being slowly cooled, than when suddenly frozen to much lower temperatures. In fact, this process can be interpreted as an optimization mechanism which results in a better outcome when following a certain cooling schedule. The physical process of annealing has found its analogies in non-convex optimization, where the cooling process can be mimicked in different ways. In simulated annealing (Kirkpatrick et al., 1983), random noise is added to the objective. This noise is then slowly reduced over time, just as
the thermal noise in the example of atoms. Quantum annealing, a different approach, exploits the quantum tunnel effect for optimization, which has even been applied to text modeling (Sato et al., 2009). In this paper, we explore deterministic annealing. Deterministic annealing deterministically deforms the objective according to a time-dependent schedule. In all of these settings, our goal is to prevent the optimization process from getting stuck in shallow local optima. In this paper, we use deterministic annealing to solve the non-convex optimization problem that arises when using stochastic variational inference in a probabilistic model. Probabilistic models of hidden and observed variables are a powerful way to analyze data. Given data, we can use the conditional distribution of the hidden variables to understand its latent properties and to form predictions about the future (Bishop, 2006). For many models of interest, however, computing this posterior is intractable and one has to resort to approximate methods. Here we will use variational inference (VI), which maps the problem of estimating a conditional distribution to an optimization problem (Jordan et al., 1999). Variational inference has enabled posterior analysis in many otherwise intractable models. With massive data sets, such as millions of newspaper articles, images or user data, even variational methods have reached their limits. To this end, Hoffman et al. (2013) developed stochastic variational inference (SVI). In a nutshell, SVI uses stochastic optimization (Robbins and Monro, 1951) to scale variational inference. Both VI and SVI face the problem of non-convex optimization, and their objective (which is the same in both algorithms) has many local optima. Previous work on improving SVI has mostly focussed on techniques to reduce the noise of its stochastic gradient (Wang et al., 2013; Johnson and Zhang, 2013; Mandt and Blei, 2014). Here, we take a different route, keeping the stochastic gradients noisy but introducing deterministic annealing to better escape local optima. To be more precise, we monotonically penalize low entropies of the variational distribution, and then
2
Deterministic Annealing for Variational Inference
We consider observed data x , x1:n to come from a hidden-variable model, where the hidden variables are z , z1:n and β. Following Hoffman et al. (2013), the variables z1:n are local variables, each one governing only its corresponding observation, and the variable β is a global variable, which participates in the distribution of all the data. This is illustrated as a graphical model in Figure 1.
Figure 1: Graphical model, representing the class of models with global and local hidden variables, subject to this paper (Hoffman et al., 2013).
slowly relax this constraint to give more and more weight to distributions that better fit the data. Fig. 3 show the results of deterministic annealing for Latent Dirichlet Allocation (LDA) (Blei et al., 2003), a probabilistic topic model which was one of the first models to which stochastic variational inference was applied (Hoffman et al., 2013). We applied LDA to three large corpora, each consisting of several hundreds of thousands of documents. With annealed stochastic inference, we obtained better predictive likelihoods on held-out documents (as we explain below) than using the original SVI algorithm.
Related work. The roots of simulated annealing reach back to Metropolis et al. (1953) and to Kirkpatrick et al. (1983) who drew the connection to combinatorial optimization. Deterministic annealing was originally used for data clustering applications (Rose et al., 1990; Rose, 1998; Hofmann and Buhmann, 1997). It was later applied to latent variable models in Ueda and Nakano (1998), who suggest a deterministic annealing algorithm in the context of computing a maximum likelihood estimate for incomplete data. In (Katahira et al., 2008), deterministic annealing for variational Bayes is introduced and applied to a Hidden Markov model and tested on simulated data. Yoshida and West (2010) describe deterministic an annealing strategy for variational inference applied to sparse factor models. Our method distinct from previous work in combining deterministic annealing with stochastic optimization. This is a conceptually different problem due to the interplay of deterministic deformations of the objective and sampling noise. It is the combination of stochastic gradients and deterministic annealing that make our method scalable and efficient on massive data sets.
Many models from the statistical machine learning literature fall into this large class. Examples include Bayesian mixture models, Bayesian hidden Markov models, hierarchical generalized linear models (Gelman and Hill, 2006), probabilistic principal components analysis (Tipping and Bishop, 1999), and latent Dirichlet allocation (LDA) (Blei et al., 2003). In our empirical study, we will use LDA to demonstrate our methods. The main computational and algorithmic problem for probabilistic modeling is posterior inference. In posterior inference we compute p(β, z | x), the conditional distribution of the latent variables {β, z} conditional on the observations x. For many models of interest, this calculation is intractable (Bishop, 2006), and practitioners must resort to approximate solutions such as Markov chain Monte Carlo (Neal, 1993) or variational methods (Jordan et al., 1999). Good approximate posterior inference is the subject of this paper. We focus on variational methods. Variational methods. Variational methods posit a simple parameterized family of distributions over the hidden variables q(β, z | ν) and then try to find the member of that family that is closest to the posterior, where closeness is measured by KL divergence. Variational methods turn posterior inference into an optimization problem, ν ∗ = arg min KL(q(β, z | ν)||p(β, z | x)). ν
(1)
If we can find the optimal variational parameter ν ∗ then we can use q(β, z | ν ∗ ) as a proxy for the true posterior. The KL divergence, however, is itself not computable. Rather, variational methods optimize the evidence lower bound (ELBO), L(p) = Eq [log p(β, z, x)] − Eq [log q(β, z | ν)] .
(2)
Up to a constant, the ELBO is equivalent to the negative KL divergence (Jordan et al., 1999). Thus minimizing Equation 2 is equivalent to solving the optimization problem in Equation 1. The problem is that this optimization problem is nonconvex, which makes it difficult to find the global optimum. Simple gradient descent and coordinate descent schemes get stuck in poor local optima, and random restarts will reveal a variety of locally optimal variational parameters. As
(a) T = 100
(b) T = 30
(c) T = 1
Figure 2: T-ELBO as a function of the global variational paramaters for a mixture model of two Gaussians. We show the contours in the µ0 , µ1 plane for temperatures T = 100, T = 30 and T = 1. We can see how the surface changes slowly from having one global optima to the final ELBO with two optima as the temperature decreases. a solution to this problem, we suggest deterministic annealing. Deterministic annealing We will develop and study deterministic annealing in the context of stochastic variational inference. We introduce a “temperature” parameter T > 1 in a distribution over the hidden variables, pT (z, β) =
p(z, β|x)1/T . CT
(3)
We call this Rdistribution the tempered posterior. The constant CT = dzdβ p(z, β|x)1/T is the normalization constant required when we raise the original posterior to the power of 1/T . In the tempered posterior, the temperature warps the original posterior distribution. It shrinks high values and amplifies low values, thus making the distribution more uniform. The main idea behind deterministic annealing is to optimize the variational distribution against a sequence of tempered posteriors that ends in the true posterior. At each stage, we optimize the variational distribution to be close to a tempered posterior. This relates closely to the original variational problem. Suppose the temperature T is fixed. The KL divergence between the variational distribution and the tempered posterior is KL(q||pT ) = Eq [q(z, β | ν)] − Eq [pT (z, β | x)] .
(4)
We can simplify this expression as follows. First, write out the tempered KL, 1 KL(q||pT ) = − Eq [log p(z, β|x)] T + Eq [log q(z, β | ν)] + CT .
Next, expand the posterior into the joint and evidence, 1 KL(q||pT ) = − Eq [log p(z, β, x)] T + Eq [log q(z, β | ν)] + CT +
1 log(x). T
Finally, consider this form multiplied by the temperature T . Up to a constant, it is the negative tempered ELBO, LT (q) = Eq [log p(z, β, x)] − T Eq [log q(z, β | ν)] . (5) Maximizing the tempered ELBO (the T-ELBO) in Equation 5 is equivalent to minimizing the KL divergence to the tempered posterior in Equation 4. Deterministically annealed variational inference (DAVI) iteratively solves a sequence of tempered optimization problems. The temperature is cooled according to a schedule that ends at T = 1. Thus the final optimization problem is original variational inference. We will demonstrate that this technique finds better local optima than existing methods for optimizing the variational objective. Let us develop some intuition for the tempered ELBO in Equation 5 and deterministic annealing. The first term is the expected log joint of the hidden variables and the data. It wants q to place its probability mass on configurations of the hidden variables that best explain the observations; this induces a rugged objective with many local optima. The second term is the entropy of the variational distribution. The entropy is convex (it has one global optimum). In this context, it can be thought of an as a regularizer which prefers the variational distribution to be spread across configurations of the hidden variables. In deterministic annealing, we initially upweight the entropy term thus favoring smooth and entropic distributions. We then slowly reduce the temperature, gradually asking the variational distribution to put more and more weight on explaining the data points. The hope is that when T is large
we find a good region of the variational parameters. As T decreases, the landscape of the variational objective comes more into focus. We are then well-positioned to find a good local optimum. This captures the central idea of deterministic annealing in variational inference. As a remark, there is also an interesting connection to statistical physics. If we identify the first term in the tempered ELBO with a negative energy −U , and the last one with an entropy S, we have −LT = U − T S, and hence we can identify the tempered ELBO with a negative free energy. Minimizing the free energy is a typical task in statistical physics. This is not surprising, as simulated annealing and variational inference have their origins in statistical physics (Kirkpatrick et al., 1983; Peterson and Anderson, 1987). The mean-field family. Deterministic annealing can be applied to any variational inference problem, i.e., to any variational family. We will focus here on the mean-field variational family, where each hidden variable is independent and governed by its own variational parameter. This leads to a fully factorized family. Following the notation of Hoffman et al. (2013), q(β, z | ν) = q(β | λ)
N Y
q(zn | φn ).
(6)
right is the original variational ELBO. There are two local optima, one near {µ0 = 4, µ1 = −4} (the correct values) and the other near {µ0 = −4, µ1 = 4}. This second solution corresponds to the case where the assignments of the data points to the two cluster centers are interchanged. As the weights of the two clusters are not equal, those two solutions do not explain the data equally well. There is a true and a false optimum. Depending on initial conditions, traditional variational inference might eventually converge to the wrong local optimum. We prevent this situation if we use deterministic annealing. In fact, as we smoothly deform the objective, we will converge to the true global optimum independent of initial conditions (Katahira et al., 2008).
3
Deterministic annealing for SVI
We now introduce deterministic annealing for Stochastic Variational Inference (SVI). The class of models that we consider consists of data points, xi associated with latent variables zi and global variables β. The sets xi , zi are conditionally independent of each other given the global variables. p(x1:n , z1:n , β|η) = p(β|η)
n=1
n Y
p(xi , zi |β),
(7)
i=1
The parameters λ are global variational parameters; the parameters φn are local variational parameters.
where η are hyperparameters. We consider the mean-field variational distribution 6. The T-ELBO is then given by
A simple example. To illustrate deterministic annealing, we consider a simple mixture model of two univariate Gaussian distributions. (This example is also described in detail in Katahira et al. (2008).) One is centered at µ0 = 4; the other at µ1 = −4. We assume that data xn arise by first choosing one of the mixture components zn ∈ {0, 1} and then choosing its value from a Gaussian centered at µzn . Crucial to this example is that we assume that µ0 = 4 is selected (a priori) 30% of the time and µ1 = −4 is selected 70% of the time. We assume that we know these proportions, but do not know the mixture locations. Our goal is to estimate p(µ0 , µ1 | x1 , . . . , xN ). (We do not need to consider local variational parameters in this case.)
LT (λ, φ1:n ) = Eq [log p(β|η)] − T Eq [log q(β|λ)] n X {Eq [log p(xi , zi |β)] − T Eq [log q(zi |φi )]} (8) +
The joint distribution of the data and hidden variables is p(µ, z, x) =
N Y
[0.7N (xi ; µ1 , 1)]zn [0.3N (xn ; µ2 , 1)](1−zn )
n=1
The mean-field variational distribution gives independent Gaussian parameters for q(µ0 | µˆ0 ) and q(µ1 | µˆ1 ). (We use µˆk to denote the variational mean.) We generated 500 data points from this simple model. Figure 2 shows the tempered ELBO at different temperatures as a function of the two variational parameters. On the far
i=1
The T-ELBO can be considered a function of the global parameters when being already optimized over the local ones, LT (λ)
=
max LT (λ, φ1:n ) φ1:n
(9)
This objective has the same optima for λ as the full TELBO. We turn our attention to the conditionally conjugate exponential family (CCEF). Annealing for variational inference applies more generally, but restricting to the conditionally conjugate exponential family will allow us to analytically compute the expectation in the T-ELBO. In the CCEF family, the prior and local conditional are both in the exponential family and form a conjugate pair, p(β|α)
=
h(β) exp{α> t(β) − ag (α)},
p(zi , xi |β)
=
h(zi , xi ) exp{β > t(zi , xi ) − al (β)}.
We set the variational distributions to be in the same family as the complete conditionals. For the global variational
Figure 3: The per-word predictive log likelihood for three large corpora, as specified in section 4. Deterministic annealing consistently outperforms plain stochastic variational inference. We used an annealing length of tA = 0.1 in units of effective traversals of the data set, defined in eq. 25. parameters, this corresponds to 1 T
p(β|x, z)
=
h(β) exp{ηg (x, z) t(β) − ag (ηg (x, z))},
q(β)
=
h(β) exp{λT t(β) − ag (λ)}].
(10)
In batch variational inference, we alternate between updates of the global and local variational parameters. Setting the natural gradient to zero yields the global parameter update: λ∗ =
Algorithm 1 DASVI 1: Initialize λ(0) randomly. Initialize T > 1. 2: Set the step-size schedule ρt 3: repeat 4: Sample a data point xi uniformly. 5: Compute its local variational parameters, E
[η (x
(N )
,z
(N )
)]
(N )
ˆ t = Eφt [ηg (xi ,zi )] λ T 7: Update the current estimate of the global variational parameters, ˆ t − (T − 1)ρt λt λt+1 = (1 − ρt )λt + ρt λ 8: Reduce T if T > 1. 9: until Forever
φnj =
=
Eq [ηg (x, z, α)]∇λ ag (λ) − T · λT ∇λ ag (λ) + T · ag (λ) + const.
(11)
Next we take the natural gradient, the gradient multiplied with the inverse of the Fisher metric ∇2λ ag (λ) (Amari, 1998; Hoffman et al., 2013). The natural gradient of the T-ELBO is: ∇λ LT = Eq [ηg (x, z, α)] − T · λ. 1
1 · Eq [ηl (xn , z(n,−j) , β)], T
(14)
where ηl is the natural parameter of the corresponding exponential family distributions of the local variational parameters (Hoffman et al., 2013). We now move on to stochastic variational inference (SVI). In contrast to the closed parameter update in batch VI, SVI uses a learning rate to update the global variational parameters based on a stochastic estimate of the natural gradient, λt+1
Following Hoffman et al. (2013), the analytic expectation for the one parameter T-ELBO is LT (λ)
(13)
As T > 1, deterministic annealing in batch variational inference implies that the global variational parameters are artificially kept smaller than usual. For exponential family distributions, this implies higher entropy. Similarly, the updates for the local variational parameters are
i φt = λt g Ti 6: Compute the intermediate global parameters as if xi was replicated N times, (N )
1 · Eq [ηg (x, z, α)]. T
=
ˆ λ LT . λt + ρt ∇
(15)
The stochastic natural gradient is an unbiased estimator of the full natural gradient, ˆ λ LT (λ)] E[∇
= ∇λ LT (λ).
(16)
Typically the estimate is formed based on a mini-batch, i.e. a subset of randomly sampled data points, or a single data point. In the latter case, it is straightforward to show that the natural stochastic gradient in deterministic annealing is
(12)
In the following we will only consider base measures h(β) = 1. This is the general case for conditionally conjugate exponential family. In more general exponential families, the base measure can be absorbed into the dominating measure where appropriate.
ˆ λ LT = Eq [ηg (x(N ) , z (N ) , α)] − T · λ, ∇ i i (N )
(N )
(17)
where ηg (xi , zi , α) is the natural parameter when the ith data point is replicated N times. The gradient ascent
scheme can also be expressed as the following two-step process, ˆt λ λt+1
(N )
= Eq [ηg (xi =
(N )
, zi
, α)] ˆ t − (T − 1)ρt λt , (18) (1 − ρt )λt + ρt λ
ˆ t based on the sampled where we first build an estimate λ data point, and then merge this estimate into the previous value λt . In contrast to SVI, we have an additional damping term that shrinks λt at a rate (T − 1)ρt . Algorithm 1 summarizes deterministic annealing for SVI (DASVI). 3.1
Deterministic Annealing for Latent Dirichlet Allocation
LDA (Blei et al., 2003) has become one of the standard models used to test the efficiency of stochastic variational inference algorithms. We assume K topics, D documents, Nd words in document d, and a vocabulary of size V . The joint distribution for the LDA model is p(β, Θ, z)
=
K Y
p(βk |η)
k=1
×
D Y
Note that φ normalizes to one as a multinomial parameter. Lastly, the updates for the topic proportions are PNd k α + n=1 φdn + T − 1 γdk = (22) T Below, we give an algorithm for annealed LDA that we use for our following empirical study. Algorithm 2 DASVI for LDA 1: 2: 3: 4: 5: 6: 7: 8:
Initialize λ0 randomly Set the step-size schedule ρt Initialize T > 1 repeat Sample a document d uniformly Initialize γdk = 1, for k ∈ {1, · · · , K} repeat For n ∈ {1, · · · , N }, k ∈ {1, · · · , K} set Eq [log βk,wdn ]+Eq [log θdk ] ), T PN α+ n=1 φk +T −1 dn . T
φkdn ∝ exp(
γdk = until local parameters φ and γ converge For k ∈ {1, · · · , K} set intermediate topics
9: 10: 11:
p(Θd |α)
ˆk = λ t
d=1
Nd Y
p(zdn |Θd )p(wdn |β1:K , zdn )
n=1
Above, wdn is the nth word in document d. We approximate this distribution in terms of D Y
Nd Y
η+D
PNd
n=1
φk dn
T
ˆ t − (T − 1)ρt λt 12: Set λt+1 = (1 − ρt )λt + ρt λ 13: Reduce T if T > 1. 14: until Forever
4
Empirical Study
(19)
We studied the empirical performance of Deterministic Annealing for SVI on the example of LDA. We found that
Above, β1:K are the topics with Dirichlet parameters λ1:K , the local topic proportions are θ1:D with Dirichlet parameters γ1:D , and the local per-word topic assignments z1:D,1:N are multinomials with parameters φ1:D,1:N .
• Deterministic annealing for SVI is easy to implement at no extra computational costs
q(β, Θ, z) = q(β|λ)
q(Θd |γd )
q(zdn |φdn )
n=1
d=1
The coordinate updates and natural gradients can be computed using Eq. 17 and Eq. 14. Let us now sample a document d from the corpus. For fixed n, the local multinomial parameter is φ1:K dn . Let us define the indicator vec1:V v tor Wdn , which satisfies Wdn = 1 if wdn = v, and v else Wdn = 0. As a slight modification to Hoffman et al. (2010), the stochastic natural gradient is ˆ λ LT = η + D ∇
Nd X
• It is insensitive to annealing lengths and temperatures. We furthermore find a universal annealing length of 0.1 effective traversals of the data set to work well, independent of the size of the corpus, as we explain below. Data
φdn ·
> Wdn
− Tλ
(20)
n=1 > where φdn · Wdn is a K × V matrix and therefore has the same dimension as λ. The parameters for the topic assignments are
φkdn ∝ exp(
• It reaches higher predictive likelihoods on held-out words
Eq [log βkwdn ] + Eq [log θdk ] ) T
(21)
We ran the algorithm on three large text corpora:
1. Science: A corpus of 140,000 documents from the journal Science between 1880-2002. After processing the vocabulary consisted of 5,855 words. 1,000 documents were set aside for the test set. 2. New York Times: This collection (Sandhaus, 2008) contains 1.8M documents and a vocabulary of 8,000 terms. 10,000 documents were set aside for testing.
3. Wikipedia: This corpus contains 3.6M documents and a vocabulary of 7,700 terms. 10,000 documents were set aside for testing. We evaluate model fitness using a predictive distribution described in Hoffman et al. (2013). We separated a test set of documents Dte from the training corpus Dtr . Each document d in the test set is furthermore divided into two parts. One part is used to fit the local variational parameters to obtain the topic proportions for the given test document, containing the words wold . The other part is used for prediction, containing new words wnew . The predictive distribution is defined as p(wnew |wold , Dtr ) = Z p(wnew |Θ, β)p(β, Θ|wold , Dtr ) dΘdβ. It is approximated Hoffman et al. (2013) by p(wnew |wold , Dtr ) ≈
K X
Eq [θk ]Eq [βk,wnew ],
(23)
k=1
where Θk is the proportion of topic k in the given documents, and βk,wnew is the weight of the corresponding word in the kth topic. We calculate the log probability of all heldout words under this predictive distribution. Model Parameters Stochastic Variational Inference for LDA requires setting of the Dirichlet hyperparameters and learning rate schedules. We set the learning rates for the Wikipedia and New York times corpus to the ones that were found to be optimal for SVI in Ranganath et al. (2013) and Hoffman et al. (2013). We used α = 0.01, η = 1, τ0 = 1000, where α and η are priors. τ0 and κ determine the Robbins-Monroe learning schedule, ρt =
1 . (t + τ0 )κ
(24)
The value of κ = 0.8 was used for the Wikipedia dataset and κ = 0.7 for the New York Times. For the smaller Science dataset τ0 = 64 was used. We set the mini-batch size to 100 and used 100 topics. In the following we study the sensitivity of the deterministic annealing algorithm to parameter changes. This includes the length of the annealing schedule, and the initial temperature. Annealing length and schedule. In order to test the sensitivity of the algorithm to different annealing lengths, we set the initial temperature to T0 = 2 (a value that we later confirm as a good choice when comparing different values of T0 ). We compare different annealing lengths, that end at T = 1 at different iterations/times.
Figure 4: Performance of deterministic annealing for different annealing lengths tA , measured in units of effective traversals.
of T = 1.5 to T = 5. We fixed the annealing length to be tA = 0.1 in units of effective traversals. Naturally, higher initial temperature take longer to reach higher likelihoods. If the temperature is set too low, the full effects of annealing will not be reached. We tried a variety of initial temperatures and empirically found an initial temperature of 2-3 to give best results. We also found on the other corpora that initial temperature of 2-3 and a linear annealing length of tA = 0.1 effective traversals seems to perform well.
5 Figure 5: Sensitivity of deterministic annealing on varying initial temperatures. We show results on the New York Times data set (tA = 0.1). We compare the initial temperatures T = 1.5, 2, 3, 5 against SVI. In all cases, we choose the annealing schedule to be an exponentially decaying one that starts at T = T0 and reaches T = 1 at the annealing length tA (that uniquely determines the schedule). We measure the annealing length in units of effective traversals of the data set, eff. travers. =
iterations × minibatch size (25) documents in corpus
Using this measure, we identified a consistent optimal annealing length of tA = 0.1 effective traversals across all data sets. Also, the algorithm is seen to consistently converge at about 2 − 3 effective traversals for the given hyperparameters, demonstrating the use of this measure. These results are given in Fig. 4, which shows learning curves for different annealing lengths for all three data sets. Note that those data sets differ by a factor of 20 in size. The performance of DASVI seems to be robust regarding differing annealing lengths, as long as tA < 0.25. For larger annealing lengths, the learning curves show kinks. This is because it takes longer until the tempered ELBO reaches the final ELBO, and the annealing process optimizes the tempered one. At those long times, the decreasing Robbins-Monroe learning rates have already become so small that it takes a long time to adjust to the final ELBO (although convergence to a local optimum is still guaranteed by Robbins and Monro (1951)).
Conclusions and outlook
We have introduced deterministic annealing for stochastic variational inference. Stochastic variational inference (SVI) relies on optimizing a lower bound on the evidence. The objective is non-convex, and hence SVI suffers from convergence to poor local optima. Deterministic annealing is a method that prevents the stochastic gradients from getting stuck in shallow local optima. In contrast to simulated annealing, where artificial noise is added to the gradient, in deterministic annealing we uniformly deform the objective. More and more weight is continuously transferred from the entropy term of the ELBO to the energy term, until we finally reach the final ELBO at temperature T = 1. We have shown on three large datasets that deterministic annealing leads to higher predictive accuracy on held-out data than SVI, and it converges faster. We have studied different annealing schedules. We found that DASVI is not sensitive to fine tuning those parameters, but rather consistently gives better results.
References Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural computation, 10(2):251–276. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Blei, D., Ng, A., and Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022. Gelman, A. and Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.
To conclude, annealing should end before the stochastic gradient ascent ends. Due to the decreasing RobbinsMonroe learning rates, annealing can not be chosen too slow; otherwise the algorithm might again get stuck in poor local optima.
Hoffman, M., Blei, D., and Bach, F. (2010). Online inference for latent Drichlet allocation.
Initial Temperature We found that the initial temperature does not significantly affect the performance of the algorithm. Fig. 5 shows the dependence of the annealing algorithm upon changing the initial temperature in a range
Hofmann, T. and Buhmann, J. M. (1997). Pairwise data clustering by deterministic annealing. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 19(1):1–14.
Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference. J. Mach. Learn. Res., 14(1):1303–1347.
Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37(2):183–233. Katahira, K., Watanabe, K., and Okada, M. (2008). Deterministic annealing variant of variational bayes method. Journal of Physics: Conference Series, 95(1):012015. Kirkpatrick, S., Gelatt, C. D., Vecchi, M. P., et al. (1983). Optimization by simmulated annealing. science, 220(4598):671–680. Mandt, S. and Blei, D. (2014). Smoothed gradients for stochastic variational inference. arXiv:1406.3650. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092. Neal, R. M. (1993). Probabilistic inference using markov chain monte carlo methods. Peterson, C. and Anderson, J. R. (1987). A mean field theory learning algorithm for neural networks. Complex Systems, 1:995–1019. Ranganath, R., Wang, C., Blei, D. M., and Xing, E. P. (2013). An adaptive learning rate for stochastic variational inference. In In International Conference on Machine Learning. Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, page 400407. Rose, K. (1998). Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE, 86(11):2210–2239. Rose, K., Gurewitz, E., and Fox, G. (1990). A deterministic annealing approach to clustering. Pattern Recognition Letters, 11(9):589–594. Sandhaus, E. (2008). The New York Times Annotated Corpus LDC2008T19. Web Download. Philadelphia: Linguistic Data Consortium. Sato, I., Kurihara, K., Tanaka, S., Nakagawa, H., and Miyashita, S. (2009). Quantum annealing for variational bayes inference. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 479–486. AUAI Press. Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622.
Ueda, N. and Nakano, R. (1998). Deterministic annealing em algorithm. Neural Netw., 11(2):271–282. Wang, C., Chen, X., Smola, A., and Xing, E. (2013). Variance reduction for stochastic gradient optimization. In Advances in Neural Information Processing Systems, pages 181–189. Yoshida, R. and West, M. (2010). Bayesian learning in sparse graphical factor models via variational mean-field annealing. The Journal of Machine Learning Research, 11:1771–1798.