Combinatorial Topic Models using Small-Variance Asymptotics

Report 2 Downloads 19 Views
Combinatorial Topic Models using Small-Variance Asymptotics

Ke Jiang Dept of Computer Science and Engineering, Ohio State University, Columbus, OH, USA

arXiv:1604.02027v1 [cs.LG] 7 Apr 2016

Suvrit Sra Laboratory for Information and Decision Systems, MIT, Cambridge, MA, USA

JIANG .454@ OSU . EDU

SUVRIT @ MIT. EDU

Brian Kulis BKULIS @ BU . EDU Dept of Electrical & Computer Engineering and Dept of Computer Science, Boston University, Boston, MA, USA

Abstract Topic models have emerged as fundamental tools in unsupervised machine learning. Most modern topic modeling algorithms take a probabilistic view and derive inference algorithms based on Latent Dirichlet Allocation (LDA) or its variants. In contrast, we study topic modeling as a combinatorial optimization problem, and derive its objective function from LDA by passing to the small-variance limit. We minimize the derived objective by using ideas from combinatorial optimization, which results in a new, fast, and high-quality topic modeling algorithm. In particular, we show the surprising result that our algorithm can outperform all major LDA-based topic modeling approaches, even when the data are sampled from an LDA model and true hyperparameters are provided to these competitors. These results make a strong case that topic models need not be limited to a probabilistic view.

1. Introduction Topic modeling has long been fundamental to unsupervised learning on large document collections. Though the roots of topic modeling date back to latent semantic indexing (Deerwester et al., 1990) and probabilistic latent semantic indexing (Hofmann, 1999), the arrival of Latent Dirichlet Allocation (LDA) (Blei et al., 2003) was a turning point that completely reshaped the community’s thinking about topic modeling. LDA led to several followups that address some limitations of the original model (Blei & Lafferty, 2006; Wang & Grimson, 2007), and also helped pave the way for subsequent advances in Bayesian learn-

ing methods, including variational inference methods (Teh et al., 2006b), nonparametric Bayesian models (Blei et al., 2004; Teh et al., 2006a), and others. The LDA family of topic models are almost exclusively cast as probabilistic models. The vast majority of techniques developed for topic modeling—collapsed Gibbs sampling (Griffiths & Steyvers, 2004), variational methods (Blei et al., 2003; Teh et al., 2006b), and “factorization” approaches with theoretical guarantees (Anandkumar et al., 2012; Arora et al., 2012; Bansal et al., 2014)—are centered around performing inference for underlying probabilistic models. By limiting ourselves to a purely probabilistic viewpoint, we may be missing important opportunities grounded in combinatorial thinking. This leads us to the central question of this paper: Can we obtain a combinatorial topic model that competes with LDA? We answer this question in the affirmative. In particular, we propose a combinatorial optimization formulation for topic modeling, derived using small-variance asymptotics (SVA) on the LDA model. SVA produces limiting versions of various probabilistic learning models, which can then be solved as combinatorial optimization problems. Analogously, k-means solves the combinatorial problem arising from letting variances go to zero in Gaussian mixtures. SVA techniques have been quite successful recently, e.g., for cluster evolution (Campbell et al., 2013), hidden Markov models (Roychowdhury et al., 2013), feature learning (Broderick et al., 2013), supervised learning (Wang & Zhu, 2014), hierarchical clustering (Lee & Choi, 2015), and others (Huggins et al., 2015; Wang & Zhu, 2015). A common theme in these examples is that computational advantages and good empirical performance of k-means carry over to richer SVA based models. Indeed, in a compelling example, Campbell et al. (2013) demonstrate how a hard cluster evolution algorithm obtained via SVA is orders of magnitude faster than competing sampling-based methods,

Combinatorial Topic Models using Small-Variance Asymptotics

while still being significantly more accurate than competing probabilistic inference algorithms on benchmark data. But merely using SVA to obtain a combinatorial model does not suffice. We need appropriate algorithms to optimize the resulting model. Unfortunately, a direct application of greedy combinatorial procedures on the LDA-based SVA model fails to compete with the usual probabilistic LDA methods, necessitating a new idea. Surprisingly, as we will see, a simple local refinement procedure transforms the SVA approach into a competitive topic modeling algorithm. Thus, in summary our key contributions are the following: 





We perform SVA on the standard LDA model and obtain through it a combinatorial optimization problem (essentially a facility location problem). We develop a k-means like optimization procedure for optimizing the derived combinatorial model. We derive and implement a powerful local refinement algorithm that performs incremental word and mini-topic assignments. Combined with the k-means like procedure, this yields an effective topic modeling algorithm.

Since there is little to no labeled data for topic modeling, to evaluate our approach we adopt a commonly used benchmarking strategy where we quantitatively compare the output of topics and word assignments obtained by various algorithms against ground truth topics on data generated using the LDA model itself. Note that such evaluation handicaps our combinatorial algorithms, as they do not directly optimize the original LDA objective (just its SVA version). Nevertheless, we show that our methods are surprisingly strong compared to the best available inference techniques for LDA. For example, in 10 iterations our algorithm is able to consistently match or outperform 3000 iterations of a collapsed Gibbs sampler (using various evaluation metrics), even when the sampler is provided the exact hyperparameters used to generate the data. We also show compelling quantitative comparisons to variational and anchorbased methods. 1.1. Related Work Interest in topic models has grown so rapidly that we cannot hope to do full justice to related work. However, we summarize below some works from key related subareas. LDA Algorithms. Many techniques have been developed for efficient inference for LDA. The most popular are perhaps MCMC-based methods, notably the collapsed Gibbs sampler (CGS) (Griffiths & Steyvers, 2004), and variational inference methods (Blei et al., 2003; Teh et al., 2006b). Among MCMC and variational techniques, CGS

typically yields excellent results and is guaranteed to sample from the desired posterior with sufficiently many samples. Its running time could be slow and many samples may be required before convergence. Since topic models are often used on large (document) collections, significant effort has been made in scaling up LDA algorithms. One recent example is (Li et al., 2014) that presents a massively distributed implementation. Such methods are outside the focus of this paper, which focuses more on our new combinatorial model that can quatitatively match or outdo the probabilistic LDA model. Ultimately, our model should be amenable to scaling solvers. Given its strong empirical performance, obtaining faster algorithms for our model is an important part of future work. A complementary line of algorithms starts with (Arora et al., 2012; 2013), who considered certain separability assumptions on the input data to circumvent NP-Hardness of the basic LDA model. These works have shown performance competitive to Gibbs sampling in some scenarios while also featuring theoretical guarantees. Other important references include (Anandkumar et al., 2012; Podosinnikova et al., 2015; Bansal et al., 2014). Small-Variance Asymptotics (SVA). As noted above, SVA has recently emerged as a powerful tool for obtaining scalable algorithms and objective functions by “hardening” probabilistic models. Similar connections are known for instance in dimensionality reduction (Roweis, 1997), multi-view learning, classification (Tong & Koller, 2000), and structured prediction (Samdani et al., 2014). Figure 1 lists some examples of SVA applied to these problems. Starting with Dirichlet process mixtures (Kulis & Jordan, 2012), one thread of research has considered applying SVA to richer Bayesian nonparametric models. Applications include clustering (Kulis & Jordan, 2012), feature learning (Broderick et al., 2013), evolutionary clustering (Campbell et al., 2013), infinite hidden Markov models (Roychowdhury et al., 2013), Markov jump processes (Huggins et al., 2015), infinite SVMs (Wang & Zhu, 2014), and hierarchical clustering methods (Lee & Choi, 2015). Another related thread of research has considered how to apply SVA methods when the data likelihood is not Gaussian, which is precisely the scenario under which LDA falls. In (Jiang et al., 2012), it is shown how SVA may be applied as long as the likelihood is a member of the exponential family of distributions. Their work considers topic modeling as a potential application, but omits developing algorithmic tools without which SVA fails to succeed on topic models; the present paper fixes this by introducing local refinement. Combinatorial Optimization. In developing effective algorithms for topic modeling, we will borrow some ideas

Combinatorial Topic Models using Small-Variance Asymptotics

Probabilistic Model Gaussian Mixture Probabilistic PCA Probabilistic CCA Restricted Bayes Optimal Classifier Hidden Variable CRF

SVA Model k-means PCA CCA Support Vector Machine Latent Structural SVM

Learning Task Clustering Dimensionality Reduction Multi-view Learning Classification Structured Prediction

Figure 1. Examples of small-variance asymptotics applied to standard machine learning tasks. See text for further details.

from the large literature on combinatorial optimization algorithms. In particular, in the k-means community, significant recent research has explored how to improve upon the basic k-means algorithm, which is known to be prone to local optima; these techniques include local search methods (Dhillon et al., 2002) and good initialization strategies (Arthur & Vassilvitskii, 2007). We also borrow ideas from approximation algorithms, most notably algorithms based on the facility location problem (Jain et al., 2003).

where each of the probabilities is as specified by the LDA model. Following standard LDA manipulations, we can eliminate variables to simplify inference by integrating out θ to obtain Z p(Z, W, ψ|α, β) = p(W, Z, θ, ψ|α, β)dθ. (1) θ

After simplification, (1) becomes Y K

2. SVA for Latent Dirichlet Allocation We now detail our combinatorial approach to topic modeling, starting with the derivation of the underlying objective function that is the basis of our work. This objective is directly derived from LDA by applying SVA, and contains two terms: the first is similar to the k-means clustering objective in that it seeks to assign words to topics that are, in a particular sense, “close.” The second term, arising from the Dirichlet prior on the per-document topic distributions, places a penalty on the number of topics per document. Recall the standard LDA model: • Choose θj ∼ Dir(α), where j ∈ {1, ..., M }. • Choose ψi ∼ Dir(β), where i ∈ {1, ..., K}. • For each word t in document j: – Choose a topic zjt ∼ Cat(θj ). – Choose a word wjt ∼ Cat(ψzjt ). Here α and β are scalars (i.e., we are using a symmetric Dirichlet distribution). Let W denote the vector of all words in all documents, Z the topic indicators of all words in all documents, θ the concatenation of all the θj variables, and ψ the concatenation of all the ψi variables. Also let Nj be the total number of word tokens in document j. The θj vectors are each of length K, the number of topics. The ψi vectors are each of length D, the size of the vocabulary. We can write down the full joint likelihood p(W, Z, θ, ψ|α, β) of the model in the factored form K Y i=1

p(ψi |β)

M Y j=1

p(θj |α)

Nj Y t=1

p(zjt |θj )p(wjt |ψzjt ),

i=1

Y M j=1

p(ψi |β)

Nj M Y Y

 p(wjt |ψzjt ) ×

j=1 t=1

Γ(αK)

 K Y Γ(nij· + α)

PK Γ( i=1 nij· + αK) i=1

Γ(α)

. (2)

Here nij· is the number of word tokens in document j assigned to topic i. Now, following (Broderick et al., 2013), we can obtain the SVA objective by taking the (negative) logarithm of this likelihood and letting variance go to zero. Consider the first bracketed term of (2). Taking logs yields a sum over terms of the form log p(ψi |β) and terms of the form log p(wjt |ψzjt ). Noting that the latter of these is a multinomial distribution, and thus a member of the exponential family, we can appeal to the results in (Banerjee et al., 2005; Jiang et al., 2012) to introduce a new parameter for scaling the variance. In particular, we write p(wjt |ψzjt ) in its Bregman divergence form exp(−KL(w ˜jt , ψzjt )), where KL refers to the discrete KL-divergence, and w ˜jt is an indicator vector for the word at token wjt . It is straightforward to verify that KL(w ˜jt , ψzjt ) = − log ψzjt ,wjt . Next, introduce a new parameter η that scales the variance appropriately, and write the resulting distribution as proportional to exp(−η · KL(w ˜jt , ψzjt )). As η → ∞, the expected value of the distribution remains fixed while the variance goes to zero, exactly what we require. After this, consider the second bracketed term of (2). We scale α appropriately as well; this ensures that the hierarchical form of the model is retained asymptotically. In particular, we write α = exp(−λ · η). After some manipulation of this distribution, we can conclude that the negative log of the Dirichlet multinomial term becomes asymptotically ηλ(Kj+ − 1), where Kj+ is the number of topics i in document j where nij· > 0, i.e., the number of topics

Combinatorial Topic Models using Small-Variance Asymptotics

currently used by document j. (The maximum value for Kj+ is K, the total number of topics.) To formalize, let f (x) ∼ g(x) denote that f (x)/g(x) → 1 as x → ∞. Then we have the following: Lemma 1. Consider the likelihood  Y K M Y Γ(nij· + α) Γ(αK) . p(Z|α) = PK i Γ(α) i=1 nj· + αK) i=1 j=1 Γ( If α = exp(−λ·η), then asymptotically as η → ∞ we have XM − log p(Z|α) ∼ ηλ (Kj+ − 1). j=1

Now we put the terms of the negative log-likelihood together. The − log p(ψi |β) terms vanish asymptotically since we are not scaling β (see the note below on scaling β). Thus, the remaining terms in the SVA objective are the ones arising from the word likelihoods and the Dirichletmultinomial. Using the Bregman divergence representation with the additional η parameter, we conclude that the negative log-likelihood asymptotically yields the following: − logp(Z, W, ψ|α, β)  X Nj M M X X (Kj+ − 1) , KL(w ˜jt , ψzjt ) + λ ∼η j=1 t=1

j=1

which leads to our final objective function min Z,ψ

X Nj M X j=1 t=1

KL(w ˜jt , ψzjt ) + λ

M X

 Kj+ .

(3)

j=1

We remind the reader that KL(w ˜jt , ψzjt ) = − log ψzjt ,wjt . Thus we obtain a k-means-like term that says that all words in all documents should be “close” to their assigned topic in terms of KL-divergence, but that we should also not have too many topics represented in each document. Note that we did not scale β, to obtain a simple objective with only one parameter (other than the total number of topics), but let us say a few words about scaling β. A natural approach is to further integrate out ψ of the joint likelihood, as is done in the collapsed Gibbs sampler. One would obtain additional Dirichlet-multinomial distributions, and properly scaling as discussed above would yield a simple objective that places penalties on the number of topics per document as well as the number of words in each topic. Optimization would be performed only with respect to the topic assignment matrix. Future work will consider effectiveness of such an objective function for topic modeling.

3. Algorithms With our combinatorial objective now in hand, we will develop algorithms to directly optimize the objective. In

Algorithm 1 Basic Batch Algorithm Input: Words: W, Number of topics: K, Topic penalty: λ Initialize Z and topic vectors ψ1 , ..., ψK . Compute initial objective function (3) using Z and ψ. repeat //Update assignments: for every word token t in every document j do Compute distance d(j, t, i) to topic i: − log(ψi,wjt ). If zjt 6= i for all tokens t in document j, add λ to d(j, t, i). Obtain assignments via Zjt = argmini d(j, t, i). end for //Update topic vectors: for every element ψiu do ψiu = # occ. of word u in topic i / total # of word tokens in topic i. end for Recompute objective function (3) using updated Z and ψ. until no change in objective function. Output: Assignments Z.

particular, we discuss a basic locally-convergent algorithm similar to k-means and the hard topic modeling algorithm (Jiang et al., 2012), and then introduce two more powerful techniques: a word-level assignment method that arises from connections between our proposed objective function and the facility location problem, and an incremental topic refinement method that is inspired by local search methods developed for k-means. 3.1. The Basic Batch Algorithm We now describe a basic iterative algorithm for optimizing the combinatorial hard LDA objective derived in the previous section. The basic algorithm follows the k-means style—we perform alternate optimization by first minimizing with respect to the topic indicators for each word (the Z values) and then we minimize with respect to the topics (the ψ vectors). See Algorithm 1 for pseudo-code. Consider first minimization with respect to ψ, with Z fixed. In this case, the penalty term of the objective function for the number of topics per document is not relevant to the minimization. This can be performed in closed form by computing means based on the assignments, due to the nice property of KLdivergence; see Proposition 1 of (Banerjee et al., 2005). In our case, the topic vectors will be computed as follows: entry ψiu corresponding to topic i and word u will simply be equal to the number of occurrences of word u assigned to topic i normalized by the total number of word tokens

Combinatorial Topic Models using Small-Variance Asymptotics

assigned to topic i. Next consider minimization with respect to Z with fixed ψ. This is less straightforward due to the presence of the penalty terms for the number of topics per document. In fact, unlike the assignment step for k-means, this assignment step can be shown to be NP-hard. However, the assignment problem turns out to be an instance of a nonparametric clustering known as DP-means (Kulis & Jordan, 2012), which considers a k-means objective with a penalty on the number of clusters. We can follow a similar strategy for assignments here. In particular, we compute the KLdivergence between each word token wjt and every topic i via − log(ψi,wjt ). Then, for any topic i that is not currently occupies by any word token in document j, i.e., zjt 6= i for all tokens t in document j, we penalize the distance by λ. Next we obtain new assignments by re-assigning each word token to the topic corresponding to its smallest divergence (including any penalties). We continue this alternating strategy until convergence. The running time of the batch algorithm can be shown to be O(N K) per iteration, where N is the total number of word tokens and K is the number of topics. This is because each word token must be compared to every topic, but the resulting comparison can be done in constant time. Updating topics is performed by maintaining a count of the number of occurrences of each word in each topic, which also runs in O(N K) time. Note that the collapsed Gibbs sampler runs in O(N K) time per iteration, and thus has a comparable running time per iteration. One can show that this algorithm is guaranteed to converge to a local optima. The argument follows along similar lines to k-means and DP-means, namely that each updating step cannot increase the objective function. In particular, the update on the topic vectors must improve the objective function since the means are known to be the best representatives for topics based on the results of (Banerjee et al., 2005). The assignment step must decrease the objective since we only re-assign if the distance goes down. Further, we only re-assign to a topic that is not currently used by the document if the distance is more than λ greater than the distance to the current topic, thus accounting for the additional λ that must be paid in the objective function. 3.2. Improved Word Assignments for Z The basic algorithm has the advantage that it achieves local convergence. However, it is extremely sensitive to initializations. Actually, it will have almost no control over the number of topics used by each document due to its simple handling of the penalty λ, when initialized randomly. In this section, we will discuss and analyze an alternative assignment technique for Z, which may be used as an initialization to the locally-convergent basic algorithm or to

Algorithm 2 Improved Word Assignments for Z Input: Words: W, Number of topics: K, Topic penalty: λ, Topics: ψ for every document j do Let fi = λ for all topics i. Initialize all word tokens to be unmarked. while there are unmarked tokens do Pick the topic i and set of unmarked tokens T that minimizes P fi + t∈T KL(w ˜jt , ψi ) . |T | Let fi = 0 and mark all tokens in T . Assign zjt = i for all t ∈ T . end while end for Output: Assignments Z.

replace it completely. Algorithm 2 details the alternate assignment strategy for tokens. The inspiration for this greedy algorithm arises from the fact that we can view the assignment problem for Z, given ψ, as an instance of the uncapacitated facility location (UFL) problem (Jain et al., 2003). Recall that the UFL problem aims to open a set of facilities from a set F of potential locations. Given a set of clients D, a distance function d : D × F → R+ , and a cost function f : F → R+ for the set F , the P UFL problem P aims to find a subset S of F that minimizes i∈S fi + j∈D (mini∈S dij ). To map UFL to the assignment problem in combinatorial topic modeling, consider the problem of assigning word tokens to topics for some fixed document j. The topics correspond to the facilities and the clients correspond to word tokens. Let fi = λ for each facility, and let the distances between clients and facilities be given by the corresponding KL-divergences as detailed earlier. Then the UFL objective corresponds exactly to the assignment problem for topic modeling. The algorithm must select, among all topics and all unmarked tokens T , the minimizer to the expression P fi + t∈T KL(w ˜jt , ψi ) . |T | This can be performed efficiently by observing that, for a fixed size of T and a given i, the best choice of T is obtained by selecting the |T | closest tokens to ψi in terms of the KL-divergence. Thus, we can find the global minimizer over all i and sizes of T by appropriately sorting KL-divergences of tokens to topics, and searching over all sizes of T and topics i. Note that sorting need only be performed once at the beginning of each iteration, and sorting

Combinatorial Topic Models using Small-Variance Asymptotics

Algorithm 3 Incremental Topic Refinements for Z Input: Words: W, Number of topics: K, Topic penalty: λ, Assignment: Z, Topics: ψ randomly permute the documents. for every document j do for each mini-topic S, where zjs = i ∀s ∈ S for some topic i do for every other topic i0 6= i do Compute ∆(S, i, i0 ), the change in the objective function when re-assigning zjs = i0 ∀s ∈ S. end for Let i∗ = argmini0 ∆(S, i, i0 ). Reassign tokens in S to i∗ if it yields a smaller obj. Update topics ψ and assignments Z. end for end for Output: Assignments Z and Topics ψ.

the change in objective function that would occur if we both updated the topic assignments for those tokens and then updated the resulting topic vectors. Specifically, for document j and its mini-topic S formed by its word tokens assigned to topic i, the objective function change can be computed by 0

∆(S, i, i0 ) = −(ni·· − nij· )φ(ψi− ) − (ni·· + nij· )φ(ψi+0 ) 0

+ ni·· φ(ψi ) + ni·· φ(ψi0 ) − λI[i0 ∈ Tj ], where nij· is the number of tokens in document j assigned to topic i, ni·· is the total number of tokens assigned to topic i, ψi− and ψi+0 are the updated topics, Tj is Pthe set of all the topics used in document j, and φ(ψi ) = w ψiw log ψiw .

all distances of tokens to topics across all documents can be performed in time O(N K). This is due to the fact that the distances depend purely on counts of words within topics and that a linear-time sorting algorithm may therefore be utilized effectively in this scenario.

We accept the move if mini0 6=i ∆(S, i, i0 ) < 0 and update the topics ψ and assignments Z accordingly. Then we continue to the next mini-topic, hence the term“incremental”. Note here we accept all moves that improve the objective function instead of just the single best move as in (Dhillon & Guan, 2003). Since ψ and Z are updated in every objective-decreasing move, we randomly permute the processing order of the documents in each iteration. This usually helps in obtaining better results in practice. See Algorithm 3 for details.

For particular instances of UFL, this greedy strategy is known to produce a provably approximate solution to the UFL problem. For instance, if the distances between clients and facilities is a metric, then it is known that the greedy algorithm can achieve an approximation guarantee of at most 1.861 (Jain et al., 2003). Unfortunately, the KL-divergence does not constitute a metric, and therefore these guarantees do not directly carry over to our case. Future work will consider how to obtain explicit bounds on the performance of our proposed word assignment method.

At first glance, it appears that this incremental topic refinement strategy may be computationally expensive. However, computing the global change in objective function ∆(S, i, i0 ) can be performed in O(|S|) time, if the topics are maintained by count matrices. Only the counts involving the words in the mini-topic and the total counts are affected. Since we compute the change across all topics, and across all mini-topics S, the total running time of the incremental topic refinement can be seen to be O(N K), as in the basic batch algorithm.

3.3. Incremental Topic Refinement

4. Experiments

Unlike traditional clustering problems, topic modeling is hierarchical: we have both word level assignments and “mini-topics” (formed by word tokens in the same document which are assigned to the same topic). Explicitly refining the mini-topics should be able to help achieving better word-coassignment within the same document. Inspired by local search techniques in the clustering literature (Dhillon et al., 2002), we take a similar approach here. However, traditional approaches (Dhillon & Guan, 2003) do not directly apply in our setting; we therefore adapt existing local search techniques from clustering to the topic modeling problem.

In this section, we will empirically highlight the benefits of the algorithms proposed in Section 3.

More specifically, we consider an incremental topic refinement scheme that works as follows. For a given document, we consider swapping all word tokens assigned to the same topic within that document to another topic. We compute

Methodology. Due to a lack of ground truth data for topic modeling, we follow other methods such as that of (Arora et al., 2013) and benchmark via synthetic data. We train all algorithms on the following synthetic data sets: (A) docu-

4.1. Synthetic Documents We compare three versions of our algorithms—Basic Batch (Basic), Improved Word Assignment (Word), and Improved Word with Topic Refinement (Word+Refine)— to the collapsed Gibbs sampler (CGS) (Griffiths & Steyvers, 2004), the standard variational inference algorithm (VB) (Blei et al., 2003), and the recently-proposed Anchor method (Arora et al., 2013) for LDA.

Combinatorial Topic Models using Small-Variance Asymptotics

SynthB 5453889.128 3790071.752 3609980.107

Basic

Word

Word+Refine

Basic

0.75

0.4

0.50

0.25

0.2

0.0

0.00 8

9

10

11

12

8

9

Lambda

For the collapsed Gibbs sampler, we collect 10 samples with 30 iterations of thinning after 3000 burn-in iterations. The variational inference runs for 100 iterations. The Word algorithm replaces basic word assignment with the improved word assignment step within the batch algorithm, and Word+Refine further alternates between batch and incremental topic refinement steps. The Word and Word+Refine are run for 20 and 10 iterations respectively. For Basic, Word and Word+Refine, we run experiments with λ ∈ {6, 7, 8, 9, 10, 11, 12}, and the best results are presented if not stated otherwise. In contrast, the true α, β parameters are provided as input to the LDA algorithms, whenever applicable. We note that we have heavily handicapped our methods by this setup, since the LDA algorithms are designed specifically for data from the LDA model. Objective optimization. Table 1 shows the optimized objective function values for all three proposed algorithms. We can see that the Word algorithm significantly reduces the objective value when compared with the Basic algorithm, and the Word+Refine algorithm reduces further. As pointed out in (Yen et al., 2015) in the context of other SVA models, the Basic algorithm is very sensitive to initializations. However, this is not the case for the Word and Word+Refine algorithms and they are quite robust to initializations. From the objective values, the improvement from Word+Refine to Word seems to be marginal. But we will show in the following that the incorporation of the topic refinement is crucial for learning good topic models. Assignment accuracy. Both the Gibbs sampler and our algorithms provide word-level topic assignments. Thus we can compare the training accuracy of these assignments, which is shown in Figure 2. The result of the Gibbs sampler is given by the highest among all the samples selected. The accuracy is shown in terms of the normalized mutual information (NMI) score and the adjusted Rand index (ARand), which are both in the range of [0,1] and are standard evaluation metrics for clustering problems. The Basic algorithm

Word+Refine

0.6

Table 1. Optimized combinatorial topic modeling objective function values for different algorithms with λ = 10.

10

11

12

9

10

Lambda

0.75

ARand

0.75

NMI

ments sampled from an LDA model with α = 0.04, β = 0.05, 20 topics and vocabulary size 2000. Each document has length 150. (B) documents sampled from an LDA model with α = 0.02, β = 0.01, 50 topics and vocabulary size 3000. Each document has length 200.

Word

0.8

ARand

SynthA 5074939.616 4055091.759 3975536.098

NMI

objective value Basic Word Word+Refine

0.50

0.50

0.25

0.25

0.00

0.00 6

7

8

Lambda

9

10

6

7

8

Lambda

Figure 2. The NMI scores and Adjusted Rand Index for word assignments of our algorithms for both synthetic datasets with 5000 documents (top: SynthA, bottom: SynthB). The dashed lines represent the results of the Gibbs sampler (best viewed in color).

performs poorly despite its similarity with the Gibbs sampler. Unlike Basic, the Word algorithm greatly boosts the assignment accuracy, which shows that the algorithm has already been doing something reasonable. With further help from topic refinement, we match or marginally exceed the performance of the Gibbs sampler. From the plots, we can also see that, the performance is comparable with the Gibbs sampler for a wide range of λ values. Topic reconstruction error. Now we look at the reconstruction error between the true word-topic distributions and the learned distributions. In particular, given a learned topic matrix ψˆ and the true matrix ψ, we use the Hungarian algorithm (Kuhn, 1955) to align topics, and then evaluate the `1 distance between each pair of topics. Figure 3 presents the mean reconstruction errors per topic of different learning algorithms for varying number of documents. As a baseline, we also include the results from the k-means algorithm with KL-divergence (Banerjee et al., 2005) where each document is assigned to a single topic. Among the three proposed algorithms, similar to the situation above, the Basic algorithm performs the worst in all the data settings, even worse than the k-means algorithm. The topic refinement step provides a significant improvement, which helps to reduce the `1 error at least 60% from the Word algorithm only. The Gibbs sampler has the lowest `1 on smaller corpora, where Word+Refine and Anchor come next. It might not be easy to read from the plots, but the Word+Refine achieves a slightly better reconstruction performance with an `1 error 0.111 with standard deviation 0.024 on average across all data settings, while Anchor is at 0.128 with standard deviation 0.023.

Combinatorial Topic Models using Small-Variance Asymptotics SynthA, L1 Error

VB

L1 Error

1.5

Anchor 1.0

Basic Word

0.5

Word+Refine 0.0 4000

6000

8000

kmeans

10000

Total Log-likelihood

CGS

2000

KOS

method

2.0

-2100000 -2200000

Word+Refine

-2400000 -2500000 0

Documents

250

500

750

1000

Iterations

SynthB, L1 Error

NYTimes

method

2.0

VB Anchor

1.0

Basic Word

0.5

Word+Refine 0.0 2000

4000

6000

8000

kmeans

10000

Documents

Total Log-likelihood

CGS

1.5

L1 Error

CGS

-2300000

-8400000

-8800000

CGS Word+Refine

-9200000

0

250

500

750

1000

Iterations

Figure 3. Topic reconstruction `1 errors of different algorithms for learning LDA models (best viewed in color).

However, for the larger corpora, the sampler needs to run much longer to reach a lower `1 error, and can not compete with Word+Refine and Anchor for 3000 iterations. We again want to emphasize here that both Gibbs sampler and the variational algorithm are given the true parameters as input. As observed above, the Gibbs sampler can easily become trapped in a local minima and needs many iterations on large data sets, which can be seen from Figure 4. Since our algorithm outputs Z, we can use this assignment as initialization to the sampler. In Figure 4, we also show the evolution of topic reconstruction `1 error initialized with the Word+Refine optimized assignment for 3 iterations with varying values of λ. With these semi-optimized initializations, we observe more than 5-fold speed-up compared to random initializations. We hypothesize that the chain will converge much faster if the Word+Refine optimization step is applied whenever the chain stays in a plateau. L1 Error v.s. Iterations

0.4

L1 Error

Initialization Random

0.3

lambda=6 lambda=8 0.2

lambda=10 lambda=12

0.1

0.0 1000

2000

3000

4.2. Real Documents We consider two real-world data sets with different properties: the KOS blog entries (2445 documents, vocabulary size 1600), and a subset of the New York Times articles (10k documents, vocabulary size 6000). To compare with the Gibbs sampler, we compute the log-likelihood of the LDA model (Smola & Narayanamurthy, 2010) during training, which is the standard measure. To make fair comparisons, we tune the λ value such that the resulting number of topics per document is comparable to that of the sampler. Figure 5 shows the evolution of the log-likelihood of the trained models. Similar to what we have observed earlier, the proposed algorithm can quickly reach a state where the sampler needs thousands of passes over the data set. Again, we can use the semi-optimized assignments as initializations to the sampler in order to speed-up convergence, which we will leave for future research.

5. Conclusions

0.5

0

Figure 5. Total log-likelihood of the trained LDA model on the KOS (K = 50 topics) and NYTimes (K = 100 topics) datasets (best viewed in color).

4000

5000

Iterations

Figure 4. The evolution of topic reconstruction `1 errors of Gibbs sampler with different initializations: “Random” means random initialization, and “lambda=6” means initializing with the assignment learnt using Word+Refine algorithm with λ = 6 (best viewed in color).

The main goal of this paper has been to lay the groundwork for a combinatorial optimization view of topic modeling as an alternative to the standard probabilistic framework. Small-variance asymptotics provides a natural way to obtain an underlying objective function, using the kmeans connection to Gaussian mixtures as an analogy. We saw that the basic batch algorithm, as often utilized by researchers of small-variance techniques, performs poorly when compared quantitatively to probabilistic approaches. However, using ideas from facility location and local refinement, we designed an algorithm that is efficient, robust to initializations and parameter selection, and compares favorably to probabilistic methods. In particular, we can match or outperform the collapsed Gibbs sampler, the best competing method in our experiments, even when the sampler is given the true hyper-parameters and is run for

Combinatorial Topic Models using Small-Variance Asymptotics

3000 iterations. We hope that this work inspires further progress in combinatorial-based approaches to topic modeling. Given theoretical guarantees for SVA methods, such as kmeans++, one option is to explore provable performance bounds. Another promising direction is to utilize the fact that our algorithms can easily be parallelized to design state-of-the-art topic modeling software for large-scale applications; our contention is that these methods can compete or outperform the most scalable topic modeling algorithms such as (Ahmed et al., 2012) or online variational methods (Hoffman et al., 2010).

References Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., and Smola, A. J. Scalable inference in latent variable models. In Proceedings of the fifth ACM international conference on Web search and data mining, pp. 123–132. ACM, 2012. Anandkumar, A., Liu, Y., Hsu, D. J., Foster, D. P., and Kakade, S. M. A spectral algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems (NIPS), pp. 917–925, 2012. Arora, S., Ge, R., and Moitra, A. Learning topic models– going beyond svd. In Foundations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on, pp. 1– 10. IEEE, 2012. Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., and Zhu, M. A practical algorithm for topic modeling with provable guarantees. In International Conference on Machine Learning (ICML), 2013. Arthur, D. and Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proc. Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2007. Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705–1749, 2005. Bansal, T., Bhattacharyya, C., and Kannan, R. A provable svd-based algorithm for learning topics in dominant admixture corpus. In Advances in Neural Information Processing Systems (NIPS), pp. 1997–2005, 2014.

Blei, D. M., Jordan, M. I., Griffiths, T. L., and Tenenbaum, J. B. Hierarchical topic models and the nested Chinese restaurant process. In Advances in Neural Information Processing Systems (NIPS), 2004. Broderick, T., Kulis, B., and Jordan, M. I. MAD-Bayes: MAP-based asymptotic derivations from Bayes. In International Conference on Machine Learning (ICML), 2013. Campbell, T., Liu, M., Kulis, B., How, J., and Carin, L. Dynamic clustering via asymptotics of the dependent Dirichlet process. In Advances in Neural Information Processing Systems (NIPS), 2013. Deerwester, S., Dumais, S., Landauer, T., Furnas, G., and Harshman, R. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990. Dhillon, I. S. and Guan, Y. Information theoretic clustering of sparse co-occurrence data. In Proc. IEEE International Conferece on Data Mining (ICDM), 2003. Dhillon, I. S., Guan, Y., and Kogan, J. Iterative clustering of high dimensioanl text data augmented by local search. In Proc. IEEE International Conference on Data Mining (ICDM), 2002. Griffiths, T. L. and Steyvers, M. Finding scientific topics. Proceedings of the National Academy of Sciences, 101: 5228–5235, 2004. Hoffman, M. D., Blei, D. M., and Bach, F. Online learning for latent Dirichlet allocation. In Advances in Neural Information Processing Systems (NIPS), 2010. Hofmann, T. Probabilistic latent semantic indexing. In Proc. Twenty-Second International SIGIR Conference, 1999. Huggins, J. H., Narasimhan, K., Saeedi, A., and Mansinghka, V. K. JUMP-means: Small-variance asymptotics for Markov jump processes. In International Conference on Machine Learning (ICML), 2015. Jain, K., Mahdian, M., Markakis, E., Saberi, A., and Vazirani, V. V. Greedy facility location algorithms analyzed using dual fitting with factor-revealing LP. Journal of the ACM, 50(6):795–824, 2003.

Blei, D. M. and Lafferty, J. D. Correlated topic models. In Advances in Neural Information Processing Systems (NIPS), 2006.

Jiang, K., Kulis, B., and Jordan, M. I. Small-variance asymptotics for exponential family Dirichlet process mixture models. In Advances in Neural Information Processing Systems (NIPS), 2012.

Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(45):993–1022, 2003.

Kuhn, H. W. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2): 83–97, 1955.

Combinatorial Topic Models using Small-Variance Asymptotics

Kulis, B. and Jordan, M. I. Revisiting k-means: New algorithms via Bayesian nonparametrics. In International Conference on Machine Learning (ICML), 2012. Lee, J. and Choi, S. Bayesian hierarchical clustering with exponential family: Small-variance asymptotics and reducibility. In Artificial Intelligence and Statistics (AISTATS) Conference, 2015. Li, A. Q., Ahmed, A., Ravi, S., and Smola, A. J. Reducing the sampling complexity of topic models. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 891–900. ACM, 2014. Podosinnikova, A., Bach, F., and Lacoste-Julien, S. Rethinking lda: moment matching for discrete ica. In Advances in Neural Information Processing Systems (NIPS), pp. 514–522, 2015. Roweis, S. EM algorithms for PCA and SPCA. In Advances in Neural Information Processing Systems (NIPS), 1997. Roychowdhury, A., Jiang, K., and Kulis, B. Small-variance asymptotics for hidden Markov models. In Advances in Neural Information Processing Systems (NIPS), 2013. Samdani, R., Chang, K-W., and Roth, D. A discriminative latent variable model for online clustering. In International Conference on Machine Learning (ICML), 2014. Smola, A. and Narayanamurthy, S. An architecture for parallel topic models. In International Conference on Very Large Data Bases (VLDB), 2010. Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. Hierarchical Dirichlet processes. Journal of the American Statistical Association (JASA), 101(476):1566– 1581, 2006a. Teh, Y. W., Newman, D., and Welling, M. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems (NIPS), 2006b. Tong, S. and Koller, D. Restricted Bayes optimal classifiers. In Proc. 17th AAAI Conference on Artificial Intelligence (AAAI), 2000. Wang, X. and Grimson, E. Spatial latent Dirichlet allocation. In Advances in Neural Information Processing Systems (NIPS), 2007. Wang, Y. and Zhu, J. Small-variance asymptotics for Dirichlet process mixtures of SVMs. In Proc. TwentyEighth AAAI Conference on Artificial Intelligence, 2014.

Wang, Y. and Zhu, J. DP-space: Bayesian nonparametric subspace clustering with small-variance asymptotics. In International Conference on Machine Learning (ICML), 2015. Yen, I. E. H., Lin, X., Zhang, K., Ravikumar, P., and Dhillon, I. S. A convex exemplar-based approach to MAD-Bayes Dirichlet process mixture models. In International Conference on Machine Learning (ICML), 2015.