Multilevel Belief Propagation for Fast Inference on ... - Semantic Scholar

Report 2 Downloads 97 Views
Multilevel Belief Propagation for Fast Inference on Markov Random Fields Liang Xiong Fei Wang Changshui Zhang Department of Automation, Tsinghua University, Beijing, China. {xiongl, feiwang03}@mails.tsinghua.edu.cn [email protected] Abstract Graph-based inference plays an important role in many mining and learning tasks. Among all the solvers for this problem, belief propagation (BP) provides a general and efficient way to derive approximate solutions. However, for large scale graphs the computational cost of BP is still demanding. In this paper, we propose a multilevel algorithm to accelerate belief propagation on Markov Random Fields (MRF). First, we coarsen the original graph to get a smaller one. Then, BP is applied on the new graph to get a coarse result. Finally the coarse solution is efficiently refined back to derive the original solution. Unlike traditional multiresolution approaches, our method features adaptive coarsening and efficient refinement. The above process can be recursively applied to reduce the computational cost remarkably. We theoretically justify the feasibility of our method on Gaussian MRFs, and empirically show that it is also effectual on discrete MRFs. The effectiveness of our method is verified in experiments on various inference tasks.

1 Introduction The recent years have witnessed a surge of interests in graph-based inferences in the field of data mining and machine learning. The basic setting of graph-based inference is: given a graph in which nodes represent random variables and edges represent the statistical dependencies between the nodes, we want to know the most probable configuration of the states of these variables provided some observations. Graph-based inference provides a general way to accomplish many learning and mining tasks, such as classification, regression and clustering. Its usage arises in a wide variety of problems. For example, random walk on graphs are applied to do classification [25] and measure similarity (e.g. [16]); Gaussian Markov random fields (Gaussian MRF) serves as the basis for several semi-supervised learning algorithms ( [30, 18]); In computer vision, MRF is usually used as the underlying model to facilitate image recov-

ery [9] and segmentation (e.g. [27]). Useful as it is, acquisition of the exact solution to the general inference problem is notoriously hard. Therefore, several approaches have been developed to find approximate solutions, such as Monte Carlo methods [9], variational methods (mean fields [10]), graph cut [3], and belief propagation, which is the focus of this paper. Belief propagation (BP) solves inference problems by passing local messages. It was originally proposed for inferences on singly connected Bayesian networks where exact solutions are guaranteed [13]. In spite of this restriction, people have directly applied it to other graphical models (e.g. MRF and conditional random fields (CRF) ) and graphs with loops (loopy BP) and achieved many empirical successes [22]. Recently, several theoretical analysis ([21], [23], [12]) on the convergence and optimality of solutions of loopy BP have been made, and the reason of loopy BP’s effectiveness is gradually revealed. However, although considered as an efficient inference engine, belief propagation is still often too slow to be practical on large scale graphs, which is common in data mining and computer vision. To address this problem, we propose a multilevel strategy to accelerate belief propagation. Our basic idea is intuitive: a large graph can be approximated by a small graph whose nodes are the aggregations of the nodes in the original graph. One source of this idea is the way people perceive the world. According to the Gestalt laws in psychology, we perceive objects as well-organized, integral patterns, thus the grouping of raw data is very important. Studies in human visual system also show that hierarchical processing seems to play a critical role in perception [7]. In our algorithm, a graph is first recursively coarsened to reduce the problem scale. Then we run BP on the new problem to obtain a coarse result. Finally, this result is refined back level by level to get the solution of the original problem. Both the obtainment and refinement of the coarse result can be done rapidly, therefore the overall cost of inference is decreased. Unlike traditional waveletbased multi-resolution approaches [24, 8], this algorithm is derived under the principle of adaptive aggregation and

energy-preserving approximation. These advantages enable us to significantly reduce the running time of belief propagation without compromising the solutions’ quality. A rigid justification of our method is provided on Gaussian MRFs. We show that our method is closely related to the algebraic multigrid technique [17], which aims at solving large scale linear systems efficiently. Thus, we name it multigrid belief propagation (MGBP). Experiments show that our approach is both efficient and effective for inference tasks. The rest of this paper is organized as follows. In section 2 we briefly introduces graph-based inferences and belief propagation. Then the algorithm details are described in section 3. We justify our method in section 4. In section 5 and section 6 we make discussions and introduce some related works. Experimental results are presented in section 7 and finally we make our conclusions.

2 Background and Notations 2.1

Probabilistic Graphical Models

Probabilities play a central role in modern machine learning and data mining. Many problems can be formulated as the inference of the states of random variables. Although these inferences can be done in a pure algebraic way, it is usually highly advantageous to use graphs as diagrammatic representations to facilitate analysis and manipulations. Concretely, we can represent a probabilistic model by a graph G = (V, E), where each vertex vi in V is associated with a random variable xi and the edges in E represent the statistical dependencies between these variables. These graph representations are called probabilistic graphical models[2]. In data mining, machine learning and other related fields, Markov random fields (MRF [5]) is a commonly used graphical model. It is an undirected graphical model in which nodes represent variables and arcs represent compatibility relations between variables. We will focus on MRF in this paper. Without much loss of generality, we only consider pairwise MRFs, and assume that every node has an observation node attached. Denote the nodes as X = [x1 , · · · , xn ]T , the observations as Y = [y1 , · · · , yn ]T , and the compatibility functions as ψij (·, ·). According the Hammersley-Clifford theorem [13], the joint distribution can be written as P (x, y) =

 1  ψij (xi , xj ) ψii (xi , yi ) i,j i Z

(1)

where Z is called the partition function which ensures that the joint distribution is properly normalized. Usually, the compatibility ψij is parameterized by the edge weight wij , which measures the similarity between xi and xj . If two nodes are similar, then their states should also be close to achieve a high compatibility in the network.

Figure 1. A lattice Markov random field for image modeling. Filled circles are observations, while others are their underlying nodes. The arcs represent spatial compatibility constraints between nodes.

The usage of MRF can be found in many problems in data mining and machine learning. For example, it can be used to describe the relations between different objects in probabilistic modeling (e.g. [6]). When constructed upon data samples, it provides a discrete characterization of the data’s manifold structure i.e. prior distribution [1]. In computer vision, MRF is a natural model for the spatial constraints between pixels. Figure 1 illustrates a lattice MRF model that is commonly used in image processing.

2.2

Inference and Belief Propagation

There are mainly two types of inferences on MRFs. The first one aims at achieving the maximum marginal probabilities on each node (MM assignment), i.e., xi is determined by [22] xi = arg maxx∗i P (xi = x∗i |Y). (2) Another one derives the maximum a posteriori probability on the graph (MAP assignment), i.e., the configuration of X is determined by [22] X = arg maxX∗ P (X = X∗ |Y),

(3)

These inferences tasks on general MRFs are difficult mainly due to the calculation of the partition function Z. Therefore, several algorithms are developed to provide feasible approximate solutions, one of which is belief propagation. Belief Propagation (BP) [13] utilize the conditional independence properties in the network to derive efficient solutions. Corresponding to the MM and MAP inferences, there are two types of BP [22]. One is belief update (BU) aka the sum-product algorithm for MM inferences, and another is belief revision (BR) aka the max-product algorithm for MAP inferences. The BP algorithm can be summarized as: 1) nodes deliver their distribution information (messages)

m1i

mii mij

m2i

xi

yi

m2i

xj

m3i

yi

mii

m1i

xi

Algorithm 1: Multigrid Belief Propagation (MGBP)

m4i

m3i E

D

Figure 2. A diagrammatic representation of belief propagation. mij is the message from xi to xj . (a) The calculation of the message from mij . (b) The calculation of belief at xi . to others through (and affected by) edges; 2) the distribution (belief ) at a node is formed by combining messages it received. A detailed introduction on BP can be found in [22]. Here we show that the BU rules are   mij (xj ) ← α ψij (xi , xj )mii (xi ) mki (xi ) xi

bi (xi ) ← βmii (xi )

Input: The original graph G 0 = {V 0 , E 0 }. The number of levels K. Coarsening: for l = 1, 2, · · · , K 1. Select representative node set V l from V l−1 . 2. Calculate E l and construct G l = {V l , E l }. end for Initial Solution: Run BP to derive the belief B K for problem G K . Refining: for l = K − 1, · · · , 0 ˆ l from B l+1 . 1. Calculate the starting belief B l ˆ 2. Use B to initialize the BP on G l . 3. Run BP on G l to derive the belief B l . end for Output: The belief B 0 for G 0 .

xk ∈N (xi )\xj

 xk ∈N (xi )

mki (xi )

(4) (5)

where mij is the message from xi to xj , mii is the message from yi to xi , and bi is the belief at xi . α and β are normalization constants and N (xi )\xj means all the neighboring nodes of x i except xj . The BR rules can be obtained by replacing xi with maxxi in (4). A diagrammatic representation of BP is shown in figure 2. The computational load of belief propagation is concentrated on calculating the messages which has the complexity of O(ek 2 T ), where e is the number of edges, k is the number of possible states ( k = 1 for Gaussian MRFs) and T is the number of iterations which usually equals to the graph’s diameter. It can be seen that belief propagation could be rather slow on densely connected graphs or graphs with large diameters. To decrease this cost, we could cut down the scale of the graph to reduce e and T , and further reduce T by providing a good start point for belief propagation. This is the motivation of our multilevel approach.

3 Multigrid Belief Propagation In this section we introduce our multigrid belief propagation algorithm. We denote graphs as G, node sets as V, the cardinality of V as |V| = n, edge sets as E, and beliefs as B T = [b1 , b2 , · · · , bn ] where bi is the belief at node xi . And Superscripts are used to indicate the coarse levels. This algorithm first recursively coarsens the original graph G 0 to get smaller ones G l . Then it solves the problem on G l to derive coarse beliefs. Finally, it uses the coarse beliefs to initialize the belief propagation on the original graph G 0 so

that convergence can be achieved more rapidly. The above procedure is summarized in Algorithm 1. There are two key points in Algorithm 1. First, the selected coarse nodes should be as representative as possible. We use an algebraic multigrid (AMG) [17] like coarsening method to achieve this goal. Second, given the results on G l+1 , belief propagation on G l should be efficiently initialized to a start point that is close to the true solution. This is done by carefully constructing the edges E l+1 for G l+1 and initializing the belief propagation on G l using interpolation. In the rest of this section, we will describe the details of this algorithm. Here we assume that G 0 ’s edge weight matrix W0 is given, whose (i,j)-th entry wij is the edge weight between node xi and xj 1 .

3.1

Selecting Coarse Nodes

In this section, we show how to select a representative node set V l+1 from V l . We need to split V l into coarse node set V l+1 and fine node set F l subject to V l+1 ∪ F l = V l and V l+1 ∩ F l = φ. Coarse nodes in V l+1 form the vertices in G l+1 , while fine nodes will not present in the new graph. To ensure that the coarse nodes in V l+1 are indeed representative, the fine nodes in F l should be “close related” to the coarse nodes. Hence, we constrain V l+1 and F l to satisfy the condition that each fine node is strongly influenced by coarse nodes, where “strongly influence” means: Definition 1 (strongly influence) A node xi strongly influences node xj if (6) wij ≥ β maxk wkj , 1A

brief description about constructing W0 can be found in section 7.



w1kl



¦

L

Figure 3. The coarsening process. Filled and hollow circles belong to different classes respectively. Darker line means stronger connection. G 0 is the original graph and G 1 is the coarsened graph.

where β ∈ (0, 1) is a control parameter, and wij is edge weight between xi and xj . The above settings are commonly used in fast, multiscale algebraic multigrid techniques [17]. Briefly, we iteratively select nodes that strongly influence others into V l+1 , and put the influenced ones into F l . This process can be completed in linear time w.r.t. the number of nodes [15]. The value of β in (6) is usually chosen within [0.25, 0.75]. If β is large, then the quality of coarsening will be higher but more nodes will be selected into V l+1 . Typically, about less than half of the nodes in V l are selected into V l+1 . Therefore, the scale of the graph drops exponentially after recursive coarsening. A detailed description of this procedure can be found in [17]. An illustration of this coarsening process is shown in figure 3.

3.2

O

N



– Pik

which means that l pl+1 ik = wik /

3.3

bli =

k

l+1 pl+1 ik bk

ˆ l = Pl+1 B l+1 , i.e. B

(7)

where k traverses all the coarse nodes in V l . Pl+1 is the nl × nl+1 interpolation matrix and pl+1 ik is its elements. For discrete MRFs, Eq.(7) interpolates the belief vectors. For Gaussian MRFs it is the interpolation of mean values. The value of Pl+1 is calculated by the following rules. For nodes in V l+1 their beliefs are kept unchanged after being interpolated. For nodes in F l , their beliefs are interpolated from their neighboring coarse nodes, that is, bli = (

 k

l l+1 wik bk )/

 k

l wik ,

(8)

 k

l wik .

(9)

Constructing New Edges

Having the new node set V l+1 and the interpolation based hierarchy, we still need a new edge set E l+1 to construct G l+1 . Here we follow the iterated weighted aggregation (IWA) process to calculate the new edge weights. More concretely, for G l+1 its edges weight matrix Wl+1 ’s (k,l)-th entry is computed by l+1 wkl =



M

Figure 4. Calculation of the edge weight between blocks. Nodes in the same box forms a block. Filled circles are the selected coarse nodes (block representatives). The block connection is the aggregation of nodes connections weighted by their interpolate contributions.

Belief Interpolation based Hierarchy

Before we go further and calculate the new edge weights, we need to establish a rule to relate the nodes from different levels to form a hierarchical structure. In this paper we choose the linear interpolation strategy, i.e., beliefs for nodes in V l can be estimated from those of V l+1 by

Pjl

wij0

1 l l+1 pl+1 wij pjl , i,j ik 2

(10)

l+1 where pl+1 . (10) can be written ij is the (i,j)-th entry of P in a more compact form as

Wl+1 =

1  l+1 T Wl Pl+1 . P 2

(11)

The meaning of (10) is clear: the connection between blocks is the aggregation of nodes connections weighted by their contributions in interpolation, as shown in figure 4. Another issue needs to be addressed is the impact of coarsening on the observations. After coarsening, some of the observation nodes are dropped, so the overall evidence on the coarse graph is “weakened”. Assuming that evidences are uniformly provided on each observation node, we balance the drop of total evidence by strengthen the remaining evidences with the rate r = |Ol |/|Ol+1 |,

(12)

where Ol is the observed nodes in V l , and |Ol | is its cardinality. In practice, this balance is achieved alternatively by weakening the edge weights by 1/r. For semi-supervised learning tasks, we often force that all the labeled nodes are selected into higher level graphs so the evidences need not to be adjusted.

3.4

Initial Solution and Refinement

As stated before, to solve the whole problem we first need to seek for an initial solution as the “seed”, and then refine it recursively to get the solution on the original graph. By recursively coarsening the original graph, we obtain at level K a coarse graph G K . We run BP on G K to derive the beliefs B K for nodes in V K . Since G K is small, this BP’s convergence can be achieved rapidly. Then, we refine the initial solution B K back level by level to the original graph G0. The refinement from B l+1 to B l is a 3-step process. ˆ l based on B l+1 . As First, we calculate estimated beliefs B established in section 3.2, this is done by interpolating B l+1 using Eq.(7), which can be completed very fast. We claim that this estimation is close to the true solution, which will be justified on Gaussian MRF in section 4. ˆ l to set the start Then, we use the estimated beliefs B l point for the belief propagation on G . This is done by iniˆ l . Considering tialize the messages on G l according to B the calculation of messages in (4) and beliefs in (5), we propose  to initialize the messages using (4) with its terms mii (xi ) xk ∈N (xi )\xj mki (xi ) replaced by ˆbi . In practice, we find this approximation works fine. Finally, we run BP on G l using the above initial messages. Since its start point is close to the true solution, this BP is expected to converge rapidly. The above steps are applied recursively until the beliefs on G 0 is obtained.

4 Theoretical Justification In the previous section, we have described our multigrid belief propagation algorithm in detail. In this section, we show that on Gaussian MRFs this algorithm can preserve the original objective function and solves the inference problem in a algebraic multigrid way. Therefore, its performance is guaranteed.

4.1

Gaussian MRFs

We first introduce some background. Gaussian MRF is a type of MRF whose joint distribution is Gaussian [23] i.e. P (M) = e−G /Z,

(13)

where M = [μ1 , · · · , μn ]T is the mean value of X, Z is the partition function, and G is the free energy [26]   G= [μi , μj ]Cij [μi , μj ]T + [μi , yi ]Cii [μi , yi ]T i,j

i

(14) where Cij is the 2 × 2 potential matrix between xi and xj . On Gaussian MRFs, MM assignment and MAP assignment are achieved by the minimization of G. A most commonly applied energy function for Gaussian MRF is the quadratic form:   G= wij (μi − μj )2 + γ (μi − yi )2 , (15) i,j

i

where wij is the edge weight. Usually, wij indicates the similarity between xi and xj . Thus, this energy function will force that 1) closely connected nodes have similar states and 2) nodes’ states are close to their observations. The first term can be considered as the smoothness constraint and the second one is the fitting quality with γ > 0 being the control coefficient. Eq.(15) can be vectorized as 2

G = MT LM + γ M − Y ,

(16)

where L = D − W is the combinatorial Laplacian [1] on the graph and D is a diagonal matrix with its i-th diagonal entry dii = j wij . The above settings can be found in various applications such as semi-supervised learning [30] and computer vision [8]. In the rest of this section, we will focus on this form of free energy. It is proved that upon convergence the assignments obtained by Gaussian BP are exact on networks with arbitrary topology [23].

4.2

Energy-Preserving Coarsening

Now we show that the way we coarsen graphs in section 3 actually preserves the free energy of Gaussian MRFs within the framework of algebraic multigrid (AMG) coarsening and interpolation-based refinement. After the AMG coarsening, we obtain the coarse node set V l+1 which is believed to be representative of V l . Then the energy of G l+1 with assignment Ml+1 is  l+1 l+1 Gl+1 (Ml+1 ) = wkl (μk − μl+1 )2 l k,l∈V l+1  (17) +γ l+1 (μl+1 − ykl+1 )2 , k k∈V l+1

l+1 is the edge weight on graph G l+1 and γ l+1 is where wkl the new control coefficient. And the energy of G l using the ˆ l = Pl+1 Ml+1 is interpolated assignment M

Gl (Ml+1 ) =    l+1 l+1 2 l+1 l wij ( pl+1 − pjl μl ) ik μk i,j∈V l k∈V l+1 l∈V l+1   l+1 +γ l ( pl+1 − yil )2 , ik μk i∈V l k∈V l+1

(18)

Ideally, parameters in G l+1 should satisfy the condition Gl+1 (Ml+1 ) = Gl (Ml+1 ) so that the quality of the coarse solution Ml+1 in G l+1 will reflect the quality of its interˆ l in G l . To achieve this, we consider the polated solution M first and the second terms separately. For the first terms (smoothness), we force them to be identical. It can be proved that this constraint will lead to the following edge weights on G l+1 : l+1 = wkl

1 l+1 l+1 l+1 l wij (pl+1 jl − pil )(pik − pjk ). i,j 2

(19)

It can be shown that the IWA weights from Eq.(10) usually provides a good approximation to Eq.(19) according to [15] and our own experiments. Therefore, Eq.(10) can approximately keep the smoothness term in the energy function unchanged. Although Eq.(10) can be replaced by Eq.(19), we still prefer the IWA weights because it is more intuitive. For the second terms (fitting quality), we preserve it by adjusting γ l+1 . Assuming that after interpolation the fitting errors are uniformly distributed on all the  nodes that have observations, the value of total fitting error k (μk −yk )2 is proportional to the number of terms in the sum. Therefore, we can adjust the value of γ l+1 to compensate the drop of total error using     (20) γ l+1 = γ l Ol /Ol+1 , where Ol is the observed nodes in V l , as in Eq.(12). This compensation coincides with our balance measure taken in section 3.3. Thus, both the smoothness and fitting quality of G l is preserved. To sum up, the multigrid belief propagation algorithm is able to preserve the energy function of Gaussian MRFs between levels, therefore a low-energy assignment of G l+1 ˆ l in G l . Therefore, we can will also yield a good solution M ˆ l is close to the be confident that the interpolated solution M l true solution M .

4.3

Relation with Algebraic Multigrid

Our approach on Gaussian MRF is closely related to the algebraic multigrid (AMG) [17], which aims at solving large scale linear systems efficiently. Considering minimizing the energy (16), let ∂G/∂M = 0 we get (L + γI)M = γY.

(21)

Supposing that the compensation (20) is accurate, it is easy to verify that after coarsening and interpolation, Eq.(21) becomes   T

 T P1 (L + γI)P1 M = γ P1 Y. (22) If we consider graph G 0 as an algebraic grid [17], then Eq.(21) is a linear system on this grid. Further if we use

the AMG technique to seek an efficient solution of Eq.(21) based on the Galerkin principal [17], then the equation system Eq.(22) is obtained. So our method actually reduces the problem scale in the same way as AMG does. Therefore, we call our method Multigrid Belief Propagation (MGBP).

5 Discussions Recall that the cost of BP is O(ek 2 T ). Our multilevel algorithm can significantly accelerate BP from three aspects. First, as coarsened level by level, the scale of the graph drops exponentially, so the number iteration T needed for convergence is reduced. Second, T is further decreased by setting BPs to good start points. The situation for the number of edges e is a bit complicated. On densely connected graphs, the reduction of graph scale will also reduce e. However, on some sparsely connected graphs e may increase after coarsening, and during the recursive coarsening process e may first rise and then drop. In this case, the coarse level of MGBP should be selected to achieve a balance between speed and accuracy. We have shown that on Gaussian MRFs MGBP actually accelerates BP in an AMG way. Their key difference is that, AMG only solves Eq.(22) to derive the solution, whose quality depends on how well the coarse node set V 1 can represent V 0 . Sometimes this solution is overly smooth. On the other hand, MGBP only use this solution as a start point to accelerate belief propagation. Thus the solution obtained by MGBP is accurate upon convergence. And we can also make a trade-off between quality and speed by controlling the number of BP iterations. If we refine the beliefs using only interpolation, then MGBP is equivalent to the AMG solver on Gaussian MRFs, which is the essence of [20]. The implementation of MGBP on Gaussian MRFs is straightforward. When implementing it on discrete MRFs, the difference is that instead of manipulating the scalar mean values, we handle the belief vectors, which can also be done by matrix multiplication (For a detailed introduction about discrete BP, readers are referred to [22, 8]). We are not able to justify the validity of MGBP on discrete MRFs as on Gaussian MRFs. Yet, we believe that MGBP has a reasonable motivation, and in practice its performance is found promising in various inference tasks, as shown in section 7.

6 Related Works Multilevel technique has long been applied to accelerate learning on graphs [24]. Usually, pyramidally organized trees are constructed to develop efficient algorithms. MGBP also follows this paradigm. However, traditional approaches often coarsen a graph by wavelets (e.g. Haar

7 Experiments 



Figure 5. Coarsening using Haar wavelet. New nodes are created to represent blocks in G 0 . The coarsening compromises the boundary.

wavelet) as shown in figure 5. These methods can only be applied to regular graphs such as lattices. It ignores the characteristics of graphs and do uniform coarsening, which often compromises graphs’ details. Most importantly, we do not really know the relation between coarse results and the true solutions. On the other hand, MGBP can be applied to graphs with various structures. It coarsens graphs adaptively by aggregating closely connected nodes together. Finally, in MGBP the coarse graphs are designed to approximate the original graph so less refining operations are required to derive the true solutions. [8] proposed a multi-scale strategy to run belief propagation efficiently on images. The framework is similar: first coarsen and then refine. They coarsen the graphs using Haar wavelet. After running BP on the coarse graph G l+1 , they use the messages in G l+1 to initialize the messages on G l so the convergence can be accelerated. This approach is quite similar to MGBP. However, their method can only be applied to lattice MRFs such as images. Besides, they ignore the characteristics of each image and do uniform coarsening. On the contrary, by adaptive coarsening MGBP can better approximate the original graphs so fewer BP iterations are needed for refinement in MGBP than in their method. The AMG technique have also been used in other problems. For image segmentation, [15] uses a quadratic energy function to indicate the quality of solution, and then recursively coarsen the image lattices to form a small graph where salient segments are easy to detect. During coarsening the energy is approximately retained so the quality of coarse result is guaranteed. Moreover, they propose to modify the coarse graphs’ edges according to block-wise similarity, which is based on features that is not available at the pixel level. This approach may help improve the quality of the energy function, and can also be applied in MGBP for similar tasks.

In this section, we test MGBP on inference problems including graph-based semi-supervised learning and lowlevel vision problems. An important issue to be addressed before applying MGBP is the construction of the original graph G 0 . The topology of the graphs are determined by specific problems. Usually, we use the kNN graphs where two nodes are connected iff one of them is in the other’s k nearest neighbors. to model data distributions, and use lattices to model images. The edge weight can be calculated by several approaches such as the heatkernel and LLE reconstruction [14, 19]. Generally, this problem is not fully solved yet. A comprehensive discussion on constructing the edges can be found in [29]. In our experiments, we choose to use the heat kernel, i.e., the edge weight is calculated by wij = exp{−d(xi , xj )/T }

(23)

where d(xi , xj ) is the distance between xi and xj and T is the temperature parameter. Details will be described in each experiment. Usually in the refining stage of MGBP, a very small number of iterations is sufficient for the convergence of initialized BP. Thus, a good way to use MGBP is to first run BP on the coarsest graph until convergence (this will not take long since the graph scale is small), and then run BP on the rest of levels for 2 or 3 iterations. We use this strategy to run MGBP in this section unless indicated otherwise. For ordinary BP, we run it to the convergence. In implementation, Gaussian BP and MGBP are written R in pure Matlab code. Discrete BP and MGBP are written R  in Matlab with some functions written in C. All experiments are run on the same computer. For MGBP, we report the total running time including coarsening and interpolation.

7.1

Semi-supervised Classification

Semi-supervised learning (SSL) [4] receives lots of attention in machine learning and data mining recently. It aims at solving problems where a large number of samples are available but only a few of them are labeled. Among existing approaches, graph-based transduction [29] is a most active one. Here we use belief propagation to solve this problem. We use the data samples to form an MRF and infer the class labels of unlabeled samples given the labeled ones. As a common setting, kNN graphs are constructed to model the relations between samples. When the edges are determined, their weights are calculated using heat kernel. The Gaussian MGBP is applied to solve binary classification problems. We use the “soft label” assumption [30], in which the mean value of each node is an indicator μ ∈ [−1, +1] which implies the significance whether a

sample is positive or negative. Initially, the labeled positive samples have μ = +1 and the negative ones have μ = −1. After inference, an unlabeled sample is classified as positive if it has μ > 0, or otherwise it is negative. This is a common assumption in semi-supervised classification (e.g. [28, 30]). Further, discrete MGBP is applied to solve multi-class problems. In this case the modeling is more straightforward: the state of a node is the class it belongs to. After inference, unlabeled samples are assigned to their most probable classes. The potential function for discrete MGBP is the Potts model [3]. The nearest neighbor (NN) classifier is used as the baseline. We compare MGBP to ordinary BP and two popular graph-based methods in SSL including learning with local and global consistency (LLGC) [28], and harmonic Gaussian field (HGF) [30] (HGF is compared only in binary cases). In each run, a small number of samples from each class are randomly selected and labeled, and the same graph are used for all the methods. Mean performance of 50 independent tests are reported.

We choose the rec topic which contains autos, motorcycles, baseball, and hockey. The Rainbow package [11] is used to process the documents with options: passing all words through the Porter stemmer before counting them; tossing out any token that is on the stoplist of the SMART system; skipping any headers; ignoring words that occur in fewer than 6 documents. Then we normalized the samples to the TFIDF representation. Finally, 3970 8014-dimensional samples are obtained. We solve this 4-class problem by discrete MGBP. The graph is constructed using 10 nearest neighbors. For the heat kernel, the value of T is set to 1. For MGBP, we coarsen the problem to level 3. Figure 7 shows the accuracies of different methods. We can see that MGBP also achieves satisfactory accuracy in this task. Yet, the acceleration is here not obvious because the original graph’s diameter is already small, and the coarsening increases the number of edges in it. 0.9 0.88

7.1.1 Digits Recognition

7.1.2 Text Classification In this experiment, we test MGBP on a text classification task. The data set we adopt is a subset of 20-newsgroup 3 . 2 http://www.kernel-machines.org/data.html 3 http://people.csail.mit.edu/jrennie/20Newsgroups/

Accuracy

0.86

In this experiment, we study the performance of MGBP in digits recognition tasks. The data we use here is USPS2 , which is a set of hand-written digit images with size 16×16 and pixel values ranging from 0 to 255. We use digits “1”, “2” to test two-class classification, and “1”, “2”, “3”, “4” to test multi-class performance. Sample numbers for the above digits are 1269, 929, 824, 852 respectively. The graph is constructed using 5 nearest neighbors. For the heat kernel, we use the Euclidean distance for d(xi , xj ) and the value of T is 380 tuned by cross validation. For MGBP, the graph is coarsened to level 5. The results are shown in figure 6. In the two-class problem Gaussian MGBP has a similar performance with BP and other methods. In the 4-class problem, discrete MGBP achieves an impressive accuracy. The fact that MGBP may outperform BP is probably due to the noise-reduction effect of coarsening. Figure 6 (c) shows the running time and accuracy of discrete MGBP using different coarse levels in the above 4-class problem. The accuracy is stable during coarsening until the level goes too high. The rise of running time at level 3 is caused by the increase of edges, as analyzed in section 5. In this case, we can see that coarsening the graph to level 5 is a good choice.

0.84 0.82 BP D−MGBP LLGC

0.8 0.78 0

10

20

30

# Labeled sample per class

Figure 7. 4 class classification result on the 20-newsgroup data. Discrete MGBP is used in this task.

7.2

Semi-Automatic Image Segmentation

MGBP can also be used to solve the MRF-based image segmentation problem. In this experiment, first a few pixels in an image are labeled by the user to indicate the classes (segments) they belong to, and then the algorithm classifies the rest of pixels and forms the complete segmentation. Applications of this procedure can be found in fields such as medical image analysis. We use Gaussian MGBP for two-way segmentations and discrete MGBP for the multi-way cases. Following the traditional setting of computer vision, we use lattice graphs to model the spatial relations between pixels. For the heat kernel, we use the Euclidean RGB-color distance, and T is roughly tuned for each image. All the images are re-scaled

Classification of “1” “2”

Classification of “1” “2” “3” “4”

15

0.95 0.9

0.8

10

0.85

2

3

4

5

0.6

0.8

NN BP G−MGBP LLGC HGF

0.8

0.85

0.7 1

2

3

4

5

Time Accuracy

6

0.4

# Labeled sample per class

# Labeled sample per class

5

NN BP D−MGBP LLGC

0.75 6

Time

0.9

Accuracy

Accuracy

Accuracy

0.95

0.75 1

Accuracy & time at different levels 1 0.95 0.9

1

1

(a)

2

(b)

4 6 Coarse level

0 8

(c)

Figure 6. Experimental results on USPS data. (a) 2 class classification using Gaussian MGBP. (b) 4 class classification using discrete MGBP. (c) The accuracy and running time of discrete MGBP using different coarse levels on the USPS data.

to size 200 × 200 and then coarsened to level 5. To demonstrate the accuracy of MGBP’s energy-preserving coarsening, we refine the coarse result back using only interpolation. Segmentation results are shown in figure 8. Labeled pixels are displayed as dots whose colors indicate class labels. Typically, segmentations using MGBP are done in less than 2 seconds, while it takes around 2 minutes for ordinary BP to finish the same tasks. We can see that MGBP achieves satisfactory results using much less time.

7.3

Image Restoration

We also apply MGBP to the image restoration problem [9]. Given a degraded image, by restoration we want to recovery the original image i.e. infer the most probable color of the pixels given the degraded observation. The original image (179 × 122) is degraded by additive Gaussian noise, as shown in figure 9 (a) and (b). We model this image as a lattice Gaussian MRF, where a node’s value indicates its true color and the degraded color serves as the observations. Pixel values are normalized between 0 and 1. For heat kernel, we use color difference as the distance and set T = 10 manually. These parameters are kept the same when we run ordinary BP and MGBP respectively. For MGBP, we coarsen the image to level 3, and run only one iteration of BP on each level. Ordinary BP is run until similar results are obtained. The results are shown in figure 9. It can be seen that MGBP reduces the computational cost significantly while achieving a good approximation to the original restoration.

8 Conclusions (a)

(b)

(c)

(d)

Figure 8. Segmentation results using MGBP. The 1st row is the labeled images and 2nd row is the result. (a) and (b) are two-way segmentations using Gaussian MGBP. (c) and (d) are three-way segmentations using discrete MGBP. MGBP is about 60 times faster than ordinary BP in this task.

In this paper, we propose the multigrid belief propagation (MGBP) algorithm to do approximate inferences on large graphs efficiently. Our basic strategy is to first solve a small, approximate problem, and then refine the coarse solution back to the original problem. The acceleration is achieved in two ways. First, we significantly reduce the scale of the problem by recursively coarsening the graph. Second, we initialize the belief propagation to a good start point so that the iteration will converge rapidly. Using the algebraic multigrid techniques and energy-preserving coarsening, our method is able to construct small graphs that well approximate the original graph. Consequently, the

(a)

(b)

(c)

(d)

Figure 9. Image restoration from Gaussian noise. (a) Original image. (b) Degraded image. (c) Restored by Gaussian BP (7 sec, 13 iterations). (d) Restored by Gaussian MGBP (0.9 sec, 3 coarse level).

coarse result can be refined very efficiently. We provide a justification for MGBP on Gaussian MRFs and verify empirically that it is also effective on discrete MRFs. Experiments show that our method can remarkably reduce the running time of belief propagation while preserving the quality of solutions. Acknowledgment Funded by Basic Research Foundation of Tsinghua National Laboratory for Information Science and Technology (TNList).

References [1] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from examples. Journal of Machine Learning Research, to appear. [2] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. [3] Y. Boykov, O. Veksler, and R.Zabih. Fast approximate energy minimization via graph cuts. IEEE Trans. PAMI, 23:1222–1239, 2001. [4] O. Chapelle, B. Sch¨olkopf, and A. Zien, editors. SemiSupervised Learning. MIT Press, Cambridge, MA, 2006. [5] R. Chellappa, editor. Markov Random Fields: Theory and applications. New York: Academic, 1993. [6] F. DiMaio and J. Shavlik. Belief propagation in large, highly connected graphs for 3d part-based object recognition. In ICDM-06, 2006. [7] D. J. Felleman and D. C. V. Essen. Distributed hierachical processing in the primate cerebral cortex. Crebral Cortex, 1:1–47, 1991. [8] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient belief propagation for early vision. International Journal of Computer Vision, 70:41–54, 2006. [9] S. Geman and D. Geman. Stochastic relaxation, gibbs distribution, and the bayesian restoration of images. IEEE Trans. PAMI, 6:721–741, 1984.

[10] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37:183–233, 1999. [11] A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996. [12] J. Mooij and H. Kappen. Sufficient conditions for convergence of loopy belief propagation. In UAI-05, pages 396–40, Arlington, Virginia, 2005. AUAI Press. [13] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 1988. [14] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323– 2326, 2000. [15] E. Sharon, A. Brandt, and R. Basri. Fast multiscale image segmentation. In CVPR-00, 2000. [16] H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its applications. In ICDM-06, 2006. [17] U. Trottenberg, C. W. Oosterlee, and A. Schler. Multigrid. with guest contributions by Brandt, A., Oswald,P. and Sten, K. Academic, 2001. [18] J. Verbeek and N. Vlassis. Gaussian fields for semisupervised regression and correspondence learning. Pattern Recognition, 39(10):1864–1875, 2006. [19] F. Wang and C. Zhang. Label propagation through linear neighborhoods. In ICML-06, 2006. [20] F. Wang and C. Zhang. Fast multilevel transduction on graphs. In SDM-07, 2007. [21] Weiss and Freeman. On the optimality of solutions of the max-product belief-propagation algorithm in arbitrary graphs. IEEE Trans. Information Theory, 47:736–744, 2001. [22] Y. Weiss. Correctness of local probability propagation in graphical models with loops. Neural Computation, 12(1):1– 41, 2000. [23] Y. Weiss. Correctness of belief propagation in gaussian graphical models of arbitrary topology. Neural Computation, 13:2173–2200, 2001. [24] A. S. Willsky. Multiresolution markov models for signal and image processing. Proceedings of the IEEE, 90:1396–1458, 2002. [25] Y. Xu, X. Yi, and C. Zhang. A random walks method for text classification. In SDM-06, 2006. [26] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Understanding belief propagation and its generalizations. Technical report, Mitsubishi Electric Research Laboratories, 2001. [27] Y. Zhang, M. Brady, and S. Smith. Segmentation of brain mr images through a hidden markov randomfield model and the expectation-maximization algorithm. IEEE Trans. Medical Imaging, 20:45 – 57, 2001. [28] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch¨olkopf. Learning with local and global consistency. In Advances in NIPS-03, 2003. [29] X. Zhu. Semi-supervised learning with graphs. PhD thesis, CMU, 2005. [30] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In ICML-03, 2003.