On the Convergence Properties of Contrastive Divergence

Report 5 Downloads 101 Views
1

On the Convergence Properties of Contrastive Divergence Ilya Sutskever1 Tijmen Tieleman1 1 University

of Toronto.

Keywords: Contrastive Divergence, Convergence

Abstract Contrastive Divergence (CD) is a popular method for estimating the parameters of Markov Random Fields (MRFs) by efficiently approximating an intractable term in the gradient of the MRF’s log probability. Despite its empirical success, basic theoretical questions on its convergence properties are currently open. In this paper, we analyze the CD1 update rule for Restricted Boltzmann Machines (RBMs) with binary variables. We show that this update is not the gradient of any function, and we present an example of a somewhat contrived “regularization function” that causes the CD update to cycle indefinitely. On the other hand, we prove that the CD update always has at least one fixed point when used with L2 regularization.

1 Introduction Markov Random Fields (MRFs) form an important class of probabilistic models that are used in a wide range of applications, such as image, speech, and general signal denoising, as well as prediction and density modelling (Cross & Jain, 1981; Malfait & Roose, 1997; Portilla et al., 2003; Roth & Black, 2005; Li, 1994); see references in (Wainwright, 2008). In addition, Restricted Boltzmann Machines (RBMs) (Hinton, 2002; Smolensky, 1986), a class of MRFs, are an essential building block of Deep Belief Networks (Hinton et al., 2006; Bengio et al., 2007; Hinton & Salakhutdinov, 2006). Nearly every application of MRFs requires that their parameters are estimated from a collection of training examples. However, the natural maximum-likelihood parameter estimation is difficult because the log probability’s gradient, being the difference of two expectations of which one is intractable, cannot be computed efficiently. As a result, a number of approximate parameter estimation methods have been developed: Pseudolikelihood (Besag, 1977) and Score Matching (Hyvarinen, 2006) are tractable alternatives to the log probability loss function which are easier to optimize, and Loopy Belief Propagation and its variants (Wainwright, 2008) directly approximate the intractable expectation in the log probability’s gradient. This paper focuses on Contrastive Divergence (CD) (Hinton, 2002), which is a method for approximating the hard expectation with a Monte Carlo estimate from an easy-to-sample distribution. The CD update is very easy to implement so it is widely used (Hinton et al., 2006), but its convergence properties are currently unknown. Interestingly, CD has a persistent variant (Younes, 1988; Tieleman, 2008) that is known to converge to the maximum-likelihood parameter setting if the learning-rate is annealed with a suitable schedule (Younes, 1988). In this paper we gain a better understanding of the CD1 update rule for binary RBMs and report the following results: • The CD update direction is not the gradient of any cost function. • We construct an example of a nonconvex regularization function that causes the CD update to cycle indefinitely.

2

• The CD update, when combined with L2 regularization of any small weight, always has at least one fixed point.

2 Preliminaries 2.1 Restricted Boltzmann Machines A binary RBM defines a probability distribution over binary vectors V ∈ {0, 1}n and H ∈ {0, 1}m by the expression P (V, H) =

exp(−E(V, H)) Z

(1)

where the energy E(V, H) is defined as E(V, H) = −V ⊤ W H − V ⊤ bV − H ⊤ bH

(2)

and the partition function Z is Z=

X

V ∈{0,1}n

The marginal probability P (V ) is isfies

X

exp(−E(V, H))

(3)

H∈{0,1}m

P

H∈{0,1}m

P (V, H), and its logarithm log P (V ) sat-

log P (V ) = EP (H|V ) [−E(V, H)] − log Z

(4)

A natural way of estimating the RBM’s parameters from a training set {V1 , . . . , VN } is by finding the parameters that maximize the average log probability L = ED(V ) [log P (V )]

(5)

where D(V ) is the empirical data distribution of the training set (which is a uniform mixture of delta distributions, one for each training point). The parameters maximizing the average log probability L are typically found with a gradient-based optimization method. The gradient of L with respect to the weights of the RBM is given by ∂L ∂W ∂L ∂bV ∂L ∂bH

    = EP (H|V )D(V ) V H ⊤ − EP (V,H) V H ⊤

(6)

= EP (H|V )D(V ) [V ] − EP (V,H) [V ]

(7)

= EP (H|V )D(V ) [H] − EP (V,H) [H]

(8)

3

In each equation, the first expectation is easy to compute because the distribution P (H|V ) Q is factorial: the definition of P (eq. 1) implies that P (H|V ) = m j=1 P (Hj |V ). In contrast, there is no efficient way of computing the second expectation of each equation exactly.

2.2 Contrastive Divergence Contrastive Divergence (CD) (Hinton, 2002) is an algorithmically efficient procedure for parameter estimation. The CD update modifies the gradient of the log probability by replacing the distribution P (V, H) with a distribution R(V, H) that is much easier to sample:     ∆W CD(W ) = EP (H|V )D(V ) V H ⊤ − ER(V,H) V H ⊤

(9)

∆bV CD(W ) = EP (H|V )D(V ) [V ] − ER(V,H) [V ]

(10)

∆bH CD(W ) = EP (H|V )D(V ) [H] − ER(V,H) [H]

(11)

where we denote the state of the weights and the biases by W . We can obtain a sample (V, H) ∼ R as follows: 1. Let V ′ be a random training point. 2. Sample H ′ from P (H ′|V ′ ). 3. Sample V from P (V |H ′). 4. Sample H from P (H|V ). 5. return (V, H). The reconstruction distribution R(V, H) replaces the model’s distribution P (V, H) in the expression for the gradient of the RBM’s log probability (eq. 6). The distribution R is obtained by starting at the data distribution and running the Gibbs sampling Markov   chain for one step, so the term ER(V,H) V H ⊤ reflects the kind of data the model

“likes”, causing the difference to often point in a direction of improvement. See the discussion in (Hinton, 2002).

4

In addition, CD has an “objective function”, CD(W ) = EP (H|V )D(V ) [−E(V, H)] − ER(V,H) [−E(V, H)],

(12)

whose gradients ∇CD(W ) are close—but not equal—to the CD update ∆CD(W ) (Hinton, 2002). Standard CD learning proceeds by repeatedly changing the RBM parameters W according to the CD update ∆CD(W ) with some learning rate1 . The regularized CD update, ∆CDF (W ), with the regularization function F , is defined by the equation ∆CDF (W ) = ∆CD(W ) + ∇F (W )

(13)

so, for example, if F (W ) = −λ/2 · kW k, then ∆CDF (W ) is the CD update with L2 weight decay.

3 The CD update direction is not the gradient of any function 2 Finding a function H(W ) whose gradients are precisely equal to ∆CD would be useful, because it would shed light on the of the solutions that CD learning tends to find. Furthermore, the existence of such a function H would let us conclude that CD is convergent with every regularization function. However, in this section we show the surprising result that ∆CD is not the gradient of any function; even more surprisingly, we construct a regularization function that causes the CD update to cycle indefinitely around a circle.

3.1 Proofs We present two proofs showing that the CD update is not the gradient of any function. Assume, in order to obtain a contradiction, that there is a function H such that ∇H = ∆CD. Consider an RBM with one visible unit and one hidden unit, and a 1

We abuse the notation ∆CD and “the CD update”: an actual update must involve a learning rate

which we omit. 2 Some of the findings in this section were described earlier in (Tieleman, 2007).

5

single training point where the visible unit is in the zero state. The bias to the hidden unit is fixed to zero, so that there are only two parameters: w, the connection between the two units, and bH , the bias to the visible unit. The parameters are jointly written as W = (w, bH ). Now, because ∆CD is differentiable and bounded (see section 4.1), basic calculus states that ∂ 2 H(W ) ∂ 2 H(W ) = ∂w∂bH ∂bH ∂w ∂ ∂ ∂ ∂ H(W ) = H(W ) ∂w ∂bH ∂bH ∂w ∂ bH ∂ ∆ CD(W ) = ∆w CD(W ) ∂w ∂bH

(14)

is valid for all W . We investigated whether those are equal, and found, with bH = 0 and w = 1, that they are not. A straightforward but tedious derivation reveals that ∂ e · (e − 1) · (e + 3) ∂ bH ∆ CD − ∆w CD = ∂w ∂b 8(e + 1)3

(15)

where e = exp(1). None of the terms in the numerator is zero, so these two partial derivatives are not equal, implying that no such function H exists. Our second proof involves a simulation and is important for constructing a regularization function that causes CD to cycle. As before, suppose that ∆CD were the gradient of a function H. We can use the CD update to compute the change in the value of H as we travel along a path in the parameter space. If the path’s initial and final points are equal, then the total change in the value of H, as measured by ∆CD, its gradient, must be zero. Therefore, if we use the CD update to numerically evaluate the change in H along a closed loop, and find that the cumulative change in the value of this presumed H is nonzero, then there cannot be any function H of which ∆CD is the gradient. Our example is the same 2-parameter RBM as before. The path we follow, γ(t), traverses a circle whose radius is 1/(2π) and whose center is (2, 0) as t traverses the interval [0, 1]; this radius ensures that the length of the path is 1. The path γ begins at (2, 1/(2π)) and proceeds in counter-clockwise direction. In particular, γ(0) = γ(1). 6

0.25 0.2 0.15 0.1 0.05 0 −0.05 −0.1 1.8 2 2.2 2.4

−0.1

−0.2

0

0.1

0.2

Figure 1: The surface of the supposed function H implied by the CD update along the path γ described in the text (tilted curve). The figure shows that the altitude of the final point is slightly greater than the altitude of the initial point. Although the total elevation along this closed path is slight, its existence is sufficient to deduce that there is no function whose gradient is equal to the CD update. The horizontal curve is the projection of the tilted curve on the z = 0 plane.

We computed the total change in the supposed function H along the path γ using the CD update, which turned out to be 0.008242 (see fig. 1); if the CD update were the gradient of any function, this total change would be zero. The total change is computed by traversing the path with 105 equally-spaced steps, where at each step, the change in the function’s value is proportional to the length of the projection of CD update onto the path’s direction.

3.2 Convergence-Preventing Regularization Using the above result, we construct a “regularization” function F for which the update3 ∆CDF (W ) = ∆CD(W ) + ∇F (W )

(16)

does not converge for the previously-described RBM, and in fact causes the optimization to cycle indefinitely. It is not a function that could be used in practice for regularization, but it demonstrates how the fact that CD learning has no underlying objective function can cause the optimization to behave in unexpected ways. 3

If ∆CD were the gradient of H and we wished to optimize F + H, then its gradient would be eq.

16.

7

We achieve this effect by choosing a function F satisfying the following criteria: 1. The function F causes ∆CDF (W ) to seem like it is always “ascending” along the path γ. 2. The function F severely penalizes deviations from the path γ. Condition 2 ensures that parameters that follow ∆CDF (W ) tend to stay near the path γ, while condition 1 causes the parameters cycle around it. As we will see, a function that satisfies condition 1 can be constructed only because the CD update “thinks” that traversing γ causes a total increase of 0.008 in H. We now define the function F more formally. Let G(t) be the change, as computed by ∆CD, between the values of the presumed function H at γ(0) and γ(t). Specifically, G(t) is defined by the integral G(t) =

Z

t

∆CD(γ(τ ))⊤ γ ′ (τ )dτ

(17)

0

The main fact about G is that G(1) ≈ 0.008 > 0. Using G, we can define the values of the function F along the path γ with the equation F (γ(t)) = −G(t) + G(1) · t

(18)

For this definition to be consistent, the above equation must have F (γ(0)) equal to F (γ(1)) (simply because γ(0) = γ(1)). And indeed, F (γ(0)) = −G(0) + G(1) · 0 = 0 = −G(1) + G(1) · 1 = F (γ(1))

(19)

In fact, F is differentiable on each point of the circle γ including γ(0) = γ(1) because ∂F (γ(0)) ∂F (γ(1)) = −G′ (0) + G(1) = −G′ (1) + G(1) = ∂t ∂t

(20)

because G′ (0) = G′ (1) by eq. 17. Finally, we extend F to R2 : F (x) is the value of F at the closest point to x on γ (and 0 at the circle’s center), so F is differentiable everywhere except in the circle’s center (which we set to zero). As a result, the update ∆CDF (W ) cancels out the irregular effect of CD and preserves only the elevation t · G(1), which is can be seen by computing the magnitude of 8

0.5

0

−0.5

−1

−1.5 0.4

2.6 2.4

0.2

2.2

0

2 −0.2

1.8 −0.4

1.6

Figure 2: The surface of the “regularization” function F that causes CD learning to cycle. The plot clearly shows the contribution from G which is reflected in slope of the circle, as well as the contribution from the regularization.

∆CDF ’s projection on the direction of γ: ∆CDF (γ(t))⊤ γ ′ (t) +

∂F (γ(t)) = ∆CD(γ(t))⊤ γ ′ (t) − G′ (t) + G(1) ∂t = G′ (t) − G′ (t) + G(1) = G(1) > 0

(21)

where we used the identity ∆CDF (W )⊤ γ ′ (t) = G′ (t) from the definition of G(t). This causes the parameters to move in a constant speed along the circle γ provided that the point stays near γ—which is easily arranged using regularization. For the regularization function, we added R(x) = −10 · kx − γk1.3 to F , where kx − γk is the distance from x to the nearest point on γ. This regularization was chosen to make the simulation work successfully. For the simulation, we implemented F as described above by sampling approximate values for F (γ(t)) at 20,000 equally-spaced points and using cubic splines to extend F (γ(t))’s definition to all 0 ≤ t ≤ 1. The gradient ∇F was computed with numerical differentiation; if the stepsize is set to 10−5 , 9

the update ∆CDF (W ) makes the parameters cycle around γ indefinitely. The resulting F is not differentiable at 0, which we fix by multiplication with a smooth function whose value is mostly 1 but is zero in a small neighborhood of the circle’s center. This is done for formality, because ∆CDF (W ) keeps the parameters near γ which is far from the circle’s center, so F ’s values at its neighborhood are irrelevant.

4 CD has fixed points In this section we show that the CD update always has a fixed point when used with an arbitrarily small amount of L2 regularization. Specifically, the L2 regularized CD update, ∆CDλ (W ) = ∆CD(W ) − λW

(22)

satisfies the following theorem.

Theorem 1. For any binary RBM of any architecture and any 1 > λ > 0, there exists a setting of the parameters W ∗ such that ∆CDλ (W ∗ ) = 0.

When the distribution to be estimated, D(V ), is exactly representable by a distribution of an RBM with parameters W ∗ (say P (V ); so D(V ) = P (V )), then CD is consistent in the sense that ∆CD(W ∗ ) = 0. This is because the distributions P (V |H)D(V ) and R(V, H) are equal to P (V, H), which follows from the equality P (V ) = D(V ) and the fact that the Gibbs sampling Markov Chain leaves P (V, H) invariant. However, no such result is known in the general case, when D(V ) cannot be expressed as the distribution of an RBM—which is the typical situation when working with finite training sets. However, CD was suspected to have fixed points on the basis of empirical evidence (Carreira-Perpinan & Hinton, 2005). The proof of Theorem 1 is general and nonconstructive, so we can only show that a fixed point exists. It applies to all bounded updates which are continuous functions of the parameters, which is satisfied by the CD update in binary RBMs (and, in fact, by the CDn update for every n for binary RBMs, but not for the real-valued Gaussian RBMs

10

CD1 update

CD1 update with 0.2 weight decay

2.5

2.5

2

2

1.5

1.5

1

1

0.5

0.5

0

0

−0.5

−0.5

−1

−1

−1.5

−1.5

−2

−2

−2.5 −2

−2.5 −1

0

1

2

−2

−1

0

1

2

Figure 3: An illustration of Theorem 2 and of the effect of the L2 regularization on the CD update. Left: the update defined by CD without regularization. The update has no fixed point in the plotted region (although it may have a fixed point outside of it). Right: the update defined by CD with 0.2 L2 regularization. Observe that all the arrows on the boundary of the square point to its inside, and hence, according to Brower’s fixed point theorem, it must have a fixed point (which is found near (−1, −1)). In both figures, the axes correspond to the parameters of the RBM described in section 3.

(Welling et al., 2005)). The proof idea is to apply Brower’s fixed point theorem (Henle, 1994), which states that any continuous function from the closed ball to itself has a fixed point. Because the CD update is bounded, it will eventually be overwhelmed by the L2 regularization, so there is a ball large enough which the regularized CD update never leaves. This lets us apply Brower’s fixed point theorem to conclude that the regularized CD update has a fixed point.

4.1 The formal proof The proof uses Brower’s fixed point theorem (e.g., (Henle, 1994)):

Theorem 2. Let B be any closed ball in Rk , and let f : B → B be any continuous function whose outputs are contained in B. Then f has a fixed point—namely, 11

there is an x∗ ∈ B such that f (x∗ ) = x∗ .

To prove Theorem 1, consider the function f (W ) = W + ∆CDλ (W ) = W + ∆CD(W ) − λW

(23)

    We begin the proof by observing that ∆W CD(W ) = EP (H|V )D(V ) V H ⊤ −ER(V,H) V H ⊤ is bounded because both V and H are binary (0-1) vectors, so its norm is maximal when all of its components are equal to 1, giving the bound k∆CD(W )k ≤ U where U =

p

(24)

(n + 1) · (m + 1). In this equation, n is the number of visible units, m is

the number of hidden units, and (n + 1)(m + 1) is an upper bound on the total number of parameters of the RBM (the +1 terms account for the biases). The update ∆CD(W ) is also continuous because the RBM’s distribution in eq. 1 is a continuous function of its parameters. Now, for all W such that kW k ≤

U , λ

kW − λW + ∆CD(W )k ≤ (1 − λ)kW k + k∆CD(W )k U ≤ (1 − λ) + U λ U = λ

(25)

where we used the triangle inequality and the fact that 1 − λ > 0. Therefore, when R = U/λ and B is the closed ball {W : kW k ≤ R}, the function f satisfies f (B) ⊆ B; furthermore, f is continuous since both the CD update and the L2 regularization are. This lets us apply Brower’s fixed point theorem to f and conclude that there exists W ∗ ∈ B such that f (W ∗) = W ∗ , or equivalently, that ∆CDλ (W ∗ ) = 0.

5 Conclusions In this paper we showed that while CD is not the gradient of any function and that it is possible to construct regularization functions causing it to fail to converge, CD has fixed points when used with L2 regularization. It would be interesting to determine whether the fixed points of L2 -regularized CD are robust to small perturbations. 12

6 Acknowledgement This research was partially supported by the Ontario Graduate Scholarship.

References Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., & Montreal, U. (2007). Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference, (p. 153). The MIT Press. Besag, J. (1977). Efficiency of pseudolikelihood estimation for simple Gaussian fields. Carreira-Perpinan, M., & Hinton, G. (2005). On contrastive divergence learning. In Artificial Intelligence and Statistics, vol. 2005. Cross, G., & Jain, A. (1981). Markov random field texture models. In Conference on Pattern Recognition and Image Processing, Dallas, TX, (pp. 597–602). Henle, M. (1994). A combinatorial introduction to topology. Dover publications. Hinton, G. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800. Hinton, G., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554. Hinton, G., & Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Hyvarinen, A. (2006). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(1), 695. Li, S. (1994). Markov random field models in computer vision. Lecture Notes in Computer Science, 801, 361–370. Malfait, M., & Roose, D. (1997). Wavelet-based image denoising using a Markov random field a priorimodel. IEEE transactions on image processing, 6(4), 549–565. 13

Portilla, J., Strela, V., Wainwright, M., & Simoncelli, E. (2003). Image denoising using scale mixtures of Gaussians in the wavelet domain. IEEE Transactions on Image Processing, 12(11), 1338–1351. Roth, S., & Black, M. (2005). Fields of experts: A framework for learning image priors. In IEEE Computer Society Conference on Computer Vision andn Pattern Recognition, 2005. CVPR 2005, vol. 2. Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. Parallel distributed processing: Explorations in the microstructure of cognition, 1, 194–281. Tieleman, T. (2007). Some investigations into energy-based models. Master’s thesis, University of Toronto. Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, (pp. 1064–1071). ACM New York, NY, USA. Wainwright, M. (2008). Graphical Models, Exponential Families, and Variational Inference.. Now Publishers. Welling, M., Rosen-Zvi, M., & Hinton, G. (2005). Exponential family harmoniums with an application to information retrieval. Advances in neural information processing systems, 17, 1481–1488. Younes, L. (1988). Estimation and annealing for Gibbsian fields. Ann. Inst. H. Poincare Probab. Statist, 24, 269–294.

14