Deterministic Annealing for Semi-supervised Kernel Machines Vikas Sindhwani1 , Sathiya Keerthi2 , Olivier Chapelle3 1 University
of Chicago Research 3 Max Planck Institute, T¨ ubingen 2 Yahoo!
ICML 2006
V. Vapnik’s Transductive SVM idea Suppose, for a binary classification problem, we have l labeled examples {xi , yi }li=1 , xi ∈ X , yi ∈ {−1, +1} u unlabeled examples {x0j }uj=1 Denote y0 = (y10 . . . yu0 ) as the unknown labels. Train an SVM while optimizing unknown labels Solve, over f ∈ HK : X → R and y0 ∈ {−1, +1}u , standard SVM
z }| { l u X λ 1 λ0 X 0 2 min0 kf kK + V (yi , f (xi )) + V yj , f (x0j ) l u f ,y |2 {z } | i=1 {z } | j=1 {z } regularizer labeled loss
subject to:
1 u
u X j=1
max(0, yj0 ) = r
unlabeled loss
(positive class ratio)
V. Vapnik’s Transductive SVM idea Suppose, for a binary classification problem, we have l labeled examples {xi , yi }li=1 , xi ∈ X , yi ∈ {−1, +1} u unlabeled examples {x0j }uj=1 Denote y0 = (y10 . . . yu0 ) as the unknown labels. Train an SVM while optimizing unknown labels Solve, over f ∈ HK : X → R and y0 ∈ {−1, +1}u , standard SVM
z }| { l u X λ 1 λ0 X 0 2 min0 kf kK + V (yi , f (xi )) + V yj , f (x0j ) l u f ,y |2 {z } | i=1 {z } | j=1 {z } regularizer labeled loss
subject to:
1 u
u X j=1
max(0, yj0 ) = r
unlabeled loss
(positive class ratio)
Equivalent Continuous Optimization Problem
Optimization Problem
min0 J (f , y0 ) = f ,y
l u λ0 X 0 λ 1X V (yi , f (xi )) + V yj , f (x0j ) kf k2K + 2 l u i=1
j=1
l
min J (f ) = f
1X λ kf k2K + V (yi , f (xi )) + 2 l i=1 u 0 X
λ u
h i min V +1, f (x0j ) , V −1, f (x0j ) {z } j=1 | effective loss V 0 (f (x0j ))
Effective Loss Function Over Unlabeled Examples
(b) Quadratic Hinge Loss 1 0.9
0.8
0.8
0.7
0.7
0.6
0.6
loss
loss
(a) Hinge Loss 1 0.9
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1 0 −2
Non-convex
0.1 −1.5
−1
−0.5
0
0.5
1
1.5
0 −2
2
−1.5
f(x) (c) Squared Loss
−1
−0.5
0
0.5
1
1.5
2
1.5
2
f(x) (d) Logistic Loss
1
0.8
0.9 0.7 0.8 0.6
0.6
loss
loss
0.7
0.5 0.4 0.3
0.5 0.4 0.3
0.2 0.2 0.1 0 −2
−1.5
−1
−0.5
0
f(x)
0.5
1
1.5
2
0.1 −2
−1.5
−1
−0.5
0
f(x)
0.5
1
Penalty if decision surface gets too close to unlabeled examples.
This idea implements a common assumption for SSL...
Low-Density Separation Assumption The true decision boundary passes through a region containing low volumes of data. Implements the prior knowledge/assumption that Z P(x)dx is small B(f )
where B(f ) = {x : |f (x)| < 1} Cluster Assumption Points in a data cluster belong to the same class.
Solution Strategies
JTSVM [Joachims, 98] Label unlabeled data using supervised SVM. Alternate Optimize f given current y0 Optimize y0 by switching a pair of labels ∇ TSVM [Chapelle and Zien, 05] Use differentiable losses – quadratic hinge loss over labels and a gaussian loss over unlabeled examples. apply gradient descent. [Bennett & Demirez,98], [Fung & Mangasarian,01], [Collobert, Sinz,Weston,Bottou,05], [Gartner,Le,Burton,Smola,Vishwanathan,05]
Objective Function
Non-convexity can hurt empirical performance
6 4 2 0 −2 −4 −6 −5
0
5
160 140 120 100 80 60 40 20 −100
0
θ
100
Objective Function
Non-convexity can hurt empirical performance
6 4 2 0 −2 −4 −6 −5
0
5
160 140 120 100 80 60 40 20 −100
0
θ
100
Objective Function
Non-convexity can hurt empirical performance
6 4 2 0 −2 −4 −6 −5
0
5
160 140 120 100 80 60 40 20 −100
0
θ
100
Objective Function
Non-convexity can hurt empirical performance
6 4 2 0 −2 −4 −6 −5
0
5
160 140 120 100 80 60 40 20 −100
0
θ
100
Objective Function
Non-convexity can hurt empirical performance
6 4 2 0 −2 −4 −6 −5
0
5
160 140 120 100 80 60 40 20 −100
0
θ
100
Objective Function
Non-convexity can hurt empirical performance
6 4 2 0 −2 −4 −6 −5
0
5
160 140 120 100 80 60 40 20 −100
0
θ
100
Objective Function
Non-convexity can hurt empirical performance
6 4 2 0 −2 −4 −6 −5
0
5
160 140 120 100 80 60 40 20 −100
0
θ
100
Objective Function
Non-convexity can hurt empirical performance
6 4 2 0 −2 −4 −6 −5
Error rates on
0
COIL 6:
5
160 140 120 100 80 60 40 20 −100
0
θ
100
SVM 21.9, JTSVM 21.2, ∇ TSVM 21.6
Deterministic Annealing: Intuition Question What should the shape of the loss function be so that the decision boundary locally evolves in a desirable manner ? 0 6 −0.2 4
−0.4 −0.6
loss
2
0
−2
−0.8 −1 −1.2 −1.4
−4
−1.6 −6 −6
−4
−2
0
2
4
6
−1.8 −3
−2
−1
0
1
2
3
f(x)
Key Idea Deform the loss function (objective) as the optimization proceeds...somehow !
Deterministic Annealing: Intuition Question What should the shape of the loss function be so that the decision boundary locally evolves in a desirable manner ? 1 6
0.9 4
0.8 0.7
loss
2
0
0.6 0.5 0.4
−2
0.3 −4
0.2 0.1
−6 −6
−4
−2
0
2
4
6
0 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
f(x)
Key Idea Deform the loss function (objective) as the optimization proceeds...somehow !
Deterministic Annealing: Intuition Question What should the shape of the loss function be so that the decision boundary locally evolves in a desirable manner ? 1 6
0.9 4
0.8 0.7
loss
2
0
0.6 0.5 0.4
−2
0.3 −4
0.2 0.1
−6 −6
−4
−2
0
2
4
6
0 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
f(x)
Key Idea Deform the loss function (objective) as the optimization proceeds...somehow !
Deterministic Annealing as a Homotopy Method
Work with a family of objective functions JT . Smoothly deform an “easy” (convex) function JT1 to the given “hard” function JT2 = J by varying T . Track minimizers along the deformation path. DA is a specific implementation of this idea.
Deterministic Annealing for Semi-supervised SVMs Another Equivalent Continuous Optimization Problem “Relax” y0 to p = (p1 . . . pu ) where pj is like the prob that yj0 = 1. l
1X λ kf k2K + V (yi , f (xi )) 2 l
J (f , p) = Ep J (f , y0 ) =
i=1
+
u λ0 X h
u
j=1
i pj V +1, f (x0j ) + (1 − pj )V −1, f (x0j )
Family of Objective Functions: Avg Cost - T Entropy JT (f , p) = Ep J (f , y0 ) − − Tu
Pu
j=1
[ pj
T H(p) | {z }
log pj +(1−pj ) log (1−pj )]
Deterministic Annealing for Semi-supervised SVMs
Full Optimization problem at T P minf ,p JTh(f , p) = λ2 kf k2K+ 1l li=1 V (yi , f (xi )) +i λ0 P u 0 0 + j=1 pj V +1, f (xj ) + (1 − pj )V −1, f (xj ) u Pu P u T s.t (1/u) j=1 pj = r j=1 pj log pj + (1 − pj ) log pj u Deformation: T controls non-convexity of J (f , p). At T = 0, reduces to the original non-convex objective function J (f , p). Optimization at T (fT? , p?T ) = argminf ,p JT (f , p) Annealing: Return: f ? = limT →0 fT? P Balance constraint: u1 uj=1 pj = r
Alternating Convex Optimization At any T, optimize f keeping p fixed Representer Pl theorem: Pu f (x) = i=1 αi K (x, xi ) + j=1 αi0 K (x, x0j )
Minimize weighted regularized loss using standard tricks. At any T, optimize p keeping f fixed h i 1 pj? = gj = λ0 V (f (x0j )) − V (−f (x0j )) gj −ν 1+e
T
Obtain ν by solving
1 u
Pu
j=1
1 1+e
gi −ν T
=r
Stopping Conditions At any T, alternate until KL(pnew |pold ) < . Obtain pT? . Reduce T, Seed old pT? , until H(p?T ) < .
How effective Loss deforms as a function of T (b) Quadratic Hinge Loss
(a) Hinge Loss
1
1
0.5
0.5
0
0
loss
loss
Decreasing T
−0.5
−0.5
−1
−1
−1.5
−1.5
−2 −2
−1.5
−1
−0.5
0
0.5
1
1.5
−2 −3
2
−2
−1
0
1
2
3
2
3
f(x) (d) Logistic Loss
f(x) (c) Squared Loss 4
1 0.5
3
0
loss
loss
2
1
−0.5 −1
0 −1.5 −1
−2 −3
−2
−2
−1
0
f(x)
1
2
3
−2.5 −3
−2
−1
0
f(x)
1
Effective Loss in JTSVM,∇ TSVM wrt λ0 . Jλ0 (f ) = λ2 kf k2K +
Pl
1 l
i=1 V (yi , f (xi )) +
1
0.8 0.7
0.6
0.6
loss
loss
0.9
Increasing λ’
0.7
0.5 0.4
0.5 0.4
0.3
0.3
0.2
0.2
0.1 0 −3
0 f (x0 ) V j=1 j
Pu
(b) ∇ TSVM
(a) JTSVM 1 0.9 0.8
λ0 u
0.1 −2
−1
0
f(x)
1
2
3
0 −3
−2
−1
0
1
2
3
f(x)
Unlabeled examples outside the margin do not influence the decision boundary!
Deterministic Annealing: Some Quick Comments Smoothing: At high T , spurious & shallow local min are smoothed away. Simulated Annealing: Stochastic search allowing “uphill” moves depending on T . Associated Markov process converges slowly to Gibbs distribution at equilibrium which minimizes Ep J − TH(p) (free energy). As T → 0 very slowly, global solution guaranteed (in prob). DA retains annealing but avoids stochastic search by directly optimizing Ep J − TH(p) for p. Maximum Entropy: Ep J − TH(p) is the Lagrangian of: maxp S(p) subject to Ep J = β. Proven Heuristic: Very strong record of empirical success, including in clustering, classification, compression problems. For SSL, has been applied with EM in [Nigam, 2001].
Deterministic Annealing: Some Quick Comments Smoothing: At high T , spurious & shallow local min are smoothed away. Simulated Annealing: Stochastic search allowing “uphill” moves depending on T . Associated Markov process converges slowly to Gibbs distribution at equilibrium which minimizes Ep J − TH(p) (free energy). As T → 0 very slowly, global solution guaranteed (in prob). DA retains annealing but avoids stochastic search by directly optimizing Ep J − TH(p) for p. Maximum Entropy: Ep J − TH(p) is the Lagrangian of: maxp S(p) subject to Ep J = β. Proven Heuristic: Very strong record of empirical success, including in clustering, classification, compression problems. For SSL, has been applied with EM in [Nigam, 2001].
Deterministic Annealing: Some Quick Comments Smoothing: At high T , spurious & shallow local min are smoothed away. Simulated Annealing: Stochastic search allowing “uphill” moves depending on T . Associated Markov process converges slowly to Gibbs distribution at equilibrium which minimizes Ep J − TH(p) (free energy). As T → 0 very slowly, global solution guaranteed (in prob). DA retains annealing but avoids stochastic search by directly optimizing Ep J − TH(p) for p. Maximum Entropy: Ep J − TH(p) is the Lagrangian of: maxp S(p) subject to Ep J = β. Proven Heuristic: Very strong record of empirical success, including in clustering, classification, compression problems. For SSL, has been applied with EM in [Nigam, 2001].
Deterministic Annealing: Some Quick Comments Smoothing: At high T , spurious & shallow local min are smoothed away. Simulated Annealing: Stochastic search allowing “uphill” moves depending on T . Associated Markov process converges slowly to Gibbs distribution at equilibrium which minimizes Ep J − TH(p) (free energy). As T → 0 very slowly, global solution guaranteed (in prob). DA retains annealing but avoids stochastic search by directly optimizing Ep J − TH(p) for p. Maximum Entropy: Ep J − TH(p) is the Lagrangian of: maxp S(p) subject to Ep J = β. Proven Heuristic: Very strong record of empirical success, including in clustering, classification, compression problems. For SSL, has been applied with EM in [Nigam, 2001].
First Experiments
2moons
2circles
Successes in 10 trials. l=2 Dataset → Algorithm ↓ JTSVM (l2 ) JTSVM (l1 ) ∇TSVM DA (l2 ) DA (l1 )
2 MOONS
2 CIRCLES
0 0 3 6 10
1 1 2 3 10
Used RBF Kernels, Optimal parameters.
Real-world Datasets λ0 = 1; λ, σ optimized; avg. over 10 random splits. T =
10 1.5i
Unlab SVM JTSVM ∇TSVM DA(l2 ) DA(sqr)
USPS 2
COIL 6
PCMAC
ESET 2
7.5 7.6 6.9 6.4 5.7
21.5 19.9 21.4 13.6 13.8
18.9 10.4 5.4 5.3 5.4
19.4 9.2 8.7 8.1 9.0
Test SVM JTSVM ∇TSVM DA(l2 ) DA(sqr)
7.8 7.2 7.1 6.3 6.3
21.9 21.2 21.6 15.0 15.2
17.9 7.0 4.5 4.8 4.7
19.7 8.9 9.1 8.5 9.4
i = 0, 1, ...
Large Scale Text Categorization UseNet articles from two discussion groups: Auto-vs-Aviation. Used special primal routines for linear kernels, [Keerthi and Decoste, 2005]. More results in [SK,SIGIR 06] #features=20707, #training=35543, #test=35587. minimum value achieved
Objective function
0.73 0.72 0.71 0.7 0.69 0.68 0.67
DA
0.66 0.65 45
TSVM 89
178
356
712
1424
2848
Number of Labeled Examples
Large Scale Text Categorization UseNet articles from two discussion groups: Auto-vs-Aviation. Used special primal routines for linear kernels, [Keerthi and Decoste, 2005]. More results in [SK,SIGIR 06] #features=20707, #training=35543, #test=35587. test error 0.1
DA
Test Error
0.09
TSVM
0.08 0.07 0.06 0.05 0.04 45
89
178
356
712
1424
2848
Number of Labeled Examples
Large Scale Text Categorization UseNet articles from two discussion groups: Auto-vs-Aviation. Used special primal routines for linear kernels, [Keerthi and Decoste, 2005]. More results in [SK,SIGIR 06] #features=20707, #training=35543, #test=35587. importance of annealing
Test error rate
0.35
DA T=1
0.3
T=0.1 0.25 T=0.001 0.2
0.15
0.1
0.05 45
89
178
356
712
1424
2848
Number of Labeled Examples
Importance of Annealing
Unlab DA T=0.1 T=0.01 T=0.001
USPS 2
COIL 6
PC - MAC
ESET 2
6.4 6.6 7.6 7.9
13.6 20.0 20.1 20.3
5.3 5.7 7.1 9.1
8.1 7.8 8.1 8.8
Test DA T=0.1 T=0.01 T=0.001
6.3 6.8 7.0 7.2
15.0 21.0 21.3 21.5
4.8 4.7 5.7 7.3
8.5 8.0 8.5 8.8
Summary and Open Questions
Summary New optimization method that better approaches global solution for TSVM-like SSL. “Easy” to “Hard” approach. Can use off-the-shelf optimization subroutines. Open Questions Intriguing connections between annealing behaviour, loss function and regularization. Annealing sequence ? Detailed experimental studies. Also see: A Continuation method for Semi-supervised SVMs, O. Chapelle, M. Chi, A. Zien, ICML 2006.