talk - Vikas Sindhwani

Report 29 Downloads 29 Views
Deterministic Annealing for Semi-supervised Kernel Machines Vikas Sindhwani1 , Sathiya Keerthi2 , Olivier Chapelle3 1 University

of Chicago Research 3 Max Planck Institute, T¨ ubingen 2 Yahoo!

ICML 2006

V. Vapnik’s Transductive SVM idea Suppose, for a binary classification problem, we have l labeled examples {xi , yi }li=1 , xi ∈ X , yi ∈ {−1, +1} u unlabeled examples {x0j }uj=1 Denote y0 = (y10 . . . yu0 ) as the unknown labels. Train an SVM while optimizing unknown labels Solve, over f ∈ HK : X → R and y0 ∈ {−1, +1}u , standard SVM

z }| { l u  X λ 1 λ0 X  0 2 min0 kf kK + V (yi , f (xi )) + V yj , f (x0j ) l u f ,y |2 {z } | i=1 {z } | j=1 {z } regularizer labeled loss

subject to:

1 u

u X j=1

max(0, yj0 ) = r

unlabeled loss

(positive class ratio)

V. Vapnik’s Transductive SVM idea Suppose, for a binary classification problem, we have l labeled examples {xi , yi }li=1 , xi ∈ X , yi ∈ {−1, +1} u unlabeled examples {x0j }uj=1 Denote y0 = (y10 . . . yu0 ) as the unknown labels. Train an SVM while optimizing unknown labels Solve, over f ∈ HK : X → R and y0 ∈ {−1, +1}u , standard SVM

z }| { l u  X λ 1 λ0 X  0 2 min0 kf kK + V (yi , f (xi )) + V yj , f (x0j ) l u f ,y |2 {z } | i=1 {z } | j=1 {z } regularizer labeled loss

subject to:

1 u

u X j=1

max(0, yj0 ) = r

unlabeled loss

(positive class ratio)

Equivalent Continuous Optimization Problem

Optimization Problem

min0 J (f , y0 ) = f ,y

l u  λ0 X  0 λ 1X V (yi , f (xi )) + V yj , f (x0j ) kf k2K + 2 l u i=1

j=1

l

min J (f ) = f

1X λ kf k2K + V (yi , f (xi )) + 2 l i=1 u 0 X

λ u

h    i min V +1, f (x0j ) , V −1, f (x0j ) {z } j=1 | effective loss V 0 (f (x0j ))

Effective Loss Function Over Unlabeled Examples

(b) Quadratic Hinge Loss 1 0.9

0.8

0.8

0.7

0.7

0.6

0.6

loss

loss

(a) Hinge Loss 1 0.9

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1 0 −2

Non-convex

0.1 −1.5

−1

−0.5

0

0.5

1

1.5

0 −2

2

−1.5

f(x) (c) Squared Loss

−1

−0.5

0

0.5

1

1.5

2

1.5

2

f(x) (d) Logistic Loss

1

0.8

0.9 0.7 0.8 0.6

0.6

loss

loss

0.7

0.5 0.4 0.3

0.5 0.4 0.3

0.2 0.2 0.1 0 −2

−1.5

−1

−0.5

0

f(x)

0.5

1

1.5

2

0.1 −2

−1.5

−1

−0.5

0

f(x)

0.5

1

Penalty if decision surface gets too close to unlabeled examples.

This idea implements a common assumption for SSL...

Low-Density Separation Assumption The true decision boundary passes through a region containing low volumes of data. Implements the prior knowledge/assumption that Z P(x)dx is small B(f )

where B(f ) = {x : |f (x)| < 1} Cluster Assumption Points in a data cluster belong to the same class.

Solution Strategies

JTSVM [Joachims, 98] Label unlabeled data using supervised SVM. Alternate Optimize f given current y0 Optimize y0 by switching a pair of labels ∇ TSVM [Chapelle and Zien, 05] Use differentiable losses – quadratic hinge loss over labels and a gaussian loss over unlabeled examples. apply gradient descent. [Bennett & Demirez,98], [Fung & Mangasarian,01], [Collobert, Sinz,Weston,Bottou,05], [Gartner,Le,Burton,Smola,Vishwanathan,05]

Objective Function

Non-convexity can hurt empirical performance

6 4 2 0 −2 −4 −6 −5

0

5

160 140 120 100 80 60 40 20 −100

0

θ

100

Objective Function

Non-convexity can hurt empirical performance

6 4 2 0 −2 −4 −6 −5

0

5

160 140 120 100 80 60 40 20 −100

0

θ

100

Objective Function

Non-convexity can hurt empirical performance

6 4 2 0 −2 −4 −6 −5

0

5

160 140 120 100 80 60 40 20 −100

0

θ

100

Objective Function

Non-convexity can hurt empirical performance

6 4 2 0 −2 −4 −6 −5

0

5

160 140 120 100 80 60 40 20 −100

0

θ

100

Objective Function

Non-convexity can hurt empirical performance

6 4 2 0 −2 −4 −6 −5

0

5

160 140 120 100 80 60 40 20 −100

0

θ

100

Objective Function

Non-convexity can hurt empirical performance

6 4 2 0 −2 −4 −6 −5

0

5

160 140 120 100 80 60 40 20 −100

0

θ

100

Objective Function

Non-convexity can hurt empirical performance

6 4 2 0 −2 −4 −6 −5

0

5

160 140 120 100 80 60 40 20 −100

0

θ

100

Objective Function

Non-convexity can hurt empirical performance

6 4 2 0 −2 −4 −6 −5

Error rates on

0

COIL 6:

5

160 140 120 100 80 60 40 20 −100

0

θ

100

SVM 21.9, JTSVM 21.2, ∇ TSVM 21.6

Deterministic Annealing: Intuition Question What should the shape of the loss function be so that the decision boundary locally evolves in a desirable manner ? 0 6 −0.2 4

−0.4 −0.6

loss

2

0

−2

−0.8 −1 −1.2 −1.4

−4

−1.6 −6 −6

−4

−2

0

2

4

6

−1.8 −3

−2

−1

0

1

2

3

f(x)

Key Idea Deform the loss function (objective) as the optimization proceeds...somehow !

Deterministic Annealing: Intuition Question What should the shape of the loss function be so that the decision boundary locally evolves in a desirable manner ? 1 6

0.9 4

0.8 0.7

loss

2

0

0.6 0.5 0.4

−2

0.3 −4

0.2 0.1

−6 −6

−4

−2

0

2

4

6

0 −2

−1.5

−1

−0.5

0

0.5

1

1.5

2

f(x)

Key Idea Deform the loss function (objective) as the optimization proceeds...somehow !

Deterministic Annealing: Intuition Question What should the shape of the loss function be so that the decision boundary locally evolves in a desirable manner ? 1 6

0.9 4

0.8 0.7

loss

2

0

0.6 0.5 0.4

−2

0.3 −4

0.2 0.1

−6 −6

−4

−2

0

2

4

6

0 −2

−1.5

−1

−0.5

0

0.5

1

1.5

2

f(x)

Key Idea Deform the loss function (objective) as the optimization proceeds...somehow !

Deterministic Annealing as a Homotopy Method

Work with a family of objective functions JT . Smoothly deform an “easy” (convex) function JT1 to the given “hard” function JT2 = J by varying T . Track minimizers along the deformation path. DA is a specific implementation of this idea.

Deterministic Annealing for Semi-supervised SVMs Another Equivalent Continuous Optimization Problem “Relax” y0 to p = (p1 . . . pu ) where pj is like the prob that yj0 = 1. l

1X λ kf k2K + V (yi , f (xi )) 2 l

J (f , p) = Ep J (f , y0 ) =

i=1

+

u λ0 X h

u

j=1





 i pj V +1, f (x0j ) + (1 − pj )V −1, f (x0j )

Family of Objective Functions: Avg Cost - T Entropy JT (f , p) = Ep J (f , y0 ) − − Tu

Pu

j=1

[ pj

T H(p) | {z }

log pj +(1−pj ) log (1−pj )]

Deterministic Annealing for Semi-supervised SVMs

Full Optimization problem at T P minf ,p JTh(f , p) = λ2 kf k2K+ 1l li=1 V (yi , f (xi )) +i λ0 P u 0 0 + j=1 pj V +1, f (xj ) + (1 − pj )V −1, f (xj ) u   Pu P u T s.t (1/u) j=1 pj = r j=1 pj log pj + (1 − pj ) log pj u Deformation: T controls non-convexity of J (f , p). At T = 0, reduces to the original non-convex objective function J (f , p). Optimization at T (fT? , p?T ) = argminf ,p JT (f , p) Annealing: Return: f ? = limT →0 fT? P Balance constraint: u1 uj=1 pj = r

Alternating Convex Optimization At any T, optimize f keeping p fixed Representer Pl theorem: Pu f (x) = i=1 αi K (x, xi ) + j=1 αi0 K (x, x0j )

Minimize weighted regularized loss using standard tricks. At any T, optimize p keeping f fixed h i 1 pj? = gj = λ0 V (f (x0j )) − V (−f (x0j )) gj −ν 1+e

T

Obtain ν by solving

1 u

Pu

j=1

1 1+e

gi −ν T

=r

Stopping Conditions At any T, alternate until KL(pnew |pold ) < . Obtain pT? . Reduce T, Seed old pT? , until H(p?T ) < .

How effective Loss deforms as a function of T (b) Quadratic Hinge Loss

(a) Hinge Loss

1

1

0.5

0.5

0

0

loss

loss

Decreasing T

−0.5

−0.5

−1

−1

−1.5

−1.5

−2 −2

−1.5

−1

−0.5

0

0.5

1

1.5

−2 −3

2

−2

−1

0

1

2

3

2

3

f(x) (d) Logistic Loss

f(x) (c) Squared Loss 4

1 0.5

3

0

loss

loss

2

1

−0.5 −1

0 −1.5 −1

−2 −3

−2

−2

−1

0

f(x)

1

2

3

−2.5 −3

−2

−1

0

f(x)

1

Effective Loss in JTSVM,∇ TSVM wrt λ0 . Jλ0 (f ) = λ2 kf k2K +

Pl

1 l

i=1 V (yi , f (xi )) +

1

0.8 0.7

0.6

0.6

loss

loss

0.9

Increasing λ’

0.7

0.5 0.4

0.5 0.4

0.3

0.3

0.2

0.2

0.1 0 −3

  0 f (x0 ) V j=1 j

Pu

(b) ∇ TSVM

(a) JTSVM 1 0.9 0.8

λ0 u

0.1 −2

−1

0

f(x)

1

2

3

0 −3

−2

−1

0

1

2

3

f(x)

Unlabeled examples outside the margin do not influence the decision boundary!

Deterministic Annealing: Some Quick Comments Smoothing: At high T , spurious & shallow local min are smoothed away. Simulated Annealing: Stochastic search allowing “uphill” moves depending on T . Associated Markov process converges slowly to Gibbs distribution at equilibrium which minimizes Ep J − TH(p) (free energy). As T → 0 very slowly, global solution guaranteed (in prob). DA retains annealing but avoids stochastic search by directly optimizing Ep J − TH(p) for p. Maximum Entropy: Ep J − TH(p) is the Lagrangian of: maxp S(p) subject to Ep J = β. Proven Heuristic: Very strong record of empirical success, including in clustering, classification, compression problems. For SSL, has been applied with EM in [Nigam, 2001].

Deterministic Annealing: Some Quick Comments Smoothing: At high T , spurious & shallow local min are smoothed away. Simulated Annealing: Stochastic search allowing “uphill” moves depending on T . Associated Markov process converges slowly to Gibbs distribution at equilibrium which minimizes Ep J − TH(p) (free energy). As T → 0 very slowly, global solution guaranteed (in prob). DA retains annealing but avoids stochastic search by directly optimizing Ep J − TH(p) for p. Maximum Entropy: Ep J − TH(p) is the Lagrangian of: maxp S(p) subject to Ep J = β. Proven Heuristic: Very strong record of empirical success, including in clustering, classification, compression problems. For SSL, has been applied with EM in [Nigam, 2001].

Deterministic Annealing: Some Quick Comments Smoothing: At high T , spurious & shallow local min are smoothed away. Simulated Annealing: Stochastic search allowing “uphill” moves depending on T . Associated Markov process converges slowly to Gibbs distribution at equilibrium which minimizes Ep J − TH(p) (free energy). As T → 0 very slowly, global solution guaranteed (in prob). DA retains annealing but avoids stochastic search by directly optimizing Ep J − TH(p) for p. Maximum Entropy: Ep J − TH(p) is the Lagrangian of: maxp S(p) subject to Ep J = β. Proven Heuristic: Very strong record of empirical success, including in clustering, classification, compression problems. For SSL, has been applied with EM in [Nigam, 2001].

Deterministic Annealing: Some Quick Comments Smoothing: At high T , spurious & shallow local min are smoothed away. Simulated Annealing: Stochastic search allowing “uphill” moves depending on T . Associated Markov process converges slowly to Gibbs distribution at equilibrium which minimizes Ep J − TH(p) (free energy). As T → 0 very slowly, global solution guaranteed (in prob). DA retains annealing but avoids stochastic search by directly optimizing Ep J − TH(p) for p. Maximum Entropy: Ep J − TH(p) is the Lagrangian of: maxp S(p) subject to Ep J = β. Proven Heuristic: Very strong record of empirical success, including in clustering, classification, compression problems. For SSL, has been applied with EM in [Nigam, 2001].

First Experiments

2moons

2circles

Successes in 10 trials. l=2 Dataset → Algorithm ↓ JTSVM (l2 ) JTSVM (l1 ) ∇TSVM DA (l2 ) DA (l1 )

2 MOONS

2 CIRCLES

0 0 3 6 10

1 1 2 3 10

Used RBF Kernels, Optimal parameters.

Real-world Datasets λ0 = 1; λ, σ optimized; avg. over 10 random splits. T =

10 1.5i

Unlab SVM JTSVM ∇TSVM DA(l2 ) DA(sqr)

USPS 2

COIL 6

PCMAC

ESET 2

7.5 7.6 6.9 6.4 5.7

21.5 19.9 21.4 13.6 13.8

18.9 10.4 5.4 5.3 5.4

19.4 9.2 8.7 8.1 9.0

Test SVM JTSVM ∇TSVM DA(l2 ) DA(sqr)

7.8 7.2 7.1 6.3 6.3

21.9 21.2 21.6 15.0 15.2

17.9 7.0 4.5 4.8 4.7

19.7 8.9 9.1 8.5 9.4

i = 0, 1, ...

Large Scale Text Categorization UseNet articles from two discussion groups: Auto-vs-Aviation. Used special primal routines for linear kernels, [Keerthi and Decoste, 2005]. More results in [SK,SIGIR 06] #features=20707, #training=35543, #test=35587. minimum value achieved

Objective function

0.73 0.72 0.71 0.7 0.69 0.68 0.67

DA

0.66 0.65 45

TSVM 89

178

356

712

1424

2848

Number of Labeled Examples

Large Scale Text Categorization UseNet articles from two discussion groups: Auto-vs-Aviation. Used special primal routines for linear kernels, [Keerthi and Decoste, 2005]. More results in [SK,SIGIR 06] #features=20707, #training=35543, #test=35587. test error 0.1

DA

Test Error

0.09

TSVM

0.08 0.07 0.06 0.05 0.04 45

89

178

356

712

1424

2848

Number of Labeled Examples

Large Scale Text Categorization UseNet articles from two discussion groups: Auto-vs-Aviation. Used special primal routines for linear kernels, [Keerthi and Decoste, 2005]. More results in [SK,SIGIR 06] #features=20707, #training=35543, #test=35587. importance of annealing

Test error rate

0.35

DA T=1

0.3

T=0.1 0.25 T=0.001 0.2

0.15

0.1

0.05 45

89

178

356

712

1424

2848

Number of Labeled Examples

Importance of Annealing

Unlab DA T=0.1 T=0.01 T=0.001

USPS 2

COIL 6

PC - MAC

ESET 2

6.4 6.6 7.6 7.9

13.6 20.0 20.1 20.3

5.3 5.7 7.1 9.1

8.1 7.8 8.1 8.8

Test DA T=0.1 T=0.01 T=0.001

6.3 6.8 7.0 7.2

15.0 21.0 21.3 21.5

4.8 4.7 5.7 7.3

8.5 8.0 8.5 8.8

Summary and Open Questions

Summary New optimization method that better approaches global solution for TSVM-like SSL. “Easy” to “Hard” approach. Can use off-the-shelf optimization subroutines. Open Questions Intriguing connections between annealing behaviour, loss function and regularization. Annealing sequence ? Detailed experimental studies. Also see: A Continuation method for Semi-supervised SVMs, O. Chapelle, M. Chi, A. Zien, ICML 2006.