Support Vector Machines for Classification: A ... - Semantic Scholar

Report 2 Downloads 138 Views
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University

May 27, 2011 The Spring Conference of Korean Statistical Society KAIST, Daejeon, Korea

Handwritten digit recognition

Figure: 16 × 16 grayscale images scanned from postal envelopes, courtesy of Hastie, Tibshirani, & Friedman (2001). Cortes & Vapnik (1995) applied SVM to the data and demonstrated its improved accuracy over decision trees and neural network.



Learn a rule φ : Rp → Y from the training data, which can be generalized to novel cases.

0.5 0.0

y ∈ Y = {1, . . . , k}

−0.5



x = (x1 , . . . , xp ) ∈ Rp

−1.0



Training data {(x i , yi ), i = 1, . . . , n} x2



1.0

Classification

−1.0

−0.5

0.0 x1

0.5

1.0

The Bayes decision rule ◮

The 0-1 loss function: L(y, φ(x)) = I(y 6= φ(x))



(X , Y ): a random sample from P(x, y), and pj (x) = P(Y = j|X = x)



The rule that minimizes the risk R(φ) = EL(Y , φ(X )) = P(Y 6= φ(X )): φB (x) = arg max pj (x) j∈Y



The Bayes error rate: R ∗ = R(φB ) = 1 − E(max pj (X ))

Two approaches to classification ◮

Probability based plug-in rules (soft classification): ˆ ˆj (x) φ(x) = arg maxj∈Y p e.g. logistic regression, density estimation (LDA, QDA), ... ˆ − R ∗ ≤ 2E max |pj (X ) − p ˆj (X )| R(φ) j∈Y



Error minimization (hard classification): Find φ ∈ F minimizing n

1X Rn (φ) = L(yi , φ(x i )). n i=1

e.g. large margin classifiers (support vector machine, boosting, ...)

Discriminant function



Much easier to find a real-valued discriminant function f (x) first and obtain a classification rule φ(x) through f .



For instance, in the binary setting ◮ ◮

◮ ◮

Y = {−1, +1} (symmetric labels) Classification rule: φ(x) = sign(f (x)) for a discriminant function f Classification boundary: {x|f (x) = 0} yf (x) > 0 indicates correct decision for (x, y).

0.0 −0.5 −1.0

x2

0.5

1.0

Linearly separable case

−1.0

−0.5

0.0 x1

0.5

1.0

Perceptron algorithm Rosenblatt (1958), The perceptron: A probabilistic model for information storage and organization in the brain. ◮

Find a separating hyperplane by sequentially updating β and β0 of a linear classifier, φ(x) = sign(β ′ x + β0 ). (0)

Step 1. Initialize β (0) = 0 and β0 = 0. Step 2. While there is a misclassified point such that (m−1) (m−1) ′ ) ≤ 0 for m = 1, 2, . . . , repeat xi + β0 yi (β ◮ Choose a misclassified point (x , y ). i i (m−1) (m) ◮ Update β (m) = β (m−1) + y x and β + yi . = β i i 0 0 ◮

(Novikoff) The algorithm terminates within ⌊(R 2 + 1)(b2 + 1)/δ 2 ⌋ iterations, where R = maxi kxi k and δ = mini yi (w ′ xi + b) > 0 for some w ∈ Rp with kwk = 1 and b ∈ R.

0.5

1.0

Optimal separating hyperplane

0.0

margin=2/||β||

−0.5

βtx + β0 = − 1

−1.5

βtx + β0 = 0

−1.0

x2

βtx + β0 = 1

−1.0

−0.5

0.0 x1

0.5

1.0

1.5

Support Vector Machines

Boser, Guyon, & Vapnik (1992), A training algorithm for optimal margin classifiers. Vapnik (1995), The Nature of Statistical Learning Theory. ◮

Find “the separating hyperplane with the maximum margin”: f (x) = β ′ x + β0 minimizing kβk2 subject to yi f (x i ) ≥ 1 for all i = 1, . . . , n



Classification rule: φ(x) = sign(f (x))

Why large margin?



Vapnik’s justification for large margin: - The complexity of separating hyperplanes is inversely related to margin. - Algorithms that maximize the margin can be expected to produce lower test error rates.



A form of regularization: e.g. ridge regression, LASSO, smoothing splines, Tikhonov regularization

Non-separable case



Relax the separability condition to yi f (x i ) ≥ 1 − ξi by introducing slack variable ξi ≥ 0. (common technique in constrained optimization)



Take ξi (proportional to the distance of x from yf (x) = 1) as a loss.



Find f (x) = β ′ x + β0 minimizing n

λ 1X (1 − yi f (xi ))+ + kβk2 n 2 i=1



Hinge loss: L(y, f (x)) = (1 − yf (x))+ where (t)+ = max(t, 0).

Hinge loss [−t]* (1−t)

+

2

1

0 −2

−1

0 t=yf

1

2

Figure: (1 − yf (x))+ is a convex upper bound of the misclassification loss I(y 6= φ(x)) = [−yf (x)]∗ ≤ (1 − yf (x))+ where [t]∗ = I(t ≥ 0) and (t)+ = max{t, 0}.

Remarks on hinge loss



Originates from the separability condition as an inequality and its relaxation.



Taking it as negative log likelihood would imply a very unusual probability model.



Yields a robust method compared to logistic regression and boosting.



Singularity at 1 leads to a sparse solution.

Computation: quadratic programming ◮

Primal problem: minimize w.r.t. β0 , β, and ξi n X 1

n

i=1

λ ξi + kβk2 2

subject to yi (β ′ xi + β0 ) ≥ 1 − ξi and ξi ≥ 0 for i = 1, . . . , n. ◮

Dual problem: maximize w.r.t. αi (Lagrange multipliers) n X i=1

1 X αi − αi αj yi yj xi′ xj 2nλ i,j

Pn

◮ ◮

subject to 0 ≤ αi ≤ 1 and i=1 αi yi = 0 for i = 1, . . . , n. 1 Pn ˆ α ˆ i yi xi (from KKT conditions) β= nλ

i=1

Support vectors: data points with α ˆi > 0

Operational properties



The SVM classification rule depends on the support vectors only (sparsity).



The sparsity leads to efficient data reduction and fast evaluation at the testing phase.



Can handle high dimensional data even when p ≫ n as the solution depends on x only through inner products xi′ xj in the dual formulation.



Need to solve quadratic programming problem of size n.

Nonlinear SVM ◮

Linear SVM solution: f (x) =

n X

ci (x ′i x) + b

i=1 ◮

Replace the Euclidean inner product x ′ t with K (x, t) = Φ(x)′ Φ(t) for a mapping Φ from Rp to a higher dimensional ‘feature space.’



Nonlinear kernels: K (x, t) = (1 + x ′ t)d , exp(−kx − tk2 /2σ 2 ), ... e.g. For p = √2 and √x = (x21 , x22 ),√ Φ(x) = (1, 2x1 , 2x2 , x1 , x2 , 2x1 x2 ) gives K (x, t) = (1 + x ′ t)2 .

Kernels Aizerman, Braverman, and Rozonoer (1964), Theoretical foundations of the potential function method in pattern recognition learning. ◮

Kernel trick: replace the dot product in linear methods with a kernel.



“Kernelize.” kernel LDA, kernel PCA, kernel k-means algorithm, ...



K (x, t) = Φ(x)′ Φ(t): non-negative definite



Closely connected to reproducing kernels. This revelation came at AMS-IMS-SIAM Summer Conference, Adaptive Selection of Statistical Models and Procedures, Mount Holyoke College, MA, June 1996. (G. Wahba’s recollection)

Regularization in RKHS Wahba (1990), Spline Models for Observational Data. Find f (x) =

PM

ν=1 dν φν (x)

+ h(x) with h ∈ HK minimizing

n

1X L(yi , f (x i )) + λkhk2HK . n i=1



HK : a reproducing Kernel Hilbert space of functions defined on a domain which can be arbitrary



K (x, t): reproducing kernel if i) K (x, ·) ∈ HK for each x ii) f (x) =< K (x, ·), f (·) >HK for all f ∈ HK (the reproducing property)



The null space is spanned by {φν }M ν=1 .



J(f ) = khk2HK : penalty

SVM in general

Find f (x) = b + h(x) with h ∈ HK minimizing n X 1

n

i=1

(1 − yi f (x i ))+ + λkhk2HK .



The null space: M = 1 and φ1 (x) = 1



Linear SVM: HK = {h(x) = β ′ x | β ∈ Rp } with K (x, t) = x ′ t and khk2HK = kβ ′ xk2HK = kβk2

Representer Theorem Kimeldorf and Wahba (1971), Some results on Tchebycheffian Spline Functions. ◮

The minimizer f =

PM

ν=1 dν φν

n X 1

n

i=1

+ h with h ∈ HK of

L(yi , f (x i )) + λkhk2HK

has a representation of the form ˆf (x) =

M X

dˆν φν (x) +

ν=1

◮ khk2 HK

=

P

i,j

cˆi cˆj K (x i , x j )

n X

|i=1

cˆi K (x i , x) . {z h(x )

}

Implications of the general treatment



Kernelized SVM is a special case of the RKHS method.



K (xi , ·) form basis functions for f .



There is no restriction on input domains and the form of a kernel function as long as the kernel is non-negative definite (by the Moore-Aronszajn theorem).



Kernels can be defined on non-numerical domains such as strings of DNA bases, text, and graph, expanding the realm of applications well beyond the Euclidean vector space.

Statistical properties ◮



Bayes risk consistent when the space generated by a kernel is sufficiently rich. Lin (2000), Zhang (AOS 2004), Bartlett et al. (JASA 2006) Population minimizer f ∗ (limiting discriminant function) for ◮

Binomial deviance L(y, f (x)) = log(1 + exp(−yf (x))): p1 (x) f (x) = log 1 − p1 (x) ∗



Hinge loss L(y, f (x)) = (1 − yf (x))+ : f ∗ (x) = sign{p1 (x) − 1/2}



Designed for prediction only and no probability estimates available from ˆf in general.



Can be less efficient than probability modeling in reducing error rate.

SVM vs logistic regression 1.5

1

0.5

0

−0.5

−1

−1.5 −2

true probability logistic regression SVM −1

0 x

1

2

ˆLR (x) − 1 and dashed: ˆfSVM (x) Figure: Solid: 2p(x) − 1, dotted: 2p

Extensions and further developments



Extensions to the multiclass case



Feature selection: Make the embedding through kernel explicit.



Kernel learning



Efficient algorithms for large data sets when the penalty parameter λ is fixed



Characterization of the entire solution path



Beyond classification: Regression, novelty detection, clustering, and semi-supervised learning, ...

Reference



This talk is based on a book chapter: Lee (2010), Support Vector Machines for Classification: A Statistical Portrait in Statistical Methods in Molecular Biology.



See references therein.



A preliminary version of the manuscript is available on my webpage http://www.stat.osu.edu/∼yklee.