Support Vector Machines for Classification: A ... - Semantic Scholar

Comment

Report 2 Downloads 138 Views

Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University

May 27, 2011 The Spring Conference of Korean Statistical Society KAIST, Daejeon, Korea

Handwritten digit recognition

Figure: 16 × 16 grayscale images scanned from postal envelopes, courtesy of Hastie, Tibshirani, & Friedman (2001). Cortes & Vapnik (1995) applied SVM to the data and demonstrated its improved accuracy over decision trees and neural network.

◮

Learn a rule φ : Rp → Y from the training data, which can be generalized to novel cases.

0.5 0.0

y ∈ Y = {1, . . . , k}

−0.5

◮

x = (x1 , . . . , xp ) ∈ Rp

−1.0

◮

Training data {(x i , yi ), i = 1, . . . , n} x2

◮

1.0

Classification

−1.0

−0.5

0.0 x1

0.5

1.0

The Bayes decision rule ◮

The 0-1 loss function: L(y, φ(x)) = I(y 6= φ(x))

◮

(X , Y ): a random sample from P(x, y), and pj (x) = P(Y = j|X = x)

◮

The rule that minimizes the risk R(φ) = EL(Y , φ(X )) = P(Y 6= φ(X )): φB (x) = arg max pj (x) j∈Y

◮

The Bayes error rate: R ∗ = R(φB ) = 1 − E(max pj (X ))

Two approaches to classification ◮

Probability based plug-in rules (soft classification): ˆ ˆj (x) φ(x) = arg maxj∈Y p e.g. logistic regression, density estimation (LDA, QDA), ... ˆ − R ∗ ≤ 2E max |pj (X ) − p ˆj (X )| R(φ) j∈Y

◮

Error minimization (hard classification): Find φ ∈ F minimizing n

1X Rn (φ) = L(yi , φ(x i )). n i=1

e.g. large margin classifiers (support vector machine, boosting, ...)

Discriminant function

◮

Much easier to find a real-valued discriminant function f (x) first and obtain a classification rule φ(x) through f .

◮

For instance, in the binary setting ◮ ◮

◮ ◮

Y = {−1, +1} (symmetric labels) Classification rule: φ(x) = sign(f (x)) for a discriminant function f Classification boundary: {x|f (x) = 0} yf (x) > 0 indicates correct decision for (x, y).

0.0 −0.5 −1.0

x2

0.5

1.0

Linearly separable case

−1.0

−0.5

0.0 x1

0.5

1.0

Perceptron algorithm Rosenblatt (1958), The perceptron: A probabilistic model for information storage and organization in the brain. ◮

Find a separating hyperplane by sequentially updating β and β0 of a linear classifier, φ(x) = sign(β ′ x + β0 ). (0)

Step 1. Initialize β (0) = 0 and β0 = 0. Step 2. While there is a misclassified point such that (m−1) (m−1) ′ ) ≤ 0 for m = 1, 2, . . . , repeat xi + β0 yi (β ◮ Choose a misclassified point (x , y ). i i (m−1) (m) ◮ Update β (m) = β (m−1) + y x and β + yi . = β i i 0 0 ◮

(Novikoff) The algorithm terminates within ⌊(R 2 + 1)(b2 + 1)/δ 2 ⌋ iterations, where R = maxi kxi k and δ = mini yi (w ′ xi + b) > 0 for some w ∈ Rp with kwk = 1 and b ∈ R.

0.5

1.0

Optimal separating hyperplane

0.0

margin=2/||β||

−0.5

βtx + β0 = − 1

−1.5

βtx + β0 = 0

−1.0

x2

βtx + β0 = 1

−1.0

−0.5

0.0 x1

0.5

1.0

1.5

Support Vector Machines

Boser, Guyon, & Vapnik (1992), A training algorithm for optimal margin classifiers. Vapnik (1995), The Nature of Statistical Learning Theory. ◮

Find “the separating hyperplane with the maximum margin”: f (x) = β ′ x + β0 minimizing kβk2 subject to yi f (x i ) ≥ 1 for all i = 1, . . . , n

◮

Classification rule: φ(x) = sign(f (x))

Why large margin?

◮

Vapnik’s justification for large margin: - The complexity of separating hyperplanes is inversely related to margin. - Algorithms that maximize the margin can be expected to produce lower test error rates.

◮

A form of regularization: e.g. ridge regression, LASSO, smoothing splines, Tikhonov regularization

Non-separable case

◮

Relax the separability condition to yi f (x i ) ≥ 1 − ξi by introducing slack variable ξi ≥ 0. (common technique in constrained optimization)

◮

Take ξi (proportional to the distance of x from yf (x) = 1) as a loss.

◮

Find f (x) = β ′ x + β0 minimizing n

λ 1X (1 − yi f (xi ))+ + kβk2 n 2 i=1

◮

Hinge loss: L(y, f (x)) = (1 − yf (x))+ where (t)+ = max(t, 0).

Hinge loss [−t]* (1−t)

+

2

1

0 −2

−1

0 t=yf

1

2

Figure: (1 − yf (x))+ is a convex upper bound of the misclassification loss I(y 6= φ(x)) = [−yf (x)]∗ ≤ (1 − yf (x))+ where [t]∗ = I(t ≥ 0) and (t)+ = max{t, 0}.

Remarks on hinge loss

◮

Originates from the separability condition as an inequality and its relaxation.

◮

Taking it as negative log likelihood would imply a very unusual probability model.

◮

Yields a robust method compared to logistic regression and boosting.

◮

Singularity at 1 leads to a sparse solution.

Computation: quadratic programming ◮

Primal problem: minimize w.r.t. β0 , β, and ξi n X 1

n

i=1

λ ξi + kβk2 2

subject to yi (β ′ xi + β0 ) ≥ 1 − ξi and ξi ≥ 0 for i = 1, . . . , n. ◮

Dual problem: maximize w.r.t. αi (Lagrange multipliers) n X i=1

1 X αi − αi αj yi yj xi′ xj 2nλ i,j

Pn

◮ ◮

subject to 0 ≤ αi ≤ 1 and i=1 αi yi = 0 for i = 1, . . . , n. 1 Pn ˆ α ˆ i yi xi (from KKT conditions) β= nλ

i=1

Support vectors: data points with α ˆi > 0

Operational properties

◮

The SVM classification rule depends on the support vectors only (sparsity).

◮

The sparsity leads to efficient data reduction and fast evaluation at the testing phase.

◮

Can handle high dimensional data even when p ≫ n as the solution depends on x only through inner products xi′ xj in the dual formulation.

◮

Need to solve quadratic programming problem of size n.

Nonlinear SVM ◮

Linear SVM solution: f (x) =

n X

ci (x ′i x) + b

i=1 ◮

Replace the Euclidean inner product x ′ t with K (x, t) = Φ(x)′ Φ(t) for a mapping Φ from Rp to a higher dimensional ‘feature space.’

◮

Nonlinear kernels: K (x, t) = (1 + x ′ t)d , exp(−kx − tk2 /2σ 2 ), ... e.g. For p = √2 and √x = (x21 , x22 ),√ Φ(x) = (1, 2x1 , 2x2 , x1 , x2 , 2x1 x2 ) gives K (x, t) = (1 + x ′ t)2 .

Kernels Aizerman, Braverman, and Rozonoer (1964), Theoretical foundations of the potential function method in pattern recognition learning. ◮

Kernel trick: replace the dot product in linear methods with a kernel.

◮

“Kernelize.” kernel LDA, kernel PCA, kernel k-means algorithm, ...

◮

K (x, t) = Φ(x)′ Φ(t): non-negative definite

◮

Closely connected to reproducing kernels. This revelation came at AMS-IMS-SIAM Summer Conference, Adaptive Selection of Statistical Models and Procedures, Mount Holyoke College, MA, June 1996. (G. Wahba’s recollection)

Regularization in RKHS Wahba (1990), Spline Models for Observational Data. Find f (x) =

PM

ν=1 dν φν (x)

+ h(x) with h ∈ HK minimizing

n

1X L(yi , f (x i )) + λkhk2HK . n i=1

◮

HK : a reproducing Kernel Hilbert space of functions defined on a domain which can be arbitrary

◮

K (x, t): reproducing kernel if i) K (x, ·) ∈ HK for each x ii) f (x) =< K (x, ·), f (·) >HK for all f ∈ HK (the reproducing property)

◮

The null space is spanned by {φν }M ν=1 .

◮

J(f ) = khk2HK : penalty

SVM in general

Find f (x) = b + h(x) with h ∈ HK minimizing n X 1

n

i=1

(1 − yi f (x i ))+ + λkhk2HK .

◮

The null space: M = 1 and φ1 (x) = 1

◮

Linear SVM: HK = {h(x) = β ′ x | β ∈ Rp } with K (x, t) = x ′ t and khk2HK = kβ ′ xk2HK = kβk2

Representer Theorem Kimeldorf and Wahba (1971), Some results on Tchebycheffian Spline Functions. ◮

The minimizer f =

PM

ν=1 dν φν

n X 1

n

i=1

+ h with h ∈ HK of

L(yi , f (x i )) + λkhk2HK

has a representation of the form ˆf (x) =

M X

dˆν φν (x) +

ν=1

◮ khk2 HK

=

P

i,j

cˆi cˆj K (x i , x j )

n X

|i=1

cˆi K (x i , x) . {z h(x )

}

Implications of the general treatment

◮

Kernelized SVM is a special case of the RKHS method.

◮

K (xi , ·) form basis functions for f .

◮

There is no restriction on input domains and the form of a kernel function as long as the kernel is non-negative definite (by the Moore-Aronszajn theorem).

◮

Kernels can be defined on non-numerical domains such as strings of DNA bases, text, and graph, expanding the realm of applications well beyond the Euclidean vector space.

Statistical properties ◮

◮

Bayes risk consistent when the space generated by a kernel is sufficiently rich. Lin (2000), Zhang (AOS 2004), Bartlett et al. (JASA 2006) Population minimizer f ∗ (limiting discriminant function) for ◮

Binomial deviance L(y, f (x)) = log(1 + exp(−yf (x))): p1 (x) f (x) = log 1 − p1 (x) ∗

◮

Hinge loss L(y, f (x)) = (1 − yf (x))+ : f ∗ (x) = sign{p1 (x) − 1/2}

◮

Designed for prediction only and no probability estimates available from ˆf in general.

◮

Can be less efficient than probability modeling in reducing error rate.

SVM vs logistic regression 1.5

1

0.5

0

−0.5

−1

−1.5 −2

true probability logistic regression SVM −1

0 x

1

2

ˆLR (x) − 1 and dashed: ˆfSVM (x) Figure: Solid: 2p(x) − 1, dotted: 2p

Extensions and further developments

◮

Extensions to the multiclass case

◮

Feature selection: Make the embedding through kernel explicit.

◮

Kernel learning

◮

Efficient algorithms for large data sets when the penalty parameter λ is fixed

◮

Characterization of the entire solution path

◮

Beyond classification: Regression, novelty detection, clustering, and semi-supervised learning, ...

Reference

◮

This talk is based on a book chapter: Lee (2010), Support Vector Machines for Classification: A Statistical Portrait in Statistical Methods in Molecular Biology.

◮

See references therein.

◮

A preliminary version of the manuscript is available on my webpage http://www.stat.osu.edu/∼yklee.

Recommend Documents

Support Vector Machines for Visual Gender Classification

SUPPORT VECTOR MACHINES FOR ... - Semantic Scholar

Support Vector Classification for Pathological ... - Semantic Scholar

ACCENT CLASSIFICATION USING SUPPORT VECTOR MACHINES