Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University
May 27, 2011 The Spring Conference of Korean Statistical Society KAIST, Daejeon, Korea
Handwritten digit recognition
Figure: 16 × 16 grayscale images scanned from postal envelopes, courtesy of Hastie, Tibshirani, & Friedman (2001). Cortes & Vapnik (1995) applied SVM to the data and demonstrated its improved accuracy over decision trees and neural network.
◮
Learn a rule φ : Rp → Y from the training data, which can be generalized to novel cases.
0.5 0.0
y ∈ Y = {1, . . . , k}
−0.5
◮
x = (x1 , . . . , xp ) ∈ Rp
−1.0
◮
Training data {(x i , yi ), i = 1, . . . , n} x2
◮
1.0
Classification
−1.0
−0.5
0.0 x1
0.5
1.0
The Bayes decision rule ◮
The 0-1 loss function: L(y, φ(x)) = I(y 6= φ(x))
◮
(X , Y ): a random sample from P(x, y), and pj (x) = P(Y = j|X = x)
◮
The rule that minimizes the risk R(φ) = EL(Y , φ(X )) = P(Y 6= φ(X )): φB (x) = arg max pj (x) j∈Y
◮
The Bayes error rate: R ∗ = R(φB ) = 1 − E(max pj (X ))
Two approaches to classification ◮
Probability based plug-in rules (soft classification): ˆ ˆj (x) φ(x) = arg maxj∈Y p e.g. logistic regression, density estimation (LDA, QDA), ... ˆ − R ∗ ≤ 2E max |pj (X ) − p ˆj (X )| R(φ) j∈Y
◮
Error minimization (hard classification): Find φ ∈ F minimizing n
1X Rn (φ) = L(yi , φ(x i )). n i=1
e.g. large margin classifiers (support vector machine, boosting, ...)
Discriminant function
◮
Much easier to find a real-valued discriminant function f (x) first and obtain a classification rule φ(x) through f .
◮
For instance, in the binary setting ◮ ◮
◮ ◮
Y = {−1, +1} (symmetric labels) Classification rule: φ(x) = sign(f (x)) for a discriminant function f Classification boundary: {x|f (x) = 0} yf (x) > 0 indicates correct decision for (x, y).
0.0 −0.5 −1.0
x2
0.5
1.0
Linearly separable case
−1.0
−0.5
0.0 x1
0.5
1.0
Perceptron algorithm Rosenblatt (1958), The perceptron: A probabilistic model for information storage and organization in the brain. ◮
Find a separating hyperplane by sequentially updating β and β0 of a linear classifier, φ(x) = sign(β ′ x + β0 ). (0)
Step 1. Initialize β (0) = 0 and β0 = 0. Step 2. While there is a misclassified point such that (m−1) (m−1) ′ ) ≤ 0 for m = 1, 2, . . . , repeat xi + β0 yi (β ◮ Choose a misclassified point (x , y ). i i (m−1) (m) ◮ Update β (m) = β (m−1) + y x and β + yi . = β i i 0 0 ◮
(Novikoff) The algorithm terminates within ⌊(R 2 + 1)(b2 + 1)/δ 2 ⌋ iterations, where R = maxi kxi k and δ = mini yi (w ′ xi + b) > 0 for some w ∈ Rp with kwk = 1 and b ∈ R.
0.5
1.0
Optimal separating hyperplane
0.0
margin=2/||β||
−0.5
βtx + β0 = − 1
−1.5
βtx + β0 = 0
−1.0
x2
βtx + β0 = 1
−1.0
−0.5
0.0 x1
0.5
1.0
1.5
Support Vector Machines
Boser, Guyon, & Vapnik (1992), A training algorithm for optimal margin classifiers. Vapnik (1995), The Nature of Statistical Learning Theory. ◮
Find “the separating hyperplane with the maximum margin”: f (x) = β ′ x + β0 minimizing kβk2 subject to yi f (x i ) ≥ 1 for all i = 1, . . . , n
◮
Classification rule: φ(x) = sign(f (x))
Why large margin?
◮
Vapnik’s justification for large margin: - The complexity of separating hyperplanes is inversely related to margin. - Algorithms that maximize the margin can be expected to produce lower test error rates.
◮
A form of regularization: e.g. ridge regression, LASSO, smoothing splines, Tikhonov regularization
Non-separable case
◮
Relax the separability condition to yi f (x i ) ≥ 1 − ξi by introducing slack variable ξi ≥ 0. (common technique in constrained optimization)
◮
Take ξi (proportional to the distance of x from yf (x) = 1) as a loss.
◮
Find f (x) = β ′ x + β0 minimizing n
λ 1X (1 − yi f (xi ))+ + kβk2 n 2 i=1
◮
Hinge loss: L(y, f (x)) = (1 − yf (x))+ where (t)+ = max(t, 0).
Hinge loss [−t]* (1−t)
+
2
1
0 −2
−1
0 t=yf
1
2
Figure: (1 − yf (x))+ is a convex upper bound of the misclassification loss I(y 6= φ(x)) = [−yf (x)]∗ ≤ (1 − yf (x))+ where [t]∗ = I(t ≥ 0) and (t)+ = max{t, 0}.
Remarks on hinge loss
◮
Originates from the separability condition as an inequality and its relaxation.
◮
Taking it as negative log likelihood would imply a very unusual probability model.
◮
Yields a robust method compared to logistic regression and boosting.
◮
Singularity at 1 leads to a sparse solution.
Computation: quadratic programming ◮
Primal problem: minimize w.r.t. β0 , β, and ξi n X 1
n
i=1
λ ξi + kβk2 2
subject to yi (β ′ xi + β0 ) ≥ 1 − ξi and ξi ≥ 0 for i = 1, . . . , n. ◮
Dual problem: maximize w.r.t. αi (Lagrange multipliers) n X i=1
1 X αi − αi αj yi yj xi′ xj 2nλ i,j
Pn
◮ ◮
subject to 0 ≤ αi ≤ 1 and i=1 αi yi = 0 for i = 1, . . . , n. 1 Pn ˆ α ˆ i yi xi (from KKT conditions) β= nλ
i=1
Support vectors: data points with α ˆi > 0
Operational properties
◮
The SVM classification rule depends on the support vectors only (sparsity).
◮
The sparsity leads to efficient data reduction and fast evaluation at the testing phase.
◮
Can handle high dimensional data even when p ≫ n as the solution depends on x only through inner products xi′ xj in the dual formulation.
◮
Need to solve quadratic programming problem of size n.
Nonlinear SVM ◮
Linear SVM solution: f (x) =
n X
ci (x ′i x) + b
i=1 ◮
Replace the Euclidean inner product x ′ t with K (x, t) = Φ(x)′ Φ(t) for a mapping Φ from Rp to a higher dimensional ‘feature space.’
◮
Nonlinear kernels: K (x, t) = (1 + x ′ t)d , exp(−kx − tk2 /2σ 2 ), ... e.g. For p = √2 and √x = (x21 , x22 ),√ Φ(x) = (1, 2x1 , 2x2 , x1 , x2 , 2x1 x2 ) gives K (x, t) = (1 + x ′ t)2 .
Kernels Aizerman, Braverman, and Rozonoer (1964), Theoretical foundations of the potential function method in pattern recognition learning. ◮
Kernel trick: replace the dot product in linear methods with a kernel.
◮
“Kernelize.” kernel LDA, kernel PCA, kernel k-means algorithm, ...
◮
K (x, t) = Φ(x)′ Φ(t): non-negative definite
◮
Closely connected to reproducing kernels. This revelation came at AMS-IMS-SIAM Summer Conference, Adaptive Selection of Statistical Models and Procedures, Mount Holyoke College, MA, June 1996. (G. Wahba’s recollection)
Regularization in RKHS Wahba (1990), Spline Models for Observational Data. Find f (x) =
PM
ν=1 dν φν (x)
+ h(x) with h ∈ HK minimizing
n
1X L(yi , f (x i )) + λkhk2HK . n i=1
◮
HK : a reproducing Kernel Hilbert space of functions defined on a domain which can be arbitrary
◮
K (x, t): reproducing kernel if i) K (x, ·) ∈ HK for each x ii) f (x) =< K (x, ·), f (·) >HK for all f ∈ HK (the reproducing property)
◮
The null space is spanned by {φν }M ν=1 .
◮
J(f ) = khk2HK : penalty
SVM in general
Find f (x) = b + h(x) with h ∈ HK minimizing n X 1
n
i=1
(1 − yi f (x i ))+ + λkhk2HK .
◮
The null space: M = 1 and φ1 (x) = 1
◮
Linear SVM: HK = {h(x) = β ′ x | β ∈ Rp } with K (x, t) = x ′ t and khk2HK = kβ ′ xk2HK = kβk2
Representer Theorem Kimeldorf and Wahba (1971), Some results on Tchebycheffian Spline Functions. ◮
The minimizer f =
PM
ν=1 dν φν
n X 1
n
i=1
+ h with h ∈ HK of
L(yi , f (x i )) + λkhk2HK
has a representation of the form ˆf (x) =
M X
dˆν φν (x) +
ν=1
◮ khk2 HK
=
P
i,j
cˆi cˆj K (x i , x j )
n X
|i=1
cˆi K (x i , x) . {z h(x )
}
Implications of the general treatment
◮
Kernelized SVM is a special case of the RKHS method.
◮
K (xi , ·) form basis functions for f .
◮
There is no restriction on input domains and the form of a kernel function as long as the kernel is non-negative definite (by the Moore-Aronszajn theorem).
◮
Kernels can be defined on non-numerical domains such as strings of DNA bases, text, and graph, expanding the realm of applications well beyond the Euclidean vector space.
Statistical properties ◮
◮
Bayes risk consistent when the space generated by a kernel is sufficiently rich. Lin (2000), Zhang (AOS 2004), Bartlett et al. (JASA 2006) Population minimizer f ∗ (limiting discriminant function) for ◮
Binomial deviance L(y, f (x)) = log(1 + exp(−yf (x))): p1 (x) f (x) = log 1 − p1 (x) ∗
◮
Hinge loss L(y, f (x)) = (1 − yf (x))+ : f ∗ (x) = sign{p1 (x) − 1/2}
◮
Designed for prediction only and no probability estimates available from ˆf in general.
◮
Can be less efficient than probability modeling in reducing error rate.
SVM vs logistic regression 1.5
1
0.5
0
−0.5
−1
−1.5 −2
true probability logistic regression SVM −1
0 x
1
2
ˆLR (x) − 1 and dashed: ˆfSVM (x) Figure: Solid: 2p(x) − 1, dotted: 2p
Extensions and further developments
◮
Extensions to the multiclass case
◮
Feature selection: Make the embedding through kernel explicit.
◮
Kernel learning
◮
Efficient algorithms for large data sets when the penalty parameter λ is fixed
◮
Characterization of the entire solution path
◮
Beyond classification: Regression, novelty detection, clustering, and semi-supervised learning, ...
Reference
◮
This talk is based on a book chapter: Lee (2010), Support Vector Machines for Classification: A Statistical Portrait in Statistical Methods in Molecular Biology.
◮
See references therein.
◮
A preliminary version of the manuscript is available on my webpage http://www.stat.osu.edu/∼yklee.