SUPPORT VECTOR MACHINE Presented by: Ratnanjali Sood M.Sc Physics with spl in Comp Sc. Faculty of Science Dayalbagh Educational Institute
Generalization
Supervised systems learn from a training set T = {Xk, Dk} Xk∈ℜn , Dk∈ℜ Basic idea: Use the system (network) in predictive mode ypredicted = f(Xunseen) Another way of stating is that we require that the machine be able to successfully generalize.
Occam’s Razor Principle
William Occam, c.1280–1349 {
No more things should be presumed to exist than are absolutely necessary.
Generalization ability of a machine is closely related to {
{
the capacity of the machine (functions it can represent) the data set that is used for training.
Statistical Learning Theory
Proposed by Vapnik Essential idea: Regularization {
{
Given a finite set of training examples, the search for the best approximating function must be restricted to a small space of possible architectures. When the space of representative functions and their capacity is large and the data set small, the models tend to over-fit and generalize poorly.
Given a finite training data set, achieve the correct balance between - accuracy in training on data set - capacity of the machine to learn data set without error.
Vapnik–Chervonenkis Dimension
If the set of indicator functions can correctly classify each of the possible 2Q labellings, we say the set of points is shattered by F. The VC-dimension h of a set of functions F is the largest set of points that can be shattered by the set in question.
Labellings in 2-d of 3 points
(a)
(e)
(b)
(f)
(c)
(g)
(d)
(h)
Three points in R2 can be labelled in eight different ways. A linear oriented decision boundary can shatter all eight labellings.
VC-Dimension of Linear Decision Functions in ℜ2 is 3
Labelling of four points in ℜ2 that cannot be correctly separated by a linear oriented decision boundary
A quadratic decision boundary can separate this labelling!
VC-Dimension of Linear Decision Functions in ℜn
At most n+1 points can be shattered by oriented hyperplanes in ℜn VC-dimension is n+1
Towards Complexity Control
Necessary to ensure that the model chosen for representation of the underlying function has a complexity (or capacity) that matches the data set in question.
Solution: structural risk minimization
Structural Risk Minimization
Structural Risk Minimization (SRM): {
Minimize the combination of the empirical risk and the complexity of the hypothesis space.
Space of functions F is very large, and so restrict the focus of learning to a smaller space called the hypothesis space. SRM therefore defines a nested sequence of hypothesis spaces Increasing complexity { F1 ⊂ F2 ⊂… ⊂ Fn ⊂… { VC-dimensions h1≤ h2 ≤ … ≤ hn ≤ …
Nested Hypothesis Spaces form a Structure
F1
F2
F3
…
VC-dimensions h1 ≤ h2 ≤ … ≤ hn ≤ …
Fn
…
A Trade-off
Successive models have greater flexibility such that the empirical error can be pushed down further. Increasing i increases the VC-dimension Goal: select an appropriate hypothesis space to match the training data complexity to the model capacity. This gives the best generalization.
Key Theorem gives us that empirical risk equals true risk as patterns tends to infinity.
Origins of SVM
Support Vector Machines (SVMs) have a firm grounding in the VC theory of statistical learning Essentially implements structural risk minimization Originated in the work of Vapnik and coworkers at the AT&T Bell Laboratories Initial work focussed on { {
optical character recognition object recognition tasks
Context
Linear indicator functions (TLN hyperplane classifiers) which is the bipolar signum function Data set is linearly separable T = {Xk, dk}, Xk ∈ℜn, dk ∈ {-1,1} Consider two sets of data points that are to be classified into one of two classes C1, C2 C1: positive samples C2: negative samples
SVM Design Objective
Find the hyperplane that maximizes the margin Class 1 Class 1 Class 2 Class 2
Distance to closest points on either side of hyperplane
Hypothesis Space
Our hypothesis space is the space of functions
We want to maximize the margins from the separating hyperplane to the nearest positive and negative data points. Find the maximum margin hyperplane for the given training set.
Definition of Margin
The perpendicular distance to the closest positive sample (d+) or negative sample (d-) is called the margin
Class 1
X+
d+
Class 2
X-
d-
Reformulation of Classification Criteria
Originally
Reformulated as
Introducing a margin ∆ so that the hyperplane satisfies
Canonical Separating Hyperplanes
Satisfy the constraint ∆ = 1 Then we may write
or more compactly
Notation
X+ is the data point from C1 closest to hyperplane Π, and XΠ is the unique point on Π that is closest to X+ Maximize d+ d+ = || X+ - XΠ || From the defining equation of hyperplane Π, {
Class 1
Π
X+ d+ XΠ
Expression for the Margin
Defining equations of hyperplane yield
Noting that X+ - XΠ is also perpendicular to Π
Eventually yields
Total margin
Support Vectors Π+ Π
Margin
Class 1
Vectors on the margin are the support vectors, and the total margin is 2/llWll
Πsupport vectors
Total Margin
SVM and SRM
If all data point lie within an n-dimensional hypersphere of radius ρ then the set of indicator functions
has a VC-dimension that satisfies the following bound Distance to closest point is 1/||W|| Constrain ||W|| ≤ A then the distance from the hyperplane to the closest data point must be greater than 1/A. Therefore, Minimize ||W||
SVM Implements SRM
ρ
An SVM implements SRM by constraining hyperplanes to lie outside hyperspheres of radius 1/A radius
1/A
Objective of the Support Vector Machine
Given T = {Xk, dk}, Xk ∈ℜn, dk ∈ {-1,1} C1: positive samples C2: negative samples Attempt to classify the data using the smallest possible weight vector norm ||W|| Maximize the margin 1/||W|| Minimize
subject to the constraints
Method of Lagrange Multipliers
Used for two reasons {
{
the constraints on the Lagrangian multipliers are easier to handle; the training data appear in the form of dot products in the final equations a fact that we extensively exploit in the non-linear support vector machine.
Construction of the Lagrangian
Formulate problem in primal space Λ = (λ1, …, λQ), λi ≥ 0 is a vector of Lagrange multipliers
Saddle point of Lp is the solution to the problem
Shift to Dual Space
Makes the optimization problem much cleaner in the sense that requires only maximization of λi Translation to the dual form is possible because the constraints are strictly convex. Kuhn–Tucker conditions for the optimum of a constrained optimization problem are invoked to effect the translation of Lp to the dual form
Shift to Dual Space
Partial derivatives of Lp with respect to the primal variables must vanish at the solution points
D = (d1,…dQ)T is the vector of desired values
Kuhn–Tucker Complementarity Conditions
Constraint Must be satisfied with equality
Yields the dual formulation
Final Dual Optimization Problem
Maximize
with respect to the Lagrange multipliers, subject to the constraints:
Support Vectors
Numeric optimization yields optimized Lagrange multipliers Λˆ = (λ1,..., λ Q )T
Observation: some Lagrange multipliers go to zero. Data vectors for which the Lagrange multipliers are greater than zero are called support vectors. For all other data points which are not support vectors, λi = 0.
Optimal Weights and Bias
ns is the number of support vectors
Optimal bias computed from the complementarity conditions Usually averaged over all support vectors and uses Hessian
Classifying an Unknown Data Point
Use a linear indicator function:
Soft Margin Hyperplane Classifier
For non-linearly separable data classes overlap Constraint cannot be satisfied for all data points Solution: Permit the algorithm to misclassify some of the data points albeit with an increased cost A soft margin is generated within which all the misclassified data lie
Soft Margin Classifier Π+
Class 1
Π
d(X2)=-1+ξ2 X2
Π-
d( x
X1 Class 2
d( x
d(X1)=1-ξ1 d( x
)=1
)=0
)=1
Slack Variables
Introduce Q slack variables ξi
Data point is misclassified if the corresponding slack variable exceeds unity
Cost Function
Optimization problem is modified as: {
Minimize
{
subject to the constraints
Image Classification Application
High dimensional feature space leads to poor generalization performance of image classification algorithms Indexing and retrieval of image collections in the World Wide Web is a major challenge Support vector machines provide much promise in such applications. We now describe the application of support vector machines to the problem of image classification
Description of Image Data Set
Corel Stock Photo collection: 200 classes with 20000 images. Two databases derived from the original collection as follows: o
o
Corel14 14 classes and 1400 images air shows, bears, elephants, tigers, Arabian horses, polar bears, African specialty animals, cheetahs-leopardsjaguars, bald eagles, mountains, fields, deserts, sunrisessunsets, night scenes Corel7 7 classes and 2670 images airplanes, birds, boats, buildings, fish, people, vehicles
Corel14
Corel7
Selection of Kernel
Introducing Non-Gaussian Kernels
In addition to a linear SVM, the authors employed three kernels: Gaussian (b=2), Laplacian (b=1), sub-linear (b=0.5)
Corel14
Conclusion
Support Vector Machine is a classifier with good generalization ability used in multi class classification.
Reference
Vapnik , V.N ‘An Overview of Statistical Learning Theory’ ,IEEE Trans.Neural Network.(1999) Neural Network : A Classroom Approach by Satish Kumar
Thank You