support vector machine - UMD Department of Computer Science

Comment

Report 4 Downloads 3 Views

SUPPORT VECTOR MACHINE Presented by: Ratnanjali Sood M.Sc Physics with spl in Comp Sc. Faculty of Science Dayalbagh Educational Institute

Generalization

Supervised systems learn from a training set T = {Xk, Dk} Xk∈ℜn , Dk∈ℜ Basic idea: Use the system (network) in predictive mode ypredicted = f(Xunseen) Another way of stating is that we require that the machine be able to successfully generalize.

Occam’s Razor Principle

William Occam, c.1280–1349 {

No more things should be presumed to exist than are absolutely necessary.

Generalization ability of a machine is closely related to {

{

the capacity of the machine (functions it can represent) the data set that is used for training.

Statistical Learning Theory

Proposed by Vapnik Essential idea: Regularization {

{

Given a finite set of training examples, the search for the best approximating function must be restricted to a small space of possible architectures. When the space of representative functions and their capacity is large and the data set small, the models tend to over-fit and generalize poorly.

Given a finite training data set, achieve the correct balance between - accuracy in training on data set - capacity of the machine to learn data set without error.

Vapnik–Chervonenkis Dimension

If the set of indicator functions can correctly classify each of the possible 2Q labellings, we say the set of points is shattered by F. The VC-dimension h of a set of functions F is the largest set of points that can be shattered by the set in question.

Labellings in 2-d of 3 points

(a)

(e)

(b)

(f)

(c)

(g)

(d)

(h)

Three points in R2 can be labelled in eight different ways. A linear oriented decision boundary can shatter all eight labellings.

VC-Dimension of Linear Decision Functions in ℜ2 is 3

Labelling of four points in ℜ2 that cannot be correctly separated by a linear oriented decision boundary

A quadratic decision boundary can separate this labelling!

VC-Dimension of Linear Decision Functions in ℜn

At most n+1 points can be shattered by oriented hyperplanes in ℜn VC-dimension is n+1

Towards Complexity Control

Necessary to ensure that the model chosen for representation of the underlying function has a complexity (or capacity) that matches the data set in question.

Solution: structural risk minimization

Structural Risk Minimization

Structural Risk Minimization (SRM): {

Minimize the combination of the empirical risk and the complexity of the hypothesis space.

Space of functions F is very large, and so restrict the focus of learning to a smaller space called the hypothesis space. SRM therefore defines a nested sequence of hypothesis spaces Increasing complexity { F1 ⊂ F2 ⊂… ⊂ Fn ⊂… { VC-dimensions h1≤ h2 ≤ … ≤ hn ≤ …

Nested Hypothesis Spaces form a Structure

F1

F2

F3

…

VC-dimensions h1 ≤ h2 ≤ … ≤ hn ≤ …

Fn

…

A Trade-off

Successive models have greater flexibility such that the empirical error can be pushed down further. Increasing i increases the VC-dimension Goal: select an appropriate hypothesis space to match the training data complexity to the model capacity. This gives the best generalization.

Key Theorem gives us that empirical risk equals true risk as patterns tends to infinity.

Origins of SVM

Support Vector Machines (SVMs) have a firm grounding in the VC theory of statistical learning Essentially implements structural risk minimization Originated in the work of Vapnik and coworkers at the AT&T Bell Laboratories Initial work focussed on { {

optical character recognition object recognition tasks

Context

Linear indicator functions (TLN hyperplane classifiers) which is the bipolar signum function Data set is linearly separable T = {Xk, dk}, Xk ∈ℜn, dk ∈ {-1,1} Consider two sets of data points that are to be classified into one of two classes C1, C2 C1: positive samples C2: negative samples

SVM Design Objective

Find the hyperplane that maximizes the margin Class 1 Class 1 Class 2 Class 2

Distance to closest points on either side of hyperplane

Hypothesis Space

Our hypothesis space is the space of functions

We want to maximize the margins from the separating hyperplane to the nearest positive and negative data points. Find the maximum margin hyperplane for the given training set.

Definition of Margin

The perpendicular distance to the closest positive sample (d+) or negative sample (d-) is called the margin

Class 1

X+

d+

Class 2

X-

d-

Reformulation of Classification Criteria

Originally

Reformulated as

Introducing a margin ∆ so that the hyperplane satisfies

Canonical Separating Hyperplanes

Satisfy the constraint ∆ = 1 Then we may write

or more compactly

Notation

X+ is the data point from C1 closest to hyperplane Π, and XΠ is the unique point on Π that is closest to X+ Maximize d+ d+ = || X+ - XΠ || From the defining equation of hyperplane Π, {

Class 1

Π

X+ d+ XΠ

Expression for the Margin

Defining equations of hyperplane yield

Noting that X+ - XΠ is also perpendicular to Π

Eventually yields

Total margin

Support Vectors Π+ Π

Margin

Class 1

Vectors on the margin are the support vectors, and the total margin is 2/llWll

Πsupport vectors

Total Margin

SVM and SRM

If all data point lie within an n-dimensional hypersphere of radius ρ then the set of indicator functions

has a VC-dimension that satisfies the following bound Distance to closest point is 1/||W|| Constrain ||W|| ≤ A then the distance from the hyperplane to the closest data point must be greater than 1/A. Therefore, Minimize ||W||

SVM Implements SRM

ρ

An SVM implements SRM by constraining hyperplanes to lie outside hyperspheres of radius 1/A radius

1/A

Objective of the Support Vector Machine

Given T = {Xk, dk}, Xk ∈ℜn, dk ∈ {-1,1} C1: positive samples C2: negative samples Attempt to classify the data using the smallest possible weight vector norm ||W|| Maximize the margin 1/||W|| Minimize

subject to the constraints

Method of Lagrange Multipliers

Used for two reasons {

{

the constraints on the Lagrangian multipliers are easier to handle; the training data appear in the form of dot products in the final equations a fact that we extensively exploit in the non-linear support vector machine.

Construction of the Lagrangian

Formulate problem in primal space Λ = (λ1, …, λQ), λi ≥ 0 is a vector of Lagrange multipliers

Saddle point of Lp is the solution to the problem

Shift to Dual Space

Makes the optimization problem much cleaner in the sense that requires only maximization of λi Translation to the dual form is possible because the constraints are strictly convex. Kuhn–Tucker conditions for the optimum of a constrained optimization problem are invoked to effect the translation of Lp to the dual form

Shift to Dual Space

Partial derivatives of Lp with respect to the primal variables must vanish at the solution points

D = (d1,…dQ)T is the vector of desired values

Kuhn–Tucker Complementarity Conditions

Constraint Must be satisfied with equality

Yields the dual formulation

Final Dual Optimization Problem

Maximize

with respect to the Lagrange multipliers, subject to the constraints:

Support Vectors

Numeric optimization yields optimized Lagrange multipliers Λˆ = (λ1,..., λ Q )T

Observation: some Lagrange multipliers go to zero. Data vectors for which the Lagrange multipliers are greater than zero are called support vectors. For all other data points which are not support vectors, λi = 0.

Optimal Weights and Bias

ns is the number of support vectors

Optimal bias computed from the complementarity conditions Usually averaged over all support vectors and uses Hessian

Classifying an Unknown Data Point

Use a linear indicator function:

Soft Margin Hyperplane Classifier

For non-linearly separable data classes overlap Constraint cannot be satisfied for all data points Solution: Permit the algorithm to misclassify some of the data points albeit with an increased cost A soft margin is generated within which all the misclassified data lie

Soft Margin Classifier Π+

Class 1

Π

d(X2)=-1+ξ2 X2

Π-

d( x

X1 Class 2

d( x

d(X1)=1-ξ1 d( x

)=1

)=0

)=1

Slack Variables

Introduce Q slack variables ξi

Data point is misclassified if the corresponding slack variable exceeds unity

Cost Function

Optimization problem is modified as: {

Minimize

{

subject to the constraints

Image Classification Application

High dimensional feature space leads to poor generalization performance of image classification algorithms Indexing and retrieval of image collections in the World Wide Web is a major challenge Support vector machines provide much promise in such applications. We now describe the application of support vector machines to the problem of image classification

Description of Image Data Set

Corel Stock Photo collection: 200 classes with 20000 images. Two databases derived from the original collection as follows: o

o

Corel14 14 classes and 1400 images air shows, bears, elephants, tigers, Arabian horses, polar bears, African specialty animals, cheetahs-leopardsjaguars, bald eagles, mountains, fields, deserts, sunrisessunsets, night scenes Corel7 7 classes and 2670 images airplanes, birds, boats, buildings, fish, people, vehicles

Corel14

Corel7

Selection of Kernel

Introducing Non-Gaussian Kernels

In addition to a linear SVM, the authors employed three kernels: Gaussian (b=2), Laplacian (b=1), sub-linear (b=0.5)

Corel14

Conclusion

Support Vector Machine is a classifier with good generalization ability used in multi class classification.

Reference

Vapnik , V.N ‘An Overview of Statistical Learning Theory’ ,IEEE Trans.Neural Network.(1999) Neural Network : A Classroom Approach by Satish Kumar

Thank You

Recommend Documents

Paper - UMD Department of Computer Science