Large margin methods for structured classification: Exponentiated gradient algorithms and PAC-Bayesian generalization bounds Peter L. Bartlett, Michael Collins, David McAllester, and Ben Taskar. Presented by: Yu Jin
March 30, 2006
Outline
I
Models for structured classification
I
Exponentiated gradient algorithm for QP problem
I
EG updates for structured objects
I
Generalization Bound
I
Experiment Result
Application: Image Classification
I
Image Classification.
General Setting I
We try to learn a function f : X → Y.
I
Loss function: L : X × Y × Y → R+
I
Given some distribution D(x, y ), our aim is to find a function with low expected loss, or risk: E(x,y )∼D L(x, y , f (x)).
I
A set of candidates G (x).
I
The destination function is fw (x) = arg max < φ(x, y ), w > y ∈G (x)
I
I
In order to find fw (x), we will formalized a large-margin optimization problem, which minimize the regularized empirical risk function: P 1 2 (max(L(xi , yi , y ) − mi,y (w )))+ , where 2 ||w || + C i
y
mi,y (w ) =< w , φ(xi , yi ) > − < w , φ(xi , y ) > is the ”margin” on example y .
Primal and Dual Problems
Exponentiated Gradient Updates for Large Margin Problems
Convergence of the Exponentiated Gradient QP Algorithm
Models for Structured Classification
I
The structured labels have a natural decomposition into ”parts”.
I
Assume some countable set of parts, R.
I
A function R which maps each object (x, y ) ∈ X × Y to a finite subset of R.
I
R(x, y ) is the set of parts belonging to a particular object.
I
A feature vector representation function: φ : X × R → Rd . P Thus, Φ(x, y ) = φ(x, r ).
I
r ∈R(x,y ) I
L(x, y , yˆ ) =
P r ∈R(x,ˆ y)
l(x, y , r )
Example: Markov Random Field in Image Classification
I
Markov random fields replace temporal dependency of Markov chains with spatial dependency.
I
Image Classification.
Example (2)
I
Graphical model in for the 4-neighborhood system.
A New Dual
I
New marginals: µi,r (¯ α), the expectation of α in each part. P µi,r (¯ α) = αi,y I (xi , y , r ).
I
Qm (µ)
y
Gibbs Distribution I
The number of dual variables αi,y is exponential.
I
The number of dualP variables precludes computing the primal ∗ Φ parameters w ∗ = C i,y αi,y i,y directly.
I
αi,y is the probability of Φ(xi , y ), according to: Hammersley-Clifford theorem: Y (labeling) is an MRF on S (image) with respect to N (neighborhood system) if and only if Y is a GRF on S with respect to N. αi,y takes the form of a Gibbs distribution:
I
We have application specific algorithms to compute P µi,r (¯ α) = αi,y I (xi , y , r ) efficiently. y
EG Updates for Structured Objects
A Primal Form Algorithm I
The previous algorithm requires a large amount of storage to ¯ store θ.
Generalization Bound I
Bound 1
I
Bound 2
Experiment Results