Large margin methods for structured classification ... - Semantic Scholar

Report 2 Downloads 58 Views
Large margin methods for structured classification: Exponentiated gradient algorithms and PAC-Bayesian generalization bounds Peter L. Bartlett, Michael Collins, David McAllester, and Ben Taskar. Presented by: Yu Jin

March 30, 2006

Outline

I

Models for structured classification

I

Exponentiated gradient algorithm for QP problem

I

EG updates for structured objects

I

Generalization Bound

I

Experiment Result

Application: Image Classification

I

Image Classification.

General Setting I

We try to learn a function f : X → Y.

I

Loss function: L : X × Y × Y → R+

I

Given some distribution D(x, y ), our aim is to find a function with low expected loss, or risk: E(x,y )∼D L(x, y , f (x)).

I

A set of candidates G (x).

I

The destination function is fw (x) = arg max < φ(x, y ), w > y ∈G (x)

I

I

In order to find fw (x), we will formalized a large-margin optimization problem, which minimize the regularized empirical risk function: P 1 2 (max(L(xi , yi , y ) − mi,y (w )))+ , where 2 ||w || + C i

y

mi,y (w ) =< w , φ(xi , yi ) > − < w , φ(xi , y ) > is the ”margin” on example y .

Primal and Dual Problems

Exponentiated Gradient Updates for Large Margin Problems

Convergence of the Exponentiated Gradient QP Algorithm

Models for Structured Classification

I

The structured labels have a natural decomposition into ”parts”.

I

Assume some countable set of parts, R.

I

A function R which maps each object (x, y ) ∈ X × Y to a finite subset of R.

I

R(x, y ) is the set of parts belonging to a particular object.

I

A feature vector representation function: φ : X × R → Rd . P Thus, Φ(x, y ) = φ(x, r ).

I

r ∈R(x,y ) I

L(x, y , yˆ ) =

P r ∈R(x,ˆ y)

l(x, y , r )

Example: Markov Random Field in Image Classification

I

Markov random fields replace temporal dependency of Markov chains with spatial dependency.

I

Image Classification.

Example (2)

I

Graphical model in for the 4-neighborhood system.

A New Dual

I

New marginals: µi,r (¯ α), the expectation of α in each part. P µi,r (¯ α) = αi,y I (xi , y , r ).

I

Qm (µ)

y

Gibbs Distribution I

The number of dual variables αi,y is exponential.

I

The number of dualP variables precludes computing the primal ∗ Φ parameters w ∗ = C i,y αi,y i,y directly.

I

αi,y is the probability of Φ(xi , y ), according to: Hammersley-Clifford theorem: Y (labeling) is an MRF on S (image) with respect to N (neighborhood system) if and only if Y is a GRF on S with respect to N. αi,y takes the form of a Gibbs distribution:

I

We have application specific algorithms to compute P µi,r (¯ α) = αi,y I (xi , y , r ) efficiently. y

EG Updates for Structured Objects

A Primal Form Algorithm I

The previous algorithm requires a large amount of storage to ¯ store θ.

Generalization Bound I

Bound 1

I

Bound 2

Experiment Results