Smoothing Proximal Gradient Method for General ... - Semantic Scholar

Comment

Report 1 Downloads 73 Views

Smoothing Proximal Gradient Method for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012)

Zoltán Szabó

Gatsby Unit, Tea Talk October 25, 2013

Zoltán Szabó

Smoothing Proximal Gradient Method

Outline

Motivation: (structured) sparse coding. Proximal operators, FISTA. Solution: dual norm + smooth approximation.

Zoltán Szabó

Smoothing Proximal Gradient Method

Motivation: least squares, sparse coding Given: x ∈ Rdx , D ∈ Rdx ×dα . Least squares problem: J(α) =

1 kx − Dαk22 → min . 2 α∈Rdα

(1)

Sparse coding (JPEG; convex relaxation, Lasso, w > 0): J(α) =

1 kx − Dαk22 + w kαk1 → min . 2 α∈Rdα

Zoltán Szabó

Smoothing Proximal Gradient Method

(2)

Motivation: structured sparse coding Group Lasso (G partition = non-overlapping, blocks): J(α) =

X 1 kx − Dαk22 + w kαG k2 → min . 2 α∈Rdα G∈G

Overlapping G: hierarchy, grid, total variation, graphs. many successful application: gene analysis, face expression recognition, . . .

Zoltán Szabó

Smoothing Proximal Gradient Method

(3)

Non-overlapping group Lasso

FISTA objective: J(α) = f (α) + g(α) → min . α∈Rdα

(4)

Assumptions: f , g: convex, f is ’smooth’ (Lipschitz continuous gradient, L).

Fast convergence: ∗

J(αt ) − J(α ) = O

Zoltán Szabó

1 t2

.

Smoothing Proximal Gradient Method

(5)

FISTA

Ingredients: Gradient of the smooth term: ∇f . Lipschitz constant of ∇f : L. Proximal operator of the non-smooth term (p > 0): 1 2 proxpg (v) = arg min g(y) + ky − vk2 . 2p y

Example: f (α) =

1 2

kx − Dαk22 , g(α) = w

∇f (α) = DT (Dα − x),

P

G∈G kαG k2 ,

L = λmax DT D ,

proxg : analytical (for partition G).

Zoltán Szabó

(6)

Smoothing Proximal Gradient Method

(7) (8)

Goal

Objective (λ > 0; wG > 0, ∀G ∈ G): J(α) = f (α) + Ω(α) + λ kαk1 → min , α∈Rdα X Ω(α) = wG kαG k2 . G∈G

Assumption: f : convex (FISTA assumptions). G: non-overlapping ⇒ no analytical formula for proxpg .

Zoltán Szabó

Smoothing Proximal Gradient Method

(9) (10)

Solution The ℓ2 -norm is self-dual: kak2 = max bT a. b:kbk2 ≤1

Zoltán Szabó

Smoothing Proximal Gradient Method

(11)

Solution The ℓ2 -norm is self-dual: kak2 = max bT a. b:kbk2 ≤1

We rewrite Ω (αG 7→ βG ∈ R|G| : auxiliary variable): P β = (βG )G∈G ∈ R G∈G |G| , X X T Ω(α) = wG kαG k2 = wG max βG αG G∈G

= max β∈Q

G∈G

X

βG :kβG k2 ≤1

T w G βG αG =: max β T Cα,

G∈G

β∈Q

(11)

(12) (13) (14)

Q = {β : kβG k2 ≤ 1, ∀G ∈ G} (product of unit balls).

Zoltán Szabó

Smoothing Proximal Gradient Method

Solution - continued

Smooth approximation to Ω(α) (µ ≥ 0): Ω(α) = max β T Cα ≈ max β T Cα − µs(β) =: Ωµ (α), β∈Q

s(β) =

β∈Q

1 kβk22 ≥ 0. 2

(15)

Maximum gap is µM: |G| , β∈Q 2 Ω(α) − µM ≤ Ωµ (α) ≤ Ω(α). M = max s(β) =

Zoltán Szabó

Smoothing Proximal Gradient Method

(16) (17)

Solution: FISTA on the smooth approximation

Original objective (λ > 0): J(α) = f (α) + Ω(α) + λ kαk1 → min . α∈Rdα

(18)

Smooth approximation (µ > 0, λ > 0): Jµ (α) = f (α) + Ωµ (α) + λ kαk1 → min . | {z } | {z } α∈Rdα FISTA: f

Zoltán Szabó

g

Smoothing Proximal Gradient Method

(19)

Result (=FISTA can be applied)

Ωµ (α): convex with Lipschitz continuous gradient ∇Ωµ (α) = CT β ∗ ,

(20)

β ∗ = arg max β T Cα − µs(β)

(21)

β∈Q

=

"

Lipschitz constant: Lµ =

Π2

1 µ

Zoltán Szabó

wG αG µ

G∈G

#

.

kCk22 .

Smoothing Proximal Gradient Method

(22)

Proof (intuition) Convexity, smoothness of Ωµ : T Cα Ωµ (α) = max β Cα − µs(β) = µ max β − s(β) β∈Q β∈Q µ ∗ Cα = µd . (23) µ

T

Gradient ∇Ωµ : Danskin’s theorem with h(α) = ∇h(α) =

max

ϕ(β, α),

β∈K :compact ∇α ϕ(β ∗ , α).

Lipschitz constant Lµ : Nesterov ’05.

Zoltán Szabó

Smoothing Proximal Gradient Method

(24) (25)

Convergence rate: O

1 ǫ

Given: ǫ (precision). We want J(αt ) − J(α∗ ) ≤ ǫ. Set µ =

ǫ 2M ,

where M =

(26)

|G| 2 .

Sufficient number of iterations: v " # u 2 u 4 kα∗ − α0 k2 2M kCk 1 2 2 O λmax DT D + . =t ǫ ǫ ǫ Note (subgradient descent is much slower): O

Zoltán Szabó

1 ǫ2

.

Smoothing Proximal Gradient Method

Summary

Task: non-overlapping group Lasso. Difficulty: non-overlapping ⇒ non-separability. Proposed solution: ∗

k·k2 = k·k2 . Smooth approximation. |G| independent subproblems, analytical expressions to FISTA. convergence rate: O 1ǫ .

Zoltán Szabó

Smoothing Proximal Gradient Method

Thank you for the attention!

Zoltán Szabó

Smoothing Proximal Gradient Method

Analytical solution for β ∗

µ β ∗ = arg max β T Cα − kβk22 2 β∈Q X µ T w G βG αG − kβG k22 = arg max 2 β∈Q G∈G

2 X wG αG

= arg min

βG − µ . β∈Q 2

(27) (28)

(29)

G∈G

Thus

∗

(β )G = Π2

Zoltán Szabó

wG αG µ

.

Smoothing Proximal Gradient Method

(30)

Combination of Lipschitz constants

Let Lf (Lg ) be a Lipschitz constant of ∇f (∇g). Then Lf +g ≤ Lf + Lg , since k(∇f + ∇g)(x) − (∇f + ∇g)(y)k2

(31)

≤ k[∇f (x) + ∇g(x)] − [∇f (y) + ∇g(y)]k2

(32)

≤ k∇f (x) − ∇f (y)k2 + k∇g(y) − ∇g(y)k2

(33)

= Lf kx − yk2 + Lg kx − yk2

(34)

≤ (Lf + Lg ) kx − yk2 .

(35)

Zoltán Szabó

Smoothing Proximal Gradient Method

Rate of convergence for SPG

J(αt ) − J(α∗ )

(36) ∗

∗

∗

= [J(αt ) − Jµ (αt )] + [Jµ (αt ) − Jµ (α )] + [Jµ (α ) − J(α )] 2Lµ kα0 − α∗ k22 +0 ≤ µM + t2 ! kCk2 2 kα0 − α∗ k22 T 2 λmax D D + ≤ µM + . µ t2 Plug-in µ =

ǫ 2M ,

(37) (38)

and solve for t:

2 kα0 − α∗ k22 ǫ J(αt ) − J(α ) ≤ + 2 t2 ∗

Zoltán Szabó

2M kCk2 2 λmax DT D + ǫ

Smoothing Proximal Gradient Method

!

≤ ǫ.

Proximal operator

f : Rd → R ∪ {∞}: closed proper convex function, i.e., epi(f ) = {(y, t) ∈ Rd × R : f (y) ≤ t}

(39)

is nonempty closed convex. Proximal operator of f :

1 2 proxf (v) = arg min f (y) + ky − vk2 . 2 y Strictly convex r.h.s. of (40) ⇒ proxf : exists, unique.

Zoltán Szabó

Smoothing Proximal Gradient Method

(40)

Proximal operator = generalization of projection

C: closed convex set. f = IC : indicator function of C ( 0 IC (y) = ∞

y ∈ C, y∈ / C.

(41)

Then, proxf = Euclidean projection onto C: proxIC (v) = ΠC (v) = arg min kv − yk2 . y

Zoltán Szabó

Smoothing Proximal Gradient Method

(42)

Conjugate function

f : Rd → R, not necessarily convex. Conjugate of f : h i f ∗ (v) = sup vT y − f (y) . y

Notes: f ∗ : convex ⇐ pointwise sup of convex functions. if f is convex, closed: (f ∗ )∗ = f . if f is differentiable: f ∗ = Legendre transform of f .

Zoltán Szabó

Smoothing Proximal Gradient Method

(43)

Conjugate function: properties If f = indicator function of a unit ball, i.e., C = Bk·k = {y ∈ Rd : kyk ≤ 1},

f = IC ,

(44)

then f ∗ is the dual norm f ∗ (v) = kvk∗ =

max

y∈Rd :kyk≤1

Dual norm of k·kp (p ≥ 1) is k·kp′ with Similarly (G: partition): X kuk = kuG kp , G∈G

Zoltán Szabó

1 p

vT y. +

1 p′

(45) = 1.

kuk∗ = max kuG kp′ . G∈G

Smoothing Proximal Gradient Method

(46)

Recommend Documents

A DELAYED PROXIMAL GRADIENT METHOD ... - Semantic Scholar

ON STOCHASTIC PROXIMAL GRADIENT ... - Semantic Scholar

A fast dual proximal gradient algorithm for convex ... - Semantic Scholar

A PROXIMAL-GRADIENT HOMOTOPY METHOD FOR THE SPARSE ...