Smoothing Proximal Gradient Method for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012)
Zoltán Szabó
Gatsby Unit, Tea Talk October 25, 2013
Zoltán Szabó
Smoothing Proximal Gradient Method
Outline
Motivation: (structured) sparse coding. Proximal operators, FISTA. Solution: dual norm + smooth approximation.
Zoltán Szabó
Smoothing Proximal Gradient Method
Motivation: least squares, sparse coding Given: x ∈ Rdx , D ∈ Rdx ×dα . Least squares problem: J(α) =
1 kx − Dαk22 → min . 2 α∈Rdα
(1)
Sparse coding (JPEG; convex relaxation, Lasso, w > 0): J(α) =
1 kx − Dαk22 + w kαk1 → min . 2 α∈Rdα
Zoltán Szabó
Smoothing Proximal Gradient Method
(2)
Motivation: structured sparse coding Group Lasso (G partition = non-overlapping, blocks): J(α) =
X 1 kx − Dαk22 + w kαG k2 → min . 2 α∈Rdα G∈G
Overlapping G: hierarchy, grid, total variation, graphs. many successful application: gene analysis, face expression recognition, . . .
Zoltán Szabó
Smoothing Proximal Gradient Method
(3)
Non-overlapping group Lasso
FISTA objective: J(α) = f (α) + g(α) → min . α∈Rdα
(4)
Assumptions: f , g: convex, f is ’smooth’ (Lipschitz continuous gradient, L).
Fast convergence: ∗
J(αt ) − J(α ) = O
Zoltán Szabó
1 t2
.
Smoothing Proximal Gradient Method
(5)
FISTA
Ingredients: Gradient of the smooth term: ∇f . Lipschitz constant of ∇f : L. Proximal operator of the non-smooth term (p > 0): 1 2 proxpg (v) = arg min g(y) + ky − vk2 . 2p y
Example: f (α) =
1 2
kx − Dαk22 , g(α) = w
∇f (α) = DT (Dα − x),
P
G∈G kαG k2 ,
L = λmax DT D ,
proxg : analytical (for partition G).
Zoltán Szabó
(6)
Smoothing Proximal Gradient Method
(7) (8)
Goal
Objective (λ > 0; wG > 0, ∀G ∈ G): J(α) = f (α) + Ω(α) + λ kαk1 → min , α∈Rdα X Ω(α) = wG kαG k2 . G∈G
Assumption: f : convex (FISTA assumptions). G: non-overlapping ⇒ no analytical formula for proxpg .
Zoltán Szabó
Smoothing Proximal Gradient Method
(9) (10)
Solution The ℓ2 -norm is self-dual: kak2 = max bT a. b:kbk2 ≤1
Zoltán Szabó
Smoothing Proximal Gradient Method
(11)
Solution The ℓ2 -norm is self-dual: kak2 = max bT a. b:kbk2 ≤1
We rewrite Ω (αG 7→ βG ∈ R|G| : auxiliary variable): P β = (βG )G∈G ∈ R G∈G |G| , X X T Ω(α) = wG kαG k2 = wG max βG αG G∈G
= max β∈Q
G∈G
X
βG :kβG k2 ≤1
T w G βG αG =: max β T Cα,
G∈G
β∈Q
(11)
(12) (13) (14)
Q = {β : kβG k2 ≤ 1, ∀G ∈ G} (product of unit balls).
Zoltán Szabó
Smoothing Proximal Gradient Method
Solution - continued
Smooth approximation to Ω(α) (µ ≥ 0): Ω(α) = max β T Cα ≈ max β T Cα − µs(β) =: Ωµ (α), β∈Q
s(β) =
β∈Q
1 kβk22 ≥ 0. 2
(15)
Maximum gap is µM: |G| , β∈Q 2 Ω(α) − µM ≤ Ωµ (α) ≤ Ω(α). M = max s(β) =
Zoltán Szabó
Smoothing Proximal Gradient Method
(16) (17)
Solution: FISTA on the smooth approximation
Original objective (λ > 0): J(α) = f (α) + Ω(α) + λ kαk1 → min . α∈Rdα
(18)
Smooth approximation (µ > 0, λ > 0): Jµ (α) = f (α) + Ωµ (α) + λ kαk1 → min . | {z } | {z } α∈Rdα FISTA: f
Zoltán Szabó
g
Smoothing Proximal Gradient Method
(19)
Result (=FISTA can be applied)
Ωµ (α): convex with Lipschitz continuous gradient ∇Ωµ (α) = CT β ∗ ,
(20)
β ∗ = arg max β T Cα − µs(β)
(21)
β∈Q
=
"
Lipschitz constant: Lµ =
Π2
1 µ
Zoltán Szabó
wG αG µ
G∈G
#
.
kCk22 .
Smoothing Proximal Gradient Method
(22)
Proof (intuition) Convexity, smoothness of Ωµ : T Cα Ωµ (α) = max β Cα − µs(β) = µ max β − s(β) β∈Q β∈Q µ ∗ Cα = µd . (23) µ
T
Gradient ∇Ωµ : Danskin’s theorem with h(α) = ∇h(α) =
max
ϕ(β, α),
β∈K :compact ∇α ϕ(β ∗ , α).
Lipschitz constant Lµ : Nesterov ’05.
Zoltán Szabó
Smoothing Proximal Gradient Method
(24) (25)
Convergence rate: O
1 ǫ
Given: ǫ (precision). We want J(αt ) − J(α∗ ) ≤ ǫ. Set µ =
ǫ 2M ,
where M =
(26)
|G| 2 .
Sufficient number of iterations: v " # u 2 u 4 kα∗ − α0 k2 2M kCk 1 2 2 O λmax DT D + . =t ǫ ǫ ǫ Note (subgradient descent is much slower): O
Zoltán Szabó
1 ǫ2
.
Smoothing Proximal Gradient Method
Summary
Task: non-overlapping group Lasso. Difficulty: non-overlapping ⇒ non-separability. Proposed solution: ∗
k·k2 = k·k2 . Smooth approximation. |G| independent subproblems, analytical expressions to FISTA. convergence rate: O 1ǫ .
Zoltán Szabó
Smoothing Proximal Gradient Method
Thank you for the attention!
Zoltán Szabó
Smoothing Proximal Gradient Method
Analytical solution for β ∗
µ β ∗ = arg max β T Cα − kβk22 2 β∈Q X µ T w G βG αG − kβG k22 = arg max 2 β∈Q G∈G
2 X wG αG
= arg min
βG − µ . β∈Q 2
(27) (28)
(29)
G∈G
Thus
∗
(β )G = Π2
Zoltán Szabó
wG αG µ
.
Smoothing Proximal Gradient Method
(30)
Combination of Lipschitz constants
Let Lf (Lg ) be a Lipschitz constant of ∇f (∇g). Then Lf +g ≤ Lf + Lg , since k(∇f + ∇g)(x) − (∇f + ∇g)(y)k2
(31)
≤ k[∇f (x) + ∇g(x)] − [∇f (y) + ∇g(y)]k2
(32)
≤ k∇f (x) − ∇f (y)k2 + k∇g(y) − ∇g(y)k2
(33)
= Lf kx − yk2 + Lg kx − yk2
(34)
≤ (Lf + Lg ) kx − yk2 .
(35)
Zoltán Szabó
Smoothing Proximal Gradient Method
Rate of convergence for SPG
J(αt ) − J(α∗ )
(36) ∗
∗
∗
= [J(αt ) − Jµ (αt )] + [Jµ (αt ) − Jµ (α )] + [Jµ (α ) − J(α )] 2Lµ kα0 − α∗ k22 +0 ≤ µM + t2 ! kCk2 2 kα0 − α∗ k22 T 2 λmax D D + ≤ µM + . µ t2 Plug-in µ =
ǫ 2M ,
(37) (38)
and solve for t:
2 kα0 − α∗ k22 ǫ J(αt ) − J(α ) ≤ + 2 t2 ∗
Zoltán Szabó
2M kCk2 2 λmax DT D + ǫ
Smoothing Proximal Gradient Method
!
≤ ǫ.
Proximal operator
f : Rd → R ∪ {∞}: closed proper convex function, i.e., epi(f ) = {(y, t) ∈ Rd × R : f (y) ≤ t}
(39)
is nonempty closed convex. Proximal operator of f :
1 2 proxf (v) = arg min f (y) + ky − vk2 . 2 y Strictly convex r.h.s. of (40) ⇒ proxf : exists, unique.
Zoltán Szabó
Smoothing Proximal Gradient Method
(40)
Proximal operator = generalization of projection
C: closed convex set. f = IC : indicator function of C ( 0 IC (y) = ∞
y ∈ C, y∈ / C.
(41)
Then, proxf = Euclidean projection onto C: proxIC (v) = ΠC (v) = arg min kv − yk2 . y
Zoltán Szabó
Smoothing Proximal Gradient Method
(42)
Conjugate function
f : Rd → R, not necessarily convex. Conjugate of f : h i f ∗ (v) = sup vT y − f (y) . y
Notes: f ∗ : convex ⇐ pointwise sup of convex functions. if f is convex, closed: (f ∗ )∗ = f . if f is differentiable: f ∗ = Legendre transform of f .
Zoltán Szabó
Smoothing Proximal Gradient Method
(43)
Conjugate function: properties If f = indicator function of a unit ball, i.e., C = Bk·k = {y ∈ Rd : kyk ≤ 1},
f = IC ,
(44)
then f ∗ is the dual norm f ∗ (v) = kvk∗ =
max
y∈Rd :kyk≤1
Dual norm of k·kp (p ≥ 1) is k·kp′ with Similarly (G: partition): X kuk = kuG kp , G∈G
Zoltán Szabó
1 p
vT y. +
1 p′
(45) = 1.
kuk∗ = max kuG kp′ . G∈G
Smoothing Proximal Gradient Method
(46)