On Iterative Hard Thresholding Methods for High ... - CSE - IIT Kanpur

Report 0 Downloads 155 Views
On Iterative Hard Thresholding Methods for High-dimensional M-Estimation ∗



Prateek Jain , Ambuj Tewari , and Purushottam Kar



∗ Microsoft †

Research, INDIA University of Michigan, Ann Arbor, USA

The Goal

Iterative Hard Thresholding-style Methods

20 0

0.1 0.2 0.3 Noise level (sigma)

0.4

(a) Recovery quality under noise

150 100

HTP GraDeS L1 FoBa

50 0 0.5

1 1.5 2 Dimensionality (p)

2.5 4

x 10

(b) Runtimes with large problem sizes

• Current analyses deficient in analyzing statistical models

Given data samples, sparse estimation can be formulated as θ = arg min f (θ) = L(θ; z1:n)

40

0

Proof Idea: Key idea is to use a

200

60

(1)

kθk0≤s∗

2

Examples: (label noise: ξi ∼D N (0,Eσ ))



¯ ¯ ¯ , Σ), θ 0 ≤ s∗ 1. Sparse LS regression: yi = θ, xi + ξi, xi ∼ N (x 2 1X L(θ; z1:n) = n (yi − hxi, θi) 2. Regression with feature noise: feature noise can be ˜ i = xi + wi with wi ∼ N (0, ΣW ) ◦ additive: x ˜ i = xi w.p. 1 − ν and ∗ otherwise ◦ obliterative: x > ˜ > ˜ ˜ ˆ Let Γ = X X/n − ΣW and γˆ = X Y /n 1 >ˆ > L(θ; z1:n) = 2 θ Γθ − γˆ θ Note: the above is non-convex for n  p T ∗ 3. Low-rank matrix regression: yi = tr(W Xi ) + ξi, rank(W ) = s 1X L(W ; Z1:n) = n (yi − tr(W XiT ))2 28th Annual Conference on Neural Information Processing Systems (NIPS 2014)

Includes algorithms such as CoSaMP, Subspace pursuit

Includes algorithms such as IHT, GraDeS

Theorem: IHT guarantees f (θ

Challenges ...

Setting the Stage



• Give comparable recovery quality as L1 or greedy • Much more scalable than L1, greedy methods HTP GraDeS L1 FoBa

Two-stage Hard-thresholding

Iterative Hard-thresholding

Algorithm 1 (IHT) 1. while not converged t+1 t t 2. θ ← Ps(θ − η∇θ f (θ ))

IHT Methods in Practice

80

Can we show provable recovery guarantees for popular IHT-style methods under statistical settings with high condition numbers?

Support Recovery Error

Example: Sparse least squares regression D E ¯ xi where θ¯ is sparse Given: n samples zi = (xi, yi), yi ≈ θ, est est p Task: Recover a sparse θ ∈ R such that θ ≈ θ¯ Points to note: • Severely under-specified problem n  p



¯ ∗

• Model sparsity θ 0 = s  p The good news: • Consistent estimation possible with structural assumptions ◦ Sparsity, low rank • Poly-time estimation routines assuming RSC/RSS ◦ Convex relaxations (LASSO), greedy methods The not-so-good news: • The above estimation routines do not scale at all! ◦ Convex relaxations: non-smooth ⇒ slow rates ◦ Greedy methods: incremental approach ⇒ slow progress

Runtime (sec)

High-dimensional M-estimation

• Family of projected gradient descent-style methods • Take gradient step along ∇θ f (θ) and project onto feasible set ◦ Sparsity: Ps(z): take s-largest elements of z by magnitude ◦ Low rank: P Ms(W ): take top-s singular components of W • Very popular, methods of choice for large-scale applications ◦ IHT, GraDeS, HTP, CoSaMP, SP, OMPR(`), . . .

Support Recovery Error

Analyze a class of effective and scalable iterative methods for high-dimensional statistical estimation problems.

The Big Question

τ



L2s 1 ) − f (θ ) ≤  for τ ≥ α2s log  .  2 relaxed projection s ≥ Lα s∗.

40

• Utilizes a fully corrective step

CoSaMP HTP GraDeS

30

Algorithm 2 (TsHT) 1. while not converged t t t t 2. g ← ∇θ f (θ ), S ← supp(θ ) o n t t t 3. β ← F C f ; S ∪ largest ` elements of gS t t t 4. z ← Ps(β ) t+1 t 5. θ ← F C(f ; supp(z ))

F C(f ; S) = arg minsupp(θ)⊆S f (θ)

20

• Similar convergence bounds as IHT - better constants • Key idea 1: large distance from optima implies a large gradient

10 0

80

100 120 140 160 Projected Sparsity (s)

t kgS t∪S ∗k

Note: on large κ = 50 problems, relaxed projection really helps Crucial: Ps(·) provides strong contraction if s  s∗. If θ = Ps(z), p−s ∗ 2 2 kθ − zk kθ − zk2 ≤ 2 ∗ p−s



≥ 2α2s(f (θ) − f (θ )) +

t 2 α2skθ S t\S ∗k

• Key idea 2: projection doesn’t undo progress made by F C(·)   L ` 2s ∗ t t t f (z ) − f (β ) ≤ · · f (β ) − f (θ ) α2s s + ` − s∗ • Analyze Partial Hard-thresholding methods OMPR(`) as well

Restricted Strong Convexity/Smoothness

Guarantees for High Dimensional Statistical Estimation

A function f satisfies RSC/RSS with constants α2s and L2s if



1 2 1 2 for all θ , θ such that θ 0 , θ 0 ≤ s, we have

2

2 D E α2s

1 L 2s

2 1 2 1 2 2 1 2

θ − θ ≤ f (θ ) − f (θ ) − θ − θ , ∇θ f (θ ) ≤

θ − θ 2 2 2 2

• All known bounds require κ = L2s/α2s < constant ◦ For LS objective, this reduces to the RIP condition ◦ Best known constant κ < 3 (or δ2s < 0.5) due to OMPR(`) ◦ Completely silent otherwise • Assumption untrue: practical settings exhibit large κ  1 1 −   ΣX = 1− 1 Note: even with infinite samples, κ = Ω (1/)

Theorem: If θ

est

is an opt-optimal solution to (1), then

√ v

∗ ∇ L(θ; u  ¯

s + s z ) u opt θ 1:n ∞

est

¯

θ − θ 2 ≤ +t αs+s∗ αs+s∗ Proof Idea: IHT results, RSC/RSS and H¨older’s inequality • Results hold even for non-convex L(·) ◦ Only RSC and RSS need to hold ◦ Essential for noisy regression models

RSC (αk ) RSS (Lk ) k∇L(·)k∞

est

θ ∗



¯ θ 2

τ (p) = log p ·

∗∗

Sparse LS regression Regression with feature noise σmin(Σ) k log p σmin(Σ) kτ (p) − − 2 n 2 n k log p 3σmax(Σ) kτ (p) 2σmax(Σ) + + n 2 n s s

log p log p

¯ σ σ˜ θ 2 n n s s s s

∗ ∗ κ(Σ) s log p opt κ(Σ) ¯ s log p opt σ + σ˜ θ + 2 σmin(Σ) n σmin(Σ) σmin(Σ) n σmin(Σ) (kΣk2+kΣW k2)2 σq min (Σ) 2

σ˜ = (kΣW k + σ) kΣk + kΣW k2 Full Paper : http://tinyurl.com/mk5jlr8