On Iterative Hard Thresholding Methods for High-dimensional M-Estimation ∗
†
Prateek Jain , Ambuj Tewari , and Purushottam Kar
∗
∗ Microsoft †
Research, INDIA University of Michigan, Ann Arbor, USA
The Goal
Iterative Hard Thresholding-style Methods
20 0
0.1 0.2 0.3 Noise level (sigma)
0.4
(a) Recovery quality under noise
150 100
HTP GraDeS L1 FoBa
50 0 0.5
1 1.5 2 Dimensionality (p)
2.5 4
x 10
(b) Runtimes with large problem sizes
• Current analyses deficient in analyzing statistical models
Given data samples, sparse estimation can be formulated as θ = arg min f (θ) = L(θ; z1:n)
40
0
Proof Idea: Key idea is to use a
200
60
(1)
kθk0≤s∗
2
Examples: (label noise: ξi ∼D N (0,Eσ ))
¯ ¯ ¯ , Σ), θ 0 ≤ s∗ 1. Sparse LS regression: yi = θ, xi + ξi, xi ∼ N (x 2 1X L(θ; z1:n) = n (yi − hxi, θi) 2. Regression with feature noise: feature noise can be ˜ i = xi + wi with wi ∼ N (0, ΣW ) ◦ additive: x ˜ i = xi w.p. 1 − ν and ∗ otherwise ◦ obliterative: x > ˜ > ˜ ˜ ˆ Let Γ = X X/n − ΣW and γˆ = X Y /n 1 >ˆ > L(θ; z1:n) = 2 θ Γθ − γˆ θ Note: the above is non-convex for n p T ∗ 3. Low-rank matrix regression: yi = tr(W Xi ) + ξi, rank(W ) = s 1X L(W ; Z1:n) = n (yi − tr(W XiT ))2 28th Annual Conference on Neural Information Processing Systems (NIPS 2014)
Includes algorithms such as CoSaMP, Subspace pursuit
Includes algorithms such as IHT, GraDeS
Theorem: IHT guarantees f (θ
Challenges ...
Setting the Stage
∗
• Give comparable recovery quality as L1 or greedy • Much more scalable than L1, greedy methods HTP GraDeS L1 FoBa
Two-stage Hard-thresholding
Iterative Hard-thresholding
Algorithm 1 (IHT) 1. while not converged t+1 t t 2. θ ← Ps(θ − η∇θ f (θ ))
IHT Methods in Practice
80
Can we show provable recovery guarantees for popular IHT-style methods under statistical settings with high condition numbers?
Support Recovery Error
Example: Sparse least squares regression D E ¯ xi where θ¯ is sparse Given: n samples zi = (xi, yi), yi ≈ θ, est est p Task: Recover a sparse θ ∈ R such that θ ≈ θ¯ Points to note: • Severely under-specified problem n p
¯ ∗
• Model sparsity θ 0 = s p The good news: • Consistent estimation possible with structural assumptions ◦ Sparsity, low rank • Poly-time estimation routines assuming RSC/RSS ◦ Convex relaxations (LASSO), greedy methods The not-so-good news: • The above estimation routines do not scale at all! ◦ Convex relaxations: non-smooth ⇒ slow rates ◦ Greedy methods: incremental approach ⇒ slow progress
Runtime (sec)
High-dimensional M-estimation
• Family of projected gradient descent-style methods • Take gradient step along ∇θ f (θ) and project onto feasible set ◦ Sparsity: Ps(z): take s-largest elements of z by magnitude ◦ Low rank: P Ms(W ): take top-s singular components of W • Very popular, methods of choice for large-scale applications ◦ IHT, GraDeS, HTP, CoSaMP, SP, OMPR(`), . . .
Support Recovery Error
Analyze a class of effective and scalable iterative methods for high-dimensional statistical estimation problems.
The Big Question
τ
∗
L2s 1 ) − f (θ ) ≤ for τ ≥ α2s log . 2 relaxed projection s ≥ Lα s∗.
40
• Utilizes a fully corrective step
CoSaMP HTP GraDeS
30
Algorithm 2 (TsHT) 1. while not converged t t t t 2. g ← ∇θ f (θ ), S ← supp(θ ) o n t t t 3. β ← F C f ; S ∪ largest ` elements of gS t t t 4. z ← Ps(β ) t+1 t 5. θ ← F C(f ; supp(z ))
F C(f ; S) = arg minsupp(θ)⊆S f (θ)
20
• Similar convergence bounds as IHT - better constants • Key idea 1: large distance from optima implies a large gradient
10 0
80
100 120 140 160 Projected Sparsity (s)
t kgS t∪S ∗k
Note: on large κ = 50 problems, relaxed projection really helps Crucial: Ps(·) provides strong contraction if s s∗. If θ = Ps(z), p−s ∗ 2 2 kθ − zk kθ − zk2 ≤ 2 ∗ p−s
∗
≥ 2α2s(f (θ) − f (θ )) +
t 2 α2skθ S t\S ∗k
• Key idea 2: projection doesn’t undo progress made by F C(·) L ` 2s ∗ t t t f (z ) − f (β ) ≤ · · f (β ) − f (θ ) α2s s + ` − s∗ • Analyze Partial Hard-thresholding methods OMPR(`) as well
Restricted Strong Convexity/Smoothness
Guarantees for High Dimensional Statistical Estimation
A function f satisfies RSC/RSS with constants α2s and L2s if
1 2 1 2 for all θ , θ such that θ 0 , θ 0 ≤ s, we have
2
2 D E α2s
1 L 2s
2 1 2 1 2 2 1 2
θ − θ ≤ f (θ ) − f (θ ) − θ − θ , ∇θ f (θ ) ≤
θ − θ 2 2 2 2
• All known bounds require κ = L2s/α2s < constant ◦ For LS objective, this reduces to the RIP condition ◦ Best known constant κ < 3 (or δ2s < 0.5) due to OMPR(`) ◦ Completely silent otherwise • Assumption untrue: practical settings exhibit large κ 1 1 − ΣX = 1− 1 Note: even with infinite samples, κ = Ω (1/)
Theorem: If θ
est
is an opt-optimal solution to (1), then
√ v
∗ ∇ L(θ; u ¯
s + s z ) u opt θ 1:n ∞
est
¯
θ − θ 2 ≤ +t αs+s∗ αs+s∗ Proof Idea: IHT results, RSC/RSS and H¨older’s inequality • Results hold even for non-convex L(·) ◦ Only RSC and RSS need to hold ◦ Essential for noisy regression models
RSC (αk ) RSS (Lk ) k∇L(·)k∞
est
θ ∗
−
¯ θ 2
τ (p) = log p ·
∗∗
Sparse LS regression Regression with feature noise σmin(Σ) k log p σmin(Σ) kτ (p) − − 2 n 2 n k log p 3σmax(Σ) kτ (p) 2σmax(Σ) + + n 2 n s s
log p log p
¯ σ σ˜ θ 2 n n s s s s
∗ ∗ κ(Σ) s log p opt κ(Σ) ¯ s log p opt σ + σ˜ θ + 2 σmin(Σ) n σmin(Σ) σmin(Σ) n σmin(Σ) (kΣk2+kΣW k2)2 σq min (Σ) 2
σ˜ = (kΣW k + σ) kΣk + kΣW k2 Full Paper : http://tinyurl.com/mk5jlr8