Distributed Optimization and Statistics via ... - Semantic Scholar

Report 5 Downloads 173 Views
Distributed Optimization and Statistics via Alternating Direction Method of Multipliers Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato and Jonathan Eckstein Source: http://www.stanford.edu/~boyd/papers/pdf/admm_talk.pdf Discussion led by XianXing Zhang

September 6, 2013

Outline of today’s discussion

I

Background I

Dual Decomposition

I

Method of Multipliers

I

Alternating Direction Method of Multipliers (ADMM)

I

Examples

I

Conclusions

Dual problem

I

convex equality constrained optimization problem minimize f (x) subject to Ax = b

I

Lagrangian: L(x, y ) = f (x) + y T (Ax − b)

I

dual function: g(y ) = infx L(x, y )

I

dual problem: y ? = arg maxy g(y )

I

recover x ? = arg minx L(x, y ? )

Dual ascent I

recall the dual problem: y ? = arg maxy g(y )

I

gradient method for dual problem: y k +1 = y k + αk ∇g(y k )

I

∇g(y k ) = Ax k − b, where x k = arg minx L(x, y k )

I

dual ascent method is x k +1 := arg min L(x, y k ) x

y k +1 := y k + αk (Ax k +1 − b) I

//x-minimization //dual update

works, with lots of strong assumptions (e.g., strong convexity)

Dual decomposition I

suppose f is separable: f (x) = f1 (x1 ) + · · · + fN (xN ), x = (x1 , . . . , xN )

I

then L = f (x) + y T (Ax − b) is separable in x: L(x, y ) = L1 (x1 , y ) + · · · + LN (xN , y ) − y T b Li (xi , y ) = fi (xi ) + y T Ai xi

I

x-minimization in dual ascent splits into N separate minimizations xik +1 := arg min Li (xi , y k ) xi

which can be carried out in parallel

Dual decomposition I

dual decomposition (Everett, Dantzig, Wolfe, Benders 1960-65) xik +1 := arg min Li (xi , y k ), xi

y k +1 := y k + αk

N X

i = 1, . . . , N !

Ai xik +1 − b

i=1 I

scatter y k ; update xi in parallel; gather Ai xik +1

I

solve a large problem

I

I

by iteratively solving subproblems (in parallel)

I

dual variable update provides coordination

works, with lots of assumptions; often slow

Method of multipliers I

a method to robustify dual ascent

I

use augmented Lagrangian (Hestenes, Powell 1969), ρ>0 Lρ (x, y ) = f (x) + y T (Ax − b) + (ρ/2)||Ax − b||22

I

method of multipliers (Hestenes, Powell; analysis in Bertsekas 1982) x k +1 := arg min Lρ (x, y k ) x

y

k +1

k

:= y + ρ(Ax k +1 − b)

(the specific dual update step length ρ satisfies primal and dual feasibility)

Method of multipliers

I

good news: converges under much more relaxed conditions (f can be non-differentiable, take on value + inf, ...)

I

bad news: quadratic penalty destroys splitting of the x-update, so can’t do decomposition

Alternating direction method of multipliers (ADMM)

I

I

a method I

with good robustness of method of multipliers

I

which can support decomposition

proposed by Gabay, Mercier, Glowinski, Marrocco in 1976

Alternating direction method of multipliers I

ADMM problem form (with f , g convex) min f (x) + g(z) subject to Ax + Bz = c I

I

two sets of variables, with separable objective

Lagrangian Lρ (x, z, y ) = f (x)+g(z)+y T (Ax+Bz−c)+(ρ/2)||Ax+Bz−c||22

I

ADMM: x k +1 := arg min Lρ (x, z k , y k )

//x − minimization

z k +1 := arg min Lρ (x k +1 , z, y k )

//z − minimization

x

z

y k +1 := y k + ρ(Ax k +1 + Bz k +1 − c)

//dual update

Alternating direction method of multipliers

I

if we minimized over x and z jointly, reduces to method of multipliers

I

instead, we do one pass of a Gauss-Seidel method

I

we get splitting since we minimize over x with z fixed, and vice versa

ADMM with scaled dual variables I

combine linear and quadratic terms in augmented Lagrangian

Lρ (x, z, y ) = f (x) + g(z) + y T (Ax + Bz − c) + (ρ/2)||Ax + Bz − c = f (x) + g(z) + (ρ/2)||Ax + Bz − c + u||22 + const. with u k = (1/ρ)y k I

ADMM (scaled dual form): x k +1 := arg min(f (x) + (ρ/2)||Ax + Bz k − c + u k ||22 ) x

z

k +1

:= arg min(g(z) + (ρ/2)||Ax k +1 + Bz − c + u k ||22 ) z

u k +1 := u k + (Ax k +1 + Bz k +1 − c)

Convergence

I

I

assume (very little!) I

f , g convex, closed, proper

I

L0 has a saddle point

then ADMM converges: I

iterates approach feasibility: Ax k + Bz k − c → 0

I

objective approaches optimal value: f (x k ) + g(z k ) → p?

Lasso I

lasso problem: minimize (1/2)||Ax − b||22 + λ||x||1

I

ADMM form: minimize subject to

I

(1/2)||Ax − b||22 + λ||z||1 x −z =0

ADMM: x k +1 := (AT A + ρI)−1 (AT b + ρ(z k − u k )) z k +1 := Sλ/ρ (x k +1 + u k ) u k +1 := u k + x k +1 − z k +1

Lasso example

I

example with dense A ∈ R1500×5000

I

computation times

I

I

factorization (same as ridge regression) 1.3s

I

subsequent ADMM iterations 0.03s

I

lasso solve (about 50 ADMM iterations) 2.9s

I

full regularization path (30 λ’s) 4.4s

not bad for a very short script

Consensus optimization I

want to solve problem with N objective terms minimize

N X

f (x i |θ) + g(θ)

i=1 I

I

e.g., , x i is the loss function for ith block of training data

ADMM form: minimize

N X

f (x i |θi ) + g(θ)

i=1

subject to

θi − θ = 0

I

θi are local variables, θ is the global variable

I

xi − θ = 0 are consistency or consensus constraints

Consensus optimization via ADMM

I

Lρ (x, z, y ) =

I

ADMM:

PN

i=1 (f (x i |θi )

+ yiT (θi − θ) + (ρ/2)||θi − θ||22 )

θik +1 := arg min(f (x i |θi ) + yikT (θi − θk ) + (ρ/2)||θi − θk ||22 ) xi P k +1 θ := arg min(g(θ) + i (yikT (θik +1 − θ) + (ρ/2)||θik +1 − θ||22 )) yik +1 := yik + ρ(θik +1 − θk +1 ) I

θik +1 and yik +1 can be updated in parallel

I

θ is updated by the consensus of local parameters

Statistical interpretation

I

f (x i |θi ) is negative log-likelihood for parameter θi given ith data block

I

θik +1 is MAP estimate under prior N (θ + (1/ρ)yik , ρI)

I

prior mean is previous iteration’s consensus shifted by “price” of processor i disagreeing with previous consensus

I

local processors only need to support a Gaussian MAP method

Distributed `1 regularized logistic regression example I

logistic loss, l(u) = log(1 + e−u ), with `1 regularization

I

n = 104 , N = 106 , sparse with ≈ 10 nonzero regressors in each example

I

split data into 100 blocks with N = 104 examples each

Big picture / conclusions

I

scaling: scale algorithms to datasets of arbitrary size

I

cloud computing: run algorithms in the cloud

I

I

each node handles a modest convex problem

I

decentralized data storage

coordination: ADMM is meta-algorithm that coordinates existing solvers to solve problems of arbitrary size I

designing specialized large-scale algorithms for specific problems