Distributed Optimization and Statistics via Alternating Direction Method of Multipliers Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato and Jonathan Eckstein Source: http://www.stanford.edu/~boyd/papers/pdf/admm_talk.pdf Discussion led by XianXing Zhang
September 6, 2013
Outline of today’s discussion
I
Background I
Dual Decomposition
I
Method of Multipliers
I
Alternating Direction Method of Multipliers (ADMM)
I
Examples
I
Conclusions
Dual problem
I
convex equality constrained optimization problem minimize f (x) subject to Ax = b
I
Lagrangian: L(x, y ) = f (x) + y T (Ax − b)
I
dual function: g(y ) = infx L(x, y )
I
dual problem: y ? = arg maxy g(y )
I
recover x ? = arg minx L(x, y ? )
Dual ascent I
recall the dual problem: y ? = arg maxy g(y )
I
gradient method for dual problem: y k +1 = y k + αk ∇g(y k )
I
∇g(y k ) = Ax k − b, where x k = arg minx L(x, y k )
I
dual ascent method is x k +1 := arg min L(x, y k ) x
y k +1 := y k + αk (Ax k +1 − b) I
//x-minimization //dual update
works, with lots of strong assumptions (e.g., strong convexity)
Dual decomposition I
suppose f is separable: f (x) = f1 (x1 ) + · · · + fN (xN ), x = (x1 , . . . , xN )
I
then L = f (x) + y T (Ax − b) is separable in x: L(x, y ) = L1 (x1 , y ) + · · · + LN (xN , y ) − y T b Li (xi , y ) = fi (xi ) + y T Ai xi
I
x-minimization in dual ascent splits into N separate minimizations xik +1 := arg min Li (xi , y k ) xi
which can be carried out in parallel
Dual decomposition I
dual decomposition (Everett, Dantzig, Wolfe, Benders 1960-65) xik +1 := arg min Li (xi , y k ), xi
y k +1 := y k + αk
N X
i = 1, . . . , N !
Ai xik +1 − b
i=1 I
scatter y k ; update xi in parallel; gather Ai xik +1
I
solve a large problem
I
I
by iteratively solving subproblems (in parallel)
I
dual variable update provides coordination
works, with lots of assumptions; often slow
Method of multipliers I
a method to robustify dual ascent
I
use augmented Lagrangian (Hestenes, Powell 1969), ρ>0 Lρ (x, y ) = f (x) + y T (Ax − b) + (ρ/2)||Ax − b||22
I
method of multipliers (Hestenes, Powell; analysis in Bertsekas 1982) x k +1 := arg min Lρ (x, y k ) x
y
k +1
k
:= y + ρ(Ax k +1 − b)
(the specific dual update step length ρ satisfies primal and dual feasibility)
Method of multipliers
I
good news: converges under much more relaxed conditions (f can be non-differentiable, take on value + inf, ...)
I
bad news: quadratic penalty destroys splitting of the x-update, so can’t do decomposition
Alternating direction method of multipliers (ADMM)
I
I
a method I
with good robustness of method of multipliers
I
which can support decomposition
proposed by Gabay, Mercier, Glowinski, Marrocco in 1976
Alternating direction method of multipliers I
ADMM problem form (with f , g convex) min f (x) + g(z) subject to Ax + Bz = c I
I
two sets of variables, with separable objective
Lagrangian Lρ (x, z, y ) = f (x)+g(z)+y T (Ax+Bz−c)+(ρ/2)||Ax+Bz−c||22
I
ADMM: x k +1 := arg min Lρ (x, z k , y k )
//x − minimization
z k +1 := arg min Lρ (x k +1 , z, y k )
//z − minimization
x
z
y k +1 := y k + ρ(Ax k +1 + Bz k +1 − c)
//dual update
Alternating direction method of multipliers
I
if we minimized over x and z jointly, reduces to method of multipliers
I
instead, we do one pass of a Gauss-Seidel method
I
we get splitting since we minimize over x with z fixed, and vice versa
ADMM with scaled dual variables I
combine linear and quadratic terms in augmented Lagrangian
Lρ (x, z, y ) = f (x) + g(z) + y T (Ax + Bz − c) + (ρ/2)||Ax + Bz − c = f (x) + g(z) + (ρ/2)||Ax + Bz − c + u||22 + const. with u k = (1/ρ)y k I
ADMM (scaled dual form): x k +1 := arg min(f (x) + (ρ/2)||Ax + Bz k − c + u k ||22 ) x
z
k +1
:= arg min(g(z) + (ρ/2)||Ax k +1 + Bz − c + u k ||22 ) z
u k +1 := u k + (Ax k +1 + Bz k +1 − c)
Convergence
I
I
assume (very little!) I
f , g convex, closed, proper
I
L0 has a saddle point
then ADMM converges: I
iterates approach feasibility: Ax k + Bz k − c → 0
I
objective approaches optimal value: f (x k ) + g(z k ) → p?
Lasso I
lasso problem: minimize (1/2)||Ax − b||22 + λ||x||1
I
ADMM form: minimize subject to
I
(1/2)||Ax − b||22 + λ||z||1 x −z =0
ADMM: x k +1 := (AT A + ρI)−1 (AT b + ρ(z k − u k )) z k +1 := Sλ/ρ (x k +1 + u k ) u k +1 := u k + x k +1 − z k +1
Lasso example
I
example with dense A ∈ R1500×5000
I
computation times
I
I
factorization (same as ridge regression) 1.3s
I
subsequent ADMM iterations 0.03s
I
lasso solve (about 50 ADMM iterations) 2.9s
I
full regularization path (30 λ’s) 4.4s
not bad for a very short script
Consensus optimization I
want to solve problem with N objective terms minimize
N X
f (x i |θ) + g(θ)
i=1 I
I
e.g., , x i is the loss function for ith block of training data
ADMM form: minimize
N X
f (x i |θi ) + g(θ)
i=1
subject to
θi − θ = 0
I
θi are local variables, θ is the global variable
I
xi − θ = 0 are consistency or consensus constraints
Consensus optimization via ADMM
I
Lρ (x, z, y ) =
I
ADMM:
PN
i=1 (f (x i |θi )
+ yiT (θi − θ) + (ρ/2)||θi − θ||22 )
θik +1 := arg min(f (x i |θi ) + yikT (θi − θk ) + (ρ/2)||θi − θk ||22 ) xi P k +1 θ := arg min(g(θ) + i (yikT (θik +1 − θ) + (ρ/2)||θik +1 − θ||22 )) yik +1 := yik + ρ(θik +1 − θk +1 ) I
θik +1 and yik +1 can be updated in parallel
I
θ is updated by the consensus of local parameters
Statistical interpretation
I
f (x i |θi ) is negative log-likelihood for parameter θi given ith data block
I
θik +1 is MAP estimate under prior N (θ + (1/ρ)yik , ρI)
I
prior mean is previous iteration’s consensus shifted by “price” of processor i disagreeing with previous consensus
I
local processors only need to support a Gaussian MAP method
Distributed `1 regularized logistic regression example I
logistic loss, l(u) = log(1 + e−u ), with `1 regularization
I
n = 104 , N = 106 , sparse with ≈ 10 nonzero regressors in each example
I
split data into 100 blocks with N = 104 examples each
Big picture / conclusions
I
scaling: scale algorithms to datasets of arbitrary size
I
cloud computing: run algorithms in the cloud
I
I
each node handles a modest convex problem
I
decentralized data storage
coordination: ADMM is meta-algorithm that coordinates existing solvers to solve problems of arbitrary size I
designing specialized large-scale algorithms for specific problems