Modeling an augmented Lagrangian for blackbox constrained optimization Robert B. Gramacy The University of Chicago Booth School of Business bobby.gramacy.com Joint with Genetha A. Gray, S´ebastien Le Digabel, Herbert K.H. Lee, Pritam Ranjan, Garth Wells, Stefan Wild
CASD/George Mason University — Oct 2015
Blackbox constrained optimization Consider constrained optimization problems of the form min {f (x) : c(x) ≤ 0, x ∈ B} , where x
I I I
f : Rd → R is a scalar-valued objective function
c : Rd → Rm is a vector of constraint functions B ⊂ Rd is known, bounded, and convex
This is a challenging problem when c are non-linear, and when evaluation of f and/or c requires expensive (blackbox) simulation. 1
Here is a toy problem to fix ideas. I
A linear objective in two variables: min x1 + x2 : c1 (x) ≤ 0, c2 (x) ≤ 0, x ∈ [0, 1]2 x
I
where two non-linear constraints are given by 3 1 − x1 − 2x2 − sin 2π(x12 − 2x2 ) 2 2 3 c2 (x) = x12 + x22 − 2
c1 (x) =
Even when treating f (x) = x1 + x2 as known, this is a hard problem when c(x) is treated as a blackbox. 2
1
A
x ≈ [0.1954, 0.4044], f x A ≈ 0.5998,
0.9 0.8
0.7
B
x C = [0, 0.75], f x C = 0.75,
2
0.6
x
x ≈ [0.7197, 0.1411], f x B ≈ 0.8609,
0.5 0.4 0.3 0.2
0.1 0 0
A B C Infeasible 0.2
0.4
0.6
x
0.8
1
1
I
c2 (x) may seem uninteresting, but it reminds us that solutions may not exist on every boundary 3
Solvers Mathematical programming has efficient algorithms for non-linear (blackbox) optimization (under constraints) with I
provable local convergence properties,
I
lots of polished open source software
Statistical approaches
e.g., EI (Jones et al., 1998)
I
enjoy global convergence properties,
I
excel when simulation is expensive, noisy, non-convex
... but offer limited support for constraints. (Schonlau et al., 1998; G & Lee, 2011; Williams et al., 2010) 4
A hybrid proposal Combine (global) statistical objective-only optimization tools a) response surface modeling/emulation: training a flexible model f n on {x (i) , y (i) }ni=1 to guide choosing x (n+1) (e.g., Mockus, et al., 1978, Booker et al., 1999) b) expected improvement (EI) via Gaussian process (GP) emulation (Jones, et al., 1998) ... with a tool from mathematical programming c) augmented Lagrangian (AL): converting a problem with general constraints into a sequence of simply constrained ones (e.g., Bertsekas, 1982) 5
Gaussian process (GP) surrogate/regression models make popular emulators. As predictors, they are I
rarely beaten in out-of-sample tests,
I
have appropriate coverage, and can interpolate
Using data D = (X , Y ), where X is an n × p design matrix, the n × 1 response vector Y has MVN likelihood: Y ∼ Nn (0, τ 2 K ), often with prior π(τ 2 ) ∝ τ −2
where
Kij = K (xi , xj ) (Berger et al., 2001) 6
The predictive equations have mean and scale
µn (x|D, K ) = k > (x)K −1 Y , σ 2n (x|D, K ) =
ψ[K (x, x) − k > (x)K −1 k(x)] , n
1.0
1.5
where k > (x) is the n-vector whose i th component is K (x, xi ).
●
−0.5 0.0
●
●
●
● −1.5
f−hat(x)
0.5
●
●
7
Expected Improvement Suppose the predicting equations from f n are conditionally normal, i.e., from a GP: Y (x) ∼ N (µn (x), σ 2n (x)) Define the improvement as n I (x) = max{0, fmin − Y (x)}
Then, its expectation (EI) has a closed form expression: n n fmin − µn (x) fmin − µn (x) n n +σn (x)φ E{I (x)} = (fmin −µ (x))Φ σ n (x) σ n (x)
8
471
FFICIENT GLOBAL OPTIMIZATION
72
12 10 DACE µn (x) predictor
8
Standard n (x) error
4
fmin
balancing exploitation and exploration D.R. JONES, M. SCHONLAU AND W.J. WELCH I
6
2 0
0
2
4
6
(a) 8
10
(b)
12
12
0.06
12
0.03
6
0.01
Figure 10. Our uncertainty about the function’s value at a point (such as x = 8 above) can 0.009 10 0.05 10 0.008 be treated as if the value there were a realization of a normal random variable with mean and 0.007 8 0.04 and 8 its standard error. standard deviation given by the DACE predictor 0.006 6
E{I(x)}
0.005
E{I(x)} 0.004
4 0.02 4 A highly attractive figure of merit that balances local and global search is0.003 ‘ex0.002 2 0.01 2 ected improvement’. This concept can be found in the literature as early as 1978 0.001 0 0 0 0 as .g., Mockus The expected improvement 0 2et al. 4 [24]). 6 8 10 12 0 2 4criterion 6 8 is computed 10 12 (1) (n) ollows. Let fmin = min(y , . . . , y ) be the current best function value. Before (Jones, etbeen al.,sampled; 1998) Figureat 11.some (a) The expected function when fivey(x). points have e sample point x, weimprovement are uncertain about theonly value Of course, there (b) the expected improvement function after adding a point at x = 2.8. In both (a) and the nothing random about y(x); we simply do not know what it is. Let us model(b)our
left scale is for the objective function and the right scale is for the expected improvement.
9
Augmented Lagrangian AL methods for constrained nonlinear optimization have favorable theoretical properties for finding local solutions. The main tool is the AL: m
1 X LA (x; λ, ρ) = f (x) + λ c(x) + max (0, cj (x))2 2ρ j=1 >
I I
ρ > 0 is a penalty parameter λ ∈ Rm + serves as a Lagrange multiplier; omitting this term leads to a so-called additive penalty method (APM)
AL-based methods transform a constrained problem into a sequence of simply constrained problems. 10
Given (ρk−1 , λk−1 ), 1. approximately solve the subproblem x k = arg min LA (x; λk−1 , ρk−1 ) : x ∈ B x
2. update: I
λkj = max 0, λjk−1 +
k ) , j = 1, . . . , m c (x ρk−1 j
I
If c(x k ) ≤ 0, set ρk = ρk−1 ; otherwise, set ρk = 12 ρk−1
1
... then repeat, incrementing k. I
Functions f and c are only evaluated when solving the subproblem(s), comprising an “inner loop”. 11
Statistical surrogate AL AL methods are not designed for global optimization. I
Convergence results have a certain robustness,
I
but only local solutions are guaranteed.
Hybridizing with surrogate models offers a potential remedy. I
Focus is on finding x k in the “inner loop”,
I
using evaluations (x (1) , f (1) , c (1) ), . . . , (x (n) , f (n) , c (n) ) collected over all “inner” and “outer” loops ` = 1, . . . , k − 1.
There are several options for how exactly to proceed. 12
One option is easy to rule out. Let y (i) = LA (x (i) ; λk−1 , ρk−1 ) via f (i) and c (i) . I.e., y
m 2 1 X ) c(x ) + k−1 max 0, cj (x (i) ) 2ρ j=1
= f (x ) + (λ
I
fit a GP emulator f n to the n pairs {(x (i) , y (i) )}ni=1
I
(i)
k−1 >
(i)
(i)
guide “inner loop” search by the predictive mean or EI
Benefits include: I
modular
I
facilitates global–local tradeoff 13
But modeling this x–y relationship presents serious challenges. y (i) = f (x (i) ) + (λk−1 )> c(x (i) ) +
m 1 X (i) 2 max 0, c (x ) j 2ρk−1 j=1
Inherently nonstationarity. I
square amplifies and max creates kinks
Fails to exploit known structure. I
a quadratic form
Needlessly models a (potentially) known quantity. I
many interesting problems have linear f 14
Separated modeling Shortcomings can be addressed by separately/independently modeling each component of the AL. I
f n emitting Yf n (x)
I
c n = (c1n , . . . , cmn ) emitting Ycn (x) = (Ycn1 (x), . . . , Ycnm (x))
The distribution of the composite random variable m
1 X max(0, Ycj (x))2 Y (x) = Yf (x) + λ Yc (x) + 2ρ j=1 >
can serve as a surrogate for LA (x; λ, ρ). I
simplifications when f is known 15
The composite posterior mean is available in closed form, e.g., under GP priors. m
E{Y (x)} =
µnf (x)
+
λ> µnc (x)
1 X + E{max(0, Ycj (x))2 } 2ρ j=1
A result from generalized EI (Schonlau et al., 1998) gives E{max(0, Ycj (x))2 } = E{I−Ycj (x)}2 + Var[I−Ycj (x)] !2 ! ! n n n n µ (x) µcj (x) µcj (x) µcj (x) Φ cj . + φ = σc2nj (x)1+ σcnj (x) σcnj (x) σcnj (x) σcnj (x)
16
Expected improvement for AL The simplest way to evaluate the EI is via Monte Carlo: I I
(i)
(i)
take 100 samples Yf (x) and Yc (x) P100 1 n (i) then EI (x) ≈ 100 i=1 max{0, ymin − Y (x)}
The “max” in the AL makes analytic calculation intractible. But you can remove the “max”by introducing slack variables I
turning inequality into equality constraints
I
and making the AL composite Y (x) a simple quadratic.
I
The EI then becomes a one-dimensional integral of non-central chi-squared quantities. 17
1.2
Results on toy data n
1.0
EI EY SANN MADS-AL AsyEnt MADS model
0.9
EY EI
0.7
0.8
EI EY SANN MADS-AL AsyEnt MADS model
0.6
best valid objective (f)
1.1
SANN MADS−AL AsyEnt MADS model−based
0
20
40
60
blackbox evaluations (n)
80
25 95% 0.866 1.052 1.013 1.070 0.825 1.056 1.064 5% 0.610 0.607 0.648 0.600 0.610 0.608 0.600
50
100
0.775 0.854 0.940 0.979 0.761 0.886 0.861
0.602 0.603 0.855 0.908 0.758 0.863 0.750
0.602 0.601 0.630 0.600 0.601 0.600 0.599
0.600 0.600 0.612 0.600 0.600 0.599 0.599
100
18
Benchmark problem Two contaminant plumes threaten a valuable water source: the Yellowstone River. No ow
Plume A
Specied head
Ye llo w
st
on
e
riv
er
Plume B
No ow
To prevent further expansion of these plumes, six pump-and-treat wells have been proposed. 19
Mayer et al. (2002) first posed the pump-and-treat problem as a constrained blackbox optimization. If xj denotes the pumping rate for well j, then min {f (x) = x
6 X j=1
xj : c1 (x) ≤ 0, c2 (x) ≤ 0, x ∈ [0, 2 · 104 ]6 }.
I
f is linear, describing costs to operate the wells
I
c1 and c2 denote plume flow exiting the boundary: simulated via an analytic element method groundwater model 20
Matott et al. (2011) compared MATLAB and Python optimizers, treating constraints via APM. 60,000
60000
Figures (continued) BFGS
OICs
DDS
Nelder-Mead
Newton Powell PRGC
40,000
SA TGA
35,000
TNC 30,000
40000
IFFCO
45,000
best valid objective (f)
Hooke-Jeeves
50000
DE
50,000
30000
Pump-and-Treat Cost Function (ΣQ!$35/m3/d)
55,000
25,000
20,000 0
200
400 600 Number of Model Evaluations
800
1000
Figure 4: Convergence Behavior of the Selected Algorithms (as applied to the Pump-and-Treat problem)
I
0
100
200
300
400
500
blackbox evaluations (n)
initialized at xj0 = 104 21
60000 50000 40000
EY EI
n EI EY SANN MADS-AL AsyEnt MADS model EI EY SANN MADS-AL AsyEnt MADS model
30000
best valid objective (f)
SANN MADS−AL AsyEnt MADS model−based
0
100
200
300
400
100 95% 37584 36554 43928 60000 49030 60000 60000 5% 27054 25677 28766 30776 37483 30023 25912
200
500
28698 32770 35456 49020 29079 60000 60000
25695 27362 30920 32663 27445 60000 35730
25119 24492 27238 26358 26681 26591 25164
24196 24100 26824 24102 25377 23571 24939
500
blackbox evaluations (n)
22
Summarizing Nontrivial multiple blackbox constraints present serious challenges to optimization I
even when the objective is simple/known.
The augmented Lagrangian method from mathematical programming is a nice framework for handling constraints I
but only local convergence is guaranteed.
Statistical surrogate modeling and expected improvement nicely hybridize with the AL: I
implementation is straightforward (see laGP on CRAN). 23