Modeling an augmented Lagrangian for blackbox constrained ...

Report 3 Downloads 173 Views
Modeling an augmented Lagrangian for blackbox constrained optimization Robert B. Gramacy The University of Chicago Booth School of Business bobby.gramacy.com Joint with Genetha A. Gray, S´ebastien Le Digabel, Herbert K.H. Lee, Pritam Ranjan, Garth Wells, Stefan Wild

CASD/George Mason University — Oct 2015

Blackbox constrained optimization Consider constrained optimization problems of the form min {f (x) : c(x) ≤ 0, x ∈ B} , where x

I I I

f : Rd → R is a scalar-valued objective function

c : Rd → Rm is a vector of constraint functions B ⊂ Rd is known, bounded, and convex

This is a challenging problem when c are non-linear, and when evaluation of f and/or c requires expensive (blackbox) simulation. 1

Here is a toy problem to fix ideas. I

A linear objective in two variables:  min x1 + x2 : c1 (x) ≤ 0, c2 (x) ≤ 0, x ∈ [0, 1]2 x

I

where two non-linear constraints are given by  3 1 − x1 − 2x2 − sin 2π(x12 − 2x2 ) 2 2 3 c2 (x) = x12 + x22 − 2

c1 (x) =

Even when treating f (x) = x1 + x2 as known, this is a hard problem when c(x) is treated as a blackbox. 2

1

A

x ≈ [0.1954, 0.4044],  f x A ≈ 0.5998,

0.9 0.8



0.7

B

x C = [0, 0.75],  f x C = 0.75,

2

0.6

x

x ≈ [0.7197, 0.1411],  f x B ≈ 0.8609, 

0.5 0.4 0.3 0.2



0.1 0 0

A B C Infeasible 0.2

0.4

0.6

x

0.8

1

1

I

c2 (x) may seem uninteresting, but it reminds us that solutions may not exist on every boundary 3

Solvers Mathematical programming has efficient algorithms for non-linear (blackbox) optimization (under constraints) with I

provable local convergence properties,

I

lots of polished open source software

Statistical approaches

e.g., EI (Jones et al., 1998)

I

enjoy global convergence properties,

I

excel when simulation is expensive, noisy, non-convex

... but offer limited support for constraints. (Schonlau et al., 1998; G & Lee, 2011; Williams et al., 2010) 4

A hybrid proposal Combine (global) statistical objective-only optimization tools a) response surface modeling/emulation: training a flexible model f n on {x (i) , y (i) }ni=1 to guide choosing x (n+1) (e.g., Mockus, et al., 1978, Booker et al., 1999) b) expected improvement (EI) via Gaussian process (GP) emulation (Jones, et al., 1998) ... with a tool from mathematical programming c) augmented Lagrangian (AL): converting a problem with general constraints into a sequence of simply constrained ones (e.g., Bertsekas, 1982) 5

Gaussian process (GP) surrogate/regression models make popular emulators. As predictors, they are I

rarely beaten in out-of-sample tests,

I

have appropriate coverage, and can interpolate

Using data D = (X , Y ), where X is an n × p design matrix, the n × 1 response vector Y has MVN likelihood: Y ∼ Nn (0, τ 2 K ), often with prior π(τ 2 ) ∝ τ −2

where

Kij = K (xi , xj ) (Berger et al., 2001) 6

The predictive equations have mean and scale

µn (x|D, K ) = k > (x)K −1 Y , σ 2n (x|D, K ) =

ψ[K (x, x) − k > (x)K −1 k(x)] , n

1.0

1.5

where k > (x) is the n-vector whose i th component is K (x, xi ).



−0.5 0.0







● −1.5

f−hat(x)

0.5





7

Expected Improvement Suppose the predicting equations from f n are conditionally normal, i.e., from a GP: Y (x) ∼ N (µn (x), σ 2n (x)) Define the improvement as n I (x) = max{0, fmin − Y (x)}

Then, its expectation (EI) has a closed form expression:   n   n fmin − µn (x) fmin − µn (x) n n +σn (x)φ E{I (x)} = (fmin −µ (x))Φ σ n (x) σ n (x)

8

471

FFICIENT GLOBAL OPTIMIZATION

72

12 10 DACE µn (x) predictor

8

Standard n (x) error

4

fmin

balancing exploitation and exploration D.R. JONES, M. SCHONLAU AND W.J. WELCH I

6

2 0

0

2

4

6

(a) 8

10

(b)

12

12

0.06

12

0.03

6

0.01

Figure 10. Our uncertainty about the function’s value at a point (such as x = 8 above) can 0.009 10 0.05 10 0.008 be treated as if the value there were a realization of a normal random variable with mean and 0.007 8 0.04 and 8 its standard error. standard deviation given by the DACE predictor 0.006 6

E{I(x)}

0.005

E{I(x)} 0.004

4 0.02 4 A highly attractive figure of merit that balances local and global search is0.003 ‘ex0.002 2 0.01 2 ected improvement’. This concept can be found in the literature as early as 1978 0.001 0 0 0 0 as .g., Mockus The expected improvement 0 2et al. 4 [24]). 6 8 10 12 0 2 4criterion 6 8 is computed 10 12 (1) (n) ollows. Let fmin = min(y , . . . , y ) be the current best function value. Before (Jones, etbeen al.,sampled; 1998) Figureat 11.some (a) The expected function when fivey(x). points have e sample point x, weimprovement are uncertain about theonly value Of course, there (b) the expected improvement function after adding a point at x = 2.8. In both (a) and the nothing random about y(x); we simply do not know what it is. Let us model(b)our

left scale is for the objective function and the right scale is for the expected improvement.

9

Augmented Lagrangian AL methods for constrained nonlinear optimization have favorable theoretical properties for finding local solutions. The main tool is the AL: m

1 X LA (x; λ, ρ) = f (x) + λ c(x) + max (0, cj (x))2 2ρ j=1 >

I I

ρ > 0 is a penalty parameter λ ∈ Rm + serves as a Lagrange multiplier; omitting this term leads to a so-called additive penalty method (APM)

AL-based methods transform a constrained problem into a sequence of simply constrained problems. 10

Given (ρk−1 , λk−1 ), 1. approximately solve the subproblem  x k = arg min LA (x; λk−1 , ρk−1 ) : x ∈ B x

2. update: I

 λkj = max 0, λjk−1 +

 k ) , j = 1, . . . , m c (x ρk−1 j

I

If c(x k ) ≤ 0, set ρk = ρk−1 ; otherwise, set ρk = 12 ρk−1

1

... then repeat, incrementing k. I

Functions f and c are only evaluated when solving the subproblem(s), comprising an “inner loop”. 11

Statistical surrogate AL AL methods are not designed for global optimization. I

Convergence results have a certain robustness,

I

but only local solutions are guaranteed.

Hybridizing with surrogate models offers a potential remedy. I

Focus is on finding x k in the “inner loop”,

I

using evaluations (x (1) , f (1) , c (1) ), . . . , (x (n) , f (n) , c (n) ) collected over all “inner” and “outer” loops ` = 1, . . . , k − 1.

There are several options for how exactly to proceed. 12

One option is easy to rule out. Let y (i) = LA (x (i) ; λk−1 , ρk−1 ) via f (i) and c (i) . I.e., y

m 2 1 X ) c(x ) + k−1 max 0, cj (x (i) ) 2ρ j=1

= f (x ) + (λ

I

fit a GP emulator f n to the n pairs {(x (i) , y (i) )}ni=1

I

(i)

k−1 >

(i)

(i)

guide “inner loop” search by the predictive mean or EI

Benefits include: I

modular

I

facilitates global–local tradeoff 13

But modeling this x–y relationship presents serious challenges. y (i) = f (x (i) ) + (λk−1 )> c(x (i) ) +

m  1 X (i) 2 max 0, c (x ) j 2ρk−1 j=1

Inherently nonstationarity. I

square amplifies and max creates kinks

Fails to exploit known structure. I

a quadratic form

Needlessly models a (potentially) known quantity. I

many interesting problems have linear f 14

Separated modeling Shortcomings can be addressed by separately/independently modeling each component of the AL. I

f n emitting Yf n (x)

I

c n = (c1n , . . . , cmn ) emitting Ycn (x) = (Ycn1 (x), . . . , Ycnm (x))

The distribution of the composite random variable m

1 X max(0, Ycj (x))2 Y (x) = Yf (x) + λ Yc (x) + 2ρ j=1 >

can serve as a surrogate for LA (x; λ, ρ). I

simplifications when f is known 15

The composite posterior mean is available in closed form, e.g., under GP priors. m

E{Y (x)} =

µnf (x)

+

λ> µnc (x)

1 X + E{max(0, Ycj (x))2 } 2ρ j=1

A result from generalized EI (Schonlau et al., 1998) gives E{max(0, Ycj (x))2 } = E{I−Ycj (x)}2 + Var[I−Ycj (x)]  !2  ! ! n n n n µ (x) µcj (x) µcj (x) µcj (x)  Φ cj . + φ = σc2nj (x)1+ σcnj (x) σcnj (x) σcnj (x) σcnj (x)

16

Expected improvement for AL The simplest way to evaluate the EI is via Monte Carlo: I I

(i)

(i)

take 100 samples Yf (x) and Yc (x) P100 1 n (i) then EI (x) ≈ 100 i=1 max{0, ymin − Y (x)}

The “max” in the AL makes analytic calculation intractible. But you can remove the “max”by introducing slack variables I

turning inequality into equality constraints

I

and making the AL composite Y (x) a simple quadratic.

I

The EI then becomes a one-dimensional integral of non-central chi-squared quantities. 17

1.2

Results on toy data n

1.0

EI EY SANN MADS-AL AsyEnt MADS model

0.9

EY EI

0.7

0.8

EI EY SANN MADS-AL AsyEnt MADS model

0.6

best valid objective (f)

1.1

SANN MADS−AL AsyEnt MADS model−based

0

20

40

60

blackbox evaluations (n)

80

25 95% 0.866 1.052 1.013 1.070 0.825 1.056 1.064 5% 0.610 0.607 0.648 0.600 0.610 0.608 0.600

50

100

0.775 0.854 0.940 0.979 0.761 0.886 0.861

0.602 0.603 0.855 0.908 0.758 0.863 0.750

0.602 0.601 0.630 0.600 0.601 0.600 0.599

0.600 0.600 0.612 0.600 0.600 0.599 0.599

100

18

Benchmark problem Two contaminant plumes threaten a valuable water source: the Yellowstone River. No ow

Plume A

Specied head

Ye llo w

st

on

e

riv

er

Plume B

No ow

To prevent further expansion of these plumes, six pump-and-treat wells have been proposed. 19

Mayer et al. (2002) first posed the pump-and-treat problem as a constrained blackbox optimization. If xj denotes the pumping rate for well j, then min {f (x) = x

6 X j=1

xj : c1 (x) ≤ 0, c2 (x) ≤ 0, x ∈ [0, 2 · 104 ]6 }.

I

f is linear, describing costs to operate the wells

I

c1 and c2 denote plume flow exiting the boundary: simulated via an analytic element method groundwater model 20

Matott et al. (2011) compared MATLAB and Python optimizers, treating constraints via APM. 60,000

60000

Figures (continued) BFGS

OICs

DDS

Nelder-Mead

Newton Powell PRGC

40,000

SA TGA

35,000

TNC 30,000

40000

IFFCO

45,000

best valid objective (f)

Hooke-Jeeves

50000

DE

50,000

30000

Pump-and-Treat Cost Function (ΣQ!$35/m3/d)

55,000

25,000

20,000 0

200

400 600 Number of Model Evaluations

800

1000

Figure 4: Convergence Behavior of the Selected Algorithms (as applied to the Pump-and-Treat problem)

I

0

100

200

300

400

500

blackbox evaluations (n)

initialized at xj0 = 104 21

60000 50000 40000

EY EI

n EI EY SANN MADS-AL AsyEnt MADS model EI EY SANN MADS-AL AsyEnt MADS model

30000

best valid objective (f)

SANN MADS−AL AsyEnt MADS model−based

0

100

200

300

400

100 95% 37584 36554 43928 60000 49030 60000 60000 5% 27054 25677 28766 30776 37483 30023 25912

200

500

28698 32770 35456 49020 29079 60000 60000

25695 27362 30920 32663 27445 60000 35730

25119 24492 27238 26358 26681 26591 25164

24196 24100 26824 24102 25377 23571 24939

500

blackbox evaluations (n)

22

Summarizing Nontrivial multiple blackbox constraints present serious challenges to optimization I

even when the objective is simple/known.

The augmented Lagrangian method from mathematical programming is a nice framework for handling constraints I

but only local convergence is guaranteed.

Statistical surrogate modeling and expected improvement nicely hybridize with the AL: I

implementation is straightforward (see laGP on CRAN). 23