Developments in stochastic optimization algorithms with gradient ...

Report 6 Downloads 150 Views
Proceedings

of the 1994 Winter

eel. J. D. Tew, S. Manivannan,

DEVELOPMENTS GRADIENT

Simulaiaon

Conference

D. A. Sadowski,

and A. F. Seila

IN STOCHASTIC

APPROXIMATIONS

OPTIMIZATION BASED

ON FUNCTION

ALGORITHMS

WITH

MEASUREMENTS

James C. Span

The Johns Hopkins Applied Laurel, (E-mail:

Physics Laboratory Maryland

20723-6099

[email protected])

ABSTRACT

measurements of the objective fiction. This interest has been motivated, for example, by problems in the adaptive control and statistical identification of complex systems, the optimization of processes by large Monte Carlo simulations, the training of recurrent neural networks, and the design of complex queueing and discrete-event systems.

There has recently been much interest in recursive optimization algorithms that rely on measurements of only the objective function, not requiring measurements of the gradient (or higher derivatives) of the objective function. The algorithms are implemented by forming an atmroximation to the gradient at each iteration that is based on the function measurements. Such algorithms have the advantage of not requiring detailed modeling information describing the relationship between the parameters to be optimized and the objective function. To properly cope with the noise that generally occurs in the measurements, these algorithms are best placed within a stochastic approximation tkunework. This paper discusses some of the main contributions to this class of algorithms, beginning in the early 1950s and progressing until now. Key words:

University

Overall, such algorithms exhibit certain convergence properties of gradient-based algorithms while requiring only objective (say, loss) fi.mction measurements. A main advantage of such algorithms is that they do not require the detailed knowledge of the functional relationship between the parameters being adjusted (optimized) and the loss function being minimized that is required in gradient-based algorithms. Such a relationship can be notoriously difficult to develop in problem areas such as nonlinear feedback controller design. Further, in areas such as Monte Carlo optimization or recursive statistical parameter estimation, there may be large computational savings in calculating a loss function relative to that required in calculating a gradient. Because of the inherent randomness in the data and search algorithms here, all algorithms will be viewed from the perspective of stochastic approximation (SA).

Optimization, stochastic approximation, gmdient estimation, recursive procedure

1. INTRODUCTION In virtually all areas of engineering and the physical and social sciences, one encounters problems involving the optimization of some mathematical objective fhnction (e.g., as in optimal control, system design and planning, model fitting, and performance evaluation from system test data). Typically, the solution to this optimization problem corresponds to a vector of parameters such that the gradient of the objective fiction (with respect to the system parameters being optimized) is zero. Over the kast several years, there h,as been a growing interest in recursive optimization algorithms that do not depend on direct gradient information or measurements. Rather, these algorithms are based on an aDtIroximation to the gradient formed from (generally noisy)

Let us elaborate on the distinction between algorithms based on direct gradient measurements and algorithms based on gradient approximations from measurements of the loss timction. Examples of the former include Robbins-Monro SA (Robbins and Monro, 1951), steepest descent and Newton-lUphson (1%.zaraa and Shetty, 1979, Chap. 8), neural network back propagation (Rumelhart, et al., 1986), and infinitesimal perturbation analysis (IPA)-based optimization for discrete-event systems (Ghasserman, 199 1). Examples of approximation-based methods using loss function measurements are given below, but include as an early prototype the Kiefer-Wolfowitz finite-difference SA algorithm (Kiefer and Wolfowitz, 1952). The gradient-based algorithms rely on direct measurements of the gradient of the loss t%nction with respect to the parameters being optimized. These measurements typically yield an estimate of the

This work was partially supported by the JHU/APL IRAD Program and U.S. Navy Contract NOO039-94-C0001. 207

208

gradient since the underlying data generally include added noise. Because it is not usually the case that one would obtain direct measurements of the gradient (with or without added noise) naturally in the course of operating or simdating a system, one must have knowledge of the underlying system input-output relationships in order to calculate the gradient estimate (using the chain rule) from basic system output measurements. In contrast, the approaches based on gradient approximations require only conversion of the basic output measurements to sample values of the loss function, which does not require knowledge of the system input-output relationships. Because of the fundamentally different information needed in implementing these two general it is difficult to construct types of algorithms, meaningtid methods of comparison. As a general rule, however, the gradient-based algorithms will be faster to converge than those based on gradient approximations when m eed is measured in number of Intuitively, this is not surprising given the iterations. additional information required for the gradient-based algorithms. In particular, the rate of convergence— measured in terms of the deviation of the parameter estimate from the true optimal parameter vector—is typically of order k‘~ for the gradient-based algorithms and of order k-~ for some 0< 13 O represents a scalar gain co~fticient, and ~~(d~) represents an approximation of g(tl~) based on the above-mentioned measurements of L@ at yalues of 0 that are perturbed fioma the nominal value (1~. Under appropriate conditions, Ok will converge (in some stochastic sense) to 0; (see, e.g., Kushner and Clark, 1978).

Developments

in Stochastic

The most critical aspect @ implementing (1) is the gradient approximation ~#3~) at each k. This review discusses the three general forms that appearato have attracted the most attention. In forming~&) we let y(”) denote a measurement of L(”) at a design level represented by the dot (i.e., Y(”) = ~~ + nofie) and Ct be some (usually small) positive number (in general, a dot as a timction argument represents a specific value of 0 that we will not speci~ here in order to avoid excess notation). “One-sided” gradient approximations involve measurements y(tl~) ~d y (ok +perzurbm”on) for each component of~#Q while “two-side$ involve the approximations measurements y (tl~ + pertarbm”on). The three general forms are: Finite difference (FD), where each component of ~~ is perturbed one-at-a-time and corresponding me~urements y(”) are obtained; each component of ~#3~ is formed by differencing corresponding measurements of y(.) and then dividing by the difference interval. This is the “standard” approach to approximating gradient vectors and is motivated directly from the definition of a gradient as a vector ofp partial derivatives, each constructed as a limit of the ratio of a change in the fimction value over a corresponding change in one component of the arg~ent vector. Typically, the i ‘h component of &(o~) (i=l, 2, ... . P) for approximation is given by

a

two-sided

FD

y(~~ + c~ei) - Y@k - c~e) ik-i(dk) =

2ck

9

where ei denotes a unit vector in the i ‘h direction (an obvious analogue holds for the one-sided version; likewise for the RD and SP forms below). Rando~ directions (RD), where all components of tl~ are randomly perturbed together in two separate directions to obta@ two measurements Y(~ ~~d each comPoflent of &(eJ is formed from the product of the corresponding component of the perturbation vector times the difference in the two measurements. For two-sided RD, we have

t?J~k) =dti where dk = (dkI, random variables

Y@+ckdk) - J@k - ckdk) 2ck

Y

... . dti)T is a vector of user-specified satisfying certain conditions.

Simultaneous perturbation (SP), which also has all elements of tl~ randomly perturbed togetherAto obtain twomeastaem ents y(”), but each ~ of&(e~) is formed from a ratio involving the individual components of the perturbation vector and the difference in the two corresponding measurements. For two-sided SP, we have

J@k+CkAk) -)@ - +k)

&j@k) =

2c~Ati

,

Optimization

Algorithms

209

where the distribution of the user-specified random perturbations for SP, A~ = (AH, ... . A~T, must satisfy conditions different from those of RD (although certain distributions satisfy both sets of conditions). Algorithm (1) with one of the gradient approximations will be referred to as FDSA, RDSA, or SPSA, as appropriate. Note that the number of loss function measurements y(”) needed in FD grows with p, while with RD and SP only two measurements are needed inde~e ndent of p, This, of course, provides the potential for RDSA or SPSA to achieve a large savings (over FDSA) in the total number of measurements required to estimate 0 when p is large. This potential is only realized if the number of iterations required for effective convergence to 6* does not increase in a way to cancel the measurement savings per gradient approximation. Some of the references in Section 3 address this issue (especially Span (1992) and Chin (1993)), demonstrating when this potential can be realized. Although the RD and SP gradient approximation forms have certain similarities, the performance of the RDSA and SPSA algorithms will generally be quite different, as demonstrated in some of the references below. Note, however, that in one important special case the RDSA and SPSA algorithms coincide: namely, when the components of the perturbation vector are symmetric Bernoulli distributed (e.g., & 1 with each outcome having probability !4). In general, however, the SPSA formulation has been shown to be more efficient than RDSA, as discussed in Section 3. Theoretically, this follows from the fact that SPSA has a lower asymptotic mean-square error than RDSA for the same number of measurements y(”). Let us close with definitions of some terms used in Section 3. “Gradient averaging” refers to the averaging of several (say,g) gradient approximations at any given iteration; “gradient smoothing” refers to averaging gradients across iterations (analogous to “momentum” in the neural network parlance).

210

Span

3. LIST

OF KEY

DEVELOPMENTS

AND

No. of Function Measurements y(j t)er Iteration

References Kiefer and Wolfowitz (1952)

ASSOCIATED

REFERENCES

Measurement Noise in y(j Considered?

Comments

Yes

First FDSA algorithm. Limited scalar setting. Convergence probability shown.

&)

to in

Bhun (1954)

p+l

Yes

Multivariate extension of Kiefer and Wolfowitz (1952) FDSA. One-sided finite difference gradient approximation. Shows almost sure (as.) convergence.

Sacks (1958)

2p

Yes

Shows asymptotic normality of multivariate two-sided FDSA method. Normality result useful in quantifying accuracy of SA estimate.

2

Yes

Apparently first paper to consider a special case of RDSA-tWe algorithm: one-sided RDSA form with uniformly distributed perturbations. Includes analysis of bias in gradient approximation. No convergence analysis or theoretical or numerical comparisons with standard FDSA algorithm.

>2p

Yes

Papers present several different methods for accelerating convergence of FDSA-type algorithms. Methods are based on taking additional measurements to explore loss fiction surface in greater detail. The 1971 paper discusses stochastic analogue to second-order algorithms of generic Newton-Raphson form (this algorithm uses O@Z) measurements of y(”)).

Depends on algorithm

Yes

Performs general analysis of FDSAand RDSA-type algorithms, including a demonstration of as. convergence.

2

Yes

Considers two-sided RDSA form with uniform distributed spherically Theoretical analysis perturbations. shows no improvement over FDSA; this finding appears to result from an error in choice of RD perturbation distribution, as discussed in Chin (1993).

Ermoliev

(1969)

Fabian (1967,

polyak (1973)

Kushner

1971)

and Tsypkin

and Clark (1978)

Developments

References Ermoliev

Spd

(1983)

(1987)

Span (1988)

polyak and Tsybakov (1990, 1992)

in Stochastic

Optirrization

211

Algorithms

No. of Function Meas~l~y ~(”) pe ea o

Measurement No~idbry(~ 0 s e ed.

2

Yes

Extends Ermoliev (1969) to include constraints; special treatment for convex L(Q. No convergence theory or comparative numerical analysis.

2

No

Introduces SPSA (two-sided) form. Apparently fust paper to consider general perturbation distributions (vs. uniform distributions for RDSA above). Analysis of bias in gradient approximation; numerical study for special case of symmetric Bernoulli (random binary) perturbations shows performance superior to FDSA. No theoretical convergence analysis.

2q for any q21

Yes

Extends Span (1987) to include measurement noise; also proves as. convergence of SPSA algorithm; numerical analysis of potential benefits of gradient averaging (q>l).

Yes

Papers present approach similar to RDSA based on kernel fictions. Show as. convergence in general noise settings.

lor2

Comments

Styblinski and Tang (1990); Chin (1994)

2q for any q>l

No

Styblinski and Tang uses modified version of one- and two-sided RDSA algorithms for global optimization. Considers Gaussianand CauchyBoth distributed perturbations. smoothing “and across-iteration gr,adient averaging (q>l) considered. numerical Extensive analysis, including demonstration of superiority of RDSA to simulated annealing; no theoretical convergence analysis. Chin substitutes SPSA for RDSA in algorithm of Styblinski and Tang and numerically illustrates superior performance.

Span (1992)

2q for any q>l

Yes

Extends SPSA theory in Span (1987, 1988) to include asymptotic normality (a la Sacks (1958)). First paper to show theoretical advantage of twomeasurement approaches (SPSA in particular, which includes RDSA with symmetric Bernoulli perturbations as special case) over classical FDSA Also includes theoretical approach. analysis on when gradient averaging (q> I) is beneficial, and extensive numerical analysis.

Span

212

No. of Function Measurements y(.) t)er Iteration

Measurement y:ldllry(] o s e ed.

2q (for any q21)

Yes

Papers show use of SPSA in closedloop control problems (where, e.g., the function ~0) changes over time). Allows for optimal control without model of system constructing dynamics. Convergence (as.) to optimal controller shown under Across-iteration certain conditions. gradient smoothing considered in 1994a paper. Numerical analysis and comparison with FDSA for closedloop Considers estimation. polynomial and neural net examples where p ranges from 70 to 400.

Depends on algorithm

Yes

Extends theoretical and numerical comparison of SPSA and FDSA in Span (1992) to include RDSA. Shows theoretical and numerical superiority of SPSA for general perturbation distributions.

2p

Yes

Alternate global optimization approach (vs. Styblinski and Tang (1990) and Chin (1994)) using twosided FDSA aigorithm. Shows both as. convergence and asymptotic normality. Extensions to RDSA and/or SPSA seem feasible.

Cauwenberghs (1993, 1994); Alspector, et al. (1993)

2q for any q>l

No

FOCUS on constant gain (ak = a) implemental ion of SPSAIRDSA algorithms with symmetric Bernoulli perturbation distribution (SPSARDSA equivalent in this case). Both open-loop identification and closed-loop control problems considered. Techniques for hardware implementation in feed-forward and recurrent neural networks.

Span (1994)

3q for any q>l

Yes

Extends SPSA to include secondeffects for purposes of order algorithm acceleration (in the spirit of Fabian (197 1) above). Estimates both gradient and inverse Hessian at each iteration (with number of measurements independent of p, as indicated at left) to produce an SA analogue of the deterministic NewtonRaphson algorithm.

References Span and Cristion 1994a,b)

chin

(1992,

(1993)

Yakowitz

(1993)

Developments

in Stochastic

Optimization

Algorithms

213

REFERENCES

Alspector,

J,, R. Meir

B. Yuhas and A. Jayakumar,

1993. A Parallel Learning

Gradient

in Analog

Advances

in

Systems,

VLSI

N&ural

5:836-844.

Morgan

Descent Method Neural

Fabian, V. 1967. Stochastic Approximation with

for

Mathematical

Networks.

Information

Processing

San Mateo,

California

Improved

M.

and

C.

Methods in Statistics,

M.

Programming.

Shetty.

1979.

New York:

J.

R.

1954.

Fu, M.

Wiley.

Multidimensional

Approximation

Stochastic

Methods.

Mathematical

Statistics

C.

1994.

G.

1993.

Annals

Optimization. Processing

Systems,

California

Morgan

in Neural

5:244-251.

Systems for Learning

P.

1991.

VLSI

Institute

Chin, D. C. 1993. Performance

and

Kiefer,

J.

J.

San Mateo,

Mathematical

Autonomous

Wolfowitz

Statistics

H. J. and D.

C.

1994.

Optimization Tang.

Neural

Y.

S. Clark.

A

More

Algorithm

Efficient

L’Ecuyer,

1991,

P.

Estimation.

Systems. New York:

of the

Kelton,

An

Overview

and G. M. Clark,

and

Automation

and

of Generalized

of of

Derivative the

Winter

207-217.

1973. Pseudogradient

Training

Algorithms.

and Remote Control

Sequences.

B. T. and A. B. Tsybakov. Rates of Convergence Stochastic

Optimization.

Znformatsii. Problems

9:1-36.

Springer-

Conference, ed. B. L. Nelson, W. D.

Adaptation

Global

5:208-220.

and their Application

and

34:377-397.

7:573-574.

Y. 1983. Stochastic Quasigradient

Stochastic

Constrained

Proceedings

Simulation

Computing

Based on Styblinski

Networks

Stochastic Gradients andQuasi-Fejer

Ermoliev,

1978. Stochastic

for

289-295.

1969. On the Method

Cybernetics

Annals of

Verlag.

Polyak, Ennoliev,

Stochastic

Function.

23:462-466.

Polyak, B. T. and Y. Z. Tsypkin. D.

via

K.luwer.

1952.

Methods

Unconstrained

Ph.D.

in the Multivariate

on the Interface

Science and Statistics,

Chin,

Estimation

Boston:

of a Regression

Approximation

of Technology.

Setting. Proceedings

25th Symposium

and

Information

of Several Stochastic

Algorithms

Kiefer-Wolfowitz

Gradient

Analysis.

Estimation

and Optimization.

California

Approximation

A

of Operations

Kaufman.

G. 1994. Analog

dissertation,

Simulation:

Error-

Kushner, Cauwenberghs,

via

in Annals

Research.

Glasserman,

for Supervised Learning

Advances

439-470.

of

25:737-744.

A Fast Stochastic

Descent Algorithm

Optimizing

Press.

Optimization

to appear

Perturbation Cauwenberghs,

38:191-200.

ed. J. J. Rustigi,

Academic

of

Nonlinear

Review Bhun,

Statistics

of Minima

Speed. Annals

Fabian, V. 1971. Stochastic Approximation.

Kaufinan.

New York: Bazaraa,

Asymptotic

Methods

to System Optimization.

26:45-53; of Information

133, 1990,

1990. Optimal

for the Global Problemy English

Search

Peredachi transl.

Transmission

in

26:126-

Span

214

Polyak,

B.

T.

and

A.

B.

Tsybakov.

Stochastic Approximation (the

K-W

Advances

in

On

Span,

J. C. and J. A. Adaptive Estimation

with

Perturbation

Gradient Approximation.

12:107-113.

H.

and

S. Monro.

Approximation

1951.

A

Method.

Mathematical

Statistics

Control

Annals

of

1986.

Learning

Error

Control

of General

Processing.

Representations

Proceedings

1:318-362.

by

Cambridge:

Approximation

Span,

J.

C.

1987.

Technique

A

Stochastic

Parameter

Estimates.

American

Control

Yakowitz,

J.

C.

1988.

Algorithm

A

Proceedings

of

J.

C.

Setting.

1992.

Approximation

Systems in Proceedings

on Decision

of

and Control

Multivariate Using

Perturbation

Gradient

Transactions

Stochastic

a

simultaneous

Approximation.

on Automatic

J.

C.

1994.

A

Approximation

Control

Second

Algorithm

Measurements.

IEEE

3’7:332-341.

Order

Stochastic

Using Only Function

Proceedings

Conference on Decision

Control

of Nonlinear

Networks

and

Proceedings Decision

of

of

and Control,

the

IEEE

to appear.

1992. Direct Systems

Using

Adaptive Neural

Stochastic

Approximation.

the

Conference

and Control

Stochastic Smoothing Networb

and

3:467-

Convergent

SIAM Journal

Stochastic

on Control

and

31:30-40.

BIOGRAPHY

JAMES

C. SPALL

University

has been with the Johns Hopkins

Applied

Physics

IEEE

focusing

on problems

control.

For

Hart

Prize

outstanding

the

Principal

878-883.

since

1983,

such

parameter

Staff

member

of and

most

including filtering,

and neural

and coauthor

IEEE, Sigma

Dr.

research papers in the

and control, Kalman

to the

of the laboratory.

articles

on

optimization,

networks.

He also

of the book Bayesian

of Time Series and Dynamic

Association,

the

Research and Development

numerous

estimation,

and

the R. W.

of

In 1991, he was appointed

as

served as editor

modeling

investigator

Independent

Professional

subjects

statistical

year 1990, he received

areas of statistics

a

in

as principal

project at JHU/APL.

engineering on

Laboratory

where he is a project leader for several research efforts

Analysis Span, J. C. and J. A. Cristion.

Neural

S. 1993. A Globally

Span has published Span,

1990. Experiments

Function

Annealing.

Optimization

the

1544-1548.

Span,

on

Approximation

for Large-Dimensional

the IEEE Conference

Conference

1161-1167.

Stochastic

the Kiefer-Wolfowitz

in 1993

2792-2797).

Optimization: with

Approximation.

Likelihood

AUTHOR Span,

IEEE

Automatic

form

483.

Approximation

Conference

the

Nonconvex

Simulated

of

Annals

26:373-405.

for Generating Maximum

of

on

(preliminary

and Control

Approximation

of Stochastic

Procedures. Statistics

Transactions

M. A. and T. S. Tang.

in

Mathematical

Discrete-Time

IWT Styblinski,

Distributions

Statistic

1994b. Model-Free

Stochastic

submitted

Decision

Distributed

Press.

Sacks, J. 1958. Asymptotic

IEEE

Control,

Parallel

Networks:

Simultaneous

4:1-27.

and R. J. Williams,

Internal

Propagation.

Neural

Smoothed

Span, J. C. and J. A. Cristion.

29:400-407.

D. E., G. E. Hinton,

Using a

Nonlinear

Stochastic

Systems. Rumelhart,

1994a.

Noise

Sinica Robbins,

Cristion.

Soviei

with Arbitrary

case).

A4arhematics

1992.

Models.

the

American

Xi,

and a fellow

honor society Tau Beta Pi.

He is

Statistical of

the