Proceedings
of the 1994 Winter
eel. J. D. Tew, S. Manivannan,
DEVELOPMENTS GRADIENT
Simulaiaon
Conference
D. A. Sadowski,
and A. F. Seila
IN STOCHASTIC
APPROXIMATIONS
OPTIMIZATION BASED
ON FUNCTION
ALGORITHMS
WITH
MEASUREMENTS
James C. Span
The Johns Hopkins Applied Laurel, (E-mail:
Physics Laboratory Maryland
20723-6099
[email protected])
ABSTRACT
measurements of the objective fiction. This interest has been motivated, for example, by problems in the adaptive control and statistical identification of complex systems, the optimization of processes by large Monte Carlo simulations, the training of recurrent neural networks, and the design of complex queueing and discrete-event systems.
There has recently been much interest in recursive optimization algorithms that rely on measurements of only the objective function, not requiring measurements of the gradient (or higher derivatives) of the objective function. The algorithms are implemented by forming an atmroximation to the gradient at each iteration that is based on the function measurements. Such algorithms have the advantage of not requiring detailed modeling information describing the relationship between the parameters to be optimized and the objective function. To properly cope with the noise that generally occurs in the measurements, these algorithms are best placed within a stochastic approximation tkunework. This paper discusses some of the main contributions to this class of algorithms, beginning in the early 1950s and progressing until now. Key words:
University
Overall, such algorithms exhibit certain convergence properties of gradient-based algorithms while requiring only objective (say, loss) fi.mction measurements. A main advantage of such algorithms is that they do not require the detailed knowledge of the functional relationship between the parameters being adjusted (optimized) and the loss function being minimized that is required in gradient-based algorithms. Such a relationship can be notoriously difficult to develop in problem areas such as nonlinear feedback controller design. Further, in areas such as Monte Carlo optimization or recursive statistical parameter estimation, there may be large computational savings in calculating a loss function relative to that required in calculating a gradient. Because of the inherent randomness in the data and search algorithms here, all algorithms will be viewed from the perspective of stochastic approximation (SA).
Optimization, stochastic approximation, gmdient estimation, recursive procedure
1. INTRODUCTION In virtually all areas of engineering and the physical and social sciences, one encounters problems involving the optimization of some mathematical objective fhnction (e.g., as in optimal control, system design and planning, model fitting, and performance evaluation from system test data). Typically, the solution to this optimization problem corresponds to a vector of parameters such that the gradient of the objective fiction (with respect to the system parameters being optimized) is zero. Over the kast several years, there h,as been a growing interest in recursive optimization algorithms that do not depend on direct gradient information or measurements. Rather, these algorithms are based on an aDtIroximation to the gradient formed from (generally noisy)
Let us elaborate on the distinction between algorithms based on direct gradient measurements and algorithms based on gradient approximations from measurements of the loss timction. Examples of the former include Robbins-Monro SA (Robbins and Monro, 1951), steepest descent and Newton-lUphson (1%.zaraa and Shetty, 1979, Chap. 8), neural network back propagation (Rumelhart, et al., 1986), and infinitesimal perturbation analysis (IPA)-based optimization for discrete-event systems (Ghasserman, 199 1). Examples of approximation-based methods using loss function measurements are given below, but include as an early prototype the Kiefer-Wolfowitz finite-difference SA algorithm (Kiefer and Wolfowitz, 1952). The gradient-based algorithms rely on direct measurements of the gradient of the loss t%nction with respect to the parameters being optimized. These measurements typically yield an estimate of the
This work was partially supported by the JHU/APL IRAD Program and U.S. Navy Contract NOO039-94-C0001. 207
208
gradient since the underlying data generally include added noise. Because it is not usually the case that one would obtain direct measurements of the gradient (with or without added noise) naturally in the course of operating or simdating a system, one must have knowledge of the underlying system input-output relationships in order to calculate the gradient estimate (using the chain rule) from basic system output measurements. In contrast, the approaches based on gradient approximations require only conversion of the basic output measurements to sample values of the loss function, which does not require knowledge of the system input-output relationships. Because of the fundamentally different information needed in implementing these two general it is difficult to construct types of algorithms, meaningtid methods of comparison. As a general rule, however, the gradient-based algorithms will be faster to converge than those based on gradient approximations when m eed is measured in number of Intuitively, this is not surprising given the iterations. additional information required for the gradient-based algorithms. In particular, the rate of convergence— measured in terms of the deviation of the parameter estimate from the true optimal parameter vector—is typically of order k‘~ for the gradient-based algorithms and of order k-~ for some 0< 13 O represents a scalar gain co~fticient, and ~~(d~) represents an approximation of g(tl~) based on the above-mentioned measurements of L@ at yalues of 0 that are perturbed fioma the nominal value (1~. Under appropriate conditions, Ok will converge (in some stochastic sense) to 0; (see, e.g., Kushner and Clark, 1978).
Developments
in Stochastic
The most critical aspect @ implementing (1) is the gradient approximation ~#3~) at each k. This review discusses the three general forms that appearato have attracted the most attention. In forming~&) we let y(”) denote a measurement of L(”) at a design level represented by the dot (i.e., Y(”) = ~~ + nofie) and Ct be some (usually small) positive number (in general, a dot as a timction argument represents a specific value of 0 that we will not speci~ here in order to avoid excess notation). “One-sided” gradient approximations involve measurements y(tl~) ~d y (ok +perzurbm”on) for each component of~#Q while “two-side$ involve the approximations measurements y (tl~ + pertarbm”on). The three general forms are: Finite difference (FD), where each component of ~~ is perturbed one-at-a-time and corresponding me~urements y(”) are obtained; each component of ~#3~ is formed by differencing corresponding measurements of y(.) and then dividing by the difference interval. This is the “standard” approach to approximating gradient vectors and is motivated directly from the definition of a gradient as a vector ofp partial derivatives, each constructed as a limit of the ratio of a change in the fimction value over a corresponding change in one component of the arg~ent vector. Typically, the i ‘h component of &(o~) (i=l, 2, ... . P) for approximation is given by
a
two-sided
FD
y(~~ + c~ei) - Y@k - c~e) ik-i(dk) =
2ck
9
where ei denotes a unit vector in the i ‘h direction (an obvious analogue holds for the one-sided version; likewise for the RD and SP forms below). Rando~ directions (RD), where all components of tl~ are randomly perturbed together in two separate directions to obta@ two measurements Y(~ ~~d each comPoflent of &(eJ is formed from the product of the corresponding component of the perturbation vector times the difference in the two measurements. For two-sided RD, we have
t?J~k) =dti where dk = (dkI, random variables
Y@+ckdk) - J@k - ckdk) 2ck
Y
... . dti)T is a vector of user-specified satisfying certain conditions.
Simultaneous perturbation (SP), which also has all elements of tl~ randomly perturbed togetherAto obtain twomeastaem ents y(”), but each ~ of&(e~) is formed from a ratio involving the individual components of the perturbation vector and the difference in the two corresponding measurements. For two-sided SP, we have
J@k+CkAk) -)@ - +k)
&j@k) =
2c~Ati
,
Optimization
Algorithms
209
where the distribution of the user-specified random perturbations for SP, A~ = (AH, ... . A~T, must satisfy conditions different from those of RD (although certain distributions satisfy both sets of conditions). Algorithm (1) with one of the gradient approximations will be referred to as FDSA, RDSA, or SPSA, as appropriate. Note that the number of loss function measurements y(”) needed in FD grows with p, while with RD and SP only two measurements are needed inde~e ndent of p, This, of course, provides the potential for RDSA or SPSA to achieve a large savings (over FDSA) in the total number of measurements required to estimate 0 when p is large. This potential is only realized if the number of iterations required for effective convergence to 6* does not increase in a way to cancel the measurement savings per gradient approximation. Some of the references in Section 3 address this issue (especially Span (1992) and Chin (1993)), demonstrating when this potential can be realized. Although the RD and SP gradient approximation forms have certain similarities, the performance of the RDSA and SPSA algorithms will generally be quite different, as demonstrated in some of the references below. Note, however, that in one important special case the RDSA and SPSA algorithms coincide: namely, when the components of the perturbation vector are symmetric Bernoulli distributed (e.g., & 1 with each outcome having probability !4). In general, however, the SPSA formulation has been shown to be more efficient than RDSA, as discussed in Section 3. Theoretically, this follows from the fact that SPSA has a lower asymptotic mean-square error than RDSA for the same number of measurements y(”). Let us close with definitions of some terms used in Section 3. “Gradient averaging” refers to the averaging of several (say,g) gradient approximations at any given iteration; “gradient smoothing” refers to averaging gradients across iterations (analogous to “momentum” in the neural network parlance).
210
Span
3. LIST
OF KEY
DEVELOPMENTS
AND
No. of Function Measurements y(j t)er Iteration
References Kiefer and Wolfowitz (1952)
ASSOCIATED
REFERENCES
Measurement Noise in y(j Considered?
Comments
Yes
First FDSA algorithm. Limited scalar setting. Convergence probability shown.
&)
to in
Bhun (1954)
p+l
Yes
Multivariate extension of Kiefer and Wolfowitz (1952) FDSA. One-sided finite difference gradient approximation. Shows almost sure (as.) convergence.
Sacks (1958)
2p
Yes
Shows asymptotic normality of multivariate two-sided FDSA method. Normality result useful in quantifying accuracy of SA estimate.
2
Yes
Apparently first paper to consider a special case of RDSA-tWe algorithm: one-sided RDSA form with uniformly distributed perturbations. Includes analysis of bias in gradient approximation. No convergence analysis or theoretical or numerical comparisons with standard FDSA algorithm.
>2p
Yes
Papers present several different methods for accelerating convergence of FDSA-type algorithms. Methods are based on taking additional measurements to explore loss fiction surface in greater detail. The 1971 paper discusses stochastic analogue to second-order algorithms of generic Newton-Raphson form (this algorithm uses O@Z) measurements of y(”)).
Depends on algorithm
Yes
Performs general analysis of FDSAand RDSA-type algorithms, including a demonstration of as. convergence.
2
Yes
Considers two-sided RDSA form with uniform distributed spherically Theoretical analysis perturbations. shows no improvement over FDSA; this finding appears to result from an error in choice of RD perturbation distribution, as discussed in Chin (1993).
Ermoliev
(1969)
Fabian (1967,
polyak (1973)
Kushner
1971)
and Tsypkin
and Clark (1978)
Developments
References Ermoliev
Spd
(1983)
(1987)
Span (1988)
polyak and Tsybakov (1990, 1992)
in Stochastic
Optirrization
211
Algorithms
No. of Function Meas~l~y ~(”) pe ea o
Measurement No~idbry(~ 0 s e ed.
2
Yes
Extends Ermoliev (1969) to include constraints; special treatment for convex L(Q. No convergence theory or comparative numerical analysis.
2
No
Introduces SPSA (two-sided) form. Apparently fust paper to consider general perturbation distributions (vs. uniform distributions for RDSA above). Analysis of bias in gradient approximation; numerical study for special case of symmetric Bernoulli (random binary) perturbations shows performance superior to FDSA. No theoretical convergence analysis.
2q for any q21
Yes
Extends Span (1987) to include measurement noise; also proves as. convergence of SPSA algorithm; numerical analysis of potential benefits of gradient averaging (q>l).
Yes
Papers present approach similar to RDSA based on kernel fictions. Show as. convergence in general noise settings.
lor2
Comments
Styblinski and Tang (1990); Chin (1994)
2q for any q>l
No
Styblinski and Tang uses modified version of one- and two-sided RDSA algorithms for global optimization. Considers Gaussianand CauchyBoth distributed perturbations. smoothing “and across-iteration gr,adient averaging (q>l) considered. numerical Extensive analysis, including demonstration of superiority of RDSA to simulated annealing; no theoretical convergence analysis. Chin substitutes SPSA for RDSA in algorithm of Styblinski and Tang and numerically illustrates superior performance.
Span (1992)
2q for any q>l
Yes
Extends SPSA theory in Span (1987, 1988) to include asymptotic normality (a la Sacks (1958)). First paper to show theoretical advantage of twomeasurement approaches (SPSA in particular, which includes RDSA with symmetric Bernoulli perturbations as special case) over classical FDSA Also includes theoretical approach. analysis on when gradient averaging (q> I) is beneficial, and extensive numerical analysis.
Span
212
No. of Function Measurements y(.) t)er Iteration
Measurement y:ldllry(] o s e ed.
2q (for any q21)
Yes
Papers show use of SPSA in closedloop control problems (where, e.g., the function ~0) changes over time). Allows for optimal control without model of system constructing dynamics. Convergence (as.) to optimal controller shown under Across-iteration certain conditions. gradient smoothing considered in 1994a paper. Numerical analysis and comparison with FDSA for closedloop Considers estimation. polynomial and neural net examples where p ranges from 70 to 400.
Depends on algorithm
Yes
Extends theoretical and numerical comparison of SPSA and FDSA in Span (1992) to include RDSA. Shows theoretical and numerical superiority of SPSA for general perturbation distributions.
2p
Yes
Alternate global optimization approach (vs. Styblinski and Tang (1990) and Chin (1994)) using twosided FDSA aigorithm. Shows both as. convergence and asymptotic normality. Extensions to RDSA and/or SPSA seem feasible.
Cauwenberghs (1993, 1994); Alspector, et al. (1993)
2q for any q>l
No
FOCUS on constant gain (ak = a) implemental ion of SPSAIRDSA algorithms with symmetric Bernoulli perturbation distribution (SPSARDSA equivalent in this case). Both open-loop identification and closed-loop control problems considered. Techniques for hardware implementation in feed-forward and recurrent neural networks.
Span (1994)
3q for any q>l
Yes
Extends SPSA to include secondeffects for purposes of order algorithm acceleration (in the spirit of Fabian (197 1) above). Estimates both gradient and inverse Hessian at each iteration (with number of measurements independent of p, as indicated at left) to produce an SA analogue of the deterministic NewtonRaphson algorithm.
References Span and Cristion 1994a,b)
chin
(1992,
(1993)
Yakowitz
(1993)
Developments
in Stochastic
Optimization
Algorithms
213
REFERENCES
Alspector,
J,, R. Meir
B. Yuhas and A. Jayakumar,
1993. A Parallel Learning
Gradient
in Analog
Advances
in
Systems,
VLSI
N&ural
5:836-844.
Morgan
Descent Method Neural
Fabian, V. 1967. Stochastic Approximation with
for
Mathematical
Networks.
Information
Processing
San Mateo,
California
Improved
M.
and
C.
Methods in Statistics,
M.
Programming.
Shetty.
1979.
New York:
J.
R.
1954.
Fu, M.
Wiley.
Multidimensional
Approximation
Stochastic
Methods.
Mathematical
Statistics
C.
1994.
G.
1993.
Annals
Optimization. Processing
Systems,
California
Morgan
in Neural
5:244-251.
Systems for Learning
P.
1991.
VLSI
Institute
Chin, D. C. 1993. Performance
and
Kiefer,
J.
J.
San Mateo,
Mathematical
Autonomous
Wolfowitz
Statistics
H. J. and D.
C.
1994.
Optimization Tang.
Neural
Y.
S. Clark.
A
More
Algorithm
Efficient
L’Ecuyer,
1991,
P.
Estimation.
Systems. New York:
of the
Kelton,
An
Overview
and G. M. Clark,
and
Automation
and
of Generalized
of of
Derivative the
Winter
207-217.
1973. Pseudogradient
Training
Algorithms.
and Remote Control
Sequences.
B. T. and A. B. Tsybakov. Rates of Convergence Stochastic
Optimization.
Znformatsii. Problems
9:1-36.
Springer-
Conference, ed. B. L. Nelson, W. D.
Adaptation
Global
5:208-220.
and their Application
and
34:377-397.
7:573-574.
Y. 1983. Stochastic Quasigradient
Stochastic
Constrained
Proceedings
Simulation
Computing
Based on Styblinski
Networks
Stochastic Gradients andQuasi-Fejer
Ermoliev,
1978. Stochastic
for
289-295.
1969. On the Method
Cybernetics
Annals of
Verlag.
Polyak, Ennoliev,
Stochastic
Function.
23:462-466.
Polyak, B. T. and Y. Z. Tsypkin. D.
via
K.luwer.
1952.
Methods
Unconstrained
Ph.D.
in the Multivariate
on the Interface
Science and Statistics,
Chin,
Estimation
Boston:
of a Regression
Approximation
of Technology.
Setting. Proceedings
25th Symposium
and
Information
of Several Stochastic
Algorithms
Kiefer-Wolfowitz
Gradient
Analysis.
Estimation
and Optimization.
California
Approximation
A
of Operations
Kaufman.
G. 1994. Analog
dissertation,
Simulation:
Error-
Kushner, Cauwenberghs,
via
in Annals
Research.
Glasserman,
for Supervised Learning
Advances
439-470.
of
25:737-744.
A Fast Stochastic
Descent Algorithm
Optimizing
Press.
Optimization
to appear
Perturbation Cauwenberghs,
38:191-200.
ed. J. J. Rustigi,
Academic
of
Nonlinear
Review Bhun,
Statistics
of Minima
Speed. Annals
Fabian, V. 1971. Stochastic Approximation.
Kaufinan.
New York: Bazaraa,
Asymptotic
Methods
to System Optimization.
26:45-53; of Information
133, 1990,
1990. Optimal
for the Global Problemy English
Search
Peredachi transl.
Transmission
in
26:126-
Span
214
Polyak,
B.
T.
and
A.
B.
Tsybakov.
Stochastic Approximation (the
K-W
Advances
in
On
Span,
J. C. and J. A. Adaptive Estimation
with
Perturbation
Gradient Approximation.
12:107-113.
H.
and
S. Monro.
Approximation
1951.
A
Method.
Mathematical
Statistics
Control
Annals
of
1986.
Learning
Error
Control
of General
Processing.
Representations
Proceedings
1:318-362.
by
Cambridge:
Approximation
Span,
J.
C.
1987.
Technique
A
Stochastic
Parameter
Estimates.
American
Control
Yakowitz,
J.
C.
1988.
Algorithm
A
Proceedings
of
J.
C.
Setting.
1992.
Approximation
Systems in Proceedings
on Decision
of
and Control
Multivariate Using
Perturbation
Gradient
Transactions
Stochastic
a
simultaneous
Approximation.
on Automatic
J.
C.
1994.
A
Approximation
Control
Second
Algorithm
Measurements.
IEEE
3’7:332-341.
Order
Stochastic
Using Only Function
Proceedings
Conference on Decision
Control
of Nonlinear
Networks
and
Proceedings Decision
of
of
and Control,
the
IEEE
to appear.
1992. Direct Systems
Using
Adaptive Neural
Stochastic
Approximation.
the
Conference
and Control
Stochastic Smoothing Networb
and
3:467-
Convergent
SIAM Journal
Stochastic
on Control
and
31:30-40.
BIOGRAPHY
JAMES
C. SPALL
University
has been with the Johns Hopkins
Applied
Physics
IEEE
focusing
on problems
control.
For
Hart
Prize
outstanding
the
Principal
878-883.
since
1983,
such
parameter
Staff
member
of and
most
including filtering,
and neural
and coauthor
IEEE, Sigma
Dr.
research papers in the
and control, Kalman
to the
of the laboratory.
articles
on
optimization,
networks.
He also
of the book Bayesian
of Time Series and Dynamic
Association,
the
Research and Development
numerous
estimation,
and
the R. W.
of
In 1991, he was appointed
as
served as editor
modeling
investigator
Independent
Professional
subjects
statistical
year 1990, he received
areas of statistics
a
in
as principal
project at JHU/APL.
engineering on
Laboratory
where he is a project leader for several research efforts
Analysis Span, J. C. and J. A. Cristion.
Neural
S. 1993. A Globally
Span has published Span,
1990. Experiments
Function
Annealing.
Optimization
the
1544-1548.
Span,
on
Approximation
for Large-Dimensional
the IEEE Conference
Conference
1161-1167.
Stochastic
the Kiefer-Wolfowitz
in 1993
2792-2797).
Optimization: with
Approximation.
Likelihood
AUTHOR Span,
IEEE
Automatic
form
483.
Approximation
Conference
the
Nonconvex
Simulated
of
Annals
26:373-405.
for Generating Maximum
of
on
(preliminary
and Control
Approximation
of Stochastic
Procedures. Statistics
Transactions
M. A. and T. S. Tang.
in
Mathematical
Discrete-Time
IWT Styblinski,
Distributions
Statistic
1994b. Model-Free
Stochastic
submitted
Decision
Distributed
Press.
Sacks, J. 1958. Asymptotic
IEEE
Control,
Parallel
Networks:
Simultaneous
4:1-27.
and R. J. Williams,
Internal
Propagation.
Neural
Smoothed
Span, J. C. and J. A. Cristion.
29:400-407.
D. E., G. E. Hinton,
Using a
Nonlinear
Stochastic
Systems. Rumelhart,
1994a.
Noise
Sinica Robbins,
Cristion.
Soviei
with Arbitrary
case).
A4arhematics
1992.
Models.
the
American
Xi,
and a fellow
honor society Tau Beta Pi.
He is
Statistical of
the