Mirrored Orthogonal Sampling with Pairwise Selection in Evolution Strategies Hao Wang
Michael Emmerich
Thomas Bäck
LIACS, Leiden University Niels Bohrweg 1, Leiden The Netherlands
LIACS, Leiden University Niels Bohrweg 1, Leiden The Netherlands
LIACS, Leiden University Niels Bohrweg 1, Leiden The Netherlands
h.wang@ liacs.leidenuniv.nl
m.t.m.emmerich@ liacs.leidenuniv.nl
t.h.w.baeck@ liacs.leidenuniv.nl
ABSTRACT
2. MIRRORED SAMPLING
In this paper, an improvement of the mirrored sampling method, called mirrored orthogonal sampling, is propsed. The convergence rates on the sphere function are estimated. It is also applied to Covariance Matrix Adaptation Evolution Strategy (CMA-ES). The resulting algorithm, termed (µ/µw , λom )-CMA-ES, is benchmarked on the Black-box optimization benchmark (BBOB). The newly proposed technique is found to outperform both the standard (µ/µw , λ)CMA-ES and its mirrored variant.
The mirrored sampling technique is introduced by Auger et al. [5]. Instead of generating λ independent and identically distributed (i.i.d.) search points, only half of the i.i.d. samples are created, namely zi ∼ N (0, σ 2 C), 1 ≤ i ≤ λ/2, where σ is the global step-size and C is the covariance matrix. Each mutation vector zi is used to produce two offspring, the usual one x2i−1 = m + zi and the mirrored offspring x2i = m − zi . Those two offspring are symmetric or mirrored to the parental point m. The mirrored sampling method is described in Algorithm 1. Following the notation in [5], we denote the ES algorithm using mirrored sampling as (µ +, λm )-ES.
Categories and Subject Descriptors I.2.8 [Artificial Intelligence]: Problem Solving, Control Methods, and Search—heuristic methods; G.1.6 [Numerical Analysis]: Optimization—global optimization, unconstrained optimization
General Terms Algorithms
Algorithm 1 mirrored-sampling(m, σ, C, λ) 1 B, D ← eigen-decomposition(C) 2 for i = 1 → λ/2 do 3 zi ← N (0, I) 4 x2i−1 ← m + σBDzi 5 x2i ← m − σBDzi 6 end for
Keywords Evolution strategies, derandomization, orthogonal sampling, mirrored sampling, empirical study
3. PROPOSED ALGORITHM
1.
3.1 Mirrored Orthogonal Sampling
INTRODUCTION
The recently proposed mirrored sampling technique is a simple and effective derandomized sampling method to accelerate the convergence speed for evolution strategies and has been theoretically proven to accelerate the convergence [1, 5]. The purpose of this paper is to introduce an improvement of the mirrored sampling technique, named mirrored orthogonal sampling to generate more evenly distributed samples.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’14 March 24-28, 2014, Gyeongju, Korea. Copyright 2014 ACM 978-1-4503-2469-4/14/03 ...$15.00.
It is proposed here to first generate half of the mutations as random orthogonal samples. Specifically, each pair of mutation vectors are ensured to be orthogonal to each other while their directions are still random. The simplest way to realize orthogonal sampling is to orthonormalize a collection of i.i.d. Gaussian vectors. The Gram-Schmidt process [4] is chosen as the orthonormalization procedure. After the orthonormalization, the lengths of the samples are restored to the values before the process. Note that the maximal number of orthogonal samples is just the dimensionality. If λ/2 > N , the remaining λ/2 − N samples are created as i.i.d samples. Then we create the remaining half of the mutations by mirroring. The resulting sampling algorithm is called mirrored orthogonal sampling and denoted as (µ +, λom )-ES here. The detailed procedure is described in Algorithm 2.
3.2 Recombination and Pairwise Selection When working with the weighted recombination, mirroring related methods cause an undesired reduction of the variance of recombined mutation, leading to an undesired
Algorithm 2 mirrored-orthogonal(m, σ, C, λ) 1 B, D ← eigen-decomposition(C) 2 p ← λ/2 3 for i = 1 → p do 4 si ← N (0, I) 5 end for 6 if p ≤ N then 7 [s′1 , . . . , s′p ] ← gram-schmidt(s1 , . . . , sp ). 8 zi ← ksi k · s′i , 1 ≤ i ≤ p 9 else 10 [s′1 , . . . , s′N ] ← gram-schmidt(s1 , . . . , sN ). 11 zi ← ksi k · s′i , 1 ≤ i ≤ N 12 [s′N+1 , . . . , s′p ] ← [sN+1 , . . . , sp ]. 13 end if 14 for i = 1 → p do 15 x2i−1 ← m + σBDzi 16 x2i ← m − σBDzi 17 end for decrease of step-sizes [2, 5]. To fix this effect, the pairwise selection heuristic introduced in [2] is adopted here, in which only the better one among a mirrored pair is possiblly selected for recombination.
3.3 Application to the CMA-ES Algorithm We apply the mirrored orthogonal sampling technique to the well-known CMA-ES [6] (Covariance Matrix Adaptation Evolution Strategy). The damping factor dσ to control the updating speed of step-size should be optimized for the new sampling method because it is originally developed for i.i.d. Gaussian mutations. Thus, after the optimization on the sphere function (details arepnot shown here), we modified it to dσ = 1.49 − 0.6314 · ( (µeff + 0.1572)/(N + 1.647) + 0.869) + cσ . See [6] for the original setting of dσ .
3.4 Empirical Convergence Rates Given the starting point X(0) , the distance to the global optimum X(k) of generation k and the global optimum X∗ , the convergence rate of evolution strategies can be measured as [3], 1 kX(k) − X∗ k ln , Tk kX(0) − X∗ k where Tk is the total number of function evaluations performed until generation k. On the sphere function, the convergence of (µ/µw , λom )-CMA-ES and other comparable ES variants are illustrated in Figure 1a and 1b. 1a shows the details for 10-D. The (1 + 1)-ES uses 1/5 success step-size control while the (1 + 1)-ES optimal uses optimal σ setting. “dσ = optimal” denotes (µ/µw , λom )-CMA-ES using optimal dσ tuning on the sphere function. The mirrored orthogonal CMA-ES is significantly better than the mirrored version and the advantage even holds for large dimensions.
4.
EXPERIMENTAL SETTING
The standard experimental procedure of BBOB is adopted. (µ/µw , λom )-CMA-ES , (µ/µw , λm )-CMA-ES and (µ/µw , λ)CMA-ES are benchmarked on BBOB. The initial global step-size σ is set to 1. The maximum number of function evaluations is set to 104 × N . The initial solution vector is a uniformly distributed in the hyperbox [−4, 4]N . The dimensions tested in the experiment are N ∈ {2, 3, 5, 10, 20, 40}.
5. RESULTS Small population. The results under the default population setting (logarithm of the dimensionality) are shown in Figure 1c and 1d. (µ/µw , λom )-CMA-ES is shown in dashed curves while (µ/µw , λm )-CMA-ES is shown in solid curves. When N = 5, the performance leap of mirrored orthogonal sampling is significant. When N = 20, there is still a small advantage. Large population. When the population size is linearly related to the dimensionality, the comparison between the mirrored orthogonal sampling and its mirrored counterpart is illustrated in Figure 1e and 1f. The improvement brought by mirrored orthogonal sampling becomes more significant than the small population case in 5-D and further holds for 20-D.
6. CONCLUSION The mirrored orthogonal sampling improves the mirrored sampling both theoretically and experimentally. On the sphere function, mirrored orthogonal sampling is much faster than the mirroring and just a little bit slower than the (1+1)ES with 1/5 rule. As for the BBOB comparisons, the results also reveal its advantages over the mirrored counterpart and the standard (µ/µw , λ)-CMA-ES , although such advantages seem to decrease in higher dimensions. The better BBOB results for the large population suggest that the mirrored orthogonal sampling may be more suitable for increasing the exploration effects of a large population.
7. REFERENCES [1] A. Auger, D. Brockhoff, and N. Hansen. Analyzing the Impact of Mirrored Sampling and Sequential Selection in Elitist Evolution Strategies. In H.-G. Beyer and W. B. Langdon, editors, Proceedings of the 11th workshop proceedings on Foundations of genetic algorithms, FOGA ’11, pages 127–138, ACM, New York, 2011. [2] A. Auger, D. Brockhoff, and N. Hansen. Mirrored Sampling in Evolution Strategies with Weighted Recombination. In N. Krasnogor and P. L. Lanzi, editors, Proceedings of the 13th annual conference on Genetic and evolutionary computation, GECCO ’11, pages 861–868, ACM, New York, 2011. [3] A. Auger and N. Hansen. Theory of Evolution Strategies: A New Perspective. In A. Auger and B. Doerr, editors, Theory of Randomized Search Heuristics: Foundations and Recent Developments, chapter 10, pages 289–325. World Scientific Publishing Company, 2011. [4] ˚ A. Bj¨ orck. Numerics of Gram-Schmidt Orthogonalization. Linear Algebra and its Applications, 197-198:297–316, 1994. [5] D. Brockhoff, A. Auger, N. Hansen, D. V. Arnold, and T. Hohm. Mirrored Sampling and Sequential Selection for Evolution Strategies. In R. Schaefer, C. Cotta, J. Kolodziej, and G. Rudolph, editors, Proceedings of the 11th international conference on Parallel problem solving from nature: Part I, PPSN’10, pages 11–21, Springer-Verlag, Berlin, 2010. [6] N. Hansen and A. Ostermeier. Completely Derandomized Self-Adaptation in Evolution Strategies. Evol. Comput., 9(2):159–195, June 2001.
Sphere 10-D convergence rate * dimension
−0.10
distance to the optimun
10
Sphere −0.05
0
−0.15
10-4
−0.20
(µ/µw ,λ)-CMA-ES (µ/µw ,λm )-CMA-ES (µ/µw ,λmo )-CMA-ES dσ = optimal (1 +1)-ES (1 +1)-ES optimal
10-8
0
200
(µ/µw ,λ)-CMA-ES (µ/µw ,λm )-CMA-ES (µ/µw ,λmo )-CMA-ES dσ = optimal (1 +1)-ES (1 +1)-ES optimal
−0.25
−0.30 −0.35
400
600
function evaluations
800
1000
−0.402
3
5
(a)
proportion of trials
0.8
0.6
1.0
f1-24
+1 -1 -4 -8
0.8
0.6
0.4
+1 -1 -4 -8
1
2 3 log10 of FEvals / DIM
0.00
4
0.6
+1 -1 -4 -8
1.0
f1-24
0.8
0.6
0.4
1
2 3 log10 of FEvals / DIM
4
+1 -1 -4 -8
f1-24
0.4
0.2
0.00
f1-24
(d)
proportion of trials
proportion of trials
0.8
80 100
0.2
(c)
1.0
40
0.4
0.2
0.00
20
dimension
(b)
proportion of trials
1.0
10
0.2
1
2 3 log10 of FEvals / DIM (e)
4
0.00
1
2 3 log10 of FEvals / DIM
4
(f)
Figure 1: (a): The comparison of empirical convergence in 10-D, measured by the distance to the optimum. (b): The comparison of empirical convergence rates against the dimensionality. (c) and (d): Empirical cumulative distributions (ECDF) of run lengths (the number of function evaluations divided by dimension) for (µ/µw , λom )-CMA-ES (solid lines) and (µ/µw , λm )CMA-ES (dashed lines) needed to reach a target value fopt + ∆f with ∆f = 10k , where k ∈ {1, −1, −4, −8} is given by the first value in the legend. The default population size setting is used here. The vertical black line indicates the maximum number of normalized run length. Light beige lines show the ECDF of run lengths for target value ∆f = 10−8 of all algorithms benchmarked during BBOB-2009. The comparisons in two dimensions, N = 5 (c) and N = 20 (d) are shown. The ECDFs are estimated using small population setting. (e) and (f ): The same comparisons as (c), (d) except the large population setting is used. (e): N = 5 and (f ): N = 20.