On the Design of Constraint Covariance Matrix Self ... - Semantic Scholar

Comment

Report 4 Downloads 121 Views

1

On the Design of Constraint Covariance Matrix Self-Adaptation Evolution Strategies Including a Cardinality Constraint Hans-Georg Beyer and Steffen Finck

Abstract—This paper describes the algorithm’s engineering of a covariance matrix self-adaptation Evolution Strategy (ES) for solving a mixed linear/nonlinear constrained optimization problem arising in portfolio optimization. While the feasible solution space is defined by the (probabilistic) simplex, the nonlinearity comes in by a cardinality constraint bounding the number of linear inequalities violated. This gives rise to a nonconvex optimization problem. The design is based on the CMSAES and relies on three specific techniques to fulfill the different constraints. The resulting algorithm is then thoroughly tested on a data set derived from time series data of the Dow Jones Index. Index Terms—Constrained Optimization, Covariance Matrix Self-Adaptation Evolution Strategy, Non-convex Optimization, Portfolio Optimization

I. I NTRODUCTION Covariance Matrix Adaptation (CMA) Evolution Strategies (ES) belong currently to the best performing direct search strategies for real-valued black-box optimization. This statement is based on the results of the 2009 and 2010 GECCO workshops on Black Box Optimization Benchmarking (BBOB) [1]. According to the design of this benchmark [2], the test problems are basically unconstrained. It might come as a surprise, however, there are virtually no published results concerning the performance of CMA-ES on constrained optimization problems. For example, the CEC 2010 Competition on Constrained Real-Parameter Optimization (Barcelona, Spain, July, 18-23, 2010) [3] does not contain any CMA-ES related algorithm. Instead the currently best performing algorithms are from the Differential Evolution (DE) class, as e.g. [4]. Vetting publicly available implementations of CMA-ES, one finds that constraint handling has been designed (and published) so far for box constraints only [5]. Even this seemingly simple case appears to be a non-trivial one requiring a penalty approach with adaptive weights calculated using information from the covariance matrix [5]. While one cannot conclude from the non-appearance of CMA-ES in constraint optimization that CMA-ES is not well-suited for constraint optimization, there are clues that there are non-trivial problems concerning the step-size control when the optimizer is located on the boundary of the feasibility domain. There are also theoretical evidences [6], [7] that premature convergence can happen due to inadequate mutation strength control. This is especially observed when trying penalty-based approaches to Both authors are with the Vorarlberg University of Applied Sciences, Dornbirn, A-6850, Austria e-mail: {Hans-Georg.Beyer, Steffen.Finck}@fhv.at.

tackle linear inequality constraints. However, this can also happen in repair-based techniques where one has the special problem of finding replacements for the cumulative step-size adaptation (CSA) rule [8]. The solution provided for the boxconstraints in [5] is a Baldwin-type ad hoc approach: The offspring’s genome is allowed to violate the box-constraints while the fitness is calculated at the edge of feasibility enhanced with an adaptive penalty value depending on the quadratic distance of the evaluation point to the offspring. Using this trick, the standard CMA principles including the CSA adaptation can be applied to the selected offspring genomes. Apart from possible technical problems regarding the projection of infeasible solutions onto the feasibility boundary1, this approach can in principle be applied to other inequality constraints as long as these constraints are directly related to the search space. However, there are also types of constraints that are related to functionals of inequalities. An often recurring task in portfolio optimization is to ensure a minimum number of so-called value-at-risk constraints [9], forming in turn a cardinality constraint. From mathematical point of view one has to face a mixed integer problem and finding the right combination of inequality constraints is NPcomplete [10]. There seems to be no natural way to combine the box-violation penalty and the cardinality requirements in the framework of [5]. Furthermore, the optimization problem that gave rise to the investigations and developments presented here additionally contains an equality constraint. Even that case has not been considered in ES literature and needs a treatment in conjunction with the box constraints. The aim of the paper is to provide a theoretically motivated approach to the design of covariance matrix adaptation ES in constrained search spaces. The main objective is to rather present the algorithmic ideas leading to the final design than to only present a ready-to-use algorithm for the specific portfolio optimization problem. That is why, we do not present extended parameter studies. The algorithm design will be based on the arguably simplest covariance matrix self-adaptation strategy (CMSA) proposed in [11]. The CMSA-ES uses self-adaptation (SA) instead of the commonly used cumulative step-size adaptation (CSA) to control the mutation strength. In the next section, the optimization problem is introduced. Section III starts with a short review of CMSA-ES. Afterwards the CMSA-ES is 1 Actually, projection itself can be regarded as a constrained optimization problem that can only be solved efficiently in special cases, see also Section III-B3.

2

modified in Section III-B to handle the different kinds of constraints. Section IV provides experiments based on data taken from the Dow Jones Index. Low-dimensional cases of the search space are considered first in order to visually inspect the fitness landscape and the solutions generated. Finally the high-dimensional case is considered using a meta strategy that is based on restarts and exponentially increasing population sizes. The paper closes with a short summary and outlook.

being the vertices of the (probabilistic) simplex X . Actually, it is ek∗ with S X k ∗ = arg max Ysk (7) k

s=1

(or a manifold spanned by ek∗ vectors if there is more than one k ∗ ) and the maximum is obtained as ∗

f = max f (x) = x∈X

The problem that triggered the development of the constrained CMSA-ES arises in the field of scenario-based portfolio optimization in finance. Let Y be an S × K matrix with elements Ysk ∈ R representing random asset returns of K assets and S scenarios (given by time series data). It is the goal to find the K asset weights xk that maximize the total expected return, Eq. (1). Of course, the weights must sum up to 1 being the equality constraint, Eq. (2), and the weights are confined in the box [0, 1]K , Eq. (3). Up to this point, the problem described is a simple linear optimization problem. The nonlinearity (and non-convexity) comes in by demanding that at most an (1 − α) percentage of scenarios incur a loss bounded by κ. That is, at least ⌊αS⌋ scenarios should produce a return greater than or equal to κ. In order to model this aspect of the optimization problem, let #(A) denote the cardinality of a set A. Let α denote the confidence level, 0 < α < 1, expressing the percentage of inequality constraints fulfilled in a specific optimization problem. The constant κ ∈ R is also referred to as value-at-risk. Equation (5) counts the number η of inequalities with a return being greater than or equal to the value-at-risk κ. Inequality (4) demands that this number should be greater than or equal to α percentage of scenarios. Thus, one obtains the following optimization problem for the decision variables (asset weights) x ⊂ RK , x = (x1 , . . . , xK )T : S X K X

Ysk xk

−→

s=1 k=1

K X

max

(1)

x

xk = 1,

(2)

∀k = 1, . . . , K : 0 ≤ xk ≤ 1,

(3)

η(x) ≥ ⌊αS⌋,

(4)

s.t.

k=1

and

where η(x) := #

(

K )! X Ysk xk ≥ κ . s ∈ {1, . . . , S}

(5)

k=1

As already mentioned, considering only the first three equations (1–3), the maximization problem is a simple linear one. The optimizer of which, denoted by x∗ , is (usually) from the set of the unit basis vectors ei := (. . . , 0, 1, 0, . . .)T

Ysk∗ .

(8)

s=1

II. T HE O PTIMIZATION P ROBLEM

f (x) :=

S X

(6)

Taking Eqs. (4), (5) additionally into account changes the problem class of the optimization problem. Equation (4) requires that a feasible P solution x should fulfill at least ⌊αS⌋ of the S inequalities K k=1 Ysk xk ≥ κ, s = 1, . . . , S. While for the linear problem (1)–(3) the optimizer lies on the boundary of the feasible solution set X (usually at one of the vertices of the simplex defined by (2), (3)), now the optimizer can be also an interior point of the simplex X . For later use, let us introduce the violation count function ν(x) ∈ {0, . . . , S}: S − η(x), if η(x) < ⌊αS⌋, ν(x) := (9) 0, otherwise Considering (5),P function (9) calculates the number of inequalK ities for which k=1 Ysk xk 6≥ κ, s = 1, . . . , S, however, it returns 0 if (4) is fulfilled. III. A CMSA-ES FOR C ONSTRAINT O PTIMIZATION ( C CMSA-ES) A. A Short Review of Standard CMSA-ES The acronym CMSA-ES stands for “Covariance Matrix Self-Adaptation Evolution Strategy” [11]. It is a simple variant of so-called covariance matrix adaptation (CMA) strategies [8], [12] with a reduced number and theoretically derived time constant parameters. Consider the unconstrained optimization problem f (x) → opt!, x ∈ RN , the pseudocode of the CMSAES is presented in Fig. 1. In each generation λ offspring are generated from µ parental individuals. The parent individuals for the next generation g + 1 are selected from the offspring of generation g using (µ, λ)-selection, i.e., the µ best offspring individuals are selected (w.r.t. their f -values) to be the parents for the next generation. Note, the pseudocode in Fig. 1 presents only the actions taking place within a generation. For a fully operating strategy an outer generation loop must be added with appropriate termination conditions and initialization. In what follows, a short explanation of the meaning of lines in the generation loop is presented. In the Evolution Strategies (ES) considered here, an individual a is defined as a structure comprising the individual’s x vector (objective vector to be optimized), the mutation direction s, the mutation strength σ, and the individual’s objective function value f (x) referred to as the individual’s “fitness”. Selection acts on the individual as a whole and is based on the fitness. The offspring individual l, denoted by ˜al (note offspring are usually marked with a tilde on top of the respective symbols), is generated by mutating the parental state (σ, x) as follows: At first, the parental mutation strength σ (which is actually a recombinant

3

(µ/µI , λ)-CMSA-ES (one generation cycle) For l ← 1 To λ

σ ˜l ← σeτ Nl (0,1) √ s˜l ← C N l (0, I) ˜l ← x + σ x ˜l s˜l ˜ ˜l) fl ← f (x ˜al ← f˜l , x ˜l, σ ˜l , s˜l

End RankOffspringPopulation(˜ a1 , . . . , ˜ aλ ) ˜ x ← hxi σ ← h˜ σi 1 1 C← 1− C + h˜ ss˜T i τc τc

(L1) (L2) (L3) (L4) (L5) (L6) (L7) (L8) (L9)

Fig. 1. The generation loop of the CMSA-ES. Recombination, denoted by the “h·i” notation, is by mean value calculation using the µ best offspring individuals.√The covariance matrix C is initially chosen as√identity matrix, i.e. C = C = I. The strategy parameters are τ = 1/ 2N and τc = 1 + N (N + 1)/(2µ).

of the values of selected offspring of the previous generation) is multiplied by a log-normally distributed random number in (L1). In Line (L2) a (correlated) N (0, C) normally distributed N -dimensional vector s˜l is produced being the offspring’s mutation direction. In (L3) this direction is scaled with the offspring’s mutation strength and added to the parental object ˜ parameter vector x. Thus, the offspring’s objective vector x is obtained and its fitness is evaluated in (L4). The offspring structure is completed (L5). After procreation of λ offspring, the µ best offspring are selected and become parents of the new generation (L6). These parents are needed for the mean value calculations in (L7 – L9). These operations are also referred to as intermediate multirecombination in ES terminology. Centroid calculation ˜ of the selected offspring yields the new parental x of the x state (L7). A similar mean value calculation yields the new parental mutation strength σ (L8). Finally, the update of the covariance matrix C takes place (L9). After that the process starts anew until a termination criterion is fulfilled. B. The CMSA-ES for a Combination of Constraints of Type (2)–(4) Typical “naive” approaches to constraint optimization problems are often based on penalty functions. The problem with such approaches is the control of the strength of the penalties in conjunction with the objective function f (x). Evolutionary Algorithm offer a larger repertoire to deal with such problems [13]. The trivial approach is to treat infeasible offspring/mutations as lethal individuals. That is, one simply generates the offspring anew hoping that chance will also produce feasible offspring. However, this can be a very time consuming process if the feasible region is small compared to

the unconstrained search space. Actually, the method will fail totally if the feasible region if of zero measure. This happens if one wants to use this approach for the equality constraint (2). The strength of the evolutionary approach is, however, that one is not confined in using a single approach. Notwithstanding, all approaches found in literature and that have been tested do not seem to work for the optimization problem considered here. The main problem observed is due to an exponentially decrease of the mutation strength in cases where the evolutionary search has reached the boundary of the feasible region. This results in a premature convergence in situations where the evolutionary search path should proceed on the surface of the simplex defined by (2) and (3). The problem is reminiscent of the premature convergence of CMAES on the sharp ridge [14]–[16]. However, the kludge used to prevent premature convergence of CMA-ES on the sharp ridge by artificially keeping the mutation strength σ at a reasonable level does not work due to the box constraints (3). In order to treat the constraints (2)–(4), a 3-fold approach has been developed: 1) The linear equality constraint (2) can be treated by performing mutations that live exactly in the subspace of the RN fulfilling (2). That is, concerning the equality constraint, the candidate solutions generated will always be feasible. Similar ideas have already been published in [17] (and in references provided there). 2) The box constraints defined by (3) are treated by a mutation repair strategy that performs a minimal but yet evolvable change of the mutation fulfilling (3). This also requires a modified covariance matrix update. 3) The nonlinear constraint (4) is treated within a cascaded selection process applied to a 2-dimensional fitness vector. One component of which represents the violation count, Eq. (9), the other component is the objective function f (x), Eq. (1). The individual ranking is then based on a lexicographic order. That is, the treatment of the nonlinear constraint (4) is selection-based using fitness information. In contrast to penalty function approaches, it resembles rather the multi-objective optimization paradigm. However, it is not multi-objective optimization where a Pareto front is searched for. We will consider these three parts in separate sections below. However, at first the global structure of the constraint CMSAES will be discussed. 1) Outline of the Constraint CMSA-ES Algorithm: The Constraint CMSA-ES (cCMSA-ES) is depicted in Fig. 2. An individual a consists of a fitness tupel (comprising the goal function value f and the violation count ν), the vector of the decision variables x, the mutation strength σ, and a (K − 1)dimensional vector representing the mutation direction (to be discussed below). Since comma selection does not guarantee convergence to optima, this strategy keeps track of the best individual observed so far, the best-so-far (bsf) individual absf . It is initialized in Line (C2). The strategy also keeps track of the evolution of the parental centroid, that is why an explicit generation counter is needed (C3). The evolution loop is processed as long as none of the criteria in (C18) is fulfilled.

4

(µ/µI , λ)-cCMSA-ES Initialize µ, λ, x(0) , σ, τ, τc , G, gstop , σstop , ǫ, C = I

absf ← f (x(0) ), ν(x(0) ), x(0) , σ, 0 g←0 Repeat For l ← 1 To λ

(C1)

σ ˜l ← σeτ Nl (0,1) √ 1 zl ← det(C)− 2(K−1) C N l (0, I) sl ← Mzl

s˜l ← FulfillBoxConstraints(x(g) , sl , σ ˜l ) (g)

˜l ← x + σ x ˜l s˜l ˜l) ν˜l ← ν(x ˜l) f˜l ← f (x

(C2) (C3)

(C4) (C5) (C6) (C7) (C8) (C9) (C10)

˜al ← f˜l , ν˜l , x ˜l, σ ˜l , z˜l = M s˜l ˜ al , if ˜ al ≻ absf absf ← absf , otherwise End RankOffspringPopulation(˜ a1 , . . . , ˜ aλ ) g ←g+1 −

˜ x(g) ← hxi ( h˜ σ i, if h˜ σ i ≤ √1K σ← √1 , otherwise K 1 1 C← 1− C + hz˜z˜T i τc τc Until g > gstop ∨ σ < σstop ∨ kx(g) − x(g−G) k < ǫ

(C11) (C12) (C13) (C14) (C15) (C16) (C17)

(C18)

Fig. 2. Pseudocode of the constraint CMSA-ES with elite conservation in the “best-so-far” individual absf .

These criteria are the maximal number of generations gstop and the minimal mutation strength σstop . Additionally, a G generations step-length distance being less than ǫ in decision parameter space serves as a termination criterion. The latter is measured as the distance between the parental centroid at generation g and g − G.2 In the evolutionary loop, first the λ offspring are generated. For each offspring individual ˜ a, the mutation strength σ ˜ is generated from the parental σ by multiplicative lognormal mutation. The learning parameter τ is chosen√according to the recommendation given in [11], i.e., τ = 1/ 2K. The Lines (C5–C7) produce the correlated mutation directions where (C6) ensures the validity of the equality constraint and (C7) ensures the fulfillment of the box constraints. In 2 Of course, this criterion can be applied not until after the first G generations.

(C8) the search direction s˜ is scaled by the mutation strength and added to the parental state. Thus, the offspring’s decision parameter vector is generated. The evaluation of its fitness is performed in (C9), determining the offspring’s violation count, and in (C10), calculating the objective function value. In Line (C11) the individual is finalized by calculating z˜ by back transformation of the individual’s search direction. Line (C12) does not belong to standard (µ, λ)-selection strategies. It incorporates elitist tracking. Note, this does not influence the dynamics of the ES, it only ensures the conservation of the globally best offspring generated so far. The underlying order relation “≻” will be discussed in Section III-B4 After the procreation of the offspring population, the ranking according to the order relation “≻” is performed in (C13) and the generation counter is incremented in (C14). The new parental decision vector is obtained by intermediate recombination of the x vectors of the best µ offspring individuals. Geometrically, this is just a simple centroid calculation that takes place in (C15). In Line (C16), the new parental mutation strength is calculated as the mean value of the individual σ values of the best µ offspring. However, parental mutation strength is additionally √ bounded by 1/ K. Mutation strengths greater than this bound are not useful: The steps generated are already so large that a single mutation has an expected length of about the diameter of the feasible search domain. Note, however, this bound is related to the constraints (2) and (3). If the rhs of (2) √ is changed from 1 to c, the bound should be changed to c/ K, respectively. Finally, the covariance matrix update is performed in Line (C16). It is a kind of rank-µ update with exponentially fading memory of time constant τc = 1 + 0.5(K − 1)K/µ. It is important to notice that unlike the standard CMA-ES the backtransformed mutation direction is used for the update, see next section. 2) Ensuring the Equality Constraint (2): According to (1), the decision vector (to be optimized) is denoted by x and lives in RK . However, according to (2) evolution takes place in a (K −1)-dimensional manifold. Actually, it is just a hyperplane embedded in the RK . Therefore, it is necessary to generate ˜l mutations on this hyperplane such that for each offspring x K X

˜ l )k = 1 (x

(10)

k=1

holds. Due to line (C8) this transfers to the s˜ vector K X (x(g) )k + σ ˜l (˜ sl )k = 1.

(11)

k=1

Provided that

PK

k=1 (x

(g)

)k = 1, Eq. (11) is fulfilled if

K X

(˜ sl )k = 0.

(12)

k=1

Provided that the box constraints (3) are not violated, (C7) does not change the s vector s˜l = sl . That is, due to (C6) the

5

condition (12) takes the form K X

(Mzl )k = 0,

(13)

k=1

where zl ∈ RK−1 lives in a (K-1)-dimensional unconstrained space. By means of (C6) the mapping RK−1 7→ RK is performed. A transformation matrix M is needed in (C6) that ensures K K−1 X X (M)ki (zl )i = 0 (14) k=1 i=1

for arbitrary zl vectors. Exchanging the summation order, one obtains ! K K−1 X X (15) (M)ki (zl )i = 0. i=1

k=1

Equation (15) can be fulfilled if the inner sum is always zero. That is, each column in the K × (K − 1) matrix M must sum up to zero. This can be easily and non-trivially fulfilled by stacking a (K − 1) × (K − 1) identity matrix on top of a (K − 1)-dimensional row vector with components all being −1. Using such an M matrix results in a computationally very simple (C6) line ∀k = 1, . . . , K − 1 : sk = zk ,

and sK = −

K−1 X

zk . (16)

k=1

In order to have this simple transformation, the parental state x must fulfill (2) for each generation. That is, the initial parental state x must fulfill (2) K X

(0)

xk = 1.

(17)

k=1

This can be easily accomplished by choosing 1 . (18) K This initialization chooses the center of the simplex as the starting point. When using the cCMSA-ES in a multi-start setting (which is recommended in multi-modal optimization), random initialization must be performed. Sampling K random numbers rk from the uniform distribution U(0, 1) on the interval (0, 1), a possible random initialization reads rk (0) xk = PK . (19) k′ =1 rk′ (0)

∀k = 1, . . . , K : xk =

As one can easily check, (19) satisfies the condition (17). In Line (C11), the inverse M transformation is needed since the unconstrained z vectors are used to update the covariance matrix in (C17). As one can easily show, M− is a (K − 1) × (K − 1) identity matrix with an additional Kth column with zero only entries. As a result, the components of the z˜ vector are simply the first (K − 1) components of the s˜ vector. There is another – although technical – difference comparing line (C5) to (L2) in Fig. 1: By dividing the square root of the covariance matrix with the 2(K − 1)-th root of its determinant, the resulting transformation matrix is volume conserving. This means that the spherically symmetric N (0, I)

vectors forming a spiky ball are only distorted to an ellipsoid in an incompressible manner. In other words, the resulting (K − 1)-dimensional s˜ vectors share always the same length properties and the probability mass associated with is constant. Therefore, scaling of the length of the mutations is the sole responsibility of the mutation strength σ. Note, due to the choice of the transformation matrix M, the operation in (C6) does not change this property: The determinate of the first (K − 1) rows of M is 1. 3) Ensuring the Box Constraints: Even though the search PK on the manifold k=1 xk = 1 can be easily accomplished by the linear transformation (C6), fulfilling the box constraints 0 ≤ xk ≤ 1 cannot be realized by a linear scalarization of the mutation direction s. While one can scale the s vector as long as the parent is in the simplex defined by Eqs. (2, 3), this does not longer work when approaching the boundary of the simplex. This is especially critical in the case where the search has to be performed on the boundary of the feasible region. For example, this happens regularly when the optimizer is located on the boundary as it is almost always the case for linear fitness functions. Length scaling does not work in such situations because the mutation strength decreases exponentially fast. This is so because the component towards the local boundary constraint enforces a rapid decrease of the mutation strength σ ˜ . As a result, also the remaining (K − 1) perpendicular components are reduced by the same factor. However, these components are usually vital search directions. Theoretically, one could argue that this problem can be tackled by the covariance matrix adaptation. However, as we already know from the simple sharp ridge problem, learning the covariance matrix is of order O(K 2 ) while the mutation strength goes down exponentially fast at a time scale of O(K). Even if one were able to learn the locally optimal covariance matrix (that would be a degenerated one when reaching a boundary of the simplex), the problem would still persist and would become even more pronounced when reaching one of the vertices of the simplex. For example, considering non-degenerated linear programs, the optimum is just in one of the vertices of the simplex and the probability to generate a feasible solution can be less than 2−K in the vicinity of such solutions. One needs a radically different approach to handle such problems. If there is only one simple fitness goal, Baldwin evolution with an augmented fitness function realizing a penalty for leaving the feasible region might be a remedy. Such an approach has been used in some implementations of CMA-ES, e.g. [18]. That is, the CMA-ES searches in an unconstrained RK , however, evaluates the objective function on the boundary of the feasible region. A distance penalty is added in the case when the offspring is not in the feasible region. Unfortunately, such an ad hoc approach does only work flawlessly in cases of scalar fitness functions. Its extension to lexicographically ordered individuals may be problematic. Especially in the case where the primary (i.e. dominating) component of the fitness is the violation count. There is no meaningful way how to compare an individual with violated box constraint (3) and violation count ν = 0 with an individual fulfilling (3) but ν ≥ 1. It is quite clear that current CMA-ES cannot produce

6

feasible solutions in the vicinity of the simplex boundary. Therefore, a repair approach will be used. The question arises, however, how to make infeasible offspring feasible. The idea pursued here is to minimally change infeasible offspring individual x† = x + σ ˜ s generated by the CMSA-ES through the ˘ having lines (C4–C6). That is, one is searching for a point x minimal distance to the infeasible x† while yet being feasible. That is, the following constrained optimization problem must be solved  ˘ = arg minx′ ||x′ − x† ||,  x   PK ′ (20) s.t. x = 1, k=1 k    x′k ≥ 0.

Since this is a convex optimization problem, the solution to ˘ is on the boundary of the simplex in the (20), denoted by x, case that x† violates the box constraints.3 Solving (20) is part of (C7). Using the order statistics notation (y)i:K indicating that component of a vector y ∈ RK which is the ith smallest component, (C7) performs the following operation ( s, if (x + σ ˜ s)1:K ≥ 0, (21) s˜ = ˘ x−x , otherwise. σ ˜

The first line in (21) is realized if no component of the offspring x + σ ˜ s is less than 0. That is, in that case the box constraints (3) are not violated. Note, it is only necessary to check below zero components because due to the transformation (C6), Eq. (2) is fulfilled and components greater than 1 automatically imply the existence of at least one negative component. Therefore, by considering the smallest component, box violations are safely detected. If there is such a violation, ˘ is then used to recalculate (20) must be solved. The resulting x the search direction s that happens in the second line of (21). ˘ First note, the It remains to explain the calculation of x. Euclidean distance used in (20) can be replaced by its square resulting in a quadratic optimization problem that can be solved using Lagrangian multipliers. However, the resulting stationarity conditions require case inspections. An elegant way to avoid case differentiation has been found in [19]. The resulting pseudocode is provided in Fig. 3. A short inspection of the optimization algorithm in Fig. 3 reveals that its complexity is O(K log K) due to the sorting required. Sorting is also needed in (21).4 Therefore, the operations in (C7) can be further simplified by always calculating the d vector and replacing the condition in (21) by (d)K ≥ 0. 4) Selection Based on Lexicographic Order: Lines (C9) and (C10) calculate the offspring’s fitness, i.e., the violation count (9) and the objective function (1), respectively. This completes the offspring individual. As already mentioned, an offspring individual ˜a (the tilde is used to denote the offspring), is a tuple consisting of ˜a := f (x), ˜ ν(x), ˜ x, ˜ σ ˜ , s˜, z˜ . (22) ˘ = x† . there is no box constraint violation, (20) simply yields x determining (x + σ ˜ s)1:K can be done in O(K), since this is simply finding a minimal element in a list of K elements. Therefore sorting is not really needed in that step. 3 If

4 Actually,

x† ← x + σ ˜s d ← SortDescending(x† ) ∆←0 For i ← 1 To K ∆ ← ∆ + (d)i If (d)i −

∆−1 i

> 0 Then γ ←

∆−1 i

End For i ← 1 To K ˘ i ← max (x† )i − γ, 0 (x) End

Fig. 3. Pseudocode for the projection of an exterior point x† onto the unit simplex.

After completion of the generation of the λ offspring selection is to be performed. This step is not explicitly displayed in the pseudocode of Fig. 2. It is hidden in the h·i notation where the mean value calculation is performed about the µ best individuals from the offspring pool (al=1,...,λ ). In order to perform this calculation a ranking of the offspring individuals is needed in (C13). To this end, an order relation between individuals must be developed. If the maximization of the goal function f (x), Eq. (1), were the only objective, an individual al would have a higher rank than an individual am (l 6= m) if f (xl ) > f (xm ). We will denote this order relation by al ≻ am expressing the superiority of the left individual al , i.e. individual al is fitter than individual am . Extending the ≻ notation to the two-component fitness ˜ f (x)) ˜ in the individual (22) we have to decide on (ν(x), the importance of the different components. Considering the satisfaction of the violation count ν as the more severe objective, it clearly must have priority in the comparison of ˜ l ) < ν(x ˜ m ) ⇒ al ≻ am . Only in individuals. That is, if ν(x the case that both individuals have the same violation count, the goal function f (according to (1) to be maximized) is to be ˜ l ) = ν(x ˜ m ) ∧ f (x ˜ l ) > f (x ˜m) ⇒ considered additionally ν(x al ≻ am . Putting both alternatives together one finally obtains ˜ l ) < ν(x ˜m) al ≻ am ⇔ ν(x ˜ l ) = ν(x ˜ m ) ∧ f (x ˜ l ) > f (x ˜ m) . ∨ ν(x (23)

From mathematical point of view, this establishes a so-called lexicographic order. Using the relation (23) as sorting key, standard sorting algorithms can be used to perform the ranking of the offspring individuals. IV. E XPERIMENTS

ON

P ORTFOLIO O PTIMIZATION

While the optmization problem (1)–(5) may be regarded as a generic example of a problem with equality and inequality constraints, it originates from a non-convex problem in valueat-risk portfolio optimization [9]. It actually arises when using real historical data (i.e., time series) instead of distributions (with parameters fitted to the observed data) to allow for a data-driven optimization. As shown in [10], the resulting optimization problem is NP complete.

7

The experiments are based on data calculated from the Dow Jones Index (DJI) starting from 01/01/2000 and ending by 02/24/2010. Given the raw data Pk (t) (t = 1, . . . , T ) of the kth company’s time series and the time index s, the components of the Y matrix are calculated as Pk (s + 1) k = 1, . . . , K, − 1, (24) Ysk := s = 1, . . . , S, S = T − 1. Pk (s)

The matrix dimensions of the data provided are S = 2552 and K = 29.5 A. Having a Closer Look at the Optimization Problem for Small K

In order to get a certain “feeling” about the data, it will be helpful to look at the K = 2- and K = 3-dimensional cases considering sub-matrices of Y. Let us start with the K = 2 case using S = 2552 data points. Let us further introduce the relative violation count function ρ(x) derived from (5) S − η(x) . (25) S This function measures the percentage of the number of PK inequalities k=1 Ysk xk ≥ κ (s = 1, . . . , S) not fulfilled given a solution x. Figure 4 shows the dependency of ρ on x1 and on the choice of κ. As one can see, aiming at a high ρ(x) :=

0.02

f (x)

ρ(x)

0.9

0.075

0.8

0.070

0.7

0.065

0.6

0.060

0.5 0.4

0.055 0.2

0.4

0.6

0.8

1.0

x1

0.2

0.4

0.6

0.8

1.0

x1

Fig. 5. Relative violation count ρ as a function of x1 for κ = −0.036 (left graph). In the graph on the rhs, the objective function is displayed depending on x1 . Smaller x1 values are desirable since they increase the objective function.

if for a given κ a targeted significance level is not reached. For example considering κ = 0, the graphs of Fig. 5 change to those in Fig. 6 using the first thousand inequalities, i.e., S = 1000. As one can see, we are far off the target α = .95 ρ(x)

0.36

f (x)

0.515 0.34

0.510

0.32

0.505 0.500

0.30 0.2

0.4

0.6

0.8

1.0

x1 0.2

0.4

0.6

0.8

1.0

x1

Fig. 6. Relative violation count ρ as a function of x1 for κ = 0 (left graph) considering a case with S = 1000. In the graph on the rhs, the objective function is displayed depending on x1 . Larger x1 values are desirable since they increase the objective function.

0.01 0.00 -0.01 -0.02 -0.03 -0.04 0.0

0.2

0.4

0.6

0.8

1.0

Fig. 4. Relative violation count as a function of κ and the first variable x1 . Note, due to (2), the second component of x is determined by x1 . In the contour plot rightward, the vertical axis displays κ and the horizontal axis x1 . Note, these pictures present a rather global view, zooming in would reveal a very rugged landscape as can be seen on the left of Fig. 5.

significance level α requires to choose a sufficiently small (and negative) κ. For example, for α = 0.95, one has to choose κ about −0.036 in order to reach the domain of relative violation count of 1 − α = 0.05. The ρ-function and also the objective function are displayed in Fig. 5. Given the significance level α = 0.95 admissible solutions are on and below the x1 axis of the lhs graphics in Fig. 5. The leftmost x1 value adhering to these admissible solutions represent the optimum solution. One also sees that there are plateaus of constant ρ-values. These plateaus can be difficult to be handled by gradient-based numerical optimization techniques. In order to reach a high significance level of, e.g., α = 0.95 one has to go to rather high loss values κ < 0. While this is an aspect that might have consequences for finance, another question arises. This concerns the behavior of the algorithm 5 One of the columns of the originally 30 has been removed from the DJI table since the company associated with that column has been exchanged during the time frame covered by the Y matrix.

(that corresponds to a ρ = 0.05). How should an optimization algorithm deal with such a situation? As assumed in this work, ensuring a high significance level is the more important goal than maximizing f . That is, the algorithm should yield the minimal ρ = 0.491. This appears to be at about x1 = 0.500. There is also a local minimizer at about x1 = 0.100 of slightly larger ρ value (just a difference of 0.001, meaning one additional inequality is violated). Considering the K = 3 case (S = 2552) one needs about κ = −0.03 in order to reach a significance level of α = .95. Figure 7 shows the global landscape structure. The contour

Fig. 7. The fitness landscapes of the K = 3, S = 2552 case. On the left: relative violation count as a function of x1 and x2 . Note, due to (2), the third component of x is determined by x1 and x2 . On the right hand side: goal function f (x), Eq. (1). Note, feasible solutions are in the left triangle of the x1 -x2 -plane bounded by the line x2 = 1 − x1 .

plot displayed in Fig. 8 shows that the optimal solution is again inside the feasible region. The optimal goal function value f is determined by a line parallel to the straight falling f (x) = const. contour lines touching the ρ(x) = 0.05 contour at its leftmost point.

8 1.0

0.30 0.28

0.8

f (x)

ρ(x)

0.26

0.0515

0.70

0.0510

0.65

0.0505

0.6

0.60

0.0500 0.24

0.0495

0.4

0.55

0.0490

g

0.22

20 0.2

0.0 0.0

0.4

0.6

0.8

0.18 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

1.0

60

80

100

Fig. 8. Contour plot of the violation count ρ(x) for κ = −0.03 corresponding to the lhs graphics in Fig. 7, horizontal axis: x1 , vertical axis: x2 . Left plot: The (seemingly) closed contour line belongs to ρ(x) = 0.05 and its interior to significance levels α > 0.95. The other contour lines belong to the values ρ(x) = 0.075, 0.1, 0.125. The straight falling lines are the f (x) = const. contours. Note, feasible solutions are in the left triangle bounded by the falling diagonal line x2 = 1 − x1 (not displayed). Right plot: magnification of the region where the global optimizer is located at about (0.0346, 0.2231)T .

Increasing the dimensionality K of the problem and considering large numbers S of inequations to be fulfilled, the optimization problem can asymptotically reach (or be interpreted as) a chance constraint problem which in turn can be a convex optimization problem. As a result, optimal solutions will be found on the boundary of the feasible region defined by (2, 3). This is also observed in the case considered here when considering large K and S. As a consequence, the cCMSAES must be able to efficiently search both within and on the boundary of the feasible domain. B. First Evolution Experiments 1) Experimental Settings: Considering K = 2- to K = 4dimensional cases offers the opportunity to visually determine the optimizer of (1–5) and to compare with the result of the cCMSA-ES. We first start with K = 2 examples introduced in the last section. The cCMSA-ES uses the following exogenous strategy parameters τc = 1 +

K(K − 1) , 2µ

0.1 0.01

0.6

an offspring population size λ = 12 and parental population size µ = λ/4 = 3. The matrix square root in Line (C5) of the cCMSA-ES algorithm of Fig. 2 has been obtained by spectral value decomposition (alternatively one might as well use Cholesky decomposition). However, unlike shown √ in the 1 algorithm of Fig. 2, the calculation of det(C)− 2(K−1) C is done only every ⌊τc ⌋ generation. This saves computation time and is admissible since the C matrix changes only slowly over time with time constant τc . The parental mutation strength σ is used as termination criterion if it falls below σstop = 10−6 . Additionally, the cCMSA-ES is stopped if the number of generations exceeds Gstop = 1000, or the parental x centroid does not change significantly over the last G = 10 generations using ǫ = 10−9 . The random initialization (19) is used√and the initial parental mutation strength is set to σ (0) = 1/ K.

60

80

100

60

80

100

g

0.5

10-4 0.4

10-5 20

40

60

80

g

100

20

40

g

Fig. 9. On the evolutionary dynamics of the (3/3, 12)-cCMSA-ES for an instance of K = 2, S = 2552 with κ = −0.036 and α = 0.95, depicted in Fig. 5.

2) Evolution Experiments for K = 2: Let us consider the K = 2 instance introduced in Fig. 5. After about 114 generations the cCMSA-ES terminates yielding x = (0.27036, 0.72964)T as the optimizer and f = 0.72186 as the optimum. If one compares with the left hand graphics in Fig. 5, one sees that the global optimum has been reached. As is common practice, the dynamics of the evolution process are displayed in Fig. 9. The ES quickly reaches large fitness values within the first 30 generations. This is accompanied by an exponential drop of the mutation strength. The targeted violation count is stably reached after about 80 generations. What happens if the desired significance level α cannot be reached? By changing κ = 0, one obtains the relative violation count graphics in Fig. 10. A closer look at the minima reveals ρ(x)

ρ(x) 0.491 0.500 0.490 0.495 0.489 0.2

(26)

40

x2 x1 0.7

0.001

1 , τ= √ 2K

20

σ

0.20

0.2

40

0.4

0.6

0.8

1.0

x1 0.49

0.50

0.51

0.52

0.53

x1

Fig. 10. Relative violation count ρ as a function of x1 for κ = 0 (left graph) of the instance K = 2, S = 2552. The graph on the rhs is a magnification of the region where the minima are located. Note, as in Fig. 5, smaller x1 values are desirable since they increase the objective function f .

that the global minium w.r.t. the violation count is degenerated: There are four optimizers yielding the (same) global minimum value. However, due to the goal function f , the left most is the value that should be returned by the optimization strategy. This solution represents the best one can reach under the conditions given. Running the cCMSA-ES on this test case yields the global optimizer x = (0.49759, 0.50241)T after about 155 generations and f = 0.570 as the global optimum. The dynamics of the evolution process is displayed in Fig. 11. Repeated runs of the experiments presented so far do not show notable differences in the final population results and the results of the best-so-far individual absf . This changes

9

ρ(x)

f (x)

ρ(x) 0.496

0.56

0.494

0.55 0.54

0.492

0.53 0.490

0.52 50

100

150

g

50

σ

f (x)

0.500 0.499 0.498 0.497 0.496 0.495 0.494

0.57

100

150

0.295 0.290

g

200

400

600

800

400

600

g

800

1000

800

1000

1000

σ

x2 x1

x2 x1

0.50 0.8

0.1 0.20

0.55

0.01

200

g

0.6 0.10

0.001 0.50

10-4

50

100

150

50

g

100

150

g 200

Fig. 11. On the evolutionary dynamics of the (3/3, 12)-cCMSA-ES for the K = 2, S = 2552 instance of Fig. 10 with κ = 0 and α = 0.95.

when considering the K = 2 (S = 1000) instance displayed in Fig. 6. There are two basins of attraction far apart from each other. Being interested in the global ρ minimizer, a result at x1 = 0.500 would be desirable. Repeated runs using the same settings for the cCMSA-ES have been performed. The dynamics of a typical run are depicted in Fig. 12. The final ρ(x)

f (x)

0.502

0.32

0.500

0.31

0.498 0.30 0.496 0.29

0.494 50

100

150

g

50

100

150

100

150

g

x2 x1

σ 0.1

0.8

0.01 0.6

0.001 -4

10

0.4

10-5 50

100

150

g

0.4

0.05

10-5

50

g

Fig. 12. On the evolutionary dynamics of the (3/3, 12)-cCMSA-ES for an K = 2 example with κ = 0, α = 0.95 and S = 1000 corresponding to the fitness landscape displayed in Fig. 6.

population result yields x = (0.136, 0.864)T and the fitness values ρ = 0.493 and f = 0.283. In contrast, the best-sofar values obtained during the run are x = (0.500, 0.500)T, ρ = 0.491, and f = 0.314. These deviations between final population results and best-so-far individual are not untypical under comma selection strategies. That is why, one should always keep track of the best-so-far individual (cf. Line C12 in Fig. 2). So far a (3/3, 12)-cCMSA-ES has been considered, next we want to show an interesting behavior that can been observed for larger population sizes using the problem instance from Fig. 6. To this end, a (30/30, 100)-cCMSA-ES is considered. In the evolution run presented in Fig. 13, the evolution process does not converge. Instead one observes a steady state behavior fluctuating about certain values of the

400

600

800

g

200

1000

400

600

g

Fig. 13. On the evolutionary dynamics of a cCMSA-ES for the same problem instance as in Fig. 12 (K = 2, S = 1000, κ = 0, α = 0.95), however, using an ES with population size µ = 30, λ = 100.

mutation strength and the left local minimizer in the left graph of Fig. 6. Inspecting the best-so-far individual, one obtains xbsf = (0.5002, 0.4998)T, however, the best in the last population is x(1000) = (0.1004, 0.8996)T. Taking the average over the parental centroids over the last 800 generations yields x = (0.128, 0.872)T. The stationary mutation strength has an average of σ = 0.038. In the last example one observes again the difference between best-so-far individual and the best individual at the final generation. Furthermore, the mutation strength does not exhibit the exponential decrease. Actually these are two different and independent behaviors observed under certain conditions. However, these observations are not observed in all runs (note, the Evolution Strategy used belongs to the probabilistic algorithms). The first observation concerns the ES’s “choice” of evolving towards the local minimum attractor instead to the seemingly global minimizer (which is according to Fig. 6 in the vicinity of x1 = 0.500). Actually, this may be regarded rather a “feature than a bug.” Deep, but narrow valleys are often not very robust w.r.t. parameter fluctuations. That is, such optimizers are often sensitive to the change of parameters. Evolution Strategies sample the landscape and as such they rather “see” the global structure than the local fine structure of the landscape. This is similar to a smoothing effect observed when averaging the local neighborhood of a state. Figure 14 shows the effect of local smoothing the ρ-landscape of Fig. 6 using different local neighborhoods. As one can see, considρ(x)

ρ(x) 0.515

0.510

0.510 0.505 0.505 0.500

0.500 0.2

0.4

0.6

0.8

1.0

x1

0.2

0.4

0.6

0.8

1.0

x1

Fig. 14. Smoothed relative violation count ρ as a function of x1 for κ = 0 (cf. Fig. 6). The left graph uses an averaging neighborhood of ±0.025 while in the graph on the rhs an averaging neighborhood of ±0.1 has been used.

10

ering a certain neighborhood of the solutions, the structure of the landscape changes. The ρ landscape becomes less rugged and the former global minimum disappears while the former local attractor region now contains the global optimizer. This is exactly the effect an Evolution Strategy causes when sampling the search space. However, the smoothing effect depends on the mutation strength. As a result, the smoothing is performed adaptively. Usually, one starts with initially large mutation strengths σ (0) . During the evolution process the ES gradually descends into the valleys (if minimization is the optimization goal). This process of gradually decreasing mutation strengths continues as long as the local topology allows for a continuous descent. If this is not possible due to a discontinuous change of the landscape structure, the dynamics can change. If the local topology turns into a flat landscape or a landscape with plateaus, the behavior of the standard selfadaptive ES exhibits a (built-in) tendency of exponentially fast increasing the mutation strength. This is due to the intermediate σ-recombination operator used in Line (C16) of Fig. 2 (see [20] for an analysis). Thus, the ES starts to increase the mutation strength σ. However, this drives away the offspring from the current parental state resulting in less fit offspring individuals. Consequently, individuals with small mutation strengths are again preferred, driving back the population to the plateau bottom line. As a result, steady state σ fluctuations appear as can be seen in the σ graph in Fig. 13. While the observed steady state σ fluctuations can indicate the vicinity of a (local plateau) minimizer, the question raises how to cope with such a behavior. This concerns the question of detecting such states and to find a way to terminate the evolution process (apart from the trivial termination based on a maximal number of generations). As an alternative one may change the σ recombination operator hσi, which is based on arithmetic mean value calculation, to a geometric mean value calculation hσig . Using the same experimental settings as have been used in Fig. 13, however, using geometric recombination for the mutation strength yields the evolution dynamics in Fig. 15. As one can see, now the evolution ρ(x) 0.499

0.296

0.498

0.294

f (x)

0.292

0.497

0.290

0.496

0.288

0.495

0.286

0.494

0.284 20

40

60

80

g

20

40

60

80

60

80

g

x2 x1

σ 0.1

0.8

0.01 0.6

0.001 10-4

0.4

10-5 20

40

60

80

g

20

40

g

Fig. 15. On the evolutionary dynamics of the (30/30, 100)-cCMSA-ES using geometric mean recombination for the mutation strength σ (K = 2 example with κ = 0 and S = 1000). Note the difference to the dynamics displayed in Fig. 13.

converges. This test case yields x = (0.128, 0.872)T after about 83 generations, f = 0.282 for the goal function, and a relative violation count of ρ = 0.493. That is, the ES did not end up in the leftmost local minimizer of ρ, but in the next one located to the right yielding a slightly larger goal function value. The best-so-far results is again at the global optimizer xbsf = (0.5003, 0.4997)T. The result might cause to exchange the standard σ recombination operator. However, the steady state behavior has been observed only in cases of small K-dimensionalities in conjunction with large parental population sizes µ. Choosing µ depending on K, one can control the convergence behavior of the ES (see Eq. (27) and recommendations at the end of this section). 3) Evolution Experiments for K = 3: We now consider the problem with α = 0.95 and κ = −0.03 the fitness landscape of which is displayed in Fig. 8. Using a (3/3, 12)-ES, the results are x = (0.0346201, 0.223136, 0.742244)T, f = 1.130 for both the last population best individual and the best-sofar individual. The dynamics are displayed in Fig. 16. By ρ(x)

f (x)

0.058

1.10

0.056

1.05

0.054

1.00

0.052

0.95 0.90

0.050 20

40

60

80

100

120

g

20

40

60

80

100

120

60

80

100

120

g

x3 x2 x1

σ 0.7

0.1

0.6

0.01

0.5

0.001

0.4

-4

10

0.3

10-5

0.2 20

40

60

80

100

120

g

20

40

g

Fig. 16. On the evolutionary dynamics of the (3/3, 12)-cCMSA-ES on a K = 3 instance displayed in Fig. 8.

decreasing κ slightly to −0.031, the first component of the optimal x shifts to the boundary x1 = 0. Such solutions are also evolved by the cCMSA-ES, however, we abstain from showing graphs of the evolution dynamics here. 4) Behavior of the cCMSA-ES for Larger K in the Linear Feasibility Range (α = 1): The investigations to follow are concerned with the full-blown problem setting having a dimensionality of K = 29 as given by the DJI-data. When choosing κ sufficiently negative, α = 1 can be realized. That is, the optimization problem (1)–(5) becomes a linear optimization problem that can be easily treated by standard LP solvers. Actually, using an LP solver, one can find an upper feasibility bound κ ˆ below which the LP has feasible solutions. For the data set considered one finds κ ˆ ≈ −0.0504 (e.g., by binary search checking for LP feasibility). Considering κ < κ ˆ one may ask how near the ES gets to the results of the LP solver. This type of optimization task can be regarded – in some sense – as the “hardest” one for the ES optimizing the system (1)–(5): Since the ES has only global information about the violation of constraints, it has to find that parameter vector x that fulfills all inequalities and still yielding the maximal

11

max f

max f æ

æ

æ

æ

æ

æ

1.6 æ

1.4 1.2

æ

æ

æ

æ æ

æ

æ

æ

1.6 æ

1.4

æ

æ

1.6

æ

1.4

1.0

-0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06

æ

æ æ

30

æ

æ

κ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

20 æ

-0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06

æ æ æ æ æ

æ æ

1.2 æ

-0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06

κ

Fig. 18. Solution quality in the linear domain (i.e., there are feasible solutions for α = 1) depending on the choice of κ. The solid line represents the maxima obtained by an LP solver. Here, an increased population size (compared to Fig. 17) is used: the dots are the results of single (10/10, 30)-cCMSA-ES runs.

æ

æ

æ

10

κ

max f æ

æ

40

1.0

f -value possible. That is, as for the ES it does not make a difference whether the problem is a convex one or not: The ES primarily tries to reduce the number of constraint violations. As we have already seen in plots like those presented in Fig. 8, the ρ-landscape is very rugged and is not convex. Therefore, the evolution can drive the x vectors into locally optimal regions finally fulfilling the ρ = 0 condition (or not). Since the selection w.r.t. the f -values is subordinate, one may end up with a solution that is not the global maximum of f . This is actually observed for higher K-dimensionalities. In Fig. 17, results of runs of the cCMSA-ES are compared with the optima obtained by an LP solver. The solid curves in Fig. 17 have been obtained by an LP solver. There are no feasible solutions above κ ˆ ≈ −0.0504. When decreasing κ < κ ˆ , the feasible region defined by the inequalities (5) increases steadily. Finally, it fully covers the simplex X and the optimizer is determined by that vertex ek∗ of the simplex that yields the highest f -value given by Eq. (8). That is why one observes the saturation for small κ-values. If one further decreases κ, the problem gets simpler for the cCMSA-ES since the inequalities can be easier fulfilled. In the plots shown, we presented the “interesting” κ-range where the cCMSA-ES can fail: As one can see in the left hand plot, the cCMSA-ES does not reach the maximum in each experiment. A countermeasure is to use r > 1 independently initialized ES runs, i.e. restarts, taking the best of all runs. This is depicted in the plot on the rhs of Fig. 17. However, it seems that this does not fully solve the problem. Alternatively one may consider increased population sizes. Figure 18 shows such a single run experiment using a (10/10, 30)-cCMSA-ES. It suggests that increasing

1.0

æ

æ

Fig. 17. Solution quality for K = 29 in the linear domain (i.e., there are feasible solutions for α = 1) depending on the choice of κ. The solid line represents the maxima obtained by an LP solver. The dots are the results of cCMSA-ES runs. Left picture: results of single runs of (3/3, 12)-cCMSAES, right picture: restart version of the (3/3, 12)-cCMSA-ES using r = 10 independent runs.

æ

æ

æ

1.0

1.4

æ

1.2

æ

1.6

æ

æ

æ

æ

æ

æ

æ

æ

æ

1.2

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

λbsf

max f æ

-0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06

κ

-0.20

-0.18

-0.16

-0.14

-0.12

-0.10

-0.08

κ

-0.06

Fig. 19. Results of the population size vs. restarts balanced meta strategy running rλ ×(µ/µ, λ)-cCMSA-ESs. Left graph: Solution quality in the linear domain (i.e., there are feasible solutions for α = 1) depending on the choice of κ. The solid line represents the maxima obtained by an LP solver. The dots are the ES results. Left graph: Population size λ for which the best-sofar solution has been found. Note, max(λ) = 96 used in these experiments has never been the population size that provided the best solution.

the population size may help. However, it also suggests that there are cases where a single run of a small population can yield better results. This gives rise to a new meta-strategy that uses generation balanced mixing of restarts and exponentially increasing population sizes. The meta-strategy uses a fixed per-generation-budget of population sizes λ and restarts r such that rλ = const. The rational behind this meta-strategy is to have more restarts when using small population sizes and only a few (or even no) restarts for the large population size runs. This reflects the above made observations that there are κcases which are better treated by small populations while there are also cases where a larger population might be beneficial. However, the latter case is computationally more expensive. Therefore, only a small number of restarts is allowed in that cases. As a result, considering the number of function evaluations needed within one generation is the same for all population sizes used in the meta-strategy. Needless to say, allocating resources in that way is by no means obvious and it could be the case that concentrating all efforts in the restart of the largest population yields better best-so-far solutions. In Fig. 19 the results of the meta-strategy are presented. As one can see in the graph on the lhs, the cCMSA-ES was always able to find the optimizer for all κ-values tested. The meta-strategy uses a per-generation-budget of rλ = 192 offspring evaluations while the cCMSA-ES runs with exponentially increasing population sizes λ = 12, 24, 48, 96 (note, µ = λ/4). That is, the corresponding number of restarts are r = 16, 8, 4, 2, respectively. The maximum number of generations allowed was always gstop = 1.000. Inspecting the plot in the rhs of Fig. 19 supports our initial hypothesis that both parts of the meta-strategy are needed. There is virtually no κ-instance where the ES took advantage of the largest population size available (λ = 96). There seems to be optimal population sizes in conjunction with minimal numbers of restarts. This is in accordance with the experiments presented in Figs. 17 and 18. Given the experiments done, there seems to be no clue to increase the population size above a value of about 4K. That is, our current recommendation for running the cCMSA-ES on the problem (1)–(5) is λ ∈ {12, . . . , max(12, 4K)}

and

λ µ = ⌊ ⌋, 4

(27)

where λ is chosen from (27) in exponentially increasing

12

manner

max f + + + + + + + + + + 1.6 +

f+ −f× f×

1.8

λ = 12 · 2m ,

m ∈ {0, 1, . . . , M − 1},

M := log2 ⌈K/3⌉. (28) The corresponding number of restarts are given by r = max(r0 , 2M ), . . . , 21 .

To this end, the cCMSA-ES is run with α = 1 given fixed κ until convergence and the best (i.e., smallest) ρ-value is used in the plot displayed in Fig. 20. Being based on these results, the ρ æ

æ

1.2

æ

æ

æ

æ

æ

æ

æ

æ

æ

κ

æ

-0.045 -0.040 -0.035 -0.030 æ -0.025 -0.020 æ æ

+ +

-0.05

æ æ

-0.10

+ κ -0.15

(29)

Here r0 > 1 is the minimum number of restarts used for the smallest population size. This minimum number is necessary to have (at least) r0 restarts in the case of small K-dimensionalities. 5) Performance of the cCMSA-ES in the Large K NonConvex Domain (α ≤ 1): The investigations to be presented are done for the problem dimension K = 29. Given an arbitrary α, κ combination, there is no guarantee that the optimization problem (1–5) has feasible solutions. This is so because the inequality (4) requires a number of inequalities in (5) to be fulfilled independent of the choice of κ. However, choosing κ too large, the inequalities in (5) cannot be fulfilled. While LP solver would necessarily stop in such situations, the ES’s primary goal is to minimize the violation count (10) before starting to maximize the goal function (1). That is, the cCMSA-ES can be used to firstly find the functional dependency of the relative violation count (25) on the choice of κ, i.e., ρ = ρ(κ). Thus, 1 − ρ can be used as a bound for the choice of α α ≤ 1 − ρ(κ). (30)

1.000 0.500

1.4

æ

æ

-0.045 -0.040 -0.035 -0.030 -0.025 -0.020

Fig. 21. On the optimization quality of the cCMSA-ES for a confidence level α = 0.95. Left plot: The ×-markers represent the maxima obtained by the cCMSA-ES (being f× := maxcCMSA-ES f ) and the +-markers represent those obtained by the Lingo optimization software (being f+ := maxLingo f ). Right plot: relative error of the Lingo-results f+ compared with the cCMSAES result f× .

As has been already pointed out, using multi-starts of the core cCMSA-ES is a necessary ingredient for finding good approximation to the global optimizer of the system (1)–(5). There is obviously a large amount of local optima. In a test performed using the multi-start strategy allowing population sizes λ up to 1536 and 510 restarts, the lhs of Fig. 22 shows the distribution of best-so-far f -values fbsf obtained. The question p ∆f

0.06

0.05 0.04

- 0.05

0.03

- 0.10

0.02

- 0.15

0.01 - 0.20

1.00

1.05

1.10

1.15

1.20

fbsf

0.1 0.2 0.3 0.4

∆x

Fig. 22. On the distribution of best-so-far values obtained in a multi-start strategy using a per-generation-budget of rλ = 3072 offspring evaluations. As problem instance, K = 29, S = 2552 with κ = −0.017 and α = 0.9503 has been chosen. In the left hand graphic, the relative frequency of best-so-far fitness values is displayed. In the rhs graph, the fitness-distance-relation of the best-so-far results is presented.

æ

0.100 0.050

æ æ æ æ

0.010 0.005

æ æ æ

0.001

æ

æ

-0.04

-0.03

-0.02

-0.01

0.00

0.01

0.02

κ

Fig. 20. The best obtained relative violation count (25) versus κ. The restart cCMSA-ES as described in Section IV-B4 has been used with α = 1 to obtain the data points. Note the logarithmic scale on the vertical axis.

decision maker can choose a confidence level α given a fixed κ such that the inequality (30) is fulfilled. Alternatively, one can also fix a confidence level α and find the corresponding κ below which feasible solutions do exist. The latter case will be considered in the next experiment displayed in Fig. 21. Here a confidence level α = 0.95 has been chosen and the optimization results (i.e. the maximal f -values) depending on admissible κ-values are displayed with ×-marker. Additionally the results of the commercial optimization software LINGO [21] are displayed with +-markers. While both optimization algorithms reached the targeted α-level (not displayed here), the cCMSA-ES is able to yield better results, especially in the most important region of larger κ-values.

arises how these fbsf -values are related to the corresponding local optimizers xbsf . Taking the local optimizer belonging to ˆ := the maximum fbsf -value fˆ := max(fbsf ), denoted by x arg max(fbsf ), and displaying the differences ∆f := fbsf − fˆ ˆ of the versus the corresponding distances ∆x := kxbsf − xk single runs provides a measure of the fitness-distance-relation of the best-so-far results. This is plotted in the rhs graph of Fig. 22. While one observes a somewhat isolated local optimizer being the maximum best-so-far optimizer (which is by definition located at the origin), there are also results ˆ Given very close in f -values, but considerably away from x. the tendency of the data points displayed in that plot, one might speculate whether the solution at the origin already represented the global optimizer or whether it is at least a good approximation for it. Of course, a definite answer cannot be given here. A single run of the core cCMSA-ES, Fig. 2, usually converges to a local optimum. This can be checked by observing the evolution of the mutation strength σ the dynamics of which must finally drop exponentially fast towards zero. For the dimensionality K = 29 given, this usually happens within the given gstop = 1000 generations. Fig. 23 shows the number of generations gend needed by the single runs of the cCMSA-ES

13

fbsf

g 600 400 800 1000 end

1.20 1.15 1.10 1.05

200

Fig. 23. On the distribution of best-so-far values versus the number of generations until termination, obtained by a multi-start strategy using a pergeneration-budget of rλ = 3072 offspring evaluations.

and the corresponding best-so-far f -values. One can clearly see that the best fbsf -values are obtained in a range of end generation values of about 400 to 700 generations. There arises the question how the maximum number of generations gstop should be chosen. To this end consider Fig. 24. There is a fbsf 1.20 1.15 1.10 1.05

gend

1000

800 600

1.00 10

20

50

100

200

400

500 1000

200

λ

0

20

50

100

10

200

500 1000

λ

Fig. 24. On the distribution of best-so-far values obtained in a multi-start strategy using a per-generation-budget of rλ = 3072 offspring evaluations. In the graphic to the left the influence of the solution quality depending on the offspring population size is displayed. The graphic to the right connects population size with the number of generations until termination.

striking drop of the best-so-far fitness results depending on the population size in the lhs of Fig. 24. This suggest that for the experimental setting m = 2 in (28) resulting in λ = 48 is the best choice. The optimum is rather broad in Fig. 24, however, it seems that chosing λ > 4K is not advisable. This corresponds to the recommendation to choose the parental population size µ not greater than the search space dimensionality K. Unlike other highly multimodal optimization problems, such as the Rastrigin test function [11], increasing the parental population size does not help. Instead, improved fitness results are obtained by multiply restarting the cCMSA-ES using the right population size. Interestingly, the optimal choice of the population size seems to also correspond to a (roughly speaking) local minimum of generations needed as observed in the rhs picture of Fig. 24. From that graph one can infer that a maximum of 600 generations suffices under the experimental settings considered. From the experiments performed so far, the following recommendations concerning the strategy parameters of the cCMSA-ES might be drawn: 1) Use a multi-start cCMSA-ES either with a fixed population size λ or a fixed per-generation-budget, but limit the population size to λ = max(12, ⌊1.5K⌋), 2) Choose µ = ⌊λ/4⌋. 3) As a rule of thumb use gstop = 200 + 10K 2 /µ.

4) Use intermediate recombination for both the object parameter vectors x and the strategy parameter σ. Recommendation 2) results from extensive experiments done for the publication [11]. Recommendation 4) is based on theoretical analyses done for ES on sphere models with noise [20]. Using intermediate recombination for the mutation strength σ results in a bias toward increased exploration bahavior. If a more local search behavior is desired, geometric recombination should be considered. In recommendation 1) we decreased the 4K, originally given in (27), to 1.5K without any noticeable changes in the final results. However, as for both recommendations 1) and 3) we clearly have to state that these are specific to the problem considered. That is, other instances might result in somewhat different recommendations. V. S UMMARY AND O UTLOOK In this paper we engineered a covariance matrix adaptation Evolution Strategy which is able to cope with some specific types of constraints often encountered in finance applications. Treating constraints in CMA-ESs is a rather uncharted field and the solutions proposed so far [5] are not suited for the combination of constraints considered here. It might come as a surprise, however, considering simple linear constraints (2) and (3) pose already a non-trivial problem for standard CMAES [12]: Eq. (2) requires mutations placed on a hypersurface embedded in the RK and (3) requires the fulfillment of the box constraints. While this can be achieved in different ways (and one solution has been presented in this paper), the problem how to adapt the mutation strength σ still remains.6 Furthermore, if the optimizer is located on the boundary of the feasible region, the mutation strength is reduced exponentially fast and the evolution on the boundary in higher-dimensional spaces ends in premature convergence. The authors were not able to find a correction to the standard σ-CSA-rule (cumulative step size adaptation) used in state-of-the-art CMAES. Therefore, the new approach presented in this paper has been taken. The question arises whether the CMA-ES approach to the optimization problem (1–5) is the best evolutionary one. One might also consider differential evolution (DE) as a possible approach (see [23] for a survey). The ideas behind DE are very similar to ideas used in real-coded GAs proposed in the early 1990s [24], [25]: Offspring are basically generated by some kind of affine or convex combinations of parental individuals, respectively. This produces a natural bias towards the fulfillment of box or other convex set constraints. In ES-type algorithms, however, the mutations are point-symmetrically distributed. As a result, the probability of mutations in a highdimensional cone vanish exponentially fast with increasing search space dimensionality. Therefore, repair or Baldwin-type penalty approaches must be used, respectively. While the principles used to build the new cCMSA-ES are not totally new in the field of evolutionary computation, the 6 Another approach to fulfill the constraint (3) is by nonlinear transformation of the search space using ri = exp(wi ) where the ES evolves the unconstrained P wi . Constraint (2) can be fulfilled similar to (19) by calculating x i = ri / K k=1 rk . According to one of the anonymous reviewers of this paper, this approach has already been taken in [22].

14

combination and the implementation in Evolution Strategies is new and based on theoretical considerations. This may be regarded as a successful algorithm engineering approach. In order to achieve the goal, we had to start with the arguably simplest variant of CMA strategies: the covariance matrix selfadaptation (CMSA) ES [11] (thus, circumventing the use of CSA). This strategy uses (mutative) self-adaptation to learn the mutation strength σ and needs only two time constants. This reduces also the complexity involved in the interplay between strategy-specific parameters as is encountered in standard CMA-ES like UH-CMA-ES [5] and other versions like BiPOP-CMA [1]. The application range of the cCMSA-ES proposed is not restricted to the specific optimization problem (1)–(5) and by no means to financial applications. The linear objective function (1) can be replaced by any nonlinear goal function f (x). The function η(x) used in the cardinality constraint (4) counting the number of linear inequalities fulfilled in (5) can be applied – of course – to any kind of systems of nonlinear inequalities. That is, the cCMSA-ES can be applied to problems of mixture optimization bounded in a simplex of type (2), (3). If one is only interested in problems without the equality constraint (2), the cCMSA-ES algorithm can be easily modified by slightly changing the Lines C5, C6, and C11 in Fig. 2. There are limitations, though. The approach can only be applied to linear equality constraints (2). There seems to be no generic way to replace (2) by general nonlinear equality constraints. However, this does not exclude the treatment of certain special cases. Such a case is currently under investigation. It arises in a (nonlinear) stochastic programming problem, i.e., the evolution on a tree structure where at each inner node a nonlinear equality constraint must be fulfilled. ACKNOWLEDGEMENTS The authors thank Raimund Kovacevic for providing his solution to the optimization problem using the commercial global optimization solver Lingo. This work was supported by the Austrian Science Fund (FWF) as part of the TranslationalResearch-Program under grant L532-N18 “Multi-period risks in portfolio selection.” R EFERENCES [1] N. Hansen, “Benchmarking a BI-population CMA-ES on the BBOB2009 function testbed,” in Workshop Proceedings of the GECCO Genetic and Evolutionary Computation Conference. ACM, July 2009, pp. 2389– 2395. [2] S. Finck, N. Hansen, R. Ros, and A. Auger, “Real-Parameter BlackBox Optimization Benchmarking 2009: Presentation of the Noiseless Functions,” Vorarlberg University of Applied Sciences (FHV), Research Center PPE, Tech. Rep. 2009/20, Dornbirn, Austria, 2009. [3] R. Mallipeddi and P. Suganthan, “Problem Definitions and Evaluation Criteria for the CEC 2010 Competition on Constrained Real-Parameter Optimization,” Nanyang Technological University, Tech. Rep., 2010. [4] T. Takahama and S. Sakai, “Constrained Differential Evolution with an Archive and Gradient-Based Mutation,” in CEC 2010 IEEE Congress on Evolutionary Computation, 2010, pp. 1680–1688. [5] N. Hansen, A. Niederberger, L. Guzzella, and P. Koumoutsakos, “A Method for Handling Uncertainty in Evolutionary Optimization With an Application to Feedback Control of Combustion,” IEEE Transactions on Evolutionary Computation, vol. 1, no. 13, pp. 180–197, 2009.

[6] D. Arnold and D. Brauer, “On the Behaviour of the (1+1)-ES for a Simple Constrained Problem,” in Parallel Problem Solving from Nature– PPSN X, G. Rudolph et al., Eds. Dortmund, Germany: Springer. Lecture Notes in Computer Science Vol. 5199, September 2008, pp. 1–10. [7] Y. Akimoto, Y. Nagata, I. Ono, and S. Kobayashi, “Theoretical Analysis of Evolutionary Computation on Continuously Differentiable Functions,” in Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation (GECCO’2010). Portland, Oregon, USA: ACM, July 7–11 2010, pp. 1401–1408. [8] N. Hansen and A. Ostermeier, “Completely Derandomized SelfAdaptation in Evolution Strategies,” Evolutionary Computation, vol. 9, no. 2, pp. 159–195, 2001. [9] N. Larsen, H. Mausser, and S. Uryasev, “Algorithms for Optimization of Value-at-Risk,” in Financial Engineering, e-Commerce and Supply Chain, P. Pardalos and V. Tsitsiringos, Eds. Berlin: Springer, 2002, pp. 19–46. [10] S. Benati and R. Rizzi, “A Mixed Integer Linear Programming Formulation of the Optimal Mean/Value-at-Risk Portfolio Problem,” European Journal of Operational Research, vol. 176, no. 1, pp. 423–434, 2007. [11] H.-G. Beyer and B. Sendhoff, “Covariance Matrix Adaptation Revisited – the CMSA Evolution Strategy,” in Parallel Problem Solving from Nature 10, G. Rudolph et al., Eds. Berlin: Springer, 2008, pp. 123–132. [12] N. Hansen, S. M¨uller, and P. Koumoutsakos, “Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES),” Evolutionary Computation, vol. 11, no. 1, pp. 1–18, 2003. [13] O. Kramer, “A Review of Constraint-Handling Techniques for Evolution Strategies.” Applied Computational Intelligence and Soft Computing, p. 11 pages, 2010. [14] J. Poland and A. Zell, “Main Vector Adaptation: A CMA Variant with Linear Time and Space Complexity,” in GECCO’01: Proceedings of the Genetic and Evolutionary Computation Conference, L. S. et al., Eds. San Francisco, CA: Morgan Kaufmann, 2001. [15] S. Meyer-Nieberg, “Self-Adaptation in Evolution Strategies,” Ph.D. dissertation, University of Dortmund, CS Department, Dortmund, Germany, 2007. [16] M. Lunacek and D. Whitley, “Searching for Balance: Understanding Self-adaptation on Ridge Functions,” in Parallel Problem Solving from Nature 9, T. Runarsson et al., Eds. Berlin: Springer, 2006, pp. 82–91. [17] Z. Michalewicz and M. Schoenauer, “Evolutionary Algorithms for Constrained Parameter Optimization Problems,” Evolutionary Computation, vol. 4, no. 1, pp. 1–32, 1996. [18] C. Igel, N. Hansen, and S. Roth, “Covariance Matrix Adaptation for Multi-Objective Optimization,” Evolutionary Computation, vol. 15, no. 1, pp. 1–28, 2007. [19] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra, “Efficient Projections onto the ℓ1 -Ball for Learning in High Dimensions,” in Proceedings of the 25th International Conference on Machine Learning, ser. ICML ’08. New York, NY, USA: ACM, 2008, pp. 272–279. [20] H.-G. Beyer and S. Meyer-Nieberg, “Self-Adaptation of Evolution Strategies under Noisy Fitness Evaluations,” Genetic Programming and Evolvable Machines, vol. 7, no. 4, pp. 295–328, 2006. [21] L. Systems, LINGO user’s guide. Chicago: Lindo Systems Inc., 2011. [22] B. Mersch, T. Glasmachers, P. Meinicke, and C. Igel, “Evolutionary optimization of sequence kernels for detection of bacterial gene starts,” International Journal of Neural Systems, vol. 17, no. 5, pp. 369–381, 2007. [23] S. Das and P. Suganthan, “Differential evolution: A survey of the stateof-the-art,” IEEE Transactions on Evolutionary Computation, vol. 15, no. 1, pp. 4–31, 2011. [24] L. J. Eshelman and J. D. Schaffer, “Real-coded genetic algorithms and interval schemata,” in Foundations of Genetic Algorithms, 2, L. D. Whitley, Ed. San Mateo, CA: Morgan Kaufmann, 1993, pp. 187–202. [25] K. Deb and R. B. Agrawal, “Simulated binary crossover for continuous search space,” Complex Systems, vol. 9, pp. 115–148, 1995.

Recommend Documents

The spatiotemporal MEG covariance matrix ... - Semantic Scholar

Sparse Covariance Matrix Estimation with ... - Semantic Scholar

Least-squares covariance matrix adjustment - Semantic Scholar

Covariance Matrix Estimation With Heterogeneous ... - Semantic Scholar

on the use of anisotropic covariance models in ... - Semantic Scholar