Theory of (1+1) ES on SPHERE Revisited

The theory of evolutionary algorithms on continuous space gravitates around the evolution strategy with one individual, adaptive mutation, and elitist selection, optimizing the symmetric, quadratic SPHERE function. The classic, normal mutation theory consists of three main building-blocks: 1) a two-term formula for the local (constant mutation) expected progress; 2) an exponential formula for the global behavior of the (adaptive mutation) algorithm; and 3) linear convergence time with respect to both space dimension $n$ and (logarithm of) initial distance to optimum. We show that the three main results still hold if we replace the normal mutation with uniform inside the sphere and also with the sum of two uniforms. That makes the case for an important conclusion: the linear convergence time of the algorithm is not a consequence of the normal mutation, but of the elitist selection and 1/5 success rule. A simplified version of the 1/5-rule allows also for an intuitive representation of the algorithm, as a sequence of constant-mutation, independent, and identical (expected) length cycles.


I. INTRODUCTION
T HE CONTINUOUS optimization problem discussed in this article consists in minimizing the SPHERE function 1 with center O. Despite its apparent simplicity, the convergence analysis of the evolution strategy (ES)-also known as continuous evolutionary algorithm-on the SPHERE proved to be difficult, making it one of the most studied problems in [13], [16], [22], and [23]. The fitness function and the algorithm analyzed in this article are defined in the following 2 : Algorithm 1 (1+1) ES With 1/5 Success Rule 1) Set t = 0, t max , G, 3 initial point x 0 and initial mutation parameter ρ 2) repeat • t := t + 1 • Mutation: generate a new point x in n using A) the uniform distribution inside the sphere of radius ρ, or B) x := (x 1 + x 2 )/2, where x 1,2 are independent, uniformly distributed inside the sphere of radius ρ, or C) the normal distribution with standard deviation 3) until t = t max Algorithm 1 applies elitist selection, marked with "plus" in ES notation-that is, the offspring replaces the parent only in case of a better fitness value.
Responsible for exploring the landscape is the mutation operator, defined at step 2 in three different forms. The classic ES applies the normal distribution, case (C) [13], [37], [41], but we formulate our analysis on two alternative distributions: 1) uniform inside the n-dimensional sphere A) and 2) sum of two uniforms B).
Note that the latter variant-starting from one individual and generating two mutated parents, to be further averaged in order to produce one offspring-is formally identical to the crossover (intermediate recombination) used in multiindividual ES [5], [8], [15], [40]. However, the operator, in this case is placed before not after selection, so it actually builds a different mutation distribution-no longer uniform but still spherical. 4 Convergence to optimum is not achieved without progressively decreasing the mutation radius ρ, as the ES approaches optimum. That is done in Algorithm 1 by a simplified version of Rechenberg's 1/5-rule 5 [6], [7], [13], [26], [37], [40], [41]. Alternative procedures exist, in the form of covariance matrix adaptation, but only for the ES with normal mutation, allowing for different σ -values on the independent components [32], [34].
Let Z t be the random variable (r.v.) describing the ES' current position at iteration t. The sequence {Z t } t∈N defines a stochastic process, whose analysis as t → ∞ provides global convergence theorems and convergence times/rates. So far, there are only two ES theories fulfilling these goals. They are both mathematically sound, highly complex, exhibit generalization potential with respect to fitness functions and algorithmic design, yet they do not overlap. A comparison is possible, with respect to the prediction quality of the convergence time formulas, or to their explanatory power.
The drift analysis proves global convergence almost sure (a.s.) of the (1+1) ES with 1/5-rule to the optimum of the SPHERE and also linear convergence time [7]. Relying only on lower/upper bounds and lacking explicit formulas for expected progress, a numerical verification against the behavior of real algorithms is precluded, yet the theory applies to arbitrary space dimension n.
The classic ES theory was established by Rechenberg, Schwefel, Beyer, Rudolph, and Arnold on analytic estimates of the local expected progress [8], [13], [37], [40], [41]. Due to the limited tractability of the one-step transition kernel, the formulas are mostly asymptotic, yet their application to convergence time provides accurate predictions for the real ES behavior on moderate dimensions as well. As for the global convergence, the martingale analysis by Rudolph is enlightening [39], [40].
This study builds on the classic ES theory, so the remaining of the section is devoted to introducing concepts and results from the local analysis. That relies on the definition and tractability of two mathematical objects: one is geometricalthe success region, the other is stochastic-the transition kernel, which encapsulates the local probabilistic structure of the algorithm. Both concepts are conditional on x ∈ n , the current position of the ES. If mutation is defined by a probability density function (pdf) f , success region R S x and transition kernel P x (A) are defined, for any A ⊂ n , by the following 6 : 5 The original rule also increases the mutation step if success frequency is larger than 1/5. Removing that line does not affect the order of convergence, but it has two effects: The algorithm will be slow when initialized with a too small step size, and the actual step size is biased, so on average it is smaller than desired, since the optimal step size is no longer a stable equilibrium. 6 Note that transition kernel is discontinuous, due to δ x (A), the Dirac measure in x-defined as 1 if x ∈ A, and 0 otherwise.
x , (3) reduces to its first term only, which defines the success probability.
Assume the algorithm (at iteration t) and the center of the coordinate system are both in point P, the distance to optimum is R t = R = |OP|, such that the progress is measured in the positive direction of the first axis, Fig. 1. Renote x 1 by x, the remaining n − 1 components (x 2 , . . . , x n ) by h and let C be some random point generated by mutation.
The progress between iterations t, t + 1 is a 1-D r.v., that corresponds to the difference in distance to optimum between the current ES position and the next. Due to selection, the progress is non-negative. For a successful mutation, point C is inside region R S x -the gray area in Fig. 1. If one applies Pythagoras to OC, the progress receives an equivalent expression in terms of the two perpendicular components x and h, with u = ||h|| 2 [13, p. 54] The progress r.v. depends on R t = ||Z t || and it is observable only through its conditional expectation values Before reviewing the existent estimates of φ, let us take a look at the candidate mutation distributions, responsible for the pdf f . The normal and uniform belong to the same class of spherical multivariate distributions [20].
for some 1-D r.v. (radius) r, and the uniform distribution on the unit sphere u n . Moreover, r and u n are independent, and also r = ||x|| ≥ 0, 2) If the spherical distribution has pdf g, then g(x) = g(||x||) and there is a special connection between g and f , the pdf of r Historically, normal distribution was discovered first, applied, and thoroughly analyzed in statistics, as the only spherical distribution with independent and identically distributed components (univariate marginals). In contrast, the components of uniform distribution inside the sphere are neither independent, nor uniform.
In the theory and practice of ES, normal distribution played from the beginning the central role [12], [41]. Using an ingenious normalization of the mutation parameter σ (standard deviation) and approximating an r.v. by its mean (expected) value, Rechenberg and Beyer proved an asymptotic formula for the normalized expected progress φ * , valid for large-space dimension n [37], [13, pp. 32, 67] The unimodal form of function φ * = φ * t (σ * ) is essential to the classic ES theory. Assuming the 1/5-rule manages to keep φ * at its maximum (0.202), a lower bound on the global convergence of the algorithm is achieved [13, pp. 48-50]. Back-application of normalization (9) provides 8 The difference equation corresponding to (11) can be approximated by a separable differential equation which solves to (apply inequality again) If R t = , 1/0.202 = C, inequality (13) can be read as t ≥ C n log(R 0 / ), so the convergence time of the algorithm with normal mutation on SPHERE has lower linear bounds, with respect to initial distance to optimum R 0 and space dimension n.
Under a new mutation, uniform on the sphere of radius ρ, a different progress definition and a normalization progress = R 2 −R 2 (quality gain) (14) Rudolph proved an expected progress formula similar to (10) [40, pp. 170-172] Finally, Jägersküpper added superior computational complexity skills to the classic ES theory. Avoiding a direct estimation of local progress-using only the decomposition (6) and radius properties of the normal mutation N(0, σ I n )-he proved the missing upper linear bounds on global convergence [25], [26], [27].
Noting the differences between local progress formulas (10)-(16), we revisit the classic (1+1) ES theory on SPHERE under two new mutation distributions: 1) uniform inside the sphere of radius ρ and 2) sum of two uniforms. Concerning global convergence, we gain more insight into the cyclic behavior of the algorithm under the 1/5-rule and show that Theorem 1 still holds, if normal mutation is replaced by uniform mutation. Moreover, we show that the local behavior analysis can be used, in conjunction with Theorem 1, to render a prediction formula similar to (13).
The multivariate distributions are introduced in Section II. The local behavior of the ES with uniform mutation is analyzed in Section III, while its global behavior is discussed in Section IV. The theoretical formulas are tested versus real algorithms in Section V. The proofs, MATLAB files, and some experimental results are deferred in the supplemental material.

II. UNIFORM AND NORMAL DISTRIBUTIONS
The n-dimensional sphere of radius ρ is defined, together with its volume, as The uniform distribution inside the sphere of radius ρ, denoted UNIFORM ρ in this article, is the simplest continuous multivariate, since it assumes a constant pdf More insight is gained from [17] and [20].
is UNIFORM ρ , then: 1) the radius (length of mutation vector) r is distributed Beta n,1 (power distribution) with pdf Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. 2) the pdf of the first component x is The n-dimensional normal multivariate N(0, σ I n ), with independent components, zero mean, and standard deviation parameter σ , is defined in [13, pp. 54, 62].
1) The radius r σ , commonly denoted in the ES literature, is distributed χ , with pdf 2) The pdf of the first component x is 3) The joint pdf of h is The above properties are summarized in Table I. 9 If we inspect the differences from the point of view of global optimization, besides the bounded/unbounded support and dependent/independent components, a comparison of the two radii is also interesting. As n → ∞, UNIFORM's radius mean converges to parameter ρ, while variance tends to zero. This multidimensional behavior lies at the core of ES theory. Intuitively, the points generated uniformly inside the n-sphere concentrate, with large n, toward the sphere's surface-see also [2]. 9 Radius mean/variance are approximations w.r.t. large n.
The variance of normal radius is constant with respect to n. Yet, when initiating the ES theory, Rechenberg and Schwefel had the brilliant idea of scaling parameter σ , such that the model fulfills the zero-limit-variance property of UNIFORM 10 [37], [41] Under this normalization, the classic ES theory was constructed, from (1+1) up to (μ, λ) and (μ/μ, λ) algorithms [13], [14].
Let us consider now two independent, uniformly generated points inside the sphere, x 1 and x 2 . The distribution of the sum can be derived from the pdf of the distance between the points-since, due to the symmetry, The problem of finding the distribution of distance inside a hypersphere was solved in [33] and recently extrapolated to other types of uniform distributions [31].
Since, for the ES analysis, we need a spherical distribution different from (18) but with the same support, the sum of x 1,2 must be halved, yielding a sum of two independent uniform distributions inside the ρ/2-sphere. Such a sum is also spherical, with stochastic representation (6) and radius 11 r s provided in [31] and [33].
The moments of r.v. r s are also calculated in [31] and can be further approximated for large n.
Corollary 1: In the conditions of Theorem 2 Using the pdf of radius r s and the first component of uniform on the sphere, one can also determine the first component of the sum of uniforms inside the sphere.
Proposition 3: In the conditions of Theorem 2, the pdf of the first component of (x 1 + x 2 )/2 is given, for any 10 The idea, one may assume, came not from the above theoretical properties of uniform distribution, but rather from extensive numerical simulations with normal distribution. 11 The expressions associated to the 'sum' of two uniform distributions are marked with 's'.

III. LOCAL BEHAVIOR
We apply UNIFORM ρ as a mutation operator of the algorithm minimizing the SPHERE. Since the radius ρ of the mutation sphere is assumed fixed within this section and the uniform (unlike the normal distribution) has compact support, one should distinguish between two different cases: small and large step size. These cases correspond to the algorithm's dynamics. As the ES approaches the optimum under constant mutation radius, the success region varies from the intersection of two spheres (small step size, with respect to distance to optimum), to a full sphere (large step size). The latter case corresponds to the worst case scenario, where mutation only rarely succeeds and the algorithm stagnates. Therefore, a practical ES like the one depicted in Algorithm 1 avoids that situation by progressively lowering the step size by using the 1/5-rule.
Since the relevance of the large step size is purely mathematical-under uniform mutation, success probability is the derivative, with respect to distance to optimum, of the expected progress [3]-this article tackles only the small step case ρ < R, depicted in Fig. 1.
Section III-A provides estimates for the success probability and expected progress of the (1+1) ES with uniform mutation. The finite dimension formulas are proved to converge, as n → ∞, to the asymptotics of the (1+1) ES with normal mutation. Section III-B derives the expected progress of the (1/2+1) ES-with the sum of uniforms mutation-in a similar manner.

A. Expected Progress of (1+1) ES
The geometry of success region (2) for the uniform operator is depicted in Fig. 1, under the small step assumption ρ < R. The success region is the union of two spherical caps [4], one with center O, radius R and height ρ 2 /2R, denoted A 1 and corresponding to the SPHERE fitness function, the other with center P, radius ρ and height ρ − ρ 2 /2R, denoted A 2 and corresponding to the uniform mutation inside the sphere.
All integrals, and expected progress accordingly, may be split into two parts, corresponding to A 1 and A 2 . If we apply the dimension reduction to the last n-1 components, A 1,2 have the following 2-D representation with respect to (x, u): For estimating the performance of uniform mutation, one has to integrate only over A 1 ∪ A 2 . This makes the analysis easier than the one of normal mutation, and actually gives a better picture of that case as well. 12 In order to compare two algorithms, one with uniform, the other with normal mutation, one should first match their most important parameters: the radius means. The inspection of Table I suggests We also apply a second normalization, of both new mutation parameter a, and expected progress φ, similar to the normal case (9) [13, p. 32] a * = a n R , Applied successively, (32) and (33) provide ρ/R = a * / √ n, and the small step assumption reads a * < √ n. Proposition 4: Let a (1+1) ES with UNIFORM ρ mutation minimize the SPHERE and ρ < R. Then, the success probability is approximately 13 The success probability of the uniform case converges, as n tends to infinity, to the asymptotic of the normal case [13, p. 68], see Fig. 4. Proposition 5: The success probability of the (1+1) ES in Proposition 4 for large n is approximately The equality of success probability asymptotics indicates that the two operators: 1) uniform and 2) normal, might perform similarly on SPHERE. Yet, the equality of probabilities does not imply equality of expected values, that requires a different calculus. As before, the integration region will be confined to A 2 .
Theorem 3: Let a (1+1) ES with UNIFORM ρ mutation minimize the SPHERE and ρ < R. Then, the normalized expected progress is approximately Apart from an easier proof, Theorem 3 provides for the first time an expected progress formula on SPHERE, valid for arbitrary, finite-space dimension n.
The next result proves that asymptotically, for large n, the expected progress of the algorithm with uniform mutation inside the sphere is the same as expected progress of the algorithm with normal mutation (10).
Theorem 4: The normalized expected progress of the (1+1) ES in Theorem 3 is approximately, for large n Note that the square bracket in (37) is the success probability limit (35). The same decomposition of the expected progress into a sum of two terms, the first positive and linear in a * , the second negative, quadratic in a * and multiplied by success probability, holds for the finite dimension case, (34), (36).
Proposition 6: Let f n be the expected progress formula from Theorem 3, and f be the limit from Theorem 4 and M > 0. Then, f n u → f on [0, M]. 13 The estimate yields from A 2 , the term corresponding to A 1 vanishes with large n . I(x, a, b) stands for the Beta distribution. Expected progress of (1+1) ES: finite dimension analytical formula (36) (dashed lines) versus real data (dots) and asymptotic of uniform-normal (10)-(37) (solid).
The estimate (36) is depicted for different values of n, against the asymptotic (37) and real data in Fig. 2. 14 Similar to the classic ES theory [13, pp. 68-69], we observe a so-called evolution window, that is, an a * -interval where the mutation parameter should be kept during the algorithm's evolution, in order to gain large expected progress φ * . A maximal progress rate φ * = 0.202 can be observed for the limit curve in Fig. 2, attained for an optimal parameter value a * = 1.224 [37].
Besides proving the validity of the normal mutation asymptotic for a different operator, the above analysis shows that a similar formula, but with the Beta distribution instead of the normal, holds for the finite dimension case.

B. Expected Progress of (1/2+1) ES
The main conclusion of Section III-A is as follows.

Under a proper scaling, there is no difference between the normal and uniform mutation.
In order to test the statement on yet another multivariate, the uniform on sphere would be the first candidate. Already studied by Rudolph, 15 the uniform on sphere shares the same radius mean (ρ) with the normal and uniform inside the sphere, so one may expect a formula identical to (10) and (37). In search for another distribution, the sum of uniforms appeared as a natural choice. It is spherical, its radius is analytically tractable and, under a convenient scaling, the radius r.v. has the same support as the uniform, but a different (asymptotic) mean: ρ/ √ 2, see Section II. Tractable as it is, the sum's radius is less manageable than the uniform's. For that reason, we decided to replace the radius of the multivariate by its mean value-justified by Corollary 1-and to use only the first component of the r.v., provided by Proposition 3. A similar approach has been taken in [2], [13], and [40]. Since we follow this path, the arbitrary dimension case is precluded and only an asymptotic formula for large n shall be derived.  (37), (38). Evolution windows of (1+1) and (1/2+1) ES, for threshold 0.188 corresponding to success probability 0.2. Once entered, the gray region is w.o.p. never left, due to the 1/5-rule, which leads to global convergence.
Theorem 5: Let a (1/2+1) ES with UNIFORM ρ mutation minimize the SPHERE. Then, normalized expected progress is approximately, for large n The asymptotic (38) is depicted versus that of the (1+1) ES in Fig. 3. We note that the two algorithms have comparable performance-they are also similar in terms of efficiency, both require only one call of the fitness function per iteration. The curves exhibit the same maximal value, 0.202, attained in a * = 1.224 by the (1+1) ES and in a * = 1.73 by (1/2+1) ES. Up to the crossing point of the curves, the (1+1) ES exhibits a larger progress, below that point (1/2+1) ES prevails and also exhibits a larger evolution window. 16 For global behavior, the implication is that the algorithm will converge faster in the first phase and slower in the second phase-confirmed by the dynamics of the constant-mutation ES in Fig. 5.
A comparison of formulas (38) and (37) points out a remarkable property of elitist ES.
Corollary 2: In the conditions of Theorems 3 and 5, the following relation holds between the expected progress of (1+1) and ( Note that 1/ √ 2 is the ratio of (mean value) radiuses, corresponding to the spherical distributions of the two algorithms, see Table I and (27) which leads to the following reading of Corollary 2: the onestep expected progress of the (1/2+1) ES is provided by the expected progress formula of the (1+1) ES, with the argument scaled by the ratio of the corresponding spherical mean radiuses. That makes the case for a stronger conjecture: The local behavior of an elitist ES with spherical mutation is determined by the expected progress formula of (1+1) ES and the mean value of the mutation's spherical radius. In other words, the actual ES operators are not relevant, but only the mutation radius-that is, the expected Euclidean distance from the current point, acquired by the random algorithm in one step, prior to elitist selection. Considering the uniform inside the sphere as basis for the (1+1) ES, the above conjecture holds for the SPHERE fitness landscape and the special mutation used in Algorithm 1-B (sum of uniforms), but also for the normal mutation.
If we infer the functional dependency from Corollary 2 onto success probability, we get the following result.

IV. GLOBAL CONVERGENCE
Using spherical distributions and multivariate calculus, the local analysis showed that the uniform mutation(s) obeys to the same expected progress function as normal mutation. Therefore, the best case scenario corresponding to the maximal value φ * = 0.202 (see Fig. 3) applies as well, yielding inequality (13) and lower linear bounds on convergence time. For upper bounds, the mathematical difficulty resides not in multivariate calculus, but in the computational complexity analysis of the 1/5 success rule. 17 Following Jägersküpper, we regard the (1+1) ES as a sequence of n-length phases (G = n) with constant mutation rate ρ = a √ n in each phase, during which the success frequency of mutation is observed before the application of the 1/5-rule [25], [26], [27].
Contrary to Jägersküpper and classic ES theory, the 1/5rule from Algorithm 1 does not increase the mutation rate if success frequency is larger than 1/5. Therefore, the constantmutation phase extends to a number of phases ≥ 1, which 17 A previous attempt at extrapolating Jägersküpper's work to uniform mutation yielded only lower bounds [29]. will be called a cycle. The r.v. "length of a cycle" will play an important role in the convergence time derivation.
For the local behavior, Jägersküpper avoided the calculus from Section III and used instead the decomposition (6) of the normal mutation distribution N(0, σ I n ) into r.v.s uniform on sphere u n and radius . We resume and adapt in the following the results from [25], [26], and [27] to the (1+1) and (1/2+1) ES with mutation distribution UNIFORM ρ on SPHERE, current distance to optimum R, success probability Prob, radius r and parameter a = ρ/ √ n. Lemma 1: Let a (1+1) ES with any spherical mutation minimizing the SPHERE be in current point P, |OP| = R. The mutant C is accepted with Prob ∈ [ , 1/2 − ], > 0, iff |PC| = (R/ √ n). Lemma 2: If X is UNIFORM ρ (or sum of two uniforms) and r is the corresponding r.v. radius, then If X 1 , . . . , X n are independent copies of X, then for any λ ∈ (0, 1), exist a λ , b λ > 0 such that the r.v. cardinal number of {i | a λ ρ ≤ r i ≤ b λ ρ} is ≥ λn w.o.p. Lemma 3: Let a (1+1) or a (1/2+1) ES with UNIFORM ρ mutation minimize the SPHERE. The following are equivalent.
Let i, i + 1 denote the states of the algorithm at the beginning/end of this phase, respectively.
, that is, w.o.p. the approximation error is reduced by a constant fraction in the ith phase. 2) If ρ i is not modified after the ith phase (success frequency larger than 1/5), then ρ i = O(R i / √ n). 3) If ρ i is halved after the ith phase (success frequency less than 1/5), then ρ i+1 = (R i+1 / √ n). Lemma 6: If the 1/5-rule causes a (k + 1)-sequence of phases, 1 ≤ k = n O(n) , such that in the first phase ρ is halved and in all the following it is left unchanged, or the other way around, then w.o.p. the distance from optimum is k times reduced by a constant fraction in these phases.
Lemmas 5 and 6 provide Theorem 1, now for the algorithm with uniform mutation and simplified 1/5-rule.
Even if this main result does not state convergence to zero of r.v. R t , as t → ∞, neither in expectation, a.s. or w.o.p., it accounts for a form of linear convergence, with respect to both t and n. Remark 1: According to Jägersküpper [25]: "For other starting conditions, the number of steps until the theorems assumption is met must be estimated before the theorem can be applied-a rather simple task when utilizing the strong results presented in Lemma 5." The empirical evidence for this fact is provided as follows.
Theorem 6 does not provide a formula like (13), able to predict the algorithm's behavior. To that end, we analyze the r.v. T="length of a constant-mutation ES cycle," defined as a stopping time [42, pp. 97-98].
Definition 2: A r.v. T with state space {0, 1, . . . , ∞} is said to be a stopping time for the sequence of r.v.s 18 {X t } t∈N if one can decide whether the event {T = t} has occurred only by observing X 1 , . . . , X t .
The most important stopping time is the first hitting time of some distance d > 0 from the initial point X 0 .
We set in our analysis X 0 = c = R 0 > d and X t+1 = R t −R t+1 , the ES progress between iterations t and t+1, for t ≥ 0; hence, Note that under a different setting, X 0 = d and X t+1 = d − (R 0 −R t+1 ), Wald's inequality can be obtained from the (proof of) Additive Drift Theorem, by Lehre and Witt [30]-which is an adaptation to continuous space of an original discrete space drift result of He and Yao [21]. But their supplementary assumption X t ≥ 0 for all t, yielding X T d = 0, is unrealistic for d < R 0 . On the other hand, if we set d = R 0 , the lower bound δ > 0 does not exist (for all t) in the continuous case. 20 That is why we do not use Wald's inequality in global convergence, but only for convergence time.
Consider in our ES model some fixed n, d < R 0 and two arbitrary points on the positive a * -axis, 0 < a * 1 < a * 2 . The distance between a * 1 and a * 2 corresponds to the un-normalized distance-γ, γ 1,2 are constants, 21 while the un-normalized bounds δ l,u corresponding to Renaming the constants, inequality (44) reads 18 Since the ES is a Markov chain, we do not assume independence of X 1 , X 2 , . . ., as in Wald's equation and renewal theory [38]. 19 For r.v. X and P(A) > 0 define E(X|A) = E(X · 1 A )/P(A). 20 The existence of a strictly positive lower bound on either success probability or expected progress can be seen as sufficient convergence condition for continuous space algorithms, see also [39], yet it is plausible only for the discrete space. 21 See the proof of Theorem 7 for a numerical example.
The expected hitting time of d < R 0 is thus (n) and does not depend on the current mutation rate ρ, provided ρ is constant until d is covered. Hence, the elitist ES on SPHERE can be seen as a sequence of constant-mutation, independent and identical (expected) length cycles. Moreover, since = o(n), (47) can be read as Two particular settings for d (corresponding to normalized distances d * on the a * -axis) are discussed next, corresponding to the evolution windows in Fig. 3 On the other hand, Theorem 6 states for t = 1 that if T is the (r.v.) number of iterations required to halve the initial distance to optimum R 0 , then E(T) is (n). That is, a, b > 0 exist such that for n large enough Compare (49) Iterate (51) s times and substitute 3.9ns = t to get which yields, under log 2/3.9 = 0.178, Using R t = and a derivation similar to the one following (13), the global linear convergence is obtained, now in terms of both lower and upper bounds.
Even if, locally, the two mutation distributions (uniform, sum of uniforms) have different radius means (ρ, ρ/ √ 2) and evolution windows, they behave identically in the long run, Fig. 6. The explanation is that the time required to halve the distance to optimum (the length of a constant-mutation cycle) is the same for the two algorithms, as indicated by Theorem 7. 22 The empirical data suggest a slightly different mean value, namely E(T d ) ∈ Apply the normalization (32) and (33) and reinserts n into the asymptotics (37) and (38)-such that φ = φ(R, ρ, n). For arbitrarily fixed initial value R 0 = 5, the difference equation (54) is iterated numerically for 1000 steps, with vector R t recorded as the prediction The obtained functions are depicted in Fig. 5, versus simulations with the (1+1) and (1/2+1) ES using uniform mutation inside the sphere of constant radius ρ. Results are averaged over 100 independent runs in order to smooth the data. 23 Away from optimum, the algorithms evolve, down to a critical distance to optimum R crit , from which point they begin to 23 The (1+1) ES with normal mutation and σ ∈ {0.31, 0.18, 0.1} has been tested with similar results. Fig. 6. Log-distance toward the optimum, averaged over 100 independent runs of Algorithms 1-A and B, initial radius ρ 0 = 1 and space dimension n = 10, 30, 100. Theoretical predictions (53) (lines) are depicted versus real data (dots). Initial distance to optimum is R 0 = 20, 5, 1, from top to bottom. stagnate. Note that R crit depends on radius ρ, space dimension n but also on the type of mutation used, Table II.
The critical values of a * can be observed in Fig. 3, as the points where the φ * -graphs touch the x-axis: a * crit ≈ 6 for the (1+1) ES and a * crit ≈ 9 for the (1/2+1) ES. The critical distances are calculated as R crit = ρ √ n/a * crit , and their logarithmic values correspond to the asymptotics of the graphs in Fig. 5, as t → ∞.
The differences between the global behavior of (1+1) and (1/2+1) ES, apparent in Fig. 5, vanish when the mutation parameter ρ is adapted by the 1/5-rule, see Fig. 6. The theoretical prediction is provided now by (53). Different space dimensions n and starting points R 0 have been considered. Note that experiments with normal mutation led to identical results.

VI. CONCLUSION
Under normal mutation, the classic ES theory proved asymptotic formulas for the local expected progress and an exponential formula, which accurately predicts the real behavior of the (1+1) ES with 1/5 success rule on SPHERE. As for the global linear convergence time, with respect to both time t and space dimension n, classic theory assumed, optimistically, that the 1/5-rule manages somehow to always keep the local progress at its maximum. Removing that unrealistic assumption, Jägersküpper was able to prove linear convergence rigorously, in terms of lower and upper bounds. Yet, his analysis lacks an exact prediction formula.
Under two new mutation distributions-1) uniform inside the sphere and 2) sum of uniforms-we showed that: 1) the local expected progress formula is the same as for normal mutation; globally, all mutations perform the same; 2) a simplified 1/5-rule (only decrease, never increase) works as well; under this rule, the ES can be regarded as a sequence of constant-mutation, independent, and identical (expected) length cycles; 3) all of Jägersküpper's results hold for uniform mutation; a prediction formula can be derived from the cyclic representation of the ES. A similar analysis of (1 + 1) ES with uniform mutation on the RIDGE is already done [2], while the case of (1, λ) strategies is the natural next step.