By Topic

SECTION I

EVOLUTIONARY algorithms (EAs) have been applied successfully to many optimization problems [24]. However, despite several decades of research, many fundamental questions about their behavior remain open. One of the central questions regarding EAs is to understand the interplay between the selection mechanism and the genetic operators. Several authors have suggested that EAs must find a balance between maintaining a sufficiently diverse population to *explore* new parts of the search space, and at the same time *exploit* the currently best found solutions by focusing the search in this direction [8], [9], [29]. In fact, the tradeoff between exploration and exploitation has been a common theme not only in evolutionary computation, but also in operations research and artificial intelligence in general. However, few theoretical studies actually exist that explain how to define such tradeoff quantitatively and how to achieve it. Our paper can be regarded as one of the first rigorous runtime analyses of EAs that addresses the interaction between exploration, driven by mutation, and exploitation, driven by selection.

Much research has focused on finding measures to quantify the selection pressure in selection mechanisms—without taking into account the genetic operators—and subsequently on investigating how EA parameters influence these measures [1], [2], [3], [9], [25]. One such measure, called the *take-over time*, considers the behavior of an evolutionary process consisting only of the selection step, and no crossover or mutation operators [1], [9]. Subsequent populations are produced by selecting individuals from the previous generation, keeping at least one copy of the fittest individual. Hence, the population will after a certain number of generations only contain those individuals that were fittest in the initial population, and this time is called the take-over time. A short take-over time corresponds to a high selection pressure. Other measures of selection pressure consider properties of the distribution of fitness values in a population that is obtained by a single application of the selection mechanism to a population with normally distributed fitness values. One of these properties is the *selection intensity*, which is the difference between the average population fitness before and after selection [25]. Other properties are *loss of diversity* [2], [20] and higher order cumulants of the fitness distribution [3].

To completely understand the role of selection mechanisms, it is necessary to also take into account their interplay with the genetic operators. There exist few rigorous studies of selection mechanisms when used in combination with genetic operators. Happ *et al.* considered fitness proportionate selection, which is one of the first selection mechanisms to be employed in evolutionary algorithms [11]. Early research in evolutionary computation pointed out that this selection mechanism suffers from various deficiencies, including population stagnation due to low selective pressure [29]. Indeed, the results by Happ *et al.* show that variants of the RLS and the $(1{+}1)$ EA that use fitness-proportional selection have exponential runtime on the class of linear functions [11]. Their analysis was limited to single-individual based EAs. Neumann *et al.* showed that even with a population-based EA, the OneMax problem cannot be optimized in polynomial time with fitness proportional selection [22]. However, they pointed out that polynomial runtime can be achieved by scaling the fitness function. Witt also studied a population-based algorithm with fitness proportionate selection, however with the objective to study the role of populations [31]. Chen *et al.* analyzed the $(N{+}N)$ EA to compare its runtimes with truncation selection, linear ranking selection and binary tournament selection on the LeadingOnes and OneMax problems [4]. They found the expected runtime on these fitness functions to be the same for all three selection mechanisms. None of the results above show how the balance between the selection pressure and mutation rate impacts the runtime.

This paper analyses rigorously a non-elitist, population based EA that uses linear ranking selection and bit-wise mutation. The main contributions are an analysis of situations where the mutation-selection balance has an exponentially large impact on the runtime, and new techniques based on branching processes for analyzing non-elitist population based EAs. This paper is based on preliminary work reported in [18], which contained the first rigorous runtime analysis of a non-elitist, population based EA with stochastic selection. This paper significantly extends this early work. In addition to strengthening the main result, simplifying several proofs and proving a conjecture, we have added a completely new section that introduces multi-type branching processes as an analytical tool for studying the runtime of EAs.

The following notation will be used in the rest of this paper. The length of a bitstring $x$ is denoted $\ell(x)$. The $i$th bit, $1\leq i\leq\ell(x)$, of a bitstring $x$ is denoted $x_{i}$. The concatenation of two bitstrings $x$ and $y$ is denoted by $x\cdot y$ and $xy$. Given a bitstring $x$, the notation $x[i,j]$, where $1\leq i<j\leq\ell(x)$, denotes the substring $x_{i}x_{i+1}\cdots x_{j}$. For any bitstring $x$, define $\Vert x\Vert:=\sum_{i=1}^{\ell(x)}x_{i}/\ell(x)$, i.e., the fraction of 1-bits in the bitstring. We say that an event holds with overwhelmingly high probability (w.o.p.) with respect to a parameter $n$, if the probability of the event is bounded from below by $1-e^{-\Omega(n)}$.

In contrast to classical algorithms, the runtime of EAs is usually measured in terms of the number of evaluations of the fitness function, and not the number of basic operations. For a given function and algorithm, the *expected runtime* is defined as the mean number of fitness function evaluations until the optimum is evaluated for the first time. The runtime on a class of fitness functions is defined as the supremum of the expected runtimes of the functions in the class [7]. The variable name $\tau$ will be used to denote the runtime in terms of number of generations of the EA. In the case of EAs that are initialized with a population of $\lambda$ individuals, and which in each generation produce $\lambda$ offspring, variable $\tau$ can be related to the runtime $T$ by $\lambda(\tau-1)\leq T\leq\lambda\tau$.

SECTION 1

In ranking selection, individuals are selected according to their fitness rank in the population. A ranking selection mechanism is uniquely defined by the probabilities $p_{i}$ of selecting an individual ranked $i$, for all ranks $i$ [2]. For mathematical convenience, an alternative definition due to Goldberg and Deb [9] is adopted, in which a function $\alpha:[{0,1}]\rightarrow\BBR$ is considered a ranking function if it is non-increasing, and satisfies the following two conditions:

- $\alpha(x)\geq 0$;
- $\int_{0}^{1}\alpha(y) dy=1$.

We consider a population-based non-elitist EA which uses linear ranking as selection mechanism. The crossover operator will not be considered in this paper. The pseudo-code of the algorithm is given above. After sampling the initial population $P_{0}$ at random in lines 1–5, the algorithm enters its main loop where the current population $P_{t}$ in generation $t$ is sorted according to fitness, then the next population $P_{t+1}$ is generated by independently selecting (line 9) and mutating (line 10) individuals from the previous population $P_{t}$. The analysis of the algorithm is based on the assumption that parameter $\chi$ is a constant with respect to $n$.

Linear ranking selection is indicated in line 9, where for a given selection pressure $\eta$, the cumulative probability of sampling individuals with rank less than $\gamma\lambda$ is $\beta(\gamma)$. It can be seen from the definition of the functions $\alpha$ and $\beta$ that the upper bound $\beta(\gamma,\gamma+\delta)\leq\delta\cdot\alpha(\gamma)$, holds for any $\gamma$, $\delta>0$ where $\gamma+\delta\leq 1$. Hence, the expected number of times a uniformly chosen individual ranked between $\gamma\lambda$ and $(\gamma+\delta)\lambda$ is selected during one generation is upper bounded by $(\lambda/\delta\lambda)\cdot\beta(\gamma,\gamma+\delta)\leq\alpha(\gamma)$. We leave the implementation details of the sampling strategy unspecified, and assume that the EA has access to some sampling mechanism which draws samples perfectly according to $\beta$.

For any constants $\sigma$, $\delta$, $0<\delta<\sigma<1-3\delta$, and integer $k\geq 1$, define the function TeX Source $${\rm SelPres}_{\sigma,\delta,k}(x):=\cases{2n,&if $x\in X_{\sigma}^{\ast}$, {\rm and}\cr\sum_{i=1}^{n}\prod_{j=1}^{i}x_{j},&otherwise\cr}$$ where the set of optimal solutions $X_{\sigma}^{\ast}$ is defined to contain all bitstrings $x\in\{0,1\}^{n}$ satisfying TeX Source $$\eqalignno{\Vert x[1,k+3]\Vert=&\,0\cr\Vert x[k+4,(\sigma-\delta)n-1]\Vert=&\,1 {\rm and}\cr\Vert x[(\sigma+\delta)n,(\sigma+2\delta)n-1]\Vert\leq&\,2/3.}$$

Except for the set of globally optimal solutions $X^{\ast}_{\sigma}$, the fitness function takes the same values as the well-known LeadingOnes fitness function, i.e., the number of leading 1-bits in the bitstring. The form of the optimal search points, which is illustrated in Fig. 1, depends on the three problem parameters $\sigma$, $k$, and $\delta$. The $\delta$-parameter is needed for technical reasons and can be set to any positive constant arbitrarily close to 0. Hence, the globally optimal solutions have approximately $\sigma n$ leading 1-bits, except for $k+3$ leading 0-bits. In addition, globally optimal search points must have a short interval after the first $\sigma n$ bits which does not contain too many 1-bits.

SECTION II

For any constant integer $k\geq 1$, let $T$ be the runtime of the linear ranking EA with population size $n\leq\lambda\leq n^{k}$ with a constant selection pressure of $\eta$, ${1<\eta\leq 2}$, and bit-wise mutation rate $\chi/n$, for a constant $\chi>0$, on function ${\rm SelPres}_{\sigma,\delta,k}$ with parameters $\sigma$ and $\delta$, where $0<\delta<\sigma<1-3\delta$. Let $\epsilon>0$ be any constant.

- If $\eta<\exp(\chi(\sigma-\delta))-\epsilon$, then for some constant $c>0$ TeX Source $$\Pr(T\geq e^{cn})=1-e^{-\Omega(n)}.$$
- If $\eta=\exp(\chi\sigma)$, then TeX Source $$\Pr(T\leq n^{k+4})=1-e^{-\Omega(n)}.$$
- If $\eta>(2\exp(\chi(\sigma+3\delta))-1)/(1-\delta)$, then TeX Source $${\bf E}[T]=e^{\Omega(n)}.$$

The theorem follows from Theorems 4, 5, and Corollary 1. ▪

Theorem 1 describes how the runtime of the linear ranking EA on fitness function ${\rm SelPres}_{\sigma,\delta,k}$ depends on the main problem parameters $\sigma$ and $k$, the mutation rate $\chi$ and the selection pressure $\eta$. The theorem is illustrated in Fig. 2 for problem parameter $\sigma=1/2$. Each point in the grey area indicates that for the corresponding values of mutation rate $\chi$ and selection pressure $\eta$, the EA has either expected exponential runtime or exponential runtime with overwhelming probability (i.e., is highly inefficient). The thick line indicates values of $\chi$ and $\eta$ where the runtime of the EA is polynomial with overwhelmingly high probability (i.e., is efficient). The runtime in the white regions is not analyzed.

The theorem and the figure indicate that setting one of the two parameters of the algorithm (i.e., $\eta$ or $\chi$) independently of the other parameter is insufficient to guarantee polynomial runtime. For example, setting the selection pressure parameter to $\eta:=3/2$ only yields polynomial runtime for certain settings of the mutation rate parameter $\chi$, while it leads to exponential runtime for other settings of the mutation rate parameter. Hence, it is rather the balance between the mutation rate $\chi$ and the selection pressure $\eta$, i.e., the *mutation-selection balance*, that determines the runtime for the linear ranking EA on this problem. More specifically, a too high setting of the selection pressure parameter $\eta$ can be compensated by increasing the mutation rate parameter $\chi$. Conversely, a too low parameter setting for the mutation rate $\chi$ can be compensated by decreasing the selection pressure parameter $\eta$. Furthermore, the theorem shows that the runtime can be highly sensitive to the parameter settings. Notice that the margins between the different runtime regimes are determined by the two parameters $\epsilon$ and $\delta$ that can be set to any constants arbitrarily close to 0. Hence, decreasing the selection pressure below $\exp(\chi\sigma)$ by any constant, or increasing the mutation rate above $\ln(\eta)/\sigma$ by any constant, will increase the runtime from polynomial to exponential. Finally, note that the optimal mutation-selection balance $\eta=\exp(\chi\sigma)$ depends on the problem parameter $\sigma$. Hence, there exists no problem-independent optimal balance between the selection pressure and the mutation rate.

Before proving Theorem 1, we mention that also previous analyses have shown that the runtime of randomized search heuristics can depend critically on the parameter settings. In the case of EAs, it is known that the population size is important [12], [15], [30]. In fact, even small changes to the population size can lead to an exponential increase in the runtime [27], [31]. Another example is the evaporation factor in ant colony optimization, where a small change can increase the runtime from polynomial to exponential [5], [6], [23]. A distinguishing aspect of the result in this paper is that the runtime is here shown to depend critically on the relationship between *two* parameters of the algorithm.

SECTION III

This section gives the proof of Theorem 1. The analysis is conceptually divided into two parts. In Sections IV-A and IV-B, the behavior of the main “core” of the population is analyzed, showing that the population enters an equilibrium state. This analysis is sufficient to prove the polynomial upper bound in Theorem 1. Sections IV-C and IV-D analyze the behavior of the “stray” individuals that sometimes move away from the core of the population. This analysis is necessary to prove the exponential lower bound in Theorem 1.

As long as the global optimum has not been found, the population is evolving with respect to the number of leading 1-bits. In the following, we will prove that the population eventually reaches an equilibrium state in which the population makes no progress with respect to the number of leading 1-bits. The population equilibrium can be explained informally as follows. On one hand, the selection mechanism increases the number of individuals in the population that have a relatively high number of leading 1-bits. On the other hand, the mutation operator may flip one of the leading 1-bits, and the probability of doing so clearly increases with the number of leading 1-bits in the individual. Hence, the selection mechanism causes an influx of individuals with a high number of leading 1-bits, and the mutation causes an efflux of individuals with a high number of leading 1-bits. At a certain point, the influx and efflux reach a balance which is described in the field of population genetics as mutation-selection balance.

Our first goal will be to describe the population when it is in the equilibrium state. This is done rigorously by considering each generation as a sequence of $\lambda$ Bernoulli trials, where each trial consists of selecting an individual from the population and then mutating that individual. Each trial has a certain probability of being successful in a sense that will be described later, and the progress of the population depends on the sum of successful trials, i.e., the population progress is a function of a certain Bernoulli process.

We will associate a Bernoulli process with the selection step in any given generation of the non-elitist EA, similar to Chen *et al.* [4]. For notational convenience, the individual that has rank $\gamma\lambda$ in a given population will be called the $\gamma$-ranked individual of that population. For any constant $\gamma$, $0<\gamma<1$, assume that the $\gamma$-ranked individual has $f_{0}:=\xi n$ leading 1-bits for some constant $\xi$. As illustrated in Fig. 3, the population can be partitioned into three groups of individuals: $\lambda^{+}$-individuals with fitness higher than $f_{0}$, $\lambda^{0}$-individuals with fitness equal to $f_{0}$, and $\lambda^{-}$-individuals with fitness less than $f_{0}$. Clearly, $\lambda^{+}+\lambda^{0}+\lambda^{-}=\lambda$, and $0\leq\lambda^{+}<\gamma\lambda$.

The following theorem makes a precise statement about the position $\xi^{\ast}=\ln(\beta(\gamma)/\gamma)/\chi$ for a given rank $\gamma$, $0<\gamma<1$, in which the population equilibrium occurs. Informally, the theorem states that the number of leading 1-bits in the $\gamma$-ranked individual is unlikely to decrease when it is below $\xi^{\ast}n$, and is unlikely to increase, when it is above $\xi^{\ast}n$.

For any constant $\gamma$, $0<\gamma<1$, and any $t_{0}>0$, define for all $t\geq 1$ the random variable $L_{t}$ as the number of leading 1-bits in the $\gamma$-ranked individual in generation $t_{0}+t$. For any $t\leq e^{c\lambda}$, define $T^{\ast}:=\min\{t,T-t_{0}\}$, where $T$ is the number of generations until an optimal search point is found. Furthermore, for any constant mutation rate $\chi>0$, define $\xi^{\ast}:=\ln\left(\beta(\gamma)/\gamma\right)/\chi$, where the function $\beta(\gamma)$ is as given in (2). Then for any constant $\delta$, $0<\delta<\xi^{\ast}$, it holds that TeX Source $$\eqalignno{\Pr\left(\min\{\xi_{0}n,(\xi^{\ast}-\delta)n\}>\min_{0\leq i\leq T^{\ast}}L_{i}\mid L_{0}\geq\xi_{0}n\right)=&\,e^{-\Omega(\lambda)}\cr\Pr\left(\max\{\xi_{0}n,(\xi^{\ast}+\delta)n\}<\max_{0\leq i\leq T^{\ast}}L_{i}\mid L_{0}\leq\xi_{0}n\right)=&\,e^{-\Omega(\lambda)}}$$ where $c>0$ is some constant.

For *the first statement*, define $\xi:=\min\{\xi_{0},\xi^{\ast}-\delta\}$. Consider the events ${\cal F}^{-}_{j}$ and ${\cal G}^{-}_{j}$, defined for $j$, $0\leq j<t$, by
TeX Source
$${\cal F}^{-}_{j}: L_{j+1}<\xi n\quad{\rm and}\quad{\cal G}^{-}_{j}:\min_{0\leq i\leq j}L_{i}\geq\xi n.$$

The first probability in the theorem can now be expressed as TeX Source $$\eqalignno{&\Pr\left(\cup_{0\leq j<T^{\ast}}{\cal F}_{j}^{-}\wedge{\cal G}_{j}^{-}\mid L_{0}\geq\xi_{0}n\right)\cr&\qquad\qquad\qquad\leq\sum_{j=0}^{t-1}\Pr\left({\cal F}_{j}^{-}\wedge{\cal G}_{j}^{-}\mid L_{0}\geq\xi_{0}n\right)\cr&\qquad\qquad\qquad\quad\enspace\enspace\enspace\enspace\enspace\leq\sum_{j=0}^{t-1}\Pr\left({\cal F}_{j}^{-}\mid{\cal G}_{j}^{-}\wedge L_{0}\geq\xi_{0}n\right)}$$ where the first inequality follows from the union bound. The second inequality follows from the definition of conditional probability, which is well-defined in this case because $\Pr\left({\cal G}_{j}^{-}\mid L_{0}\geq\xi_{0}n\right)>0$ clearly holds.

To prove the first statement of the theorem, it now suffices to choose a not too large constant $c$, and show that $\Pr\left({\cal F}_{j}^{-}\mid{\cal G}_{j}^{-}\wedge L_{0}\geq\xi_{0}n\right)=e^{-\Omega(\lambda)}$ for all $j$, $0\leq j<t$.

To show this, we consider each iteration of the selection mechanism in generation $j$ as a Bernoulli trial, where a trial is successful if the following event occurs.

${\cal E}^{+}_{1}$: An individual with at least $\xi n$ leading 1-bits is selected, and none of the initial $\xi n$ bits are flipped.

Let the random variable $X$ denote the number of successful trials. Notice that the event $X\geq\gamma\lambda$ implies that the $\gamma$-ranked individual in the next generation has at least $\xi n$ leading 1-bits, i.e., that event ${\cal F}_{j}^{-}$ does not occur. From the assumption that $\xi\leq\ln(\beta(\gamma)/\gamma)/\chi-\delta$, we get TeX Source $${{1}\over{e^{\xi\chi}}}\geq{{\gamma}\over{\beta(\gamma)}}\cdot e^{\delta\chi}.$$ Hence, it follows that TeX Source $$\eqalignno{{\bf E}\left[X\mid{\cal G}_{j}^{-}\wedge L_{0}\geq\xi_{0}n\right]=&\,\lambda\cdot\Pr\left({\cal E}^{+}_{1}\mid{\cal G}_{j}^{-}\wedge L_{0}\geq\xi_{0}n\right)\cr\geq&\,\beta(\gamma)\lambda\cdot\left(1-{{\chi}\over{n}}\right)\left(1-{{\chi}\over{n}}\right)^{\xi n-1}\cr\geq&\,\beta(\gamma)\lambda\cdot\left(1-{{\chi}\over{n}}\right)\cdot e^{-\xi\chi}\cr\geq&\,\gamma\lambda\cdot\left(1-{{\chi}\over{n}}\right)\cdot e^{\delta\chi}\cr\geq&\,\gamma\lambda\cdot(1+\delta\chi)\cdot\left(1-{{\chi}\over{n}}\right).}$$ For sufficiently large $n$, a Chernoff bound [21] therefore implies that $\Pr\left(X<\gamma\lambda\mid{\cal G}_{j}^{-}\wedge L_{0}\geq\xi_{0}n\right)=e^{-\Omega(\lambda)}$.

For *the second statement*, define $\xi:=\max\{\xi_{0},\xi^{\ast}+\delta\}$. Consider the events ${\cal F}^{+}_{j}$ and ${\cal G}^{+}_{j}$, defined for $j$, $0\leq j<t$, by
TeX Source
$${\cal F}^{+}_{j}: L_{j+1}>\xi n\quad{\rm and}\quad{\cal G}^{+}_{j}:\min_{0\leq i\leq j}L_{i}\leq\xi n.$$ Similarly to the above, the second statement can be proved by showing that $\Pr\left({\cal F}_{j}^{+}\mid{\cal G}_{j}^{+}\wedge L_{0}\leq\xi_{0}n\right)=e^{-\Omega(\lambda)}$ for all $j$, $0\leq j<t$. To show this, we define a trial in generation $j$ successful if one of the following two events occurs.

${\cal E}^{+}_{2}$: An individual with at least $\xi n+1$ leading 1-bits is selected, and none of the initial $\xi n+1$ bits are flipped.

${\cal E}^{-}_{2}$: An individual with less than $\xi n+1$ leading 1-bits is selected, and the mutation of this individual creates an individual with at least $\xi n+1$ leading 1-bits.

Let the random variable $Y$ denote the number of successful trials. Notice that the event $Y<\gamma\lambda$ implies that the $\gamma$-ranked individual in the next generation has no more than $\xi n$ leading 1-bits, i.e., that event ${\cal F}_{j}^{+}$ does not occur. Furthermore, since the $\gamma$-ranked individual in the current generation has no more than $\xi n$ leading 1-bits, less than $\gamma\lambda$ individuals have more than $\xi n$ leading 1-bits. Hence, the event ${\cal E}^{+}_{2}$ occurs with probability TeX Source $$\Pr\left({\cal E}^{+}_{2}\mid{\cal G}_{j}^{+}\wedge L_{0}\leq\xi_{0}n\right)\leq\beta(\gamma)\left(1-{{\chi}\over{n}}\right)^{\xi n+1}\leq{{\beta(\gamma)}\over{e^{\xi\chi}}}.$$ If the selected individual has $k\geq 1$ 0-bits within the first $\xi n+1$ bit positions, then the probability of mutating this individual into an individual with at least $\xi n+1$ leading 1-bits, and hence also the probability of event ${\cal E}^{-}_{2}$, is bounded from above by TeX Source $$\Pr\left({\cal E}^{-}_{2}\mid{\cal G}_{j}^{+}\wedge L_{0}\leq\xi_{0}n\right)\leq\left({{\chi}\over{n}}\right)^{k}\left(1-{{\chi}\over{n}}\right)^{\xi n+1-k}\leq{{\chi}\over{ne^{\xi\chi}}}.$$ From the assumption that $\xi\geq\ln(\beta(\gamma)/\gamma)/\chi+\delta$, we get TeX Source $${{1}\over{e^{\xi\chi}}}\leq{{\gamma}\over{\beta(\gamma)}}\cdot e^{-\delta\chi}.$$ Hence, for any constant $\delta^{\prime}$, $0<\delta^{\prime}<1-e^{-\delta\chi}<1$, we have TeX Source $$\eqalignno{{\bf E}\left[Y\mid{\cal G}_{j}^{+}\wedge L_{0}\leq\xi_{0}n\right]=&\,\lambda\cdot\Pr\left({\cal E}_{2}^{+}\mid{\cal G}_{j}^{+}\wedge L_{0}\leq\xi_{0}n\right)\cr&+\lambda\cdot\Pr\left({\cal E}_{2}^{-}\mid{\cal G}_{j}^{+}\wedge L_{0}\leq\xi_{0}n\right)\cr\leq&\,\lambda\left(\beta(\gamma)+{{\chi}\over{n}}\right)\cdot e^{-\xi\chi}\cr\leq&\,\gamma\lambda\left(1+{{\chi}\over{n\beta(\gamma)}}\right)\cdot e^{-\delta\chi}\cr\leq&\,\gamma\lambda(1-\delta^{\prime})\left(1+{{\chi}\over{n\beta(\gamma)}}\right).}$$ For sufficiently large $n$, a Chernoff bound therefore implies that $\Pr\left(Y\geq\gamma\lambda\mid{\cal G}_{j}^{+}\wedge L_{0}\leq\xi_{0}n\right)=e^{-\Omega(\lambda)}$.▪

In the following, we will say that the $\gamma$-ranked individual $x$ is in the *equilibrium position* with respect to a given constant $\delta>0$, if the number of leading 1-bits in individual $x$ is larger than $(\xi^{\ast}-\delta)n$, and smaller than $(\xi^{\ast}+\delta)n$, where $\xi^{\ast}=\ln(\beta(\gamma)/\gamma)/\chi$.

Theorem 2 states that when the population reaches a certain region of the search space, the progress of the population will halt and the EA enters an equilibrium state. Our next goal is to calculate the expected time until the EA enters the equilibrium state. More precisely, for any constants $\gamma$, $0<\gamma<1$ and $\delta>0$, we would like to bound the expected number of generations until the fitness $f_{0}$ of the $\gamma$-ranked individual becomes at least $(\ln(\beta(\gamma)/\gamma)/\chi-\delta)n$. Although the fitness $f_{0}$ will have a tendency to drift toward higher values, it is necessary to take into account that the fitness can in general both decrease and increase according to stochastic fluctuations.

Drift analysis has proven to be a powerful mathematical technique to analyze such stochastically fluctuating processes [13]. Given a distance measure (sometimes called potential function) from any search point to the optimum, one estimates the drift $\Delta$ toward the optimum in one generation, and bounds the expected time to overcome a distance of $b(n)$ by $b(n)/\Delta$.

However, in our case, a direct application of drift analysis with respect to $f_{0}$ will give poor bounds, because the drift of $f_{0}$ depends on the value of a second variable $\lambda^{+}$. The probability of increasing the fitness of the $\gamma$-ranked individual is low when the number of individuals in the population with higher fitness, i.e., $\lambda^{+}$ is low. However, it is still likely that the sum $\lambda^{0}+\lambda^{+}$ will increase, thus increasing the number of good individuals in the population.

Several researchers have discussed this alternating behavior of population-based EAs [4], [30]. Witt shows that by taking into account replication of good individuals, one can improve on trivial upper runtime bounds for the $(\mu+1)$ EA, e.g., from $O(\mu n^{2})$ on LeadingOnes into $O(\mu n\log n+n^{2})$ [30]. Chen *et al.* described a similar situation in the case of an elitist EA, which goes through a sequence of two-stage phases, where the first stage is characterized by accumulation of leading individuals, and the second stage is characterized by acquiring better individuals [4].

Generalized to the non-elitist EA described here, this corresponds to first accumulation of $\lambda^{+}$-individuals, until one eventually gains more than $\gamma\lambda$ individuals with fitness higher than $f_{0}$. In the worst case, when $\lambda^{+}=0$, one expects that $f_{0}$ has a small positive drift. However, when $\lambda^{+}$ is high, there is a high drift. When the fitness is increased, the value of $\lambda^{+}$ is likely to decrease. To take into account this mutual dependency between $\lambda^{+}$ and $f_{0}$, we apply drift analysis in conceptually two dimensions, finding the drift of both $f_{0}$ and $\lambda^{+}$. Similar in vein to this 2-D drift analysis, is the analysis of simulated annealing due to Wegener, in which a gambler's ruin argument is applied with respect to a potential function having two components [28].

The drift analysis applies the following simple property of function $\beta$ which follows from its definition in (2).

The function $\beta$ defined in (2) satisfies TeX Source $${{\beta(\gamma/x)}\over{\beta(\gamma)}}\geq{{1}\over{x}}$$ for all $x\geq 1$, and $\gamma$, where $0<\gamma<1$.

The following theorem shows that if the $\gamma$-ranked individual in a given population is below the equilibrium position, then the equilibrium position will be reached within expected $O(\lambda n^{2})$ function evaluations.

Let $\gamma$ and $\delta$ be any constants with $0<\gamma<1$ and $\delta>0$. The expected number of function evaluations until the $\gamma$-ranked individual of the linear ranking EA with population size $\lambda\geq c\ln n$, for some constant $c>0$ that depends on $\gamma$, attains at least $n(\ln (\beta(\gamma)/\gamma)/\chi-\delta)$ leading 1-bits or the optimum is reached, is $O(\lambda n^{2})$.

Recall from the definition of the EA that $P_{t}$ is the population vector in generation $t\geq 0$. We consider the drift by the potential function $h(P_{t}):=h_{y}(P_{t})+\lambda h_{x}(P_{t})$, which is composed of a horizontal component $h_{x}$, and a vertical component $h_{y}$, defined as TeX Source $$\eqalignno{h_{x}(P_{t}):=&\,n-{\rm LeadingOnes}(x_{(\gamma)})\cr h_{y}(P_{t}):=&\,\gamma\lambda-\vert\{y\in P_{t}\mid f(y)>f(x_{(\gamma)})\}\vert}$$ where $x_{(\gamma)}$ is the $\gamma$-ranked individual in population $P_{t}$. The horizontal $\Delta_{x,t}$ and vertical $\Delta_{y,t}$ drift in generation $t$ are TeX Source $$\eqalignno{\Delta_{x,t}(i):=&\,{\bf E}\left[h_{x}(P_{t})-h_{x}(P_{t+1})\mid h_{x}(P_{t})=i\right] {\rm and}\cr\Delta_{y,t}(i):=&\,{\bf E}\left[h_{y}(P_{t})-h_{y}(P_{t+1})\mid h_{y}(P_{t})=i\right].}$$ The horizontal and vertical drift will be bounded independently in the following two cases:

- $0\leq\lambda^{+}_{t}\leq\gamma\lambda/l$;
- $\gamma\lambda/l<\lambda^{+}_{t}$

Assume that the $\gamma$-ranked individual has $\xi n$ leading 1-bits, where ${\xi<\ln (\beta(\gamma)/\gamma)/\chi-\delta}$. By the first statement of Theorem 2, the probability of reducing the number of leading 1-bits in the $\gamma$-ranked individual, i.e., of increasing the horizontal distance, is $e^{-\Omega (\lambda)}$. The horizontal distance cannot increase by more than $n$, so $\Delta_{x,t}\geq-ne^{-\Omega (\lambda)}$ holds in both cases.

We now bound the horizontal drift $\Delta_{x,t}$ for Case 2. Let the random variable $S_{t}$ be the number of selection steps in which an individual with fitness strictly higher than $f_{0}=f(x_{(\gamma)})$ is selected, and none of the leading $\xi n$ bits are flipped. Then TeX Source $$\eqalignno{{\bf E}[S_{t}]\geq&\,\lambda\cdot\beta (\gamma/l)\cdot e^{-\xi\chi}\cdot\left(1-{{\chi}\over{n}}\right)\cr\geq&\,\gamma\lambda\cdot (1+\chi\delta)\cdot{{\beta (\gamma/l)}\over{\beta(\gamma)}}\cdot\left(1-{{\chi}\over{n}}\right)\cr\geq&\,\gamma\lambda\cdot{{(1+\chi\delta)}\over{l}}\cdot\left(1-{{\chi}\over{n}}\right).}$$ By defining $l:=(1+\chi\delta/2)$, there exists a constant $\delta^{\prime}>0$ such that for sufficiently large $n$, we have ${\bf E}\left[S_{t}\right]\geq(1+\delta^{\prime})\cdot\gamma\lambda$. Hence, by a Chernoff bound, with probability $1-e^{-\Omega (\lambda)}$, the number $S_{t}$ of such selection steps is at least $\gamma\lambda$, in which case $\Delta_{x,t}\geq 1$. The unconditional horizontal drift in Case 2 therefore satisfies $\Delta_{x,t}\geq 1\cdot (1-e^{-\Omega (\lambda)})-n\cdot e^{-\Omega (\lambda)}$.

We now bound the vertical drift $\Delta_{y,t}$ for Case 1. In order to generate a $\lambda^{+}$-individual in a selection step, it is sufficient that a $\lambda^{+}$-individual is selected and none of the leading ${\xi n+1}$ 1-bits are flipped. We first show that the expected number of such events is sufficient to ensure a non-negative drift. If $\lambda_{t}^{+}=0$, then the vertical drift cannot be negative. Let us therefore assume that $0<\lambda_{t}^{+}=\gamma\lambda/m$ for some $m>1$ which is not necessarily constant. The expected number of times a new $\lambda^{+}$-individual is created is at least TeX Source $$\eqalignno{&\lambda\cdot\beta (\gamma/m)\cdot e^{-\xi\chi}\cdot\left(1-{{\chi}\over{n}}\right)\cr&\qquad\qquad\quad\geq\gamma\lambda\cdot{{\beta (\gamma/m)}\over{\beta(\gamma)}}\cdot(1+\chi\delta)\cdot\left(1-{{\chi}\over{n}}\right)\cr&\qquad\qquad\qquad\quad\enspace\enspace\enspace\enspace\enspace\enspace\enspace\enspace\geq(\lambda\gamma/m)\cdot (1+\chi\delta)\cdot\left(1-{{\chi}\over{n}}\right).}$$ Hence, for sufficiently large $n$, this is at least $\lambda_{t}^{+}$, and the expected drift is at least positive. In addition, a $\lambda^{+}$-individual can be created by selecting a $\lambda^{0}$-individual, and flipping the first 0-bit and no other bits. The expected number of such events is at least $\lambda\cdot\beta (\gamma/l,\gamma)\cdot e^{-\xi\chi}\cdot\chi/n=\Omega (\lambda/n)$. Hence, the expected vertical drift in Case 1 is $\Omega (\lambda/n)$. Finally, for Case 2, we use the trivial lower bound $\Delta_{y,t}\geq-\gamma\lambda$.

The horizontal and vertical drift is now added into a *combined drift* $\Delta_{t}:=\Delta_{y,t}+\lambda\Delta_{x,t}$, which in the two cases is bounded by:

- $\Delta_{t}=\Omega (\lambda/n)-\lambda n e^{-\Omega (\lambda)}$;
- $\Delta_{t}=-\gamma\lambda+\lambda (1-e^{-\Omega (\lambda)})-\lambda n e^{-\Omega (\lambda)}$.

Given a population size $\lambda\geq c\ln n$, for a sufficiently large constant $c$ with respect to $\gamma$, the combined drift $\Delta_{t}$ is therefore in both cases bounded from below by $\Omega (\lambda/n)$. The maximal distance is $b(n)\leq(n+\gamma)\cdot\lambda$, hence, the expected number of function evaluations $T$ until the $\gamma$-ranked individual attains at least $n(\ln (\beta(\gamma)/\gamma)/\chi-\delta)$ leading 1-bits is no more than ${\bf E}[T]\leq\lambda\cdot b(n)/\Delta_{t}=O(\lambda n^{2})$.▪

In the previous section, it was shown that the population reaches an equilibrium state in $O(\lambda n^{2})$ function evaluations in expectation. Furthermore, the position of the equilibrium state is given by the selection pressure $\eta$ and the mutation rate $\chi$. By choosing appropriate values for the parameters $\eta$ and $\chi$, one can ensure that the equilibrium position occurs close to the global optimum that is given by the problem parameter $\sigma$. Theorem 7, that will be proved in Section IV-E, also implies that no individual will reach far beyond the equilibrium position. It is now straightforward to prove that an optimal solution will be found in polynomial time with overwhelmingly high probability.

The probability that linear ranking EA with population size $n\leq\lambda\leq n^{k}$, for any constant integer $k\geq 1$, selection pressure $\eta$, and bit-wise mutation rate $\chi/n$ for a constant $\chi>0$ satisfying $\eta=\exp (\sigma\chi)$, finds the optimum of ${\rm SelPres}_{\sigma,\delta,k}$ within $n^{k+4}$ function evaluations is $1-e^{-\Omega (n)}$.

We divide the run into two phases. The first phase lasts the first $\lambda n^{3}$ function evaluations, and the second phase lasts the remaining $n^{k+4}-\lambda n^{3}$ function evaluations. We say that a failure occurs during the run, if within these two phases, there exists an individual that has more than $(\sigma+\delta)n$ leading 1-bits, or more than $2n\delta/3$ 1-bits in the interval from $(\sigma+\delta)n$ to $(\sigma+2\delta)n$. We first claim that the probability of this failure event is exponentially small. By Theorem 7, no individual reaches more than $(\sigma+\delta)n$ leading 1-bits within $cn^{k+4}$ function evaluations with probability $1-e^{-\Omega (n)}$. Hence the bits after position $(\sigma+\delta)n$ will be uniformly distributed. By a Chernoff bound, and a union bound over all the individuals in the two phases, the probability that any individual during the two phases has more than $2\delta n/3$ 1-bits in the interval from $n(\sigma+\delta)$ to $n(\sigma+2\delta)$ is exponentially small. We have therefore proved the first claim.

Let $\gamma>0$ be a constant such that $\ln (\beta(\gamma)/\gamma)/\chi>\sigma-\delta$. We say that a failure occurs in the first phase, if by the end of this phase, there exists a non-optimal individual with rank between 0 and $\gamma$ that has less than $(\sigma-\delta)n$ leading 1-bits. We will prove the claim that the probability of this failure event is exponentially small. By Theorem 3, the expected number of function evaluations until the $\gamma$-ranked individual has obtained at least $(\sigma-\delta)n$ leading 1-bits is no more than $c\lambda n^{2}$, for some constant $c>0$. We divide the first phase into sub-phases, each of length $2c\lambda n^{2}$. By Markov's inequality, the probability that the $\gamma$-ranked individual has not obtained $(\sigma-\delta)n$ leading 1-bits within a given sub-phase is less than $1/2$. The probability that this number of leading 1-bits is not achieved within $n/2c$ such sub-phases, i.e., by the end of the first phase, is no more than $2^{-n/2c}$, and the second claim holds.

We say that a failure occurs in the second phase, if a non-optimal individual with rank better than $\gamma$ has less than ${(\sigma-\delta)n}$ leading 1-bits, or the optimum is not found by the end of the phase. We claim that the probability of this failure event is exponentially small. The first part of the claim follows from the first part of Theorem 2 with the parameters $\xi_{0}=\sigma-\delta$ and $t=n^{k+4}/\lambda-n^{3}$. Assuming no failure in the previous phase, it suffices to select an individual with rank between 0 and $\gamma$, and flip the leading $k+3$ 0-bits, and no other bits. The probability that this event happens during a single selection step, assuming that $n>2\chi-k-3$, i.e., $n-k-3<2n-2\chi$, is TeX Source $$\eqalignno{r=&\,\beta(\gamma)\left({{\chi}\over{n}}\right)^{k+3}\left(1-{{\chi}\over{n}}\right)^{n-k-3}\cr\geq&\,\beta(\gamma)\left({{\chi}\over{n}}\right)^{k+3}\left[\left(1-{{\chi}\over{n}}\right)^{{{n}\over{\chi}}-1}\right]^{2\chi}\geq{{\beta(\gamma)}\over{e^{2\chi}}}\left({{\chi}\over{n}}\right)^{k+3}.}$$ The expected number of selection steps until the optimum is produced is $1/r\leq c^{\prime}n^{k+3}$ for some constant $c^{\prime}>0$. Similarly to the first phase, we consider sub-phases, each of length $2c^{\prime}n^{k+3}$. By Markov's inequality, the probability that the optimum has not been found within a given sub-phase is less than $1/2$. The probability that the optimum has not been found within $n/4c^{\prime}$ sub-phases, i.e., before the end of the second phase, is $2^{-n/4c^{\prime}}$, and the third claim holds.

If none of the failure events occurs, then the optimum has been found by the end of the second phase. The probability that any of the failure events occurs is $e^{-\Omega (n)}$, and the theorem then follows.▪

While Theorem 2 describes the equilibrium position of any $\gamma$-ranked individual for any positive constant $\gamma$, the theorem cannot be used to analyze the behavior of single “stray” individuals, including the position of the fittest individual (i.e., $\gamma=0$). This is because the tail inequalities obtained by the Chernoff bounds used in the proof of Theorem 2 are too weak for ranks of order $\gamma=o(1)$.

To analyze stray individuals, we will apply the technique of non-selective family trees introduced in [18]. This technique is different from, but related to, the *family tree* technique described by Witt [30]. A family tree has as its root a given individual $x$ in some generation $t$, and the nodes in each level $k$ correspond to the subset of the population in generation $t+k$ defined in the following way. An individual $y$ in generation $t+k$ is a member of the family tree if and only if it was generated by selection and mutation of an individual $z$ that belongs to level $t+k-1$ of the family tree. In this case, individual $z$ is the parent node of individual $y$. If there is a path from an individual $z$ at level $k$ to an individual $y$ at level $k^{\prime}>k$, then individual $y$ is said to be a *descendant* of individual $z$, and individual $z$ is an *ancestor* of individual $y$. A directed path in the family tree is called a *lineage*. A family tree is said to become *extinct* in generation $t+t(n)+1$ if none of the individuals in level $t(n)$ of the tree were selected. In this case, $t(n)$ is called the *extinction time* of the family tree.

The idea for proving that stray individuals do not reach a given part of the search space can be described informally using Fig. 4. One defines a certain subset of the search space called the *core* within which the majority of the population is confined with overwhelming probability. In our case, an appropriate core can be defined using Theorems 2 and 3. One then focuses on the family trees that are outside this core, but which have roots within the core. Note that some descendants of the root may re-enter the core. We therefore prune the family tree to those descendants which are always outside the core. More formally, the pruned family tree contains node $x$ if and only if $x$ belongs to the original family tree, and $x$ and all its ancestors are outside the core.

We would then like to analyze the positions of the individuals that belong to the pruned family tree. However, it is non-trivial to calculate the exact shape of this family tree. Let the random variable $\xi_{x}$ denote the number of offspring of individual $x$. Clearly, the distribution of $\xi_{x}$ depends on how $x$ is ranked within the population. Hence, different parts of the pruned family tree may grow at different rates, which can influence the position and shape of the family tree. To simplify the analysis, we embed the pruned family tree into a larger family tree which we call the *non-selective family tree*. This family tree has the same root as the real pruned family tree, however it grows through a modified selection process. In the real pruned family tree, the individuals have different numbers of offspring according to their rank in the population. In the non-selective family tree, the offspring distribution $\xi_{x}$ of all individuals $x$ is identical to the offspring distribution $\xi_{z}$ of an individual $z$ which is best ranked among individuals outside the core. We will call the expectation of this distribution $\xi_{z}$ the *reproductive rate* of the non-selective family tree. Hence, each individual in the non-selective family tree has at least as many offspring as in the real family tree. The real family tree will therefore occur as a sub-tree in the non-selective family tree. Furthermore, the probability that the real family tree reaches a given part of the search space is upper bounded by the probability that the non-selective family tree reaches this part of the search space. A related approach, where faster growing family trees are analyzed, was described by Jägersküpper and Witt [14].

Approximating the family tree by the non-selective family tree has three important consequences. The *first* consequence is that the non-selective family tree can grow faster than the real family tree, and in general beyond the population size $\lambda$ of the original process. The *second* consequence is that since all individuals in the family tree have the same offspring distribution, no individual in the family tree has any selective advantage, hence the name non-selective family tree. The behavior of the family tree is therefore independent of the fitness function, and each lineage fluctuates randomly in the search space according to the bits flipped by the mutation operator. Such mutation random walks are easier to analyze than the real search process. To bound the probability that such a mutation random walk enters a certain region of the search space, it is necessary to bound the extinction time $t(n)$ of the non-selective family tree. The *third* consequence is that the sequence of random variables $Z_{t\geq 0}$ describing the number of elements in level $t$ of the non-selective family tree is a discrete time branching process [10]. We can therefore apply the techniques that have been developed to study branching processes to bound the extinction time $t(n)$.

Before introducing branching processes, we summarize the main steps in a typical application of non-selective family trees, assuming the goal is to prove that with overwhelming probability, an algorithm does not reach a given search point $x^{\ast}$ within $e^{cn}$ generations for some constant $c>0$. The first step is to define an appropriate core, which is a subset of the search space that is separated from $x^{\ast}$ by some distance. The second step is to prove that any non-selective family tree outside the core will become extinct in $t(n)$ generations with overwhelmingly high probability. This can be proved by applying results about branching processes, e.g., Lemmas 2 and 3 in this paper. The third step is to bound the number of different lineages that the family tree has within $t(n)$ generations. Again, results about branching processes can be applied. The fourth step involves bounding the probability that a given lineage, starting inside the core reaches search point $x^{\ast}$ within $t(n)$ generations. This can be shown in various ways, depending on the application. The fifth, and final step, is to apply a union bound over all the different lineages that can exist within $e^{cn}$ generations.

In the second step, one should keep in mind that there are several causes of extinction. A reproductive rate less than 1 is perhaps the most evident cause of extinction. Such a low reproductive rate may occur when the fitness outside the core is lower than the fitness inside the core, as is the case for the family trees considered in Section IV-D. With a majority of the population inside the core, each individual outside the core is selected in expectation less than once per generation. However, a low reproductive rate is not the only cause of extinction. This is illustrated by the core definition in Section IV-E, where the fitness is generally higher outside, than inside the core. While the family tree members may in general be selected more than once per generation, the critical factor here is that their offspring are in expectation closer to the core than their parents. Hence, the lineages outside the core will have a tendency to drift back into the core where they are no longer considered part of the family tree due to the pruning process.

A single-type branching process is a Markov process $Z_{0}, Z_{1},\ldots$ on $\BBN_{0}$, which for all $t\geq 0$, is given by $Z_{t+1}:=\sum_{i=1}^{Z_{t}}\xi_{i}$ where $\xi_{i}\in\BBN_{0}$ are i.i.d. random variables having ${\bf E}[\xi]=:\rho$. A branching process can be thought of as a population of identical individuals, where each individual survives exactly one generation. Each individual produces $\xi$ offspring independently of the rest of the population during its lifetime, where $\xi$ is a random variable with expectation $\rho$. The random variable $Z_{t}$ denotes the population size in generation $t$. Clearly, if $Z_{t}=0$ for some $t$, then $Z_{t^{\prime}}=0$ for all $t^{\prime}\geq t$. The following lemma gives a simple bound on the size of the population after $t\geq 1$ generations.

Let $Z_{0}, Z_{1},\ldots$ be a single-type branching process with $Z_{0}:=1$ and mean number of offspring per individual $\rho$. Define random variables $T:=\min\{t\geq 0\mid Z_{t}=0\}$, i.e., the extinction time, and $X_{t}$ the number of different lineages until generation $t$. Then for any $t$, $k\geq 1$ TeX Source $$\Pr(Z_{t}\geq k)\leq{{\rho^{t}}\over{k}}\quad{\rm and}\quad\Pr (T\geq t)\leq\rho^{t}.$$ Furthermore, if $\rho<1$, then TeX Source $${\bf E}[X_{t}]\leq{{\rho}\over{1-\rho}}\quad{\rm and}\quad\Pr (X_{t}\geq k)\leq{{\rho}\over{k(1-\rho)}}.$$

By the law of total expectation, we have TeX Source $${\bf E}[Z_{t}]={\bf E}[{\bf E}[Z_{t}\mid Z_{t-1}]]=\rho\cdot{\bf E}[Z_{t-1}].$$ Repeating this $t$ times gives ${\bf E}\left[Z_{t}\right]=\rho^{t}\cdot{\bf E}\left[Z_{0}\right]$. The first part of the lemma now follows by Markov's inequality, that is TeX Source $$\Pr (Z_{t}\geq k)\leq{{{\bf E}[Z_{t}]}\over{k}}={{\rho^{t}}\over{k}}.$$ The second part of the lemma is a special case of the first part for $k=1$, i.e., $\Pr\left(T\geq t\right)=\Pr\left(Z_{t}\geq 1\right)\leq\rho^{t}$. For the last two parts, note that since each lineage must contain at least one individual that is unique to that lineage, we have $X_{t}\leq Z_{1}+\cdots+Z_{t}$. By linearity of expectation and the previous inequalities, we can therefore conclude that TeX Source $${\bf E}[X_{t}]\leq\sum_{i=1}^{t}{\bf E}[Z_{i}]\leq\sum_{i=1}^{\infty}\rho^{i}={{\rho}\over{1-\rho}}.$$ Finally, it follows from Markov's inequality that TeX Source $$\Pr(X_{t}\geq k)\leq{{\rho}\over{k(1-\rho)}}.$$▪

From the preceding lemma, it is clear that the expected number of offspring $\rho$ is important for the fate of a branching process. For $\rho<1$, the process is called *sub-critical*, for ${\rho=1}$, the process is called *critical*, and for $\rho>1$, the process is called *super-critical*. In this paper, we will consider sub-critical processes.

In this section, it is proved that ${\rm SelPres}_{\sigma,\delta,k}$ is hard for linear ranking EA when the ratio between parameters $\eta$ and $\chi$ is sufficiently large. The overall proof idea is first to show that the population is likely to reach the equilibrium position before the optimum is reached (Proposition 2, Theorem 3). Once the equilibrium position is reached, a majority of the population will have significantly more than $(\sigma+\delta) n$ leading 0-bits, and individuals that are close to the optimum are therefore less likely to be selected (Proposition 3).

The proof of Proposition 2 builds on the result in Proposition 1, which states that the individuals with at least $k+3$ leading 0-bits will quickly dominate the population. Hence, family trees of individuals with less than $k+3$ leading 0-bits are likely to become extinct before they discover an optimal search point. Recall that optimal search points have $k+3$ leading 0-bits. In the following, individuals with at least $k+3$ leading 0-bits will be called $1^{k+3}$-individuals.

Let $\gamma^{\ast}$ be any constant $0<\gamma^{\ast}<1$, and $t(\lambda)={\rm poly}(\lambda)$. If the linear ranking EA with population size $\lambda$, $n\leq\lambda\leq n^{k}$, for any constant integer $k\geq 1$, and bit-wise mutation rate $\chi/n$ for a constant $\chi>0$, is applied to ${\rm SelPres}_{\sigma,\delta,k}$, then with probability $1-o(1)$, all the $\gamma^{\ast}$-ranked individuals in generation $\log\lambda$ to generation $T^{\ast}:=\min\{t(\lambda),T-1\}$ are $1^{k+3}$-individuals, where $T$ is the number of generations until the optimum has been found.

If the $\gamma^{\ast}$-ranked individual in some generation $t_{0}\leq\log\lambda$ is an $1^{k+3}$-individual, then by the first part of Theorem 2 with parameter $\xi_{0}:=(k+3)/n$, the $\gamma^{\ast}$-ranked individual remains so until generation $T^{\ast}$ with probability $1-e^{-\Omega (\lambda)}$. Otherwise, we consider the run a failure.

It remains to prove that the $\gamma^{\ast}$-ranked individual in one of the first $\log\lambda$ generations is an $1^{k+3}$-individual with probability $1-o(1)$. We apply the drift theorem with respect to the potential function $\log (\lambda^{+})$, where $\lambda^{+}$ is the number of $1^{k+3}$-individuals in the population.

A run is considered failed if the fraction of $1^{k+3}$-individuals in any of the first $T^{\ast}$ generations is less than $\gamma_{0}:=1/2^{k+4}$. The initial generation is sampled uniformly at random, so by a Chernoff bound, the probability that the fraction of $1^{k+3}$-individuals in the initial generation is less than $\gamma_{0}$, is $e^{-\Omega (\lambda)}$. Given that the initial fraction of $1^{k+3}$-individuals is at least $\gamma_{0}$, it follows again by the first part of Theorem 2 with parameter $\xi_{0}=(k+3)/n$ that this holds until generation $T^{\ast}$ with probability $1-e^{-\Omega (\lambda)}$. Hence, the probability of this failure event is $e^{-\Omega (\lambda)}$.

The $1^{k+3}$-individuals are fitter than any other non-optimal individuals. Assume that the fraction of $1^{k+3}$-individuals in a given generation is $\gamma$, $\gamma_{0}\leq\gamma<\gamma^{\ast}$. In order to create a $1^{k+3}$-individual in a selection step, it suffices to select one of the best $\gamma\lambda$ individuals, and to not mutate any of the first $k+3$ bit positions. The expected number of $1^{k+3}$-individuals in the following generation is therefore at least $r(\gamma)\lambda$, where we define $r(\gamma):=\beta(\gamma)(1-\chi/n)^{k+3}$. The ratio $r(\gamma)/\gamma$ is linearly decreasing in $\gamma$, and for sufficiently large $n$, strictly larger than $1+c$, where $c>0$ is a constant. Hence, for all $\gamma<\gamma^{\ast}$, it holds that TeX Source $$r(\gamma)=\gamma{{r(\gamma)}\over{\gamma}}>\gamma{{r(\gamma^{\ast})}\over{\gamma^{\ast}}}\geq\gamma (1+c).$$ The drift is therefore for all $\gamma$, where $\gamma_{0}\leq\gamma<\gamma^{\ast}$ TeX Source $$\eqalignno{\Delta\geq&\,\log (r(\gamma)\lambda)-\log (\gamma\lambda)\cr\geq&\,\log (\gamma (1+c)\lambda)-\log (\gamma\lambda)=\log (1+c).}$$

Assuming no failure, the potential must be increased by no more than $b(\lambda):=\log (\gamma^{\ast}\lambda)-\log (\gamma_{0}\lambda)=\log (\gamma^{\ast}/\gamma_{0})$. By the drift theorem, the expected number of generations until this occurs is $b(\lambda)/\Delta=O(1)$. And the probability that this does not occur within $\log\lambda$ generations is $O(1/\log\lambda)$ by Markov's inequality.

Taking into account all the failure probabilities, the proposition now follows.▪

For any constant $r>0$, the probability that the linear ranking EA with population size $\lambda$, $n\leq\lambda\leq n^{k}$, for some constant integer $k\geq 1$, and bit-wise mutation rate $\chi/n$ for a constant $\chi>0$, has not found the optimum of ${\rm SelPres}_{\sigma,\delta,k}$ within $\lambda rn^{2}$ function evaluations is $\Omega (1)$.

We consider the run a failure if at some point between generation $\log\lambda$ and generation $rn^{2}$, the $(1+\delta)/2$-ranked individual has less than $k+3$ leading 0-bits without first finding the optimum. By Proposition 1, the probability of this failure event is $o(1)$.

Assuming that this failure event does not occur, we apply the method of non-selective family trees with the set of $1^{k+3}$-individuals as core. Recall that the family trees are pruned such that they only contain lineages outside the core. However, to simplify the analysis, the family trees will not be pruned before generation $\log\lambda$. Therefore, any family tree that is not rooted in an $1^{k+3}$-individual, must be rooted in the initial population. The proof now considers the family trees with roots after and before generation $\log\lambda$ separately.

we first consider at most $m:=\lambda rn^{2}\leq rn^{k+2}$ family trees with roots after generation $\log\lambda$. We begin by estimating the total number of lineages, and their extinction times. The mean number of offspring $\rho$ of an individual with rank $\gamma$ is no more than $\alpha (\gamma)$, as given in (1). Assuming no failure, any non-optimal individual outside the core has rank at least $\gamma:=(1+\delta)/2$. Hence for any selection pressure $\eta$, $1<\eta\leq 2$, the mean number of offspring of an individual in the family tree is $\rho\leq\alpha ((1+\delta)/2)=1-(\eta-1)\delta<1$. We consider the run a failure if any of the $m$ family trees survives longer than $t:=(k+3)\ln n/\ln (1/\rho)$ generations. By the union bound and Lemma 2, the probability of this failure event is no more than $m\rho^{t}=mn^{-k-3}=O(1/n)$.

Let the random variable $P_{i}$ be the number of lineages in family tree $i$, $1\leq i\leq m$. The expected number of lineages in a given family tree is by Lemma 2 no more than $\rho/(1-\rho)$. We consider the run a failure if there are more than $2m\rho/(1-\rho)$ lineages in all these family trees. The probability of this failure event is by Markov's inequality no more than TeX Source $$\Pr\left(\sum_{i=1}^{m}P_{i}\geq{{2m\rho}\over{1-\rho}}\right)\leq{{(1-\rho)\sum_{i=1}^{m}{\bf E}\left[P_{i}\right]}\over{2m\rho}}\leq 1/2.$$

We now bound the probability that any given lineage contains a $0^{k+3}$-individual, which is necessary to find an optimal search point. The probability of flipping a given bit during $t$ generations is by the union bound no more than $t\chi/n$, and the probability of flipping $k+3$ bits within $t$ generations is no more than $(t\chi/n)^{k+3}$. The probability that any of the at most $2m\rho/(1-\rho)$ lineages contains a $0^{k+3}$-individual is by the union bound no more than TeX Source $${{(t\chi/n)^{k+3}2m\rho}\over{1-\rho}}=O(\ln^{k+3}n/n).$$

we secondly consider the family trees with roots before generation $\log\lambda$. In the analysis, we will not prune these family trees during the first $\log\lambda$ generations. However, after generation $\log\lambda$, the family trees will be pruned as usual. This will only overestimate the extinction time of the family trees. Furthermore, there will be exactly $\lambda$ such family trees, one family tree for each of the $\lambda$ randomly chosen individuals in the initial population.

We now bound the number of lineages in these family trees, and their extinction times. The mean number of offspring is no more than $\eta\leq 2$ during the first $\log\lambda$ generations. Because the family trees are pruned after generation $\log\lambda$, we can re-use the arguments from Case 1 above to show that the mean number of offspring after generation $\log\lambda$ is no more than $\rho$, for some constant $\rho<1$. Let random variable $Z_{t}$ be the number of family tree members in generation $Z_{t}$. Analogously to the proof of Lemma 2, we have ${\bf E}[Z_{t}]\leq 2^{t}$ if $t\leq\log\lambda$, and ${\bf E}[Z_{t}]\leq 2^{\log\lambda}\rho^{t-\log\lambda}=\lambda\rho^{t-\log\lambda}$ for $t\geq\log\lambda$. We consider the run a failure if any of the $\lambda$ family trees survives longer than $\sqrt{n}$ generations. By the union bound and Markov's inequality, the probability of this failure event is no more than $\lambda{\bf E}\left[Z_{\sqrt{n}}\right]=e^{-\Omega (\sqrt{n})}$.

Let the random variable $P_{i}$ be the number of lineages in family tree $i$, $1\leq i\leq\lambda$. Similarly to the proof of Lemma 2, the expected number of different lineages in the family tree is no more than TeX Source $${\bf E}[P_{i}]\leq\sum_{t=1}^{\log\lambda}{\bf E}[Z_{t}]+\sum_{t=\log\lambda+1}^{\infty}{\bf E}[Z_{t}]\leq2\lambda+{{\rho\lambda}\over{1-\rho}}=O(\lambda).$$ We consider the run a failure if there are more than $\lambda^{3}$ lineages in all family trees. By Markov's inequality, the probability of this failure event is no more than TeX Source $$\Pr\left(\sum_{i=1}^{\lambda}P_{i}\geq\lambda^{3}\right)\leq\sum_{i=1}^{\lambda}{\bf E}\left[P_{i}\right]/\lambda^{3}=O(1/\lambda).$$

We now bound the probability that a given lineage finds an optimal search point. Define $\sigma^{\prime}:=\sigma-\delta-(k+4)/n$. To find the optimum, it is necessary that all the bits in the interval of length $\sigma^{\prime}n$, starting from position $k+4$, are 1-bits. We consider the run a failure if any of the individuals in the initial population has less than $\sigma^{\prime}n/3$ 0-bits in this interval. By a Chernoff bound and the union bound, the probability of this failure event is no more than $\lambda e^{-\Omega (n)}=e^{-\Omega (n)}$.

The probability of flipping a given 0-bit within $\sqrt{n}$ generations is by the union bound no more than $\chi/\sqrt{n}$. Hence, the probability that all of the at least $\sigma^{\prime}n/3$ 0-bits have been flipped is less than $(\chi/\sqrt{n})^{\sigma^{\prime}n/3}=n^{-\Omega (n)}$. The probability that any of the at most $\lambda^{3}$ lineages finds the optimum within $\sqrt{n}$ generations is by the union bound no more than $\lambda^{3}n^{-\Omega (n)}=n^{-\Omega (n)}$.

If none of the failure events occur, then no globally optimal search point has been found during the first $rn^{2}$ generations. The probability that any of the failure events occur is by union bound less than $1/2+o(1)$. The proposition therefore follows.▪

Once the equilibrium position has been reached, we will prove that it is hard to obtain the global optimum. We will rely on the fact that it is necessary to have at least $\delta n/3$ 0-bits in the interval from $(\sigma+\delta)n$ to $(\sigma+2\delta)n$, and that any individual with a 0-bit in this interval will be ranked worse than at least half of the population.

Let $\sigma$ and $\delta$ be any constants that satisfy $0<\delta<\sigma<1-3\delta$. If the linear ranking EA with population size $\lambda$, where $n\leq\lambda\leq n^{k}$, for any constant integer $k\geq 1$, with selection pressure $\eta$ and constant mutation rate $\chi>0$ satisfying $\eta>(2e^{\chi (\sigma+3\delta)}-1)/(1-\delta)$ is applied to ${\rm SelPres}_{\sigma,\delta,k}$, and the $(1+\delta)/2$-ranked individual reaches at least $(\sigma+2\delta)n$ leading 0-bits before the optimum has been found, then the probability that the optimum is found within $e^{cn}$ function evaluations is $e^{-\Omega (n)}$, for some constant $c>0$.

Define $\gamma:=(1+\delta)/2$, and note that
TeX Source
$${{\beta(\gamma)}\over{\gamma}}=\eta (1-\gamma)+\gamma={{\eta (1-\delta)+1+\delta}\over{2}}>e^{\chi (\sigma+3\delta)}.$$ Hence, we have
TeX Source
$$\xi^{\ast}:=\ln (\beta(\gamma)/\gamma)/\chi>\sigma+3\delta.\eqno{\hbox{(3)}}$$ Let $\xi_{0}:=\sigma+2\delta=\xi^{\ast}-\delta$. Again, we apply the technique of non-selective family trees and define the *core* as the set of search points with more than $\xi_{0}n$ leading 1-bits. By the first part of Theorem 2, the probability that the $\gamma$-ranked individual has less than $\xi_{0}n$ leading 0-bits within $e^{cn}$ generations is $e^{-\Omega (n)}$ for sufficiently small $c$. If this event does happen, we say that a *failure* has occurred. Assuming no failure, each family tree member is selected in expectation less than $\rho<\alpha ((1+\delta)/2)=1-(\eta-1)\delta<1$ times per generation.

We first estimate the extinction time of each family tree, and the total number of lineages among the at most $m:=\lambda e^{cn}$ family trees. The reproductive rate is bounded from above by a constant $\rho<1$. Hence, by Lemma 2, the probability that a given family tree survives longer than $t:=2cn/\ln (1/\rho)$ generations is $\rho^{t}=e^{-2cn}$. By union bound, the probability that any family tree survives longer than $t$ generations is less than $\lambda e^{-2cn}$, and we say that a failure has occurred if a family tree survives longer than $t$ generations. For each $i$, where $1\leq i\leq m$, let the random variable $P_{i}$ denote the number of lineages in family tree $i$. By Lemma 2 and Markov's inequality, the probability that the number of lineages in all the family trees exceeds $e^{2cn}\rho/(1-\rho)$ is TeX Source $$\Pr\left(\sum_{i=1}^{m}P_{i}\geq{{e^{2cn}\rho}\over{1-\rho}}\right)\leq{{(1-\rho)\sum_{i=1}^{m}{\bf E}\left[P_{i}\right]}\over{\rho e^{2cn}}}\leq\lambda e^{-cn}.$$ If this happens, we say that a failure has occurred.

We then bound the probability that any given member of the family tree is optimal. To be optimal, it is necessary that there are at least $\delta n/3$ 0-bits in the interval from 1 to $\xi_{0}n$. We therefore optimistically assume that this is the case for the family tree member in question. However, none of these 0-bits must occur in the interval from bit position $k+4$ to bit position $(\sigma-\delta)n$, otherwise the family tree member is not optimal. The length of this interval is $(\sigma-\delta-o(1))n=\Omega (n)$. Since the family tree is non-selective, the positions of these 0-bits are chosen uniformly at random among the $\xi_{0}n$ bit positions. In particular, the probability of choosing a 0-bit within this interval, assuming no such bit position has been chosen yet, is at least $\Omega (n)/\xi_{0}n>c^{\prime}$, for some constant $c^{\prime}>0$. And the probability that none of the at least $\delta n/3$ 0-bits are chosen from this interval is no more than $(1-c^{\prime})^{\delta n/3}=e^{-\Omega (n)}$.

There are at most $t$ family tree members per lineage. The probability that any of the $te^{2cn}\rho/(1-\rho)\leq e^{3cn}$ family tree members is optimal is by union bound no more than $e^{3cn}e^{-\Omega (n)}=e^{-\Omega (n)}$, assuming that $c$ is a sufficiently small constant. Taking into account all the failure probabilities, the probability that the optimum is found within $e^{cn}$ generations is $e^{-\Omega (n)}$, for a sufficiently small constant $c>0$.▪

By combining the previous, intermediate results, we can finally prove the main result of this section.

Let $\sigma$ and $\delta$ be any constants that satisfy $0<\delta<\sigma<1-3\delta$. The expected runtime of the linear ranking EA with population size $\lambda$, $n\leq\lambda\leq n^{k}$, for any integer $k\geq 1$, and selection pressure $\eta$ and constant mutation rate $\chi>0$ satisfying $\eta>(2e^{\chi (\sigma+3\delta)}-1)/(1-\delta)$ is $e^{\Omega (n)}$.

Define $\gamma:=(1+\delta)/2$ and $\xi^{\ast}:=\ln (\beta(\gamma)/\gamma)/\chi$. By (3) in the proof of Proposition 3, it holds that $\xi^{\ast}-\delta>\sigma+2\delta$. By Theorem 3 and Markov's inequality, there is a constant probability that the $\gamma$-ranked individual has reached at least $(\xi^{\ast}-\delta)n>(\sigma+2\delta)n$ leading 0-bits within $rn^{2}$ generations, for some constant $r$. By Proposition 2, the probability that the optimum has not been found within the first $rn^{2}$ generations is $\Omega (1)$. If the optimum has not been found before the $\gamma$-ranked individual has $(\sigma+2\delta)n$ leading 0-bits, then by Proposition 3, the expected runtime is $e^{\Omega (n)}$. The unconditional expected runtime of the linear ranking EA is therefore $e^{\Omega (n)}$.▪

This section proves an analogue to Theorem 5 for parameter settings where the equilibrium position $n(\ln\eta)/\chi$ is below $(\sigma-\delta)n$. i.e., it is shown that ${\rm SelPres}_{\sigma,\delta,k}$ is also hard when the selection pressure is too low. To prove this, it suffices to show that with overwhelming probability, no individual reaches more than $n\ln (\eta\kappa\phi)/\chi$ leading 0-bits in exponential time, for appropriately chosen constants $\kappa$, $\phi>1$. Again, we will apply the technique of non-selective family trees, but with a different core than in the previous section. The core is here defined as the set of search points with prefix sum less than $n\ln (\eta\kappa)/\chi$, where the *prefix sum* is the number of 0-bits in the first $n\ln (\eta\kappa\phi)/\chi$ bit positions of the search point. Clearly, to obtain at least $n\ln (\eta\kappa\phi)/\chi$ leading 0-bits, it is necessary to have prefix sum exactly $n\ln (\eta\kappa\phi)/\chi$. We will consider individuals outside the core, i.e., the individuals with prefix sums in the interval from $n\ln (\eta\kappa)/\chi$ to $n\ln (\eta\kappa\phi)/\chi$. Note that choosing $\kappa$ and $\phi$ to be constants slightly larger than 1 implies that this interval begins slightly above the equilibrium position $n\ln (\eta)/\chi$ given by Theorem 2 (see Fig. 5).

Single-type branching processes are not directly applicable to analyze this drift process, because they have no way of representing how far each family tree member is from the core. Instead, we will consider a more detailed model based on *multi-type branching processes* (see Haccou *et al.* [10]). Such branching processes generalize single-type branching processes by having individuals of multiple types. In our application, the type of an individual corresponds to the prefix-sum of the individual. Before defining and studying this particular process, we will describe some general aspects of multi-type branching processes.

A multi-type branching process with $d$ types is a Markov process $Z_{0}, Z_{1},\ldots$ on $\BBN^{d}_{0}$, which for all $t\geq 0$, is given by
TeX Source
$$Z_{t+1}:=\sum_{j=1}^{d}\sum_{i=1}^{Z_{tj}}\xi_{i}^{(j)}$$ where for all $j$, $1\leq j\leq d$, $\xi_{i}^{(j)}\in\BBN_{0}^{d}$ are i.i.d. random vectors having expectation ${\bf E}\left[\xi^{(j)}\right]=:{(m_{j1},m_{j2},\ldots,m_{jd})}^{\ssr T}$. The associated matrix $M:=(m_{jk})_{d\times d}$ is called the *mean matrix* of the process.

Definition 3 states that the population vector $Z_{t+1}$ for generation $t+1$ is defined as a sum of offspring vectors, one offspring vector for each of the individuals in generation $t$. In particular, the vector element $Z_{tj}$ denotes the number of individuals of type $j$, $1\leq j\leq d$, in generation $t$. And $\xi_{i}^{(j)}$ denotes the offspring vector for the $i$th individual, $1\leq i\leq Z_{nj}$, of type $j$. The $k$th element, $1\leq k\leq d$, of this offspring vector $\xi_{i}^{(j)}$ represents the number of offspring of type $k$ this individual produced.

Analogously to the case of single-type branching processes, the expectation of a multi-type branching process $Z_{t\geq 0}$ with mean matrix $M$ follows
TeX Source
$${{\bf E}\left[Z_{t}\right]}^{\ssr T}={{\bf E}\left[{\bf E}\left[Z_{t}\mid Z_{t-1}\right]\right]}^{\ssr T}={{\bf E}\left[Z_{t-1}\right]}^{\ssr T}M={{\bf E}\left[Z_{0}\right]}^{\ssr T}M^{t}.$$ Hence, the long-term behavior of the branching-process depends on the matrix power $M^{t}$. Calculating matrix powers can in general be non-trivial. However, if the branching process has the property that for any pair of types $i$, $j$, it is possible that a type $j$-individual has an ancestor of type $i$, then the corresponding mean matrix is *irreducible* [26].

A $d\times d$ non-negative matrix $M$ is *irreducible* if for every pair $i$, $j$ of its index set, there exists a positive integer $t$ such that $m_{ij}^{(t)}>0$, where $m_{ij}^{(t)}$ are the elements of the $t$th matrix power $M^{t}$. If the mean matrix $M$ is irreducible, then Theorem 6 implies that the asymptotics of the matrix power $M^{t}$ depend on the largest eigenvalue of $M$.

If $M$ is an irreducible matrix with non-negative elements, then it has a unique positive eigenvalue $\rho$, called the *Perron root* of $M$, that is greater in absolute value than any other eigenvalue. All elements of the left and right eigenvectors $u={(u_{1},\ldots,u_{d})}^{\ssr T}$ and $v={(v_{1},\ldots,v_{d})}^{\ssr T}$ that correspond to $\rho$ can be chosen positive and such that $\sum_{k=1}^{d}u_{k}=1$ and $\sum_{k=1}^{d}u_{k}v_{k}=1$. In addition
TeX Source
$$M^{n}=\rho^{n}\cdot A+B^{n}$$ where $A=(v_{i}u_{j})_{i,j=1}^{d}$ and $B$ are matrices that satisfy the following conditions.

- $AB=BA=0$.
- There are constants $\rho_{1}\in (0,\rho)$ and $C>0$ such that none of the elements of the matrix $B^{n}$ exceeds $C\rho_{1}^{n}$.

A central attribute of a multi-type branching process is therefore the Perron root of its mean matrix $M$, denoted $\rho (M)$. A multi-type branching process with mean matrix $M$ is classified as *sub-critical* if $\rho (M)<1$, *critical* if $\rho (M)=1$, and *super-critical* if $\rho (M)>1$. Theorem 6 implies that any sub-critical multi-type branching process will eventually become extinct. However, to obtain good bounds on the probability of extinction within a given number of generations $t$ using Theorem 6, one also has to take into account matrix $A$ that is defined in terms of both the left and right eigenvectors. Instead of directly applying Theorem 6, it will be more convenient to use the following lemma.

Let $Z_{0}, Z_{1},\ldots$ be a multi-type branching process with irreducible mean matrix $M=(m_{ij})_{d\times d}$. If the process started with a single individual of type $h$, then for any $k>0$ and $t\geq 1$ TeX Source $$\Pr\left(\sum_{j=1}^{d}Z_{tj}\geq k\mid Z_{0}=e_{h}\right)\leq{{\rho (M)^{t}}\over{k}}\cdot{{v_{h}}\over{v^{\ast}}}$$ where $e_{h}$, $1\leq h\leq d$, denote the standard basis vectors, $\rho (M)$ is the Perron root of $M$ with the corresponding right eigenvector $v$, and $v^{\ast}:=\min_{1\leq i\leq d}v_{i}$.

The proof follows [10, p.122]. By Theorem 6, matrix $M$ has a unique largest eigenvalue $\rho (M)$, and all the elements of the corresponding right eigenvector $v$ are positive, implying $v^{\ast}>0$. The probability that the process consists of more than $k$ individuals in generation $t$, conditional on the event that the process started with a single individual of type $h$, can be bounded as TeX Source $$\eqalignno{&\Pr\left(\sum_{j=1}^{d}Z_{tj}\geq k\mid Z_{0}=e_{h}\right)\cr&\qquad\qquad\quad=\Pr\left(\sum_{j=1}^{d}Z_{tj}v^{\ast}\geq kv^{\ast}\mid Z_{0}=e_{h}\right)\cr&\qquad\qquad\qquad\quad\enspace\enspace\enspace\enspace\enspace\leq\Pr\left(\sum_{j=1}^{d}Z_{tj}v_{j}\geq kv^{\ast}\mid Z_{0}=e_{h}\right).}$$

Markov's inequality and linearity of expectation give TeX Source $$\eqalignno{&\Pr\left(\sum_{j=1}^{d}Z_{tj}v_{j}\geq kv^{\ast}\mid Z_{0}=e_{h}\right)\cr&\qquad\qquad\quad\leq{\bf E}\left[\sum_{j=1}^{d}Z_{tj}v_{j}\mid Z_{0}=e_{h}\right]\cdot{{1}\over{kv^{\ast}}}\cr&\qquad\qquad\qquad\quad\enspace\enspace\enspace\enspace\enspace\enspace\enspace\enspace\enspace=\sum_{j=1}^{d}{\bf E}\left[Z_{tj}\mid Z_{0}=e_{h}\right]\cdot{{v_{j}}\over{kv^{\ast}}}.}$$

As seen above, the expectation on the right-hand side can be expressed as TeX Source $${{\bf E}\left[Z_{t}\mid Z_{0}=e_{h}\right]}^{\ssr T}={{\bf E}\left[Z_{0}\mid Z_{0}=e_{h}\right]}^{\ssr T}M^{t}.$$

Additionally, by taking into account the starting conditions, $Z_{0h}=1$ and $Z_{0j}=0$, for all indices $j\ne h$, this simplifies further to TeX Source $$\eqalignno{&\sum_{j=1}^{d}{\bf E}\left[Z_{tj}\mid Z_{0}=e_{h}\right]\cdot{{v_{j}}\over{kv^{\ast}}}\cr&\qquad\quad=\sum_{j=1}^{d}\sum_{i=1}^{d}{\bf E}\left[Z_{0i}\mid Z_{0}=e_{h}\right]\cdot m_{ij}^{(t)}\cdot{{v_{j}}\over{kv^{\ast}}}\cr&\qquad\qquad\qquad\quad\enspace\enspace\enspace\enspace\enspace\enspace\enspace\enspace\enspace\enspace\enspace\enspace\enspace\enspace\enspace\enspace\enspace\enspace=\sum_{j=1}^{d}m_{hj}^{(t)}\cdot{{v_{j}}\over{kv^{\ast}}}.}$$ Finally, by iterating TeX Source $$M^{t}v=M^{t-1}(Mv)=\rho (M)\cdot M^{t-1}v$$ which on coordinate form gives TeX Source $$\sum_{j=1}^{d}m_{hj}^{(t)}v_{j}=\rho (M)^{t}\cdot v_{h}$$ one obtains the final bound TeX Source $$\Pr\left(\sum_{j=1}^{d}Z_{tj}\geq k\mid Z_{0}=e_{h}\right)\leq{{\rho (M)^{t}}\over{k}}\cdot{{v_{h}}\over{v^{\ast}}}.$$▪

We will now describe how to model a non-selective family tree outside the core as a multi-type branching process (see Fig. 5). Recall that the prefix sum of a search point is the number of 0-bits in the first $n\ln (\eta\kappa\phi)/\chi$ bit positions of the search point, and that the core is defined as all search points with prefix-sum less than $n\ln (\eta\kappa)/\chi$ 0-bits. The process has $n(\ln\phi)/\chi$ types. A family tree member has type $i$ if its prefix sum is $n\ln (\eta\kappa\phi)/\chi-i$. The element $a_{ij}$ of the mean matrix $A$ of this branching process represents the expected number of offspring a type $i$-individual gets of type $j$-individuals per generation. Since we are looking for a lower bound on the extinction probability, we will over-estimate the matrix elements, which can only decrease the extinction probability. By the definition of linear ranking selection, the expected number of times during one generation in which any individual is selected is no more than $\eta$. We will therefore use $a_{ij}=\eta\cdot p_{ij}$, where $p_{ij}$ is the probability that mutating a type $i$-individual creates a type $j$-individual. To simplify the proof of the second part of Lemma 4, we overestimate the probability $p_{ij}$ to $1/n^{2}$ for the indices $i$ and $j$ where $j-i\geq 2\log n+1$. Note that the probability that none of the first $n\ln (\eta\kappa)/\chi$ bits are flipped is less than $\exp (-\ln (\eta\kappa))=1/\eta\kappa$. In particular, this means that $\eta\cdot p_{ii}\leq\eta/\eta\kappa=1/\kappa:=a_{ii}$. The full definition of the mean matrix is as follows.

For any integer $n\geq 1$ and real numbers $\eta$, $\chi$, $\phi$, $\kappa$, $\varepsilon$ where $0<\chi$, $1\leq\eta$ and $1<\phi<\kappa\leq\varepsilon$, define the $n\ln (\phi)/\chi\times n\ln (\phi)/\chi$ matrix $A=(a_{ij})$ as TeX Source $$a_{ij}=\cases{\eta/n^{2},&if $2\log n+1\leq j-i$\cr\eta\cdot{n\ln (\eta\kappa\phi)/\chi\choose j-i}\cdot\left({{\chi}\over{n}}\right)^{j-i},&if $1\leq j-i\leq 2\log n$\cr 1/\kappa,&if $i=j$, {\rm and}\cr 1/\kappa\cdot{i\choose i-j}\cdot\left({{\chi}\over{n}}\right)^{i-j},&if $i>j$.\cr}$$

In order to apply Lemma 3 to mean matrix $A$ defined above, we first provide upper bounds on the Perron root of $A$ and on the maximal ratio between the elements of the corresponding right eigenvector.

For any integer $n\geq 1$, and real numbers $\eta$, $1<\eta\leq 2$, $\chi>0$, and $\varepsilon>1$, there exist real numbers $\kappa$ and $\phi$, $1<\phi<\kappa\leq\varepsilon$, such that matrix $A$ given by Definition 5 has Perron root bounded from above by $\rho (A)<c$ for some constant $c<1$. Furthermore, for any $h$, $1\leq h\leq n\ln (\phi)/\chi$, the corresponding right eigenvector $v$, where $v^{\ast}:=\min_{i}v_{i}$, satisfies TeX Source $${{v_{h}}\over{v^{\ast}}}\leq 2^{n\ln (\phi)/\chi}\cdot\left({{n}\over{\chi}}\right)^{n\ln (\phi)/\chi-h}.$$

Set $\kappa:=\varepsilon$. Since $a_{ij}>0$ for all $i$, $j$, matrix $A$ is by Definition 4 irreducible, and Theorem 6 applies to the matrix. Expressing the matrix as $A=1/\kappa\cdot I+B$, where $B:=A-1/\kappa\cdot I$, and $I$ is the identity matrix, the Perron root is $\rho (A)=1/\kappa+\rho (B)$.

The Frobenius bound for the Perron root of a non-negative matrix $M=(m_{ij})$ states that $\rho (M)\leq\max_{j}c_{j}(M)$ [16], where $c_{j}(M):=\sum_{i}m_{ij}$ is the $j$th column sum of $M$. However, when applied directly to our matrix, this bound is insufficient for our purposes. Instead, we can consider the transformation $SBS^{-1}$, for an invertible matrix TeX Source $$S:={\rm diag}(x_{1},x_{2},\ldots,x_{n\ln(\phi)/\chi}).$$ To see why this transformation is helpful, note that for any matrix $A$ with the same dimensions as $S$, we have $\det (SAS^{-1})=\det (A)$. So if $\rho$ is an eigenvalue of $B$, then TeX Source $$\eqalignno{0=&\,\det (B-\rho I)\cr=&\,\det (S(B-\rho I)S^{-1})\cr=&\,\det (SBS^{-1}-\rho I)}$$ and $\rho$ must also be an eigenvalue of $SBS^{-1}$. It follows that $\rho (B)=\rho (SBS^{-1})$. We will therefore apply the Frobenius bound to the matrix $SBS^{-1}$, which has off-diagonal elements TeX Source $$(SBS^{-1})_{ij}=a_{ij}\cdot{{x_{i}}\over{x_{j}}}.$$ Define $x_{i}:=q^{i}$ where TeX Source $$q:={{\ln (\eta\kappa\phi)}\over{\ln (1+1/r\eta)}}$$ for some constant $r>1/(\eta-1)\geq 1$ that will be specified later. Since $\eta=1+c$ for some $c>0$, the constant $q$ is bounded as TeX Source $$q>{{\ln\eta}\over{\ln (1+{{1}\over{r\eta}})}}>{{\ln\eta}\over{\ln (2-{{1}\over{\eta}})}}={{\ln\eta}\over{\ln\eta+\ln ({{2}\over{\eta}}-{{1}\over{\eta^{2}}})}}>1.$$

The sum of any column $j$ can be bounded by the three sums TeX Source $$\eqalignno{\sum_{i=1}^{j-2\log n-1}a_{ij}\cdot{{x_{i}}\over{x_{j}}}\leq&\,n\cdot{{\eta}\over{n^{2}}}={{\eta}\over{n}}\cr\sum_{i=j-2\log n}^{j-1}a_{ij}\cdot{{x_{i}}\over{x_{j}}}\leq&\,\eta\cdot\sum_{i=1}^{j-1}{n\ln (\eta\kappa\phi)/\chi\choose j-i}\cdot\left({{\chi}\over{n}}\right)^{j-i}\cdot q^{i-j}\cr\leq&\,\eta\cdot\sum_{i=1}^{j-1}{{(\ln(\eta\kappa\phi)/q)^{j-i}}\over{(j-i)!}}\cr\leq&\,\eta\cdot\sum_{k=1}^{\infty}{{(\ln (\eta\kappa\phi)/q)^{k}}\over{k!}}\cr=&\,\eta\cdot (\exp (\ln (\eta\kappa\phi)/q)-1)\quad{\rm and}\cr\sum_{i=j+1}^{n\ln (\phi)/\chi}a_{ij}\cdot{{x_{i}}\over{x_{j}}}=&\,{{1}\over{\kappa}}\cdot\sum_{i=j+1}^{n\ln (\phi)/\chi}{i\choose i-j}\cdot\left({{\chi}\over{n}}\right)^{i-j}\cdot q^{i-j}\cr\leq&\,{{1}\over{\kappa}}\cdot\sum_{i=j+1}^{n\ln (\phi)/\chi}{n\ln (\phi)/\chi\choose i-j}\cdot\left({{\chi}\over{n}}\right)^{i-j}\cdot q^{i-j}\cr\leq&\,{{1}\over{\kappa}}\cdot\sum_{i=j+1}^{n\ln (\phi)/\chi}{{(q\ln\phi)^{i-j}}\over{(i-j)!}}\cr\leq&\,{{1}\over{\kappa}}\cdot\sum_{k=1}^{\infty}{{(q\ln\phi)^{k}}\over{k!}}\cr=&\,{{1}\over{\kappa}}\cdot (\exp (q\ln\phi)-1).}$$ The Perron root of matrix $A$ can now be bounded by TeX Source $$\eqalignno{\rho (A)\leq&\,{{1}\over{\kappa}}+\max_{j}c_{j}(SBS^{-1})\cr=&\,{{1}\over{\kappa}}+\max_{j}\sum_{i\ne j}^{n\ln (\phi)/\chi}a_{ij}\cdot{{x_{i}}\over{x_{j}}}\cr\leq&\,{{\eta}\over{n}}+\eta\cdot (\exp (\ln(\eta\kappa\phi)/q)-1)+{{1}\over{\kappa}}\cdot\exp (q\ln\phi)\cr=&\,{{\eta}\over{n}}+{{1}\over{r}}+{{\phi^{q}}\over{\kappa}}.}$$ Choosing $\phi$ sufficiently small, such that $1<\phi<\kappa^{1/2q}$, and defining the constant $r:={{2}\over{\eta-1}}\cdot{{\sqrt{\kappa}}\over{\sqrt{\kappa}-1}}>1/(\eta-1)$, we have TeX Source $$\eqalignno{\rho (A)\leq&\,{{\eta}\over{n}}+{{1}\over{r}}+{{\phi^{q}}\over{\kappa}}\cr\leq&\,{{\eta}\over{n}}+{{\sqrt{\kappa}-1}\over{2\sqrt{\kappa}}}+{{1}\over{\sqrt{\kappa}}}\cr=&\,{{\eta}\over{n}}+{{1}\over{2}}+{{1}\over{2\sqrt{\kappa}}}<1.}$$

The second part of the lemma involves for any $h$, to bound the ratio $v_{h}/v^{\ast}$ where $v$ is the right eigenvector corresponding to the eigenvalue $\rho$. In the special case where the index $h$ corresponds to the eigenvector element with largest value, this ratio is called the *principal ratio*. By generalizing Minc's bound for the principal ratio [19], one obtains the upper bound
TeX Source
$${{v_{h}}\over{v^{\ast}}}=\max_{k}{{v_{h}}\over{v_{k}}}=\max_{k}{{\rho v_{h}}\over{\rho v_{k}}}=\max_{k}{{\sum_{j}a_{hj}\cdot v_{j}}\over{\sum_{j}a_{kj}\cdot v_{j}}}\leq\max_{k,j}{{a_{hj}}\over{a_{kj}}}.$$ It now suffices to prove that the matrix elements of $A$ satisfy
TeX Source
$$\forall h,j,k\quad{{a_{hj}}\over{a_{kj}}}\leq 2^{n\ln (\phi)/\chi}\cdot\left({{n}\over{\chi}}\right)^{n\ln (\phi)/\chi-h}.$$

To prove that these inequalities hold, we first find a lower bound $a^{\ast}_{j}$ on the minimal element along any column, i.e., $\min_{k}a_{kj}\geq a^{\ast}_{j}$, for any column index $j$. As illustrated in Fig. 6, the matrix elements of $A$ can be divided into six cases according to their column and row indices, For Cases 1a and 1b, where $2\log n+1\leq j-k\leq n\ln (\phi)/\chi$ TeX Source $$a_{kj}>{{1}\over{n^{2}}}.$$ For Cases 2a and 2b, where $0<j-k\leq 2\log n$ TeX Source $$a_{kj}\geq\left({{\chi}\over{n}}\right)^{j-k}\geq\left({{\chi}\over{n}}\right)^{2\log n}.$$ For Cases 3a and 3b, where $k\geq j$ TeX Source $$a_{kj}\geq{{1}\over{\kappa}}\left({{\chi}\over{n}}\right)^{k-j}\geq{{1}\over{\kappa}}\left({{\chi}\over{n}}\right)^{n\ln (\phi)/\chi-j}.$$ Hence, we can use the lower bound TeX Source $$a^{\ast}_{j}:=\cases{{{1}\over{\kappa}}\left({{\chi}\over{n}}\right)^{n\ln (\phi)/\chi-j},&if $j\leq n\ln (\phi)/\chi-2\log n$, {\rm and}\cr\left({{\chi}\over{n}}\right)^{2\log n},&otherwise.\cr}$$

We then upper bound the ratio $a_{hj}/a^{\ast}_{j}$ for all column indices $j$. All elements of the matrix satisfy $a_{hj}\leq\eta$. Therefore, in Cases 1b, 2b, and 3b, where $j>n\ln (\phi)/\chi-2\log n$ TeX Source $${{a_{hj}}\over{a^{\ast}_{j}}}\leq\eta\left({{n}\over{\chi}}\right)^{2\log n}.$$ In Cases 1a and 2a, where $h<j\leq n\ln (\phi)/\chi-2\log n$ TeX Source $${{a_{hj}}\over{a^{\ast}_{j}}}\leq\kappa\eta\left({{n}\over{\chi}}\right)^{n\ln (\phi)/\chi-j}\leq\kappa\eta\left({{n}\over{\chi}}\right)^{n\ln (\phi)/\chi-h}.$$ Finally, in Case 3a, where $j\leq h$ and $j\leq n\ln (\phi)/\chi-2\log n$ TeX Source $$\eqalignno{{{a_{hj}}\over{a^{\ast}_{j}}}\leq&\,{{1}\over{\kappa}}{h\choose h-j}\cdot\left({{\chi}\over{n}}\right)^{h-j}\cdot\kappa\left({{n}\over{\chi}}\right)^{n\ln(\phi)/\chi-j}\cr\leq&\,2^{n\ln (\phi)/\chi}\cdot\left({{n}\over{\chi}}\right)^{n\ln (\phi)/\chi-h}.}$$ The second part of the lemma therefore holds. ▪

Having all the ingredients required to apply Lemma 3 to the mean matrix in Definition 5, we are now ready to prove the main technical result of this section. Note that this result implies that Conjecture 1 in [18] holds.

For any positive constant $\epsilon$, and some positive constant $c$, the probability that during $e^{cn}$ generations, linear ranking EA with population size $\lambda=poly(n)$, selection pressure $\eta$, and mutation rate $\chi/n$, there exists any individual with at least ${n((\ln\eta)/\chi+\epsilon)}$ leading 0-bits is $e^{-\Omega (n)}$.

In the following, $\kappa$ and $\phi$ are two constants such that $(\ln\kappa+\ln\phi)/\chi=\epsilon$, where the relative magnitudes of $\kappa$ and $\phi$ are as given in the proof of Lemma 4.

Let the *prefix sum* of a search point be the number of 0-bits in the first $n\ln (\eta\kappa\phi)/\chi$ bits. We will apply the technique of non-selective family trees, where the core is defined as the set of search points with prefix sum less than $n\ln (\eta\kappa)/\chi$ 1-bits. Clearly, any non-optimal individual in the core has fitness lower than $n\ln (\eta\kappa)/\chi$.

To estimate the extinction time of a given family tree, we consider the multi-type branching process $Z_{0},Z_{1},\ldots$ having $n\ln (\phi)/\chi$ types, and where the mean matrix $A$ is given by Definition 5. Let the random variable $S_{t}:=\sum_{i=1}^{n\ln (\phi)/\chi}Z_{ti}$ be the family size in generation $t$. By Lemmas 3 and 4, it is clear that the extinction probability of the family tree depends on the type of the root of the family tree. The higher the prefix sum of the family root, the lower the extinction probability. The parent of the root of the family tree has prefix sum lower than $n\ln (\eta\kappa)/\chi$, hence the probability that the root of the family tree has type $h$, is TeX Source $$\Pr(Z_{0}=e_{h})\leq{n\ln (\phi)/\chi\choose n\ln (\phi)/\chi-h}\cdot\left({{\chi}\over{n}}\right)^{n\ln (\phi)/\chi-h}.$$

By Lemmas 3 and 4, the probability that the family tree has more than $k$ members in generation $t$ is for sufficiently large $n$ and sufficiently small $\phi$ bounded by TeX Source $$\eqalignno{&\qquad\Pr (S_{t}\geq k)\cr&=\sum_{h=1}^{n\ln (\phi)/\chi}\Pr (Z_{0}=e_{h})\cdot\Pr\left(\sum_{j=1}^{n\ln (\phi)/\chi}Z_{tj}\geq k\mid Z_{0}=e_{h}\right)\cr&\leq\sum_{h=1}^{n\ln (\phi)/\chi}{n\ln (\phi)/\chi\choose n\ln (\phi)/\chi-h}\cdot\left({{\chi}\over{n}}\right)^{n\ln (\phi)/\chi-h}\cdot{{\rho (A)^{t}}\over{k}}\cdot{{v_{h}}\over{v^{\ast}}}\cr&\leq 2^{n\ln (\phi)/\chi}\cdot{{\rho(A)^{t}}\over{k}}\cdot\sum_{h=0}^{n\ln (\phi)/\chi}{n\ln(\phi)/\chi\choose h}\cr&=2^{2n\ln (\phi)/\chi}\cdot{{\rho(A)^{t}}\over{k}}.}$$

By Lemma 4, the Perron root of matrix $A$ is bounded from above by a constant $\rho (A)<1$. Hence, for any constant $w>0$, the constant $\phi$ can be chosen sufficiently small such that for large $n$, the probability is bounded by $\Pr\left(S_{t}\geq k\right)\leq\rho (A)^{t-wn}/k$.

For $k=1$ and $w<1$, the probability that the non-selective family tree is not extinct in $n$ generations, i.e., that the *height* of the tree is larger than $n$, is $\rho (A)^{\Omega (n)}=e^{-\Omega (n)}$. Furthermore, the probability that the *width* of the non-selective family tree exceeds $k=\rho (A)^{-2wn}$ in any generation is by union bound less than $n\rho (A)^{wn}=e^{-\Omega (n)}$.

We now consider a phase of $e^{cn}$ generations. The number of family trees outside the core during this period is less than $\lambda e^{cn}$. The probability that any of these family trees survives longer than $n$ generations, or are wider than $\rho (A)^{-2wn}$, is by union bound $\lambda e^{cn}\cdot (e^{-\Omega (n)}+e^{-\Omega (n)})=e^{-\Omega (n)}$ for a sufficiently small constant $c$. The number of paths from root to leaf within a single family tree is bounded by the product of the height and the width of the family tree. Hence, the expected number of different paths from root to leaf in all family trees is less than $\lambda e^{cn}n\rho (A)^{-2wn}$. The probability that it exceeds $e^{2cn}\rho (A)^{-2wn}$ is by Markov's inequality $\lambda e^{cn}ne^{-2cn}=e^{-\Omega (n)}$.

The parent of the root of each family tree has prefix sum no larger than $n\ln (\eta\kappa)/\chi$. In order to reach at least $n\ln (\eta\kappa\phi)/\chi$ leading 1-bits, it is therefore necessary to flip $n\ln (\phi)/\chi$ 0-bits within $n$ generations. The probability that a given 0-bits is not flipped during $n$ generations is $(1-\chi/n)^{n}\geq p$ for some constant $p>0$. Hence, the probability that all of the $n\ln (\phi)/\chi$ 0-bits are flipped at least once within $n$ generations is no more than $p^{n\ln (\phi)/\chi}=e^{-c^{\prime}n}$ for some constant $c^{\prime}>0$. Hence, by union bound, the probability that any of the paths attains at least $\ln (\eta\kappa\phi)/\chi$ leading 0-bits is less than $e^{2cn}\rho (A)^{-2wn}e^{-c^{\prime}n}=e^{-\Omega (n)}$ for sufficiently small $c$ and $w$. ▪

Using Theorem 7, it is now straightforward to prove that ${\rm SelPres}_{\sigma,\delta,k}$ is hard for the linear ranking EA when the ratio between the selection pressure $\eta$ and the mutation rate $\chi$ is too small.

The probability that linear ranking EA with population size $\lambda={\rm poly}(n)$, bit-wise mutation rate $\chi/n$, and selection pressure $\eta$ satisfying $\eta<\exp (\chi (\sigma-\delta))-\epsilon$ for any $\epsilon>0$, finds the optimum of ${\rm SelPres}_{\sigma,\delta,k}$ within $e^{cn}$ function evaluations is $e^{-\Omega (n)}$, for some constant $c>0$.

In order to reach the optimum, it is necessary to obtain an individual having at least $n(\sigma-\delta)$ leading 1-bits. However, by Theorem 7, the probability that this happens within $e^{cn}$ generations is $e^{-\Omega (n)}$ for some constant $c>0$. ▪

SECTION IV

The aim of this paper was to better understand the relationship between mutation and selection in EAs, and in particular to what degree this relationship can have an impact on the runtime. To this end, we rigorously analyzed the runtime of a non-elitist population-based EA that uses linear ranking selection and bit-wise mutation on a family of fitness functions. We focused on two parameters of the EA, $\eta$ which controls the selection pressure, and $\chi$ which controls the bit-wise mutation rate.

The theoretical results show that there exist fitness functions where the parameter settings of selection pressure $\eta$ and mutation rate $\chi$ have a dramatic impact on the runtime. To achieve polynomial runtime on the problem, the settings of these parameters need to be within a narrow critical region of the parameter space, as illustrated in Fig. 2. An arbitrarily small increase in the mutation rate, or decrease in the selection pressure can increase the runtime of the EA from a small polynomial (i.e., highly efficient), to exponential (i.e., highly inefficient). The critical factor which determines whether the EA is efficient on the problem is not individual parameter settings of $\eta$ or $\chi$, but rather the ratio between these two parameters. A too high mutation rate $\chi$ can be balanced by increasing the selection pressure $\eta$, and a too low selection pressure $\eta$ can be balanced by decreasing the mutation rate $\chi$. Furthermore, the results showed that the EA will also have exponential runtime if the selection pressure becomes too high, or the mutation rate becomes too low. It was pointed out that the position of the critical region in the parameter space in which the EA is efficient is problem dependent. Hence, the EA may be efficient with a given mutation rate and selection pressure on one problem, but be highly inefficient with the same parameter settings on another problem. There is therefore no balance between selection and mutation that is good on all problems. The results shed some light on the possible reasons for the difficulty of parameter tuning in practical applications of EAs. The optimal parameter settings can be problem dependent, and very small changes in the parameter settings can have big impacts on the efficiency of the algorithm.

Informally, the results for the functions studied here can be explained by the occurrence of an equilibrium state into which the non-elitist population enters after a certain time. In this state, the EA makes no further progress, even though there is a fitness gradient in the search space. The position in the search space in which the equilibrium state occurs depends on the mutation rate and the selection pressure. When the number of new good individuals added to the population by selection equals the number of good individuals destroyed by mutation, then the population makes no further progress. If the equilibrium state occurs close to the global optimum, then the EA is efficient. If the equilibrium state occurs far from the global optimum, then the EA is inefficient. The results are theoretically significant because the impact of the selection-mutation interaction on the runtime of EAs has not previously been analyzed. Furthermore, there exist few results on the runtime of population-based EAs, in particular those that employ both a parent and an offspring population. Our analysis answers a challenge by Happ *et al.* [11], to analyze a population-based EA using a non-elitist selection mechanism. Although this paper analyses selection and mutation on the surface, it actually touches upon a far more fundamental issue of the tradeoff between exploration (driven by mutation) and exploitation (driven by selection). The analysis presented here could potentially by used to study rigorously the crucial issue of balancing exploration and exploitation in evolutionary search.

In addition to the theoretical results, this paper has also introduced some new analytical techniques to the analysis of evolutionary algorithms. In particular, the behavior of the main part of the population and stray individuals are analyzed separately. The analysis of stray individuals is achieved using a concept which we call non-selective family trees, which are then analyzed as single-type and multi-type branching processes. Furthermore, we apply the drift theorem in two dimensions, which is not commonplace. As already demonstrated in [17], these new techniques are applicable to a wide range of EAs and fitness functions.

A challenge for future experimental work is to design and analyze strategies for dynamically adjusting the mutation rate and selection pressure. Can self-adaptive EAs be robust on problems like those that are described in this paper? For future theoretical work, it would be interesting to extend the analysis to other problem classes, to other selection mechanisms, and to EAs that use a crossover operator.

The authors would like to thank T. Chen for discussions about selection mechanisms in evolutionary algorithms, and R. Mathias, L. Kolotilina, and J. Rowe for discussions about Perron root bounding techniques.

This work was supported by EPSRC, under Grant EP/D052785/1, and by Deutsche Forschungsgemeinschaft, under Grant WI 3552/1-1. This work is based on earlier work in [18], available at http://doi.acm.org/10.1145/1527125.1527133.

P. K. Lehre is with the DTU Informatics, Technical University of Denmark, Kongens Lyngby 2800, Denmark (e-mail: pkle@imm.dtu.dk).

X. Yao is with the Center of Excellence for Research in Computational Intelligence and Applications, School of Computer Science, University of Birmingham, Edgbaston, Birmingham B15 2TT, U.K. (e-mail: x.yao@cs.bham.ac.uk).

No Data Available

No Data Available

None

No Data Available

- This paper appears in:
- No Data Available
- Issue Date:
- No Data Available
- On page(s):
- No Data Available
- ISSN:
- None
- INSPEC Accession Number:
- None
- Digital Object Identifier:
- None
- Date of Current Version:
- No Data Available
- Date of Original Publication:
- No Data Available

Normal | Large

- Bookmark This Article
- Email to a Colleague
- Share
- Download Citation
- Download References
- Rights and Permissions