By Topic

• Abstract

SECTION I

## INTRODUCTION

ESTIMATION of distribution algorithms (EDAs) [25], [28] are population-based stochastic algorithms that incorporate learning into optimization. Unlike evolutionary algorithms (EAs) that rely on variation operators to produce offspring, EDAs create offspring through sampling a probabilistic model that has been learned so far in the optimization process. Obviously, the performance of an EDA depends on how well we have learned the probabilistic model that tries to estimate the distribution of the optimal solutions. The general procedure of EDAs can be summarized in Table I. In recent years, many variants of EDAs have been proposed. On one hand, they have been shown experimentally to outperform other existing algorithms on many benchmark test functions. On the other hand, there were also experimental observations that showed EDAs did not scale well to large problems. In spite of a large number of experimental studies, theoretical analysis of EDAs has been few, especially on the computational time complexity of EDAs.

The importance of the time complexity of EDAs was recognized by several researchers. Mühlenbein and Schlierkamp-Voosen [31] studied the convergence time of constant selection intensity algorithms on the ONEMax function. Later, Mühlenbein [27] studied the response to selection equation of the univariate marginal distribution algorithm (UMDA) on the ONEMax function through experiments as well as theoretical analysis. Pelikan et al. [32] studied the convergence time of Bayesian optimization algorithm on the ONEMax function. Rastegar and Meybodi [35] carried out a theoretical study of the global convergence time of a limit model of EDAs using drift analysis, but they did not investigate any relations between the problem size and computation time of EDAs. In addition to convergence time, the time complexity of EDAs can be measured by the first hitting time (FHT), which is defined as the first time for a stochastic optimization algorithm to reach the global optimum. Although recent work pointed out the significance of studying the FHT of EDAs [29], [33], few results have been reported. Droste's results [8] on the compact genetic algorithm (cGA) are a rare example. He analyzed rigorously the FHT of cGA with population size 2 [14] on linear functions. The other example is González's doctoral dissertation [13], where she analyzed the FHT of EDAs on the pseudo-boolean injective function using the analytical Markov chain framework proposed by He and Yao [17]. González [13] proved an important result that the worst-case mean FHT is exponential in the problem size for four commonly used EDAs. However, no specific problem was analyzed theoretically. Instead, González et al. [10] studied experimentally the mean FHT of three different types of EDAs, including the UMDA, on the Linear function, LEADINGOnes function [4], [7], [16], [37], and UNIMAX (long-path) function [22].

TABLE I GENERAL PROCEDURE OF EDA

This paper concerns theoretical analysis of the FHT of EDAs on the optimization problems with a unique global optimum. First, we provide a classification of problem hardness based on the FHT of EDAs, so that we can relate the problem characteristics to EDAs. This is very important for investigating the principles of when to use which EDAs for a given problem. Given such a classification (with respect to an EDA), we then investigate the relationship between EDAs probability conditions and problem hardness. Specifically, the time complexity of a simple EDA, the UMDA with truncation selection, is analyzed on two unimodal problems. The first problem is the LEADINGOnes problem [37], which has frequently been studied in the field of time complexity analysis of EAs [7], [16], [17], [18]. The other problem is a variant of LEADINGOnes, namely BVLeadingOnes.

Our analysis can be briefly summarized from two aspects. First, we propose a general approach to time complexity analysis of EDAs with finite populations. In the domain of EDAs, lots of theoretical results are based on infinite population assumption (e.g., [3], [11], [45]), while few consider the more realistic scenario that employs finite populations. Though we restrict our analysis to UMDA, our approach may also be useful for other EDAs. Second, both LEADINGOnes and BVLeadingOnes are unimodal problems, and hence are usually expected to be easy for EDAs [11]. Our analysis confirms that LEADINGOnes is easy for the UMDA studied. However, we interestingly find that BVLeadingOnes is hard for the UMDA. To deal with this issue, we relax the UMDA by the so-called margins, and prove that BVLeadingOnes becomes easy for this relaxed version of UMDA.

The rest of the paper is organized as follows. Section II discusses why FHT is more appropriate for time complexity analysis of EDAs and presents the classification of problem hardness and the corresponding probability conditions for EDAs. Section III presents the new approach to analyzing EDAs with finite populations and describes the UMDA studied in this paper. Then, UMDA is analyzed on LEADINGOnes and BVLeadingOnes problems in Sections IV and V, respectively. Section VI studies the relaxation form of the UMDA on the BVLeadingOnes problem. Finally, Section VII concludes the paper.

SECTION II

## TIME COMPLEXITY MEASURES FOR EDAS

### A. How to Measure the Time Complexity of EDAs

The concept of “convergence” is often used to measure the limit behaviors of EAs, including EDAs, which was derived from the concept of convergence of random sequences [37]. For EDAs, the following formal definition of “convergence” was given by Zhang and Mühlenbein [45]:

If $\lim _{t\to \infty }\bar {F}(t)=g^{\ast }$ holds for a given EDA, where $\bar {F}(t)$ is the average fitness of individuals in the $t$th generation and $g^{\ast }$ is the fitness of the global optimum, then we say that the EDA converges to the global optimum.

There has been some work concerning such convergence of EDAs [12], [30]. It is worth noting that the above definition of convergence requires all individuals of a population to reach the global optimum. If we assume that an EDA on a problem converges to the global optimum, we can then measure the EDAs time complexity using the minimal number of generations that is needed for it to converge. This concept is called the convergence time (CT), denoted by $T$ in this paper. For EDAs, the CT is formally defined by TeX Source $$T \triangleq {{\min}}\left\{t;p\left(x^{\ast }\vert \xi _{t}^{(s)}\right)=1\right\}\eqno{\hbox{(1)}}$$ where $x^{\ast }$ is the global optimum of a given problem, and $\xi _{t}^{(s)}$ is the population after selection at the $t$th generation. $p\left(x^{\ast }\vert \xi _{t}^{(s)}\right)$ is the estimated probability (of generating $x^{\ast }$) by the EDA at the $t$th generation.

In addition to CT, the FHT is also a commonly used concept for measuring the time complexity of EAs [16], [17]. The FHT [16], [17], [43], denoted by $\tau$, is defined for the general procedure of EDA shown in Table I TeX Source $$\tau \triangleq {{\min}} \left\{t;x^{\ast } \in \xi _{t+1}\right\}\eqno{\hbox{(2)}}$$ where $\xi _{t+1}$ is the population generated at the end of $t$th generation. In the domain of EA, the FHT records the smallest number of generations needed to find the optimum, which is by a factor $N$ smaller than another commonly used measure named number of fitness evaluations, where $N$ is the number of fitness evaluations in every generation [9]. As González pointed out in [13], the FHT can also be used to measure the time complexity of EDAs.

Since EDAs are stochastic algorithms, both CT $T$ and FHT $\tau$ are random variables. Noting that the FHT measures the time for the global optimum to be found for the first time, thus the CT is no smaller than FHT TeX Source $$T\geq \tau\eqno{\hbox{(3)}}$$ which implies a natural way to bound CT from below by FHT or bound FHT from above by the CT.

In practical optimization, we are most interested in the time spent in finding the global optimum, not in waiting for the whole population to converge to the global optimum. Hence, the FHT is a better measure for analyzing the time complexity of the EDAs. It is worth noting that for a given EDA on a problem, it may have a small FHT but large CT. In other words, the population may take a long time (even infinite) to converge to the global optimum. In such cases, the analysis of FHT is still valid while the analysis of CT is rather uninteresting. It is possible that an EDA could find the global optimum efficiently (in polynomial time), but the population does not converge to the global optimum. We will discuss such an example in Section VI.

### B. Probability Conditions for EDA-Hardness

In order to understand better the relationship between problem characteristics and algorithmic features of an EDA, we introduce a problem classification for a given EDA. However, we should introduce some notations first.

Denote $Poly(n)$ as the polynomial function class of the problem size $n$ and $SuperPoly(n)$ as the super-polynomial function class of the problem size $n$. For a function $f(n)$ (where $f(n)>1$ always holds, and when $n\to \infty$, $f(n)\to \infty$), denote the following:

1. $f(n)\prec Poly(n)$ and $g(n)= {{1}\over {f(n)}} \succ {{1}\over {Poly(n)}}$ if and only if $\exists a,b\in \BBR ^{+}$, $n_{0}\in \BBN$: $\forall n>n_{0}$, $f(n)\leq an^{b}$;
2. $f(n)\succ SuperPoly(n)$ and $g(n)= {{1}\over {f(n)}} \prec {{1}\over {SuperPoly(n)}}$ if and only if $\forall a,b\in \BBR ^{+}$: $\exists n_{0}\in \BBN$: $\forall n>n_{0}$, $f(n)> an^{b}$.

Based on the above definitions, we know that “$\prec$” and “$\succ$” imply “$\langle$” and “$\rangle$” respectively, when $n$ is sufficiently large. $Poly(n) [SuperPoly(n)]$ implies that there exists a monotonically increasing function that is polynomial (super-polynomial) in the problem size $n$. Note that $g(n)= {{1}\over {f(n)}}\in (0,1)$, and its asymptotic form $g(n)\succ {{1}\over {Poly(n)}}$ or $g(n)\prec {{1}\over {SuperPoly(n)}}$, can be used to measure the asymptotic order of a probability (e.g., the probability of generating a certain individual), since a probability always takes its value in the interval $[{0,1}]$.1 Then we provide the following problem classification for a given EDA.

1. EDA-easy Class. For a given EDA, a problem is EDA-easy if, and only if, with the probability of $1-1/SuperPoly(n)$, the FHT needed to reach the global optimum is polynomial in the problem size $n$.
2. EDA-hard Class. For a given EDA, a problem is EDA-hard if, and only if, with the probability of $1/Poly(n)$, the FHT needed to reach the global optimum is super-polynomial in the problem size $n$.

The above classification can be considered as a direct generalization of the following EA-hardness classification for EAs proposed by He and Yao [18].

1. EA-easy Class. For a given EA, a problem is EA-easy if, and only if, the mean FHT needed to reach the global optimum is polynomial in the problem size $n$.
2. EA-hard Class. For a given EA, a problem is EA-hard if, and only if, the mean FHT needed to reach the global optimum is super-polynomial in the problem size $n$.

We see that He and Yao's classification for EAs is based on mean FHT, while our classification for EDAs concerns more detailed characteristics of the probability distribution of FHT. Given a problem, if the FHT of an EDA is polynomial with a probability super-polynomially close to 1 (the probability will be called “an overwhelming probability” in the following parts of the paper), then we can say that in most of independent runs, the EDA can find the optimum of the problem efficiently. On the other hand, if the FHT of an EDA is super-polynomial with a probability that is polynomially large. i.e., $1/Poly(n)$, then it is very likely that the EDA cannot find the optimum of the problem efficiently. A similar idea can be found in [42], which defined efficiency measures for randomized search heuristics.

From the definition of expectation in probability theory, we know that for an algorithm, the problems belonging to the EDA-hard class in our classification will still be hard under the classification based on mean FHT. But our classification defines EDA-easy differently from the classification based on mean FHT. In practice, it is possible that an EDA finds the optimum efficiently in most of the independent runs, while spends extremely long time in the other runs. This kind of problems will considered to be “hard” cases if using mean FHT for classification. However, in our classification, such problems are considered to be easy cases, which is more likely to fit the practitioners' point of view.

We now establish conditions under which a problem is EDA-hard (or EDA-easy) for a given EDA. Let $\BBP (\tau =t) (t\in \BBN)$ be the probability distribution of the FHT, which is determined by the probabilistic model at the $t$th generation. An EDA can be regarded as a random process $K=\{K_{t}\colon t\in \BBN \}$, where $K_{t}$ is the probabilistic model (including the parameters) maintained at the $t$th generation. Obviously, $K_{t}$ implies the probability of generating the global optimum in one sampling at the $t$th generation, denoted by $P_{t}^{\ast }$ TeX Source $$\forall t\in \BBN\colon K_{t}\vdash P_{t}^{\ast }.\eqno{\hbox{(4)}}$$

Meanwhile, to obtain the probability distribution of the FHT $\tau$, we let $P_{t}^{\prime}$ be the probability of generating the global optimum in one sampling at the $t$th generation, conditional on the event $\tau \geq t$ (i.e., the global optimum has not been generated before the $t$th generation). Consequently, we obtain the following lemma:

#### Lemma 1

The probability distribution of the FHT $\tau$ satisfies TeX Source $$\forall t\geq 0\colon \BBP(\tau=t)=\left (1-\left(1-P_{t}^{\prime }\right)^{N}\right)\prod _{j=0}^{t-1}\left(1-P_{j}^{\prime }\right)^{N}.\eqno{\hbox{(5)}}$$

##### Proof

Let $x^{\ast }$ be the global optimum. As Table I and (2), we also let $\xi _{t+1}$ be the generated population at the end of $t$th generation $(t\in \BBN)$. According to the FHT defined in (2), for any $t\in \BBN ^{+}$ we have TeX Source \eqalignno{& \BBP(\tau=t)= \BBP\left(x^{\ast }\in \xi_{t+1},x^{\ast }\notin \xi _{t},\ldots,x^{\ast }\notin \xi_{2},x^{\ast }\notin \xi _{1}\right)\cr& = \BBP\left(x^{\ast }\in\xi _{t+1},x^{\ast }\notin \xi _{t},\ldots,x^{\ast }\notin \xi_{2}\mid x^{\ast }\notin \xi _{1}\right)\cr& \quad \cdot\BBP\left(x^{\ast }\notin \xi _{1}\right)\cr& = \BBP\left(x^{\ast}\in \xi _{t+1},x^{\ast }\notin \xi _{t},\ldots,x^{\ast }\notin\xi _{3}\mid x^{\ast }\notin \xi _{2},x^{\ast }\notin \xi_{1}\right)\cr& \quad \cdot \BBP\left(x^{\ast }\notin \xi_{2}\mid x^{\ast }\notin \xi _{1}\right) \BBP \left(x^{\ast}\notin \xi _{1}\right)\cr& = \BBP\left(x^{\ast }\in \xi_{t+1}\mid x^{\ast }\notin \xi _{t},\ldots,x^{\ast }\notin \xi_{1}\right) \BBP \left(x^{\ast }\notin \xi _{1}\right)\cr&\quad \cdot \prod _{j=1}^{t-1} \BBP \left(x^{\ast }\notin \xi_{j+1}\mid x^{\ast }\notin \xi _{j},\ldots,x^{\ast }\notin \xi_{1}\right)\cr& = \BBP\left(x^{\ast }\in \xi _{t+1}\mid \tau \geq t\right)\prod _{j=0}^{t-1} \BBP\left(x^{\ast }\notin \xi_{j+1}\mid \tau \geq j\right)\cr& =\left (1-\left(1-P_{t}^{\prime}\right)^{N}\right)\prod _{j=0}^{t-1}\left(1-P_{j}^{\prime}\right)^{N}} where $N$ is the population size, the item $1-\left(1-P_{t}^{\prime}\right)^{N}$ is the probability that the optimum is found at the $t$th generation, conditional on the event $\tau \geq t$, and the item $\prod _{j=0}^{t-1}\left(1-P_{j}^{\prime}\right)^{N}$ is the probability that the optimum has not been found before the $t$th generation. Combining the above result with the fact $\BBP (\tau=0)=1-\left(1-P_{0}^{\prime}\right)^{N}$, we have proven the lemma. ■

Moreover, let us consider the following lemma:

#### Lemma 2

If $\BBP (\tau \prec Poly(n))\succ 1- {{1}\over {SuperPoly(n)}}$, then $\exists t^{\prime}\leq \lceil \BBE [\tau \mid \tau \prec Poly(n)]\rceil +1$ such that TeX Source $$\BBP(\tau=t^{\prime })\succ {{1}\over {Poly(n)}}.$$

##### Proof

Assume that $\forall t\leq \lceil \BBE [\tau \mid \tau \prec Poly(n)]\rceil +1$, $\BBP (\tau =t)\prec {{1}\over {SuperPoly(n)}}$, then we know that TeX Source \eqalignno{& \max \left\{\BBP(\tau=t);t\leq \lceil \BBE [\tau\mid\tau \prec Poly(n)]\rceil +1\right\}\cr& \quad \quad \quad \quad \quad \quad \quad \quad \quad \prec{{1}\over {SuperPoly(n)}}.} Hence, we can obtain TeX Source \eqalignno{& \BBP(\tau\leq \lceil \BBE [\tau\mid \tau \prec Poly(n)]\rceil +1)\cr& =\sum _{t=0}^{\lceil \BBE [\tau\mid \tau \prec Poly(n)]\rceil +1} \BBP (\tau =t) \cr& \leq \left(\lceil \BBE [\tau\mid \tau \prec Poly(n)]\rceil +2\right)\cr& \quad \cdot \max \left\{\BBP(\tau=t);t\leq\lceil \BBE [\tau\mid \tau \prec Poly(n)]\rceil +1\right\}\cr& \prec {{Poly(n)}\over {SuperPoly(n)}}.}

Now we can estimate the expectation of the FHT $\tau$ TeX Source \eqalignno{& \BBE [\tau\mid \tau \prec Poly(n)]=\sum_{t=0}^{+\infty }t \BBP (\tau =t\mid \tau \prec Poly(n))\cr& =\sum_{t=0}^{Poly(n)} {{t \BBP(\tau=t,\tau \prec Poly(n))}\over {\BBP(\tau\prec Poly(n))}}\cr& =\sum _{t=0}^{Poly(n)} {{t\BBP(\tau=t)}\over {\BBP(\tau\prec Poly(n))}}\geq \sum_{t=0}^{Poly(n)}t \BBP (\tau =t)\cr& =\sum _{t=0}^{\lceil \BBE[\tau\mid \tau \prec Poly(n)]\rceil +1}t \BBP (\tau =t)\cr& \quad+\sum _{t=\lceil \BBE [\tau\mid \tau \prec Poly(n)]\rceil +2}^{Poly(n)}t \BBP (\tau =t)\cr& >(\lceil \BBE [\tau\mid\tau \prec Poly(n)]\rceil +2)\cr& \quad \cdot\BBP\left(Poly(n)\succ \tau >\lceil \BBE [\tau\mid \tau \prec Poly(n)]\rceil +1\right)\cr& =(\lceil \BBE [\tau\mid \tau \prec Poly(n)]\rceil +2)\biggl(\BBP\left(\tau \prec Poly(n)\right)\cr&\quad - \BBP\left(\tau \leq\lceil \BBE [\tau\mid \tau \prec Poly(n)]\rceil +1\right)\biggr)\cr& =(\lceil \BBE [\tau\mid \tau\prec Poly(n)]\rceil +2)\cr& \quad \cdot \left (1- {{1}\over {SuperPoly(n)}}- {{Poly(n)}\over {SuperPoly(n)}}\right)\cr& \quad\succ (\lceil \BBE [\tau\mid \tau \prec Poly(n)]\rceil +2)- {{Poly(n)}\over {SuperPoly(n)}}\cr& \quad {\,} - {{Poly(n)Poly(n)}\over {SuperPoly(n)}}.} As $n\to \infty$, ${{Poly(n)}\over {SuperPoly(n)}}\to 0$ and ${{Poly(n)Poly(n)}\over {SuperPoly(n)}}\to 0$. Hence, there exists a sufficiently large problem size $n$ such that TeX Source $$\BBE [\tau\mid \tau \prec Poly(n)]>\lceil \BBE [\tau\mid\tau \prec Poly(n)]\rceil +1\eqno{\hbox{(6)}}$$ which is an obvious contradiction. So we have proven the lemma. ■

Formally, an optimization problem can be denoted by $I=(\Omega,f)$, where $\Omega$ is the search space and $f$ the fitness function. Following He et al. [19], we use ${\cal P}=(\Omega,f, {\cal A})$ to indicate an algorithm ${\cal A}$ on a fitness function $f$ in the search space $\Omega$. Let the FHT of ${\cal A}$ on $I$ be $\tau ({\cal P})$. The following theorem describes the relation between EDA-hardness and probability $P_{i}^{\ast }$.

#### Theorem 1

For a given ${\cal P}$, if the population size $N$ of the EDA ${\cal A}$ is polynomial in the problem size $n$, then:

1. if $I$ is EDA-easy for ${\cal A}$, then $\exists t^{\prime \prime}\leq \lceil \BBE [\tau ({\cal P})\mid \tau ({\cal P})\prec Poly(n)]\rceil +1$ such that TeX Source $$P_{t^{\prime \prime}}^{\ast }\succ {{1}\over {Poly(n)}};$$
2. if $\forall t=t(n)\prec Poly(n)$, $P_{t}^{\ast }\prec {{1}\over {SuperPoly(n)}},$ then $I$ is EDA-hard for ${\cal A}$.
##### Proof

Note that the second part of this theorem is a corollary of the first part. We only need to prove the first part.

According to Lemma 1, we have TeX Source $$\BBP(\tau({\cal P})=i)< 1-\left(1-P_{i}^{\prime }\right)^{N}.$$ On the other hand, according to Lemma 2, we know that $\exists t^{\prime}\leq \lceil \BBE [\tau ({\cal P})\mid \tau ({\cal P})\prec Poly(n)]\rceil +1$ such that TeX Source $$\BBP(\tau({\cal P})=t^{\prime })\succ {{1}\over {Poly(n)}}.$$ Thus, we can define $t^{\prime \prime}$ as follows: TeX Source \eqalignno{& t^{\prime \prime }=\min \left \{\vphantom {{1}\over {Poly(n)}} t^{\prime }; t^{\prime }\leq\lceil \BBE[\tau({\cal P})\mid \tau ({\cal P})\prec Poly(n)]\rceil +1,\right.\cr& \quad \quad \quad \quad \left. \BBP (\tau ({\cal P})=t^{\prime })\succ {{1}\over {Poly(n)}}\right \}.&{\hbox{(7)}}} Since $\BBP (\tau ({\cal P})=t^{\prime \prime})\succ {{1}\over {Poly(n)}}$, we have TeX Source $$1-\left(1-P_{t^{\prime \prime}}^{\prime }\right)^{N}\succ {{1}\over {Poly(n)}}.\eqno{\hbox{(8)}}$$ Let us assume that $P_{t^{\prime \prime}}^{\ast }\prec {{1}\over {SuperPoly(n)}}$. Here we let ${\cal E}$ represent the event “the global optimum is generated in one sampling at the $t^{\prime \prime}$-th generation,” then according to the definitions of $P_{t^{\prime \prime}}^{\ast }$ and $P_{t^{\prime \prime}}^{\prime}$ mentioned in Section II-B, we obtain the following inequality: TeX Source \eqalignno{& P_{t^{\prime \prime}}^{\ast }= \BBP({\cal E})\geq \BBP({\cal E},\tau ({\cal P})\geq t^{\prime \prime }) \cr=&\, \BBP({\cal E}\mid \tau ({\cal P})\geq t^{\prime \prime }) \BBP(\tau ({\cal P})\geq t^{\prime \prime }) \cr=&\, P_{t^{\prime \prime}}^{\prime } \BBP(\tau ({\cal P})\geq t^{\prime \prime }).& {\hbox{(9)}}} Meanwhile, (7) implies that TeX Source $$\BBP(\tau({\cal P})\geq t^{\prime \prime })\geq\BBP(\tau({\cal P})= t^{\prime \prime })\succ {{1}\over {Poly(n)}}.\eqno{\hbox{(10)}}$$ Combining (9) and (10) together, we know that $P_{t^{\prime \prime}}^{\ast }\prec {{1}\over {SuperPoly(n)}}$ yields $P_{t^{\prime \prime}}^{\prime}\prec {{1}\over {SuperPoly(n)}}$.

Now $\forall f(n)\prec Poly(n)$, we estimate TeX Source $$\lim _{n\to \infty } {{1-\left(1-{P_{t^{\prime \prime}}^{\prime}}\right)^{N}}\over {1/f(n)}}\eqno{\hbox{(11)}}$$ where $N=N(n)\prec Poly(n)$ is the population size of the EDA. Equation (11) can be calculated as follows: TeX Source \eqalignno{& \lim _{n\to \infty } {{1-\left(1-{P_{t^{\prime\prime}}^{\prime}}\right)^{N(n)}}\over {1/f(n)}}\cr& =\lim_{n\to \infty } {{1-\left (\left(1-{P_{t^{\prime \prime}}^{\prime}}\right)^{{1}\over {P_{t^{\prime \prime}}^{\prime}}}\right)^{P_{t^{\prime \prime}}^{\prime }N(n)}}\over {1/f(n)}}\cr& =\lim _{n\to \infty }\left(f(n)-f(n)e^{-P_{t^{\prime\prime}}^{\prime }N(n)}\right)\cr& =\lim _{n\to \infty }\left(\vphantom {{\left(P_{t^{\prime \prime}}^{\prime}N(n)\right)^{2}}\over {2}}f(n)-f(n)\left(\vphantom {{\left(P_{t^{\prime \prime}}^{\prime }N(n)\right)^{2}}\over {2}}1-P_{t^{\prime \prime}}^{\prime }N(n)\right.\right.\cr&\quad\quad\quad \quad\left.\left.+ {{\left(P_{t^{\prime \prime}}^{\prime}N(n)\right)^{2}}\over {2}}+o\left(\left(P_{t^{\prime \prime}}^{\prime }N(n)\right)^{2}\right)\right)\right)\cr& =\lim _{n\to\infty }f(n)P_{t^{\prime \prime}}^{\prime }N(n)-\lim _{n\to\infty } {{f(n)\left(P_{t^{\prime \prime}}^{\prime}N(n)\right)^{2}}\over{2}}\cr& \quad -\lim _{n\to \infty }o\left(f(n)\left(P_{t^{\prime \prime}}^{\prime }N(n)\right)^{2}\right)\cr& \prec \lim _{n\to \infty }{{Poly^{2}(n)}\over {SuperPoly(n)}}-\lim _{n\to \infty }{{Poly^{3}(n)}\over {SuperPoly^{2}(n)}}\cr& \quad -\lim _{n\to \infty }o\left ({{Poly^{3}(n)}\over {SuperPoly^{2}(n)}}\right)=0.} Hence, we know that $1-\left(1-P_{t^{\prime \prime}}^{\prime}\right)^{N}$ is smaller than ${{1}\over {f(n)}}\succ {{1}\over {Poly(n)}}$ when $n\to \infty$. In other words TeX Source $$1-\left(1-P_{t^{\prime \prime}}^{\prime }\right)^{N}\prec {{1}\over {SuperPoly(n)}}$$ where we obtain a contradiction to (8).

So we have TeX Source $$P_{t^{\prime \prime}}^{\ast }\succ {{1}\over {Poly(n)}}.$$ The theorem is proven. ■

The theorem above provides us with two simple probability conditions related to the problem classification in terms of EDA-hardness. Later, we will use this theorem to obtain more specific results related to EDA-hardness for the UMDA.

SECTION III

## TIME COMPLEXITY ANALYSIS OF EDASWITH FINITE POPULATION SIZES

### A. A General Approach to Analyzing EDAs With Finite Population Sizes

In the domain of EA, several different approaches have been proposed for analyzing theoretically the FHT, such as drift analysis [16], [18], analytical Markov chain [17], Chernoff bounds [7], [23], [24], and convergence rate [15], [43]. Some of them have been applied to EDAs as well. González used the analytical Markov chain to study the worst case exponential FHT of some EDAs [13]. Droste employs drift analysis and Chernoff bounds to analyze the time complexity of cGA (with a population size of two) on linear pseudo-boolean functions [8]. However, those existing techniques might not be sufficient for time complexity analysis of EDAs, because EDAs do not use any variation operators (e.g., mutation and crossover) but rely on sampling successive probabilistic models. Hence, some new ideas are needed to deal with probabilistic models.

One of the main difficulties of analyzing probabilistic models is due to the errors brought by the random sampling processes. Such random errors may occur when a probabilistic model is updated via random sampling. An intuitive idea of handling the random errors is to assume infinite population sizes for EDAs. This assumption has been adopted in the most existing literature, such as the well-known example of ONEMax given by Mühlenbein and Schlierkamp-Voosen [31], and Zhang's convergence analysis of EDAs [45]. Two exceptions are the aforementioned Droste's results on cGA [8] and González's general worst case analysis of EDAs [13].

In this section, we will provide a general approach to analyzing theoretically EDAs with finite population sizes. The approach is closely related to Chernoff bounds and the discrete dynamic system model of population-based incremental learning (PBIL) [1]. PBIL is a more general version of UMDA and its discrete dynamic system model was first presented by González et al. [11], [12], [13]. Assume there is a function ${\cal G}\colon \BBR ^{n} \to \BBR ^{n}$, then $A(t+1)= {\cal G}(A(t)) (t=0,1,\ldots)$ is called a discrete dynamic system [39]. In [11], [12], [13], two discrete dynamic system were discussed. The first one considered PBIL as a function ${\cal G}_{1}\colon [{0,1}]^{n} \to [{0,1}]^{n}$. ${\cal G}_{1}$ includes the random effects. Hence, even if the initial probability distribution and algorithm parameters of PBIL are fixed, the system is still stochastic. This is an exact model of PBIL, but hard to analyze directly. So the authors considered the second dynamic system with the function ${\cal G}_{2}\colon [{0,1}]^{n} \to [{0,1}]^{n}$, which removes the random effects by assuming an infinite population size and thereby becomes deterministic. Although the deviation (caused by the random sampling errors) between the two dynamic systems has been estimated, so as to study the fixed point of the first dynamic system by investigating that of the second system, their method does not relate the deviation to the computation time of PBIL. Hence, it is not applicable to time complexity analysis.

Although González et al. [11], [12], [13] did not analyze the time complexity of EDAs, their mathematical models (using the discrete dynamic systems) can be used to develop a feasible approach to analyzing the time complexity of EDAs. Such an approach can be summarized by two major steps.

1. Build an easy-to-analyze discrete dynamic system for the EDA. The idea is to de-randomize the EDA and build a deterministic 2 dynamic system.
2. Analyze the deviations caused by de-randomization. Note that EDAs are stochastic algorithms. Concretely, tail probability techniques, such as Chernoff bounds, can be used to bound the deviations.

In this paper, we will use UMDA as an example of EDAs to illustrate the analysis of EDAs time complexity using the above approach. The analysis will show that our approach provides a feasible way of estimating the random errors brought by finite populations in UMDA, and thus shed some light on analyzing other EDAs with finite populations. However, it should be noted that much work remains to be done to achieve such a goal.

### B. Univariate Marginal Distribution Algorithm

The UMDA was originally proposed as a discrete EDA [28], [44]. As one of the earliest and simplest EDAs, UMDA has attracted a lot of research attention. The UMDA studied in this paper adopts binary encoding and one of the most commonly used selection strategies—the truncation selection, which is described below.

Sort the ${\rm N}$ individuals in the population by their fitness from high to low. Then select the best ${\rm M}$ of them for estimating the probability distribution.

The general procedure of UMDA studied in our paper is shown in Table II, where ${\bf x}=(x_{1},x_{2},\ldots,x_{n})\in \{0,1\}^{n}$ represents an individual, $p_{t,i}(1) (p_{t,i}(0))$ is the estimated marginal probability of the $i$th bit of an individual to be 1 (0) at the $t$th generation. We can also define the indicators $\delta (x_{i}\vert 1)$ as follows: TeX Source $$\delta (x_{i}\vert 1) \triangleq \cases{1, \hfill & x_{i}=1 \hfill \cr 0, \hfill & x_{i}=0.\hfill \cr }$$

TABLE II UNIVARIATE MARGINAL DISTRIBUTION ALGORITHM (UMDA) WITH TRUNCATION SELECTION

The marginal probabilities $p_{t,i}(1)$ and $p_{t,i}(0)$ are given by TeX Source $$p_{t,i}(1) \triangleq {{\sum\limits_{{\bf x}\in\xi _{t}^{(s)}}\delta (x_{i}\vert 1)}\over {M}}, \quad p_{t,i}(0) \triangleq 1-p_{t,i}(1).$$ Let TeX Source $${\bf P}_{t}({\bf x}) \triangleq \left(p_{t,1}(x_{1}),p_{t,2}(x_{2}),\ldots,p_{t,n}(x_{n})\right)$$ where ${\bf P}_{t}({\bf x})$ is a probability vector, which is made up of $n$ random variables (that is because, UMDA is a stochastic algorithm). Then the probability of generating individual ${\bf x}$ in the $t$th generation is TeX Source $$p_{t}({\bf x})=\prod _{i=1}^{n} p_{t,i}(x_{i}).$$

### C. Analyzing Time Complexity of UMDA

The UMDA given in the former section can be analyzed following the general idea presented in Section III-A. First, we define a function $\gamma \colon [{0,1}]^{n} \to [{0,1}]^{n}$ such that $\gamma = {\cal S}\circ {\cal D}$, where ${\cal S}\colon [{0,1}]^{n} \to [{0,1}]^{n}$ is the function that represents the effect of selection, and ${\cal D}\colon [{0,1}]^{n} \to [{0,1}]^{n}$ is the function that is used in eliminating the stochastic effects of the random sampling. Then we obtain a deterministic discrete dynamic system $\left\{{\mathhat {{\bf P}}}_{t}({\bf x}^{\ast });t=0,1,\ldots \right\}$ related to the marginal probabilities of generating the global optimum TeX Source \eqalignno{{\mathhat {{\bf P}}}_{0}({\bf x}^{\ast })=&\, {\bf P}_{0}({\bf x}^{\ast })&{\hbox{(12)}}\cr{\mathhat {{\bf P}}}_{t+1}({\bf x}^{\ast })=&\,\gamma \left({\mathhat {{\bf P}}}_{t}({\bf x}^{\ast })\right)= {\cal S}\left({\cal D}\left({\mathhat {{\bf P}}}_{t}({\bf x}^{\ast })\right)\right)&{\hbox{(13)}}\cr {\mathhat {{\bf P}}}_{t}({\bf x}^{\ast })=&\,\gamma ^{t}\left({\mathhat{{\bf P}}}_{0}({\bf x}^{\ast })\right)&{\hbox{(14)}}} where ${\mathhat {{\bf P}}}_{t}({\bf x})=\left({\mathhat {p}}_{t,1}(x_{1}),\ldots, {\mathhat {p}}_{t,n}(x_{n})\right)$ is the marginal probability vector of the deterministic system for generating an individual ${\bf x}$, and ${\bf x}^{\ast }$ is the global optimum. Since UMDA is usually initialized with a uniform distribution, we consider ${\mathhat {{\bf P}}}_{0}({\bf x})= {\bf P}_{0}({\bf x})=\left({{1}\over {2}},\ldots, {{1}\over {2}}\right)$ in this paper. Correspondingly, the probability of generating an individual ${\bf x}$ is TeX Source $${\mathhat {p}}_{t}({\bf x})=\prod _{i=1}^{n} {\mathhat {p}}_{t,i}(x_{i}).$$ Note that $p_{t}({\bf x})$ in the former section corresponds to the original UMDA, while ${\mathhat {p}}_{t}({\bf x})$ is obtained from the deterministic dynamic system after de-randomization. Following the first step of our general approach, we need to estimate the time complexity of the de-randomized UMDA.

To relate the time complexity result obtained by the deterministic system to the original UMDA, we should estimate the deviation of the de-randomized UMDA from the original UMDA. Since time complexity of the former totally depends on $\left\{{\mathhat {{\bf P}}}_{t}({\bf x}^{\ast });t=0,1,\ldots \right\}$, such deviation arises from the difference between $\left\{{\mathhat {{\bf P}}}_{t}({\bf x}^{\ast });t=0,1,\ldots \right\}$ and $\left\{{\bf P}_{t}({\bf x}^{\ast });t=0,1,\ldots \right\}$. Ideally, we want to exactly calculate the difference between the two sequences of marginal probability vectors. However, this is a non-trivial work (if not impossible). Alternatively, we resort to estimating the probabilities that the deviations are smaller than some specific values. Two crucial lemmas for this task are given below.

#### Lemma 3 ([26]): Chernoff Bounds

Let $X_{1},X_{2},\ldots,X_{k} \in \{0,1\}$ be $k$ independent random variables (take the value of either 0 or 1) with a same distribution TeX Source $$\forall i\ne j\colon \BBP(X_{i}=1)= \BBP (X_{j}=1)$$ where $i,j\in \{1,\ldots,k\}$. Let $X$ be the sum of those random variables, i.e., $X=\sum _{i=1}^{k} X_{i}$, then we have:

1. $\forall 0< \delta < 1$ TeX Source $$\BBP\left(X< (1-\delta) \BBE [X]\right)< e^{- \BBE [X]\delta^{2}/2};$$
2. $\forall \delta \leq 2e-1$ TeX Source $$\BBP\left(X>(1+\delta) \BBE [X]\right)< e^{- \BBE [X]\delta^{2}/4}.$$

#### Lemma 4 ([21], [38])

Consider sampling without replacement from a finite population $(X_{1},\ldots,X_{N})\in \{0,1\}^{N}$. Let $(Y_{1},\ldots,Y_{M})\in \{0,1\}^{M}$ be a sample of size $M$ get randomly without replacement from the whole population, $Y^{(M)}$ and $X^{(N)}$ be the sums of the random variables in the sample and population, respectively, i.e., $Y^{(M)}=\sum _{i=1}^{M} Y_{i}$ and $X^{(N)}= \sum _{i=1}^{N} X_{i}$, then we have TeX Source \eqalignno{\BBP\left(Y^{(M)}- {{MX^{(N)}}\over {N}}\geq M\delta \right)\leq &\, e^{- {{2M\delta ^{2}}\over {1-(M-1)/N}}}\cr <&\, e^{-2M\delta ^{2}}\cr \BBP\left(\left\vert Y^{(M)}- {{MX^{(N)}}\over {N}}\right\vert > M\delta\right)\leq&\, 2e^{- {{2M\delta ^{2}}\over {1-(M-1)/N}}}\cr <&\, 2e^{-2M\delta^{2}}} where $\delta \in [{0,1}]$ is some constant. 3

Another issue that will be involved in our further analysis is to estimate the probability of the following events: TeX Source $$\forall t\in \BBN_{0}\colon p_{t}({\bf x}^{\ast })\oplus {\mathhat {p}}_{t}({\bf x}^{\ast })\eqno{\hbox{(15)}}$$ where $\oplus \in \{\leq,\geq \}$. As we will show soon, they can be handled on the basis of estimation of the probabilities of deviations. Finally, before presenting the case studies in detail, it should be noted that we always consider finite population sizes throughout this paper. Although we will sometimes utilize a statement like “when the problem size becomes sufficiently large,” that does not mean that we assume infinite population sizes, it is merely used to obtain the asymptotic order of a function of the problem size $n$. The main difference is that the infinite population assumption implies infinite population sizes for all problem sizes (so that the random sampling errors are removed), while in our case the population size will be infinite only if the problem size has become infinite.

SECTION IV

## WORST CASE ANALYSIS OF UMDA ON THE LEADINGONES PROBLEM

The first maximization problem we investigate is called the LEADINGOnes problem, formally defined as follows: TeX Source $${\rm L{\scriptstyle EADING}O{\scriptstyle NES}}({\bf x}) \triangleq \sum _{i=1}^{n}\prod _{j=1}^{i}x_{j},\quad x_{j}\in \{0,1\}.\eqno{\hbox{(16)}}$$

The global optimum of LEADINGOnes is ${\bf x}^{\ast }=(1,\ldots,1)$. The fitness of an individual is determined by the number of the leading 1-bits in the individual, and it is not influenced by any bits right to the leftmost 0-bit of the individual. The value of the bits right to the leftmost 0-bit will not influence the output of fitness-based selection operators in EAs. Due to this characteristic, a population will begin to converge to 1 at a bit if the bits left to it have almost converged to 1's, and thus a sequential convergence phenomenon, namely Domino convergence [3], [36], [41], will happen.

In the literature of EDAs, the LEADINGOnes problem has been investigated empirically [10], but no rigorous theoretical result exists. This section will provide the first theoretical result that put a sound foundation to the time complexity analysis of the UMDA on this problem.

First, we introduce the following concept.

#### Definition 1 ($b$-Promising Individual)

In the population that contains $N$ individuals, the $b$-promising individuals are those individuals with fitness no smaller than a threshold $b$.

Since the UMDA adopts the truncation selection, we have the following lemma.

#### Lemma 5

For the UMDA with truncation selection, the poportion of the $b$-promising individuals after selection at the $t$th generation satisfies TeX Source Q_{t,b}^{(s)}=\cases{{{Q_{t,b}N}\over {M}}, \hfill & Q_{t,b} \leq {{M}\over {N}} \hfill \cr\noalign{\vskip 6pt} \quad 1, \hfill & Q_{t,b} > {{M}\over {N}}\hfill \cr }\eqno {\hbox{(17)}} where $Q_{t,b}\leq 1$ is the proportion of the $b$-promising individuals before the truncation selection.

Define the $i$-convergence time $T_{i}$ to be the number of generations for a discrete EDA to converge to the globally optimal value on the $i$th bit of the solution. It is defined formally as TeX Source $$T_{i} \triangleq \min\left\{t;p_{t,i}\left(x^{\ast }_{i}\right)=1\right\}.$$ Let $T_{0}=0$.

Moreover, in the following parts of the paper, we use the notation “$\omega$” to demonstrate the relationship between the asymptotic orders of two functions [5], [24]. Given two positive functions of the problem size $n$, say $f=f(n)$ and $g=g(n)$, $f=\omega (g)$ holds if and only if $\lim _{n\to \infty }g(n)/f(n)=0$. Now we reach the following theorem.

#### Theorem 2

Given the population sizes $N=\omega (n^{2+\alpha }\log n)$, $M=\omega (n^{2+\alpha }\log n)$ (where $\alpha$ can be any positive constant) and $M=\beta N$ ($\beta \in (0,1)$ is some constant), for the UMDA with truncation selection on the LEADINGOnes problem, initialized with a uniform distribution, at least with the probability of TeX Source $$\left(1- n^{-\omega (n^{2+\alpha })\delta ^{2}}\right)^{\bar {\tau}}\left (1-n^{-\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}\omega (1)}\right)^{2(n-1)\bar{\tau}}$$ its FHT satisfies TeX Source $$\tau < \bar {\tau }= {{n\left(\ln{{eM}\over {N}}-\ln (1-\delta)\right)}\over {\ln (1-\delta)+\ln \left({{N}\over{M}}\right)}}+2n$$ where $\delta \in \left(\max \left\{0,1- {{2M}\over {N}}\right\},1- {{M}\over {N}}\right)$ is a positive constant, and $\bar {\tau }$ represents an upper bound 4 of the random variable $\tau$. In other words, the LEADINGOnes problem is EDA-easy for the UMDA.

##### Proof

The basic idea of the proof is based on the approach outlined in the former section. We first de-randomize the UMDA. Since the LEADINGOnes problem is associated with the domino convergence property, we can further divide the optimization process into $n$ stages. The $i$th stage starts when all bits at the left side of the $i$th bit have converged to 1's, and ends when the $i$th bit has converged. Suppose generation $t+1$ belongs to the $i$th stage, then the marginal probabilities at the generation are TeX Source \eqalignno{& {\mathhat {{\bf P}}}_{t+1}({\bf x}^{\ast})=\gamma _{i}\left({\mathhat {{\bf P}}}_{t}({\bf x}^{\ast})\right)=\biggl({\mathhat {p}}_{t,1}\left(x_{1}^{\ast}\right),\ldots, {\mathhat {p}}_{t,i-1}\left(x_{i-1}^{\ast}\right),\cr& \quad \left[G {\mathhat {p}}_{t,i}\left(x_{i}^{\ast}\right)\right],R {\mathhat {p}}_{t,i+1}\left(x_{i+1}^{\ast}\right),\ldots,R {\mathhat{p}}_{t,n}\left(x_{n}^{\ast}\right)\biggr)} where ${\bf x}^{\ast }=\left(x_{1}^{\ast },\ldots,x_{n}^{\ast }\right)=(1,\ldots,1)$ is the global optimum of the LEADINGOnes problem, $G=(1-\delta) {{N}\over {M}}$ ($\delta \in \left(\max \left\{0,1- {{2M}\over {N}}\right\},1- {{M}\over {N}}\right)$ is a constant), and $R=(1-\eta)(1-\eta ^{\prime})$ ($\eta < 1$ and $\eta ^{\prime}< 1$ are positive functions of the problem size $n$). We consider three different cases in the above equation.

1. $j\in \{1,\ldots,i-1\}$. In the deterministic system above, the marginal probabilities ${\mathhat {p}}_{t,j}(x_{j}^{\ast })$ have converged to 1, thus at the next generation they will not change.
2. $j=i$. In the deterministic system above, the marginal probability ${\mathhat {p}}_{t,i}\left(x_{i}^{\ast }\right)$ is converging, and we use the factor $G=(1-\delta) {{N}\over {M}}$ to demonstrate the impact of selection pressure on this converging marginal probability,5 where ${{N}\over {M}}$ represents the influence of the selection operator (see Lemma 5).
3. $j\in \{i+1,\ldots,n\}$. The $j$th bits of individuals are not exposed to selection pressure, and we use the factor $R=(1-\eta)(1-\eta ^{\prime})$ to demonstrate the impact of genetic drift 6 on these marginal probabilities.

In Case 3, we consider the $j$th marginal probability $p_{\cdot,j}\left(x_{j}^{\ast }\right) (j\in \{i+1,\ldots,n\})$ which is not affected by the selection pressure. This is rather pessimistic, because the UMDA tends to preserve the value of $x_{j}^{\ast }=1$ that leads to higher fitness, and thus tends to increase $p_{\cdot,j}\left(x_{j}^{\ast }\right)$. Utilizing the idea mentioned in (15), we will study the time complexity of the UMDA by studying the above deterministic system, and estimate the deviation between the deterministic system and the real UMDA in terms of the probability that the stochastic marginal probabilities of the UMDA are bounded by the corresponding deterministic marginal probabilities of the deterministic system. Before our analysis, we first provide the formal definition of the deterministic system.

With ${\mathhat {{\bf P}}}_{0}({\bf x}^{\ast })=\left({{1}\over {2}},\ldots, {{1}\over {2}}\right)$, we have TeX Source $${\mathhat {{\bf P}}}_{t}({\bf x}^{\ast })=\gamma _{i}^{t-T_{i-1}}\left({\mathhat{{\bf P}}}_{T_{i-1}}({\bf x}^{\ast })\right)$$ where $T_{i-1}< t\leq T_{i} (i=1,\ldots,n)$. Since $\{\gamma _{i}\}_{i=1}^{n}$ de-randomizes the whole optimization process, $\{T_{i}\}_{i=1}^{n}$ in the above equation are no longer random variables. For the sake of clarity, we rewrite the above equation as TeX Source $${\mathhat {{\bf P}}}_{t}({\bf x}^{\ast })=\gamma _{i}^{t- {\mathhat {T}}_{i-1}}\left({\mathhat {{\bf P}}}_{{\mathhat {T}}_{i-1}}({\bf x}^{\ast })\right)$$ where ${\mathhat {T}}_{i-1}< t\leq {\mathhat {T}}_{i} (i=1,\ldots,n)$. As we will show immediately, ${\mathhat {T}}_{i} (1\leq i\leq n)$ is an upper bound of the random variable $T_{i}$ with some probability. Since $T_{n}\geq \tau$, our task finally becomes calculating the ${\mathhat {T}}_{n}$ and the probability that ${\mathhat {T}}_{n}$ holds as an upper bound of $T_{n}$.

Now we present the proof in detail. First, we estimate ${\mathhat {T}}_{1}$ and $T_{1}$ for the UMDA, which is the first stage of our analysis. Consider the 1-promising individuals. Note that the first bits of the 1-promising individuals are 1's. The sampling procedure of the UMDA can be considered as a large number of events resulting in either 0 or 1. Hence, when $p_{t-1,1}(1)\leq {{M}\over {N(1-\delta)}}$, for the sampling procedure of the UMDA, by noting Lemma 5, we can apply Chernoff bounds to obtain the following: TeX Source \eqalignno{& \BBP\left(Mp_{t,1}(1)\geq (1-\delta)p_{t-1,1}(1)N\mid p_{t-1,1}(1) \leq {{M}\over {N(1-\delta)}}\right)\cr& \quad >1- e^{- {{p_{t-1,1}(1)N}\over {2}}\delta ^{2}}} where $N=\omega (n^{2}\log n)$, thus the probability above is super-polynomially close to 1, i.e., an overwhelming probability. An equivalent form of the equation above is TeX Source \eqalignno{& \BBP\left(p_{t,1}(1)\geq (1-\delta) {{p_{t-1,1}(1)N}\over {M}}\mid p_{t-1,1}(1) \leq {{M}\over {N(1-\delta)}}\right)\cr& \quad \quad > 1- e^{- {{p_{t-1,1}(1)N}\over {2}}\delta ^{2}}} which demonstrates with an overwhelming probability the marginal probability $p_{t,1}(1)$ is lower bounded by $Gp_{t-1,1}(1)=(1-\delta) {{p_{t-1,1}(1)N}\over {M}}$. Furthermore, given ${\mathhat {p}}_{t,1}(1)=G^{t} {\mathhat {p}}_{0,1}(1)$ and $G>1$, we can obtain the inequality in Table III.

TABLE III CALCULATION OF PROBABILITY THAT $p_{t,1}(1)$ IS LOWER BOUNDED BY ${\mathhat {p}}_{t,1}(1)$

We now study the distribution of $T_{1}$. Considering the probability that $T_{1}$ is bounded by a value, say ${\mathhat {T}}_{1}$: given $T_{1}< {\mathhat {T}}_{1}$, then according to Lemma 5, at the $({\mathhat {T}}_{1}-1)$th generation, the marginal probability $p_{{\mathhat {T}}_{1}-1,1}(1)$ should be at least ${{M}\over {N(1-\delta)}}$. The above proposition is presented in Table IV, where in (19) the factor $\left(1-e^{- {{{\mathhat {p}}_{0,1}(1)N}\over {2}}\delta ^{2}}\right)$ is added since we apply Chernoff bounds once at the end of the $({\mathhat {T}}_{1}-1)$th generation and obtain the probability that ${\mathhat {p}}_{{\mathhat {T}}_{1},1}(1)=1$, under the condition ${\mathhat {p}}_{{\mathhat {T}}_{1}-1,1}(1)\geq {{M}\over {N(1-\delta)}}$. Now let us consider the following item. Noting that ${\mathhat {p}}_{{\mathhat {T}}_{1}-1,1}(1)$ is deterministic, we know TeX Source $$\BBP\left({\mathhat {p}}_{{\mathhat {T}}_{1}-1,1}(1)> {{M}\over {N(1-\delta)}}\mid p_{0,1}(1)= {\mathhat {p}}_{0,1}(1)\right)\eqno {\hbox{(24)}}$$ must be either 0 or 1, and we need to find the value of ${\mathhat {T}}_{1}$ that makes the probability above 1. Given that ${\mathhat {p}}_{0,1}(1)= {{1}\over {2}}$, the condition that $\forall t< {\mathhat {T}}_{1}-1\colon {{M}\over {N(1-\delta)}}> {\mathhat {p}}_{t,1}(1)>(1-\delta) {{{\mathhat {p}}_{t-1,1}(1)N}\over {M}}$ and Lemma 5 together imply the following inequalities. TeX Source \eqalignno{G^{{\mathhat {T}}_{1}-2} {\mathhat{p}}_{0,1}(1)=&\, (1-\delta)^{{\mathhat{T}}_{1}-2}\left(\displaystyle {{N}\over {M}}\right)^{{\mathhat {T}}_{1}-2} {\mathhat {p}}_{0,1}(1)\cr <&\, \displaystyle{{M}\over {N(1-\delta)}}\cr G^{{\mathhat {T}}_{1}-1}{\mathhat {p}}_{0,1}(1)=&\, (1-\delta)^{{\mathhat{T}}_{1}-1}\left(\displaystyle {{N}\over {M}}\right)^{{\mathhat {T}}_{1}-1} {\mathhat {p}}_{0,1}(1)\cr \geq&\,\displaystyle {{M}\over {N(1-\delta)}}.}

TABLE IV CALCULATION OF PROBABILITY THAT $T_{1}$ IS UPPER BOUNDED BY ${\mathhat {T}}_{1}$

Solving the inequalities above, we get TeX Source $${\mathhat {T}}_{1}\leq {{\ln {{2M}\over {N}}-\ln(1-\delta)}\over {\ln (1-\delta)+\ln \left({{N}\over {M}}\right)}}+2$$ where $\delta \in \left(\max \left\{0,1- {{2M}\over {N}}\right\},1- {{M}\over {N}}\right)$ is a constant, and it is easy to show that ${\mathhat {T}}_{1}=\Theta (1)$. On the other hand, recall the inequalities in Table III, we can continue to estimate the corresponding probability mentioned in (18) TeX Source \eqalignno{& \quad \quad \BBP\left(T_{1}\leq {\mathhat {T}}_{1} \mid p_{0,1}(1)= {\mathhat {p}}_{0,1}(1)\right) \cr& > \BBP\left(p_{{\mathhat {T}}_{1}-1,1}(1)\geq {\mathhat {p}}_{{\mathhat {T}}_{1}-1,1}(1)\mid p_{0,1}(1)= {\mathhat {p}}_{0,1}(1)\right) \cr& \cdot \left(1-e^{- {{{\mathhat {p}}_{0,1}(1)N}\over {2}}\delta ^{2}}\right) \cr& >\left(1-e^{- {{{\mathhat {p}}_{0,1}(1)N}\over {2}}\delta ^{2}}\right)^{{\mathhat {T}}_{1}}.& {\hbox{(25)}}} The analysis above tells us, the probability to which the marginal probability converges before the ${\mathhat {T}}_{1}$th generation $(T_{1}< {\mathhat {T}}_{1})$ is at least $\left(1- e^{- {{N}\over {4}}\delta ^{2}}\right)^{{\mathhat {T}}_{1}}$. Since $N=\omega (n^{2+\alpha }\log n)$, $M=\beta N$ ($\beta \in (0,1)$ is a constant) and ${\mathhat {T}}_{1}$ is polynomial in the problem size $n$, we know that the probability is overwhelming.

At every stage, the bits on the right-hand side of the currently converging bit are not exposed to selection pressure. However, we should still consider the errors brought by the repeated sampling procedures in UMDA, which is related to the genetic drift [6], [41].

Take the first stage as an example. The $j$th bit $(j=2,\ldots,n)$ is affected by genetic drift. First, we utilize Chernoff bounds to study the deviations brought by the random sampling procedures of the UMDA TeX Source \eqalignno{& \BBP\left(N_{t,j}\left(x_{j}^{\ast }\right)\geq (1-\eta)p_{t-1,j}\left(x_{j}^{\ast }\right)N\mid p_{t-1,j}\left(x_{j}^{\ast }\right)\right)\cr& \quad \quad \quad\quad\quad \quad\quad > 1- e^{- {{p_{t-1,j}(1)N}\over {2}}\eta ^{2}}} where $\eta$ is a parameter that controls the size of deviation, and $N_{t,j}(x_{j})$ is the number of individuals that takes the value $x_{j}$ in their $j$th bit in the population before selection, $\xi _{t}$. Here we set $\eta =\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}$, and obtain TeX Source \eqalignno{& \BBP\left(N_{t,j}\left(x_{j}^{\ast }\right)\geq\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)p_{t-1,j}\left(x_{j}^{\ast }\right)N\mid p_{t-1,j}\left(x_{j}^{\ast }\right)\right)\cr&\quad\quad\quad\quad\quad\quad\quad > 1- e^{- {{p_{t-1,j}\left(x_{j}^{\ast }\right)\omega (\log n)}\over {2}}}= 1- n^{- {{p_{t-1,j}\left(x_{j}^{\ast }\right)\omega (1)}\over {2}}}.}

Second, we further consider the selection procedure, since it may also bring some deviations. In our worst case analysis, the $j$th bits of individuals are considered to not be exposed to the selection pressure, then for these bits the selection procedure can be regarded as get a simple random sample of $M$ individuals from a finite population with $N$ individuals [34]. More precisely, since one individual cannot be selected more than once by the truncation selection, this procedure is known as random sampling without replacement from a finite population [34] in the field of statistics. From Lemma 4, we can bound from below the probability such that the number of individuals taking the value $x_{j}^{\ast }$ on their $j$th bits after selection [denoted by $N^{(s)}_{t,j}\left(x_{j}^{\ast }\right)$] is lower bounded, which is shown by the inequalities presented in Table V, where $\eta ^{\prime}$ is a parameter that controls the size of deviation, and $N^{(s)}_{t,j}\left(x_{j}^{\ast }\right)=p_{t,j}\left(x_{j}^{\ast }\right)M$. By setting $\eta '=\eta =\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}$, since $M=\omega (n^{2+\alpha }\log n)$ we obtain TeX Source \eqalignno{& \BBP\left(p_{t,j}\left(x_{j}^{\ast }\right)\geq\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}p_{t-1,j}\left(x_{j}^{\ast }\right)\mid p_{t-1,j}\left(x_{j}^{\ast }\right)\right)\cr&\quad\quad\quad\quad\quad > \left(1-n^{-p_{t-1,j}\left(x_{j}^{\ast }\right)\omega (1)}\right)\cr&\quad\quad\quad \quad\quad\cdot \left(1-n^{-\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}p^{2}_{t-1,j}\left(x_{j}^{\ast }\right)\omega (1)}\right)\cr&\quad\quad\quad\quad\quad >\left(1-n^{-\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}p^{2}_{t-1,j}\left(x_{j}^{\ast }\right)\omega(1)}\right)^{2}.}

TABLE V BOUNDING $N^{(s)}_{t,j}\left(x_{j}^{\ast }\right)$ FROM BELOW WITH AN OVERWHELMING PROBABILITY

Since the factor $R=\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}< 1$, for $\forall j=2\ldots,n$ and $t=1,\ldots, {\mathhat {T}}_{1}$, similar to the analysis shown in Table III, we further obtain TeX Source \eqalignno{& \quad \BBP \biggl(p_{t,j}\left(x_{j}^{\ast}\right)\geq\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha}\over {2}}}\right)^{2t}p_{0,j}\left(x_{j}^{\ast }\right) \cr&\quad \quad \quad \quad \quad \quad \quad \mid p_{0,j}\left(x_{j}^{\ast }\right)= {\mathhat{p}}_{0,j}\left(x_{j}^{\ast }\right)\biggr) \cr& > \left(1-n^{-\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over{2}}}\right)^{2} {\mathhat {p}}^{2}_{t-1,j}\left(x_{j}^{\ast}\right)\omega(1)}\right)^{2t}.& {\hbox{(26)}}} Given any $t=O(n)$, according to the definition of the deterministic system, we know TeX Source $${\mathhat {p}}_{t,j}\left(x_{j}^{\ast }\right)\geq \left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over{2}}}\right)^{O(n)} {\mathhat {p}}_{0,j}\left(x_{j}^{\ast }\right)>{{1}\over {e}}$$ holds. The above inequality implies that within the number of generations $t=O(n)$, the probability in (26) is an overwhelming one.

To generalize the above analysis to other stages, let us consider the $i$th $(i\in \{2,\ldots,n\})$ stage is about to start. Due to the genetic drift, the marginal probability $p_{t,j}\left(x_{j}^{\ast }\right) (j\in \{i,\ldots,n\})$ has dropped to a lower level than the initial value ${{1}\over {2}}$ by multiplying the factor $R^{t}$. We concern the value of $p_{t,i}\left(x_{i}^{\ast }\right)$. For any $t=O(n)$, similar to (26), the probability that $p_{t,i}\left(x_{i}^{\ast }\right)$ maintains a level of TeX Source $$p_{t,i}\left(x_{i}^{\ast }\right)\geq \left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over{2}}}\right)^{O(n)} {\mathhat {p}}_{0,i}\left(x_{i}^{\ast }\right)>{{1}\over {e}} \eqno {\hbox{(27)}}$$ is super-polynomially close to 1 (an overwhelming probability).

According to (27), we know that $p_{t,i}\left(x_{i}^{\ast }\right)$ is above ${{1}\over {e}}$ with an overwhelming probability. Consequently, the joint probability that the first bit has converged to 1 and the genetic drift cannot reduce $p_{{\mathhat {T}}_{1},2}(1)$ to be smaller than ${{1}\over {e}}$ by the end of the first stage is TeX Source $$\left(1- e^{- {{\omega (n^{2+\alpha }\log n)}\over {2e}}\delta ^{2}}\right)^{{\mathhat {T}}_{1}}\left (1-n^{-\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}\omega (1)}\right)^{2 {\mathhat{T}}_{1}} \eqno {\hbox{(28)}}$$ which is again an overwhelming probability. Now we have finished the analysis of the first stage.

As the dynamic system we described at the beginning of the proof, in the second stage, for ${\mathhat {T}}_{1} < t\leq {\mathhat {T}}_{2}$, we have TeX Source $${\mathhat {p}}_{t,2}(1)=G {\mathhat {p}}_{t-{1},2}(1).$$ Given ${\mathhat {T}}_{1}$ and the corresponding marginal probabilities, we consider the joint probability that $T_{2}$ is bounded above by ${\mathhat {T}}_{2}$ by inequalities presented in Table VI.

TABLE VI CALCULATION OF THE JOINT PROBABILITY THAT $T_{1}$ IS BOUNDED ABOVE BY ${\mathhat {T}}_{2}$

Let us consider the following item of the probability estimated in Table VI: TeX Source \eqalignno{& \BBP\biggl({\mathhat {p}}_{{\mathhat{T}}_{2}-1,2}(1)> {{M}\over {N(1-\delta)}}\mid p_{{\mathhat{T}}_{1},1}(1)=1,\cr& \quad \quad \quad \quad \quad \quad \quad\quad \quad \quad p_{{\mathhat {T}}_{1},2}(1)\geq {\mathhat{p}}_{{\mathhat {T}}_{1},2}(1)> {{1}\over {e}}\biggr)} since $\{{\mathhat {p}}_{t,2}(1)\}_{t=0}^{\infty }$ is a deterministic sequence, the above item must be either 0 or 1. Noting that ${\mathhat {p}}_{{\mathhat {T}}_{1},2}(1)> {{1}\over {e}}$, given the condition that $\forall t\colon {\mathhat {T}}_{1}< t< {\mathhat {T}}_{2}-1\colon {{M}\over {N(1-\delta)}}> {\mathhat {p}}_{t,2}(1)=(1-\delta) {{{\mathhat {p}}_{t-1,2}(1)N}\over {M}}$, we can solve the following inequalities to obtain ${\mathhat {T}}_{2}$ TeX Source \eqalignno{& G^{{\mathhat {T}}_{2}- {\mathhat {T}}_{1}-2} {\mathhat {p}}_{{\mathhat {T}}_{1},2}(1)\cr& \quad =\left ((1-\delta)\left({{N}\over {M}}\right)\right)^{{\mathhat {T}}_{2}- {\mathhat{T}}_{1}-2} {\mathhat {p}}_{{\mathhat {T}}_{1},2}(1)< {{M}\over {N(1-\delta)}}\cr& G^{{\mathhat {T}}_{2}- {\mathhat {T}}_{1}-1} {\mathhat {p}}_{{\mathhat {T}}_{1},2}(1)\cr& \quad =\left ((1-\delta)\left({{N}\over {M}}\right)\right)^{{\mathhat {T}}_{2}- {\mathhat{T}}_{1}-1} {\mathhat {p}}_{{\mathhat {T}}_{1},2}(1)\geq {{M}\over {N(1-\delta)}}.} Moreover, another item in (22) TeX Source \eqalignno{& \BBP\Biggl(p_{{\mathhat {T}}_{2}-1,2}(1)\geq{\mathhat {p}}_{{\mathhat {T}}_{2}-1,2}(1)\mid p_{{\mathhat {T}}_{1},1}(1)=1,\cr& \quad p_{{\mathhat {T}}_{1},2}(1)\geq {\mathhat {p}}_{{\mathhat {T}}_{1},2}(1)> {{1}\over{e}},{\mathhat {p}}_{{\mathhat {T}}_{2}-1,2}(1)> {{M}\over {N(1-\delta)}}\Biggr)} should be estimated. This can be done similarly as we have done in Table III. Then we obtain that TeX Source $$T_{2}< {\mathhat {T}}_{2}\leq {{2\ln {{eM}\over {N}}-2\ln(1-\delta)}\over {\ln (1-\delta)+\ln \left({{N}\over {M}}\right)}}+4$$ holds with the probability [the product of the items mentioned in (22)] TeX Source $$\left(1- e^{- {{\omega (n^{2+\alpha }\log n)}\over {2e}}\delta ^{2}}\right)^{{\mathhat {T}}_{2}}\left (1-n^{-\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}\omega (1)}\right)^{2 {\mathhat{T}}_{1}}.$$ The above analysis can be readily extended to other stages. To be specific, at the $i$th stage, the $i$-promising individuals are taken into account. We have TeX Source $${\mathhat {p}}_{t,i}(1)=G {\mathhat {p}}_{t-1,i}(1).$$

For induction, assume that at the $(i-1)$th stage TeX Source \eqalignno{T_{i-1}<&\, {\mathhat {T}}_{i-1}\leq {{(i-1)\ln{{eM}\over {N}}-(i-1)\ln(1-\delta)}\over {\ln (1-\delta)+\ln\left({{N}\over {M}}\right)}} \cr& +2(i-1)& {\hbox{(29)}}} holds with the probability TeX Source \eqalignno{& \left(1- e^{- {{\omega (n^{2+\alpha }\log n)}\over {4}}\delta ^{2}}\right)^{{\mathhat {T}}_{i-1}}\cr & \quad \quad\quad\cdot \prod _{k=1}^{i-2}\left (1-n^{-\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}\omega (1)}\right)^{2 {\mathhat{T}}_{k}}.}

To estimate ${\mathhat {T}}_{i}$, we solve the following inequalities: TeX Source \eqalignno{& G^{{\mathhat {T}}_{i}- {\mathhat {T}}_{i-1}-2} {\mathhat {p}}_{{\mathhat {T}}_{i-1},i}(1)\cr& \quad =(1-\delta)^{{\mathhat {T}}_{i}- {\mathhat {T}}_{i-1}-2}\left ({{N}\over {M}}\right)^{{\mathhat {T}}_{i}- {\mathhat {T}}_{i-1}-2} {\mathhat {p}}_{{\mathhat {T}}_{i-1},i}(1)\cr& \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad < {{M}\over{N(1-\delta)}}\cr& G^{{\mathhat {T}}_{i}- {\mathhat {T}}_{i-1}-1} {\mathhat {p}}_{{\mathhat {T}}_{i-1},i}(1)\cr& \quad =(1-\delta)^{{\mathhat {T}}_{i}- {\mathhat {T}}_{i-1}-1}\left ({{N}\over {M}}\right)^{{\mathhat {T}}_{i}- {\mathhat {T}}_{i-1}-1} {\mathhat {p}}_{{\mathhat {T}}_{i-1},i}(1)\cr& \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \geq {{M}\over{N(1-\delta)}}} where ${\mathhat {p}}_{{\mathhat {T}}_{i-1},i}(1)> {{1}\over {e}}$[similar to (27)], since ${\mathhat {T}}_{i-1}=O(n)$ [our assumption for induction in (29) shows that it is $O(n)$]. Similar to the discussion at the second stage, we can get that TeX Source $$T_{i}< {\mathhat {T}}_{i}\leq {{i\ln {{eM}\over {N}}-i\ln(1-\delta)}\over {\ln (1-\delta)+\ln \left({{N}\over {M}}\right)}}+2i$$ holds with the probability TeX Source \eqalignno{& \hskip 24pt \left(1- e^{- {{\omega (n^{2+\alpha }\log n)}\over {2e}}\delta ^{2}}\right)^{{\mathhat {T}}_{i}}\cr& \cdot \prod _{k=1}^{i-1}\left (1-n^{-\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}\omega (1)}\right)^{2 {\mathhat{T}}_{k}}.}

Finally, the FHT $\tau$ is upper bounded by TeX Source $$\tau < {\mathhat {T}}_{n}= {{n\left(\ln{{eM}\over {N}}-\ln (1-\delta)\right)}\over {\ln (1-\delta)+\ln \left({{N}\over{M}}\right)}}+2n$$ with a probability of TeX Source \eqalignno{& \quad \quad \left(1- e^{- {{\omega (n^{2+\alpha }\log n)}\over {4}}\delta ^{2}}\right)^{{\mathhat {T}}_{n}}\cr& \cdot \prod _{k=1}^{n-1}\left (1-n^{-\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}\omega (1)}\right)^{2 {\mathhat{T}}_{k}}\cr&\qquad >\left(1-n^{-\omega (n^{2+\alpha })\delta ^{2}}\right)^{{\mathhat {T}}_{n}}\cr& \cdot \left (1-n^{-\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}\omega (1)}\right)^{2(n-1) {\mathhat{T}}_{n}}} which is an overwhelming probability. ■

In the proof above, we have proven that a bound holds for the FHT with an overwhelming probability. Furthermore, the proof also shows the convergence of UMDA on LEADINGOnes: the UMDA will converge to the optimum with an overwhelming probability. The convergence property is ensured by using population sizes of $\omega (n^{2+\alpha }\log n)$, and considering all the random sampling errors in the pessimistic way.

SECTION V

## BEST CASE ANALYSIS OF UMDA ON THE BVLEADINGONES PROBLEM

The previous section has shown that the LEADINGOnes problem is EDA-easy for the UMDA. In this section, we will study another maximization problem that is unimodal but EDA-hard for the UMDA. The problem, which is called BVLeadingOnes (BVLO for short), can be regarded as the LEADINGOnes problem with one bit's variation. It is defined as follows: TeX Source $${\rm \rm BVLO}({\bf x})=\cases{{\rm LO}({\bf x})+n, \hfill & {\rm LO}({\bf x})\leq n-1, x_{n}=0\hfill \cr {\rm LO}({\bf x}), \hfill & {\rm LO}({\bf x})< n-1, x_{n}=1 \hfill \cr 3n, \hfill & {\rm LO}({\bf x})= n\hfill \cr }\eqno {\hbox{(30)}}$$ where $\forall i=1,\ldots,n\colon x_{i}\in \{0,1\}$ and LO stands for LEADINGOnes. The BVLeadingOnes is a unimodal function whose global optimum is ${\bf x}^{\ast }=\left(x_{1}^{\ast },\ldots,x_{n}^{\ast }\right)=(1,\ldots,1)$. In this section, we will prove that BVLeadingOnes is EDA-hard for the UMDA.

Let us look at (30) again. The $n$th bits of the individuals are exposed to the selection pressure from the very beginning. During the optimization process, an individual whose last bit is 0 always has higher fitness than any individuals with its last bit being 1, unless the first $n-1$ bits of the latter are all 1's. In other words, the $n$th marginal probability $p_{\cdot,n}\left(\bar {x}_{n}^{\ast }\right)$ starts converging to 1 from the beginning of optimization, where $\bar {x}_{n}^{\ast }=1-x_{n}^{\ast }=0$. Once $p_{\cdot,n}\left(\bar {x}_{n}^{\ast }\right)$ reaches 1, the UMDA will miss the global optimum forever. Therefore, we need to check whether an individual whose first $n-1$ bits are all 1's can be generated before $p_{\cdot,n}\left(\bar {x}_{n}^{\ast }\right)$ reaches 1.

We start from analyzing the converging speed of the first $n-1$ bits of individuals, given polynomial population sizes $M=\omega (n^{2+\alpha }\log n)$, $N=\omega (n^{2+\alpha }\log n)$ (where $\alpha$ can be any positive constant), and $M=\beta N$ ($\beta \in (0, 1)$ is some constant) for the UMDA. These bits can be classified into two categories. The first category is exposed to the selection pressure, and the second one is affected by the genetic drift. Unlike the previous section, here we analyze from an optimistic viewpoint: all bits of the first category will converge in one generation, and the genetic drift will promote the marginal probabilities of generating the optimal value on the remaining bits. We first consider the genetic drift of a typical marginal probability, say $p_{\cdot,q}\left(x_{q}^{\ast }\right)$ (the $q$th bits belong to the second category). Using Chernoff bounds to study the deviations brought by the random sampling procedures, we have TeX Source \eqalignno{& \BBP\left(N_{t,q}\left(x_{q}^{\ast }\right)\leq (1+\eta)p_{t-1,q}\left(x_{q}^{\ast }\right)N\mid p_{t-1,q}\left(x_{q}^{\ast }\right)\right)\cr& \quad \quad > 1- e^{- {{p_{t-1,q}\left(x_{q}^{\ast }\right)N}\over {4}}\eta ^{2}}} where $\eta$ is a parameter that controls the size of deviation, and $N_{t,q}\left(x_{q}^{\ast }\right)$ is the number of individuals that takes the value $x_{q}^{\ast }$ in their $q$th bit in the population before selection. Set $\eta =\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}$, we obtain TeX Source \eqalignno{& \quad \BBP\biggl(N_{t,q}\left(x_{q}^{\ast}\right)\leq\left(1+\left({{1}\over {n}}\right)^{1+ {{\alpha}\over {2}}}\right)p_{t-1,q}\left(x_{q}^{\ast }\right)N\cr& \quad\quad \quad \quad \quad \quad \quad \mid p_{t-1,q}\left(x_{q}^{\ast }\right)\biggr)\cr& > 1- e^{- {{p_{t-1,q}\left(x_{q}^{\ast }\right)\omega (\log n)}\over {4}}}=1-n^{- {{p_{t-1,q}\left(x_{q}^{\ast }\right)\omega (1)}\over {4}}}.}

The selection procedure may also bring some deviations. Since the $q$th bits of individuals are not exposed to the selection pressure, then for these bits the selection procedure can be regarded as Simple Random Sampling without replacement. Lemma 4 can be used to estimate the probability that the number of individuals taking the value $x_{q}^{\ast }$ on their $q$th bits after selection [denoted by $N^{(s)}_{t,j}\left(x_{q}^{\ast }\right)$] is bounded from above, which is lower bounded by $1- e^{-2(1+\eta)^{2}p^{2}_{t-1,q}\left(x_{q}^{\ast }\right)\eta ^{\prime 2}M}$ estimated by (23) in Table VII, where $\eta ^{\prime}$ is a parameter that controls the size of deviation, and $N^{(s)}_{t,q}\left(x_{q}^{\ast }\right)=p_{t,q}\left(x_{q}^{\ast }\right)M$. Let $\eta ^{\prime}=\eta =\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}$, since $M=\omega (n^{2+\alpha }\log n)$ we get TeX Source \eqalignno{& \BBP\left(p_{t,q}\left(x_{q}^{\ast }\right)\leq\left(1+\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}p_{t-1,q}\left(x_{q}^{\ast }\right)\mid p_{t-1,q}\left(x_{q}^{\ast }\right)\right)\cr&\quad \quad \quad \quad \quad > \left(1-n^{-p_{t-1,q}\left(x_{q}^{\ast }\right)\omega (1)}\right)\cr&\quad \quad \qquad\quad \quad \cdot \left(1-n^{-\left(1+\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}p^{2}_{t-1,q}\left(x_{q}^{\ast }\right)\omega (1)}\right)\cr&\quad \quad \quad\quad \quad >\left(1-n^{-\left(1+\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}p^{2}_{t-1,q}\left(x_{q}^{\ast }\right)\omega(1)}\right)^{2}.}

TABLE VII BOUNDING $N^{(s)}_{t,q}\left(x_{q}^{\ast }\right)$ FROM ABOVE WITH AN OVERWHELMING PROBABILITY

Since $R=\left (1+\left ({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}>1$ (thus we know that ${\mathhat {p}}_{t-1,q}\left(x_{q}^{\ast }\right)> {\mathhat {p}}_{0,q}\left(x_{q}^{\ast }\right)$ in the above inequality), similar to the analysis shown in Table III, we further have TeX Source \eqalignno{& \quad \BBP\biggl(p_{t,q}\left(x_{q}^{\ast}\right)\leq\left(1+\left({{1}\over {n}}\right)^{1+ {{\alpha}\over {2}}}\right)^{2t}p_{0,q}\left(x_{q}^{\ast }\right)\cr&\quad \quad \quad \quad \quad \quad \quad \mid p_{0,q}\left(x_{q}^{\ast }\right)= {\mathhat{p}}_{0,q}\left(x_{q}^{\ast }\right)\biggr)\cr& > \left(1-n^{-\left(1+\left({{1}\over {n}}\right)^{1+ {{\alpha }\over{2}}}\right)^{2} {\mathhat {p}}^{2}_{0,q}\left(x_{q}^{\ast}\right)\omega(1)}\right)^{2t}.} Given any polynomial $t$, the above probability is an overwhelming one. Specifically, $\forall t=O(n)$, $p_{t,q}\left(x_{q}^{\ast }\right)$ is upper bounded as TeX Source \eqalignno{& \quad p_{t,q}\left(x_{q}^{\ast }\right)\leq \left(1+\left({{1}\over {n}}\right)^{1+ {{\alpha }\over{2}}}\right)^{O(n)} {\mathhat {p}}_{0,q}\left(x_{q}^{\ast }\right) \cr& = {{1}\over {2}}+\Theta \left({{1}\over {n^{\alpha /2}}}\right)+o\left({{1}\over{n^{\alpha /2}}}\right)< c< 1& {\hbox{(31)}}} with an overwhelming probability (where $c$ is some positive constant, and the $q$th bits are not exposed to the selection pressure).

Another key issue of our analysis is the time $T_{n}^{\prime}$ for the $n$th marginal probability $p_{\cdot,n}\left(\bar {x}_{n}^{\ast }\right)$ to converge to 1. We can prove the following lemma.

#### Lemma 6

The number of generations required by the marginal probability $p_{\cdot,n}\left(\bar {x}_{n}^{\ast }\right)$ to converge to 1, i.e., $T_{n}^{\prime}$, is upper bounded by TeX Source $$U= {{\ln{{2M}\over {N}}-\ln (1-\delta)}\over {\ln (1-\delta)+\ln \left({{N}\over {M}}\right)}}+2$$ with an overwhelming probability, if no global optimum is generated before the $U$th generation, where $\delta \in \left(\max \left\{0,1- {{2M}\over {N}}\right\},1- {{M}\over {N}}\right)$ is a positive constant.

The proof is provided in the Appendix. Given polynomial population sizes $M=\omega (n^{2+\alpha }\log n)$, $N=\omega (n^{2+\alpha }\log n)$ (where $\alpha$ can be any positive constant), and $M=\beta N$ ($\beta \in (0,1)$ is some constant), Lemma 6 implies that $U=\Theta (1)$. Now we reach the following theorem.

#### Theorem 3

Given polynomial population sizes $M=\omega (n^{2+\alpha }\log n)$, $N=\omega (n^{2+\alpha }\log n)$ (where $\alpha$ can be any positive constant), and $M=\beta N$ ($\beta \in (0,1)$ is some constant), the FHT of the UMDA with truncation selection on the BVLeadingOnes problem is infinity with an overwhelming probability. In other words, the UMDA with truncation selection cannot find the optimum of the BVLeadingOnes problem with an overwhelming probability.

##### Proof

We have proven that the number of generations required for $p_{\cdot,n}\left(\bar {x}_{n}^{\ast }\right)$ to reach 1 (denoted by $T_{n}^{\prime}$) is upper bounded by a constant function $U$ with an overwhelming probability, under the condition that no global optimum is generated before the $U$th generation. We now further prove that the probability that no global optimum is generated before the $U$th generation is also overwhelming.

As mentioned before, we classify the first $n-1$ bits of individuals into two categories. The first category, which contains the bits being exposed to the selection, further contains two types of bits. The first type contains the bits which have already converged to the optimal values, and the second type contains the bits that are exposed to the selection pressure but have not converged to the optimal values yet. In our best case analysis, for the bits of the second type, we consider that only one generation is needed for the corresponding marginal probabilities (to the optimal values) to converge. In other words, before the $U$th generation, the marginal probabilities (of the first $n-1$ bits of individuals) are either 1 or no more than the constant $c$. Noting that $U=\Theta (1)$, according to (31), $c\in \left({{1}\over {2}},1\right)$, and it demonstrates the result of genetic drift within $O(n)$ generations. From an optimistic viewpoint, we further consider that in every generation, besides the marginal probability $p_{\cdot,n}\left(\bar {x}_{n}^{\ast }\right)$, at most $\log ^{2} n$ other marginal probabilities 7 are also converging with an overwhelming probability. $\log ^{2}n$ is used here because the joint probability of generating $\log ^{2}n$ consecutive 1's (so as to produce the selection pressure on the corresponding bits) by $\log ^{2}n$ non-converged marginal probabilities is no more than $c^{\log ^{2}n}$, which is super-polynomially small.

The above result implies that the probability of generating the global optimum in one generation is also super-polynomially small. Noting that $U=\Theta (1)$, then the probability of generating the optimum before the $U$th generation is also super-polynomially small. Combining this probability with the conditional probability mentioned in Lemma 6, we know that the joint probability that no global optimum is generated before the $U$th generation, and $p_{\cdot,n}\left(\bar {x}_{n}^{\ast }\right)$ converges to 1 no later than the $U$th generation, is super-polynomially close to 1, i.e., an overwhelming probability. Combining with the fact that once the $n$th marginal probability $p_{\cdot,n}\left(x_{n}^{\ast }\right)$ has already converged to 0, the probability of finding the optimum will drop to 0, we have proven the theorem.

According to Theorem 1, given polynomial population sizes $M=\omega (n^{2+\alpha }\log n)$ and $N=\omega (n^{2+\alpha }\log n)$ ($M=\beta N$, $\beta \in (0,1)$ is a constant.), BVLeadingOnes is EDA-hard for the UMDA. ■

For the sake of consistence, we also provide the formal description of the deterministic dynamic system utilized in this section. Considering the $i$th stage $\left(i\leq \min \left\{T_{n}^{\prime}, {{n-1}\over {\log ^{2}n}}\right\}\right)$ which starts when all the marginal probabilities $p_{\cdot,k}\left(x_{k}^{\ast }\right) (k\leq (i-1)\log ^{2} n\})$ have just converged to 1 and ends when all the marginal probabilities $p_{\cdot,j}\left(x_{j}^{\ast }\right) (j\leq i\log ^{2} n)$ have just converged to 1, we can obtain ${\mathhat {{\bf P}}}_{t+1}({\bf x}^{\ast })$ by defining $\gamma _{i}$ as follows. TeX Source \eqalignno{& \quad \quad {\mathhat {{\bf P}}}_{t+1}({\bf x}^{\ast})=\gamma _{i}\left({\mathhat {{\bf P}}}_{t}({\bf x}^{\ast})\right)=\cr& \biggl({\mathhat {p}}_{t,1}\left(x_{1}^{\ast}\right),\ldots, {\mathhat {p}}_{t,(i-1)\log^{2}n}\left(x_{(i-1)\log ^{2} n}^{\ast }\right),1,\ldots,1,\cr&\quad R {\mathhat {p}}_{t,i\log ^{2} n+1}\left(x_{i\log^{2}n+1}^{\ast }\right),\ldots,R {\mathhat{p}}_{t,n-1}\left(x_{n-1}^{\ast }\right),\cr& \quad \quad \quad\quad \quad \quad \quad \quad1-G\left(1- {\mathhat{p}}_{t,n}\left(x_{n}^{\ast }\right)\right)\biggr)} where $R=(1+\eta)(1+\eta ^{\prime})$($\eta < 1$ and $\eta ^{\prime}< 1$ are positive functions of the problem size $n$), and $G=(1-\delta) {{N}\over {M}}$($\delta \in \left(\max \left\{0,1- {{2M}\over {N}}\right\},1- {{M}\over {N}}\right)$ is a constant). In the above equation, we consider four different cases.

1. $j\in \{1,\ldots,(i-1)\log ^{2} n\}$. In the deterministic system above, the marginal probabilities ${\mathhat {p}}_{t,j}\left(x_{j}^{\ast }\right)$ have converged to 1, thus at the next generation they will not change.
2. $j\in \{(i-1)\log ^{2} n+1,\ldots,i\log ^{2} n\}$. In the deterministic system above, the marginal probabilities ${\mathhat {p}}_{t,j}\left(x_{j}^{\ast }\right)$ are converging to the optimum, and they will converge in one generation in the best case analysis.
3. $j\in \{i\log ^{2} n+1,\ldots,n-1\}$. The $j$th bits of individuals are not exposed to selection pressure, and we use the factor $R=(1+\eta)(1+\eta ^{\prime})$ to demonstrate the impact of genetic drift in the deterministic system above.
4. $j=n$. The marginal probability ${\mathhat {p}}_{t,n}\left(\bar {x}_{n}^{\ast }\right)=1- {\mathhat {p}}_{t,n}\left(x_{n}^{\ast }\right)$ is converging, and we use the factor $G=(1-\delta) {{N}\over {M}}$ to demonstrate the impact of selection pressure on this converging marginal probability in the deterministic system above, which is a best case style for ${\mathhat {p}}_{t,n}\left(x_{n}^{\ast }\right)$.

With ${\mathhat {{\bf P}}}_{0}({\bf x}^{\ast })=\left({{1}\over {2}},\ldots, {{1}\over {2}}\right)$, noting that one stage actually refers to one generation (thus $i=t$), we have TeX Source $${\mathhat {{\bf P}}}_{t}({\bf x}^{\ast })=\gamma _{t}\circ \gamma _{t-1}\ldots\circ \gamma _{1}\left({\mathhat {{\bf P}}}_{0}({\bf x}^{\ast })\right)$$ where $t\leq \min \left\{T_{n}^{\prime}, {{n-1}\over {\log ^{2} n}}\right\}$. Since $\{\gamma _{i}\}_{i=1}^{t}$ de-randomizes the whole optimization process, $T_{n}^{\prime}$ in the above equation is no longer random variable. For the sake of clarity, we rewrite the above equation as TeX Source $${\mathhat {{\bf P}}}_{t}({\bf x}^{\ast })=\gamma _{t}\circ \gamma _{t-1}\ldots\circ \gamma _{1}\left({\mathhat {{\bf P}}}_{0}({\bf x}^{\ast })\right)$$ where $t\leq \min \left\{{\mathhat {T}}_{n}^{\prime}, {{n-1}\over {\log ^{2} n}}\right\}\leq \min \left\{U, {{n-1}\over {\log ^{2} n}}\right\}$.

SECTION VI

## A MODIFIED UMDA: RELAXATION BY MARGINS

So far we have seen both EDA-easy and EDA-hard problems for the UMDA. This section will analyze more in-depth the relationship between EDA-hardness and the algorithms. The BVLeadingOnes problem, which has proven to be EDA-hard for the UMDA with finite populations, will be employed as the target problem in this section. We will show that a simple “relaxed” version of UMDA with truncation selection can solve the BVLeadingOnes problem efficiently. The “relaxation” is implemented by adding some “margins” to the marginal probabilities of the UMDA. That is, the highest level the marginal probabilities can reach is $1- {{1}\over {M}}$ and the lowest level the marginal probabilities can drop to is ${{1}\over {M}}$. Any marginal probabilities higher than $1- {{1}\over {M}}$ are set to be $1- {{1}\over {M}}$, and any marginal probabilities lower than ${{1}\over {M}}$ are set to be ${{1}\over {M}}$. We denote such a UMDA with margin as ${\rm UMDA}_{M}$. The margins here aim to avoid the premature convergence, which is similar to the upper and lower bounds of the pheromone information in Max-Min Ant System [40] and Laplace correction [2]. It is noteworthy that we are not trying to propose a new algorithm here. Instead, by an example, we are trying to demonstrate theoretically that some approaches proposed to avoid premature convergence of EDAs, can actually help to promote the performance of the algorithms.

We have seen in the previous section that the original UMDA cannot solve BVLeadingOnes efficiently. Interestingly, by adding the margins, the ${\rm UMDA}_{M}$ can solve BVLeadingOnes efficiently. The following theorem summarizes the main result.

#### Theorem 4

Given polynomial population sizes $N=\omega (n^{2+\alpha }\log n)$, $M=\omega (n^{2+\alpha }\log n)$ (where $\alpha$ can be any positive constant) and $M=\beta N$ ($\beta \in (0,1)$ is some constant), then for any constant $\delta$ that satisfies $\delta \in \left(\max \left\{0,1- {{2M}\over {N}}\right\},1-e^{{1}\over {\epsilon (n)}} {{M}\over {N}}\right)$ (where $\epsilon (n)= {{M}\over {n}}$), the first hitting time $\tau$ of the ${\rm UMDA}_{M}$ with truncation selection (initialized with a uniform distribution) satisfies TeX Source \eqalignno{& \tau < \bar {\tau }= {{\left(\ln{{e(M-1)}\over {N}}-\ln (1-\delta)\right)n\epsilon (n)+n}\over {\epsilon (n)\ln (1-\delta)+\epsilon (n)\ln \left({{N}\over {M}}\right)-1}}\cr& \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad + {{M}\over{N}}\ln ^{2}n+2n} with the overwhelming probability TeX Source \eqalignno{& \left(1-n^{-e^{-1/\epsilon (n)}\omega (n^{2+\alpha })\delta ^{2}/2e}\right)^{2\bar {\tau}}\cr& \quad \cdot \left (1-n^{-\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}\omega (1)}\right)^{2(n-1)\bar {\tau}}\cr& \quad \cdot \left (1-\left({{1}\over {e}}\right)^{\omega (\ln n)}\right).}

##### Proof

In order to proof the above theorem, we define $n+1$ random variables $t_{0}$ and $t_{i} (i=1,\ldots,n)$ as follows: TeX Source \eqalignno{& t_{0} \triangleq \min\left\{t;p_{t,n}\left(\bar {x}^{\ast }_{n}\right)=1- {{1}\over {M}}\right\}\cr& t_{i} \triangleq \min\left\{t;p_{t,i}\left(x^{\ast }_{i}\right)=1- {{1}\over {M}}\right\}.} The proof follows our basic idea introduced in Section III-A, and thus is similar to the proof of Theorem 2. However, the maximal value that a marginal probability can reach drops to $1- {{1}\over {M}}$, and the minimal value that a marginal probability can reach increases to ${{1}\over {M}}$. We will then de-randomize the ${\rm UMDA}_{M}$.

In the analysis, we ignore the possibility that the optimum is found before the $t_{0}$th generation (which will make the FHT smaller), and we divide the optimization process into $(n+1)$th stages. The 1st stage begins when the optimization begins, and ends when the marginal probability ${\mathhat {p}}_{\cdot,n}\left(\bar {x}^{\ast }_{n}\right)$ reaches $1- {{1}\over {M}}$ for the first time. The 2nd stage follows the 1st stage, and ends when the marginal probability ${\mathhat {p}}_{\cdot,1}\left(x^{\ast }_{1}\right)$ reaches $1- {{1}\over {M}}$ for the first time. The $q$th stage $(q\in \{2,\ldots,n\})$ begins when the marginal probability ${\mathhat {p}}_{\cdot,q-2}\left(x^{\ast }_{q-2}\right)$ reaches $1- {{1}\over {M}}$ for the first time, and ends when the marginal probability ${\mathhat {p}}_{\cdot,q-1}\left(x^{\ast }_{q-1}\right)$ reaches $1- {{1}\over {M}}$ for the first time.

Let us consider the deterministic system. Suppose generation $t+1$ belongs to the $i$th stage $(i\in \{1,\ldots,n+1\})$, then the marginal probabilities at this generation are updated from the marginal probabilities at generation $t$ by $\gamma _{i}$. When $i=1$, we have TeX Source \eqalignno{& {\mathhat {{\bf P}}}_{t+1}({\bf x}^{\ast})=\gamma _{1}\left({\mathhat {{\bf P}}}_{t}({\bf x}^{\ast})\right)=\cr& \biggl(R {\mathhat {p}}_{t,1}\left(x_{1}^{\ast}\right),\ldots,R {\mathhat {p}}_{t,n-1}\left(x_{n-1}^{\ast}\right),\cr& \quad \quad \quad \quad \quad \quad \quad\quad1-G_{1}\left(1- {\mathhat {p}}_{t,n}\left(x_{n}^{\ast}\right)\right)\biggr)} where $R=(1-\eta)(1-\eta ^{\prime})$($\eta < 1$ and $\eta ^{\prime}< 1$ are positive functions of the problem size $n$), and $G_{1}=(1-\delta) {{N}\over {M}}$($\delta \in \left(\max \left\{0,1- {{2M}\over {N}}\right\},1-e^{{1}\over {\epsilon (n)}} {{M}\over {N}}\right)$ is a constant). In the above equation, we consider two different cases.

1. $j\in \{1,\ldots,n-1\}$. In the deterministic system above, the $j$th bits of individuals are not exposed to selection pressure, and we use the factor $R=(1-\eta)(1-\eta ^{\prime})$ to demonstrate the impact of genetic drift on these marginal probabilities.
2. $j=n$. In the deterministic system above, the marginal probability ${\mathhat {p}}_{t,n}\left(\bar {x}_{n}^{\ast }\right)=1- {\mathhat {p}}_{t,n}\left(x_{n}^{\ast }\right)$ is increasing, and we use the factor $G_{1}=(1-\delta) {{N}\over {M}}$ to demonstrate the impact of selection pressure on the increasing marginal probability ${\mathhat {p}}_{\cdot,n}\left(\bar {x}_{n}^{\ast }\right)$(${\mathhat {p}}_{t+1,n}\left(\bar {x}_{n}^{\ast }\right)=G_{1} {\mathhat {p}}_{t,n}\left(\bar {x}_{n}^{\ast }\right)$, thus ${\mathhat {p}}_{t+1,n}\left(x_{n}^{\ast }\right)=1-G_{1} {\mathhat {p}}_{t,n}\left(\bar {x}_{n}^{\ast }\right)=1-G_{1}(1- {\mathhat {p}}_{t,n}\left(x_{n}^{\ast }\right))$ holds).

When $i\in \{2,\ldots,n\}$, we have TeX Source \eqalignno{& \quad {\mathhat {{\bf P}}}_{t+1}({\bf x}^{\ast})=\gamma _{i}\left({\mathhat {{\bf P}}}_{t}({\bf x}^{\ast})\right)\cr& =\biggl({\mathhat {p}}_{t,1}\left(x_{1}^{\ast}\right),\ldots, {\mathhat {p}}_{t,i-2}\left(x_{i-2}^{\ast}\right),\cr& \quad \quad \quad G_{2} {\mathhat{p}}_{t,i-1}\left(x_{i-1}^{\ast }\right),R {\mathhat{p}}_{t,i}\left(x_{i}^{\ast }\right),\ldots,\cr& \quad \quad\quad \quad \quad \quad R {\mathhat{p}}_{t,n-1}\left(x_{n-1}^{\ast }\right), {\mathhat{p}}_{t,n}\left(x_{n}^{\ast }\right)\biggr)} where $G_{2}=(1-\delta)\left(1- {{1}\over {M}}\right)^{n} {{N}\over {M}}$ ($\delta \in \left(\max \left\{0,1- {{2M}\over {N}}\right\},1-e^{{1}\over {\epsilon (n)}} {{M}\over {N}}\right)$ is a constant), and $R=(1-\eta)(1-\eta ^{\prime})$ ($\eta < 1$ and $\eta ^{\prime}< 1$ are positive functions of the problem size $n$). In the above equation, we consider four different cases for the deterministic system above.

1. $j\leq i-2$, $j\in \BBN ^{+}$. The marginal probabilities ${\mathhat {p}}_{t,j}\left(x_{j}^{\ast }\right)$ have reached $1- {{1}\over {M}}$, and at the next generation they will not change (we will soon prove this).
2. $j=i-1$. The marginal probability ${\mathhat {p}}_{t,j}\left(x_{j}^{\ast }\right)$ is increasing, and we use the factor $G_{2}=(1-\delta)\left(1- {{1}\over {M}}\right)^{n} {{N}\over {M}}$ to demonstrate the impact of selection pressure on this increasing marginal probability.
3. $j\in \{i,\ldots,n-1\}$. The $j$th bits of individuals are not exposed to selection pressure, and we use the factor $R=(1-\eta)(1-\eta ^{\prime})$ to demonstrate the impact of genetic drift on these marginal probabilities.
4. $j=n$ The marginal probabilities ${\mathhat {p}}_{t,n}\left(\bar {x}_{n}^{\ast }\right)$ and ${\mathhat {p}}_{t,n}\left(x_{n}^{\ast }\right)$ have reached $1- {{1}\over {M}}$ and ${{1}\over {M}}$ respectively, and at the next generation they will not change (we will soon prove this).

Consider the $(n+1)$th stage, we have TeX Source \eqalignno{& \quad {\mathhat {{\bf P}}}_{t+1}({\bf x}^{\ast })=\gamma _{n+1}({\mathhat {{\bf P}}}_{t}({\bf x}^{\ast }))\cr& =\left({\mathhat {p}}_{t,1}\left(x_{1}^{\ast }\right),\ldots, {\mathhat {p}}_{t,n-1}\left(x_{n-1}^{\ast }\right), {\mathhat {p}}_{t,n}\left(x_{n}^{\ast }\right)\right)} where we consider two different cases for this deterministic system. $j\in \{1,\ldots,n-1\}$.

1. The marginal probabilities ${\mathhat {p}}_{t,j}\left(x_{j}^{\ast }\right)$ have reached $1- {{1}\over {M}}$, and at the next generation they will not change (we will soon prove this).
2. $j=n$. The marginal probability ${\mathhat {p}}_{t,n}\left(x_{n}^{\ast }\right)$ is always no smaller than ${{1}\over {M}}$.

With ${\mathhat {{\bf P}}}_{0}({\bf x}^{\ast })=\left({{1}\over {2}},\ldots, {{1}\over {2}}\right)$, we have TeX Source $${\mathhat {{\bf P}}}_{t}({\bf x}^{\ast })=\gamma _{i}^{t-t_{i-2}}\left({\mathhat {{\bf P}}}_{t_{i-2}}({\bf x}^{\ast })\right)$$ where $t_{i-2}< t\leq t_{i-1} (i=1,\ldots,n+1)$, and we let $t_{-1}=0$ represent the beginning of the optimization process. Since $\{\gamma _{i}\}_{i=1}^{n+1}$ de-randomizes the whole optimization process, $\{t_{i}\}_{i=0}^{n}$ in the above equation are no longer random variables. For the sake of clarity, we rewrite the above equation as TeX Source $${\mathhat {{\bf P}}}_{t}({\bf x}^{\ast })=\gamma _{i}^{t- {\mathhat {t}}_{i-2}}\left({\mathhat {{\bf P}}}_{{\mathhat {t}}_{i-2}}({\bf x}^{\ast })\right)$$ where ${\mathhat {t}}_{i-2}< t\leq {\mathhat {t}}_{i-1} (i=1,\ldots,n+1)$. As we will show immediately, ${\mathhat {t}}_{i} (0\leq i\leq n)$ is an upper bound of the random variable $t_{i}$ with some probability. Once all ${\mathhat {t}}_{i}$ can be estimated, and all the marginal probabilities $p_{t,j}\left(x_{j}^{\ast }\right) (j=1,\ldots,n)$ have reached $1- {{1}\over {M}}$, the optimum might already be found, or it will take only a few steps to generate the optimum. Thus, if we can prove that once the marginal probabilities $p_{t,j}\left(x_{j}^{\ast }\right) (j=1,\ldots,n-1)$ have reached $1- {{1}\over {M}}$, it will never reduce again, our task finally becomes calculating the ${\mathhat {t}}_{n}$, the probability that ${\mathhat {t}}_{n}$ holds as an upper bound of $t_{n}$.

We now provide the formal proof stage by stage. At the 1st stage, we analyze the case with the $n$th bit. At the $t$th generation (which belongs to the 1st stage), according to Lemma 5 and Chernoff bounds, we have TeX Source \eqalignno{& \quad \BBP\biggl(p_{t,n}\left(\bar {x}_{n}^{\ast}\right)\geq (1-\delta) {{p_{t-1,n}\left(\bar {x}_{n}^{\ast}\right)N}\over {M}}\cr& \quad \quad \quad \quad \quad \quad\quad \mid p_{t-1,n}\left(\bar {x}_{n}^{\ast }\right) \leq {{M-1}\over{N(1-\delta)}}\biggr)\cr& > 1- e^{-p_{t-1,n}\left(\bar{x}_{n}^{\ast }\right)N\delta ^{2}/2}} where $\delta \in \left(\max \left\{0,1- {{2M}\over {N}}\right\},1-e^{{1}\over {\epsilon (n)}} {{M}\over {N}}\right)$ is a positive constant, and $p_{t,n}\left(\bar {x}_{n}^{\ast }\right)\leq 1- {{1}\over {M}}$ (since the UMDA adopts margins) yields the condition that $p_{t-1,n}\left(\bar {x}_{n}^{\ast }\right) \leq {{M-1}\over {N(1-\delta)}}$. Similar to Table III in the proof of Theorem 2 we can obtain TeX Source \eqalignno{& \BBP\left(p_{t,n}\left(\bar {x}_{n}^{\ast }\right)\geq G_{1}^{t}p_{0,n}\left(\bar {x}_{n}^{\ast }\right)\mid p_{0,n}\left(\bar {x}_{n}^{\ast }\right)= {\mathhat {p}}_{0,n}\left(\bar {x}_{n}^{\ast }\right)\right) \cr& \quad \quad >\left(1-e^{-p_{0,n}\left(\bar {x}_{n}^{\ast }\right)N\delta ^{2}/2}\right)^{t}.& {\hbox{(36)}}}

Consider the probability that $t_{0}$ is upper bounded by some value, say ${\mathhat {t}}_{0}$, we obtain the inequalities estimated in Table VIII, where in (33) the factor $\left(1-e^{-{{\mathhat {p}}_{0,n}\left(\bar {x}_{n}^{\ast }\right)N}\delta ^{2}/2}\right)$ is added since we apply Chernoff bounds at the end of the $({\mathhat {t}}_{0}-1)$th generation. Now we consider the following item: TeX Source \eqalignno{& \BBP\left({\mathhat {p}}_{{\mathhat {t}}_{0}-1,n}\left(\bar {x}_{n}^{\ast }\right)> {{M-1}\over {N(1-\delta)}}\mid p_{0,n}\left(\bar {x}_{n}^{\ast }\right)= {\mathhat {p}}_{0,n}\left(\bar {x}_{n}^{\ast }\right)\right) \cr& = \BBP\left({\mathhat {p}}_{{\mathhat {t}}_{0}-1,n}\left(\bar {x}_{n}^{\ast }\right)> {{M-1}\over {N(1-\delta)}}\right).& {\hbox{(37)}}} Since $\left\{{\mathhat {p}}_{t,n}\left(\bar {x}_{n}^{\ast }\right)\right\}_{t=0}^{\infty }$ is a deterministic sequence, the probability above must be either 0 or 1. We need to find the value of ${\mathhat {t}}_{0}$ that makes the above probability 1. Given that ${\mathhat {p}}_{0,n}\left(\bar {x}_{n}^{\ast }\right)= {{1}\over {2}}$, the definition of ${\mathhat {t}}_{0}$ (it is an upper bound of $t_{0}$ defined at the beginning of the proof) and the condition that $\forall t< {\mathhat {t}}_{0}-1\colon {{M-1}\over {N(1-\delta)}}> {\mathhat {p}}_{t,n}\left(\bar {x}_{n}^{\ast }\right)>(1-\delta) {{{\mathhat {p}}_{t-1,n}\left(\bar {x}_{n}^{\ast }\right)N}\over {M}}$ together imply TeX Source \eqalignno{&\quad \quad \quad G_{1}^{{\mathhat {t}}_{0}-2} {\mathhat {p}}_{0,n}\left(\bar {x}_{n}^{\ast }\right)\cr& \quad =\left ((1-\delta)\left({{N}\over {M}}\right)\right)^{{\mathhat {t}}_{0}-2} {\mathhat {p}}_{0,n}\left(\bar {x}_{n}^{\ast }\right)< {{M-1}\over {N(1-\delta)}}\cr& G_{1}^{{\mathhat {t}}_{0}-1} {\mathhat {p}}_{0,n}\left(\bar {x}_{n}^{\ast }\right)\cr& \quad =\left ((1-\delta)\left({{N}\over {M}}\right)\right)^{{\mathhat {t}}_{0}-1} {\mathhat{p}}_{0,n}\left(\bar {x}_{n}^{\ast }\right)\geq {{M-1}\over {N(1-\delta)}}.} Hence, we obtain the value of ${\mathhat {t}}_{0}$ TeX Source $${\mathhat {t}}_{0}\leq {{\ln{{2M-2}\over {N}}-\ln (1-\delta)}\over {\ln (1-\delta)+\ln \left({{N}\over {M}}\right)}}+2.$$ Now we can continue to estimate the probability mentioned in (32), which can provide us the probability that $t_{0}$ is upper bounded by ${\mathhat {t}}_{0}$. Similar to (25) in the proof of Theorem 2, according to (36), we can obtain that the probability is at least $\left(1-e^{-p_{0,n}\left(\bar {x}_{n}^{\ast }\right)N\delta ^{2}/2}\right)^{{\mathhat {t}}_{0}}$.

TABLE VIII CALCULATION OF PROBABILITY THAT $t_{0}$ IS UPPER BOUNDED BY ${\mathhat {t}}_{0}$

On the other hand, we can deal with the genetic drift in the same way as we did for Theorem 2: since ${\mathhat {t}}_{0}=\Theta (1)$, when $t= {\mathhat {t}}_{0}$, for the marginal probabilities of other bits, a level of ${{1}\over {e}}$ can be maintained at least with the overwhelming probability of TeX Source $$\left(1- e^{- {{\omega (n^{2+\alpha }\log n)}\over {2e}}\delta ^{2}}\right)^{{\mathhat {t}}_{0}}\left (1-n^{-\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}\omega (1)}\right)^{2 {\mathhat{t}}_{0}}$$ where the second factor $\left(1-n^{-\left (1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}\omega (1)}\right)^{2 {\mathhat {t}}_{0}}$ comes from the analysis of genetic drift (please refer to (26) for details). The proof details will be very similar to those in the proof of Theorem 2. For the sake of brevity, we omit the details. Now we have finished the analysis of the 1st stage.

After the marginal probability $p_{\cdot,n}\left(\bar {x}_{n}^{\ast }\right)$ has reached $1- {{1}\over {M}}$, i.e., $t\geq {\mathhat {t}}_{0}$, $p_{\cdot,n}\left(\bar {x}_{n}^{\ast }\right)$ will not drop to a level that is smaller than $1- {{1}\over {M}}$ again unless the algorithm has found the optimum. In fact, for other marginal probabilities, similar fact also holds. In order to prove it, let us consider the $(i+1)$th stage $(1\leq i< n)$, and we use the factor $G_{2}$ to demonstrate the impact of selection, by which the interactions among bits are taken into account. For the $i$th bit, at the $k$th generation, we can investigate the following situation: TeX Source \eqalignno{& p_{k,i}\left(x_{i}^{\ast }\right)< 1- {{1}\over {M}},\cr& \forall j\leq i-1\colon p_{k,j}\left(x_{j}^{\ast }\right)=1- {{1}\over {M}}.} We will then prove that once $\forall 1\leq j\leq i-1$, $p_{\cdot,j}\left(x_{j}^{\ast }\right)$ reach $1- {{1}\over {M}}$, with an overwhelming probability, none of them will decrease again with an overwhelming probability. Let $r_{k+1}\left((1^{i-1}\ast \ast \cdots \ast 1)\right)$ be the proportion of individuals $(1^{i-1}\ast \ast \cdots \ast 1)$ before selection at the $(k+1)$th generation, where ∗ must be either 0 or 1. According to Chernoff bounds, and with $N>M=\epsilon (n)n$, we have TeX Source \eqalignno{& \BBP\Biggl(r_{k+1}\left((1^{i-1}\ast \ast \cdots\ast 1)\right)>(1-\delta)\left (1- {{1}\over {M}}\right)^{i}\cr& \mid p_{k,n}\left(\bar {x}_{n}^{\ast }\right)=1- {{1}\over {M}},\forall j\leq i-1\colon p_{k,j}\left(x_{j}^{\ast }\right)=1- {{1}\over {M}} \Biggr)\cr& >1-e^{-\left(1- {{1}\over {M}}\right)^{i}N\delta ^{2}/2}>1-e^{-\left(1- {{1}\over {M}}\right)^{n}N\delta^{2}/2}\cr& >1-e^{-\left(1- {{1}\over {\epsilon (n)n}}\right)^{n}\epsilon (n)n\delta ^{2}/2}\cr& \to 1-e^{-e^{-1/\epsilon (n)}\epsilon (n)n\delta ^{2}/2}} which is an overwhelming probability when $n\to \infty$. Since $\delta \in \left(\max \left\{0,1- {{2M}\over {N}}\right\},1-e^{{1}\over {\epsilon (n)}} {{M}\over {N}}\right)$, we know that TeX Source \eqalignno{& r_{k+1}\left((1^{i-1}\ast \ast \cdots\ast 1)\right)\cr& >(1-\delta)\left (1- {{1}\over {M}}\right)^{i}\cr& > (1-\delta)\left (1- {{1}\over {M}}\right)^{n}> {{M}\over {N}}} holds with an overwhelming probability $1-e^{-e^{-1/\epsilon (n)}\epsilon (n)n\delta ^{2}/2}$. At the same time, it is obvious that the individuals $(1^{i-1}\ast \ast \cdots \ast 1)$ have the highest fitness in the population. After truncation selection, according to Lemma 5, we obtain that (note that we use margins for the marginal probabilities) TeX Source \eqalignno{& \BBP\Biggl(\forall j\leq i-1\colon p_{k+1,j}\left(x_{j}^{\ast }\right)=1- {{1}\over {M}}\mid p_{k,n}\left(\bar {x}_{n}^{\ast }\right)=1- {{1}\over {M}}, \cr& \quad \quad \forall j\leq i-1\colon p_{k,j}\left(x_{j}^{\ast }\right)=1- {{1}\over {M}} \Biggr) \cr& >1-e^{-e^{-1/\epsilon (n)}\epsilon (n)n\delta ^{2}/2}& \hbox{(38)}} which means with an overwhelming probability, the marginal probabilities $p_{\cdot,j}\left(x_{j}^{\ast }\right) (\forall j\leq i-1)$ will no longer change once they reach $1- {{1}\over {M}}$.

Now we consider the $(i+1)$th stage $(i\leq n-1)$, at which the $i$th bits of individuals are of our interest. Similar to the case of the 1st stage, in which the marginal probability ${\mathhat {p}}_{\cdot,n}\left(\bar {x}_{n}^{\ast }\right)$ is investigated, we can estimate the time that ${\mathhat {p}}_{\cdot,i}\left(x_{i}^{\ast }\right)$ reaches $1- {{1}\over {M}}$, i.e., ${\mathhat {t}}_{i} (1\leq i< n)$. As presented in Table IX, it is not hard to obtain (34) and (35).

In order to obtain ${\mathhat {t}}_{i}$. we need to know ${\mathhat {p}}_{{\mathhat {t}}_{i-1},i}\left(x_{i}^{\ast }\right)$ so as to solve (34) and (35). It is worth noting that ${\mathhat {p}}_{{\mathhat {t}}_{i-1},i}\left(x_{i}^{\ast }\right)$ is related to the genetic drift. Similar to what we did in Section IV, when the bits are not exposed to selection pressure, given that ${\mathhat {t}}_{i-1}=O(n)$, the marginal probability ${\mathhat {p}}_{\cdot,i}\left(x_{i}^{\ast }\right)$ will remain to be as ${{1}\over {e}}$.8 Hence, we have $p_{{\mathhat {t}}_{i-1},i}\left(x_{i}^{\ast }\right)> {{1}\over {e}}$ holds with the overwhelming probability of TeX Source $$\prod _{k=0}^{i-1}\left (1-n^{-\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}\omega (1)}\right)^{2 {\mathhat{t}}_{k}} \eqno {\hbox{(39)}}$$ where the item TeX Source $$\left (1-n^{-\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}\omega (1)}\right)^{2 {\mathhat{t}}_{k}}$$ represents the probability that the $(k+1)$th marginal probability is at least ${{1}\over {e}}$ after genetic drift. Detailed analysis can be found in the proof of Theorem 2.

TABLE IX CALCULATION OF (34) and (35)

Now we can solve the equations given in (34) and (35), and get TeX Source \eqalignno{& \quad {\mathhat {t}}_{i}= {\mathhat {t}}_{0}+\displaystyle \sum _{k=1}^{i}({\mathhat {t}}_{k}- {\mathhat{t}}_{k-1}) \cr& < {{(i+1)\left (\ln{{e(M-1)}\over {N}}-\ln (1-\delta)+ {{1}\over {\epsilon (n)}}\right)}\over {\ln(1-\delta)+\ln \left({{N}\over {M}}\right)- {{1}\over {\epsilon (n)}}}} \cr& \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad +2(i+1)& \hbox{(40)}} where $i\leq n-1$ holds.

Next, we need to estimate the joint probability that the random variable $t_{i}$ is upper bounded by ${\mathhat {t}}_{i}$. Since similar work has been done in (32) and (33), and (20) in the proof of Theorem 2, we only informally describe it here for the sake of brevity. This joint probability contains four parts.

1. The probability that $\forall k\in \{1,\ldots,i-1\}\colon t_{k}< {\mathhat {t}}_{k}$. (It can be obtained by induction. For more details, please refer to (20).)
2. The probability that after genetic drift of the $i$th bit, the marginal probability $p_{{\mathhat {t}}_{i-1},i}\left(x_{i}^{\ast }\right)$ is larger than ${{1}\over {e}}$. (We have already mentioned it in (39).)
3. The probability that after the marginal probabilities $p_{\cdot,j}\left(x_{j}^{\ast }\right) (j\ne n)$ have reached $1- {{1}\over {M}}$, they will never drop to a lower level again. (We can utilize the result given in (38).)
4. The probability that $p_{t,i}\left(x_{i}^{\ast }\right)$ is lower bounded by ${\mathhat {p}}_{t,i}\left(x_{i}^{\ast }\right) ({\mathhat {t}}_{i-1}< t\leq {\mathhat {t}}_{i})$, given the condition that $p_{{\mathhat {t}}_{i-1},i}\left(x_{i}^{\ast }\right)\geq {\mathhat {p}}_{{\mathhat {t}}_{i-1},i}\left(x_{i}^{\ast }\right)$.

Now we briefly estimate the probability mentioned in Item 4 (and a more detailed example can be found in Table III in the proof of Theorem 2). As the first step, we consider the relation between $p_{t,i}\left(x_{i}^{\ast }\right)$ and $p_{t-1,i}\left(x_{i}^{\ast }\right) ({\mathhat {t}}_{i-1}< t\leq {\mathhat {t}}_{i})$ by applying Chernoff bounds twice. As a result, we obtain the inequalities presented in Table X, where we utilize “$\min$” to take into account the situation in which $(1-\delta) {{N}\over {M}}p_{t-1,i}\left(x_{i}^{\ast }\right)p_{t-1,n}\left(\bar {x}_{n}^{\ast }\right)\prod _{j=1}^{i-1}p_{t-1,j}\left(x_{j}^{\ast }\right)>1- {{1}\over {M}}$ holds. In this case, noting that the UMDA has adopted margins, the marginal probability $p_{t,i}\left(x_{i}^{\ast }\right)$ is set to be $1- {{1}\over {M}}$. By setting the condition of the above probability as $p_{t-1,i}\left(x_{i}^{\ast }\right)\geq {\mathhat {p}}_{t-1,i}\left(x_{i}^{\ast }\right)= G_{2}^{t- {\mathhat {t}}_{i-1}-1} {\mathhat {p}}_{{\mathhat {t}}_{i-1},i}\left(x_{i}^{\ast }\right)$, the above inequality further implies that TeX Source \eqalignno{& \quad \BBP\biggl(p_{t,i}\left(x_{i}^{\ast }\right)\geq\min \left\{G_{2}p_{t-1,i}\left(x_{i}^{\ast }\right),1- {{1}\over {M}}\right\}\cr& \quad \quad \quad \mid p_{t-1,i}\left(x_{i}^{\ast }\right)\geq G_{2}^{t- {\mathhat {t}}_{i-1}-1} {\mathhat {p}}_{{\mathhat {t}}_{i-1},i}\left(x_{i}^{\ast }\right)\biggr) \cr& >1-e^{-\left(1- {{1}\over {M}}\right)^{n}G_{2}^{t- {\mathhat {t}}_{i-1}-1} {\mathhat {p}}_{{\mathhat {t}}_{i-1},i}\left(x_{i}^{\ast }\right)N\delta^{2}/2}\cr& >1-e^{-\left(1- {{1}\over {M}}\right)^{n} {\mathhat {p}}_{{\mathhat {t}}_{i-1},i}\left(x_{i}^{\ast }\right)N\delta ^{2}/2}\cr& >1-e^{-\left(1- {{1}\over {M}}\right)^{n}N\delta ^{2}/2e}} holds, where we utilize the facts that ${\mathhat {p}}_{{\mathhat {t}}_{i-1},i}\left(x_{i}^{\ast }\right)> {{1}\over {e}}$ holds with an overwhelming probability (the consequence of genetic drift. Original analysis can be found before (27), and $G_{2}>1$ (which ensures that ${\mathhat {p}}_{t,i}\left(x_{i}^{\ast }\right)$ is mono-increasing when the time index $t$ satisfies ${\mathhat {t}}_{i-1}< t\leq {\mathhat {t}}_{i}$). As a consequence of the above inequality, similar to Table III in the proof of Theorem 2, we obtain the probability mentioned in Item 4 TeX Source \eqalignno{& \qquad \qquad \left(1-e^{-\left(1- {{1}\over {M}}\right)^{n}N\delta ^{2}/2e}\right)^{{\mathhat {t}}_{i}- {\mathhat {t}}_{i-1}}\cr& =\left(1-e^{-e^{-1/\epsilon (n)}\omega (n^{2+\alpha }\log n)\delta ^{2}/2e}\right)^{{\mathhat {t}}_{i}- {\mathhat {t}}_{i-1}}. } Now combining the probabilities mentioned in Items 1, 2, 3 and 4 together, we can obtain that $t_{i}$ is upper bounded by ${\mathhat {t}}_{i}$ at least with the probability of TeX Source \eqalignno{& \left(1-n^{-e^{-1/\epsilon (n)}\omega (n^{2+\alpha })\delta ^{2}/2e}\right)^{2 {\mathhat {t}}_{i}}\cr& \quad \cdot \prod _{k=0}^{i-1}\left(1-n^{-\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}\omega (1)}\right)^{2 {\mathhat{t}}_{k}}.} As a result, $t_{n-1}$ is bounded by ${\mathhat {t}}_{n-1}$ with the overwhelming probability of TeX Source $$\left(1-n^{-e^{-1/\epsilon (n)}\omega (n^{2+\alpha })\delta ^{2}/2e}\right)^{2 {\mathhat {t}}_{n-1}}\cdot\prod _{k=0}^{n-2}\left(1-n^{-\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}\omega (1)}\right)^{2 {\mathhat{t}}_{k}}.$$

TABLE X BOUNDING $p_{t,i}\left(x_{i}^{\ast }\right)$ FROM BELOW WITH AN OVERWHELMING PROBABILITY

When all the marginal probabilities $p_{\cdot,i}\left(x_{i}^{\ast }\right) (i\ne n)$ have reached $1- {{1}\over {M}}$, the marginal probability $p_{\cdot,n}\left(\bar {x}_{n}^{\ast }\right)$ will become smaller and smaller and the probability of finding the optimum becomes larger and larger.

Now we consider the $(n+1)$th stage, in which two events hold: 1) ${\mathhat {p}}_{{\mathhat {t}}_{n-1},n}\left(x_{n}^{\ast }\right)\geq {{1}\over {M}}$ holds; 2) $\forall t> {\mathhat {t}}_{n-1}$, $t\prec Poly(n)$, $\forall j\leq n-1\colon p_{t,j}\left(x_{j}^{\ast }\right)=1- {{1}\over {M}}$ holds with an overwhelming probability (38). Thus, there is no genetic drift to be taken into account. Meanwhile, the probability of generating the optimum in one sampling of a generation, conditional on the above two events, is at least $\left(1- {{1}\over {M}}\right)^{n-1} {{1}\over {M}}=e^{-(n-1)/n\epsilon (n)} {{1}\over {M}}$, which implies that if the above two events both happen (which is true in the $(n+1)$th stage), then the optimum will be found within $M\ln ^{2}n$ extra samplings (which generates $M\ln ^{2}n$ new individuals) with the overwhelming probability $1-\left({{1}\over {e}}\right)^{\omega (\ln n)}$. Consequently, after the first $n$ stages, at most ${{M}\over {N}}\ln ^{2}n$ generations can guarantee the emergence of the optimum with an overwhelming probability.

Hence, the first hitting time $\tau$ is upper bounded by a deterministic value $\bar {\tau }$ TeX Source \eqalignno{& \tau < \bar {\tau }= {{\left(\ln{{e(M-1)}\over {N}}-\ln (1-\delta)\right)n\epsilon (n)+n}\over {\epsilon (n)\ln (1-\delta)+\epsilon (n)\ln \left({{N}\over {M}}\right)-1}}\cr& \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad + {{M}\over{N}}\ln ^{2}n+2n} with the overwhelming probability at least TeX Source \eqalignno{& \left(1-n^{-e^{-1/\epsilon (n)}\omega (n^{2+\alpha })\delta ^{2}/2e}\right)^{2\bar {\tau}}\cr& \quad \cdot \left(1-n^{-\left(1-\left({{1}\over {n}}\right)^{1+ {{\alpha }\over {2}}}\right)^{2}\omega (1)}\right)^{2(n-1)\bar {\tau}}\cr& \quad \cdot \left (1-\left({{1}\over {e}}\right)^{\omega (\ln n)}\right).}

The results in this section show that margins can avoid misleading convergence and leave some chances to the ${\rm UMDA}_{M}$ to find the global optimum. However, ${\rm UMDA}_{M}$ cannot converge to the global optimum completely anymore, i.e., the CT becomes infinite. This is an interesting case where the FHT is bounded polynomially in the problem size, but the CT is infinite, and it demonstrates that FHT is a more appropriate measure for EDAs time complexity than CT. It is noteworthy that the idea of margins is quite similar to the Laplace correction [2], which was also proposed to prevent the marginal probabilities from premature convergence. However, since our purpose here is to demonstrate the influence of forbidding a marginal probability to be 0 or 1, the slight difference between relaxation and Laplace correction is not investigated.

SECTION VII

## CONCLUSION

In this paper, we utilized the FHT to measure the time complexity of EDAs. Based on the FHT measure, we proposed a classification of problem hardness for EDAs and the corresponding probability conditions. This is the first time the general issues related to the time complexity of EDAs were discussed theoretically. After that, a new approach to analyzing the FHT for EDAs with finite population was introduced. Using this approach, we investigated the time complexity of UMDAs as examples. In this paper, UMDAs were analyzed in depth on two problems: LEADINGOnes [37] and BVLeadingOnes. Both of the problems are unimodal. The latter was derived from the former, and inherited the domino convergence property of the former. For the original UMDA, LEADINGOnes is shown to be EDA-easy, and BVLeadingOnes is shown to be EDA-hard. Comparing the theoretical results of EDAs with those of the EAs', although the first result is similar to EAs', i.e., LEADINGOnes is easy, it should be noted that the general case does not hold. That is, a problem that is easy for the EAs could be hard for EDAs, e.g., the BVLeadingOnes problem. However, it is still an open issue to analyze problems that are hard for the EAs but easy for the EDAs.

If the UMDA is further relaxed by margins, BVLeadingOnes will no longer be EDA-hard. Our analysis shows that the margin is helpful for UMDA to avoid wrong convergence and thus significantly increases the performance of UMDA on BVLeadingOnes. This is the first rigorous time complexity evidence that supports the efficacy of relaxations (corrections) of EDAs.

Finally, although we only analyze UMDAs, our approach has the potential for analyzing other EDAs with the finite populations. The general idea is to find a way to simplify the EDAs and then estimate the probability that this simplification holds. However, since different EDAs may have different characteristics, more work needs to be done for the generalization of our approach.

## APPENDIX

##### Proof of Lemma 6

According to Chernoff bounds, we have TeX Source \eqalignno{& \quad \BBP\Biggl(p_{t,n}\left(\bar {x}_{n}^{\ast }\right)\geq(1-\delta) {{p_{t-1,n}\left(\bar {x}_{n}^{\ast }\right)N}\over {M}}\cr& \quad \quad \quad \quad \quad \quad \mid p_{t-1,n}\left(\bar {x}_{n}^{\ast }\right) \leq{{M}\over {N(1-\delta)}}\Biggr)\cr& > 1- e^{-p_{t-1,n}\left(\bar {x}_{n}^{\ast }\right)N\delta ^{2}/2}, \forall t\leq U} where $\delta \in \left(\max \left\{0,1- {{2M}\over {N}}\right\},1- {{M}\over {N}}\right)$ is a positive constant. Since no global optimum is generated before the $U$th generation, we have TeX Source $${\mathhat {p}}_{t,n}\left(\bar {x}_{n}^{\ast }\right)=G^{t}p_{0,n}\left(\bar {x}_{n}^{\ast }\right),\quad \forall t\leq U$$ where $G=(1-\delta) {{N}\over {M}}$, and ${\mathhat {p}}_{t,n}\left(\bar {x}_{n}^{\ast }\right)$ is deterministic given the initial value $p_{0,n}\left(\bar {x}_{n}^{\ast }\right)= {\mathhat {p}}_{0,n}\left(\bar {x}_{n}^{\ast }\right)= {{1}\over {2}}$. Furthermore, setting $t=U$ in the above equation, by calculation we obtain that TeX Source $${\mathhat {p}}_{U,n}\left(\bar {x}_{n}^{\ast }\right)=1.$$ Let ${\mathhat {T}}_{n}^{\prime}$ denote the minimal $t$ for ${\mathhat {p}}_{t,n}\left(\bar {x}_{n}^{\ast }\right)$ to reach 1, then the above equation implies ${\mathhat {T}}_{n}^{\prime}\leq U$. We study the probability that the random variable $p_{t,n}\left(\bar {x}_{n}^{\ast }\right)$ is larger than ${\mathhat {p}}_{t,n}\left(\bar {x}_{n}^{\ast }\right)$. Similar to Table III, $\forall t\leq {\mathhat {T}}_{n}^{\prime}$ we obtain TeX Source \eqalignno{& \BBP\left(p_{t,n}\left(\bar {x}_{n}^{\ast }\right)\geq{\mathhat {p}}_{t,n}\left(\bar {x}_{n}^{\ast }\right)\mid p_{0,n}\left(\bar {x}_{n}^{\ast }\right)= {\mathhat {p}}_{0,n}\left(\bar {x}_{n}^{\ast }\right)\right)\cr& \quad \quad >\left(1- e^{-p_{0,n}\left(\bar {x}_{n}^{\ast }\right)N\delta ^{2}/2}\right)^{t}.}

By inequalities in Table XI, we estimate the probability that $T_{n}^{\prime}$ is bounded by ${\mathhat {T}}_{n}^{\prime}$, where in (42) the factor $\left(1- e^{-{p_{0,n}\left(\bar {x}_{n}^{\ast }\right)N}\delta ^{2}/2}\right)$ is added since we apply Chernoff bounds at the end of the $\left({\mathhat {T}}_{n}^{\prime}-1\right)$th generation. We then consider the following item: TeX Source \eqalignno{& \BBP\left({\mathhat {p}}_{{\mathhat {T}}_{n}^{\prime }-1,n}\left(\bar {x}_{n}^{\ast }\right)> {{M}\over {N(1-\delta)}}\mid p_{0,n}\left(\bar {x}_{n}^{\ast }\right)= {\mathhat {p}}_{0,n}\left(\bar {x}_{n}^{\ast }\right)\right)\cr& \quad \quad = \BBP\left({\mathhat {p}}_{{\mathhat {T}}_{n}^{\prime }-1,n}\left(\bar {x}_{n}^{\ast }\right)> {{M}\over {N(1-\delta)}}\right).} According to the definition of ${\mathhat {T}}_{n}^{\prime}$, and noting that ${\mathhat {p}}_{{\mathhat {T}}_{n}^{\prime}-1,n}\left(\bar {x}_{n}^{\ast }\right)> {{M}\over {N(1-\delta)}}$ is deterministic, we know the probability above is 1. Thus, we continue to estimate the corresponding probability mentioned in (41) TeX Source \eqalignno{& \quad \BBP\left(T_{n}^{\prime }\leq {\mathhat{T}}_{n}^{\prime } \mid p_{0,n}\left(\bar {x}_{n}^{\ast }\right)={\mathhat {p}}_{0,n}\left(\bar {x}_{n}^{\ast }\right)\right)\cr& >\BBP\biggl(p_{{\mathhat {T}}_{n}^{\prime }-1,n}\left(\bar{x}_{n}^{\ast }\right)\geq {\mathhat {p}}_{{\mathhat{T}}_{n}^{\prime }-1,n}\left(\bar {x}_{n}^{\ast }\right)\cr& \mid p_{0,n}\left(\bar {x}_{n}^{\ast }\right)= {\mathhat{p}}_{0,n}\left(\bar {x}_{n}^{\ast }\right)\biggr)\left(1-e^{- {{{\mathhat {p}}_{0,n}(1)N}\over {2}}\delta ^{2}}\right)\cr&>\left(1-e^{- {{{\mathhat {p}}_{0,n}(1)N}\over {2}}\delta^{2}}\right)^{{\mathhat {T}}_{n}^{\prime}}.} Since ${\mathhat {T}}_{n}^{\prime}\leq U$, we further getTeX Source \eqalignno{& \BBP\left(T_{n}^{\prime }\leq U \mid p_{0,n}\left(\bar {x}_{n}^{\ast }\right)= {\mathhat {p}}_{0,n}\left(\bar {x}_{n}^{\ast }\right)\right)\cr& > \BBP\left(T_{n}^{\prime }\leq {\mathhat {T}}_{n}^{\prime } \mid p_{0,n}\left(\bar {x}_{n}^{\ast }\right)= {\mathhat {p}}_{0,n}\left(\bar {x}_{n}^{\ast }\right)\right)\cr& >\left(1- e^{- {{{\mathhat {p}}_{0,n}(1)N}\over {2}}\delta ^{2}}\right)^{U}.} The analysis above tells us, the probability that the marginal probability converges before the $U$th generation $(T_{n}< U)$ is at least $\left(1- e^{- {{N}\over {4}}\delta ^{2}}\right)^{U}$. Since $N=\omega (n^{2+\alpha }\log n)$, $M=\beta N$($\beta \in (0,1)$ is a constant) and $U$ is polynomial in the problem size $n$, this probability is overwhelming. Hence, we have proven the lemma.

TABLE XI CALCULATION OF PROBABILITY THAT $T_{n}^{\prime}$ IS UPPER BOUNDED BY ${\mathhat {T}}_{n}^{\prime}$

### ACKNOWLEDGMENT

The authors are grateful to Prof. J. A. Lozano for his constructive comments. T. Chen would like to thank Dr. J. He for his kind helps and suggestions over the years.

## Footnotes

This work was supported in part by the National Natural Science Foundation of China under Grants 60533020 and U0835002, the Fund for Foreign Scholars in the University Research and Teaching Programs (111 Project) in China under Grant B07033, and an Engineering and Physical Science Research Council Grant EP/C520696/1 in the U.K.

X. Yao is with the Nature Inspired Computation and Applications Laboratory, School of Computer Science and Technology, University of Science and Technology of China, Hefei, Anhui 230027, China, and also with the Center of Excellence for Research in Computational Intelligence and Applications, School of Computer Science, University of Birmingham, Edgbaston, Birmingham B15 2TT, U.K.(e-mail: x.yao@cs.bham.ac.uk).

T. Chen, K. Tang and G. Chen are with the Nature Inspired Computation and Applications Laboratory, School of Computer Science and Technology, University of Science and Technology of China, Hefei, Anhui 230027, China (e-mail: cetacy@mail.ustc.edu.cn, ketang@ustc.edu.cn, glchen@ustc.edu.cn).

1For $g(n)\in [{0,1}]$, there are more detailed asymptotic orders in the interval $[{0,1}]$:

1) $g(n)\prec {{1}\over {SuperPoly(n)}}$;

2) ${{1}\over {Poly(n)}}\prec g(n)\prec 1- {{1}\over {Poly(n)}}$ [if and only if $\exists a_{1},b_{1},a_{2},b_{2}\in \BBR ^{+}$, $n_{0},n_{1}\in \BBN$: $\forall n>\max \{n_{0},n_{1}\}$, $1/\left(a_{1}n^{b_{1}}\right)\leq g(n)\leq 1-1/\left(a_{2}n^{b_{2}}\right)$];

3) $g(n)\succ 1- {{1}\over {SuperPoly(n)}}$ [if and only if $\forall a,b\in \BBR ^{+}$: $\exists n_{0}\in \BBN$: $\forall n>n_{0}$, $g(n)\geq 1-1/(an^{b})$].

If necessary, these detailed asymptotic orders can be obtained by considering the regions $c\pm {{1}\over {Poly(n)}}$ and $c\pm {{1}\over {SuperPoly(n)}}$, where $0< c< 1$.

2In our discussions, “deterministic” is always in the sense that we have fixed the initial values of all the parameters of the non-self-adaptive EDA.

3The first inequality can be found in [38, Corollary 1.1], or a similar form can be found in [21], and the second inequality is in [38, (3.3)].

4Given the values of the population sizes and the constant $\delta$, the value of $\bar {\tau }$ is then determined by the problem size $n$. Thus, $\bar {\tau }$ is not a random variable.

5The notation “$[\quad ]$” can be interpreted as follows: given $a>1$, $[a]=1$; given $a\in (0,1)$, $[a]=a$. For the sake of brevity, we will omit this notation but implicitly restrict the value of a probability not to exceed 1 in the following parts of the paper.

6When there is no selection pressure, the proportion of alleles in a population with finite genes will fluctuate due to the errors brought by random sampling. For more details, one can refer to [6], [41].

7For the sake of brevity, we assume that $\log ^{2} n$ is an integer and thus omit the notation “$\lceil \quad \rceil$.”

8For the sake of brevity, we write the results of different stages together. It is noteworthy that the proof here contains no loop, since we can prove the result for different values of $i$ ($i=1,\ldots,n-1$ is the index of bits) one after another as we have done in Theorem 2. Similar to the case of Theorem 2, since $\forall i=1,\ldots,n-1$, ${\mathhat {t}}_{i}- {\mathhat {t}}_{i-1}=\Theta (1)$, the sum of at most $i$ such items [see (40)] is always $O(n)$, and the impact of genetic drift can be estimated as we have done in Theorem 2 for the $(i+1)$th bit: at least a level of $1/e$ can be maintained with an overwhelming probability.

## References

No Data Available

## Cited By

No Data Available

None

## Multimedia

No Data Available
This paper appears in:
No Data Available
Issue Date:
No Data Available
On page(s):
No Data Available
ISSN:
None
INSPEC Accession Number:
None
Digital Object Identifier:
None
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available