By Topic

• Abstract

SECTION I

## INTRODUCTION

CO-EVOLUTIONARY learning refers to a broad class of population-based, stochastic search algorithms that involves the simultaneous evolution of competing solutions with coupled fitness [1]. The co-evolutionary search process is characterized by the adaptation of solutions in some form of representation involving repeated applications of variation and selection [2]. Co-evolutionary learning offers an attractive alternative approach for problem solving in cases where obtaining an absolute quality measurement to guide the search for solutions is difficult or not possible. One such problem is game-playing [3], [4], [5], [6], [7], [8]. Unlike classical machine learning that requires an absolute quality measurement, the search process in co-evolutionary learning can be guided by strategic interactions between competing solutions (learners). Early studies [9], [10], [11] have further argued that the co-evolutionary search may benefit from these strategic interactions from one generation to the next that results in an arms race of increasingly innovative solutions.

Generalization is one of the main research issues in co-evolutionary learning. Recently, we have formulated a theoretical framework for generalization in co-evolutionary learning [12]. Other past studies such as [13] have investigated an approach to analyze performance in co-evolving populations through non-local adaptation. A general framework for statistical comparison of performance of evolutionary algorithms has been recently formulated [14]. In line with these studies, the generalization framework offers a rigorous approach to performance analysis of co-evolutionary learning, whether for individual co-evolved solutions, or for the population of co-evolved solutions in any generation.

We have demonstrated the generalization framework in the context of game-playing. Generalization performance of a strategy (solution) is estimated using a collection of random test strategies (test cases) by taking the average game outcomes, with confidence bounds provided by Chebyshev's theorem [15]. Chebyshev's bounds have the advantage that they hold for any distribution of game outcomes. However, such a distribution-free framework leads to unnecessarily loose confidence bounds. In this paper, we have taken advantage of the near-Gaussian nature of average game outcomes through the central limit theorem [16] and provided tighter bounds based on parametric testing. Furthermore, we can strictly control the condition (i.e., sample size under a given precision) under which the distribution of average game outcomes (generalization performance estimates) converges to a Gaussian through the Berry-Esseen theorem [17].

These improvements to the generalization framework now provide the means with which we can develop a general and principled approach to improve generalization performance in co-evolutionary learning that can be implemented efficiently. Ideally, if we compute the true generalization performance of any co-evolving solution and directly use it as the fitness measure, co-evolutionary learning would lead to the search for solutions with higher generalization performance. However, direct estimation of the true generalization performance using the distribution-free framework can be computationally expensive [12]. Our new theoretical contributions that exploit the near-Gaussian nature of generalization estimates allow us: 1) to find out in a principled manner the required number of test cases for robust estimations (given a controlled level of precision) of generalization performance, and 2) to subsequently use small sample of random test cases sufficient for robust estimations to compute generalization estimates of solutions directly as the fitness measure to guide and improve co-evolutionary learning.

Early studies [18], [19], [20] have shown that the classical co-evolutionary learning approach that uses relative fitness (i.e., fitness evaluation that depends on other competing solutions in the population) does not necessarily lead to solutions with increasingly higher generalization performance. Others have investigated various approaches to improve performance in co-evolutionary learning, e.g., by exploiting diversity in the population [21], [22], [23], using other notions for fitness measure such as pareto dominance [24], and using archives of test cases [25], [26], [27], among others. In [28], a study has been made to investigate how performance can be improved in a cooperative co-evolutionary learning framework (a population member only represents part of a complete solution) as compared to most other studies that have considered the competitive co-evolutionary learning framework (a population member represents a complete solution).

Unlike these past studies, we demonstrate an approach to improve generalization performance in a principled manner that can be implemented as an efficient algorithm (e.g., using small samples of test cases) that is verified in a principled manner as well. Our approach directly uses generalization estimates as the fitness measure in a competitive co-evolutionary learning setting. A series of empirical studies involving the iterated prisoner's dilemma (IPD) and the more complex Othello game is used to demonstrate how the new approach improves on the classical approach in that evolved strategies with increasingly higher generalization performance are obtained using relatively small samples of test strategies. This is achieved without large performance fluctuations typical of the classical approach. The new approach also leads to faster co-evolutionary search where we can strictly control the condition (sample sizes) under which the speedup is achieved (not at the cost of weakening precision in the estimates). It is faster than using the distribution-free framework (it requires an order of magnitude smaller number of test strategies) to achieve similarly high generalization performance.

More importantly, our approach does not depend on the complexity of the game. For some games that are more complex (under some measures of game complexity), more test strategies may be required to estimate the generalization performance of a strategy for a given level of precision. However, this will come out automatically and in a principled manner from our analysis. The necessary sample size for robust estimations can then be set and subsequently, generalization estimates can be computed and directly used as the fitness measure to guide co-evolutionary search of strategies with higher generalization performance.

We note that this paper is a first step toward understanding and developing theoretically motivated frameworks of co-evolutionary learning that can lead to improvements in the generalization performance of solutions. Although our generalization framework makes no assumption on the underlying distribution of test cases, we demonstrate one application where the underlying distribution in the generalization measure is fixed and known a priori. Generalization estimates are directly used as the fitness measure to improve generalization performance in co-evolutionary learning. This has the effect reformulating co-evolutionary learning to that of an evolutionary learning approach, but with the advantage of a principled and efficient methodology that has the potential of outperforming classical co-evolutionary approach on difficult learning problems such as games. Further studies may involve extending the generalization framework in formulating co-evolutionary learning systems where the population acting as test samples can adapt to approximate a particular distribution that solutions should generalize to.

The rest of this paper is organized as follows. Section II presents the theoretical framework for statistical estimation of generalization performance, and improvements made to provide tighter bounds through the central limit and Berry-Esseen theorems. We mention two kinds of parametric testing: 1) making statistical claims on the hypothesized performance of a strategy, and 2) comparing performance differences of a pair of strategies. Section III demonstrates how one can find out and set the required number of test strategies for robust estimation of generalization performance in a principled manner, using the IPD game for illustration. It is shown that a small number of test strategies is sufficient to estimate generalization performance with good accuracy. Section IV demonstrates how generalization estimates can be used directly as the fitness measure to improve co-evolutionary learning. We first illustrate the new co-evolutionary approach using the IPD game and later consider the more complex Othello game. Finally, Section V concludes the paper with some remarks for future studies.

SECTION II

## STATISTICAL ESTIMATION OF GENERALIZATION PERFORMANCE IN CO-EVOLUTIONARY LEARNING

### A. Games

In co-evolutionary learning, the quality of a solution is determined relative to other competing solutions in the population through interactions. This can be framed in the context of game-playing, i.e., an interaction is a game played between two strategies (solutions) [12]. We assume that there is a potentially vast but finite set of possible strategies that can be involved in playing the game. At each time step, a strategy can select a move from a finite set of possible moves to play the game. Endowing strategies with memory of their own and opponents' moves results in an exponential explosion in the number of such strategies.

Consider a game and a set ${\cal S}$ of $M$ distinct strategies, ${\cal S}=\{1,2,\ldots,M\}$. Denote the game outcome of strategy $i$ playing against the opponent strategy $j$ by $G_{i}(j)$. Different definitions of $G_{i}(j)$ (different notions of game outcomes) for the generalization performance indicate different measures of quality [12]. The win-lose function for $G_{i}(j)$ is given by TeX Source $$G_{\rm W}(i,j)=\cases{C_{\rm WIN}, &if g(i,j)>g(j,i)\cr C_{\rm LOSE},&{\rm otherwise}}\eqno{\hbox{(1)}}$$ where $g(i,j)$ and $g(j,i)$ are payoffs for strategies $i$ and $j$ at the end of the game, respectively, and $C_{\rm WIN}$ and $C_{\rm LOSE}$ are arbitrary constants with $C_{\rm WIN}>C_{\rm LOSE}$. We use $C_{\rm WIN}=1$ and $C_{\rm LOSE}=0$ and arbitrarily choose a stricter form of $G_{\rm W}(i,j)$ (a loss is awarded to both sides in the case of a tie). As an example that we present later, our choice of $G_{\rm W}(i,j)$ is to simplify analysis. In this case, the “all defect” strategy that plays full defection is known to be the only one with the maximum generalization performance when $G_{\rm W}(i,j)$ is used irrespective of how test strategies are distributed in ${\cal S}$ for the IPD game. This is not necessarily true for other definitions, such as the average-payoff function.

### B. Estimating Generalization Performance

A priori some strategies may be favored over the others, or all strategies can be considered with equal probability. Let the selection of individual test strategies from ${\cal S}$ be represented by a random variable $J$ taking on values $j\in{\cal S}$ with probability $P_{\cal S}(j)$. $G_{i}$ is the true generalization performance of a strategy $i$ and is defined as the mean performance (game outcome) against all possible test strategies $j$ TeX Source $$G_{i}=\sum_{j=1}^{M}P_{\cal S}(j) G_{i}(j).\eqno{\hbox{(2)}}$$

In other words, $G_{i}$ is the mean of the random variable $G_{i}(J)$ TeX Source $$G_{i}=E_{P_{\cal S}}[G_{i}(J)].$$

In particular, when all strategies are equally likely to be selected as test strategies, i.e., when $P_{\cal S}$ is uniform, we have TeX Source $$G_{i}={{1}\over{M}}\sum_{j=1}^{M}G_{i}(j).\eqno{\hbox{(3)}}$$

The size $M$ of the strategy space $S$ can be very large, making direct estimation of $G_{i}$, $i\in{\cal S}$, through (2) infeasible. In practice, one can estimate $G_{i}$ through a random sample $S_{N}$ of $N$ test strategies drawn i.i.d. from ${\cal S}$ with probability $P_{\cal S}$. The estimated generalization performance of strategy $i$ is denoted by ${\mathhat{G_{i}}}(S_{N})$ and given as follows: TeX Source $${\mathhat{G_{i}}}(S_{N})={{1}\over{N}}\sum_{j\in S_{N}}G_{i}(j).\eqno{\hbox{(4)}}$$

If the game outcome $G_{i}(J)$ varies within a finite interval $[G_{\rm MIN},G_{\rm MAX}]$ of size $R$, the variance of $G_{i}(J)$ is upper-bounded by $\sigma^{2}_{\rm MAX}=(G_{\rm MAX}-G_{\rm MIN})^{2}/4=R^{2}/4$. Using Chebyshev's theorem [15], we obtain TeX Source $$P(\vert{\mathhat{G_{i}}}-G_{i}\vert\geq\epsilon)\leq{{R^{2}}\over{4 N\cdot{\epsilon}^{2}}}\eqno{\hbox{(5)}}$$ for any positive number $\epsilon>0$. Note that Chebyshev's bounds (5) are distribution-free, i.e., no particular form of distribution of $G_{i}(J)$ is assumed. One can make statistical claims of how confident one is on the accuracy of an estimate given a random test sample of a known size $N$ using Chebyshev's bounds [12].

### C. Error Estimations for Gaussian-Distributed Generalization Estimates

Selection of the sample $S_{N}$ of test strategies can be formalized through a random variable ${\cal S}_{N}$ on ${\cal S}^{N}$ endowed with the product measure induced by $P_{\cal S}$. Estimates of the generalization performance of strategy $i$ can be viewed as realizations of the random variable ${\mathhat{G_{i}}}({\cal S}_{N})$. Since game outcomes $G_{i}(J)$ have finite mean and variance, by the central limit theorem, for large enough $N$, ${\mathhat{G_{i}}}({\cal S}_{N})$ is Gaussian-distributed. Claims regarding the “speed of convergence” of the (cummulative) distribution of ${\mathhat{G_{i}}}({\cal S}_{N})$ to the (cummulative) distribution of a Gaussian can be made quantitative using the Berry-Esseen theorem [17].

First, normalize $G_{i}(J)$ to zero mean TeX Source $$X_{i}(J)={G_{i}}(J)-G_{i}.\eqno{\hbox{(6)}}$$

Denote the variance of $G_{i}(J)$ [and hence the variance of $X_{i}(J)$] by $\sigma_{i}^{2}$. Since $G_{i}(J)$ can take on values in a finite domain, the third absolute moment TeX Source $$\rho_{i}=E_{P_{\cal S}}[\vert X_{i}(J)\vert^{3}]$$ of $X_{i}$ is finite.

Second, normalize ${\mathhat{G_{i}}}({\cal S}_{N})$ to zero mean TeX Source $$Y_{i}({\cal S}_{N})={\mathhat{G_{i}}}({\cal S}_{N})-G_{i}={{1}\over{N}}\sum_{j\in{\cal S}_{N}}X_{i}(j).\eqno{\hbox{(7)}}$$

Third, normalize ${\mathhat{G_{i}}}({\cal S}_{N})$ to unit standard deviation TeX Source $$Z_{i}({\cal S}_{N})={{Y_{i}({\cal S}_{N})}\over{{\sigma_{i}}\over{\sqrt{N}}}}.\eqno{\hbox{(8)}}$$

The Berry-Esseen theorem states that the cummulative distribution function (CDF) $F_{i}$ of $Z_{i}({\cal S}_{N})$ converges (pointwise) to the CDF $\Phi$ of the standard normal distribution $N(0,1)$. For any $x\in{\BBR}$ TeX Source $$\left\vert F_{i}(x)-\Phi (x)\right\vert\leq{{0.7975}\over{\sqrt{N}}}{{\rho_{i}}\over{\sigma_{i}^{3}}}.\eqno{\hbox{(9)}}$$

It is noted that only information on $\sigma_{i}$ and $\rho_{i}$ is required to make an estimate on the pointwise difference between CDFs of $Z_{i}({\cal S}_{N})$ and $N(0,1)$. In practice, since the (theoretical) moments $\sigma_{i}$ and $\rho_{i}$ are unknown, we use their empirical estimates. To ensure that the CDFs of $Z_{i}({\cal S}_{N})$ and $N(0,1)$ do not differ pointwise by more than $\epsilon>0$, we need at least TeX Source $$N_{\rm CDF}(\epsilon)={{0.7975^{2}}\over{\epsilon^{2}}}{{\rho_{i}^{2}}\over{(\sigma_{i}^{2})^{3}}}\eqno{\hbox{(10)}}$$ test strategies.

Let us now assume that the generalization estimates are Gaussian-distributed (e.g., using the analysis above, we gather enough test points to make the means almost Gaussian distributed). Denote by $z_{\alpha/2}$ the upper $\alpha/2$ point of $N(0,1)$, i.e., the area under the standard normal density for $(z_{\alpha/2},\infty)$ is $\alpha/2$, and for $[-z_{\alpha/2},z_{\alpha/2}]$ it is $(1-\alpha)$. For large strategy samples $S_{N}$, the estimated generalization performance ${\mathhat{G_{i}}}(S_{N})$ of strategy $i\in{\cal S}$ has standard error of $\sigma_{i}/\sqrt{N}$. Since $\sigma_{i}$ is generally unknown, the standard error can be estimated as TeX Source $${{{\mathhat\sigma}_{i}(S_{N})}\over{\sqrt{N}}}=\sqrt{{\sum\nolimits_{j\in S_{N}}(G_{i}(j)-{\mathhat{G_{i}}}(S_{N}))^{2}}\over{N(N-1)}}\eqno{\hbox{(11)}}$$ and the $100(1-\alpha)\%$ error margin of ${\mathhat{G_{i}}}(S_{N})$ is $z_{\alpha/2}\sigma_{i}/\sqrt{N}$, or, if $\sigma_{i}$ is unknown TeX Source $$\Upsilon_{i}(\alpha,N)=z_{\alpha/2}\sqrt{{\sum\nolimits_{j\in S_{N}}(G_{i}(j)-{\mathhat{G_{i}}}(S_{N}))^{2}}\over{N(N-1)}}.\eqno{\hbox{(12)}}$$

Requiring that the error margin be at most $\delta>0$ leads to samples of at least TeX Source $$N_{\rm em}(\delta)={{z_{\alpha/2}^{2} \sigma_{i}^{2}}\over{\delta^{2}}}\eqno{\hbox{(13)}}$$ test strategies.

In other words, to be $100(1-\alpha)\%$ sure that the estimation error $\vert{\mathhat{G_{i}}}(S_{N})-G_{i}\vert$ will not exceed $\delta$, we need $N_{\rm em}(\delta)$ test strategies. Stated in terms of confidence interval, a $100(1-\alpha)\%$ confidence interval for the true generalization performance of strategy $i$ is TeX Source $$({\mathhat{G_{i}}}(S_{N})-\Upsilon_{i}(\alpha,N),{\mathhat{G_{i}}}(S_{N})+\Upsilon_{i}(\alpha,N)).\eqno{\hbox{(14)}}$$

### D. Statistical Testing for Comparison of Strategies

One can also make statistical claims regarding hypothesized performance of the studied strategies. For example, one may be only interested in strategies with true generalization performance greater than some threshold ${\mathtilde G}$. In this case, we can test whether $i$ is a “bad” strategy by testing for $G_{i}<{\mathtilde G}$. The hypothesis $H_{1}$ that $G_{i}<{\mathtilde G}$ is substantiated at significance level of $\alpha\%$ (against the null hypothesis $H_{0}$ that $G_{i}={\mathtilde G}$) if the test statistic TeX Source $$Z^{\prime}_{i}(S_{N},{\mathtilde G})={{{\mathhat{G_{i}}}(S_{N})-{\mathtilde G}}\over{\sqrt{{\sum\nolimits_{j\in S_{N}}(G_{i}(j)-{\mathhat{G_{i}}}(S_{N}))^{2}}\over{N(N-1)}}}}\eqno{\hbox{(15)}}$$ falls below $-z_{\alpha}$, i.e., if $Z^{\prime}_{i}(S_{N},{\mathtilde G})\leq-z_{\alpha}$. Alternatively, the hypothesis that strategy $i$ is an acceptable strategy, i.e., $G_{i}\,{>}\,{\mathtilde G}$, is accepted (against $H_{0}$) at significance level of $\alpha\%$, if $Z^{\prime}_{i}(S_{N},{\mathtilde G})\geq z_{\alpha}$. We can also simply test for $G_{i}\ne{\mathtilde G}$, in which case we require $\vert Z^{\prime}_{i}(S_{N},{\mathtilde G})\vert\geq z_{\alpha/2}$.

Crucially, we can compare two strategies $i$, $j\in{\cal S}$ for their relative performance. This can be important in the evolutionary or co-evolutionary learning setting when constructing a new generation of strategies. Assume that both strategies $i$ and $j$ play against the same set of $N$ test strategies $S_{N}=\{t_{1}, t_{2},\ldots, t_{N}\}$. Statistical tests regarding the relation between the true generalization performances of $i$ and $j$ can be made using paired tests. One computes a series of performance differences on $S_{N}$ TeX Source $$D(n)=G_{i}(t_{n})-G_{j}(t_{n}) n=1,2,\ldots,N.$$

The performance differences are then analyzed as a single sample. At significance level of $\alpha\%$, strategy $i$ appears to be better than strategy $j$ by more than a margin ${\mathtilde D}$ (against the null hypothesis that $i$ beats $j$ exactly by the margin ${\mathtilde D}$), if $Z^{\prime\prime}_{i}(S_{N},{\mathtilde D})\geq z_{\alpha}$, where TeX Source $$Z^{\prime\prime}_{i}(S_{N},{\mathtilde D})={{{\mathhat{D}}(S_{N})-{\mathtilde D}}\over{\sqrt{{\sum\nolimits_{n=1}^{N}(D(n)-{\mathhat{D}}(S_{N}))^{2}}\over{N(N-1)}}}}\eqno{\hbox{(16)}}$$ and TeX Source $${\mathhat{D}}(S_{N})={{1}\over{N}}\sum_{n=1}^{N}D(n).\eqno{\hbox{(17)}}$$

For simply testing whether strategy $i$ outperforms strategy $j$ we set the margin to 0, i.e., ${\mathtilde D}=0$. Analogously, strategy $i$ appears to be worse than strategy $j$ at significance level of $\alpha\%$, provided $Z^{\prime\prime}_{i}(S_{N},0)\leq-z_{\alpha}$.

Finally, strategies $i$ and $j$ appear to be different at significance level of $\alpha\%$, if $\vert Z^{\prime\prime}_{i}(S_{N},0)\vert\geq z_{\alpha/2}$. We stress that the comparison of strategies $i$, $j\in{\cal S}$ is done through a set of test strategies in $S_{N}$ and not through a game of strategy $i$ playing against strategy $j$. Although one may want to compare one strategy with another directly by having them competing against each other, it should be noted that specific properties in games such as intransitivity may lead to misleading results.

For small samples $S_{N}$ of test strategies, we would need to use the $t$-statistic instead of the normally distributed $Z$-statistic employed here. Distribution of the $t$-statistic is the Student's $t$-distribution with $N-1$ degrees of freedom. However, for sample sizes $N\geq 50$ used in this paper, the Student's $t$-distribution can be conveniently replaced by the standard normal distribution $N(0,1)$.

### E. Properties of Gaussian-Distributed Generalization Estimates

It is common to use instead of the true standard deviation $\sigma_{i}$ of game outcomes for strategy $i$ its sample estimate [see (11)] TeX Source $${\mathhat\sigma}_{i}(S_{N})=\sqrt{{\sum\nolimits_{j\in S_{N}}(G_{i}(j)-{\mathhat{G_{i}}}(S_{N}))^{2}}\over{N-1}}.\eqno{\hbox{(18)}}$$

If we generate $n$ i.i.d. test strategy samples $S_{N}^{r}$, $r=1,2,\ldots,n$, each of size $N$, then the generalization performance estimates TeX Source $${\mathhat{G_{i}}}(S^{r}_{N})={{1}\over{N}}\sum_{j\in S^{r}_{N}}G_{i}(j)$$ are close to being Gaussian-distributed with mean $G_{i}$ and standard deviation $\sigma_{i}/\sqrt{N}$ (for large enough $N$). Such generalization estimates can be used to estimate the confidence interval for $\sigma_{i}$ as follows.

The sample variance of the estimates ${\mathhat{G_{i}}}(S^{r}_{N})$ is TeX Source $$V^{2}_{n}={{\sum\nolimits_{r=1}^{n}({\mathhat{G_{i}}}(S^{r}_{N})-\Gamma_{i})^{2}}\over{n-1}}\eqno{\hbox{(19)}}$$ where TeX Source $$\Gamma_{i}={{1}\over{n}}\sum_{r=1}^{n}{\mathhat{G_{i}}}(S^{r}_{N}).\eqno{\hbox{(20)}}$$

The normalized sample variance of a Gaussian-distributed ${\mathhat{G_{i}}}(S^{r}_{N})$ TeX Source $$U^{2}_{n}={{(n-1) V^{2}_{n}}\over{{\sigma_{i}^{2}}\over{N}}}\eqno{\hbox{(21)}}$$ is known to be $\chi^{2}$-distributed with $n-1$ degrees of freedom.1

The $100(1-\alpha)\%$ confidence interval for $\sigma_{i}/\sqrt{N}$ is TeX Source $$\left(V_{n}\sqrt{{n-1}\over{\chi^{2}_{\alpha/2}}},V_{n}\sqrt{{n-1}\over{\chi^{2}_{1-\alpha/2}}}\right)\eqno{\hbox{(22)}}$$ where $\chi^{2}_{\beta}$ is the value such that the area to the right of $\chi^{2}_{\beta}$ under the $\chi^{2}$ distribution with $N-1$ degrees of freedom is $\beta$. It follows that the $100(1-\alpha)\%$ confidence interval for $\sigma_{i}$ is TeX Source $$\left(V_{n}\sqrt{{N(n-1)}\over{\chi^{2}_{\alpha/2}}},V_{n}\sqrt{{N(n-1)}\over{\chi^{2}_{1-\alpha/2}}}\right)\eqno{\hbox{(23)}}$$ which can be rewritten as TeX Source $$\left(\sqrt{{N\cdot\sum\nolimits_{r=1}^{n}({\mathhat{G_{i}}}(S^{r}_{N})-\Gamma_{i})^{2}}\over{\chi^{2}_{\alpha/2}}},\sqrt{{N\cdot\sum\nolimits_{r=1}^{n}({\mathhat{G_{i}}}(S^{r}_{N})-\Gamma_{i})^{2}}\over{\chi^{2}_{1-\alpha/2}}}\right).\eqno{\hbox{(24)}}$$

### F. Ramifications of Statistical Estimation of Generalization Performance in Co-Evolutionary Learning

This framework provides a computationally feasible approach to estimate generalization performance in co-evolutionary learning. A small sample of test strategies may be sufficient to estimate the generalization performance of strategies, even though the strategy space is huge. Furthermore, the framework has the potential application for developing efficient algorithms to improve co-evolutionary search. Our theoretical framework allows us to develop a methodology to find the number of test strategies required for the robust estimation of generalization performance. Subsequently, generalization estimates obtained using a small sample of test strategies (compared to the case of direct estimation of the true generalization) can lead to the co-evolutionary search of strategies with increasingly higher generalization performance since the selection of evolved strategies are based on their estimated generalization performances.

SECTION III

## EXAMPLES OF STATISTICAL ESTIMATION OF GENERALIZATION PERFORMANCE IN CO-EVOLUTIONARY LEARNING

We first illustrate several examples of statistical estimation of generalization performance in co-evolutionary learning based on our theoretical framework in Section II. We consider the three-choice IPD game with deterministic and reactive, memory-one strategies since we can compute the true generalization performance (for simplicity, we assume that test strategies are randomly sampled from ${\cal S}$ with a uniform distribution). We demonstrate how one can find and set the required number of random test strategies for robust estimation (given a controlled level of precision) of generalization performance for subsequent use of generalization estimates directly as the fitness measure in co-evolutionary learning. Our results would show that a smaller number of test strategies than predicted previously in [12] is sufficient for robust estimation of generalization performance.

### A. Iterated Prisoner's Dilemma Game

In the classical, two-player IPD game, each player is given two choices to play, cooperate or defect [30]. The game is formulated with the predefined payoff matrix specifying the payoff a player receives given the joint move it made with the opponent. Both players receive $R$ (reward) units of payoff if both cooperate. They both receive $P$(punishment) units of payoff if they both defect. However, when one player cooperates while the other defects, the cooperator receives $S$(sucker) units of payoff while the defector receives $T$ (temptation) units of payoff. The values $R$, $S$, $T$, and $P$ must satisfy the constraints: $T>R>P>S$ and $R>(S+T)/2$. Any set of values can be used as long as they satisfy the IPD constraints (we use $T=5$, $R=4$, $P=1$, and $S=0$). The game is played when both players choose between the two alternative choices over a series of moves (repeated interactions).

The classical IPD game has been extended to more complex versions, e.g., the IPD with multiple, discrete levels of cooperation [31], [32], [33], [34], [35]. The $n$-choice IPD game can be formulated using payoffs obtained through the following linear interpolation: TeX Source $$p_{\rm A}=2.5-0.5c_{\rm A}+2c_{\rm B}-1\leq c_{\rm A},c_{\rm B}\leq1\eqno{\hbox{(25)}}$$ where $p_{\rm A}$ is the payoff to player A, given that $c_{\rm A}$ and $c_{\rm B}$ are the cooperation levels of the choices that players A and B make, respectively.The payoff matrix for the three-choice IPD game is given in Fig. 1 [12].

Fig. 1. Payoff matrix for the two-player three-choice IPD game [12]. Each element of the matrix gives the payoff for player A.

The payoff matrix for any $n$-choice IPD game must satisfy the following conditions [32]:

1. for $c_{\rm A}<c^{\prime}_{\rm A}$ and constant $c_{\rm B}: p_{\rm A}(c_{\rm A},c_{\rm B})>p_{\rm A}(c^{\prime}_{\rm A},c_{\rm B})$;
2. for $c_{\rm A}\leq c^{\prime}_{\rm A}$ and $c_{\rm B}<c^{\prime}_{\rm B}: p_{\rm A}(c_{\rm A},c_{\rm B})<p_{\rm A}(c^{\prime}_{\rm A},c^{\prime}_{\rm B})$;
3. for $c_{\rm A}<c^{\prime}_{\rm A}$ and $c_{\rm B}<c^{\prime}_{\rm B}: p_{\rm A}(c^{\prime}_{\rm A},c^{\prime}_{\rm B})>(1/2)(p_{\rm A}(c_{\rm A},c^{\prime}_{\rm B})+p_{\rm A}(c^{\prime}_{\rm A},c_{\rm B}))$.

These conditions are analogous to those for the classical IPDs: 1) defection always pays more; 2) mutual cooperation has a higher payoff than mutual defection; and 3) alternating between cooperation and defection pays less in comparison to just playing cooperation.

### B. What is the Required Number of Test Strategies?

We would like to find out and set the required number $N$ of random test strategies drawn i.i.d. from ${\cal S}$ for robust estimation of generalization performance for a game. Instead of making some assumptions about the complexity of a game and the impact on the required number of random test strategies, we demonstrate a principled approach based on our theoretical framework in Section II. Our approach exploits the near-Gaussian nature of generalization estimates and finds out the rate at which the distribution of generalization estimates converges to a Gaussian as the number of random test strategies to compute generalization estimates grows.

We illustrate our approach for the three-choice IPD game. We first collect a sample of 50 base strategies $i$, which we obtain by randomly sampling from ${\cal S}$ with uniform distribution. We also collect 1000 independent samples $S_{N}$ to compute 1000 estimates ${\mathhat{G}}_{i}(S_{N})$. Each random sample $S_{N}$ consists of $N$ test strategies drawn i.i.d. from ${\cal S}$ with uniform distribution.

For each base strategy $i$, we directly estimate the true generalization performance $G_{i}$ from (2) and normalize $G_{i}(J)$ by taking $X_{i}(J)=G_{i}(J)-G_{i}$. We can then compute estimates of the variance and the third absolute moment of $X_{i}(J)$ with respect to $S_{N}$, i.e., for each strategy $i$, we have a 1000-sample estimate of ${\mathhat{\sigma}}_{i}^{2}$ and another 1000-sample estimate of ${\mathhat{\rho}}_{i}$.

From the Berry-Esseen theorem (10) [17], we can compute for each base strategy $i$ the deviation from the Gaussian TeX Source $$\epsilon={{0.7975\cdot{\mathhat{\rho}}_{i}}\over{\sqrt{N}\cdot{\mathhat{\sigma}}_{i}^{3}}}\eqno{\hbox{(26)}}$$ given different sample sizes of $S_{N}$. By systematically computing the error $\epsilon$ for $S_{N}$ with $N=\{50, 100, 200, 300, 400, 500, 1000, 2000, 3000, 4000, 5000, 10\,000, 50\,000\}$, we can observe how fast the distribution of generalization estimates is converging to a Gaussian.

Since we do not know the true value of $\epsilon$, we take a pessimistic estimate of $\epsilon$. Both 1000-sample estimates of ${\mathhat{\sigma}}_{i}^{2}$ and ${\mathhat{\rho}}_{i}$ are first rank-ordered in an ascending order. A pessimistic estimate of $\epsilon$ would be to take a smaller value (2.5%-tile) of ${\mathhat{\sigma}}_{i}^{2}$ and a larger value (97.5%-tile) of ${\mathhat{\rho}}_{i}$.

Although we can directly compute the quantile intervals for ${\mathhat{\sigma}}_{i}^{2}$ from the $\chi^{2}$-distribution, we have loose bounds for ${\mathhat{\rho}}_{i}$ (based on the inequality from [29, p. 210]), which would result in unnecessarily larger values in our pessimistic estimate of $\epsilon$. Our comparison of quantiles from a 1000-sample estimate of ${\mathhat{\sigma}}_{i}^{2}$ between empirical estimates and estimates obtained directly from the $\chi^{2}$-distribution (24) indicate an absolute difference around 0.03 when $N=50$ (which is the smallest sample size we consider) and is smaller for larger values of $N$ on average. Given the small absolute difference and that we are already computing pessimistic estimates of $\epsilon$, we will use quantiles for ${\mathhat{\sigma}}_{i}^{2}$ and ${\mathhat{\rho}}_{i}$ obtained empirically for subsequent experiments.2

Fig. 2 plots the results for the 50 strategies $i$ showing $\epsilon$ against $N$. Table I lists out $\epsilon$ for different $N{\rm s}$ for ten strategies $i$. Naturally, increasing the sample size $N$ leads to decreasing values of $\epsilon$. However, there is a tradeoff between more robust estimation of generalization performance and increasing computational cost. Fig. 2 shows that $\epsilon$ decreases rapidly when $N$ increases from 50 to 1000, but starts to level off from around $N=1000$ onward. Table I suggests that at $N=2000$, $S_{N}$ would provide a sufficiently robust estimate of generalization performance for a reasonable computational cost since one would need a five-fold increase of $N$ to $10\,000$ to reduce $\epsilon$ by half.3 Furthermore, since for non-pathological strategies,4 ${\mathhat{\sigma}}_{i}^{2}$ and ${\mathhat{\rho}}_{i}$ in (26) are finite moments bounded away from 0, for larger $N$, $\epsilon$ is dominated by the term $N^{-1/2}$. In our experiments, this implies that the tradeoff between more robust estimations of generalization and computational cost is roughly the same for most of the strategies.

Fig. 2. Pessimistic estimate of $\epsilon$ as a function of sample size $N$ of test strategies for 50 random base strategies $i$.
TABLE I PESSIMISTIC ESTIMATES OF $\epsilon$ FROM (26) FOR TEN STRATEGIES $i$

Leaving the previous analysis aside for a moment and assuming that estimates ${\mathhat{G}}_{i}({\cal S}_{N})$ for a base strategy $i$ are Gaussian-distributed, from (13), we obtain the error $\delta$ TeX Source $$\delta={{z_{\alpha/2} {\mathhat{\sigma}}_{i}}\over{\sqrt{N}}}.\eqno{\hbox{(27)}}$$

We compute the pessimistic estimate of $\delta$, taking 97.5%-tile of ${\mathhat{\sigma}}_{i}^{2}$ from the rank-ordered 1000-sample estimates. Our results for $\delta$ also indicate a tradeoff between more robust estimation and increasing computational cost, and suggest that $S_{N}$ at $N=2000$ would provide a sufficiently robust estimate of generalization performance for a reasonable computational expense. The results in Table II show that the absolute difference between $\epsilon$ and $\delta$ becomes smaller for larger sample sizes (e.g., at $N>1000$ the absolute difference is less than 0.01).

TABLE II STATISTICS OF $\{\vert\epsilon-\delta\vert\}_{i}$ FOR 50 STRATEGIES $i$

We also illustrate how two strategies can be compared with respect to their generalization performances through a test of statistical significance based on the normally distributed $Z$-statistic. For example, Table III shows numerical results for the $p$-values obtained from $Z$-tests directly using (16) to find out whether one strategy outperforms another with respect to a sample of random test strategies of size $N$. In this case, since the two strategies actually differ substantially in performance, a small sample size of test strategies would be sufficient to test for its statistical significance (at around $N=400$, the $p$-values are smaller than the significance level of 0.05). Our experiments with other pairs of strategies with smaller performance differences indicate the need for a larger sample size of test strategies to test for statistical significance.

TABLE III COMPUTED $p$-VALUES OF $Z$-TESTS TO DETERMINE WHETHER A STRATEGY $i$ OUTPERFORMS A STRATEGY $j$

We have illustrated examples of statistical estimation of generalization performance in co-evolutionary learning. Our studies have indicated that the number of test strategies that is required for robust estimation (given a controlled level of precision) of generalization performance is smaller than predicted earlier using the distribution-free framework (Chebyshev's) [12]. This has an obvious impact in the use of generalization estimates as a fitness measure in co-evolutionary learning since estimations have to be repeated throughout the evolutionary process. Although we use the IPD game as an example, our theoretical framework presented in Section II can be applied to other more complex problems or scenarios. The information we need to find and set the required number of test strategies for robust estimation only involves the second (variance) and third order moments, which can be estimated as well. As an example that we present in Section IV, we illustrate how the framework can be applied in the co-evolutionary learning of the more complex Othello game.

SECTION IV

## USING THE NOTION OF STATISTICAL ESTIMATION OF GENERALIZATION PERFORMANCE AS FITNESS MEASURE IN CO-EVOLUTIONARY LEARNING

We will investigate the notion of directly using generalization estimates as a form of fitness measure in co-evolutionary learning. Ideally, we would like to make a direct estimation on the true generalization performance of the evolved strategy. In this case, co-evolutionary learning would lead to the search of strategies with increasingly higher generalization performance since the selection of evolved strategies are based on their generalization performances.5 However, such direct estimations can be computationally expensive. Instead, we investigate the use of relatively small samples of test strategies to guide and improve the co-evolutionary search following our earlier studies on the number of test strategies required for robust estimation. We first study this new approach of directly using generalization estimates as the fitness measure for the co-evolutionary learning of the IPD game before applying it to the more complex game of Othello.

### A. Co-Evolutionary Learning of IPD

#### 1) Strategy Representation

Various strategy representations for the co-evolutionary learning of IPD have been studied in the past, e.g., the look-up table with bit-string encoding [3], finite state machines [37], [38], and neural networks [32], [35], [39], [40]. The study in [41] has further investigated other forms of representations such as cellular representation for finite state machines and Markov chains among others, and their impact on the evolution of cooperation. We use the direct look-up table strategy representation [32] that directly represents IPD strategy behaviors through a one-to-one mapping between the genotype space (strategy representation) and the phenotype space (behaviors). The main advantage of using this representation is that the search space given by the strategy representation and the strategy space is the same (assuming a uniform strategy distribution in ${\cal S}$), which simplifies and allows direct investigation on the co-evolutionary search for strategies with higher generalization performance [12].

For a deterministic and reactive, memory-one $n$-choice IPD strategy, the direct look-up table representation takes the form of $m_{ij}$, $i$, $j=1,2,\ldots,n$ table elements that specify the choice to be made given the inputs of $i$ (player's own previous choice) and $j$ (opponent's previous choice). The first move $m_{\rm fm}$ is specified independently rather than using pre-game inputs (two for memory-one strategies). $m_{ij}$ and $m_{\rm fm}$ can take any of the $n$ values (choices) used to produce the payoffs in the payoff matrix through a linear interpolation. Fig. 3 illustrates the direct look-up table representation for the three-choice IPD strategy [32] where each table element can take ${+}{1}$, 0, or ${-}{1}$.

Fig. 3. Direct look-up table representation for the deterministic and reactive memory-one IPD strategy that considers three choices (also includes $m_{\rm fm}$ for the first move, which is not shown in the figure).

Mutation is used to generate an offspring from a parent strategy when using the direct look-up table for strategy representation [32]. Mutation replaces the original choice of an element in the direct look-up table with one of the remaining $n-1$ possible choices with an equal probability of $1/(n-1)$. Each element ($m_{ij}$ and $m_{\rm fm}$) has a fixed probability $p_{\rm m}$ of being replaced. The mutation can provide sufficient variations on strategy behaviors directly with the use of the direct look-up table representation (even for the more complex IPD game with more choices) [32].

#### 2) Co-Evolutionary Learning Procedure

The following describes the classical co-evolutionary learning procedure [12], [32].

1. Generation step, $t=1$. Initialize $\vert{\rm POP}\vert/2$ parent strategies $i=1,2,\ldots,\vert{\rm POP}\vert/2$ randomly.
2. Generate $\vert{\rm POP}\vert/2$ offspring strategies $i=\vert{\rm POP}\vert/2+1,\vert{\rm POP}\vert/2+2,\ldots,\vert{\rm POP}\vert$ from $\vert{\rm POP}\vert/2$ parent strategies through a mutation operator with $p_{\rm m}=0.05$.
3. All pairs of strategies in the population POP compete, including the pair where a strategy plays itself (round-robin tournament). For $\vert{\rm POP}\vert$ strategies, every strategy competes a total of $\vert{\rm POP}\vert$ games. The fitness of a strategy $i$ is ${{1}\over{\vert{\rm POP}\vert}}\sum_{j\in{\rm POP}}G_{i}(j)$.
4. Select the best $\vert{\rm POP}\vert/2$ strategies based on fitness. Increment generation step, $t\leftarrow t+1$.
5. Steps 2–4 are repeated until termination criterion (a fixed number of generation) is met.

All IPD games involve a fixed game length of 150 iterations. A fixed and sufficiently long duration for the evolutionary process $(t=300)$ is used. As in [12], we observe how the generalization performance of co-evolutionary learning (we measure the generalization performance of the top performing evolved strategy) changes during the evolutionary process. All experiments are repeated in 30 independent runs to allow for statistical analysis.

The classical co-evolutionary learning (CCL) is used as a baseline for comparison with the improved co-evolutionary learning (ICL) that directly uses generalization performance estimates as the fitness measure. We first study a simple implementation of this co-evolutionary learning approach where the estimate ${\mathhat{G}}_{i}(S_{N})$ is directly used as the fitness of the evolved strategy $i$. The procedure for this new approach is similar to the baseline with the exception of Step 3). This is to allow a more direct comparison. We investigate the approach with $\vert{\rm POP_{ICL}}\vert=20$ and $S_{N}$ with different sample sizes $N=\{50, 500, 1000, 2000, 10\,000, 50\,000\}$. The sample $S_{N}$ is generated anew every generation. We use a sample size of $N=50\,000$ to provide a “ground truth” estimate close to the true generalization performance (based on the distribution-free framework) with which to compare results for cases where much smaller samples are used. For a more direct comparison, the baseline CCL uses $\vert{\rm POP_{CCL}}\vert=50$ since experiments using generalization estimates directly as fitness in ICL starts with $S_{N}=50$.

### B. Results and Discussion

Fig. 4 shows results for our experiments. Each graph plots the true generalization performance $G_{i}$ of the top performing strategy of the population throughout the evolutionary process for all 30 independent runs. In particular, Fig. 4(a) shows that when co-evolutionary learning uses a fitness measure based on relative performance between competing strategies in the population, the search process can exhibit large fluctuations in the generalization performance of strategies throughout co-evolution. This is consistent with observations from previous studies such as [21], [22], and [32], where it has been shown that fluctuations in the generalization performance during co-evolution are due to overspecialization of the population to a specific strategy that is replaced by other strategies that can exploit it. Results from our baseline experiment show that the use of relative fitness measure does not necessarily lead to the co-evolutionary learning of strategies with increasingly higher generalization performance.

Fig. 4. FComparison of CCL and different ICLs for the three-choice IPD game. (a) CCL. (b) ICL-N50. (c) ICL-N500. (d) ICL-N2000. (e) ICL-N10000. (f) ICL-N50000. Shown are plots of the true generalization performance $G_{i}$ of the top performing strategy of the population throughout the evolutionary process for all 30 independent runs.

However, for all ICL experiments where estimates ${\mathhat{G}}_{i}(S_{N})$ are directly used as the fitness measure, no evolutionary run is observed to exhibit large fluctuations in generalization performance (Fig. 4). This is in contrast to the case of CCL where runs exhibit large fluctuations during co-evolution [Fig. 4(a)]. Starting with the case of a small sample of size 50, the search process of ICL-N50 $(S_{N}, N=50)$ exhibits only small fluctuations in the generalization performance during co-evolution. These fluctuations are a result of sampling errors from using a small sample of test strategies to estimate ${\mathhat{G}}_{i}(S_{N})$, which can affect the ranking of strategies $i$ in the co-evolving population for selection.

Results from Fig. 4 suggest that when generalization estimates are directly used as the fitness measure, co-evolutionary learning converges to higher generalization performance. For example, when a sample of 500 test strategies is used to estimate ${\mathhat{G}}_{i}(S_{N})$, more evolutionary runs converge to higher generalization performance without fluctuations compared to the case when 50 test strategies are used. However, we do not observe significant differences at the end of the evolutionary runs for ICLs when the sample size is increased further, i.e., between ICL-N2000 and ICL-N50000 (Fig. 4). Closer inspection on evolved strategies reveals that they play nearly “all defect” or are actually “all defect” strategies. This observation is expected since “all defect” strategy has the maximum generalization performance for the game outcome defined by (1).

We have also collected various statistics on $G_{i}$ measurements of CCL and ICLs using different sample sizes in Table IV. The table shows that starting from a small sample of 50 test strategies, the increase in the generalization performance of ICL is statistically significant in comparison to the case of CCL. The generalization performance of ICL appears to have settled with no significant increase when sample size $N$ is increased from 500 to $50\,000$ (which is the sample size based on the distribution-free framework and close in number to all possible strategies for the three-choice IPD). The estimates ${\mathhat{G}}_{i}(S_{N})$ appear to be robust at small sample sizes of $S_{N}$ to guide and improve co-evolutionary search to obtain strategies with high generalization performance. The co-evolutionary learning is also much faster since significantly smaller sample sizes (around an order of magnitude smaller in number of test strategies) are sufficient to achieve similarly high generalization performance.

TABLE IV SUMMARY OF RESULTS FOR DIFFERENT CO-EVOLUTIONARY LEARNING APPROACHES FOR THE THREE-CHOICE IPD TAKEN AT THE FINAL GENERATION

At this point, we have compared only the generalization performances of the co-evolutionary learning that directly uses ${\mathhat{G}}_{i}(S_{N})$ with the classical co-evolutionary learning that uses relative fitness measure. However, it is of interest to investigate the co-evolutionary learning that uses a fitness measure consisting of a mixture of the two fitness values to determine the impact on generalization performance. We consider the simple implementation of a weighted sum of fitness measures TeX Source $${\rm fitness}_{i}\,{=}\,(\eta)\cdot\left({{1}\over{N}}\sum_{j\in S_{N}}G_{i}(j)\right){+}(1-\eta){\cdot}\left({{1}\over{\vert{\rm POP}\vert}}\sum_{k\in{\rm POP}}G_{i}(k)\right)\eqno{\hbox{(28)}}$$ where higher $\eta$ values give more weight to the contribution of estimates ${\mathhat{G}}_{i}(S_{N})$ for the selection of evolved strategies. We investigate this approach where the estimate ${\mathhat{G}}_{i}(S_{N})$ is computed with $N=10\,000$ test strategies (to ensure a reasonable tradeoff between accuracy and computational expense) and $\eta$ at 0.25 (MCL25-N10000), 0.50 (MCL50-N10000), and 0.75 (MCL75-N10000).

Results show that co-evolutionary learning is able to search for strategies with high generalization performance (Fig. 5). However, the inclusion of relative fitness leads to fluctuations in the generalization performance of co-evolutionary learning. The fluctuations are smaller and localized around a high generalization performance when the contribution of relative fitness is reduced [Fig. 5(b)]. Our results suggest that the co-evolutionary search of strategies with high generalization performance is due to the estimate ${\mathhat{G}}_{i}(S_{N})$ that contributes to the fitness measure. There is no positive impact to the generalization performance of co-evolutionary learning by including relative fitness.

Fig. 5. Different MCL-N10000s for the three-choice IPD game. (a) MCL25-N10000 $(\eta = 0.25)$. (b) MCL75-N10000 $(\eta = 0.75)$. Shown are plots of the true generalization performance $G_{i}$ of the top performing strategy of the population throughout the evolutionary process for all 30 independent runs.

### C. Co-Evolutionary Learning of Othello

In this section, we demonstrate our new approach to more complex problems. As an example, we will show that ICL also improves on co-evolutionary learning for the more complex game of Othello. We can achieve similarly high generalization performance using estimates requiring an order of magnitude smaller number of test strategies than the case of a distribution-free framework. We do not necessarily need larger samples of test strategies when applying the new approach to more complex games. Instead, we can find out in advance the required number of test strategies for robust estimation before applying ICL to the game of Othello.

#### 1) Othello

Othello is a deterministic, perfect information, zero-sum board game played by two players (black and white) that alternatively place their (same colored) pieces on an eight-by-eight board. The game starts with each player having two pieces already on the board as shown in Fig. 6. In Othello, the black player starts the game by making the first move. A legal move is one where the new piece is placed adjacent horizontally, vertically, or diagonally to an opponent's existing piece [e.g., Fig. 6(b)] such that at least one of opponent's pieces lies between the player's new piece and existing pieces [e.g., Fig. 6(c)]. The move is completed when the opponent's surrounded pieces are flipped over to become the player's pieces [e.g., Fig. 6(d)]. A player that could not make a legal move forfeits and passes the move to the opponent. The game ends when all the squares of the board are filled with pieces, or when neither player is able to make a legal move [7].

Fig. 6. Figure illustrates basic Othello moves. (a) Positions of respective players' pieces at the start of the game. (b) Possible legal moves (which are indicated by black, crossed circles) at a later point of the game. (c) Black player selecting a legal move. (d) Black move is completed where surrounded white pieces are flipped over to become black pieces [7].

#### 2) Strategy Representation

Among the strategy representations that have been studied for the co-evolutionary learning of Othello strategies (in the form of a board evaluation function) are weighted piece counters [42] and neural networks [7], [43]. We consider the simple strategy representation of a weighted piece counter in the following empirical study. This is to allow a more direct investigation of the impact of fitness evaluation in the co-evolutionary search of Othello strategies with higher generalization performance.

A weighted piece counter (WPC) representing the board evaluation function of an Othello game strategy can take the form of a vector of 64 weights, indexed as $w_{rc}$, $r=1,\ldots,8$, $c=1,\ldots,8$, where $r$ and $c$ represent the position indexes for rows and columns of an eight-by-eight Othello board, respectively. Let the Othello board state be the vector of 64 pieces, indexed as $x_{rc}$, $r=1,\ldots,8$, $c=1,\ldots,8$, where $r$ and $c$ represent the position indexes for rows and columns. $x_{rc}$ takes the value of ${+}{1}$, ${-}{1}$, or 0 for black piece, white piece, and empty piece, respectively [42].

The WPC would take the Othello board state as input, and output a value that gives the worth of the board state. This value is computed as TeX Source $${\rm WPC}({\mbi x})=\sum_{r=1}^{8}\sum_{c=1}^{8}w_{rc}\cdot x_{rc}\eqno{\hbox{(29)}}$$ where the more positive value of ${\rm WPC}({\mbi x})$ would indicate WPCs interpretation that the board state ${\mbi x}$ is more favorable if WPC is a black player. The more negative value of ${\rm WPC}({\mbi x})$ would indicate WPCs interpretation that the board state ${\mbi x}$ is more favorable if WPC is a white player [42].

We consider a simple mutation operator, where the WPC weight of the offspring $w^{\prime}_{rc}$ can be obtained by adding a small random value to the corresponding WPC weight of the parent $w_{rc}$ TeX Source $$w^{\prime}_{rc}=w_{rc}+k\ast F_{rc}\quad r=1,\ldots,8\quad c=1,\ldots,8\eqno{\hbox{(30)}}$$ where $k$ is a scaling constant $(k=0.1)$ and $F_{rc}$ is a real number randomly drawn from $[{-1,1}]$ with a uniform distribution and resampled for every combination of $r$ and $c$ (total of 64 weights). For the experiments, we consider the space of Othello strategies given by the WPC representation with $w_{rc}\in [{-10,10}]$. The simple mutation-operator can provide sufficient variation to the Othello game strategy represented in the form of a WPC evaluation function. We note that choices for various parameters are not optimized. The main emphasis of our study is to investigate the impact of generalization performance estimates used as fitness measure in improving the generalization performance of co-evolutionary learning.

#### 3) Measuring Generalization Performance on Othello Strategies

Unlike the IPD game that is symmetric, the Othello game is not necessarily symmetric, i.e., the black and white players may not have the same sets of available strategies [23]. In this case, we consider two estimates of generalization performance. We estimate the generalization performance of a black WPC through Othello game-plays against a random test sample of white WPCs. Conversely, we estimate the generalization performance of a white WPC through Othello game-plays against a random test sample of black WPCs. A random test sample of Othello WPC is obtained through random sampling of $w_{rc}$ from $[{-10,10}]$ having a uniform distribution and resampled for every combination of $r$ and $c$. We use a random sample of $50\,000$ test WPCs (opposite color) to directly estimate the generalization performance of evolved WPC since we cannot compute the true generalization performance.

#### 4) Co-Evolutionary Learning Procedure

Given the approach we use to measure the generalization performance of evolved Othello WPC, we repeat all experiment settings twice: one for black WPC and one for white WPC. For example, the CCL of black WPC is described as follows.

1. Generation step, $t=1$. Initialize $\vert{\rm POP}\vert/2$ parent strategies $i=1,2,\ldots,\vert{\rm POP}\vert/2$ randomly. For a ${\rm WPC}_{i}$, $w_{rc}^{i}$ is real number randomly sampled from $[-0.2,0.2]$ having a uniform distribution and resampled for every combination of $r$ and $c$.
2. Generate $\vert{\rm POP}\vert/2$ offspring strategies $i=\vert{\rm POP}\vert/2+1,\vert{\rm POP}\vert/2+2,\ldots,\vert{\rm POP}\vert$ from $\vert{\rm POP}\vert/2$ parent strategies through a mutation operator given by (30).
3. All pairs of strategies in the population POP compete, including the pair where a strategy plays itself (round-robin tournament). For $\vert{\rm POP}\vert$ strategies, every strategy competes a total of $\vert{\rm POP}\vert$ games. The fitness of a black WPC strategy $i$ is ${{1}\over{\vert{\rm POP}\vert}}\sum_{j\in{\rm POP}}G_{i}(j)$, where $G_{i}(j)$ is the game outcome to $i$ for an Othello game played by $i$ (black) and $j$ (white).
4. Select the best $\vert{\rm POP}\vert/2$ strategies based on fitness. Increment generation step, $t\leftarrow t+1$.
5. Steps 2–4 are repeated until termination criterion (i.e., a fixed number of generation) is met.

For the co-evolutionary learning of Othello, we consider a shorter evolutionary duration of 200 generations compared to the co-evolutionary learning of IPD. This is due to the increase in computational expense in a single Othello game compared to a single IPD game. All experiments are repeated in 30 independent runs to allow for statistical analysis.

ICLs with $\vert{\rm POP_{ICL}}\vert=20$ and different sample sizes of $N=\{50, 500, 1000, 5000, 10\,000, 50\,000\}$ to estimate ${\mathhat{G}}_{i}(S_{N})$ are considered while CCL with $\vert{\rm POP_{CCL}}\vert=50$ is used as a baseline for more direct comparison. The sample size of $N=50\,000$ is to provide a “ground truth” estimate close to the true generalization performance with which to compare results for cases where much smaller samples are used. Note that different samples of $S_{N}$ are used to estimate ${\mathhat{G}}_{i}(S_{N})$ as the fitness measure in ICL and to estimate the generalization performance of ICL for analysis.

#### 5) Results and Discussion

Fig. 7 shows results for our experiments. Each graph plots the estimated generalization performance ${\mathhat{G_{i}}}(S_{N}) (N=50\,000)$ of the top performing strategy of the population throughout the evolutionary process for all 30 independent runs. As with the CCL of the simpler IPD game [Fig. 4(a)], results for the CCL of black and white WPCs indicate a search process with large fluctuations in the generalization performance of strategies throughout co-evolution [Fig. 7(a) and (b)]. Our results suggest that co-evolutionary learning does not necessarily lead to Othello WPC strategies with increasingly higher generalization performance when a relative fitness measure is used.

Fig. 7. Comparison of CCL and different ICLs for the Othello game. (a) CCL black WPC. (b) CCL white WPC. (c) ICL-N500 black WPC. (d) ICL-N500 white WPC. (e) ICL-N50000 black WPC. (f) ICL-N50000 white WPC. Shown are plots of the estimated generalization performance ${\mathhat{G_{i}}}(S_{N})$ (with $N=50\,000$) of the top performing strategy of the population throughout the evolutionary process for all 30 independent runs.

When estimates ${\mathhat{G_{i}}}(S_{N})$ are directly used as the fitness measure in co-evolutionary learning, fluctuations in the generalization performance are reduced and that the co-evolutionary search converges to higher generalization performance compared to the case of CCL for the Othello game (Fig. 7). We observe that ICL can search WPCs with higher generalization performance although small fluctuations can be seen during co-evolution when estimates ${\mathhat{G_{i}}}(S_{N})$ are computed using a small sample of 50 test strategies. Further increase in sample size leads to further improvements in generalization performance of ICL, e.g., when $N=500$ [Fig. 7(c) and (d)].

There is a point where we observe that a significant increase in the sample size does not bring about a significant increase in generalization performance of ICL. For example, results for ICL-N5000 is similar to that of ICL-N50000 [Fig. 7(e) and (f)]. This is consistent with results from experiments to find the required number of test strategies for robust estimation of generalization performance (Fig. 8). The figure suggests a tradeoff at $N=5000$ for $S_{N}$ to provide sufficiently robust estimation for a reasonable computation cost since substantially increasing $N$ to $50\,000$ would not lead to a significant decrease in the error $\epsilon$ for the Othello game.

Fig. 8. Pessimistic estimate of $\epsilon$ as a function of sample size $N$ of test strategies for 50 random base strategies $i$ for the Othello game. (a) Black WPC.(b) White WPC.

Tables V and VI compare the generalization performance of CCL with ICLs at the end of the generational runs for black and white WPCs, respectively. They show that there is a positive and significant impact on the generalization performance in co-evolutionary learning when estimates ${\mathhat{G_{i}}}(S_{N})$ are directly used as the fitness measure. The means over 30 runs are higher while the standard errors at 95% confidence interval are lower when comparing results between ICLs and CCL. In addition, results of controlled experiments for co-evolutionary learning with the fitness measure being a mixture of the estimate ${\mathhat{G}}_{i}(S_{N})$ and relative fitness (28) indicate that when the contribution of the relative fitness is reduced while that of the estimate ${\mathhat{G}}_{i}(S_{N})$ is increased, higher generalization performance can be obtained with smaller fluctuations throughout co-evolution. These results further support our previous observation from Fig. 7 that co-evolution converges to higher generalization performance without large fluctuations as a result of directly using the generalization estimate ${\mathhat{G}}_{i}(S_{N})$ as the fitness measure.

TABLE V SUMMARY OF RESULTS FOR DIFFERENT CO-EVOLUTIONARY LEARNING APPROACHES FOR BLACK OTHELLO WPC TAKEN AT THE FINAL GENERATION
TABLE VI SUMMARY OF RESULTS FOR DIFFERENT CO-EVOLUTIONARY LEARNING APPROACHES FOR WHITE OTHELLO WPC TAKEN AT THE FINAL GENERATION

Our empirical studies indicate that the use of generalization estimates directly as the fitness measure can have a positive and significant impact on the generalization performance of co-evolutionary learning for both the IPD and Othello games. The new approach (ICL) can obtain strategies with higher generalization performance without large performance fluctuations and is faster compared to the case when a distribution-free framework is used, requiring an order of magnitude smaller number of test strategies to achieve similarly high generalization performance. More importantly, it is not necessary to use larger samples of test strategies when applying ICL to more complex games. One can observe the similarity in the rate at which the error $\epsilon$ decreases for increasing sample size $N$ for both the IPD and Othello games (Figs. 2 and 8), and subsequently the similarity of the impact of using ${\mathhat{G}}_{i}(S_{N})$ on the generalization performance of ICL (Figs. 4 and 7). We stress that one can use our approach to find and set the required number of test strategies for robust estimation in a principled manner before applying ICL to a new game.

We do note that there are many issues related to the design of co-evolutionary learning systems for high performance. For example, design issues can be problem-specific and involve representation, variation and selection operators [4], [7] as well as more sophisticated development of systems involving incorporation of domain knowledge [44] that have the potential to provide superior solutions compared to other learning approaches [8]. We only address the issue of selection (generalization estimates used to guide co-evolutionary search) in a principled manner that also can be implemented practically. Although fine-tuning parameters such as mutation rate and population size can have an impact on our numerical results, our general observations would hold. In addition, various selection and variation approaches can have different impact on generalization performance in co-evolutionary learning for different real-world problems (and games in particular). Here, it is of interests to use common tools for rigorous quantitative analysis such as generalization measures we have formulated in [12]. As an example, we have previously studied both generalization estimates using unbiased sample of random test strategies (obtained through uniform sampling of ${\cal S}$) and biased sample of random test strategies that are superior in game-play and more likely to be encountered in a competitive setting (obtained through a multiple partial enumerate search). We have also recently started a preliminary investigation on the impact of diversity on the generalization performance of co-evolutionary learning [45].

SECTION V

## CONCLUSION

We have addressed the issue of loose confidence bounds associated with the distribution-free (Chebyshev's) framework we have formulated earlier for the estimation of generalization performance in co-evolutionary learning and demonstrated in the context of game-playing. Although Chebyshev's bounds hold for any distribution of game outcomes, they have high computational requirements, i.e., a large sample of random test strategies is needed to estimate the generalization performance of a strategy as average game outcomes against test strategies. In this paper, we take advantage of the near-Gaussian nature of average game outcomes (generalization performance estimates) through the central limit theorem and provide tighter bounds based on parametric testing. Furthermore, we can strictly control the condition (sample size under a given precision) under which the distribution of average game outcomes converges to a Gaussian through the Berry-Esseen theorem.

These improvements to our generalization framework provide the means with which we develop a general and principled approach to improve generalization performance in co-evolutionary learning that can be implemented as an efficient algorithm. Ideally, co-evolutionary learning using the true generalization performance directly as the fitness measure would be able search for solutions with higher generalization performance. However, direct estimation of the true generalization performance using the distribution-free framework can be computationally expensive. Our new theoretical contributions that exploit the near-Gaussian nature of generalization estimates provide the means with which we can now; 1) find out in a principled manner the required number of test cases for robust estimations of generalization performance, and 2) subsequently use the small sample of random test cases to compute generalization estimates of solutions directly as the fitness measure to guide and improve co-evolutionary learning.

We have demonstrated our approach on the co-evolutionary learning of the IPD and the more complex Othello game. Our new approach is shown to improve on the classical approach in that we can obtain increasingly higher generalization performance using relatively small samples of test strategies and without large performance fluctuations typical of the classical approach. Our new approach also leads to faster co-evolutionary search where we can strictly control the condition (sample sizes) under which the speedup is achieved (not at the cost of weakening precision in the estimates). It is much faster compared to the distribution-free framework approach as it requires an order of magnitude smaller number of test strategies to achieve similarly high generalization performance for both the IPD and Othello game. Note that our approach does not depend on the complexity of the game. That is, no assumption needs to be made about the complexity of the game and how it may have an impact on the required number of test strategies for robust estimations of generalization performance.

This paper is a first step toward understanding and developing theoretically motivated frameworks of co-evolutionary learning that can lead to improvements in the generalization performance of solutions. There are other research issues relating to generalization performance in co-evolutionary learning that need to be addressed. Although our generalization framework makes no assumption on the underlying distribution of test cases $(P_{\cal S})$, we have demonstrated one application where $P_{\cal S}$ in the generalization measure is fixed and known a priori. Generalization estimates are directly used as the fitness measure to improve generalization performance of co-evolutionary learning (in effect, reformulating the approach as evolutionary learning) in this paper. There are problems where such an assumption has to be relaxed and it is of interests to us for future studies to formulate naturally and precisely co-evolutionary learning systems where the population acting as test samples can adapt to approximate a particular distribution that solutions should generalize to.

### ACKNOWLEDGMENT

The authors would like to thank Prof. S. Lucas and Prof. T. Runarsson for providing access to their Othello game engine that was used for the experiments of this paper.

## Footnotes

This work was supported in part by the Engineering and Physical Sciences Research Council, under Grant GR/T10671/01 on “Market Based Control of Complex Computational Systems.”

S. Y. Chong is with the School of Computer Science, University of Nottingham, Semenyih 43500, Malaysia. He is also with the Automated Scheduling, Optimization and Planning Research Group, School of Computer Science, University of Nottingham, Nottingham NG8 1BB, U.K. (e-mail: siang-yew.chong@nottingham.edu.my).

P. Tiňo is with the School of Computer Science, University of Birmingham, Edgbaston, Birmingham B15 2TT, U.K. (e-mail: p.tino@cs.bham.ac.uk).

D. C. Ku is with the Faculty of Information Technology, Multimedia University, Cyberjaya 63100, Malaysia (e-mail: dcku@mmu.edu.my).

X. Yao is with the Center of Excellence for Research in Computational Intelligence and Applications, School of Computer Science, University of Birmingham, Edgbaston, Birmingham B15 2TT, U.K. (e-mail: x.yao@cs.bham.ac.uk).

1There is a trivial inequality for the third moment [29, p. 210] that however leads to rather broad bounds.

2The non-parametric quantile estimation is performed in the usual manner on ordered samples. Uniform approximation of the true distribution function by an empirical distribution function based on sample values is guaranteed, e.g., by the Glivenko-Cantelli theorem [29], [36].

3Note that because we take a pessimistic estimate of $\epsilon$, it is possible that the computed value of $\epsilon$ is greater than the real one, especially for small sample sizes (e.g., strategy #7 at $N=50$).

4By pathological strategies we mean strategies with very little variation of game outcomes when playing against a wide variety of opponent test strategies.

5 In the same way, the selection process in the pareto co-evolution [24] is based on the pareto-dominance relationship although establishing such a relationship requires competing solutions to interact (solve) a sample of test cases.

## References

No Data Available

## Cited By

No Data Available

None

## Multimedia

No Data Available
This paper appears in:
No Data Available
Issue Date:
No Data Available
On page(s):
No Data Available
ISSN:
None
INSPEC Accession Number:
None
Digital Object Identifier:
None
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available