Theoretical Analysis of Accuracy-Based Fitness on Learning Classifier Systems

Like most evolutionary algorithms, accuracy-based learning classifier systems (XCSs) use a fitness metric to recognize the superiority of rules, under a principle that a higher-quality rule has a higher fitness. However, XCS must learn the fitness values under a reinforcement learning scheme. This introduces uncertainty and asynchrony to the fitness estimation while no theoretical work formally guarantees that such a basic principle would hold. The goal of this paper is to complement this fundamental lack in the reliability of XCS by mathematically analyzing its fitness-update scheme. Our main assumption is that the fitness is updated with an absolute accuracy instead of its relative accuracy to sidestep unpredictable dynamics of XCS. Our theoretical conclusion is that the superiority of rules can be correctly recognized through the accuracy-based fitness under finite update times. We further show that recognizing the superiority among low-quality rules is a costly procedure that increases the number of necessary rule-trainings. This drawback indicates that XCS may struggle to identify good parent rules at early generations, degrading the efficiency of evolutionary propagation.


I. INTRODUCTION
Learning Classifier Systems (LCSs) [1] are a paradigm of evolutionary rule-based learning approaches, aiming to produce accurate and maximally general rules that capture the complexity behind a given problem [2]. LCSs seek to build an ensemble model of simple condition-action rules, wherein each rule is responsible for a sub-input space divided optimally with evolutionary algorithms. During the last decade, LCSs have been variously extended in response to modern interest in the machine learning field, for example, ensemble modeling of neural networks [3]- [6], knowledge extraction with deep neural network models [7]- [9], and automation of LCSs [10].
Accuracy-based LCSs (XCSs) [11] are the most common basis of modern LCSs designed for various problem domains, such as classification, recognition, and online-control [12]- [14]. XCS optimizes rules with a genetic algorithm (GA) and evaluates their quality through a reinforcement learning scheme. Its main feature is the utilization of an accuracy-based fitness, in which the rule-quality is considered as an estimation accuracy of expected rewards The associate editor coordinating the review of this manuscript and approving it for publication was Li Zhang . to be received by its rule. In early works, the impact of accuracy-based fitness has been actively studied, as fitness design is crucial to boosting evolutionary propagation. For instance, Kovacs's optimality hypothesis suggested that XCS can develop minimal representations of optimal solutions [15]- [17]. Butz discussed how specific XCS components act to produce the optimal rules while utilizing the fitness [18], [19]. For the sake of simplicity, hereinafter, the accuracy-based fitness is denoted as fitness, if not stated otherwise.
Like most evolutionary algorithms, the rule-discovery process of XCS, that is, GA, optimizes rules relying on a basic principle that a higher-quality rule has a higher fitness. However, XCS must estimate the fitness from a limited resource of reward signals [20], causing uncertainty and asynchrony in the fitness estimation. Specifically, the fitness of rules is asynchronously trained because each rule covers a different sub-input space [21]. Furthermore, the fitness value may differ, even for the same rule-quality, owing to some probabilistic factors, for example, the input distribution. Thus, we can consider that XCS seeks to solve an uncertain (i.e., noisy) and asynchronous optimization problem, where the fitness may be unreliable. XCS may thus fail to correctly recognize the superiority of rules.
In this regard, the collected insights are briefly summarized below.
• Handling uncertainty. Using a small learning rate, β, can improve the estimation accuracy of rule-variables, including fitness [22], and thus, can relax the uncertainty in the fitness estimation [23]. Although this strategy deteriorates the convergence speed of the rule-variables, it is effective, especially for large-scale and/or noisy problems, as shown in [24]. For this, several works have attempted to adaptively tune the learning rate during a run [25]- [27]. The brief function theory [28] and an estimation technique for reward distributions [29] are integrated into XCS to manage the uncertainty in classification problems.
• Handling asynchrony. A population size bound is theoretically discussed in [22], [30], [31], given that high-quality rules can have reproductive events before they are deleted. This condition, known as the reproductive opportunity bound, is a reasonable explanation to manage the asynchrony in fitness estimation. Frequency tuning of GA [32], [33] and a theoretical condition to safely delete rules [21] have also been considered as additional perspectives.
• Other perspectives. Certain works investigated the affinity of the fitness with possible options of the XCS framework. The impact of fitness-sharing [34]- [36], niche GAs [37], and parent selection methods [38], [39] have been analyzed. For instance, tournament selection tends to improve the stability of XCS performance [40].
However, some fundamental insights related to the fitness have remained theoretically unclear so far. Specifically, it has not been proved yet whether XCS can correctly recognize the superiority of any rules through their fitness, and how many times each rule should be trained to represent its quality. The former issue is related to the understanding of XCS stability against the uncertainty in fitness estimation. The latter corresponds to quantifying a negative impact of the asynchrony in the fitness-update scheme. Considering these issues, this work presents a theoretical analysis of the fitnessupdate scheme. Our goal is to theoretically conclude that XCS can correctly recognize the superiority of any rules through their fitness under finite training times (i.e., a feasible condition). Thus, our main contribution is to provide the first proof that the basic principle, that is, a higher-quality rule has a higher fitness, holds, complementing a fundamental lack in the reliability of XCS. We therefore demonstrate that, the XCS framework is designed to manage uncertainty and asynchrony in optimizing rules online under assumptions.
The mathematical analysis of the fitness value is generally non-trivial because of the complexity in deriving its estimation equation. In fact, the fitness is updated with four variables: a prediction, a prediction error, and absolute/relative accuracies (see Section II for more detail). The absolute accuracy is determined with a discontinuous function, and estimations of the prediction and prediction error are further required to analyze the fitness. Furthermore, the relative accuracy is used to realize the fitness sharing approach, but this has an unpredictable impact on the fitness estimation.
To relax the above difficulties, we consider recent theoretical insights and assumptions. Specifically, the estimation equations of prediction and prediction error have been derived in the learning optimality theory [24]; we show that the discontinuity of absolute accuracy can be reasonably decomposed by utilizing these equations. Because of assumptions used in this theory, our analysis is conducted for classification tasks. Furthermore, we assume that the fitness is updated with the absolute accuracy instead of its relative accuracy to avoid its unpredictable property. While this assumption creates a gap in the standard XCS framework, it is a necessary condition to proceed with our analysis. Such simplifications of the fitness calculation have been designed for some LCS variants [34], [41]- [43].
This paper is organized as follows. Section II introduces the XCS framework and derivations of the learning optimality theory. Section III describes our analysis and introduces assumptions and definitions. In Section IV, we further analyze the negative impact of asynchrony in the fitness-update scheme. Conclusions and future work are then summarized in Section V. It is worth noting that, on the classification tasks, the estimation accuracy of expected rewards corresponds to a classification accuracy, and XCS systematically defines the worst rule-quality as having a 50% classification accuracy. This is known as the fitness dilemma, and a bilateral accuracy has been introduced to sidestep this issue [30]. However, this paper focuses on the standard XCS and follows the definition of worst quality.

II. XCS AND LEARNING OPTIMALITY THEORY
We describe the XCS framework for classification tasks with binary inputs, wherein the rule-condition is represented with a ternary alphabet coding [11]. Subsequently, the learning optimality theory is introduced. Note that this theory does not suppose any particular coding, and thus, can be satisfied for XCSR [44] and XCSI [45], designed for real-valued and integer inputs, respectively.

A. XCS FRAMEWORK
To apply XCS to the classification tasks, a binary reward scheme is frequently employed [11], [30], in which the system receives r max if it predicts the correct class for a given input, otherwise r min . This paper sets r min = 0 and r max = 1000.

1) RULE FORMAT
A rule has a condition C and an action A. The condition determines a set of inputs (i.e., a sub-input space) that the corresponding rule can use. The action corresponds to a predicted class. The condition is represented as C = [c 1 , · · · , c d ] and c i ∈ {0, 1, #} for a binary input, where a don't care symbol # can be any value. Specifically, a rule can be matched to an input x = [x 1 , · · · , x d ] if c i is equal to x i or # for all VOLUME 10, 2022 i ∈ {1, 2, · · · , d}, which is simply denoted as x ∈ C in this paper. The rule has the following main rule-variables: p n a prediction, which is an estimated reward calculated from given rewards; n a prediction error, which estimates an absolute error of p n to given rewards; κ n an absolute accuracy, which is calculated from n ; F n an accuracy-based fitness, which is calculated from a relative accuracy based on κ n ; num a numerosity, which is the number of subsumed rules. The above notations of p n , n , κ n , and F n indicate values which have been updated n times. Their initial values, except for κ n , are denoted as p I , I , and F I for n = 0. κ n is directly calculated from n . Note that each rule has its own update time n, and thus, the rule-variables are asynchronously updated. All rules are contained in a population P with the maximum size N .

2) MECHANISM
To begin with, P is initialized as an empty set. Given an input x = [x 1 , · · · , x d ], XCS builds a match set M containing rules matched to x, that is, M = {cl ∈ P | x ∈ cl.C}. If M does not cover all possible classes, the covering operator is activated to produce rules which have missing actions and conditions matched to x. Their initial values in this case are sufficiently small [20], for example, p I , I , F I = 0.01. XCS executes an action a * selected with a defined strategy and a reward r is then returned to the system.
Next, XCS builds an action set A, which consists of rules having a * , that is, A = {cl ∈ M | cl.A = a * }, and it updates the rule-variables of rules in A. First, the update time n is increased by one. Then, p n , n , and κ n are updated as follows: where β is a learning rate; 0 is an error tolerance; and α, ν control the exponential curve of κ n . The fitness, F n , is updated as where κ n is a relative accuracy, given by Finally, GA is applied to A depending on a hyper-parameter θ GA (see [20] for more details). XCS selects two parent rules from A and duplicates them as two offspring rules. Then, it applies the crossover and mutation operators to the offspring rules with probabilities χ, µ, respectively. If the crossover is applied, the rule-variables of the offspring rules are set to the average values of the parent rules. Typically, the initial fitness is forcedly reduced with a fitness reduction rate δ F ∈ (0, 1]. The offspring rules are inserted to P and two rules are deleted from P if its size is greater than N . The subsumption operator may be applied to the rules in A after updating them and to the offspring rules after GA. A rule cl 1 can be subsumed by a more general rule cl 2 if cl 2 is maximally accurate and sufficiently updated, which is formalized as the following condition; If cl 1 is subsumed by cl 2 , cl 2 .num is increased by cl 1 .num and cl 1 is eliminated from P. After completing the above procedure, the next input is sent to XCS.

B. LEARNING OPTIMALITY THEORY
Suppose the classification tasks, the maximally accurate rules defined in (6) must have an optimum classification accuracy. However, this optimality has not been proved theoretically. The learning optimality theory guarantees that this optimality holds under some assumptions while identifying a theoretical setting of β and 0 . For readability, according to [24], we denote the maximally accurate rules and other rules as accurate and inaccurate ones, respectively.

1) DEFINITION AND GOAL
Let φ ∈ [0, 1] be a true classification accuracy of a rule, which is the rate of correct classifications over all inputs that it matches. An accurate rule is defined as having φ * ∈ * = {0, 1} because it can correctly predict a given reward. An inaccurate rule is defined as having φ ∈ ∈ [1 − φ max , φ max ], where a boundary of the inaccurate rules can be controlled with φ max ∈ [0.5, 1). 1 Let n (φ) be a prediction error of a rule having φ. Note that n (φ) may change even for the same value of φ because it depends on p I and I . Furthermore, let max n ( ) and min n ( ) be the maximum and minimum prediction errors of any rules having φ ∈ , that is, min n ( ) := min From (6), to correctly distinguish any accurate rules from any inaccurate ones, the following condition must be satisfied; The goal of the theory is to derive the optimal setting of β, 0 that satisfies the above condition. Note that in the theory, θ sub remains as a hyper-parameter, which defines the minimum update time required to identify the accurate rules. This paper uses an ideal setting of θ sub given that any inaccurate rule receives at least one pair of 1 can be also defined as ∈ [φ min , φ max ]. However, φ min is identified as 1 − φ max when φ max is defined once, because n follows a symmetric function [24]. r max /r min rewards [24], that is,

2) DERIVATION
The theory derives estimation equations of p n , n , and then identifies max n * ( * ) and min n ( ). Let E r (φ) be an expected reward which a rule having φ receives at any update time n, given by By recursively expanding (1) with r E r (φ), an estimation equation of p n for φ, i.e., p n (φ), is derived as Subsequently, let E ,n (φ) be an expected value of |r − p n−1 | of (2). Because p n−1 is now replaced with By recursively expanding (2) where ∞ (φ) is given by For n → ∞, n (φ) eventually converges to ∞ (φ). Consequently, the following lemma holds. Lemma 1: Assume that two rules have φ and 1−φ, respectively, wherein φ ∈ [0.5, 1] and their prediction errors can be approximated by (14). When their prediction errors have been updated for the same n update time, the following two equalities hold: where max n (φ) and min n (φ) are prediction errors maximized and minimized in terms of the initial values (i.e., p I , I ), respectively. Proof: From (14), n (φ) and n (1 − φ) are maximized with {p I , I } = {0, 1000} and {1000, 1000} for any value of β, respectively, which are formalized as the same equation as Similarly, those are minimized with {p I , I } = {1000, 0} and {0, 0}, respectively, which follow the same equation as Note that (17) and (18) are available for φ ≥ 0.5.
Recall that φ * is either 0 or 1, from (17) where ∞ (φ * ) = 0. For inaccurate rules, min n ( ) is obtained by minimizing (18) for φ . Specifically, it is minimized when φ is φ max or 1 − φ max . Thus, any possible value of n (φ ) is always greater than or equal to min n ( ), given by From the above derivations, (9) always holds, as proven in the following theorem.
Theorem 1: Assume that φ max and θ sub are given in advance and a prediction error of any rules can be approximated by (14). Let β be a solution of the following boundary equation at n * , n = θ sub : and 0 be its boundary value calculated with the solution β, that is, Then, with the above setting of β, 0 , the condition (9) holds, that is, XCS always correctly distinguishes the accurate rules from the inaccurate ones when they have been updated more than θ sub times. Proof: From (19) and (20), max n * ( * ) > min n ( ) holds for n * , n = 0, whereas max n * ( * ) < min n ( ) eventually holds because ∞ (φ max ) > 0. Thus, there is the intersection point of max n * ( * ) and min n ( ), and thus, the solution β of (21) always exists. Although (21) cannot be solved algebraically for β, the solution β can be obtained by, for example, the Newton's method.
Here, max n * ( * ) and min n ( ) are monotonically decreasing and increasing for n * , n > 0, respectively (see [24] for the formal proof). Because θ sub > 0, max n * ( * ) and min n ( ) are always smaller and greater than the boundary value 0 for n * , n > θ sub respectively. That is, (9) always holds when the prediction and prediction error of any rules are updated with the solution β.
Accordingly, the above theorem reveals the theoretical settings of β, 0 .

III. ANALYSIS
Our analysis inherits theoretical insights proven in the learning optimality theory. We start by introducing assumptions, preparation, and goal. Subsequently, the detailed procedures are described. Note that our conclusion is graphically explained in Section III-C. VOLUME 10, 2022 A. PREPARATION 1) ASSUMPTIONS We use the following assumptions.
1) The prediction error can be approximated by (14), and thus, Lemma 1 and Theorem 1 hold. 2) The fitness F n is updated with the absolute accuracy κ n instead of its relative accuracy κ n .
3) The initial fitness F I is always sampled from a numerical range of κ n . The second assumption is necessary to advance our analysis because analyzing the relative accuracy is non-trivial due to an unpredictable value of num. Note that β and 0 are fixed values not dependent on our analysis, as their values are uniquely determined when φ max is defined. The third assumption is used to identify the minimum value of F n (see Section III-A2 for more details).

2) BASIC INSIGHT
Let κ n (φ) be an absolute accuracy of a rule having φ at update time n, and F n (φ) be its rule-fitness. From the first assumption and (3), κ n (φ) is rewritten as where its numerical range can be identified as This is because n (φ) ∈ [0, 1000]. From the second assumption, the update equation of F n (φ) is rewritten as which can be proved by mathematical induction [24]. The above equation, that is, the Widrow-Hoff learning rule, is designed not to exceed the numerical range of κ n (φ) if F I is sampled from its range. Thus, with our third assumption, the minimum value of F n appears, which corresponds to that of κ n (φ). Note that the minimum value of F n is otherwise indeterminate because F I may be reduced recursively with a reduction bias of δ F . Accordingly, the following lemma holds. Lemma 2: Assume two rules having φ and 1 − φ, respectively, wherein φ ∈ [0.5, 1]. When their fitness has been updated for n update time, the following equalities hold: Proof: From (25), max F n (φ) and min F n (φ) are obtained by maximizing and minimizing F I and κ i (φ), respectively, as follows: From (23), it is obvious that min κ i (φ) and max κ i (φ) are obtained when i (φ) are equal to max i (φ) and min i (φ), respectively. From Lemma 1, min κ n (φ) = min κ n (1 − φ) and max κ n (φ) = max κ n (1 − φ) hold, and thus, (26) also holds.
Importantly, Lemma 2 reveals that we only need to consider φ ≥ 0.5 in analyzing the fitness. This theoretically-valid simplification will be the basis of our analysis.

3) GOAL
Suppose two rules have φ 1 , φ 2 ∈ [0.5, 1], respectively, where φ 1 > φ 2 holds. Thus, the rule with φ 1 is better than the other with φ 2 . The optimality of the fitness estimation argued here is that the rule having φ 1 has a higher fitness than that of the rule having φ 2 at a finite update time. Because their fitness changes depend on the variables of F I and n (φ), this optimality must hold for any possible values of these variables, which is strictly formalized as min F n 1 (φ 1 ) > max F n 2 (φ 2 ) s.t. n 1 , n 2 < ∞. (29) The above condition is general because it is valid for both accurate and inaccurate rules. However, max F n (φ) and min F n (φ) are differently formalized depending on whether φ belongs to * or owing to the discontinuity in calculating κ n (φ). Thus, to proceed with the analysis, (29) is expressly divided into the following two specific cases: • Superiority of accurate rules against inaccurate rules.
The first case is defined such that any fitness of any accurate rules is greater than that of any inaccurate ones, which is conditioned as where min F n * ( * ) and max F n ( ) are the minimum and maximum fitness of any accurate and inaccurate rules, respectively.

B. DERIVATION
To begin with, we reveal the monotonicity of min F n (φ) and max F n (φ) for update time n.
Lemma 3: Assume that min κ n (φ) is an increasing function for any update time n. Then, min F n (φ) is also an increasing function for n.
Proof: The proof is given by reversing the inequality signs of Lemma 3.

1) SUPERIORITY OF ACCURATE RULES AGAINST INACCURATE RULES
Here we show that (30) holds, concluding that any accurate rules eventually have fitness higher than that of any inaccurate rules. We start by deriving specific equations of min F n * ( * ) and max F n ( ).
As revealed in (28), min F n * ( * ) is determined with the minimum absolute accuracy of any possible accurate rules, denoted as min κ n * ( * ). From Theorem 1, min κ n * ( * ) is identified as because max n * ( * ) is smaller than 0 if and only if n * > θ sub . Thus, from (28), min F n * ( * ) is determined as where its converged value for n * → ∞, that is, min F ∞ ( * ), is determined as Meanwhile, max F n ( ) is obtained with max κ n ( ), that is, the maximum absolute accuracy of any inaccurate rules. For any update time n , max κ n ( ) is given by because min n ( ) < 0 holds if and only if n < θ sub , as proven in Theorem 1. Thus, max F n ( ) is specified as To identify max F n ( ) for n → ∞, we assume that min i ( ) of (39) is approximated by its fully-converged value, that is, Note that max F ∞ ( ) is always smaller than min F ∞ ( * ). From the above discussion, min F n * ( * ) and max F n ( ) now include only one variable, that is, n * , n , respectively.
Theorem 2: Assume that rule-variables are updated with the theoretical setting of β and 0 adjusted for φ max . Let θ F be an update time that first holds the following inequality: Then, any possible accurate rules have a higher fitness than that of any inaccurate rules when they have been updated for n * , n ≥ θ F . Proof: A proof can be obtained with a procedure similar to Theorem 1. First, min F n * ( * ) < max F n ( ) holds for n * , n = 0, whereas min F ∞ ( * ) is greater than max F ∞ ( ). Thus, there is a finite solution θ F that satisfies (41).
Herein, min κ n * ( * ) and max κ n ( ) are increasing and decreasing functions, respectively. From Lemmas 3 and 4, min F n * ( * ) and max F n ( ) are also increasing and decreasing functions, respectively. Accordingly, min F n * ( * ) is always greater than max F n ( ) for n * , n ≥ θ F when (41) is once satisfied at θ F update time. VOLUME 10, 2022 Note that θ F cannot be described algebraically but can be obtained by, for example, a bisection method.

2) SUPERIORITY AMONG INACCURATE RULES
The goal condition (31) can be proved with almost the same procedure as in Section III-B1. We determine estimation equations of min F n 1 (φ 1 ) and max F n 2 (φ 2 ).
From (28), for any update time n 1 , min κ n 1 (φ 1 ) involved in min F n 1 (φ 1 ) is obtained as because max n 1 (φ 1 ) is always greater than 0 , as proven in Theorem 1. Thus, min F n 1 (φ 1 ) is identified as Because min κ n 1 (φ 1 ) is an increasing function, min F n 1 (φ 1 ) is also increasing for n 1 , as proven in Lemma 3.
Next, we consider max F n 2 (φ 2 ). Here, min n 2 (φ 2 ) may exceed 0 earlier than θ sub update times because it may be larger than min n ( ). Let θ be an update time when min n 2 (φ 2 ) is reaching 0 . Thus, max κ n 2 (φ 2 ) is 1 only for n 2 < θ. Consequently, max F n 2 (φ 2 ) is derived as which is a decreasing function, as discussed in Theorem 2. Note that θ can be calculated with Newton's method by solving the equation of min θ (φ 2 ) = 0 . Herein, with the same procedure as in (40), the converged values of min F n 1 (φ 1 ) and max F n 2 (φ 2 ) can be obtained as α ∞ (φ 1 )/ 0 −ν and α ∞ (φ 2 )/ 0 −ν , respectively. Because φ 1 > φ 2 , the following inequality holds: From the above discussion, we can prove the goal condition (31) with almost the same procedure as in Theorem 2, as summarized in the formal proof below.
Theorem 3: Assume two rules having φ 1 and φ 2 , respectively, and their rule-variables updated with the theoretical setting of β and 0 adjusted for φ max . Let θ F be an update time that first holds the following inequality: Then, the rule having φ 1 has a higher fitness than that of the other rule having φ 2 when they have been updated for any update time n 1 , n 2 ≥ θ F . Proof: From (43) and (44), min F θ F (φ 1 ) < max F θ F (φ 2 ) holds for n 1 , n 2 = 0, whereas min F θ F (φ 1 ) is eventually greater than max F θ F (φ 2 ), as shown in (45). Thus, there is a finite solution θ F of (46). Because min F n 1 (φ 1 ) and max F n 2 (φ 2 ) are increasing and decreasing functions, respectively, min F n 1 (φ 1 ) is always greater than max F n 2 (φ 2 ) when the equality of (46) is once satisfied at θ F update times.

C. GRAPHICAL EXPLANATION
To graphically demonstrate the conclusions of our analysis, we use a particular theoretical setting of Case1, which is derived for φ max = 0.9 (see Table 1). Specifically, with φ max = 0.9, we can get θ sub = 9 from (10). By solving the boundary equation of (21), its solution is obtained as β = 0.319347, and thus, 0 = 163.7633. Note that α and ν are set to their default values [20].
We first revisit the condition (30), that is, any accurate rules will have a higher fitness than that of any inaccurate rules. Figure 1 (a) shows the theoretical curves of max n * ( * ) and min n ( ), which explains the conclusion of the learning optimality theory. As proved in Theorem 1, max n * ( * ) is equal to min n ( ) at θ sub update times; since then, those are smaller and greater than 0 , respectively, and therefore, (9) has been satisfied. Correspondingly, as shown in Figure 1 (b), min κ n * ( * ) and max κ n ( ) are equal to 1 for n * > θ sub and n < θ sub , respectively. Figure 1 (c) shows min F n * ( * ) and max F n ( ). Note that θ F = 11 is mathematically determined for Case1 by solving (41). From this figure, the following insights can be confirmed.
• As proved in Theorem 2, min F n * ( * ) is greater than max F n ( ) when they have been updated for more than or equal to θ F update times. Furthermore, Figure 2 shows the plots of max F n (φ ) with different values of φ ∈ . Although can be any set sampled from [0.1, 0.9], we set = {0.1, 0.2, · · · , 0.9}   for demonstration purposes. Note that max F n (φ = 0.9) is equal to max F n ( ). As proved in Lemma 2, for φ ≥ 0.5, max F n (φ ) and max F n (1 − φ ) follow the same values. Moreover, it is confirmed that max F n ( ) is always greater than that with the other values of φ . Thus, any accurate rules will have a higher fitness than that of any inaccurate rules when they have been updated for n ≥ θ F . Next, to demonstrate that the condition (31) holds, we set φ 1 = 0.9 and φ 2 = 0.7 as an example, where θ = 1 and θ F = 13 are obtained. Figure 3 (a) shows the plots of max n 1 (φ 1 ) and min n 2 (φ 2 ). As noted in (42), max n 1 (φ 1 ) is always greater than 0 and min n 2 (φ 2 ) exceeds 0 earlier than θ sub . Consequently, min κ n 1 (φ 1 ) is always smaller than 1, as shown in Figure 3 (b). Figure 3 (c) shows the plots of min F θ F (φ 1 ) and max F θ F (φ 2 ). min F θ F (φ 1 ) is always greater than max F θ F (φ 2 ) when they have been updated for n ≥ θ F , which is consistent with Theorem 3.

IV. DISCUSSION
It is a known fact that XCS, as well as other LCSs, suffers to solve high-dimensional problems [46], [47]. For instance, XCS has completely failed to improve the rule-quality on the 135-bit multiplexer problem [24], [46]. A common insight to this fact is that discovering better rules is hindered because of a huge search space of rules. Thus far, certain works have improved the rule-discovery process of XCS [48]- [52]. However, the rule-discovery process of XCS is still expected to improve the rule-quality in a stepwise manner, say, from φ = 0.5 to 0.55, and then, to 0.6. Thus, a question is why such a stepwise propagation may be hindered in the high-dimensional problems. In this section, by utilizing our theoretical results, we present an insight in this regard.
The boundary update time, θ F , indicates a necessary number of rule-trainings to recognize the superiority between two rules having φ 1 and φ 2 . Importantly, calculating θ F corresponds to quantifying a negative impact of the asynchrony in the fitness estimation. We empirically investigate this impact by analyzing θ F adapted for different pairs of φ 1 and φ 2 . Specifically, we set φ max = 0.9, 0.99, 0.999, respectively, and their parameter settings are summarized in Case1, 2, 3 of Table 1. Note that the theoretical value of β decreases when φ max increases. The dependency of θ F on β will be also discussed below. Figure 4 (a) shows specific values of θ F for Case1, where φ 1 and φ 2 are sampled from {0, 0.1, 0.15, · · · , 0.85, 0.9, 1}. The vertical and horizontal axes indicate the values of φ 1 and φ 2 , respectively. Each cell is colored according to the obtained θ F . Note that the blank (i.e., white) cell indicates that no valid value of θ F exists because a pair of φ 1 and φ 2 is out of the definition (see Section III-A3). As explained in the previous section, accurate rules will have fitness values higher than that of any possible inaccurate rules after n ≥ θ F = 11. Recalling θ sub = 9 at Case1, there is almost no significant difference between θ F and θ sub . However, θ F gradually increases when φ 1 and φ 2 are decreasing towards 0.5. For instance, θ F increases to 31 when φ 1 = 0.55 and φ 2 = 0.5. In this sense, additional 20 update times are required to recognize the superiority of rules having φ 1 = 0.55, compared to recognizing that of the accurate rules. The above trend is further highlighted when β is increasing. Figures 4 (b) and (c) show θ F for Case2, 3, where θ F for the pair of φ 1 = 0.55 and φ 2 = 0.5 increases to 361 and 3881, respectively.
In summary, recognizing the superiority among low-quality rules is costly, as it requires more update times. As aforementioned, the rule-discovery process is expected to improve the rule-quality in a stepwise manner. However, our observations reveal that such a stepwise propagation is systematically hindered because the number of rule-trainings required to recognize the superiority among low-quality rules increases. Consequently, XCS may struggle to identify good parent rules from early generations. Moreover, this inefficiency is further highlighted when decreasing β to a sufficiently small value. As mentioned in Section I, using a small value of β is a reasonable strategy to manage the uncertainty in the fitness estimation, but this conflicts with the relaxation of the negative impact of the asynchrony. In this sense, our result indicates that recognizing the superiority of low-quality rules with as few trainings as possible is crucial to improving the scalability of XCS.

V. CONCLUSION
In this paper, we theoretically provided the first proof that XCS can correctly recognize the superiority of any rules through an accuracy-based fitness under finite update times. Although this insight is limited to the assumption that the fitness is updated according to its absolute accuracy, it provides an insight into the rationality of the XCS's fitness update scheme. However, our analysis further revealed that additional training of rules is required to recognize the superiority among low-quality rules. This fact indicates that XCS suffers in identifying good parent rules in GA, and thus, it hinders an evolutionary propagation at early generations, where the population is fulfilled mostly with baseline rules.
Additionally, we provided a mathematical form for analyzing the XCS's fitness-update scheme while linking it to the recent theoretical insight. Presented equations and proofs should be available in XCS variants, e.g., XCSR and XCSI, if they employ the same learning scheme as in XCS. Thus, further advanced insights are expected by investigating the impact of relative accuracy and fitness reduction. Moreover, we focused on the classification task with a binary reward scheme because of the assumption of the learning optimality theory, however, a recent work has attempted to extend it to a multiple reward scheme used in regression tasks [53]. Our analysis should be available in this scheme with this extended theory, as it can utilize the theory without any modifications.
As a different direction, our result points out the importance of reducing θ F . Considering that the fitness value depends on hyper-parameters of α and ν, there may exist an adequate setting which reduces θ F . Thus, we will investigate further in this regard towards a universal setting-up guide of α and ν. Moreover, we will mathematically analyze the variation of selection/deletion probabilities in GA, aiming to boost evolutionary propagation during early generations.
RUI SUGAWARA received the B.E. degree from Yokohama National University, Japan, in 2019. He is a currently a Graduate Student at the Graduate School of Engineering Science, Yokohama National University. His research interests include evolutionary rule-based learning and its theoretical analysis.
MASAYA NAKATA (Member, IEEE) received the Ph.D. degree in informatics from the University of Electro-Communications, Japan, in 2016. He is currently an Associate Professor at the Faculty of Engineering, Yokohama National University, Japan. He has been mainly working on evolutionary machine learning, data mining, and more specifically, theoretical analysis of evolutionary rule-based learning. Since 2019, he has focused his research on surrogate-assisted evolutionary algorithms. His contributions have been published through more than 15 journal articles and 35 conference papers, such as the IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, GECCO, and PPSN.