How Does the Number of Objective Function Evaluations Impact Our Understanding of Metaheuristics Behavior?

Comparing various metaheuristics based on an equal number of objective function evaluations has become standard practice. Many contemporary publications use a specific number of objective function evaluations by the benchmarking sets definitions. Furthermore, many publications deal with the recurrent theme of late stagnation, which may lead to the impression that continuing the optimization process could be a waste of computational capabilities. But is it? Recently, many challenges, issues, and questions have been raised regarding fair comparisons and recommendations towards good practices for benchmarking metaheuristic algorithms. The aim of this work is not to compare the performance of several well-known algorithms but to investigate the issues that can appear in benchmarking and comparisons of metaheuristics performance (no matter what the problem is). This article studies the impact of a higher evaluation number on a selection of metaheuristic algorithms. We examine the effect of a raised evaluation budget on overall performance, mean convergence, and population diversity of selected swarm algorithms and IEEE CEC competition winners. Even though the final impact varies based on current algorithm selection, it may significantly affect the final verdict of metaheuristics comparison. This work has picked an important benchmarking issue and made extensive analysis, resulting in conclusions and possible recommendations for users working with real engineering optimization problems or researching the metaheuristics algorithms. Especially nowadays, when metaheuristic algorithms are used for increasingly complex optimization problems, and meet machine learning in AutoML frameworks, we conclude that the objective function evaluation budget should be considered another vital optimization input variable.


I. INTRODUCTION
Metaheuristic optimization has become a trendy topic over the last few decades. It concerns a wide range of applications from continuous single objective problems to discrete optimization tasks such as the Travelling Salesman Problem, engineering applications, circuits design, or scheduling [1]- [3].
There is a close connection between metaheuristics and benchmark testing. Metaheuristics typically prove their efficiency on a set of test problems, which should be diverse The associate editor coordinating the review of this manuscript and approving it for publication was Yanbo Chen . and unbiased [4]. In this context, the official benchmarking testbeds (e.g., the IEEE CEC benchmarks) provide many benefits. They include problems with various characteristics. The problem selection is already predefined, not formed by the competence of the promoted algorithm. And finally, they offer performance measures and evaluation rules. As a result, benchmark testbeds' influence goes beyond the competitions, as they are becoming the etalon of current optimization practice.
One of the significant metaheuristic struggles is stagnation, in which the algorithms can no longer create a better solution to the solved problem. This phenomenon may appear anytime but usually happens at the end of the convergence curves, depicting the mean solutions' quality based on the number of objective function evaluations (FEs) or corresponding iterations (see, e.g., [5]- [17]).
In theory, a moment should arise in which most metaheuristics will not improve the best-found solution. Therefore, all metaheuristics may be doomed to end their optimization process in stagnation, though hopefully in the desirable global optimum.
However, since many convergence curves end in such a stagnation manner (and given the often employed benchmark FEs recommendation practice), it may lead to the impression that further optimization with more evaluations may be pointless. But is it?

A. MOTIVATION
During our research, we came upon several cases where some algorithms broke through apparent stagnation when the computation budget increased. Figure 1 presents an instance of such a case. The figure depicts the mean convergence of 51 runs on the standard number of FEs and the respective raised budget. Despite the stagnation that lasted for more than 700,000 FEs, some algorithms managed to invert the trend. Since this was not an isolated case, we decided to investigate the influence of FEs budget on the performance and inner dynamics of selected metaheuristics. Namely: • the overall performance of selected optimization algorithms, • the success (or failure) of particular optimization algorithms, especially when compared to other optimization techniques, • their convergence curve, • and the population diversity of the algorithms.

B. RESEARCH IMPORTANCE EXPLANATION
The goal of this article is not to compare several selected algorithms and it is not to present the comprehensive performance comparisons of well known algorithms, but to investigate the influence of FEs budget on the final interpretation of the results. The aim here is to examine the issues that can appear in benchmarking and comparisons of metaheuristics performance (no matter what the problem is).
Recently, many challenges, issues, questions have been raised regarding fair comparisons, good practices for benchmarking of metaheuristic algorithms [18], more in-depth insights into results, statistics, analysis of behavioral patterns, and resulting recommendations [19], [20]. Best practices in benchmarking represent a significant problem nowadays, when metaheuristic algorithms are used for increasingly complex optimization problems, and for the evolution of deep learning architectures in AutoML frameworks [21], [22]. Therefore professional organizations, like IEEE, and researchers have recently established benchmarking taskforce 1 and networks. 2 This article has picked an important benchmarking issue and made extensive analysis, resulting in conclusions and possible recommendations for users. Our hope and ambition for this work are to benefit and inspire both real-world applications and the future direction of benchmark profiling.
The paper is structured as follows: Section II sums up the current evaluation practice, investigates the most used number of FEs, and examines alternative performance measures. Section III briefly introduces the metaheuristic algorithms further examined in the experiments and describes the parameter tuning process. Section IV presents and partially discusses the results of the experiment. It compares the performance, convergence, and population diversity given two FEs budget scenarios first on a selection of swarm-based algorithms, and then analyses the impact on the effective optimizers, that won the examined IEEE CEC benchmark competitions. Section V discusses the results from a broader perspective. And finally, Section VI concludes the meaning of our findings for future research and practice.

II. IS THERE ANY STANDARD EVALUATIONS PRACTICE?
Many metaheuristic publications examine the proposals or modifications of existing algorithms, which leads to an inevitable comparison between the algorithms' performance or their respective versions. The requirement of an equal number of objective function evaluations has become the standard practice. It offers several advantages: • A predefined number of FEs defines a clear and straightforward termination condition • The majority of publications use this approach • It may provide a fairer condition than the alternative iteration-based comparison The last point revolves around the changeable nature of metaheuristic designs. While some algorithms evaluate the population only once per iteration, others (for instance, hybrid algorithms that merge more metaheuristics into one) may do the same several times during a single iteration loop. In such a case, comparing these algorithms based on an equal set of iterations may be biased [23], [24]. Still, an iterationbased comparison is a standard practice of many publications (see, e.g., [5]- [9]).
The sole fulfillment of equal FEs alone does not ensure fair comparisons. There is a general assumption that the objective function holds for the most complicated computation step in the metaheuristic optimization process. That is why the number of FEs sets a base for algorithmic comparison and the computational complexity approximation [25]. However, no one can guarantee this assumption. Some optimization methods may include even more complicated operations than the currently solved problem defined by the objective function [24], [26].
Another important objection is that FEs equality is not the only necessary condition to a fair algorithmic comparison. A wide range of input variables impact the optimization process as well as the chance on an objective comparison: including the parameter settings, computational capabilities, used programming language, or the population size, as pointed out in [24], [26].
Although the use of evaluation budget is a common practice (e.g., in [10]- [15]), to the best of our knowledge, there is no rule limiting the ''standard'' maximum of FEs. The range of used evaluation limit reaches from 250 FEs [27], [28] over 10, 000 × D [29], to no limitation at all [30]. However, the benchmark test suites may serve as a nonnegligible inspiration source, and many publications adopt the recommended evaluation budgets from the benchmark definitions (see, e.g., [10]- [14], [31]). Table 1 provides an overview of the IEEE CEC optimization benchmark testbeds and the corresponding limits of objective function evaluations. The IEEE CEC benchmarks range the FEs limit from 50 × D to 10,000,000 evaluations [32], [33]. The difference lies in the problem domain specification. The high complexity of some real-world problems indirectly boosted the development of optimization methods designed to work well with a small number of FEs [27], [34]- [36]. Computationally expensive problems hence may be limited by a lower evaluation budget than easily solvable functions.
The comparison of metaheuristics comes hand in hand with various performance measures. One of the essential foundations of the experiments is the definition of the termination condition for every algorithm. The primary termination conditions include the maximum number of iterations, evaluations, or predefined execution time. These conditions can also be further advanced to limit the maximum iterations without improvement or reaching the desired objective function value [54]. Apart from the budget limitation, some benchmark recommendations also suggest additional termination condition of an error rate lower than a predefined threshold, for example, set to E −8 [55].
The performance measures often include the mean statistical error of the best, worst, mean and median solutions, and their standard deviations. However, since the No Free Lunch theorem [55] states, there is no "universal" best performing algorithm to solve any possible problem, solely performance-oriented experiments usually cannot lead to general assumptions. An utter win of one algorithm on one set of problems does not mean that the algorithm would be usable on a different set. The comparison experiments need to deploy further insights of the algorithms' inner dynamics to support its future usability. Such performance measures investigate, e.g., the population diversity, the algorithms' robustness, or the convergence curves.
A vast number of convergence curves indicate later stagnation [5]- [17], which may lead to a (possibly false) impression that raising the FEs budget would lead to a redundant computational effort. Yet several studies imply that the maximum number of FEs affects the overall performance of the algorithms [19], [26], [56], [57]. In the Particle Swarm Optimization Evaluations study [26], Engelbrecht suggests that a large number of evaluations may not be beneficial to small populations due to the premature stagnation.
In the proposal of the Passing Vehicle Search algorithm [56], the authors (Savsani and Savsani, 2016) compare the proposed algorithm on 13 engineering applications. However, since the results of other algorithms were taken from broad literature, all the compared algorithms used different FEs limit. To provide a fair comparison, the authors executed the Passing Vehicle Search algorithm with the corresponding FEs budget to many of the respective scenarios. This experiment revealed a slight improvement of the mean and worst solutions with a higher evaluation budget.
In the Conceptual Comparison of Several Metaheuristics [57], the author (Ezugwu, 2020) shows the mean best solution error with a variable FEs limit. The results differ based both on the solved problems and the solving algorithm. Another noteworthy publication [19] (LaTorre, 2020) presents that the average ranks of compared algorithms differ significantly, given a various number of FEs.
This article investigates five swarm algorithms and two benchmark competition winners on two sets of IEEE CEC benchmark testbeds with two evaluation budgets. We aim to uncover the effect of the FEs limit on population diversity, convergence, and the algorithms' overall performance.

III. BRIEF DESCRIPTION OF COMPARED ALGORITHMS
To analyze the impact of a higher number of objective function evaluations, we compared the optimization performance of selected algorithms solving two benchmark sets of problems IEEE CEC 2015 and 2017. The examined swarm algorithms were: the Particle Swarm Optimization [58], the Cuckoo Search [59], the Bat Algorithm [60], the Firefly Algorithm [61], and the Bison Algorithm [62]. Further we analysed the impact of FEs budget on the winners of the CEC benchmark competitions: the L-SHADE with Eigenvector Crossover and Successful Parent-Selecting Framework [63], and the Effective Butterfly Optimizer with Covariance Matrix [64]. The following section briefly introduces the examined algorithms.

A. PARTICLE SWARM OPTIMIZATION
The Particle Swarm Optimization algorithm (PSO) was proposed by Kennedy and Eberhart in 1995 [58]. It is by far the most popular swarm metaheuristic, with more than 55,000 publications in the Web of Science database. 3 The algorithm found inspiration in the emerging behavior of bird flocks and fish swarms. There are two basic versions of the algorithm: local best and global best, distinguished by the solution's neighborhood topology [65]. This article studies PSO with the global best topology.

B. CUCKOO SEARCH OPTIMIZATION
The Cuckoo Search (CS) was developed by Yang and Deb in 2009 [59]. The algorithm simulates the egg-laying patterns of cuckoos. The algorithm offers the advantage of only two configurable parameters and has proved to be a mighty optimization tool.

C. BAT ALGORITHM
The Bat Algorithm (BAT)  simulates the echolocation ability of microbats [60]. The algorithm employs a frequency tuning mechanism, which, by extension, acts as a mutation factor -as it affects mainly local solutions [66].

D. FIREFLY ALGORITHM
The Firefly Algorithm (FFA) was designed by Yang in 2010 [61]. The algorithm simulates the courtship of fireflies. In the courting ritual, each firefly looks around, and when it finds a firefly shining brighter than itself, it moves closer. The most glowing firefly finally performs a movement in a random direction.

E. BISON ALGORITHM
The Bison Algorithm (BIA) (Kazikova, 2018) models bison herds' protecting and running behavior [67]. It divides the population into two groups. The first group exploits known solutions by moving them closer to the center of several fittest solutions. In contrast, the second group explores the search space to avoid the trap of local optima.

F. SPS L SHADE EIG
The L-SHADE with Eigenvector-Based Crossover and Successful Parent-Selecting Framework (SPS L SHADE EIG) was proposed by Guo et al. in 2015 [63]. The acronym stands for a Success-History based Adaptive Differential Evolution with a linear decrease of the population size (L-SHADE), Eigenvector-Based Crossover (EIG), and Successful Parent-Selecting Framework (SPS). It is a variant of a Differential Evolution, which implements the memories of previously successful parameters. The Successful Parent-Selecting Framework provides a way to overcome stagnation. The algorithm won the IEEE CEC 2015 competition [49].

G. EBO WITH CMAR
The Effective Butterfly Optimizer with Covariance Matrix Adapted Retreat Phase (EBO with CMAR) was proposed by Kumar et al. in 2017 [64]. It is a hybrid self-adaptive algorithm, which combines the features of global and local optimizers. The original Effective Butterfly Optimizer algorithm is enhanced with success-history based adaptation and linear population size reduction, while the Covariance Matrix Adapted Retreat Phase improves the local search. The algorithm won the IEEE CEC 2017 competition [29].

1) PARAMETER TUNING
To determine the appropriate parameters for the solved problems, we tested the algorithms in 10 and 30 dimensions on the IEEE CEC 2017 benchmark with a selection of recommended configurations from the following literature: [61], [68]- [76]. We investigated the statistical significance of the results (p < 0.05) with the Wilcoxon Rank-Sum test and the Friedman Rank test. Further experiments use the winning parameter configurations defined in Table 2. The Parameter Tuning Experiment was extended in [77].
The algorithms implementations were derived from the EvoloPy library [70] and [78]. All codes of the swarm optimizers are available and welcome to use at TBU A.I.Lab's GitHub repository. 4 The codes of the competition winners: SPS L SHADE EIG and EBO with CMAR were adopted from the official CEC Benchmarking Github 5,6 with the included tuned parameters.

IV. COMPARING ALGORITHMS WITH TWO OBJECTIVE FUNCTION EVALUATION BUDGETS
We compared the algorithms on two benchmark sets of IEEE CEC 2015 and 2017 [29], [49] in 10, 30, 50, and 100 dimensions in 51 runs. We examined two evaluation budgets: the standard evals scenario of 10, 000×D FEs, as recommended in the Problem Definitions and Evaluation Criteria for both testbeds, and the 7 evals scenario of 70, 000 × D FEs.
Our experiment first analyzed the impact on the swarm algorithms separately and then added the competition winning optimizers. Separating these two analyses allowed for an easier recognition of the FEs limit effect, as the influence of the higher budget is more visible on algorithms of comparable performance, and it is less complicated to detect a renewed convergence on a smaller range of error values.
We studied the influence of the FEs budget on the algorithms' inner dynamics. Therefore, we assumed that the choice of the algorithms should not be substantial, supposing, that what works on elementary optimizers may be considered even with more advanced optimization techniques. To confirm (or disprove) this hypothesis, we analyzed the competition testbeds' winning algorithms with a higher FEs budget subsequently.
All the swarm algorithms started with a randomly generated initial population. They were programmed in the same environment, using the same programming language, with parameters based on the Parameter Tuning Experiment.
The following sections first examined the swarm algorithms for the mean solution errors with statistical tests in Section IV-A, the convergence curves in Section IV-B, and the population diversities in Section IV-C. Finally, Section IV-D examined the impact on the competition winners. The experiments first compare the tested scenarios separately and then investigate the difference that the bigger evaluation budget caused.

A. OVERALL PERFORMANCE (MEAN SOLUTION ERROR)
This section investigates the algorithms' overall performance by comparing the solution errors with the Wilcoxon Rank-Sum tests (p < 0.05). The test determines on how many problems an algorithm significantly outperformed the others.
We compared the results across the five swarm algorithms in one of the scenarios -standard evaluations in Tables 3  and 4 and 7 evals in Tables 5 and 6. Then we investigated the impact of the different evaluation budgets. Table 7 and Table 8 examine the difference in the interpretation of the previous results. Figure 2 shows the ranks computed by the Friedman Rank test (p < 0.05), which compared all the swarm algorithms in both scenarios on the IEEE CEC 2015 testbed. The third column presents the difference in ranks for each algorithm.   Finally, we evaluated the individual impact of the bigger evaluation budget on every tested algorithm separately. The Wilcoxon pair-wise Rank-Sum tests (p < 0.05) comparing the 7 evals and standard evals scenarios of one particular algorithm are in Tables 9 and 10. The benefit is computed by Eq. 1, which subtracts the sum of 7 evals wins -standard    Tables 16 and 17. The percentual benefit presented on the last line sums the percentage of problems positively affected by the higher evaluation budget (Eqs. 1, 2).

evals wins from the Wilcoxon Rank-Sum tests showed in
where: • B D represents the benefit of the 7 evals scenario against the standard evaluations scenario, • D presents the examined dimension, • F max is the number of problems in the benchmark test bed, • W i and w i stand for the number of wins from the Wilcoxon Rank-Sum test from the 7 evals and standard evals scenarios respectively (from Tables 16 and 17), • Benefit represents the percentage of positively affected problems by higher evaluation budget, • and B 10D presents the benefit computed by Eq. 1 on 10-dimensional benchmark problems.
Similarly, Figure 3 shows the pair-wise Friedman Rank test (p < 0.05), which compared the evaluation scenarios on every swarm algorithm individually. All the difference oriented experiments were computed by subtracting the 7 evalsstandard evals results.  It is important to note that previous results compare only the final interpretations of the results with the statistical tests. The overall impact, concerning each algorithm on their own, was investigated in Tables 9 and 10. These tests compared the final results of both of the tested scenarios on a single algorithm, to estimate the benefit of higher evaluations for every individual algorithm separately. The results were then compared in Tables 9 and 10. Despite the previous findings, the effect of 7 evals budget was beneficial for most of the tested algorithms: mostly for the Firefly Algorithm (with 93% percentual benefit on CEC 2015) and the Cuckoo Search algorithm. This discovery underlines the results in Tables 5 and 6 since almost all of the algorithms performed significantly better than in the standard evaluation scenario.
Unexpectedly, higher evaluations slightly harmed the Bat Algorithm, especially in the lower dimensions, whose percentual benefit ended up in the negative numbers. The Friedman Rank test in Figure 3 confirmed these findings. In the pair-wise comparison, every algorithm, with the BAT exception, significantly improved its rank in the higher evaluation scenario.

B. CONVERGENCE
The convergence experiment investigated the development of the mean solution errors based on a varying number of evaluations. Figure 4 depicts the convergence curve of 15 100-dimensional problems of the IEEE CEC 2015 testbed with a higher evaluation budget. It also shows the standard evaluation threshold, which would be the last stop of the optimization, if it followed the benchmark recommendations [29], [49]. Figure 5 shows a selection of interesting cases in which the convergence of the standard evaluation budget ends in stagnation. Surprisingly, further evaluations provided an unprecedented improvement in the mean error.
It should be noted, that the convergence curves (Figs. 4-6) were approximated from 14 error values according to the IEEE CEC 2017 benchmark set recommendation [29].   Figure 4 shows a substantial drop of mean error value even after the standard evaluation threshold in most of the displayed cases. The only exception is Function 11, which stagnated for more than 6.5E 6 FEs. Figure 5 illustrates the potential benefit of higher evaluations, with some remarkable convergence twists. This figure emphasizes that even apparent stagnation does not necessarily mean that further optimization is pointless. This notably concerns the Cuckoo Search optimization but other VOLUME 9, 2021 algorithms such as the Firefly Algorithm, the Bison Algorithm, or the Particle Swarm Optimization as well.

C. POPULATION DIVERSITY
Finally, we investigated the impact of a higher evaluation budget on the population diversity of the swarm algorithms. The population diversity measure is described in Eqs. 3, 4 [80], and the data are presented in a relative percentage to a theoretical maximum of the diversity value. where: • NP is the population size, • D presents the dimensionality of the problem, • i and j are the population and dimension iterators, respectively, • x i,j represents the vector value of the solution at the given dimension, • and x j presents the corresponding mean of the solutions. (Tables 11, 18) and 7 evals scenario (Tables 12, 19). Since the population diversities are very similar for both of the tested benchmarks, the partial diversity results for IEEE CEC 2017 benchmark.
The final difference between the examined diversities (Table 12 -Table 11 and Table 19 -Table 18) on both tested benchmark testbeds is calculated in Table 13 and Table 14. Again, the positive value in these last two tables would mean a higher diversity of the 7 evals budget.

1) POPULATION DIVERSITY DISCUSSION
The higher evaluation budget effect on the overall population diversity is both algorithm and problem dependable. While the population diversity mostly lowers when solving the IEEE CEC 2015 benchmark (Table 13), the effect on IEEE CEC 2017 population diversity was negligible (Table 14). Table 13 revealed that for most of the algorithms, the higher evaluation budget lowered the population diversity on the IEEE CEC 2015 benchmark testbed. The uncomplimentary effect concerned foremost the Cuckoo Search optimization, especially in the lower dimensions, but even so, the diversity rate still stayed relatively high (see Table 12). The effect on other algorithms' diversities was either low (BIA, PSO, BAT) or none (FFA).
The extremely low population diversities of some algorithms (as can be seen at the end of some algorithms in Fig. 6) hint that the population merged merely into one location. This affects the exploration ability of the algorithms and points to a possible local optimum containment. On the other hand, the general progress of the population distribution ( Figure 6, Tables 11 and 12) proved a steady diversity rate of the Bison Algorithm on the whole range of tested problems.

D. IMPACT OF RAISED FEs BUDGET ON THE CEC COMPETITIONS WINNERS
So far, we have focused on the selected five swarm algorithms available from the TBU AILab's Github. However, we were also interested whether the raised budget would impact even advanced optimizers such as the winners of the examined CEC competitions. We used the public codes of the competing algorithms available at the official CEC Benchmarking Github 7,8 with raised FEs budget of the SPS L SHADE EIG on the CEC 2015 test set and EBO with CMAR on the CEC 2017 learning-based test set. We followed the benchmark directions during this experiment, with 51 optimization runs per problem and max FEs of 10,000 × D, and 70,000 × D in the std evals and 7 evals scenarios, respectively. The parameters of the optimizers were already tuned in the original source codes provided in the official CEC Benchmarking Github with the included tuned parameters.
The following section discusses the results comparing the algorithm in the two evaluation scenarios. Table 15 presents the impact of the higher evaluation budget on the CEC 2015 winner and the CEC 2017 winner on their respective winning pools. The table shows the number of problems in which the raised budget produced significant improvement against the standard evaluation budget according to the Wilcoxon Rank-Sum test (α = 0.05) and the corresponding percentual benefit (see Eqs. 1,2). Similarly, Figure 7 presents the Friedman Rank test comparing the two evaluation scenarios on the examined algorithm on its respective winning test pool.
These results imply that the raised budget helps even successful versions of SHADE and EBO. It might appear that the higher budget provided a lower percentual benefit for the SPS L SHADE EIG algorithm when compared to other algorithms (see Table 9). However, it is important to note that the CEC 2015 winner sometimes found the exact optimum already within the standard evaluation budget. In these cases, 7 https://github.com/P-N-Suganthan/CEC2017-BoundContrained 8 https://github.com/P-N-Suganthan/CEC2015-Learning-Based raising the budget does not contribute to the algorithm's performance, as can be seen in the 10 dimensions in Table 15.
On the other hand, in 30 dimensions, the higher evaluation budget significantly improved 9 out of 15 problems. These conclusions were confirmed by the pair-wise Friedman rank test in Figure 7, in which the 7 evals scenario outperformed the standard budget scenario significantly.
The improvements appeared even in the convergence analysis of the winners. Figure 8 shows the mean convergences of selected cases, in which the algorithms overcame an apparent stagnation when allowed for the raised FEs budget. The previous section discussed the impact of a higher FEs budget on the comparison experiment's final interpretation. However, when we added the benchmark winners to the optimizers, the statistical tests favored the winning algorithm regardless of the budget scenario. In several cases, the higher budget neutralized the lead; where the winner had significantly better results in the standard FEs budget, it lost the lead's significance with a higher budget. Figures 9 and 10 show the Friedman Rank test comparing the algorithms on their respective winning test pools.

V. DISCUSSION
The heuristic optimization may resemble any other race: even this year's Formula 1 champion is not a guaranteed winner VOLUME 9, 2021  on a different track, riding a backup vehicle, having one liter of fuel less, or racing other competitors. Similarly, in metaheuristics comparison, investigation of inner principles is preferable than to rank them purely by performance measures. In this article, we focused not only on the performance measures but also on the algorithms' convergence patterns and population diversities.  We investigated the mean error convergences and found that some algorithms improved significantly even after a long stagnation. We discovered that the effect of the evaluation budget on population diversity is problem-dependent. While more evaluations diminished the population diversity on IEEE CEC 2015 benchmark's problems, its impact on IEEE CEC 2017 was negligible. Finally, we found that the  evaluation budget significantly affected six of the tested algorithms and considerably improved their performance.
The most critical effect concerned the interpretation of the statistical results when comparing the swarm algorithms separately. While in the standard evaluation scenarios (the last rows of Tables 3, 4), the sum of Wilcoxon Rank-Sum test wins was in favor of the Bison Algorithm, more evaluations (Tables 5, 6) promoted the Cuckoo Search optimization, and therefore flipped the final winner of this particular race in the swarm optimizers comparison. The effect was damped, when we added the competition winners to the list of optimizers, as the Friedman Rank test in Figures 9 and 10 ranked the SPS L SHADE EIG and EBO with CMAR the first in both of the tested scenarios.
We also examined the impact of a higher evaluation budget on successful optimizers that won the IEEE CEC competitions. The raised budget allowed for a significantly better performance of these algorithms.
Interestingly, a higher evaluation budget's overall percentual benefit was considerably lower for the CEC 2015 winner (compare Tables 9 and 15). The reason may be that the SPS L SHADE EIG sometimes found the exact optimum already within the standard evaluation budgethence the 0% benefit for 10-dimensional problems. However, in higher dimensions, the percentual benefit was renewed (up to 60% in 30 dimensions). Therefore, we can recommend raising the evaluation budget even for effective optimizers when solving problems without the known optimum.
In the end, we would like to highlight that many applications, like the computationally expensive problems, do not allow for numerous evaluations. Furthermore, this article did not mention the algorithms' complexity, which relatively does not change with FEs, but can greatly affect the options to set the evaluations budget due to the limited resources or time restrictions. The time consumed by the presented experiments took minutes to days based on the solving algorithm.

VI. CONCLUSION
The function evaluation limit impacts both the performance and behavior of metaheuristics. Therefore, when one solves a nontrivial problem without a pressing deadline, it may be reasonable to use a higher evaluation budget to obtain better results.
It might seem that our paper states the obvious -that a longer optimization process sometimes produces better results. However, there is more to the story. Our primary goal was to point out the importance of focusing on good practice in benchmarking and transferring this knowledge into engineering optimization practice and metaheuristic development. The most important findings can be summarised as follows: We discovered that some metaheuristics conceal a hidden asset that only a raised number of objective function evaluations may reveal: the ability to renew the convergence after apparent stagnation. This feature may be crucial for specific real-world applications. Also, not every algorithm has this ability -some algorithms are more affected by the evaluation budget than others. In this context, the Cuckoo Search optimization with more evaluations carried out impressive results, as it repeatedly recovered convergence even after periods of stagnation.
The impact of the function evaluation budget on population diversity was no less interesting. While one set of problems was intact by evaluation change, there was a significant drop in the population diversity on the other one. However, lower population diversity in higher iterations raises the chance of capturing the entire population to a local optimum.
The world of metaheuristic optimization is currently ruled by the No Free Lunch theorem, which states that there is no universal best-to-solve-it-all algorithm. According to this theory, various algorithms are more (or less) suitable to solve different kinds of problems. At the same time, the success of a metaheuristic stands on a whole list of conditions: including dimensionality, solving algorithm selection, parameter configuration, adopted border strategies, problem dynamics, and the objective function. To the account of conditions affecting optimizations, we cordially recommend adding another input variable: the function evaluation limit.
So far, the benchmarking testbeds focused mostly on solving problems fast, which corresponds with most of the VOLUME 9, 2021 objective function evaluation budget limitations. However, as our paper revealed, some algorithms can overcome the potential stagnation with an increased FEs budget. Hence, it might be useful if the future direction of benchmark testing would consider increasing the FEs budget besides the currently adopted limits. The aim of uncovering this feature and the investigation of why some algorithms can renew the convergence even after apparent stagnation might form the next generation of future benchmark profiling.
In this article, we have discovered that some algorithms are able to renew the convergence within the extended FEs budget. It is possible that with an even larger FEs budget, other algorithms would be able to renew the convergence as well, despite failing to do so within our experiments. However, for practical purposes, the number of function evaluations must be limited at a finite value. Therefore, we believe that instead of still increasing the FEs limit, an in-deep study of the algorithms is necessary to uncover the key features for convergence renewal ability, and our work should encourage the meaningfulness of such a study.