An Improved Equilibrium Optimizer Algorithm for Features Selection: Methods and Analysis

In the last decade, data generated from different digital devices has posed a remarkable challenge for data representation and analysis. Because of the high-dimensional datasets and the rapid growth of data volume, a lot of challenges have been encountered in various fields such as data mining and data science. Conventional machine learning classifiers are of limited ability to handle the problems of high dimensionality that includes memory limitation, computational cost, and low accuracy performance. Consequently, there is a need to reduce the dimension of datasets by choosing the most significant features that would represent the data efficiently with minimum volume. This study proposes an improved binary version of the equilibrium optimizer algorithm (IBEO) to mitigate features selection problem. Two main enhancements are added to the original equilibrium optimizer (EO) to strengthen its performance. Opposition based learning is the first advancement added to the initialization stage of EO to enhance the diversity of the population in the search space. Local search algorithm is the second advancement added to enhance the exploitation of EO. Wrapper approaches can offer premium solutions. Thus, we used k-nearest neighbour classifier and support vector machine classifiers as the most popular wrapper methods. Moreover, dealing with the problem of over-fitting is an essential task that urges on applying k-fold cross-validation to split each dataset into training and testing data. Comparative tests with different well-known algorithms such as grey wolf optimization, grasshopper optimization, particle swarm optimization, whale optimization, dragonfly, and improved salp swarm algorithms are considered. The proposed algorithm is applied to the most commonly datasets used in the field to validate the performance. Statistical analysis studies demonstrate the effectiveness of the IBEO.


I. INTRODUCTION
The technological evolution in many fields such as finance, biomedical, bioinformatics, and telecommunication has produced an exponential volume of pervasive data. The rapid growth in the data volume has fabricated datasets with thousands of features (also known as attributes) comprising diverse data, resulting in a lot of challenges and fundamental tasks in data science applications. Typically, high dimensional datasets are accompanied by redundant, irrelevant, and noisy records, which hurts the accuracy of machine The associate editor coordinating the review of this manuscript and approving it for publication was Hiu Yung Wong . learning (ML) classification and raises computational costs [1]. Conventional ML classifiers are incapable of dealing with such a large number of features, and are frequently trapped in local optima [2]. Accordingly, a preprocessing procedure like features selection becomes a necessity to address the high dimensional data problem and filter out the unnecessary/redundant features. Features selection in ML and statistics is also referred to as attribute selection, variable selection, or variable subset selection. It is a part of data preprocessing which is supposed to be the most time-consuming process of any ML pipeline. Features selection process can help build robust models by removing redundant and irrelevant features and choosing the most informative ones. VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Features selection methods are mainly categorized into wrapper methods, and filter methods. The feature subset evaluation step in wrapper approaches is based on the performance of the classification algorithm. A wrapper employs the classification algorithm as a ''black box'' to evaluate the quality of the chosen subset via the classification performance [3]. A filter approach is not based on any learning model and the features are picked and ranked in principal based on statistical approaches [4]. Filter algorithms are frequently more general and less computationally expensive than wrapper methods [5]. Filters, on the other hand, ignore the classification algorithm's performance of the selected features, whereas wrappers are highly dependent on the used learning algorithm to evaluate the feature subsets. In terms of classification accuracy, wrappers beat filter methods [6]. Wrapper methods in many classification problems do not significantly rely on a large number of selected attributes to attain high classification accuracy. Wrappers are more accurate because they take into account the relationships between their features [7]. Moreover, this approach can use various ML methods and knowledge extraction such as k nearest neighbor (kNN) [8], discriminant analysis [9], artificial neural networks (ANN) [10], and support vector machines (SVMs) [11]. Some researchers divide features selection approaches into three categories: wrapper, filter, and embedded approaches [5]. Embedded techniques are methods that combine feature selection and classifier learning into a single process [12].
The interactions and relationships among the features, as well as the extensive search space, have made features selection one of the most difficult data mining and classification processes. Features interaction can be two-way, threeway, or even include several features. A feature may not have a confident impact on the target when it is used individually, however its effect can be amplified when combined with other features. In addition, a feature that is useful on its own can become redundant when combined with others. The extensive search space 2 n , n is the total number of features, is another challenging process. In other words, exhaustively searching for all possible solutions is not possible in most situations. As a result, a variety of search algorithms have been proposed in order to locate a sufficiently good subset.
According to A. Jovic [13], there are three types of search techniques: exponential, sequential, and random selection strategy. The number of evaluated features in the exponential methods increases exponentially with the size of features. Although this approach produces reliable results, it is impractical to use due to its high computational cost. Some exponential algorithms are exhaustive search, bound and branch methods [14]. Sequential methods add or delete features sequentially. The main problem that leads to local optima is that it is not possible to make modifications once a feature has been added or deleted from the selected subset. Sequential forward selection (SFS) is a sequential search strategy that works best when the optimal subset has a limited number of features. The key drawback of SFS is that it cannot be used to remove features that become outdated when new ones are added [15]. Sequential backward elimination (SBE) is another example of sequential algorithms but it works in the opposite direction of SFS. When the feature subset has a large number of features, SBE performs best. SBE's key flaw is its inability to reevaluate the utility of a feature after it has been removed [16]. With some backtracking capabilities, plus-L minus-R selection (LRS) tries to compensate for SFS and SBE's flaws. The lack of theory to estimate the optimum values of L and R is its primary weakness [17]. To overcome this limitation, sequential forward floating selection (SFFS) and sequential backward floating selection (SBFS) have been proposed [18]. These approaches have been shown to be superior to static sequential approaches. The common issues with most approaches are premature convergence, high computing cost, and enormous complexity. Random algorithms, also known as population-based techniques, use randomization in the search process to avoid trapping the algorithms in local optima. Metaheuristics are examples of random search strategy that draw researchers' attention to deal with these kinds of problems.
Wrapper approaches based on metaheuristic algorithms have shown their productivity and efficiency as an attempt to solve features selection problem. The idea behind solving features selection problems with metaheuristics is that it can deliver a solution that is closer to the optimal solution in a reasonable amount of time [19]. Metaheuristic algorithms have a stochastic behavior, as they begin their optimization process by producing random solutions to efficiently explore the search space. It is not necessary to compute the search space derivative. Because of their simple principle and easy implementation, metaheuristics are easily adaptable according to particular problem. The key feature of these algorithms is their extraordinary ability to avoid algorithms from converging prematurely. Metaheuristic algorithms maintain a balance between its two important aspects, exploration and exploitation [20].
Metaheuristic methods are mainly divided into four categories based on the source of inspiration: (i) swarm intelligence [21], (ii) human-based methods [22], (iii) physics-based methods [23], [24], and (iv) evolutionary algorithms [25]. Swarm intelligence methods are inspired by the way in which animals behave in swarms as individual information is shared throughout the optimization process. Particle swarm optimization (PSO), one of the most well-known contributions in this class, it has a number of advantages, including its simplicity and high convergence rate [26], [27]. To deal with the problem of picking the minimal subset of features for high-dimensional data, a modification strategy for the PSO method was presented in [28]. One of the promising approaches is the grey wolf optimizer (GWO), which is an efficient nature-inspired population-based metaheuristic algorithm [29]. Emary et al. [30] employed the sigmoid transfer function to get the first binary version of GWO (BGWO). The kNN classifier was used to calculate the classification accuracy and was applied to eighteen distinct UCI datasets.
Levy flight GWO was proposed by Pathak et al. [31], and it was utilized to pick significant features from the original datasets. The acquired findings demonstrated its excellent performance in terms of achieving high convergence. Al-Tashi et al. [32] utilized GWO algorithm to choose the best features with SVM as a classifier to diagnose cardiovascular disease. In addition, the author created a binary version (BMOGW-S) based on a sigmoidal function to solve multi-objective features selection problems with an artificial neural network for classification process [33]. On fifteen benchmark datasets, BMOGW-S outperformed MOGWO, which used the tanh transfer function, in both classification accuracy and feature selection. The grasshopper optimization algorithm (GOA), another new population-based optimizer, introduced by Saremi et al. to simulate the swarm behavior in natural populations of grasshopper insects [34]. A new transfer function was introduced by Hichem et al. [35]. The hamming distance was used to transform continuous variables into binary ones to suit the nature of the features selection problem. The latest version of GOA (NBGOA) was tested on 20 standard datasets and compared to previous GOA versions. The reported data demonstrated NBGOA's potential to deliver excellent results. Mafarja et al. used mutation operator sigmoid and V-shaped transfer functions to improve the exploration quality of BGOA [36]. To solve the binary optimization problems, Eid et al. [37] added transformation functions to the conventional whale optimization algorithm (WOA). The exploitation capability of WOA has been improved by Mafarja et al. [38], he integrated the simulated annealing into WOA in each iteration step to enhance the best solution. Sayed et al. extended the WOA algorithm with the chaotic search to avoid the slow convergence rate and fight local optima stagnation in feature selection problems [39]. Mirjalili et al. investigated dragonfly's behavior bringing out a new metaheuristic algorithm called dragonfly algorithm (DA) [40]. Medjahed et al. [41] suggested a comprehensive cancer diagnosis method based on the binary dragonfly (BDA) algorithm and SVM. SVM-RFE (SVM-recursive feature elimination) was used to select the gene from the datasets, with BDF being utilized to improve SVM-RFE performance. Six micro-array datasets were used to test the proposed technique, which yielded high accuracy results. Mafarja et al. developed a binary version of the dragonfly method that uses time-varying transfer functions to strike a good balance between exploration and exploitation [42].
Physics-based methods are derived from natural physics laws. The big bang-big crunch theory brought out an optimization method called the big bang-big crunch (BB-BC) algorithm [43]. In addition, a novel optimization method for developing charged system search (CSS) was inspired by the newtonian mechanical laws and the coulomb's law in electrostatics [24]. Another instances of this class: simulated annealing (SA) [44], multi-verse optimizer (MVO) [45], henry gas solubility optimization (HGSO) [46], lightning search algorithm (LSA) [47], chemical reaction optimization (CRO) [48], gravitational search algorithm (GSA) [49], and electromagnetic field optimization (EFO) [50]. Evolutionary methods are based on darwinian evolutionary theory and simulate natural evolution laws. Genetic algorithms (GA) are an instance of evolutionary methods [51]. They have a lot of potential for addressing complex optimization problems. Sayed et al. utilized a nested-GA algorithm for the features selection of cancer micro-array datasets. Nested-GA contains two GA algorithms: inner and outer GA algorithms and they work on two different kinds of datasets [52]. The results of nested-GA showed a small optimal subset of features with the highest classification performance. The authors in [53] used the GA feature selection algorithm with the chaos optimization for text categorization. Differential evolution algorithms (DE) are another instance of evolutionary methods [54].
Human behavior and human interaction in society inspire human-based methods. Agrawal [55] introduced the first novel binary version of gaining sharing knowledge based algorithm (GSK) for feature selection problem (FSNBGSK). The authors introduced binary junior and senior gaining and sharing stages. With the kNN classifier, the FS-NBGSK method was tested over 23 benchmark datasets from the UCI repository. In terms of classification accuracy and the smallest number of selected features, the proposed method outperformed the other compared algorithms. Representative algorithms of human based methods include: cultural evolution algorithm (CEA) [56], imperial competition algorithms (ICA) [57], teaching-learning-based optimization (TLBO) [22], and the volleyball premier league (VPL) [58]. Recently, interesting new methods have been developed, taking inspiration from different fields. For instance, paints have been used to propose a new optimization method as one of the most important fields of art [59]. Other new instances can be found in [60]- [63]. Finally, hybridization of multiple algorithms is a trendy method in features selection to take advantage of the strength of different algorithms [64]- [69].
Opposition-Based Learning (OBL) is a field of study that has been widely employed to enhance the search process in several algorithms [70], [71]. Tizhoosh et al. [72] was the first to use OBL to speed up the convergence rate and enhance the quality of solutions suggested by metaheuristics. Its primary principle is to think about a solution and its opposite at the same time. The logic behind this concept is that an opposing solution has a larger chance of being closer to the optimal solution than another random guess, given that the current original solution is not closer to the optimal solution [73]. Type-I opposition and type-II opposition are two types of search that use the OBL idea [74]. Type-I opposition is a function that maps each solution in the search space to its opposite, allowing the search to proceed with the better solution. Type-II opposition establishes a relationship between the original solution and its opposing based on evaluating their qualities. The most common OBL type in the literature is type-I opposition [74], which was employed in this study. Several opposition-based metaheuristics based on various OBL techniques have been proposed in the literature. VOLUME 9, 2021 This is because the distance between the initial solutions and the unknown optimal solution is crucially related to the convergence speed of metaheuristics to the global optimum. If the optimal solution is far away from the initial solutions, the convergence speed to the global optimum could be slow. As a result, well-diversified initial solutions can speed up metaheuristic convergence [75]. Some metaheuristics based on various OBL techniques are: an advanced charged system search (ACSS) [76], opposition-based differential evolution (ODE) [77], [78], opposition-based harmony search (OHS) [79], opposition-based sine cosine algorithm (OBSCA) [80], and improved salp swarm algorithm (ISSA) [81].
A recent algorithm called equilibrium optimizer algorithm (EO) was developed by Faramarzi et al. to predict equilibrium states, where physical dynamics and sink models are the main inspiration for the algorithm [82]. EO belongs to the physics-based group of optimization algorithms where it is based on the laws of physical theory in nature. As addressed above, metaheuristic algorithms have shown supporting positive effects on features selection problems over recent decades. Despite all the research in this direction, most of the metaheuristic algorithms still face several challenges that need to be addressed. For instance, entrapment in local optima, lack of diversity, and imbalance between the explorative and exploitative abilities of the algorithm. It is still necessary to have more optimization techniques to get additional enhanced results. To the best of the author's information, there are a few studies in the literature for the binary version of EO [83], [84]. This has motivated us in this study to propose a new binary version of EO and test its benefit in features selection problems as a binary optimization algorithm. The main contributions of this paper are summarized as follows: 1) IBEO: a new modified version of EO is introduced.
Two different transfer functions are used to deal with the binary problem. 2) OBL is applied to enhance the population diversity of EO at the initialization stage. 3) Local search algorithm (LSA) is applied to enhance the exploitation capability of the EO. At the end of each EO iteration, LSA is integrated to prevent it from getting stuck in local optima. 4) Test the proposed algorithm by comparing its result with six state-of-the-art features selection methods.
The remainder of this paper is organized as follows: Section 2 describes the original EO. The proposed algorithm is discussed in detail in section 3, while the experimental results are given in section 4. Finally, section 5 concludes the paper.

II. EQUILIBRIUM OPTIMIZER ALGORITHM: AN OVERVIEW
The equilibrium optimizer algorithm (EO) is a recent algorithm applied in continuous optimization problems [82]. EO belongs to physics-based methods. Particles and concentrations in EO can be portrayed as particles with their positions in particle swarm optimization algorithm (PSO). According to the best solutions or the equilibrium candidates, the concentrations of the particles (search agents) are randomly updated until the optimal result equilibrium state is obtained. One of the advantages of EO is its ability to randomly update the solution within a high balance between exploration and exploitation. Eq.(1) represents the update pattern for the concentration of a solution in the EO Algorithm: The concentration of each particle is updated through three terms. The equilibrium concentration is the first term, and it is picked randomly from a pool, a vector that contains five equilibrium candidates or five best solutions, named the equilibrium pool. The second component serves as a direct exploration mechanism, and it indicates the difference between the concentration of a solution and the concentration of an equilibrium state. This term acts as explorers, which supports particles to explore the domain globally. Moreover, F's presence in the second term will help EO achieve a sensible balance between exploration and exploitation. The generation rate is the third term. A mathematical explanation of these terms is given in the next subsections.

1) CANDIDATES AND EQUILIBRIUM POOL
The convergence state is called the equilibrium state of the algorithm that is supposed to be the global optimum. There is no information about the concentration equilibrium state at the beginning of the optimization process. Thus, equilibrium candidates are useful in providing a search pattern for the particles. As mentioned before, the EO algorithm creates an equilibrium pool, a vector that contains five equilibrium candidates. Four candidates are the finest concentrations named in the entire process optimization plus their average. These four candidates support EO in exploration, while the average particle supports having better exploitation capability. It is worth considering that the number of candidates is optional and connected with the nature of the optimization problem. The equilibrium pool vector can be expressed as: The particle's concentration update is executed after each iteration by a random selection among candidates that have the same probability of being selected.
2) EXPONENTIAL TERM: F As mentioned previously, the exponential term F in Eq.(3) can enormously assist balancing in EO exploitation and exploration. An explanation of the exponential term F can be expressed as: Here, λ ∈ [0], [1] is a random vector, t denotes a function of iteration that is decreasing with the number of iterations, and t 0 slows down the search speed to assure the convergence: Here, Iter and Max_iter represent the current and the maximum number of iterations, respectively. r ∈ [0], [1] is a random vector, sign(r − 0.5) affects exploration and exploitation directions, a 1 and a 2 are constants. In the related work, a 1 = 2 and a 2 = 1. The higher a 1 value proposes better exploration performance and worse exploitation ability. The higher of a 2 value means better exploitation performance and worse exploration ability. By substituting Eq.(5) in Eq. (3), the last explanation of the exponential term is as following:

3) GENERATION RATE (G)
The third term in Eq.(1) comes with the generation rate (G), it performs a vital role in EO to improve the exploitation process: where r 1 and r 2 are random numbers in the interval of [0], [1], GCP is an abbreviation for generation rate control parameter; it implies the probability of generating term addition towards the updating rule. GP is an abbreviation for Generation Probability, GP = 0.5 and it guarantees a fair balance between exploitation and exploration. GP denotes the probability that tells how many particles update their concentration using the generation term.

III. THE PROPOSED ALGORITHM (IBEO)
In this section, a wrapper-based method is proposed for tackling the problem of features selection. Initialization, transformation function, local search algorithm (LSA), and evaluation are the essential steps of IBEO. A detailed description of each step will be presented in the next subsections.

A. INITIALIZATION
Several studies have raised the quality of their initiated population solutions by using an optimization technique called opposition based learning (OBL). The OBL strategy works by diversifying the solutions to give a better probability of discovering promising regions. It searches the two directions in search space. The first direction is the primary solution, and the other direction is the opposite solution. It is reasonable to assume that if current solutions are far from the unknown optimal solution, computing the opposite solutions will lead to the opposite direction toward the unknown optimal solution. The purpose of the OBL approach is to take the fittest solutions from all solutions. Let c be a real number that belongs to the interval [lb, ub]. The opposite number of c is indicated by:c Eq.(10) can be generalized to use it in features selection problems. The following equations can express the concentration of each particle and its other direction (opposite solution) in EO: All elements values inc can be defined by: where lb and ub are lower and upper bounds of each value in the current solution, respectively. In this strategy, lb and ub are utilized to calculate the opposite of the current solution.
Then, two populations, the set of the opposite solutions OC and its initial form C, are combined and the fittest solutions are selected from C ∪ OC as the new initial population.
In the IBEO initialization stage, a number of particles are randomly generated depending on the size of the population. Each particle serves as an available solution with a dimension d, where d is the number of the all features that exist in the original dataset. OBL applied only in the initialization step to get the opposite solution of every generated solution. The new initial population of IBEO is created by choosing the best solutions from the set of the opposite solutions and its initial form. IBEO determines the fitness values based on kNN or SVM classifiers.

B. TRANSFORMATION FUNCTION
Several optimization problems have been modeled as binary problems such as feature selection problems. As mentioned before, the particle concentration produced from the original EO has continuous values. Thus, the conversion process from the continuous space of the original EO to a binary search space requires a transformation function. The concentrations of the particles in feature subset selection problems can only be 0 or 1. In IBEO, the binary solution space is represented by a population size n × number of features N matrix. The 1/0 values indicate that the related feature is selected/unselected, respectively. Fig. 1 represents the binary search space in IBEO. This study uses two different transformation functions to convert the continuous space into a binary search space [85]. Later in the results and analysis section, the performance of these functions will be examined. The sigmoidal function is an instance of the S-shaped family transfer functions as given in: where the i th particle of the k th dimension at the t th iteration is a continuous value c k i (t) calculated by Eq.(1). Eq. (14) still VOLUME 9, 2021 shows the output for the S-shaped function in a continuous manner. To get the binary value, the i th particle concentration is updated as follows: where rand is a random number in the interval of [0, 1]. The second function is categorized as an instance of the V-Shaped function: The V-shaped function's output in Eq.(16) still appears in a continuous manner. To get the binary value, the i th particle concentration is updated as follows: S-shaped and V-shaped functions transform a continuous space to a binary search space through the following procedure: 1) Passing the continuous value c k i (t) value to the transfer function. The existing best solution B is passed to a local search algorithm in each iteration with the aim of discovering a better solution. LSA selects randomly three features in each iteration from the existing best solution. The algorithm switches the value of these features from 1 to 0 and vice-versa. Afterward, the fitness value for the new solution is calculated. LSA will update the value B only if the new solution gives a better fitness value than the current solution (Algorithm 1).

Scalability of ML algorithms becomes a serious issue when dealing with datasets that have a large number of features.
One of the concerns with adding more features to the data is that the redundant and irrelevant features have a negative influence on the performance of the classifier in many ways [86]. Moreover, there will be a need to add more instances, which will cause the classifier to take a longer time to learn. In such cases, the dimensionality of the data must be reduced. The feature selection enhances the learning time and the accuracy performance of a given classifier by eliminating unnecessary and irrelevant features. In this regard, we need to choose only the most relevant features and simultaneously raise the classification performance.
The features selection problem is known as a multiobjective optimization problem (MOP) since it needs to obtain the following: • Reduce the number of selected features. • Improve the classifier performance by maximizing the accuracy value. In order to balance between the two objectives, the fitness function is used in IBEO to evaluate the solutions: where error (D) is the rate of classification error that is calculated using the kNN or SVM classifier. α and β are the weight parameters where α ∈ [0, 1] and β = 1 − α. The two parameters α and β reflect the importance of the classification accuracy and the length of the selected feature subset. |N | is the number of the original features. |M | is the size of the feature selected. kNN or SVM is employed in IBEO as a classifier [11], [87]. We prefer to use the SVM classifier if a dataset has two classes. Otherwise, the kNN classifier is used. Fig. 2 represents features selection processes for IBEO. The steps of the IBEO are shown in Algorithm 2.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
We have tested all algorithms on Matlab Software (ver. R2016a) that is installed on Microsoft Windows 10, 64-bit Edition, Intel Core i7-3630QM processor, 2.40 GHz and 8 GB RAM machine.

A. DATASETS
There are 25 datasets taken from https://www.openml.org were used for verifying and evaluating the performance of IBEO compared to other algorithms. Table 1 demonstrates a brief overview of the used datasets, which contain different numbers of classes (from 2 to 10), different numbers of instances (from 47 to 6435), and different numbers of attributes (from 10 to 7129).

B. PARAMETER CONFIGURATION
The IBEO performance is compared to different stateof-the-art features selection methods. Each algorithm has 20 independent runs. For all experiments, the population size is equal to 5 and the maximum number of iterations is set to 30. The classification process is responsible for classifying new incoming instances where the class label is unknown. kNN and SVM are the preferred classifiers in the present study. To produce the optimal subset, the 5-NN classifier Initialize the parameters a 1 = 2, a 2 = 1, GP = 0.5. Calculate the particle's opposite population OC using Eq. (13). Select the n fittest particles from {C ∪ OC} which represent the initial IBEO population. Define t = 0 while t < max_iterations do Construct the equilibrium pool using using Eq.(2). for each particle do Update concentrations C using Eq.(1). end for Convert the concentration of particles into binary ones. Evaluate each particle in the population using kNN or SVM classifiers. Calculate the fitness of all the particles in the population using Eq. (18). B = bestsolution Apply LSA on B to find if there is a better solution.
is preferred for datasets that contain more than two classes. Several trials and runs are performed on various datasets to figure out which K value in kNN is the best. For both kNN and SVM, K-fold cross-validation is equal to 10 to reduce the over-fitting problem. The idea of k-fold cross-validation depends on splitting the dataset into k-folds (subsets) that have almost equal size. Of the k subsets, k − 1 subsets will be used to train the classifier and then the remaining single subset is treated as testing data for predicting the class label of each instance. Following that, the classification percentage error rate is determined as a percentage of the inaccurate class label predictions. Based on domain knowledge, parameters such as α and β in the fitness function are defined. The rest of the parameters are defined by trial and error. The parameters of IBEO are shown in Table 2.

C. RESULTS AND ANALYSIS
The first experiment studies the effect of two different transformation functions on the original EO. The second experiment introduces a comparison of the proposed IBEO with the original EO and another binary version of EO called BEO that was published in 2020 [83]. In the third experiment, a comparison between IBEO and different algorithms such as PSO, GOA, GWO, WOA, DA, and ISSA is performed. The last-mentioned comparison is made by using three measures: • Fitness value • Classification accuracy • Number of selected features

1) COMPARISON BETWEEN PERFORMANCES OF TWO TRANSFORMATION FUNCTIONS ON EO
A binary version of EO is executed using an instance of the S-shaped family transfer functions (EO-S) and an instance of the V-shaped family transfer functions (EO-V). The comparison based on the fitness function value is made between EO-S and EO-V by using three performance measures: • The minimum fitness value (i.e. best fitness value) that is reached after running the two algorithms 20 times.
• The mean fitness value that is reached after running the two algorithms 20 times.
• The maximum fitness value (i.e. worst fitness value) that is reached after running the two algorithms 20 times.
As seen in Table 3, EO-S works for most of the datasets better than EO-V. The best results are indicated by bold font. Moreover, Fig. 3 shows the performance of the two algorithms by using the total improvement percentage (IP).   IP is the ratio of positive change in the two algorithms. IP can be calculated using the following formula: where full is supposed to be the fitness value of choosing all dataset's original features, f alg is the fitness value of the worst, mean, and best case in EO-S and EO-V. The value m is the number of datasets (m = 25). The above figure shows that the IP of the two algorithms has been increased and it is worth noting that the performance has considerably enhanced in comparison with choosing all features in every dataset. Besides, EO-S has better IP than EO-V for all datasets.

2) COMPARISON OF IBEO WITH EO AND BEO
The former experiment tested the performance of EO-S and EO-V to decide which one is better. The results showed that EO-S performs better than EO-V. Now is the time to investigate IBEO and determine the impact of incorporating the OBL strategy and the LSA algorithm into the EO-S. Thus, the proposed IBEO is compared with the original EO and BEO. K. Ghosh proposed BEO as the first binary version of EO, which improved EO's exploitative abilities by using simulated annealing algorithm [83]. The average number of the selected features, classification accuracy, and fitness values are the most crucial comparative performance measurements in our experiment. As presented in  Fig. 4 shows a comparison among EO, BEO, and IBEO based on the total average of the fitness value, classification accuracy, and number of selected features for all the datasets. The IBEO exceeds both algorithms with a total average fitness value of 0.1563 and a total average   classification accuracy of 0.8569. Furthermore, the average number of the selected features over all the datasets was calculated and it has been reduced from 142.68 to 136.03. The use of OBL strategy and LSA algorithm clearly aids the original algorithm in exploring different parts of the search space and achieving better results. In Fig. 5, the average time for running overall the datasets 20 times is shown.

3) COMPARISON BETWEEN IBEO AND OTHER METAHEURISTIC ALGORITHMS
The previous experiment compared IBEO with the original EO. IBEO showed a higher performance over the standard EO algorithm because of the ability to balance between exploration and exploitation, the ability to escape from local optima, and the ability to improve population diversity. For results confirmation, the third experiment was conducted to compare our proposed algorithm with other six well-known algorithms (PSO, GOA, GWO, WOA, DA, ISSA). The following measures have been employed in the comparative performance evaluation: number of selected features, mean fitness, classification accuracy, and standard deviation. Table 5 shows the results of seven algorithms in terms of the average fitness values. IBEO shows minimum fitness values in most of the datasets compared to other metaheuristic algorithms. It can be seen from Table 5 that IBEO works best in 23 cases out of 25 datasets (92%). IBEO has been shown as the second-best in the case of Satellite and Leukemia. Furthermore, Fig. 6 presents the total mean fitness values over all the datasets. The average fitness value of IBEO is equal to 0.1564, which is the lowest value compared to the rest of the values.
In Table 6, a comparison is performed in terms of the classification accuracy that obtained by each algorithm. In most datasets, we found that IBEO outperformed other algorithms in terms of classification accuracy. Considering Table 6, VOLUME 9, 2021  IBEO works best in 22 cases out of 25 datasets (88%). IBEO has been shown as the second-best in the case of Robot-failures-lp1. For Leukemia, it has achieved fifth-best coming after the results of PSO, ISSA, DA, and GOA, in this order. In Fig. 7, the average classification accuracy obtained by each algorithm over all datasets is reported. IBEO has reached 85.6% classification accuracy, and this is the best result compared to other algorithms. Moreover, Fig. 8 gives the average number of selected features for all datasets. IBEO has a value of 136.102 features, putting the proposed algorithm in second rank. Despite the fact that the PSO algorithm achieved the smallest average number of selected features, IBEO seems to be more helpful as the classification accuracy should take more consideration than the number of features. Additionally, the comparisons show that there is no huge difference between PSO and IBEO and the fitness values are influenced by the classification accuracy value more than the number of selected features. The advantage of the IBEO comes from its ability to balance the search process over iterations between exploration and exploitation. In Fig. 9, IBEO appears in the first rank according to the obtained total average standard deviation for the mean fitness values, indicating that the difference between the datasets' fitness values is smaller than the other algorithms. As aforementioned, IBEO has the smallest mean fitness value. Hence, it is concluded that the best and the worst fitness values are near the mean value.
Based on IBEO configuration, the maximum number of iterations in all the experiments is set to 30 and the population size is 5. LSA will update the value of the current best solution only if the new solution gives a better fitness value than the   current solution. In this context, the evaluation function for the current best solution is called twice in each iteration. Thus, to ensure the efficiency of the proposed algorithm, algorithms that do not use iterated local search (PSO, GOA, GWO, WOA, DA) are tested again and we maximize the number of iterations to 60. Tables 7 and 8 give comparisons in terms of the average fitness value and the classification accuracy, respectively. The results show that the proposed algorithm continues to outperform the other algorithms across most of the datasets in both tables.
Based on the results shown in Fig. 10, IBEO exceeded all other algorithms in convergence over most of the datasets. Hence, the comparison test for convergence between IBEO and other algorithms showed the higher performance for IBEO. Besides, we can observe from Fig. 10 the convergence curves as it shows that the IBEO solved the premature convergence in most of the datasets by improving population diversity and balancing between exploitation and exploration. Fig. 11 presents the boxplots for all datasets  to compare the average performance of algorithms visually. Boxplots represent the classification accuracy values and they are plotted after executing each algorithm 20 times. Each boxplot can give information about five components: maximum, median, minimum, first quartile, and third quartile of the data. The whisker extending down represents the minimum value, the bottom of the rectangle represents quartile one, the top of the rectangle represents quartile three, the line that separates the rectangle is the median value, and the top whisker is the maximum value. Additionally, outliers may be plotted as individual points. It can be noticed that IBEO has higher boxplots characterized by higher median values compared to other algorithms in most of the datasets.
Furthermore, in order to determine whether there is a statistical difference between IBEO's results and those of the other comparative methods, the nonparametric Wilcoxon-based fitness function rank-sum test is conducted for 10 randomly selected datasets (Fri_c1_1000, Australian, Lymphography, Credit, Kc1, Parkinsons, Dermatology,  SonarEW, Robot-failures-lp2, and CNAE) [88]. The Wilcoxon test's significance level is set at 0.05, and the results are presented in Table 9. The number of positive ranks in which IBEO outranks the comparative algorithms is referred to as No. R+. The number of negative ranks in which the IBEO fails to outrank the comparative algorithms is represented by No. R−. The number of ties is the number of equal ranks for the IBEO with the other comparative algorithm.   The total of positive and negative ranks is represented by Sum_ R+ and Sum_ R−, respectively. According to Table 9, the No. R+ in which IBEO outperforms PSO, GOA, GWO, WOA, DA, and ISSA are 10 cases out of the 10 randomly selected datasets. For instance, in the Fri_c1_1000 dataset, the number of runs in which IBEO outperforms PSO is 19 out of 20 runs. In the Robot-failureslp2 dataset, the number of runs in which IBEO outperforms GWO is 16 out of 20 runs and it fails to outrank GWO in 3 runs and shows a similar performance in one run. Interestingly, IBEO has obtained the top performance in all 20 runs for the datasets: Kc1, Parkinsons, and SonarEW. For the datasets used in this test, we can observe that Sum_ R+ is greater than Sum_R−. The p-values in Table 9 indicate if there is a significant difference between the proposed algorithm and the compared algorithm. The evidence becomes stronger as the p-value becomes smaller. Statistical significance is defined as a p-value of less than 0.05. It means there's strong evidence against the null hypothesis. From Table 9, we observe that IBEO outperformed GOA, GWO, WAO, and DA in 10 datasets, where in 10 out of these 10 datasets there is a significant difference (p-value less than 0.05). Comparing IBEO to PSO and ISSA, we can notice that IBEO has significantly outperformed them in 9 and 7 cases out of 10 datasets, respectively. Finally, the p-values confirm that the results of the proposed approach are significantly different from the results of the state-of-the-art algorithms on the majority of the used datasets.

V. CONCLUSION
The comparative experiments and the aforementioned results indicate the advantage of IBEO in comparison with other optimization algorithms. Based on some measures, we evaluated the IBEO Algorithm across 25 datasets and compared it to other well-known optimization algorithms (PSO, GOA, GWO, WOA, DA, ISSA). These measures are fitness, classification accuracy, and number of selected features. This advantage in IBEO performance resulted from using the sigmoid function to deal with the binary problem. In addition, two enhancements to the original EO algorithm were added, including the OBL strategy and the LSA algorithm. OBL is applied to improve the population diversity of EO at the initialization phase, and LSA is used at the end of each EO iteration to prevent it from getting stuck in a local optimum. The kNN or SVM classifier gives high-quality solutions with the IBEO algorithm, and it can effectively learn from the training data. k-fold cross validation is an excellent decision for bypassing the over-fitting problem. In the future, the ability of the proposed algorithm to select fewer features could be increased through using new selection strategies. Hybridization can also be a good move toward improving the exploration of the algorithm. Furthermore, different classifiers such as neural networks can be used to examine the performance of the IBEO.