Enhanced Binary Cuckoo Search With Frequent Values and Rough Set Theory for Feature Selection

Redundant and irrelevant features in datasets decrease classification accuracy, and increase computational time of classification algorithms, overfitting problem and complexity of the underlying classification model. Feature selection is a preprocessing technique used in classification algorithms to improve the selection of relevant features. Several approaches that combine Rough Set Theory (RST) with Nature Inspired Algorithms (NIAs) have been used successfully for feature selection. However, due to the inherit limitations of RST for some data types and the inefficient convergence of NIAs for high dimensional datasets, these approaches have mainly focused on a specific type of low dimensional nominal dataset. This paper proposes a new filter feature selection approach based on Binary Cuckoo Search (BCS) and RST, which is more efficient for low and high dimensional nominal, mixed and numerical datasets. It enhances BCS by developing a new initialization and global update mechanisms to increase the efficiency of convergence for high dimensional datasets. It also develops a more efficient objective function for numerical, mixed and nominal datasets. The proposed approach was validated on 16 benchmark datasets; 4 nominal, 4 mixed and 8 numerical drawn from the UCI repository. It was also evaluated against standard BCS; five NIAs with fuzzy RST approaches; two popular traditional FS approaches; and multi objective evolutionary, Genetic, and Particle Swarm Optimization (PSO) algorithms. Decision tree and naive Bayes algorithms were used to measure the classification performance of the proposed approach. The results show that the proposed approach achieved improved classification accuracy while minimizing the number of features compared to other state-of-the-art methods. The code is available at https://github.com/abualia4/EBCS.

algorithms [8], [10], [11]. In other words, to address these issues, these approaches aim to remove both irrelevant and redundant features.
In general, efficient FS approaches aim to select the minimum number of needed features without significantly reducing classification accuracy or substantially increasing computational time [8], [12], [13].
A typical FS approach employs three main steps: firstly, the subset generation, which is responsible for generating candidate feature subsets; secondly, the subset evaluation, which is responsible for evaluating the selection [14]. The search uses three strategies to search for candidate feature subsets, which are complete, random and heuristic search [8]. Complete search is computationally expensive because it covers all combinations of features. Random search generates a subset of features randomly. Heuristic search is faster than complete search and more efficient than random search because it makes smart choices to select the near optimal subset without searching in all combinations of features. Result validation is the last step, which is used to validate and subsequently determine the performance of the FS approach [14].
Meta-heuristic needs less number of assumptions to find the near optimal feature subset compared to heuristic search [15]. FS is an optimization problem [16]. One of the most efficient popular methods used to generate candidate solutions for optimization problems is Nature Inspired Algorithms (NIAs) algorithms [17]- [19]. NIAs, developed using characteristics of biological systems, are a population meta-heuristic class of algorithms that improves concurrency of computing multiple candidate solutions [20]- [22]. Ant Colony Optimization (ACO) [23], Particle Swarm Optimization (PSO) [24], Artificial Bee Colony (ABC) [25], Cuckoo Search (CS) [26] and Binary CS (BCS) [27] are examples of popular NIAs. Moreover, NIAs are widely used for FS [28], [29].
The objective function is responsible for determining the relevancy of the generated candidate feature subsets for the classifier algorithms [30]. For the objective function, FS approaches can be classified into two groups: filter approaches and wrapper approaches [31], [32]. Filter approaches use statistical methods in their objective function to evaluate feature characteristics such as dependency degree [33] or information measure [34]. Wrapper approaches, on the other hand, use classification algorithms in their objective function [35]. Filter approaches are often considered more general, have lower probability of overfitting and much faster compared to wrapper approaches [8]. Rough Set Theory (RST) or Dependency Degree RST (RSTDD) is widely used in filter FS approaches' objective functions for measuring the dependency between the feature subsets and class labels [36]. RSTDD have shown several advantages over other methods including greater efficiency, faster computation, and does not need any preliminary or additional information about the data [36]. These approaches usually work well for nominal datasets, however, they suffer from several inefficiency limitations when applied on mixed and numeral as well as high dimensional datasets. Therefore, the paper proposes a new more efficient filter feature selection for classification approach, named Enhanced Binary Cuckoo Search (EBCS), for three types of datasets: nominal, mixed and numerical. EBCS uses RSTDD to develop a new objective function and employs frequent values and Rough Set Theory for feature selection.

A. THE INNOVATION
Building on the strengths of RSTDD and BCS, the proposed approach enhanced BCS and developed a new objective function with a more efficient filter FS in classification for nominal, mixed and numerical datasets. Section 2 describes the main limitations of the current hybrid NIAs and RSTDD approaches in more details, which include inefficiency for mixed and numerical datasets [37], [38], and weak convergence for high dimensional datasets [39]. EBCS, on the other hand, achieves reduced number of selected features in less computational time and without significant reduction in classification accuracy for mixed, nominal and numerical datasets with varied sizes on number of features, objects and classes. The original contributions of the paper include, firstly, the development of a new improved classification filter feature selection approach (EBCS), with a new initialization mechanism that covers most of the search space, and a new global updating mechanism that has more efficient convergence. Secondly, the development of a new EBCS objective function to produce a reduced feature subset that achieves maximum classification accuracy and minimum number of features for nominal, mixed and numerical datasets by balancing between RST, number of features and frequent values.
The remainder of this paper is organized as follows. Section II reviews related work of FS in classification, which used NIA and RST. Section III discusses briefly the basic concepts of RST, and introduces CS and BCS. In section IV the new proposed approach is described. The evaluation methodology is presented in section V. Section VI discusses experiments' results. The conclusion is included in section VII.

II. RELATED WORK
In the literature, there are many studies on filter FS in classification, each uses different search methods and objective functions to find the most relevant features. Three traditional filter FS approaches were proposed by [40], [41]. In [40], Karthikeyan et al. proposed two filter FS approaches, the first approach combined best-first search method with Correlation-based Feature Selection (CFS), while in the second approach, a greedy search method was combined with CFS. On the other hand, CONsistency based feature selection (CON) with different traditional search methods, such as forward selection, was presented in [41]. To improve the search strategy in filter FS, many researchers used one or more NIA for exploring the search space. Recently, Zhang et al. developed two new filter FS-based NIAs, ABC was enhanced and used in the first approach [42], while Differential Evolution was improved and employed in the VOLUME 9, 2021 second approach [43]. The authors of [44] integrated Genetic Algorithm (GA) with CFS to address the filter FS problem. PSO and CFS were combined in [45] to develop a new filter FS. Hybrid filter FS approaches based on NIAs and RST have achieved significant results with low computational time, especially, RST, which is both efficient and fast [38], [46]- [48]. In this section, we focus on reviewing this type of hybrid approaches.
ACO is an NIA which simulates the behavior of real Ants to find the shortest path between their nest and the food source [23]. ACOAR [46], ACOFS [49], AnTRSAR [50] and ARRST [51] employed ACO and RST for filter FS approaches. ACOAR, ACOFS and AnTRSAR were developed for low dimensional nominal datasets, while ARRST was developed for low dimensional numerical datasets. ARRST used RST's principles of in-discernibility to discretize the numerical values (convert numerical datasets to nominal datasets). ACOAR achieved maximum classification accuracy with minimum feature subset compared to other approaches, because it limits the value of pheromone and heuristic information, which prevents the algorithm from going away from the optimal feature subset. AnTRSAR, however, is more expensive than ACOFS and ACOAR. It used entropy in its heuristic information update, which is a more costly operation than that of RSTDD, which was used in ACOFS and ACOAR. However, ARRST was shown as the slowest because the discretization of numerical data is computationally expensive and takes significant time to perform. On the other hand, experimental results showed that ARRST succeeds in maximizing the classification accuracy in four out of the seven tested datasets, although, most of the features in these datasets are nominal. All these approaches used RSTDD in their objective function, but ACOAR used two user stable parameters to control the importance of the number of features and their quality (i.e. classification accuracy). AnTRSAR achieved balancing between the number of features and classification accuracy automatically.
In general, ACO uses a graph to represent its search space and employs several variables for configuration. However, these approaches are more complex, more expensive, slower, with weak convergence [52], [53] and are therefore inefficient for high dimensional datasets.
PSO is another NIA, which simulates the movements of flocks of birds around food sources [24]. Several filter FS approaches exist in the literature that combined PSO with RSTDD, such as PSORSFS [47], PSO-RR [48], PSO-QR [54], PSO-RS [55] and EPSORSNA [56]. PSORSFS was applied on low dimensional, general, datasets. PSO-RR, PSO-QR, and PSO-RS were also applied on low dimensional medical datasets. EPSORSNA used PSO, RST and quick reduct algorithm for general structured datasets. However, all of these datasets consist of features that have nominal characteristics. The big difference between these approaches is the objective function. PSORSFS developed a flexible objective function based on RSTDD with two simple parameters to control the importance of classification accuracy and the number of selected features. While PSO-RR and PSO-QR used relative-dependency and RSTDD respectively, and focused on classification accuracy only. The objective function in PSO-RS used RSTDD, but it used cardinality and two variables to select the minimum feature subset with maximum classification accuracy, which makes PSO-RS computationally inefficient. EPSORSNA, on the other hand, combined RSTDD with a quick reduction algorithm and group of parameters to minimize the number of selected features with at least the same classification accuracy as the original nominal datasets.
In general, PSO is easier to implement, requires fewer parameters, and is less expensive compared to ACO. However, because it uses the global search mechanism and poor initial solutions, it has weak convergence [57]. Similarly PSORSFS, PSO-RR, PSO-QR, and PSO-RS have weak convergence for high dimensional datasets [47], [54], [55].
ABC is another popular NIA inspired by the natural foraging behavior of honey bees [25]. Several approaches combined it with RST concepts to develop a filter FS, including NDABC [58], BeeRSAR [59], BeelQR [60], and ABC-FTSBPSD [61]. NDABC has been applied on a low dimensional nominal dataset, while both BeeRSAR and BeelQR have been applied on five low dimensional nominal medical datasets. ABC-FTSBPSD, however, has been applied on low dimensional mixed datasets with nominal and numerical features. This approach, in its initial step, ''discretizes'' the numerical features. This can be very complex and inefficient, and thus results in low performance compared to other approaches [61]. NDABC, on the other hand, used a meta-heuristic search and greedy selection algorithm to avoid falling into a local optimum and speed up convergence. For its objective function, NDABC uses RSTDD and one user configurable parameter to control the important balance between the number of features and classification accuracy. While other approaches used the principle of in-discernibility in their objective function with no user-balancing configuration, thus making them easier to use.
In general, ABC approaches are more efficient compared to those of ACO's and PSO's, but they have relatively weak convergence, are relatively slow, and may fall into local optimum, especially when applied to high dimensional datasets [52].
Ibrahim et al. [62] combined the rough sets and neighborhood rough sets with a runner-root algorithm (RRA) to select the relevant features in nominal (discrete), mixed and numerical (continuous) structured datasets. This approach used rough sets and neighborhood rough sets separately as an objective function to work with different types of datasets. The improved runner-root algorithm, on the other hand, applied many operators to decrease the probability of being trapped into local optimum. In the literature, this approach tried to solve the FS on different types of structured datasets by using two objective functions; one was based on RST and the other combined RST with Euclidean distance, which resulted into a high computational cost. Das et al. [63] proposed a group incremental feature selection for classification using rough set theory based on a genetic algorithm. In this approach, RST was used to remove imprecise, vague and inconsistent data during feature selection. The objective function applied the positive region and the previously generated reduction to select the relevant features in small nominal datasets. In [64], the authors proposed an ant lion optimizer, which is mainly based on the hybridization between rough set and conditional entropy. This work tried to improve the quality of the initial population, and eventually the final optimal solution, but the conditional entropy and different approaches that were used increased the computational time. In addition, the approach failed to address the numerical datasets. One of the goals of the work [65] was proposing a new FS approach that employed a RST and a binary version of the water wave optimization approach. The objective function used two parameters with RST to decrease the number of selected features and to increase the values obtained by RST. The approach based on RST objective function was evaluated on sixteen datasets with less than 69 features and most of them have nominal features, especially, RST's dependency degree is inefficient for numerical features. Moreover, the binary version of the water wave optimization requires a lot of computational operations. This approach is efficient for small nominal datasets, and the number of computational operations increases the complexity of computational time. The authors of [66] proposed a filter-based FS approach by combining the Master River Multiple Creeks Intelligent Water Drops (MRMC-IWD) model with different objective functions, such as fuzzy RST [67], to solve FS. MRMC-IWD with fuzzy RST produced minimum feature subset, but without maximum classification performance. Similarly, Diao and Shen [68] developed a novel filter FS approach based on the Harmony Search (HS) algorithm and several objective functions; fuzzy RST is one of these functions. To improve the performance of HS, the authors introduced some methods to select the initial parameters dynamically, and the search process was modified to select small feature subsets while keeping classification performance. In the experiments, the authors combined three NIAs including genetic algorithms (GA) [69], PSO [47] and hill climbing (HC) [70] with fuzzy RST to evaluate HS based on fuzzy RST, results showed that it achieved the best feature reduction, but not the best classification accuracy.
Cuckoo Search (CS) is, an efficient NIA algorithm, developed by Yang and Deb [26]. It mimics the reproduction strategy of cuckoo birds. It has been used in several different domains due to its improved efficiency [71]- [73]. It employs a more efficient search mechanism and requires fewer number of user-configurable variables compared to ACO, PSO, genetic algorithm, RRA, HS, HC and ABC. Generally, CS is considered relatively easy to implement, and has fast and efficient convergence. Binary CS (BCS) is a binary version of CS that uses binary representation [27].
Recently, a number of hybrid filter FS approaches based on BCS and RST principles have been proposed, including the Modified cuckoo search algorithm with rough sets for feature selection (MCSRS) [37], feature selection based on Hybrid Binary Cuckoo Search and Rough Set Theory in Classification for Nominal Datasets (FS-BCS) [38] and a hybrid Rough Set Feature Selection using Cuckoo Search Optimization (RSFSCSO) [74]. Both MCSRS and FS-BCS have been applied on low dimensional nominal datasets, while RSFSCSO has been applied on a Malware classifier dataset built with header fieldsvalues of Portable Executable files (69 features). MCSRS and FS-BCS used RSTDD in their objective function with parameters to balance the number of selected features and classification accuracy. However, MCSRS is enhanced over FS-BCS by two features: it changed α from constant to variable, which decreases as the number of iterations increases, to avoid drifting away from best feature subset, and it used parallel techniques to update the population of candidate feature subsets. In other words, FS-BCS combined standard BCS with RSTDD without any modification to solve low nominal dimensional datasets, while MCSRS used RSTDD and BCS with α modification to address the low nominal dimensional datasets. On the other hand, RSFSCS did not use any parameter to balance between the number of selected features and classification accuracy, which means the approach tries to find the solution which has the highest RSTDD regardless of the number of selected features. However, All these approaches suffer from two main drawbacks: Firstly, weak convergence when they are applied to high dimensional datasets [39], because BCS's initial and global mechanisms do not cover the large search space. This is described in more detail in subsection IV-B. Secondly, it is inefficient for numerical and mixed datasets, because their objective function, which uses RSTDD, is only efficient for nominal datasets [37], [38].
To overcome the above limitations, this paper proposes a more efficient filter FS by enhancing BCS, and developing a new objective function based on frequent values and RST. To increase the convergence efficiency of BCS, the paper developed a new initialization mechanism that covers most of the search space and a new global updating mechanism that has more efficient convergence for both low and high dimensional datasets. This also makes the proposed approach more efficient for nominal, mixed and numerical datasets. The paper also developed a new EBCS objective function to produce a reduced feature subset that achieves maximum classification accuracy and minimum number of features by balancing RST, number of features and frequent values. This approach aims to reduce the number of selected features with improved computational time and improved or similar classification accuracy for mixed, nominal, and numerical high dimensional datasets. The following section describes some of the preliminary concepts used in the proposed approach.

III. PRELIMINARY CONCEPTS
This section provides a general discussion of the background concepts used in the rest of the paper. VOLUME 9, 2021 A. ROUGH SET THEORY (RST) Zdzislaw Pawlak developed Rough Set Theory (RST) in 1982 [75]. RST is a mathematical tool for data analysis and data mining. RST has been applied in many domains for several reasons: it provides an efficient approach for discovering useful patterns in data, is easy to implement and understand, and depends on input data alone [76]. Basic concepts of RST are discussed below.

1) INFORMATION TABLE OR INFORMATION SYSTEM
The dataset is called an information table or an information system. Let I = (U,A) where I is the information table, U is a nonempty set of finite objects and A is a nonempty set of features. In the table 1

3) POSITIVE REGION (POS)
Let P and Q be equivalence relations over U, then the positive region contains all objects of U that can be classified as classes of U/Q using information in attributes/features P [50]. In other words, objects that have the same values in their features are classified into the same class. For example (Table1), let P = {a,b,c} and Q = {e}, now we need to find the objects that have the same values in a, b, and c features, and at same time belong to the same classes. Table 2

4) DEPENDENCY DEGREE
is a very important issue in data analysis to discover the dependencies between attributes. For P, Q ⊂ A, if all attribute values from Q are uniquely determined by the values of the attributes from P then Q depends totally on P. If some of the values of Q partially depend on P then Q depends on P with a degree of K(0 ≤ K ≤ 1). If k = 0, then Q does not depend on P. The dependency degree can be defined by the equation 1 [76]: where |U | is the total number of objects, |pos P (Q)| is the number of objects in a positive region, and γ P (Q)is the dependency between the feature subset p and classes Q. For example (Table 2) Many filter FS in classification approaches use dependency degree to build their objective function to guide the search algorithms to optimal/nearest solution by measuring the dependencies between the feature subsets and class labels [58], [77]. The frequently used objective function that uses dependency degree balanced with the number of selected features is(Equation 2) [50]: where |U | is the total number of features.

B. CUCKOO SEARCH
This section briefly reviews the cuckoo birds, Lévy flight of birds, CS algorithm and BCS algorithm.

1) CUCKOO BREEDING BEHAVIOR AND LÉVY FLIGHT a: CUCKOO BREEDING BEHAVIOUR
The reproduction strategy of cuckoo birds is aggressive.
Cuckoos use the nests of other host birds to lay their eggs in and rely on these birds for hosting the eggs. Sometimes the other host birds discover the strange eggs and they either throw these strange eggs out, or leave their nest and build a new one. Cuckoos lay eggs that look like the pattern and color of the native eggs to reduce the probability of discovering them. If the egg of the cuckoo hatches first, then the cuckoo chick destroys all eggs in the nest to get all the food provided by its host bird [26], [78].

b: LÉVY FLIGHT
In nature, many animals and insects search for food by moving to the next location based on the current location. This behavior of search called Lévy flight, which is a special case of random walks where the step size has a Lévy probability distribution to improve the convergence [79].
Step sizes is the main factor of the efficiency of Lévy flight search. Lévy flight is modelled in equations 3 and 4.
where X t+1 represents the next solution, X t is current solution, t is the iteration, and α is the step size (α > 0) scaling factor of the problem. In most cases we can use α = 1. ⊕ means entry-wise multiplications, and λ is Lévy distribution coefficient (0 < λ ≤ 3). Random step length(Lévy(λ) is calculated from the power law by equation 4.

2) CUCKOO SEARCH AND BINARY CUCKOO SEARCH ALGORITHMS
Cuckoo Search (CS) is an efficient NIA algorithm that has been developed by Yang and Deb [26]. CS depends on three rules [26]: 1) Each cuckoo lays one egg at a time, and dumps it in a randomly chosen nest.
2) The best nests with high quality eggs (solutions) will carry over to the next generations.
3) The number of available host nests is fixed, and the egg laid by a cuckoo is discovered by the host bird with a probability p a ∈ [0, 1]. In more detail, each nest represents a solution and CS aims to replace the ''not so good'' solutions (nests) with a new one that is better. CS starts to generate the population of nests (solutions) randomly, then in each iteration it uses the threshold value (P a ∈ [0, 1]) to find the nests that have the lowest quality, which are updated using Lévy flight. Population update is repeated until the maximum number of userdefined iterations are reached [26].
Binary Cuckoo Search(BCS) [27] is a binary version of CS, in which search space is modelled as a n-binary vector. For FS, BCS represents each nest as a candidate features subset and each egg represents a feature where 1 corresponds to a selected feature and 0 corresponds to not selected [27].
Recall that each solution in BCS must be a binary vector, but Lévy flights does not return a binary bit. Therefore, BCS uses equations 5 and 6 to create a binary vector for each new candidate solution [27].
where σ ∈ U (0, 1), in iteration t. BCS Algorithm 1 presents the main steps of the standard BCS.

IV. ENHANCED BINARY CUCKOO SEARCH (EBCS) WITH FREQUENT VALUES AND ROUGH SET THEORY FOR FEATURE SELECTION
This section describes the proposed approach by enhancing the binary cuckoo search algorithm with a new objective function. EBCS improves the binary cuckoo search algorithm by developing a new initialization mechanism, a new updating mechanism, and a new objective function. The new objective function aims to find the minimum features subset without significant reduction in classification performance for nominal, mixed, and numerical datasets with varying number of features. Subsection IV-A describes the newly developed objective function and subsections IV-B and IV-C describe the enhanced BCS.

A. NEW OBJECTIVE FUNCTION
In general, using a features subset that has high relevancy to class labels and high frequent values in its features increases classification performance in many classification algorithms [80], [81]. High frequent values, which are used to build the future classification model, correspond to a high reuse probability. The high frequent values alone, however, are not a sufficient indicator for classification performance, because the dependency between the features subset and class labels may be low. RSTDD provides an efficient measure to determine the dependency between the features subset and class labels. However, it is not sufficient alone to provide a scalable indicator for classification performance in all types of datasets, especially in datasets that have low and different frequent values (e.g. mixed and numerical datasets). Therefore, it is necessary to develop a new more efficient objective function for nominal, mixed and numerical datasets by balancing between frequent values, dependency degree, and number of selected features.
where R is a subset of features, DistinctValues(R) is the average of number of distinct values for Rfeatures, and Objects is the number of total objects in dataset.

2) DEPENDENCY DEGREE
We use RSTDD (equation 1) to measure the dependency between the feature subset and class labels.

3) BALANCING BETWEEN THE PERCENTAGE OF DISTINCT VALUES AND DEPENDENCY DEGREE
In general, high dependency degree is a good indicator to measure relevancy between the feature subset and class labels. Low distinct percentage for the feature subset is a good indicator for high frequent values. In other words, a feature subset, which has maximum dependency degree and minimum distinct values percentage, is desirable for classification algorithms; dependency degree is close to one (∼100%) and percentage of distinct values is close to zero. Equation 8 balances between the two; it provides high quality when the feature subset is more desirable for classification algorithms.
where R is a subset of features, and γ R (Q) is the dependency between feature subset R and classes Q.

4) BALANCING BETWEEN THE QUALITY AND NUMBER OF SELECTED FEATURES
FS is a multi-objective problem where maximum classification performance and minimum number of selected features are goals of FS [81]. Equation 8 achieves the first objective only. To achieve the two objectives, equation 8 is modified to equation 9.
FinalQuality(R)% = Quality(R) * |C| − |R| |C| (9) where Quality (R) is calculated from equation 8, C is the number of available features (total features), R is the number of selected features. Equation 9 uses the factor ((|C|-|R|)/|C|) in addition to the quality of feature subset to achieve the two objectives. In other words, equation 9 multiplies equation 8 by the above factor; to guide the EBCS toward the feature subset that achieves the minimum number of selected features and maximum classification performance.

B. NEW INITIALIZATION MECHANISM
In BCS [27], the initial solution (or nest) is randomly initialized as shown in algorithm 2.

5: end for
Given its binary selection, from a search space, the probability of being selected is 50% [82]. A good initialization mechanism is one capable of generating initial solutions with many different number of features covering as much of the search space as possible.
For example, assume the total number of features in a search space is 50. The possible number of features to select for the search ideally could be from 0 to 50. A good initialization strategy is one that generates solutions uniformly between 0 and 50. However, algorithm 2 does not guarantee uniform generation across the search space.
To understand the reason behind this, consider the probability theory of algorithm 2. The number of successes, in a sequence of n (total number of features) independently selected or not selected experiments, is the binomial distribution [82] where each success (probability p = 50%) and the number of features n play a major role in determining the probability of numbers of successes. The probability of getting exactly k successes in n trials is given by the probability mass function. Subfigures 1 (A-D) shows examples of the probability mass function for different values of n. The result of the binomial distribution is that the probability for selecting the features subsets near the end points decreases with increasing n. For large n, the probability is near 0 in the feature selection regions <25% and >75% of the total number of features.
However, this strategy does not help BCS cover most of the search space (i.e. most of the possible numbers of selected features), and thus causes slow and weak convergence. Some approaches used different initialization methods to improve the initialization strategy [83], [84]. However, to improve the initialization strategy in BCS for FS, a new mechanism that divides the initialization space equally to three parts is  developed. See figure 2 and the algorithm 3 (Lines 7-34). This is described below in more details.
Small Part: aims to generate feature subsets that have number of the selection of around the 25% of the available features. This helps to find optimal solutions that have small size numbers. Small initialization consists of three steps: first, it starts from an empty set. Second, select random number ''s'' between one and half number of the available features. Third, select randomly ''s'' features from all available features, then add them to the empty set. Thus according to binomial distribution, the possible numbers of selected features include around quarter of available features, which would then have greater chance of selection than others. See algorithm 3 (Lines 7-14).
Medium Part: aims to reach and search the area of the feature subsets with medium size numbers. This helps to find the optimal solutions within the medium size range numbers. This mechanism also starts from an empty set. Then selects randomly features from all available features. It then adds selected features to the empty set. This initialization focuses on the feature subsets that have around half the number of available features. This part is the same as the traditional BCS initialization strategy. Figure 2 shows the possible numbers of selected features that have around half the number of available features with greater chance being selected than others. See algorithm 3 (Line 15-22).
Large Part: This mechanism is capable of searching for the feature subsets that are selected close to the full number of available features. This helps to find the optimal solutions VOLUME 9, 2021

C. NEW GLOBAL SEARCH MECHANISM
The goal of the global search is to cover as much of the search space as possible to guarantee global convergence and low computational time. Because the global search in BCS uses the traditional initialization mechanism, it is not guaranteed to converge when the search space contains more than 20 features. BCS's global search is also developed using the same strategy as was introduced for the new initialization mechanism to cover as much of the search space as possible. This is achieved by using algorithm 3, lines 35-61.

V. EVALUATION METHODOLOGY
This section describes the followed evaluation methodology and the datasets used in the experiments.

A. DATASETS
In order to evaluate the performance of EBCS, a group of experiments were run on sixteen datasets, which were selected from the University of California at Irvine, known as the UCI data repository of machine learning database [85]. UCI divides classification datasets according to their types of features to three groups: firstly, the ''nominal'' group, which contains 28 datasets; secondly, the ''Numerical'' group, which contains 137 datasets; and thirdly, or the ''mixed'' group, which includes 37 datasets. Our approach aims to achieve FS for nominal, mixed, and numerical datasets with different characteristics, including different number of features, different number of objects, and different number of class values. To evaluate this approach, sixteen datasets that possess these characteristics are selected randomly as follows: Four datasets from the ''nominal'' group, four datasets from the ''mixed'' group, and eight datasets from the ''numerical'' group. Table 3 shows these datasets and their characteristics.
A common way to conduct the experiments on FS for classification is to divide each dataset into two sub-datasets randomly: a training set, and a learning and test set. A training set has around 70% of the dataset's objects, and 30% of the dataset's objects are for the learning and test set [84], [86]. The training set is used by FS approaches to achieve features reduction, and the learning and test set is used to build the classification model and estimates the performance of classification.
In this work, K-fold cross-validation [87] was used to split the learning and test set into two disjoint sets to build the classification model and estimate the classification performance. When an object belongs to the test set, its class is hidden from the built classification model based on the learning set only. In particular, K-fold cross-validation splits the learning and test set into K subsets, then the learning set is created on K-1 subsets, and the test set is created on the remaining subset. The process is repeated with several partitions to calculate the classification performance.

B. EVALUATION METHOD
In order to evaluate the performance of EBCS, the paper takes an indirect approach [12]. Two main comparisons are commonly used in the indirect approach to evaluate the FS. Firstly, before and after comparison, which measures the classification performance of all available features and for selected features. Secondly, comparison checks the efficiency of a specific FS approach by comparing it to other FS approaches after applying them on the selected 16 datasets. Our developed approach or algorithm (EBCS) is compared with the baseline approach, i.e. Basic Binary Cukcoo Search (FS-BCS) algorithm III-B2, which uses basic binary cuckoo search. The binary cuckoo search algorithm for the search strategy and objective function (equation 2) are taken from the literature [38]. In addition, it will be compared to ten FS approaches as follows: In section VI-E, EBCS is compared to the experiments that were published in [68] over five datasets, which were taken from UCI [88], and these experiments have a comparison between five NIA (HS,GA, PSO, HC and MRMC-IWD) with fuzzy RST approaches. Comparison between EBCS, genetic algorithm [89] with CFS [90], multi objective evolutionary [91] with CFS [90] and PSO [92] with CFS [90] are discussed in section VI-F. The experimental results for comparison between EBCS, best first with CFS [90], [93] and linear forward selection with Consistency Subset Evaluator (CON) [41], [94] are included in the section VI-G. Two different classification types (DT [95] and NB [96]) are used to measure the classification performance.

C. BENCHMARKING AND EXPERIMENT DESIGN
All experiments are run on a personal computer running Windows 10 with (i7) 3 GHz processor and 16 GB RAM. EBCS and FS-BCS are implemented using PHP. Other approaches in the experiments have standard implementations in the Weka tool [97]. The standard Weka tool implementation of K-fold cross-validation is also used [97].
The parameters' settings are identified based on default settings in the Weka tool, the original BCS and test experiments. For the Weka tool, default values were used for the parameters except for population size (=20), maximum number of iterations (=20), and the size of k (=10 objects) in the K-fold cross-validation.
Test experiments were run initially to find optimal values. The authors found that running the experiments more than five times did not improve the results, which were sufficient to obtain consistent results between the EBCS and FS-BCS. Running the experiments more than five times did not improve this comparison, and thus results for more than five times are not reported. The number of selected features (features subset) and computational time are recorded for each run. In the next step, the run that achieved best classification accuracy is selected. Similarly, PSO and genetic approaches are run for each training set until good FS is achieved, where the number of selected features are recorded for each run.
DT ''J48'' [95] and NB [96] classification algorithms, which are implemented within Weka tool, are used to measure the classification performance for all approaches that are used in our experiments. They are applied on each reduced learning set to build the classification model, which are then applied on the test set to measure the classification performance. Also classification performance was measured for all datasets before and after FS. NB and DT are selected because they are commonly used algorithms to evaluate FS, which are considered two of the top ten data mining algorithms that do not require complex initial parameters settings [12], [98].

VI. RESULTS AND DISCUSSIONS
This section presents and discusses the results of EBCS compared to the baseline FS-BCS approach, ten known filter FS approaches, and all available features (before FS).
Differences in accuracy is considered significant when it is more than 5% [99], and accuracy of different methods is considered equal when the difference is less than 1% [84]. In all results, no significant difference was noted between average of precision and average of recall, thus accuracy was sufficient to evaluate the classification performance for the datasets used in the experiments. In addition, statistical  significance of classification accuracy was tested using the Sign test method [100] on the results of the evaluated 10 approaches. The results were found statistically significant for the evaluated approaches, for which the probability, p-value, was found less than or equal to 0.05.

A. COMPARISONS BETWEEN EBCS AND FS-BCS
Experimental results of EBCS and FS-BCS for the sixteen datasets are shown in Table 4. The following discusses the results for each type of the datasets. Table 4 and subfigures 3 (A,B) show that EBCS and FS-BCS achieve the same size reduction, DT accuracy, and NB accuracy for the congressional voting records dataset. For the Mushroom, Soybean, and Lung Cancer datasets, both EBCS and FS-BCS achieve better or roughly same DT accuracy and NB accuracy, but EBCS achieves significant size reduction compared to FS-EBCS. These results show that both the developed objective function and traditional objective functions [50] have roughly the same efficiency for nominal datasets. However, EBCS has better convergence for datasets that have more than 20 features and when size reduction is less than one quarter of available features. See section IV-B. Table 4 and subfigures 3 (C,D) show our approach (EBCS) achieves better size reduction in all mixed datasets compared to FS-BCS. For the Zoo dataset, EBCS achieves less size reduction and improves classification accuracy (DT accuracy = 13%, NB accuracy = 26.7%). For the Hepatitis and German Credit Data, EBCS achieves the same size reduction and significantly better classification accuracy (both DT and NB) than FS-BCS. However, for the Dermatology dataset, EBCS achieves better size reduction and classification accuracy than FS-BCS. These results show that the developed objective function performs more efficiently than the standard objective function for mixed datasets where the number of distinct feature values is different. Similarly, it shows EBCS is more efficient than FS-BCS when the number of features in datasets (e.g. Dermatology) is more than 20 features.

3) NUMERICAL DATASETS
According to table 4 and subfigures 3 (E,F), EBCS and FS-BCS achieve the same FS for the Breast Cancer Wisconsin (Original) dataset, because the number of available features is 9 (low dimensional dataset) and the number of distinct values of its features is the same. For the Wine dataset, EBCS achieves less size reduction, but it achieves significant improvement of classification accuracy (DT and NB) compared to FS-BCS. This improvement is caused by the newly developed objective function, which is more efficient than the standard objective function for this dataset, given its features have different number of distinct values. For the ISOLET dataset, FS-BCS failed to achieve the FS; it selects around half the number of available features compared to EBCS, which selects 8% from the available features without significant reduction in classification accuracy. This is caused by the high number of distinct values (nearly equal to the total number of objects) in this dataset, and the weak convergence of the FS-BCS. The high number of distinct values and the number of selected features causes RSTDD to produce full dependency (100%) for feature subsets, which subsequently causes FS-BCS's convergence to use local search only to update the population of solutions. This indicates that FS-BCS is trapped in a local optimal.
For the remaining datasets, EBCS achieves better significant size reduction than FS-BCS (from 15.7% to 38.1%), with significant improvement in classification accuracy (except for the Sonar dataset where there is no significant reduction of NB classification accuracy). EBCS achieves higher efficient convergence than FS-BCS, especially when the number of available features for these datasets are more than 20. Also, the developed objective function is more efficient than the standard objective function when the datasets have different number of distinct values in their features. Table 5 shows the computational time, in seconds, of FS-BCS and EBCS for the 16 studied datasets. As expected, most of their computational time was spent in the objective function process. The objective function is run 20 times (population size) for each iteration. The time complexity for each iteration is thus O(pf + pnlogn), where p is the population size, f is the number of features and n is the number of objects. In the experiments, p is constant (20) but f and n vary up to 617 and 8124 respectively. The effect of the f factor is linear on the time complexity while the time complexity of input n is linearithmic. This means, the time complexity is the same for FS-BCS and EBCS, although EBCS took less time than FS-BCS in fourteen datasets, see figure 4. The main reason for this difference is that EBCS runs fewer number of iterations to reach the best feature subset, converging much faster to the optimal subset than FS-BCS. Unfortunately, the overall time complexity of both algorithms cannot be determined theoretically because they are non-deterministic algorithms that produce different outputs in different runs with different number of iterations for the same input. Therefore, the paper measures computational time for both algorithms or approaches empirically, please see [101] for details.

B. ANALYSIS OF COMPUTATIONAL TIME AND NUMBER OF ITERATIONS
The Isolet-test dataset has numerical features that have very high number of distinct feature values (low frequent values) and FS-BCS selects nearly half the available features. This implies that the value of the standard objective function is 100%, which means FS-BCS's convergence uses local update only to generate new candidate feature subsets after the first iteration (i.e., FS-BCS is trapped in a local optimal).  Table 5 and figure 4 (A) show the number of iterations needed to find the best features subset in both approaches. EBCS needs 56% iterations from FS-BCS iterations to find the best features subset in these datasets. Figure 4(B)  shows the time difference in percentage for convergence for both approaches. EBCS converges in less time compared to FS-BCS in 14 datasets, because EBCS runs significantly fewer iterations to converge in the search space for most datasets, except for the Dermatology dataset where the difference in iterations is minor (and thus is its computational time). On the other hand, for the Isolet-test dataset FS-BCS failed to achieve feature reduction.
To explain the effect of varying number of objects (n) and number of features toward computational time of EBCS and FS-BCS. Both algorithms are run for Musk (version 1) dataset which has good number of features(f) and objects (n) multiple VOLUME 9, 2021    5 (A,B) show that the time complexity for both approaches over varying number of objects (n) and constant number of features is linearithmic for all cases, and EBCS is faster than FS-BCS, see subfigures 5 (C,D). According to Table 7

C. CLASSIFICATION PERFORMANCE BEFORE AND AFTER EBCS
According to table 4 and figure 7, EBCS achieves significant size reduction (average 79.4%) without significantly reducing or affecting the classification accuracy for all datasets using NB and DT (except Soybean due to its small size).

D. ANALYSIS OF NEW OBJECTIVE FUNCTION
To evaluate the efficiency of the newly developed objective function, the standard objective function [50] is used in EBCS instead, named TEBCS to differentiate it. It is implemented using PHP and is run five times for each training set. Classification performance is then measured using NB and DT. Results of EBCS and TEBCS are shown in table 8. According to table 8 and subfigures 8 (A,B), the new objective function and the standard objective function [50] have the same efficiency when they are applied on nominal datasets that have roughly the same number of distinct feature values.
As shown, EBCS accuracy and TEBCS accuracy have the same values and similarly EBCS SR% and TEBCS SR%, thus two values (EBCS accuracy, EBCS SR%) are visible in Figure 8.
Results in table 8 and subfigures 9 (A,B) shows that the new objective function is more efficient than the standard objective function for mixed and numerical datasets. The new objective function enables the EBCS to achieve size reduction without significant reduction in classification accuracy. On the other hand, the standard objective function enables EBCS to achieve significant size reduction but at the cost of significant reduction in classification accuracy. However for nominal datasets, e.g. the Breast Cancer Wisconsin, that have the same number of distinct feature values, both objective functions show same efficiency. However, as shown, the newly developed objective function is more efficient for mixed and numerical datasets that have different number of distinct feature values.

E. COMPARISON BETWEEN EBCS AND HYBRID NIA WITH FUZZY RST APPROACHES
This section measures the efficiency of EBCS approach by comparing it to five hybrid NIA and fuzzy RST approaches using the experimental results in Master River Multiple Creeks Intelligent Water Drops (MRMC-IWD) [66]. The experimental study compared MRMC-IWD to four state-ofthe-art approaches (HS,GA, PSO, and HC) that were published in [68] over five datasets, which were taken from UCI [88]. Table 9 summarizes the main characteristics of the datasets.
The datasets have different characteristics, such as different number of features, number of objects, number of classes, and types of data. DT was used to measure the classification accuracy. All approaches that were used in the experiment, in addition to DT, were implemented in Weka tool [97]. To make a fair comparison between EBCS and the experimental results in [66], EBCS has been run on the same five datasets and DT is used to measure the classification accuracy.
The results in table 10 and subfigures 10 (A,B) show EBCS achieves significant improvement of classification VOLUME 9, 2021 accuracy with best or same size reduction compared to other approaches on Ionosphere and Water datasets. However, in the Sonar dataset, all approaches minimize the number of selected features more than EBCS, in comparison to others, EBCS improves the classification accuracy significantly. In Libras dataset, EBCS and HS achieve the same and best  results compared to others while GA maximizes the classification accuracy slightly (less than 5%) and number of selected features significantly (more than 100%) compared to EBCS. Finally in Arrhythmia, EBCS achieves the best size reduction and highest classification accuracy compared to PSO, GA, HS, and MRMC-IWD approaches, and the roughly same classification accuracy and size reduction compared to HC approach.
In general, EBCS achieves the best FS compared to PSO, GA, and MRMC-IWD on five datasets. While compared to HC and HS approaches, EBCS improves the FS on four datasets and the same or roughly the same on one dataset.

F. COMPARISON BETWEEN EBCS, MULTI OBJECTIVES EVOLUTIONARY, PSO WITH CFS, AND GENETIC WITH CFS
In general, subfigure 11 (A) shows that EBCS achieves better size reduction and classification accuracy (in DT and NB) compared to both PSO, multi-objectives evolutionary, and genetic. EBCS removes nearly 79% from all features and improves classification accuracy (DT and NB). On the other hand, PSO removes only 59.8% from all features with significant reduction of DT classification accuracy (-8.1%) and of NB classification accuracy (-7.2%).
Similarly, Genetic removes only 58.8% from all features with significant reduction of DT classification accuracy (-5.4%) and of NB classification accuracy (-5.3%). Also multi objectives evolutionary removes 66% from original data with reduction of DT classification accuracy (-4.4%) and of NB classification accuracy (-2.9%). Table 11, subfigure 12 (A,B) and figure 11 (B) show EBCS achieves best size reduction while improving the VOLUME 9, 2021   For Spectf dataset, all approaches achieve the same size reduction while EBCS achieves the same or better classification accuracy compared to others. In Mushroom dataset, EBCS improves the size reduction with the same classification accuracy compared to PSO and genetic approaches, and the same size reduction and classification accuracy compared the multi objectives evolutionary approach.
In the Vote dataset; PSO, Genetic and multi objectives evolutionary achieve better FS than the EBCS approach. For the other datasets; PSO, Genetic and multi objectives evolutionary reduce the number of selected features and classification accuracy more than EBCS while EBCS reduces the number of selected features without significant reduction of classification accuracy using both DT and NB classifiers.
Considering results of the four approaches on all the tested datasets, EBCS achieves better FS than PSO and Genetic in 15 out of the 16 datasets. In 14 datasets, EBCS achieves better FS than multi objectives evolutionary.

G. COMPARISON EBCS WITH POPULAR TRADITIONAL FS APPROACHES
To evaluate the performance of EBCS approach, it is compared with two popular traditional FS approaches that are implemented in Weka for nominal and numerical features [97]. The first approach is best first with CFS [90], and the second is linear forward selection with Consistency Subset Evaluator (CON) [41].  CON). Size means the number of selected features, DT: Decision tree accuracy, NB: Naive Bayes accuracy. Highlight corresponds to the best approach/s based on balancing between classification accuracy and number of selected features.
In general, table 12, figure 13 and subfigures 14 (A,B) show that EBCS achieves the best classification accuracy based on DT and NB classifiers with significant size reduction compared to CFS and CON approaches. EBCS, CFS, and CON achieve 95.52%, 79.1%, and 69.8% respectively based on DT classifier, and also, EBCS, CFS and CON achieve 91.3%, 81.7% and 71.3% respectively according to NB classifier. EBCS reduces the number of feature by 79.4% VOLUME 9, 2021  from the original datasets, but CFS and CON achieve 63.5% and 70.2% respectively for size reduction.
On Zoo, Hepatitis, Dermatology, Soybean(small), Breast-W, Wine, Segment, Spectf, MoveLib, Musk, and Isolet datasets; EBCS minimizes the number of features significantly and maximizes the classification accuracy based on both DT and NB classifiers compared to CFS. On Mushroom dataset, EBCS and CFS achieve roughly the same FS. CFS achieves better size reduction, but it fails to improve the classification accuracy based on both DT and NB classifiers compared to EBCS on Credit, Lung and Sonar datasets. Only on Vote dataset, CFS achieves better FS than EBCS.
When, EBCS is compared to CON, EBCS maximizes the classification accuracy based on both DT and NB, and it minimizes the features subset significantly more than CON on Credit, Hepatitis, Vote, Soybean(small), Breast-W, Wine, Segment, Spectf, and MoveLib datasets. However, CON fails to achieve any result on Zoo and Musk datasets. On Mushroom dataset, EBCS reduces the number of features with roughly same classification accuracy (DT and NB) compared to CON. In the remaining datasets, EBCS achieves better FS than CON, which reduces the number of features and classification accuracy on both DT and NB.
Finally, EBCS achieves better FS in 15 out of the 16 tested datasets compared to CFS, and on all datasets compared to CON.

VII. CONCLUSION AND FUTURE WORK
FS is an important process for classification that aims to improve the accuracy and reduce the complexity of the classification model by selecting the relevant features. Hybrid approaches using NIAs (such as ACO, ABC,, and PSO) and RST are widely used to solve FS, however they are inefficient for high dimensional, mixed, and numerical datasets.
BCS, on the other hand, is fast and efficient, less complex, easy to implement, has fewer parameters, and has a more efficient search strategy compared to other NIAs including ACO, PSO, and ABC. The main drawback of these algorithms, however, is that the efficiency of convergence decreases as the number of features increases. To improve convergence, RSTDD is used in many filter FS approaches for classification as an objective function. It offers several advantages, including easy to implement and does not need any preliminary or additional information about data, but performs inefficiently for mixed and numerical datasets.
To address these drawbacks, a new filter FS approach for classification (EBCS) is proposed. EBCS solves FS for nominal, mixed, and numerical datasets, that have different number of features, with improved computational time and classification accuracy. To achieve, EBCS developed a new initialization and search strategy, and a new more efficient objective function using RSTDD and frequent values.
EBCS was evaluated on 16 datasets including 4 nominal, 4 numerical and 8 mixed. The results show EBCS achieved better feature selection on 14 datasets, and the same feature selection on 2 datasets compared to the standard. It took less computational time when compared to the standard BCS on all datasets, and it also achieved significantly improved feature selection without significant reduction in classification accuracy for all features for all tested datasets. It also achieved better feature selection on 15 datasets compared to the genetic, PSO, CFS, and CON approaches. In addition, it improved the FS on 14 datasets and had the same FS on one dataset compared to multi objectives evolutionary. Finally, EBCS achieved better FS compared to experimental results in MRMC-IWD on four out of five datasets that were used in the experiments [66].
As a future work, we plan to improve the developed approach based on two points: First, the approach used distinct values to develop the new objective function, but the unique features, and the big difference between the numbers of distinct values for each feature decrease the performance of our approach. Therefore, we need to investigate the performance of our work by dealing with these two limitations of the proposed objective function. Second, the maximum number of features that are used in this work is 617, but in cases some datasets have even larger number of features. However, to investigate the capability of our work on datasets that has greater than 1000 features, alternative initialization and global search mechanisms would need to be further investigated; to develop a more dynamic approach to automatically subdivide the datasets to search groups that provide optimal efficient computation.