IBDA: Improved Binary Dragonfly Algorithm With Evolutionary Population Dynamics and Adaptive Crossover for Feature Selection

Feature selection is an effective method to eliminate irrelevant, redundant and noisy features, which improves the performance of classification and reduces the computational burden in machine learning. In this paper, an improved binary dragonfly algorithm (IBDA) which extends from the conventional dragonfly algorithm (DA) is proposed as a search strategy to design a wrapper-based feature selection method. First, a novel evolutionary population dynamics (EPD) strategy is introduced in IBDA to enhance the exploitation ability while ensuring population diversity of the algorithm. Second, IBDA proposes a novel crossover operator which establishes connections between the crossover rates and iterations so that making the algorithm can adjust the crossover rates of solutions dynamically, thereby balancing the exploitation and exploration of the algorithm. Finally, a binary mechanism is proposed to make the algorithm suitable for the binary feature selection problems. Simulations are conducted on 27 classical datasets from the UC Irvine Machine Learning Repository, and the results demonstrate that the proposed IBDA has better performance than some other comparison algorithms. Moreover, the effectiveness and performance of the proposed improved factors are evaluated by tests.


I. INTRODUCTION
The availability of large scale datasets has boosted the applications of machine learning in many fields, such as active matter [1], molecular and materials science [2], and biomedical [3]. With the increasing of complexities of the machine learning models, more and more datasets with high-dimensional feature spaces are generated. However, part of the features are irrelevant and redundant, which may reduce the classification accuracy and waste the computing resources.
Feature selection is an effective method to overcome the abovementioned issues [4]. The main idea of feature selection The associate editor coordinating the review of this manuscript and approving it for publication was Donato Impedovo . is to select the informative subset from a high-dimensional feature space [5], such that the number of features can be reduced and the worthless features can be deleted, thereby saving the computing resources and increasing the classification accuracy for machine learning.
Feature selection methods are mainly divided into two categories that are filter and wrapper approaches. The filter-based method assigns a relevance score to each feature by using a statistical measure. Then, it ranks the features according to the calculated scores and selects the subset of features depending on a user-defined criterion [6]. For the wrapper-based approach, it utilizes a classifier to guide the feature selection results and the accuracies of this method are usually better than the filter-based method. Thus, the wrapper-based approach is an efficient method for feature selection. However, this method may take more computing resources and it closely related to the learning algorithms [7].
Feature selection can be considered as a global combination optimization problem in nature. Thus, it can be solved by swarm intelligence algorithms especially for the wrapper-based method. Some researchers have adopted several swarm intelligence algorithms such as genetic algorithm (GA) [8], particle swarm optimization (PSO) [9], ant colony optimization (ACO) [10], bat algorithm (BA) [11], gray wolf optimization (GWO) [12], and the variants of these algorithms for the feature selection problems.
DA is a novel swarm intelligence algorithm proposed by Mirjalili [13] in 2016 for solving continuous optimization problems, and it performs better performance compared to some other approaches [14] due to the effectiveness and accessibility. However, according to the no free lunch (NFL) theory, no algorithm can solve all optimization problems suitably. Moreover, the conventional DA can not be directly used for feature selection problems since these problems are with binary solution spaces. In addition, DA may have some certain shortcomings, e.g., its exploitation ability relies on the sub-swarm mechanism, which is insufficient for exploiting high-dimensional solution spaces of feature selection. Thus, it is necessary to improve conventional DA to make it more suitable for feature selection.
The main contributions of this paper are summarized as follows: 1) We formulate an optimization problem to jointly reduce the number of selected features and improve the classification accuracy. 2) We propose an improved binary DA (IBDA) to solve the formulated joint feature selection problem. IBDA introduces a novel evolutionary population dynamics (EPD) mechanism, an adaptive crossover (AC) factor and a binary strategy to improve the performance of conventional DA and make it more suitable for feature selection.

3) Experiments based on the UC Irvine Machine Learning
Repository are conducted to evaluate the performance of the proposed IBDA for feature selection, and the results are compared to some other algorithms. Moreover, the effectiveness of the proposed improved factors are verified.
The rest of this paper is organized as follows. Section II reviews the related work. Section III formulates the joint feature selection problem. Section IV proposes the IBDA. Section V shows the experiment results and Section VI presents a summary of findings and conclusions.

II. RELATED WORK
Heuristic algorithms can solve various optimization problems including the feature selection. Due to their effectiveness and simplicity, many heuristic algorithms have been proposed for solving the feature selection problems, e.g., GA [15], PSO [16], GWO [17], flower pollination algorithm (FPA) [18], artificial bee colony (ABC) [19], bacterial foraging optimization (BFO) [20], BA [21], cuckoo search (CS) [22], firefly algorithm (FA) [23], whale optimization algorithm (WOA) [24], grasshopper optimization algorithm (GOA) [25]. Recently, more and more heuristic algorithms are proposed to deal with many kinds of optimization problems. For instance, inspired by the navigation and foraging behaviors of sales, Mirjalili et al. [26] propose a salp swarm algorithm (SSA) for the airfoil and marine propeller design problems. Moreover, Dhiman and Kumar [27] propose a novel nature-inspired algorithm called emperor penguin optimizer (EPO), which mimics the huddling behavior of emperor penguins. Seagull optimization algorithm (SOA) is another heuristic algorithm, which is inspired by the migration and attacking behaviors of seagulls [28]. In addition, Anita and Yadav [29] propose a novel artificial electric field algorithm (AEFA) which inspired by the Coulomb's law of electrostatic force. More heuristic algorithms can be found in literatures [30], [31] and [32].
Many heuristic optimization algorithms have been used as the search strategies in wrapper-based feature selection methods, and some representative approaches are summarized and reviewed as follows.
García-Dominguez et al. [33] utilize GA for feature selection to reduce the original size of the environmental sound data. Liu and Shang [34] propose a fast wrapper feature subset selection algorithm based on PSO, which employs the domain knowledge of feature subset selection problems. Reference [35] uses a binary BA for feature selection to reduce the size of stego and cover images data. Devanathan et al. [36] exploit a binary GWO (BGWO)based feature selection method and the bag-of-keypoint features (BoKF) model to distinguish nucleolar and centromere staining patterns. Rodrigues et al. [37] use the CS algorithm to solve the feature selection problems in two datasets obtained from a Brazilian electrical power company. In [38], a feature selection algorithm based on the moth-flame optimization (MFO) is proposed. Moreover, Emary et al. [12] propose a novel binary version of GWO and use it to select optimal feature subset for classification purposes.
Recently, some improved approaches that combining different swarm intelligence algorithms or introducing enhanced factors are proposed for feature selection problems. For example, the authors in [39] propose a new algorithm by combining the differential evolution (DE) and ABC algorithms for feature selection. Reference [40] propose a hybrid algorithm called ACO-ABC to solve the feature selection problems. Al-Tashi et al. [41] propose a binary version of the hybrid GWO-PSO algorithm for selecting features. Bharti and Bharti [42] select an informative subset of features by employing a crossbreed approach of binary PSO (BPSO) and sine cosine algorithm (SCA). The authors in reference [43] propose an enhanced hybrid metaheuristic approach by combining GWO and WOA to develop a wrapper-based feature selection method. For the algorithms with improved factors, Zhang et al. [44] proposed a return-cost-based binary firefly algorithm (Rc-BBFA) for the feature selection problems. The authors in reference [45] use eight transfer functions and the crossover operator to enhance the exploratory behavior of SSA for feature selections. Reference [46] propose an improved version of gravitational search algorithm (GSA) by using the concept of global memory and the definition of exponential K best to solve the feature selection problems. Tumar et al. [47] propose an enhanced binary moth flame optimization (EBMFO) with adaptive synthetic sampling (ADASYN) to predict the most optimal feature combination in software faults. More improved algorithms for feature selections can be found in literature [48]- [53], and [54].
There are also several previous works that focus on DA-based methods and their applications in feature selections. Mafarja et al. [55] propose a wrapper-feature selection algorithm based on the binary DA (BDA). For example, Elhariri et al. [56] solve the problem of electromyography (EMG) signal classification with optimal features subset selection by using DA and support vector machines (SVM) classification. Tawhid and Dsouza [57] combine the BDA and enhanced PSO to propose a hybrid BDA-enhanced PSO (HBDESPO) algorithm for feature selections. In reference [58], a combination of wavelet packet-based features and improved binary dragonfly optimization-based feature selection method is proposed to classify different types of infant cry signals. Moreover, Sayed et al. [59] propose a chaotic DA (CDA) where the chaotic maps are embedded with the searching iterations of the algorithm for feature selections.
The above methods can solve the problem of feature selection in various applications. However, an optimization algorithm may perform different performance in different applications. Thus, the existing methods cannot solve all feature selection problems properly, and this motives us to propose an IBDA with suitable improved factors to deal with more feature selection problems in this work.

III. PROBLEM FORMULATION
In this work, we aim to select part features from the whole dataset so that achieving the following two objectives: (1) reducing the number of selected features, (2) enhancing the classification accuracy. Thus, to simultaneously achieve these two goals, we design a joint fitness function as follows: where F s and F a are the number of selected features and total number of features, respectively. Moreover, α ∈ [0, 1] and β = (1 − α) are the weights of these two objectives, respectively. E r indicates the classification error rate of a certain classifier. As can be seen, the fitness function consists of two parts, the first part is used to guarantee the classification accuracy while the second part is used to reduce the number of selected features, and they are combined by using the linear weighting method. These two objectives can be adjusted according to different feature selection problem to obtain the results with different biases. Specifically, if α is turned up while β is turned down, then the approach that utilizes this fitness function tends to select more features to obtain higher accuracy. Conversely, if α is turned with smaller value and β is with larger value, the algorithm tends to sacrifice partial accuracy to reduce the number of selected features.
Note that using a simple and relatively cheap classification algorithm in a wrapper approach, such as k-nearest neighbor (KNN) and decision tree (DT), can obtain a good feature subset that is also suitable for the complex classification algorithms [60]. Thus, the KNN [61] method is introduced as a classifier in this paper since it is effective and easy to be implemented.

IV. ALGORITHM
The solution space of the formulated feature selection problem is discrete and relatively huge especially when the datasets are with large numbers of features, which is difficult to be solved. Thus, we propose an IBDA with the EPD mechanism, crossover strategy and binary scheme to make it more suitable for the feature selection problems.
A. CONVENTIONAL DA DA is inspired by dynamic and static behaviors of swarming of dragonflies in nature. In the static swarm, dragonflies create sub-swarms and fly over small areas to hunt foods. In dynamic swarms, dragonflies make the swarm for migrating in one direction over long distances. Due to the nature of them, static and dynamic behaviors can represent the exploitation and exploration optimization phases, respectively.
There are three primitive principles of swarming behavior to simulate the behaviors of dragonflies, that are: Separation (S), alignment (A) and cohesion (C) are the individual's behaviors affected by the sub-swarm (a certain area around each dragonfly). Moreover, the main objective of DA is to make any swarm to be survival, such that all of the individuals should be attracted towards food sources (F) and distracted outward enemies (E).
The main factors that related to the solution update method in DA are introduced as follows.
1) The separation is expressed as: where S i indicates the separation of the i th individual, X is defined as the individual's current position, X m shows the position of the m th individual of the sub-swarm and N sub denotes the number of individuals of the subswarm.
2) The alignment is calculated as follows: where V m shows the velocity of the m th individual of the sub-swarm.
3) The cohesion is calculated as follows: 4) The attraction towards a food source in DA is expressed as follows: where X + shows the position of the food source. 5) The distraction outwards an enemy is calculated as follow: where X − shows the position of the enemy. Accordingly, the step vector that combines the abovementioned five behaviors for updating the positions of dragonflies is defined as follows: where s, a, c, f , e, and w indicate the separation weight, alignment weight, cohesion weight, food factor, enemy factor, and inertia weight, respectively. Moreover, t is the iteration counter.
Then, the individual position vector based on the step vector is defined as follows: Moreover, DA uses Lévy flight mechanism as the random walk factor to enhance the stochastic behaviour and exploration ability of dragonflies. The position update method of dragonflies by using Lévy flight is expressed as follows: The Lévy flight is calculated as follows: where µ ∼ N (0, σ 2 ), ν ∼ N (0, 1), η is a constant (equal to 1.5 in this work), and σ is calculated as follows: where (y) = (y − 1)!. Accordingly, the pseudo-code of the conventional DA is shown in Algorithm 1.

Algorithm 1 Conventional DA
1 Define and initialize the related parameters: swarm size N swarm , solution dimension N dim , maximum iteration N max_iter , and fitness function, etc.; 2 for t = 1 to N max_iter do 3 Calculate the fitness function values of dragonflies; 4 Update the food source and enemy, namely, food_position and enemy_position; 5 Update w, s, a, c, f , and e; 6 for i = 1 to N swarm do 7 if No other dragonflies in the sub-swarm then 8 Perform random walk by using Eqs. (9); It has been demonstrated that conventional DA has certain shortcomings for some optimization problems. For example, it may lack exploitation capabilities for the optimization problems with huge solution space. Moreover, conventional DA is originally proposed for the continuous optimization problems. However, feature selection is a discrete optimization problem since the solution space is binary. Thus, an IBDA is proposed for solving the formulated feature selection problem. IBDA introduces a novel EPD mechanism, a crossover strategy and a binary mechanism to improve the performance of the conventional DA and make it suitable for feature selection.
The main steps of IBDA is shown in Algorithm 2. Note that since the distance of dragonflies cannot be determined clearly in a binary space, the IBDA considers all of the dragonflies as one sub-swarm. Moreover, the details of the proposed factors are introduced as follows.

1) EPD
EPD is an evolutionary operator which is based on the theory of self-organized criticality (SOC) [62]. The purpose of EPD is to eliminate the worst individuals in the swarm by repositioning them around the best solutions, so that improving exploitation and local search abilities. Moreover, EPD is a simple but effective mechanism that can be embedded in different optimizers. Thus, we consider to adopt EPD to enhance the searching ability for feature selection.

Algorithm 2 IBDA
1 Define and initialize the related parameters: swarm size N swarm , solution dimension N dim , maximum iteration N max_iter , and fitness function, etc.; 2 for t = 1 to N max_iter do 3 Calculate the fitness function values of solutions; 4 Sort these solutions in ascending order according to their fitness function values; 5 Update the food source and enemy, namely, food_position and enemy_position; 6 Reinitialize the individuals from the last half of ordered swarm by using Algorithm 3;

7
Update w, s, a, c, f , and e;

/food_position is the best solution obtained by the algorithm
The core idea of EPD is to eliminate the worst solutions, improve the median fitness of the whole swarm, and relocate the removed solutions according to the best solutions. Note that the worst and best solutions are the solutions with worst and best fitness functions. To combine DA with EPD, the dragonfly swarm is divided into two parts according to their fitness function values. Then, half of the solutions in the swarm that with the worst fitness function values are died out and reinitialized based on EPD mechanism. Several EPD mechanisms have been proposed and used in some algorithms, and the popular ones as well as their main principles are introduced as follows.
Basic EPD: For each solution in the worst half of the swarm, EPD selects the best three solutions and generates a new solution by using these selected three solutions. Then, it randomly selects a solution from these four solutions for relocating the original worst solution.
EPD_CM: In this mechanism, the mutation and crossover operators are introduced into the basic EPD. A solution selected by basic EPD is first mutated by using a mutation operator, then the mutated solution is crossed with the original solution by using a crossover operator. Thus, the exploration tendency of the algorithm may be improved by the introduced mutation and crossover operators.
EPD_Tour: In this version, the tournament selection (TS) operator is introduced to select solutions from the best solutions of the swarm. First, N t solutions are picked out randomly from the best half of the swarm, and the best solution X best is selected among the N t solutions. Then, X best is operated by using the same mutation and crossover operations as mentioned in EPD_CM. Compared to EPD_CM, more valuable solutions are retained by the TS operator, thereby making it easier for the algorithm to obtain the optimal solution. EPD_RWS: Different from the above EPD mechanisms, EPD_RWS utilizes the roulette wheel selection (RWS) operator to select an individual from the best half of the swarm. The basic idea of RWS is that the probability of each individual being selected is proportional to its fitness function value. Then, the selected solution is handled by the same mutation and crossover operations as used in EPD_CM. As can be seen, RWS never ignores any individuals in the swarm, such that more regions of the solution space are explored.

2) EPD WITH LINEAR RANKING SELECTION (LRS)
It has been demonstrated that EPD_Tour and EPD_RWS may achieve better performance than other EPD versions [63], [64]. However, they still have some limitations. In the EPD_Tour, the worst (N t − 1) individuals in the best half of the swarm will be never selected, which may reduce the population diversity. For EPD_RWS, if the fitness function value of an individual is differed by an order of magnitude with other's fitness function values, then the individual will be rarely selected, which may cause the algorithm to converge prematurely. These conditions above will cause some valuable solutions to be discarded, which makes the algorithm easy to fall into local optimum. Therefore, to overcome these drawbacks, we introduce the linear ranking selection (LRS) method as the selection operator in EPD mechanism and propose a novel EPD_LRS approach to further improve the population diversity. to N swarm do 6 Select a solution X EPD_LRS from the best half according to the probabilities. 7 Mutate X EPD_LRS and cross it with X i by using Algorithm 4; Relocate X i by using X EPD_LRS . 9 end 10 Return Array; 11 //Array is the array of relocated solutions As shown in algorithm 3, in EPD_LRS, each individual of the best half is first ranked according to the fitness function value. Then, the mechanism determines which solution needs to be operated according to a probability which is designed as follows: where P max and P min are the probabilities for selecting the highest and lowest ranked individuals, respectively. Moreover, P k is the probability that the k th solution at the intermediate rank will be selected, and N rank is the number of individuals participating in the ranking. The schematic diagram of this method is shown in Fig. 1. P max and P min are designed as follows: where f Fitness max and f Fitness min are the fitness function values of the highest and lowest ranked individuals, respectively. By using these methods, the connections between the fitness function values and selection probabilities can be established so that the algorithm may obtain the reasonable operation probability. Finally, the mutation and crossover operators are used to relocate the obtained solutions so that making them to explore wider areas of solution space. Note that a solution of the formulated feature selection problem may consist of many dimensions, and the mutation and crossover operators are performed on each dimension of the solution. In our mechanism, a widely used mutation operator is adopt and it is defined as follows [63]: where x represents a dimension of a solution, N rand expresses a random number within the range of [0,1], and ϕ indicates the linear mutation rate which is calculated as follows:

3) AC OPERATOR
The crossover operator can generate a new solution by crossing the solution generated the EPD mechanism and the original solution generated by DA, thereby increasing the population diversity of the algorithm. However, the probabilities of the two solutions being retained in the existing crossover operators are equal and constant, which causes the algorithm to be difficult to ensure an excellent convergence rate. To solve this problem, we propose an AC operator to balance the exploration and exploitation performances at different iteration stages of the algorithm, so that facilitating the transiting from exploration to exploitation of the search space. Specifically, if the algorithm is with earlier iterative stage, the AC operation should tend to retain the original solutions generated by DA such that the global exploration can be achieved. If the algorithm is proceeded in the later iteration stage, then more solutions generated by the EPD mechanism should be kept to make the algorithm to have a faster convergence rate. To achieve the abovementioned purpose, we design the AC operator as follows: where x DA is a dimension of the solution generated by DA, and x EPD indicates a dimension of the solution generated by EPD, respectively. Moreover, indicates the bias rate of the crossover operator, which is calculated as follows: where θ is the maximum bias of the crossover operator. A typical curve of changes with iterations is shown in Fig. 2.
As can be seen, can be regarded as a bias threshold to make the crossover operator have different tendencies so that retaining the two solutions at different iterations, thereby balancing the exploitation and exploration capabilities at different iteration stages of the algorithm. Accordingly, the mutation and crossover mechanisms are shown in Algorithm 4. By using these mechanisms, the population diversity is increased and the exploitation and exploration capabilities are balanced, thereby improving the performance of the proposed IBDA for the formulated feature selection problem.

Algorithm 4 Mutation and Crossover Mechanisms
1 Define and initialize the related parameters: the original solution X DA , the solution generated by EPD mechanisms X EPD , and solution dimension N dim , etc.; 2 Calculate ϕ and by using Eqs. (16) and (18), respectively; 3 for j = 1 to N dim do 4 Mutate the j th dimension of X EPD by using Eq. (15); 5 Reinitialize the j th dimension of X EPD by crossing the j th dimensions of X DA and X EPD according to Eq. (17); 6 end 7 Return X EPD ; 8 //X EPD is the relocated solutions VOLUME 8, 2020

4) BINARY MECHANISM
The solutions in conventional DA are continuous and they can be updated by using the step vectors shown in Eq. (8) directly. However, the solution space of the formulated feature selection problem is discrete, which cannot be handled by conventional DA. Thus, a binary mechanism shown in Algorithm 5 is introduced to map the solutions from the continuous space to discrete space, so that making the algorithm suitable for the feature selection problems.
In this work, the v-shaped transfer function is first utilized to calculate the probability of changing position for each dimension of all solutions. The v-shaped transfer function is described as follows: where x means a step vector which is calculated by Eq. (7). The functional relationship between x and P( x) is shown in Fig. 3. As can be seen, the v-shaped transfer function tends to change the variables of search dragonflies more frequently, which boosts exploration in the huge solution space of the formulated feature selection problem. Then, a dimension of the binary solution is updated by using the method as follows: (20) where x t is a dimension of a binary solution of the t th iteration. By using the binary mechanism above, the continuous solution space of conventional DA can be effectively transferred to the discrete spaces so that making the algorithm suitable for the feature selection problem.

C. COMPLEXITY ANALYSIS OF IBDA
In this section, the complexity of the proposed IBDA is analyzed. The most time-consuming step in feature selection should be the calculation of fitness function value, which is several orders of magnitude more complex than the other steps. Thus, other calculation steps can be ignored.

Algorithm 5 Binary Mechanism
1 Define and initialize the related parameters: solution from previous iteration X t , and solution dimension N dim , etc.; 2 for j = 1 to N dim do 3 Calculate the step vector of the j th dimension by using Eqs. (7); 4 Calculate of the changing probability of the j th dimension of X t by using Eq. (19); 5 Update the j th dimension of X t+1 by using Eq. (20) 6 end 7 Return X t+1 ; 8 //X t+1 is is the updated binary solutions We suppose that the maximum number of iteration and population size are N max_iter and N swarm , respectively, then the complexity of IBDA is O(N max_iter · N swarm ) because the fitness function values are calculated N max_iter · N swarm times in the algorithm, which is the same with the conventional DA. However, IBDA may consume extra computing time than conventional DA in practical application, and the reason may be that the introduced improved factors lead to additional computing time, which is difficult to be predicted. Thus, to assess the computational complexity of the algorithm more comprehensively, the experiment time of the proposed algorithm is evaluated in Section V.

D. FEATURE SELECTION WITH IBDA
To solve the feature selection problem by utilizing the proposed method, we consider the dragonfly of the swarm as a solution to the problem. Therefore, the dragonfly consists of a one-dimensional vector is actually a solution of the problem, in which the value of each bit is 1 or 0 (1 and 0 indicate that the feature will be chosen or not, respectively). Thus, a dragonfly can be expressed as follows: where N dim represents the number of features. Moreover, the swarm of IBDA is expressed as follows:

V. EXPERIMENTS AND ANALYSIS
In this section, tests are conducted to verify the performance of the proposed IBDA for the feature selection problem. First, the datasets and setups used in the experiments are introduced. Then, the test results obtained by IBDA and several comparison algorithms are presented and analyzed. Finally, the effectiveness of the introduced improved factors are evaluated.

A. DATASETS AND SETUPS 1) BENCHMARK DATASETS
In this work, we use 27 datasets that are selected from the UC Irvine Machine Learning Repository to perform the experiments. The main information of these datasets are shown in Table 1.

2) PARAMETER TUNING
As we mentioned above, IBDA regards all the dragonflies as one sub-swarm, and it simulates the exploration and exploitation processes of the algorithm by adaptively tuning  the swarming factors including s, a, c, f , e, and the inertia weight w. However, it is difficult to tune all the parameters since there are six parameters in IBDA. Thus, refer to the experiments in [13], we use two parameters that are τ and ζ , to calculate these parameters as follows:we use two alternative parameters that are τ and ζ to represent the original parameters of IBDA (see Eqs. (23)-(29) ), so that tuning the parameters in a reasonable way. w is calculated as follows: where τ is a parameter that controls the size of w. Specifically, Moreover, s, a, c, f , and e are calculated as follows: where ξ is calculated as follows: where ζ is a parameter that controls the size of s, a, c, f , and e. VOLUME 8, 2020  Theoretically, according to no-free-lunch theory, we need to tune τ and ζ for different optimization problems separately to achieve the best performance for each problem [65], [66].
In this work, each dataset comes from a practical optimization problem, which means that it is better to tune the main parameters for each dataset. However, this will be huge works since there are 27 datasets that need to be tuned in our work. For the sake of simplicity, refer to [67], we select one dataset with the median dimension size, namely Lung-Cancer, to tune the key parameters of the proposed IBDA. VOLUME 8, 2020 Accordingly, we jointly tune τ and ζ by using the simple generate-evaluate methods [68]. Specifically, we use the proposed IBDA with different combinations of τ and ζ to solve the feature selection problem on Lung-Cancer dataset. Moreover, τ ranges from 0.9 to 0.5 with a step size of 0.1, and ζ is from 0.4 to 0.1 with a step size of 0.1. Thus, there are 16 different combinations of these parameters. Moreover, each combination is independently run for 30 times, and the average results are recorded and shown in Table 2

3) EXPERIMENT SETUPS
The CPU of the computer used for the experiments is Intel(R) Xeon(R) E5-2630 v4 and the RAM is 32 GB. We implement the experiments by employing Python 3.7 and the KNN (k = 5) based on Euclidean distance measurement is utilized. Moreover, α and β in the fitness function are set to 0.99 and 0.01, respectively.
In this paper, the BPSO [69], BGWO [12], BDA [70], BBA [9] and improved BPSO (IBPSO) [53] are introduced as  the comparison algorithms. Note that BDA is a binary version of conventional DA and it has the same binary mechanism as IBDA. Table 3 shows the key parameter selections of these algorithms and these parameters are assigned as the values that perform well in the literature for feature selection such that the corresponding algorithms can make a reasonable comparison. Moreover, the proposed IBDA and these comparison algorithms are both metaheuristics algorithms, and they are directly affected by the population size and number of iterations. Thus, to ensure the fairness of comparison, the same swarm size and number of iterations are utilized for each algorithm. In this work, they are set as 24 and 100, respectively. Furthermore, as indicated by the central limit theorem, each algorithm is independently run for 30 times in these selected datasets to avoid the random bias of the experiment. In addition, 80% of the instances are used for training, and the rest instances are used for testing, which is a common way employed by several previous works.

B. FEATURE SELECTION RESULTS
In this section, the feature selection results in terms of the fitness function value, convergence rate, accuracy and CPU time obtained by different algorithms are presented. Moreover, the best values obtained by a certain approach are highlighted in bold font.  Table 4 shows the numerical statistical results of the average fitness function values and standard deviations (Stds) of different algorithms for each dataset. As can be seen,  BPSO and BBA achieve better fitness function values than BDA for the datasets with less number of features. However, BDA has better performance for the datasets with larger numbers of features. This demonstrates our conjecture that BDA may have a good exploration ability but lacks the exploitation performance. Thus, by introducing the improved factors to BDA, the proposed IBDA obtains the best average fitness function values on 18 datasets, which means that the introduced improved factors are effective. Note that the effectiveness of different improved factors is further verified and discussed in Section V-C. Overall, the proposed IBDA has better performance than other comparison algorithms for solving the formulated feature selection problem.
Figs. 4 and 5 show the convergence rates of different algorithms during the optimization processes. Note that these figures are separated into two part due to the space limited, and each curve is selected from the 15th test. It can be seen from these figures that the proposed IBDA expose best curves on 19 datasets, which performs the best convergence ability among all the comparison methods.
The average CPU time and Stds obtained by different algorithms for each dataset are presented in Table 5. As can be seen, although IBDA consumes more CPU time than BDA, the CPU time of IBDA is not significantly different from other algorithms, which illustrates that the overhead of IBDA is similar to the comparison algorithms. Moreover, IBDA obtains the best Std of CPU time in 16 datasets, which means that it has better computing stability in terms of CPU time. Table 6 presents the feature selection accuracies obtained by different algorithms. As can be seen, the proposed IBDA obtains the best average accuracy results on 19 datasets. Therefore, IBDA has the best performance in terms of feature selection accuracy for these selected datasets. Moreover, Tables 7 shows the numbers of the selected features of the datasets obtained by different algorithms. It can be seen from the table that IBDA selects more features than BDA. However, the Std of IBDA is better than BDA, which indicates it is more stable for selecting features. Note that the accuracy and number of selected features are trade-offs, which means it may be very difficult to achieve the best results in both of these two objectives for each dataset. Thus, by comprehensive considerations of the results above, we may say that the proposed IBDA employs fewer features to achieve better performance in terms of accuracy and stability. The reason may be that the introduced improved factors can enhance exploitation capabilities, thereby improving the performance of the algorithm.

C. EFFECTIVENESS OF THE PROPOSED IMPROVED FACTORS
In this section, we conduct test cases to verify the effectiveness of proposed EPD_LRS mechanism and AC operator. In the tests, we use BDA with different EPD mechanisms to solve the formulated feature selection problem for each selected dataset, respectively, and the details of these mechanisms are shown in Table 8. Similar to the previous results, the performance indexes in terms of the fitness function value, convergence rate, CPU time, accuracy and number of selection features are presented. Table 9 shows the numerical statistical results of the average fitness function values and Stds of different approaches for each dataset, and the convergence rates during the optimization processes are shown in Figs. 6 and 7, respectively. As can be seen, the BDA_EPD_LRS mechanism achieves the best average fitness function value results in all EPD mechanisms, which means that EPD_LRS mechanism is more compatible with DA than other EPD mechanisms. Moreover, BDA_EPD_LRS_AC achieves the best average fitness function results on 15 datasets, which is a significant improvement over BDA_EPD_LRS. The reason may be that the AC operator can facilitate the transiting from exploration to exploitation of the search space. For references, Table 10,  Table 11 and Table 12 present the summary statistical results of the CPU time, feature selection accuracies and selected number of features, respectively. It can be observed from these results that the introduced improved factors are effective for enhancing the performance of conventional DA for feature selection.

D. COMPARISON WITH OTHER FEATURE SELECTION SCHEMES
In this section, the classification of IBDA is compared to five well-known filter-based methods, which are correlation-based feature selection (CFS), fast correlationbased filter (FCBF), fisher score (F-score), information gain (IG) and wavelet power spectrum (Spectrum). Specifically, IG, Spectrum and F-Score belong to the univariate strategy, which do not reflect the dependencies of the features in the assessment measure. Moreover, CFS and FCBF belong to the multivariate strategy, which can employ the dependencies of the features. Note that these results are from [63] and the comparison results with IBDA on same datasets are shown in Table 13. It can be observed from table that IBDA can outperform other algorithms on most datasets. The reason may be that the wrapper-based feature selection method can provide superior accuracy in comparison with the filter-based versions since they can utilize both labels and correlation of features.

E. LIMITATIONS OF IBDA
In this section, the limitations of the proposed IBDA is analyzed. Although the proposed IBDA outperforms some comparison algorithms according to the simulation results, it still has some limitations. First, the proposed IBDA as well as conventional DA introduces too many main parameters compared to other algorithms, which may make it difficult to tune the parameters. Second, IBDA may take more CPU time than conventional DA because the introduced improved factors need extra computing time. For instance, the proposed EPD_LRS operator needs to relocate the worst populations of the swarm, thus the algorithm should take more CPU time. Moreover, the proposed AC operator also leads the algorithm to spend more CPU time for calculation in each loop. Finally, the Stds of fitness function values obtained by IBDA is higher than some comparison algorithms, as shown in Tables 5. This means that IBDA may lack of stability in some scenarios, and the reason may be that the mutation mechanism of the proposed EPD_LRS increases the uncertainty of the algorithm.

VI. CONCLUSION
In this paper, we investigate the feature selection problem for enhancing the classification performance in machine learning. First, a feature selection optimization problem is formulated to jointly improve the classification accuracy and reduce the number of selected features. Then, an IBDA is proposed to solve the formulated problem. IBDA introduces the EPD_LRS mechanism, AC operator and v-shape binary scheme to improve the performance of conventional DA and make it suitable for the formulated feature selection problem. By using these improved factors, the exploitation and exploration abilities of the algorithm can be balanced while the population diversity can be kept. Experiments are conducted to test the effectiveness of the proposed IBDA and the results demonstrate that it has the overall best performance on 27 well-known scientific datasets compared with BPSO, BBA, BDA, BGWO and IBPSO. Moreover, the effectiveness of the introduced improved factors are evaluated and the results verify that they are useful to enhance the performance of conventional DA for feature selection. In the future, we intend to propose more EPD strategies and combine them with other swarm intelligence algorithms to solve more optimization problems.