Bio-Inspired Feature Selection: An Improved Binary Particle Swarm Optimization Approach

Feature selection is an effective approach to reduce the number of features of data, which enhances the performance of classiﬁcation in machine learning. In this paper, we formulate a joint feature selection problem to reduce the number of the selected features while enhancing the accuracy. An improved binary particle swarm optimization (IBPSO) algorithm is proposed to solve the formulated problem. IBPSO introduces a local search factor based on Lévy ﬂight, a global search factor based on weighting inertia coefﬁcient, a population diversity improvement factor based on mutation mechanism and a binary mechanism to improve the performance of conventional PSO and to make it suitable for the binary feature selection problems. Experiments based on 16 classical datasets are selected to test the effectiveness of the proposed IBPSO algorithm, and the results demonstrate that IBPSO has better performance than some other comparison algorithms


I. INTRODUCTION
Machine learning has been widely applied in many practical applications such as data mining, text processing, pattern recognition and medical image analysis, and these fields often rely on the datasets with a large amount of data [1]. However, part of the features may be irrelevant or even misleading for the machine learning algorithms, which increase the computational overhead and reduce accuracy of classification especially for the high-dimensional datasets [2], [3]. Thus, it is necessary to conduct feature selections.
The main principle of feature selection is to find an optimal subset of features which is discriminating from the full dataset, and the selected subset should remain or even enhance the classification performance of the original dataset [4]. Feature selections are useful methods because they can eliminate redundant noise from the datasets so that making the machine learning algorithms perform to execute faster and more efficient. In other words, by using feature The associate editor coordinating the review of this manuscript and approving it for publication was Jihwan P. Choi . selection, the machine learning approaches may perform better while saving costs [5].
According to the principles, feature selection methods mainly include three categories that are the filters, wrapper and embedded approaches [6]. The filter approaches use a statistical measure to assign a relevance score to each feature and rank the features according to the computed scores [7]. The wrapper methods adopt classifiers to evaluate the selected subsets obtained by selection algorithms, and use the feedback of classifiers to guide the feature selection. Thus, accuracies of the wrapper methods are better than the filter methods [8]. However, they may consume more computing resources. Moreover, the embedded methods are actually the special cases of the wrapper methods since the feature selections are regarded as a part of training phase in machine learning [9].
Generally, feature selections can be regarded as the optimization problems in which a subset of the original dataset is represented by a solution to the optimization problem, and these problems can be solved by exhaustive and heuristic search approaches [10]. However, the computation costs are VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ usually unacceptable by using the exhaustive search methods, especially for the dataset with high dimensions of data. Therefore, the heuristic approaches may be more reasonable methods for solving the feature selection problems. Swarm intelligence algorithms are efficient heuristic search methods for the wrapper-based feature selection problems [11]. For example, several classical swarm intelligence algorithms such as genetic algorithm (GA) [12], ant colony optimization (ACO) [13], differential evolution (DE) [14], grey wolf optimization (GWO) [15] and dragon algorithm (DA) [16]. Among these kind of algorithms, the particle swarm optimization (PSO) algorithm is demonstrated as an effective method for the optimization problems since it is powerful while easy to be implemented [17]. However, conventional PSO algorithm may have certain drawbacks such as it is lack of exploitation for some problems. Moreover, according to the no-free-lunch (NFL) theory, there will be no algorithm that is suitable for solving all the optimization problems [18]. In addition, the conventional PSO is proposed for the continuous optimization problem, which can not be used for the feature selection problems with binary solution space. Thus, the conditions above motive us to improve the conventional PSO so that making it more suitable for the feature selection.
The main contributions of this paper are summarized as follows: • We formulate a joint feature selection problem to simultaneously reduce the number of selection features and enhance the accuracy.
• We propose an improved binary PSO (IBPSO) algorithm to solve the formulated feature selection problem. IBPSO introduces a local search operator, a global search operator, a population diversity improvement factor and a binary mechanism to improve the performance of conventional PSO and to make it suitable for the binary feature selection problem.
• The performance of the proposed IBPSO is verified by 16 classical datasets, and several other algorithms are selected for comparisons.

A. ROADMAP
The rest of this paper is organized as follows. Section II reviews the related work. Section III formulates the joint feature selection problem. Section IV proposes the IBPSO algorithm. Section V shows the experiment results and Section VI presents a summary of findings and conclusions.

II. RELATED WORK
In this section, the previous works that related to feature selections based on various methods are reviewed. Several works that are not based on swarm intelligence and evolutionary algorithms are proposed for feature selections, e.g., the works in references [19], [20] and [21]. Recently, the swarm intelligence and evolutionary algorithms may be more popular approaches for feature selections [11]. Genetic algorithms (GAs) may be the first technologies that are widely adopted in feature selections [22], and many GA-based methods are proposed for these problems [23]. For example, Sayed et al. [24] propose a nested GA for feature selection in high-dimensional cancer microarray datasets. Liu et al. propose a hybrid GA with wrapper-embedded approach to select features from several datasets. In addition to GA, other swarm intelligence algorithms are also used for feature selections. Mistry et al. [25] propose a micro-GA embedded PSO feature selection method for the intelligent facial emotion recognition problems. Zhu et al. [26] propose an improved gravitational search algorithm to solve the feature selection problems, and the effectiveness of the proposed method are evaluated on several widely used datasets. Taradeh et al. [27] propose another gravitational search-based algorithm for feature selection. Emary et al. [28] use the firefly algorithm (FA) to select features from several datasets. In [29], the authors propose to use an ant colony optimization (ACO) with support vector machine (SVM) strategy for wrapper feature selection in face recognition. O'Boyle et al. [30] use ACO to select features and optimize the parameters of an SVM system, in which a weighting method is adopted for improving the performance of the algorithm. Moreover, Rodrigues et al. [31] use the cuckoo search (CS) algorithm to solve the feature selection problems.
Recently, the enhanced versions of swarm intelligence and evolutionary algorithms are more popular methods for feature selections. Dong et al. [32] propose an improved binary GA with feature granulation to select the significant features in datasets. Hou et al. [18] use a novel binary improved fruit fly optimization (FFO) algorithm to improve the performance of conventional FFO algorithm for feature selections. The authors in [33] combine the differential evolution (DE) with ACO for feature selection problems, and DE is adopted to further search for optimal feature subset based on the solutions generated by ACO. Aziz and Hassanien [34] propose a modified CS method which combines rough sets for feature selections. Zhang et al. [35] propose a new version of FFO algorithm to solve the feature selection problem. In this approach, the Gaussian mutation operator and chaotic local search method are introduced to enhance the performance of the original algorithm. Anter et al. [36] use a chaotic binary grey wolf optimization (GWO) approach as the feature selection model that attempts to reduce the number of features without loss of significant information for classification. Abdel-Basset et al. [37] propose a GWO algorithm integrated with a two-phase mutation strategy to solve the feature selection for classification problems based on the wrapper methods, and the two-phase mutation enhances the exploitation capability of the algorithm. Reference [38] proposes a brain storm optimization with a new individual clustering technology and two individual updating mechanisms for developing novel feature selection algorithms with the purpose of maximizing the classification performance. More improved algorithms for feature selections can be found in [39]- [42], [43] and [44].
For the feature selections that based on PSO approach, Zhang et al. [45] propose an improved multi-objective PSO for multi-label feature selection. In their method, two new operators that are adaptive uniform mutation and local learning strategy are introduced to enhance the performance of the algorithm. Qi et al. [46] use PSO with mutation mechanism and the SVM for feature selection in hyperspectral classification. Chhikara et al. [47] propose a feature selection approach based on an improved PSO algorithm and filter approaches to enhance the classification accuracy and reduce the computational complexity in image steganalysis. The authors in [48] utilizes a filter-based feature selection technique which uses an information theoretic-PSO approach to determine the most optimal feature combination in biomedical entity extraction. Chen et al. [49] propose a hybrid PSO with a spiral-shaped mechanism for selecting the optimal feature subset for classification via a wrapper-based approach. Ding et al. [50] add the crossover and mutation operators of GA to the competitive swarm optimization which is an extended version of PSO for feature selection, so as to improve the generation speed of new individuals in the algorithm and prevent premature population. Moreover, Sakri et al. [51] use a PSO-based feature selection algorithm for data mining in predicting breast cancer recurrence.
The abovementioned approaches can solve the feature selection problems in various applications. However, motivated by the NFL theorem, none of these algorithms is able to solve all feature selection problems. Moreover, an optimization algorithm may perform different performances in different feature selection applications. Thus, in this work, a novel IBPSO by balancing the exploration and exploitation is proposed for trying to deal with more feature selection problems.

III. PROBLEM FORMULATION
The goal of feature selection in this work is to reduce the number of selected features while maximizing the classification accuracy, which can be regarded as a multi-objective optimization problem. To consider these two objectives, we design the fitness function based on the linear weighting method as follows: where E is the classification error rate of a certain classifier, R and C represent the number of selected features and the total number of features, respectively. Moreover, a and b are the weights which are used to balance these two objectives.
Note that the k-nearest neighbor (KNN) method is applied to implement wrapper method and evaluate the classification accuracy in this work. KNN is a classical and popular machine learning algorithm, which keeps all the training data for classification. The details of KNN can be learned in [52]. Moreover, it should be also noted that the classifier is not the contributions of this work, and we only adopt the classification results as the indicators.

IV. PROPOSED IBPSO
The search space in the formulated feature selection problem is nonlinear and discrete, and there may be a large number of local minimum points. Thus, we propose an IBPSO algorithm for solving the feature selection problems and the details of this algorithm are as follows.

A. CONVENTIONAL PSO
The conventional PSO is a bio-inspired evolutionary algorithm which is inspired by the food searching behaviors of fish or birds in nature. In this algorithm, each candidate solution of the optimization problem is represented by a particle, and each particle has two main properties that are position and velocity. For an optimization problem with n dimensions, the ith particle moves with a certain velocity v i and the position of the particle is expressed as x i . Thus, a solution and its position in PSO algorithm can be written as: In PSO, the quality of a solution is evaluated by the fitness function and a particle with better fitness function value is regarded as a better solution. In each iteration, the position and velocity of each particle in the population should be updated to generate new solutions, and the solution update method is as follows [17]: where P t best is the personal best solution which means it is a solution with best fitness function value in tth iteration, G best is the global best solution, which mean it is the solution with the best fitness function value currently. ω is the inertia coefficient which ranges from 0 to 1, c 1 and c 2 are the accelerating coefficients. Moreover, r 1 and r 2 are random numbers generated by uniform distributions and the ranges of them are both [0, 1].
Accordingly, the main steps of conventional PSO are shown in Fig. 1, and the details are further explained as follows: Step 1: Define the key parameters such as the population size N pop and fitness function f (x), and initialize the positions and velocities of the particles randomly.
Step 2: Calculate the fitness function value of each particle and sort the solutions according these values.
Step 3: Update P best and G best based on the fitness function values.
Step 4: Update the position and velocity of each particle by Eqs. 2 and 3.
Step 5: If the stopping criteria is reached, then the algorithm finishes. Otherwise, go to Step 2 for a loop.

B. IBPSO
It has been reported that conventional PSO algorithm has some drawbacks for certain optimization problems.  For example, it may have a big possibility to fall into local optima. Moreover, feature selection is a binary optimization problem by nature, which means that the conventional PSO algorithm can not be used for this problem directly since it is proposed for continuous problems. Thus, we propose an IBPSO algorithm for solving the formulated feature selection problem. IBPSO introduces the local search, global search and population diversity improvement factors to improve the performance of conventional PSO, and uses a binary mechanism to make the algorithm suitable for feature selection.
Note that the basic principle of IBPSO is to let the searching agents of the continuous algorithm to move around the search space continuously, and then map the obtained continuous solutions into a binary space. The pseudo code of the proposed IBPSO is shown in Algorithm 1, and the details of the introduced improved factors are as follows.

1) LOCAL SEARCH BASED ON LÉVY FLIGHT
In conventional PSO, each solution is updated by the guidance of G best , so that it may move toward a better position. However, if G best is with a position that far away from the optimal solution, the algorithm may fall into local optima, which reduces the performance of the algorithm. Thus, to further enhance the exploitation ability of conventional PSO, a local search operator based on Lévy flight mechanism for G best is proposed. Lévy flight is a random walk method which follows a heavy-tailed distribution. In this method, the short-distance and occasional long-distance searching appear alternately, such that expands the search scope and enhances the local search performance.
Accordingly, the proposed Lévy flight-based local search factor is described as follows [53]: Algorithm 1 IBPSO 1 Define and initialize the related parameters: population size N pop , solution dimension N dim , maximum iteration t max and fitness function, etc.; 2 Map the searching agents into binary spaces by using Eqs. (11) and (11); 3 Calculate the fitness function values of mapped particles and sort these particles in ascending order; 4 for t = 1 to t max do 5 for i = 1 to N pop do 6 Update each searching agent particles by using Eq. (6); 7 Apply the local search operator to update G best by using Eqs. (4) Mutate solutions by using Algorithm 2; 16 Map the continuous searching agents into binary spaces by using Eqs. (11) and (11); 17 Calculate the fitness function values of mapped particles; 18 end 19 Return x best 20 //x best is the best solution obtained by the algorithm where α is the step factor and its value depends on the applications. Moreover, the random step value of Lévy flight is taken from the Lévy distribution, which is: In conventional PSO, the inertia coefficient is an important parameter that affects the searching performance of the algorithm since it is able to determine the searching scope. However, the value of inertia coefficient is fixed during the iterative process, which may be not suitable since the required searching steps of the initial and end stages are different. Thus, to overcome this shortcoming, we propose a weighting inertia coefficient to PSO and it is described as follows: where ω max and ω min are the maximum and minimum values of inertia coefficient. Then, the position update method by introducing the weighting inertia coefficient is as follows: 85992 VOLUME 8, 2020 By using the weighting inertia coefficient, the searching step can be dynamically adjusted according to the iterations of algorithm, that is, if the algorithm is with the initial stage, it may be better to have a longer searching step since the current solutions may be far away from the optimal location and vise versa. Thus, by using the weighting inertia coefficient, the global search ability of the algorithm may be improved.

3) POPULATION DIVERSITY IMPROVEMENT BASED ON MUTATION MECHANISM
The feature selection problems are usually with high solution dimensions, which cause the algorithm has a huge population. Moreover, since the solutions in conventional PSO are updated by the guidance of the corresponding P best , which may cause the solutions with poor P best to be over-exploited in a low-fitness solution space, so that reducing exploitation efficiency. However, if the weight of G best is increased to solve this issue, then all solutions are updated by the guidance of G best , causing the updated solutions may be similar with each other, so that reducing the population diversity. Therefore, it may be difficult to maintain both exploitation efficiency and population diversity in conventional PSO.
To overcome this issue above, we introduce a mutation mechanism to mutate some of the solutions and the corresponding P best values of them in the population, thereby enhancing the development efficiency while ensuring the population diversity. In this work, half of the solutions that with the worst fitness function values and their corresponding P best are selected to be mutated by using the method as follows: where P t best m1 , P t best m2 and P t best m2 are the randomly selected personal best solutions in the first half of population which with better fitness function values, m 1 = m 2 = m 3 , M is an adjust factor. By using this mutation operator, the solutions that with worst fitness function values can be further guided by the personal best solutions with better fitness function values, which may improve the exploitation efficiency.
Moreover, the solutions in the second half of population with worse fitness function values should be also mutated to further improve the population diversity. Thus, we propose to use a threshold ϕ to determine which dimension of solution is updated, and ϕ is defined as follows: where t is the current iteration number. In our scheme, if a randomly generated number rand is less than ϕ, then the corresponding dimension of solution should be mutated to its inverse value by using Eq. (10) as follows: where x d represent the d th dimension of a solution. The main steps of mutation mechanism is shown in Algorithm 2. Select P best m 1 , P best m 2 and P best m 3 from the first half population that with better fitness function values; 4 Generate a new P best i by using using Eq. (8); 5 end 6 Map the continuous searching agents into binary spaces by using Eqs. (11) and (11); 7 Calculate the fitness function values of mapped particles; 8 for j = 1 to N dim do 9 if rand ≤ ϕ then 10 Update j th dimension of x i by using Eq. (10); 11 end 12 end 13 Return x i

4) BINARY MECHANISM
A binary mechanism is introduced to map the solutions from the continuous space to discrete space, so that making the algorithm suitable for the feature selection problems. In this work, the widely used Sigmoid function is adopted to IBPSO for the solution mappings, and the details of this function is as follows [54]: x binary = 1, N random x sig 0, N random > x sig (12) where x binary is the converted binary solution of the feature selection problem, and N random is a random number which is used as the threshold.

C. FEATURE SELECTION BASED ON IBPSO
To solve the formulated feature selection problem by using the proposed IBPSO, a solution can be regarded as a particle. Thus, the solutions can be expressed as follows: where n represents the number of features. Correspondingly, the population of IBPSO is expressed as follows:

D. COMPUTATIONAL COMPLEXITY
The complexity of the proposed IBPSO is analyzed in this section. We also suppose that the maximum number of iteration and population size are t max and N pop , respectively, then VOLUME 8, 2020 the complexity of conventional PSO algorithm is O(t max · N pop ) because there is only one inner loop of the algorithm. Since the structure of the proposed IBPSO is similar with PSO, the complexity of IBPSO is also O(t max · N pop ). However, IBPSO may consume more computing time than conventional binary PSO (BPSO) for solving a certain feature selection problem even if they have the same computational complexity. The reason may be that the introduced improved factors lead to additional computing time, and this will be further evaluated and discussed in the following section.

V. RESULTS AND ANALYSIS
In this section, we conduct tests to evaluate the performance the proposed IBPSO algorithm for feature selections. Moreover, several other algorithms are selected for comparisons.

A. BENCHMARK DATASETS AND EXPERIMENT SETUPS
The benchmark datasets used in the evaluations and parameter setups of different algorithms are introduced.

1) BENCHMARK DATASETS
In this work, we select 16 datasets from the widely used UC Irvine Machine Learning Repository [55], and the main information of these selected datasets are shown in Table 1.

2) PARAMETER TUNING
As we known, according to the NFL theory, it is difficult for a metaheuristics algorithm to perform excellent performance on all the optimization problems, especially by using the same parameter setups. Thus, it is better to tune the key parameters for each optimization problem separately so that achieving the best performance. However, in this work, the number of features (solution dimensions) of these datasets are quite different, which means that the formulated feature selection optimization problem for each dataset may be regarded as independent optimization problems. Thus, it is better to tune the key parameters of IBPSO for each dataset. However, this will be huge works since there are 16 datasets that need to be tuned in our work. Therefore, refer to [56], we selected BreastEW dataset to tune the key parameters of the proposed IBPSO since this dataset has the median size compared to other datasets.
In the tuning test, the accelerating coefficients c 1 and c 2 , which are the key parameters of PSO as well as IBPSO, are jointly tuned. Specifically, we used the proposed IBPSO with different combinations of c 1 and c 2 to select the features on BreastEW dataset, and the ranges of these two parameters are both 1 to 2 with the step size of 0.25, resulting in 16 different combinations of these parameters. Moreover, each combination is independently run for 30 time to avoid random bias, and the average results are presented. The parameter tuning results are shown in Table 2. According to the results, when c 1 = 1.75 and c 2 = 2.0, IBPSO achieves the best optimization results. Thus, we use these parameter values for all the datasets.

3) EXPERIMENT SETUPS
In the feature selection tests, the genetic algorithm (GA), binary firefly algorithm (BFA), binary cuckoo search (BCS), BPSO, and binary bat algorithm (BBA) are introduced as the comparison algorithms. Moreover, the key parameter setups of these comparison algorithms as well as the proposed IBPSO are listed in Table 3. In addition, the maximum number of iterations for each algorithm is set as 200, the population size (number of searching agents) is 20, and the dimension of solution is equal to the feature number of each dataset. Note that the performance of a metaheuristics algorithm is directly affected by the population size and the number of iterations. Specifically, if the algorithm is with large population size, it may achieve better optimization performance than the algorithm with small population size. Moreover, if an algorithm has more numbers of iterations, then it may obtain better results than the algorithm with less numbers of iterations. Thus, we use the same population size and the number of iteration for each algorithm to make a fair comparison between different algorithms.
Each algorithm is independently run for 30 times to solve the feature selection problems of these selected datasets, and the numerical statistics results will be presented. Moreover, in each test, we use 80 % of the instances for training, and the rest ones are used for testing, which is a common way adopted by several previous works.
The computer used for the tests is with an Intel(R) Xeon(R) E5-2630 v4 CPU and the RAM is 32 GB. Moreover, the abovementioned algorithms for feature selections are implemented by Python.

B. FEATURE SELECTION RESULTS
In this section, the feature selection results achieved by different algorithms are presented.

1) PERFORMANCE EVALUATIONS OF DIFFERENT ALGORITHMS
In this section, the fitness function values obtained by different algorithms are presented to show the performances of these approaches directly. Tables 4 and 5 show the numerical statistics results in terms of best value, worst value, standard deviation (SD), average value and CPU time of different algorithms for each datasets, and the best values obtained by VOLUME 8, 2020 a certain algorithm are highlighted in bold font for a clear presentation. Due to the limited page margin, the results of the selected 16 datasets are shown in two separated tables. It can be seen from the tables that the proposed IBPSO algorithm achieves the best average fitness function values on 12 datasets, which means it has better performance than other comparison algorithms. Moreover, IBPSO consumes more CPU time for solving the feature selection problems on most datasets. This is because the introduced improved factors need extra operations of solutions, which will cause the increasing the experiment time.
In addition, the convergence rates of different algorithms during the processes of solving the fitness functions are shown in Fig. 2. Note that these curves are selected from the 15th test, which is a median. As can be seen, the proposed IBPSO expose best curves on 12 datasets, which performs better convergence ability.

2) FEATURE SELECTION ACCURACIES
The feature selection accuracies obtained by different algorithm are presented in Tables 6 and 7, respectively. Similarly, the numerical statistics results of different algorithms for each datasets are presented in these tables. As can be seen, IBPSO algorithm achieves the best average accuracies of feature selection results on 10 datasets and the best accuracy results on 13 datasets. Thus, IBPSO algorithm has the best performance in terms of feature selection accuracies on these selected datasets compared to other algorithms. The reasons may be that the introduced improved factors can balance the exploration and exploitation 85996 VOLUME 8, 2020 abilities, so that enhancing the performance of the algorithm. Tables 8 and 9 show the numbers of the selected features of the datasets obtained by different algorithms, respectively. Similar to the accuracy results, these tables also present the numerical statistics results. It can be seen from the tables that IBPSO obtains the best average number of selected features in half of the datasets (8 of 16), which can be regarded as the best results in the tests compared to other algorithms. Note that the accuracy and the number of selected features are tradeoffs, which means it may be very difficult to achieve the best results in both of these two objectives for each dataset. Thus, we may say that the proposed IBPSO has the overall best performance for feature selections in the selected datasets compared to other algorithms.

C. SOLUTION DISTRIBUTIONS
In this section, the exploration and exploitation changes of the proposed IBPSO are plotted such that the performance of the algorithm can be shown in a more intuitive way.
However, the feature selection problem is usually with high solution dimensions, which means that it is very difficult to show the solution distributions directly. Thus, we use a popular data dimensionality reduction technology called t-SNE to visualize the high-dimensional data by giving each datapoint solution) in a two-dimensional map. The details and principle of t-SNE can be found in [57]. Without loss of generality and to make it more concise, we select four of sixteen datasets to present the visualization results of exploration and exploitation. The selected datasets are Breastcancer, CongressEW, BreastEW and SonarEW, since they are with different numbers of features (solution dimensions), which is representative. Moreover, the specific process of the population distribution visualization by using t-SNE is as follows. First, the solution distribution data of the population in each iteration are obtained and recorded. Second, these data are handled and mapped to the two-dimensional spaces by using t-SNE technology. Finally, we plot the mapped data of each selected dataset to show how exploration and exploitation are changed during the iterative process. Fig. 3 shows the solution distributions in the population obtained by the proposed IBPSO. As can be seen, the solutions obtained by IBPSO are with dense and sparse distributions alternately in the four selected datasets during the iterative processes, which means that IBPSO can balance the exploration and exploitation abilities in solving the feature selection problems.

D. EFFECTIVENESS OF THE IMPROVED FACTORS
In this section, we conduct tests to evaluate the effectiveness of the improved factors in IBPSO. In the test, we use conventional BPSO, BPSO with local search factor (BPSO-LSF), BPSO with global search factor (BPSO-GSF) and BPSO with population diversity improvement factor (BPSO-PDIF) to solve the formulated feature selection problem, respectively, to observe that whether these factors can enhance the performance of conventional PSO. The tests are also conducted on the four selected datasets that are Breastcancer, CongressEW, BreastEW and SonarEW, and the numerical results obtained by these abovementioned approaches are listed in Table 10. On the whole, all approaches achieve the same results on Breastcancer dataset and this may be that this dataset is with the lowest solution dimension, which makes it easy to be solved. The rest results are discussed in detail as follows.

1) LÉVY FLIGHT-BASED LOCAL SEARCH FACTOR
It can be seen from Table 10 that the fitness function value, classification accuracy and number of features obtained by BPSO-LSF are better than conventional BPSO algorithm. This is because the Lévy flight-based local search operator can extend the searching area of the global best solution, which means that it is able to provide a better exploitation ability for finding better global best solution, so that the algorithm can achieve a more accurate solution compared to BPSO. Table 10 shows that the accuracies of fitness function values obtained by BPSO-GSF performs better than BPSO, especially on medium-dimensional datasets. This is because that the proposed weighting inertia coefficient mechanism can adjust the searching scope of the algorithm adaptively such that improving the exploration ability. Note that this improved factor may decline the exploitation capability of the algorithm because it excessively retains the solutions of the previous iteration at the end-stage, so that reducing the performance of  the algorithm on the datasets with larger solution dimensions. Thus, it is necessary to introduce the third improved factor to overcome this shortcoming.

3) MUTATION MECHANISM-BASED POPULATION DIVERSITY IMPROVEMENT FACTOR
As can be seen, the BPSO-PDIF approach can effectively achieve better fitness function value on SonarEW dataset, which is with the largest solution dimension. Moreover, Fig. 4 shows the convergence rates of BPSO and BPSO-PDIF on the four selected datasets. It can be seen from the figure that the proposed BPSO-PDIF scheme can improve the convergence ability compared to conventional BPSO algorithm. This is because that the mutation mechanism-based population diversity improvement factor makes the algorithm tend to exploit around high-quality solutions, so that enhancing the exploitation ability of the algorithm and convergence rate.
In summary, these three introduced improved factors are effective for improve the performance of conventional BPSO. Moreover, they are also complementary. For example, if the mutation mechanism is used on small size datasets, then the algorithm may easy to fall into local optima. Thus, it is necessary to adopt the Lévy flight-based local search factor to handle this issue.

E. LIMITATIONS OF IBPSO
Although the proposed IBPSO performs better performance than some other algorithms for the selected datasets, it still has some limitations. We think that the main limitations of this algorithm are as follows. First, the number of parameters in IBPSO is relatively more than some other approaches such as the conventional PSO algorithm, since the introduced improved factors bring more parameters. Thus, it is difficult to tune all the parameters of IBPSO for different applications, e.g., for solving other binary optimization problems. Second, IBPSO takes more experiment time than other algorithms even if they have the same computational complexity, and this can be reflected in Tables 4 and 5. This is because that the introduced improved factors need extra computing time than conventional PSO algorithm. For example, IBPSO has a Lévy flight-based local search operator, thus the algorithm should take more CPU time for calculation in each loop. Moreover, the proposed mutation mechanism also leads the algorithm to spend more CPU time. Finally, the performance of IBPSO for solving the feature selection problems with lower solution dimensions is not so good, and this can be also observed in Tables 4 and 5. The reason may be that the introduced mutation mechanism-based population diversity improvement factor can divide the solutions into different parts and let the algorithm to handle them separately, which is suitable for the larger population size with higher dimension of solution.

VI. CONCLUSION
In this paper, the feature selection problem is investigated. First, a joint feature selection problem is formulated, and then we propose an efficient algorithm called IBPSO to solve the formulated problem. In IBPSO, we first introduce the Lévy flight mechanism to improve the local search performance of the algorithm. Second, a weighting inertia coefficient operator is proposed to enhance the global search ability. Moreover, we use the mutation mechanism to improve the population diversity of the algorithm. Finally, a binary method is adopted to make the continuous algorithm suitable for the binary feature selection problem. Experiments are conducted on several classical datasets for the evaluations of the proposed algorithm, and the results show that the overall performance of IBPSO outperforms GA, BFA, BCS, BPSO and BBA for solving the feature selection problem. In our future work, more test datasets will be considered to further evaluate the proposed algorithm. YINZHE XIAO received the B.S. degree in naval architecture and ocean engineering from the Dalian University of Technology, in 2019. He is currently pursuing the master's degree in computer science and technology with Jilin University. His research interests include data mining, feature engineering, and machine learning.