Bi-Directional Feature Fixation-Based Particle Swarm Optimization for Large-Scale Feature Selection

Feature selection, which aims to improve the classification accuracy and reduce the size of the selected feature subset, is an important but challenging optimization problem in data mining. Particle swarm optimization (PSO) has shown promising performance in tackling feature selection problems, but still faces challenges in dealing with large-scale feature selection in Big Data environment because of the large search space. Hence, this article proposes a bi-directional feature fixation (BDFF) framework for PSO and provides a novel idea to reduce the search space in large-scale feature selection. BDFF uses two opposite search directions to guide particles to adequately search for feature subsets with different sizes. Based on the two different search directions, BDFF can fix the selection states of some features and then focus on the others when updating particles, thus narrowing the large search space. Besides, a self-adaptive strategy is designed to help the swarm concentrate on a more promising direction for search in different stages of evolution and achieve a balance between exploration and exploitation. Experimental results on 12 widely-used public datasets show that BDFF can improve the performance of PSO on large-scale feature selection and obtain smaller feature subsets with higher classification accuracy.


INTRODUCTION
F EATURE selection is the process of selecting a portion of features relevant to the labels in classification problems and removing redundant or noisy features from the entire feature set. The goal of feature selection is to select as few features as possible while maximizing the discriminative capability of the selected features. As feature selection can reduce the data dimensionality and the difficulty of analyzing and solving problems, it has become an effective data preprocessing method in many fields [1], [2].
Feature selection is essentially a binary discrete optimization problem and has been proved to be NP-hard [3], [4]. Most existing feature selection methods can be roughly classified into three categories: filter methods, wrapper methods, and embedded methods [5], [6]. Filter methods analyze the features through some statistical or informatics methods and then select the features with high scores. Wrapper methods search for the optimal feature subset by evaluating all candidate subsets on a specific problem. Embedded methods usually embed feature selection into the training process of machine learning through regularization [7], [8]. In general, wrapper methods perform better than filter methods in classification accuracy and have a larger scope of application than embedded methods [9]. Evolutionary computation (EC) techniques are powerful in solving variants of NP-hard optimization problems [10], [11], [12], [13] and they have been widely applied as wrapper methods for feature selection due to their excellent global search ability [14]. Particle swarm optimization (PSO) is a representative EC technique first proposed by Kennedy and Eberhart [15]. Compared with other EC methods, PSO has the advantage of simple implementation and fast convergence [16], thus becoming an effective method for feature selection [9], [14]. Therefore, this paper focuses on PSO-based methods to solve feature selection problems.
Due to the development of the Big Data era, data from various application fields also grows with the increment of the number of features. In general, the potential search space for feature selection with D features is 2 D and it grows exponentially when the number of features increases. Under this circumstance, PSO methods face two challenges. Firstly, they require more computational resources for the evaluations of candidate feature subsets. Secondly, the search ability of these PSO methods drops sharply because of "the curse of dimensionality" [17], [18], [19], [20], [21].
Therefore, many PSO variants have been proposed for large-scale feature selection recently. They can be roughly divided into two categories. In the first category, PSO-based algorithms aim to design a more effective and reasonable evolutionary mechanism for particles to improve their performance on large-scale feature selection problems [22][23] [24]. In the second category, PSO-based algorithms use some correlation measures to indicate the importance of each feature and focus on searching for solutions formed by features with more importance in the subsequent search process [25] [26][27] [28][29] [30]. These correlation-based algorithms usually perform better because they can narrow the search space according to the correlation measures. However, they still face the following three challenges. Firstly, the implementation of these algorithms is more complicated than those in the first category because of additional correlation measurement methods or control mechanisms based on correlation. Secondly, the correlation information of features is sometimes difficult to obtain. For example, the cost of correlation computing is expensive in high-dimensional data, and the commonly used entropybased correlation measures are difficult to estimate accurately on continuously distributed observations [31]. Thirdly, the correlation analysis is usually incomplete because it is time-consuming to calculate correlation values of all feature combinations and it is difficult to calculate a proper correlation value of multiple variables. Usually, only correlations between two single variables, e.g., correlations between two single features or correlations between a single feature and the label, are considered in the correlation analysis. In other words, the correlation information of feature subsets composed of multiple features is always missed, which affects the correctness of analyzing feature importance. Thus, over-reliance on incomplete correlation analysis results may mislead particles and prevent them from finding optimal solutions. Nevertheless, correlation-based algorithms are effective methods for large-scale feature selection by focusing on a portion of important features specifically rather than on all features equally, even though there are still some challenges as mentioned above. To overcome the above challenges faced by traditional correlation-based algorithms, an algorithm that can narrow the search space but does not overrely on correlation measures is greatly expected. Therefore, this paper proposes a novel framework named bi-directional feature fixation (BDFF) for PSO. The main novelties and contributions of BDFF can be summarized as follows.
(1) Using bi-direction guidance when updating particles to fully search for solutions with different numbers of selected features. Many existing feature selection algorithms consider the number of selected features in their fitness evaluation to find a solution with fewer features [26], [28], [32][33] [34][35] [36]. Different from those existing algorithms, BDFF does not need to consider the number of selected features in its fitness function, so it simplifies the fitness evaluation. Instead, it uses the information of the number of selected features to update particles and can also help particles find solutions with fewer features.
(2) Narrowing the search space by the feature fixation strategy to improve search efficiency. Therefore, BDFF can not only reduce the search space but also reduce the difficulty of searching for a better solution.
(3) Proposing a self-adaptive direction change (SADC) strategy for particles so that they can change the search direction adaptively with the information provided by the swarm. With the SADC strategy, particles can switch the state of each feature between being fixed and not being fixed dynamically, which increases the flexibility of feature fixation and further improves their search ability.
(4) Having advantages in simple implementation and finegrained control. BDFF is not complex to be implemented and can be easily applied to many existing PSO-based feature selection algorithms. Besides, BDFF has a small control granularity because it takes the neighborhood containing only several features as the basic unit for fixation.
The rest of this paper is organized as follows. In Section 2, the related work of applying PSO to feature selection is presented. Section 3 introduces the detailed implementation of the proposed BDFF for PSO. The experimental results and analysis of BDFF are given in Section 4. Finally, Section 5 concludes this paper.

Feature Selection Problem
The feature selection problem is a binary discrete optimization problem, which means that the optional values for each dimension are "0" or "1". Assuming that there are D features in the data, the solution for the feature selection problem can be represented by a D-dimension vector x. The binary value of "1" or "0" at the dth dimension x d represents the dth feature is selected or not selected, respectively. Then, the goal of the feature selection optimization problem is to select a feature subset from the D features to maximize the discriminative capability f(x) of the data, as shown in Eq. (1). For example, in classification problems, f(x) can be the classification accuracy of the selected feature subset, and the goal of feature selection is to maximize f(x).

PSO-Based Algorithms for Large-scale Feature Selection
Most existing PSO-based algorithms for feature selection can be divided into two categories: one of the categories tries to use different mechanisms to help particles search effectively, while the other utilizes the correlation information to gain further improvement. The first category of algorithms focuses on designing different evolutionary mechanisms for feature selection to improve their performance. For example, the earliest PSObased algorithm that can be used for feature selection is binary PSO (BPSO) proposed by Kennedy and Eberhart [37], which used the velocity of particles to represent the probability of a feature being selected. After BPSO, many PSO-based algorithms attempted to improve the encoding representation, the initialization strategy, the updating mechanism, and the evaluation function of PSO to obtain a better performance on feature selection and 0/1 problems. Xue et al. [38] proposed a PSO-based approach with novel initialization strategies and updating mechanisms for feature selection. Shen et al. [39] developed a bi-velocity discrete PSO (BVDPSO) using the two velocities of particles to respectively represent the possibilities of being 1 and 0. Gu et al. [23] discretized the competitive swarm optimizer (CSO) [40] and applied CSO to large-scale feature selection problems. Later, the potential PSO proposed by Tran et al. [41] used potential entropy-based cut-points to discretize values of each feature and then encoded the particles with those discrete values. In self-adaptive PSO introduced by Xue et al. [24], multiple candidate solution generation strategies were applied simultaneously by a self-adaptive mechanism to increase the diversity of the swarm.
In the first category, although many different mechanisms have been proposed to enhance the search ability of particles, few of them can distinguish which features are worth further searching. Therefore, they always treat all features equally and keep searching for solutions by considering the entire feature set, which wastes computational resources and makes it difficult to find a better solution.
The second category of PSO-based algorithms uses correlation measures as an auxiliary tool to figure out the weight of importance of each feature, which shows more potential for large-scale feature selection and has attracted great attention in recent years. Commonly used correlation measures are based on similarity or information theory [31], [42], including Relief-F [43], mutual information (MI) [44], symmetric uncertainty (SU) [45], etc. Chuang et al. [46] first used the correlation-based feature selection as a filter method to select the important features and then used the BPSO with chaotic theory to search for the final optimal solution on those important features. After sorting features with SU measure, Tran et al. [25] designed the variable-length PSO with local search (VLPSO-LS) to dynamically shorten the length of particles thus narrowing the search space. Chen et al. [26] introduced the evolutionary multitasking framework into PSO, which identified a promising feature subset with high Relief-F values and generated two related tasks on the promising feature subset and the whole feature set, respectively. Song et al. [27] proposed the variable-size cooperative coevolutionary PSO, which employed a space division strategy based on SU measure and allocated a larger subswarm for those features that were more relevant to the label. Besides, the PSO variant developed by Chen et al. [28] generated new particles with a correlation-guided updating strategy and those features with higher correlation were more likely to be selected. To reduce the computational cost, Song et al. [29] proposed a hybrid feature selection algorithm named HFS-C-P that used SU measure to discard low-correlation features and to cluster features so that it could search in a small solution space. In [47], the SU measure was also used to distinguish relevant and redundant features in the local search strategy and to affect the mutation probability in the adaptive flip mutation strategy.
In the second category, the correlation-based PSO algorithms can narrow the search space because they can point out promising features and then focus on those promising features rather than all features. However, they still face the following challenges. First, the calculation of correlations is time-consuming. To obtain the correlations between D features and the label, the time complexity is O(D). If the correlations among features are also required, the time complexity will rise dramatically to O(D 2 ), which does matter in large-scale feature selection problems in Big Data environments. Second, due to the high computational consumption, correlations between groups of multiple features are rarely considered, so the correlation information used to assist PSO is incomplete in most cases. Third, the entropy-based correlation measures such as MI and SU can only be applied to datasets with continuous numerical features by discretization [31], [42] or non-parametric estimation methods [48], which increases the difficulty of their application and weakens the stability of their performance on different data.

Bare Bones PSO
Bare bones PSO (BBPSO) proposed by Kennedy [49] is a simple but efficient PSO variant. It drops the velocity part in the standard PSO and only uses the historical optimal position found by each particle and the global optimal position found by the swarm so far to update the position of each particle, as shown in: where x i,d is the dth dimension of the position of the ith particle, pbest i,d is the dth dimension of the historical optimal position found by the ith particle, gbest d is the dth dimension of the global optimal position found by the swarm so far, N ðm; sÞ is a Gaussian distribution with a mean m and a variance s, and r is a random value uniformly sampled within [0, 1]. To help particles escape from the local attractor and perform better in feature selection, Qiu [22] introduced an adaptive chaotic jump (ACJ) strategy into BBPSO and then proposed BBPSO-ACJ. Chaos is a non-linear system and unpredictable. When BBPSO-ACJ is combined with a chaotic system, it can help the stagnated particles change their positions greatly and jump out of the trapped local optima. Therefore, swarm diversity can be promoted and the global search ability can be greatly enhanced. The position of each particle in BBPSO-ACJ is updated by: where P cj;i is the probability of the ith particle performing the chaotic jump, and z k 2 ð0; 1Þ is a value of a chaotic sequence generated with Eq. (4) every time it is used.
To balance the convergence speed and the diverse swarm, BBPSO-ACJ designs a strategy to make P cj;i related to the stagnant generations s i , i.e., generations without fitness improvement of the ith particle. Therefore, P cj;i can be calculated by: Moreover, BBPSO-ACJ employs a method to decode the position of a particle into the representation of the selected feature subset. If the value of x i,d is greater than 0.5, then the dth feature will be selected into the feature subset; otherwise, the dth feature will be discarded.

BI-DIRECTIONAL FEATURE FIXATION FRAMEWORK
In this section, the proposed BDFF for PSO to solve largescale feature selection problems is introduced. First, the main idea of BDFF is given to illustrate how BDFF works with different search directions. After introducing the search direction initialization and the feature neighborhood representation, the details about feature fixation and the self-adaptive direction change strategy are discussed. Finally, the overall framework of BDFF is presented.

Main Idea of Design
A common initialization method for position x of a particle in PSO-based feature selection algorithms is to randomly select some features, as shown in: where x i,d is the dth dimension of the ith particle. Since each feature has an equal probability of being selected and being not selected, the particles in the initial swarm have a mathematical expectation of the feature number which is equal to one-half of the total feature size. Focusing on the fact that the number of selected features contained in the global optimal solution can be greater or less than the mathematical expectation of the number of selected features contained in the initial particles, the BDFF framework is hence proposed. The main idea of BDFF is to guide some particles to search for solutions with more features and the other particles to search for solutions with fewer features after initialization. As the particle swarm evolves and acquires new information, BDFF then changes the search direction of some particles adaptively to help the swarm approaches the global optimal solution in terms of the feature number as well as the fitness value.
Without loss of generality, BDFF also adopts Eq. (6) for particle initialization. Assuming that there is a dataset with D features for selection, and the global optimal solution contains fewer than D/2 selected features. Herein we give an example of the evolution process guided by BDFF, as shown in Fig. 1. After initialization, all particles in the swarm S select around D/2 features, and their fitness values are poor. Then, the particles are divided evenly into two subswarms with different initial search directions. One subswarm SS u (composed of particles with upward arrows in Fig. 1) searches in the solution space with more than D/2 features. The other subswarm SS l (composed of particles with downward arrows in Fig. 1) searches in the solution space with fewer than D/2 features. Each particle can judge whether it is in the correct direction, i.e., the direction that is more likely to guide it to find the global optimal solution, according to the information from the whole swarm. For example, in Fig. 1, the particles in SS u with upward arrows are probably in the wrong direction, because their fitness values are worse than those of the particles searching in the opposite direction (i.e., particles with downward arrows). Therefore, it is necessary to adjust the search direction of some particles in SS u in time to avoid a useless search, while the rest particles in SS u are still reserved for exploration. Eventually, most of the particles gather together in one of the subswarms with similar fitness values and feature numbers to the global optimal solution.
In a feature selection problem, an optimal solution means that it should contain as fewer features as possible under the premise of optimal fitness (i.e., discriminative capability like Eq. (1)). To approach the optimal solution, most of the existing EC-based feature selection algorithms consider reducing the number of selected features and optimizing fitness at the same time in their evaluation. However, reducing the number of selected features is not the core goal of feature selection. Therefore, these EC-based algorithms may be misled by such consideration to focus on finding a feature subset with fewer features and ignore the importance of discriminative capability. Differently, the proposed BDFF framework adopts a novel technique named feature fixation to approach the optimal solution. With feature fixation, BDFF only needs to consider optimizing discriminative capability in its evaluation, and the size of the selected feature subset can be optimized automatically in its update process. For example, if the classification accuracy is adopted as the fitness function in the classification problem, then the only goal of BDFF is to maximize the classification accuracy.

Initialization of Search Direction
There are two search directions for particles: the direction of searching for solutions with more features and the direction of searching for solutions with fewer features. Supposing that there are N particles in the swarm, the search direction of the ith particle p i is initialized by: where dir(p i ) returns the search direction of p i , and dir l and dir u are search directions representing that p i searches for solutions with fewer features and searches for solutions with more features, respectively. After initialization, particles with direction dir l are in subswarm SS l , and the other particles with direction dir u are in subswarm SS u .

Representation of Feature Neighborhood
To ensure the subsequent feature fixation can be carried out, we introduce the representation of feature neighborhood here, which has two purposes. First, feature neighborhood can provide common information among adjacent features, and considering the features in the same neighborhood as a whole can make the search more efficient. Second, the use of feature neighborhood can gain a fine-grained control for the feature fixation and avoid fixing too many features at a time.
Assuming that there are D features in the data, each feature neighborhood consists of T adjacent features, then all features can be divided into R feature neighborhoods and R is calculated by: where symbol dÁe represents the ceiling function. If the number of features in the last neighborhood is less than T, extra features with random selection states will be added to the neighborhood as the stuff-bits. An example of neighborhood division with T ¼ 3 is shown in Fig. 2, where value 1 indicates that the feature is selected, and value 0 indicates that the feature is not selected.
To strengthen the correlation between two features in the same neighborhood, SU values between features and the label of the dataset are used to sort all features before the neighborhood division. After being sorted by SU, features in the same neighborhood have a similar correlation for labels, thus they are more integrated and can be regarded as a whole in the feature fixation stage. The SU value between a feature F and the class label C can be calculated as: HðF Þ À HðF jCÞ HðF Þ þ HðCÞ where H(F) and H(C) are the entropies of F and C, and H (F j C) is the conditional entropy of F when C is given. Notice that SU is only used for the feature sorting at the beginning of BDFF, but does not play a dominant role in feature fixation. More importantly, BDFF still works well without the assistance of feature sorting by SU, which can be verified in Section 4.6.

Algorithm 1. Particle Position Update with Feature Fixation
Input: The ith particle p i to be updated, the total number of features D, the number of features T in feature neighborhood, the historical optimal position pbest i of p i , the position x i of p i Output: The updated particle p i BEGIN 1: IF all features in neighborhood k are selected in pbest i THEN 10: x i,j pbest i,j ; 11: CONTINUE; 12: END IF 13: END IF 14: Use any chosen position update mechanism to update x i,j ; 15: END FOR 16: RETURN p i ; END

Feature Fixation Guided By Search Directions
Feature fixation, which is the core part of the BDFF framework, aims to fix some features and keep their selection states unchanged when the particle is being updated. The feature fixation process of each particle is guided by its current search direction and uses the feature neighborhood as the basic unit. When the condition of feature fixation meets, all features in the same neighborhood will be fixed as a whole until the search direction changes.
Supposing that there is a particle p i in the subswarm SS l , its condition of feature fixation can be described as follows. If none of the features in a feature neighborhood are selected in the historical optimal position of p i (i.e., the pbest i ), all these features in the neighborhood will be fixed as a whole. Then in the newly generated position x i of p i , the fixed features will be kept unselected and have the same selection states as what they have in the pbest i . As more feature neighborhoods are fixed with the guidance of the search direction dir l , the number of selected features in the new x i will become less. Thus, p i can search for feature subsets with fewer features. On the contrary, if p i is in the other subswarm SS u , the features in the same neighborhood will be fixed and kept selected if all of these features are selected in the pbest i . Then more selected features will be contained in the new x i , and p i can search for a feature subset with more features.
The pseudo-code of feature fixation in the particle update procedure is given in Algorithm 1. If the feature neighborhood is fixed, its corresponding dimension values of x i will be the same as those in pbest i . Otherwise, the dimension values of x i will be updated by the update mechanism of any chosen PSO-based algorithm in the first category mentioned in Section 2.2.
An example of a feature fixation procedure carried out by a particle is shown in Fig. 3. In the k -1 generation, the search direction of a particle is dir l , i.e., searching for solutions with fewer selected features. Therefore, only feature neighborhoods with no features selected (i.e., with T zeros) in the pbest k-1 are fixed. Then in the k generation, this particle turns to the opposite direction dir u , i.e., searching for solutions with more selected features. Therefore, the previously fixed neighborhoods are canceled fixation and can be updated, while those neighborhoods with all features selected (i.e., with T ones) are fixed and will be kept unchanged in the next generation until the search direction changes again.
Besides guiding particles to search for solutions with different numbers of features, feature fixation has three extra advantages. Firstly, no matter which search direction the particle is currently in, feature fixation can reduce the number of features that each particle needs to search, thus narrowing the search space. Secondly, the information of the fixed features is still retained in the swarm, so other particles can learn from the information when being updated. Thirdly, different features are fixed in different particles and they can be unfixed when the search direction of the particle is changed, which improves the diversity of the swarm.

Self-Adaptive Direction Change Strategy
Changing the search direction of each particle in time is the key to making full use of all particles in the swarm. It is expected that all particles search in the correct direction so that none of them do a meaningless search in the wrong direction. However, the swarm does not know which direction is correct at the beginning of the search. Therefore, the SADC strategy is proposed to help particles determine the correct direction according to the information gained during the evolution process. Two metrics, i.e., average improvement (AI) and average fitness (AF), are designed in the SADC strategy to balance the abilities of exploitation and exploration of the swarm.

Algorithm 2. SADC Strategy
Input: The swarm S to be changed search direction Output: The swarm S after changing the search direction BEGIN 1: Count the number of particles n l and n u in SS l and SS u ; 2: IF n l ¼¼ 0 OR n u ¼¼ 0 THEN 3: RETURN S; 4: END IF 5: Calculate AI l and AI u of SS l and SS u with Eq. (10); 6: IF AI l < AI u THEN 7: Randomly select a particle p r and dir(p r ) ¼ ¼ dir l ; 8: dir(p r ) dir u ; 9: END IF 10: IF AI l > AI u THEN 11: Randomly select a particle p r and dir(p r ) ¼ ¼ dir u ; 12: dir(p r ) dir l ; 13: END IF 14: IF AI l ¼ ¼ AI u THEN 15: Calculate AF l and AF u of SS l and SS u with Eq. (11); 16: IF AF l < AF u THEN 17: Randomly select a particle p r and dir(p r ) ¼ ¼ dir l ; 18: dir(p r ) dir u ; 19: END IF 20: IF AF l > AF u THEN 21: Randomly select a particle p r and dir(p r ) ¼ ¼ dir u ; 22: dir(p r ) dir l ; 23: END IF 24: END IF 25: RETURN S; END The AI of a subswarm is the average boost fitness value of the particles in the subswarm within several generations. Considering that in the kth generation, the value of AI from generation (k -W) to k is calculated by: ðfðpbest pbest pbest pbest pbest pbest pbest k i Þ À fðpbest pbest pbest pbest pbest pbest pbest kÀW i ÞÞ where SS is the subswarm SS l or SS u , jSSj is the size of SS, p i is a particle of SS, pbest i k is the historical optimal position of p i in the kth generation, W is the generation window for AI calculation, and f is the fitness function. A larger AI value means that the search direction is more promising and the particles searching in this direction are more likely to find a better solution.
The AF of a subswarm is the average fitness value of all pbests in the subswarm, which can be calculated by: fðpbest pbest pbest pbest pbest pbest pbest k i Þ The larger the value of AF is, the better the particles in the subswarm perform, and the more likely the optimal solution is to be found in this search direction.
The procedure of the SADC strategy is described in Algorithm 2. First, the number of particles in the two opposite directions is counted. If there is no particle in one of the directions, it means that all particles have the same direction and the correct direction has been determined by the swarm. Therefore, no particle will change its search direction. Otherwise, a random particle in the direction with a less AI value will change its direction. Furthermore, if the AI values of the two directions are the same, a random particle in the direction with a less AF value will change its direction. Notice that if the AI values and the AF values of both directions are the same, all particles will keep their direction unchanged. The SADC strategy can be divided into two stages: early exploration and late exploitation. In the early stage, the SADC strategy prefers the search direction corresponding to the subswarm with a greater AI value even if its AF value is worse. Because the AF values of both SS l and SS u are at a poor level in the early stage, a subswarm with better AF may not indicate that its direction is correct. Instead, a more promising direction with better AI is worthy of greater efforts to search, which enhances the exploration ability of the swarm S. In the late stage, the AI values of both SS l and SS u are likely to be the same and equal to 0. Then the SADC strategy encourages S to search in the direction corresponding to the subswarm with better AF, which enhances the exploitation ability of S. With the two metrics to assess the search directions of particles, the SADC strategy can make the trade-off between the exploration and the exploitation in different stages.

Overall Framework
The overall framework is described in Algorithm 3. First, each particle is initialized and assigned a search direction. Then, the features are sorted by SU and divided into feature neighborhoods. In each generation, particles fix some feature neighborhoods according to their search directions and only update those features that are not fixed, thus reducing the combinations of features to search and narrowing their search space. After every W generations, BDFF adjusts the search directions of particles adaptively to balance the exploration ability and the exploitation ability of the swarm. Noting that there is no specific update mechanism for the position in the BDFF framework. Therefore, most PSO-based algorithms that mainly focus on designing different evolutionary mechanisms and have no mechanism to narrow the search space can be adopted in the framework of BDFF. In this paper, we adopt the update mechanism of BBPSO-ACJ mentioned in Section 2.3.
The time complexity of BDFF is O(MAX_GEN Â N Â (D AU þ T E )), where MAX GEN is the maximum number of generations, N is the size of the swarm, D AU is the average number of unfixed features of the whole swarm, and T E is the time complexity of evaluating a particle. Usually, D AU is less than the total number of features D in the BDFF framework. In the worst case, the time complexity of BDFF is O (MAX GEN Â N Â ðD þ T E Þ).

EXPERIMENTAL RESULTS AND ANALYSIS
In this section, experiments are carried out to evaluate the performance of BDFF on twelve public large-scale feature selection problems compared with other PSO-based algorithms.

Datasets
We used twelve public datasets for feature selection in the experiments and the detailed information is listed in Table 1, where symbol "#" means the number of corresponding items. All the used datasets are for classification problems and can be accessed from [26] and [31]. A common characteristic of the twelve datasets is that they have a small number of samples but a large number of features, which makes it difficult to solve the classification problems.

Algorithms for Comparison and Parameter Settings
All the used algorithms and their parameter settings are listed in Table 2. The proposed BDFF using the update mechanism of BBPSO-ACJ is named BBPSO-ACJ-BDFF. We used six PSO-based algorithms which can be used in feature selection for comparison. BPSO [37], BVDPSO [39], BBPSO-ACJ [22], and CSO [23] can be classified in the first category as mentioned in Section 2.2. BPSO and BVDPSO are two typical PSOs for binary optimization problems and they are  Dataset  #Samples  #Features  #Classes  Data Type   Colon  62  2000  2  discrete  WarpAR10P  130  2400  10  continuous  GLIOMA  50  4434  4  continuous  Leukemia_1  72  5327  3  discrete  9_Tumor  60  5726  9  continuous  TOX_171  171  5748  4  continuous  Brain_Tumor_1  90  5920  5  continuous  Nci9  60  9712  9  discrete  Arcene  200  10000  2  continuous  CLL_SUB_111  111  11340  3  continuous  Lung_Cancer  203  12600  5  continuous  SMK_CAN_187 187 19993 2 continuous treated as the baseline methods. CSO and BBPSO-ACJ are algorithms specifically proposed for large-scale feature selection problems recently. VLPSO-LS [25] and HFS-C-P [29], both of which are representative and very new algorithms in the second category, use the correlation to get information on the importance of features and then focus on features with higher importance to narrow the search space. The parameter settings of the compared algorithms are the same as those in their corresponding papers. Since the swarm size N of each algorithm is different, we limited the maximum number of fitness evaluations (MAX_FE) for all algorithms to 5000. For classification problems, we chose the k-nearest neighbor (k-NN) method as the classifier because k-NN has a stable classification performance on different datasets [50]. The parameter k of k-NN was set to be 5 in the experiments. First, 70% of samples in each dataset are randomly selected as the training dataset and the remaining 30% of samples are reserved as the test dataset. Then the classification accuracy obtained by k-NN with 5-fold cross-validation on the training dataset was adopted as the evaluation function for all algorithms in the training process. After training, the best feature subset found by each algorithm will be tested on the test dataset with the k-NN classifier to get its classification accuracy. Each algorithm was run 20 times on each dataset independently with different random seeds to reduce random statistical errors in the results. To verify the significant difference between different algorithms, we employ the Wilcoxon rank sum test [51] on the experimental results with a significance level of 0:05. Three symbols are used to indicate the Wilcoxon rank sum test results: symbols "þ" and "-" indicate that our proposed algorithm is significantly superior to and inferior to the compared algorithm, respectively, while symbol "¼" indicates that there is no significant difference between our algorithm and the compared algorithm.
All algorithms were implemented in Cþþ with an opensource library named Feature Selection Toolbox 3 [52]. In terms of the hardware environment, experiments were carried out on a platform with an Intel Core i7-10700F CPU @2.90GHz and a total memory of 8 GB.

Comparison Results and Discussion
The average classification accuracy, number of selected features, and running time obtained by the seven algorithms over 20 independent runs on the 12 datasets are compared in Tables 3, 4, and 5, respectively, where the value in bold represents the best result among all algorithms.
Compared with BPSO, BVDPSO, CSO, and BBPSO-ACJ, our BBPSO-ACJ-BDFF performs better on most datasets, obtaining a higher or similar classification accuracy but a much smaller subset of selected features. The classification accuracy of BBPSO-ACJ-BDFF is superior to or similar to those of the four algorithms on all datasets except datasets TOX_171 and Brain_Tumor_1. In terms of the number of selected features, BBPSO-ACJ-BDFF has a significantly better performance than BPSO, BVDPSO, and BBPSO-ACJ on most datasets. On 6 of the 12 datasets, BBPSO-ACJ-BDFF requires fewer features than CSO while achieving better or similar accuracy. In addition, the average running time of BBPSO-ACJ-BDFF is less than BPSO, BVDPSO, CSO, and BBPSO-ACJ on most datasets. Not only the feature fixation strategy but also the smaller feature subsets found by BDFF help BBPSO-ACJ-BDFF spend less time searching for the best solution.
Compared with VLPSO-LS, BBPSO-ACJ-BDFF obtains a higher classification accuracy on 3 datasets and a similar classification accuracy on 8 datasets. Considering all datasets, BBPSO-ACJ-BDFF with a rank sum of 38 outperforms VLPSO-LS with a rank sum of 45 on the classification accuracy. Though VLPSO-LS can find a smaller feature subset on some datasets, it always spends much more time searching for solutions than BBPSO-ACJ-BDFF, as shown in Table 5. This is because the local search strategy of VLPSO-LS requires the correlation result between each pair of features, which is time-consuming especially on datasets with a large number of features and samples. On the contrary, BBPSO-ACJ-BDFF only requires the correlation result between each feature and the class label, so it spends less time than VLPSO-LS.
Compared with HFS-C-P, BBPSO-ACJ-BDFF has a significantly better classification accuracy on 3 datasets and a similar accuracy on 8 datasets. BBPSO-ACJ-BDFF also obtains a smaller feature subset on 6 datasets than HFS-C-P. Overall, BBPSO-ACJ-BDFF has a more consistent performance than HFS-C-P on different datasets. For example, HFS-C-P achieves the highest accuracy on datasets WarpAR10P and 9_Tumor, while it gets the lowest accuracy on datasets Leu-kemia_1 and Arcene. A possible reason to explain the poor performance of HFS-C-P on some datasets is that it relies much on the correlation to filter irrelevant features and cluster relevant features, which greatly affects the final result. When sometimes the correlation measures cannot indicate the importance of features accurately, particles may be misled by such information and finally find poor solutions. On the contrary, the proposed BDFF framework reduces the search space according to the feature fixation mechanism instead of correlation measures, so it is more adaptable to different datasets.
First, we introduce an indicator named feature fixation rate (FFR) for further analysis, which can be calculated by: where D F is the total fixed dimensions that have been ignored in the update phase so far and D H is the total dimensions that have been handled (including being ignored and being updated) in the update phase so far. The higher the value of FFR is, the more the search space is reduced. Therefore, the value of D AU mentioned in Section 3.6 is equal to ð1 À FFRÞ Â D.
As BDFF is a general framework and can be applied to other PSO variants except BBPSO-ACJ, we also adopted the update mechanism of BPSO, BVDPSO, and CSO in the BDFF and used the same parameters as BBPSO-ACJ-BDFF so that the analysis can be more generalized and convincing. For the four PSO-based algorithms with BDFF, i.e., BPSO-BDFF, BVDPSO-BDFF, CSO-BDFF, and BBPSO-ACJ-BDFF, we recorded the total FFR values of all particles during the search and drew the change curves for further analysis, as shown in Fig. 4.
In the beginning, the four PSOs with BDFF have similar FFR values which are about 1/2 3 because the parameter T of each algorithm is set to be 3. If a particle is guided by the same search direction and keeps updating its historical optimal position, it can usually fix more features, so its FFR value then increases. Among all algorithms, CSO-BDFF gets the highest FFR value on 10 datasets. The FFR value of CSO-BDFF is always over 50% and is over 70% on datasets Colon and Arcene. BBPSO-ACJ-BDFF ranks second among all algorithms and also achieves an FFR value of over 40% on all datasets except GLIOMA. On datasets WarpAR10P, Nci9, and CLL_SUB_111, the FFR value of BBPSO-ACJ-BDFF is over 60%. Though BPSO-BDFF and BVDPSO-BDFF often have lower FFR values than the other two algorithms, they still ignore over 20% of the dimensions in their search to narrow their search space. Noting that the FFR value of BPSO-BDFF and BVDPSO-BDFF does not always increase  "þ", "=", and "-" indicate that BBPSO-ACJ-BDFF is significantly superior to, similar to, and significantly inferior to the compared algorithm, respectively. "þ", "=", and "-" indicate that BBPSO-ACJ-BDFF is significantly superior to, similar to, and significantly inferior to the compared algorithm, respectively.
with time. If the search direction of a particle is changed frequently, the previously fixed dimensions will be shifted to be flexible and can be updated, thus the FFR value will probably decrease.

From the Perspective of Feature Subset Size
We recorded the average number of selected features of the global best particle in the swarm throughout the search process and drew the change curves of four PSOs and four PSOs with BDFF to describe how they vary with the increase in evaluation times, as shown in Fig. 5. The number of features contained in the best particle of all algorithms at the beginning is about half of the total number of features because they adopt the same initialization method in Eq. (6). However, the change curves of PSOs with BDFF are different from the original PSOs and finally can achieve a smaller number of selected features. It can be known from the results in Tables 3 and 4 that solutions with the highest accuracy can be obtained with the number of features less than half on all datasets. BPSO and BVDPSO can improve their accuracy as the number of evaluations increases, but the number of their features cannot be effectively reduced. From the perspective of the number of selected features, BPSO and BVDPSO probably insist on searching in the wrong direction, so the accuracy they achieve finally is usually inferior to other algorithms. After being combined with the BDFF framework, BPSO-BDFF and BVDPSO-BDFF can search in the correct direction and finally achieve a higher classification accuracy with fewer features. CSO and BBPSO-ACJ can reduce the features of the best particle while improving the accuracy in most datasets with the increase in evaluation times. However, they cannot exploit the search space with fewer features and usually find a solution with more features than CSO-BDFF and BBPSO-ACJ-BDFF finally. Therefore, the BDFF framework can help particles explore and exploit more efficiently not only from the perspective of the fitness but also from the perspective of the number of selected features, and finally guide them to approach the global optimal solution in terms of the number of selected features.

Influence of Correlation Information
To investigate the effect of correlation information used in BDFF, we removed the feature sorting step via SU and then recorded the comparison results between BBPSO-ACJ-BDFF and its variant in Table 6. The BBPSO-ACJ-BDFF-w/ o-SU has similar performances on the classification accuracy and the size of the feature subset to BBPSO-ACJ-BDFF and there is no significant difference between the two algorithms on most datasets. Without the help of correlation "þ", "¼", and "-" indicate that BBPSO-ACJ-BDFF is significantly superior to, similar to, and significantly inferior to the compared algorithm, respectively. information, BBPSO-ACJ-BDFF-w/o-SU also finds a solution with high accuracy and a small number of features. However, BBPSO-ACJ-BDFF can perform slightly better and find a solution with higher accuracy and fewer features on some datasets after using the SU measure for feature sorting. An important reason why BDFF can work well without using correlation information is that it only ranks features and discards none of the features according to the correlation information. Though the features in the same feature neighborhood are no longer strongly correlated without feature sorting, they can be reserved and still have the opportunity to be selected. Moreover, the small size of the feature neighborhood also reduces the correlation requirement for features in the same neighborhood because the feature fixation operation is fine-grained. However, some redundant features will be mixed in the same feature neighborhood when the feature sorting is missing, which makes it a bit more difficult to select a better feature subset with fewer features.

Influence of the Size of Feature Neighborhood
The parameter T represents that there are T features in a feature neighborhood. To investigate the influence of T, variants of BBPSO-ACJ-BDFF with different values of T are compared and the results are presented in Table 7. The performance of the accuracy is the best when T is 3 among all    "þ", "=", and "-" mean that BBPSO-ACJ-BDFF with T=3 is significantly superior to, similar to, and significantly inferior to the compared variant, respectively. variants. As the value of T increases, the number of features also increases in most cases. A small T makes it easy to fix features and reduce most of the search space, so the particle is likely to find a small feature subset but miss the optimal solution. On the contrary, a large T makes it hard to fix features and can retain the optimal solution in the search space, but it also increases the difficulty to find the optimal solution because the search space is too large. Therefore, the value of T is usually recommended to be 3.

CONCLUSION
This paper proposes the BDFF framework for PSO to solve large-scale feature selection problems. In the BDFF framework, each particle owns a search direction and searches for solutions with different numbers of features. According to its search direction, each particle can fix some features and ignore them in the update stage with the feature fixation mechanism, thus narrowing its search space. BDFF also adopts the representation of the feature neighborhood, dividing all features into small neighborhoods. Then it uses the feature neighborhood as the basic unit to fix features and refine the granularity of operations. Moreover, BDFF designs the SADC strategy to change the search direction of particles adaptively, making a trade-off between exploration and exploitation in different search processes. Experimental results on 12 public feature selection datasets show that the proposed BDFF framework can help particles approach the optimal solution from the perspective of the fitness value and the number of features. Compared with the correlation-based algorithms, BDFF can reduce the search space effectively without over-relying on the correlation information and have a consistent performance on most classification problems. Besides, BDFF can be treated as a general framework and be combined with PSO-based feature selection algorithms to further improve their performances. For future work, there still exist some challenges to be overcome for PSO-based feature selection algorithms, such as reducing the time of evaluation and dealing with some complicated real-world problems. Therefore, some promising techniques designed for expensive optimization, such as the data-driven method [53], [54], the scale-adaptive fitness evaluation method [55], [56], and parallel/distributed computing methods [57], [58], [59], [60], can be combined to further improve the performance of BDFF.