Classification Based on Brain Storm Optimization With Feature Selection

Classification is one of the most classic problems in machine learning. Due to the global optimization ability, evolutionary computation (EC) techniques have been successfully applied to solve many problems and the evolutionary classification model is one of the methods used to solve classification problems. Recently, some evolutionary algorithms (EAs) such as the fireworks algorithm (FWA) and brain storm optimization (BSO) algorithm have been employed to implement the evolutionary classification model and achieved the desired results. This means that it is feasible to use EC techniques to solve the classification problem directly. However, the existing evolutionary classification model still has some disadvantages. The limited datasets used in the experiment make the experimental results not convincing enough, and more importantly, the structure of the evolutionary classification model is closely related to the dimension of datasets, which may lead to poor classification performance, especially on large-scale datasets. Therefore, this paper aims at improving the structure of the evolutionary classification model to improve classification performance. Feature selection is an effective method to deal with large datasets, firstly, we introduce the concept of feature selection and use the different feature subsets to construct the structure of the evolutionary classification model. Then, the BSO algorithm is employed to implement the evolutionary classification model to search the optimal structure by search for the optimal feature subset. Moreover, the optimal weight parameters corresponding to the different structures are also searched by the BSO algorithm while searching the optimal feature subset. For verification of the classification effectiveness of the proposed method, 11 different datasets are selected for experiments. The results show that it is feasible to optimize the structure of the evolutionary classification model by introducing feature selection. Moreover, the new method has better classification performance than the original method, especially on large-scale or high-dimensional datasets.


I. INTRODUCTION
Classification problem has been widely studied in machine learning [1]. Many different classification methods have been proposed and been widely used in practical applications. For instance, k-nearest neighbor (KNN) [2], naive bayesian The associate editor coordinating the review of this manuscript and approving it for publication was Gang Li . classification (NBC) [3], support vector machine (SVM) [4], decision tree (DT) [5], artificial neural network (ANN) [6], etc. However, many of these methods are structurally deterministic, which makes them often fall into a locally optimal solution.
Evolutionary computation (EC) techniques, such as ant colony optimization (ACO) [7], particle swarm optimization (PSO) [8], genetic algorithm (GA) [9], brain storm optimization (BSO) [10], differential evolution algorithm (DE) [11], fireworks algorithm (FWA) [12], artificial bee colony (ABC) [13], extremal optimization (EO) [14], etc., are well-known for their global optimization capabilities and have been concerned and improved by researchers. For example, according to the threshold of some parameters, El-Abd [15] introduced a new population initialization scheme and a global-best version, which combined with fitness-based grouping and per-variable updates to improve the BSO algorithm performance. The proposed GBSO method achieves ideal results on different problem scales and extensive classical functions. An optimization problem modeling for Loney's solenoid problem was successfully solved by Duan and Li [16] proposed a quantum-behaved BSO (QBSO) algorithm. In addition, Liang et al. [17] combined the advantages of ACO and BSO, and proposed a new feature selection algorithm, which has achieved good results in accuracy, percent rate, and recall rate, etc. Thus, considering the advantages of EC techniques in solving various problems, it may also be a better way to solve the classification problem.
In recent years, many researchers have introduced EC techniques to solve the classification problem. Si and Dutta [18] proposed a method based on PSO algorithm and Feed-Forward Neural Network (FFNN) for classification task in medical machine learning. Nugraha et al. [19] proposed a PSO-SVM algorithm for journal rank classification, and the research shows that PSO can improve the classification performance of SVM. Besides, Rajathi and Radhamani [20] combined the KNN algorithm with ACO to identify the presence of heart disease, and their method has achieved good performance in accuracy and error rate. However, all of the above methods use EC techniques to optimize a classifier, and do not attempt to solve the classification problem directly with EC techniques. This greatly affects the global optimization ability of EC techniques, which may not be fully reflected. Xue et al. proposed an evolutionary classification model that can solve the classification problem by the BSO algorithm [21] and FWA [22]. In the proposed model, linear equations are constructed based on the training set, and the objective function is constructed based on the equations. However, neither [21] nor [22] consider the relationship between the complexity of the model structure and the size of the dataset. Besides, there are not enough datasets were selected for the experiment in [21], [22], especially on large-scale datasets. Therefore, considering optimize the structure of the evolutionary classification model is a key point to improve its classification performance.
Feature selection is a classical data preprocessing technique, which is often used to solve the problem with high-dimensional or large-scale datasets. Selecting a good feature subset can not only improve the classification accuracy, but also simplify classification structure. Therefore, the concept of feature selection has been introduced into more and more classification problems in recent years. Due to good global optimization ability, EC techniques are often achieved good results in finding the optimal feature subset. For instance, Xue et al. [23] proposed a new initialization strategy and a new update mechanism of individual optimization and global optimization based on the traditional PSO algorithm to solve the feature selection problem. Their experiments are carried out from two aspects: 1) maximizing the classification performance; 2) considering the number of features and classification performance at the same time. Compared with the algorithm only considering the classification accuracy, the algorithm considering both the feature number and the classification accuracy shows better classification performance in the experiment. In addition, Tajik et al. [24] used GA in feature selection to improve the efficiency of computer-aided diagnosis. The research also shows that the combination of feature number and classification accuracy has good performance in classification. Inspired by this, we introduce the feature selection technique into the evolutionary classification model to optimize the structure of the model and improve the classification performance.
In this paper, feature selection is introduced to improve the structure of the evolutionary classification model, which is implemented by the BSO algorithm. Firstly, the structure of the model depends on the dimension of the dataset, so the BSO algorithm is used to initialize many different feature subsets to construct different structures of the model. In addition, the BSO algorithm is also used to initialize weight parameters corresponding to the different structures. Then, taking the error values as the objective function, the BSO algorithm is used to search for the optimal structure and its weight parameters. Finally, solution size (the number of selected features) and classification accuracy are used as the final evaluation criteria to evaluate the performance of the model. Different from the previous evaluation criteria, we not only compare the classification accuracy, but also evaluate the solution size. Therefore, the new evolutionary optimization classification model can significantly reduce the computational complexity. In addition, for verification of the effectiveness of the proposed method, 11 different datasets from the University of California Irvine (UCI) Machine Learning Repository [25] are used for experiments.
The following is the arrangement of the rest of this article. Section II summarizes the BSO algorithm and the evolutionary classification model. Section III introduces our method and describes the evolutionary classification model with feature selection based on the BSO algorithm in detail. Section IV gives the information of the datasets used in the experimental and the parameter values used in the algorithm. Then, the experimental results are given and analyzed by Section V. Finally, Section VI concludes the paper and insight the future research directions.

A. EVOLUTIONARY CLASSIFICATION MODEL
In [21], [22], a basic evolutionary classification model was proposed as follows: Given a training set T = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n )}, where x i = x i1 , x i2 , . . . , x id represents the i th instance and the class label of the i th instance is described as y i . d is the dimension of the training set and n represents the number of the instances, write it as follows: First of all, the classification problem can be regarded as an optimization problem for solving linear equations by introducing weight vector W = {w 1 , w 2 , . . . , w d }, such as Eq. (2): the W is the solution set that can satisfy the equation set.
Let the coefficient matrix of the linear equation be X and the constant column vector be Y , as defined below: Therefore, Eq.(2) can also be written as X · W T = Y [26]. The rank of matrix X is R(X ) and the rank of a matrix (X , Y ) is R(X , Y ). In the majority of cases, the number of instances is far larger than d. and these equations are uncorrelated. Thus, it is greatly possible for these equations to be uncorrelated and have no solution which can satisfy all the equations. Fortunately, solving the classification problem, we only need to find the approximate solutions for the following equation: With this set of approximate solutions, we can calculate the class label of instance x i , described as: where δ is a threshold, which means that the error of the calculated label is allowed to be within this range. In addition, the objective functions used in the model are defined as follows: In recent years, many EAs have emerged and the BSO algorithm is one of the popular algorithms, which proposed by Professor Shi [27] in 2011. BSO algorithm uses the clustering idea to search the local optimal value, and obtains the global optimal value by comparing all the local optimal values, which is suitable for solving the multi-peak high-dimensional function problem, so the BSO algorithm is applied to the evolutionary classification model. In BSO, the individuals in the algorithm represent the solutions of the potential problem, and the solutions are converged to several clusters. New individuals are updated through the evolution and fusion of individuals in the clusters. The new individuals are compared with the original individuals, and the individuals with better fitness values are stored and updated iteratively. The process of BSO is as follows: Step 1 Initialization: N initial individuals (initial solutions) are generated randomly, and the individuals' fitness values are calculated.
Step 2 Regeneration population: If a good enough solution is not found or the preset maximum number of iterations is not reached, repeat the following steps: 1) Solution clustering: N individuals are clustered into m clusters by k-means algorithm; 2) Generating new solutions: New individuals are generated from randomly selected individuals from the clusters; 3) New solution selection: Compare the fitness value of the new solution and the original solution, and store the solution with better fitness value in the next iteration.
Step 3 Optimal individual: The global optimal individuals are obtained by comparing the central individuals in m clusters. Figure 1 depicts the execution flow chart of the BSO algorithm: The number of instances and features determines the size of the dataset. In the era of big data, it has become an inevitable trend to deal with large-scale or high-dimensional datasets in solving classification problems. However, they often make the classifier more complex, which resulting in the performance degradation. In practice, many features of the dataset are redundant and irrelevant. They not only increase the training time and computational complexity, but also affect the classification performance, which is called the ''curse of dimensionality''. Feature selection is an effective way to solve such problems, which can reduce the exponential increase of training time and computational complexity due to the increase of dimension.
In a d-dimensional dataset, it means that there are 2 d − 1 feature subsets existing. This means that datasets with smaller dimensions have fewer feature subsets. Therefore, we can use the enumeration method to found the optimal feature subset. However, if the d is large, it is impossible to enumerate all feature subsets. At this time, how to identify the effective features from the original dataset is a difficult problem. Therefore, it is necessary to find a method to obtain an effective feature subset efficiently.
The basic feature selection process must determine the four aspects shown in Figure 2: 1) Generating process: Generating candidate feature subset; 2) Evaluation criteria: Evaluation of the characteristics of the subset; 3) Stop criterion: Decide when to stop; 4) Verification results: Verify that the expected results are achieved.

III. PROPOSED METHOD
In the evolutionary classification model with feature selection based on the BSO algorithm, the structure size of the evolutionary optimization classification model depends on the dimension of the training set, and the optimization of the model structure needs to exclude the irrelevant and redundant features of the training set. We use the evolutionary mechanism of the BSO algorithm to generate a new structure and its corresponding weight parameters, which mainly includes the following two aspects and three steps: 1) feature subset optimization and weight parameters optimization; 2) initialization population, generation of new individuals, selection of individuals.
Firstly, the feature subset population and weight parameter population are initialized respectively.
The purpose of feature selection is to find out the most effective features from all features. It means that all features have two possibilities: selected and unselected. In order to represent the feature subset conveniently, we define that 1 represents the feature selected, and 0 represents the feature is not selected. In this paper, N groups of d random numbers in the range of [0, 1] are randomly generated by the BSO algorithm, which corresponding to d features. If the random number is ≥ 0.6, it means that its corresponding feature is selected. Otherwise, if the random number is < 0.6, the corresponding feature is unselected. Thus, the feature subset can be represented as binary coding, for example: in the Figure 3, five random numbers between 0 and 1 correspond to five features. Among them, 0.33, 0.42, and 0.17 were all less than the threshold value 0.6, which represents that the first feature, the third feature and the fourth feature are not selected. In addition, 0.65 and 0.94 are greater than the threshold value 0.6, which means that the second feature, and the fifth feature are selected. For initialization weight parameter population, we use the following Eq.(7) to estimate the upper and lower boundaries of the solution to reduce the solution search space. (7) where N represents the number of individuals in the population, d represents the dimension of the training set, and σ is used to adjust the control parameters of the boundary range. It is effective to control the search space in a certain range, VOLUME 9, 2021 but how to determine the search space well also is our next research direction.
Then, we combine the feature subset population with the weighted parameters population. The fitness value of each individual was calculated. BSO algorithm takes the minimum error value (min(f (W ))) as the optimization objective function, and the fitness calculation method is Eq. (8).
where n is the number of samples in the training set. Secondly, N individuals are clustered into m by the k-means algorithm, and the individuals with the optimal fitness value in each cluster are selected as the central individuals. Finally, the optimal individuals are obtained by comparing the central individuals of each cluster. If the current optimal individuals are not satisfied, we generate new individuals from the existing population of individuals.
Through clustering operation, the search area can be improved. However, after a large number of iterations, all solutions converge to a small search area with high probability. In order to avoid being trapped in local optimum or premature convergence, a disrupt cluster operation is introduced into the BSO algorithm. The disrupt cluster operation is controlled by a probability value P. Generate a random number in the range of (0, 1), and if this random number is greater than P, a random individual (including feature subset and weight parameters) is generated to replace a central individual randomly. Otherwise, new individuals will be generated directly.
The algorithm randomly selects clusters to generate new individuals, and new individuals can be generated based on one or more existing individuals. In this paper, if the probability value P cluster is greater than 0.8, and the new individuals are generate based on one existing individual. Otherwise, the new individuals are generated based on a combination of more existing individuals. If new individuals are generated in one cluster, a search area can be optimized, and the algorithm focuses on development capability. Accordingly, if new individuals are generated based on more clusters, the new individuals generated may be far away from the original cluster center. In this case, the algorithm focuses more on the exploration ability.
The probability values P one and P two are used to control the selection of central individual or non-central random individuals in one or more clusters, respectively, to generate new individuals. For one cluster is selseted, a random value is generated and if the random value is greater than the probability values P one , we choose the central individual in this cluster to generate new individuals. Otherwise, a non-central random individual in this cluster is selected to generate new individuals. At this time, the generation of new individuals are described by Eq.(9) and Eq.(10): where x i new represents the i th new individual with feature subset and weight parameters; x i selected represents the i th individual to be updated with feature subset and weight parameters; N (µ, σ 2 ) are random numbers based on normal distribution; rand() is a function that generates random numbers in the range of (0, 1); max iteration represents the maximum number of iterations, while t represents the current number of iterations; k is the coefficient controlling the logsig() function, which is used to change the search step size of ξ (t), and then to balance the convergence speed of the algorithm. The transfer function logsig() is defined as Eq. (11): For two clusters are selected, the new individuals generated are based on two individuals. A random value is generated, and compared with the probability value P two . We randomly select the central individual of two clusters to generate new individuals, if the random value is less than the probability value P two ; otherwise, we randomly select the common individual of two clusters to generate new individuals. A new individual is generated by two existing individuals x selected1 and x selected2 , and x selected1 represents the selected individual with feature subset and weight parameters in the first cluster and x selected2 represents the selected individual with feature subset and weight parameters in the second cluster. In this case, the selected individual x selected can be written as: where tem is a random number in the range of(0, 1), which is generated by the rand().
Finally, the new individual selection is to keep the individuals with better fitness in the population. After generating new individuals, if the stop criteria are not met, the individuals with better functional fitness are saved to the next iteration by selection strategy. The strategy of clustering, generating, and selecting new individuals is used to introduce new individuals into the population to ensure the diversity of the population. The evolutionary classification model with feature selection based on the BSO algorithm (FS-CBSO) is described in Algorithm 1 in detail.

IV. EXPERIMENTAL DESIGN A. DATASETS
Eleven datasets (DS1-DS11) used in the experiment are selected from the UCI Machine Learning Repository. In addition, in order to verify the performance of the algorithm on large-scale datasets, DS4, DS7 and DS8 are chosen in the experiment, where the data dimension of DS7 and DS8 is more than 500. Table 1 details all the datasets information, where ''DSn'' is the n th dataset, ''NoE'' is the number of instances, ''NoF'' is the dimension of a dataset, and ''NoC'' is the number of classes. The datasets selected in this paper are classified into two classes.

Algorithm 1 The framework of the FS-CBSO algorithm
Input: Set parameter values including the number of population size (N ), the number of clusters (m),the max number of iteration (max iteration ), the probability values (P, P cluster , P one , P two ); Output: The optimal individual with feature subset and its corresponding weight parameters and classification accuracy.  Table 2 [10], [28].
Among them, the threshold value θ for determining whether to select features is set to 0.6, which has been   proved in [28]. In the experiment, the maximum evaluation times (max iteration ) of the two algorithms are 100,000. Use the BSO algorithm to find W in all datasets to minimizes the objective function Then, W is used to solve the classification label of each instance. Finally, the classification accuracy and the solution size on each dataset are calculated, and each dataset is repeated for 30 times to obtain more general results.

V. RESULTS AND ANALYSIS
In the experiment, the performance of the FS-CBSO method the evolutionary classification model based on the BSO (CBSO) method is compared. Table 3 and Table 4 are the two algorithms' experimental results. Table 3 shows the classification accuracy of the two algorithms on 11 testing sets and Table 4 shows the solution size of the two algorithms on 11 testing sets. ''Mean'' is the average value of classifi- cation accuracy or solution size obtained by each algorithm, and ''std'' represents the standard deviation. After the significant difference test, ''+'' represents that FS-CBSO results are better than CBSO results, ''-'' represents that FS-CBSO results are worse than CBSO results, and ''='' represents no significant difference between the two algorithms. In addition, the best average for each dataset is shown in bold.

A. RESULT ANALYSIS OF CLASSIFICATION ACCURACY
As can be seen from Table 3, FS-CBSO can achieve the highest classification accuracy on most datasets. Specifically, the classification accuracy of FS-CBSO is significantly higher than that of CBSO on the eight datasets (DS2, DS3, DS4, DS6, DS7, DS9, DS10, and DS11). In addition, the classification accuracy of FS-CBSO is similar to CBSO in DS5 and DS8, while FS-CBSO is inferior to CBSO only in DS1. Besides, on large-scale dataset DS4 and high-dimensional dataset DS9, FS-CBSO is superior to CBSO in classification performance. Table 4 shows that FS-CBSO can reduce the features by 63%˜79% on eight datasets. For DS4, DS5 and DS7, more than 85% of features can be reduced. Compared with CBSO, FS-CBSO can effectively reduce the number of features and achieve the expected effect of optimizing the structure ofb the classification model. Figure.4 intuitively shows the robustness and classification accuracy of the two algorithms on the 11 datasets. We can see that the stability of the FS-CBSO method is better than that of the CBSO method in 6 datasets and worse than the CBSO method in 5 datasets. However, the box plots show that the classification accuracy of the FS-CBSO method is better than the CBSO method in eight datasets and worse than the CBSO method in only one dataset. Generally speaking, the FS-CBSO method is superior to CBSO in classification accuracy and stability. To sum up, we can conclude that FS-CBSO has a smaller solution size than CBSO while ensuring the classification accuracy. The introduction of feature selection in the evolutionary optimization classification model can solve the problem of complex structure of the model, especially on large-scale and high-dimensional datasets.

VI. CONCLUSION
In this paper, based on the original evolutionary classification model, feature selection is introduced to optimize the structure of the model. Then, the BSO algorithm regards the optimization problem as a search problem and searches the optimal value in the solution space: searching the optimal structure and its corresponding weight parameters. In addition, in order to verify the effectiveness of the proposed method, 11 different datasets were selected to test the classification performance. The experimental results show that FS-CBSO has better classification accuracy and smaller solution size than CBSO on most of the 11 datasets, especially on large-scale or high-dimensional datasets, which strongly proves the feasibility of the proposed method.
However, the existing work has the defect of insufficient comparative experiments. For example, the problem solved is also limited to solving two classification label problems, and the classification performance has not reached the ideal effect. Therefore, in future work, we will compare the FS-CBSO algorithm with other algorithms. In addition, there are some limitations in using the minimum error value as the evaluation standard, and we will consider other more suitable objective functions. Besides, how to set the boundary value of solution space is also a subject worthy of study.