Binary Drone Squadron Optimization Approaches for Feature Selection

A great amount of data is being created these days, which is kept in massive datasets with different irrelevant attributes that are unrelated to the goal notion. Feature selection deals with the selection of the most pertinent features that also aid to increase the classification accuracy. The topic of feature selection is viewed as a multiobjective optimization problem with two goals: improving the classification accuracy and reducing the number of features used. Drone Squadron Optimization (DSO) is one of the most recent artifact-inspired optimization algorithms; having two key components: semi-autonomous drones that hover over a terrain and a command center that manages the drones. In this paper, two binary variants of the DSO are proposed to deal with the feature selection problem. The proposed binary algorithms are applied on 21 different benchmark datasets with five state-of-the-art algorithms, i.e., Grey Wolf Optimizer (GWO), Particle Swarm Optimization (PSO), Flower Pollination Algorithm (FPA), Genetic Algorithm (GA) and Ant Lion Optimization (ALO). Different assessment indicators are used to assess the diversification and intensification of the optimization algorithms. When compared to current state-of-the-art wrapper-based algorithms, the suggested binary techniques are more efficient in scanning the dimension space and picking the most useful characteristics for categorization tasks, resulting in the lowest classification error rate.


I. INTRODUCTION
In computer science, a dataset includes significant, insignificant, or superfluous features that critically affect the performance of the classification due to a large number of features [1]. Choosing the important qualities or properties of the data is a perplexing issue. Feature selection is a technique which aims to eliminate superfluous variables within a dataset to better understand the data. The goal of feature selection approaches is to improve classifier performance and achieve a classification error rate that is almost comparable to, if not identical to, that of using the entire feature set. [2]. Wrappers The associate editor coordinating the review of this manuscript and approving it for publication was Yongming Li . and filters are the two basic categories of feature selection methods [3]. The filter-based algorithms utilize statistical methods [4] to evaluate a feature subset, whereas learning algorithms are used in wrapper-based algorithms to search through the universe of potential solutions to find the best feature subset [5].
Filter based methods generally demonstrate quicker performance in comparison to wrapper-based methods since they calculate the distance between features, information gain and feature dependency, which is computationally less expensive than estimating a classifier precision [3]. Nevertheless, wrappers have been shown to be useful in locating the best feature subsets for a given classifier, hence they are commonly investigated for classification error rate [6]. When using a wrapper VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ feature selection technique, three factors should be decided: classifier, feature subset evaluation criteria, and a searching algorithm to find the best subset of features [4]. Finding a near optimum subset from the original set is a difficult task. Moreover, the search space size will increase exponentially with the amplification of the number of dimensions in a data set. Practically speaking, the exhaustive search techniques are not able to achieve the desired optimal solution and still experience the ill effects of stagnation in local optima [7]. In recent decades, metaheuristic algorithms inspired by nature have grown in prominence because of their ability to deal with complex real-world situations in a powerful and effective manner [8]. These algorithms are capable of utilizing the population's relevant knowledge in order to identify the best solutions. These resilient and efficient methods are used to solve a wide range of optimization problems, including financial time series prediction [9], economic dispatch optimization [10], neural networks [11], [12], wireless sensor networks [13] and engineering design problems [14], [15]. Many researchers have proposed metaheuristic based wrapper techniques for feature selection problem where metaheuristics are used to make subsets of features [16]. In practice, for the dataset with n number of attributes, the number of possible subsets are 2n, which makes it impossible for the exhaustive search methods to evaluate all possible combinations due to time consumption [17]- [19]. Therefore, metaheuristic based wrapper techniques have shown better performance in comparison to the exhaustive search methods [1].
The feature subset generation stage of the metaheuristic algorithm produces the random solution primarily, and the fitness of the random solutions is assessed using some fitness function by means of a subset assessment stage for every solution corresponding to which the dataset is partitioned into training and testing sets. The classification algorithm is trained by means of the training set and evaluated using the testing set, and the fitness is calculated to get the best solution in each iteration. The best solution in turn is used in the consequent iterations for updating the solutions. The training and testing set is divided using a fivefold cross-validation procedure. This rudimentary process of feature selection is identical for the entire metaheuristic algorithm. Figure 1 shows the feature selection process. The remaining sections of the paper are organized as follows: Next section presents the related work while the proposed DSO algorithm for feature selection problem is described in Section 3. Section 4 discusses how DSO is used for feature selection, and Section 5 reports the findings and comments. Section 6 concludes with findings and recommendations for further work.

II. RELATED WORK
Emary et al. [20] used the Grey Wolf Optimization (GWO) strategy to tackle the feature selection problem. To determine the location of the grey wolf, the approach employed a stochastic crossover. Recently, Ant Lion Optimizer (ALO) [21] and Flower Pollination Algorithm (FPA) based technique were employed to tackle the feature selection challenge [22]. In the realm of medical diagnostics, an enhanced GWO method with ELM has recently been presented to discover the best feature subset [23]. In the realm of text classification, a genetic algorithm combined with chaos theory for dimension reduction has been developed [24]. To determine the optimal feature subset, Xue et al. [25] presented novel starting strategies in the Particle Swarm Optimization. Gu et al. [26] addressed the feature selection problem by employing a competitive swarm optimizer, a variation of the PSO. Recently, a hybrid approach based on ABC and the differential evolution technique for picking an optimal collection of features for the classification issue has been developed [27].
Simulated annealing has recently been combined with a whale optimization approach to improve the exploitation of simulated annealing [28]. To increase the efficiency of the native crow search for feature selection, Sayed et al. [29] added chaotic maps into the crow search algorithm. As a method of feature selection for classification, a binary ant lion optimizer has also been presented in [21]. In ant lion optimization, Zawbaa et al. [30] used chaotic maps to minimize the attributes in high-dimensional data sets. In wrapper mode, another solution based on the whale optimization algorithm was developed to handle the feature selection problem [18]. A strategy for finding the best feature set was successfully implemented using a grasshopper optimization algorithm [31] and evolutionary population dynamics [32] in [33].
Hans and Kaur [34] proposed a binary multi verse optimization method for selecting the optimal feature set by incorporating the use of various transfer functions for position updation. Hu et al. [35] proposed an improved grey wolf optimizer so that the problem of discretization can easily be solved. Zhang et al. [36] integrated salp swarm algorithm (SSA) into the Harris Hawk Optimizer so that a balance between exploration and exploitation can be maintained. Neggaz et al. [37] proposed the used of sine cosine algorithm boosted by salp swarm algorithm. The method improves the exploration and avoids local minima. Emine and Ülker [38] proposed the use of social spider algorithm by using a transfer function which can be utilized for mapping from a continuous to a binary search space. Agrawal et al. [39] proposed the use of quantum concepts and the whale optimization algorithm to increase exploration and exploitation. The individual solutions can be represented in the form of bits and modified mutation and crossover operators have been introduced.
In spite of the many metaheuristic methodologies suggested for the feature selection problems, still, many difficulties remain unanswered, such as reduced accuracy in the case of datasets with higher number of attributes, high processing time, and solution subsets with inappropriate attributes. To address these issues, this study explores the use of DSO algorithm for feature selection as this is claimed to be robust, adaptive, and flexible and has powerful exploration and exploitation capabilities. In this context, this study aims to propose two variants of the binary DSO algorithm for selecting a minimal number of features and acquiring a comparable or perhaps lower classification error rates than traditional feature selection methods.

III. PRELIMINARIES
The Drone Squadron Optimization is a non-naturemotivated evolutionary metaheuristic that imitates the drones' behaviour which fly over a geographical area to explore and these drones are controlled by a command center [40]. Figure 2 shows the important terms used in the DSO and Figure 3 shows the characteristics of DSO.

A. DRONE SQUADRON OPTIMIZATION 1) INSPIRATION
DSO is inspired by the movement of drones, which are entities that make movements to explore the geographical search space area, to locate something, or to complete a particular task. There are different teams of drones which are being controlled by the command centers. A drone tries to move closer to the destination location. The command center performs two tasks: manage the search operations of the drones partially and control the drones by making new firmware which contains the modules to explore the search space. The drones are divided into certain groups where each group is controlled by the firmware to control the movement of the drones. Each team is controlled by its own firmware. There are perturbation functions that are used to create the trial solution locations for the movement of the drones. The perturbation scheme is used to generate the new firmware using the following equations: VOLUME 10, 2022  where departure is a coordinate, offset() represents the perturbation movement, and Equation 1 and 2 show the formulas for finding the trial coordinates TC.

2) MOVEMENT OF DRONES
The drones move to the target positions calculated by the perturbation step and return the information to the command center. To calculate the target locations, the DSO may use any of the methods used in various optimization techniques. For example, consider the two following perturbations for team1 and team2: where G(0, 1) are the values generated from Gaussian distribution, C 1 is a constant whose value is defined by the user, and U (0, 1) D represents an array generated from a uniform distribution ranging from 0 to 1. It's worthwhile to mention that −−→ GBC represents the global best coordinates and −−−−−→ CBC drone represents the current best coordinates.
Each drone performs the recombination with the best solution coordinates after the perturbation step, where all recombination methods have an equal probability of getting selected. The drones are permitted to move only in a certain search space region. If the calculated position of the drone comes out to be outside this region, then it is considered as a violation which needs to be corrected. The following equation is used to correct the violations: The number of drones in each team is represented by N, D represents the dimension, and the upper bounds array is UB, while the lower bounds array is LB. This calculation takes care of the violation cases where T m C team,drone,j > UB j or Tm, C team,drone,j < LB j . The symbol T m C team,drone denotes the target position of each drone of the team. The quality of teams is measured by using the number of solutions generated out of bounds and the distance from the objective function. The corrections of the teams are performed at every iteration. The firmware is updated when a criterion set in the beginning is reached and in that process, the firmwares of the team with the worst results are replaced by the firmwares of the best performing team. The command center and drones are depicted in Figure 4. Figure 5 shows the pseudocode of DSO.

IV. THE PROPOSED BINARY DRONE SQUADRON OPTIMIZATION (BDSO) A. BINARY DRONE SQUADRON OPTIMIZATION -METHOD 1 (S-BDSO)
As previously stated, because the new drone locations are values in continuous form, the continuous values must be translated to binary values that correspond to each other. A Sigmoidal (S-shaped) transfer function is used to accomplish this task of compressing continuous solutions in each dimension [41], which forces the drones to discretize their movement. The S-shaped function, as seen in Eq. (6) and Figure 6, is a typical transfer function.
where P k i is the continuous-valued perturbation of the ith drone in k th dimension at iteration t.
The Eq. (7) is used to convert the S-shape function output into a binary values in the case of sigmoidal function.
where x k i (t) and P k i (t) indicate the position and perturbation of the ith drone at iteration t in the kth dimension.   This approach presents a V-shaped transfer function rather than an S-shaped transfer function and Eqs. (8) and (10) are used to achieve so. The transfer functions of force drones to travel in a discrete search space are shown in Figure 7.
The Eq. (9) can be rewritten as: where x k i (t) and P k i (t) specify the location and perturbation of the ith drone in the kth dimension at iteration t, and x k i (t) −1 represents the complement of x k i (t). A large P k i (t) indicates a high likelihood of shifting site, which means the search agent is significantly away from the best solution, therefore it should change the position vector to hit possible search space regions on the road. Lesser values of P k i (t) on the other hand, imply that a search agent should stay put and try to find anything close [44].

C. FITNESS FUNCTION
As mentioned earlier, the DSO algorithm is employed to search through the large subset of features to find the best subset which should have a minimal number of features and maximum accuracy. To evaluate the subset of features, the following fitness function is used: where γ R (D) is the KNN classifier's classification error rate. In addition, R denotes the cardinality of the chosen feature subset, whereas N is the total number of features in the original dataset. The parameters α ∈ [0, 1] and β = (1 − α) are taken from [21], [28] and relate to the relevance of classification quality and subset length, respectively.

A. DATA DESCRIPTION
The experimental analysis has been performed on twenty-one datasets shown in Table 1 and are retrieved from UCI repository [42]. The datasets were chosen to reflect a wide range of characteristics and tuples on which the suggested technique must be evaluated [20], [21]. Especially, the chosen datasets offer a large amount of search space, allowing for a proper optimization method's testing. Every dataset is partitioned in the same way as done in cross-validation procedures [43]. The training is done using K-1 folds, while the validation is done on the kth fold as in K fold cross-validation. For each dataset, the evaluation is repeated K × M times. The dataset's training section is used to teach the classifier, and then its performance is evaluated using the dataset's validation section. Finally, the testing dataset is used to evaluate the characteristics that have been chosen. Each drone is relocated during the training process to choose a feature subset. PSO, FPA, GA, ALO, and GWO were used to compare the suggested feature selection methods.
Before employing these methods for feature selection, certain parameters must be preset. Table 2 shows the parameters values used in this investigation. These parameter values were chosen based on the values found in the literature [20], [21].

B. EVALUATION CRITERIA
In each run of the individual optimization method, the following metrics are applied to the data:

1) CLASSIFICATION ACCURACY
It is the average of classification accuracy values obtained from N runs. It can be defined as the fraction of tuples that are erroneously categorized.
The number of optimization algorithm runs is N , and the number of tuples in the testing data set is M. C i is the data point i s classifier output label, and L i is the tuple i s reference class label. When two separate labels are the same, the match match returns 0; when they are different, the match returns 1.

2) STATISTICAL MEAN
This value represents the average of the fitness values obtained when all runs of the optimization method are exhausted, as shown in the equation below.
here g * i is the best value obtained during the ith run.

3) STATISTICAL BEST
This value represents the smallest of all fitness values accumulated across the iterations. The formula can be stated as:

4) STATISTICAL WORST
It is the maximum (worst) value obtained from all fitness values obtained throughout the iterations, as shown in the following equation:

5) STATISTICAL STANDARD DEVIATION
The standard deviation indicates the variation between the best answers found throughout the model's major iterations. It describes the stability and robustness of an optimization technique and may be expressed as follows.

6) AVERAGE SELECTION SIZE
It is calculated by dividing the total number of features by the average number of picked features after each run. The formula is given as follows: Here, the number of features in the testing data set that were chosen represented by size().

7) PROCESSING TIME
It is described as the optimization algorithm's running time averaged on all runs. The processing time can be determined using: where M specifies the iterations for optimization method o, and RunTime o,i, , the optimization algorithm o's real calculation time at run number i.

8) NON-PARAMETRIC TESTING
This study uses Wilcoxon's signed rank test that seeks to discover substantial differences between the means of two samples. The test produces a p-value parameter, which checks the significance level of the two algorithms. Table 3 demonstrates the comparison between the proposed binary DSOs approaches in the context of error rate in classification. It is evident from Table 3 that V-BDSO method outperforms the native one. Over all datasets utilized in this investigation, the native DSO does not outperform any binary DSO. On the other side, S-BDSO outperforms the native DSO across all datasets. On all datasets, the suggested V-BDSO surpassed the native DSO in terms of average selection size, and the performance of S-BDSO is likewise competitive with V-BDSO, as shown in Table 4. The V-BDSO performed statistically better on almost all datasets in comparison to S-BDSO and in the penglungEW dataset, S-BDSO outperformed V-BDSO by providing 163.73 average selection size whereas V-BDSO demonstrates 167.60 average selection size.    Table 5, DSO demonstrated superior performance on 7 datasets on the mean fitness measure, while V-BDSO outperformed DSO and S-BDSO on 14 datasets. As may be seen from the statistical best fitness results in Table 6, DSO has the best results on 13 datasets,  whereas V-BDSO performed better on 8 datasets. The results on the statistical worst fitness measures are shown in Table 7, where it can be analyzed that V-BDSO performed significantly better than S-BDSO and DSO on 13 datasets, whereas S-BDSO outperformed other algorithms on 6 datasets only. Table 8 illustrates the statistical standard deviation fitness measure. As it can be seen from these results, V-BDSO  outperformed S-BDSO and DSO on 16 datasets, whereas S-BDSO performed better on 2 datasets only. Table 9 shows the average computing time of DSOs' technique, which is the time it takes to arrive at a near-optimal solution. The results of this table demonstrate that the computational speed of S-BDSO is competitive to DSO, whereas compared to DSO, V-BDSO takes longer to compute than S-BDSO.

2) COMPARATIVE ANALYSIS OF RESULTS WITH STANDARD METHODOLOGIES
From the last section, it can be observed that V-BDSO provided the lowest classification error rate as well as the average selection size in comparison to DSO and S-BDSO. In this section, the performance of the best strategy, V-BDSO, is compared to the performance of several state-of-the-art ways that are frequently used to tackle the feature selection issue in the literature [18], [21], [28]. In terms of classification error rate, table 10 shows the findings of V-BDSO, ALO, FPA, GA, GWO, and PSO. As it may be observed in this table, V-BDSO outperforms ALO, FPA, PSO, GWO, and GA in terms of classification error rate on all datasets. This higher performance demonstrates the suggested approach's ability to efficiently discover the search space's optima. VOLUME 10, 2022 Table 11 shows the average selection size using V-BDSO and other approaches. In comparison to the other techniques used in this study, V-BDSO performs substantially better by picking a smaller number of features. V-BDSO performed better on all datasets, according to the findings provided in this table except penglungEW and spectEW, where GA showed better performance by selecting less number of attributes. The suggested strength of V-BDSO resides in its increased exploration and exploitation capability, which enables it to reduce redundant characteristics before intensely searching the high-performance regions of the feature space.  In Tables 12-15, the statistical measurements acquired from several runs of the algorithms on all data sets are shown, and these results are standardized in the range 0 to 1. It can be observed from Tables 12 and 14 that V-BDSO outperformed ALO, FPA, PSO, GWO, and GA in the mean and worst fitness measure on all datasets whereas in the best fitness criteria, V-BDSO outperformed other algorithms on most (eighteen out of twenty-one) of the datasets as shown in Table 13. Table 15 shows an overview of the statistical standard deviation measure findings obtained for all datasets. The performance of V-BDSO is competitive in comparison to state-of-the-art techniques (ALO, FPA, GA, GWO, and PSO), as seen in this table. VOLUME 10, 2022   The results of V-BDSO and other algorithms in terms of computing time are shown in Table 16. On thirteen datasets, FPA has the fastest computational time, whereas GA and GWO each have the fastest computational time on four datasets. In compared to the other techniques, V-BDSO needs much more processing time because of the V-shaped transfer   function, which allows V-BDSO to map continuous values to binary values and requires considerable computational time. Table 17 shows the p values at 5% significance level found in most cases; the results are significant at p ≤ 0.05 in the vast majority of cases. Figure 8 shows the average number of features selected from all datasets. It can be clearly observed that V-DSO selects the lowest number of features. Figure 9 depicts the accuracy to feature selection ratio on individual data sets, the results clearly indicate that the proposed V-BDSO algorithm has achieved higher accuracy on feature selected ratio value on all datasets.

VI. CONCLUSION
In this research, binary variations of the Drone Squadron Optimization (DSO) are suggested and used to choose features in wrapper mode. The continuous version of DSO is transformed into its binary variant utilising V-shaped and S-shaped transfer functions. The new binary techniques were tested against well-known nature-inspired algorithms such as Ant Lion Optimization (ALO), Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Flower Pollination Algorithm (FPA), and Grey Wolf Optimizer (GWO) on 21 benchmark datasets from the UCI library. To examine different areas of performance, the evaluation uses a set of assessment criteria. The presented findings suggest that the proposed binary DSO technique can optimally explore the space of features and converge to the best solution faster than other algorithms. Moreover, the results prove that the binary algorithm proposed based on V-shaped function achieves an average classification accuracy of 91.5% and an average selection size of 26.4, whereas the binary algorithm proposed based on S-shaped function achieves an average classification accuracy of 88.5% and an average selection size of 32.4. Hence, V shaped function performs better than the S-shaped one.
It might be interesting to combine the DSO method with another population-based metaheuristic algorithm in future investigations. Furthermore, examining the performance of the DSO approach when used on considerably higher-dimensional datasets will be a great contribution.