Late Acceptance Hill Climbing Based Social Ski Driver Algorithm for Feature Selection

Feature selection (FS) is mainly used as a pre-processing tool to reduce dimensionality by eliminating irrelevant or redundant features to be used for a machine learning or data mining algorithm. In this paper, we have introduced binary variant of a recently proposed meta-heuristic algorithm called Social Ski Driver (SSD) optimization. To the best of our knowledge, SSD has not been used yet in the domain of FS. Two binary variants of SSD are proposed using S-shaped and V-shaped transfer functions. Besides, the exploitation ability of SSD is improved by using a local search method, called Late Acceptance Hill Climbing (LAHC). The hybrid meta-heuristic is then converted to binary version by using said transfer functions. The proposed methods are applied on 18 standard UCI datasets and compared with 15 state-of-the-art FS methods. Also to check the robustness of the proposed method, we have applied it to 3 high dimensional microarray datasets and compared with 6 state-of-the-art methods. Achieved results confirm the superiority of the proposed methods compared to other meta-heuristic wrapper based FS methods considered here. Source code of this work is available at https://github.com/consigliere19/SSD-LAHC.


I. INTRODUCTION
With the recent advances in technology, huge amount of data has become available in different domains of image processing, pattern recognition, and disease diagnosis system [1]. As a consequence, data dimensionality creates a huge impact on the performance of the various machine learning and data mining tasks, both in terms of time and storage needs of the computing devices. In this context, it can noted that there may be some redundancy in the data itself. For example, all the features developed by some means to represent a pattern or an image are not important for the classification or analysis of the same. Here, comes the role of a feature selection (FS) technique. FS is a data pre-processing step which attempts to eliminate all the irrelevant and redundant features [2] from the underlying dataset or feature vector, and thereby reduces required processing time and storage space. Due to The associate editor coordinating the review of this manuscript and approving it for publication was Yongming Li . the elimination of the non-informative features, FS technique also helps in enhancing the performance of the corresponding machine learning tasks [3].
Based on the usage of learning algorithm, FS techniques can broadly be divided into two categories [4]: filter and wrapper. Filter methods do not use any learning technique during elimination (selection) of the irrelevant (important) features, rather use different scoring criteria [5]- [8] to rank the features indicating their importance in order. Wrapper methods use learning techniques (such as classifiers) as a part of the selection and evaluate the subset of the selected features [9]- [14]. Filter methods are faster but wrapper methods, in general, perform much better [4]. Three factors must be considered while using a wrapper based FS model: choice of classifier, evaluation criteria of feature subsets (such as accuracy), and a searching (or optimization) technique to find the best subset of features [15].
FS is considered to be an NP-complete combinatorial optimization problem. Generating all possible subsets and evaluating those are not suitable for large datasets. This is because for a dataset containing n features, 2 n feature subsets will be generated and evaluating all of those require a great computational cost. The search for the best performing feature subset can also be carried out randomly. A heuristic search strategy performs a guided search which does not always find the best solution but tries to obtain a near-optimal solution in computationally reasonable time. Heuristic approaches are classified into two types-specific heuristics which are designed for a particular problem, and general-purposed meta-heuristics which are designed to solve a wide range of problems [16].
In this paper, we have made an attempt to propose a FS method following a recently proposed meta-heuristic algorithm called Social Ski Driver (SSD) optimization algorithm [17], which has produced significantly good results as compared to other optimization algorithms in the literature. To enhance its exploitation ability, we have embedded a local search algorithm called Late Acceptance Hill Climbing (LAHC) into SSD algorithm. The contributions of this paper are highlighted below: • Application of SSD for FS problem for the first time to the best of our knowledge.
• Enhancing the exploitation ability of SSD by using LAHC.
• Use of two different transfer functions: S-shaped and V-shaped for proposing the FS methods • Validation of the proposed model on 18 standard UCI datasets and comparison with 15 state-of-the-art FS methods and on 3 standard microarray datasets and comparison with 6 state-of-the-art FS methods. The rest of the paper is organized as follows: Section II provides a brief review of the past FS methods. Section III provides detailed description of the proposed FS methods. The results obtained by the FS versions of SSD are explained in Section IV. Section V provides the comparison of the proposed model with 15 state-of-the-art FS methods.Section VI shows the applicability of the proposed methods on high dimensional microarray datasets. Lastly, Section VII concludes our work and provides directions for future extension of this work.

II. LITERATURE SURVEY
FS is an optimization problem where the aim is to simultaneously maximize the classification accuracy and minimize the number of selected features. The role of FS is crucial because it helps us to gauge the performance of a machine learning technique. There are several research articles published in the literature which have tried to solve the FS problem, and some of those are described here.
Nature inspired algorithms are popular because of a number of factors: easy to adopt, flexible, not requiring very complex mathematical derivation, and their ability to avoid a local optima. Genetic Algorithm (GA), the oldest algorithm in this category, was used in [18] for the selection of features in automatic pattern classifier. It was further used in [9], and in [19]. Hybrid versions of the GA were subsequently utilized in [20]- [22], and [4]. In [23], a Histogram-Based Multi Objective GA (HMOGA) was proposed for finding informative features from higher-dimensional data. This idea was applied for two previously proposed feature sets for handwritten Devanagari numeral recognition problem. In [21], two stages of optimization were involved -the outer optimization stage completed the global search for the best subset of features in a wrapper way, while the inner optimization performed the local search in a filter manner. A tribe competition-based GA (TCbGA) to solve FS problems in pattern classification was proposed in [24]. A new Deluge based GA for FS was also proposed recently in [25].
PSO is a powerful optimization technique introduced in 1995 and further used in [10] for optimization of non-linear functions. A binary version of this algorithm was subsequently used in [26]. Six new transfer functions for binary PSO was introduced in [27]. A hybrid version of the PSO algorithm was used in [28] and [29]. The authors of [30] used a new algorithm called Sentiment Fitness Sum Binary PSO (SCO-FS-BPSO). This article overcame the drawbacks of the binary PSO and was used for sentiment classification.
Simulated Annealing (SA) for optimization problems was introduced in [31], and then used in [32]. The concept of SA was to mimic the annealing of solids. This algorithm was subsequently used by the authors of [33] which incorporated the Firefly Algorithm (FA) with SA to escape from the local optima and increase the quality of the solution. Another optimization technique called Binary Coordinate Ascent (BCA) was proposed in [15]. This method also used two popular feature subset selection (FSS) meta-heuristics namely, Sequential Forward Selection (SFS) and Sequential Floating Forward Selection (SFFS). The authors of [34] introduced a new algorithm called Grasshopper Optimization Algorithm (GOA) for solving optimization problems. This algorithm was mathematically modeled and was inspired by the behavior of grasshopper swarms in nature. The work reported in [35] proposed a hybrid method based on the GOA and used it to optimize the parameters of the SVM classifier and simultaneously find the best feature subset. In [36], the authors have proposed Harmony Search Algorithm (HSA) following the musical performance process. It has three operations: random search, pitch adjustment, and harmony memory. HSA has been applied to different domains [37], including FS [38].
The authors of [14] used the Ant Lion Optimizer (ALO) to solve the FS problem. This algorithm mimicked the hunting behavior of ant lions in nature. Then a hybrid binary ALO was used in [16]. This work used two incremental hill-climbing techniques which are QuickReduct and CEBARKCC. Different versions of the Grey Wolf Optimizer (GWO) were used in [11] and [39]. Authors in [39] used a new algorithm called BGWOPSO which was a binary version of a hybrid form of the GWO and PSO. This technique proved to be relatively better in comparison to its peers. After that, authors in [40] proposed a new GWO algorithm integrated on a two phase mutation to solve the FS problem based on a wrapper method. The work reported in [41] used the Whale Optimization Algorithm (WOA) which had hitherto not been used for FS problems. This algorithm was inspired by the social behavior of humpback whales. This was further used in [16]. In [42], authors proposed a hybrid version of the WOA with SA for FS, where SA was used to improve the best solution found after each iteration of WOA. Subsequently in [43], the authors proposed an enhanced meta-heuristic approach using the GWO and WOA to develop a wrapper-based FS. This method successfully removed the drawbacks of the GWO and WOA and hence was superior compared to both of them. A wrapper-based FS algorithm was designed and substantiated based on the binary variant of the Dragonfly Algorithm (BDA) in [44]. This algorithm mimicked the behavior of dragonflies in nature. The authors of [45] introduced the Ant Colony Optimization (ACO) algorithm as a new meta-heuristic. This algorithm was inspired by the foraging behavior of ant colonies. The work in [46] subsequently used a wrapper-filter FS technique based on the ACO algorithm. In [47], the authors introduced a binary variant of the Grasshopper Optimization Algorithm (BGOA) to solve the FS problem. This work proposed two approachesthe first approach used a V-shaped function and the sigmoid function as transfer functions while the second approach used the mutation operator to exploit the BGOA.
The authors in [48] proposed a method called Correlation based Feature Selection (CFS). This algorithm coupled the evaluation formula with an appropriate correlation measure and a heuristic search strategy. A correlation based Memetic framework was used in [49]. The work used a combination of GA and local search. The Spotted Hyena Optimization (SHO) algorithm was used in [2]. It used two different hybrid models of this algorithm. In the first model, SA was embedded in SHO to improve the best solution after each iteration, while the second model used SA to enhance the final solution obtained by the SHO algorithm. A novel technique called the Gravitational Search Algorithm (GSA) was used in [50]. This algorithm was based on the law of gravity and mass interactions. In the given algorithm, the searcher agents are a collection of masses which interact with each other based on Newtonian gravity and the laws of motion. In [51], a binary version of the GSA was used. In [52], the authors used a GSA-based algorithm with evolutionary crossover and mutation operators to solve the FS task. The work reported in [53] used the Salp Swarm Algorithm (SSA) and Multiple SSA (MSSA). The inspiration of these algorithms came from the swarming behavior of salps when navigating and foraging in oceans. In [1], the authors proposed a wrapper FS method which combined a time-varying number of leaders and followed a binary form of the SSA called TVBSSA, combined with a Random Weight Network (RWN). In [54], the authors proposed the Improved Salp Swarm Algorithm (ISSA) where Opposition Based Learning (OBL) was used to improve the population diversity in the search space and local search was used to enhance the exploitation ability of the algorithm. The work reported in [13] used the Moth Flame Optimization (MFO) algorithm. It was inspired by the navigation method of moths in nature called transverse orientation. Authors of [55] proposed a local search technique called LAHC through Redundancy and Relevancy (LAHCRR) and combined it with Memetic Algorithm (MA) to form Late Hill Climbing based MA (LHCMA) for FS, which reduced the feature dimension to a significant extent. The work reported in [56] proposed a new optimization algorithm called Equilibrium Optimizer (EO) inspired by control volume mass balance models used to estimate both dynamic and equilibrium states. A novel bio-inspired optimization algorithm called the Barnacles Mating Optimizer (BMO) algorithm was proposed in [57]. The inspiration of this algorithm came from the mating behaviour of barnacles in nature and the authors of this work used the Hardy-Weinberg principle for the generation of a new offspring. This principle states that allele and genotype frequencies in a population will remain constant from generation to generation in the absence of other evolutionary influences. In a recent work [58], a new optimization method called the Black Widow Optimization (BWO) algorithm was introduced which was inspired by the mating behavior of black widow spiders. The proposed method includes an exclusive stage, namely, cannibalism where species with inappropriate fitness were omitted, thus leading to early convergence.
Existence of so many meta-heuristic and hybrid meta-heuristic FS methods obviously raises the question about introducing another hybrid meta-heuristic FS method. However, according to No Free Lunch theorem [59] for optimization, there cannot be any single algorithm to solve all the optimization problems. With every new algorithm following any natural phenomena, mainly researchers target to give some new dimension to the algorithm where both exploration and exploitation will have a better trade-off, so that it eventually escapes the local optima and reaches to the global optima. But achieving these goals is not easy, specifically if one wants to propose an algorithm applicable to different domains. This practically motivates the researchers to come up with better methods than the past which, in turn, keeps the research alive in this domain. For our particular given problem, in order to find the best algorithm, NFL should remind us that we need to focus on the particular problem at hand, the assumptions, the priors (extra information), the data and the cost. If we consider an optimization problem, the multi-modal functions have large number of dimensions and finding an optimal value for all those dimensions simultaneously is next-toimpossible. That is why researchers try to solve these type of problems using some meta-heuristic methods where the aim is to get an optimal solution within a reasonable amount of time. Now FS is an optimization problem [42], and there may exist multiple optimum subsets i.e., with same dimension and same classification accuracy. Here also, it is used to find out optimal set of features keeping in mind the storage space and computation time along with the performance of the machine learning algorithm. Therefore, researchers are striving to provide algorithms which can meet these requirements. VOLUME 8, 2020 This motivated us to propose a hybrid meta-heuristic FS method based on the SSD [17] algorithm.
The concept of the SSD algorithm was inspired by various evolutionary algorithms such as PSO [26], GWO [60], and Sine Cosine Algorithm (SCA) [12]. The SSD algorithm has been previously used for parameter optimization of SVM classifier in [17].

III. PRESENT WORK
A. SOCIAL SKI DRIVER OPTIMIZATION: AN OVERVIEW SSD algorithm is a novel optimization algorithm recently proposed in [17]. It mimics the path taken by ski-drivers downhill. The components of SSD are briefly described as follows: • Position of the agents (X R n i ): The position of the agents is used to calculate the objective function at that location in the n-dimensional search space.
• Personal best position (P i ): The fitness value for each agent is compared with the previously obtained best fitness value and the corresponding best position is stored as the personal best position of the agent.
• Mean global best position (M i ): The agents move towards the global best position which represents the mean of the best three solutions as follows: where X α , X β and X γ are the three best solutions obtained so far.
• Velocity of the agents(V i ): The positions of the agents are updated using the equation: where V i is the velocity of X i , r 1 and r 2 are uniformly generated random numbers in the range [0,1], P i is the personal best position of the i th agent, M i is the mean global best position for the entire population, and c is a parameter which is used to make a balance between exploration and exploitation and it is calculated as: where t is the current iteration and α is used to reduce the value of c. Equation 3 indicates that the moving directions for the agents are not straightforward because of the sine and cosine functions which gives the algorithm a better exploration capability and diversifies the search space, but in a guided manner.

B. LATE ACCEPTANCE HILL CLIMBING
Hill climbing is a simple local search method where a new solution is selected only if it is better. Because of this greedy approach, Hill climbing often gets stuck in local optima. To overcome this, LAHC [61] is proposed. LAHC has the ability to choose a lower performing solution in order to overcome a local optima. Here a solution is immediately chosen if it has better performance and λ number of solutions with worse performance are stored to allow for the acceptance of slightly poor solutions in order to improve them into better ones. Here, neighbor is obtained via mutation. If the performance of the neighbor is better than the performance of the original solution, then the neighbor is chosen as the best solution. Else, we temporarily consider the neighbor which gives a poorer result. This consideration is performed only if the performance of the neighbor is worse than the performance of the best solution by a tolerance factor of δ per cent. After that, mutation is again performed on the neighbor to find its neighbor. If the performance of this newly found neighbor of neighbor is better than that of its best counterpart, then this solution is considered as the best solution; else the obtained neighbor is discarded.

C. PROPOSED BINARY SSD
In FS, searching for the best feature subset is a challenging problem, especially in the wrapper-based methods. This is because the selected subset needs to be evaluated by the learning algorithm at each iteration. FS is a binary optimization problem [42], where the solutions are limited to binary values {0, 1}. A solution or 'agent' is represented using a binary vector where 1 represents that the corresponding feature is selected and 0 represents that the corresponding feature is not selected. Solution vector size is equal to number of features in the dataset in consideration.
SSD is originally proposed for solving continuous optimization problems. To suit the FS problem, the position of an agent must be converted to a binary vector. This binary conversion is performed by using two different transfer function: S-shaped ( Figure 1a) and V-shaped (Figure 1b). This transfer function gives the probability of updating the position of an agent from 0 to 1 and vice versa. An S-shaped function, depicted in Figure 1a, is given by Equation 5.
The agent position is updated as per Equation 6.
where X t+1 d is the updated position of the agent. V-shaped function, depicted in Figure 1b, is defined as Equation 7.
Using the V-shaped function agent position is updated as per Equation 8: where X t+1 is the agent's updated position, X t is the position of the agent at that instant of time, and rand ∈ [0, 1]. After updating an agent's position in each iteration, a local search is performed using LAHC to optimize the position of the agent in order to obtain a better fitness value. The local search using LAHC increases the exploitation ability of the SSD algorithm, and by doing so it helps to escape the local optima, as described in Section III-B.
The analysis of our algorithm shows that its time complexity is O(iter * psize * λ 2 * (t fitness + dim)) where iter is the maximum number of iterations, psize is the population size, λ is the parameter of LAHC, t fitness is the complexity of calculating the fitness of a particular agent using the classifier and dim is the dimension of the dataset. Algorithm 2 represents the pseudocode for the proposed method. The pseudocode of LAHC is provided in Algorithm 1.

D. FITNESS FUNCTION
In this part, we discuss how to assess the quality of a solution. Since the above algorithm is a wrapper based method, so a Algorithm 1 Pseudocode for LAHC Input: Agent X = (x 1 , x 2 , . . . ,  [62] classifier to evaluate the classification accuracy, following the works of [11], [42], [43]. Our proposed fitness function has two components: classification accuracy and number of features. Since these two components are opposing in nature, i.e., we want to achieve higher classification accuracy with lower number of features, so, we have decided to use classification error. This is because a lower error i.e., high accuracy would imply a low fitness value, so would a lower number of features. Equation 9 shows the fitness function which evaluates a given feature subset.
where |S| is the number of features in the selected feature subset, |D| is the total number of features of the dataset, η is the classification error of the feature subset, and ω ∈ [0, 1] [11] indicates the relative weight of the number of features and the classification error.

IV. EXPERIMENTAL RESULTS
This section reports the experimental observations found which help to prove the applicability of binary variants of SSD in FS domain. We have used KNN [62] classifier to measure classification accuracy. As per the recommendations in [11], [43], [44], we have set K = 5. For each dataset, 80% of the instances are used for training the classification model and rest 20% are used for testing. Proposed methods are implemented using Python3 [63] and obtained graphs are plotted using Matplotlib [64].

A. DATASET DESCRIPTION
In this work, 18 different benchmark datasets from the UCI data repository [65] are used to assess the performance of the proposed methods. Table 1 shows the details of the used datasets. There are 15 bi-class and 3 multi-class datasets.

B. PARAMETER TUNING
We have done various experiments to know the effect of population size on the performance of the proposed methods (in terms of classification accuracy). We have evaluated the proposed binary SSD methods for population sizes of [5,10,20,30,50]. Figure 2 shows the effect of changing population size on the classification accuracy of the algorithm for all datasets. Two effects have been noticed: first, increasing the population size does not always improve the classification accuracy, and second, for most of the datasets, not much variation has been seen in the classification accuracy for different population sizes. On the basis of the outcomes shown in Figure 2, we have set the population size to 30 and tried to improve the classification accuracy for the datasets used here.
In this work, our aim is to minimize the fitness function, i.e., minimizing the number of features as well as the classification error. If the value of ω is increased then the algorithm gives more focus on reducing the number of features; if ω is given a lower value then more focus is given on decreasing the classification error. Upon experimenting with several values of ω, the optimal value of ω is fixed at 0.2. Thus the algorithm under consideration assigns greater importance to minimize the classification error, i.e., maximizing the accuracy. During the course of the work, we have experimented with different values of c and α in the search space. As already mentioned in [17], our experiments also come to the conclusion that the classification accuracy improves upon increasing the value of c. The classification accuracy improves when the value of α is slowly reduced but then again decreases after reaching a maximum. This happens due to overfitting when the value of α is too low. Eventually, the maximum classification accuracy is achieved when the values of c and α are kept as 100 and 0.9 respectively. We have experimented with different values of λ which allows us to store solutions of slightly poorer performance. The dilemma is to either improve the classification accuracy or the execution time. Since both of these cannot be improved simultaneously so the value of λ is fixed at 15, which is the upper limit on the acceptable number of worse solutions.   From Figure 2 and Figure 3, we have decided to use population size as 30 and maximum number of iterations as 50.

C. DISCUSSION
In this section, we have provided the results obtained by the proposed approaches over the datasets mentioned in Section IV-A and analyzed them. From Table 2, it is clear that the proposed methods are able to perform FS efficiently. For all the datasets, classification accuracy has improved after applying the FS methods (with huge margins in multiple cases).
SSDs and SSDv both have achieved more than or equal to 90% accuracy for 14 datasets whereas for SSDs+LAHC and SSDv+LAHC the count is 15. Moreover, SSDs and SSDv have got 100% accuracy for three datasets each while SSDs+LAHC and SSDv+LAHC each has scored full for seven datasets which is quite remarkable.
If we keep aside the classification accuracy and focus on the number features used to obtain these high accuracies, even then the proposed methods have performed significantly well. Almost on all the datasets, all the proposed variants have used only 50% or lesser number of features for classification.
In case of Exactly, Sonar, SpectEW, Tic-tac-toe accuracies have improved by quite a bit after using LAHC. For many cases like Lymphography, PenglungEW, SpectEW, Wine, Zoo, number of selected features has reduced after using LAHC. Observing the results in Table 2 and Table 3, it quite obvious that the use of LAHC is significantly helping the algorithm to explore different parts of the search space and achieve better results. Now, let us compare SSDs and SSDv in terms of achieved classification accuracy. For BreastEW, WineEW and Zoo, SSDs and SSDv have achieved same classification accuracy. SSDs has performed better than SSDv in 4 cases: HeartEW, IonosphereEW, PenglungEW and SonarEW. For PenglungEW, the accuracy difference is quite significant. For rest 11 cases, SSDv has achieved higher classification accuracy. For, HeartEW, M-of-n and Zoo, SSDs and SSDv have selected same number of features. SSDs has selected more number of features in 4 cases: Exactly, KrvskpEW, PenglungEW, and SonarEW. For rest 11 cases, SSDv has selected lesser number of features than SSDs. In case of SSDs+LAHC and SSDv+LAHC in terms of achieved classification accuracy, for 8 datasets both the methods have achieved same classification accuracy. In five cases SSDs+LAHC has performed better and in 5 cases SSDv+LAHC has performed better. In case of Exactly, HeartEW, SonarEW, and Tic-tac-toe, SSDs+LAHC and SSDv+LAHC have selected same number of features. SSDs+LAHC has selected lesser number of features in 4 cases: Breastcancer, BreastEW, SpectEW, and Zoo. For rest 10 cases SSDv+LAHC has selected lesser number of features.

V. COMPARISON AND ANALYSIS
To establish the superiority of the proposed FS methods, we have compared these methods with 15 state-of-the-art methods: 8 popular meta-heuristic FS methods: GA, PSO, ALO, GSA, GWO, GOA, DA, and SSA; and 7 hybrid meta-heuristic FS methods. WOASAT-2 [44] is hybrid of WOA and SA. BGWOPSO [39] is developed by hybridizing GWO and PSO. Following three different strategies, GWO and WOA are hybridized [66]: serial grey-whale optimizer (HSGW), random switching grey-whale optimizer (RSGW),  and adaptive switching grey-whale optimizer (ASGW). In WOA-CM [16], performance of WOA is enhanced by using crossover and mutation. We have obtained the results from the mentioned articles. The authors in the corresponding articles have got the results with the best parameters obtained through exhaustive experiments, and we have chosen the best results from those articles. Figure 4 shows the performance of SSDs + LAHC and SSDv + LAHC in terms of achieved classification accuracy. From Figure 4, it can be observed that SSDs + LAHC performs best in case of 10 datasets: CongressEW, Exactly, M-of-n, PenglungEW, SpectEW, Tic-tac-toe, Vote, Wave-formEW, WineEW, and Zoo. SSDv + LAHC performs best in 9 cases: Breastcancer, CongressEW, Exactly, HeartEW, M-of-n, PenglungEW, Vote, WineEW and Zoo. Now, for CongressEW, Exactly, M-of-n, PenglungEW, Vote, WineEW, and Zoo datasets, SSDs + LAHC and SSDv + LAHC have combinedly achieved the best accuracy. For Lymphography dataset, both SSDs + LAHC and SSDv + LAHC have achieved second best classification accuracy. For Breastcancer dataset, SSDs + LAHC has achieved second highest accuracy. For BreastEW, Exactly2, and HeartEW, SSDs + LAHC has achieved third highest accuracy. For Tic-tac-toe, SSDv + LAHC has achieved third highest accuracy. Now, both SSDs+LAHC and SSDv+LAHC have achieved same accuracy as GA for Exactly and M-of-n datasets. For remaining 16 datasets, both SSDs + LAHC and SSDv + LAHC have performed better than GA. Same can be stated if we compare SSDs + LAHC and SSDv + LAHC with PSO. Both SSDs + LAHC and SSDv + LAHC completely outperform ALO, GSA and GWO for all 18 datasets. SSDs + LAHC has achieved same accuracy as DA in case of 5 datasets: Exactly, M-of-n, Penglung, WineEW, and Zoo. DA outperforms SSDs + LAHC in 5 cases: Breastcancer, IonosphereEW, KrvskpEW, Lymphograophy, and SonarEW. SSDs + LAHC wins in rest 8 cases. With DA, SSDv + LAHC has 8 wins, 6 ties, and 4 losses. Both SSDs + LAHC and SSDv + LAHC completely outperforms SSA except for Zoo dataset, where both have achieved same accuracy. Figure 5 shows the performance of SSDs + LAHC and SSDv + LAHC in terms of selected number of features. From Figure 5, it can be observed that SSDs + LAHC performs best i.e., selects lowest number of features in case of Breastcancer dataset. SSDs + LAHC performs second best BreastEW, HeartEW, M-of-n, and WineEW datasets. SSDv + LAHC performs best for WineEW. It performs as second best for Breastcancer, HeartEW, and Lymphography. With GA, both SSDs+LAHC and SSDv+LAHC have 3 wins and 2 ties. With PSO, SSDs + LAHC has 3 ties and 3 wins, and SSDv + LAHC has 2 ties and 2 wins. With ALO, SSDs + LAHC has 9 wins and 1 tie, and SSDv + LAHC has 11 wins and no tie. With GSA, SSDs + LAHC has 5 wins and 1 tie and SSDv + LAHC 6 wins and 2 ties. With GWO, SSDs + LAHC 9 wins and no tie and SSDv + LAHC has 8 wins and 2 ties. With GOA, SSDs + LAHC has 9 wins and 1 tie, and SSDv + LAHC has 11 wins and no tie. Figure 6 and Figure 7 respectively shows the average accuracies achieved and average number of features selected over the 18 UCI datasets using the proposed methods and 15 stateof-the-art methods considered for comparison.
To determine the statistical significance of the proposed methods, Wilcoxon test has been performed. It is a non-parametric statistical test where pairwise comparison is performed [67]. Here the null hypothesis is, two sets of results have same distribution. If the distributions of two results are statistically different, then the generated p-value from the test statistics will be < 0.05, as we have performed the test at 0.05% significance level, resulting in the rejection of the null   hypothesis. From the test results, provided in Table 4, we can conclude that the proposed FS methods are statistically significant.
The main characteristic of the proposed FS method that separates it from other meta-heuristic algorithms is that the search direction is not straightforward because of the sine and cosine functions. This allows diversification of the search space and the parameter c in Equation 3 maintains a balance between exploration and exploitation, guiding it to finally converge to a solution. Furthermore, LAHC allows the algorithm to refine the solution, overcoming local optima in the process and leading to improved results.

VI. EXPERIMENTS ON HIGH DIMENSIONAL DATA
To show the robustness of the proposed FS methods, we have also applied these to high dimensional microarray datasets. It is to be noted that microarray datasets [68] are important medical diagnostic tools used for identifying or classifying different diseases including cancer. The main challenge of working with these datasets is that they tend to have small  number of samples and large number of features. We have applied the proposed FS methods on three standard and publicly available microarray datasets and compared the obtained results with 6 state-of-the-art methods: GA, PSO, ALO, GSA, SSA, and Harris Hawks Optimizer (HHO) [68]. Table 5 shows the details of the microarray datasets used in this work. These datasets are high dimensional, with no. of features > 1000. Figure 8 shows the accuracies achieved for 3 microarray datasets using the proposed methods and mentioned stateof-the-art methods. Figure 8d shows the average accuracy achieved over the used 3 microarray datasets using the proposed methods and the state-of-the-art methods. Figure 9 shows the number of features selected using the proposed method and state-of-the-art methods. Figure 9d shows aver-VOLUME 8, 2020  age number of selected features over the 3 datasets. From the results it is quite evident that the proposed approaches are able to achieve higher classification accuracy with lower number of features in compared to the mentioned state-of-the-art methods. So, this proves the robustness and applicability of the proposed methods.

VII. CONCLUSION
In this work, we have introduced new meta-heuristic FS algorithms based on the recently proposed SSD optimization algorithm. To convert the continuous search space of the SSD to binary, we have used S-shaped and V-shaped transfer functions. To enhance the exploitation ability of SSD, we have applied a local search algorithm, namely LAHC. The FS problem is formulated as a multi-objective optimization task with a fitness function tending to achieve high classification accuracy with low number of selected features. The obtained results show that the incorporation of LAHC with SSD has significantly improved the results; both in terms of classification accuracy and number of selected features. We have applied the proposed methods on 18 well known UCI datasets and 3 microarray datasets, and compared the achieved results with some state-of-the-art FS methods. Comparison shows that the proposed methods are able to produce better results. Hence, it can be said that the proposed methods are able to effectively search the feature space and converge to (near) optimal solution better than other methods considered here for comparison. Moreover, amongst S-shaped and V-shaped transfer functions both SSD and hybrid SSD perform marginally better with V-shaped transfer function. The proper functioning of the proposed method depends on the parameter setting of SSD and LAHC. The optimal values for these parameters for a different problem may be different, which will require some experiments to determine, and is therefore considered as a limitation of this method. Also, having the same stochastic nature as other swarm-intelligence algorithms, as per No Free Lunch theorem [59], binary SSD is not guaranteed to produce outstanding results for all FS problems. For further studies, binary SSD can be applied to other standard datasets, real-world problems and it can also be employed with more classifiers like neural network, random forest, SVM etc. A comparative study can be made of the obtained results for microarray datasets with other hybrid meta-heuristics present in the literature. It would be interesting to hybridize SSD with other meta-heuristics or other local search algorithms.
BITANU CHATTERJEE is currently an Undergraduate Student with the Computer Science and Engineering, Jadavpur University, Kolkata, India. His areas of interest include machine learning, optimization, and graph theory.
TRINAV BHATTACHARYYA is currently an Undergraduate Student with the Computer Science and Engineering, Jadavpur University, Kolkata, India. His areas of interest include machine learning, optimization, and graph theory.
KUSHAL KANTI GHOSH is currently a senior year Undergraduate Student with the Computer Science and Engineering, Jadavpur University, Kolkata, India. His areas of interest include machine learning, optimization, game theory, and image processing. ZONG WOO GEEM received the B.Eng. degree from Chung-Ang University, the M.Sc. degree from Johns Hopkins University, and the Ph.D. degree from Korea University. He has researched at Virginia Tech, the University of Maryland at College Park, and Johns Hopkins University. He is currently an Associate Professor with the Department of Energy IT, Gachon University, South Korea. He invented a music-inspired optimization algorithm and harmony search, which has been applied to various scientific and engineering problems. His research interest includes phenomenon-mimicking algorithms and their applications to energy, environment, and water fields. He has served for various journals as an Editor (an Associate Editor for Engineering Optimization and a Guest Editor for Swarm and Evolutionary Computation, the International Journal of Bio-Inspired Computation, the Journal of Applied Mathematics, Applied Sciences, and Sustainability).