A Comprehensive Survey of Recent Hybrid Feature Selection Methods in Cancer Microarray Gene Expression Data

In the diagnosis and treatment of cancer, cancer classification is a vital issue. Gene selection is much needed to solve the high dimensionality issue in microarray data, small sample size, and noisy. The best way to classify cancer is to select those genes that hold the most informative ones, and this process contributes significantly to the classification performance of microarrays. In this survey, we comprehensively studied hybrid selection methods proposed since 2017 that may be used for comparison to several other algorithms proposed for gene selection in cancer classification in the past and looked to see if there are any challenges future authors that need to be discussed.


I. INTRODUCTION
Cancer was the second leading cause of death worldwide in 2020, causing nearly 10 million deaths. Researchers fear the rates will increase by 50% to 15 million new cases [1] [2]. Cancer starts when abdominal cells grow in organs or tissues of the human body and spreads to its surroundings and, in advanced cases, expands into other organs. Early detection of cancer can help significantly increase survival rates. For the patient to receive appropriate treatment, the kind of cancer must be determined as precisely as possible. The traditional method used microscopic observation on different types of biopsy samples, but this is considered a waste of time and not cost-efficient in advanced cases, and it can produce false negative results. For that reason, the use of DNA microarrays and selection of the correct number of features (genes) is needed to find more predictive and effective genes for cancer classification is essential.
Typically, gene expression data contains a large number of genes, which necessitates the employment of analysis techniques so meaningful information can be obtained [3]. The advent of gene expression technologies has made microarray data increasingly popular in cancer research classification due to the massive amount of gene expression information (features/genes) that can be used to find common patterns within a set of samples. Microarrays are a prominent method for identifying cancer cells by analyzing the DNA proteins for further analysis of the genes. Microarray data is organized into a matrix called the gene expression matrix, in which each row represents a specific gene and each column indicates an experimental condition [4].
The use of microarray technology can yield useful insights into disease-gene correlations. However, the dimensionality problem, the presence of irrelevant genes, complicates data analysis and cancer classification. To remove unnecessary genes from microarrays and retrieve useful information, a feature selection method and classification algorithm are applied to classify the cancer accurately [5].
Feature selection methods are divided into several categories: filter, wrapper, and embedding. In recent years, a hybrid method has been introduced as part of the general framework for feature selection. The main idea behind feature selection is to choose the most informative and significant genes for the classification problem. This selection can be attained by removing irrelevant genes and noisy data to maximize the correct predictive outcomes for cancer classification [4]. The hybrid method combines the benefits of both the filter and wrapper techniques. Several hybrid approaches, primarily a merger of filter and wrapper methods or two wrapper methods to identify the useful genes for correct diagnosis, have been developed over the last few years. The hybrid methods integrate the capabilities of both approaches to get the best of both worlds [4].
The goal of this survey is to find contributions to the development of hybrid feature selection methods for cancer classification in recent years.

II. DNA MICROARRAY GENE EXPRESSION PROFILE
DNA microarrays are a technical alliance of biology and computers that allows for the genome-wide analysis of gene expression in human tissues [3]. DNA microarray technology has been widely used in cancer research for cancer classification. In addition, understanding of the cause of cancer has also made it possible to inspect the expression levels of a large number of genes at the same time [6]. Especially when the technology becomes more widely used and standardized, prices and complexity decrease because of the massive amount of gene expression information (features) that could be used to find common patterns within a set of samples.
The expression level of a gene is represented by the number of gene cells. Gene expression typically yields thousands of genes and a small number of samples. This is an issue in microarrays called high dimensionality. Gene expression also has many useless and superfluous features, and only a few of the evaluated genes may have a significant impact on cancer classification. Genes are coding sections that construct essential building blocks inside the cell and direct proteins to perform a range of functions. The expression variables in the microarray dataset are structured as an M X N matrix, where each row contains multiple features each feature is also called a gene, and each column represents samples matrix [4], as shown in Figure 1.

III. FEATURE (GENE) SELECTION
The main objective of feature selection is to choose the most informative and significant genes for the classification problem. This selection can be attained by removing irrelevant genes that add dimensionality and noisy data to find relevant features and patterns in genes that may help cancer classification. Feature selection offers several advantages [5]: • Helping researchers visualize, understand, and gain knowledge about the data. • Reducing data and scaling down the storage requirements. • Generating a simpler model that allows for greater speed and simplicity. • Improving the performance of the machine learning algorithm. There are three main feature selection methods used to subset the feature space and help the model perform efficiently: filters, wrappers, and embedded methods. Each method has its own use and way of interacting with the genes. However, two new methods have been added: ensemble and hybrid methods [7]. Many researchers have been applying these new methods to their classification models to generate new feature selection methods. Each of these five methods has distinct characteristics. However, we will explain only those methods most relevant to our project: filter, wrapper, embedded, and hybrid.

A. FILTER FEATURE SELECTION METHODS
Filter methods, Figure 2, are commonly employed as a preprocessing phase, the earliest step in feature selection to reduce dimensionality. The methods typically calculate a feature/gene relevance score for each feature/gene, rank the features/genes based on their scores, and omit low-scoring features/genes [8]. There are many advantages to using filter methods, the most important being that they achieve more generality with less computational complexity, therefore being suitable for high-dimensional space, and computation is straightforward and fast [8].

FIGURE 2. Gene Filter Method Flowchart
The following is an introduction to some of the commonly used filter methods.
• Information gain (IG) is a feature selection method based on entropy. It represents how much information is included in a class prediction [9]. Specifically, entropy measures the amount of information in a random variable [10]. • Mutual information (MI) measures nonlinear relationships between two random variables by measuring the level of similarity and correlation and then shows how much data can be collected from one random variable by monitoring another random variable X and Y [11].
In other words, MI is a method for identifying features that are highly dependent on all the other features in the same class. • Conditional mutual information maximization (CMIM) is an approach that selects features based on an approximation of that criterion by attempting to reduce 2 VOLUME 4,2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  [13], and it gained popularity in 2019 after Uber became popular [14]. mRMR aims to find the maximum relevance between the features and the target as well as the minimum redundancy between the random variables X and Y. This can be achieved by using the mutual information algorithm [13]. The aim of maximum relevance is to find the most correlated features to the target. The maximum relevance criterion can lead to many redundant features. Therefore, the minimum redundancy routine finds a better subset representation of the whole feature by removing similar features. • Random forest ranking (RFR) is an algorithm that uses decision trees to merge predictions from a collection of random trees [15] by applying accuracy-based ranking. It is based on the correctness of a single tree from a previous random forest evaluation [16]. • Fast Correlation-Based Filter (FCBF) is a method for selecting features developed by Yu and Liu [17] for managing multivariate criteria that eliminate noise while maintaining more significant data using symmetrical uncertainty (SU). The FCBF algorithm utilizes several concepts, including predominant correlation and heuristic-heuristic. It identifies a set of features that are highly correlated with the classes and then sorts those values using predominant correlation. A heuristic algorithm is used to remove features that are redundant while keeping those that are more relevant [18]. • F-score is also known as Fisher's scoring and the scoring algorithm. It is a selection strategy that takes the F-distribution into account when looking at which individual descriptive features relate to the target features, and based on their scores, each feature is selected independently [19]. The F-score is considered to be simpler than the feature selection algorithm proposed by Chen and Lin [20]. The selection of features subset is based on a small distance between features point from the same target (minimum interclass distance) and a larger distance between different targets (maximum intraclass distance). • Relief algorithm considers the correlation between features, and feature weights are used to select the features to classify. Despite the ease with which the relief technique calculates classification weights, the results can be influenced by noise, which can lead to mistakes in the subset of features acquired [21]. Relief was originally proposed by Kira and Rendell [22].
• Pareto Optimization Pareto optimization or multiobjective optimization aim to develop and present a set of acceptable solutions to the decision maker, who will then choose a solution from it. In some cases, a decision-maker can provide additional constraints or criteria either before or after the search to help with guidance, refinement and narrowing, but in this case we will consider the generic scenario in which there is no prior information from the decision-maker [23].

B. WRAPPER FEATURE SELECTION METHODS
Wrapper methods use a classifier along with learning algorithms to find an optimal subset of features. They have to conduct a search in the space of primary features and select a subset of them. They are known for high computing costs, and they are not suitable for high-dimensional datasets [24]. However, they are more effective than feature-ranking algorithms because they consider the classifier hypothesis [25]. Figure 3 shows the wrapper steps. Here are descriptions of the most typically used wrapper methods divided by their meta-heuristic categories based on [26]: 1) Evolutionary-based: inspired by evolutionary processes found in nature.
• Genetic Algorithm (GA) was proposed in 1960 by John Holland. It has been used for many scientific and engineering problems and models, such as optimization, automatic programming, machine learning, economics, immune systems, ecology, population genetics, evolution and learning, and social systems [27]. GA is a heuristic search algorithm inspired by the process of natural evolution and natural selection. The algorithm has three operations: selection, crossover, and mutation. It starts with a selection operation to choose the fittest individuals (genes), discard those that are not well suited for solving the present problem, and pass those chosen to the next generation procedure. This is followed by the crossover operation: new individuals are formed by considering a combination of previously selected individuals. It uses a random selection of two individuals by exchanging the individuals' genes to reduce the number of individuals and select the fittest. Finally, it ends with mutation, which is small random changes to the new solution (individuals) [28]. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. consists of the three main characteristics of pollination: biotic pollination, abiotic pollinations, and flower constancy [29]. The FPA includes a global pollinator and a local pollinator. The feasible search space is initialized with random vectors after each pollen item is handled as a solution [30].
2) Swarm-based: inspired on the social behavior of animals.
• Artificial Bee Colony Algorithm (ABC) was proposed in 2005 by Karaboga [31] as a simulation of the foraging behavior of honeybees. There are three types of bees in this algorithm: employed bees, onlookers, and scouts. The number of employed bees is the same as the number of food sources. When the food source found by an employed bee becomes exhausted, this bee becomes a scout. There are three steps repeated in each cycle: (a) the employed bee and the onlookers move to the food sources, (b) the nectar amounts of the food sources are calculated, and (c) the scout bees are recruited and directed to other possible food sources [31]. It is based on the behavior of some cuckoo species, such as ani and guira, which engage in obligate brood parasitism by laying their eggs in the nests of other bird species; they may even remove other birds' eggs to increase the chance that their own will hatch [24]. The cuckoo lays one egg at a time and then adds it to a random nest. Then the nest with the highest egg quality is moved to the next generation. CS has a fixed number of nests, and the property that the host bird discovers the egg is The bird can then either abandon the nest or get rid of the egg [32]. • Dragonfly Algorithm (DF) is inspired by the dynamic and static swarming behavior of dragonflies. The main concept of DF can be understood as a way of estimating the global optimum of an optimization problem [33]. In short, small groups of dragonflies hunt other insects over a small area in a static swarm. The swarming behavior is characterized by local movements and abrupt changes. However, in dynamic swarming, a large number of dragonflies congregate into one swarm and fly for a considerable distance in one direction [34]. • Moth Flame Algorithm (MFA) was developed by [35] as a computerized algorithm based on nature. It is primarily inspired by moths' transverse orientation method of navigation in nature. To travel long distances in a straight line, moths maintain a fixed angle with the moon while flying at night. This method works with the moon, which is far away, but it does not work with a close flame. • Particle swarm optimization (PSO) is an algorithm proposed by Kennedy and Eberhart and modeled after the social behavior of bird flocks. It is like birds migrating in flocks toward a common destination, where intelligence and efficiency come from the cooperation of the flock [36]. PSO uses particles moving in an ndimensional space to solve an n-variable optimization problem. The particles have fitness values that are evaluated by the fitness function to be optimized and have velocities that control their flight. As the best solutions so far follow the particles, the particles travel through the problem space. PSO starts with a random set of particles-solutions-and then it iterates through the problem space and searches for optimum solutions by updating each generation [37]. PSO is considered one of the better feature selection algorithms, since it can search huge areas cost-effectively in terms of computation. Moreover, it is easier to build and requires fewer parameters [36]. • Firefly Algorithm (FA) uses swarm intelligence and upgrades based on a metaheuristic search [38]. Its major strength is solving complex optimization problems. Using FA, the behavior of real fireflies-which is based on the attraction between fireflies, which in turn depends on their brightness-can be simulated. A firefly algorithm must follow the three laws of firefly behavior in a real space [39]. • Bat Algorithm (BA) is based on the echolocation behavior of microbats [40]. By utilizing echolocation, microbats can find their prey and distinguish different types of insects in the dark. Bats use short, powerful sound waves to hunt at night and listen for the echo reflecting from a barrier or prey. A bat's particular hearing apparatus can help it determine the size and location of an object [41]. • Ant Colony Optimization (ACO) is a heuristic algorithm inspired by the way ants cooperate to find food sources. Each agent in the ACO simulates the realworld behavior of ants as they move from the nest to the food source [42]. The ants move in random directions, depositing a chemical called a pheromone on the ground. When the ants arrive at a path junction, the decision about which path to follow depends on the amount of pheromone on the path. If it is a new path, the probability of the pheromone is the same. However, if the ants have previously chosen one of the crossing paths, the probability that the new ants will follow that path increases. The intensity of the pheromone decreases over time (evaporation), while the amount of pheromone increases with each ant that passes along the path (amplification) [43].
3) Human-based: takes inspiration from human behavior and activities. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. teacher, and in the second, called the learner phase, the student learns through interaction with other learners. The student with the highest grade in the population is chosen to be the teacher during the teacher phase. The teacher is in charge of teaching the learners and raising the class's average grade. During the learner phase, each student is allowed to share their knowledge with other learners randomly to improve their own knowledge.
If the other learners have more knowledge than the student, the student will pick up new information; if the other learners do not have more knowledge, the student will not pick up further information. In this stage, the ultimate aim is to raise the class's mean grade. The algorithm can tackle multidimensional, linear, and nonlinear problems with a high degree of efficiency by simulating the teaching-learning process in a classroom: every feature/gene strives to learn from other features/genes to enhance itself [44]. • Learning Automata (LA) algorithm was originally designed as an imitation of the learning behavior of biological tissues that can acquire the erratic behavior of their surroundings by frequently interacting with them, thus optimizing the long-term benefits. Action, feedback, learning automata, and a random environment are the four components of the LA learning framework [45].
• Black Hole Algorithm (BHA) was introduced by Abdolreza Hatamlou in 2013. BHA is population based, inspired by the behavior of black holes, which attract everything around them. The BHA is based on the concept of a black hole, which is a region of space with so much mass concentrated in it that no neighboring object can escape its gravitational pull. Light objects, like everything else that falls into a black hole, cannot escape any BHA iteration, and the best solution is then selected as the black hole, which then attracts other candidates. Stars will be swallowed by black holes if their fitness crosses the event horizon; after that, the search process will start again with a new potential solution star generated at random and placed in the search space [46]. The BHA begins by selecting a random population of possible solutions, called stars. After the initialization step, the population fitness values are evaluated, and the best candidate is chosen as a black hole. The chosen black hole has the best fitness value, and the remaining solutions will move toward the black hole, depending on their position and a random number. During each iteration, the best candidate is considered to be a black hole, and the remaining are treated as stars. Then all the stars near the black hole are absorbed by the black hole.
As the stars are moving, if the star reaches a certain position where the cost is less than the black hole, then the star becomes a black hole, and the iteration starts again.
• Gravitational Search Algorithm (GSA) originally comes from the laws of Newtonian mechanics, which are based on an isolated system of masses and their interactions [47]. The GSA takes into consideration gravity and how it attracts other masses [48].

5)
Music-based: Inspired by music instrument.
• Harmony Search (HS) In 2001, Zong Woo Geem et al. developed the HS [49]. Metaheuristic optimization algorithm based on music. Music is the pursuit of a perfect state of harmony; hence, it was inspired by this observation. The concept of finding harmony in music is analogous to optimizing a process.

6) No inspiration
• Crossover operation involves the mimicking of properties. A random position (crossover point) is selected that separates the parents into two groups. Two new offspring are produced when the parents of the two portions are swapped. This is known as a crossover operation to develop new best options [50]. • Stacked Autoencoder (SA) is a deep neural network in which three autoencoder layers (input, output, and hidden) are layered together to form an unsupervised pretraining stage in which an autoencoder's encoder layer is used as the input to the following autoencoder layer [51]. There are two parts to autoencoder training: encoders and decoders. Encoders convert input data into hidden representations, and decoders reconstruct input data from hidden representations [52].

IV. CLASSIFICATION
Classification is used to determine which dataset the input data originated from. As its name implies, classification in machine learning divides data into multiple categories [53]. The performance of the various algorithms is compared with their results in classification predictive modeling. Classification accuracy is an important metric for assessing how well any model performs based on various predicted classes [54].
Below is an overview of the most common classification models used so far.
• Random Forest (RF) [55] is an ensemble classification model composed of a set of closely connected decision trees. RF trees are constructed through bagging and random variable selection. Its construction principle is identical to that of decision trees, which is based on recursive partitioning. Each decision tree votes for a class based on its own criteria and variable set, and the classification with the most votes is considered the consensus. • Support Vector Machine (SVM) classifies data by identifying a linear or nonlinear separating surface in the input space. A set of support vectors is separated into surfaces that depend only on a subset of the original data. In a high-dimensional space, the SVM constructs a hyperplane or set of hyperplanes that can be used for VOLUME 4, 2016 5 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. classification. By using the hyperplane with the greatest distance to the nearest training data points of any class, known as the functional margin, a good separation can be achieved. When this functional margin is large, the generalization error of the classifier is small. SVM models are based on a kernel function that transforms the input data into an n-dimensional space in which a hyperplane can be constructed to partition the data [56]. For classification, support vector classifiers are used, while support vector regressions are used when regression data is analyzed [55] [57]. • Genetic Programming (GP) is an evolutionary technique for creating computer programs that represent approximate or exact solutions to a problem. GP is merely a subset of GA, with the key difference being the structures of the individuals. Individuals in GA are string structured, those in GP tree structured [58]. GP is based on the evolution of a particular population. In this population, each individual represents a solution to the problem being solved. GP seeks the best solution using a process based on the theory of evolution, where in an initial population of random individuals, after successive generations, new individuals emerge from old individuals through crossover, selection, and mutation. Strong individuals have a better chance of survival to become part of the next generation due to natural selection. Thus, after several generations, the best individual is determined, which corresponds to the final solution of the problem [59]. • K-Nearest Neighbors (KNN) is a nonparametric, nonlinear, and relatively simple classifier [60]. It classifies a new sample by measuring the "distance" to a set of samples held in memory. The class that the KNN classifier determines for this new sample is determined by the pattern that is most like it (i.e., that has the smallest distance to it). The distance function commonly used in the KNN classifier is the Euclidean distance. A majority voting among the K nearest neighbors is usually performed to select the nearest sample. The parameter K in KNN must be chosen before the classifier is run [59]. • Naïve Bayes (NB) is a probabilistic algorithm that employs the Naïve Bayes theorem [61]. In probability theory, the Bayes theorem relates the conditional and marginal probabilities of two random events. It is used to calculate the posterior probabilities of given observations. A Naïve Bayes classifier assumes that features are conditionally independent with respect to class, meaning that the value of a given feature of a class is unrelated to the value of another feature.

• Artificial Neural Networks (ANN) McCulloch and
Pitts developed it in 1943 [62] its mimics the interaction between nerve cells in the brain by using mathematical and computational techniques, by using this technology translate inputs and outputs to simulate real-world scenarios.
• Fuzzy classification is a rules-based classifier that offers substantial benefits concerning functionality, analysis, and design. It involves finding one of such class labels in a set of class labels corresponding to the vector of features of an object. A fuzzy classifier has the advantage of being able to interpret classification rules better than traditional classifiers based on other principles. Its classification accuracy is widely used as a metric of efficiency [63].

V. HYBRID FEATURE SELECTION METHODS
Hybrid feature selection methods typically combine sequentially and successively two or more feature selection algorithms from different search strategies. It aims to take advantage of both filtering and wrapping techniques to overcome the disadvantage of the individual techniques and reduce the complexity of selecting relevant features from the dataset by reducing the selection time [64]. Hybrid methods developed since 2017 include the following: • Intelligent dynamic genetic algorithm (IDGA) [65]. Developed by Dashtban and Balafar, this is built on the concepts of genetic algorithms, artificial intelligence, random restart hill-climbing, and reinforcement learning. It comprises two steps. First, the dataset is filtered using the Fisher score method to choose the top N statistically significant genes for the next step, and two alternative scoring techniques, the Fisher score and the Laplacian score, are applied. Second, the IDGA method is then used to examine the significant gene subset using an SVM classifier [66]. In addition, it provides the required crossover and mutation probability as well as faster convergence for the recognition of predictive genes [67]. • Genetic Bee Colony Algorithm (GBC) [68]. Alshamlan et al. proposed a new hybrid meta heuristic feature selection. It was built using the ABC and GA algorithms, which were both bio-inspired. The goal is to choose the genes that are most significant in attempt to optimise the classifier's accuracy. The authors of GBC combine GA operators with the ABC algorithm to produce a controlled optimization approach based on the modified ABC algorithm. • RFR-IDGA-RF [66]. Proposed by Pashaei and Pashaei, RFR-IDGA-RF is a new hybrid approach that employs both random forest ranking (RFR) as a filter method and the intelligent dynamic genetic algorithm (IDGA). RFR is used to pick only the important variables (genes) and their accommodating high-dimensional genomic data to eliminate unwanted genes from a new subset and its fitness function. IDGA is used to find the most informative subset from the produced subset in the filter method. The RF classifier, with leave-one-out cross-validation (LOOCV), is used in both fitness and classification of the final top genes, since it has higher performance than the SVM classifier for microarray classification, which 6 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. is used to evaluate two cancer types of datasets (colon and leukemia) to select the most meaningful genes. The Fisher score ranking method is used to compare the results since the number of genes needed to reach higher accuracy is ambiguous. The Fisher score for the leukemia dataset was not significantly different, but for the colon dataset, it was significantly different. The end experimental results have shown a 100% accuracy rate for leukemia and 95.16% for colon cancer; the authors argue that based on recently published work, their model is highly accurate and selects fewer genes. • mRMR-BBHA [69]. Proposed by Pashaei and Pashaei, mRMR-BBHA is a hybrid method that combines minimum redundancy maximum relevance (mRMR) with the binary black hole optimization algorithm (BBHA) to filter out noisy data and select highly discriminative genes. It was also used with the SVM classification model to accurately diagnose cancer genes. mRMR was used to find the most suitable attributes based on their relevance to the class tags, and at the same time, it minimizes the repetition between attributes. BBHA was also employed as a search algorithm that mimics the behavior of a black hole. It has been applied to two benchmark cancer datasets, colon and breast, showing higher accuracy than SVM with mRMR alone while using a small number of gene subsets. For example, the breast dataset has selected an average of 22.5 genes to achieve 94.48% accuracy, while the colon dataset has achieved a classification accuracy of 98.87% with an average of only 10.33 selected genes. • MIMAGA-Selection [70]. Lu et al. proposed MIMAGA-Selection, a new hybrid feature selection algorithm, by merging mutual information maximization (MIM), which identifies genes that are heavily reliant on other genes in the same category, with the adaptive genetic algorithm (AGA) to obtain the highest possible level of optimal results by determining the most appropriate crossover probability and mutation probability values. MIMAGA-Selection's primary purpose is to minimize the dimension of the gene expression profile and eliminate duplicated genes. The proposed algorithm was tested against six benchmark gene datasets: leukemia, colon, prostate, lung, breast, and small-roundblue-cell tumor (SRBCT). The SVM was selected as a classifier, and 30 repetitions of the classification process were performed. On the same dataset with the same target gene number, the authors used three existing algorithms-sequential forward selection (SFS), ReliefF, and MIM-with the SVM classifier to compare the accuracy of the MIMAGA-Selection algorithm. MIMAGA-Selection's accuracy was higher than that of existing feature selection algorithms, according to the results. Moreover, the authors used four different classifiers to classify the selected gene using the MIMAGA-Selection algorithm. The accuracy of all four classifiers was greater than 80%.
• Hybrid SVM-RFE and BDF [71]. In the second stage, the resultant feature subset from the first stage is used to identify and select the final optimal set by applying the GA. The feature subset from the two-stage method was evaluated using SVM-based classification on different types of cancer (colon, lung, and ovarian) datasets, where the highest classification accuracy, 96.77%, appeared in the colon cancer dataset, with only 10 extracted genes. • GALA [73]. Motieghader et al. proposed a mixed cancer classification hybrid algorithm that uses the GA as one wrapper combined with the LA as another wrapper. The GA was first used to assign a score for each chromosome by the gene locations. After that, the chromosome that had the best or highest score with the maximum fitness function value was located. GALA was applied to the SVM classifier to predict cancer. The authors chose the SVM classifier as the classification model. The proposed approach was applied to six different binary and multiclass microarray cancer datasets-colon, ALL_AML, SRBCT, MLL, tumors_9, and tumors_11-and performed well. Its mean classification accuracy on the colon dataset with 8 genes was 99.46%; the ALL_AML, SRBCT, and MLL datasets had mean classification accuracies of 100%, 97.35%, and 93.96% when selecting 2, 4, and 3 genes, respectively, and the tumors_9 and tumors_11 datasets' mean classification accuracies with 10 genes were 86.52% and 84.38%, respectively. • FCBF-GA and FCBF-PSO [18]. Djellali et al. proposed two new hybridized filter and meta-heuristic methods for optimal feature selection. The first one combines the fast correlation-based filter (FCBF), known to be powerful in removing unneeded and irrelevant features, with the GA. This hybrid method has two steps. The first method, FCBF-GA, uses the FCBF to eliminate unneeded features, and the GA selects features that the FCBF has already selected with other features since two features may be compatible and yield greater accuracy when used together. The second method, FCBF-PSO, combines the FCBF, which reduces features that are not necessary or useful, with particle swarm optimization (PSO), whose global optimization ability is well known in large search areas and whose computational complexity is relatively low. The experiments were conducted on five cancer microarray datasets (Wisconsin Diagnostic Breast Cancer, colon, hepatitis, diffuse large B-cell lymphoma, and lung) using the SVM as the classifier. The results showed that the second method, FCBF-PSO, surpasses FCBF-GA and other existing methods when the accuracy and number of selected genes are considered. • Hybrid EGS + F-score with AGA [74]. Shukla et al.
were driven to develop a new hybrid gene selection strategy that helps decrease false positives and correctly categorize cancers in a short time. The proposed hybrid method, EGS + F-score with AGA, consists of two phases. The first phase utilizes the external guide sequence (EGS) method, which uses a multi-layered approach, and the F-score approach is used to filter noise and redundant genes from the dataset. In the second phase, an adaptive genetic algorithm (AGA) acts as a wrapper and identifies important subsets of the gene from the resulting reduced datasets produced by the EGS to help detect cancer or tumors. The developed model was tested on six cancer gene datasets (colon, breast, diffuse large B-cell lymphoma, SBRCT, lung, and leukemia), and the outcome of the experiment reveals high accuracy, greater than 98%, for all cancer gene datasets. • rMRMR-MBA [75]. Al-Betara, Alomari, and Abu-Rommanc tried to solve an issue facing many selection methods, finding the most important and dependable genes, and proposed rMRMR-MBA, a hybrid filter and wrapper approach with a filter stage and a wrapper stage. At the filter stage, robust minimum redundancy maximum relevancy (rMRMR) will select the most promising genes by giving scores for each gene. At the wrapper stage, a modified bat algorithm (MBA) will act as a search engine and sort the genes by their scores to identify a small set of informative features. rMRMR-MBA was evaluated on 10 cancer gene expression datasets (breast, MLL, colon, ALLAML, ALLAML-3C, ALLAML-4C, lymphoma, CNS, ovarian, and SRBCT); accuracy was 100% for 8 datasets (MLL, ALLAML, ALLAML-3C, ALLAML-4C, lymphoma, CNS, ovarian, and SRBCT), 97.65% for the colon dataset, and 95.4% for the breast dataset. What makes the proposed hybrid method promising is that the number of selected genes is less than 10 for the 8 datasets that reached 100% accuracy. • GAABC [76]. Ge et al. wanted to solve the dimensionality problem of microarray data classification. Their hybrid method, GAABC, merges the artificial bee colony (ABC) algorithm with the GA to enhance the GA's ability to jump from local to global search functionality and increase the diversity of bee populations. The ABC has known defects of premature convergence and early fall into local extremum, which is why it was combined with the GA. The experimental results showed 80% accuracy, lower than other existing proposed hybrid methods. • CSC [50]. Sampathkumar et al. proposed a new hybrid bio-inspired algorithm combining two wrapper methods: cuckoo search, which is used to find the significant cancer-causing genes, with a crossover operator, which is a useful technique for luring populations away from local minima that solve the cuckoo search issue. The authors applied the cellulose synthase complex (CSC) selection method to the KNN classifier. Five cancer gene expression datasets (prostate, colon, leukemia, lung, and lymphoma) were used in the experiment. The results have shown that the CSC surpasses other well-known methods, yielding a classification accuracy of 99% for the prostate, lung, and lymphoma datasets, 96.98% for leukemia, and 98.54% for colon. • FFF [77]. Almugren and Alshamlan improved a previously developed wrapper to generate a new hybrid selection method called fuzzy firefly (FFF), which consists of a filtering phase and a gene selection phase.
In the filtering phase, the F-score is used to reduce the dimensionality of the data and reduce the complexity of the search area. In the gene selection phase, a wrapper method called FF is applied to locate the genes that are more informative. The experiments were carried out on five microarray cancer datasets (leukemia_2, SRBCT, lung, leukemia_1, and colon) having both binary and multiclass labels. Experimental results show that FFF-SVM has 100% accuracy for the lung, leukemia_1, and SRBCT datasets, in which the number of selected genes was less than 10. The accuracy for the leukemia_2 dataset was 97.8%, and 94.3% for colon. • CMIM + BGA [78]. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3185226 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ + BGA algorithm was evaluated using a number of classifiers on five biological datasets and five University of California at Irvine datasets of different dimensionalities and several instances. The authors ran filter-wrapper feature selection on four different classifications (SVM, DT, KNN, and NB). According to the findings of the evaluation, the proposed method provides adequate support for major feature reduction and outperforms existing methods; the classification accuracy of KNN scored was the lowest precision, 40.04%, the SVM the highest, 99.32%; the SRBCT dataset has the lowest classification accuracy, 61.24%, the diffuse large B-cell lymphoma (DLBCL) the best, 99.32%. • TLBOGS [79]. Shukla, Singh, and Vardhan wanted to lessen the dimensionality issue in microarray gene data and increase the interpretability of discriminative gene data, so they produced the TLBOGS hybrid wrapper method that integrates the properties of the teaching learning-based algorithm (TLBO) and the gravitational search algorithm (GSA). The TLBO, introduced in 2011, offers great potential for identifying gene subsets with near-optimal properties in high-dimensional spaces, but it has a few limitations, such as premature and slow convergence. To solve these issues, the GSA is used since it has excellent global search capability.
The results of the experiments show that the proposed method has higher classification accuracy and a more optimal number of feature sets than current approaches. The proposed method achieves greater than 98% accuracy in six datasets (leukemia_2, colon, DLBCL, SRBCT, lung, prostate), with the greatest accuracy, 99.62%, in the DLBCL dataset. • ABC-PSO and ABC-GA [80]. Djellali et al. proposed two hybrid methods based on the artificial bee colony (ABC). The first method, ABC-PSO, combined the ABC with particle swarm optimization (PSO) to improve the search capability of the ABC bees when they found no food source and to give greater stability between exploration and exploitation. The second method, ABC-GA, combines ABC with the genetic algorithm (GA) to find a high balance between exploration and exploitation since each chromosome-possible solution-and the collection of chromosomes form a population. In the onlooker and scout phases, GA mutation operators are used. The experimental results indicate that the proposed hybrid ABC-GA method is competitive with existing methods and outperforms ABC-PSO in identifying and classifying the Wisconsin Diagnostic Breast Cancer, colon, hepatitis, and DLBCL cancer datasets with the smallest number of features. The results of the experiments illustrate the effectiveness of mutation operators in terms of accuracy and particle swarm for smaller characteristics. Although ABC-PSO has the lowest accuracy, this hybrid method tends to produce fewer gene features. • MRMR-FPA and MRMR-GA [81]. Alomari et al.
wanted to solve the gene selection issue since the sheer number of genes and the small number of patient samples make it difficult for classifiers to produce appropriate classification results. The majority of these genes are repetitious and unnecessary, which may impair categorization. The authors proposed MRMR-FPA, which consists of minimum redundancy maximum relevancy (MRMR) as the filter method and the flower pollination algorithm (FPA) as the wrapper method to determine the most informative gene subset. To evaluate the MRMR-FPA, the authors developed another method, the MRMR-GA, which is based on MRMR as the filter method and the GA as the wrapper method. The experiments were conducted on three microarray cancer gene datasets (colon, ovarian, and breast). The performance of MRMR-FPA and MRMR-GA were similar on the ovarian and breast datasets. However, the MRMR-GA had a higher classification accuracy on the colon dataset. The comparison of MRMR-FPA with MRMR-GA revealed that MRMR-FPA was able to achieve similar classification accuracy with a lower number of genes selected. This gives MRMR-FPA the potential for overcoming the gene selection issue. • ReliefF [82]. Kilicarslan, Adem, and Celik developed a hybrid method called ReliefF for dimension reduction and classification that combines the Relief method with the stacked autoencoder. Relief ensures that data is compressed to save storage space, and it reduces computing complexity, but this method often results in data loss; this loss can be solved by applying the stacked autoencoder as a wrapper to acquire new characteristics from the outputs of the hidden layers. ReliefF was then used with convolutional neural networks (CNNs) for classification. The developed method was tested on the ovarian, leukemia, and CNS microarray datasets, in which it had classification accuracies of 98.6%, 99.86%, and 83.95%, respectively. • MOBBA-LS [83]. The authors proposed MOBBA-LS, a novel bio-inspired multiobjective algorithm that aims to identify the informative genes that employ the bioinspired multi-objective binary bat algorithm (MOBBA) by using specific local searches based on the BA with a Fisher criterion that aims at identifying the informative genes. MOBBA-LS uses the fast-non-dominatedsort algorithm to locate the leader bats, which are the ultimate solution and are theoretically participants in the first front of the multi-objective outcome. The proposed method was tested against three different microarray cancer datasets: leukemia, SRBCT, and prostate. The proposed method achieved the best accuracy in the prostate cancer dataset while using a much smaller number of genes. • Hybrid stem cell (HSC) [84]. For constructing fuzzy VOLUME 4, 2016 classification systems, Vijay and Ganeshkumar developed a novel hybrid stem cell (HSC) algorithm that combines ant colony optimization (ACO) and the stem cell algorithm with MI, which is a strategy for extracting the most informative genes from a large microarray dataset. MI is used first to reduce the gene dimensionality. Using the ACO algorithm, the HSC rule set is represented by integer values. The simulated performance results of the proposed approach were validated using several microarray datasets. These findings show that the proposed HSC algorithm generates a more precise fuzzy system than existing methods. • MIM-mMFA [85]. Dabba et al. combined the modified moth flame algorithm (mMFA) and mutual information maximization (MIM) to build the MIM-mMFA to solve gene selection in microarray data classification. As a prefilter, MIM is used to determine the significance of the genes and remove duplicated genes, and the mMFA is used to select gene subsets and score them based on fitness scores determined by an SVM with LOOCV. • MI-IBGSA [86]. Yan et al. used MI to rank and select features for the wrapper method's population based on their significance. The gravitational search algorithm (GSA) is then employed to find an optimal feature subset based on its efficiency. While the GSA has limitations in terms of its search speed and premature convergence, it remains a powerful optimization algorithm. • ICA + ABC + NB [87]. Musheer et al. developed a novel feature selection methodology, which consists of two steps: the Independent component analysis (ICA) extraction method and the Artificial Bee Colony (ABC) wrapper approach, with Naive Bayes (NB) as a classifier. A major advantage of ICA is that the number of extracted features is always equal to the number of samples in the dataset. ICA has this issue that it do not know which subset is the best subset of features.
To solve this issue the authors used ABC as a wrapper method to select the best subset of features. • rMRMR-MGWO [88]. A new gene selection for microarray data classification has been developed called rMRMR-MGWO in which its compose of two phases, filter and wrapper methods, as a filter, Robust Redundancy Minimum Maximum Relevancy (rMRMR) was applied and Modified Gray Wolf Optimizer (MGWO) with SVM classifier was used as a wrapper to find the optimal subset of genes. the authors improved the GWO with TRIZ optimization mechanism to improve the exploration ability of the wolves to select the important genes to increase the classifier classification. • Pareto Optimization + AHS [89]. As a solution to the high-dimensional dataset issue, Dash proposed a twostage hybrid feature selection method. An AHS-based probability distribution factor was used to determine the optimal gene ranking in the first stage. To select a minimum number of top-ranked genes, Pareto Optimization was applied as a feature selection method during the sec-ond stage. To evaluate the proposed method and check which classifier gives the best results, three classifiers (KNN, NB and SVM) were used. Results show that the SVM classifier provides better results than other classifiers, which give 100% accuracy to most datasets. • ICA + ABC + ANN and ICA+GBC + ANN [90].
In this paper, the authors examine artificial neural networks (ANNs) with two different hybrid algorithms. It combines Independent component analysis (ICA), an algorithm used commonly in filtering, with two bioinspired approaches: Artificial Bee Colony (ABC) and Genetic Bee Colony (GBC). Dataset dimensionality was reduced using ICA. Five different datasets were analyzed to test the proposed algorithm's performance.
According to the findings section, ICA + GBC produced higher accuracy and select lower genes number for the microarray datasets. • mRMR-SARA [91]. Santos Kumar Baliarsingh combines Simulated Annealing (SA) and the Rao Algorithm (RA). In the hybrid technique, SA handle local search and RA handle global optimization. To select relevant gene subsets from the microarray dataset, the proposed method uses an algorithm known as minimum redundancy maximum relevance (mRMR). This method was tested on five binary-class and multiclass datasets. The authors found out that due to the addition of RA optimization to the training method, the accuracy of the model increased.

VI. ANALYSIS AND DISCUSSION
The purpose of this overview of previous studies is to investigate the current research in hybrid gene feature selection approaches and learn about the current research community's tendencies for the five-year period up to 2020. Based on the recent studies table, and for comparing the results of previous studies, we must keep in mind the studies with similar datasets, feature selection, and classifiers. Moreover, it seems that the previous studies show fewer similarities. However, we conclude the following: 1) There are many cancer datasets used to predict using genes whether a person could develop cancer, and the most-used cancer dataset types are, in order, colon [90], SRBCT [88], colon [88], as shown in Figure 4. 2) Figure 5 shows that most of the studies listed combined both filter and wrapper feature selection methods into new hybrid methods. The filter method was used to improve the performance of the wrapper method, except in four studies, which used only a wrapper method. 3) Two studies, [72] and [73], achieved 100% accuracy with the colon [92] dataset with 20, 8, 9, and 10 genes. However, when the proposed method was tested with other datasets, the accuracy was less than 100%. Also, when the same colon [92] dataset was used in other studies, the accuracy was 83% in [70], 97% in [71], 94% in [77], and 60%-83% in [78]. This might be due to the large number of selecting genes as [70] 10 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.   and [71] select 202 and 510 genes, respectively, while [77] selects 15 and [78] selects 21; they achieved lower accuracy, and that might contribute to the chosen feature selection methods, since all studies that have used colon [92] use the SVM classifier. 4) Among all other wrapping methods, the genetic algorithm is the most widely used, as presented in Figure  6. With a small number of selected genes, the genetic algorithm obtains the best accuracy. 5) The most-used classifier in the literature review is the SVM, followed by KNN and NB, as shown in Figure  7. The classifier that gave the highest accuracy while selecting a low number of genes was the SVM. Moreover, only one study [82] uses CNN as the classifier, but the study did not specify the number of selected genes. However, the SVM gives the worst accuracy in MOBBA-LS [83]. 6) Hybrid SVM-RFE and BDF give great accuracy, but the number of selected genes is higher than 1000, which is the largest number of genes of any hybrid  method. 7) Figure 8 presents the most common wrapper methods' meta-heuristic categories. The methods used are swarm based and evolution based so they can produce high accuracy and select fewer genes. 8) By looking at [71], we can see that even though the authors used BD, which is considered a good bioinspired method that has shown promising results in other papers, it selected a huge number of genes. We will try to improve the dragonfly method to select fewer genes. VOLUME 4, 2016

VII. CONCLUSION
Hybrid methods have grown in popularity in recent years, this paper present different proposed and described the hybrid feature selection methods (Filter/Wrapper and Wrapper/Wrapper) methods and compared their performance between each other regarding the number of selected genes and their accuracy. While various models have been proposed to solve the dimensionality issue of microarray gene expression profiles, specifically their accuracy and number of selected genes. we only looked at papers that were published between 2017-2021 that proposed a hybrid feature selection method in order to maintain an appropriate number of papers and focus on only the recent years. We Found out that GA is the most commonly used wrapper method. In addition, we look at which dataset is commonly used to evaluate the developed methods which is in n order, Colon [90], SRBC [88], Colon [88]. Most commonly used in classifying cancer is the Support Vector Machine (SVM). She is currently an Assistant Professor in the Information Technology Department, College of Computer and Information Sciences, King Saud University (KSU). She is interested in data science and big data analytics. She developed many novel algorithms that discover cancer biomarkers from genomic data. All these algorithms have been published in highimpact journals. Her research interests include bioinformatics, especially how artificial intelligence techniques and machine learning approaches can be applied to the analysis of biological data. VOLUME 4, 2016 27 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3185226