An Efficient SVM-Based Feature Selection Model for Cancer Classification Using High-Dimensional Microarray Data

Feature selection is critical in analyzing microarray data, which has many features (genes) or dimensions. However, with only a few samples the large search space and time consumed during their selection make selecting relevant and informative genes that improve classification performance a complex task. This paper proposed a hybrid model for gene selection known as (SVM-mRMRe), the proposed model provides a framework for combining filter-based, ensemble, and embedded methods to select the most relevant and informative genes from high-dimensional microarray data by fusing embedded SVM coefficients (features ranking) with ensemble mRMRe. Eight of the most commonly used microarray datasets for various types of cancer were used to evaluate the model. The selected subset feature is evaluated by four different types of classifiers: random forest (RF), multilayer perceptron (MLP), k-nearest neighbors (k-NN), and Support Vector Machine (SVM). The experimental results show that the proposed model reduces time consumption and dimensionality and improves the differentiation of cancer tissues from benign tissues. Furthermore, the selected genes for the brain cancer dataset are biologically interpreted, and it agrees with the findings of relevant biomedical studies and plays an important role in patient prognosis.


I. INTRODUCTION
One of the leading causes of death worldwide is cancer [1] Microarray-based gene expression profiling has proven to be an effective technique for cancer diagnosis, prognosis, and treatment [2]. DNA microarray technology is a significant tool that enables researchers to track the level of gene expression in an organism [3]. Microarrays measure the interactions of thousands of genes simultaneously and create a global picture of cellular function [4]. However, analyzing DNA microarray data is difficult for a variety of reasons. First, DNA microarray experiments usually produce many The associate editor coordinating the review of this manuscript and approving it for publication was Wentao Fan .
features for a small number of patients, resulting in a dataset with a high dimension. A small number of samples contains several hundred or even thousands of genes (features). Second, the classification of the microarray data, computationally complex and so requires efficient and fast classification algorithms.
Third, Gene expression data is highly complex; genes are directly or indirectly correlated with one another, making classification a difficult task that typically necessitates the use of a powerful and accurate feature selection technique. A robust feature selection method and enough classifiers are required for gene recognition or disease diagnosis using DNA microarray data to overcome these limitations. VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ The goal of gene (feature) selection is to reduce the complexity of the feature space while also identifying a small subset of distinct genes from a larger set, resulting in not only classification accuracy performance but also biologically meaningful insights [5]. The main aim of this study is to select a subset of informative and relevant genes that accept the findings of related biomedical research. At the same time, eliminating irrelevant or redundant genes and improve the classification performance of high dimension microarray data.
This research paper presents a hybrid feature selection model called (SVM-mRMRe) that combines different methods ensemble minimum redundancy-maximum relevancy (mRMR) feature selection [6] and support vector machine technique (SVM) as an embedded method. For evaluating the proposed (SVM-mRMRe) model, eight of the most frequently used microarray datasets for various types of cancer are used. The (SVM-mRMRe) model is evaluated using four different classifiers: SVM [7], random forest (RF) [8], k-nearest neighbor (KNN) [9], and multilayer perceptron (MLP) [10]. According to the experimental results, the proposed method outperforms the existing standard algorithms regarding classification accuracy and execution time. Furthermore, the genes selected using the brain cancer dataset are biologically interpreted, matching the results of related biomedical studies. This paper's main contributions are as follows: • The most informative and relevant genes are subjected to the proposed SVM-mRMRe model (features).
• The proposed model is compared to the current SVM-RFE method. The findings show that with SVM-RFE, feature selection takes a long time. However, our proposed model solves this problem by incorporating the following stages: • In the first stage, the linear SVM is used as a features (genes)selector, considering feature interaction. The SVM output subset features are then fed into a support vector machine that performs recursive feature elimination and cross-validation (SVM_RBF _CV) in the second stage. As a result, a preliminary list of informative features is generated.
• The ensemble mRMRe selects non-redundant and relevant genes to the biological context, leading to more detailed biological interpretations. Later, the output of the gene's subset is combined with SVM_RBF _CV, and then a voting process is applied to get the unique, informative genes with high relevance and minimum redundancy.
• The selected subset features of (brain cancer) are biologically interpreted, and it agrees with the outcome of relevant biomedical studies.
• We also present a comprehensive review of various filter and classification methods related to working, particularly for cancer microarray data analysis. The following is how the paper is organized. The second section is a book review. The procedures used are described in detail in Section 3. The proposed model is presented in Section 4. Section 5 delves into the experimental findings based on publicly available cancer microarray datasets. The conclusion is found in Section 6.

II. RELATED WORK
In recent years, many significant research efforts have been produced to study the cancer microarray data classification using different feature selection techniques, with feature selection playing an important role in cancer classification. As a context for the research discussed in the paper, we provide an overview of this work. Table 1 summarizes some of the previous research methods for microarray cancer classification.
Cancer classification accuracy is considered in all these previous studies without disclosing biological information on the cancer classification process. The SVM-mRMRe model aims to close the gap between the classification and biological interpretation of cancer by improving accuracy and selecting significant genes that agree with pertinent biomedical studies.

A. FEATURE SUBSET SELECTION
Feature subset, also known as gene subset collection, excludes no longer relevant or redundant features. In certain instances, this is an NP-hard problem (nondeterministic polynomial time hard [19].) The subset of features chosen should obey Occam's razor theory and have the best value in terms of any objective function. There are three different kinds of feature selection algorithms [20]: a) Filters extract features from data without prior information, and Filters function without considering the classifier. Therefore, they are highly effective in terms of computation. They are split into two categories: multivariate and univariate processes. Relationships between features can be discovered using multivariate techniques. A multivariate approach is (mRMR). b) Wrappers evaluate which features are useful using machine learning techniques. Wrappers are best at feature selection because they practice and measure the feature space, considering the model hypothesis. The wrappers' main drawback is computational inefficiency as a result of this. c) Embedded approaches incorporate the steps of feature selection and classifier development. In terms of computational efficiency, embedded approaches outperform wrappers, but they allow classifier-specific judgments that do not fit with any other classifier. The SVM method of recursive feature elimination RFE) is embedded.

B. SUPPORT VECTOR MACHINES
SVM is a classification algorithm based on mathematical learning theory [21], [22]. SVM (Support Vector Machine) has long been praised for its superior classification efficiency  and intrinsic feature selection ability. SVMs may be used to pick features as well as classify them. In each round, features that do not lead to classification are omitted until no further change in classification is feasible [23]. In our model paper, we use linear SVM as a simple gene (feature)selector due to the high dimensionality of microarray data.
For a linear kernel SVM, where x and y are points in a d-dimensional Euclidian space, the margin width can be determined using (2)- (3): where Ns denotes the number of support vectors, which are the training samples of 0 < α i ≤ c.

C. SUPPORT VECTOR MACHINE RECURSIVE FEATURE ELIMINATION AND CROSS-VALIDATION (SVM-RFE-CV)
Guyon et al. proposed SVM-RFE for ranking genes from gene expression data for cancer classification [24]. The SVM-RFE algorithm produces a ranking coefficient dependent on the SVM's weight vector during preparation, eliminating the signature attribute with the smallest ranking coefficient in each iteration until all signature attributes decrease order. Small variations in the training set may cause the feature exclusion process to fail; features extracted from the training set which not perform well in an independent testing set. Zhang et al. [25] used a leave-one-out cross-validation approach to enhance the reliability and robustness of SVM-RFE. The following is a summary of the SVM Recursive Feature Elimination approach based on Cross-Validation (SVM-RFE-CV): Enter the training samples {xi, yi}, yi ∈ {−1, +1}. The R feature ordering set is the output feature ordering set.
1) The initialization processes. D. the function ordered set R = 0, the initial feature set S = 1,2, . . . 2) Repeat the process until R equals 0. a) Get the applicant feature set and the instruction set. b) To get ω, train the SVM classifier. c) Determine the ranking of the classification criteria: d) Find the smallest rating parameters that have the following features: e) Update feature set R = PUR. f) If you're looking for a special route, Delete this function from S, rendering S = S/P.

D. ENSEMBLE MINIMUM Redundancy-MAXIMUM RELEVANCE (mRMRe) FEATURE SELECTION
The mRMR is a filter-type feature selection approach that maximizes the correlation between features and categorized variables while minimizing the correlation between features to obtain the best feature collection. The issue is that, like all feature selection algorithms in a low sample-to-dimensionality ratio environment, mRMR produces difficult-to-interpret results. The Ensemble (mRMRe) feature selection is a variation of mRMR that creates various feature sets rather than a single feature list. Also, the package provides a function for calculating a mutual information matrix (MIM) using the necessary estimators for each variable type. Small variations in sample data frequently result in radically different sets of chosen features, so the effects are highly unpredictable. Paraphrase formalized by the mRMR methodology [6], as applied in the mRMR classic function, allows rapid detection of important and non-redundant features [26]. In the set S, the most important and least redundant gene I is: Two ensemble methods were used in the Ensemble feature selection (mRMRe): exhaustive and bootstrap ensemble mRMR. The exhaustive mRMR heuristic extends the mRMR heuristic by beginning several feature selection procedures with the k > 1 most important feature. Then, k mRMR solutions are created in parallel, with the first feature guaranteed to be different. The bootstrap variant resamples the original dataset (with replacement) to produce k bootstraps and then performs classical mRMR feature selection for each bootstrapped dataset in parallel, yielding k mRMR solutions. The proposed model applies SVM-RFE-cv to the SVM output subset genes, mRMRe to the original data set, then shuffles the output subset genes of both algorithms, creates a voting process for the resulted subset gene (features) and obtains the final informative subset feature with high relevance, high importance, and minimal redundancy.

IV. THE PROPOSED MODEL (SVM-mRMRe)
The proposed model SVM-mRMRe is defined in this section (shown in Fig. 1) for Identifying Informative genes from high dimensional microarray data. The proposed SVM-mRMRe model has two stages. In the first stage (Inner election), the data were partitioned using the k-fold cross-validation technique (k = 8) to avoid overfitting problems. The training folds are used for training the SVM classifier, and the testing part is used to evaluate the final model. In this stage, the SVM classifier is used as a feature selector through the following steps: A. SVM generates coefficients(weights) for each gene during training; coefficients (weights) will be used for the prediction of class targets for unseen (test) data. B. We have eight different weight vectors considering that the weight for the same gene will be different from one fold to another. To overcome the Cutpoint Partition Problem, which represents the threshold value, we calculate the means of coefficients(weights) vector for each fold separately and use it as a threshold (Cutpoint). The genes with a coefficient less than the mean in each iteration are removed. The genes with coefficients(weights) bigger than means are considered important ones. The genes are filtered according to (means), then the new means are calculated for the filtered genes in each fold. The genes with coefficients(weights) bigger than new means are selected (importance genes). C. From process two we have eight different weight vector, that has a redundancy gene in 8 folds, in this process we make the inner election in two-stage for important genes first: merge all genes in the eightfold and select unique genes, second: each gene in the list of the unique genes are ranked according to the voting process in which for example the gene A in fold 0 will take one vote and if its exit in fold one it will take two-vote and so on for all genes. D. From process three, we get a vector with the gene name and its rank. We suggest that the new threshold check the condition that gene _ importance > 7(represented in all N Folds) is the final_svm_genes. One of the advantages of the SVM is considering the interaction between genes in the training process, but not irrelevant or redundant. To overcome that, we use the wrapper method mRMRe in the second stage. E. in the second stage, mRMRe is applied on the original microarray data, choosing threshold to mRMRe is considered a challenge, so we suggest that the threshold = final_svm_genes from the previse process as in Figure 1. F. From processes 4 and 5, we have two gene lists.
To select the most relevant genes to the target class, we merge the two lists of genes (final_svm_genes, mRMRe-genes); we suggest an arbitration process that has two stages: 1. Merg the two lists (final_svm_genes, mRMRegenes) and select unique genes. 2. A voting process, in which the gene, A, takes one vote if it exists in the first list (final_svm_genes), and if it exists in fold one, it will take two votes if it exists in found in the second list (mRMRegenes). We suggest that the new threshold will be the genes that have a voting value >= 1. G. The SVM-RFE-CV is applied to select the final subset genes with high performance. The final subset genes are genes with high importance and informative, high relevance, and minimum redundancy, the detailed description of SVM-mRMRe implementation is shown in Algorithm 1.

V. SIMULATIONS AND PERFORMANCE EVALUATION
The experimental validation of the model proposed in this paper is the focus of this section. A PC with the following

B. EVALUATION METHOD
To assess the efficiency of the proposed model in this section, firstly SVM (first stage) used as a features(genes) selector the gene ranked according to its weight(ω) and each feature(gene) having importance value, the feature(gene) less than mean is removed then the unique output features(genes) with importance feed to SVM-RBF-CV (second stage) is well suited to analyzing noisy high-throughput microarray data; it outperforms SVM-RFE in terms of noise robustness and ability to recover informative features, and it can boost prediction efficiency (Area Under Curve) in the testing data set, Using ensemble mRMRe, the optimal output features of SVM-RBF-CV were shuffled with the output features(genes) added in the original results, outperforming the traditional mRMR method in terms of prediction accuracy. They can contribute to richer biological explanations by recognizing genes that are more important to the biological context. The final optimal list of features (genes) in each data set is evaluated output with the four classifiers SVM, KNN, where k is 3, RF, and MLP checked with eight times cross-validation after implementing the voting method in the shuffled list of features (genes). The final optimal list of features (genes) in each data set is high value, descriptive, minimum redundancy, and highest relevance.  Our evaluation included the following four measurements: (1) Accuracy (ACC) is the most commonly used evaluation standard for the proportion of correctly predicted pairs, but using it alone is usually insufficient.
(2) Sensitivity (also known as recall) is the proportion of true positive pairs correctly defined. (3) Specificity, or the proportion of correctly defined negative pairs; (4) The region under the ROC curve (AUC), which is a probability value for correctly classifying one sample; the larger the AUC, the better. TP denotes true positive, FP is false positive, TN is a true negative, and FN is a false negative. Based on the confusion matrix, we evaluated the performance of the proposed method and rival gene selection Two statistical testing methods are also used to evaluate the performance of our model. ANOVA [34], which stands for analysis of variance, the goal of the test is to determine whether two or more means are equal. The Friedman test [35] is applied to data with three or more correlated or repeated outcomes with non-normal distribution. The null hypothesis states that the distribution remains constant across repeated measurements.   runtime are summarized in Table 3, Figs. 2 and 3, and the evaluation of the proposed method using RF, KNN, MLP, and SVM is summarized in Table 4, Figs. 4, 5, 6, and 7. Four performance metrics were chosen for result estimation: ACC, VOLUME 9, 2021 TABLE 5. Present gene accession number and gene description of the selected genes of brain cancer by the proposed model. AUC, sensitivity, and specificity. we performed a statistical p-value test to determine the significance of the results.
To reduce the computational complexity of the problem at hand and select the most informative genes, we ran SVM-mRMRe against each dataset. We obtained the number of optimal selected features, as shown in Table 3 and    To reduce the computational complexity of the problem at hand and select the most informative genes, as described before, SVM-mRMRe used more than one stage (Embedded SVM, SVM-RBF-CV, SVM-RBF-CV-mRMRe) to select optimal features, we ran SVM-mRMRe against each dataset and obtained the number of optimal selected features, as shown in Table 3 and Figure 3. The number of genes chosen by SVM-mRMRe for each microarray gene dataset is shown in Table 3. It should be noted that SVM-mRMRe provides an ordered list of the genes (features) according to the optimal genes with importance, relevant and informative; it is obvious that SVM-mRMRe achieves the highest level of dimensionality reduction by selecting the fewest number of informative genes, the highest dimensional dataset is ovarian cancer with 15155 features (genes) and 253 samples. The optimal subset selected gene by the SVM-mRMRe are six features (genes) from 15155; these six genes are the ranked genes with the highest importance, informative, and relevance.
It is also observed that the SVM-mRMRe model consumes less computational cost in experiments in all data sets, as shown in Figure 2. The lowest runtime is 1 (sec) VOLUME 9, 2021   with 1070 features and 16 genes selected in the brain cancer data set, while the highest runtime is 863 (sec) for the Ovarian dataset with the highest dimensional. Supplementary number C displays heat maps of the genes chosen in the SVM-mRMRe model.
The classic learning algorithms SVM, KNN, RF, and MLP are used to evaluate the gene classification accuracy of selected optimal genes by the SVM-mRMRe model. The learning algorithms are applied to the newly collected dataset, which only includes the best genes, and the overall accuracy is calculated. Table 4 and Figure 4 outline the learning accuracy of four classifiers on various feature sets. SVM-mRMRe increases the accuracy of SVM, KNN, RF, and MLP classifiers in most datasets while the accuracy is weighted overall data sets; on the other hand, SVM achieves the highest classification accuracy. However, as previously stated, a single classifier such as SVM is not accurate enough when applied to the problem of gene microarray classification, which typically faces several challenges such as the curse of dimensionality, small sample size datasets, and a large amount of noise and uncertainty. The accuracy of SVM as embedded methods on the original CNS datasets is 0.67 0.15, as shown in Table 4. As we know, accuracy alone is insufficient for model evaluation, so we use three other evaluation matrices: specificity, sensitivity, and AUC, as shown in Table 4 and Figures 5, 6, and 7.
From Fig. 5, it is observed that the SVM-mRMRe improves the sensitivity of SVM, KNN, RF, and MLP classifiers as seen the sensitivity of SVM is the best in most datasets then MLP, SVM has a sensitivity of 1.00 ± 0.00 in breast dataset and 0.69 ± 0.13 in the same original dataset.
From Fig. 6, it is observed that the SVM-mRMRe improves the specificity for most classifiers; SVM has a specificity of 1.00 ± 0.00 in the CNS dataset and 0.42 ± 0.26 in the same original dataset.
From Fig. 7, it is observed that the SVM-mRMRe improves the AUC; AUC is beast with both SVM and MLP in most datasets.
Supplementary A and B show the confusion matrix and the recall, precision, f1-score, and support of SVM, KNN, RF, and MLP for all datasets in detail.

1) BIOLOGICAL INTERPRETATION OF BRAIN CANCER
The leading cause of cancer mortality in children is brain cancer, which is also the second leading cause of cancer death in general [36]. According to studies, brain tumors are highly heterogeneous, which poses the main challenge for brain tumor classification and segmentation, and thus diagnosis and prognosis [37]. A subset of genes (features) from the brain cancer data set is biologically interpreted to demonstrate the efficacy of the proposed model in improving both critical items such as classification accuracy and VOLUME 9, 2021  selecting genes with important biological backgrounds. Just a few classes of important genes derived from microarray technologies are used for the diagnosis and prognostic purposes of brain cancer after we used the biological portrait (SVM-mRMRe). The aim of (SVM-mRMRe) is to know crucial gene subsets with the maximum outcome feedback accuracy to treat a brain cancer patient. In this segment, the selected group of probe sets could be studied by using the web tool DAVID (Database for Annotation, Integrated Discovery, and Visualization) https://david.ncifcrf.gov/list.jsp [38], [39]. Table (5) shows the gene name and gene ID from the Entrez probe set. GO Research Tools: Ncbi.nlm.nih.gov/geoprofiles and https://david.ncifcrf.gov/list.jsp are generally considered the most inclusive and fastest-growing public repository for grouping functionally related genes. Following that, it can be shown that the proposed approach is the most effective way to pick a large group of genes for brain cancer pathway detection and prognosis.

VI. CONCLUSION AND FUTURE WORK
Limited sample size, high dimensionality, and high complexity are the key characteristics of microarray data, as well as the main obstacles for researchers performing microarray data analysis. To address this issue, this paper proposes SVM-mRMRe, an efficient SVM-based feature selection model for identifying informative features from high dimensional microarray data. SVM-mRMRe combines a filter, an embedded method, and an ensemble method to select the most informative genes with the least redundancy and the highest relevance. When evaluating the proposed method with three different classifiers, experimental results on eight microarray datasets validated our findings. On most test datasets, the proposed model outperformed others in terms of classification error. Extensive testing revealed that the proposed model has four distinguishing features: (1) high classification accuracy, (2) successful time complexity resolution, and (3) effective informative gene selection, with the biological interpretation VOLUME 9, 2021 of the selected genes for brain cancer dataset agreeing with the results of relevant biomedical studies. In the future, the bioinformatics Gene networks analysis will be shown many functionally to our studying genes to predict cancer prognosis. Also, this may indicate a new relationship between our genes and other regulated genes to foresee possible functional interactions among them to cancer disease pathways. A comparison between the proposed approach and a hybrid technique depending upon GA and PSO will be investigated.

APPENDIX A
See Table 6.

APPENDIX B
See Table 7.

APPENDIX C
See Table 8.