IDriveGenes: Cancer Driver Genes Prediction Using Machine Learning

The development of high throughput sequencing technologies i.e. Next Generation Sequencing (NGS) is revolutionizing the exploration of cancer. Though sequence datasets are highly complex, mutation can occur randomly in DNA or RNA sequences that can make cells sicker or less fit. The unusual growth and behavior of genes in cells cause cancer. Cancer-driver gene cells grow when mutation occurs. Identification of cancer driver genes is a critical and challenging issue for researchers. In the proposed work, initially, robust features are extracted from the sequence dataset through Position Relative Incidence Matrix (PRIM) integrated with Accumulative Absolute Position Incidence Vector (AAPIV) generation. PRIM and AAPIV convert the single-dimensional sequence data into 2-dimensional numeric data. Support Vector Machine (SVM), Neural Network (NN), and Random Forest (RF) are used to train the model. The proposed model is validated with different validation methods i.e., independent testing, k-fold cross-validation, self-consistency, and jackknife testing. The proposed model predicts whether the given primary structure corresponds to cancer driver genes or not. Results analyses show 95%, 92%, and 69% accuracy on RF, Artificial Neural Networks (ANN), and SVM respectively. The comparative analysis with existing state-of-the-art models i.e., 20/20+ and Multimodal Deep Neural Network by integrating Multi-dimensional Data (NDNNMD) shows that the proposed model outperforms the existing techniques.


I. INTRODUCTION
The massive parallel and high-throughput sequencing platform known as Next Generation Sequencing (NGS) technologies enforce progressive demand on statistical models and bioinformatics applications to manage and analyze intensive data produced [1]. Bioinformatics domain includes sequence analysis, gene annotation, gene expression, protein analysis, protein structure prediction, high sequence image analysis, mutation analysis in cancer, etc [2]. Mutation in genomics data usually causes cancer. Mutation can occur by Single Nucleotide Variants (SNVs), structural variants, and insertion The associate editor coordinating the review of this manuscript and approving it for publication was Mehul S. Raval . or deletion of genes. The mutation is the change in the DNA sequence that occurs in the bases of the sequence. Mutation occurs due to DNA replication and environmental factors such as smoking, radiation, and sunlight also affect the DNA sequences. This change occurs in protein sequences that can be bad or good for health. The mutation that occurs in inheritance usually has a positive effect. However, mutation disturbs normal genes and causes diseases like cancer. A somatic cell is any cell of a living body that rises after conception. Somatic mutation is variation in any cell other than reproductive cells such as gamete, gametocyte, or germ cell [3].
Oncogenes (OG) are genes that help to grow mutated cells. When the mutation occurs in any cell, OG is activated and VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ starts growing. A large number of copies become bad genes when activated. When this happens, the cell grows rapidly and can progress to cancer disease. The Tumor Suppressor Genes (TSG) or anti-oncogenes are common genes that prevent the division of cells and repair DNA errors [4]. When the mutation occurs in a cell, it activates and prevents the growth of cancer mutation. When TSG is deactivated or does not work properly, the mutated cell can grow extensively, which can cause cancer. Driver genes containing driver mutations can be discovered from cancer mutation facts with or without earlier knowledge of pathways or further information on genetics and or protein interactions [5], [6]. This technique works when driver and passenger mutations observe the same frequencies. Additionally, it is found correct that sub-network can recognize small recurrence cancer driver genes. NGS technologies produce a massive amount of genetic data that needs powerful computational devices, highcapacity memory, and specified software and hardware to address the particular problem i.e., driver genes prediction [7]. Cancer is a disease that has multipart interaction among environmental and genetic factors that organize carcinogenesis [8] affecting millions of people around the world. The research study presented in [9] shows that about 8.9 million deaths have been counted in 2015 and are expected to increase this number to 14.6 million by 2035. Cancer disease is coordinated by disturbance of regular cellular function. The mutation that affects the cell genetically causes the pathways deregulation that controls the fundamental process of cells [10]. The initial stage of cancer is triggered by the pile of several genetic mutations that are major causes of signaling pathways deregulation which affects cell growth, DNA restoration, and apoptosis [11]. As stated, the pathways have been deregulated, cancer cells start to grow without any normal restriction. Therefore, such a technique needs to be formed that identifies cancer-muted genes efficiently that could help to find the structure of genes involved in cancer growth. The techniques used for this purpose must be accurate along with the other parameter. The need for more accurate and efficient techniques motivated the development of the proposed model. In this article, we worked on the identification and prediction of driver genes whose mutations cause tumors or cancer. This identification could help us find a structure of cancer that is valuable for the development of novel drugs. The remaining paper is organized as follows. Section II covers related works including a detailed discussion of methods used, datasets, etc. Materials and methods are presented in section III including subsections for each component of the proposed model. Prediction algorithms with necessary details are presented in section IV. Section V presents the results and discussion. Validation methods covering details of each method are presented in section VI. Comparative analysis of the prediction models is presented in section VII covering the detailed results of each model. Comparison of the proposed model with existing methods is presented in section VIII. Section IX presents conclusion and future research directions.

II. LITERATURE REVIEW
Detection of cancer driver gene mutation in NGS data gained the attention of researchers. NGS is a massively parallel sequence technique that generates a huge amount of data that needs efficient models and frameworks for managing and analyzing the intensive data. Machine learning is used in many applications for prediction and detection including cancer driver genes [12]. There exist review articles covering the details of different methods with the pros and cons of the available methods [13], [14], [15]. This section presents methods more related to the proposed model. Research article [7] presents a MapReduce paradigm for processing parallel NGS data by distributing a repository of the DNA segment created on similar features. In research, article [16] the authors presented a dimensionality reduction approach using feature selection taxonomy of labeled, unlabeled and partially labeled gene expression microarray data for better prediction, scalability, understandability, and fitness simplification of the classifier. Feature selection in gene expression data supports cancer classification with better performance. Dimensionality reduction help to reduce storage and computational complexity, but feature selection has complex stages that are generally expensive [17]. Cancer is a heterogeneous and complex disease and various factors of environmental and genetic nature contribute to the cause of the disease. With the development of sequencing technology, a huge data on cancer genomics has been produced through various platforms such as NGS, cancer genome atlas [18], Cancer Cell Line Encyclopedia (CCLE) [19], and International Cancer Genome Consortium (ICGC), etc. These sequencing data allow the researchers to understand the mechanism of molecules and the pathogenesis basis of cancer [5]. A major challenge is the detection and distinction of driver genes that are the factor of cancer development. The earliest attempts are to identify individual driver genes that have recurring mutations [20]. Still, such methods cannot be considered for complex mutation heterogeneity in cancer genomes that have various gene mutations. Due to this a great amount of attention has been given to the assessment of mutation recurrences in genomics to create pathways that have already protein-protein interaction or known networks [21], [22]. These are habitually giving a share inside tumor cells that may cause carcinogenic possessions, e.g., metastasis, angiogenesis, or cell proliferation [23]. A key subject is that the interaction between the biological pathways and the human protein network is complete. It is of great importance to investigate innovative approaches that do not depend on prior knowledge to find pathways or mutated novel driver gene groups.
A lot of research work exists on cancer cell oncological research that indicates a low number of acquired biochemical, molecular, and cellular features. These are the main causes of alteration of key pathways that might look like a strong generalization [10]. There are 100+ distinctive types of cancer not counting additional subtypes of malignancies that have been acknowledged. But, certain directions and rules handled the transformation of human cells to cancerous ones [24]. Cancer types need to identify specific mutations in the human genome, that can be found in many cancer types. Several researchers also attempted to classify genes that are critical to carcinogenesis into different classes based on malignant phenotypes in various experimental prototypes. There are two types of gene classes i.e., tumor suppressor and oncogenes. Cancer is caused by the activation of oncogenes as well as the inactivation of tumor suppressor genes, which later on deliberate the irregular function that specifies cancer sickness [25].
In the early stages, oncogenes called proto-oncogenes are altered by presiding mutations which gain the proliferation to a regular cell. These genetic components are known as oncogenes in their altered form and increase the proliferative ability of cells. In contrast, tumor-suppressor genes are changed and inhabited. Central cellular processes are transformed in cancer cells due to alterations in one as well as the other classes of genes e.g. metabolism, proliferation, growth, and death [7]. The mutations of these pathways give malignant and cancerous cells that can grow in huge numbers and form tumors at the local site [8]. The uncontrollable growth of these cells can occur by sidestepping the regulatory effects of the numerous mechanism which exist in a cell, controlled by key proto-oncogenes and tumor suppressor genes [9]. Homegrown cancers develop into carcinomas when they fold away and attack external tissues in the body. Authors in [26] present a Multimodal Deep Neural Network by integrating Multi-dimensional Data (MDNNMD) for diagnosing breast cancer from multidimensional data i.e., gene expression profile data and Copy Number Alteration (CNA) profile data. The small sample size or high dimensionality data may cause bad results [27]. Initially, they select effective features from gene expression profile data that include approximately 24,000 genes and CNA profile data that include approximately 26,000 genes using mRMR method [28]. The mRMR feature selection method reduces the dimensionality of data effectively without the loss of important information. The said method selects 400 genes and 200 genes from gene expression profile data and CNA profile data respectively to fit the Deep Neural Network (DNN) prediction model. DNN is applied to extract information effectively from data by greedily training each DNN on each sub-data.
Research paper [29] presents deepDriver framework that predicts cancer driver genes using a Convolutional Neural Network (CNN) based on somatic mutations. Support Vector Machine (SVM) and 20/20+ [30] methods were also applied to rank unknown genes. These methods combined with CNN improve the accuracy of the proposed approach. The experiment was made on d Lung Adenocarcinoma (LUAD), Colon Adenocarcinoma (COAD), and Breast Invasive Carcinoma (BRCA) genomes data and obtained significant scores. For further investigation said approaches can be applied to the state-of-the-art data such as pan-cancer to predict driver genes with better accuracy. Genomic Analysis of Mutations Extracted by Sequencing (GAMES) tool presented in [31] identifies and annotates mutation using NGS technology. GAMES enable the reduction of the complexity of huge DNA sequence data. This tool allows a detailed investigation of genetic and mining functional mutation of different NGS platforms. GAMES helps to extract information about divergence and make available genome annotation integrated with a genomic database. DriverML presented in [32] uses a machine learning approach that incorporates a supervised learning algorithm and weighted scoring test to identify cancer driver genes. The supervised approach scores the functional values of alteration of DNA sequences and integrates with various mutation types in somatic cells. The weighted score statistics that link all mutations can universally test each protein sequence across the genome and quantify the functional impact of various mutation forms on the protein. DriverML was applied to 31 cancer mutation datasets from TCGA and compared with 20 other common tools as the benchmark. The research article [33] presents IMaxDriver framework for driver genes prediction. It is a network-based tool using the maximization algorithm on the human transcriptional regularity network. Initially assigned weight and pruned TRN via the use of tumor-specific genes. Then find each gene's impact by influencing the maximation approach. The uppermost genes with the maximum influence rate are selected as likely driver genes. IMaxdriver identifies 408 driver genes collectively including new driver genes. Mutation can be predicted from extra-biological information about the sequence and structure of protein determined by the mutated gene. MutationAssessor presented in [34] combines protein area information with an evolutionary conservation model to identify the functional impact of somatic mutations. OncodriveFM [35] and TransFIC [36] use ML algorithms trained on known cancer mutations and focus on potential driver mutations. OncodriveFM recognizes genes by high functional mutations. A machine learning approach called 20/20+ proposed in [30] differentiates driver genes from passenger mutation in cancer. It utilizes the random forest tree trained on known cancer driver genes to recognize cohort-level cutoffs that fit the said type of identification. This method needs previous molecular knowledge for the alteration. Another approach presented in [37] presents LOTUS, a machine learning-based method to predict cancer driver genes by combining mutation frequency, functional impact, and pathway-based features. The COSMIC CGCv86 dataset is used for training and tested on the complete COSMIC database which contains 19,320 genes. Identifying new driver genes is still a major problem and challenge for the researcher. However, several approaches have been applied and lots of methods have been developed to recognize them. Number of various tools such as MutSigCV [38], 20/20+ [30], MuSic [39], TUSON [40], OncodriveFML [35], FunSeq [41], ANNOVAR [42], IntOGen [43], and CHASM [44], etc. have been proposed to identify gene mutation in sequenced data. Driver genes having driver mutations can be predicted using cancer mutation facts. This technique work when driver and VOLUME 11, 2023 passenger mutations observe the same frequencies. More, it is found accurate that sub-network can recognize small recurrence cancer driver genes. Moreover, biological information about the protein structure and sequence encrypted by the mutated gene can predict functional mutation impact [21], [45]. These techniques are applied to the non-silent SNVs to changes in the corresponding proteins, an amino acid sequence.

III. METHODS AND MATERIAL
The details of the proposed model are presented in this section. The dataset of cancer driver genes is used to validate the proposed model. To remove the redundancies, pre-processing was used first. Features vectors were generated using the preprocessed dataset. These feature vectors were used to train various classifiers. Different validation techniques were used to test each classifier. In the proposed system, extracting a robust feature vector from sequence data is the core part to fit in the machine learning prediction model. Feature extraction means converting or transforming the input datasets into feature vector form. All the attributes and features in the datasets are mostly not used, so features that perform better roles in prediction are extracted from the dataset. The feature vector of n-dimension contains numerical values as features of an object. Several feature extraction techniques such as Position Relative Incidence Matrix ( PRIM), Reverse Position Relative Incidence Matrix (RPRIM), Accumulative Absolute Position Incidence Vector (APPIV), and Reverse Accumulative Absolute Position Incidence Vector Generation (RAAPIV) have been used to extract novel features. Then statistical moments such as raw, central, and Hahn moments are applied to find further significant properties of massive data. These feature vectors are further used to train various classifiers i.e., Random Forest (RF) classifiers, Artificial Neural Network (ANN), and SVM. ANN has interconnected layers of neurons and the ANN is based on a feed-forward network and uses a back-propagation algorithm to reduce errors. Another prediction algorithm used in the research is the RF classifier for predicting cancer driver genes and non-cancer driver genes. In the end, SVM is used to predict cancer driver genes. Figure 1 presents the workflow of the proposed model IDriveGenes.

A. DATASET COLLECTION
The dataset has been obtained from NCBI, a free repository containing genetic data with biological functions [46]. Advanced search option was used to download positive samples for both cancer and non-cancer genes. Negative notation was used to search negative data samples. After collecting data from NCBI, CD-hit suite [47] is used for clustering. Analogous clusters were generated for both samples with sequence identity parameters of 60%. With the help of which 763 positive and 1805 negative cancer driver genes clusters were left. To balance the data random oversampling is used. It is a common method of oversampling in NGS nucleotide data. To balance the class distribution, this method duplicates instances of the minority class in the dataset at random. This can be useful when the minority class is under-represented in the dataset. Equation 1 shows the sum of positive and negative data generated from dataset samples.
where S + represents the positive data sample and S − shows negative data. ∪ represents union operation.

B. FEATURE EXTRACTION
In feature extraction, the one-dimensional input datasets are conversed and transformed into two-dimension features in vector form. Following steps are performed in sequence to achieve robust feature vectors.

1) GENE MAPPING
The feature vector contains numerical values as features and consists of n dimensions. The combination of nucleotides builds up genes and is also used in making DNA and RNA which is the genetic code of species. In gene expression, the DNA is first transcribed into mRNA. For performing the specific protein functions, the RNA may act directly or be the starting material for the synthesis. In the process of phenotypic trait inheritance, the genes are transformed into offspring. The genotype of an organism is responsible for the appearance of phenotypic along with several developmental and environmental factors. The polygenic associations between the genes and the external environment control the biological traits despite all the complexities. All the characteristics are not immediately visible like the possibility of illnesses, blood type, or thousands of essential biochemical processes that make up life while some are all immediately visible, such as skin color or several limbs. So, for the feature extraction, some mathematical model is required that considers K-tuple nucleotide composition and position in the gene sequence [48].  GTT, TAA, TAC, TAG, TAT, TCA, TCG, TCT, TCC, TGA,  TGC, TGT, TGG, TTA, TTC, TTG, TTT} The dataset contains an array of these alphabets and RF can be applied to this format because it is a good representation for input. When the input is well-formed, RF can better learn the relationships between the data to predict invisible sequences.

2) POSITION RELATIVE INCIDENCE MATRIX (PRIM)
Since the position of the amino acid in the polypeptide chain is very important, PRIM indicates the relative position of the K-tuple of the nucleotide chain [49]. Here we draw a relative probability matrix up to 64 × 64 positions to make the system more efficient. PRIM is a n×n matrix as shown in Equation 2.

4) DETERMINING FREQUENCY MATRIX
A genetic model indicates the amount of time a nucleotide is present in a DNA sequence. Therefore, it is important and beneficial to detect the number of nucleotide sites. To calculate the frequency matrix denoted as ξ to represent the maximum number of K-tuples produced by nucleotides is computed using Equation 4 [49].
The i th value is 4 for nucleotide, 16 for dinucleotide, and 64 for trinucleotide. The frequency matrix helps us to know how the sequence is limited through different frequencies of each nucleotide in the series.

5) ACCUMULATIVE ABSOLUTE POSITION INCIDENCE VECTOR GENERATION ( AAPIV)
A frequency matrix indicates the frequency of amino acids and tells us how the sequence is generated. It does not provide a residual relative position that can help us to find information about the nucleotide composition of genes. The aggregated frequency matrix does not provide relative position information, so an aggregated location frequency vector called Accumulative AAPIV is created to obtain the required information.
AAPIV is a 4-element vector in which all numerical values of the nucleotide appear in basic order with their respective locations as given by Equation 5 [49].
where the i th element of AAPIV4, AAPIV16, and AAPIV64 is computed using Equation 6.

6) REVERSE ACCUMULATIVE ABSOLUTE POSITION INCIDENCE VECTOR GENERATION (RAAPIV)
The feature extraction method extracts hidden and interesting patterns, and the AAPIV method is also used to accomplish the same task. RAAPIV is used to extract useful and hidden information about the relative position of residues in the sequence [49]. RAAPIV was developed by reversing the primary sequence of DNA and then AAPIV is generated from the reversed sequence. RAAPIV is a 4-element vector, 16 is a dinucleotide, and 64 is a trinucleotide as presented by Equation 7 where, η i is the i th element of AAPIV4, AAPIV16, AAPIV64 is computed using Equation 8.
The statistical moment is a symbolic measure that describes the appearance of the data distribution. There are many kinds of moments, and each describes certain attributes of the data. Some moments describe the size of the data, while others describe the direction of the data. In this research, Hahn, central moment, and original moment are used to address the problem [48]. The original moments are used to estimate the mean and standard deviation of the data as they have no scaling and positional variations. Constant scale is a feature that is not affected by the scale of adding any length, energy, or other variables. The same position means that it is not affected by the movement of data values. The central moment is like the original moment as it provides the same information. They are scale-invariant but calculated along the centroid of the data. Hahn moments use polynomial values as their moment scores. They are neither position-invariant nor scaleinvariant. In this research work, non-scale invariant moments are used. Since every moment has its technology, the data is presented in this way. The data is used in a two-dimensional format at every moment, so one-dimensional data is converted into a two-dimensional format. Suppose we have a genetic sequence P as presented by Equation 9. P = a 1 , a 2 , a 3 , . . . . . . , a K (9) where, k represents series of residues, and a i is the i th K-tuple nucleotide sequence. As a result, an n × n dimensional matrix is formed to describe all amino acid components as shown in Equation 10.
where, P ′ is used for moment computation. For raw moments computation, Equation 11 is used. . Hahn moments need input as a two-dimensional square matrix. So, Hahn polynomial can be expressed as given in Equation 13.
For the calculation of the Pochhammer symbol a protocol mentioned in [50] is used. Then an orthogonal normalized Hahn moments are computed. Therefore, normalized Hahn moments for the 2-D matrix are computed as given by Equation 14. Each classifier is rigorously tested based on well-known validation techniques such as self-consistency, cross-validation, jackknife testing, and independent testing.

IV. PREDICTION ALGORITHMS
In this article, we use three machine learning models for prediction i.e., ANN, RF, and SVM to predict cancer driver genes and non-cancer driver genes as discussed in the following subsections.

A. ARTIFICIAL NEURAL NETWORK (ANN)
ANN consists of interconnected layers of neurons and can be used for the prediction of cancer driver genes [51]. The architecture of the back-propagation network is shown in Figure 2. ANN method is used based on the network that transmits the feed and uses an inverse spreading algorithm VOLUME 11, 2023 to reduce the number of errors. The layer is related to vector design and has a hidden layer that gets the number of neurons from the input layer and then creates a processing section for the entire network. The ANN launch section collects larger and larger records in addition to non-standard values comprised of a three-stage continuous flow and an inverse error [52] as given by Equation 15.
The input and hidden layers consist of k and h neurons respectively. Each neuron computes the output denoted by O m . For any node with input I a , the weight of the edge connecting random node x to node y is denoted with W xy . Whereas W ym represents the weight of the node y connected to the neuron of the random output layer m, the function f in the equation.
It is determined to be the classic sigma function that shows neuron activation as described in Equation 16.
In every training pattern, the output units and the target output are compared. In every training pattern, the generated and targeted units are compared. E denotes the error rate which can be calculated using Equation 17.
where O i is the target version and P i is the latest count of the network. The gradient slope is used to reduce errors. The error design of the release layer is returned to the output layer. The layer of each weight is represented by a vector V. The recovery process chooses a different vector such as V to reduce errors. This continues periodically until the assembly is complete, as described in Equation 18.
Change in weight at time t+1 is computed using Equation 19. Where η is a positive constant and represents the rate of learning with the value between 0 and 1.
Equation 20 is used to express the change in weights.
Here, V u,v indicates minimum E weight between u th and v th neurons in the i th iteration. This procedure is also applied when crossing the front and back entry marks. This procedure is also applied when crossing the front and back entry marks. It is a lightweight system with low memory consumption used for training ANN. The target in the networks is usually to minimize the Mean Square Error (MSE) as given by Equation 21.
where P and O show the actual output and the output neurons respectively. In Equation 21, O mn P mn represent the predicted and the observed values respectively. In this article, the database included 763 cancers positive and 1805 negative genes. The Feature Input Matrix (FIM) is designed for the driver genes. Each FIM string represents a data model. Again, the Expected Output Matrix (EOM) is formed to confirm the corresponding FIM element class as positive or negative. ANN is trained using the input matrix FIM and the expected output matrix EOM. FIM is provided as an entry-level training module where EOM is used to calculate the error by backward propagating.

B. RANDOM FOREST (RF)
RF is used for regression and classification problems [53]. Therefore, the RF classifier is used here to predict both cancer-driver and non-cancer-driver genes. In the first step, the complete data is converted into a decision tree [54]. The class is predicted for each tree using the classifier. The generated feature input matrix of two data samples is used by the ANN algorithm. The model is trained on this data for prediction and the accuracy is calculated. The predicted class with the highest votes predicts the models as shown in Figure 3.

C. SUPPORT VECTOR MACHINE (SVM)
SVM is a machine learning classifier used for classification [55] and regression-related problems [56]. The primary objective of SVM is to identify a hyperplane in N-dimensional space, where N is the number of features that can be used to categorize a point. The hyperplane is a decision boundary used for the data points classification. The data points on opposite ends of the hyperplane are classified into separate classes [57]. The points on opposite sides of the hyperplane represent different classes i.e., class A and B as shown in Figure 4.

V. RESULTS AND DISCUSSION
This section presents experimental evaluation of the proposed framework. The experiments were run on an Intel (R) Core i7-7500 with two CPUs@2.70 GHz, 8 GB of memory, with 64-bit Ubuntu 18.04 operating system. Experimental datasets are obtained from NCBI web portal [46]. The dataset is freely available with verified statistics. Python 3.7 with Numpy and Sklearn are used for the implementation of the proposed model. Initially, pre-processing was done on sequencing data, the novel features were extracted and saved in the CSV file. Then, the CSV file is used as input to the classification models  for prediction. Finally, the results generated are stored in the output file and plotted.

A. MODEL EVALUATION
Classification tests are measured in terms of specificity, accuracy, and sensitivity. The performance of a model can be measured statistically using these parameters. Sensitivity and specificity measure different categories or classes in a given dataset. In case the model detects a driver gene it will be either true which is referred to as True Positive (TP) or false which is referred to as False Positive (FP). In case the model does not detect driver genes it may be true referred to as True Negative

VI. VALIDATION METHOD
Model testing is an important factor for the validation of the predicting model [58]. The proposed model is validated with VOLUME 11, 2023 the four tests i.e., jackknife testing, cross-validation, independent testing, and self-consistency. Following subsections present the detailed results.

A. SELF-CONSISTENCY
One of the simplest and most obvious tests is the selfconsistency test. A trained model is simply evaluated using the training set of data. It serves as a simple yet effective benchmark for assessing the learning ability of a model. Selfconsistency test was performed on both positive and negative genes using the datasets used for training the model. Table 1 presents the results of the self-consistency test and Figure 5 shows the ROC graph of the same test. The results analysis show that the RF classifier performs better than ANN and SVM classifier.

B. JACKKNIFE TESTING
Jackknife testing is the strictest testing method. Each iteration sees the removal of a sample while the algorithm is being trained on the remaining samples. The model is assessed using the missing sample after sufficient training. For each data sample, this process is repeated. Since N is the size of the testing set, this is carried out N times. Each cycle's testing data sample is unique; therefore, each sample test is conducted precisely once. Although this process is the strictest, it also takes the longest time [59]. The confusion matrix that the model develops after effectively training and testing include TP, FP, TN, and FN values that are used to compute accuracy for a particular instance ρ j . The mean accuracy for all the instances is computed as depicted in Equation 26.ρ whereρ j represents the cumulative accuracy of the proposed model. The cumulative accuracy obtained for this test remains unique as the sample is tested once [60], [61]. Due to uniqueness, the results remain invariant and the test is considered more credible. Table 1 shows results obtained from jackknife using different classifiers. The results show that the accuracy of RF, ANN, and SVM is 91.3%, 88.4%, and 69.2% respectively. The ROC curve graph for jackknife testing validation is shown in Figure 6.

C. CROSS-VALIDATION
When testing requires anonymized data, but none is readily available, the cross-validation technique is utilized. The dataset is divided into multiple segments randomly which leads to rigorous testing [62]. In this technique, each partition is converted to disjoint from all other partitions. The training is performed on the data without considering the selected partition. After training is completed, the selected partition is used to test the model. The accuracy of the model is calculated at each iteration and the mean of all the obtained results is used for the cross-validation test. In the evaluation of the proposed model, 5-fold and 10-fold cross-valuations were used using the benchmark dataset. Table 1 shows the results of the cross-validation test. Figures 7 and 8 illustrate the ROC curve graph for 5-folds and 10 folds respectively.

D. INDEPENDENT TESTING
After training the model, independent testing is carried out using test data. To avoid ambiguity in the results the training and testing data are kept with similar ratios. Table 1 shows the results of the proposed model for independent testing in terms of accuracy on different classifiers i.e., RF, ANN, and SVM. The accuracy achieved by RF, ANN, and SVM is 95.4%, 93.0%, and 66.6% respectively. The ROC curve graph for independent testing is shown in Figure 9.

VII. COMPARATIVE ANALYSIS OF CLASSIFIERS
Comparative results of RF, ANN, and SVM are shown in Table 2. The analysis shows that in both cases i.e., cancer and non-cancer driver genes, RF achieves the highest accuracy approximately 95.8%, compared to ANN and SVM classifiers. ANN classifier shows 93.0% accuracy which is better than SVM while SVM shows a 69.2% accuracy. Figure 10 shows the graphical representation of comparative results.   The results show that RF performs better than ANN and SVM. RF shows 95.8% accuracy for cancer driver genes, ANN shows 93% accuracy while the accuracy of SVM is 69.2% for cross-validation and self-consistency tests. The performance of the model is measured using classification scores. Artificial dataset with heavy class imbalance does not yield better results and this type of validation becomes less effective. In such cases, Area Under the Curve (AUC)  integrated with ROC is also a critical parameter in evaluating the performance of the classification models [63]. ROC curve is a plot of True Positive Rate (TPR) against False Positive Rate (FPR). It tells how much the model can distinguish between the classes. Higher accuracy means the model distinguishes the classes more accurately. The performance of the model is considered by looking at the AUC in a plot with TPR and FPR [64]. If the model distinguishes the classes accurately the accuracy tends toward 1 and tends VOLUME 11, 2023    to zero in case the classes are distinguished less accurately. The effectiveness of a classifier is measured using various measurement techniques. To validate the performance of the proposed model the different classifiers are used, and the results are presented using ROC curve. Figures 11, 12, and 13 shows the results of RF, ANN, and SVM classifiers respectivley. Results show that the area of RF is near to1, which indicates the RF has better measure of separating the classes.

VIII. COMPARISON WITH EXISTING METHODS
The proposed model IDriveGenes is compared with the state-of-the-art models i.e., 20/20+ and MDNNMD. The 20/20+ model was proposed to differentiate driver genes from passenger mutation in cancer [30]. It utilized the RF tree trained on known cancer driver genes to recognize cohort-level cut-offs that fit the said type of identification. Such a method requires previous molecular knowledge for the alteration. MDNNMD [26] proposed for breast  cancer diagnosis from multi-dimensional data includes gene expression profile data and CNA profile data. The method selects features by applying the mRMR [17] feature selection method, which effectively reduces data dimensionality without losing important information. The said method selects 400 genes and 200 genes from gene expression profile data and CNA profile data. To show the significance of the proposed model, various predictive performance measures such as accuracy, precision, SP, SN (Recall), F1-measure score, and MCC are measured. The IDriveGenes yields 95.8% accuracy which gives correct identifications from the total dataset. Both existing methods achieve 86% and 82% accuracy respectively which is less than the proposed model. The proposed model obtained 92.2% SP and 97.3% SN which is better than existing models. The precision of the IDriveGenes is 96.6% which computes several correct positive predictions from a total number of positive predictions. F1-score of IDriveGenes is almost 97%. MCC of the IDriveGenes is 90% which is higher than the existing techniques. Table 3  TABLE 3. Comparative results analysis of proposed model with existing.
shows the detailed comparison and Figure 14 illustrates the comparison in graphical form of the IDriveGenes with existing methods. It can be concluded that the results shown by the IDriveGenes are better than 20/20+ and MDNNMD. Moreover, the proposed model takes less time and consumes less memory for training as compared to existing approaches; however, it needs more time to extract robust features from the NGS dataset. The execution time of the training model of RF is less than ANN. RF is also less susceptible to overfitting. Accuracy of a model is strongly influenced by the robustness of the feature extraction technique. More relevant features to the composition and sequence of the primary structures are extracted during feature extraction. Correct feature extraction leads to better results for a model. The proposed model yields result in less time and cost from a given sequence. The effectiveness of the model in identifying driver genes is evident from the results.

IX. CONCLUSION
Cancer is caused by cell proliferation. Cancer genes proliferate following mutation. The definition of such cells can help with treatment and, in some cases, even the cure of the ailment. This study suggests a secure in-silico method for using classifiers to find cancer genes. To obtain qualities out from reference dataset, an optimal feature drilling approach was applied. The following feature vectors were used to build classifiers which including ANN, SVM, and RF. Once models have been fully trained, test techniques including Jackknife testing, self-consistency, k-fold cross validation, and independent testing are rigorously tested. For cancer driver genes, the Random Forest classifier had 95% accuracy, the ANN had 92% accuracy, and the SVM had 69% accuracy. On the contrary hand, the suggested structure outperformed the current approaches in every way.