A novel machine learning model for identifying patient-specific cancer driver genes

The identification of patient-specific cancer driver genes plays a crucial role in the development of personalized cancer treatment and drug development. Several computational methods have been proposed for identifying patient-specific cancer driver genes, most of which rank driver genes ac-cording to scores calculated from various gene or protein network information. In this paper, we propose a machine learning model for more accurate identification of patient-specific cancer driver genes. The training data for the proposed model is composed of the gene vectors, which indicate the impacts that one gene can have on or receive from all the genes. The gene vector is patient-specific, in other words, one gene can have many gene vectors from many cancer patients. To make gene vectors, first a patient-specific gene network is built using the gene expression data of each cancer patient and gene regulatory network, then modified PageRank is applied to the patient-specific gene network to make the impact matrix, from which gene vectors can be extracted. We used the Random Forest model to train gene vectors to find and discriminate patterns that show how known driver genes affect, or are affected by, other genes. The proposed model was tested through cross validations and independent tests using different sets of known cancer driver genes and six cancer types from The Cancer Genome Atlas (TCGA) data, and showed higher F-scores than existing patient-specific driver gene identification algorithms. The majority of predicted driver genes were rare, and F-scores calculated with these rare genes are higher than or comparable to those of frequently identified driver genes.


I. INTRODUCTION
Identification of cancer driver genes is important because it enables us to have a deeper understanding of cancer, leading to development of superior anti-cancer drugs or therapies. With accumulation of high-throughput genomic and transcriptomic data, many computational methods have been proposed to identify cancer driver genes or mutations, which can be roughly divided into three categories. The first group of methods identifies driver genes or mutations by their frequency [1], [2]. The main disadvantage of these methods is that they cannot find rare driver genes or mutations unless massive amount of data is provided. The second group is based on machine learning models, which learn genetic or transcriptomic patterns of known driver mutations or genes [3]- [6]. While machine learning based approaches have shown to exhibit high accuracy from recent studies, they are limited by their small number of available training data due to a limited number of known cancer driver mutations or genes. The third group adopts various network searching algorithms to the network of genes, such as gene regulatory network or protein-protein interaction network, to identify cancer driver genes [7], [8].
A majority of the above mentioned methods are focused on identifying drivers from cancer cohort studies. However, it is highly likely that individual cancer patients with the same cancer type have heterogeneous cancer drivers [9], [10]. A small fraction of these heterogeneous drivers have high them are well studied, however, most of them are rare and hard to identify [11], [12]. Several methods have been developed to find rare and patient-specific drivers from an individual cancer patient data, and most of the methods are based on network search (third group). DawnRank [13] modified PageRank [14] to apply variable damping factor which is calculated based on the number of incoming edges. Genes are scored using modified PageRank and individual somatic mutation information, and highly ranked genes with somatic mutations are selected as driver genes. Personalized Network Control (PNC) [15] applies Maximum Matching Set algorithm to a patient-specific gene network to find the minimum number of driver genes that affect whole network. The paired Single Sample Network (SSN) constructs a patient-specific gene network whose edges show significant changes in correlation of gene expression between normal and tumor states. Singlesample Controller Strategy (SCS) [16] identifies the minimum number of mutated genes needed to control differentially expressed genes using a method named CTC (Constrained Target Control), which results in several gene modules composed of one driver gene and other downregulated genes. PRODIGY [17] ranks driver genes by calculating the impact of mutated genes on deregulated pathways. These mutated genes are significantly enriched in differentially expressed genes when performing a hypergeometric test.
In this paper, we propose a novel machine learning model named MPD (Machine learning model for Patient-specific Driver gene identification) for identifying patient-specific cancer driver genes. The training data for MPD is composed of the gene vectors. The gene vector of gene G represents the impacts that G can have on, or receives from, all the genes. Gene vectors are patient-specific, which means that the vectors for each gene are made for all cancer patients, and gene vectors of the same gene can be different.
To make the gene vectors, we first construct a gene network for each cancer patient, i.e., a patient-specific gene network. Weights of the patient-specific gene network are calculated using DNA mutations and gene expressions. Then we apply modified PageRank for each gene of the network to make an impact matrix, of which element eij implies an impact from genej to genei. A gene vector is a concatenation of a column and transpose of a row of a gene in the impact matrix.
Gene vectors for all the patients comprise the training and test data. Because gene vectors are patient-specific, gene vectors of the same gene can be divided into training and test data. This means that known driver genes can also be predicted as cancer driver genes, even if they were already used for training.
Gene vectors of known cancer driver genes with somatic mutation (SM), copy number alteration (CNA) or DNA methylation (DNAm) are assumed to be patient-specific driver genes, and labeled as T. The same number of gene vectors not known to be cancer drivers that do not feature SM, CNA, or DNAm are randomly selected and labeled as F. We used CNA and DNAm in addition to SM data, because they are often associated with cancer driver genes [18], [19]. The machine learning models are trained to classify the gene vectors into T and F. The unlabeled gene vector of gene G is classified by the trained model, and G is predicted to be a patient-specific cancer driver gene if it is classified as T. Because a gene vector is the impacts that a gene gives to and receives from all the genes, classification modeling can find and discriminate patterns that show how a known patient-specific driver affects other genes.
We performed five-fold cross validation and independent tests using known driver genes from Intogen [20], Cancer Gene Census (CGC) [21] and The Network of Cancer Genes (NCG) [22], and SM, CNA and DNAm data of six cancer types (breast, colon, liver, pancreatic, and stomach cancer) in TCGA [23]. Through cross validation, we found that Random Forest (RF) [24] was effective for our purpose, and that all omics data types showed best results for colon and liver cancer, while the rest of the cancer types were shown best by SM and DNAm data. Through independent tests, we confirmed that MPD shows higher F1 and F0.5 scores than existing methods for identifying patient-specific cancer driver genes. The majority of predicted driver genes are found to be rare, and F1 and F0.5 scores calculated with these rare genes are higher than or comparable to those of frequently identified driver genes.

A. OVERVIEW
To describe how the proposed model, MPD, works, we first explain how to obtain gene vectors in Section 2.2, and explain how to train gene vectors using the machine learning methods and how to predict patient-specific cancer driver genes in Section 2.3.

1) BUILDING THE PATIENT-SPECIFIC GENE NETWORK
The first step to make a gene vector is to construct the patientspecific gene network. It is built using gene expression data and integrated gene networks made with directed edges of FI networks from Reactom [25] and the gene regulatory networks from the RegNetwork [26] and TRRUST [27]. The patient specific gene network can be represented as a weighted adjacency matrix for each patient. is calculated using the product of two matrices, and as in (1).
In (1), is an identity matrix, and is a diagonal matrix. Each element of represents differences in gene expression between a tumor sample and a group of normal samples, which is calculated as t-statistics using one sample t-test. Tstatistics are then normalized to a range from 0.1 to 0.9 by the min-max scaling method. The minimal value should not be zero, because the node with value 0 will not retain its own value, and the maximal value should not be one, because the node with value 1 will not receive any impact from the network. The elements of are weights of self-loops of a patient-specific gene network.
The matrix is used for calculate weights of non-self-loops, and defined as the elementwise multiplication (⨂) of four matrices, an adjacency matrix , , and , as in (2).
is an adjacency matrix where = if i-th and j-th genes are connected in the integrated gene networks, and Aij = 0 otherwise. Aij = 2 if i-th gene has SM, CNA or DNAm, and Aij = 1 otherwise. WC and WD are matrices of which elements are calculated using Pearson's Correlation Coefficient (PCC) as in (3) and (4), respectively.
where , and are matrices of gene expression data of cancer and normal samples, respectively. The value of [ , ] is close to zero if there are similar patterns between cancer and normal samples, and otherwise its value increases up to the maximum of 1.
shows which interactions are particularly crucial in a patient, compared to the other patients. A value of [ , ] is derived from calculation of PCC and gets close to 1 if a patient has similar linear correlation pattern between the i-th and j-th genes compared to PCC of the cancer sample group. In (5), [ , ] is the value of the -th gene expression level of the k-th patient; and are the mean and standard deviation of the expression level of the -th gene in the cancer samples, respectively; is a function that returns the sign of an input value, 1 or -1. The sigmoid function is used to give weight of zero to edge ( , ) (i.e. to remove edge), if i-th and jth genes are not correlated in a specific sample, compared to cancer sample group.

2) GENERATION OF IMPACT MATRIX
Once the patient-specific gene networks are made, we apply modified PageRank to the network to make the impact matrix for each patient. The impact matrix, or IM, is composed of n feature vectors that correspond to the dimension and the number of genes. A feature vector of a specific gene implies the impacts that it has on all the genes, and element eij of IM implies an impact from genej to genei. Feature vectors calculated by the modified PageRank are used to make gene vectors, which act as the input to the machine learning methods. To apply the modified PageRank algorithm, a stochastic matrix ̃ of the patient-specific gene weight matrix is first calculated as (6).
In (6), is a diagonal matrix where the i-th diagonal entry is equal to the sum of elements on the i-th column in , and −1 is the inverse matrix of . One of the properties of stochastic matrix ̃ is that the sum of elements on each column is 1 and an element ̃[ , ] can be interpreted as the probability with which we proceed to j-th gene from i-th gene. The modified PageRank algorithm computes feature vectors iteratively using the patient-specific stochastic matrices, as illustrated in Fig.1. An initial feature vector of the i-th gene 0 [ ] is a one-hot vector of which the dimension is equal to the number of genes. At the initial time, the i-th entry of the feature vector has value 1 and the other entries have zero, which means that the initial impact matrix IM0 is an identity matrix. By iteratively multiplying the patient-specific stochastic matrix by the feature vector (7), the positive value of i-th entry spreads to other entries and this process repeats until the feature vector reaches the steady state. Equation (7) takes a different approach compared to standard PageRank. First, while the damping factor of PageRank has a fixed value, MPD has a dynamic damping factorweights of self-loops (Ψ in (1)) can be seen as (1-damping factor). The weights of self-loops are correlated with gene expression differences between cancer samples and normal samples, assuming that genes which have a greater expression difference tend to be affected more by other genes. This suggests that nodes with greater self-loop weights should have smaller impacts on neighboring genes, corresponding to a smaller damping factor. Second, while initial node values are multiplied by (1damping factor) for every time point in PageRank, node values at time point t are multiplied by (1-damping factor) to calculate node values of time point + 1 in MPD. The reason for this modification is that the initial node values are the same at time 0 in MPD, and we assumed that node's impact from other nodes in the network is accumulated as time flows. The reason why we used both row and column of the impact matrix is because driver genes are not necessarily associated with upstream genes in the whole gene network. The example for generation of a gene vector is illustrated in Fig.1.

C. PREDICTION OF PATIENT-SPECIFIC CANCER DRIVER GENE
The gene vector of gene G indicates the impacts that G can have on, or receive from, all the genes, and all the cancer patients have a different gene vector for gene G. In this section, we explain how a machine learning model learns latent information about cancer drivers from gene vectors and determines which gene acts as a cancer driver. We first get positive and negative gene vectors. The gene vectors are labeled as positive if a gene is known driver in Intogen, CGC, and NCG, and has SM, CNA, or DNAm for each patient. A positive set comprises the positively labeled gene vectors for all the patients. For each patient, we randomly select gene vectors from genes that are not known to be driver gene and do not have SM, CNA, and DNAm, and label them as negative. Negatively labeled gene vectors for all the patients compose the negative set.
To build a classification model, we performed five-fold cross validation and two independent tests as in Fig.2. The goal of cross validation is to select a machine learning algorithm, patient-specific gene network construction method, and type of omics data, that will give the best results. The goal of independent tests is to compare the performance of patientspecific driver gene identification with those of existing methods. For the cross validation, a positive set of each fold consists of gene vectors of one-fifth of the known driver genes. A negative set of each fold consists of gene vectors that are randomly selected genes not known to be drivers. Note that we used only positive set I for cross validation, because driver genes in NCG are not cancer type specific. For independent tests, all the positive gene vectors not used for training are testedif positive set I is used for training, positive set II is used for test, and vice versa. Because randomly selected negative set can affect training performance, we made five negative sets and averaged recall and precision. For both cross validation and independent tests, the same number of gene vectors for the negative and positive set were selected for training. We used RF, Naïve Bayesian classifier (NB) [28] and Deep Neural Networks (DNNs) to train a classification model. Note that known driver genes can also be predicted as cancer driver genes, even if they were already used for training. This is because gene vectors of known driver genes are patientspecific, which means gene vectors of the same gene are different for different cancer patients. So, even if gene vectors of same genes are used for both training and test, they do not overlap as long as they are not from the same cancer patients.

A. DATASETS
Four types of omics data (mRNA expression, SM, CNA, and DNAm data) from six cancer types (BRCA, COAD, LIHC, LUAD, PAAD, and STAD) were downloaded from the TCGA data portal. For DNAm data, the top 5% methylation levels of each sample were replaced by 1 and the rest by 0. For gene expression data, genes with zero FPKM value in more than 80% of samples were excluded. The known driver gene information was downloaded from Intogen, CGC, and NCG. Driver genes provided by CGC were divided into two tiers, and only Tier 1 genes were used because driver genes corresponding to Tier 1 had more evidence for cancer occurrence than those in Tier 2. The described datasets are summarized in the Table 1.
We also downloaded the only directed edges of FI network provided by the Reactom and the gene regulatory networks from the RegNetwork and TRRUST, and integrated them. The details about the integrated network are given in the Table 2.

B. FIVE-FOLD CROSS VALIDATION RESULTS
To choose the best 1) machine learning method, 2) patientspecific gene network construction method, and 3) combination of omics data, we performed a five-fold cross validation as shown in Fig.2.

1) COMPARISON ON DIFFERENT MACHINE LEARNING METHODS
We compared three machine learning models, RF, NB, and DNN in order to find the machine learning method that can best learn the gene vectors we created. We found optimal parameters for each method through iterative experiments, and used the parameters with n_estimator of 50 for RF, and four hidden layers with size 5,000, 1,000 and 100 were used for DNN. We calculated recall, precision, F1 score and F0.5 score using the genes selected as driver genes in 0% to 99% of all patient samples for each of six cancer type and averaged them as shown in Fig.3. S1 Fig show F1 score, F0.5 score, precision and recall for each cancer type. Note that 0% means that we use the union of selected driver genes of all the samples, and 100% is not shown because no genes were selected as driver gene in all the samples. We can see that RF shows much higher precision which leads to high F1 and F0.5 scores, so RF was used for the rest of the experiments. The proposed patientspecific gene network construction method was used to build gene networks and all the omics data were used for Fig.3 and S1 Fig.

2) COMPARISON ON METHODS FOR PATIENT-SPECIFIC NETWORK CONSTRUCTION
The differences between patients are represented by patientspecific networks, so accurately constructed patient-specific net-works are very important for accurate identification of the patient-specific driver genes. To show the performance of the proposed patient-specific network construction method, driver genes were found using the proposed network Single Sample Network (SSN) [16], and a random weight network, and F1 score, F0.5 score, precision and recall of each case were compared. Fig.4 shows averaged results, and S2 Fig shows F1 score, F0.5 score, precision and recall for each cancer type. We can see in Fig.4 that the proposed patient-specific network had higher precision than others except 99% of samples, which lead to higher F0.5 score and higher F1 score for less than 70% of samples. All the omics data were used for Fig.4 and S2 Fig.

3) COMPARISON ON METHODS OF DIFFERENT OMICS DATA
We compared the F1 score, F0.5 score, precision and recall when used 1) only SM data, 2) SM and CNV data, 3) SM and DNAm data, and 4) used all data types. F0.5 score and precision for each cancer type are shown in Fig.5, and F1 score, F0.5 score, precision and recall for each cancer type are provided in S3 Fig. But for COAD and LIHC, because using  all the data together shows slightly better F0.5 score, we use all types of omics data for independent tests for COAD and LIHC.

C. INDEPENDENT TEST RESULTS
We performed two independent tests as explained in Fig.2. In independent tests, we compare the F1 score, F0.5 score, precision and recall of MPD, and those of Dawnrank, PNC and PRODIGY. Note that in one other recently published method, SCS was not used in the comparison because we were not able to get its code or executable file. For PNC and PRODIGY, only paired samples were used, because PNC showed better performance for paired samples, and PRODIGY required too much time to get through all the samples. As mentioned in section B, we used all the omics data types for COAD and LIHC, and used SM and DNAm for the rest of the cancer types. Fig.6 show averaged results for independent test I, and S4 Fig  shows F1 score, F0.5 score, precision and recall for each cancer type. We can see in Fig.6 that precision of MPD is higher to or comparable to DawnRank but recall is higher, which leads to highest F1 score when less than 20% of samples are used, and always highest F0.5 score. PNC has highest recall but low precision. Fig.7 show averaged results for independent test II, and S5 Fig  shows F1 score, F0.5 score, precision and recall for each cancer type. Fig.7 also shows that precision of MPD is higher than or comparable to DawnRank, but recall is higher. MPD showed the highest average F1 and F0.5 score when percentage of samples exceeded 20 and 30, respectively. In S5  Fig, we can see that MPD showed good performance for BRCA, LIHC, LUAD and STAD, but has low F1 score, F0.5 . Results of independent test II score and precision when smaller samples were used. This was especially true for COAD and PAAD, which was likely due to the small number of COAD and PAAD samples, resulting in small number of training gene vectors. Next, we counted the number of driver genes found for each cancer type to calculate the ratio of rare driver genes. Fig.8 shows the ratio of genes identified as driver genes in less than 1%, 1% to 5%, 5% to 10%, 10% to 50%, and 50% to 100% of samples. We can see that about 20~40% of genes were identified in ~1% of samples and 40-50% of genes were identified in ~5% of samples, which means the majority of genes identified as drivers were rare. In Figu. 6 we can see that MPD shows higher F1 and F0.5 scores as less samples were used. In Fig.7, unlike Fig.6, F1 and F0.5 scores increased as more samples were used, and MPD continued to increase relative to others when more than 10% of samples are used. Collectively, these results tell us that MPD successfully identifies rare driver genes with strong accuracy. The differences between patients are represented by patientspecific networks, so accurately constructed patient-specific networks are very important for accurate identification of the patient-specific driver genes. To show the performance of the proposed patient-specific network construction method, driver genes were found using the proposed network Single Sample Network (SSN), and a random weight network, and the F1 scores of each case were compared. Fig. 3 shows that the proposed patient-specific network greatly outperforms others except 99% of BRCA. Supplementary fig. S13~S16 show F1 score, F0.5 score, precision and recall, respectively, and we can see that proposed method generally shows good precision.

IV. DISCUSSION
While existing methods of identifying patient-specific cancer driver genes are usually based on various kinds of network searching algorithms, we proposed a machine learning based method named MPD to reveal patient-specific cancer driver genes. We showed that MPD generally produced higher F1 and F0.5 scores in comparison with existing methods. Compared to DawnRank which frequently shows good performance among existing methods, MPD showed higher or comparable precision but with higher recall. Expected reasons for good performance of MPD are 1) accurately constructed patient-specific networks, 2) ability of gene vectors to characterize the latent roles of genes in cancer genome, and 3) intrinsic ability of machine learning techniques to find hidden patterns. Machine learning based search can be expected to show high performance in many cases due to recent advances in machine learning methods, but often times, the number of samples is too small. Our work solves this problem of having too few samples by creating a sufficient number of gene vectors for each known driver gene. Despite its strong results, MPD has some limitations and additional work remains. The main disadvantage of MPD is that it still requires a number of tumor and normal samples to create accurate patient-specific gene networks. Research to overcome this barrier could be the subject of a future study. MPD has another limitation: it does not tell us why a gene is identified as a driver gene because most machine learning models (including Random Forest) are black box models. Interpreting the trained model could be the subject of another future study. In addition, as a classification model, MPD requires negatively labeled samples that are hard to optimize. Because good negative samples can be important for better prediction performance, we are planning to develop a better way to select negative samples.