Analysis for Disease Gene Association Using Machine Learning

To recognize the basis of disease, it is essential to determine its underlying genes. Understanding the association between underlying genes and genetic disease is a fundamental problem regarding human health. Identification and association of genes with the disease require time consuming and expensive experimentations of a great number of potential candidate genes. Therefore, the alternative inexpensive and rapid computational methods have been proposed that can identify the candidate gene associated with a disease. Most of these methods use phenotypic similarities due to the fact that genes causing same or similar diseases have less variation in their sequence or network properties of protein-protein interactions based on-premises that genes lie closer in protein interaction network that causes the similar or same disease. However, these methods use only basic network properties or topological features and gene sequence information or biological features as a prior knowledge for identification of gene-disease association, which restricts the identification process to a single gene-disease association. In this study, we propose and analyze some novel computational methods for the identification of genes associated with diseases. Some advance topological and biological features that are overlooked currently are introducing for identifying candidate genes. We evaluate different computational methods on disease-gene association data from DisGeNET in a 10-fold cross-validation mode based on TP rate, FP rate, precision, recall, F-measure, and ROC curve evaluation parameters. The results reveal that various computational methods with advanced feature set outperform previous state-of-the-art techniques by achieving precision up to 93.8%, recall up to 93.1%, and F- measure up to 92.9%. Significantly, we apply our methods to study four major diseases: Thalassemia, Diabetes, Malaria, and Asthma. Simulation results show that the proposed Deep Extreme Learning Machine (DELM) gives more accurate results as compared to previously published approaches.


I. INTRODUCTION
A gene is the basic physical and functional unit of heredity that is responsible for different biological processes in an organism. The mutation in a single gene sequence may mutate a biological process and leads to a certain The associate editor coordinating the review of this manuscript and approving it for publication was Kin Fong Lei .
disease. The genes in the human body are not isolated they interact with one another, therefore, the mutation in a single gene may affect its interacting gene which may also play a part in the mutation of different biological processes and cause different diseases. Therefore, consideration of biological mechanisms and based on these mechanisms discovering the relationship between the diseases and genes is a serious challenge in modern biology and medicine.
Understanding the association between casual genes and their genetic disease is a fundamental problem regarding human health [1]. Technology is involved in the detection and monitoring of various human diseases such as Parkinson [2]. Also, the Internet of Medical Things (IoMT) is in focus for addressing human health [3]. Different experimental methods have been proposed to associate genes with a disease but these methods are expensive in terms of cost and time [4]. For that reason, alternative computational approaches are gaining popularity for disease-gene-association. These computational approaches identify or prioritize genes associated with a disease based on genes sequences and genes interactions. When genes involved in disease are not known then the prioritization task is done based on different features like computing the similarities between the known disease genes and a given gene [5], as shown in Figure 1. It illustrates the diseases (D1-D4), known causal genes (G1-G4), and the connection lines are used to represent the connections to the diseases. Features are evaluated using gene information (F1-F5) and the connection lines are used to represent the interactions with known causal genes. While prioritizing a gene to be a candidate gene, a threshold is used to determine its likelihood to be involved in a disease. For the prioritization of candidate genes various approaches were proposed which previously used phenotypic similarities measure, focused on the heterogeneous network, and some on comprehensive and accurate protein-protein interaction data by utilizing known disease genes data [6]. In [7], PROSPECTR is presented that utilizes sequence-based features and identify the genes participating in Mendelian and oligogenic disorders. Similarly, [5] used the Random Walk examination which describes similarity in protein-protein interaction utilizing artificial linkage interval. It contains the first 100 genes situated nearby to the disease gene, based on genetic distance on the same chromosome. For the ranking of the genes, they also used PROSPECTR [5]. A method of candidate gene prioritization is defined in [8] that is exclusively grounded on the protein-protein interaction network (PPIN). Other than functional annotation of data and protein interaction data, another method for the selection of genes is proposed in [9] which utilized 1D discrete wavelet-transform-based choice of genes. This technique allows scores to genes between two classes for distinguishing samples. Using various levels of the wavelet transform it decompose gene expression signal. The genes are selected having maximum scores for the formation of a feature set and trial classification.
A worldwide network-based technique is provided in [10] for the prioritization of disease genes and assuming protein complex links. They developed a method named PRINCE based on the prioritization function and its constraints that relate to its uniformity over the usage of prior information and network. They used the technique that not only predicts the gene associations but correspondingly protein complex associations by the disease of concern [10]. Gene prioritization is used for the identification of the disease gene but also for the identification of favorable candidates from various studies that generate gene lists [11]. Cofunction Networks (CFNs) is used based on mutual functional similarity to prioritize clusters of candidate genes from numerous disease-associated loci [12]. Recently, a new type of technique for prioritizing candidate genes associated with a certain disease is proposed. It is a knowledge-based methodology that studies gene-gene association tendency in diseases from acknowledged gene-disease association. Mutual information is used to quantify the strength of gene-gene association in a certain disease [13].
Most of the above-mentioned techniques emphase on prioritizing self-determining genes, in various circumstances mutations at different loci might lead to the alike disease [10]. However, these approaches perform well but they still have some limitations [6]. PROSPECTR is not able to achieve reliable, and sizeable set of genes in complex traits, this intended that algorithm performance could not be verified on the constituents of complex diseases [7]. A method in [5] can only be used for those genes whose protein-protein interactions are known or predicted. Similarly, a technique in [8] is applicable only for the known disease-related (seeds) genes. The wavelet-based method in [9] regularly outperforms only when the number of selected genes were more than 40. PRINCE depends on prior knowledge of phenotype that bounds the application such as diseases that are phenotypically related to diseases with identified fundamental genes. Secondly, PRINCE computation was not considered other related data such as genes that are differentially expressed in the disease state, it uses known diseasegene associations [10]. Knowledge-based approach chooses a substitute technique such as Know-GENE without using mutual information or the network propagation technique. It does not perform well for genes that do not exist in any of the diseases used in originating the mutual information [13]. Therefore, in this study, we proposed and analyze some novel computational methods for the identification of genes associated with diseases based on some advanced biological VOLUME 8, 2020 features. Biological features are calculated based on gene sequence information. Furthermore, we also test our data on advance topological features. Topological features are calculated based on protein complexes. Most of these biological and topological features were not used in previous studies. As we extracted the biological feature set by applying the discrete wavelet transform on EIIP values of genes amino acids. However, [9] used 1D DWT for the choice of genes but they do not utilize EIIP values of genes amino acid sequence. Similarly, [14] used only degree connectivity and betweenness centrality in their computational pipeline for the prioritization of genes associated with the disease. They do not utilize some other topological features which we proposed and achieve remarkable results. Secondly, we use a supervised learning method for the identification of genes associated with genes that were not used before. Previously, most of the proposed methods used unsupervised prioritization techniques.
To analyze different computational techniques, the known disease genes are downloaded from DisGeNET, sequences of genes from UniProt, the binary protein interactions from HPRD (Human protein reference database), and the true human protein complexes are from Comprehensive Resource of Mammalian protein complexes (CORUM). By using all these data resources, we have extracted different biological and topological features. The extracted feature set is passed to different computational methods for classification using 10-fold validation mode. We analyze the power of the proposed methods by reviewing four major diseases: Malaria, Asthma, Thalassemia, and Diabetes. As more than one gene is responsible for a single disease so our approach identifies and associates it respectively by using advanced features. We analyze that by using an advanced feature set, most of the proposed computational methods perform well by achieving the highest accuracy demonstrated in Section IV. Furthermore, we compared our proposed methods with the previous methods and found that the proposed methods outperform by using advance topological and biological features.
Although numerous computational methods have been proposed for the disease-gene association, the proposed analysis introduces a new feature set along with existing one and gives insight into the behavior of various novel machine learning models that have not been tested previously. Instead of exploring the behavior of only a single machine learning model, this study investigates the prediction power of different machine learning models with the newly introduced feature set. It is shown that these models perform more accurately and precisely than the previous models with an advanced feature set. Furthermore, this study grouped different machine learning models behavior for disease-gene association in a single manuscript which will be helpful for future study and health community.
Our article comprises of the opening paragraphs, which provide an initial impression about the logic of our argument.
Section II explains the suggested method and approaches. In Section III, proceeding paragraphs will provide implications for our research. Section IV will demonstrate results derived from the proposed methodology, and Section V concludes the paper.

II. MATERIAL AND METHODS
The flow chart of the research methodology is shown in Figure 2. It depicts datasets that are downloaded from different sources. Secondly, different biological and topological features are extracted by utilizing the datasets and finally, testing of different computational models are carried out using the feature set and the behavior of these models are analyzed in Figure 2 A. DATASETS AND SOURCES The diseases and their respective genes are downloaded from DisGeNET (http://www.disgenet.org/) and the sequences of genes that are involved in given diseases are downloaded from UniProt (http://www.uniprot.org/). DisGeNET is one of the largest databases containing collections of genes involved in human diseases. UniProt is a database of the protein sequence. The binary protein interactions are downloaded from HPRD (http://www.hprd.org/) and the true human protein complexes are downloaded from CORUM (http://mips.helmholtz-muenchen.de/corum/). All of these are publically available datasets. For analysis and evaluation Weka (http://www.cs.waikato.ac.nz/ml/weka/) is used for data mining. Table 1 comprises our extracted dataset statistics.

B. FEATURE EXTRACTION
Subsequently, after downloading different datasets, useful biological and topological information is extracted to be utilized as a feature set. The specifics of feature sets are specified as: a) Biological Features: The biological features are extracted by using the amino acid sequence information of the gene. Amino acids play a vital role and act as structure blocks of genes. A mutation in an amino acid sequence may lead to a certain disease thus genes causing certain diseases may have a similar amino acid sequence structure. That is the reason for biological feature computation. This feature set consists of length, entropy, and discrete wavelet features. Length and entropy were used previously for protein complex identification [19], [20] not for the disease-gene association. However, discrete wavelet features from EIIP values are not used previously in any study, to the best of our knowledge. We are the first one to utilize the EIIP values of amino acid for feature extraction by applying a discrete wavelet transform. b) Length: Gene sequence is the combination of different amino acids, and different sequences have different lengths. We compute length by counting the number of amino acids in a sequence. So, the length is the total number of amino acids appearance in a sequence. c) Entropy: The entropy can be estimated by computing the correct probabilities of a distinct probability and defined in equation 1.
where p n is the probability of an amino acid in a sequence. d) Discrete Wavelet Transform (DWT): It is a tool that can be used to extract useful information from any data source without losing any data and with a distinction between important and non-important data with high speed [15]. Therefore, in this study, it is used for extracting useful information from gene sequences. For computing discrete wavelet features, initially, we replace the amino acids in a sequence with their EIIP values. The EIIP values designate the regular states of energy for all of the valence electrons in the specific amino acid [16]. Afterward, we apply discrete wavelet on these values which return approximation coefficient and detail coefficients for EIIP values of a sequence. These approximation and detail coefficients are utilized as a feature set to identify the underlying gene associated with a disease. There are separate EIIP values for each of the amino acids. The list of amino acids, codes, and EIIP values are given in  When a gene interacts with other genes, it forms a complex which could be represented as a graph. A graph consists of vertices and edges and can be directed or undirected. The vertices are connected with each other by edges. If edges have direction from one vertex to another than it is said to be directed graph otherwise undirected. As a gene interact with other gene and forms a complex therefore genes could be represented as vertices and binary interaction between genes could be represented as edges. Here, we consider the gene interactions as undirected. The attainable protein-protein interaction (PPI) in the human genome provides a novel opportunity for discovering hereditary genes-disease by topological features from the PPI network [17]. It is important to analyze genes to prevent genetic problems [18], [23]. Moreover, the genes causing the same or similar disease may lie closer to each other. Therefore, we utilized the topological structure of genes and compute a feature set known as topological features. The topological feature set consists of degree, eccentricity, Neighborhood Connectivity, Average Shortest Path Length, Betweenness Centrality, Closeness Centrality, Clustering Coefficient, Radiality, Topological Coefficient, and Stress Centrality [19]. This feature set is previously used for protein complex identification [19], [20] but not for the disease-gene association. Results reveal that with the use of this advanced topological feature set various computational methods to perform significantly.
After computing the feature set for different genes. We trained various computational models to classify these genes based on different features into four different disease classes, i.e, Thalassemia, Diabetes, Malaria, and Asthma. The prediction power of these computational models is analyzed VOLUME 8, 2020 by different evaluation parameters discussed in section III. However, the performance is exhibited in section IV, i.e, results, and discussion.

C. PROPOSED DEEP EXTREME LEARNING MACHINE FRAMEWORK
The deep extreme learning machine (DELM) is well-known for forecasting health conditions, forecasting electricity use, transport and traffic control, etc. The DELM may be widely used in various contexts for classification and regression objectives since DELM learns efficiently. Extreme learning machine is a neural network system that only enables data to move one direction across several layers, but we have utilized the back-propagation approach in this proposed model during the training process as data flows back through the network. The weights of the network are constant during the validation process, where we import the trained model and estimate the real data. The DELM model integrates the input layer, several hidden layers, and one output layer.
In the DELM method, the n-th input node, the ith hidden node, and the p th output node can be considered as an, mi, and gm, correspondingly, while all the N input nodes, l hidden nodes, and P output nodes can be considered asà = The DELM framework will, therefore, be characterized densely as; And Ü = QB where B = [b in ] eR lxN , c = c 1 , c 2 . . . . . . c l T eR l , Q = q mi eR pxl , and the activation function f ( ) could be used as sigmoid, linear Gaussian models, etc.
Assume there are just V diverse training records, and let a v eR N and ü v eR P and symbolize the v th training input and the subsequent v th training output, correspondingly, where v = 1, 2, . . . . . .V . In the training dataset the input arrangement and output arrangement can be indicated as; and correspondingly. We can alternate (4) into (5) to obtain where M = m 1 , m 2 . . . . . . m V T eR lxV is the assessment grouping of all l hidden nodes, and ⊗ Kronecker product. Then we can override (6) and (5) in (3) to attain the real training implementation sequence.
In DELM, the output weight Q is adaptable, while B (i.e., the input weights) and c (i.e., the biases of the hidden nodes) are randomly focused. Entitle the expected outcome as Y. Then DELM only minimizes the valuation inaccuracy; By verdict the least-squares explanation Q for the problem where . F specifies the Frobenius norm. For the problematic (9), the outstanding least norm leastsquares solution is; To evade overfitting, the general Tikhonov regularization can be employed to amend Eq. (10) into where v 2 0 > 0 indicates the regularization expression. Evidently, Eq. (11) is only the specific case of Eq. (10) with v 2 0 = 0. Consequently, we locate only Eq. (11) the regularization of Tikhonov for the DELM. Machine learning is a general approach for progressively mounting the number of hidden layers to the preferred accuracy. When this method is affected unswervingly in DELM, nevertheless, in Eq. (11) the opposite matrix process for standard ELM is compulsory when some or only another hidden node is enhanced, and the algorithm is therefore unaffordable to computation. The back-propagation process comprises weight initialization, feedforward propagation, back error propagation, and update of weight and uniqueness. An activation function like g (x) = sigmoid occurs on each neuron in the hidden layer. This permits the sigmoid input feature and the DELM hidden layer to be constituted in this way; Eq. (12) designates a back-propagation error that can be computed by splitting the sum of the square from the expected outcome by 2. The weight change is compulsory to decrease the common error. The rates of weight change for the output layer are presented in Eq. (13).  (14) inscribing Eq. (14) by exercising the chain rule technique 160620 VOLUME 8, 2020 The value of change weight can be attained after exchanging the values in Eq. (14) as presented in Eq. (15).
The measurement for corresponding weight adaptation to the hidden weight can be seen in the next step. This is more complicated since by weighted connection it can lead to misinterpretation on any node. From The technique to develop the weight and bias among the output and the hidden layer is presented in Eq. (16e).
Eq. (17) indicate how updating the weight and bias between the input and the hidden layer.

III. EVALUATION PARAMETERS
To evaluate the performance of various computational model's TP rate, FP rate, precision, recall, F-measure, and ROC area are used as evaluation parameters.  (21) where TP is the number of disease genes that are correctly classified as disease genes, FN is the disease genes classified as non-disease genes, TN is the number of non-disease genes classified as non-disease genes and FP is the number of non-disease genes that classified as disease genes. e) F Score: It is a harmonic mean of precision and recall.
f) ROC Area: A ROC curve is a function that is plotted between true positive rate and false-positive rate for distinct cut-off points of a variable.

IV. RESULTS AND DISCUSSION
We analyze some novel computational methods for disease gene association by using advance biological and topological features. To evaluate the performance of these methods we used the FP rate (false positive), TP rate (True positive), recall, precision, F-measure, and ROC curve. To validate our evaluation, we use 10-fold cross-validation mode. Firstly, we test our data by using only biological features and it is shown in Table 3  We observed that by using biological features, we have achieved the highest accuracy, i.e., up to 90%-99% but the FP rate is quite less. To increase the FP rate, we tried another method and test our data by using only topological features and investigate the behavior. It is shown in Table 4 that the FP rate is increased significantly but the TP rate decreases. Moreover, with topological features, Random forest, PART, Logit boost and proposed DELM outperform as compared to other methods. As Random Forest has a TP rate up to 64.6%, FP rate up to 48.3%, precision up to 60.8%, Recall VOLUME 8, 2020 Furthermore, to achieve remarkable TP and FP rates both, we combine the biological features with topological features and test the data. The results are shown in Table 5, it is evident in Table 5 that the TP rate increases significantly as compared to the results of topological features only. Where FP rate varies with a little margin when compared to the results of only biological features. However, when compare FP rate to results of only topological features it decreases. Moreover, with combine biological and topological features Random forest, PART, Multiclass classifier, and proposed DELM outperforms. As shown in Table 5 All these algorithms make decisions based on regression analysis between attributes for classification so it can be concluded that the algorithms that use regression analysis give the best results for the disease-gene association. However, Random Forest outperforms in all the experimental setups due to its properties of never over fitting, identifications of important attributes for classification, and multiple trees with multiple levels.
Moreover, the FP rate with only biological features is quite less as compared to TP rate as shown in Table 3 where it is increased by using only topological features as shown in Table 4, however with topological features the FP rate increases significantly but TP rate decreases when compare to biological feature results. Therefore, to achieve a remarkable TP rate and FP rate both the biological and topological features are combined which increases the TP rate significantly as compare to topological feature result and the FP rate vary with a little margin as compared to with biological features shown in Table 3. So, it is concluded that computational approaches with combine biological and topological features achieve the best performance.

V. COMPARITIVE ANALYSIS OF NOVEL COMPUTATIONAL METHODS WITH RESPECT TO TRUE POSITIVE RATE AT DIFFERENT THRESHOLDS
Furthermore, to pretend the robustness of different novel computational methods with advanced feature sets we analyze them at different thresholds with respect to True Positive Rate (TPR). The Figure 3a It is exhibited from Figures 3a, 3b, and 3c that the number of correctly associated diseases with genes by almost all the computational approaches are more at threshold (TPR) value greater than 0.70. Only a few methods have TPR less than 0.70. Thus, it is concluded that the use of advance biological and topological feature set significantly 160622 VOLUME 8, 2020   influenced the performance of different computational methods.

VI. COMPARITIVE ANALYSIS OF COMPUTATIONAL COST OF NOVEL COMPUTATIONAL METHODS
We also analyze the computational time taken by each computational method to build a model for classifying genes associated with the diseases. Figure 4 shows the computational cost analysis. It is shown in Figure 4 that Conjunctive Rule takes less time to associate the genes with diseases and Multiclass Classifiers take more time than all other methods. But accuracy matters more than time in biological problems, and Multiclass Classifier yields more accurate result as compare to Conjunctive Rule. Conjunctive Rule has less accuracy than other methods as exhibited in Table 3, Table 4 and Table 5. However, Random Forest supersede all others methods concerning accuracy as elucidated in Table 3, Table 4, and Table 5. To conclude, Random forest is efficient with respect to accuracy while Conjunctive Rule is efficient concerning time complexity.

VII. COMPARISON WITH OTHER METHODS
We also compared our findings with previous methods and concluded that various computational methods with advance biological and topological features supersede the previous methods. We analyzed the performance of various methods based on different parameters that are FP rate, TP rate, recall, precision, F-measure, and ROC area. However, the previous methods use only one or two evaluation parameters, like PAGE RANK [8] algorithm used ROC area that was 50% accurate. The PROSPECTOR [7] achieved 77% accuracy with two folds, 32% accuracy with five-folds, and 11% accuracy with twenty folds. The PRINCE [10] algorithm was evaluated based on precision and recall, the precision was 61.8% and recall was 26.3%. OPEN [21] achieved precision up to 20%, recall up to 2-6% and ROC area up to 91%. NETWORK PROPAGATION [22] has an Area Under the Curve (AUC) up to 60%, and CIPHER [23] achieved 24.7% precision. KNOWN-GENE [13] have 65% recall, 40% AUPR and 96% AUC. However, comparing all these performances with our analyzed methods with advance biological and topological features, it is evident that our analyzed methods outperform by having an average accuracy of 90%. The comparison of various analyzed computational 160624 VOLUME 8, 2020 methods using biological features, and combined biological and topological features with previous methods are shown in Figure 5, and Figure 6 respectively.

VIII. CONCLUSION AND FUTURE WORK
Understanding the association between underlying genes and genetic disease is a fundamental problem regarding human health. Different computational methods have been devised to tackle this problem. But these methods used limited information for the identification of genes associated with a disease that restricts their performance. Therefore, in this study to improve the performance of computational approaches we use the advanced biological and topological features and analyze different computational methods. The advanced biological features are extracted from genes sequences information and advance topological features are extracted from the network among the genes. It is revealed from results that by using advance biological and topological features the performance of various computational approaches is increased significantly as compared to previous methods. Moreover, the DELM method gives more accurate results as compared to previously published approaches.
In the future instead of using EIIP values of Amino acid, biological properties as polarization, ionization, and hydrophobicity can be utilized. Similarly, weighted features of gene interaction networks can also be used to increase the accuracy of computational approaches. Further possible advancement could be achieved by integrating more verified features. Moreover, the privacy of biological and topological features can be considered in the future to conceal the features from external alteration.
Hardware, Availability, and Performance: The computational experiments were executed on Haier win8.1PC, Intel(R) Core (TM) i3-4010 processor, and 1.70 GHz, 64 operating system, and x64-based processor. The average runtime for inferring protein complexes or completing the cross-validation iterations was a maximum of 5 minutes and a minimum of 2 to 3 minutes. The datasets and code described herein are accessible upon request.