An Iteration Method for Identifying Yeast Essential Proteins From Weighted PPI Network Based on Topological and Functional Features of Proteins

Accumulating studies have indicated that essential proteins play critical roles in numerous biological processes. With the rapid development of high-throughput technologies, a large number of Protein-Protein Interaction (PPI) data have been found in Saccharomyces cerevisiae, which facilitate the formation of PPI networks. Up to now, a series of computational methods for predicting essential proteins from PPI networks have been proposed successively. However, the prediction accuracy of these computational methods is still not quite satisfactory. In this paper, a novel prediction method called CVIM is proposed to infer potential essential proteins. In CVIM, original PPI networks will be ﬁrst transferred into weighted PPI networks by implementing PCC (Pearson Correlation Coefﬁcient) on protein gene expression data. And then, based on weighted PPI networks and information of orthologous proteins, some critical network topological features and protein functional features will be extracted for each protein in the weighted PPI network. Finally, based on these newly extracted topological and functional features of proteins, an iterative algorithm will be designed to predict essential proteins. In order to evaluate the identiﬁcation performance of CVIM, we have compared CVIM with 13 kinds of state-of-the-art prediction methods. Experimental results show that CVIM can achieve prediction accuracies of 92%, 80% and 71% out of the top 1%, 5% and 10% candidate proteins separately, which signiﬁcantly outperform the prediction accuracies achieved by those state-of-the-art prediction methods. We have demonstrated that the prediction accuracy of essential proteins can be effectively improved by integrating the functional and network topological characteristics of proteins, which means that the novel method CVIM may be an excellent addition to the protein researches in the future.


I. INTRODUCTION
More and more evidences have shown that essential proteins are critical to the development and survival of organisms, and absence of these proteins will lead to loss of biological functions of protein complexes and death of organisms.
The associate editor coordinating the review of this manuscript and approving it for publication was Haipeng Yao .
Prediction of essential proteins plays a crucial role in research of bioinformatics, which is not only of great significance to the study of life sciences, but also of great application value in drug design and treatment of diseases. In recent years, a number of computational methods for essential protein prediction have been proposed successively. However, the identification accuracy of essential proteins is still not quite high. Hence, it is an important and challenging task to design walk based method to identify protein complexes by integrating Tandem Affinity Purification/Mass Spectrometry Data with PPI networks [20]. Jiawei L et al. put forward a computational method for detecting essential proteins by integrating local interaction density and protein complexes [21]. B.H. Zhao et al. adopted the gene expression data and network topology attributes to construct a reliable weighted network, based on which, a novel computational method called POEM was further designed to forecast essential protein based on overlapping essential modules [22]. Yijia Zhang et al. constructed a dynamic PPI network by integrating dynamic active information into high-throughput PPI data, based on which, a novel method for predicting protein complexes from the dynamic PPI networks is proposed based on core-attachment structural feature [23]. Ma CY et al. presented a novel algorithm called NEOComplex to infer protein complexes by integrating functional orthology information obtained from different types of multiple network alignment approaches with PPI networks [24]. Lei X et al. proposed a method called IFPA for protein complex detection in multi-relation reconstructed dynamic protein networks by adopting the flower pollination mechanism [25]. Peng W et al. put forward an iterative method called ION to reveal essential proteins through integrating homologous information and PPI networks [26]. Luo J et al. designed a new algorithm to discover essential proteins by combining protein complex co-expression information with edge clustering coefficient [27]. Xu B et al. developed a machine learning based method to identify protein complex through integrating protein-protein interaction evidence from 6 different sources [28], and a calculative method called GANE to predict protein complexes based on go attributed network embedding [29] separately. Lei X et al. proposed a computational method called NABCAM to discover protein complexes from dynamic PPI networks [30]. Ou-Yang L et al. presented a multi-network clustering method to infer protein complexes from multiple heterogeneous networks [31]. Srihari S et al. proposed a refinement of MCL by incorporating core-attachment structure to predict yeast complexes from weighted PPI networks [32].
In different to the first category central approach, to reduce the negative impact of incomplete protein interaction data and inherent PPI network topological characteristics on essential protein prediction, we combined multi-source biological data: gene expression, direct homologous information. Although gene expression is mentioned in the second category method above, most methods simply combine gene expression with network topological data, but ignore the essential differences in the meanings of biological data and network topological data. For example, in Pec, PCC * ECC is used directly to get the final result. Therefore, this paper proposes a new iterative method, called CVIM. The method detects essential proteins by combining protein function and network topology. In CVIM, considering the current incomplete PPI data set, we first used PCC (Pearson correlation coefficient) [33] for protein gene expression data to convert VOLUME 8, 2020 the original PPI network into a weighted PPI network. Then, based on the weighted PPI network and the information of the direct homologous proteins, we will further extract some key network topological characteristics and protein functional characteristics of each protein in the weighted PPI network. Generate new protein interaction matrix (network) from network topological data. Finally, based on these newly acquired protein interaction networks and functional properties, we will construct an iterative method called CVIM to predict the required protein. In order to estimate the identification performance of CVIM, intensive experiments will be implemented. Experimental results show that CVIM can achieve the prediction accuracies of 92%, 80% and 71% in the top 1%, 5% and 10% proteins respectively, which are much better than that achieved by 13 state-of-the-art competitive methods including DC [10], SC [11], BC [12], EC [13], IC [14], CC [15], NC [16], LAC [5], RWHN [1], PEC [17], CoEWC [18], POEM [22] and ION [26].

II. METHOD
As illustrated in Fig.1, the procedure of CVIM consists of the following three major steps: Step1: First, we adopt PPC on gene expression data to establish weights between protein nodes in the original PPI network, and then the original PPI network will be transferred into a weighted PPI network.
Step2: Next, based on the weighted PPI network and information of orthologous proteins, some critical network topological features and protein functional features will be extracted for each protein in the weighted PPI network separately.
Step3: Finally, based on the topological and functional features of proteins, a novel iteration method called CVIM will be designed to identify essential proteins by using an iterative algorithm.

A. CONSTRUCTION OF THE WEIGHTED PPI NETWORK
Let G = (V , E) denote an original PPI network constructed by the dataset of known PPIs downloaded from a public database D. Here, V = {p 1 , p 2 ,. . . , p N } represents the set of different proteins in D, and E represents the set of edges between proteins in V . Additionally, for a pair of proteins p and q in V , there is an edge e(p, q) between them, if and only if there is a known interaction between p and q in D. Based on the original PPI network G, it is clear that we can obtain an adjacency matrix A = (a ij ) N ×N , where there is a ij = 1, if and only if there is an edge e(p i , p j ) between p i and p j , otherwise there is a ij = 0.
PCC measures the linear correlation between two vectors. Gene expression is the process of using gene information to synthesize functional gene products. These gene products are usually proteins. We believe that the gene expression of key proteins at different times may have similar performance, that is, the gene expression vectors of the two key proteins may have a large linear correlation. Moreover Horyuet al. [33] found that the Pearson correlation coefficient is more suitable as a similarity measures for gene expression profiles. Therefore, we use PCC as the measurement factor of the new method calculated the co-expression intensity of the two genes, and transformed the two original PPI networks into two weighted PPI networks, as follows: For a given protein p, its gene expression at different times can be expressed by a vector: Exp(p) = {Exp(p,1), Exp(p,2),. . . , Exp(p, n)}, where Exp(p, i) is the expression level of the protein p at the ith time. Evidently, based on the Pearson Correlation Coefficient, in an original PPI network, the weight between two proteins p and q can be calculated as follows: Here, Exp(p) denotes the average expression of protein p at all times, σ (p) is the standard variance of expression for protein p at all times. If PCC(p, q) has a positive value, then it means a positive correlation between these two proteins p and q, otherwise, if PCC(p, q) has a negative value, then it means a negative correlation between these two proteins p and q.
Evidently, based on above formula (1), an original PPI network can be transferred into a weighted PPI networks easily.

B. EXTRACTION OF TOPOLOGICAL AND FUNCTIONAL FEATURES FOR PROTEINS
For a given protein p in an original PPI network G = (V , E), let NG(p) denote the set of neighboring nodes that have known interactions with p in G, then there is Through the analysis of the network structure formed by protein interactions, a lot of research has been conducted on the identification of key proteins, and some good results have been achieved, such as the LAC [4] method. In the studies of Hart et al. [43] and Dezso et al. [44], it was found that in many cases, the necessities are not the functional products of individual proteins, but the products of complex functions. Considering that triangles have the most stable properties in the geometric structure, the triangle structure of the PPI network to happen to be a local measurement feature that determines the protein necessity according to the modular nature of the protein necessity. Therefore, the number of triangles formed by the connections between proteins constitutes a feature of our algorithm. In this section, according to the weighted PPI network newly constructed above, we will first calculate the number of triangles for each protein p in PPI network G = (V , E) as follows: Here, |NG(p) ∩ NG(q)| denotes the number of elements in the set of NG(p) ∩ NG(q).
Based on above formula (3) and (4), we can extract the first network topological feature TF 1 for the protein p as follows: Here, |NG(p)| denotes the number of elements in the set of NG(p).
Next, in the study of Li et al. [17], it was mentioned that key proteins tended to form tightly connected clusters. The neighbors of key proteins are also in a closely related cluster. Based on this view, we believe that if protein p is an essential protein, then its neighbor may also be an essential protein, for each protein p, we will extract another network topological feature TF 2 for it as follows: where NG e (p) denotes the number of edges of all nodes in NG(p), and NG Tris (p) means the number of triangles of all nodes in NG(p), which can be calculated according to the following formulas: Moreover, in the study of Peng et al. [26], the key proteins proved to be relatively conservative. By studying 99 reference organisms from Homo sapiens to modern humans. Whether each protein has homology, get the homology score of each protein, which indicates the degree of conservation of each protein. For each protein p in an original PPI network G = (V , E), supposing that its orthologous score is I (p), then, we can extract its first functional feature FF 1 (p) from the information of orthologous proteins as follows: Finally, based on the weighted PPI network, for each protein p, we can further extract its another functional feature FF 2 (p) as follows: where q∈NG(p) weight(p, q) represents the sum of the co-expression degree of protein p and all its neighbor nodes, and the ratio of q∈NG(p) weight(p, q) to the number of neighbor nodes represents the average level of co-expression degree of protein p in the whole PPI network.

C. CONSTRUCTION OF CVIM
Based on above descriptions, let {TF i1 , TF i2 , . . . , TF iM } denote all these topological features (such as TF 1 and TF 2 ) extracted for the protein p i from the PPI network, then it is obvious that we can obtain a N × M dimensional characteristic matrix TF for all these N different proteins in the PPI network as follows: After normalizing above matrix TF, we can obtain a transformation matrix B as follows: (12) Based on above formula (12), for the jth network topological feature of proteins, we can obtain its entropy e j , which represents the stability of the jth feature, as follows: Based on above formula (13), for the jth network topological feature of proteins, we can calculate its weight in all M different network topological features according to the following formula (14): Thereafter, based on above formula (14), for a given protein p i , we can calculate its score of network topological features as follows: Based on above formula (15), for all these N proteins in the PPI network, we can construct a protein interaction matrix VOLUME 8, 2020 Based on above formula (9) and formula (10), for a given protein p i , we define its total score of protein functional features as follows: For all N proteins {p 1 , p 2 , . . . , p N } in the weighted PPI network, then we can obtain their initial scores as follows: FFscore(2), . . . , FFscore(N)) (18) Finally, we adopt formula (19) to compute all the proteins' criticality score iteratively Here, the parameter α(0 ≤ α ≤ 1) is utilized to adjust the proportion of initial score T(0) and last iteration score T(t). Thereafter, based on above descriptions, we can present our CVIM algorithm as follows:

A. EXPERIMENTAL DATA
In order to evaluate the performance of CVIM, we will compare it with 13 representative methods in Table 1 based on the datasets downloaded from two databases DIP [34] and GAVIN [35] separately. During experimental, after filtering out self-interactions and repeated interactions, we finally obtained 5093 different proteins and 24743 interactions including 1167 essential proteins from the DIP database,

Algorithm CVIM
Input: Original PPI network G = (V , E), orthologous and gene expression data, the parameters ε and K Output: Top K percent of proteins sorted by the vector T in descending order Step1: Generate the weighted network according to formula (1); Step2: For each protein p, extract its network topological features TF 1 and TF 2 from the novel weighted PPI network according to formulas (5) and (6) separately; Step3: For each protein p, extract its functional features FF 1 and FF 2 from the novel weighted PPI network, orthologous data and gene expression data according to formulas (9) and (10) respectively; Step4: Obtain the protein interaction matrix H according to formula (16); Step5: Let t = 0, Compute T ( t) according to (18); Step6: Let t = t + 1; Compute T (t) according to formula (19); Step7: Repeat Step6 until T (t) − T (t − 1) √ N < ε; Step8: Sort proteins by the value of T in the descending order; Step9: Output top K percent of sorted proteins. and 1855 different proteins and 7669 interactions including 714 essential proteins from the GAVIN database. Obviously, based on these two datasets downloaded from the DIP and GAVIN databases, two kinds of original PPI networks, such as a DIP-based PPI network and a GAVIN-based PPI network, can be constructed.
Moreover, information of orthologous proteins was downloaded from the InParanoid database (Version 7) [36], which consists of a collection of pair wise comparisons between 100 whole genomes. And additionally, the gene expression data of yeast was downloaded from the dataset provided by  Tu et al. [37]. In experiment, the coverage of the DIP-based PPI network and the GAVIN-based PPI network in the gene expression data reached over 95%. For proteins that do not have corresponding gene expression data, we would set their values of gene expression to zero.
Finally, we would further download a dataset consisting of 1285 essential genes of Saccharomyces cerevisiae from four databases such as MIPS [39], SGDP [42], DEG [40] and SGD [41] as the benchmark set. By comparing the key proteins screened by CVIM with these 1285 real key proteins, the recognition rate of CVIM method in DIP database and GAVIN database was obtained. We will present the experimental results of PPI network based on DIP in detail, and briefly present the experimental results of PPI network based on GAVIN.

B. EFFECTS OF THE PARAMETER α
In CVIM, we introduced a user-defined parameter α with value between 0 and 1. By setting different values to α, we illustrated the prediction results based on the DIP-based PPI network and the GAVIN-based PPI network in the following Table 2 and Table 3 respectively.
As shown in Table 2, We sort the final score of the protein in descending order, and selected the top 1%, 5%, 10%, 15%, 20%, and 25% of the potentially essential proteins identified by CVIM, while α was set to 0.1, 0.2,..., 0.8, and 0.9. It is not difficult to see that the prediction accuracy of CVIM will change with different α values. Overall, as the value of α increases, the accuracy of CVIM prediction will steadily increase. Although the recognition rate of the top 20% and 25% dropped to 0.47-0.55, this is because in the data set, key protein data only accounts for about 20% of all data, the data distribution is extremely uneven, and our research is mainly for the identification of key proteins, so we mainly consider the key protein within 20% The recognition situation, and the proportion of more than 20% of the data corresponding to the data of non-critical proteins has increased significantly, resulting in a rapid decline in accuracy. Therefore, we think that when performing comparative experimental on a DIP-based PPI network, setting the value of α to 0.8 is the most appropriate.
As shown in Table 3, it is easy to see that when α is increased to 0.7, the top 5%, 15%, 20%, and 25% of the potential essential proteins identified by CVIM all reach the best prediction accuracy. However, when α is set to 0.5, the first 1% and 10% identified by CVIM can obtain the best prediction accuracy. Therefore, considering both the experimental results, when comparing the analog network PPI GAVIN performed based on the value of α is set to 0.7 is the most suitable.
Although we get the best effect at α = 0.8 for DIP dataset and at α = 0.7 for GAVIN dataset, it will cause over-fit for different dataset with different parameter values. Therefore, combining the two databases, we chose 0.8 as value of the parameter α in the following experiment. And we also tested in Krogan [45], BioGRID database, the Krogan dataset consists of 3672 proteins and 14317 interactions. The BioGRID yeast data set used in [4] contains 5616 proteins and 52833 distinct interactions, which are denser than the other three data sets. we found that the alpha parameter does not change much in different data sets, and has little effect on the experimental results.

C. COMPARISONS BETWEEN CVIM AND 13 REPRESENTATIVE METHODS
First, we adopt the dataset downloaded from the DIP database to compare CVIM with 13 representative methods in Table 1 simultaneously. And the experimental results are illustrated in the following Fig.2.
From observing the Fig.2, it is easy to see that in the top 1% (51), 5% (255) and 10% (510) potential essential proteins  and table 1 were sorted in PPI network in order from high to low. Then, the top 1%, 5%, 10%, 15%, 20% and 25% ranked proteins will be selected as candidate essential proteins. Thereafter, by comparing with known key protein libraries, the performance is judged by the number of true essential proteins identified by each method. This figure shows the number of true essential proteins discovered by each method. Because the total number of ranked proteins is 5093. The digits in brackets indicate the number of proteins ranked in each top percentage.  detected by CVIM, there are 47, 204 and 359 true essential proteins respectively, which mean that the recognition rates of CVIM can reach 92%, 80% and 71% in the top 1%, 5% and 10% newly identified potential proteins separately. Particularly, while compared with the 8 representative prediction methods based on PPI network topology in table 1, our method CVIM can achieve the highest predictive accuracy in all top percentages. Moreover, compared with the five representative prediction methods based on network topology and related biological data in table 1, our method CVIM outperforms PEC, POEM, CoEWC and ION in any interval of top percentages. And in the top 1%, 10%, 15% and 20% candidate proteins, our method CVIM can achieve better performance than RWHN as well. However, in the top 5% and 25% candidate proteins, the predictive performance of CVIM is a little lower than RWHN. This may be because the RWHN method uses different parameter value settings for different data. Thus, we can draw a conclusion that CVIM is superior to these 13 state-of-the-art methods and has a higher recognition rate for key proteins in the overall level.

IV. ROC CURVE VERIFICATION
The receiver operating characteristic (ROC) curve was used to evaluate the performance of the CVIM method. If AUC = 0.5, it means random performance. The larger the area of the model's ROC curve (AUC), the better the model's performance. When FPR = 0.2, TPR = 0.58, CVIM AUC = 0.083, RWHN AUC = 0.081. Therefore, when FPR <= 0.2, the performance of the CVIM algorithm is the best among all algorithms. As FPR grows, the AUC of CVIM is slightly smaller than the RWHN algorithm.

A. VALIDATION BY JACKKNIFE METHODOLOGY
Jackknife Methodology [42] is a common method utilized to evaluate the superiority and disadvantage of algorithms for identifying key proteins. In order to evaluate CVIM more comprehensively and concretely, in this section, we introduced the Jackknife methodology for the top 1000 candidate essential proteins predicted by CVIM and 13 representative methods to test their superiority and disadvantages. The comparison result is shown in the following Fig.4. From observing Fig.4(a), Fig.4(b) and Fig.4(d), it is easy to see that CVIM can achieve better predictive performance than IC, EC, BC, NC, DC, SC, CC, Pec, LAC and CoEWC. Moreover, from observing Fig.4(c), we can find that CVIM outperforms ION and POEM, meanwhile, the curves of CVIM and RWHN are intersected with each other. However, through careful observation, we will find that when the number of candidate key proteins increases to 500, the curve of RWHN will turn lower than that of CVIM. That is to say, with the increasing of predicted scale of proteins, the predictive performance of CVIM will gradually exceed that of RWHN. Hence, we can declare that the prediction performance of CVIM is better than that of these 13 representative methods on the whole.

B. DIFFERENCES BETWEEN CVIM AND 13 REPRESENTATIVE METHODS
In order to analyze the difference between CVIM and 13 state-of-the-art prediction methods in Table 1, we compared CVIM and 13 methods based on the top 200 ranked proteins. Comparison results are shown in the following table 4 and Fig.4, in which, M i denotes one of these 13 methods, |CVIM∩M i | indicates the number of essential proteins 90800 VOLUME 8, 2020 From observing table 4 and Fig.5, it is obvious that the percentage of essential proteins in the top 200 ranked proteins discovered by CVIM but not discovered by any given competing method is much higher than the percentage of essential proteins in the top 200 ranked proteins discovered by the given competing method but not discovered by CVIM. That is to say, comparing with state-of-the-art methods, CVIM can detect more true key proteins and has stronger ability to eliminate noise data.

C. PREDICTION PERFORMANCE OF CVIM BASED ON THE GAVIN DATASET
In order to verify the universal applicability of CVIM, in this section, we adopt the GAVIN dataset to compare the predictive performance between CVIM and 13 previous methods.
And the comparison results are illustrated in the following table 5 and Fig.6.
As shown in above table 5, in the top 1% (19) ranked proteins, the number of true essential proteins discovered by CVIM is 16, which is higher than that of EC, SC, BC, CC, NC, Pec and LAC, equivalent to that of IC and CoEWC, and a little smaller than that of POEM, RWHN and ION. Although the prediction performance of CVIM in the top 1% ranked proteins is not the best, but the prediction performances of CVIM in the top 5% to 25% ranked proteins are better than all these 13 competing methods.
From observing the Fig.6(a) and Fig.6(b), it is clear that the curves of CVIM are higher than those of DC, EC, SC, BC, IC and NC, which indicate that the performance of CVIM outperforms these methods. From observing the Fig.6(c) and Fig.6(d), we can find as well that the gaps between the curves of CVIM and the curves of PEC, POEM, CoEWC, LAC, RWHN and ION will gradually increase with the increasing of the number of ranked proteins, which demonstrate that with the increasing of ranked proteins, the predictive performance of CVIM will become better and better than that of PEC, POEM, CoEWC, LAC, RWHN and ION. Therefore, we can believe that CVIM is a leading method for predicting potential essential proteins. VOLUME 8, 2020

V. DISCUSSION
Essential proteins are indispensable materials to sustain life activities. Up to now, due to the high cost of identifying essential proteins by traditional biological experiments, the recognition of key proteins based on computational techniques has become a hotspot in the research field of proteins. It is an important and challenging work to develop stable and accurate protein identification algorithms by using computational methods instead of biomedical experiments to identify key proteins. More and more researchers are combining PPI networks with biological data to build effective prediction models. Inspired by them, we designed a novel prediction model in this manuscript by integrating the topological features of the weighted PPI network and functional features of the proteins to determine the importance of proteins. Experimental results show that the method can achieve excellent prediction results, which provides a good reference for the future researches.

VI. CONCLUSION
In this manuscript, a novel prediction method called CVIM is proposed to discover potential essential proteins by integrating the PPI network and relevant biological data. In CVIM, a weighted PPI network is constructed first by adopting the PCC scheme on the original PPI network. And then, based on the weighted PPI network and homologous data of proteins and the real-time expression data of genes, for each protein in the weighted PPI network, some network topological features and functional features will be extracted. Finally, based on these different kinds of features, an iterative method is adopted to obtain the final scores of proteins. Based on the DIP2010 and GAVIN yeast PPI networks, intensive experiments have been implemented. Experimental results demonstrate that CVIM outperforms 13 competing representative prediction methods, which shows that CVIM is a unique and effective prediction method as well.