An Iterative Method for Identifying Essential Proteins Based on Non-Negative Matrix Factorization

In recent years, with the development of high-throughput technologies, lots of computational methods for predicting essential proteins based on protein-protein interaction (PPI) networks and biological information of proteins have been proposed successively. However, due to the incompleteness of PPI networks, the prediction accuracy achieved by these methods is still unsatisfactory, and it remains to be a challenging work to design effective computational models to identify essential proteins. In this manuscript, a novel Prediction Model based on the Non-negative Matrix Factorization (PMNMF for abbreviation) is proposed. In PMNMF, an original PPI network will be constructed first based on PPIs downloaded from any given benchmark database. And then, based on topological features of protein nodes, the original PPI network will be further converted to a weighted PPI network. Moreover, in order to overcome the incompleteness of PPI networks, the NMF (Non-negative Matrix Factorization) method will be implemented on the weighted PPI network to obtain a transition probability matrix. And then, by integrating biological information including the gene expression information, homologous information and subcellular localization information of proteins, a unique initial score will be calculated and assigned to each protein node in the weighed PPI network, based on which, an improved Page-Rank algorithm will be designed to infer potential essential proteins. Finally, in order to evaluate the performance of PMNMF, it will be compared with 14 state-of-the-art prediction models, and experimental results show that PMNMF can achieve the best identification accuracy.


I. INTRODUCTION
Essential proteins are found in large numbers in protein complexes, and their absence will lead to the loss of functions of related protein complexes, and make it impossible for organisms to survive or develop. Identifying essential proteins is important for the understanding of the process of cell growth and regulation, and can provide valuable information to the researches of disease analysis and drug design etc. In recent years, with the rapid development of high-throughput techniques, more and more protein-protein interactions (PPIs) have been detected successively, based on which, PPI networks are established and applied widely The associate editor coordinating the review of this manuscript and approving it for publication was Vishal Srivastava. in designing computational models for inferring essential proteins. For instance, based on the topological characteristic of centrality [1], [2] of PPI networks, a series of calculation models including CC(Closeness Centrality) [3], DC(Degree Centrality) [4], BC(Betweenness Centrality) [5], SC[Subgraph Centrality] [6], NC(Neighbor Centrality) [7] have been proposed to discover basic proteins. Besides, Li M et al [8] designed an identification model named LAC to identify key proteins based on the Local Average Connectivity of protein nodes in PPI networks [9]. Qi Yi et al [10] designed a prediction model to infer basic proteins based on the Local Interaction Density (LID) of protein nodes in PPI networks. Chen B et al [11] proposed an essential protein identification method based on multiple topological structures of PPI networks. In all these methods mentioned VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ above, it is only considerate the topological properties of PPI networks, thus, due to the incompleteness of current PPI networks, the prediction accuracy of these methods is still not satisfactory. Then in order to improve the prediction accuracy of computational models, some new identification models have been proposed for the past few years by combining the topological characteristics of PPI networks and the biological information of proteins. For example, through integrating PPI networks with the gene expression data of proteins, M Li et al [12] and Xiwei Tang et al [13] proposed two prediction models called Pec and WDC respectively. W Peng et al designed one prediction model based on the orthologous information of proteins and PPI networks [14], and another prediction model based on the domain information of proteins and PPI networks [15] to infer essential proteins respectively. X Zhang et al [16] introduced an identification method called CoEWC by combining topological features of PPI networks with the co-expression properties of proteins. BH Zhao et al [17] designed a prediction model called POEM by integrating gene expression data of proteins with topological features of PPI networks. J Luo et al [18] put forward a computational method for essential protein prediction based on the local interaction density of PPI networks and biological features of protein complexes. Seketoulie Keretsu et al [19] presented an identification model of protein complexes based on weighted edge by clustering and the gene expression profiles of proteins. M Li et al [20], [21] proposed two necessary protein identification methods by integrating PPI networks with subcellular localization information and complex centrality of proteins separately. J Luo et al [22] introduced a method to detect essential proteins based on protein complex co-expression data and ECC (edge clustering coefficient) of PPI networks. Bihai Zhao et al proposed a model based on Multiplex Biological Networks [23] and a model based on Diffusion Distance Networks [24] to predict essential proteins separately. S. Li et al [25] proposed one iteration method called CVIM to predict essential protein, based on topological and functional features. Lei X et al presented one essential protein prediction methods called AFSOEP [26] to infer protein complexes by AFSO (Artificial Fish Swarm Optimization). Bihai Zhao et al [27] designed an iterative method for identifying potential key proteins from heterogeneous PPI networks. Dai W et al [28] proposed a method to discover essential genes based on protein-protein interaction network embedding. Fengyu Zhang et al [29] introduced a model called FDP to predict essential Genes by fusing dynamic PPI networks. Chen Z et al [30] proposed a prediction model called NPRI based on one heterogeneous network, the heterogeneous Protein-Domain network are established in accordance with initial PPI network, Protein-Domain network and gene expression data. All these above mentioned methods have demonstrated that it can improve the prediction accuracy of calculative models by combining the biological information of proteins with the topological features of PPI networks.
In general, these existing essential protein prediction methods are mainly designed by combining the topological characteristics of PPI networks with biological features of proteins. However, due to the incompleteness of PPI networks, the prediction accuracy of these methods is still not very satisfactory. Hence, inspired by the ideas of existing state-of-the-art models, in this paper, a novel prediction model called PMNMF is designed to infer essential proteins. In PMNMF, an original PPI network will be constructed first based on known PPIs downloaded from benchmark databases. And then, based on topological features of protein nodes, the original PPI network will be transformed to a weighted PPI network. Next, through adopting the Non-negative Matrix factorization (NMF) method, a transition probability matrix will be obtained. Finally, by combining the gene expression information, orthologous information and subcellular localization information of proteins, an iterative algorithm will be designed and implemented on the weighted PPI network to detect potential essential proteins. Moreover, in order to evaluate the performance of PMNMF, it will be compared with some competitive methods. Experimental results show that PMNMF can achieve reliable identification accuracies of 98.04%, 85.10%, 69.74%, 60.10%, 55.05% and 51.22% in top 1%, top 5%, top10%, top15%, top20% and top 25% of predicted potential essential proteins respectively, which predictive performance is better than all these state-of-the-art competing models.

II. MATHOD
As shown in Fig.1, the process of PMNMF consists of the following 3 main steps: Step 1: First, based on the dataset of known PPIs downloaded from any given benchmark database, an original PPI network will be constructed. And then, based on the topological features of protein nodes, the original PPI network will be further converted to a weighted PPI network.
Step 2: Next, based on the gene expression formation, homologous information and subcellular location information of proteins, a unique initial score will be calculated and assigned to each protein node in the weighted PPI network.
Step 3: Finally, based on the initial scores of proteins and the transition probability matrix obtained by adopting the non-negative matrix factorization method, an improved page-rank algorithm will be designed to calculate a final score for each protein, which can be utilized to evaluate the essentiality of the protein effectively.

A. CONSTRUCTION OF THE WEIGHTED PPI NETWORK
Let denote the dataset of known PPIs downloaded from any given benchmark database, N P = {p 1 , p 2 . . . .p O } be the set of all these different proteins in . For any two given proteins p i and p j in N P , we define that there is an edge e(p i , p j ) between them, if and only if there is a known interaction between p i and p j in . And for convenience, let E P represent the set consisting of all these edges between proteins in . Then, it is apparent that we can obtain an original PPI network OppiN = {N P , E P }, and based on which, we can further obtain an O×O dimensional adjacency matrix OppiM as follows: for any two given proteins p i and p j in N P , there is OppiM (i, j) = 1, if and only if there is a known interaction between them in , otherwise there is OppiM (i, j) = 0. In addition, let NB(p i ) represent the set of nodes neighboring to p i in OppiN , i.e., there are edges between these nodes and p i in OppiN . Let |NB (p i ) | denote the number of different nodes in NB(p i ), NB(p i ) ∩ NB(p j ) be the set of nodes neighboring to both p i and p j in OppiN , and |NB (p i ) ∩ NB p j | represent the number of different nodes in NB(p i ) ∩ NB(p j ), then based on the assumption that for any two given proteins, if they interact with one or more other common proteins at the same time, the interaction between these two proteins will be more reliable [31], we can define the Edge Aggregation Coefficient between p i and p j as follows: From observing above formula (1), it is easy to see that, for any two given proteins p i and p j in N P , the more common neighboring nodes between them, the bigger the value of EAC p i , p j will be. Hence, to some degree, the Edge Aggregation Coefficient between p i and p j can reflect the degree of interaction between them effectively.
Moreover, it is reasonable to assume that if a protein node has known interactions with more proteins, then it will be more reliable. Hence, for any given protein p i in N P , let ENB (p i ) denote the number of known interactions between it and all the other proteins in N P , then we can define the Point Aggregation Coefficient of p i as follows: Based on above two formulas, for any two given proteins p i and p j in N P , it is reasonable to assume that the potential interaction between them varies directly with both the value of the Edge Aggregation Coefficient between them and the values of their Point Aggregation Coefficients. Hence, we can define the Degree of Potential Interaction between p i and p j as follows: Obviously, based on above formula (3), an O × O dimensional interaction matrix DPI can be obtained. However, through considering the limited number of known interactions between proteins and the definition of the Edge Aggregation Coefficient between proteins illustrated in above formula (1), it is easy to know that DPI will be a sparse matrix. Hence, we can adopt the Non-negative Matrix Factorization (NMF) method [32][33][34] to predict unknown weights, convert it to the product of two non-negative matrixes W ∈ R O×k and H ∈ R k×O (k O) as follows: Here, the matrixes W and H satisfy the following target function: Here, ||.|| E represents the Euclid paradigms. From observing above two formulas, it is easy to see that NMF aims to find two non-negative matrixes W and H, whose product WH can provide the optimal approximation to the original matrix DPI. As for above target function illustrated in formula (5), by adopting the iterative update algorithm proposed by Lee et al [35], the matrixes W and H can be iteratively obtained according to the following formulas:

B. CALCULATION OF INITIAL SCORES FOR PROTEINs
In this section, we will combine the gene expression information, subcellular localization information and homologous information of proteins to calculate a unique initial score for each protein in the weighted PPI network as follows: Firstly, for any given protein 2), . . . , GE(p i , n)} denote the gene expression data of p i at n different time points, where GE(p i , t) represent the level of gene expression of p i at the time point t. Then, based on the method of PCC (Pearson Correlation Coefficient) [36], we can calculate a PCC-based initial score for p i as follows: Here, Here, GE(p i ) denotes the average expression level of p i at all these n time points, σ (p i ) is the standard variance of gene expression levels of p i at all these n time points, and PCC p i , p j represents the Pearson Correlation Coefficient between p i and p j .
Next, based on the homologous information of proteins, for any given protein p i , let O (p i ) denote the homologous information of p i , then we can obtain another homologous information based initial score for p i as follows: Moreover, based on the subcellular location information of proteins, we can calculate the third subcellular location based initial score for p i as follows: Here, Pro_s(p i ) denotes the set of all subcellular locations, in which the proteinp i is located, Sub_p (s i ) is the number of proteins in the i-th subcellular localization, and m is the total number of all subcellular localizations.
Finally, based on above formulas, for any given protein p i , we can define a unique initial score for it as follows: Here, β ∈ [0, 1], γ ∈ [0, 1] and δ ∈ [0, 1] are the weights of the GScore(p i ), OScore(p i ) and SScore(p i ) separately, and in addition, there is β + γ + δ =1. During simulation, in order to obtain the appropriate combination of these parameters, all possible values of these three parameters will be tried to obtain different initial scores for proteins, among which, the combination corresponding to the highest prediction accuracy of essential proteins will be selected as the final values of these three parameters.Here, β = 0.55, γ = 0.25, δ = 0.2.

C. CONSTRUCTION OF THE PREDICTION MODEL PMNMF
First, based on the following formula (15), we will transform the matrix DPI * to a symmetrical transition probability matrix NTP as follow: Here, Next, based on the transition probability matrix NTP, let PScore(0)=PScore0, then we can iteratively obtain the final scores for all proteins in the weighted PPI network as follows: Here, the parameter α is used to adjust the ratio of the initial score to the score of the latest iteration, and PScore(t) is the scores of all proteins in the t-th round of iteration.
Finally, according to above descriptions, as shown in algorithm 1, the novel prediction method PMNMF can be presented as follows: Algorithm 1 PMNMF Input: Downloaded dataset of known PPIs, downloaded dataset of orthologous information, gene expression information, and subcellular location information of proteins, the iteration condition parameter ε, the dimensionality parameter K, the max iteration times T, and the proportional adjustment parameters α, β, γ and δ. Output: Top K percent of proteins sorted by values in PScore in descending order Step1: Generating the original and weighted PPI networks according to formulas (1)-(3); Step2: Obtaining the non-negative matrices W and H according to formulas (4)-(7),Repeating (6)-(7) until the iteration times exceeds T; Step3: Calculating the initial scores for proteins according to formulas (8)- (14); Step4:Obtaining the transition probability matrix according to formulas (15)-(16); Step5: Let t = t + 1, calculating PScore(t + 1) according to the formula (17) iteratively; Step6: Repeating Step 5 until ||PScore(t+1) -PScore(t) || < ε; Step7:Sorting proteins by values in PScore in the descending order; Step8: Outputing top K percent of sorted proteins.

A. EXPERIMENTAL DATA
In order to evaluate the predictive performance of PMNMF, in this section, we will compare it with 14 representative basic protein prediction methods including IC [1], CC [3], DC [4], BC [5], SC [6], NC [7], PeC [12], ION [14], CoEWC [16]   and POEM [17], CVIM [25], NPRI [30], TEGS [20] and RWHN [27] simultaneously. During experiments, we will first download datasets of known PPIs from different benchmark databases including DIP [37], Gavin [38] and Krogan [39] respectively. After pre-processing, we obtain a dataset consisting of 24743 interactions between 5093 proteins from the DIP database, a dataset consisting of 7,669 interactions between 1,855 proteins from the Gavin database, and a dataset consisting of 14317 interactions between 3672 proteins from the Krogan database finally. In addition, according to the databases such as MIPS [40], SGD [41], DEG [42] and SGDP [43] etc., a dataset consisting of 1285 essential proteins can be further obtained, and based on which, 1,167, 714 and 929 essential proteins have been picked out from the databases of DIP, Gavin and Krogan separately. Moreover, based on the dataset provided by Tu BP et al [44], we obtain a dataset consisting of gene expression data of 6,776 proteins, which represent the gene expression levels of proteins over consecutive metabolic cycles. Additionally, the orthologous information of proteins will be downloaded from the Inparanoid database (Version7) that includes a collection of pair wise comparisons between 100 whole genomes [45]. After that, the number of times that proteins have orthologous information in reference organisms will be calculated to quantify the homologous information of proteins. Finally, based on the dataset downloaded from the COMPART-MENTS database [46] (downloaded at April 20, 2014), we can obtain a dataset consisting of the subcellular location information of proteins, in which, we will only keep 11 categories of subcellular localization data closely related to essential proteins such as the Endoplasmic, Cytoskeleton, Golgi, Cytosol, Vacuole, Mitochondrion, Endosome, Plasma, Nucleus, Peroxisome and Extracellular etc.

B. EFFECTS OF PARAMETER α ON PERFORMANCE OF PMNMF
In PMNMF, we set a user-defined parameter α with value between 0 and 1 to adjust the ratio of the initial protein fraction to the latest score during iterations. By setting different values to α, we can obtain different prediction accuracies of PMNMF. During simulation, we will choose the number of true essential proteins identified by PMNMF in top 1%, top 5%, top 10%, top 15%, top 20% and top 25% of predicted potential essential proteins when α is set to 0.1, 0.2, 0,3, . . . and 0.9 as the final results. And in detail, Table 1, Table 2 and Table 3 illustrate these results based on the databases of DIP, Gavin and Krogan respectively. From observing Table 1, it is easy to see that with the increasing VOLUME 8, 2020 And then, the top 1%, 5%, 10%, 15%, 20%, and 25% of ranked proteins will be selected as candidate essential proteins. Finally, through comparing with the downloaded dataset of known essential proteins, the number of true essential proteins identified by each method will be calculated and shown in the table, which will be adopted to evaluate the predictive ability of each method. The numbers in parentheses indicate the number of proteins ranked in each interval.
of the value of α from 0.1 to 0.9, the prediction accuracy of PMNMF will increase as well, however, when α exceeds 0.4, the prediction accuracy of PMNMF in the top 1% and 5% of predicted potential essential proteins will decrease gradually. Therefore, based on the DIP database, it will be appropriate to set α to 0.4. From observing Table.2, it is easy to see that PMNMF can achieve the best prediction results while α is set to 0.9 based on the Gavin database. From observing Table.3, it is obvious that 0.6 is a turning point. Therefore, based on the Gavin database, we consider that it will be appropriate to set α to 0.6. According to above analysis, it is easy to known that the value of the parameter α will have obvious effect on the prediction performance of PMNMF. Based on the overall performance on the three datasets, we set α to 0.4.

C. COMPARISON WITH STATE-OF-THE-ART METHODS
In this section, we will compare PMNMF (while α = 0.4) with 14 state-of-the-art competing methods to evaluate its prediction performance based on the DIP database. And as shown in Fig.2, we can see that the prediction performance of PMNMF is better than that of all these 14 competitive methods. Especially, as for the top 1%, top 5%, 10% and top 15% of predicted candidate proteins, PMNMF can achieve reliable predictive accuracies of 98%, 85%, 70% and 60% separately, which are 50%, 24%, 22%, 24%, 32%, 24%, 22%, 20%, 14%, 18%, 8%, 12%, 16% and 8% higher than the predictive accuracy achieved by BC, CC, DC, IC, NC, SC, Pec, POEM, CoEWC ION,CVIM,NPRI,TEGS and RWHN respectively. Next, we further adopt the ROC (Receiver Operating Characteristic) curve and the AUCs (the area under the ROC curve) to compare the prediction performance of PMNMF with these 10 competing methods. The comparison results between PMNMF and BC, CC, DC, IC, NC and SC are illustrated in Fig.3(a), and the comparison results between PMNMF and CoEWC, PeC, POEM and ION are shown in Fig.3(b) respectively. From observing these two figures, it is easy to see that the prediction performance of PMNMF is higher than that of all these 10 competitive methods. And in addition, as shown in Table 4, from observing the AUCs achieved by PMNMF and 10 competing methods based on the DIP database, it is obvious that PMNMF can achieve the highest AUC value of 0.77, which is better than that achieved by all these 10 competitive methods as well.

D. VALIDATION BY JACKKNIFE METHOLOGY
In this section, the jackknife methodology [47] will be implemented on top 1000 candidate essential proteins predicted by PMNMF and 10 competitive models to compare the performances between them based on the DIP database. The comparison results are illustrated in the following Fig.4, in which, the X-axis shows the number of predicted potential essential proteins in descending order according to the predicted scores of proteins, while the Y-axis denotes the cumulative count of the truly proven essential proteins. Especially, Fig.4(a) shows the comparison results among PMNMF, BC, CC, DCIC, NC and SC, from which, it can be seen that the predictive performance of PMNMF is much higher than that of all these 6 competing methods. Moreover, from observing Fig.4(a), it is easy to see that with the increasing of the number of ranked proteins, the performance gap between PMNMF   and these competitive methods will increase significantly. Fig.4(b) illustrates the comparison results among PMNMF, Pec, CoEWC, POEM and ION, from which, it can be seen that the prediction performance of PMNMF is to some degree higher than that of all these 4 competing methods as well. However, from observing Fig.4(b), it can be seen that with the increasing of the number of ranked proteins, the performance gap between PMNMF and these competitive methods will increase gradually.

E. DIFFERENCE BETWEEN PMNMF AND 10 COMPETITIVE PREDICTION METHODS
In this section, we will select top 200 proteins predicted by PMNMF and 10 competitive methods based on the DIP database to analyze the difference and commonality between them. Comparison results between PMNMF and 10 competitive methods are illustrated in the following Table.5, in which, Mi indicates one of these 10 methods.
|PMNMF ∩Mi| denotes the number of common key proteins VOLUME 8, 2020 FIGURE 5. The figure shows the comparison results of the number of true essential proteins inferred by PMNMF and 13 competing identification models based on the Gavin database and the Krogan database respectively. During experiment, proteins will be first sorted in descending order based on their scores calculated by predictive methods such as PMNMF, BC, CC, DC, IC, NC, SC, PeC, CoEWC, POEM, ION, CVIM, NPRI, TEGS or RWHN separately. And then, the top 1%, 5%, 10%, 15%, 20%, and 25% of ranked proteins will be selected as candidate essential proteins. Finally, through comparing with the downloaded dataset of known essential proteins, the number of true essential proteins identified by each method will be calculated and shown in the table, which will be adopted to evaluate the predictive ability of each method. The numbers in parentheses indicate the number of proteins ranked in each interval.   Table.5, it is easy to know that among these top 200 proteins, the proportions of true essential proteins predicted by PMNMF but not by competing methods are more than 80%, which indicate that PMNMF can achieve much higher identification accuracy and better prediction performance than all these 10 competitive methods

F. RECOGNITION PERFORMANCE OF PMNMF BASED ON THE GAVIN DATABASE AND KROGAN DATABASE
In order to demonstrate the universal applicability of PMNMF method, in this section, we further adopt the Gavin and Krogan databases to compare the prediction performance between PMNMF and some competitive prediction methods. The comparison results are shown in Fig.5 and Fig.6. From observing Fig.5(a), it is clear that based on the Gavin database, the prediction accuracies of PMNMF exceed 89% in the top 1%, top 5% and top 10% of ranked candidate essential proteins. From observing Fig.5(b), it is clear that based on the Krogan database, the prediction accuracies of PMNMF exceed 83% in the top 1% and top 5% of ranked candidate essential proteins. And based on both of these two databases, the prediction accuracies achieved by PMNMF are higher than all other competitive prediction methods. In addition, from observing Fig.6(a) and Fig.6(b), we can find that the prediction performance of PMNMF is higher than all these 10 methods based on the Gavin database, and from observing Fig.6(c) and Fig.6(d), we can find that the prediction performance of PMNMF is higher than all these 10 methods based on the Krogan database as well.

IV. DISCUSSION
Essential proteins are important for cell growth and regulation processes. In recent years, accumulating computational methods have been proposed to identify essential proteins. However, due to the effects of false positives and false negatives in original PPI data obtained by high-throughput techniques, it is still a challenging work to develop a stable and accurate essential protein prediction model. Inspired by the fact that it can improve the prediction performance of computational models by integrating PPI networks with multiple biological information of proteins, a novel prediction model called PMNMF based on the Non-negative Matrix Factorization is designed in this manuscript. In PMNMF, a weighted PPI network is first constructed by extracting the topological information of proteins from the original PPI network, and then, by applying the NMF method on the weighted PPI network and combining with biological information of proteins, an improved Page-Rank algorithm is introduced to calculate the importance scores for proteins. Experimental results show that PMNMF can achieve superior prediction results than state-of-the-art prediction models, which demonstrates that PMNMF is an effective prediction method for key protein prediction. Of course, there are still some shortcomings in current version of PMNMF. For example, more biological information of proteins being considered, the prediction performance of PMNMF may become better.

V. CONCLUSION
The main contributions of this manuscript can be summarized as follows: (1) A weighted PPI network is established based on the topological information of proteins in the original PPI network. (2) The Non-negative Matrix Factorization is introduced to obtain the transition probability matrix.
(3) An improved Page-Rank algorithm is designed to estimate the critical scores of proteins. However, there are still some limitations in current version of PMNMF. For example, since a random algorithm is adopted to initialize these two non-negative factorization matrices, the result obtained by the NMF algorithm has a certain degree of randomness as well, we improve the stability of results through multiple iterations, more effective methods is worthy to be explored in the future researches.
JIN LIU is currently pursuing the bachelor's degree in information and computing science with Changsha University. Her current research interest is bioinformatics.
XIANGYI WANG is currently pursuing the bachelor's degree in information and computing science with Changsha University. Her current research interest is bioinformatics.
ZHIPING CHEN received the B.S. degree in computer science and technology from Xiangtan University, in 1994, and the M.S. and Ph.D. degrees in computer science and technology from Hunan University, in 1997 and 2003, respectively. From 1997 to 2009, he has taught in Hunan University. He is currently a Professor with Changsha University. His current research area includes bioinformatics mainly.
YIHONG TAN received the Ph.D. degree in computer science and technology from Hunan University, in 2012. He is currently a Full Professor with Changsha University. His current research area is mainly bioinformatics. He has published more than 100 peer-reviewed articles. His main research areas include bioinformatics and the Internet of Things.