A New Scheme for Essential Protein Identification Based on Uncertain Networks

Identifying essential proteins is important for not only understanding cellular activity but also detecting human disease genes. A series of centrality measures have been proposed to identify essential proteins based on the protein-protein interaction (PPI) network. Although, existing studies have focused on the topological features of the PPI network and the intrinsic characteristics of biological attributes. it is still a big challenge to further improve the prediction accuracy of essential proteins. Moreover, there are substantial amounts of false-positive data in PPI networks; thus, a PPI network should be modelled as an uncertain network. How to identify essential proteins more accurately and conveniently has become a research hotspot. In this paper, we proposed a new essential protein discovery method called ETB-UPPI on uncertain PPI networks. The algorithm detects essential proteins by integrating topological features with biological information. Experimental results on four Saccharomyces cerevisiae datasets have shown that ETB-UPPI can not only improve the prediction accuracy but also outperform other prediction methods, including the most commonly-used centrality measures (DC, SC, BC, IC, EC, and NC), topology-based methods (LAC) and biological-data-integrating methods (PeC, WDC, UDONC, LBCC, TEGS, and RSG).


I. INTRODUCTION
Proteins are biological macromolecular compounds formed by the aggregation of many amino acids. They are one of the most basic substances in life and are widely found in various biological tissue cells. Among all, essential proteins are indispensable in cellular organisms. The deletion of these essential proteins will cause the loss of protein complex function and lead to organisms being unable to function [1]. Therefore, the discovery of essential proteins plays a significant role in disease detection and the manufacture of new drugs.
Traditional experimental techniques, including single-gene knockouts [2], RNA interference [3], and conditional knockouts [4], have been used to discover essential proteins. However, these experimental techniques are expensive. Therefore, it is almost impossible to use experimental techniques to detect essential proteins for a whole organism. Now, The associate editor coordinating the review of this manuscript and approving it for publication was Vincenzo Conti . due to the development of high-throughput experimental technologies, many high-quality and large-scale PPI datasets have been accumulated [5], which provide relatively reliable and sufficient data for computational methods to detect essential proteins, functional modules and protein complexes. These computational methods in the PPI network also can be applied to many other fields [6]- [8].
Jeong et al. [9] have proposed a centrality-lethality rule that demonstrated that proteins that are closely related to each other have greater potential to be the indispensable proteins in protein-protein interaction networks. After that, a series of centrality methods were proposed to identify essential proteins based on the topological features of PPI networks such as the degree centrality (DC) [10], eigenvector centrality (EC) [10], betweenness centrality (BC) [11], closeness centrality (CC) [12], subgraph centrality (SC) [13], information centrality (IC) [14], edge clustering coefficient centrality (NC) [15], and local average connectivity centrality (LAC) [16]. The experimental results have verified that VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ these centrality methods are more efficient in and convenient for finding essential proteins than traditional experimental methods. However, there are still shortcomings to these methods. On the one hand, PPI data collected from high-throughput experimental techniques are always noisy and include false positives, which have a strong influence on the prediction accuracy of essential proteins. On the other hand, the centrality-based methods ignore the biological information of proteins, which leads to low prediction accuracy. To overcome the shortcomings of these algorithms, substantial amounts of biological information have been applied to the identification of essential proteins in a more accurate manner. For example, Acencio and Lemke [17] introduced a supervised machine learning method by combining network topology and biological characteristics; among them, biological characteristics include subcellular localization information and biological process information. Li et al. [18] proposed PeC and Tang et al. introduced a modified prediction method called WDC [19], which integrates the gene expression data and topological features. Considering the domain information, Peng et al. [20] proposed a method named UDoNC. Zhibei et al. [21] proposed a method named GOS to predict essential proteins based on gene expression data, orthology, and subcellular localization information. Zhang et al. [22] proposed a method named OGN by integrating orthology, gene expression data and PPI networks. Meanwhile, clustering-based methods have also been used to predict protein complexes [23]. Li et.al. [24] proposed a method called SPP to detect essential proteins. It firstly partitioned the PPI network into sub-networks by subcellular location information and then assigned the priority of sub-networks. Finally, It employed the priority and the connections between two nodes to efficiently identify essential proteins. In [25], the authors addressed a novel method called TEGS to detect essential proteins based on gene expression data, GO annotation information and protein subcellular localization data. Lei et al. [26] constructed weighted PPI networks and efficiently discovered essential proteins based on RNA-Seq, GO annotation and subcellular localization information. Li et al. [27] constructed a refined PPI network called TS-PIN by integrating gene expression information and subcellular location information to efficiently identify essential proteins. Afterwards, they adopted a deep learning method [28] to detect essential proteins without any prior knowledge. In addition, some scholars have addressed the discovery of essential proteins from the perspective of building new models, such as the RWHN [29] and E-POC [30] models, in recent years and have achieved good results. Compared with network-based centrality methods, these methods of integrating biological information have greatly improved the prediction accuracy.
Other studies have shown that proteins that frequently appear in different protein complexes are more likely to be essential proteins [31]. Therefore, in recent years, many detection methods for essential proteins incorporating protein complexes have been presented. Li et al. [32] developed a new essential protein discovery method called UC that used protein complexes to increase the prediction accuracy. Chao et al. [33] presented a new algorithm called LBCC for integrating the network topological features and protein complex information. The results showed that LBCC achieved good performance. Lei et al. [34] introduced a new method called PCSD, which integrated the subgraph density and participation degree in the protein complex to predict essential proteins.
Since the interactions between proteins are dynamic, many scholars have focused on the identification of essential proteins on dynamic PPI networks in recent years. A set of time-series temporary sub-networks have been constructed by integrating the time-series gene expression profiles and a deterministic PPI network. For example, Luo et al. [35] proposed a new method CDLC to identify essential proteins by integrating dynamic network topologies with protein complexes. Xiao et al. [36] presented a new essential protein detection method that constructed dynamic networks using gene expression data. In addition, they proved that most of the essential proteins were active. Shang et al. [37] built a new dynamic PPI network using RNA-Seq datasets to identify essential proteins. The experimental results showed that the proposed scheme could not only significantly improve the prediction accuracy but also be extended to many existing methods.
However, there are still some tough challenges in detecting essential proteins in PPI networks. On the one hand, interactions within PPI networks change continuously over time. On the other hand, PPI networks are inaccurate and incomplete due to the limitations of experimental methods. Therefore, deterministic PPI networks cannot truly reveal the connections between proteins in living organisms. From this point of view, a PPI network should be modelled as an uncertainty graph. In this paper, we proposed a new method called ETB-UPPI based on network topological features and biological information on uncertain protein-protein interaction networks. First, we convert the Simrank problem in an uncertain network into a Simrank calculation of a deterministic network. Then, we employ the Simrank method to calculate the protein similarity in a PPI network. Finally, we performed feature selection [38] and integrated multivariate biological data to measure the essentiality of the protein. To evaluate the performance of our proposed algorithm, ETB-UPPI, we conduct experiments over four datasets from different sources and compare with other methods, i.e., DC, EC, BC, SC, IC, NC, LAC, PeC, WDC, UDONC, LBCC, TEGS and RSG. The experimental results have shown that ETB-UPPI can improve the prediction accuracy and is superior to other prediction methods.

II. METHODS
In this section, we first convert the Simrank problem in an uncertain network into a Simrank calculation of a deterministic network. Then, we employ the Simrank method to calculate the protein similarity in a PPI network. Afterwards, we integrate multivariate biological data such as gene ontology (GO) similarity, the Pearson correlation coefficient, and subcellular localization information. Finally, we give a scoring function to measure the essentiality of the protein.
The basic idea of our method is mainly based on the following premise: (1) A protein with a high similarity score is more likely to be an essential protein. (2) Essential proteins in the same cluster are more likely to be co-expressed. (3) The more often a protein appears in the nucleus, the more likely it is to become an essential protein. To describe the proposed method more clearly, we give the following definitions.

A. PROBLEM DEFINITIONS AND CONCEPTS
Definition 1 (Uncertain PPI Network): An uncertain PPI network can be represented as an undirected graph G = (V , E, P, A), where V denotes a set of protein node, E denotes a set of interactions between pairs of proteins, P represents the probability matrix of G, and A represents the adjacent matrix of the undirected graph G.

1) SIMRANK SIMILARITY SCORE
In this study, we detect future interactions between pairs of proteins in an uncertain PPI network based on a similarity method. According to the Simrank similarity score, we obtain the similarity measure.
Simrank was proposed by Jeh and Widom [39]. It is an intuitive graph-theoretic model and uses an iterative approach to calculate the similarity between nodes in a network. The algorithm measures the similarity between nodes completely based on the network topology. Simrank assumes that if two objects are referenced by similar objects, the two objects are also similar. In addition, Simrank can be used to reflect direct and indirect connections between vertices in the network.
Due to uncertain interactions in a PPI network, we employed the Simrank method to calculate the similarity between proteins. The method converts the Simrank problem in the uncertain PPI network [40] into a Simrank calculation in a deterministic PPI network. To avoid enumerating all possible worlds of an uncertain network, we constructed an adjacency subgraph of each node, which allowed us to perform transformation matrix calculations within the subgraph.
Definition 2 (Simrank Similarity): Given an uncertain graph G = (V , E, P, A) for each node v ∈ V , let I (v) be the set of neighbours of vertex v in the protein-protein interaction network, and let |I (v)| be the number of vertices in I (v) with The Simrank similarity between nodes, which is denoted as s(u, v), can be defined as follows: where c ∈ (0, 1) is the damping factor. There is a special case to be noted, that is, when where A is an n × n matrix that represents the adjacent matrix of G. i.e., Thus the matrix form equivalent to equation (1) is where W = [w ij ] is the transformation matrix, and its elements are defined as w ij = a ij n k=1 a kj . In our method, c is set to 0.8.
Based on the Simrank similarity to detect the interactions between proteins in an uncertain PPI network G = (V , E, P, A), we first convert the uncertain protein network into a deterministic networkḠ = (V ,Ē,W ) by calculating the transformation matrixW .
Definition 3 (Possible World): Let G = (V , E, P, A) denote an uncertain graph, where P represents the probability matrix, G denotes a subgraph of G and the probability is P. To satisfy such conditions, G is considered the possible world of G.
The probability of a possible world G is expressed as P r [G ]: where p(e) is the probability that each edge e is included in G .

Definition 4 (The Transformation Matrix):
Given an uncertain PPI network G and the deterministic networkḠ = (V ,Ē,W ), whereW denotes the transfer matrix ofḠ,Ē = E ∪ S, and S = {(u, u)|u ∈ V }, the transfer matrixW = where (u, v) denotes an edge between proteins u and v. A represents the adjacent matrix, (u, v) denotes the set of all possible worlds in G, and these possible worlds contain edge(u,v). By the use of the transformation matrix, we can calculate the Simrank similarity in an uncertain PPI network. This can be written as the matrix form as follows: Since the calculation of equation (3) needs to enumerate all possible worlds, the complexity of the computation is O(2 |E| − 1). Therefore, we construct a neighbour sub-graph for each node to solve the problem.
Definition 5 (Neighbour Subgraph of Node): For each node u inḠ = (V ,Ē,W ), let (u) = {v|(u, v) ∈Ē} represent the set of neighbours of node u, and the neighbour subgraph VOLUME 8, 2020 of node u is described as The set of edges outside G(u) is defined asĒ(u). Assume that the number of possible worlds in G(u) is m, i.e. G 1 (u), G 2 (u), . . . , G m (u). For each possible world G i (u), let E i (u) be the set of edges in G i (u) and E i (u) be the set of edges not in G i (u).
According to G i (u), a possible world ofḠ is defined as For the possible worldḠ ik (u), the set of edges can be represented asĒ ik (u) = {e|e ∈Ē(u), e ∈Ē ik (u)}, and the following can be obtained: is the sum of the probabilities of all possible worlds inĒ(u) = {e|e ∈ G(u), e ∈Ē}, its value is 1.
Thus, we can obtain whereW u represents the transfer matrix of G(u). Therefore, the transfer weight from node u to its neighbour v in G can be calculated bȳ From formula (6), we can find thatW (u, v) can be computed in the subgraph G(u), which will greatly reduce the time complexity.
Through the above-mentioned method, we can calculate the Simrank similarity score between nodes u and v, where u and v interact in the neighbour subgraph G(u) of node u.
Definition 6 (Simrank Similarity Score): For a protein u, its Simrank similarity score can be defined as the sum of similarity scores between protein u and all its neighbours. The calculation of the Simrank similarity score is shown as follows: where S is a protein similarity score matrix and I (u) is the set of neighbours of protein u.

2) BIOLOGICAL INFORMATION SOURCES
Our proposed algorithm, ETB-UPPI, integrates three biological information sources: gene ontology data, gene expression data, and subcellular localization data. Gene ontology (GO) is a description of the function of a particular gene and provides a comprehensive source of information about the gene function. It is the basis for the computational analysis of large-scale molecular biology and genetics experiments in biomedical research. Usually, we believe that two essential proteins linked by an edge have a greater probability of participating in the same biological process. It is well known that gene expression refers to the process of protein synthesis under gene guidance. Therefore, we employ the Pearson correlation coefficient (PCC) to measure the intensity of co-expression of two interacting proteins. Moreover, subcellular localization refers to the specific site of a certain protein or expression product in a cell, for example, in the nucleus, in the cytoplasm or on the cell membrane. Previous studies have shown that in a certain compartment, such as the cytoplasm, the wider the distribution of proteins is, the greater the possibility of becoming an essential protein [41].

a: GENE ONTOLOGY (GO) SIMILARITY
Gene ontology (GO) is an ontology widely used in the field of biological information. It mainly consists of three branches: biological processes, molecular functions, and cellular components. GO not only provides valuable information and convenient methods to study the functional similarity of genes but also improves the accuracy of constructing networks [42]. We introduce the functional similarity method proposed by Wang et al. [43] to calculate the functional similarity between genes.
where Term(G 1 ) and Term(G 2 ) represent the set of GO terms that annotate genes G 1 and G 2 , respectively.

b: PEARSON CORRELATION COEFFICIENT (PCC)
Studies have shown that co-expression genes are more likely to encode interacting protein pairs [44]. Therefore, we use the Pearson correlation coefficient to measure how strong two interacting proteins are co-expressed. The PCC between the two interacting proteins is high, demonstrating that the interactions between the two proteins are highly reliable. The Pearson correlation coefficient between genes X and Y is calculated as follows: where m is the number of samples in the gene expression data. g(X , i) and g(Y , i) represent the expression level of genes X and Y in sample i, respectively.ḡ(X ) andḡ(Y ) represent the mean expression level of genes X and Y , respectively. σ (X ) and σ (Y ) represent the standard deviations of the expression level of gene X and gene Y , respectively.

c: SUBCELLULAR LOCALIZATION SCORE
Subcellular localization refers to the specific site of a certain protein or expression product in a cell, which is divided into different compartments [45]. As shown in Figure 1-2, by calculating the number of proteins and essential proteins in seven different compartments, we found that the numbers of proteins and essential proteins in the nucleus were the highest. This fully demonstrates that protein localization plays a crucial role in the essentiality of proteins. Let |u| be the number of proteins u that appears in the subcellular localization of the nucleus. P max is the number of proteins that occur most frequently in the nucleus. NSL(u) is defined as the subcellular localization score of protein u, which is calculated as follows:

d: THE IMPORTANCE SCORE
Although the Simrank similarity considers the uncertainty of the protein-protein interaction network, it does not fully determine the essentiality of the protein. Biological information plays an important role in the identification process of essential proteins. Therefore, we use a method that integrates the Simrank similarity and biological properties to identify essential proteins. In ETB-UPPI, the essentiality of a protein is measured by the importance score. The importance score of a protein consists of the Simrank similarity score, the gene ontology (GO) similarity, the Pearson correlation coefficient (PCC) and the subcellular localization score (NSL). For a protein u, the importance score (E_score) of protein u is defined as follows: Here, I (u) is the set of all neighbour nodes of protein u, α ∈ (0, 1) is a parameter used to adjust the contribution of the Simrank similarity and other biological property similarities. Fig. 3 shows the workflow of our ETB-UPPI method. The method is composed of three main steps: 1) Converting the Simrank problem in an uncertain network into a Simrank calculation of a deterministic network.

B. ALGORITHM ETB-UPPI 1) MAIN STEPS OF THE ALGORITHM
2) Employing the Simrank method to calculate the protein similarity in a PPI network.
3) Integrating multivariate biological data, such as the gene ontology (GO) similarity, the Pearson correlation coefficient, and subcellular localization information. 4) Giving a scoring function to measure the essentiality of the protein based on the Simrank similarity, gene ontology (GO) similarity, Pearson correlation coefficient and subcellular localization information.
The framework of the proposed ETB-UPPI algorithm (essential protein identification based on topological features and biological information on uncertain protein-protein interaction networks) is shown in Algorithm 1.

III. EXPERIMENTAL RESULTS AND ANALYSIS A. DATASET
In this section, in order to evaluate the performance of our proposed ETB-UPPI algorithm, we conduct experiments on Saccharomyces cerevisiae data which is the most reliable and completely species.
The PPI datasets were collected from four different sources, including the DIP [46], MIPS [47], Gavin [45], and Krogan [48]. All self-interactions and repeated interactions were filtered. The detailed information of the four different datasets(DIP, MIPS, Gavin, and Krogan) are introduced in Table1. The essential proteins are collected from the following four different databases: MIPS [47], SGD [49], DEG [50], and SGDP (http://www.sequence.stanford.edu/ group/). Gene expression data is downloaded from the Gene Expression Omnibus(GEO) database [51] (http://www.ncbi.nlm.nih.gov/ geo/) with accession number GSE3431. It includes three metabolism cycles with a total of 36-time points. Gene expression data set includes 9336 genes. The GO data applied in our method is extracted from the GO Consortium [52]. And the subcellular localization information is obtained from the COMPARTMENTS database [53].

B. THE EFFECT OF PARAMETER ON PERFORMANCE
In our method, the importance score of the proteins consists of two parts: the Simrank similarity score and the biological Algorithm 1 ETB-UPPI Input: Uncertain protein-protein interaction (PPI) network; GO data; Gene expression data; Subcellular localization information; k:the number of essential proteins; parameter α; Output: the set of k identified essential proteins; 1. Compute the Simrank similarity score (S_score) by function (8) in an uncertain PPI network; 2. Compute the gene ontology similarity (Sim match ) by function (9); 3. Compute the Pearson correlation coefficient (PCC) by function (10); 4. Compute the subcellular localization score (NSL) by function (11); 5. Compute the importance score of each node according to the S_score, Sim match , PCC and NSL; 6. The importance score E_score of protein is obtained by function (12); 7. Sort nodes in descending order by their E_score value; 8. Output the top k ranked essential proteins. information. We adjusted them using a parameter α, which ranges from 0 to 1. Tables 2-5 list the identification results of ETB-UPPI on four different networks obtained with different values of α. When α = 1, the E_score only considers the topology information, and when α = 0, the E_score only considers the biological information. As shown in Tables 2-4, we can find that when α = 0.6, the ETB-UPPI algorithm can achieve better performance.
To further reflect the impact of α on our algorithm, we use Figure 4 to show how the F-measure of the ETB-UPPI method fluctuates under different values of α. From Figure  4(a), (b), and (d), we can clearly observe that the F-measure is maximized when α = 0.6. It can be noticed from Figure  4(c) that the F-measure obtains the best result when α = 0.5. Thus, we set α = 0.6 in our experiments.

C. COMPARING ETB-UPPI WITH OTHER METHODS
To evaluate the performance of ETB-UPPI, we compare our ETB-UPPI method with other prediction methods (DC [10], EC [10], BC [11], SC [13], IC [14], NC [15], LAC [16], PeC [18], WDC [19], UDONC [20], LBCC [33], TEGS [25], and RSG [26]) on four different datasets. For convenience,   UDONC and LBCC are only applied to the DIP dataset. We calculated the importance scores of the proteins by different prediction methods and ranked them in descending order according to their importance score of each protein. Then, we chose six levels (the top 1, top 5, top 10, top 15, top 20, and top 25 percent of all proteins) as candidate essential proteins. Finally, we compared the known essential proteins  with candidate essential proteins to determine the number of truly essential proteins in the candidate essential proteins.
The prediction results at six levels on the DIP dataset are shown in Figure 5. It can be shown that our ETB-UPPI method is more superior to all the other methods (DC, EC, BC,LAC, PeC, WDC, UDONC, LBCC) at six levels from the top 1% to the top 25% of ranked proteins. From Figure 5, we can see that the total numbers of essential proteins correctly predicted by our method ETB-UPPI were 42, 206, 332, 436, 548, and 621 at six levels. For the top 1% of identified essential proteins, it can be noticed that the ETB-UPPI algorithm performs much better than DC, EC,BC, LAC, PeC, WDC, UDONC, LBCC, and TEGS (90, 75, 75, 44, 10, 31, 13, 13 and 2 percent, respectively). On the top 10% and top15% of identified essential proteins, our algorithm is slightly inferior to TEGS and RSG. However, in most cases,our algorithm still outperforms TEGS and RSG. VOLUME 8, 2020 Therefore, it can be concluded that the ETB-UPPI algorithm outperforms the ten other methods.
The prediction results at six levels on the MIPS dataset are shown in Figure 6. Clearly, our ETB-UPPI method outperforms all the other methods (DC, EC, SC, IC, NC, LAC, PeC, WDC, TEGS and RSG) at five levels. From Figure 6, we can see that the total numbers of essential proteins correctly predicted by ETB-UPPI were 19,92,191,257,338, and 405 at six levels. ETB-UPPI achieves the best performance for the top 1%, top 10%, top 15%, top 20%, and top 25%, and PeC obtains the best performance at the top 5%. For the top 5% of identified essential proteins, PeC detected 93 truly essential proteins, while ETB-UPPI correctly identified 92 essential proteins. When the number of identified essential protein is large, our ETB-UPPI method slightly outperforms PeC. For example, when predicting the top 25% of essential proteins, ETB-UPPI achieves an approximately 25% increase over PeC.
The prediction results at six levels on the Gavin dataset are shown in Figure 7. Our ETB-UPPI method performs better than all the other methods at six levels from the top 1% to top 25% of sorted proteins. Figure 7 shows that the numbers of essential proteins correctly identified by our ETB-UPPI method are 13, 59, 117, 170, 213, and 254 at six levels. ETB-UPPI achieves the best performance at the top 5%, top 10%, top 15%, top 20%, and top 25%. LAC and PeC obtain the best performance at the top 1%. For the top 25% of identified essential proteins, LAC and ETB-UPPI achieve the same performance. However, compared with DC, EC, SC, IC, NC, PeC, WDC, TEGS and RSG, the performance of our  ETB-UPPI algorithm is increased by 14%, 23%, 35%, 17%, 1%, 9%, 3%, 0.4% and 2%, respectively.
The prediction results at six levels on the Krogan dataset are shown in Figure 8. From Figure 8, we can see that the 33984 VOLUME 8, 2020 total numbers of essential proteins predicted correctly by our ETB-UPPI method are 22,102,179,242,292, and 339 at six levels. Note that the performance of ETB-UPPI algorithm is only inferior to the RSG algorithm and superior to the other nine algorithms at six levels from the top 1% to top 10% of ranked proteins. When predicting the top 1% of essential proteins, the numbers of essential proteins correctly predicted by NC and ETB-UPPI are the same. Moreover, the ETB-UPPI algorithm performs slightly worse than RSG and TEGS, whereas it outperforms the other eight algorithms at six levels from the top 15% to top 25% of sorted proteins. For the top 25% of identified essential proteins, compared with five centrality methods (DC, EC, SC, IC, NC, and LAC), PeC, and WDC, the performance of our ETB-UPPI algorithm is increased by 6%, 33%, 24%, 7%, 4%, 4%, 6% and 6%, respectively.
Based on the comprehensive analysis of Figures 5-8, it can be seen that our ETB-UPPI method performs best on the DIP, MIPS and Gavin datasets. On the Krogan dataset, ETB-UPPI identified slightly fewer essential proteins than the RSG and TEGS methods. It is obvious that the ETB-UPPI method has a better identification efficiency and strong stability. The experimental results show the correctness as well as the effectiveness of applying ETB-UPPI to essential protein discovery on PPI networks.

D. VALIDATION USING SIX STATISTICAL MEASURES
To further demonstrate the performance of our method, we apply six statistical measures, sensitivity (SN), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), F-measure (F), and accuracy (ACC), to compare the ETB-UPPI method with other methods. The statistical measures are defined as follows.
True positive (TP) denotes the total number of true positives (correctly identified essential proteins). False negative (FN) denotes the total number of truly essential proteins minus TP. True negative (TN) denotes the number of true negatives (correctly identified non-essential proteins). False positive (FP) denotes the total number of non-essential proteins incorrectly predicted as essential proteins. We rank the proteins in descending order based on importance scores obtained by different algorithms and select the top25% as essential proteins. These statistical measures of the prediction result are calculated as follows: A high value of the six above statistical measures suggests good performance overall. The results of the statistical measures are shown in Tables 6-9. From Table 9, it can be seen that ETB-UPPI performs slightly worse than TEGS and RSG on the Krogan dataset. However, from Tables 6-8, it can be clearly observed that the six statistical measures of ETB-UPPI are higher than those of any other method on the three datasets, which further demonstrates that our ETB-UPPI method can predict essential proteins more accurately.

E. FOLD ENRICHMENT OF ETB-UPPI
Fold enrichment [37] is used to describe how those gold standard essential proteins are enriched in the set of identified VOLUME 8, 2020   essential proteins. Let N denote the set of identified essential proteins, and N e denote the number of essential proteins in N . TPR = N e |N | denotes the percentage of true positives. Suppose that V denotes the set of all proteins in the PPI network and that P denotes the set of all gold standard essential proteins in V . EPR = |P| |V | denotes the percentage of all essential proteins in V . Therefore, the fold enrichment of the identified proteins is calculated as follows:  We also test and compare the fold enrichment of the ETB-UPPI and WDC algorithms on the four datasets. It can be shown from Figure 9 that the fold enrichment of our ETB-UPPI method on four (DIP, MIPS, Gavin, and Krogan) PPI networks is significantly higher than the WDC method.

F. VALIDATION WITH JACKKNIFE METHODOLOGY
In this section, we apply the jackknife methodology to further assess the performance of our ETB-UPPI method as well as the other prediction methods. The horizontal axis of the jackknife curves denotes the number of proteins in the PPI networks sorted in descending order based on their importance scores calculated by the corresponding methods. We analysed the performance of ETB-UPPI and the other   methods by selecting the top 1000 proteins on the four different datasets. The vertical axis denotes the number of essential proteins. The comparison results of the jackknife curves for different algorithms are shown in Figure 10-13. Clearly, ETB-UPPI performs best on the three different datasets while only performing slightly worse than TEGS and RSG on the Krogan dataset. This indicates that ETB-UPPI can detect more essential proteins.

IV. CONCLUSION
In the post-genomic era, it is important to predict essential proteins and protein complexes. To overcome the shortcomings of traditional experimental methods and achieve higher prediction accuracy, many identification methods have been developed. In this study, we developed a new essential protein identification method called ETB-UPPI based on network-based topological features and biological information. First, we converted the Simrank problem in an uncertain network into a Simrank calculation on a deterministic network. Then, we used the Simrank method to calculate protein similarity in a PPI network. After that, we combined the topology feature of PPI networks with three sources of biological information about proteins. Finally, the importance score of the proteins was obtained from the sum of the GO similarity, Pearson correlation coefficient, subcellular localization information, edge clustering coefficient, and Simrank similarity score. We compared our ETB-UPPI method with other methods, i.e., DC, EC, SC, IC, BC, NC, LAC, PeC, WDC, UDONC, LBCC, TEGS, and RSG, in terms of the number of essential proteins, six statistical measures, and the jackknife curves. The experiment results on four different datasets have proved that ETB-UPPI achieves the best performance.
LIANGYU MA was born in Nantong, Jiangsu, China, in 1994. She is currently pursuing the master's degree with the College of Information Engineering, Yangzhou University, Yangzhou, China. Her research topic is bioinformatics.
LING CHEN (Member, IEEE) was born in 1951. He graduated from the Mathematics Department, Yangzhou Teachers' College. He is currently a Professor of computer science with the Information Technology College, Yangzhou University. He has published more than 200 articles in journals and conferences. He has also authored/coauthored six books. He has 15 research projects supported by the Chinese Natural Science Foundation and other organizations. His research interests include artificial intelligence, data mining, system optimization, and complex network analysis. He is a member of ACM. He received five Awards of Progress in Science and Technology from the Government of Anhui and Jiangsu Province. He was awarded the Government Special Allowance by the State Council.
BYEUNGWOO JEON (Member, IEEE) received the B.S. degree (magna cum laude) and the M.S. degree in electronics engineering from Seoul National University, Seoul, South Korea, in 1985 and 1987, respectively, and the Ph.D. degree in electrical engineering from Purdue University, West Lafayette, in 1992. From 1993 to 1997, he was with the Signal Processing Laboratory, Samsung Electronics, South Korea, where he conducted research and development on video compression algorithms, the design of digital broadcasting satellite receivers, and other MPEG-related research for multimedia applications. Since September 1997, he has been the Faculty with the School of Electronic and Electrical Engineering, Sungkyunkwan University, South Korea, where he is currently a Full Professor. He served as the Project Manager of digital TV and broadcasting at the Korean Ministry of Information and Communications, from March 2004 to February 2006, where he supervised all digital TV-related R and D, South Korea. He has authored many articles in the areas of video compression, preprocessing/postprocessing, and pattern recognition. His research interests include multimedia signal processing, video compression, statistical pattern recognition, and remote sensing. Dr. Jeon is also a member of Tau Beta Pi and Eta Kappa Nu. He is also a member of SPIE, IEEK, KICS, and KSOBE. He was a recipient of the 2005 IEEK Haedong Paper Award from the Signal Processing Society, South Korea. VOLUME 8, 2020