Using Alias Sampling Strategy Based on Network Embeddings to Detect Protein Complexes

Detecting protein complexes from available protein-protein interaction (PPI) data will help to deeply understand the mechanism of the biological activities. In recent years, various computational methods have been developed for identifying protein complexes from PPI networks. Almost all the basic computational methods mainly depend on the association of topological analysis of PPI networks. However, most of them fail to satisfactorily capture the global and local topological structures of the PPI networks, as well as the diversity of connectivity patterns between individual nodes at the same time. To solve this problem, in this work we propose a node embedding based alias sampling extension method to detect protein complexes. More specifically, for a given set of seed nodes, it first uses the alias sampling strategy based on protein node embedding similarities to select potential addable nodes. Then it makes use of a new conductance measure, which could better quantify the likelihood of a subgraph being a protein complex, to decide whether to extend the current candidate subgraph in order to find protein complexes. Evaluated on six real yeast PPI networks, our method outperforms state-of-the-art methods in detecting protein complexes. Furthermore, the experimental results demonstrate the protein complexes predicted by our method have higher biological significance.

In recent years, protein complex prediction via computational approaches from PPI network has gained a lot of attention from bioinformatics researchers. The main line of the approaches for identifying protein complexes from PPI network is based on the observation of the inherent topological structures of protein complexes [4], [5]. Consequently, identifying protein complexes can be formulated as searching for subgraphs that are densely connected inside and well separated from the rest of the networks. Considering this basic idea, the detection methods for protein complexes based on machine learning and data mining have grown rapidly and become useful ways to identify protein complexes. Besides, multiple researches have proved combining extra well-selected biological information would improve the performance of protein complex detection [6], [7]. Here, we only discuss the methods that only use topological characteristics of the network, since the biological information could be added to most of the methods to improve the performance.
Existing protein complex detection approaches can be roughly divided into two groups: supervised and VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ unsupervised learning methods. The supervised learning methods are proposed based on topological properties of the available known complexes to solve this problem. Most of them have three steps to detect complexes from PPI networks: extract features from known complexes, train a supervised classification model to distinguish the true protein complexes based on the extracted features and search for protein complexes from PPI networks according to the classification model. For example, SLPC [8] is a supervised method that trains a regression model to score subgraphs and detect new complexes in a PPI network. NN [9] trains a neural network model to guide the search process for protein complex detection. ClusterEPs [10] measures the likelihood of a subgraph being a complex using an integrative score of emerging patterns (EPs). However, supervised learning methods need a set of informative, discriminating, and independent features, while it needs tedious effort for feature engineering. What's more, they require training data, but in reality, the amount of known complexes is quite small. The majority of unsupervised methods are based on the concept that a pair of proteins interacting with each other has a higher probability of sharing the same functions than two proteins that not interacting with each other. The Markov clustering method (MCL) [11] mimics Markovian random walk on graphs, it iteratively implements ''Expand'' and ''Inflation'' operations to detect complexes. However, MCL only generates non-overlapped clusters. Both the molecular complex detection (MCODE) [12] and the repeated random walks (RRW) [13] grow complexes from a single node by iteratively adding similar nodes in terms of different similarity measures. The clustering based on maximal cliques (CMC) [14] method is a clique-based method that finds complexes by removing or merging cliques based on the connectivity of the clusters. However, the quality measures used in these methods only consider the internal connectivity of protein complexes. The clustering with overlapping neighborhood expansion (ClusterONE) [4] method adopts cohesiveness as criteria to measure the quality of subgraph as a protein complex. It considers both internal and external connectivity of complexes but ignores the interaction density within complexes. More recently, Zaki et al. proposed a method named PEWCC to detect protein complexes from PPI networks [15]. This method assesses the reliability of protein interaction and then predicts protein complexes based on weighted clustering coefficients. Afterward, Hanna et al. introduced a method (ProRank+) for detecting protein complexes based on a protein ranking algorithm that sorts proteins according to their importance in forming the corresponding subgraphs [1]. Most of the above mentioned computational approaches for identifying protein complexes from PPI networks mainly depend on the association of topological analysis of the networks in order to find subgraphs which may have the potential to be considered as protein complexes [16]. However, existing computational methods fail to simultaneously capture both the global and local topological structures of the PPI network. Moreover, protein nodes in the PPI network have various connectivity patterns besides direct connections: sharing common neighbors, or belonging to common communities, or having the possibility of connecting with each other. Thereby, how to reflect the diversity of connectivity patterns of protein nodes is important for better detecting protein complexes from PPI networks.
To solve these problems, in this paper, we propose a new method which is a seed-extension method based on node embedding similarities to detect protein complexes from PPI networks. Given a set of seeds, our proposed method firstly utilizes alias sampling strategy based on node embedding similarities to select the potential addable nodes from the network. Node embedding similarities are applied as they can not only illustrate the relations between nodes but also can preserve the structure of the networks and the diversity of connectivity patterns observed in large scale networks. Moreover, the PPI networks may have many false negative interactions due to the limitations of high-throughput experiments [14], while by using the node embedding similarities, the possibility of whether two nodes may have a potential connection could be reflected. Then, our method uses a new conductance score, which can better capture the topological structures of protein complexes by considering both the internal and external connectivity as well as the density of protein complexes, as a quality measure to decide whether to add the new node or remove a node from the current candidate subgraph. Finally, our method merges the highly overlapped subgraphs. We compared our method with six state-of-theart protein complex detection methods. Experimental results on six different yeast PPI networks from different publicly accessible databases indicate that our method outperforms all competing algorithms according to different evaluation criteria. Furthermore, the gene ontology (GO) enrichment analysis of the results of all competing algorithms show that our method finds more biologically meaningful complexes.

II. METHODS
In this section, we present the details of our proposed method. The main steps of our method are shown in Fig 1. First, a network representation method is used to obtain the node embeddings of the network and the network is weighted by the node embedding similarities between nodes. Then, generate an alias table based on node embedding similarities and utilize alias sampling to expand the seeds. Thirdly, repeat the expanding step to add new nodes or removing boundary nodes from the current subgraph based on whether the conductance score of the candidate subgraph increases to obtain the final candidate subgraph. Finally, all the highly overlapped candidate subgraphs are merged.  different edge weights in W . W is the weighted adjacency matrix of G with W ij = W ji and W ij denoting the edge weight between node i and node j. Here, we use pairwise clustering coefficient C uv score as the weight between node u and v, based on the concept that two nodes with more common neighbors have a higher possibility to connect with each other. The C uv is defined as: where N ( * ) denotes the number of neighbors of node * . The weighted degree matrix D of G is a diagonal matrix with D ii = j W ij , which is the total weights of interactions connecting to protein i.
As motivated by the conductance defined in [17], a new conductance is presented as a quality measure to identify well-separated subgraph in a given graph in this study. For a set of proteins denoted as S, the new conductance of S in G is defined as: where f (density) is the exponential function of the subgraph and W S is the weighted adjacency matrix of the subgraph constructed by the nodes in set S and D S is the weighted degree matrix for nodes in S, where D ii = j W ij for i ∈ S. This new conductance measure assigns a higher score to subgraphs with a large number of interactions, high density and well separated from the rest of the network, and lower score value otherwise.

2) NODE EMBEDDING SIMILARITIES
Recent advances in deep neural networks have shown that deep learning based methods have powerful representation abilities and generate very useful representations for graphical types of data [18]. For example, Sang et al. [19] used a deep learning architecture to convert graph nodes into embeddings for extracting semantic relation between nodes. More precisely, node embedding algorithms [20]- [24], which aim to automatically extract features from graph-structured data and learn the representations for nodes in the graph, have been successfully applied to multi-label classification, link prediction and visualization tasks. In particular, the Node2vec [23] can capture the diversity of connectivity patterns observed in large scale networks and automatically learns a mapping of nodes to a low-dimensional space of features. Specifically, two nodes being ''close'' in the graph, either belong to common communities or have similar structural role, have similar vector representations obtained from Node2vec. The obtained node embedding vectors not only are the continuous feature representations for nodes in the network but also can reflect the diversity of similarities between nodes.
Intuitively, for protein complex detection tasks, proteins that are highly interconnected and belong to a common subgraph should have higher similarities. In our method, Node2vec is utilized as a default method to obtain the node embeddings for a given PPI network, since it not only can embed similar nodes together but also preserve the network structure and capture the diversity similarities between nodes. Also, we compared the performances of using different methods to obtain the node embedding, which will be shown in the Comparison of different embedding methods section. Given a graph G = (V , E, W ), Node2vec learns a node embedding for each v i ∈ V as φ i ∈ R d . Therefore, given a protein pair v i and v j , we can calculate the similarity between them by using their embedding vectors. If two nodes are ''close'' to each other or share a common subgraph, the node embedding similarity value between them will be higher than others.
In any seed-and-extension algorithm, there is an implicit bias due to the choice of the extension direction. Since we try to search potential protein complexes through the entire network for a given seed, we offset this bias by using alias sampling [25] which is done based on the node embedding similarities to select a node to extend the current candidate subgraph. For each sampling, a node that is more close to the current seed or subgraph in embedding space has a higher probability to be selected. What's more, as the node embedding similarities can be precomputed and hence, sampling of nodes can be done efficiently in O(1) time using alias sampling. Initialize seeds_set to Empty 8: for node v i in G 9: if degree(v i ) > G.average_degree do 10: seeds_set.add(v i ) 11:

End if
12: End for 13: for node v i in seeds_set do 14: Initialize subgraph to v i 15: while subgraph is changing do 16: s = AliasSample(subgraph, G , π ) 17: subgraph = Append s to subgraph 18: if con(subgraph ) > con(subgraph) do 19: subgraph = subgraph 20: End if 21: V boundary = GetBoundaries(subgraph,G ) 22: 24: if con(subgraph ) > con(subgraph) do 25: subgraph = subgraph 26: End if 27: End while 28: Append subgraph to complexes 29: End for 30: 31: while G ! = Empty 32: Calculate overlapping score of every subgraph pairs 33: Merge subgraph pairs whose overlapping score ≥ θ 34: return complexes B. THE ALGORITHM Given a graph, in primary, embedding vectors of each node are generated, and the similarities between two nodes based on embedding vectors and pairwise clustering coefficient are calculated. Secondly, the nodes whose degree is above the average degree of the nodes in the graph are selected as initial seeds and the algorithm uses alias sampling strategy based on node embedding similarities to select addable nodes. Then the algorithm decides whether to add the new node or remove the node from the current candidate subgraph based on the new conductance value. Whenever the growth process of the initial seed set is finished, the algorithm selects the next seed set by taking the nodes with degree equals to or larger than the average degree from the new network, which is constructed by removing all seeds that have been processed so far. The entire Algorithm 2 AliasSample Algorithm Require: Current Subgraph subgraph, Network G, Network weight which determines the probabilities of a distribution for sampling π Ensure: a node sampled denotes as i 1: l = GetAllNeighbors(G, subgraph) 2: Initial L = H = ∅ and A = [ ] 3: weightall = SumAllWeight(subgraph, l, π) 4: for node i in l do 5: node weight = SumWeight(node i , subgraph, π)

6:
Pr i = node weight /weightall 7: end for 8: for i = 1 to len(l) do 9: if Pr i ≤ l −1 then  26: if Pr i > Random(0, 1) then 27: return node h 28: else 29: return node i procedure terminates when there are no proteins remained. In our method, new conductance is utilized as the quality measure of the connectivity of subgraphs during the extension process, and the exponential function of subgraph density is set to f (density) = 1.1 density , because a subgraph with high interconnectivity, high density and well separated from the rest of the network will increase the new conductance value. What's more, the growth process allows the removal of any boundary node from the group being grown in order to get the subgraph with a higher new conductance value. Thirdly, highly overlapping subgraph pairs are merged. In our benchmarks, we have merged pairs of subgraph with an overlapping score larger than a threshold θ, the overlap score known as neighborhood affinity score of two protein sets A and B is defined as follows: Finally, subgraphs that contain less than three nodes are discarded and the remaining subgraphs are outputted as final protein complexes. See Algorithm 1 for more details.
The alias sampling method used in Algorithm 1 is shown in Algorithm 2. Firstly, A containing triples (i, h, Pr h ) is generated based on the probabilities of distribution over l neighbors of the current subgraph which needs to be expanded. Here, the probabilities are calculated based on the node embedding similarities between the neighbor nodes to be sampled with the subgraph to be expanded. As the example of Fig 1 (B) shows that A contains three triples (p 2 , ∅, 1), (p 3 , p 2 , 0.5625), (p 4 , p 2 , 0.9375). It needs O(l) to obtain the A for each subgraph and O(1) to sample a node to be added using alias sampling.

III. RESULTS
In this section, we introduce the evaluation metrics and compare our method against the four well-known complex detection approaches on six yeast PPI networks.

A. DATASETS
We carried out all the experiments on six real world yeast PPI networks, four weighted datasets (Gavin [26], Collins [27], Krogan core and Krogan extended [28]), and two unweighted datasets (BioGRID [29] and DIP [30]). The details of these datasets are shown in Table 1. To compare the results with reference complexes, we have constructed a golden standard complex set by selecting all the protein complexes that had at least three proteins from MIPS [31], CYC2008 [32], SGD [33], Aloy [34] and TAP06 [26], and removing duplicated complexes. Consequently, there was a total 789 protein complexes in the reference set.

B. EVALUATION METRICS
To formally evaluate the performance of our method, we used four statistic measures that are widely used in the previous literature for complex detection tasks: precision, recall, F-score, and MMR (Maximum matching ratio). The neighborhood affinity score defined as equation 3 was used to evaluate the similarity of a predicted complex and a real complex. Note that a predicted complex is defined to be matched with a known complex if the neighborhood affinity score between them is not less than 0.25 as suggested by previous studies [4], [35]. The precision is the fraction of the predicted complex matched and the recall is the fraction of the reference complex matched. The F-score is the harmonic mean of precision  and recall and shows the overall performance. Maximum matching ratio (MMR) [4] was also used to evaluate the performance of complex detection methods. This measure is based on a maximal one-to-one mapping between reference complexes and predicted complexes.

C. PARAMETER SENSITIVITY
We investigated the performance of our proposed method with regard to the two parameters: embedding vector dimension d and the overlapping threshold θ. For all the performance, we set the parameters of Node2vec as its default [23]. Figure 2 reports the F-score and MMR performances of our method with regard to dimension d on all datasets. Dimension value is set to zero means using a random sampling strategy to select the next node. We can see that the performance of our method is sensitive to the dimension parameter. However, with proper dimension value, our method can achieve a remarkable performance, for example, the highest F-score with d equals to 192 is 0.654 and MMR is 0.365 for dataset Gavin, the highest MMR score with d equals to 256 is 0.365 and F-score is 0.584 for dataset DIP. As our method with the smaller dimension value (dimension = 32) can get relatively good performances on all datasets, we used 32 as the default dimension in the experiments. Figure 3 shows the results of our method with different overlapping scores. From the figure we can see, both the F-score and MMR increase gradually with the overlapping score. Because the highest overlapping score between two subgraphs is 1.0, and the higher merge cutoff means merging two subgraphs only if their overlapping score is larger than the cutoff threshold. Although merging highly overlapped predicted complexes could reduce result size with little sacrifice of the performance, the experimental results demonstrate that almost every predicted protein complex obtained by our method has a high possibility to be a unique potential true protein complex. In our experiment, we set the default overlapping score to be 1.0 in terms of F-score and MMR.

D. PERFORMANCE COMPARISON
We compared the our method with six protein complex identification methods: CMC [14], MCL [11], RRW [13], ClusterONE [4], PEWCC [15] and ProRank+ [36]. MCL, CMC, RRW and ClusterONE are designed for weighted networks, so the unweighted networks were treated as weighted networks with same edge value set to 1.0. What's more, we refer to the previous studies [4], [15], [36] and follow their recommended settings. Also, for fair comparison, we pruned out all clusters whose size is less than or equal to 2 in all algorithms' clustering final results. Table 2 shows the comparison results on six PPI networks. In the table, we compare the number of predicted complexes, the number of matched complexes precision, recall, F-score, MMR, and composite score. The overall results, which are represented by the composite scores of F-score and MMR, demonstrate that our method obviously outperforms other algorithms on all six networks. Here, we use composite scores as utilized in previous works [4], [37] to show the overall performances, because both the F-score and MMR have their own limitations. F-score penalizes predicted complexes that do not match any of the reference complexes, but the gold standard sets of protein complexes are often incomplete [38], therefore optimizing for the F-score might prevent us from detecting novel complexes from a PPI dataset. MMR neglects the number of predicted complexes and sums the maximum matching ratio of every predicted complex no matter whether it matches with a known complex (the neighborhood affinity score is above 0.25). So we utilize the composite score of these two to compare the overall performances of all the competing methods. Table 2 also shows detailed information regarding other evaluation metrics. With respect to the MMR, ClusterONE is the best for Krogan core and Krogan extend, but our method performs the best for the rest. Moreover, for all the six networks, our method predicts the largest number of matched golden protein complexes than other methods, except that PEWCC predicts more matched complexes on BioGrid than our method. Furthermore, the performance of our method remains stable in F-score for all the six networks. Specifically, our method attains the best F-score on all the networks. Additionally, the number of our output complexes for BioGRID and DIP, which are the largest and unweighted networks, is at least twice as larger as that of other methods' complexes, except PEWCC outputted 2781 and 1230 protein complexes for BioGrid and DIP, respectively, and CMC predicted 1503 protein complexes for BioGRID. However, compared with our method (869 out of 1929), only small percentages of complexes predicted from BioGrid by PEWCC (1013 out of 2781) and CMC (247 out of 1503) are matched with known complexes. our method consistently achieves the best performance among all the unweighted networks, suggesting our method is capable of detecting protein complexes from large-scale unweighted networks.
As shown in Table 2, ClusterONE performs better than the other six methods in terms of MMR. Note that, like our method, ClusterONE also used a seed-and-extension approach to generating predicted protein complexes. The differences between these two seed-and-extension algorithms are how to extend a seed and how to quantify the likelihood of a cluster of proteins as a protein complex. Clus-terONE leveraged greedy search to extend a seed and used a cohesiveness score to describe the within-connectivity of a protein complex. By contrast, in our study, we developed a new conductance measure to quantity a protein complex and utilized the alias sampling strategy to extend seeds. According to the experiment results, although ClusterONE gets a higher MMR score on Krogan core and Krogan extended networks than our method, our method predicted far more protein complexes that matched with the reference dataset than ClusterONE (precision of each method on Krogan core: 0.785 vs. 0.312; Krogan extended: 0.688 vs. 0.396). What's more, our method outperforms ClusterONE on all the datasets according to other metrics. And we have further compared with ClusterONE with respect to GO analysis in the following subsection.

E. COMPARISON OF DIFFERENT EMBEDDING METHODS
We used the Node2vec method as the default method to obtain the node embeddings of the network. In this section, we compared the performance of our method by using different network representation methods to gain node embeddings. We applied four network representation methods besides Node2vec, which are DeepWalk [21], SDNE [20], LINE [22], GrapRep [39], GF (short for GraphFactorization) [40], HOPE (short for High-Order Proximity preserved Embedding) [41], and LAP (short for LaplacianEigenmaps) [42]. For a fair comparison, we set the dimension to 32 and used the default parameter setting for all the methods. The parameters of each method are shown in Table3.
The performances of protein detection using different methods are shown in Table 4. As we can see from the table, Node2vec outperforms others on five out of six PPI networks, besides it got the best performance on all the networks in terms of MMR. All the results gained by using node embeddings are much better than using randomly selecting node which shown in Figure 2 with dimension equals to 0, the comparison results indicate that the effectiveness of integrating network embedding into the seed-extension method to detect protein complexes. In addition, the results of using different network embedding methods didn't vary a lot on most networks except BioGrid and Collins, which suggests capturing the topological characteristics with network embedding method could help to guide the protein complex detection anyway. Also, the differences between each method on different networks indicate that the embedding methods work differently on the different networks, and selecting the proper embedding method could further improve the performance for detecting protein complexes.

F. COMPARISON ON GO ENRICHMENT ANALYSIS
To examine the statistical significance of the predicted complexes by each algorithm, we compared the p-values of the predicted complexes under GO terms of the biological VOLUME 8, 2020 process which was performed using BINGO [43]. Figure 4 shows the comparison of the statistical significance of the complexes detected by all methods. In Figure 4, the y-axis represents the negative log-p-values with regard to the lowest p-value for each predicted protein complex, while the x-axis is the ordered list of all predicted complexes with at least one p-value obtained by all competing methods in terms of their lowest p-values. As a complex with significant biological relevance has a lower p-value, the higher the negative log-p-value is in Figure 4 represents the higher quality of the detected complex is. From Figure 4, we can see that for all yeast PPI networks, in addition to the fact that our method identifies more protein complexes with pvalues than other methods except PEWCC, our method outperforms all the competing methods with regards to GOenrichment analysis because the curves of our method on six PPI networks are constantly on the top of others. The outperformance of our method further demonstrates that the new conductance as the quality measure can better depict complexes of biological significance and our method provides an effective way to predict complexes through this new quality measure.  Additionally, Table 5 lists six predicted complexes detected by our method, whose maximum neighborhood affinity scores with reference complexes equal to zero, for each network. The min p-value represents the minimum p-value of the matched GO terms, and the corresponding GO-Descriptions, as well as the number of enriched GO terms, are shown in Table 5. Besides the listed ones, the highest min p-value of all the predicted complexes on all PPI networks is 0.0030327, which is less than 0.05, suggesting that the most predicted complexes have biological statistical significance. From the above GO analysis results, although the unmatched predicted protein complexes have not yet been characterized as complexes, they have a high possibility to be complexes in the biological sense.
The comparison results demonstrate that our method achieves the best performance according to the evaluation metrics. Moreover, the GO analysis results show that our method predicted more biologically meaningful complexes.

IV. CONCLUSION
We propose a new approach for identifying protein complexes in protein-protein interaction networks. We found that our method outperforms other six state-of-the-art algorithms in identifying protein complexes. The experimental results show that our method can better preserve the diversity of connectivity patterns of nodes and the structures of the networks by utilizing alias sampling strategy based on node embedding similarities to select potential addable nodes, and can better capture the topological structure of a protein complex by using a new conductance as the complex quality measure. Furthermore, the GO enrichment analysis of the results of all competing algorithms shows that our method finds more biologically meaningful complexes, within which proteins tend to have similar functions. We hope our work may help the bioinformatics researcher to explore more undiscovered protein complexes.
In this study, we focus on using topological features of PPI networks to identify protein complexes. However, due to the fact that PPI networks may contain false positive and false negative links, only using topological information would probably limit the performance of our proposed method. In the future, we will explore how to integrate other biological resources, eg., GO annotations, into initial PPI networks to improve the reliability of PPI data, and thereby enhance the performance of our proposed method. VOLUME  XIAOXU WANG is currently pursuing the bachelor's degree with the Information Science and Technology College, Dalian Maritime University. He will receive the master's degree in software engineering in 2021. His research interests include bioinformatics and data mining. VOLUME 8, 2020