Robust Hierarchical Overlapping Community Detection With Personalized PageRank

Community detection is a fundamental task in graph mining. Despite the fact that most of existing community detection methods are devoted to finding disjoint community structure, communities often overlap with each other and are recursively organized in a hierarchical structure in many real-world networks. Also, finding hierarchical overlapping community structure has significant implications in many real-world applications. Some of the few existing attempts suffer from the problem that the obtained community structure is sensitive to network changes as they are based heavily on one-hop node proximity to detect communities. To tackle this problem, we propose a robust hierarchical overlapping community detection method with Personalized PageRank (PPR), which is often regarded as a prevalent metric to measure node proximity globally. Specifically, motivated by the agglomerative hierarchical clustering method, we present an effective and efficient mechanism to merge small communities and form hierarchically organized overlapping communities. Experimental results on both synthetic and real-world networks corroborate the effectiveness and robustness of the proposed framework. In addition, we introduce how to make use of the detected community structure to perform various node proximity queries such as the top- $k$ structural hole spanner query and the top- $k$ heterogeneous node query, which can help us gain more insights on the underlying network.


I. INTRODUCTION
Recent years have seen the spread of complex networks in a variety of high-impact domains, ranging from social media, bioinformatics, transportation, and e-commerce to online education. Community detection plays an essential role in understanding and probing the structures of these complex networks by revealing the hidden roles of nodes [1]. Specifically, it attempts to find a set of cohesive groups such that members within the same group interact more frequently than those outside the group. In this regard, finding communities among nodes provides insights on the formation of the network. Meanwhile, the resultant community structure eases the visualization and analysis of the underlying networks and could advance many real-world applications, such as The associate editor coordinating the review of this manuscript and approving it for publication was Fatih Emre Boran . relational learning [2], viral marketing [3], social behavior analysis [4] and disease intervention [5].
Existing community detection algorithms can be broadly categorized into disjoint community detection [6] and overlapping community detection [7]. Disjoint community detection methods employ different measures and objectives for a partition of the whole network such that each node belongs to only one community and typical methods in this category employ standard techniques like spectral clustering [8], modularity maximization [9], and random walks [10]. While in contrast, overlapping community detection assumes that each node in the network is associated with one or more communities. In fact, the overlapping phenomenon is widely observed in social networks where online users may join multiple social groups and in biological networks where each gene may have multiple functionalities [11]. Therefore, overlapping community detection has received increasing research attention in VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ recent years [12]- [16]. The previously discussed methods overwhelmingly attempted to model and detect communities at a single scale. Recent studies indicate that many real-world networks can be hierarchically organized, such that communities are recursively divided into sub-communities with smaller size in a hierarchical manner [17]- [20]. For example, employees in a large company often exhibit a hierarchical community structure. Colleagues in the same group tend to form large communities which may be further divided into sub-communities of smaller size consisting of low-level employees [21]. As communities in real-world networks often overlap and are recursively grouped into a hierarchical structure, finding both the overlapping and the hierarchical community structure in the networks has gained a surge of research interest [22]- [25]. Lancichinetti et al. [24] made one of the first attempts to find hierarchical overlapping communities in complex networks based on the local optimization of a fitness function. However, one critical problem of their method is that it cannot guarantee desired community structure due to the randomness of seed nodes. Later on, researchers tackled the same problem by using clique based, link similarity based, and label propagation based methods [22], [23], [25]. Despite the effectiveness of these methods, they are vulnerable to perturbations of the underlying network structure such that a slight change of the network topology may result in totally different node community memberships. For example, the clique-based methods [25] are suitable for networks with dense subgraphs. However, small perturbations (e.g., the addition or deletion of edges) may convert a clique into a non-clique and vice versa which may significantly affect the resultant community structure. In addition, link similarity based methods [22] and label propagation based methods [23], [26] are also very sensitive to network changes. We illustrate them by a toy example as follows. The following toy example focuses on overlapping communities as the overlapping community is often regarded as a building block for the hierarchical community structure in hierarchical overlapping community detection.
In Fig. 1, we change the network A to C by deleting one edge each time. However, two dense subgraphs of A, i.e., D(1) and D (2), never change over the whole process. Label propagation based methods such as [26] and link similarity based methods such as [22] are applied to these three networks (from A to C). We denote these two methods as BMLPA and LC, respectively. The changes of the detected communities of BMLPA are shown in Fig. 2. It can be observed that even though the network structure changes slightly over time, the underlying community structure over these three phases are sharply differing. Likewise, we have similar observations on the community membership changes of the LC method in Fig. 3, where the community {3,11} is considered as a trivial community in [22] because its size is less than 3 nodes. As the aforementioned methods are vulnerable to network perturbations, The major reason that the aforementioned community detection methods are sensitive to network structure changes is that these methods are heavily based on the one-hop node proximity (i.e., direct neighbors) to detect communities, either by measuring node proximity or by performing label propagation. As the one-hop methods can only capture direct node interactions they are often regarded to be sensitive to network perturbations [27]. Therefore, the key part of a robust hierarchical overlapping community detection method is to find a metric that can capture node proximity globally. Fortunately, Personalized PageRank (PPR) is widely used as a prevalent metric to measure this kind of node proximity and has been shown to be effective theoretically as well as empirically [28]- [30]. Specifically, PPR not only measures direct interactions (i.e., one-hop distance) among nodes in the network but also captures indirect interactions to obtain long-distance node proximity. It can be viewed as an ego-centric equivalent of PageRank [31]. Concretely, for a query node q, its PPR vector measures its proximity to all other nodes in the network via short random walks from q. In other words, the PPR vector of q denotes its personalized view about the network structure. In this paper, we define a PPR-based distance metric for all pairs of nodes in the network such that, given two nodes u and v, it measures to what extent these two nodes agree with each other in terms of their proximity to other nodes in the network. As the distance metric not only depends on the two involved nodes but is also related to all other nodes in the network, it is robust to small network perturbations and could ensure the stability of the detected community structure. The community structures detected by our proposed method are consistent even with small perturbations from Phase A to C (as shown in Fig. 4).
It should be noted that PPR is a successful building block for many graph mining tasks, such as social recommendation [28] and link prediction [29]. Employing PPR for community detection especially for hierarchical overlapping community detection is still a fertile area and needs further investigation. As hierarchical overlapping communities are discovered by the presented PPR-based distance metric, one byproduct of our method is that we can do further analysis based on the detected community structure. For example, in an academic collaboration network a researcher from the field of data mining may be interested in extending his social circle to include scholars from among other communities. This type of proximity query can be easily tackled by our proposed community detection method as we focus on the most similar researchers that connect multiple communities. These nodes are often referred as structural hole spanners [32]- [35]. We illustrate this through a toy example in Fig. 5, where the network is divided into four different communities and node 9 belongs to multiple communities. Using conventional PPR node proximity measure, the top four most similar nodes for node 0 are node 5, node 6, node 4 and node 2. However, with the detected communities in hand, node 6 is distinct as it is a structural hole spanner which plays an essential role in information diffusion across multiple communities. To this end, node 0 can put more attention on node 6 if it opts to obtain more diversified information. Also, we infer that the most similar node to node 0 in community D is node 19. This form of proximity query can help find potential collaborators across different communities in academic and business collaborations.
In this paper, we study the problem of hierarchical overlapping community detection with Personalized PageRank and make the following contributions: • We introduce a distance metric for community detection with Personalized PageRank.
• We propose a novel robust hierarchical overlapping community detection framework that is not sensitive to network perturbations.
• We provide theoretical analysis of the proposed hierarchical overlapping community detection method.
• We validate the effectiveness of the proposed method on both synthetic and real-world networks.
• We perform further analysis to show that the detected community structure can be leveraged to perform various query tasks.
The rest of the paper is structured as follows: Section 2 introduces the preliminary work of PPR and then VOLUME 8, 2020 develops a PPR-based node proximity metric for community detection. Section 3 presents the proposed robust hierarchical overlapping community detection method in detail. Section 4 introduces a novel query: community-aware similar node query. Section 5 presents the experimental evaluation of the proposed method on synthetic and real-world networks. Section 6 introduces the related work. Section 7 concludes the paper and discusses future work.

II. A ROBUST DISTANCE METRIC BASED ON PERSONALIZED PageRank
In this section, we first present some preliminaries about Personalized PageRank (PPR) which is a popular node proximity measure. Next, we present a robust distance metric based on PPR for community detection. We also give a fundamental mathematical analysis of the presented distance metric, which has a strong connection with spectral theory.

A. PRELIMINARY
We first summarize the notations used in this paper. Following standard notations, we use bold uppercase characters for matrices (e.g., A), bold lowercase characters for vectors (e.g., b), normal characters for scalars (e.g., P and c), calligraphic fonts for sets (e.g., F). Also, we represent the i-th row of A as A i * , the j-th column of A as A * j , the (i, j)-th entry as A ij . Also, we denote the transpose of matrix A as A T , trace of A as tr(A) if A is a square matrix. The n-th power of a matrix A is defined as A n . The 2 -norm of the vector a is denoted as a 2 and the Frobenius norm of the matrix A is denoted as A F . Given an undirected network G = (V, E), we use V to denote the set of nodes and use E to represent the set of edges. Also, the total number of nodes and edges in the network are n = |V| and m = |E|, respectively. For any node v ∈ V, we use d(v) to denote the degree of v.
Both PageRank and Personalized PageRank (PPR) can be used to measure the importance of nodes in the network. However, these two measures are different as PageRank considers the global importance of nodes while PPR emphasizes individual's preference and gives a personalized view of node importance from a query node's perspective [36]. In particular, given a query node q and a target node v, the PPR score r qv measures the proximity between q and v from q's point of view. The proximity score between q and v is defined as follows: where τ is one particular path from q to v: is the probability that node q traverses through the path τ to node v. In this paper, we follow the suggestions of [37] to set c = 0.9 empirically. It can be easily observed from Eq. (1) that if we consider all possible paths from node q to node v, the overall computational cost of obtaining r qv will be very high. Hence in practice, we often specify a threshold L (L = 10 in the paper) and only make use of paths whose length is less than L to estimate the proximity score of r qv : In this work, we use the above equation of r L qv to obtain all proximity scores between node q and other nodes in the network.

B. THE PROPOSED DISTANCE METRIC
The Personalized PageRank performs random walks on the network and captures node proximity which is robust to the network changes. However, the proximity scores are not necessarily symmetric such that r ij may not equal to r ji , which hinders itself as an effective distance metric for community detection. From another perspective, as suggested by social science theories such as the principle of homophily and social influence [38], [39], nodes in the same community often exhibit similar characteristics, and nodes with similar characteristics are more likely to reside in the same community. In other words, if two individuals agree with each other in terms of their personalized views about the network, they are more likely to be similar with each other and in the same community, and vice versa. Based on this, we present a new distance metric between a pair of nodes i and j as follows: where r L i * = [r L i1 , . . . , r L in ] T and r L j * = [r L j1 , . . . , r L jn ] T . D is a diagonal matrix with the degree of each node d(v) in the diagonal. It is straightforward that the above presented distance metric is symmetric and it reflects to what extent nodes i and j agree with each other on the network structure from their personalized views. Intuitively, the smaller the value of d ij , the more likely these two nodes are in the same community.
Now suppose there is a community C, then the opinion of the community C on node j is defined as: The above definition indicates that the opinion of C on node j is the average value of all C's individuals' view on node j. Now we further define the distance between two communities. Suppose there are two communities C 1 and C 2 , then the distance between these two communities is defined as follows: where The above formula states that if the two communities' opinion about all the nodes is similar, the distance of the two communities should be small. We also observe that the distance metric d ij between two nodes i and j is actually a special case of the distance between two communities d {i}{j} , where {i} and {j} are two communities with a single node.

C. THEORETICAL ANALYSIS OF THE PROPOSED PPR-BASED DISTANCE METRIC
We assume that the matrix A is the adjacency matrix of the network G such that A ij = 1 if there is an edge between node i and node j, and A ij = 0 otherwise. Then the transition matrix P of performing random walks on G is defined as P = D −1 A. Then we have the following theorem to show the relations between the presented distance metric and the spectral properties of the transition matrix P. Theorem 1: The presented PPR-based distance metric d ij between a pair of nodes i and j is related to the spectral properties of the matrix P: where v α and λ α are the right eigenvectors and eigenvalues of the matrix P, Proof 1: Before proving theorem 1, we introduce the following lemma.
Based on Lemma 1, the transition matrix P can be reformulated by the following spectral decomposition: In this way, the transition probability vector P t i * can be written as: ij denotes the probability of traversing through all paths τ from node i to node j where l(τ ) = t, the node proximity measure in Eq. (2) can be reformulated as: Therefore, the distance between node i and node j in Eq. (3) can be reformulated as: which completes the whole proof because the pythagorean theorem with the orthonormal family of vectors (s α )1 ≤ α ≤ n is applied. And also, as the vector v 1 is constant, the case α = 1 in the above summation is removed. The theory states the connection between the presented distance metric and the spectral properties of the transition matrix P, where spectral clustering methods have been successfully used in many applications [40]. In contrast to our method, spectral clustering methods fail to detect hierarchical overlapping communities.

III. THE PROPOSED HIERARCHICAL OVERLAPPING COMMUNITY DETECTION
In this section, we show how to build a robust hierarchical overlapping community detection framework with the presented Personalized PageRank based distance metric in details.
In the following context, we use C i to denote a set of communities, and C i ∈ C i represents a specific community in C i . Our proposed method is based on the classical agglomerative method [41], which enables the detection of hierarchically organized community structure. Specifically, it starts from an initial state in which each single node represents a community and then it repeatedly joins small communities in pairs given a certain criterion Q. Thus, the key problem is how to define the criterion Q which is used to determine if we need to merge two different communities. Now we first introduce an effective criterion to merge different communities with the PPR-based distance metric and then introduce the proposed community framework.

A. CRITERION TO MERGE COMMUNITIES
We use the Ward's method [42] to merge different communities. Specifically, at the k-th step, we choose the merging, i.e., C k , that has the smallest objective function value: where O i is the number of communities node i belongs to and C k is the community set after merging in the k-th step. The above optimization problem is NP-hard and is difficult to solve. Therefore, greedy approaches are often used in practice [43], [44].
In particular, we adopt the following strategies. First, we assume that two small communities can be merged if they are connected after the merging. If two communities can be merged, then in the second step, we compute the variation σ (C 1 , C 2 ) that is generated by merging C 1 and C 2 into a new community C 3 = C 1 ∪ C 2 : Finally, we merge the two communities whose merging results in the lowest variation value of σ .

B. THE PROPOSED COMMUNITY DETECTION FRAMEWORK
Our proposed hierarchical overlapping community detection framework is based on the agglomerative hierarchical clustering method [41]. However, the classical agglomerative method cannot handle the case when there are overlaps between two communities. To tackle this problem, we introduce a novel concept called ''virtual community''. The ''virtual community'' is a single node which belongs to at least one community whose size is larger than 1 before merging. It can be formally defined as follows: Defintion 1: For any v ∈ V, the node v is a ''virtual community'' if there exists a community C such that v ∈ C and |C| > 1. Next, we will show that the ''virtual community'' plays an essential role in the proposed hierarchical overlapping community detection method. The workflow of our proposed algorithm is stated as follows, where we use {v i , v j , . . .} to denote a community C and use {C i , C j , . . .} to denote a set of communities C k . 1) Each node is considered as a community such that During the merging step, there are two options: a) Merge two communities C i and C j into a new community if the merge of C i and C j results in the lowest value of σ ; b) Merge community C i and a ''virtual community'' if the merge of C i and {v j } results in the lowest value of σ . 5) Update σ value between the newly emerged community and its adjacent communities. 6) The algorithm repeat steps 3 and 4 until all the nodes are in the same big community, i.e., C N = {V}. In the above algorithm, the step at 3(b) guarantees that the proposed algorithm can detect overlapping community because the ''virtual community'' v i may belong to several communities simultaneously. In a special case, after executing the step at 3(b), the new community set A simple illustration showing how the proposed hierarchical overlapping community detection framework works is presented in Fig. 6. In the presented community detection framework, there are two key problems we need to solve: (1) How to calculate the σ value efficiently which can help us determine which two different communities will be merged; and (2) How to assess if the obtained community set C i in step i is optimal. The following two subsections will focus on addressing these problems.

C. CALCULATING σ FOR MERGING COMMUNITIES
Using Eq.(7) to compute σ often leads to heavy computational overhead because we not only need to update the σ value between the new formed community and its adjacent communities, but we also need to update the σ value between the communities whose memberships O i have changed and their adjacent communities. On the other hand, largely nodes just belong one community in the network, we can simply the formulation of σ as follows: When C 1 C 2 = ∅, σ can be efficiently computed by the following theory: Theorem 2: The variation σ of merging of two communities C 1 and C 2 is directly related to the distance d C 1 C 2 between these two communities: Proof 2: Based on Eq.(4), we have the following equations: and And according to Eq. (10) and (11), we have: Similarly, we have: Therefore, we have: which completes the proof. Theorem 2 discusses the case when C 1 C 2 = ∅. On the other hand, when C 1 C 2 = ∅, we have the following theorem: Theorem 3: When two communities C 1 C 2 = ∅, the variation σ after the merging of two communities C 1 and C 2 is: Proof 3: First, based on the observation that Then, according to Eq.(10), we obtain the following: Similar, we have: and i∈C 1 C 2 Hence, we obtain the following equation: which completes the proof. Also, we can see that: and thus σ (C 1 , C 2 ) can be calculated easily when p C 1 j and p C 2 j are given. Meanwhile, p C 3 j can also be computed efficiently based on p C 1 j and p C 2 j . When C 1 ⊆ C 2 (or C 2 ⊆ C 1 ), it is easy to verify that σ (C 1 , C 2 ) is zero. On the other hand, when C 1 C 2 = ∅ and C (C 1 ∪C 2 ) = ∅, we can use the Lance-Williams-Jambu formula [45] to obtain σ : Theorem 4 (Lance-Williams-Jambu Formula [45]): If C 1 and C 2 are merged into C 3 = C 1 C 2 , then we have the following for any other community C (13), as shown at the bottom of this page.
With the above mentioned theories, we can obtain the variation value σ more efficiently which is used for community merging.

D. EXTENDED MODULARITY TO DETERMINE THE OPTIMAL COMMUNITY STRUCTURE
In this section, we will discuss the second research question on evaluating whether the obtained community structure C i is optimal or not.
In conventional community detection problems, modularity [9] is a de facto standard to gauge the goodness of the modules obtained from the community detection algorithms [46]: where m is the number of edges in the network. Despite its effectiveness to measure an obtained community structure, modularity is only applicable to measure disjoint communities. To make the modularity measure adaptive to overlapping communities, Shen et al. [25] proposed an extended version of the modularity measure: where O i and O j are the numbers of communities node i and j belong to respectively. In the case when each node belongs to one community, Eq.(15) can be reformulated as Eq. (14). In this paper, we use the extended modularity measure to evaluate the quality of obtained community structure. Specifically, each time when a new set of communities C k is obtained (by merging two small communities), the corresponding modularity is calculated. At last, we choose a set of communities C i with the largest extended modularity value as the optimal community structure. The whole process of the proposed robust hierarchical overlapping community detection is illustrated in Algorithm 1.

Input:
Network G = (V, E); Output: The best cover C; // If needed, we can output each node's r q * for community-aware node proximity query; 1: for each q ∈ V do 2: calculate r q * using Eq. (2); 3: end for 4: for each edge (u, v) ∈ E do 5: calculate σ ({u}, {v}) using Eq. (9) and Eq. (5), and then insert the σ ({u}, {v}) into a min heap H ; 6: end for 7: Q e = 0 and C = φ; 8: while |H | = φ and |C i+j | < |V| do 9: σ (i, j) = H .front() and merge communities: C i+j = C i ∪ C j ; 10: update σ between C i+j and its adjacent communities (include virtual communities) using strategies introduced in subsection 3.3, then insert σ into H ; 11: calculate the extended modularity Q e i+j of the current community structure C i+j ; 12: if Q e < Q e i+j then 13: First, the computation of PPR vector of all nodes will be done in O(nmL). Then, at each merging step the main computation cost is operations that update σ value between the newly emerged community and its adjacent communities. The algorithm update σ using distance d. Therefore, the number of distance d computed is the dominating factor of the algorithm's complexity, where each d is calculated in O(n) (the details are described in subsection 2.2 and 3.3). Let H denotes the height of the dendrogram. For each height 1 h H , the number of neighbor communities is less than m because each edge define one neighborhood relations. So, the complexity of algorithm is mn(H + L).

IV. COMMUNITY-AWARE SIMILAR NODE QUERY
The proposed hierarchical overlapping community detection method leverages the PPR-based distance metric to find optimal communities. Thus, one byproduct of the proposed method is that we can leverage the obtained optimal community structure to further analyze the underlying network. In many cases, nodes connecting multiple σ (C 3 , C) = (|C 1 | + |C|) σ (C 1 , C) + (|C 2 | + |C|) σ (C 2 , C) − |C| σ (C 1 , C 2 ) communities play an essential role in information diffusion to help nodes obtain more diversified information. These nodes are often referred to as structural hole spanners in network science [34], [35], [47], [48]. Given a node q, finding the top-k similar structural hole spanners has significant implication such as helping researchers/businessmen extend their social cycles. Meanwhile, finding the top-k similar nodes from different communities is also important in some cases. For example, such queries can help find potential collaborators across different areas in academic collaboration networks. Therefore, in this section we propose two novel community-aware queries based on PPR: finding the top-k similar structural hole spanners of q, and finding the top-k similar heterogeneous nodes (i.e., from a different community from q). We believe these two kinds of queries can help us discover more useful information in certain cases.
In this paper, we use the definition in [48] to identify structural hole spanners: Defintion 2 (Structural Hole Spanner): For any node v i ∈ C p , if it has a neighboring node v j ∈ C q (p = q), then node v i is a structural hole spanner.
According to the above definition, the obtained nodes across multiple communities are regarded as structural hole spanners. Then, the definition of querying top-k similar structural hole spanners given a node q is: Defintion 3: Given a query node q and a number k, the top-k similar structural hole spanners of q is T k (q) = {t 1 , . . . , t k } where t i (1 ≤ i ≤ k) is a structural hole spanner and r L qt i > r L qp (p is also a structural hole spanner and p / ∈ T k ). Then we give the definition of querying top-k similar heterogeneous nodes (i.e., from a different community as q).
Defintion 4: Given a query node q (q ∈ C j ) and a number k, the top-k similar heterogeneous nodes of q is T k (q) = {t 1 , . . . , t k } where t i / ∈ C j (1 ≤ i ≤ k) and r L qt i > r L qp (p,q,t i are not structural hole spanners; p / ∈ C j and p / ∈ T k ). Different from querying the top-k similar structural hole spanners given a node q, the top-k similar heterogeneous nodes query finds the top-k most similar nodes to q, and these nodes are not in the community of q. With the Eq. (2) in PPR-based distance metrics, we can conclude that the shorter the distance between q and p, the larger the value r(q, p) becomes. Therefore, the detailed process of top-k similar structural hole spanners querying is as follows. Starting at a query node q, we perform a breadth-first search. When we visit a node n i , we judge whether node n i is a structural hole spanner or not. If it is indeed a structural hole spanner and the condition ∃t i ∈ T k holds: r(q, t i ) ≤ r(q, t j )(t j ∈ T k ) and r(q, t i ) < r(q, n i ), we replace t i in T k by node n i . We continue the search until there are no nodes which satisfy the above conditions.
Likewise, we can apply the same mechanism to query the top-k similar heterogeneous nodes. Here we omit the detailed process.

V. EXPERIMENT
In this section, we first perform experiments to verify the effectiveness of the proposed robust hierarchical overlapping community detection method. Then we analyze the presented two community-aware similar node queries using real examples. Before introducing the detailed experimental results, we first present the experimental settings and the used datasets.

A. EXPERIMENTAL SETTINGS
To corroborate the effectiveness of the proposed community detection method, we compare it with the following stateof-the-art representative overlapping community detection algorithms: 1) BMLPA [26] is an overlapping community detection algorithm with label propagation. 2) GCE [49] detects overlapping community structure by greedy clique expansion. 3) LC [22] is a hierarchical overlapping community detection method which regards communities as a group of links rather than a group of nodes. 4) SA [50] finds overlapping communities based on a principled statistical approach using generative network models. 5) LPANNI [7] is an improved overlapping community detection algorithm based on LPA. In the proposed algorithm of C_PPR, L is the maximum path length for estimating PPR. By conducting an experiment, the results show that the effectiveness of C_PPR is pretty well at L=10. Increasing L (L>10), the effectiveness is improved slightly and the overhead is largely caused. Therefore, we set L=10 in the experiments.
The proposed method and the aforementioned baseline methods are compared over both synthetic networks and real-world networks with ground truth community structure.
Following a common setting to assess overlapping community detection methods, we use the measure of Omega Index [51] to assess the community detection performance of various methods by comparing the partition of the proposed method and the ground truth community structure: where ω u (C 1 , C 2 ) is the fraction of pairs that occur together in the same number of communities in both partitions, ω e (C 1 , C 2 ) is the expected value of this fraction in the null model [51]. Normally, the higher the value of the Omega Index the better the detected community structure is. As mentioned previously, we can also use extended modularity to assess the detected community structure of different methods. Furthermore, we will also consider the following two metrics to evaluate the quality of the detected communities [52]:     where S is a set of nodes, m S = |{(u, v) ∈ E : u ∈ S, v ∈ S}| and c S = |{(u, v) ∈ E : u ∈ S, v / ∈ S}|. The above two metrics rely on the intuition that a community is a set of nodes with many edges inside it and few edges outside of it [53]. Usually, a large score for Edges and a lower value for Conductance implies better community structure.

B. RESULTS ON SYNTHETIC AND REAL-WORLD DATASETS
First, we evaluate the effects of community detection algorithms on synthetic networks (henceforth the LFR benchmark), which are introduced in [54]. The parameters of LFR are specified as follows. The number of nodes n is 2000. The average node degree k is 25. The number of overlapping nodes On is either 1% or 2% of the total number of nodes. The range of community size is in [10,100]. The mixing parameter µ is set to 0.1 or 0.2. The number of memberships of the overlapping nodes Om is set to 2 or 3. Finally, the average clustering coefficient C is set to 0.6 or 0.7. The results of experiments are reported from Table 1 to Table 6. As can be observed, our algorithm outperforms the other baseline algorithms in most cases. The observations are consistent across different evaluation metrics.
Then we assess the quality of detected communities on real-world networks. We construct two coauthor networks from DBLP as follows (denoted as network A and network B). We choose five conferences (SIGMOD, KDD,    SIGIR, STOC and AAAI) to extract coauthor relationships from the year of 2014 to 2016. In particular, we randomly choose two seed nodes who have published papers in at least three aforementioned conferences, and perform breadth-first traverse to obtain two small networks of size 100, where conferences are considered as ground truth communities. The third real network is a high school friendship network [11] where the ground truth is explicitly given. The ground truth is a total of 6 communities together with two subgroups of students (white and black students) in grade 9 [11]. The comparison results on these three real-world networks are shown in Tables 7 to 9. It can also be shown that the proposed robust hierarchical overlapping community detection method outperforms other baseline methods in most cases. The observation along with the experimental results on synthetic datasets validate the effectiveness of the proposed method C_PPR.

C. ROBUSTNESS OF THE PROPOSED METHOD
Our proposed hierarchical overlapping community detection method is robust to the network perturbations. Thus in this subsection, we test the robustness of the proposed method by comparing it with LC, BMLPA, GCE, SA and LPANNI on three real-world networks. Specifically, each time we obtain a new network by deleting one edge as shown in the toy example in Fig. 2, Fig. 3 and Fig. 4. We apply these methods on the perturbed network and compute the corresponding Omega Index value. The experimental results are shown in Fig. 7 to Fig. 9. We can observe that with the perturbation VOLUME 8, 2020  of the underlying network structure, our proposed community detection is the most robust one in which Omega Index values is rather stable. While on the other hand, the Omega Index value of the other methods fluctuates to a large degree with the deletion of edges.

D. CASE STUDY OF NODE PROXIMITY QUERIES
After we obtain the desired community structure, we now study if the obtained communities can help us perform node proximity queries. Given a query node q, we focus on discussing the results of top-k structural hole spanner queries and top-k heterogeneous node queries.
We first perform a case study on a network of coauthorships between 379 authors [55] as shown in Fig. 10. Our algorithm finds 35 communities whose sizes is greater than or equal to 3. The traditional result of top-5 similar nodes query for author Kertesz are Kaski, Onnela, Chakraborti, Kanto and Jarisaramaki. As can be observed in Fig. 10, these five authors belong to the same community as Kertesz. When we perform top-5 similar structural hole spanner query w.r.t. Kertesz, we find that these 5 nodes are Holyst, Stauffer, Dafontouracosta, Aharony and Adler. These structural hole spanners span across three different communities. Among them, the author Stauffer is special because he connects 4 different communities and plays a key role in communicating with other nodes in the network. On the other hand, we find that the top-5 similar heterogeneous authors w.r.t. Kertesz are Stauffer, Bernardes, Costa, Szabo and Alava. These authors are from the Kertesz's adjacent communities. The results of these two types of queries are totally different and the obtained query results can help us better understand the structure of the network.
We then perform a similar case study on a coauthorship network which is extracted from following conferences Experimental results on both synthetic and real-world networks corroborate the effectiveness and robustness of the proposed framework. As shown in the real examples, the presented two community-aware similar node queries can help us gain more insights on the underlying network. However, as discussed in section 3.5, the complexity of our algorithm is O(mn(H + L)), which is not competitive with the baselines'. The proposed algorithm took 102s to detect community on a synthetic network which consists of 5000 nodes. How to reduce the time complexity will be our further work.

VI. RELATED WORK
In this section, we introduce the related work from four different aspects: node proximity measures, random-walk based community detection, local based overlapping community detection and structural hole spanner detection.

A. NODE PROXIMITY MEASURES
Measuring the node proximity (or closeness) is a fundamental task in graph or network mining. Recent years have witnessed the development of many successful node proximity measures, such as common neighbors [56], Jaccard coefficient [57], Adamic/Adar [58], Personalized PageRank [59], HITS [60] and SimRank [61]. Among them, common neighbors, Jaccard coefficient, and Adamic/Adar only considers one-hop distance between a pair of nodes (i.e., query node and the target node) and only look at the ego-network of the query node and the target node. In this way, these measures often do not perform well if the query node and the target node are far away from each other in the network. In addition, the community structure is sensitive to network changes as they are based heavily on one-hop node proximity measures which are very sensitive to network perturbations. In contrast, global node proximity measures such as Personalized PageRank, HITS and SimRank perform random walks on the network such that they can capture node proximity for undirected connected nodes. Among them, PPR has become a particular popular proximity measure because of its effectiveness in many real-world applications and also its theoretical guarantee [59]. For example, PPR is widely used in Twitter's user recommendation service of Who To Follow [28].

B. RANDOM WALK BASED COMMUNITY DETECTION
Detecting the inherent community structure of a network is of great importance in a variety of high-impact domains, ranging from social networks, bioinformatics, and epidemiology to e-commerce. Over the past decades, a great number of dedicated community detection methods have been proposed [1], [11], [62]. Here we focus on reviewing the community detection methods based on random walks. Random walks have been extensively employed to find local communities in the network [63]- [66]. Rather than splitting the entire network into multiple communities, these methods instead look for a small number of local groups within the network based on the analysis of local connection patterns. Our proposed robust hierarchical overlapping community detection is most related to the method in [63], they proposed a local graph partitioning algorithm for local community detection, while in contrast, our method attempts to detect all possible communities in the network. There are also several other algorithms which perform global community detection based on random walks. Examples include WalkTrap [10], Markov clusters [67], Random Walks [68], NISE [69] and CLAGO [70]. Our proposed method is distinct from these methods as we study how to detect hierarchical overlapping communities with Personalized PageRank while these methods are limited to either disjoint community detection or overlapping community detection.

C. LOCAL BASED OVERLAPPING COMMUNITY DETECTION
Local methods identify communities based on the local structure information and can reveal local community characteristics efficiently. Recently, several excellent local based methods [7], [71]- [75] are proposed. VOLUME 8, 2020 Guo et al. [71] present a local community detection algorithm based on internal force between nodes. Sun et al. [73] provide overlapping community detection method based on seed selection and expansion. Reference [72] investigates the parallelization of the procedures of LPA and proposes a fully parallel LPA. Guo et al. [74] focus on discovering communities in large-scale social networks using the MapReduce model. Liu et al. [75] study the algorithm based on dynamic distance mechanism to reduce the number of outlier communities. Lu et al. [7] propose an overlapping community detection algorithm based on label propagation, which can be used to detect community structures in large-scale complex networks.
In contrast to the aforementioned algorithms, our algorithm detects hierarchical overlapping community, which can provide different community structures from different granularity perspectives. Moreover, the community-aware similar node query, one byproduct of the proposed method, can help us gain more insights on the underlying network.

D. STRUCTURAL HOLE SPANNER DETECTION
Structural hole spanners often lie in the boundary of multiple communities and they can control how the overall information flows between different communities. Finding structural hole spanners has practical implications in various real-world applications such as disease intervention and virus attack prevention. Recently, many approaches [34], [35], [47], [48] have been proposed to mine structural hole spanners from the network. Specifically, Lou and Tang [34] proposed to detect top-k structural hole spanners from the network by minimal cut. Rezvani et al. [35] made use of the betweenness proximity measure to find structural hole spanners which correspond to nodes that other nodes have many shortest paths pass through. It also should be noted that the methods in [34], [35], [48] detected the global top-k structural hole spanners over the network while we focus on the top-k structural hole spanners for a particular node in the network after community detection. The reason is that these structural hole spanners play a key role in information diffusion between the query node and nodes in other communities.
Deng et al. [76] propose a Measuring Influence (MIF) model to capture social influence on heterogenous networks which consist of two or more types of nodes, whereas our algorithm is designed on homogeneous networks which consist of identical nodes. Specifically, it, a heterogeneous node of q, means the node and q belong to different communities in the paper.

VII. CONCLUSIONS
Conventional community detection methods mainly focus on finding disjoint communities at a single scale in the network. However, communities in many real-world networks often overlap and are recursively grouped into a hierarchical structure. A vast majority of existing attempts are vulnerable to network perturbations such that even a small disturbance may result in completely different community structure. In this paper, we propose a novel robust hierarchical overlapping community detection framework with Personalized PageRank (PPR). In particular, we first present a PPR-based distance metric to measure the distance between two communities and show its correlation with spectral theory. Then with the presented distance metric, we introduce an effective and efficient mechanism to merge small communities to enable the detection of hierarchically organized overlapping communities. At last, we verify the effectiveness of the proposed framework on both synthetic and real-world networks. Further study show the the proposed method is indeed robust to the perturbations of the underlying network structure. As the proposed community detection framework makes use of the Personalized PageRank measure, one of its byproduct is that we can perform various types of node proximity queries based on the detected community structure.
In conclusion, the advantages of our algorithm framework are: • We give a fundamental mathematical analysis of the presented distance metric, which has a strong connection with spectral theory.
• We propose a novel robust and effective hierarchical overlapping community detection framework.
• We introduce how to make use of the detected community structure to perform various node proximity queries such as the top-k structural hole spanner query and the top-k heterogeneous node query, which can help us gain more insights on the underlying network.