TNS-LPA: An Improved Label Propagation Algorithm for Community Detection Based on Two-Level Neighbourhood Similarity

Community detection can not only help people understand organizational structure and function of complex networks, but also attributes to many potential applications including targeted advertising and customer relationship management. Due to the low time complexity, the label propagation algorithm is widely used, but there is still room to improve the community quality and the detection stability. Inspired by resource allocation and local path similarity, we first give a new two-level neighbourhood similarity measure called TNS, and on this basis we propose an improved label propagation algorithm for community detection. In this new algorithm, the minimum distance and local centrality index are considered to select the initial community centers, to ensure that they are both important and far away from each other. In the process of forming initial community, we employ the new similarity measure and an optimization strategy of asynchronously updating labels according to node importance. To further improve the accuracy of community division, we introduce the label influence based on the new similarity measure to further optimize the community division of networks. The experimental results on both the artificial network and ten real-world networks show that our proposed algorithm has better comprehensive performance than several existing algorithms in terms of modularity, normalized mutual information and adjusted rand index.


I. INTRODUCTION
Since Newman's original work [1], during the past two decades, community detection in complex networks has attracted considerable attention [2]- [8]. Mining the community structure in social networks can help us analyze the network topology and function, so as to understand, control and predict social networks. Most of social networks have obvious community structures. Community detection based on social media data can be employed in various applications, including epidemic control, crisis response, and predictive policing [9]- [13]. In Internet finance, community detection can be used in targeted advertising, customer relationship management and fraud detection [14], [15]. In biological The associate editor coordinating the review of this manuscript and approving it for publication was Yichuan Jiang . networks, community detection may contribute to recognize different functional modules of proteins [16]. For academia and scientometrics, community detection can use for recommendation system and data dimension reduction in pattern recognition [17], [18].
During the past years, scholars have expended a great deal of efforts studying community properties of networks and developed various community detection algorithms [19]- [22]. The GN algorithm [23] was firstly proposed by Givan and Newman, whose key idea lies at iterative removal of edges with high betweenness centrality. Due to its high computation complexity, Newman [24] shortly afterwards put forward a fast modularity maximization (FMM) algorithm, which optimized modularity through iteratively community merging and updating. Community detection is essentially the clustering of nodes in the network, thus various clustering algorithms have been widely used to community division in networks. Clauset, Newman and Moore [25] proposed a hierarchical agglomerative algorithm to detect meaningful communities by employing greedy optimization strategy so that the proposed algorithm can be suitable for large scale networks, afterwards this algorithm is called CNM algorithm. Blondel et al. [26] proposed an efficient and heuristic algorithm called BGLL for finding high modularity partitions of large networks, which can also be used to weighted networks. Due to the limitations of the classical K-means algorithm, Rodriguez and Laio [27] proposed the density-based clustering algorithm for community division, which supposed that cluster centers are surrounded by neighbors with lower local density and that they are at a relatively large distance from any points with a higher local density. Recently, Bai et al. [28] generalized the K-prototypes-type clustering to community detection and proposed a new algorithm called ISCD+, which considered both the local importance of a node in a community and the ''importance concentration'' of the node in all the communities. The ISCD+ algorithm needs predefined parameters such as the number of communities. Considering the binary and triadic relations among vertices, Zhang et al. [29] provided a spectral k-way partition algorithm for discovering community structures, which has better performance than the normalized-cut graph partition algorithm.
The label propagation algorithm (LPA), firstly proposed by Raghavan et al., is a fast and unsupervised learning algorithm for community detection [30], whose basic idea is to make use of the information of labelled nodes to predict the labels of the remained nodes. The best advantage of LPA is the near linear time complexity and thus it has attracted a great deal of attention of scholars. Barber and Clark reformulated the LPA as its equivalent optimization problem, and put forward the LPAm [31] algorithm by modifying the objective function, which can be applicable for both bipartite and unipartite networks. Based on this work, Liu et al. presented the LPAm+ [32] to avoid the local optimum problem using the multi-step greedy aggregative strategy. On the basis of the LPA algorithm, Liu et al. proposed a novel evolutionary clustering approach, which is adaptable to detect overlapping and non-overlapping communities in dynamic networks [33]. To overcome the community annexation problem, Li et al. [34] established the Stepping-LPA-S algorithm, a new evaluation function is introduced to guide merging the small communities in the process of optimizing community division. Ding et al. [35] proposed a novel community detection algorithm called DCN, where potential community centers are chosen by means of Chebyshev inequality and label propagation makes good use of neighbors of the node and adopts the multiple strategy of label propagation. Very recently, Wang et al. proposed a novel label propagation algorithm based on node importance [36], where the important nodes are determined by integrating neighborhood Jaccard distance, K-shell value and signal propagation amount.
However, there is still a lot of room to improve the original LPA algorithm as well as its various extensions, such as stability and accuracy. For this purpose, we present an improved label propagation algorithm for community detection based on two-level neighbourhood similarity measure. The rest of this article is organized as follows. In Section 2, based on resource allocation and local paths of networks, we propose a new local similarity measure between node pairs. In Section 3, an improved label propagation algorithm called TNS-LPA is proposed to optimize the stability and accuracy of community detection. Section 4 is devoted to the data sets and evaluation metrics. In Section 5, the performance of the proposed algorithm is tested on both artificial network and real-world networks with different scales. Finally, in Section 6, we discuss and summarize our results.

II. TNS: A NEW SIMILARITY MEASURE
For an unweighted and undirected network G = (V , E), V represents the set of nodes, and E denotes the set of edges, where |V | = N , |E| = M . A = (a ij ) N ×N stands for the adjacency matrix, a ij = 1 if there exists a link between node v i and node v j and a ij = 0 otherwise.
According to the definition of community, the greater the similarity between two nodes, the more likely these two nodes belong to the same community. Therefore, how to define the similarity of nodes becomes a very essential and important issue. Many scholars have made great contributions to it [37]- [42]. From the viewpoint of network topology, Liben-Nowell and Kleinberg [38] proposed a number of node-based and path-based similarity measures. Among several local indices which only consider the neighbor information, the Adamic-Adar index [41] has the best performance. In Ref. [42], Zhou et al. compared nine known neighborhood-based similarity measures in link prediction on six different networks, which include Salton, Jaccard, Sφrensen, Hub promoted index and so on.
Based on these classical indices, Zhou and Lü [42] proposed two new indices, namely, resource allocation (RA) index and local path (LP) index. The similarity S ij between node v i and v j based on RA index can be defined as where 1 (i) and 1 (j) represent the first order neighbor sets of node v i and node v j , and k(z) is the degree of node v z . RA is found to be efficient in community detections and applicable to weighted networks [43]. From the viewpoint of resource allocation, Li et al. established an improved LPA algorithm called ''Stepping-LPA-S [34]. Besides, the LP index is given by [42] which only considers the two-step and three-step paths in networks, and ρ is a tunable parameter. VOLUME 9, 2021 Both RA and LP are found to have better prediction ability than the known nine local indices including the Adamic-Adar index [42]. Moreover, the LP has comparable prediction ability with some global measures, such as Katz index [44], [45], especially for the networks with small average shortest path. The local index only depends on the local information of the network, and its computational cost is far less than those indices based on the global information especially for the sparse and large-scale networks. Christakis et al. argued that the behavior of a node is highly related to its firstorder, second-order and up to the third-order neighbors for most of social networks, namely, the so-called three degree of separation [46]. Recently, Kovács et al. defined a new similarity on the basis of two-step and three-step path and established a new link prediction algorithm, which outperforms greatly than the existing link prediction algorithms in protein interaction networks [47].
Inspired by the above work, this article puts forward a new similarity measure, called two-level neighborhood similarity (TNS). The similarity index S ij between node v i and v j is defined as where 1 (i) and 1 (j) represent the first order neighbor sets of node v i and node v j , and k(x), k(y), k(z) represent the degrees of nodes v x , v y and v z . The new index consists of two parts: the former evaluates the contribution of network paths of length two, and the latter measures the contribution of network paths of length three. This new similarity measure is illustrated by the karate network shown in Fig. 1(a). The components related to nodes 1 and 6 are extracted and shown in Fig. 1(b). Obviously, there are two two-step paths and three three-step paths between node 1 and node 6, thus its similarity is calculated as follows,

III. PROPOSED ALGORITHM
In this article, we propose an improved label propagation algorithm called TNS-LPA, which is consisted of three phases. First, the initial centers of communities are chosen based on the local centrality and minimum distance. Second, we generate the prototype of community partition based on the proposed TNS index. Finally, the community partition is further optimized by combining nodes' label influence with traditional label propagation. The process of community detection model is schematically shown in Fig. 2.

A. SELECTING INITIAL COMMUNITY KERNELS
The limitation of traditional LPA is that node update order is random and thus it will produce unstable partition results  of networks. Inspired by DPC [27] and FCC [48], we propose a new strategy to select the initial community kernels, which takes into account the distance and importance simultaneously [49]. It is well known to us that if centers are dispersed as much as possible, they may have less overlapping effects and can get more stable community structure. So, we try to find a relatively large distance among community centers. Rodriguez and Laio proposed an innovative index called minimum distance for measuring the proximity of nodes to more important nodes [27]. The importance of a node is measured by the number of the first and second order neighbors.
The importance degree of node v i is defined as where d ij is the shortest path length between node v i and v j . C Nei (i) denotes the local centrality of node v i , which is defined as the number of the nearest neighbors and the next nearest neighbors. C Nei (i) can be calculated as where 1 (i) and 2 (i) represent the first and second order neighbor of node v i . The larger IP(i) is, the higher probability that it is the center of some community. Once the importance degree for each node is obtained, one may sort IP in descending order and the top-s nodes are chosen as initial centers of communities.
The number of initial communities can get from decision graph [27]. The transverse axis and longitudinal axis denote nodes' importance and minimum distance respectively. The kernels are at top right of the decision graph and far away from other nodes, which have both relatively high importance and large distance.

B. GENERATING THE PROTOTYPE OF COMMUNITY PARTITION
The goal of this stage is to generate the initial community partition. We know that the original LPA updates the label according to the majority of neighbors. Li et al. [34] uses the maximum similarity instead of the majority number of neighbors to update the label, which can output rough community partition. Motivated by the idea, we update the labels based on the proposed TNS similarity, (6) until all nodes have the same labels with the most similar neighbors. In Eq. (6), 1 (i) is the first order neighbor set of node v i , and S ij is given by Eq. (3). When the label propagation process is completed, nodes with the same label are divided into the same community. Then, we can get the initial community division. Note that the labels of all initial community centers remain unchanged. In order to form a relatively stable community division and avoid the turbulence phenomenon during the label propagation process, the update order of nodes' labels is determined by Eq. (4). The initial prototype of community partition for the Karate network is shown in Fig. 3. It can be clearly seen that the initial division result seems unreasonable and needs to be improved.

C. OPTIMIZING COMMUNITY PARTITION
So far, our algorithm only outputs the prototype of community partition, it often occurs that the community is divided excessively and some nodes are divided incorrectly. In the above stage, each node label is determined by the most similar nodes in its neighborhood. In fact, nodes' labels are not only affected by their most similar nodes, but also by the remained nodes with lower similarity degree. Motivated by this idea and label influence [50], we introduce the label influence to further optimize the community division of networks. A node will be assigned to the label that has the greatest influence on it. Specifically, the influence of the label l x on node v i is defined as where 1 (i) is the neighbor set of node v i , S ij denotes the similarity degree between node v i and node v j , which can be calculated by Eq. (3). = (δ x j ) m×N is a confusion matrix. δ x j indicates whether the label of node v j (j = 1, 2, · · · , N ) is consistent with the label of the x−th category. If they are consistent, δ x j = 1, otherwise it is 0. Here, L = {l 1 , · · · , l x , · · · , l m }, is the label sets of the community partition, and m = |L| denotes the number of communities. Note that the influence of the label on a node is measured by the sum of the similarity between a node and its neighbors with the same label.
With the label influence in hand, repeat the label propagation process until all nodes are assigned with the labels that have the most influence on them. Finally, the nodes with the same label are divided into the same community, and then the final community division are generated. The optimizing community partition for the Karate network is depicted in Fig. 4. It is easily seen that the community structure is more obvious and reasonable.

D. THE PSEUDO CODE AND COMPLEXITY OF THE TNS-LPA ALGORITHM
The proposed algorithm consists of the above three phases, and the pseudo code is given in Algorithm 1.
For a given undirected and unweighted network G = (V , E) with N nodes and M edges. The TNS-LPA algorithm have two key parts. The first part is the initial division of community structure. We need to compute the similarity TNS and minimum distance. The solution to TNS requires the nearest and the second nearest neighbors, with the complexity of O(N * < k > 2 ), where < k > denotes the average degree of the network. The minimum distance can be obtained through the shortest path, whose complexity can be O(M + N * log(N )) [51]. The second part is to solve the problem of excessive community division in the previous stage, through LPA algorithm with the complexity of O(M ). To sum up, the complexity of TNS-LPA is O(M +N * log(N )), far less than O(N 2 ).

IV. EXPERIMENTAL SETUP
Before proceeding with the discussion of experimental results, we devote this section to the datasets and evaluation metrics.

A. DATASETS 1) ARTIFICIAL NETWORK
In order to test various community detection algorithms, Girvan and Newman firstly gave an artificial network [1], called GN benchmark. Due to its simple structure, most community detection algorithms perform very well on the GN benchmark. Subsequently, a new generalized Lancichinetti-Fortunato-Radicchi (LFR) benchmark was introduced, in which both the degree of nodes and the size of communities obey the power law distributions [52].
There are several parameters involved in LFR benchmark, among them, N is the total number of nodes, < k > and k max are the average degree and maximum degree, respectively. m(min) and m(max) denote the minimum and maximum community size. The parameter µ represents the ratio of the external degree of each node. Obviously, with the increase of the mixing parameter µ, the community structure of LFR network is more indistinct. The details of these parameters are listed in Table 1.

B. EVALUATION METRICS
In this work, we compare the proposed algorithm with five popular algorithms according to three evaluation metrics, including modularity, normalized mutual information (NMI) and adjusted rand index (ARI).
• Modularity: For a given un-weighted network G(V , E) with M edges, the modularity of the partition can be defined as [23] where a ij is the element of the adjacency matrix, and k i , k j represent the degrees of nodes v i and v j , respectively. The Kronecker function δ(C i , C j ) has the value 1 if its arguments are equal and 0 otherwise. Generally speaking, the greater the value of modularity Q is, the network has more obvious community structure.
• Normalized mutual information (NMI): For a given network G(V , E) with N nodes, the NMI value between two divisions X = {X 1 , X 2 , · · · , X m(X ) } and Y = {Y 1 , Y 2 , · · · , Y m(Y ) } can be defined as [53]:  5: IP(i) = C Nei (i) * θ(i) // Calculate nodes' importance using Eq. (4) 6: end for 7: is the set of initial centers, and s is the number of initial centers. 9: . . , v * N } 10: Calculate the similarity S ij using Eq. (3) // i = 1 → N − 1, j = i + 1 → N 11: assign node v i (i = 1, · · · , N ) with a unique label l i 12: do 13: for S ij // update the labels of all non-community centers 15: end for 16: while (I) there exists node whose label still changes and (II) iter ≤ maxiter 17: L = {l 1 , l 2 , · · · , l m } // L is the label sets of initial community partition, and m = |L| denotes the number of communities. 18: Compute the label influence IF x (i) using Eq. (7) , · · · , m} // update the labels based on nodes' importance 22: end for 23: while (I) there exists node whose label still changes and (II) iter ≤ maxiter Output: return the final community partition C = {C 1 , C 2 , · · · , C m } End  In the above equation, m(X ) and m(Y ) denote the community numbers of partitions X and Y , respectively, n ij is the number of common nodes in communities X i and Y j . For the variables W = {n X 1 , n X 2 , · · · , n X m(X ) } and Z = {n Y 1 , n Y 2 , · · · , n Y m(Y ) }, n X i and n Y j represent the numbers of nodes in X i and Y j . The denominator of NMI is just the sum of the entropies of W and Z . Note that the value of NMI is in the range [0,1] and equals 1 only when two community divisions are exactly consistent. VOLUME 9, 2021 • Adjusted rand index (ARI) based on pair counting is computed as follows [54] where is given by Modularity reflects the closeness of the internal connection of the community through the difference between the strength of the connected edges in the actual community and the strength of the connected edges in the network under random division. NMI and ARI indicate the accuracy of community detection mainly by comparing the consistency between the results of community detection and the ''true'' community division. The larger the NMI and ARI values, the better the effect of community detection is.

V. EXPERIMENTAL RESULTS AND ANALYSIS
In this section, the proposed algorithm will be compared with five popular algorithms on both artificial network and ten real networks. These comparative algorithms include FMM [24], ISCD+ [28], LPA [30], Stepping-LPA-S [34] and NI-LPA [36]. In what follows we will analyze the results in artificial networks and ten real networks given in Section 4.1.

A. RESULTS IN ARTIFICIAL NETWORKS
The comparative experiments of six different algorithms have been conducted on LFR benchmark with two network sizes, N = 1000 and 4000. and other four parameters are taken as < k >= 20, k max = 50, γ = 2, β = 1. Figure 5 shows the comparison of the NMI values on LFR network. The larger the NMI value is, the closer the partition result is to the real community structure. As shown in Fig. 5, the NMI value of TNS-LPA is larger than those of other five algorithms, which indicates that TNS-LPA is completely superior to other five methods on this network. For the sake of simplicity, we take Fig. 5(a) as an example. The FMM algorithm performs not so well in the whole range of the mixing parameter µ. When µ > 0.35, the performance of the ISCD+ and Stepping-LPA-S algorithms decreased significantly. With the increase of the parameter µ, especially when µ > 0.45, the NMI value for LPA declined obviously, even to zero. To summarize, our proposed algorithm (TNS-LPA) in general has better comprehensive performance than those five contrast algorithms. Figure 6 gives the ARI values under six different community detection algorithms. When µ is less than 0.5, the ARI values obtained by the TNS-LPA algorithm is very close to 1. With the increase of µ, our proposed algorithm is still superior to other comparative algorithms for four sets of  parametric choices as depicted in Fig. 5. When the network scale increases up to 4000, the TNS-LPA algorithm is still better than other algorithms with respect to NMI and ARI. Therefore, the proposed algorithm shows good stability, reliability and scalability in artificial networks.

B. RESULTS IN REAL NETWORKS
In the following, we will compare the TNS-LPA with five algorithms on ten real networks. Among them, four networks including Karate, Polbooks, Football and Polblogs have known community division. Table 3 lists the modularity, NMI and ARI values of the six algorithms on these four real networks. In addition, Table 4 shows the comparison between the TNS-LPA algorithm and several popular contrast algorithms on six real networks without known community division. These real networks with different scales include Lesmis, Netscience, Email, Yeast, PairsFSG and Cond2003. Figure 4 shows the community structure of the karate's network which is obtained by the TNS-LPA algorithm. From Table 3, it can be found that the original LPA algorithm performs the worst according to three evaluating metrics. The modularity value of TNS-LPA is little worse than that of FMM. On the other hand, the NMI value of TNS-LPA is much larger than that of FMM. The ISCD+ has the same performance as our proposed algorithm on the Karate's network. Therefore, among these five algorithms, the ISCD+ and TNS-LPA are the best algorithms to detect communities of Karate's network. Figure 7 depicts the community structure of the Polbooks network detected by the TNS-LPA algorithm. The Polbooks network contains three types of books related to American politics being purchased by users. If different books are purchased by the same user, there will be an edge between the corresponding nodes of books. Because of the implicit buyer of the ''middle group'', the structure of the middle group community is not obvious, so the NMI and ARI values of five different algorithms are not so large. It is easily seen that the TNS-LPA algorithm has better performance because the detected community is more consistent with the original division. In addition, small communities are not merged into large communities due to the new update strategy in the TNS-LPA. Figure 8 gives the community detection results of the football network obtained by the TNS-LPA algorithm. Compared with five contrast algorithms, the TNS-LPA can divide communities more accurately and get greater modularity value. Figure 9 illustrates the community division results of Polblogs network detected by our proposed algorithm. The circle on the left represents liberals and the circle on the right represents conservatives. Because of the ambiguous political  attitude, some nodes are divided into opposing communities. For this network, the modularity value of the TNS-LPA is VOLUME 9, 2021   slightly less than that of FMM, but the ARI value is much larger than other comparative algorithms.
In order to further verify the TNS-LPA, we also consider six real networks without known community division, whose modularity values obtained six algorithms are listed in Table 4. The TNS-LPA algorithm can achieve maximum modularity in Lesmis, Email and PairsFSG networks. For the Netscinece and Yeast network, the modularity value of TNS-LPA is slightly less than that of FMM. As shown in the comparative analysis of LFR network and four real networks with known divisions, the FMM can achieve larger modularity value but performs not very well in terms of NMI and ARI metrics. The traditional LPA performs worst on the Lesmis, Email and PairsFSG networks. Note that the TNS-LPA has good performance in large-scale networks such as Cond2003.
To sum up, the TNS-LPA algorithm performs competitively for the given networks. The proposed algorithm can not only produce the community partitions with a larger value of modularity, but also output a more accurate community which is more consistent with the original partitions. Thus the proposed algorithm is the most suitable one among the six algorithms for community detections.

VI. DISCUSSION AND CONCLUSION
Label propagation is an efficient algorithm for community detection in complex networks due to its low time complexity. However, the uncertainty and randomness in the propagation of labels always affect its accuracy and stability. The node similarity measurement and the strategy for updating labels have a profound impact on the accuracy of community partition. For this purpose, this article proposes the TNS-LPA algorithm based on label propagation.

A. SIMILARITY COMPARISON
How to measure the similarity between nodes or links is a very essential and significant issue in community detection of complex networks. As mentioned in Section III, from the viewpoint of resource allocation and local topological structure, we present the TNS similarity measure by using the information of two-level neighborhood in networks. The new similarity is used in the latter two phases of the TNS-LPA. To illustrate the advantage of the TNS, we replace the similarity measure in the TNS-LPA with another 11 similarity indices [42], namely, common neighbours (CN), Salton index (Salton), Jaccard index (Jaccard), Sφrensen index (Sφrensen), Hub Promoted index (HPI), Hub Depressed index (HDI), Leicht-Holme-Newman index (LHN), Preferential Attachment (PA), Adamic-Adar index (AA), Resource Location (RA) and Local Path (LP). Table 5 gives the Q, NMI and ARI values on four real networks based on 12 different similarity measures, from which it is easily seen that the new TNS index performs the best on most of networks.

B. CONCLUSION
This article proposes an improved label propagation algorithm for community detection based on two-level neighborhood similarity (TNS-LPA), in which improves the LPA algorithm by using influence nodes and new community merging strategy. The TNS-LPA consists of three phases. First, we choose the initial community centers by measuring minimum distance and local centrality comprehensively. Then the label of each node is updated by employing a new label update strategy. Last, to avoid the excessive and inaccurate division, we introduce the label influence based on the proposed similarity to further optimize the community division of networks. The effectiveness of the TNS-LPA is illustrated through a series of experiments on both the artificial network and ten real networks. Compared with the five popular algorithms, namely, FMM, LPA, Stepping-LPA-S, ISCD+ and NI-LPA, our proposed algorithm has better comprehensive performance. In fact, there are complex and diverse community situations in real complex networks, such as a large number of small-scale communities, unbalanced distribution of community scale, less connections within small communities than between communities, and dynamic network structure, which brings great challenges to the research of community detection. For the complex networks with these special community structure characteristics, further research on community detection is worthy in further works.