Overlapping Community Detection Method Based on Network Representation Learning and Density Peaks

At present, the research on complex social networks has attracted extensive attention from scholars, and community detection is an important research direction in the study of network structure. Network data is often high-dimensional and very large, which makes it very difficult to process. Therefore, it is of great significance for community detection to represent network structure with low-dimensional vector. And many real world social networks contain overlapping communities. In this paper, we propose an overlapping community detection method based on network representation learning and density peaks, called NRLDP. First, it uses network representation learning technology to represent the unweighted network or weighted network with low-dimensional vectors. Then, it applies the density peaks clustering algorithm to overlapping community detection, uses cosine similarity to calculate the distance between nodes, and improves the local density calculation method. Finally, it selects the core node according to the relative distance and local density, and allocates the remaining nodes to achieve overlapping community detection of unweighted network or weighted network. Compared with relevant community detection methods on real world social networks and synthetic networks of LFR Benchmark, the results of the experiment show that our proposed approach is effective and accurate.


I. INTRODUCTION
In the late 1990s, Duncan J. Watts and Steven H. Strogatz a et al. published ''Collective dynamics of 'small world' networks'' in ''Nature'' [1], followed by A.L. Barabasi et al. in ''Science'' Published ''Emergence of Scaling in Random Networks'' [2]. The advent of these two articles represents the birth of small-world networks and scale-free networks that are closer to the real world, opening a new era of complex network research. The social network is essentially a special complex network. The nodes in the network represent individual users or certain groups, and the edges represent the intricate relationships between nodes through interaction. Mining useful information in the network is very important for scientific research and application. Early researchers proposed that it is more meaningful to discover hidden laws in the network by studying social groups than simply studying individual users. esearchers used community detection The associate editor coordinating the review of this manuscript and approving it for publication was Zhan Bu . algorithms to find a homogenous community structure in complex social networks [3]. Individuals and individuals in a social group often have homogeneity, such as Individuals in groups with the same interests have a high probability of becoming friends. In addition, many complex social networks will exhibit a strong social effect. The manifestation of this social effect is the formation of a variety of but closely connected groups, and the contacts between individuals within the group are relatively frequent. And there is much less contact with other individuals outside the group. If an individual is divided into multiple groups, it is an overlapping community detection, and these individuals are overlapping nodes. In real networks, there are often overlapping nodes. Therefore, overlapping community detection has important research significance.
With the development of social networks, network data is often high-dimensional, very large and complex, making it very difficult to process. Traditional community detection methods mainly obtain community information in the network based on the representation of the adjacency matrix, but the adjacency matrix can only represent the direct connection information between nodes, which often has the problem of unique disasters and excessive complexity. And Network Representation Learning (NRL) can represent the information in the network with low-dimensional vectors, which not only expresses the network structure information, but also reduces the computational complexity.
In this paper, we propose an overlapping community detection method (NRLDP) based on network representation learning and density peaks, which not only considers the problem of network data representation, but also considers the problem of irregular community structure in the actual network.
The rest of this paper is arranged as follows: Section II summarizes the related work of the algorithm, and Section III describes the detailed process of the proposed NRLDP algorithm. Section IV introduces the experimental results of the NRLDP algorithm obtained on the synthetic network and the real network data set, and compared with other algorithms. Conclusions and suggestions for future work are presented in Section V.

II. RELATED WORK A. CLASSICAL OVERLAPPING COMMUNITY DETECTION METHODS
Classical overlapping community detection methods can be roughly divided into the following categories: clique percolation based methods, graph partitioning based methods, local expansion and optimization based methods, and label propagation based methods.
In 2005, Palla et al. proposed Clique percolation Method (CPM) algorithm [4] to detect overlapping communities. The main idea is to detect overlapping communities based on k-max groups. The k-maximal clique represents a fully connected subgraph with k nodes in the network. If a k-maximum clique overlaps with another k-maximum clique by k-1 nodes, then these two k-maximum cliques are called adjacent k extremely large group. The adjacent k-max group is the overlapping community structure detected by the CPM algorithm. In 2009, Shen et al. proposed the EAGLE algorithm [5], which combines the ideas of hierarchical clustering and extremely large cliques. A modularity function EQ is proposed to evaluate the detection quality of overlapping communities. It not only considers the structure of overlapping communities in the network, but also considers the hierarchical structure between communities. In addition, Farkas et al. [6] extended the CPM algorithm to a weighted network and proposed the CPMw algorithm, but the algorithm stipulates that only k-factions whose internal density exceeds a given threshold can become a community, which has certain limitations. There are two main methods based on graph partitioning. One method for graph partitioning is to use non-negative matrix factorization. Zhang et al. [7] first used the NMF model in overlapping community detection, but it needs to reflect the number of communities. Subsequently, many scholars have proposed some improved methods, such as SNMF [8]. In addition, NMFOSC [9] solves this problem through feature matrix preprocessing and ranking optimization, thereby discovering the network structure of the unknown number of communities. In 2018, Li et al. [10] proposed an overlapping community detection algorithm based on semi-supervised matrix factorization and random walk. Another method is spectral clustering, which is mainly based on the feature vector of the adjacency matrix on the way to the network to perform graph segmentation. Some research scholars prioritize local information [11], [12], but its optimization is carried out in the entire feature space. Research scholars also consider the local structure of the network. The local expansion algorithm [13] is mainly based on the idea of community growth. The community seed is locally expanded and optimized through a custom local expansion function until it becomes a community with the greatest benefit. Resulting in a community structure. Lanchinicetti et al. [14] proposed the LFM algorithm, which randomly selects a node as the initial seed, and then expands from the seed node to build a community until the fitness function reaches the local optimum. Liu et al. [15], a locally optimal extended hierarchical clustering algorithm is proposed. Yu et al. [16] proposed SEOCO, a seed expansion overlap community detection algorithm based on random walk. Bhatia [17] proposed a hierarchical method based on autoencoder to initialize candidate seed nodes, and determine the number of communities by considering the network structure. The disadvantage of these methods is that the quality of the division result depends on the selection of seeds, which leads to unstable division results. The method based on label propagation is also a method considering the local structure of the network. The idea is to initialize a label for each node in the network, and then update the label according to the conditions of these nodes until the node no longer changes. COPRA [18] and SLPA [19] is a relatively classic algorithm. Both are improvements to the LPA [20] algorithm to realize the detection of overlapping communities. The COPRA algorithm is based on multiple labels, and is the first fuzzy overlapping community detection algorithm based on label propagation. This algorithm assigns a label series to each node, and uses the parameter v to control the length of the label series, that is, the maximum number of labels that each node can contain, and the maximum number of communities that a node can belong to. The SLPA is an information label propagation algorithm based on spermerlistenter, which propagates labels between nodes according to the rules of acceptance interaction. The method based on tag propagation has the advantages of simplicity and efficiency, but the community structure of its detection has great uncertainty.
In 2014, Rodriguez et al. [21] proposed a density peak clustering algorithm. Compared with other clustering algorithms, this algorithm can cluster non-spherical data sets well. In real complex networks, the relationships between nodes are often intricate and the community structure in the network is irregular. Therefore, the improved density peak algorithm used in community detection can effectively divide VOLUME 8, 2020 the irregular community structure. Deng et al. [22] improves the density peak algorithm for community detection. This method uses Jaccard similarity and shortest path to obtain composite similarity, and obtains the distance based on the node similarity value as the input of the algorithm, but the algorithm cannot detect overlapping communities.

B. NETWORK REPRESENTATION LEARNING
The main task of network representation learning is to represent the structural features of any node in the network with low-dimensional vectors for better data mining. Recently in the field of natural language processing, research on word embedding has provided new ideas for the feature representation of network nodes. Perozzi et al. introduced word embedding related technologies into network representation learning, and proposed the Deepwalk [23]. This result has triggered a wave of research on network representation learning.
Deepwalk combines two models in different fields. One is random walk, which is used to generate a large number of sequences composed of nodes in network representation learning, which is equivalent to sentences in the corpus of natural language processing. The other is the Skipgram model proposed by Mikolov et al. in word2vec, which takes the obtained node sequence as input to learn and train the vector representation of the nodes in the network. The process frame diagram of Deepwalk is shown in Figure 1.

C. DENSITY PEAKS CLUSTERING
Density Peak Clustering (DPC) is a density-based clustering algorithm proposed by Rodriguez et al. in Science [21]. According to the characteristics of the data distribution, DPC assumes that the cluster center points have a large local density, and the relative distance between different cluster centers is relatively large. And then according to the obtained center points, it allocates the remaining sample points to the cluster to which a certain center point with a closer distance and greater local density belongs. The algorithm includes the following processes.
First, DPC algorithm define the local density ρ i of node i, as Equation (1) and (2): where, ρ i is the local density of node i, d ij is the distance between node i and node j, and d c is the cutoff distance. Second, the relative distance δ i of the node i is defined as Equation (3): Then, the point with relatively large ρ i and δ i is selected as the cluster center. Finally, it assigns the remaining sample points to the clusters with the closest distance and local density greater than the current center point.

III. NRLDP ALGORITHM
This section details the framework of the proposed algorithm NRLDP. The method mainly includes four steps: First, we use Deepwalk to represent the structural information of each node in the network with a low-dimensional and continuous vector, and calculates the distance between nodes by obtaining the similarity between nodes according to the vector representation of the nodes; Second, we use the degree of the node and the Local Clustering Coefficient (LCC) [24] to measure the local density of each node, and calculates the relative distance of each node; Then, we choose the point with higher local density and relatively far distance as the core point of the community; Finally, we allocate the remaining nodes according to the degree of belonging of the remaining nodes to detect overlapping communities. The flow chart of NRLDP algorithm is shown in Figure 2.

A. RELATIVE DISTANCE CALCULATION
The relationship between nodes in a complex social network is often represented by an adjacency matrix. The adjacency matrix of an unweighted network has an element of 0 or 1. 0 means that there is no connection between nodes in the network, and 1 means that the nodes are directly connected. The connected nodes in the weighted network are represented by weights. In a real world social network, the direct connection between most individuals and other individuals in the network is very limited, so only a small amount of inter-node connection information can be obtained, and the relationship between nodes cannot be well represented. In order to solve this problem, this paper uses the Deepwalk algorithm to preprocess complex social network data, and each node is represented by a low-dimensional vector. The node vector represents the structural information of the node in the network. Therefore, in the vector space, the greater the vector similarity of the node, the closer the distance, on the contrary, the smaller the similarity, the farther the distance. Usually a network with n nodes can be regarded as a . . , e m is an edge set. The node vector representation set obtained by using network representation x v n , which can represent either unweighted network nodes or weighted network nodes.
According to the vector representation of nodes i and j, the cosine similarity is used to calculate the similarity S(i, j) between nodes i and j. Equation (4) is shown below.
where, n represents the dimension represented by the node vector.
Then, according to the similarity between nodes i and j, we calculate the distance D (i, j) from node i to j as shown in Equation (5).

B. LOCAL DENSITY CALCULATION
In the density peak clustering algorithm, according to Equations (1) and (2), the local density of node i refers to the number of nodes near node i within the cutoff distance. However, in the network topology, nodes do not exist independently, but there are some connections, that is, edges in the network. The direct connection of two nodes means a closer connection, while the indirect connection means a weaker connection. Therefore, the local density of a node cannot be simply calculated based on the number of nodes around a node in the network. As shown in Figure 3 (a) and (d), the number of neighbors of node A in Figure 3 (a) and (d) is 3. However, it can be seen from the figure that the local density of node A in (a) is obviously greater than the local density of node A in (d). Therefore, when calculating the node density, not only the number of neighbor nodes of the node, but also the tightness of the connections around the node must be considered. So, this paper uses degree and local clustering coefficient to calculate the local density of nodes, as shown in Equation (6).
where, k i represents the degree of the node. According to Equation (6) to calculate the local density of node A in different structure network graphs, then (a) the local density of node A in the graph network graph is 3 + According to Equation (3), if node i is the point of local maximum density, the relative distance δ i of node i is the distance between node j and node i that is closest to node i. If the node i is the non-local maximum density point, the relative distance δ i of the node i is the distance between the node j and the node i with the local density higher than the node i and the closest to the node i.

C. SELECTION OF THE CORES
According to the previous section, the local density ρ and relative distance δ of each node in the network can be obtained. In DPC algorithm, the key step is to obtain the points with larger ρ and δ as the cluster center according to the decision graph. However, only individual points obviously have larger ρ and δ, and in the actual network, some communities have large changes in scale, resulting in a relatively small density of individual community centers. For such points, the decision chart is not prominent. Therefore, there may be two different situations. One is that both ρ and δ are large, the other is that one is relatively large, and the other is relatively small. In order to select the center point of the community more accurately, this article will take three steps to select.
First, according to the decision diagram, select the point where the local density ρ and the relative distance δ are significantly larger as the center point.
Next, considering the second case, the NRLDP algorithm selects the community center point by calculating the product of ρ and δ, and the product represents the center value of the node with γ . Before calculating γ , you need to normalize ρ and δ to ensure that they are in the same range. The calculation formulas are shown in Equation (7) and (8).
Then, we use Equation (9) to calculate the center value of node i. Then, we arrange the center value in ascending order, and we can get the change of the node center value from small to large. The larger the center value, the more likely the node i becomes the center point.
At last, because in DPC algorithm, clusters are disjoint clusters, and in the process of community detection, there are often overlapping nodes in the network, and overlapping nodes are generally located at the edge of the community. Therefore, the relative distance between it and the nearest node with higher local density is relatively large, and if the overlapping node belongs to multiple communities, it may have higher local density, and it is easy to be selected as the center point in the center list. The neighbor nodes of overlapping nodes will be closer to a certain community and have a closer connection. Their local density is relatively larger than that of overlapping nodes. In order to make the center point not including overlapping nodes, this paper compares the average local density of neighbor nodes, and deletes the nodes that are less than the average local density of neighbor nodes from the center list.

D. OVERLAPPING ALLOCATION OF REMAINING NODES
According to the central point selected by the NRLDP algorithm in the previous section, the overlapping communities are detected for the remaining nodes. In the density peak clustering algorithm, allocating the remaining sample points is to allocate the remaining sample points to the clusters that are closest and whose local density is greater than the current center point. This method makes each remaining sample point belong to only one cluster. Therefore, this paper improve the method to realize the overlapping allocation of the remaining nodes. The specific allocation steps are as follows: First, we assume that the number of community center points obtained is m, set the label for each community as c = {c 1 , c 2 , . . . , c m .
Second, we calculate the attribution degree of node i, represented by p i,c = {p i,c 1 , p i,c 2 , . . . , p i,c m , with a value range of 0-1. The greater the degree of belonging of a node to a certain community, the greater the probability that the node is assigned to the community. First, we set the attribution degree corresponding to the community to which the center point belongs to 1, and the attribution degree of the center point to other communities is set to 0. For example, the community label corresponding to the center point 3 is c 2 , except that p 3,c 2 is 1, the rest are 0, that is, p 3,c 2 = {0, 1, 0, . . . , 0. Then, we use Equation (10) to calculate the degree of belonging of the remaining nodes.
where, S(i, j) is the similarity between nodes i and j, and neigh is the N neighbor nodes with greater local density than node i. Most people in social networks will be influenced by friends. People are more inclined to be with their friends. Therefore, the more likely it is to be in a community with friends. When calculating node affiliation, this paper considers the similarity between node i and N neighbor nodes with a greater local density than node i on the one hand, and considers the affiliation degree of N neighbor nodes on the other hand. In Equation (10), it means that the greater the similarity between a neighbor node j and i, and the greater the degree of belonging of j to a certain community, the greater the degree of belonging of node i to the community.
Third, we assume that the ratio of the degree of belonging of node i to the c r community to the degree of belonging to the c t community is greater than the threshold σ , that is, p i,c r/ p i,c t ≥ σ , then node i is allocated to the communities c r and c t at the same time. That is to say, the degree of belonging of the node i to the community c r and c t is not much different, that is, the node i is an overlapping node, which belongs to both the community c r and the community c t .

IV. EXPERIMENTAL RESULTS AND ANALYSIS
To verify the feasibility and effectiveness of our proposed method, we compare the algorithms of recent years on the real network dataset and artificial synthetic network dataset. The environment is carried on a PC(Windows10 64bit, Intel(R) Core(TM) i5-7400 CPU @3.00GHz, 8GB RAM).

A. EXPERIMENTAL DATASETS 1) REAL WORLD NETWORK
Karate Club Network [25]. This data set describes the friend relationship between members of a karate club in a certain university in the United States.
Dolphin Network [26]. This data set describes the relationship between 62 bottlenose dolphins in New Zealand. The middle edge of the network indicates that two dolphins often move together.
Football is an American college football game network [27]. This data describes the game situation between 115 college teams divided into 12 leagues in a football league in the United States.
Polbooks is a network of American political books [28]. This data set describes the sales network of American political books on Amazon during the 2004 US presidential election.
Lesmis is a network of characters in the novel [29]. This data set describes the network of characters in Hugo's famous novel ''Les Miserables''.
Polblogs is an American blog political orientation network [30]. This dataset describes the citations of blogs of different political orientations during the 2004 US presidential election.
The specific information of the above real public datasets is shown in Table 1.

2) SYNTHETIC NETWORK
The experiment in this paper uses the widely used artificial benchmark network program LFR Benchmark [31] in recent years to generate artificial synthetic network data sets. The artificial network generated by this program can well show the community structure of the network and can also simulate the real network well. The LFR benchmark program needs to adjust parameters to generate artificial synthetic networks of different scales and connection strengths. The specific parameters are described in Table 2. Among them, the value of µ ranges from 0 to 1. The larger the value of µ, the weaker the connection, the more complex the network structure, and the more difficult to detect the community structure in the network.
This paper uses the LFR benchmark program to generate 5 groups of artificial networks of different scales. The basic parameter settings of the experiment are shown in Table 3.

B. EXPERIMENTAL METHOD AND EVALUATION INDICATOR 1) EXPERIMENTAL METHOD
In order to verify the effectiveness and feasibility of our proposed method, we compare NRLDP with related algorithms and classic algorithms in recent years. The selected comparison algorithms are: OCDRDD [32] algorithm, DCN [33] algorithm, CDRS [34] Algorithm, LDC [35] algorithm, Multiscale [36] algorithm, COPRA [18] algorithm. We use different evaluation indicators and these algorithms to compare and analyze the real network and artificial network data sets to evaluate the accuracy of the NRLDP algorithm in this paper.

2) EVALUATION INDICATOR
The evaluation indicators used in this paper are EQ [37] and Normalized Mutual Information [38] (NMI) which are commonly used in overlapping community detection algorithms.
The overlapping modularity (EQ) is used to evaluate the quality of the overlapping community structure. The closer  the EQ value is to 1, the better the quality of the overlapping community structure divided by the algorithm. The definition of EQ is as follows.
where, m represents the number of edges in the network, and c l represents the lth community. O (i) and O (j) respectively represent the number of communities to which nodes i and j belong. A (i, j) represents the adjacency matrix of the network. k (i) and k(j) represent the degrees of nodes i and j, respectively. Standardized mutual information NMI is an information theory method used to measure the difference between two sets. It is used to evaluate the difference between the result of the network division of the algorithm and the real division result. The closer the NMI value is to 1, the closer the division result is to the real division. result. The definition of NMI is as follows.
VOLUME 8, 2020 where, C A is the standard division result, C B is the algorithm division result, N ij represents the number of public nodes between the i th community in C A and the j th community in C B , N is the total number of nodes in the network, and N i represents C A The number of nodes in the i th community in C B , N j represents the number of nodes in the j th community in C B .

C. PARAMETER EXPERIMENT 1) THRESHOLD IN NRLDP ALGORITHM
In order to verify the influence of the threshold σ given by the NRLDP algorithm in the process of assigning nodes on the detection results of different networks, we carried out experiments on 5 different networks of N 1 , N 2 , N 3 , N 4 , and N 5 . As shown in Figure 4, for different σ values, the EQ value and the NMI value are not much different on data sets of different scales. It can be seen that the threshold σ given by the NRLDP algorithm in the process of assigning nodes has little effect on the network detection results. Therefore, the threshold σ = 0.9 is set in the following experiments.

2) PARAMETERS IN SYNTHETIC NETWORK
Different network structures may result in different algorithm performance. In order to test which parameters of the NRLDP algorithm are affected by the network, this experiment is mainly carried out from the following aspects. First, we test the influence of the degree of nodes in the network, the scale of the network, and the scale of the community in the network on the algorithm. The parameter k is the average degree of the node, and other parameters are controlled to be the same. The k is set to 5 and 10, and the maxk is 20 and 50 respectively. The size of the community in the network is set to (20, 100) for the large community and (10, 50) for the small community. Experiments were carried out on 5 networks of different sizes, N 1 , N 2 , N 3 , N 4 , and N 5 . As shown in Figure 5(a), on data sets of different network sizes, the NMI value with a node average degree of 10 is generally larger. Because the average value of most large-scale real social networks is around 10, the experimental results are in line with the actual situation. According to Figure 4 and Figure 5 (a) and (b), it can be seen that as the network scale increases, most of the curve changes in the figure are not obvious. It can be seen that the network size has little effect on the NRLDP algorithm. Therefore, the next experiment chooses the network size N 1 = 1000 to test the influence of the internal connection strength and overlap on the algorithm. The parameter µ is the internal connection strength coefficient of the community, which is set to 0.1 and 0.3 respectively. On is set to 100 and 500 respectively. Om is set to 2, 4, 6, 8 respectively. In Figure 5 (c), the larger the µ value, the more complex the community structure in the network, and the more difficult it is to detect. In Figure 5 (d), the degree of community overlap also has a certain impact on the algorithm. The greater the degree of overlap, the lower the algorithm performance.

D. EXPERIMENTAL RESULTS ON REAL WORLD NETWORK
In order to verify the method proposed in this paper, this section experiments on 6 real network data sets, and selects EQ as a measurement index, and compares them with 6 community detection algorithms. The results are shown in Table 4.
According to the EQ value of each algorithm on the real world network data set, it can be known that the EQ value  of the NRLDP algorithm proposed in this paper on the 6 data sets is greater than that of the DCN algorithm, LDC algorithm and Multiscale algorithm. Except that the EQ value on the Polbooks data set is slightly smaller than the OCDRDD algorithm, the others are all larger than the OCDRDD algorithm. On the Football dataset and Lemis dataset, the CDRS algorithm and the COPRA algorithm have achieved better results, but overall the average value of the NRLDP algorithm is greater than 6 algorithms. Therefore, it can be said that NRLDP algorithm is better than OCDRDD algorithm, DCN algorithm, CDRS algorithm, LDC algorithm, Multiscale algorithm, COPRA algorithm on the real word network datasets.

E. EXPERIMENTAL RESULTS ON SYNTHETIC NETWORK
To further verify the effectiveness and feasibility of the NRLDP algorithm. The experiment in this section is on the same artificial synthetic network, the number of network nodes is 1000, k = 10, µ = 0.1, the community size is 20 to 100, On = 10%, and standardized mutual information NMI is selected as the measurement indicator. The selected comparison algorithms are DCN algorithm, CDRS algorithm, LDC algorithm, Multiscale algorithm, COPRA algorithm, and the results are shown in Figure 6. It can be seen from the Figure 6 that although the NRLDP algorithm decreases as the number of community membership Om of overlapping nodes increases, the NMI value of Om before 4 is greater than the other 6 algorithms. In most cases in the real world, the number of communities belonging to overlapping nodes does not exceed 5. Therefore, in a certain sense, the NRLDP algorithm has certain advantages.

V. CONCLUSION
This paper proposes an overlapping community detection method (NRLDP) based on network representation learning and density peaks. In order to deal with high-dimensional and complex network data, NRLDP first uses network representation learning technology to represent non-weighted networks or weighted networks with low-dimensional vectors. Then it uses the cosine similarity to calculate the distance between nodes and improves the local density calculation method. The core nodes are selected according to the relative distance and local density, and the remaining nodes are finally allocated to detect the overlapping community structure of the unweighted network or the weighted network. The experimental results both on synthetic and real networks demonstrate performance of our method, it gets highly accurate and effective for overlapping community detection.
For future work, we will consider using low-dimensional vectors to represent the network with node and edge attribute information, and further enhance the algorithm to detect more complex network community structures.