Local Peaks-Based Clustering Algorithm in Symmetric Neighborhood Graph

Density-based clustering methods have achieved many applications in data mining, whereas most of them still likely suffer poor performances on data sets with extremely uneven distributions, like the manifold or ring data. The paper proposes a novel method for clustering with local peaks in the symmetric neighborhood. Local peaks are points with maximum densities at the local level. During the searching of local peaks, all data, except those outliers, can be easily divided into a number of small clusters in accordance with the local peaks in each point’s neighborhood. Especially, a graph-based scheme is adopted here to merge similar clusters based on their similarity in the symmetric neighborhood graph, followed by assigning each outlier to the closest cluster. A variety of artificial, real data sets and a real building data set have been tested for clustering by the proposed method and compared against other popular density-based methods and other algorithms.


I. INTRODUCTION
Clustering is an indispensable and fundamental method for data mining, which attempts to classify data objects into categories or clusters on the basis of their similarity. Clustering has found tremendous applications in many fields, such as business intelligence, pattern recognition and cloud computing [1]. Up to now, various algorithms have been proposed, including partitioning methods [2], [3], density-based clustering [4]- [6], spectral clustering [7], [8], hierarchical clustering [9], [10], and distribution-based methods [11].
Partitioning methods firstly partition a data set into k clusters and then use an iterative control strategy to optimize an objective function. k-means [12] and k-medoids [13] are major representatives, however, these methods can not identify non-spherical clusters. Hierarchical clustering can correctly cluster non-spherical data sets. The top-down method and down-top method are two kinds of hierarchical clustering. The top-down method regards data sets as a cluster, then decides the cluster into the given number of groups according to a rule. However, the down-top method is just the opposite.
The associate editor coordinating the review of this manuscript and approving it for publication was Emre Celebi .
Chameleon [9] is a representative algorithm, which integrates top-down and down-top methods. Chameleon constructs the k-nearest neighbor graph of the data set and divides the graph into a large number of smaller subgraphs through a graph-based partition algorithm. Especially, each subgraph represents an initial cluster. Finally, Chameleon repeatedly merges initial clusters based on the similarity of clusters. However, Chameleon is susceptible to outliers.
Density-based clustering aims at detecting clusters of arbitrary shapes and spotting outliers, inspired by the observation that the centers of each cluster in any data set usually have a higher density than other points around them. A wellknown density-based method is the DBSCAN [4], which measures cluster centers in terms of local density. The local density of a point is computed by the number of points in a radius, whereby a point with a certain population density may become a (local) core. Besides, DBSCAN merges two initial clusters if they are density reachable. For the sake of achieving a parameter-free clustering technique, kNN-DBSCAN [5] introduces k-nearest neighbors graph into DBSCAN where k > 0. Another typical density-based method is the density peak clustering (DPC [14]) algorithm, which uses local density and distance from points of higher density to measure whether a point is a cluster center. After DPC locates cluster centers through a decision graph constructed from the preceding two measures, each of the rest points is assigned to the same cluster as its nearest neighbor with higher density. Due to the difficulty in determining a proper cut-off distance that is crucial to cluster center selection, several researchers [15]- [21] adopted k-nearest neighbors (kNN), mutual k-nearest neighbors (MkNN) and natural neighbors (NN) to estimate the local densities of each point. Some studies also consider residual-error [22], [23] and gravitation [24] to improve the algorithm. However, both DBSCAN, DPC and some improved algorithms may not successfully cluster data sets with various densities and overlapping levels.
To efficiently discover clusters with different densities, we recently proposed an improved algorithm based on DPC named DPC-SNR [25]. The algorithm uses reverse k-nearest neighbors to calculating the local density of each point, find the global peak obtained through local density and distance from points of higher density, and cluster from the global peak. Finally, there are several big clusters and some tiny clusters including outliers. Then the tiny clusters are assigned to big ones according to the number of public edges. However, DPC-SNR may not identify small clusters with less than k points except for outliers and sometimes does not obtain the real number of clusters. Moreover, DPC-SNR does not perform well on data sets with different types.
Therefore, to solve the problem that the existing density-based algorithm like DBSCAN and DPC is not friendly to the manifold data with different densities, we propose a new algorithm based on local peaks in symmetric neighborhood graph, called LP-SNG. Considering that using reverse k-nearest neighbors to compute the local density of each point can increase the difference between the core point and the normal point proven in DPC-SNR, DPC-SNG also adopts this way. Moreover, symmetric neighborhood is also used in LP-SNG, which can strengthen the gap between clusters. We obtain the local peaks that are points with a local maximum in the symmetric neighborhood. Moreover, points in the symmetric neighborhood of each local peak belong to the same cluster with the local peak. After that, we will get many small clusters, not including outliers. Then, we obtain the clusters by continuous merging clusters which contain the maximum similarity in the symmetric neighborhood graph. Finally, the rest points are assigned to the same clusters as their nearest points. The experimental results on synthetic and real data sets and a building data set show that the proposed algorithm is more effective. The main contribution of this article is that we propose a new density-based algorithm to fit manifold data with different densities and we consider a new perspective namely local peaks in symmetric neighbors rather than global peaks in DPC and DPC-SNR.
The rest of this paper is organized as follows. Section II outlines the local density and symmetric neighborhood relationship. Section III describes our new clustering algorithm. Section IV performs experiments on a number of synthetic and real data sets, and analyzes the efficiency of our clustering method. This paper finishes with conclusions in Section V.

II. RELATED WORKS
In this section, we will review the process of DPC and introduce these studies which arouse interests in DPC and introduce the symmetric neighborhood relationship.

A. LOCAL DENSITY
Local density often appears in density-based clustering, especially DBSCAN [4] and DPC [14]. DBSCAN and DPC think that the local density is related to the number of neighbors in a certain range. DPC uses the cutoff distance d c to measure the local density of each point i, and the formula shows as followed: where χ (a) = 1 if a < 0, otherwise χ(a) = 0. dist(i, j) denotes Euclidean distance between point i and point j. DPC-KNN [21] also uses the formula to obtain the local density. Rodriguez and Laio also use Gaussian kernel function to present ρ i : where the point j is eligible when dist(i, j) is less than d c . d c is the only influence parameter in two formulas above. However, we find that DPC performs badly on manifold data sets with different densities. If there are two manifold data with different densities, we find that two cluster centers are still located in manifold data with high density as the value of d c changes.
There are also some other formulas about local density. Some studies do not change the form in Eq.(2), and introduce nearest neighbors as a scope to limit the local density, the function of which just like d c in Eq. (2). ADPC-kNN [19] uses kNN to compute ρ i , while d c is a constant and d c = µ + 2 , µ is the mean value of d i of all points, and d i = max j∈kNN (i) (dist(i, j)). Similarly, CDP [20] introduces the mutual k-nearest neighbors to get the local density ρ i , however, d c here is a parameter. For FkNN-DPC [16], they also use kNN to compute the local density, and they change the distance squared becomes distance once with no parameter d c . Especially, DPC-kNN-PCA [17] considers a different form to obtain local density using kNN, which shows as follows: where p is computed as a percent of the number of data sets. Some studies think that the local density is the reciprocal of the average distance over a range. Therefore, STClu [15] also VOLUME 8, 2020 uses kNN as the range to obtain ρ i , which defines as follows: where k is an input parameter, kNN (i) the number of kNN of point i. NaNDP [18] uses the similar formula to get the local density, while they use natural neighbors to obtain the appropriate value of k. Compared these formulas of ρ i , we can see that the region of influence is different. Moreover, we think that the differences between core points and non-core points should be increased. Therefore, we choose reverse kNN as the region of influence, which means the value degree from others. Local peaks should be surrounded by dense points so that the value degree of local peaks will be higher. The traditional kernel density estimation methods may be affected by the setting of the parameters and cannot work well for some complex data sets, so that Yan et al. [26] used a novel potential-based method to estimate density values.

B. SYMMETRIC NEIGHBORHOOD RELATIONSHIP
kNN is initially used for classification, which arouses the interests of many researchers [27]. Then, kNN is applied to clustering [15], [16], [28], obtained by the Euclidean distance. Inspired by kNN, there are many studies about other nearest neighbors, such as natural neighbors, mutual k-nearest neighbors [20], [33] and shared nearest neighbors [29].
kNN and reverse kNN are symmetric neighborhood relationship [30], which is the main idea of mutual k-nearest neighbors and nature neighbors. Let D be a database, i and j be some objects in D, and k be a positive integer. dist(i, j) denotes the Euclidean distance between object i and j.
The k-distance of i, denoted as k dist (i), is the distance dist(i, o) between point i and point o in D, which shows as: Likewise, point i is regarded as the reverse kNN of j, and a set of points I which contains finite points i composes the reverse kNN, denoted as RkNN (j), which can be defined as: The results of the intersection between the kNN and the reverse kNN are used to estimate the density distribution around i, which is called as the symmetric neighborhood of i, denoted as SN k (i). SN k (i) means that two people are true friends only when they agree with each other, which shows as follows:

III. CLUSTERING ALGORITHM BASED ON SYMMETRIC NEIGHBORHOOD GRAPH
This section will present the details of the proposed clustering algorithm and analyze its complexity. This paper presents a new density-based clustering algorithm using symmetric neighborhood relationship. Here is the basic idea: firstly, find the symmetric neighborhood of each point; then calculate the local density of each point using reverse kNN, and find these representatives for each point to obtain the local peaks that are points with maximum local density in symmetric neighborhood. Especially, orange stars denote representatives and orange lines show the scope of each representative. Moreover, the rest points are assigned to the same cluster as their nearest local peak belongs to so that the initial clusters are obtained. Finally, continually merge clusters with the maximum similarity in the symmetric neighborhood graph. Noting that outliers should not belong to these initial clusters and the algorithm will end the merging process in advance if the value of similarity between clusters is zero. After that, each outlier is assigned to the cluster its nearest neighbor belongs to. The whole process of the proposed algorithm is similar to hierarchical clustering, moreover, the algorithm combines the characteristic of density-based clustering. The clustering process of LP-SNG can be seen in Fig. 1 and the details of LP-SNG algorithm are shown in Algorithm 1. Moreover, we have drawn the flow chart of the algorithm to better comb the whole process of this algorithm shown in Fig. 2. Especially, if the number of points in the symmetric neighborhood of a point is less than two, the point can be perceived as an outlier.
We find that more than a pair of clusters at a time has the maximum similarity in Algorithm 1, and we should merge these clusters at the same time. However, there may be the same clusters between each pair of clusters, which will belong to the same clusters. For example, (c 1 , c 3 ) and (c 1 , c 5 ) are two pairs of clusters with the maximum similarity, then c 1 , c 3 and c 5 will be combined into one cluster. Moreover, c 2 , c 4 and c 6 belong to the same cluster if (c 2 , c 4 ) and (c 6 , c 4 ) have the maximum similarity at the same time.
The similarity between clusters is based on the weight of edges and the number of edges. The weight of edge v ij between point i and point j is calculated as: And then, the formula of the similarity between cluster c a and cluster c b shows as follows: where V c a c b means the number of edges between cluster c a and cluster c b in symmetric neighborhood graph.

A. LOCAL PEAKS IN SYMMETRIC NEIGHBORHOOD GRAPH
In general, searching kNN of point i will return at least k results, while the results of RkNN will be zero, one or many. We should expand the influence of neighbors on points, therefore the reverse kNN may be more suitable for calculating the local density, which makes it easier to identify local peaks. The formula of local density is also used in [25], which can be defined as: where RkNN (i) is the reverse kNN of point i. Merge all clusters in Q and keep the result in the cluster represented by the first position in Q, and delete other clusters; 22: end if 23: end for 24: end if 25: Update the similarity Sim between cluster C i and other clusters; 26:

end while
The way to search local peaks is inspired by [10], after obtaining the local density of each point, we will find the local peaks in the symmetric neighborhood.

B. THE TIME COMPLEXITY ANALYSIS
The proposed algorithm mainly contains three steps: obtain the symmetric neighborhood, partition the data set into initial clusters by searching local peaks in symmetric neighborhood and merge these initial clusters. The first step contains getting  Assuming the number of initial clusters is N c (N c N ). Since the similarity between two clusters is stored in a matrix beforehand, the time complexity of computing the similarity is O(l 2 N 2 c ). If there are only one pair points every time, the time complexity of merging these clusters in each iteration is O(l 2 N 2 c ), which is the worse situation. Actually, we find that there are more than one pair of points (N p pairs)with maximum similarity each time, the time complexity of merging these clusters in each iteration is O(l 2 N 2 c 3N 2 p ) and the number of iterations is less than the difference value between N c and the true number of clusters. Therefore, the overall time complexity of the proposed algorithm is O(N 2 ).

IV. EXPERIMENTS
To demonstrate the efficiency of the LP-SNG, we compare the proposed method with DPC-SNR, DBSCAN, DPC, AP [31] and k-means algorithms on synthetic and real data sets including building data. The code of DPC algorithm was provided by its authors and DBSCAN, AP and k-means were implemented with MATLAB R2017b. In this paper, we use these parameters mentioned in their papers to conduct these experiments, including the number of nearest neighbors k in DPC-SNR, percent p in DPC, damping coefficient lam, iteration invariant number convits and maximum number of iterations maxits in AP and the distance Eps and the number of points in the current distance MinPts in DBSCAN. Especially, the DPC implemented by the authors in [39] uses the distance value corresponding to p percent of the data set in the descending order of the distance value as d c . Therefore, parameter p is considered rather than VOLUME 8, 2020 d c for simplicity in our paper. For k-means algorithm, there is no parameter and k is the number of clusters. We did multiple experiments and got the best result of each method.
Two indexes including clustering accuracy (Acc) [10], [17] and Normalized Mutual Information (NMI) [10], [32] are used to evaluate the clustering performance on these experiments, which are positive indexes. And the formula of Acc shows as follows: where r i is the real cluster label, s i the serial number obtained by clustering. If a = b, δ(a, b) = 1; otherwise, δ(a, b) = 0. The larger value of Acc means the better clustering performance of the algorithm. The formula of NMI shows as follows: where MI (X , Y ) is the mutual information between two random variables X and Y , H (Z ) the entropy of random variables Z . When the value of NMI is bigger, clustering performance is better.

A. CLUSTERING ON ARTIFICIAL DATA SETS
We choose eight artificial data sets to demonstrate the efficiency of LP-SNG, the details of which are shown in Table. 1. Especially, Data set 1 and Data set 4 consists of several irregular shapes. Data set 2 contains 4 ring data with the same density, which is also in Data set 3 with two spherical data in different densities. While Data set 7 has a ring data with different densities and a spherical data with the same density. The remaining data sets are dominated by manifold data with different densities. The Acc and NMI scores about these four algorithms are shown in Table. 2, and the running times are also listed in Table. 5. Noting that nothing is done about outliers in DBSCAN. From Table. 5, we can see that LP-SNG always has a higher running time, because the process of calculating the similarity between clusters need to take up three-quarters of total time. Moreover, more the number of local peaks means more iterations, which will greatly increase the running time.
The clustering results of the proposed algorithm can be seen in Fig. 3. Symmetric neighbors can weaken the connection between clusters in [38], and LP-SNG divides the data set into many small initial clusters with the help of local peaks in symmetric neighbors, which makes LP-SNG perform well on manifold and ring data. Moreover, LP-SNG removes outliers temporarily to make the boundary of clusters clearer. Table. 2 shows that LP-SNG can get higher scores of Acc and NMI, which means LP-SNG is robust to outliers and more powerful than other clustering algorithms on data sets with extremely uneven distributions and irregular shapes.
DPC-SNR also performs well on manifold and ring data with different densities and overlapping levels in Fig. 4, while DPC-SNR can not identify tiny clusters on Data set 7 because DPC-SNR thinks a tiny cluster is initially seen as an individual, however later must attach itself to the large cluster. Though DPC-SNR can recognize clusters on Data set 4, DPC-SNR mistakes two clusters that are close together as a whole for Data set 1.
For the DPC algorithm, we adopt the Gaussian function to obtain the local density. The results in Fig. 5 show that DPC cannot select the right cluster centers on manifold and ring data with extremely uneven distributions and ring data with even distributions. Meanwhile, DPC is noneffective on the data set with different shapes, and even DPC identifies two close clusters as a cluster.
Seeing from Table. 2, DBSCAN can have a certain capacity to cluster ring data and data with different shapes. Especially, DBSCAN thinks of clusters with three points as outliers so that DBSCAN only identifies two clusters on Data set 6, just like DPC-SNR. However, DBSCAN performs badly on spherical and manifold data with different densities and overlapping levels and DBSCAN even can not completely identify the number of clusters on spherical data in Fig. 6. Unfortunately, DBSCAN even identifies a manifold cluster as four clusters with a few outliers on Data set 4.
AP performs badly on manifold and ring data with different densities and overlapping levels and data with different VOLUME 8, 2020  shapes in Fig. 7, which can not identify true clusters on ring data with the same density. Furthermore, AP often identifies the wrong number of clusters. Fortunately, AP can correctly identify some spherical clusters.
Though the running time of k-means algorithm is the least, k-means algorithm is also not suitable for manifold and ring data in Fig. 8. Since the number of clusters in k-means algorithm is given, there is no problem with the incorrect number of clusters. This algorithm randomly specifies the cluster center, while the location of the cluster center largely determines the clustering accuracy. Along with the principle of nearby assignment, a cluster will think of points that should belong to other clusters as its own, especially on data set 6.

B. CLUSTERING ON REAL DATA SETS
To further demonstrate the efficiency of the proposed algorithm, we compare our algorithm with the five algorithms mentioned above on several benchmark real data sets from UCI [41], which are often used in clustering or classification. The detailed information about these data sets is shown in Table. 3, which are preprocessed to better serve the clustering algorithm.   The comparison of Acc and NMI scores are listed in Table. 4. The best results are shown in bold. The results show that the Acc and NMI scores of LP-SNG are higher than AP, DBSCAN, DPC and k-means algorithms on most data sets. Moreover, LP-SNG can obtain almost fifteen percent of improvement on Wisconsin_PBC, Haberman_S, Dermatology, Spect_H, Wisconsin_DBC, Car_E and Ecoli data sets in terms of Acc. Especially, LP-SNG performs well on Wisconsin_PBC, Liver_D, Dermatology, Wisconsin_DBC and Ecoli data sets in terms of Acc and NMI. DBSCAN can get the best result on Haberman_S and Hayes_R in terms of NMI and AP gets the best value of Acc on Glass and Sonar, NMI on Spect_H. Moreover, k-means performs well on Ionosphere data set. However, sometimes DPC-SNR, DBSCAN and AP can not identify the true number of clusters.
Moreover, the running time of each algorithm can be seen in Table. 6. When the size of the data set is small, there is not much difference in the running time between LP-SNG and DPC-SNR. However, when the size is big, the running time of LP-SNG is bigger than that of DPC-SNR. Because LP-SNG need to spend more time repeatedly calculating the similarity between clusters and finding the clusters with max similarity. VOLUME 8, 2020

C. CLUSTERING ON BUILDING DATA SETS
We collect data from 10 communities in Chongqing, China, each with a concentration of buildings. These data include the number of building floors, the ground floor area, the building perimeter, the building height, the building X coordinate and the building Y coordinate. Especially, the center of the building is regarded as a point to obtain the coordinates of the building. The overview of these buildings is shown in Fig. 9 and the data set can be seen in [42]. The clustering results and running time based on six algorithms are shown in Table. 7. LP-SNG can get higher values than other five algorithms. Especially, only LP-SNG, DPC and k-means can get the true number of clusters.

D. STUDYING ON DIFFERENT VALUES OF k
We study the sensitivity of parameter k on artificial data sets and the results are shown in Fig. 10. We find that within some interval of k, the change of k has no effect  on the results. Moreover, for data set 6, the interval is big, while it is small for data set 7. Especially, complex data sets have only one optimal k value, such as data set 1. We also do several experiments on different data sets with different values of k, the results of which can be roughly divided into two categories, as shown in Fig. 11. One has the constant number of clusters in Fig. 11(b), and the other has the number of clusters that fluctuate and then tend to stay the same  in Fig. 11(a). Especially, for the first category, changes with different values of k have little effect on the values of NMI.

V. CONCLUSION
Most density-based algorithms remain unsatisfactory to cluster data sets with various densities and overlapping levels. This paper proposed a new method for clustering, called LP-SNG, which introduces reverse k-nearest neighbors to estimate the local density of each data point. Firstly, the data set is divided into many initial clusters by finding local peaks in symmetric neighborhood. Local peaks are points with maximum local density in symmetric neighborhood. Then, outliers are removed temporarily to make the boundary between clusters clearer and let the algorithm insensitive to noise. After that, we merge initial clusters according to the connections between clusters. Finally, outliers are assigned to the clusters their nearest neighbors belong to. Experiments on various artificial and real data sets demonstrated that our LP-SNG can successfully identify clusters regardless of their densities and overlapping levels, whereby it may outperform DPC-SNR, AP, DPC and DBSCAN methods. As the parameter k may affect the efficiency of our method, how to estimate an optimal value for the parameter would be left for our future study.