A New Density Peak Clustering Algorithm Based on Cluster Fusion Strategy

When the density peak clustering algorithm deals with complex datasets and the problem of multiple density peaks in the same cluster, the subjectively selected cluster centers are not accurate enough, and the allocation of non-cluster centers is prone to joint and several errors. To solve the above problems, we propose a new density peak clustering algorithm based on cluster fusion strategy. First, the algorithm screens out the candidate cluster centers by setting two new thresholds to avoid the influence of noise points and outliers. Second, the remaining data points are allocated according to the density peak clustering algorithm to obtain the initial clusters. Third, considering the structural characteristics and spatial distribution of datasets, the new definitions of boundary points, inter-cluster intersection density and inter-cluster boundary density are provided. To correctly classify the clustering problems with multiple density peaks in the same cluster, a new cluster fusion strategy is proposed, which not only corrects the joint and several errors in the allocation of data points, but also correctly selects the cluster centers. Finally, to test the effectiveness of the proposed clustering algorithm, which is compared with DPC-KNN, DPC, K-means and DBSCAN on nine synthetic datasets and six real datasets. The experimental results demonstrate that the clustering performance of the proposed algorithm outperforms that of other algorithms.

for technology. With the in-depth study of clustering algo- 23 rithms, scholars have adopted different processing methods 24 for data and successively proposed many excellent clustering 25 algorithms. 26 (1) Partition-based clustering, which is divided into hard 27 clustering and soft clustering. Among them, hard clustering 28 is represented by K-means [2], which is mainly embodied 29 The associate editor coordinating the review of this manuscript and approving it for publication was Sotirios Goudos . in initializing the cluster center, establishing the objective 30 function of the distance between each data point and the 31 cluster center, and accurately classifying each data point into 32 the cluster where the nearest cluster center is located. How-33 ever, soft clustering is also known as fuzzy clustering, which 34 introduces the idea of fuzzy mathematics into clustering anal-35 ysis, represented by fuzzy C-means clustering (FCM) [3], [4], 36 the data is classified by membership degree through opti-37 mizing the objective function. The FCM algorithm is more 38 and more deeply studied by scholars, and the improved FCM 39 algorithm is widely used in the field of image segmentation. 40 Lei et al. [5] proposed a fast FCM clustering algorithm based 41 on superpixels, the histogram was obtained by calculating the 42 number of pixels in the superpixel image, then the FCM algo-43 rithm was implemented by combining the superpixel image 44 and the histogram. The application of superpixel images is 45 better adapted to irregular local spatial regions and helps 46 In recent years, density-based clustering algorithms have 84 attracted more and more attention from scholars. In 2014, 85 Rodriguez et al. [13] proposed the density peak clustering  [14], face 98 recognition [15], community personalized recommendation 99 [16], medicine [17], etc. and play an increasingly impor-100 tant supporting role. However, the DPC algorithm still has 101 some shortcomings, such as it is subjective to artificially 102 select cluster centers through decision graphs, sensitive to the 103 selection of parameters, easy occurrence of knock-on effects 104 in the process of distributing the remaining data, and the 105 poor clustering effect on the datasets with large distribution 106 differences. 107 The main motivation of our research work is to avoid the 108 inaccuracy of artificial subjective selection of cluster cen-109 ters, reduce the allocation error of non-cluster centers, and 110 improve the clustering performance when processing datasets 111 with complex shapes and clusters with multiple density 112 peaks. Therefore, we propose a new density peak clustering 113 algorithm (CFDPC) based on Clustering Fusion Strategy. The 114 main innovations and contributions are summarized below. 115 (1) By setting two new thresholds to filter out candidate 116 clustering centers, initial cluster synthesis is performed based 117 on density peak clustering, to avoid the influence of artificial 118 selection of clustering centers. 119 (2) New definitions of boundary point, inter-cluster inter-120 section density and inter-cluster boundary density are pro-121 posed, which are suitable for datasets with complex shapes, 122 and lay the foundation for the proposed new clustering fusion 123 strategy, which improves the joint and several errors in the 124 process of data point allocation, and corrects the initial clus-125 tering results.

126
(3) The proposed cluster fusion strategy correctly selects 127 the cluster centers, especially in the case of a cluster with 128 multiple density peaks. However, some improved algorithms 129 for DPC ignore this consideration.

130
(4) The clustering performance of the CFDPC algorithm is 131 tested on artificial and real datasets and compared with other 132 four advanced clustering algorithms, the experiments show 133 that the CFDPC algorithm has higher clustering accuracy and 134 robustness.

135
The structure of the rest of this paper is as follows: 136 Section II reviews the related improvement research on DPC 137 algorithms. Section III introduces the principle of the DPC 138 algorithm. Section IV describes the work of the CFDPC 139 algorithm in detail. Section V verifies the effectiveness of 140 the algorithm, and gives the comparison of the clustering 141 evaluation results of the CFDPC algorithm and the other 142 four algorithms under the datasets of different shapes and 143 structures. Section VI gives the conclusion of this paper.

145
Scholars have improved the DPC algorithm from different 146 research angles for different shortcomings.

147
The first aspect is the improvement of local density 148 and cut-off distance. Du et al. [18] proposed a new way 149 (DPC-KNN) to calculate local density, and used k-nearest 150 neighbors instead of d c to make the parameters easier to tune. 151 Liu et al. [19] redefined the local density and relative distance 152 based on the shared neighbor similarity, and considered the 153 neighbor information of the data points in the density and 154 distance calculation, overcoming the unicity of the DPC algo-155 rithm for data correlation calculation. Liu et al. [20] proposed 156 a mixed density peak clustering algorithm (DDNFC) by 157 obtaining two results of data points density calculation using 158 local spatial position deviation and reverse k-nearest neighbor 159 technique. Chen et al. [21] proposed a Domain Adaptive Den-

178
The second aspect is to reduce the amount of com-179 putation. Xu et al. [23] proposed a fast sparse search 180 density peak clustering algorithm (FSDPC), which used 181 random third-party data points to find nearest neighbors 182 for sparse search, thereby improving the efficiency of the 183 DPC algorithm. Xu et al. [24] proposed two density screen-184 ing strategies, grid division (GDPC) and circle division 185 (CDPC), which improved the efficiency of the DPC algo-186 rithm. Shan et al. [25] proposed a density peak clustering 187 algorithm (SKTDPC) based on sparse search and k-d tree.

188
The algorithm is based on the k-d tree theory to find the  The third aspect is to determine the clustering centers. 199 Flores et al. [26] automatically determined cluster centers   [30] pro-218 posed density peak clustering based on weighted local density 219 sequence and nearest neighbor assignment (DPCSA). The 220 error propagation of cluster labels is overcome by introducing 221 weighted local density sequence and two-stage assignment 222 strategy. Yang et al.
[31] proposed a generalized density peak 223 clustering algorithm (GDPC) based on the new order simi-224 larity, which used the Euclidean distance between samples to 225 calculate the order similarity, and adopted two-step assign-226 ment to weaken the propagation of data errors. Yu et al. [32] 227 proposed a three-way density peak clustering method 228 (3W-DPET) based on evidence theory. Using evidence theory 229 to construct and collect the information of k-nearest neigh-230 bors in order to assign ungrouped objects to the most suit-231 able clusters, and can effectively solve the problem of clus-232 ter labels error propagation. Although the above algorithms 233 overcome the error probability of data points assignment to 234 a certain extent, it is not effective for datasets with close 235 distribution and cross overlap.

236
The fifth aspect is the merging method of clustering. 237 Fang et al.
[33] proposed a density peak clustering (CFDPC) 238 based on adaptive kernel fusion. The algorithm automat-239 ically found initial clusters based on density peaks, used 240 an adaptive search method to find core points, and finally 241 obtained the final clusters according to the core fusion strat-242 egy of intra-class similarity. Sun et al. [34] proposed density 243 peak clustering (DPC-MC) based on k-nearest neighbors 244 and self-recommendation, which selected initial cluster cen-245 ters through a self-recommendation strategy, and aggregated 246 the clusters by the degree of association between clusters. 247 Liu et al. [35] proposed an adaptive density peak cluster-248 ing and aggregation strategy based on k-nearest neighbors 249 (ADPC-KNN), which merged clusters according to the den-250 sity reachability strategy. Yuan et al. [36] used k-nearest 251 neighbors to divide data points and used a cluster merging 252 strategy to automatically aggregate over-segmented clusters. 253 The above optimization algorithms are improvements of the 254 DPC algorithm from the perspective of multi-cluster fusion, 255 but the design requirements of each step of the fusion strategy 256 are very strict, otherwise, there will be the phenomenon of 257 excessive fusion or no fusion between clusters, resulting in 258 the sacrifice of clustering precision.

292
In this section, we review the principles of traditional DPC 293 algorithms.

294
The density peak clustering algorithm (DPC) consists of 295 two basic elements. First, the local density of the data points 296 as the cluster centers are higher than that of other data points 297 around; second, the relative distance between the two data 298 points as the cluster center is far. Local density and relative 299 distance are two important variables in the implementation of 300 density peak clustering algorithm.

301
Local density ρ i definition of data point i: where d i,j represents the Euclidean distance between data 304 points i and j; d c is the cut-off distance, which is obtained by The relative distance δ i of data point i is defined as follows: 317 when the local density of data point i is the maximum, then 319 data point i is the cluster center, the relative distance of data 320 point i is the maximum distance from other data points to i; 321 when the local density of data point i is not the maximum, the 322 relative distance of data point i refers to the shortest distance 323 from all data points j with greater density than data point i to 324 data point i. 325 After calculating ρ i and δ i for each data point, construct a 326 decision graph with local density as the horizontal axis and 327 relative distance as the vertical axis as shown in Fig. 1.

328
Select data points with large local density and relative 329 distance from the decision graph as cluster centers. Finally, 330 the remaining data points are classified into the nearest data 331 points whose density is greater than their own. The steps of 332 the DPC algorithm are as follows: where µ(ρ) is the mean of the local density of data points, 366 and OO i is the set that becomes the expected cluster centers 367 excluding noise points: where µ(δ) represents the mean value of the relative distance cluster C a are sorted in an ascending order according to the 390 local density, and the data points with the first 5% − 20% 391 of the local density are taken as the boundary points of 392 cluster C a .

393
Definition 2 (Inter-Cluster Boundary Density): Let the two 394 clusters to be merged be C u and C v , and B ρ represents the 395 inter-cluster boundary density, which is defined as follows: 396 where B u ρ represents the mean value of the boundary point 398 density of cluster C u , and B v ρ represents the mean value of 399 the boundary point density of cluster C v .

400
Definition 3 (Inter-Cluster Intersection Density): Let the 401 two clusters to be merged be C u and C v , and let h u and h v 402 be the union of the set of data points in the neighborhood of 403 circles with radius d c for each data point in clusters C u and C v 404 respectively, then inter-cluster intersection density is denoted 405 by A u,v : Definition 4 (Cluster Fusion Conditions): Let the two clus-408 ters to be merged be C u and C v . If the following conditions 409 are met: then cluster C u and cluster C v are fused.

412
The main ideas of cluster fusion strategy are as follows:

413
Step 1: Calculate the boundary points according to (6) 414 and (7), and calculate the inter-cluster intersection density 415 and inter-cluster boundary density to obtain inter-cluster 416 intersection density matrix A b×b and inter-cluster boundary 417 density matrix B b×b according to (8) and (9), where B i,j rep-418 resents the boundary density between the i-th cluster and the 419 j-th cluster; A i,j represents the intersection density between 420 the i-th cluster and the j-th cluster, and A b×b and B b×b are 421 symmetric matrix.

422
Step 2: The clusters are indexed in descending order of 423 local density of cluster centers. That is: to index in the order 424 of i = 1, 2, . . . , b, j = i+1, i+2, . . . , b, if A i,j > B i,j , merge 425 the i-th cluster and the j-th cluster, update the cluster center 426 set H : delete the j-th cluster center point from H , and the 427 cluster center of the i-th cluster is taken as the cluster center 428 of the new cluster; update the set N of the cluster: merge the 429 data points of the j-th cluster into the i-th cluster, delete the 430 j-th cluster from N , and sort the data points in the i-th cluster 431 according to local density in an ascending order; update the 432 label set M : update the label of the merged j-th cluster to the 433 label of the i-th cluster, and delete the label of the j-th cluster 434 from M ; b = b − 1.

435
Step 3: Return to step 1 calculate the new inter-cluster 436 intersection density matrix A b×b and inter-cluster boundary 437 density matrix B b×b , and then judge whether the cluster 438 fusion conditions are met, and if so, return to step 2 until 439 A i,j < B i,j , stop cluster fusion, and output the clustering 440 result.

458
(2) Index the first cluster N 1 and the second cluster N 2 in 459 the set N :  , and calculate the inter-cluster intersection density and 491 inter-cluster boundary density to obtain inter-cluster intersec-492 tion density matrix A 2×2 and inter-cluster boundary density 493 matrix B 2×2 according to (8) and (9), as shown in Fig. 4.

494
According to experience, it is known that cluster fusion can 495 not be continued, and there is no need to judge the cluster 496 fusion conditions. 497 ii) If fusion condition is not met, index the second cluster 498 N 2 and the third cluster N 3 in the set N , and judge whether 499 A 2,3 and B 2,3 meet the merging conditions: 500 x If A 2,3 > B 2,3 , then H 2 and H 3 are merged and H 3 501 is deleted from H , H is updated to H = {H 1 , H 2 }, H 2 as 502 the center of the new cluster; then N 3 is merged into N 2 , 503 the new cluster is still recorded as N 2 , and N = {N 1 , N 2 } 504 is obtained, the data points in N 2 of the second cluster after 505 merging are sorted according to the density in an ascending 506 order; similarly, get M = {M 1 , M 2 }, update the label in M 2 : 507 update the merged cluster label M 3 to the label in M 2 ; recal-508 culate the boundary points according to (6) and (7), and cal-509 culate the inter-cluster intersection density and inter-cluster 510 boundary density to obtain inter-cluster intersection density 511 matrixÃ 2×2 and inter-cluster boundary density matrixB 2×2 512 according to (8) and (9), as shown in Fig. 5.

513
According to experience, it is known that cluster fusion can 514 not be continued, and there is no need to judge the cluster 515 fusion conditions. 516 VOLUME 10, 2022   The computational complexity of the CFDPC algorithm in 580 this paper is mainly composed of the following five parts: 581 (1) the complexity of calculating the distance matrix T n×n 582 is O(n 2 ); (2) the complexity of calculating the local density 583 of data points is O(n); (3) the complexity of calculating the 584 relative distance is O(n 2 ); (4) the complexity of selecting 585 candidate cluster centers and assigning remaining points is 586 O(n); (5) perform cluster fusion: including (a) the number 587 of boundary points of each cluster is less than n in theory, 588 so the complexity of computing the boundary points of clus-589 ters is O(n), (b) the number of samples in the intersection 590 between clusters is much less than n, so the complexity of 591 computing the inter-cluster intersection density matrix and 592 the inter-cluster boundary density matrix is much less than 593 O(n 2 ), (c) the complexity of processing the cluster fusion is 594 O(n). To sum up, the computational complexity of CFDPC 595 algorithm in this paper is O(n 2 ), which is the same as that 596 of DPC algorithm. The computational complexity of the 597 four algorithms compared with the CFDPC algorithm in 598 this paper is as follows: the computational complexity of 599 DPC-KNN algorithm is O(n 2 ); the computational complexity 600 of DPC algorithm is O(n 2 ); the computational complexity of 601 K-means algorithm is O(n); the computational complexity of 602 DBSCAN algorithm is O(n 2 ).

604
All algorithms in this study are implemented using Python 605 tools. To evaluate the clustering performance of the CFDPC 606 algorithm proposed in this paper, it is compared with 607 four more advanced algorithms DPC-KNN [13], DPC [12], 608 K-means[2] and DBSCAN[5] algorithms under different 609 datasets. These four algorithms are reproduced in Python by 610 referring to the original literature.    Table 1 and Table 2.   clustering performance. Among them, the percentage param-645 eter p of the CFDPC algorithm in this study is not specified, 646 and the boundary point parameter q is generally selected 647 within 5-20 according to experience. The optimal value of 648 the parameter k in the DPC-KNN algorithm is selected in the 649 range of 2-35; the value range of the parameter p in the DPC 650 algorithm is not specified; the K-means algorithm iterates 651 100 times to obtain the optimal result on the premise of deter-652 mining K clusters; the parameters required by the DBSCAN 653 algorithm: neighborhood radius ε and the minimum number 654 of samples in the neighborhood Minpts select the best value 655 based on experience. In this section, we will analyze and discuss the effect of 659 the proposed CFDPC algorithm on ACC under different set-660 tings of the boundary point parameter q. The boundary point 661 parameter q is a hyper-parameter in the CFDPC algorithm, 662 and different settings of the parameter q have an important 663 impact on the clustering effect. For different datasets, the 664 size of the data volume is different. For a dataset with a 665 small number of samples, when multiple initial clusters are 666 generated, since the number of samples in the entire dataset 667 is small, the number of samples in each initial cluster obtained 668 will be less. In order to meet the cluster merging conditions, 669 it is even possible to make cluster merging as meaningful 670 as possible, so make each cluster have boundary points as 671 far as possible. This paper conducts multiple experiments 672 on the selected dataset to verify that it is more reasonable 673 to set the lower limit of the boundary parameter q is 5; for 674 datasets with a large number of samples, each initial cluster 675 will contain more samples, and the number of new cluster 676 samples and boundary points obtained by cluster merging will 677 also increase. To ensure that cluster merging continues, the 678 number of boundary points in each cluster should not exceed 679 half of the number of clusters to which it belongs. Similarly, 680 this paper conducts multiple experiments on the selected 681 dataset to verify that the upper limit of the boundary point 682 parameter q is set to 20 is reliable. But this is not absolute, 683  Fig. 6. In the case where the parameter p in Table 3 is       correctly. However, the DBSCAN algorithm can be seen from 739 the rendering that identifying some data points as noise points 740 leads to a low clustering performance. For the S1 dataset, 741    By analyzing the four evaluation index values of each 768 algorithm in Table 5 on different datasets, it can be observed 769 that both the CFDPC and DPC algorithms achieve the best 770 results when processing the Vertebral dataset, the perfor-771 mance of CFDPC on the Ecoli dataset is slightly worse than 772 that of the DPC-KNN algorithm. This is because the density 773 distribution of the Ecoli dataset is not uniform, and there are 774 cross-entanglement and overlapping among clusters, which 775 results in a deviation between the data selected by the cut-off 776 parameter p and the data selected by the k-nearest neighbor 777 so that the CFDPC algorithm does not reach the optimum 778 in these four evaluation index values. But its performance 779 is still better than that of the DPC algorithm. Some index 780 values of the DBSCAN algorithm on the Wdbc and Yeast 781 datasets are negative, mainly because the distribution of 782 Wdbc and Yeast data points is uneven and sparse, which 783 makes it difficult for the DBSCAN algorithm to adjust the 784 parameters. Second, some data points are marked as noise 785 points, resulting in a large deviation in the clustering results.

786
The K-means algorithm is not suitable for dealing with 787 non-linear separable datasets owing to its own property,  is suitable for processing arbitrarily shaped datasets, but also 813 has optimal clustering performance.

814
For future work, we will explore and solve the following 815 problems. Firstly, the algorithm does not automatically pro-816 vide two parameter settings. The focus of future research is 817 to explore non-parametric algorithms to make the algorithm 818 more intelligent. Secondly, CFDPC algorithm has high com-819 plexity in calculating the distance matrix. We will continue to 820 explore a new distance calculation method for sparse search 821 to reduce the computational complexity of the algorithm.