A density peaks clustering algorithm with sparse search and K-d tree

Density peaks clustering has become a nova of clustering algorithm because of its simplicity and practicality. However, there is one main drawback: it is time-consuming due to its high computational complexity. Herein, a density peaks clustering algorithm with sparse search and K-d tree is developed to solve this problem. Firstly, a sparse distance matrix is calculated by using K-d tree to replace the original full rank distance matrix, so as to accelerate the calculation of local density. Secondly, a sparse search strategy is proposed to accelerate the computation of relative-separation with the intersection between the set of $k$ nearest neighbors and the set consisting of the data points with larger local density for any data point. Furthermore, a second-order difference method for decision values is adopted to determine the cluster centers adaptively. Finally, experiments are carried out on datasets with different distribution characteristics, by comparing with other six state-of-the-art clustering algorithms. It is proved that the algorithm can effectively reduce the computational complexity of the original DPC from $O(n^2K)$ to $O(n(n^{1-1/K}+k))$. Especially for larger datasets, the efficiency is elevated more remarkably. Moreover, the clustering accuracy is also improved to a certain extent. Therefore, it can be concluded that the overall performance of the newly proposed algorithm is excellent.


I. INTRODUCTION
C LUSTER analysis as an important exploration technol- ogy of data mining, is committed to reveal the inherent attributes and laws hidden behind the seemingly disorganized unknown data [1]- [3].It provides support for decisionmaking and has been successfully applied in many fields such as image pattern recognition, social network mining, market statistical analysis, medical research and engineering systems [4]- [13].With the extremely strong penetration and rapid development of the Internet, many professional fields are faced with explosive growth of data storage.This leads to high computational complexity and difficulty in mining valuable information.In 2014, Science published clustering by fast search and find of density peaks (DPC) [14].Due to its novel design idea and robust performance, DPC in-stantly became the topic center of scholars in related fields.Compared with classical clustering algorithms [15], [16], DPC possesses several advantages.Firstly, the cluster centers can be identified directly through the decision graph, which consists of local density and relative-separation for all data.Secondly, it can handle non-convex datasets well and no iterative process is required.Furthermore, it is insensitive to outliers.However, there are still some shortages for DPC to be further improved, including the sensitivity to cut-off distance, the high computational complexity as well as the determination of cluster centers manually.
For the original DPC algorithm published by Science [14], the local structure of data [17] is not considered and it is sensitive to the cut-off distance.Some scholars try to solve this problem by integrating the k nearest neighbors with DPC in various ways.Du et al. [18] introduced the idea of k nearest neighbors into DPC and redefined the calculation method of local density.Thus, the so-called DPC-KNN algorithm was successfully established.Xie et al. [19] also gave another measure of local density using k nearest neighbors and designed a new allocation strategy for noncenter points.In order to elevate the possibility of selecting the correct cluster centers, Liu et al. [20] modified the distance calculation method by considering the distance factors and neighbor information simultaneously.In addition, a twostep allocation strategy was also adopted to allocate noncenter points.Recently, Liu et al. [21] have suggested a mixed density clustering method by defining two types of local density.One is based on k nearest neighbors, while the other is determined by local spatial position deviation.Local density description derived from k nearest neighbors supplies a novel way for DPC algorithm.Under this condition, the sensitivity to the hyper-parameter k, i.e. the number of nearest neighbors, is obviously depressed, compared with the sensitivity to cut-off distance used in the original DPC algorithm [14].However, for these DPC algorithms based on k nearest neighbors, mentioned-above, searching neighbors mainly resorts to violent means.Therefore, these algorithms still have the same high computational complexity as original DPC.
For DPC algorithms, the high computational complexity commonly comes from the distance calculation, which is related with any two points among all data points.With the increase of dataset size, as groundwork, calculating distance between any two points is time-consuming and even impossible for DPC to be implemented.In order to reduce the computational complexity and lift the clustering efficiency, some improvement strategies have been proposed.Gong et al. developed the efficient distributed density peaks clustering algorithm (EDDPC) [22], which eliminates unnecessary distance calculation and data shuffling by Voronoi diagram, data replication and data filtering.Bai et al. [23] tried to save a part of the computational effort required for distance by combining DPC and K-means.Xu et al. introduced the idea of replacing all data points with non-empty grid into DPC [24], [25].Two prescreening strategies were proposed to determine cluster centers by screening points with the feature of higher local density.Xu et al. [26] also proposed a fast density peaks clustering algorithm with sparse search (FSDPC), which introduced the idea of third-party random points to measure the distance.These optimization algorithms have shown good ability in reducing complexity and improve clustering efficiency.However, they either sacrifice accuracy, resulting in poor clustering effect, or decrease the clustering stability, making the clustering result for each run vary obviously.Therefore, it is not satisfying for these algorithms to get reliable and reasonable clustering results.
As for determining cluster centers, it is difficult for the original DPC algorithm to identify the correct cluster centers manually, if the dividing line between the center points and the non-center points on the decision graph is not very clear.
Tong et al. [27] obtained the initial clusters by a designed preclustering method, and then gave the final clusters according to the Scale Space Theory.Lv et al. [28] proposed a method to determine the cluster centers automatically according to the decision value defined by the product of local density and relative-separation.Flores et al. [29] proposed a strategy to find cluster centers adaptively by searching the gaps among data points on the one-dimensional decision graph mapped.Lin et al. [30] introduced a hyper-parameter, neighbor radius, to select a group of possible density peaks, as preliminary clustering results.Then the final clustering results were formed by merging the clusters with single-bond clustering method.However, some of these algorithms reached the goal of adaptive determination of cluster centers, at the cost of introducing an additional hyper-parameter.Meanwhile, some algorithms became more time-consuming and inefficient, especially for large or complex dataset.
To sum up, various DPC algorithms have been proposed, but very few of these algorithms could achieve satisfactory clustering accuracy, computational complexity as well as the adaptive determination of cluster centers simultaneously.Therefore, the main purpose of the present work is to develop an improved DPC algorithm, in which the computational complexity is reduced significantly and determining cluster centers could be carried out automatically, under the condition that the clustering accuracy is guaranteed without any additional hyper-parameter introduced.As a result, a density peaks clustering algorithm with sparse search and K-d tree (SKTDPC) was proposed.As a type of data structure, Kd tree [31] was adopted to find k nearest neighbors and thus the computational complexity of local density could be firstly reduced remarkably.Secondly, a strategy of sparse search was proposed to accelerate the calculation of relativeseparation significantly.Furthermore, the method for automatic determination of cluster centers was described based on second-order difference of the decision value.Besides the main work, the applicability of SKTDPC algorithm is analyzed briefly on higher dimensional datasets.
The rest of this paper is organized as follows.Section II reviews the works related with DPC algorithms.Section III depicts the SKTDPC algorithm newly developed, including fast search for k nearest neighbors, dual acceleration for local density and relative-separation, and determining cluster centers adaptively etc. Section IV verifies the validity of the SKTDPC algorithm and makes a comparison with other state-of-the-art clustering algorithms thoroughly on various datasets.Finally, Section V is the concluding remarks of the paper.

II. RELATED WORKS
Due to the simple and clear principle of DPC algorithm, it shows great performance on dataset with any shape or dimension.DPC has become a new favorite in the field of clustering algorithm research.The currently proposed SKT-DPC algorithm is designed on the basis of the original DPC, the idea of k nearest neighbors and sparse search.Therefore, this section briefly reviews the original DPC algorithm.The role of the idea for k nearest neighbors and sparse search in DPC are also analyzed.

A. THE ORIGINAL DPC ALGORITHM
In original DPC algorithm [14], there are two important variables, the local density ρ i and the relative-separation δ i for any data point x i in dataset labeled by S = {x 1 , x 2 , . . ., x n } with n data.
One of the important variables, the local density ρ i , is defined as follows: where d c is the cut-off distance, which is regarded as a hyper-parameter for the algorithm, and d i,j is the Euclidean distance between data points x i and x j .Another important variable, the relative-separation δ i , could be determined by searching for the nearest data point x j , which has a relatively larger local density compared with x i : This is the condition that the point x i is a non-maximum local density point.Especially, for the data point with the maximum local density, the relative-separation δ i is marked specifically by δ max and is given as: DPC selects cluster centers based on the core idea, that the cluster centers are surrounded by these data points with relatively lower ρ i as well as the cluster centers possess relatively larger distance from points with local density higher than the cluster centers.That is, the data points x i with relatively large ρ i and δ i have a higher probability of being identified as cluster centers.This is the main characteristic of the original DPC algorithm different from other density clustering algorithms.
Based on the density ρ i and distance δ i , two critical tasks are performed by the DPC.Firstly, cluster centers are determined.By drawing decision graph in the two-dimensional coordinate system with ρ i and δ i as the abscissa and ordinate, respectively, each data point corresponds to a position in the graph and these points in the upper right corner of the decision graph are selected as cluster centers [32]- [34].Secondly, non-cluster center points are assigned [35], [36].The allocation principle is that each non-center point and its nearest point with a higher local density have the same cluster.For the determination of cluster centers, if the boundary between the center points and the non-center points on the decision graph is obvious, the cluster centers could be identified quickly.Otherwise, it is not easy to distinguish the ideal cluster centers manually.This is the problem for determining cluster centers adaptively, which will be solved in the Section III.D.
For the computational complexity of the original DPC, it depends on the Euclidean distance calculation between any two data points in dataset S [37].These distances construct the symmetric full-rank matrix D: This is groundwork for further determine both the local density ρ i and the relative-separation δ i of each data point x i .For the dataset S = {x 1 , x 2 , . . ., x n } with n data, the computational complexity of DPC is mainly determined by two parts together.One is the computational complexity required to compute the distance matrix, which is O(n 2 ), while another is the complexity required to allocate the remaining points, which approximately is O(n).Then, the overall computational complexity of DPC is O(n 2 ).

B. THE IDEA OF K NEAREST NEIGHBORS AND SPARSE SEARCH
From Equation (1), we can see that for the data point x i the local density ρ i is defined by the number of the nearest neighbors which are located in the circular area with the cutoff distance d c as the radius.Thus, Equation (1) can also be reformulated by searching for the set NN(x i ) which consists of the nearest neighbors of the data point x i : where |NN(x i )| indicates the number of elements in the set NN(x i ).
In order to achieve better clustering results, for the original DPC algorithm to be implemented, the cut-off distance d c should be adjusted as a hyper-parameter.This method defining the local density ρ i ignores the local distance information of data points and makes the algorithm more sensitive to hyper-parameter.For solving this problem, inspired by the idea of k nearest neighbors, the hyper-parameter d c could be replaced by the hyper-parameter k, the number of the nearest neighbors.Some scholars have proposed various forms to define the local density ρ i with k nearest neighbors.In this way, not only the algorithmic sensitivity to hyper-parameter can be depressed, but also the local distance information can be integrated into the local density definition to describe the data density more reasonably.More specifically, d c has a wide range of adjustment and has a great impact on algorithm performance.On the contrary, the value of k increases with a step size of 1, and the algorithm can achieve a satisfactory state relatively when k does not exceed 20 generally.Therefore, this strategy to define the local density is still maintained by the present work, but a type of data structure, the K-d tree is adopted to accelerate search for the k nearest neighbors.This will be depicted in Sections III.A and III.B.
It can be found that the calculation of the distance between any two points among all data points is time-consuming at the beginning for DPC algorithms to be implemented.Thus, DPC algorithms have high computational complexity, VOLUME 4, 2016 no matter how to define ρ i , with the hyper-parameter d c or the hyper-parameter k, discussed in previous section.In view of this, some scholars have introduced the idea of sparse search into DPC algorithm to reduce the complexity.As the name implies, sparse search refers to a class of methods that can reduce internal friction effectively by replacing a large number of calculations with as little computation as possible.
[22]- [26] proposed different sparse methods to speed up the calculation process of DPC.Inspired by this idea, we design a new and effective sparse search strategy to reduce the complexity of DPC.
It should be stressed that the present analysis on computational complexity does not take into account the influence of feature dimension of data, which will be discussed in the Section III.F.In the next section, an algorithm called SKTDPC will be developed, which can significantly reduce the computational complexity by the important strategies of sparse search and K-d tree.

III. THE DENSITY PEAKS CLUSTERING ALGORITHM WITH SPARSE SEARCH AND K-D TREE
In this section, the density peaks clustering algorithm with sparse search and K-d tree (SKTDPC) is proposed.Aiming to reduce the computational complexity, the present algorithm adopts K-d tree to find k nearest neighbors quickly and uses a strategy of sparse search to accelerate the calculation of relative-separation significantly.Moreover, a method for automatic determination of cluster centers is developed.Finally, the process of the algorithm and the complexity are depicted and discussed, respectively.

A. FAST SEARCH OF K NEAREST NEIGHBORS BASED ON K-D TREE
As discussed above, introducing the hyper-parameter k, the number of the nearest neighbors, to replace the hyperparameter d c , the cut-off distance, could reduce the algorithmic sensitivity to hyper-parameter obviously.However, for the k nearest neighbors of each data point to be found out, the distance between any two points in dataset S should also be calculated.Therefore, there is still the problem of high computational complexity.To accelerate the search for k nearest neighbors and save a large part of the computational effort required for distance, as a type of data structure, the classical K-d tree is adopted in the present work.
K-d tree is a typical binary tree, which stores data points in K dimensional space for quick retrieval.It represents the division of K dimensional space and constitutes a series of K dimensional hyperrectangular regions.The fast determination of k nearest neighbors based on K-d tree is divided into two steps.The first part is the K-d tree construction process, the second part is the process of search for k nearest neighbors with K-d tree, the specific process described by Algorithm 1 is as follows.
Note that in order to reduce the cost of backtracking and achieve the best balance of data segmentation, this algorithm selects the dimension with the largest variance and the me-dian in this dimension as dividing criterion each time.The larger the variance is, the more scattered the data points are in this dimension.In the whole paper, the dataset S takes the form of S = {x 1 , x 2 , . . ., x n } ∈ R n×K , where n and K are the number and spatial dimension of data points respectively.In addition, k represents the number of the nearest neighbors for any data point x i .
The computational complexity of K-d tree construction is O(nKlogn), and the computational complexity of searching k nearest neighbors of n data points is O(n(n 1−1/K + k)) in Algorithm 1.Therefore, the overall computational complexity of searching k nearest neighbors by K-d tree is the larger one from O(nKlogn) and O(n(n Thus, a lot of unnecessary calculations for distance are saved.This will further result in the acceleration for calculation of the local density ρ i (Section III.B).

B. THE ACCELERATION FOR CALCULATION OF THE LOCAL DENSITY BY K-D TREE
Searching for k nearest neighbors is the groundwork for DPC algorithms to be implemented.By using the K-d tree method proposed in Section III.A, the computational complexity could be reduced essentially.As a result, the distance matrix D used in the original DPC (Section II.A), a symmetric full-rank matrix, is changed to the symmetric sparse matrix D. For example, assuming there is a dataset S = {x 1 , x 2 , . . ., x 12 } containing 12 elements, the form of the sparse matrix D may be as follows: where d i,j represents the Euclidean distance between the ith data point and the j-th data point calculated by K-d tree in the process of searching the k nearest neighbors; "[]" denotes the distance that does not need to be calculated in this process (also known as invalid distance).In this matrix, the first row implies that the algorithm can find out the k nearest neighbors of the data point x 1 only by calculating the distance between x 1 and x 2 , x 5 , x 6 , x 10 , instead of calculating all the distances.Taking the number k of the nearest neighbors for any data point x i as a hyper-parameter, the set of the k nearest neighbors for x i , calculated by K-d tree, is marked by NN k (x i ), which could be described as: where d(x i , x j ) is the distance between x i and x j , which is equivalent to d ij , and N k (x i ) is the k-th nearest neighbor for point x i .
Algorithm 1: The K-d tree determines for k nearest neighbors.

Input:
The dataset S = {x 1 , x 2 , . . ., x n }, where The number of the nearest neighbors k Output: K-d tree; The set Q of k nearest neighbors of x target (here x target is any point x i in dataset S) build the K-d tree; while any leaf node has non-unique data do for each dimension h(h = 1, 2, . . ., K) get its variance; choose h according to the maximum variance; sort the data values in dimension h and get its median m as the segmentation point; store the data point of segmentation on the corresponding node; other data are divided by m into two subsets (The left sub-node corresponds to the subset of the coordinates x (h) which is equal to or less than the segmentation point.The right sub-node corresponds to the subset of coordinates x (h) greater than the segmentation point); Thus, the following formula is proposed to describe the local density ρ i more reasonably: It is indicated that the local density ρ i is not required to calculate on the basis of the symmetric full-rank matrix D, which includes all distances between any two data in dataset S. Therefore, by introducing K-d tree method, the calculation of the local density ρ i is accelerated significantly with the sparse matrix D. This is the first acceleration for SKTDPC algorithm.

C. THE ACCELERATION FOR CALCULATION OF THE RELATIVE-SEPARATION BY SPARSE SEARCH
In Section III.B, ρ i is obtained based on the symmetric sparse matrix D. According to the definition of relativeseparation, Equation ( 2), δ i of the non-maximum density point is determined by searching for the nearest data point x j , which possesses relatively larger local density ρ j compared with x i (ρ j > ρ i ).In other words, the distance information in D is not enough to support the acquisition of all δ i , and it is necessary to complete the rest of the distance calculation.At this point, the computational complexity of getting δ i will reach O(n 2 ) in the worst case.Thus, it would be insignificant to save a large part of the distance calculation for ρ i by Kd tree in Section III.B, if all of distance calculations were carried out between any two points in dataset S. For dealing with this problem, a sparse search strategy is proposed to accelerate the calculation of the relative-separation δ i .For expressing more clearly, the local density ρ i for all of data points obtained in Section III.B are sorted in descending order to generate a new sequence ρ i * (i = 1, 2, . . ., n).For any point x i , set B(x i ) is introduced to represent the set including all points x j with greater local density ρ j compared with x i (ρ j > ρ i ).That is to say, consists of these data points corresponding to ρ 1 * to ρ i−1 * .Thus, the set B(x i ) is explicit.In addition, for any point x i , the set of k nearest neighbors, NN k (x i ), has also been determined in Section III.B.Then, for x i with non-maximum density, based on NN k (x i ) and B(x i ), if ∃j s.t.ρ j > ρ i , the sparse search strategy for δ i is defined as: When x i is the point with maximum density, δ max has the same expression as that defined by the original DPC, Equation (3).
Equation (8) indicates that both the sets NN k (x i ) and B(x i ) are important for determining δ i quickly.One is the set of k nearest neighbors, while the other is the set of these points with greater local density than ρ i .When the two sets have an intersection, it means that δ i must be the minimum of distances of x i from the point x j in the intersection.All these distance values are known information determined by looking for k nearest neighbors with K-d tree.They are stored in the sparse matrix D. In this case, δ i can be obtained directly, without any additional calculation for distance.Under the condition that there is no intersection, the unknown distances between x i and points with local density greater than ρ i need to be calculated to find the nearest distance.Therefore, whether the intersection is non-empty or not plays an important role in determining the computational complexity.
Normally, most of δ i could be obtained by the non-empty intersection directly.This can be explained as follows.Here all data points in dataset S are divided into four parts: points with large ρ i and large δ i (part 1), points with large ρ i and small δ i (part 2), points with small ρ i and small δ i (part 3) as well as points with small ρ i and large δ i (part 4).Firstly, there are very few data points with large ρ i and large δ i , compared with total data.These points are potential cluster centers.Secondly, for points with large ρ i , including part 1 and part 2, the total amount of elements in the set B(x i ) is relatively very small, since B(x i ) is defined by these points with local density greater than ρ i and ρ i itself is also large.In addition, for the small amount of distance calculation, some of the known distances have already been stored in the symmetric sparse matrix D, except for the distances between x i and the k nearest neighbors.Thus, even though the intersection is empty, the distance calculation, which needs to be supplemented, is not time-consuming at all.Thirdly, the points in part 3 take a vast proportion of total data.They are non-center points with low local density ρ i .Small ρ i implies that the averaged distance between x i and the k nearest neighbors is obviously large, i.e. the k nearest neighbors are not concentrated around x i , but are decentralized.Further, the k-th nearest neighbor of x i is relatively far away from x i .Meanwhile, small δ i indicates that the shortest distance from x i is small for the points x j with larger local density ρ j .Thus, for any point x i , there is a high possibility that the point x j , which determines the value of δ i , (with larger local density ρ j and minimum of distance from x i ) is located in the nearest neighbors set NN k (x i ).That is to say the intersection of NN k (x i ) and B(x i ) is non-empty very likely.Finally, the points in part 4 are commonly regarded as outliers, which are also very few.Corresponding calculation required for distance are even negligible actually.
In order to show the process of obtaining δ i more clearly and intuitively by the proposed sparse search strategy, an example is used to confirm the fact that the intersection of NN k (x i ) and B(x i ) is not empty in most cases, as shown in Fig. 1.There is a dataset with 16 data points that can be seen in Fig. 1.Firstly, the 3 nearest neighbors set NN 3 (x i ) of any sample point x i and the distance d(x i , x j ), x j ∈ NN 3 (x i ) are obtained by the sparse calculation method of K-d tree in Section III.A.Then, the ρ i is obtained by using (7), and the larger density set B(x i ) of the sample point x i can be obtained.Finally, the δ i is obtained from the NN 3 (x i )∩B(x i ) shown in the last column in Fig. 1.From Fig. 1, we can clearly see that the NN 3 (x i ) ∩ B(x i ) of sample points 0, 1, 2, 4, 5, 6, 8, 10, 11, 12, 13, 14 and 15 is not empty.This means that the δ i of these data points can be obtained directly from the NN 3 (x i ) ∩ B(x i ).That is, the nearest distance from x i in the NN 3 (x i ) ∩ B(x i ) is δ i , without any additional calculation.Only the NN 3 (x i ) ∩ B(x i ) of 3, 7 and 9 three points is empty, so we need to add a little bit of calculation.It is worth noting that although these three points require supplementary calculation, 3 and 9 only needs to be calculated with a very small number of points in the set {7}, {3, 7} respectively.Only point 7 with maximum density needs to calculate the distance from the remaining points.It can be seen that the proposed sparse search strategy effectively avoids the distance calculation of low-density points, which account for most of the data volume.For a very small proportion of high-density points, even if a little additional calculation is needed, the amount of calculation is very small.For the whole algorithm, it is almost negligible.
On the whole, a very small amount of extra distance needs to be calculated to obtain δ i of all data points.The sparse search strategy captures the crux of these low density points which occupy the main amount of computation, and obtains δ i through ingenious intersection strategy directly, so that a lot of distance calculation is saved.At this point, we have succeeded in reducing the computational complexity of this step to much less than O(n(n 1−1/K + k)).The second acceleration of the SKTDPC algorithm has been achieved.Therefore, the problem of high computational complexity is solved fundamentally, for the original DPC algorithm as well as a series of extended DPC algorithms based on k nearest neighbors, by a simpler and more efficient strategy.

D. ADAPTIVE DETERMINATION OF CLUSTER CENTERS
For the determination of cluster centers, it is difficult for the original DPC algorithm to distinguish center points and noncenter points on the decision graph manually, if the boundary is not very clear.A reasonable adaptive way can deal with this problem.In this section, a simple and efficient method is proposed.
As an important judgment basis, the variable, called decision value γ, is also introduced, which is defined by the product of ρ i and δ i : By re-arrangement in descending order of γ value, the newly generated sequence of decision value is marked by γ * .As analyzed above, the data points with both large ρ i and large δ i are potential to become cluster centers.Under this condition, the decision value is also large.Thus, it is crucial to determine adaptively the boundary between the center points and the non-center points in γ * .Due to the essential distinction between the center points and the non-center points in the degree of change of decision value, the location of the mutation-point is determined adaptively according to this key feature to lock the cluster centers.For γ * sequence, there is the relatively large fluctuation, for the center points with relatively large decision value.In contrast, the fluctuation of γ * value is not obvious for the non-center points with relatively small decision value.The second-order difference for γ * value is used to describe this fluctuation.
When searching the cluster centers, it is necessary to narrow the search appropriately.That is to say, the points in the front position of γ * with relatively large γ value that may become the cluster centers needs to be focused on.Because the points with small γ value does not have the characteristics of becoming the cluster centers, either ρ i or δ i is small, or both are small.Here, the search range is locked in [1, √ n ] preliminarily, where n is the number of data.Previous studies have shown that this search range [1, √ n ] is appropriate.In this way, not only the search efficiency can be improved, but also the removal of irrelevant data points will play a positive role in reducing disruption for determining the cluster centers.
Through the analysis of γ * value of a large number of datasets with different distributions, it is found that there is a general rule that the γ value of the point with the most potential to be the cluster centers, γ * 1 , is much larger than γ * 2 and the value behind it generally.If the γ * 1 value is placed within the search range to determine the location of the mutationpoint, the difference value of γ * 1 will be very large, which may lead to the wrong mutation-point being found, thereby affecting the determination of the cluster centers.Thus, the  search range [2, √ n ] is finally adopted in the present work.This treatment does not affect becoming a cluster center for the data point corresponding γ * 1 .Furthermore, it could avoid possible mistake to distinguish the center points and the noncenter points.
Based on the second-order difference of γ * , the mutationpoint is described as: where the function arg max is used to determine the set of variable points i which maximizes , M p is the mutation-point that takes the largest i value when arg max returns multiple i value, ξ i is the second-order difference for γ * i , which is defined by Equation ( 11), γ * min and γ * max represent the minimum and maximum values of where µ i is the first-order difference between two adjacent terms γ * i .The adjustment coefficient ( i+1 i ) 2 is to make bigger when γ * i is larger, and smaller when γ * i is smaller.The distinction between the center points and the non-center points is more prominent through the adjustment coefficient.M p is the mutation-point with the largest ordering value when the second-order difference of γ * is mutated.Then all data points with γ * i larger than γ * Mp , including γ * Mp , are the candidate cluster centers of searching, namely the corresponding points from γ * 1 to γ * Mp .Finally, the pseudo-center points with large ρ i small δ i or small ρ i large δ i were removed from these candidate centers as the final cluster centers.Although the γ i values of these pseudo-center points are large, they do not possess the characteristics of the center points, that is, ρ i and δ i are both large.Specifically, the points and in the candidate center points are retained, and the remaining pseudo-center points are removed to get the final cluster centers.
Next, a two-dimensional synthetic dataset SS2 is adopted as an example to clearly demonstrate the process of determining M p .This dataset has a total of 300 data points (n = 300).In other words, the subsequent calculation of second-order difference only needs to focus on the 16 largest γ * in the range of [2,17], where √ n = 17.Based on this information, 15 µ i from the 16 largest γ * are firstly calculated according to Equation (12); Then, 14 ξ i are calculated from 15 µ i by Equation (11); Finally, the mutation-point M p is determined by Equation (10), which is M p = 2. Here, the two candidate center points meet the above conditions.Therefore, the final cluster centers are the two data points corresponding to γ * 1 and γ * 2 .The distribution of data points in dataset SS2 is shown in Fig. 2(a), and the yellow points indicated by arrows in Fig. 2(b) are mutation-point determined by the proposed secondorder difference method.In other words, the mutation-point and the blue point (center 1 and center 2) were identified as the cluster centers of dataset SS2, and the black dots below were the ordinary non-center points.

E. PROCESS OF SKTDPC
The overall process of SKTDPC algorithm includes three main parts: calculation of dual acceleration for ρ i and δ i based on K-d tree and sparse search strategy, adaptive determination of cluster centers by second-order difference method for γ, and allocation of non-cluster center points.
The implementation of the first part has been described in Sections III.B and III.C.The second part has been described in Section III.D thoroughly.The third part for allocating noncenter points follows the original DPC depicted in Section II.A. Therefore, the specific process of SKTDPC algorithm is summarized as following Algorithm 2.

Input
conditions, is the final cluster centers.7: Assign the non-center points, according to the allocation principle, that each non-center point and its nearest point with a higher local density have the same cluster.8: Return the clustering result Y .Algorithm 2 summarizes the entire process of SKTDPC.On the one hand, SKTDPC replaces the calculation of distance for symmetric full-rank distance matrix D with the symmetric sparse distance matrix D. The K-d tree method and the sparse search strategy by the intersection between sets NN k (x i ) and B(x i ) are developed to accelerate the calculation of both ρ i and δ i .On the other hand, the secondorder difference method is used to find the boundary between the center points and the non-center points.Therefore, adaptively determining the cluster centers is achieved quickly and successfully.Finally, the present algorithm can effectively reduce the computational complexity while maintaining or even improving the clustering accuracy.

F. ANALYSIS OF COMPLEXITY
It should be stressed that the analysis of computational complexity of DPC in some literature only focus on the variable, data number n, then the complexity is O(n 2 ) [18]- [30].Different from these literature, the present work considers the influence of the two variables, n and K, simultaneously.Thus, the computational complexity of the original DPC can be regarded as O(n 2 K).
The computational complexity of SKTDPC algorithm is mainly determined by the four parts: (1) Calculation process of local density ρ i based on K-d tree.The computational complexity of this part mainly depends on the process of searching k nearest neighbors of data points.The process includes construction of the tree and the search of k nearest neighbors.The complexity of these two parts is O(nKlogn) and O(n(n 1−1/K + k)) respectively.Thus, the complexity of the calculation process of local density ρ i based on K-d tree is O(nKlogn)+O(n(n 1−1/K +k)).
Normally, the dimension K is much smaller than the data number n.Therefore, the overall computational complexity of the SKTDPC algorithm is the largest one among the four parts.That is, the complexity is O(n(n 1−1/K + k)).In order to make it easier to analyze and compare with the complexity , where 1/ K √ n, k/n take values in range (0, 1).In other words, the computational complexity of SKTDPC algorithm is much lower than the complexity O(n 2 K) of DPC.

IV. EXPERIMENTS AND RESULTS
In this section, the present SKTDPC algorithm is compared with the six state-of-the-art and typical clustering algorithms, including FSDPC [26], the original DPC [14], DGDPC [38], DPC-KNN [18], DBSCAN [16] and the K-means algorithm [15].For the seven algorithms, the first five algorithms belong to the DPC series.FSDPC, the original DPC and DGDPC are the algorithms taking cut-off distance d c as a hyperparameter, and DGDPC has an extra merging threshold l hyper-parameter.However, SKTDPC and DPC-KNN adopt the hyper-parameter k, the number of nearest neighbors.DB-SCAN is another kind of density-based clustering algorithm, for which two hyper-parameters, are used.In contrast, the Kmeans algorithm is a fast clustering algorithm, in which the allocation of non-centers points is determined by the nearest distance from cluster centers.

A. INTRODUCTION TO DATASETS AND METRICS
To verify the clustering effect and efficiency of the SKT-DPC algorithm, 15 commonly used clustering datasets [38]- [45] are adopted in the present work.These include the eight synthetic datasets (http://cs.joensuu.fi/sipu/data-sets/)shown in Table 1 and the seven UCI real datasets [46] (http://archive.ics.uci.edu/ml) which are shown in Table 2.
In the two tables, the basic information such as number of data, number of class clusters, and data dimension is listed.
Here, in order to visualize the clustering effect clearly and intuitively, in Table I all the eight synthetic datasets are twodimensional.In contrast, the UCI real datasets are highdimensional.They are discussed in Sections IV.B, IV.C and IV.D, respectively.The algorithmic clustering performance could be effectively evaluated by using running time and five well-known indexes, including accuracy (Acc) [47], adjusted mutual information (AMI) [48], adjusted rand index (ARI) [48], normalized mutual information (NMI) [49] and Fowlers-Mallows index (FMI) [50].The five indexes are powerful tools to measure the clustering results, and the maximum values are 1 for all the indexes.The closer to 1 they are, the better the clustering effect is.In addition, running time is an important performance indicator used to measure the clustering efficiency.The smaller the value is, the higher the clustering efficiency is, vice versa.
Among the other six clustering algorithms compared with SKTDPC, the code for original DPC and DPC-KNN are obtained by retranslating the MATLAB source code, provided by the authors of original DPC and DPC-KNN, into Python language form.The FSDPC and DGDPC codes are written in accordance with the original reference.The algorithms for DBSCAN and K-means are programmed by the sklearn.clusterlibrary in Python.In order to make a fair comparison, for each algorithm experiments are conducted with its optimal hyper-parameters.At the same time, the values of all the indicators below are average by the 20 times independently repeated experiments for each algorithm, to avoid the occurrence of contingency.In addition, all algorithms are implemented by Python 3.8.0.The experiments are carried out in a computer environment with a core i7 2.3 GHz processor, Windows 10 operating system and 16GB RAM.The relevant code for this article is published in https://github.com/Nutshe/code.

B. EXPERIMENTS ON SYNTHETIC DATASETS
Among the eight synthetic datasets with different distributions, the four classic datasets Flame, Spiral, Aggregation and S3 were used in the reference on original DPC [14].The 15 class clusters of dataset R15 are distributed in the ring and have similar Gaussian distribution.The datasets S1, A1 and A3 are the three commonly used clustering datasets with different characteristics of overlapping, complexity and number of class clusters.Applying to the eight two-dimensional synthetic datasets, the clustering results for the seven algorithms are shown in Figs.3-10, visually.Meanwhile, the five clustering evaluation indexes are shown in Tables 3-5, in which DPC refers to the original DPC [14].In addition, the values of optimal hyper-parameter, abbreviated by Par, suitable for each algorithm are also given in Table 3.
For the Acc, AMI, ARI, NMI and FMI index values of the seven algorithms shown in Tables 3-5 and the clustering effect shown in Figs.3-10, we can see that the SKTDPC algorithm shows the best performance on all of the eight synthetic datasets.This is because SKTDPC introduces the idea of k nearest neighbors to define the local density and obtains the correct cluster centers adaptively by second-order difference method.Through these two methods, SKTDPC can capture more comprehensive information for data distribution and avoid the wrong selection of cluster centers, which also plays an important role in improving the cluster effect.By further comparing SKTDPC with other algorithms, it is demonstrated that there are relative weaknesses with different degree for these six algorithms.Firstly, FSDPC, the original DPC and DGDPC are slightly inferior to SKTDPC, even though they perform well on most datasets.This is because these two algorithms only consider the global structure of data, resulting in partial information loss.By contrast, the local density of SKTDPC is calculated based on the distances from k nearest neighbors, which can deal with local data information well.In addition, FSDPC and the original DPC need to determine the clustering center manually.Thus the possible wrong selection of cluster centers would also lead to poor clustering effect.Secondly, for DPC-KNN and DB-SCAN, both of them show obviously poor clustering effect on S3, A1 and A3 datasets, as shown in Figs.8-10.This implies that the processing ability of the two algorithms is obviously weak for datasets with a high degree of overlap relatively.Moreover, the accuracy of DBSCAN is slightly lower than that of other algorithms on Flame, Aggregation, R15 and S1 datasets.This is caused by its recognition of noise points, as shown in related figures.Finally, the clus-    tering effect of K-means algorithm is very poor on Flame, Spiral and Aggregation datasets, as shown in Figs.3-5.This is attributed to the defects of K-means itself, which cannot capture the structural characteristics of non-convex dataset.In contrast, the other six density-based algorithms show good performance on these datasets.
The running time of SKTDPC algorithm and the other six algorithms is shown in Table 6 to evaluate the algorithmic clustering efficiency.It can be found that among the five algorithms belonging to the DPC series, including SKTDPC, FSDPC, the original DPC, DGDPC and DPC-KNN, the clustering efficiency of SKTDPC algorithm newly proposed is much higher than that for the other four algorithms.Especially when the dataset size is large enough, the acceleration effect is relatively more obvious.This is precisely because SKTDPC accelerates the calculation of local density and relative-separation by K-d tree and the sparse search strategy, and further requires less distance calculation and storage space compared with the other four algorithms.It can also be found that the running speed of K-means and DBSCAN is relatively faster compared with the DPC series.As discussed above, however, K-means is only applicable  to globular clusters.For non-convex clusters, the clustering effect is very poor, even though the efficiency is very high.
As for DBSCAN, it is sensitive to the setting of two hyperparameters and is difficult to use when the data density is not evenly distributed.In addition, as shown above, the processing ability of this algorithm is weak obviously for datasets with a high degree of overlap, such as S3, A1 and A3.

C. EXPERIMENTS ON REAL-WORLD DATASETS
In order to further verify the clustering effect and efficiency of SKTDPC algorithm, a comparative analysis is made on six UCI real datasets (Seeds, Iris, Banknote, Wine, Ecoli, Parking Birmingham) with different dataset size, number of class clusters and data dimension.In contrast with the eight synthetic datasets, which are two-dimensional, the UCI real datasets are relatively high-dimensional and complex.
The performance of SKTDPC and other six algorithms is compared.The index values and parameter values of all algorithms are shown in Tables 7-10, in which "-" indicates that the value for running time at this position is very large or even cannot be obtained.
From the Acc, AMI, ARI, NMI and FMI values shown in Tables 7-9 it can be found that: (1) SKTDPC algorithm  (3) FSDPC and DGDPC algorithms can maintain similar clustering accuracy as the original DPC, but there is still distinction in some extent for these two algorithms compared with SKTDPC algorithm.This is because FSDPC, DPC and DGDPC ignore local data information.In addition, SKTDPC can obtain the location of mutation-point adaptively in the clustering process to determine the cluster centers without interrupting the algorithm.Furthermore, due to the complexity and sparsity of the Banknote and Parking datasets, the clustering results of the seven algorithms is not satisfactory.For this kind of dataset, it is necessary to further study its inherent data structure characteristics, to find a more effective clustering method.By analyzing Table 10, it can be found that the proposed SKTDPC algorithm obviously improves the clustering efficiency.In small datasets, the clustering efficiencies of the seven algorithms are roughly the same, but the distinction becomes more obvious with the increase of data volume.For the dataset Parking Birmingham with the largest data volume, the clustering efficiency of SKTDPC is 1.92 times and more times higher than that of FSDPC, DPC, DGDPC and DPC-KNN, respectively.At this time, DPC, DGDPC and DPC-KNN are on the verge of collapse and almost lost their execution ability because they require huge amount of computation and storage space.In addition, the running time of these two algorithms is seriously time-consuming, which is already meaningless.The high efficiency for SKTDPC is mainly attributed to the dual acceleration strategies to deal with large datasets, as analyzed in Section III.F.One is the acceleration for calculation of the local density by K-d tree.Thus a sparse distance matrix D is calculated instead of a full-rank matrix D to find the k nearest neighbors.Another is the acceleration for calculation of the relative-separation by the sparse search strategy with intersection between NN k (x i ) and B(x i ).Therefore, the dual accelerations strategies play an important role in reducing the algorithmic complexity, which makes SKTDPC algorithm show more prominent on large datasets.It should also be noted that although the running speed of DBSCAN and K-means algorithms is fast, due to their own limitations, the universality of the algorithms are too weak to produce an ideal clustering results.

D. ROBUSTNESS TO NUMBER OF NEAREST NEIGHBORS K
In this section, this paper will analyze and discuss the influence of the proposed algorithm SKTDPC on the Acc under different parameter k settings.The number of nearest neighbors k is the only hyper-parameter of SKTDPC, so it is crucial to analyze the change of clustering effect under different settings.The adjustment of parameters is to accommodate datasets with different distribution characteristics.At the same time, within the scope of the regulation of as small as possible to avoid algorithm into local optimum and can achieve ideal clustering result is what we are after.If  the parameters of the algorithm need to be optimized in a wide range, it will not only affect the search efficiency of the optimal parameters, but also make the robustness and stability of the algorithm worse and it is difficult to achieve the ideal clustering effect.Therefore, we focus on analyzing the change of Acc within a reasonable range of k = [2, 10] for the SKTDPC algorithm, as shown in Fig. 11.
We can clearly see that most of the datasets of SKTDPC show stable and ideal clustering effect relatively in a small range of parameters, which shows the excellent robustness of the algorithm.On Spiral, Iris, Wine and Ecoli datasets, although Acc shows relatively large fluctuations, the algorithm can also quickly achieve ideal clustering effect in a small range of parameter adjustment, which is not easy.According to our analysis, the reason for this large fluctuation may be that the algorithm automatically identifies incorrect cluster centers under the parameter k with low Acc value.Therefore, the proposed SKTDPC algorithm can achieve the ideal clustering effect and robustness suitable for most datasets within a small range of parameter adjustment.

E. ADDITIONAL DISCUSSION
The focus of the present work is on dealing with datasets with arbitrary shape and large size.As for higher dimensional datasets, the improvement for clustering efficiency of SKT-DPC is not obvious.However, it is not a vexing problem.In this section, the applicability of SKTDPC on higher dimensional dataset is briefly discussed and some reasonable suggestions are given.Many effective dimension reduction methods can be used to handle this case, such as principal component analysis (PCA), locally linear embedding (LLE), laplacian eigenmaps (LE), etc.
The following is a simple example of processing 16dimensional dataset Pendigits using PCA as a dimension reduction method.Fig. 12 shows the heat map of Acc, AMI, ARI, NMI, FMI and running time obtained by PCA+SKTDPC algorithm by different combinations of k and d, where d is the dimension after dimension reduction.The darker the color of the rectangular block is, the better the indicator value is.We can see that Acc, AMI, ARI, NMI and FMI are all optimal with the set of parameters k = 7 and d = 4.However, the corresponding running time of 39.51s is not the minimum in the range considered.Meanwhile, there are non-unique parameter combinations with values of running time less than 39.51s.In this case, for determining parameter combination the principle should be satisfied, that is the highest efficiency is achieved on the premise of guaranteeing the ideal clustering effect, instead of pursuing speed blindly.In accord with this principle, the final parameter combination, k = 7 and d = 4, is adopted.Under this parameters, the running time results of PCA+SKTDPC and SKTDPC are shown in Table 11.

V. CONCLUSIONS
An extended DPC algorithm, called SKTDPC, is successfully proposed by K-d tree, sparse search and second-order difference methods.Applying to eight synthetic datasets with two dimensions and six real datasets, comparisons have been carried out between SKTDPC and the six typical clustering algorithms, including FSDPC, the original DPC, DGDPC, DPC-KNN, DBSCAN and K-means algorithms.The main conclusions can be summarized as follows.
Firstly, the algorithmic complexity is obviously reduced by dual accelerations.One is the acceleration for calculation of the local density with a sparse distance matrix, which is attributed to fast search of k nearest neighbors by K-d tree.Another is the acceleration for calculation of the relativeseparation by a sparse search strategy with the intersection between the set of k nearest neighbors and the set consisting of the data points with larger local density for any data point.Experimental validation demonstrates that compared with the DPC series algorithms, SKTDPC algorithm can achieve higher clustering efficiency on all datasets.The larger the dataset, the greater the advantage of SKTDPC.Secondly, experiments indicate that SKTDPC algorithm can realize the best clustering effect in general, compared with the other algorithms.Furthermore, it is indicated that compared with K-means and DBSCAN algorithm, SKT-DPC algorithm has a relatively stronger general applicability for datasets with arbitrary distribution characteristics, even though they have better clustering efficiency in some cases.
Finally, the second-order difference method for decision values is adopted to determine the location of the mutationpoint adaptively, which avoids the trouble of selecting the cluster centers manually.It is also verified by experiments that the present method can produce the correct number of cluster centers automatically.
For future work, datasets with insufficient target data, high complexity or high sparsity will be further explored and studied to enhance the application ability of clustering algorithm.

FIGURE 1 .
FIGURE 1. Example demonstration of sparse search strategy for obtaining δi.
(a) Distribution of data points in dataset SS2.(b) The data points are arranged in order γ * from the largest to the smallest.

FIGURE 2 .
FIGURE 2. A simple two-dimensional synthetic dataset as the example.

( 2 )
Acquisition process of relative-separation δ i based on the sparse search strategy with intersection between NN k (x i ) and B(x i ).The complexity of this part is far less than O(n(n 1−1/K + k)), as analyzed in Section III.C in detail.(3) Adaptive determination of cluster centers.The complexity of this part is determined by descending order, second-order difference, and determining the center points that satisfy the mean value condition.The complexity of the three steps is O(nlogn), O( √ n − 2) and O(M p ), respectively.Thus, the computational complexity of the whole part is O(nlogn)+O( √ n −2)+O(M p ). (4) Allocation of non-center points.The computational complexity of this part is O(n).The total computational complexity of SKTDPC algorithm is O

FIGURE 5 .FIGURE 6 .
FIGURE 5. Clustering results of the seven algorithms on Aggregation dataset.

FIGURE 7 .FIGURE 8 .
FIGURE 7. Clustering results of the seven algorithms on S1 dataset.

FIGURE 9 .FIGURE 10 .
FIGURE 9. Clustering results of the seven algorithms on A1 dataset.
(a) Acc of SKTDPC on synthetic datasets.(b) Acc of SKTDPC on UCI datasets.

FIGURE 11 .
FIGURE 11.Acc of SKTDPC under different parameters k.

FIGURE 12 .
FIGURE 12. Results of six indexes obtained by SKTDPC algorithm for Pendigits dataset under different parameter combinations.
YUXIN CUI was born in 1995.She is currently pursuing the Ph.D. degree in the School of Science, Harbin University of Science and Technology, Harbin, China.Her current research interests include machine learning, stochastic logic system analysis.SHUAI LI was born in 1998.He is currently pursuing the Ph.D. degree in the School of Materials Science and Chemical Engineering, Harbin University of Science and Technology, Harbin, China.His current research interests include machine learning, material research, high entropy alloy.MING ZHOU was born in 1997.She is currently pursuing the M.S. degree in the School of Science, Harbin University of Science and Technology, Harbin, China.Her current research interests include machine learning, mathematics of computation, nonlinear numerical analysis.XIANG LI was born in 1997.He is currently pursuing the M.S.degree in the School of Science, Harbin University of Science and Technology, Harbin,China.His current research interests include machine learning, data mining.
Search k nearest neighbors based on constructed K-d tree; Build an empty set Q; Recursive visit down K-d tree until leaf node is reached.At this point, the leaf node is the current node (x current ) and it is marked as visited; loop calculate the distance d(x target , x current ); add the x current to Q; if |Q| < k, |Q| indicates the number of elements in the set Q then if there is another child node in the current node then recursive visit down K-d tree until leaf node is reached and marked it as visited; update the leaf node to the current node; else find a parent node which has not been visited and mark it as visited; update the parent node to the current node; else break loop; end; if the current node is leaf node then while the current node has parent node do do Block A:; find a parent node which has not been visited and mark it as visited; update the parent node to the current node; calculate the distance d(x target , x current ); if d(x target , x current ) < Q max , Q max is the farthest distance from x target in Q then replace the point farthest from x target in Q with the current node; do Block B:; if there is another child node in the current node then calculate the distance d(x target , l cur-tan ), l cur-tan is the tangent line of the current node; if d(x target , x current ) < Q max then recursive visit down K-d tree until leaf node is reached and marked it as visited; update the leaf node to the current node; calculate the distance d(x target , x current ); if d(x target , x current ) < Q max then replace the point farthest from x target in Q with the current node; else while any leaf node has non-unique data do do block B; do block A;

:
The dataset S = {x 1 , x 2 , . . ., x n } ∈ R n×K ; The number of the nearest neighbors k Output: The clustering result Y 1: Construct a K-d tree for dataset S, by Algorithm 1. 2: Search the k nearest neighbors of the point x i based on the constructed K-d tree, by Algorithm 1.Meanwhile, the symmetric sparse distance matrix D is obtained.3: According to Equation (7) by the k nearest neighbors, the first acceleration calculation is carried out to obtain ρ i , and the descending order processing is made for ρ i to generate ρ * (10) = 1, 2, ..., n).5:According to Equation (9), calculate γ i and further obtain γ * i by descending order processing.6:According to Equation(10), calculate the mutation point M p to determine the candidate cluster centers as the data points with decision value γ * i (i = 1, 2, . . ., M p ).The candidate centers x i (i ∈ [1, M p ]), which satisfies both ρ i >