F-DPC: Fuzzy Neighborhood-Based Density Peak Algorithm

Clustering is a concept in data mining, which divides a data set into different classes or clusters according to a specific standard, making the similarity of data objects in the same cluster as large as possible. Clustering by fast search and find of density peaks (DPC) is a novel clustering algorithm based on density. It is simple and novel, only requiring fewer parameters to achieve better clustering effect, without the requirement for iterative solution. And it has expandability and can detect the clustering of any shape. However, DPC algorithm still has some defects, such as it employs the clear neighborhood relations to calculate local density, so it cannot identify the neighborhood membership of different values of points from the distance of points and It is impossible to accurately cluster the data of the multi-density peak. The fuzzy neighborhood density peak clustering algorithm is proposed for this shortcoming (F-DPC): novel local density is defined by the fuzzy neighborhood relationship. The fuzzy set theory can be used to make the fuzzy neighborhood function of local density more sensitive, so that the clustering for data set of various shapes and densities is more robust. Experiments show that the algorithm has high accuracy and robustness.

given dataset, and selects the appropriate eigenvectors to cluster different data samples by solving the eigenvalues and eigenvectors of the matrix. The spectral clustering algorithm is actually a graphics-based clustering algorithms, the essence of the spectral clustering algorithm is to transform the clustering problem into the optimization problem of the graph, and it has a solid theoretical basis and is easy to implement, it also can overcome the shortcomings of some classical clustering algorithms and ensure the final results converge to the global optimal solution. AP algorithm [15] is a clustering algorithm proposed on Science, which does not need to estimate the number of class clusters in advance. On the contrary, each sample is used as a potential cluster center at the beginning of the algorithm, and the optimal clustering center is searched through iteration to make the similarity of data samples to nearest cluster representative points is the largest. The AP algorithm is simple and efficient, and the result of multiple runs is relatively stable.

II. RELATED WORK
Recently, Science has published an original clustering algorithm [16]-clustering by fast search and find of density peaks (DPC) [27]- [29]. The algorithm needs less parameters and can detect clusters of arbitrary shape and dimension, and is not sensitive to noise. The clustering algorithm is divided into two stages: in the first stage, the algorithm uses the input parameters of the user to calculate the local density and distance of the sample, and finds the clustering center which is so-called density peak, then, select appropriate clustering centers from the samples according to decision diagrams; in the second stage, the remaining samples are distributed to the cluster where the closed and higher density samples are located, and the whole clustering process is simple and efficient.
The core idea of density peak clustering algorithm [30], [31] is to find the best clustering center, the ideal clustering center should satisfy the following two conditions: (1) The density of the cluster center itself is very large, and it is surrounded by a neighborhood of no more density than its neighborhood; (2) The relative distance of different cluster centers is farther. Therefore, the DPC algorithm introduces two conditions, the local density ρ and the distance δ, which correspond to the above clustering center respectively. For any sample i, the definitions of local density ρ i and distance δ i are shown in the following expressions (1) and (2):.
Among them, d ij is the distance between samples i and j, and d c is cut-off distance, which requires users to specify in advance. In the function χ(x), when the independent variable x < 0, χ (x) = 1. Otherwise, , χ (x) = 0.
For its local density ρ, the maximum sample i, the distance is δ i = max j (d ij ). In addition, Rodriguez and Laio gives another way to calculate the local density by using Gauss's function, as is shown in formula (3): In formula (1), ρ i represents the number of samples in the dataset and the sample i with a distance less than d c . In formula (3), ρ i represents the sum of the weighted values of the distance between sample i and other samples in the dataset, in theory, when calculating the local density using formula (3), the probability of different samples having the same local density is smaller. Therefore, for the dataset with fewer samples, formula (3) still have the more data points whose distance of x i is less than d c , the larger the value of ρ i is. In the formula (2), the distance δ between the samples indicates the distance between the sample and the nearest neighbor samples with higher local density ρ. In the original algorithm, the local density ρ and the distance δ are used to construct the decision diagram, and the samples which are larger in both ρ and δ are selected as cluster centers, and each cluster center is a class cluster. The remaining samples j are then assigned to clusters with higher local density and nearest neighbor samples. As shown in formula (2): when x i has the largest local density, δ i represents the distance between the x i and the data points with the largest distance from x i in a dataset; Otherwise, the δ i represents the distance between the x i and the data point with the smallest distance from x i in all data points with local density greater than x i .
Although the calculation of δ is easier to understand, there may be problems in some special cases. For example, if ρ i = ρ j = max k∈I s {ρ k }, and data points x i and x j happen to belong to the same class and the distance is closer, in accordance with the definition, there will be δ i = max j∈I s {d ij }, δ j = max k∈I s {d jk }. If both ρ i and ρ j are larger, the two data points are selected as cluster centers according to the principle of selecting cluster centers, and a class is dismantled into two classes.
DPC algorithm can identify clusters of arbitrary shape and dimension [32]- [35], but some shortcomings of the algorithm will affect the final clustering results. In addition, the algorithm belongs to the hard clustering algorithm, once a sample is allocated wrongly, the related samples will be misallocated. Figure 1 a and b calculate the local density using the formula (1) and the formula (3) respectively, the sample F is a class cluster with multiple density peaks. Observation shows that the final clustering result is not ideal because the sample is wrongly assigned to the nearest cluster center B.
In view of the above shortcomings, it is found that DPC algorithm is very effective for convex data clustering. However, when the data points between classes are unevenly distributed, it is not sufficient to determine the membership degree of the central point only through the number of center points within the radius of the neighborhood. As shown in Figure 1, points x1 and x2 have the same number of points in the neighborhood, the same neighborhood radius ε, and the neighborhood radius less than the maximum neighborhood radius. In the classical case, there is no difference in membership degree between points within the same neighborhood radius of the core point, that is to say, x1 and x2 have the same neighborhood membership degree. However, it can be seen from Figure 2 that there is a certain gap between x1 and x2. Juanying Xie and other [17], [18] proposed an improved algorithm KNN-DPC algorithm and FKNN-DPC algorithm based on DPC algorithm, the algorithm combines the K nearest neighbor idea and solves the defect of the DPC algorithm when measuring the sample density. Wu C proposes an effective clustering method based on density peaks with symmetric neighborhood relations [36]. A feasible residual-based density peak clustering algorithm with fragment merging strategy proposed by Parmar MD [37]. Yan H proposes to use statistical outlier detection method to automatically identify cluster centroids from decision graphs [38]. Parmar M D propose a residual error-based density peak clustering algorithm named REDPC to better handle datasets comprising various data distribution patterns [39]. Parmar MD propose a novel density peak clustering algorithm based on squared residual error [40]. Zhang and Li [19] solved the problem of not recognizing low density clusters by using CHAMELEON algorithm. Rapid searching of gene local density peaks and merging clustering for expressing microarray data proposed by R Mehmood solved clustering problems of different shapes. However, there are still some shortcomings in the DPC algorithm: (1) In the selection of clustering center, DPC algorithm does not give a clear criterion of membership degree, and the correct membership degree can not be obtained by distance measurement; (2) If there are multiple density peaks in a class cluster of dataset or the class cluster with tight density, the clustering results of this algorithm can not achieve the desired results.
In view of this, a density peak clustering algorithm based on Fuzzy neighborhood is proposed in this paper (F-DPC). The algorithm improves the DPC algorithm from two aspects: (1) The algorithm uses the fuzzy set theory to define a new kind of fuzzy neighborhood relation, which can give a definite membership degree standard and identify the correct membership degree of the points within the same neighborhood radius. (2) The algorithm uses the fuzzy neighborhood strategy to get a new kind of local density, which improves the accuracy of the selection of cluster centers, and is more robust to various shapes and density data, and solves the shortage of clustering effect when there are multiple peaks in a class.

III. THE PROPOSED ALGORITHM F-DPC
According to the above analysis, it is limited to determine the membership degree of the center by simply using the number of center points in the neighborhood radius. Especially for unbalanced data, the limitations are greater. In classical clustering, the boundaries of different clusters are very weak, and each pattern is assigned to a unique class. On the other hand, boundaries between clusters cannot be precisely defined in real life, so that certain patterns may belong to multiple different cluster members. In this case, fuzzy clustering can achieve better results. Fuzzy set theory is widely used to solve such problems, so this paper proposes a new method to solve the problems in the original algorithm by using rough set theory.
The density peak algorithm is described as follows: Step 1 Initialization and preprocessing 1.1 Give the parameter t ∈ (0, 1) for determining the cut-off distance d c .
1.2 Calculate the distance d ij , and make Step 2 Determine the cluster center {m j } n c j=1 , and initialize the data point classification attribute tag if x i is the cluster center and belongs to the class k −1, others Step 3 Classified the data points of non clustering center.
Step 4 If n c > 1, the data points in each class are further divided into cluster core and cluster halo, initialize the mark h i = 0, i ∈ I s . Generate an average local density upper bound for each class, and identificate the cluster halo.

Definition 1 (The Basic Concept of Fuzzy Joint Point Method):
Fuzzy joint point is based on a level based view to explain vagueness. It represents how many elements should be considered when constructing homogeneous groups. Obviously, when the elements are discussed in more detail, the difference between them is even greater. The more fuzzy the elements are, the more similar they are to each other, in this case, the fuzzy neighborhood can more detailed point out the types of attributes considered. Since all elements differ from each other when the minimum ambiguity is zero, each element can be considered to be divided into the same class in a similar manner. Similar elements will belong to a class, and elements that differ from each other will belong to different classes, so they have different membership degrees.
Let F(R1 m ) represent the m-dimensional fuzzy set of R m , and use µ A : R1 m to represent the membership function of the fuzzy set, and set A ∈ F(R1 m ). In the conical fuzzy space shown in Figure 3, the conical fuzzy point A = (a, R) ∈ F(R1 m ), where the space R1 m represents a fuzzy set with membership functions, the definition is as shown in (4): where a ∈ R1 m is the center of fuzzy point A, R is the support radius of A, the calculation formula is shown in formula (5): The α level set of the conical fuzzy point A = (a, R) is calculated as formula (6): Definition 2 (Distance Between Fuzzy Points): Let A = (a, R), B = (b, R) belong to fuzzy points on the fuzzy set X ⊂ F(E p ), and the neighborhood fuzzy relation is expressed as T : X × X → [0, 1], which is represented in the fuzzy set X as shown in the formula (7): where d (a, b) is the distance between a and b, R is the neighborhood radius. When a ∈ E p , b ∈ E p , the distance between the fuzzy center point A and B is shown in formula (8): Therefore, the mapping of relation T satisfies is ∀A ∈ X : T (A, A) = 1.

d(a, b) represents the distance between fuzzy central point A and point B.
Let the fuzzy point A = (a, R) and point B = (B, R) be the α-neighborhood, formula (9) can be obtained according to the T (a, b) ≥ α and formula (7): From T (a, b) ≥ α, we can get α ≤ 1− d(a,b) 2R , so it is proved that the Definition 3 is accurate.
If there are a series of αneighborhood fuzzy points C 1 , . . . , C k , k ≥ 0, k ≥ 0, for the fixed α ∈ (0, 1], there is formula (10) between point A and B: Point A and point B are called α-joint fuzzy points as shown in Figure 4.

Definition 5 (α-Joint Fuzzy Point Distance):
Let X ⊂ F (E p ) be a point on the fuzzy set, if A and B are fuzzy points on the α-joint neighborhood, α ∈ (0, 1] and ∀A, B ∈ X , where X is called the αjoint neighborhood fuzzy set. The d(A α , B α ) represents the horizontal distance between A α and B α on the dataset, it is expressed as follows: Lemma 2: The fuzzy point A and the fuzzy point B are the fuzzy points of the αneighborhood, the A α ∩ B α = .
Prove: Let the fuzzy point A and the fuzzy point B be the fuzzy points of the αneighborhood, so T (a, b) ≥ α. First of all, assume that A α ∩B α = is incorrect, that is A α ∩B α = . Then let the joint points a ∈ E p and b ∈ E p , and x ∈ E p , x / ∈ A α , x / ∈ B α .Thus, formula (11) is obtained: According to the inequality given by Lemma 1 and the αneighborhood of point A and point B, assuming A α ∩B α = , then ∃x : x ∈ A α , x ∈ B α , therefore, the formula (13) is obtained: According to the inequality given by Lemma 1 and the αneighborhood of point A and point B, assuming A α ∩B α = , then ∃x : x ∈ A α , x ∈ B α , therefore, the formula (14) is obtained: Definition 6 (Details of Distance and Local Density): Let X = {x 1 , x 2 , . . . , x n } represents n data objects in the dataset, each object x i , 1 ≤ x ≤ n has m attributes.Therefore, for each i, 1 ≤ i ≤ n, and each j, 1 ≤ j ≤ n, let x i,j be the j attribute of x i . So the distance between point x i and x j , x i , x j ∈ X is expressed as: So in order to form a fuzzy relation µ : X × X → [0, 1], a concept of fuzzy membership grade is defined, in which the radius of the neighborhood is the only input parameter determined by the percentage, which is called the truncated distance. The neighborhood radius ε = d d max · d c , · indicates the upper limit of the set, d max = d Nd = max d(x i , x j ). One of the input parameters is the percentage of d c . A point x i ∈ X and parameter ε on a neighborhood set follows the following relation N (X i , ε) = {X j ∈ X |d X i , X j < ε}. The formula for calculating the local density is obtained by formula (4): ρ i = j µ X i (X j ), the truncation distance is calculated as follows: Through the formula (8), the calculation formula of the fuzzy point is obtained, and according to the above definition to calculate the distance of any fuzzy point in the dataset, and the fuzzy distance formula is obtained as shown in formula (15), then according to the definition of the local density, the formula of fuzzy neighborhood density function is obtained as follows: (17) where d c represents the truncation distance.
The fuzzy neighborhood density peak algorithm is described as follows: Step1 Input dataset X ∈ R1 n×m and parameter d c .
Step2 Calculate the distance matrix by the formula (14).
Step3 Calculate the fuzzy neighborhood density ρ i of point X i by the formula ρ i = j µ X i (d x i , x j − d c ).
Step4 Calculate the truncation distance δ i of the point X i by the formula (15).
Step5 Draw out the decision map and select the center of the class.
Step6 Assign the remaining points to clusters by using algorithms in related work.
Step7 Output the cluster index label vector.
The algorithm flow chart is shown in Figure 5. The F-DPC algorithm requires fewer parameters to achieve better clustering results. The algorithm only contains an optional parameter t. In fact, this parameter is used to determine d c , so the essential parameter should be the cutoff distance parameter.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
In order to test the clustering results of the algorithm, this paper uses the classical artificial dataset and the real dataset on UCI to conduct experiments, in order to further compare the F-DPC algorithm and the original algorithm based on clear neighborhood relation, we compare the algorithm with DBSCAN, X-means, AP, FCM, BIRCH, KNN-DPC, FKNN-DPC, and the original algorithm, these algorithms are representative clustering algorithms. All the experimental environments are Win10 64bit operating system, Matlab software, 4G memory, Intel (R) Core (TM) i5-3210M CPU@2.50GHz. In this chapter, 3.1 briefly describes the experimental dataset information and evaluation criteria, 3.2 analyzes the clustering results of artificial datasets, 3.3 gives the specific experimental results.

A. EXPERIMENTAL DATA SETS AND EVALUATION CRITERIA
The information of the experimental dataset is shown in Table 1, the first seven are artificial datasets [20]- [24], and the latter fourteen are real datasets on UCI, which are commonly used to test datasets, and have a certain degree of division on the number of attributes and the number of class clusters. In view of the fact that the attributes of some datasets are not in the same order of magnitude, they are normalized.
The evaluation criteria for the experimental results are accuracy (ACC), recall (Re) and normalized mutual information (NMI), and the accuracy and recall are two metrics widely used in the field of information retrieval and statistics to evaluate the quality of the clustering results. Mutual information is a measure of interdependence between variables. The three evaluation indexes range from 0 to 1, and the larger the value, the better the clustering effect. The following are defined as follows: Among them, k is the number of clusters, and a i indicates the number of samples correctly classified to class cluster C i . U is a full sample, c i indicates the number of clusters belonging to the class cluster c i being misclassified into other clusters.
Among them, X and Y represent random variables, and I (X , Y ) represents mutual information of two variables, H (X) represents the entropy of X . The following are defined as follows:  The dataset is a large dataset with different number of clusters. Experimental results show that the F-DPC algorithm is robust in quantity. As shown in figure (d) (e) (f), the algorithm has good performance in the three datasets with different sizes and shapes. The dataset of graph (a) (b) (c) is a two-dimensional set with different complexity in spatial data distribution, and has different degree of overlap. The performance of the proposed F-DPC algorithm is perfect for datasets with different complexity. Due to these experiments show that the algorithm in this paper is very effective in finding clusters of arbitrary shape, density, distribution and quantity, and the results are very good. The algorithm solves the defects of the original algorithm and can correctly calculate the membership grade of each class and then cluster.

C. EXPERIMENTAL RESULTS AND ANALYSIS ON VARIOUS PERFORMANCE INDEXES
The results of this paper are shown in the following tables, Table 2 is the accuracy rate (ACC) of clustering results for each algorithm, Table 3 is the recall rate (Re) of the clustering results of each algorithm, and Table 4 is the normalized mutual information (NMI) [25], [26] of the clustering results of each algorithm. From the results of the three evaluation indicators, we can see that the clustering results of F-DPC algorithm are generally better than other clustering algorithms. For the datasets of Simle, S1, Jain and Flame, the results of algorithm in this paper are the same as that of the density peak clustering algorithm, and their accuracy, recall and mutual information are all higher than those of other algorithms. Except for Iris and Ionosphere datasets, it can be seen from other datasets that the indexes of algorithm in this paper are higher than those of other algorithms. However, from the Iris and Ionosphere datasets, we can see that the accuracy and recall of the algorithm in this paper are higher than those of other algorithms, and the mutual information results are lower than the density peak clustering algorithm, it's because there are two clusters with the same density, and there are errors in the selection, although it did not get the ideal effect, the other indexes are higher than those of the other algorithms. In the unbalanced data set Art, Compound, Ecoli, Waveform, Image Segmentation, Ionosphere, Libras Movement, it can be seen that the clustering effect of the F-DPC algorithm is better than other algorithms. Experimental results show that the algorithm proposed in this paper achieves ideal clustering results, and overcomes the shortcomings of KNN-DPC and FKNN-DPC algorithms that cannot accurately cluster unbalanced data. Table 5 takes the UCI dataset as an example to compare the proposed algorithm with the original DPC algorithm, as well as the classic AP, DBSCAN, FCM, X-means, BIRCH clustering algorithms, and the time performance of the     results, we can see that the algorithm in this paper inherits the time advantage of DPC algorithm, and running time takes less time than other algorithms, so it has some advantages in both running time and memory consumption.

V. CONCLUDING REMARKS
The density peak clustering algorithm is not ideal for neighborhood boundary clustering, to solve this problem, a new fuzzy neighborhood density function is introduced and proposed algorithm F-DPC. Comparison experiments show that our algorithm is very effective in finding clusters of arbitrary shape, density, distribution and quantity. It is observed that the algorithm in this paper is not only more robust than the original algorithm for datasets with various shapes and densities, but also significantly improves the clustering effect compared with other improved algorithms. From the experimental results, we can see that the algorithm in this paper effectively solves the shortcomings of the density peak clustering algorithm, and then get an ideal clustering results. However, the algorithm in this article takes a lot of time when the data set is large. How to improve the efficiency of the FDPC algorithm will be our next focus.