A Theoretical Analysis of Density Peaks Clustering and the Component-Wise Peak-Finding Algorithm

Density peaks clustering detects modes as points with high density and large distance to points of higher density. Each non-mode point is assigned to the same cluster as its nearest neighbor of higher density. Density peaks clustering has proved capable in applications, yet little work has been done to understand its theoretical properties or the characteristics of the clusterings it produces. Here, we prove that it consistently estimates the modes of the underlying density and correctly clusters the data with high probability. However, noise in the density estimates can lead to erroneous modes and incoherent cluster assignments. A novel clustering algorithm, Component-wise Peak-Finding (CPF), is proposed to remedy these issues. The improvements are twofold: 1) the assignment methodology is improved by applying the density peaks methodology within level sets of the estimated density; 2) the algorithm is not affected by spurious maxima of the density and hence is competent at automatically deciding the correct number of clusters. We present novel theoretical results, proving the consistency of CPF, as well as extensive experimental results demonstrating its exceptional performance. Finally, a semi-supervised version of CPF is presented, integrating clustering constraints to achieve excellent performance for an important problem in computer vision.

A Theoretical Analysis of Density Peaks Clustering and the Component-Wise Peak-Finding Algorithm Joshua Tobin and Mimi Zhang Abstract-Density peaks clustering detects modes as points with high density and large distance to points of higher density.Each non-mode point is assigned to the same cluster as its nearest neighbor of higher density.Density peaks clustering has proved capable in applications, yet little work has been done to understand its theoretical properties or the characteristics of the clusterings it produces.Here, we prove that it consistently estimates the modes of the underlying density and correctly clusters the data with high probability.However, noise in the density estimates can lead to erroneous modes and incoherent cluster assignments.A novel clustering algorithm, Component-wise Peak-Finding (CPF), is proposed to remedy these issues.The improvements are twofold: 1) the assignment methodology is improved by applying the density peaks methodology within level sets of the estimated density; 2) the algorithm is not affected by spurious maxima of the density and hence is competent at automatically deciding the correct number of clusters.We present novel theoretical results, proving the consistency of CPF, as well as extensive experimental results demonstrating its exceptional performance.Finally, a semi-supervised version of CPF is presented, integrating clustering constraints to achieve excellent performance for an important problem in computer vision.

I. INTRODUCTION
D ENSITY-BASED clustering methods relate the notion of clusters to high-density contiguous regions of the underlying density function.Hartigan [1] proposed the concept of density-based clusters as "regions ... where the densities are high surrounded by regions where the densities are low".The concept is attractive for several reasons: 1) the clusters are free to assume any shape, in contrast to model-based clustering methods; 2) the clustering method is associated with density but without requiring strong assumptions on the density function; 3) the number of clusters is linked to density peaks and can be determined as part of the estimation procedure.Density-based clustering methods can be broadly classified into two categories: level set methods and mode-seeking methods.
Level set methods detect clusters as connected components of the density level sets {x : f (x) ≥ λ}, where f is the density function and λ is a cutting threshold.The density f is unknown, and hence the level sets are required to be estimated from the data.Nearest-neighbor graphs have been widely used for this purpose [2], [3].Taking the instances to be the vertices of a graph, k-NN graphs add edges between a vertex and all its k nearest neighbors.Mutual k-NN graphs add an edge between two vertices only if they are k nearest neighbors of each other.It has been shown that any density level set of a given dataset can be approximated by the connected components of the mutual k-NN graph [2], [3], and further work attempted to develop an understanding of the optimal choice of k [2], [4].
Mode-seeking methods aim to directly locate the modes in the density and then associate each instance in the observed data with a relevant mode.Such approaches begin with a density estimate f and then move each point x i towards a mode of f by ascending the density.Mean shift, introduced in [5] and further developed in [6] and [7], is a popular mode-seeking method that associates an instance to a mode along the path of steepest ascent of the density estimate.To circumvent the costly run time of mean shift, the authors in [8] proposed a fast sample-based method, termed quick shift.Quick shift simply associates each instance to its nearest neighbor of higher empirical density.To return a partition of the data, a segmentation parameter τ is required such that an instance will not be associated to its nearest neighbor of higher density if the distance between them is greater than τ .Quick shift is shown in [9] to consistently estimate the non-trivial modes of the underlying density and to correctly assign instances to their associated mode.However, appropriate tuning of τ requires a knowledge of the distances between modes.Furthermore, determining modes by only the distances between instances and their nearest neighbor of higher density can cause outlying points to be erroneously selected as modes.
The density peaks clustering (DPC) method introduced in [10] offers a potential remedy to these issues, providing an intuitive method for sample-based mode detection.The true modes of the density are estimated using a decision graph, a scatter plot of the local density against the distance to the nearest neighbor of higher density.The modes are estimated as the extreme instances on the decision graph.DPC assigns the remaining instances to the detected modes using the same methodology as quick shift.The partition of the data is extracted by grouping together instances that are assigned to the same mode.The decision graph and the resulting clustering for a toy dataset are shown in Fig. 1.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.While many papers have demonstrated the ability of the DPC method to provide high-quality clusterings in applications [11], [12], there is, to the best of our knowledge, only one publication on the theoretical analysis of the DPC method.In [13], the authors derive a theoretically grounded rule for selecting modes from the decision graph, using a robust linear regression of log of the density estimates log f (x) against the log of the distances to neighbors of higher density.
In this work, we seek to deepen our understanding of the DPC method and propose a new density-based clustering technique that improves DPC both theoretically and computationally.By adapting results from related works, we provide theoretical guarantees that DPC consistently estimates the modes of the underlying density and can correctly cluster the data with high probability.We also demonstrate the deficiencies of the DPC methodology in the presence of noisy density estimates.Motivated by the deficiencies, we introduce a novel clustering algorithm: Component-wise Peak-Finding (CPF).CPF improves DPC in two ways: first, CPF partitions the data into regions mutually separated by areas of low density before clustering, thus ensuring the correct assignment of instances to their respective clusters; second, the peak-finding criterion is directed to seek modal-sets rather than point modes in the data, reducing the sensitivity of the clustering result to fluctuations in the density estimate.We provide theoretical guarantees for our new algorithm, extending the theoretical properties available for the DPC method.In particular, we prove that CPF recovers unique and consistent estimates of the high-density regions in the data, and correctly determines the true number of clusters.Furthermore, the complexity of our algorithm is of the order O(nk log(n)), near linear in k and n.Finally, to demonstrate the adaptability of CPF, we present a modified version of the method, CPF-Match, designed for multi-image matching, an application in computer vision.We show that CPF-Match achieves state-of-the-art performance for this task.
The remainder of the paper is organized as follows.In Section III, we formalize the DPC method, provide a theoretical analysis of its performance, and demonstrate its deficiencies via illustrative examples.In Section IV, the CPF algorithm is explained in detail, and its consistency properties are provided in Section V.In Section VI, we assess the clustering quality of CPF on a range of simulated and real-world datasets and show that CPF outperforms DPC and other peer clustering methods.Section VII introduces CPF-Match, an adapted method for multi-image matching.Section VIII concludes the paper.

II. RELATED WORK
Adaptations of the DPC method have proliferated in recent years.One strand of works focuses on improving the density estimator (see [14], [15]), and another strand of works focuses on automating the selection of modes from the decision graph (see [16], [17]).The authors of DPC have introduced a recent approach [18], which applies a density estimator based on the intrinsic dimension of the data, and a pruning mechanism for false modes.
A robust way of modelling high-density regions in the data space is proposed in [19].Modal-sets generalize the concept of a point mode to a local support of the density peak.An illustrative example is given in Fig. 2. The related clustering procedure, termed MCores, estimates the modal-sets using connected components of k-NN graphs at different levels of the empirical density.The authors provide consistency guarantees on the recovery of true modal-sets in the data.A subsequent work presents QuickShift++ [20] improving on the MCores procedure by adopting the same allocation procedure as quick shift and DPC.Recently, in [21], DPC was adapted to detect modal-sets.The method, termed DCF, was shown to detect modal-sets more efficiently than QuickShift++.
While [19], [20], [21] use classical non-parametric density estimators, recent literature has proposed density estimators using neural networks.A prominent approach uses energy-based models, defining an unnormalized density that is the exponential of the negative energy function, parameterized by a neural network.The estimation procedure involves either computing maximum likelihood estimates ( [22], [23]), variational approximation to an unnormalized target ( [24], [25]), or methods combining both approaches ( [26]).The density estimates produced by these methods are computationally expensive to compute, and consistency guarantees are not currently available making theoretical analysis challenging.Nevertheless, they can naturally be integrated into the clustering method discussed in this work.

A. The Method
The peak-finding method in [10] requires two inputs: 1) a density estimate at each data point, and 2) the distance from each point to its nearest neighbor of higher density.We consider a dataset X consisting of n data points in R p drawn from an unknown density f with compact support X .We use a k-NN density estimator as it is computationally fast and guarantees on its quality are well understood.For a data point x ∈ X, let r k (x) be the distance between x and its k-th nearest neighbor.The density estimate used is a simple functional of the distance r k (x).
Definition 1: For every x ∈ R p , let r k (x) denote the distance from x to its k-th nearest neighbor in X.The density estimate is given as where v p is the volume of the unit sphere in R p .
Note that this estimator is different from the estimate of the empirical density used in [10], which counts the number of data points within a threshold distance of a given instance.It is replaced as no guarantees of its consistency are possible.As well as a density estimate, the peak-finding criterion requires the distance from each point to it's nearest neighbor of higher density: Definition 2: For the point x = argmax x∈X fk (x), we define the quantity For the remaining points, let b(x) = argmin x ∈X { x − x : fk (x) < fk (x )}, i.e., the nearest neighbor of x with higher density.Define the distance to the nearest neighbor of higher local density as Also of interest is the product of the estimated density fk (x) and the distance quantity ω(x).This is termed the peak-finding criterion: Definition 3: Taking fk (x) and ω(x) as defined above, we define the peak-finding criterion γ(x) as Following [10], the decision graph is the scatter plot of {( fk (x), ω(x)) : x ∈ X}.To generate a set of mode estimates M = {x j } m j=1 , threshold values for the density fk (x) and the distance ω(x) need to be set: the modes are the data points with the two metric values both above the thresholds, i.e., The algorithm used for density peaks clustering in this formulation is described in Algorithm 1.The algorithm takes as input the dataset X and uses the parameter k to return the final set of clusters C. Initially, the set of estimated modes M = ∅ and the cluster assignment graph G(X, E) is initialized with vertices as the points of X and no edges.DPC produces the decision graph (Lines 2-3).
DPC requests the user to select estimated modes using this plot as reference.The estimated modes {x j } m j=1 are then added to M (Line 4).After the set of estimated modes has been returned, edges are added to the graph G(X, E) from each non-modal Algorithm 1: Density Peaks Clustering.
Input: Neighborhood parameter k.
Output: A set of clusters C 1: Initialisation: M = ∅, G(X, E), a directed graph with X as vertices and no edges, E = ∅.2: Create the decision graph {( fk (x), ω(x)) : x ∈ X}. 3: Select the estimated modes using the thresholds l and τ , i.e., {x ∈ X : fk (x) ≥ l, ω(x) ≥ τ } 4: Add the estimated modes {x j } m j=1 to M. 5: for each x in X\ M do 6: Add a directed edge from x to b(x).7: end for 8: for each estimated mode x ∈ M do 9: Let C be the collection of the points connected by any directed path in G(X, E) that terminates at x. 10: Add C ∪ x to C. 11: end for 12: return C point x to b(x) (Lines 5-7).The estimated mode together with all the vertices that have paths terminating at it form a cluster that is added to C (Lines 8-11).Proceeding in this way, each sample point will be assigned to a unique cluster.

B. Theoretical Analysis
The quality of the clusterings provided by DPC has been thoroughly demonstrated in practice, as discussed in Section I. Yet, no previous work has provided guarantees on the ability of DPC to recover modes consistently.Through drawing an analogy to the quick shift method, we show that DPC can recover the modes and the associated cluster assignments with strong consistency guarantees.
Quick shift, as described in Section I, is a fast non-parametric density-based method that produces clusterings with kernel density estimates.A directed graph is built with the observed instances as vertices, and edges added from each instance to its nearest neighbor of higher estimated density.The final clusters are extracted as the connected components of the graph once edges with length longer than a segmentation parameter τ are removed.The formulation of DPC introduced above is similar to quick shift in all but two ways: 1) a k-NN estimate of the density is used in place of a kernel density estimate, and 2) a second threshold value l is defined, used to flag low-density instances as outliers.As such, in this section we present the main results adapted from the consistency analysis of the k-NN density estimator [27] and the consistency analysis of quick shift [9].The primary contributions involve drawing the analogy to the quick shift approach, and the extension of the analysis to include the density threshold l used in the mode selection step of DPC.An extended analysis, including proofs of the theorems, is given in the supplementary material, available online.
We assume that f is α-Hölder continuous and lower bounded on X .Furthermore, it is assumed that the level sets of f are continuous with respect to the density level, and the modes of f have negative definite Hessian.Following [9], we now define a stronger notion of mode that allows for a clearer analysis of DPC.
Definition 4: r,θ,ν ⊆ M denote the set of (r, θ, ν) + -modes of f .An illustration of the (r, θ, ν) + -mode and the (r, θ, ν) −modes is given in Fig. 3. Recall that the DPC algorithm requires two cutting-off thresholds, one for cutting the value of the density estimate fk (x) and the other for cutting the distance to the nearest neighbor of higher estimated density, ω(x).Taking the thresholds as τ and l for the density and distance values respectively, our first theorem shows that M contains unique and consistent estimates of the (τ + , θ, l) + -modes of f , for θ, > 0.
Theorem 1 (Mode Estimation -adapted from Theorem 2 of [9]): For every where C is a constant depending on p, n, ζ, and f , Theorem 1 proves that DPC recovers the modes of an α-Hölder continuous density f consistently.For n large enough, with high probability, M contains unique estimates for all the true modes of f .As such, there is an injection between the set of true modes and the set of estimated modes.
The procedure used to assign points to their respective modes is the same as that used in quick shift.As such, theoretical guarantees developed for a variant of quick shift in [20] can be applied directly to DPC.We provide the relevant results below.
First, we define the attraction region of a mode.The attraction region of a particular mode covers all points that flow towards the mode along the direction of the gradient of the underlying density.
Definition 5: Let path ν x : R → R p satisfy ν x (0) = x and ν x (t) = ∇f (ν x (t)).For a mode x * ∈ M, its attraction region It is shown that DPC can cluster sample points in the (r, δ)interior of an attraction region.The parameters r > 0 and δ > 0 hold simultaneously across all modes of the density and can be chosen arbitrarily small.Definition 6: The (r, δ)-interior of an attraction region x * , is the set of points x 1 ∈ A x * such that a path P from x 1 to any point Points in the interior of an attraction region must satisfy the property that any path leaving the attraction region must significantly decrease in density at some point.An illustrative example is given in Fig. 4.
The main result (Theorem 2) states that, as long as the modes are well-estimated, the assignment method of DPC will correctly cluster the (r, δ)-interiors of the attraction regions with high probability.Suppose that x * ∈ M is estimated by x ∈ M such that x − x * ≤ r.Then, with high probability, for x ∈ A x * ∩ X, density peaks clustering clusters x to the cluster belonging to x * .
Theorem 2 (Cluster Assignment -adapted from Theorem 2 of [20]): Suppose that x * ∈ M is estimated by x ∈ M such that x − x * ≤ r.Then, for n sufficiently large, depending on f , δ, ζ and r, with high probability, for any x ∈ A (r,δ) x * ∩ X, DPC clusters x to the cluster belonging to x * .

C. Limitations
The theoretical analysis of Section III-B is based on the assumption of a sample size large enough that the error of the density estimator can be bounded.In this section, we provide an analysis of the density peaks clustering algorithm through three illustrative datasets, with n = 1500, from the scikit-learn clustering demonstration . 1 Taken together, the three datasets provide an understanding of the density peaks clustering algorithm and the type of clusters it returns; see Fig. 5.
First, the density estimation appears to recover population density for each dataset.The second feature of the density peaks clustering algorithm analyzed is the decision graph, provided to enable the estimation of the modes from the dataset.The method of selecting mode estimates from the decision graph is seen to perform well when the density of the cluster is concentrated near the mode and decays as the distance from the mode increases, such as for the Unequal Variance Gaussian dataset.Each of the remaining datasets contains areas of relatively uniform density.This poses challenges for the density peaks clustering method as noise in the density estimate leads to erroneous modes being selected.Finally, the assignment method of density peaks clustering is assessed.The assignment strategy is shown to perform well for the Unequal Variance Gaussian dataset.The allocation of instances to clusters for the Noisy Circles runs contrary to geometric intuition about the clusters.In this case, the allocation assigns instances to clusters across areas of very low density in the dataset.
In sum, the density peaks clustering framework performs well for datasets containing clusters with clear point modes around which the density decays, such as Gaussian components.The framework struggles when the high density regions of the data are relatively uniform.In this case, both the mode selection method and the assignment strategy are shown to be susceptible to errors caused by noise in the density estimate.

IV. THE PROPOSED CPF ALGORITHM
In this section, the improvements to the density peaks clustering method that constitute the CPF algorithm are introduced, together with a detailed analysis of the clustering algorithm.

A. Peak-Finding on Level Sets
In this section, we explain the component set notation and the peak-finding criterion.We denote the mutual k-NN graph G(X, E).The structure of the mutual k-NN graph can help detect outlier data points in X.In particular, for x ∈ X, x is an outlier if its vertex in the graph G(X, E) has very few or no edges.We denote the set of outliers by O.
Definition 7: For every x ∈ R p , let r k (x) denote the distance from x to its k-th nearest neighbor in X as before.The mutual k-NN graph G(X, E) consists of the vertex set X and the edge set E. There is an edge between two vertices x i and x j , denoted by {x i , x j } ∈ E, if and only if That is, an edge exists between the vertices x i and x j , only if they are a k-nearest neighbor of each other.
Next, we formalize the notation of connected components of the mutual k-NN graph G(X, E), beginning with the definition of connectedness.
Definition 8: A path of length m from x i to x j , denoted by is a sequence of distinct edges in E, starting at vertex v 0 = x i and ending at vertex v m = x j , such that {v r−1 , v r } ∈ E for all r = 1, . . ., m.We say that the two data points x i and x j are connected, if there is a path from x i to x j in the graph G(X, E).
The definition of connected components and component sets follows.
Definition 9: A connected component of G(X, E), denoted by G(S, E(S)), is a subgraph of G(X, E), where any two vertices in S are connected to each other by paths, and the edge set induced by S is a subset of E: E(S) = {{x i , x j } ∈ E : x i ∈ S, x j ∈ S}.The vertex set S of the component graph G(S, E(S)) is a subset of X and here is termed a component set of X.
From the definition of component, we know that the connected components of G(X, E) reveal certain underlying patterns of the data.In particular, the data X can be partitioned into disjoint component sets.Here, we denote the set of component sets S = {S 1 , . . ., S n S }, where n S = |S| is the number of component sets, and Theoretical results regarding the ability of connected components of G(X, E) to estimate the level sets of f are given in [2], [3].If two points belong to two different component sets, it is highly likely that they are separated by a region of low density.

B. Modelling High-Density Regions
We now explain the mode selection mechanism.The definitions for the peak-finding technique used are the same as those given in Section III-A.The definitions below are given in terms of one S ∈ S, and are equivalent for each.
Data points in S are placed in descending order of the peakfinding criterion, and the modal-set associated with the instance having maximal value of the peak-finding criterion is automatically accepted.To decide whether or not to select modal-sets associated with the subsequent instances, we here utilize an idea similar to the methods of [21], [28], [29].A candidate modal-set Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.M associated with an instance x * is accepted only when it is well separated from the others.
Definition 10: Let 0 < ρ < 1.For an instance x * ∈ S, define a graph G(V x * , E(V x * )) with The estimated modal-set M is the connected component of the graph G(V x * , E(V x * )) containing the vertex x * .M is accepted only if it does not intersect any previously selected modal-set.Note that the k-th nearest neighbour of x * in the distance r k (x * ) is a point from the component set S, not from the original dataset X.This approach allows the graph to better reflect the scale of the data contained in the component set.The component sets obtained from the graph G(V x * , E(V x * )) are assessed, and if the component set containing x * , i.e., M , does not intersect previously selected candidate modal-sets, then M is accepted.
Varying the parameter ρ determines the number of clusters for each component set S. The threshold relates directly to the estimated density for each of the instances.For example, if fk (x 1 ) < ρ fk (x 2 ) then r k (x 1 ) > ρ − 1 p r k (x 2 ).For low values of ρ, fewer vertices will be removed, and it is less likely that a proposed modal-set will be disconnected from existing ones.For larger values of ρ, more vertices and their edges will be removed from the graph, and the probability of the proposed modal-set being disconnected will increase.It is not required to have different ρ values for different component sets, because the threshold ρ − 1 p r k (x * ) adapts naturally to the density level of the component set being assessed.It is seen that modal-sets associated with spurious modes of the density estimate fk will not be accepted by CPF, as the modal-sets are not disconnected from previously accepted modal-sets.

C. The CPF Algorithm
The CPF ALGORITHM is explained with reference to Algorithm 2 and the illustrative example in Fig. 6.
The algorithm takes as input the dataset X and uses pa- Sort the x's according to their γ values.

5:
Let Let M ⊆ S be the component set of the graph Let C be the collection of the points connected by any directed path in G(S, E) that terminates at x.

23:
Add C ∪ x to C.
For each component set S ∈ S, CPF computes the peakfinding criterion for each point and selects the instance x * with maximal value (Lines 4-5).In Fig. 6(b), the higher estimated Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
density of the instances is represented by darker colors, and the magnitude of the peak-finding criterion for each instance is represented using the size of the points.
The subgraph G(V x * , E(V x * )) is extracted, and the component set of G(V x * , E(V x * )) containing x * is denoted by M .The modal-set M is automatically accepted, and the set of true modal-sets for the component set S is initialised as M = { M } (Lines 6-8).Following the comutation of the points x * for each components, the purple and green modal-sets are automatically selected in Fig. 6(c).
Next, the instance with maximal value of the peak-finding criterion yet to be assessed is selected and denoted by x * .The subgraph G(V x * , E(V x * )) is extracted, and the component set of G(V x * , E(V x * )) containing x * is denoted by M (Lines 10-12).If M is disjoint from all selected modal-sets in M, then M is added to M (Lines 13-14).For the top component set in Fig. 6(c), no further modal-sets are detected.For the bottom component set, a second modal-set, in yellow, is detected.
Once the center-selection loop is complete, non-center points are allocated to their clusters.For each non-center point x, a directed edge is added from x to b(x), its nearest neighbor of higher density (Lines 17-20).All vertices that have paths terminating at the same cluster center are assigned to the same cluster, and the cluster is subsequently added to C (Lines 21-23).The process is repeated for each component set to return the final set of clusters C. The clusters corresponding to each modal-set are shown in Fig. 6(d).Furthermore, a sample assignment path for an instance in the purple cluster is shown in gold.

A. Theoretical Analysis
In this section, we show that CPF extends the theoretical guarantees available to the DPC method in Section III-B.We demonstrate that CPF can, with high probability, estimate each modal-set of the underlying probability density bijectively.
The notion of modal-sets can also be understood as a method for pruning spurious estimates from the set of estimated modes, in a similar way to the method of [30].There, the authors prune spurious modes arising due to sampling variability by assessing the level sets at nearby levels of the density.Using nearest neighbor graphs, [27] translates this framework for mode detection, showing that the pruning method allows for bijective estimation of the true modes that with density above a certain level.The analogy to modal-sets is easily drawn.The CPF procedure will only retain an estimated modal-set, say M , if it is contained in a separate component set of the graph.The correspondence allows for the following result, given previously in [20], stating that the modal-set estimates returned by CPF estimate the modal-sets of f bijectively and consistently.
Theorem 3 (Modal-Set Estimation -adapted from Theorem 1 of [20]): Let 0 < ρ < 1 and , ζ > 0. Let M 1 , . . ., M m be the modal-sets of f .The following holds with probability 1 − ζ.For n sufficiently large depending on f, ζ, and ρ, CPF returns m modal-set estimates M 1 , . . ., M m such that The result proving the quality of the cluster assignment of Section III-B can also be applied to each component set, with suitable adjustments made to the number of observations in each component set.

B. Complexity Analysis
The most computation-intensive task is creating the mutual k-NN graph which requires O(nk log(n)) operations on average.The connected components are extracted with O(n) operations.Another major computational burden is finding, for each point, its nearest neighbor of higher density in a component set.For the points which do not have a point of higher density in their neighbors, this requires O(|S|) operations, where |S| is the number of instances in the component set.Experimental results for the proportion of instances without a point of higher density in their neighbors are presented in Fig. 7

C. Limitations
While the CPF algorithm remedies the mode estimation and assignment issues of the DPC algorithm, potential limitations of the method exist.CPF, for simplicity, takes as input only one neighborhood parameter k, used to compute the mutual knearest neighbor graph and the density estimate fk .For datasets with small number of instances, often the optimal value of k for these tasks is different, with too small k leads to oversegmentation of the data, but too large k causes an oversmoothing of the density estimate and poor detection of the modes.This issue would be compounded if the data contained both low-and high-density clusters.In such a case, it is possible to define k 1 and k 2 for graph estimation and density estimation respectively.

TABLE I CHARACTERISTICS OF THE REAL-WORLD DATASETS
We present the clustering with the highest combined value of the ARI and AMI across the range of parameter values assessed.The results are presented in Fig. 8, where different colours indicate different clusters.The datasets are henceforth referred to as Unequal Variance, Noisy Circles, Noisy Moons, and Large m, following Fig. 5. Considering the Unequal Variance dataset, the mode-seeking methods are seen to extract the correct cluster structure, while DBS fails to detect clusters at different densities.The DBS method performs well for the Noisy Moons dataset as the level set approach can detect clusters that are separated by regions of low density.DPC and MNS are seen to select multiple modes for the high-density cluster for the Noisy Circles dataset.CPF is seen to exactly recover the cluster structure for each dataset, combining the benefits of level-set and mode-seeking methods.

C. Real-World Datasets
We assess CPF on a pool of ten real-world datasets.Details of the datasets can be found in Table I  In terms of the ARI, the methods with the three next highest rank is KMS, ODP and CDP.In terms of the AMI, the DPC method, as formulated in Section III-A is also among the best performing approaches.Taken together, this makes a strong case for the ability of the peak-finding criterion to detect meaningful clusters in the data.
Considering the competitor approaches that determine the number of clusters automatically, the performance is significantly worse than CPF.The peak-finding method DPA exhibits inconsistent quality, achieving the best results for the Seeds but not detecting meaningful clusterings for the Ecoli, Glass and Vertebral datasets.The level set methods DBS and HDB perform poorly.The poor performance in both metrics indicates that these methods fail to capture the classes present in the data.MNS achieves the optimal clustering for two datasets, Ecoli and Page Blocks, but does not regularly return high quality clusterings.QKS also does not return high quality clusterings, particularly when assessed using ARI.As ARI significantly penalizes false positive clusters, it can again be concluded that quick shift is not adequately detecting the true number of clusters in the data.Considering the significant similarities between the methodology of QKS and that of the DPC methods, the poor results are likely the result of difficulty in finding the optimal value of the parameter h.
Following the guidance given in [39], the results are also subjected to a statistical analysis using non-parametric tests.We apply the Wilcoxon signed-rank test for pairwise comparisons, using the Benjamini-Hochberg correction to control the false-discovery rate [40], [41].The p-values for the associated comparison are shown in Table III.The results indicate a strong level of statistical significance for the improved clustering quality for the CPF method.CPF significantly outperforms all but one of the methods assessed and is not outperformed by any of the competitor approaches.The average run time, in seconds, for each method is presented in Table IV.For small datasets, DBS and HDB achieve the fastest run time, however the magnitude of difference with CPF is unlikely to hinder their use in applications.This reflects their implementation in C++.For larger datasets, CPF remains competitive with the fastest methods and achieve near the fastest run time for Letter Recognition, the dataset with the largest number of instances assessed.Further context is provided in Table V, detailing the computational complexity of the algorithms.It is concluded that CPF, as well as achieving high quality clusterings, gracefully scales to larger datasets.

D. Analysis of the Parameter Space
CPF achieves superb results across the datasets when optimal values for the parameters are applied.This performance is exhibited across datasets of all sizes, with optimal results achieved for datasets with the fewest and most number of samples and for datasets with low and high numbers of dimensions.The consistency of the performance of the approach is now demonstrated for a wide range of parameter values.CPF has two parameters: 1) k, the number of neighbors computed for each point when constructing the k-NN graph and computing the the k-NN density estimator, and 2) ρ, the amount of variation in the density used to assess potential cluster centers.The parameters of the competitor methods are detailed in Section VI-A.In Fig. 9 we present the clustering quality in terms of the ARI and the AMI over a broad range of parameter values, for four datasets, with the remainder included in the supplementary material, available online.

TABLE V COMPUTATIONAL COMPLEXITY FOR THE COMPETITOR ALGORITHMS
CPF is relatively robust to the choice of k and ρ for all the datasets apart from the Vertebral dataset, for which the choice of k appears important to the clustering quality.The results indicate that, for general application, it is recommended to assess k = 0.9 √ n .This value is near the optimal for all of the datasets apart from the Letter Recognition dataset.The quality of the clusterings remains consistent as the variation parameter used to assess potential cluster centers varies from ρ = 0.1 to ρ = 0.9.For general application, it is recommended to first assess ρ = 0.6 as competitive results are achieved for all datasets, except Page Blocks.The performance of CPF for values of the parameter ρ is not affected by the number of samples in the data.Users can intuitively tune the parameter ρ for alternate clusterings, increasing ρ if more clusters are desired and decreasing ρ if fewer clusters are desired.Considering the competitor methods, it is noted that ADP, CDP, DPA, and QKS also achieve consistent results as the values of their respective parameters increase.Each of these methods, as well CPF, allocate instances to the same cluster as their nearest neighbor of higher local density.An additional benefit of CPF is that the parameters do not depend on the scale of the data.This is illustrated in the large range of k, relative to the size of the datasets, for which CPF achieves excellent results.

VII. MULTI-IMAGE MATCHING WITH CPF
In this section, we introduce an adapted version of CPF for multi-image matching.Multi-image matching is an important application in computer vision, notably in the reconstruction of 3-D scenes form 2-D images.We can consider the problem as extending clustering from an unsupervised task to a semisupervised task.For multi-image matching, the only supervision information provided is the images from which each point is created.No two instances from the same image can be grouped together in the final clustering.
Quick shift forms the basis of the first successful application of density-based clustering to the problem of multi-image matching.QuickMatch [42] modifies quick shift by moving a point to its nearest neighbor with higher empirical density, only if the neighbor does not belong to an image already contained in the cluster.We adapt the CPF method introduced in Algorithm 2 to accommodate supervision information.Denote the image label of an instance x by I(x) ∈ {1, . . ., n I }, where n I is the number of images assessed.As such, we present CPF-Match by updating the allocation phase of Algorithm 2, substituting lines 19-22 with Algorithm 3. CPF-Match modifies the allocation procedure of  To demonstrate the ability of CPF-Match to perform multiimage matching, we apply it to the Graffiti dataset. 3The dataset contains six image groups (bark, bikes, boat, graffiti, Leuven, and UBC), each containing six different images of the same scene.Features are extracted from each image using SIFT, roughly 500 for each image [43].Examples for a pair of images from two of the six image groups are presented in Fig. 10.
For evaluation, we apply the same approach as in [42].For a test point in an image, we calculate the distance between its estimated correspondence and the true correspondence in another image.If the distance is smaller than a threshold, we consider the match to be correct.We plot the percentage of testing points with correct matches versus the threshold values to obtain a curve which can be interpreted in a manner similar to a precision-recall curve.As homography matrices are provided relating the first image with the remaining images in each group, we use all detected feature points in the first image as test points and evaluate the matches in the other five images.The

VIII. CONCLUSION AND FUTURE WORK
In this work, we provided the first theoretical analysis of the popular DPC algorithm.DPC was proven to consistently estimate the modes of the underlying density, and correctly assign instances to clusters with high probability.We also demonstrated issues with the DPC framework.This analysis motivated a new clustering technique, the CPF algorithm.CPF combines the benefits of both density-level set and mode-seeking density-based clustering methods.CPF offers extended theoretical guarantees, compared to DPC, and exhibits improved clustering performance on a range of synthetic and real-world datasets.Finally, we introduced CPF-Match, an adaptation of CPF for an important semi-supervised computer vision application.In future, we envisage the extension of CPF and CPF-Match to incorporate other forms of supervision, including geometric information for the multi-image matching problem, using node-attributed mutual k-NN graphs.

Fig. 1 .
Fig. 1.Left: The decision graph of the DPC method.The three extreme points are detected as the modes.Right: The assignment of the instances to the modes.

Fig. 2 .
Fig. 2. Left: Noise in the density estimate leads to errors when seeking point modes.Right: Modal-set methods are robust to noise and recover the true cluster structure.

Fig. 4 .
Fig. 4. Illustrative example of the (r, δ)-interior of an attraction region A x * , denoted A (r,δ) x * , associated with a mode x * .

Fig. 5 .
Fig. 5. Density peaks clustering of illustrative datasets.The k-NN density estimator is used here with k = 40.Left: Density estimates for the dataset.Darker regions indicate higher density.Center: The decision plot, with thresholds set to estimate approximately the correct number of clusters.Right: The final clustering assignment.The color of the shaded regions indicate the attraction region for each cluster.

Fig. 6 .
Fig. 6.Illustration of stages of the proposed CPF algorithm.

Algorithm 2 :
rameters k and ρ to return the final set of clusters C. Initially, the set of estimated clusters is C = ∅.The undirected mutual k-nearest neighbor graph G(X, E) is constructed.Vertices that have few to no edges are marked as outliers and removed.The remaining data is partitioned into disjoint component sets according to the graph G(X, E) yielding S = {S 1 , . . ., S n S } The Component-Wise Peak Finding Algorithm.Input: Neighborhood parameter k, fluctuation parameter ρ.Initialisation: C = ∅.Output: A set of clusters C. 1: Compute G(X, E), the mutual k-nearest neighbor graph.2: Extract S, the set of component sets from G(X, E). 3: for each S ∈ S do 4:

Fig. 7 .
Fig. 7. Analysis of the proportion of instances that do not have a point of higher density in their k nearest neighbors.Data was simulated from a mixture of Gaussian components with n = 40000, with the number of components and all component parameters chosen randomly to ensure variety.CPF was run with k = 100.The points in black are (|S|, p) for a given component set with the green dashed line showing the function 0.2 log(|S|)/|S|.
. The green line in the figure is 0.2 log(|S|)/|S|.As the proportion of such instances present in S appears of order O(log(|S|)/|S|), nearest neighbors of higher density are found in O(|S| log(|S|) time.Assessing each cluster center requires O(|S|k) operations.The assignment mechanism requires O(|S|) operations.As such, we see that the complexity of CPF is O(nk log(n)), near linear in n and k.

Fig. 8 .
Fig. 8. Results of the clustering algorithms on synthetic datasets.The ARI and the AMI for each clustering is given in the lower left corner.

Fig.
Fig.For each dataset and clustering algorithm, we show the clustering quality as a function of the input parameters.The ARI is shown in blue, and the AMI in pink.Note that for CPF, we present the clustering quality as a function of both k and ρ.

Fig. 10 .
Fig. 10.One pair of images from the Bikes and Boat image groups.Lines between each pair of images indicate a match detected by CPF-Match.

Algorithm 3 :
CPF-Match.17: Initialise G(S, E), a directed graph with S as vertices and no edges, E = ∅.18: Sort the vertices x ∈ S\ M in ascending order of the distance from x to b(x).19: for each x do 20: if I(x) = I(b(x)) then 21: Add a directed edge from x to b(x).22: end if 23: end for A directed graph G(S, E) is initialized as before (Line 17).Next, CPF-Match sorts the points of S not in modal-sets according to the distance x − b(x) , from smallest to largest (Line 18).Processing the non-center points in turn, a directed edge from x to b(x) is added if x and b(x) are not from the same image, i.e., I(x) = I(b(x)) (Lines 19-23).

Fig. 11 .
Fig. 11.Performance curves for the CPF-Match (black) and QuickMatch (blue) multi-image matching methods.For all datasets, k = 10, ρ = 0.5 and the threshold parameter of QuickMatch is set to 4.
. Instances with missing values are removed before clustering.Results are presented in TableII.For each method we present the clustering with the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II QUALITY
OF THE CLUSTERINGS FOR THE REAL-WORLD DATASETS highest combined value of the ARI and AMI across the range of parameter values assessed.CPF achieves the best clustering, in terms of the ARI and AMI, for six of the datasets assessed, significantly outperforming all of the competitor methods.Also presented are the mean rankings the quality of the clusterings returned by each of the methods for both metrics.Here, CPF is seen to have the best performance overall, indicating that the clustering results are generally of high quality.

TABLE III P
-VALUES FOR WILCOXON SIGNED-RANK TESTS WITH THE BENJAMINI-HOCHBERG CORRECTION, COMPARING THE ARI AND AMI VALUES OF CPF WITH THE COMPETITOR METHODS

TABLE IV AVERAGE
RUN TIME FOR THE REAL-WORLD DATASETS