A Novel Mean-Shift Algorithm for Data Clustering

We propose a novel Mean-Shift method for data clustering, called Robust Mean-Shift (RMS). A new update equation for point iterates is proposed, mixing the ones of the standard Mean-Shift (MS) and the Blurring Mean-Shift (BMS). Despite its simplicity, the proposed method has not been studied so far. RMS can be set up in both a kernel-based and a nearest-neighbor (NN)-based fashion. Since the update rule of RMS is closer to BMS, the convergence of point iterates is conjectured based on the Chen’s BMS convergence theorem. Experimental results on synthetic and real datasets show that RMS in several cases outperforms MS and BMS in the clustering task. In addition, RMS exhibits larger attraction basins than MS and BMS for identical parametrization; consequently, its kernel variant requires a lower aperture of the kernel function, and its NN variant a lower number of nearest neighbors compared to MS or BMS, to achieve optimal clustering results. In addition, the NN version of RMS does not need to specify a convergence threshold to stop the iterations, contrarily to the NN-BMS algorithm.


I. INTRODUCTION
Data clustering is a type of unsupervised learning which consists of automatically grouping data points having similar characteristics into identified clusters without training sample points. It is a central task in various application fields such as medicine, genomics, content-based image and video indexing, and Big Data mining to cite a few. Clustering is also increasingly relevant in the context of Artificial Intelligence, in order to unveil the existence of underlying complex structures in datasets [2], and especially in applications where little or no training data is available. Despite several decades of research, clustering remains a challenging task for many applications because of the increasing size (number of data points) and dimensionality (number of features) of modern datasets. This is particularly true for applications that require on-the-fly data partitioning [3].
From a general viewpoint, clustering remains an ill-posed problem [21] because, depending on the partitioning process, several legitimate solutions can be obtained which are all acceptable [22], [23]. In fact, most popular methods claimed as unsupervised require a significant prior knowledge about the data structure, i.e. the number of clusters to be found. This is particularly true for centroid clustering, mixture resolving, and spectral clustering in their baseline implementation. However, while some of their parameters are necessary and can be difficult to tune, several other approaches do not require to specify the number of clusters. For instance, among them are hierarchical methods, as well as DBSCAN [9], AP [18], convex clustering [20], nearest-neighbor densitybased (NN-DB) methods [24] and Mean-Shift based methods. In this work, we focus on the latter.
Mean-Shift (MS) was originally proposed by Fukunaga and Hostetler in 1975 [11] essentially as a means to provide the modes of an unknown probability density function (p.d.f.). MS relies on kernel density estimation (KDE), a non-parametric way to estimate a p.d.f. from data samples [25], [26]. In MS, each point of the dataset is moved iteratively by a small amount (the so-called mean shift) until convergence to some stationary point, i.e. a local mode of VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the estimated p.d.f.. MS has first been used as an unsupervised data clustering method, for which the retained local modes after convergence of the point iterates serve as cluster representatives (or exemplars). A connected component post-processing stage [26] is therefore necessary after the convergence is achieved to assign a cluster label to each of the original data points. A number of studies have followed the seminal work of Fukunaga and Hostetler [12], [25], [27]- [31], and several proofs pertaining to convergence and p.d.f. estimation have been proposed [1], [12], [25]- [28], [32]- [34]. In [26], Carreira-Perpiñán provides a comprehensive review of MS-based methods and their application to data clustering and data denoising. Mean-Shift has also been successfully applied to image filtering and segmentation in [25].
In the present work, we propose a novel approach to the classical Mean-Shift algorithm focusing data clustering; the KDE problem is not investigated herein. To the best of our knowledge, the proposed approach has not been published. Despite relying on a modification of the original MS algorithm, we demonstrate that our method leads to significantly different features. This method, which we name Robust Mean-Shift (RMS) is a hybridization of the standard Mean-Shift (MS) algorithm [25], and of the so-called Blurring Mean-Shift (BMS) method [27]. It is worth recalling that BMS was actually first proposed in [11], as pointed out in [26]. Our algorithm is based on iteratively moving updates of initial data points, similar to MS and BMS, but the update equation of RMS fundamentally differs from both methods.
The proposed RMS approach proposed in this work has several valuable properties: • We find experimentally that RMS requires a lower bandwidth parameter (in the kernel-based variant) and a lower number of nearest neighbors (in the NN variant) than MS and BMS to achieve comparable or even better results in the clustering task; this is especially interesting to speed up the computation of point iterates, and in the case of the NN-RMS variant, to reduce the size of the NN graph compared to the ones required by the NN variants of MS and BMS; • Compared to MS and BMS, RMS generally performs better, as evidenced experimentally through the analysis of various datasets; • In most experiments, RMS converges faster than BMS, the latter being proved to converge faster than MS [35]; • The (classical) kernel-based RMS can be easily turned to a K -nearest neighbor (K NN) algorithm, similarly to MS and BMS [30], [31]; • The NN variant of RMS does not require a termination threshold, unlike the NN variant of BMS. The paper is organized as follows. Section II provides a brief overview of related works, including kernel-based and K NN-based Mean-Shift (MS) approaches published so far. In Section III, we introduce the proposed clustering method and explain how it relates to MS and BMS. The convergence of RMS in the kernel-based framework is then discussed in Section IV. Section V describes the NN-based variant of RMS. An experimental study of RMS and its comparison with other similar clustering approaches on various datasets is provided in Section VI. Conclusions and perspectives of this work are given in Section VII.

II. NOTATIONS AND RELATION TO PRIOR WORKS
Let X = {x i } , x i ∈ R n , i = 1, . . . , N the set of data points to classify. Let f : R n × R n → R + a kernel function such that (u, v) → f (u, v) ≥ 0, and f (u, v) decreases monotonically with u − v . Kernel density functions are generally tuned with a bandwidth parameter, and they can follow several models; flat, Epanechnikov, biweight, and especially Gaussian kernels are commonly used [27]. However, it is well established that the shape of the kernel has very little effect on the results in comparison to the bandwidth parameter. With that in mind, Gaussian kernels are most commonly used for their convenience.
MS and BMS aim at estimating a p.d.f. and finding its local modes from the observations X . This is done by moving the initial data points {x i } i=1,...,N iteratively until convergence to stationary points, which are the estimated local modes of the true p.d.f.. Let y be the set of moved points at iteration t, and assume y (0) i = x i ∀i. We briefly recall below the original MS and BMS update rules, i.e. the operation applied to the data points at each iteration.

A. MEAN-SHIFT (MS)
MS can be used to partition a dataset by assigning each data point a label corresponding to the unique point it converges to after some (expected) finite number of iterations of an update equation. More precisely, with the above notations, the update equation of MS writes, ∀i: With the above assumption on the kernel function f , the updated points {y ..,N are obtained as a convex linear combination (i.e. with non-negative coefficients summing up to unity) of the initial data points {x i } i=1,...,N .

B. BLURRING MEAN-SHIFT (BMS)
The update rule for BMS is different from MS since it is based on a convex linear combination of the previously moved data points (hence the so-called blurring effect, as coined by Cheng in [27]), i.e. ∀i: It can be noticed that, over the iterations, this update rule progressively 'forgets' the initial data points in X , contrarily to MS.

1) MAIN ADVANTAGES AND DRAWBACKS
As pointed out in [26] MS and BMS algorithms have several advantages compared to other clustering techniques, among which: • their parametrization is limited to the choice of an appropriate kernel function, and a single bandwidth (or aperture) parameter for this kernel function; • they allow to discover non-convex clusters; • they can automatically determine the number of clusters, depending on the chosen bandwidth parameter; • the algorithms are totally deterministic. These advantages make a significant difference with respect to classical clustering methods like k-means or fuzzy c-means for which none of the last three items above is ensured. However, MS-based methods have two main drawbacks which are the lack of scalability to large datasets, and a high sensitivity of clustering performances for high-dimensional datasets with respect to the bandwidth parameter [26]. The latter is a consequence of the so-called curse of dimensionality [36] and the fact that distances tend to be less meaningful in high dimensions.

2) CONVERGENCE OF MS-BASED METHODS
Since the early works on MS-based clustering, and until very recently, the convergence of MS and BMS has been studied. A comprehensive study of the convergence of both MS and BMS algorithms is summarized in [26], but specific results were provided for MS in [25], [35], and for BMS in [1], [27].
For MS, the convergence of moved points to local modes is ensured theoretically and practically in the general case. However the situation may vary from one kernel to another as for the number of iterations required to convergence: the latter is proven in a finite number of steps for the Epanechnikov kernel [25], but infinite for the Gaussian kernel, moreover with a linear convergence rate [37].
For BMS, the convergence issue depends on the kernel aperture: for large ones which encompass the whole dataset X , convergence is ensured to a unique mode, whereas for narrower ones with finite support, the BMS update rule allows the iterates to converge quickly to well-separated distinct modes during the first steps. However, pursuing the iterations can eventually lead to the merging of these modes into a single one. Therefore BMS must be stopped before this case occurs [29]. The convergence rate of Gaussian BMS has been proven cubic [35], hence much faster than Gaussian MS.

3) K NN-MS BASED METHODS
Choosing the optimal set of hyperparameters (kernel function and bandwidth) for a specific task can become very challenging. This is why, since the early works on Mean-Shift-based density estimation and data clustering [11], [12], the use of nearest neighbors as an alternative to the standard kernel approach has been proposed by many researchers. Indeed, the NN approach (i) does not require to specify an underlying parametric function, so that only one parameter K is required; and (ii) the K NN principle makes it possible to maintain the relationship between data points located far from each other, especially on the external border of clusters. In this sense, the NN-based framework is data-adaptive, contrarily to the kernel-based one. In [11], a mean-shift estimate calculated from K nearest neighbors was proposed as a natural way to automatically adapt the density estimation to its local variations. The nearest neighbor paradigm can also be used to estimate the bandwidth parameter of kernel-based MS, for instance as the average distance of each point to its K NNs [26].
Koontz et al. [12] adopted a graph-based clustering approach which enables the assignment of any data point (or node) to a parent node, using the number of neighbors found within the constant radius ball centered on it. The clustering result is then produced via a directed tree traversal [38]. In [29], Grillenzoni proposed a Gaussian BMS based on K NNs, which provides a data-driven technique to select the bandwidth and is shown to have low sensitivity to K . Duong et al. [30] also provided a data-driven closed-form solution to estimate the optimal number of NNs in NN-MS. Recently, Beck et al. [31] proposed an extension of the Nearest Neighbor Gradient Ascent (NNGA) method (which is in fact a K NN version of MS) incorporating Locality Sensitivity Hashing (LSH) to approximate nearest neighbors and -proximity cluster labeling rule into NNGA, and renamed this method NNGA + . NNGA and NNGA + have the advantage of being scalable for large datasets.

4) IMPLEMENTATION ISSUES
An important issue of MS-based methods is related to their practical implementation. Actually, there is a major difference between MS and BMS in this regard. Indeed, Eq. (1) shows that each point can be treated independently from others because, as soon as kernel evaluations are computed, the update y (t+1) i is a linear combination of the original data points. In contrast, BMS in Eq. (2) requires the whole set of current iterates to calculate y (t+1) i . Therefore, BMS cannot be parallelized, whereas MS can be parallelized efficiently. For both approaches and for arbitrary data, the complexity is quadratic in the number of points [26].

III. ROBUST MEAN-SHIFT
In this section, we propose another MS-like clustering approach, which we name Robust Mean-Shift (RMS). The rationale behind RMS is to combine MS and BMS, so that the next iterate remains a convex linear combination of the current ones, like BMS in Eq. (2), whereas the kernel weighting is kept identical to the MS update in Eq. (1). Therefore, the proposed update equation writes: The main idea of RMS is based on the following expectations: VOLUME 10, 2022 • Since the next iterate is a combination of the current ones, a behavior similar to BMS can be anticipated, especially in terms of faster convergence with respect to MS; • Since the kernel weights remain dependent on the initial data points {x k } k=1,...,N , it is expected that iterates remain 'bound' to the initial data points, therefore avoiding convergence to a unique mode for broad kernels, which is a known drawback of BMS [35]; • The kernel-based update rule in Eq. (3) can be easily modified to involve nearest neighbors, similarly to NNGA [31] and the graph-theoretic approach in [12]. Surprisingly, this update equation has not been reported so far in the literature to the best of our knowledge. Figure 1 illustrates with a simple example the differences between MS, BMS and RMS. For the three methods, the same kernel parametrization is used, i.e.
One can see that the modes of these distributions are hardly distinguishable on Figure 1-(a). Figures 1-(b-d) display the evolution of each data point to its corresponding mode for the three methods.
In this example, RMS is able to recover the three components of the original distribution, as well as tiny modes or single outliers, whereas MS and BMS identify a higher number of local modes after convergence. Moreover, RMS requires less iterations to converge (9 iterations) than MS (96 iterations) and BMS (33 iterations).

IV. CONVERGENCE OF RMS
In this section, we discuss the convergence property of RMS. Though it is not fully theoretically established here, the convergence of RMS is conjectured, based on the adaptation of the BMS convergence theorem of Chen [1]. This theorem states that there exists {y * 1 , . . . , y * N } s.t. lim t→∞ y (t) i = y * i . This theorem is based on three lemmas: • The convex hull of all updated data points along the iterations are nested and converge to a limiting convex hull (Lemma 1); • For each vertex of the converged convex hull, at least one sequence of the data points converge to this vertex (Lemma 2); • The influence of the vertices of one converging convex hull to other data points outside this convex hull vanishes along the iterations (Lemma 3). Based on these partial results, we discuss below the convergence of RMS under the same assumptions on f .
First, it is easy to show that Lemma 1 in [1] still holds for RMS. Let C   ∀t.
The proof of Lemma 2 in [1] requires showing that, for a large enough t, no exchange can happen at iteration (t + 1) between a point y . The transposition of this lemma to the RMS update equation remains to be formally proven, but our computer simulations indicate that the same lemma can be conjectured.
The Lemma 3 of Chen can also be transposed to RMS. More precisely, the adaptation of Eq. (6) in [1] writes: where without loss of generality y i is no longer influenced by x k , and y (t) k will converge to another limit point, whereas in the second case, y (t) k will converge to the same limit point as y (t) i . Note that this only sketches the proof of the third Lemma.
In summary, convergence of RMS can be conjectured at this point, but further investigations would be necessary to thoroughly prove it. In addition, why RMS provides larger attraction basins than MS and BMS remains an open question.

V. NEAREST-NEIGHBOR RMS
Similarly to MS and BMS, the kernel-based RMS algorithm can be cast to a nearest-neighbor-based algorithm, by setting f as a variable-radius flat kernel based on the K NNs: This yields the following nearest-neighbor robust Mean-Shift (NN-RMS) update rule: The corresponding algorithm is detailed in Algorithm 1.
Notice that, contrarily to the kernel-based RMS approach, the new one is point-wise (local), i.e. its radius around one data point y j is equal to the distance to its K th NN in X . By doing so, one can expect that NN-RMS better captures the local complexity of the data distribution, compared to the kernel-based version.  Contrarily to NN-BMS, NN-RMS does not update anymore after a small number of iterations. This can be explained easily: the KNN search in NN-RMS operates so as to find the NNs of the current updates within the set of original data points, which remain static. On the one hand, this is different from NN-BMS in which the KNN search is performed within the set of current updates, which do continuously move along the iterations. On the other hand, the KNN search is similar to NN-MS (or NNGA [31]), but since NN-RMS is essentially a BMS algorithm (because next updates are weighted sums of the current moving points), its convergence is faster than MS. Indeed, NN-RMS stops if all the moving points have reached the condition that they share the same NNs within the original dataset, even if the last current iterates are close to each other, but distinct. Figure 2 exemplifies the differences between three NN-based MS algorithms for mode seeking, namely Nearest-Neighbor Mean-Shift (NN-MS), Nearest-Neighbor Blurring Mean-Shift (NN-BMS), and NN-RMS. The dataset is randomly drawn from three 2-D normal distributions with identical diagonal covariance matrices, and centered at [0, 0], [0, 1], [1,1]. This dataset is challenging for the clustering task, the modes of the mixture distribution being hard to distinguish on Figure 2-(a). Figures 2-(b-d) display the evolution of each data point to its corresponding mode for the three methods. Notice that NN-RMS is run until strict fixedpoint convergence, whereas NN-MS and NN-BMS must be stopped as soon as the mean squared difference between successive iterates Y (t) and Y (t−1) is below some threshold . In this experiment, we set = 10 −8 . It can be seen that NN-RMS again creates larger attraction basins than its MS and BMS counterparts. NN-RMS is able to recover the exact number of components in the actual distribution, whereas NN-MS and NN-BMS still identify a much higher number of local modes after convergence. Moreover, NN-RMS requires less iterations to converge (14 iterations) than NN-MS (16 iterations) and NN-BMS (55 iterations), as shown in Figure 3.

VI. EXPERIMENTS
In this section, we provide experimental results obtained with several datasets, both synthetic and real, in order to assess the performance of the proposed RMS approach for clustering, and to compare it with the state-of-the-art MS and BMS, in both configurations, i.e. kernel-based and KNN-based.

A. DATASETS 1) SYNTHETIC DATASETS
To perform the experiments and compare our approach with other clustering algorithms, we have selected a number of publicly available synthetic and real datasets. The synthetic datasets are displayed in Figure 4 with their actual (ground truth) label shown in specific colors. These datasets show diverse configurations, from well separated to highly overlapped clusters, from convex to non-convex and highly intricate clusters, and from balanced to unbalanced clusters.

2) REAL DATASETS
The real datasets used in the experiments are summarized in Table 1. They show different configurations, from low to moderate dimensionality, and various number of instances VOLUME 10, 2022   Figure 2. Notice that NN-RMS is stopped at iteration 14 with zero mean squared error. and number of clusters. All these datasets have been used without any pre-processing, except the AttFace dataset for which the original dimension n = 4096 has been reduced to n = 20 by means of principal component analysis (PCA).

B. SELECTED METHODS
With regard to similar methods based on the MS principle and their NN variants, we have compared both the NN-based and the kernel-based RMS proposed method to their equivalent MS and BMS counterparts. For comparison, we selected the  MedoidShift method proposed in [38], which was adapted to the KNN case, as suggested by the authors. We also compared our approach with two nearest-neighbor density-based clustering methods, namely kNN-DPC, and GWENN. These methods were recently improved and compared in [24] for their applicability to pixel clustering in hyperspectral images. They were chosen because they require the same input parameter K as the NN-based MS methods. Also, due to the specific convergence of NN-BMS as illustrated in Figure 3, an additional parameter was used to stop the algorithm. In all our experiments, we set = 10 −6 . For kernel-based MS methods, we have chosen the Gaussian kernel detailed in Eq. (4).

C. SELECTED VALIDATION CRITERIA
To allow the comparison of different results, we used the following cluster validation criteria: • Since all the datasets include a ground truth labeling as an external data for cluster assessment, the overall accuracy (OA), average accuracy (AA), kappa index, can be obtained after optimal pairing of cluster labels with the actual ground truth classes owing to the Munkres assignment algorithm [39]; • The purity and normalized mutual information (NMI) indices [40] are also based on the relationship between the ground truth and the predicted labels, but do not require label reassignment; • The consistency violation ration (CVR) [41] is a clustering index based on an information-theoretic concept, which is also fitted to non-convex clusters; a lower CVR indicates a better clustering result. Contrarily to the previous criteria, the CVR index does not require the ground truth labels, which makes it useful to assess the results of unsupervised classification.
In Figure 5, The results obtained on synthetic datasets are given in Table 2.
In addition to the cluster indices mentioned above, we also added the number of output clusters, the computation time and the number of iterations for each method used for comparison. For each dataset, the clustering task was performed with two groups of methods, namely NN-based and kernelbased methods. The values displayed in Table 2 correspond to the specific values of K (for NN-based methods) or σ (for the kernel-based methods) providing the best kappa index.
Concerning the NN-based methods, it can be seen that NN-RMS provided the best kappa indices on five over the nine datasets (Aggregation, DataS1, S4, Unbalance and Worms2d). Furthermore, NN-RMS provided the second best kappa on three other datasets (Flame, Spiral and Birch1). One important issue of this comparison is that these best kappa results for NN-RMS were obtained for the lowest number of NNs among all the compared methods on eight over the nine datasets. This result can be considered as significant since most of the optimal values of K for the other methods are often more than twice the optimal values for NN-RMS. It is also noticeable that NN-RMS found the correct number of clusters in seven over nine datasets, and that this number is very close to the actual one in the remaining cases. With regard to kernel-based Mean-Shift methods, over the nine datasets, G-RMS again outperforms G-MS and G-BMS on six datasets (Flame, Spiral, Aggregation, DataS1, S4 and Worms2d). Also, in all cases, G-RMS never requires a higher Gaussian kernel aperture than G-MS and G-BMS to achieve the results with the best kappa indices, and for most datasets the optimal aperture is well below the ones provided by G-MS and G-BMS. This finding is in accordance with the results of the NN-based MS methods.
Note that for the R15 dataset, all methods (NN-or kernel-based) perform equally well, except NN-MS and NN-MedShift with lower kappa, NMI and purity, and higher CVR.   Table 3 provides the clustering results obtained on the real datasets described above. Here, only NN-based MS methods were considered, as well as kNNDPC, GWENN and NN-MedShift. Again, these results correspond to the optimal parameter K in terms of output kappa index. The variety in data size (from N = 150 to 2310), dimensionality (up to n = 20), and cluster shape makes it difficult to draw clear conclusions from these observations. Here, over the six datasets, in terms of kappa index, NN-RMS performs better than the other methods in two cases (Banknote and Segment), whereas NN-BMS is better for three others (Iris, Ecoli and Attface). However, NN-RMS was able to provide the correct number of clusters in all cases, except for the Ecoli dataset for which all the methods failed. This is particularly true for the Attface dataset, despite the relative high dimensionality (n = 20) and low populated classes (only 10 data points per class) which makes it a challenging clustering problem.

2) REAL DATA
Finally, similarly as above for the synthetic datasets, the best results in terms of kappa index reported for NN-RMS in all cases correspond to lower values of K than that of the other methods.

3) APPLICATION TO PIXEL CLUSTERING IN HYPERSPECTRAL IMAGES
We provide here early experimental results of the application of NN-RMS to hyperspectral image pixel clustering. Hyperspectral images are composed of hundreds of spectral bands covering a specific spectral range, generally including the visible range, the near-infrared range and sometimes the short-wave infrared range. Each pixel can be viewed as a high-dimensional vector of spectral radiances (or reflectances) from which a high amount of valuable information can be extracted to remotely identify objects or land cover types when the hyperspectral camera is operated on board an aerial platform (aircraft of UAV). We have selected a publicly available hyperspectral image [42]. It comprises 86 × 83 pixels, has 204 spectral bands (n = 204) and includes six classes of vegetation cover. Figure 6-(a) and (b) show respectively a color composite of the hyperspectral image and the corresponding ground truth map used for clustering assessment. In this experiment, we applied the same protocol as above, i.e. we applied several nearestneighbor-based clustering method with K varying in the range from 100 to 400 by steps of 20. Four methods were compared: kNNDPC and GWENN as density-based clustering methods, and NN-MedShift and NN-RMS as Mean-Shifttype methods. For each method, we retained the best result as the one maximizing the kappa index, because this index mixes both the OA and the AA issued from the confusion matrix (after label reassignment) and generally better represents the clustering quality. Figure 6-(c)-(f) displays the corresponding clustering maps with an effort to keep the same color scale than the ground truth. It is interesting to notice that all four methods can discover two additional clusters with respect to the available ground truth. These clusters are visually coherent both from the spectral viewpoint as can be seen from the composite image, and from the spatial viewpoint since the additional segments in the maroon and light blue regions of the ground truth map follow the same spatial structure as the labeled ones. Despite the high dimensionality of the dataset, here again, NN-RMS provides the best overall clustering result among the four methods, still with the lowest number of nearest neighbors K .

VII. CONCLUSION AND PERSPECTIVES
In this paper, we have proposed a novel Mean-Shift-like method to data clustering, called Robust Mean-Shift (RMS). This approach differs from the standard Mean-Shift (MS) and Blurring Mean-Shift (BMS) ones by its update equation. More precisely, RMS uses a linear combination of the current point iterates (similarly to BMS) with weights depending on the similarity (or distance) of these iterates to the original data points (similarly to MS). Surprisingly, the proposed method does not seem to have been studied so far despite its simplicity.
The RMS update equation has been set up in both a kernel-based and a nearest-neighbor-based version. In the kernel-based case, the convergence of point iterates has been conjectured based on the BMS convergence theorem of Chen [1]. RMS has several advantages over MS and BMS: • For a same kernel bandwidth (in the kernel-based implementation) or number of NNs (in nearest-neighborbased implementation), RMS shows larger basins of attraction than MS and BMS. One consequence is that the size of the NN graph required to achieve the same number of clusters is smaller for NN-RMS than for NN-MS and NN-BMS.
• Experimental results on synthetic and real datasets show that RMS in most cases outperforms MS and BMS in the clustering task.
• Though the RMS update equation is closer in spirit to BMS, their NN-based versions have different behaviors when the iterates get close to their fixed-point limit: whereas NN-BMS iterates continue to evolve until the mean squared error between the current and the previous iterates reach a specified small value, NN-RMS stops as soon as all iterates share the same set of NNs within the original dataset. This property is also valid for NN-MS (or NNGA). Perspectives of this work are two-fold, and will concern (i) the optimization of RMS parametrization (kernel definition, kernel aperture, and number of NNs), and (ii) theoretical proofs of RMS convergence properties, especially with regard to convergence rate and larger attraction basins with respect to the standard MS and BMS clustering methods.