Clustering by Constructing Hyper-Planes

As a ubiquitous method in the field of machine learning, clustering algorithm attracts a lot attention. Because only some basic information can be utilized, clustering data points into correct categories is a critical task especially when the cluster number is unknown. This paper presents an algorithm which can find the cluster number automatically. It firstly constructs hyper-planes based on the marginal of sample points. Then an adjacent relationship between data points is defined. Based on it, connective components are derived. According to a validity index proposed in this paper, the high-qualified connective components are selected as cluster centers. Meanwhile, the clusters’ number is also determined. Another contribution of this paper is that all the parameters in this algorithm can be set automatically. To evaluate its robustness, experiments on different kinds of benchmark datasets are carried out. They show that the performances are even better than some other methods’ best results which are selected manually.

Clustering algorithms reveal the intrinsic pattern of data by grouping data points into different categories.Because they always handle the datasets without pre-existing labels, only some basic information like distances between points, density of points or points' distribution can be used.Based on them, different objective functions are constructed for determining the clusters' key information like the centers of clusters (1,2), the affinity matrix (3) or even the number of clusters (4)(5)(6).
Many algorithms require that the number of clusters is preset such as K-means (1), K-medoids (2) and algorithms based on Hierarchy (7).When the number is not given in advance, it is obviously more difficult for clustering algorithms to derive a good result.Some algorithms use the density of points to determine the number and centers of clusters, like density-based spatial clustering of applications with noise (DBSCAN) (4) and clustering by fast search and find of density peaks (CFSFDP) (5).Their problem lies in that the large difference between the densities will result in low quality performance and the results are highly sensitive to the parameters.The Affinity Propagation algorithm (6) derives the clusters' centers and number by exchanging the information between data points, which is a rather novel idea.However, it is also sensitive to the parameters while the parameter preference's value is related to the number of clusters.Some algorithms group points based on distributions (8).The performances rely on the distributions' capability to represent the data points.Therefore they cannot present different kinds of manifolds flexibly.
Here, we present a clustering method by constructing hyper-planes.It has its basis in an assumption that one group can be divided into subgroups the points of which lie in a locally linear manifold.Therefore the appropriate hyper-planes are found to distinguish points in different subgroups.By combining these hyper-planes, some connective components can be derived.Then the clusters' number and centers can be determined based on these connective components.One of its main contributions lies in that it can find the number of clusters automatically.The other is that it depends only on the marginal space between the points in different clusters.Therefore it can be approached for the data points which take the distribution on complex manifolds. Suppose To select the proper hyper-planes automatically from L , L is grouped into two categories based on the value of w of hyper-planes by using the K-means method.
The category with smaller w , denoted as ' L , is the hyper-planes which prefer to distinguish the points well just as the eight lines in Fig. 1A.The map from L to ' For dealing with the more complex manifolds as shown in Fig. 1B and Fig. 1C, . Repeat this operation until ' L is unvaried.The lines in Fig. 1B and Fig. 1C are the ' L .These three distributions in Fig. 1 are the most common ones for benchmarking clustering algorithms.
Then an affinity matrix can be determined based on a hyper-plane set , where , Although sometimes the connective components are the perfect clusters as shown in Fig. 1, it is not always so satisfying.So the connective components aren't viewed as clusters directly.Before finding the clusters, we firstly compute a measure of mean intra-connective-component distance and nearest-connective-component distance for each connective component as following: The points x are firstly transformed into () )) . Then the mean intra-cluster distance is computed as: where ( , ) is the geodesic distance between

The threshold 4
 is set to filter small size connective components off.Points in connective component with the larger M prefer to be in the same cluster with higher probability.
is called hillside connective component (HCC).PCC prefers to be the best connective component among all its adjacent ones.So it is like a center.To make the result more accurate, we adopt the PCC whose M is larger than half of the connective components as a cluster center.
Then we can derive a cluster For benchmarking our algorithm, we applied our algorithm on some real-world datasets.All the data points are centralized by subtracting their mean before processing.And each hyper-plane dataset H in these experiments is set to be   . We also present a simple method to determine the parameter  automatically.Because the quantity of points put in the clusters is related to the value of  , we can draw a line chart for their relationship as shown in Fig. S. The value of X-axis and Y-axis are  and quantity of the points put in clusters () N  , respectively.Then we find the first peak whose () N  is larger than half of quantity of total points.If there is no such peak, we use the highest peak in the chart.X-axis value of the point just before this peak is set to be  which is used for clustering.
The experiments' results as shown in Fig. S indicate that this method is effective although it doesn't obtain the best results always.After  is determined, the algorithm is repeated 5 times to derive the best result.
The first four benchmark datasets are obtained from the University of California Irvine Machine Learning Repository (10).The data in one dataset is viewed as a matrix.Row vectors represent data points.Because the different columns have different meanings, values of each column are normalized by using min-max normalization.Table .1 contains a summary of these datasets.Whereas the DBSCAN and Affinity Propagation are sensible to the parameters' values, the algorithm is also compared with the K-means (best result by repeating 20 times), Agglomative clustering algorithm and balanced iterative reducing and clustering using hierarchies (BIRCH).These five algorithms are implemented by using the Scikit-learn library in Python and all the parameters' values are default ones.
The performances are measured by two benchmark measures Adjusted Rand Index (ARI) (11) and Normalized Mutual information (NMI) (12) as shown in Table .2and Table .3.Because we only put part of the points into clusters, the results on these points in clusters are also presented in Table .4 and Table.5.
Rodriguez and Laio have pointed that the Olivetti Face Database (13) poses a serious challenge for algorithms to find the number of clusters automatically because the "ideal" number of clusters is comparable with the number of elements in the data set (namely of different images, 10 for each subject) (6).So we approached the algorithm on it.The results of CFSFDP and our algorithm both contained no single cluster included images of two different persons.We selected 26 clusters correctly as shown in Fig. 2A while CFSFDP got 22 correct clusters (6).Because another benchmark dataset Yale Face Dataset (YFD) ( 14) also contains the clusters the number of which is comparable with the images' number for one person, we applied our algorithm on it as shown in Fig. 2B.There are images of different persons put into one cluster.Those clusters are cluster 6 and cluster 19.The number is the order of its center as to the measure M .The performances on both datasets measured by ARI and NMI are also included in Table .2To find data's intrinsic patterns, one common used idea is to approximate nonlinear models by using linear ones.Our algorithm is based on this idea.By combining the linear structures, the final results can approximate the nonlinear manifolds more flexibly and accurately.
. Otherwise, they are adjacent.Therefore the connective components based on the affinity matrix can be derived.We define this affinity matrix.The nearest-connective-component distance of

Fig. 1 .Fig. 2 .
Results of connective components for synthetic points distribution.Different colors represent different connective components.The hyper-plane set ' HL  .The colored lines belong to H .The number of points is 1600. (A) Distribution of two concentric circlesPictorial cluster results on Olivetti Face Dataset (A) and Yale Face Dataset (B).Faces with same color and number in bottom-right corner belong to the same cluster while the gray images are not in any cluster.Numbers in bottom-right corner are the orders of cluster centers sorted as to their M .
Fig. S.The Performance of the algorithm on the benchmark datasets as a function of  : ARI (orange bar), NMI (gray bar) and quantity of points in clusters (blue line).The red spot is the value of  which is selected for clustering algorithm.Numbers of cluster centers from (A) to (F) are 3, 6, 4, 5, 34 and 11, respectively.

Table 1
Descriptions of 6 Benchmark Datasets.

Table 2 .
Clustering ARI on the Real-world Datasets

Table 3 .
Clustering NMI on the Real-world Datasets

Table 4 .
Clustering ARI on Points in Clusters of Real-world Datasets

Table 5 .
Clustering NMI on Points in Clusters of Real-world Datasets