Unsupervised K-Means Clustering Algorithm

The k-means algorithm is generally the most known and used clustering method. There are various extensions of k-means to be proposed in the literature. Although it is an unsupervised learning to clustering in pattern recognition and machine learning, the k-means algorithm and its extensions are always influenced by initializations with a necessary number of clusters a priori. That is, the k-means algorithm is not exactly an unsupervised clustering method. In this paper, we construct an unsupervised learning schema for the k-means algorithm so that it is free of initializations without parameter selection and can also simultaneously find an optimal number of clusters. That is, we propose a novel unsupervised k-means (U-k-means) clustering algorithm with automatically finding an optimal number of clusters without giving any initialization and parameter selection. The computational complexity of the proposed U-k-means clustering algorithm is also analyzed. Comparisons between the proposed U-k-means and other existing methods are made. Experimental results and comparisons actually demonstrate these good aspects of the proposed U-k-means clustering algorithm.


I. INTRODUCTION
Clustering is a useful tool in data science. It is a method for finding cluster structure in a data set that is characterized by the greatest similarity within the same cluster and the greatest dissimilarity between different clusters. Hierarchical clustering was the earliest clustering method used by biologists and social scientists, whereas cluster analysis became a branch of statistical multivariate analysis [1], [2]. It is also an unsupervised learning approach to machine learning. From statistical viewpoint, clustering methods are generally divided as probability model-based approaches and nonparametric approaches. The probability model-based approaches follow that the data points are from a mixture probability model so that a mixture likelihood approach to clustering is used [3]. In model-based approaches, the expectation and maximization (EM) algorithm is the most used [4], [5]. For nonparametric approaches, clustering methods are mostly based on an objective function of similarity or dissimilarity measures, and these can be divided into hierarchical and partitional methods where partitional methods are the most used [2], [6], [7].
The associate editor coordinating the review of this manuscript and approving it for publication was Noor Zaman .
In general, partitional methods suppose that the data set can be represented by finite cluster prototypes with their own objective functions. Therefore, defining the dissimilarity (or distance) between a point and a cluster prototype is essential for partition methods. It is known that the k-means algorithm is the oldest and popular partitional method [1], [8]. The k-means clustering has been widely studied with various extensions in the literature and applied in a variety of substantive areas [9], [10], [11], [12]. However, these k-means clustering algorithms are usually affected by initializations and need to be given a number of clusters a priori. In general, the cluster number is unknown. In this case, validity indices can be used to find a cluster number where they are supposed to be independent of clustering algorithms [13]. Many cluster validity indices for the k-means clustering algorithm had been proposed in the literature, such as Bayesian information criterion (BIC) [14], Akaike information criterion (AIC) [15], Dunn's index [16], Davies-Bouldin index (DB) [17], Silhouette Width (SW) [18], Calinski and Harabasz index (CH) [19], Gap statistic [20], generalized Dunn's index (DNg) [21], and modified Dunn's index (DNs) [22].
For estimation the number of clusters, Pelleg and Moore [23] extended k-means, called X-means, by making local decisions for cluster centers in each iteration of k-means with splitting themselves to get better clustering. Users need to specify a range of cluster numbers in which the true cluster number reasonably lies and then a model selection, such as BIC or AIC, is used to do the splitting process. Although these k-means clustering algorithms can find the number of clusters, such as cluster validity indices and X-means, they use extra iteration steps outside the clustering algorithms. As we know, no work in the literature for k-means can be free of initializations, parameter selection and also simultaneously find the number of clusters. We suppose that this is due to its difficulty for constructing this kind of the k-means algorithm.
In this paper, we first construct a learning procedure for the k-means clustering algorithm. This learning procedure can automatically find the number of clusters without any initialization and parameter selection. We first consider an entropy penalty term for adjusting bias, and then create a learning schema for finding the number of clusters. The organization of this paper is as follows. In Section II, we review some related works. In Section III, we first construct the learning schema and then propose the unsupervised k-means clustering (U-k-means) with automatically finding the number of clusters. The computational complexity of the proposed U-k-means algorithm is also analyzed. In Section IV, several experimental examples and comparisons with numerical and real data sets are provided to demonstrate the effectiveness of the proposed U-k-means clustering algorithm. Finally, conclusions are stated in Section V.

II. RELATED WORKS
In this section, we review several works that are closely related with ours. The k-means is one of the most popular unsupervised learning algorithms that solve the well-known clustering problem. Let X = {x 1 , . . . , x n } be a data set in a d-dimensional Euclidean spac R d . Let A = {a 1 , . . . , a c } be the c cluster centers. Let z = [z ik ] n×c , where z ik is a binary variable (i.e. z ik ∈ {0, 1}) indicating if the data point x i belongs to k-th cluster, k = 1, · · · , c. The k-means objective function is J (z, A) = n i=1 c k=1 z ik x i − a k 2 . The k-means algorithm is iterated through necessary conditions for minimizing the k-means objective function J (z, A) with updating equations for cluster centers and memberships, respectively, as where x i − a k is the Euclidean distance between the data point x i and the cluster center a k . There exists a difficult problem in k-means, i.e., it needs to give a number of clusters a priori. However, the number of clusters is generally unkown in real applications. Another problem is that the k-means algorithm is always affected by initializations. To resolve the above issue for finding the number c of cluster, cluster validity issues get much more attention.
There are several clustering validity indices available for estimating the number c of clusters. Clustering validity indices can be grouped into two major categories: external and internal [24]. External indices are used to evaluate clustering results by comparing cluster memberships assigned by a clustering algorithm with the previously known knowledge such as externally supplied class label [25], [26]. However, internal indices are used to evaluate the goodness of cluster structure by focusing on the intrinsic information of the data itself [27] so that we consider only internal indices. In the paper, these most widely used internal indices, such as original Dunn's index (DNo) [16], Davies-Bouldin index (DB) [17], Silhouette Width (SW) [18], Calinski and Harabasz index (CH) [19], Gap statistics [20], generalized Dunn's index (DNg) [21], and modified Dunn's index (DNs) [22] are chosen for finding the number of clusters and then compared with our proposed U-k-means clustering algorithm.
The DNo [16], DNg [21], and DNs [22] are supposed to be the simplest (internal) validity index where it compares the size of clusters with the distance between clusters. The DNo, DNg, and DNs indices are computed as the ratio between the minimum distance between two clusters and the size of the largest cluster, and so we are looking for the maximum value of index values. Davies-Bouldin index (DB) [17] measures the average similarity between each cluster and its most similar one. The DB validity index attempts to maximize these between cluster distances while minimizing the distance between the cluster centroid and the other data objects. The Silhouette value [18] is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. Thus, positive and negative large silhouette widths (SW) indicate that the corresponding object is well clustered and wrongly clustered, respectively. Any objects with the SW validity index around zero are considered not to be clearly discriminated between clusters. The Gap statistic [20] is a cluster validity measure based upon a statistical hypothesis test. The gap statistic works by comparing the change in within-cluster dispersion with that expected under an appropriate reference null distribution at each value c. The optimal number of clusters is the smallest c.
For an efficient method about the number of clusters, X-means proposed by Pelleg and Moore [23], should be the most well-known and used in the literature, such as Witten et al. [28], and Guo et al. [29]. In X-means, Pelleg and Moore [23] extended k-means by making local decisions for cluster centers in each iteration of k-means with splitting themselves to get better clustering. Users only need to specify a range of cluster numbers in which the true cluster number reasonably lies and then a model selection, such as BIC, is used to do the splitting process. Although X-means has been the most used for clustering without given a number of clusters a priori, it still needs to specify a range of cluster numbers based on a criterion, such as BIC. On the other hand, VOLUME 8, 2020 it is still influenced by initializations of algorithm. On the other hand, Rodriguez and Laio [30] proposed an approach based on the idea that cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities, which they called as a clustering by fast search (C-FS) and find of density peaks. To identify the cluster centers, C-FS uses the heuristic approach of a decision graph. However, the performance of C-FS highly depends on two factors, i.e., local density ρ i and cutoff distance δ i .

III. THE UNSUPERVISED K-MEANS CLUSTERING ALGORITHM
There always exists a difficult problem in the k-means algorithm and its extensions for a long history in the literature. That is, they are affected by initializations and require a given number of clusters a priori. We mentioned that the X-means algorithm has been used for clustering without given a number of clusters a priori, but it still needs to specify a range of number of clusters based on BIC, and it is still influenced by initializations. To construct the k-means clustering algorithm with free of initializations and automatically find the number of clusters, we use the entropy concept. We borrow the idea from the EM algorithm by Yang et al. [31]. We first consider proportions α k in which the α k term is seen as the probability of one data point belonged to the kth class. Hence, we use − ln α k as the information in the occurrence of one data point belonged to the kth class, and so − c k=1 α k ln α k becomes the average of information. In fact, the term − c k=1 α k ln α k is the entropy over proportions α k . When α k = 1/c, ∀k = 1, 2, . . . , c, we say that there is no information about α k . At this point, we have the entropy achieve the maximum value. Therefore, we add this term to the k-means objective function J (z, A) as a penalty. We then construct a schema to estimate α k by minimizing the entropy to get the most information for α k . To minimize − c k=1 α k ln α k is equivalent to maximizing c k=1 α k ln α k . For this reason, we use c k=1 α k ln α k as a penalty term for the k-means objective function J (z, A). Thus, we propose a novel objective function as follows: β ≥ 0 In order to determine the number of clusters, we next consider another entropy term. We combine the variables membership z ik and the proportion α k . By using the basis of entropy theory, we suggest a new term in the form of z ik ln α k . Thus, we propose the unsupervised k-means (U-k-means) objective function as follows: We know that, when β and γ in Eq. (2) are zero, it becomes the original k-means. The Lagrangian of Eq. (2) is We first take the partial derivative of the Lagrangian (3) with respect to z ik , and setting them to be zero. Thus, the updating equation for z ik is obtained as follows: The updating equation for the cluster center a k is as follows: We next take the partial derivative of the Lagrangian with respect to α k , we obtain ∂J and then we get the updating equation for α k as follows: where t denotes the iteration number in the algorithm. We should mention that Eq. (6) created above is important for our proposed U-k-means clustering method. In Eq. (6), c s=1 α s ln α s is the weighted mean of ln α k with the weights α 1 , . . . , α c . For the kth mixing proportion α k is less than the weighted mean, then the new mixing proportion α (t+1) k will become smaller than the old α (t) k . That is, the smaller proportion will decrease and the bigger proportion will increase in the next iteration, and then competition will occur. This situation is similar as the formula in Figueiredo and Jain [32]. If α k ≤ 0 or α k < 1/n for some 1 ≤ k ≤ c (t) , they are considered to be illegitimate proportions. In this situation, we discard those clusters and then update the cluster number c (t) to be where |{}| denotes the cardinality of the set {}. After updating the number of clusters c, the remaining mixing proportion α * k and corresponding z * ik need to be re-normalized by We next concern about the parameter learning of γ and β for the two terms of n i=1 c k=1 z ik ln α k and c k=1 α k ln α k . Based on some increasingly learning rates of cluster number with e −c (t) /100 , e −c (t) /250 , e −c (t) /500 , e −c (t) /750 , and e −c (t) /1000 , it is seen that e −c (t) /100 decreases faster, but e −c (t) /500 , e −c (t) /750 and e −c (t) /1000 decreases slower. We suppose that the parameter γ should not decrease too slow or too fast, and so we set the parameter γ as Under competition schema setting, the algorithm can automatically reduce the number of clusters, and also simultaneously gets the estimates of parameters. Furthermore, the parameter β can help us control the competition. We discuss the variable β as follows. We first apply Thus, we obtain Under the constraint c k=1 α k = 1, and only when α k < 1/2, we can have that (ln α k − c s=1 α s ln α s ) < 0. To avoid the situation where all α k ≤ 0, the left hand of inequality (14) must be larger than − max{α k |α k < 1/2, k = 1, 2, · · · , c}. We now have an elementary condition of β as follows: −e −1 β > − max{α k |α k < 1/2, k = 1, 2, · · · , c}. Thus, we have Thus, we have β < max{α k e|α k < 1/2, k = 1, 2, · · · , c} < e/2. Therefore, to prevent β from being too big, we can use β ∈ [0, 1]. Furthermore, if the difference between α (t+1) k and α (t) k is small, then β must become large in order to enhance its competition. If the difference between α (t+1) k and α (t) k is large, then β will become small to maintain stability. Thus, we define an updating equation for β with where η = min 1, 1/t d/2−1 and a represents the largest integer that is no more than a and t denotes the iteration number in the algorithm.
On the other hand, we consider the inequations then the restriction of max 1≤k≤c α (t+1) k ≤ 1 is held, and then we obtain According to Eqs. (12) and (13), we can get Because the β can jump at any time, we let β = 0 when the cluster number c is stable. When the cluster number c is stable, it means c is no longer decreasing. In our setting, we use all data points as initial means with a k = x k , i.e. c initial = n, and we use α k = 1/c initial , ∀k = 1, 2, ..., c initial as initial mixing proportions. Thus, the proposed U-k-means clustering algorithm can be summarized as follows: U-k-means clustering algorithm Step 1: Fix ε > 0. Give initial c (0) = n, α k = x i , and initial learning rates γ (0) = β (0) = 1. Set t = 0.
Step 7: Update a ELSE t = t+1 and return to Step 2. Before we analyze the computational complexity for the proposed U-k-means algorithm, we give a brief review VOLUME 8, 2020 of another clustering algorithm that had also used the idea from the EM algorithm by Yang et al. [31]. This is the robust-learning fuzzy c-means (RL-FCM) proposed by Yang and Nataliani [33]. In Yang and Nataliani [33], they gave the RL-FCM objective function J (U, α, c k=1 µ ik ln µ ik − r 3 n c k=1 α k ln α k with µ ik , not binary variables, but fuzzy c-memberships with 0 ≤ µ ik ≤ 1 and c k=1 µ ik = 1 to indicate fuzzy memberships for the data point x i belonging to k-th cluster. If we compare the proposed U-k-means objective function J U −k−means (z, A, α) with the RL-FCM objective function J (U, α, A), we find that, except µ ik and z ik with different membership representations, the RL-FCM objective function J (U, α, A) in Yang and Nataliani [33] gave more extra terms and parameters and so the RL-FCM algorithm is more complicated than the proposed U-k-means algorithm with more running time. For experimental results and comparisons in the next section, we make more comparisons of the proposed U-k-means algorithm with the RL-FCM algorithm. We also analyze the computational complexity for the U-k-means algorithm. In fact, the U-k-means algorithm can be divided into three parts: (1) Compute the hard membership partition z ik with O (ncd); (2) Compute the mixing proportion α k with O (nc); (3) Update the cluster center a k with O (n). The total computational complexity for the U-k-means algorithm is O (ncd), where n is the number of data points, c is the number of clusters, and d is the dimension of data points. Compared with the RL-FCM algorithm [33], the RL-FCM has the total computational complexity fwith O nc 2 d .

IV. EXPERIMENTAL RESULTS AND COMPARISONS
In this section we give some examples with numerical and real data sets to demonstrate the performance of the proposed U-k-means algorithm. We show these unsupervised learning behaviors to get the best number c * of clusters for the U-k-means algorithm. Generally, most clustering algorithms, including k-means, are employed to give different numbers of clusters with associated cluster memberships, and then these clustering results are evaluated by multiple validity measures to determine the most practically plausible clustering results with the estimated number of clusters [13]. Thus, we will first compare the U-k-means algorithm with the seven validity indices, DNo [16], DNg [21], DNs [22], Gap statistic (Gap-stat) [20], DB [17], SW [18] and CH [19]. Furthermore, the comparisons of the proposed U-k-means with k-means [8], robust EM [31], clustering by fast search (C-FS) [30], X-means [23], and RL-FCM [33] are also made. For measuring clustering performance, we use an accuracy rate (AR) with AR = c k=1 n (c k )/n, where n (c k ) is the number of data points that obtain correct clustering for the cluster k and n is the total number of data points. The larger AR is, the better clustering performance is.
Example 1: In this example, we use a data set of 400 data points generated from the 2-variate 6-component Gaussian parameters α k = 1/6, ∀k, µ 1 = 5 2 T , µ 2 = 3 4 T , µ 3 = 8 4 T , µ 4 = 6 6 T , µ 5 = 10 8 T , µ 6 = 7 10 T , and 1 = · · · = 6 = 0.4 0 0 0.4 with 2 dimensions and 6 clusters, as shown in Fig. 1(a). We implement the proposed U-k-means clustering algorithm for the data set of Fig. 1(a) in which it obtains the correct number c * = 6 of clusters with AR=1.00, as shown in Fig. 1(f), after 11 iterations. These validity indices of CH, SW, DB, Gap statistic, DNo, DNg, and DNs are shown in Table 1. All indices give the correct number c * = 6 of clusters, except DNg. Moreover, we consider the data set with noisy points to show the performance of the proposed U-k-means algorithm under noisy environment. We add 50 uniformly noisy points to the data set of Fig. 1(a), as shown in Fig. 2(a). By implementing the U-k-means algorithm on the noisy data set of Fig. 2(a), it still obtains the correct number c * = 6 of clusters after 28 iterations with AR=1.00, as shown in Fig. 2(b). These validity index values of CH, SW, DB, Gap-stat, DNo, DNg, and DNs for the noisy data set of Fig. 2(a) are shown in Table 2. The five validity indices of CH, DB, Gap-stat, DNo and DNs give the correct number of clusters. But, SW and DNg give the incorrect numbers of clusters.  Fig. 1(a).  Example 2: In this example, we consider a data set of 800 data points generated from a 3-variate 14-component Gaussian mixture with 800 data points with 3 dimensions and 14 clusters, as shown in Fig. 3(a). To estimate the number c of clusters, we use CH, SW, DB, Gap-stat, DNo, DNg, and DNs. To create the results of the seven validity indices, we consider the k-means algorithm with 25 different initializations. These estimated numbers of clusters from CH, SW, DB, Gap statistic, DNo, DNg, and DNs with percentages are shown in Table 3. It is seen that all validity indices can give the correct number c * = 14 of clusters, except DNg, where the Gap-stat index gives the highest percentage of the correct number c * = 14 of clusters with 64%. We also implement the proposed U-k-means for the data set, and then compare it with the R-EM, C-FS, k-means with the true number of clusters, X-means, and RL-FCM clustering algorithms. We mention that U-k-means, R-EM, and RL-FCM are free of parameter selection, but others are dependent on parameter selection for finding the number of clusters. Table 4 shows the comparison  results of the U-k-means, R-EM, C-FS, k-means with the true cluster number c = 14, X-means, and RL-FCM algorithms. Note that C-FS, k-means with the true number of clusters, and X-means algorithms are dependent of initials or parameter selection, and so we consider their average AR (AV-AR) under different initials or parameter selection. From Table 4, it is seen that the proposed U-k-means, R-EM, and RL-FCM clustering algorithms are able to find the correct number of clusters c * = 14 with AR=1.00. While C-FS obtained the correct c * = 14 with 96% and AV-AR=0.9772. The k-means with the true c gave AV-AR=0.8160. The X-means obtained the correct c * = 14 with 76% and AV-AR=1.00. Note that the numbers in parentheses indicate the percentage in obtaining the correct number of clusters for clustering algorithms under 25 different initial values.
Example 3: To examine the effectiveness of the proposed U-k-means for finding the number of clusters, we generate a data set of 900 data points from a 20-variate 6-component Gaussian mixture model. The mixing proportions, mean values and covariance matrices of the Gaussian mixture model are listed in Table 5. The validity indices of CH, SW, DB, Gap-stat, DNo, DNg, and DNs are used to estimate the number c of clusters. The k-means algorithm with 25 different initializations are considered to create the results of the seven validity indices. These estimated numbers of clusters from the seven validity indices with percentages are shown in Table 6 where the parentheses are indicating the percentages of validity indices in giving the correct number of clusters under 25 different initial values. It is seen that CH, SW, and Gap-stat give the correct number c * = 6 of clusters with the highest percentage. We also implemented the U-k-means and compare it with R-EM, C-FS, k-means with the true number c, X-means, and RL-FCM algorithms. The obtained numbers of clusters and ARs of these algorithms are shown in Table 7. As it can be seen, the proposed U-k-means, C-FS and X-means correctly find the number of clusters for the data set. The R-EM and RL-FCM underestimate the number of  TABLE 4. Results of U-k-means, R-EM, C-FS, k-means with the true c, X-means, and RL-FCM for the data set of Fig. 3(a).    clusters for the data set. Both U-k-means and X-means get the best AR.
Example 4: In this example, we consider a synthetic data set of non-spherical shape with 3000 data points, as shown in Fig. 4(a). The U-k-means is implemented for this data set with the clustering results as shown in Figs. 4(b)-4(f). The U-k-means algorithm decreases the number of clusters from 3000 to 2132 after the iteration is implemented once. From Figs. 4(b)-4(f), it is seen that the U-k-means algorithm exhibits fast decreasing for the number of clusters. After 11 iterations, the U-k-means algorithm obtains its convergent result with c * =9 and AR= 1.00, as shown in Fig. 4(f). We next compare the proposed U-k-means algorithm with R-EM, C-FS, k-means with true c, X-means, and RL-FCM. All the experiments are performed 25 times with parameter selection where the average AR results under the correct number of cluster are reported in Table 8. As shown in Table 8, U-k-means gives the correct number c * =9 of  clusters with AR=1.00, followed by k-means with true c=9 achieves an average AR=0.9190 and C-FS with c * =9 (96%) achieves average AR=0.7641. While R-EM overestimates the number of clusters with c * =12, but X-means and RL-FCM underestimate the number of clusters with c * =2.
We next consider real data sets. These data sets are from the UCI Machine Learning Repository [34].
Example 5: In this example, we use the eight real data sets from UCI Machine Learning Repository [34], known as Iris, Seeds, Australian credit approval, Flowmeter D, Sonar, Wine, Horse, and waveform (version 1). Detailed information on these data sets such as feature characteristics, the number c of classes, the number n of instances and the number d of features is listed in Table 9. Since data features in Seeds, Flowmeter D, Wine and Waveform (version 1) are distributed in different ranges and data features in Australian (credit approval) are mixed feature types, we first preprocess data matrices using matrix factorization technique [35]. This preprocessed technique can give these data in uniform to get good quality clusters and improve accuracy rates of clustering algorithms. Clustering results from the U-k-means, R-EM, C-FS, k-means with the true c, k-means+Gap-stat, X-means, and RL-FCM algorithms for different real data sets are shown in Table 10, where the best results are presented in boldface. It is seen that the proposed U-k-means gives the best result in estimating the number c of clusters and accuracy rate among them except for Australian data. The C-FS algorithm gives the corrected numbers of clusters for Iris, Seeds, Australian, Flowmeter D, Sonar, Wine, and Horse data sets while it underestimates the number of clusters for the waveform data set with c * =2. The X-means algorithm only obtains the correct number of clusters for Seeds, Wine and Horse data sets. The R-EM obtains the correct number of clusters for Iris and Seeds data sets. The k-means+Gap-stat only obtains a correct number of clusters for the Seed data set. The RL-FCM algorithm obtains the correct number of clusters for the Iris, Seeds and Waveform (version 1) data sets. Note that the results in parentheses are the percentages of algorithms to get the correct number c of clusters.
Example 6: In this example, we use the six medical data sets from the UCI Machine Learning Repository [34], known as SPECT, Parkinsons, WPBC, Colon, Lung and Nci9. Detailed descriptions on these data sets with feature characteristics, the number c of classes, the number n of instances and the number d of features are listed in Table 11. In this experiment, we first preprocess the SPECT, Parkinson, WPBC, Colon, and Lung data sets using the matrix   factorization technique. We also conduct experiments to compare the proposed U-k-means with R-EM, C-FS, k-means with the true c, k-means+Gap-stat, X-means, and RL-FCM. The results are shown in Table 12. For C-FS, k-means with the true c, k-means+Gap-stat and X-means, we make experiments with 25 different initializations, and report their results with the average AR (AV-AR) and the percentages of algorithms to get the correct number c of clusters, as shown in Table 12. It is seen that the proposed U-k-means gets the correct number of clusters for SPECT, Parkinsons, WPBC, Colon, and Lung. While for the Nci9 data set, the U-k-means algorithm gets the number of clusters with c * = 8 which is very closed to the true c=9. In terms of AR, the U-k-means algorithm significantly performs much better than others. The R-EM algorithm estimates the correct number of clusters on SPECT. However, it underestimates the number of clusters on Parkinsons, and overestimates the number of clusters on WPBC. We also reported that the results of R-EM on Colon, Lung and Nci9 data sets are missing because the probability of one data point belonged to the kth class on these data sets are known as illegitimate proportions at the first iteration.   Results of U-k-means, R-EM, C-FS, k-means with the true c, X-means, and RL-FCM for the 100 images sample of the CIFAR-10 data set.
The C-FS algorithm presents better than k-means+ Gap-stat and X-means. The RL-FCM algorithm estimates the correct number of clusters c for the SPECT, Parkinsons, and WPBC data sets. While RL-FCM overestimates the number of clusters on Colon, Lung and Nci9 with c * =62, c * =9, and c * =60, respectively. Example 7: In this example, we apply the U-k-means clustering algorithm for Yale Face 32 × 32 data set, as shown in Fig. 5. It has 165 grayscale images in GIF format of 15 individuals [36]. There are 11 images per subject with different facial expression or configuration: center-light, with/glasses, happy, left-light, w/no glasses, normal, right-light, sad, sleepy, surprised, and wink. In the experiment, we use 135 images of 165 grayscale images. The results from different algorithms are shown in Table 13. From Table 13, although U-k-means cannot correctly estimate the true number c=15 of clusters for the Yale face data set, but it gives the number of clusters c * =16 in which it is closed to the true c=15. The R-EM algorithm is missing because the probability of one data point belonged to the kth class on this data set are known as illegitimate proportions at the first iteration. The C-FS gives c * =12 and X-means gives c * =2 or 3. The k-means clustering with the true c=15 gives AV-AR=0. 34, while RL-FCM gives c * =2.
Example 8: In this example, we apply the U-k-means clustering algorithm to the CIFAR-10 color images [37]. The CIFAR-10 data set consists of 60000 32 × 32 color images in 10 classes, i.e., each pixel is an RGB triplet of unsigned bytes between 0 and 255. There are 50000 training images and 10000 test images. Each red, green, and blue channel value contains 1024 entries. The 10 classes in the data set are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. Specifically, we take the first 100 color images (10 images per class) and training 40 multi-way from CIFAR-10 60K images data set for our experiment. The rest 59900 images as the retrieval database. Fig. 6 shows the 100 images sample from the CIFAR-10 images data set. The results for the number of clusters and AR are given in Table 14. From Table 14, it is seen that the proposed U-k-means and k-means with the true c=10 give better results on the 100 images sample of the CIFAR-10 data set. The U-k-means has the correct number c * =10 of clusters with 42.5% and AV-AR=0.28 and k-means with c=10 gives the same AV-AR=0. 28. For the C-FS, the percentage with the correct number c * =10 of clusters is only 16.7% with AV-AR=0.24. X-means underestimates the number of clusters with c * =2. The results from R-EM and RL-FCM on this data sets are missing because the probability of one data point belonged to the kth class on these data sets are known as illegitimate proportions at the first iteration.
We further analyze the performance of U-k-means, R-EM, C-FS, and RL-FCM by comparing their average running times of 25 runs for these algorithms, as shown in Table 15. All algorithms are implemented in MATLAB 2017b. From Table 15, it is seen that the proposed U-k-means is the fastest for all data sets among these algorithms, except that the C-FS algorithm is the fastest for the Waveform data set. Furthermore, in Section III, we had mentioned that the proposed U-k-means objective function is simpler than the RL-FCM objective function with saving running time. From Table 15, it is seen that the proposed U-k-means algorithm is actually running faster than the RL-FCM algorithm.

V. CONCLUSION
In this paper we propose a new schema with a learning framework for the k-means clustering algorithm. We adopt the merit of entropy-type penalty terms to construct a competition schema. The proposed U-k-means algorithm uses the number of points as the initial number of clusters for solving the initialization problem. During iterations, the U-k-means algorithm will discard extra clusters, and then an optimal number of clusters can be automatically found according to the structure of data. The advantages of U-k-means are free of initializations and parameters that also robust to different cluster volumes and shapes with automatically finding the number of clusters. The proposed U-k-means algorithm was performed on several synthetic and real data sets and also compared with most existing algorithms, such as R-EM, C-FS, k-means with the true number c, k-means+gap, and X-means algorithms. The results actually demonstrate the superiority of the U-k-means clustering algorithm.