Finding the Number of Clusters Using a Small Training Sequence

In clustering the training sequence (TS), K-means algorithm tries to find empirically optimal representative vectors that achieve the empirical minimum to inductively design optimal representative vectors yielding the true optimum for the underlying distribution. In this paper, the convergence rates on the clustering errors are first observed as functions of <inline-formula> <tex-math notation="LaTeX">$\beta ^{-\alpha }$ </tex-math></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">$\beta $ </tex-math></inline-formula> is the training ratio that relates the training sequence size to the number of representative vectors, and <inline-formula> <tex-math notation="LaTeX">$\alpha $ </tex-math></inline-formula> is a non-negative constant. From the convergence rates, we can observe the training performance for a finite TS size. If the TS size is relatively small, errors occur in finding the number of clusters. In order to reduce the errors from small TS sizes, a compensation constant <inline-formula> <tex-math notation="LaTeX">$(1-\beta ^{-\alpha })^{-1}$ </tex-math></inline-formula> for the empirical errors is devised based on the rate analyses and a novel algorithm for finding the number of clusters is proposed. The compensation constant can be applied to other clustering applications especially when the TS size is relatively small.


I. INTRODUCTION
The clustering of samples is a method of grouping or segmenting samples into subsets or clusters to efficiently represent the samples [1], [2], [3] and improve deep learning performances. Clustering samples can be a scheme for self-organizing or unsupervised learning [4], [5]. We assume that a sequence of samples, which is called the training sequence (TS), is realized from a random vector with the underlying distribution. The K-means algorithm can efficiently conduct clustering by iteratively decreasing an empirical error from TS [6]. The algorithm divides the TS into a finite number of disjoint clusters, of which centroids are the representative vectors, based on the nearest neighbor search. Here, we call the finite set of these vectors the codebook. The K-means algorithm can asymptotically design optimal codebooks as the TS size increases in an inductive approach [7], [8]. Hence, the K-means algorithm can be an efficient approach to reduce both linear and nonlinear The associate editor coordinating the review of this manuscript and approving it for publication was Wentao Fan .
correlations [9], and combined with the conventional linear decorrelation methods, such as the principle component analysis, Karhunen-Loeve transform, and discrete cosine transform for autoregressive signals [10].
Based on the consistency property of a sequence of trained codebooks as the TS size increases [8], [11], the K-means algorithm tries to find empirically optimal codebooks, which achieve the empirical minimum for the given TS, to inductively design an optimal codebook for the underlying distribution [12]. However, the K-means algorithm often yields locally optimal codebooks depending on initial guesses. Locally optimal codebooks do not guarantee a convergence to an optimal codebook even though the TS size increases. Hence, it is necessary to devise an algorithm, which can achieve the global optimum. Vaisey and Gersho [13] utilize the simulated annealing approach to alleviate the local minimum problem. In inductively training codebooks, the training ratio β, which is defined by the ratio of the TS size to the codebook size, is a more important parameter rather than the TS size itself [9]. Furthermore, depending on search structures, β can be different [14], [15], [16].
In this paper, in order to formulate the convergence rates of the errors yielded by the trained codebooks as a function of β, previous research regarding the effects of a finite TS size on the clustering performance is first surveyed. For this paper, the main contributions are as follows. (1) It is shown that the convergence rate in clustering TS is a function of β −α , where 1/2 ≤ α ≤ 1. (2) In testing the trained codebook, a similar rate is also observed and a convergence rate in the minimax sense is derived; the trained codebook shows a rate of β −1 in a minimax sense. (3) Based on the convergence rate analyses with β, a novel algorithm to find the number of clusters is proposed, where a compensation for the K-means algorithm is considered especially for small training ratios of β. Note that, as the TS size becomes relatively small, more errors occur in finding the correct number of clusters.
This paper is organized in the following way. In Section II, several definitions in training codebooks are shown. The convergence rates of β in training codebooks are shown in Section III and the rates in testing the trained codebooks are shown in Section IV. In Section V, a novel algorithm to find the number of clusters is proposed. The conclusion is given in the last section.

II. PRELIMINARY
In early works on statistical learning, the Vapnik-Chervonenkis (VC) dimension [17] has been introduced for a set of indicator functions, which can be employed in the pattern recognitions [2]. Cohn et al. [18] have framed the clustering as a classification problem and proposed a bound by adopting the VC dimension. In this section, several definitions on trained codebooks are introduced and the convergence rates with β are observed by surveying the previous research.
Let F denote the underlying distribution and ∥·∥ denote the Euclidean norm on R k as a dissimilarity measure. The codebook design problem for F is to find a set C that minimizes a distortion defined by a mean square error as over all possible choices of C ⊂ R k , in which the size of C is less than or equal to a positive integer n. We call C the codebook and its elements the representative vectors. Let C * denote an optimal codebook if D(C * ) = inf C D(C). We call D(C * ) the optimum and simply denote the optimum as D * , which is assumed D * ̸ = 0. From the source coding theorem, D * converges to the Shannon lower bound as the vector dimension k increases [9], [19]. To find C * , Max [20] and Lloyd [21] proposed algorithms for a given distribution F.
Assume that X, X 1 , · · · , X m are independent, and identically distributed random vectors taking values in R k with distribution F. Let X 1 , · · · , X m denote a finite TS and a positive integer m denote the TS size. For a codebook C, we define the empirical distortion, which is an empirical mean square error, as where we suppose that E{∥X∥ 2 } < ∞. Note that D m is a random variable defined on the underlying sample space. There exists a codebook that achieves inf C D m (C) for a given TS [12]. Let C * m denote such a codebook that minimizes the empirical distortion D m . We call D m (C * m ) (= min C D m (C)) the empirical minimum. In order to find C * for an unknown F, an inductive method that minimizes D m is usually considered in the traditional codebook design approaches; because, for a sequence of C * m , D(C * m ) and D m (C * m ) converge to D * almost surely (a.s.) under several conditions [8], [11]. To find an empirical optimum C * m , Linder et al. [7] proposed an iterative algorithm by using TS [9, p. 366]. This algorithm is equivalent to a clustering algorithm, which is called the isodata or K-means algorithm in the area of pattern recognition [22, p. 482]. K-means algorithms usually search the cluster centers of C * m by minimizing D m (C). Because the empirical minimum D m (C * m ) is readily available in training the codebook, D m (C * m ) can be employed as a performance measure for the trained codebooks, rather than the mean distortion D(C * m ) [23], [24]. However, because D m (C * m ) is a biased estimate of D(C * m ), using a relatively small size of a separate test or validating sequence, we can efficiently estimate D(C) to evaluate the performance of C [25].
Let the training ratio β be defined as β := m/n, which is a normalized TS size by the codebook size [9, p. 364]. The distortion difference between the optimum and that of the trained codebook is dependent on the ratio β [16], [24], [26]. Hence, we may estimate the achievable performance of trained codebooks by observing β. In the special case when n = 1 and β ≥ 1, we obtain explicit relations: where c = D * is equal to the trace of the covariance matrix of X [24].
For a more general case, when n ≥ 2, any explicit derivation regarding D m (C * m ) or D(C * m ) is not shown in the literature except for several special cases; D m (C * m ) = 0 if β ≤ 1. Instead of providing such exact values, several bounds in some senses have been derived in order to observe the rate as a function of m or β. By constructing a discrete distribution, Linder [23] derived a lower bound on D * − E{D m (C * m )}, as a function of m −1/2 . Earlier results also provide upper bounds of the same order. From the bounds, Linder [23] showed that D * −E{D m (C * m )} has a rate of β −1/2 by introducing worst case bounds. Based on a theorem of [27], Chou [28] studied the mean distortion of C * m , and showed that D(C * m ) − D * has the rate m −1 in a sense of distribution. Linder et al. [29] proposed an upper bound on E{D(C * m )} − D * , and Bartlett et al. [30] suggested the rate β −1/2 in the minimax sense. In comparison to the worst case bound of Linder [23], Kim and Bell [24] derived a pointwise lower bound on D * − E{D m (C * m )} as a function of any distribution F. They also derived a lower bound, which has the rate m −1 .
In experimental research, we find several observations regarding the rate, especially for image signals. Cosman et al. [31] numerically investigated the performance of the trained codebooks and suggested an algebraic decay of the form m −α for a positive α. Cohn et al. [18] also observed the trained codebook performance based on a discrete source model for images. Collura and Tremain [26] numerically observed the appropriate values of β for designing a full-search codebook based on spectral data and showed that constrained-search structures have a better training performance than the full-search case. Note that β can be different depending on the search structures. Kim [16] observed such training ratios for different search structures, and compared their performances in terms of training. Kim and Bell [24] also showed several numerical results for the uniform, Gaussian, and Laplacian densities of F with fitted curves of the form β −α . Based on a constrained-search structure, the product quantization [14], [15] can provide good characteristics both in terms of training performance and search efficiency [32].

III. RATES ON THE EMPIRICAL MINIMUM
In this section, the convergence rates of the empirical minimum D m (C * m ) of (2) for the empirical optimum C * m is observed.

A. EMPIRICAL MINIMUM OF RATE β −α
Let us consider n points y 1 , · · · , y n as a codebook {y i }, and the corresponding partition {S i } of R k . The corresponding element of the region S i is y i , and the codebook size is n. Note that the partition is a finite, disjoint class {S i } whose union is R k (or includes the support of a density function of F). Let P i denote the probability that X belongs to S i , i.e., P i := S i dF(x), and I be an index set I = {i : P i ̸ = 0, i = 1, · · · , n}. Define the ith partial distortion δ i as δ i := S i ∥x − y i ∥ 2 dF(x), for i ∈ I. The summation of the partial distortions, i∈I δ i , is equal to the mean distortion of the codebook {y i } and the partition {S i } for F. Here, we assume that i∈I δ i ̸ = 0. For the underlying codebook {y i } and partition {S i }, let us consider a random vector Y o i that is defined as where m i is a random variable defined as For any distribution F, and the underlying {y i } and {S i }, holds [24], where k, m, n ≥ 1. If each y i is the centroid of S i , i.e., y i = S i xdF(x)/ S i dF(x), then the equality in (4) holds and a relationship D( (4), For a fixed n ≥ 1 and F, suppose that where c 1 , α 1 > 0, and the sequence (c 1 ) m is bounded.
From (5), it is clear that α 1 ≤ 1. In (6), c 1 can be a function of C * , k, m, n, and F, and can contribute to making the difference of (6) zero as m increases. For the n = 1 case, c 1 = D * and α 1 = 1 hold. However, for the case of n ≥ 2, we can only guess the constant c 1 from c * 0 in (5). From an upper bound on (6), which is derived by Linder [23], we notice that the minimum of α 1 is 1/2 for the distributions supported by a given bounded region. Here, we add a condition that lim inf m c 1 > 0 to the constant c 1 . Hence, for the empirical minimum case of (6), we can obtain a range of α 1 as Empirical Minimum: for the distributions supported by a given bounded region. The fastest rate β −1 of (7) is usually achieved when n = 1, independently of the distribution type and the vector dimension. Note that fast rates are attractive because we can obtain a good codebook using a relatively small TS [26]. However, for the case of n ≥ 2, the rate can be different depending on the distribution as illustrated in the numerical results of [24]. In fact, the codebook size n also affects the rate.
Linder [23] constructed a discrete distribution as a function of the codebook size, and showed the minimum rate β −1/2 . In the worst case of distributions, Linder [23] also derived both upper and lower bounds, and showed that the difference is proportional to β −1/2 . From (5), we can also derive a worst case bound, but as a function of β −1 , where the supremum is taken for all distributions over a ball in R k . As Kim and Bell [24] showed, even though the rate is faster than β −1/2 , a worst case bound obtained from (8) can be better than that of Linder for practically small training ratios.

IV. RATES ON THE MEAN DISTORTION
In this section, the convergence rates of the mean distortion D(C * m ) of (1) for the empirical optimum C * m is first observed. A lower bound in a minimax sense is next derived.

A. MEAN DISTORTION OF RATE β −α
In a manner similar to (6), we assume an relationship as where c 2 , α 2 > 0, and the sequence (c 2 ) m is bounded and lim inf m c 2 > 0. Linder et al. [29] proposed an upper bound on E{D(C * m )} − D * as a function of (m/ log m) −1/2 , for any distribution F. Bartlett et al. [30] sharpened the upper bound as a function of m −1/2 . Hence, the minimum of α 2 is given by 1/2 for a fixed codebook size n.
Bartlett et al. [30] constructed a distribution, and derived a lower bound of the constructed distribution as a function of β −1/2 . However, we can obtain a lower bound based on the result of Chou [28] for a more general class of distributions [27]. If C * is unique for the distribution, then from [28] where w is the sum of squares of normal random variables with zero-mean and a covariance matrix. Hence, for a fixed n. Therefore, we can obtain a lower bound as If we confine the input distributions within those of [27], then, from (10), the maximum rate is given by β −1 . In other words, for a class of distributions, we can obtain a range of α 2 for (9) as Mean Distortion: It is clear that the distribution constructed by Bartlett et al. [30] has the rate β −1/2 . The rate β −1 is always achieved when n = 1. Therefore from (7) and (11), we notice that both distortion differences have a range of rates from β −1/2 to β −1 . In the following section, it will be shown that β −1 can be a rate for the mean distortion case in the minimax sense.

B. RATE β −1 IN THE MINIMAX SENSE
In order to investigate the rate of the distortion difference E{D(C * m )} − D * , Bartlett et al. [30] introduced a notion of the minimax expected distortion redundancy, which expresses the minimal worst case excess distortion that an empirical codebook can have. The main result of [30] is that the difference E{D(C * m )} − D * is not a function of β −1 in the minimax sense, contrary to the conjecture of Chou's rate β −1 [28]. They also proposed the rate β −1/2 for the difference in the minimax sense, instead of β −1 . In this section, however, it is shown that the difference E{D(C * m )} − D * can have the rate β −1 even in the minimax sense.
Deriving the rate β −1 in the minimax sense is performed by obtaining an upper bound of the mean distortion of C o m , which is introduced in (3). In a manner similar to m , define a mean distortion as holds, where k, m, n ≥ 1. Here the positive c 3 is given by where the random variable ξ i is defined as ξ i := 1/m i if m i ̸ = 0, and ξ i = 0 otherwise, and c 3 converges to c ∞ as m → ∞ at a rate, which is equal to or can be faster than that of β −1 → 0. The proof of (13) is shown in Appendix.
Assume that {y i } and {S i } in error are equal to an optimal codebook C * := {y * i } and the corresponding Voronoi partition {S * i }, respectively. We then obtain the following relationship: where k, m, n ≥ 1. Here, the positive constant c * 3 can be obtained from (14). From (15) and E{D(C o m )} ≤ E{ }, E{D(C o m )} − D * ≤ c * 3 β −1 holds for any F. Hence, we have an upper bound in the minimax sense as where the infimum is taken over all k-dimensional n-point codebooks trained on m samples, and the supremum is taken for all distributions over a ball in R k . Bartlett et al. [30] derived a lower bound of rate β −1 . Therefore, from (16), in the minimax sense, the difference E{D(C * m )} − D * has a rate of β −1 .

V. NUMBER OF CLUSTERS
In this section, a distortion compensation in clustering relatively small TS is first introduced based on the convergence rate analysis with the training ratio β. As an example of the compensation, a novel algorithm to find the number of clusters especially for relatively small TS sizes is next proposed.

A. β-COMPENSATIONS FOR SMALL TRAINING RATIOS
In order to find better clusters for relatively small TS, a datadriven dissimilarity measure can be used [34]. Instead of such a special measure, a simple compensation of the Euclidean norm is introduced to deal with the problem caused by small training ratios based on the convergence rate analyses of the previous sections.
We consider an empirical optimum assumption that the empirically optimal codebook C * m is obtained from the K-means clustering for a TS. If the TS size is relatively small, then the empirical minimum D m (C * m ) can have considerable biases due to small β as mentioned in (6). It seems that the distances between the centers and samples are reduced as the training ratio β decreases. If β ≤ 1, then all distances eventually become zero as D m (C * m ) = 0. In other words, the clustering region is shrinking as β decreases. Hence, it is required to compensate the empirical distortions for the finite TS size. We call this compensation with the training ratio β the β-compensation.
Based on the relationship of (6), assume that the empirical minimum satisfies the following approximation: (17) for a finite training ratio β, where 1/2 ≤ α ≤ 1 from (7). As mentioned in (7), the coefficient for β −α is dependent on k and F. However, to simplify the compensation, we set the coefficient c 1 ≈ D * and thus can obtain ψ of (17). For the n = 1 case, we use α = 1 for the β-compensation. If the clusters are well separated as the fixed bins in m [35], then α ≈ 1 from E{ m } in (4). For a strong compensation, we can set α = 1/2.
A β-compensation to obtain D * can be conducted from D m (C * m )ψ(β). We can also consider a β-compensated ∥ · ∥ 2 ψ(β). This compensation is important especially when the training ratio β is too small to accurately estimate D * . For simulation examples, two cases of TS generated from under a Gaussian mixture distribution with 4 clusters are illustrated in Fig. 1. The TS of Fig. 2(a) has an enough size to be clustered as β = 100. However, the TS of Fig. 1(b) has a relatively small TS size of β = 5. A β-compensation example for the Gaussian mixture distribution of Fig. 1 is shown in Fig. 2. We can observe that the distortions obtained from the β-compensation is nearly constant even for small training ratios.

B. FINDING THE NUMBER OF CLUSTERS
We now consider an algorithm to find the number of clusters based on the K-means algorithm for a given finite TS. Let us consider a distribution F having several separate clusters as Fig. 1. In order to find the separate clusters, we introduce the underlying assumption: if the number of clusters is equal to the codebook size n, then a normalized optimum n 2/k D * is lower than the other codebook size case. Here, the optimum should be normalized by multiplying n 2/k to alleviate any influence from the codebook size [36]. Hence, for a set of candidates of the number of clusters, we can test n 2/k D * or n 2/k D m (C * m ) if β is large enough. However, for small training ratios, the β-compensation as n 2/k D m (C * m )ψ(β) is required. In order to find the number of clusters, a novel algorithm, which employs the β-compensation constant ψ, is proposed and summarized as follows.

1) NUMBER OF CLUSTERS WITH THE β-COMPENSATION
0) Consider a set N for the possible number of clusters. 1) Calculate the β-compensation constant ψ(β) from (17) for n ∈ N . 2) Conduct K-means clustering to obtain empirical minima D m (C * m ), for n ∈ N . 3) Select the codebook size such that as the number of clusters in F. Examples of the proposed algorithm are demonstrated in Fig. 3. If the training ratio is large enough as β = 100 (n = 4) of Fig. 1(a), then the compensated empirical minima n 2/k D m (C * m )ψ(β) (''β-comp. n 2/k D m ψ(β)'') with respect to several n are very close to the mean distortions n 2/k D(C * m ) (''Testing'') as shown in Fig. 3(a). Because the minimum of the compensated distortions are achieved at n = 4 in Fig. 3(a), we can conclude that the number of clusters is 4 and the clustering result is also depicted in Fig. 1(a). It is clear that, if the TS size is large enough, we can find the correct number of clusters. On the other hand, if the TS size is relatively small as shown in Fig. 1(b), then simply checking the empirical minimum n 2/k D m (C * m ) can lead to a wrong result as ''n 2/k D m '' of Fig. 3(b). Note that the training ratio β of Fig. 3(b) varies from 20 to 20/9 ≈ 2.22, for n = 1, . . . , 9. Hence, we can observe that the empirical minimum decreases and the test distortion increases as n increases in Fig. 3(b). For this relatively small TS size case, compensating the empirical minimum is important in finding the number of clusters as demonstrated in ''β-comp. n 2/k D m ψ(β)'' of Fig. 3(b). Hence, the proposed algorithm can correctly find the number of clusters from the β-compensation especially for small training ratios. In the example of Fig. 3, α = 1/2 was set for the β-compensation. However, for the n = 4 case, the  convergence rate is more close to β −1 , i.e., α ≈ 1. Hence, setting α = 1 for further possible sizes of n can be recommended.
In order to find the number of clusters, various methods have been developed [2], [22], [37], [38], [39], [40], [41], [42]. Among the methods, we can apply the β-compensation if the methods utilize the empirical distortion D m or the Euclidean norm. Observing the elbow curve of D m can provide an appropriate number of clusters as shown in Fig. 4 for the TS of Fig. 1(b) [43], [44]. From n = 5, the changes of D m are not significant. Hence, we can guess the number of clusters is 4. Applying the β-compensation to D m can help us find the elbow point more clearly especially when the TS size is relatively small. Sugar [40] proposed an information-theoretic method to find the clusters. The β-compensation can also be applied to the empirical minima and an example is shown in Fig. 5 for the TS of Fig. 1(b). Without the compensation, the distortion change curve yielded a wrong result of 8. However, applying the β-compensation provided the correct value in a similar manner to the proposed algorithm case.
In Fig. 6, the silhouette [38] and gap statistic [2], [39] methods are now discussed regarding the β-compensation. For the silhouette values, if the squared Euclidean norm is used to calculate distances, the β-compensation constant can also be considered. However, the silhouette values are calculated based on a distance normalization and thus the β-compensation is not required intrinsically compared to the methods in [22]. As we can observe in Fig. 6, the silhouette method can find the correct number of clusters even for the relatively small TS of Fig. 1(b). However, the gap statistic method shows an incorrect optimal value of 1, where the β-correction cannot be applied. Note that, as β increases, the gap statistic method can yield the correct number of clusters.

VI. CONCLUSION
In this paper, the clustering performance of the K-means algorithm was analyzed in both theoretical and numerical observations under the empirical optimum assumption. If this assumption holds, then the analysis can be applied to other clustering algorithms. The clustering performance is • dominantly dependent on the training ratio β := m/n, • less sensitive to the vector dimension k, and • different depending on the source distribution F. The convergence rate has a form of cβ −α , where α is dependent on F and 1/2 ≤ α ≤ 1. Positive constant c is a function of k and F, and inf β,k c > 0. When clustering TS while minimizing the empirical distortion, using the β-compensation constant ψ(β) = (1 − β −α ) −1 is proposed especially for the small TS or β case. By applying this β-compensation to a clustering algorithm, the number of clusters can be correctly found even for small values of β.

APPENDIX: PROOF OF (13)
In order to prove (13), the expectation of is first derived.
Because y i = S i xdF(x)/ S i dF(x) for all i from the assumption, can be rewritten by We now derive the mean of , E{ }. In order to simplify the derivation, let |I| = n. The proof in the case of |I| < n is similar. Let B ν ⊂ R km be the m-fold Cartesian product of S i 's, i.e., B ν = S i ν,1 × · · · × S i ν,m , where ν = (i ν,1 − 1) + (i ν,2 − 1)n + · · · + (i ν,m − 1)n m−1 and i ν,j ∈ {1, 2, · · · , n}. Let us consider a product measure of order m as F × · · · × F. Then the second term of the right-hand side of (A1) satisfies where ψ i,ν := B ν ∥y o i − y i ∥ 2 dF(x 1 ) · · · dF(x m ), and y o i is the function of x ℓ 's as in (3). Let m i,ν denote the number of i in {i ν,j } m j=1 . In the case when m i,ν ̸ = 0, by rearranging the parameters x ℓ , ψ i,ν can be expanded as It is clear that the second term in the right-hand side of (A5) goes to 0 as m → ∞ because m(1 − P i ) m → 0. We now consider the term mE{ξ i } in (A5). A lower bound of ξ i is given by γ i /(m i +1) ≤ ξ i , where γ i = 1 if m i ̸ = 0, and 0 otherwise. Hence, E {γ i /(m i + 1)} ≤ E{ξ i } holds. Because γ i /(m i + 1) has a multinomial distribution, An upper bound on ξ i is given by ξ i ≤ γ i /(m i + 1) + 3γ i /(m i + 1)(m i + 2) [45]. Hence, .
The second term of the right-hand side in (A7) can be expanded as Consequently, from (A6) and (A8), mE{ξ i } → 1/P i for i ∈ I. Therefore, c 3 → i∈I δ i /nP i , which is equal to c ∞ = lim m→∞ c 0 . □