The Cluster Structure Function

For each partition of a data set into a given number of parts there is a partition such that every part is as much as possible a good model (an “algorithmic sufficient statistic”) for the data in that part. Since this can be done for every number between one and the number of data, the result is a function, the cluster structure function. It maps the number of parts of a partition to values related to the deficiencies of being good models by the parts. Such a function starts with a value at least zero for no partition of the data set and descents to zero for the partition of the data set into singleton parts. The optimal clustering is the one selected by analyzing the cluster structure function. The theory behind the method is expressed in algorithmic information theory (Kolmogorov complexity). In practice the Kolmogorov complexities involved are approximated by a concrete compressor. We give examples using real data sets: the MNIST handwritten digits and the segmentation of real cells as used in stem cell research.


I. INTRODUCTION
The aim of this work is to introduce the cluster structure function and apply it to propose a method for finding the number of clusters in a given dataset that is unsupervised, feasible, justifiable an terms of its theory, and more accurate than previous methods for this task.Clustering is a fundamental task in unsupervised learning, partitioning a set of objects into groups called clusters such that objects in the same cluster are more similar to each other than to those in other groups [26].Every object in a computer is represented by a finite sequence of 0's and 1's: a finite binary string (abbreviated to "string" in the sequel).There are many methods and algorithms for clustering and determining the number of clusters in data as for example surveyed in [2], [14], [16], [26].We explore a new method for determining the number of clusters based on Kolmogorov's notion of algorithmic sufficient statistic [24], [8] which is expressed in terms of Kolmogorov complexity [17].For technical reasons we use prefix Kolmogorov complexity [19].In the sequel we also use K for the number of clusters in the data, agreeing with customary use.Confusion is avoided by the context.
A brief overview of the needed notions is given here.Details and proofs can be found in the textbook [21].A prefix Turing machine is a Turing machine (we use a binary alphabet) such that the set of input programs for which the machine halts is a prefix code (no input program is a proper prefix of another one).The prefix Turing machines can be computationally enumerated T 1 , T 2 , . . .and this list has a universal prefix Turing machine U such that U (i, p) = T i (p) for all integers i and halting programs p for T i .Formally, the conditional prefix Kolmogorov complexity K(x|y) is the length of the shortest input string z such that the reference universal prefix Turing machine U on input z with auxiliary information y outputs x.The unconditional prefix Kolmogorov complexity K(x) is defined as K(x| ) where is the empty string.The quantity K(x) is the length of a shortest binary string x * from which x can be effectively reconstructed.If there are more than one candidates for x * we use the first one in the enumeration.The string x * accounts for every effective regularity in x.In these definitions both x and y can consist of strings into which finite multisets of finite binary strings are encoded.
Informally, a finite set A of strings containing x is an algorithmic sufficient statistic for x iff K(A) + log |A| = K(x).That is, the encoding of x by giving A (a model) and the index of x in A is as short as a shortest computer program for x (sometimes one adds also a small value).This means that A is a good model for x [31].As we show in Lemma 1 it is impossible that A is such a good model for all y ∈ A. Therefore we have to relax the condition of sufficiency.If the equality above holds up to some additive term then this term is called the optimality deficiency.We propose to group the elements from a data set (a multiset) into clusters (submultisets) such that the optimality deficiencies in every cluster are minimal in some sense.This seems to require a specification of the number of clusters.However, the aim is to find the number of clusters.To solve this conundrum the proposed method proceeds as follows.The cluster structure function has the number of clusters as argument and a quantity involving the optimality deficiencies as value.Such a function decreases to 0 when the number of clusters grows to the cardinality of the data set.The optimal number of clusters can then be selected related to the cluster structure function.
We give the definitions and the ideal method of application in Section II.Proofs are deferred to Section II-B.An explanation of the probability relations of members of a cluster is given in Section III.A brief survey of related literature is given in Section IV.Finally, Section V shows examples of real applications including estimating the number of unique digits in a set of MNIST handwritten digits and an ensemble segmentation approach to human stem cell nuclear segmentation.

II. THEORY OF THE CLUSTER STRUCTURE FUNCTION
The aim is to partition a multiset into submultisets such that each submultiset constitutes a cluster.In probabilistic statistics arXiv:2201.01222v3[cs.LG] 14 Oct 2022 the relevant notion is the "sufficient statistic" due to R.J. Fisher [13], [8].According to Fisher: "The statistic chosen should summarise the whole of the relevant information supplied by the sample.This may be called the Criterion of Sufficiency . . .In the case of the normal curve of distribution it is evident that the second moment is a sufficient statistic for estimating the standard deviation."This type of sufficient statistic pertains to probability distributions.In the problem at hand the data are individual strings.Therefore the probabilistic notion is not appropriate.For individual strings the analogous notion is the "algorithmic sufficient statistic."For convenience we delete the adjective "algorithmic" in the sequel (probabilistic sufficient statistic doesn't occur in the sequel).We equate a multiset being a cluster with the multiset being, as close as possible according to a given criterion, an (algorithmic) sufficient statistic for the members of the cluster.The new method partitions S such that each resulting part is as close as possible to the given criterion a sufficient statistic for all of its members.Therefore they are good models for its members [31].This is different from existing methods which use some metric which does not say much about this aspect.DEFINITION 1.A multiset A of strings is an algorithmic sufficient statistic abbreviated as sufficient statistic for a element Here A is a model and the log |A| term allows us to pinpoint x in A. Therefore, every y ∈ A satisfies K(A) + log |A| ≥ K(y).Reference [11] tells us that if A is a sufficient statistic for the string This is akin to the minimum description length (MDL) principle in Statistics [12].To illustrate, if the length of a binary string x is n and K(x) = n + K(n) (the maximum) which means that x is random then A = {x} is a sufficient statistic of x (the minimal one) and A = {0, 1} n is also a sufficient statistic of x (the maximal one).There is a tradeoff between the cardinality of a sufficient statistic A of a string x and the amount of effective regularities in the string x it represents.The greater the cardinality of A is the smaller is K(A) which is the amount of effective regularities it represents.The multiset A accounts for as many effective regularities in x as is possible for a set of the cardinality of A. This means that A is the model of best fit, which we call the best model, for x which is possible [31, Section IV-B].Thus, if A has the property that for every y ∈ A it is as much as possible a sufficient statistic, then all members of A share as many effective regularities as is possible.All the y ∈ A are similar in the sense of [20], [4].We cluster the data according to this criterium.
If A contains elements y such that K(A) + log |A| > K(y) (trivially < is impossible) then K(A|y) = O(1).Let us look closer at what this implies and consider A containing only elements of length n.Then by the symmetry of information [10] we have For example, let A be the set containing all integers in an interval with complex endpoints and x an integer in this interval of low complexity.For example K(A) = Ω(n) and K(x) = o(n/4).Therefore K(x|A) = o(n/4) and this yields K(A|x) = Ω(n).That is, A is not at all determined by x.

DEFINITION 2. The optimality deficiency of
(II.1) The mean of the optimality deficiencies of a set A is Here δ(A, x) ≥ 0 with equality for a proper sufficient statistic.If µ A = 0 then δ(A, x) = 0 for all x ∈ A, that is, A is a sufficient statistic for all of its elements.But this is not possible for |A| ≥ 2 by the following lemma.LEMMA 1.Let A be a finite multiset of strings of length n.
(ii) There exist A and x ∈ A such that δ(A, x) < 0 and for such A no y ∈ A satisfies δ(A, y) = 0. REMARK 1.The optimality deficiency should not be confused with the randomness deficiency of x ∈ A with respect to A: By the symmetry of information law K(A) + K(x|A) = K(x) + K(A|x) up to a logarithmic additive term O(log K(A)).Therefore δ(x|A)+K(A|x) = log |A|+K(A)− K(x)+O(log K(A)) and hence δ(A, x) = δ(x|A)+K(A|x)+ O(log K(A)). 3 For clustering we want ideally the model to be a sufficient statistic for all elements in it.But we have to deal with optimality deficiencies which are greater than 0, and with real data typically they are all greater than 0. There are many ways to combine the optimality deficiencies (or other aspects) to obtain criteria for selection.This is formulated in the criterion function as follows.
Let N denote the natural numbers and S = {x 1 , . . ., x n } be a finite nonempty multiset of strings.Consider a partition π of S into k nonempty subsets S 1 , . . ., S k such that k i=1 S i = S and S = {x 1 , . . ., x n } be a finite nonempty multiset of strings.Consider a partition π of S into k nonempty subsets S 1 , . . ., S k such that k i=1 S i = S and S i S j = ∅ for i = j.Denote the set of partitions of S into k submultisets by Π k and the set of all partitions by Π.The criterion function f : Π → N takes as argument a partition π ∈ Π of S and as value a natural number computed from the optimality deficiencies involved in the partition subject to the following: (i) the value of f (π) does not increase if one or more optimality deficiencies are changed to 0; and (ii) f (π) = 0 if all optimality deficiencies are 0. (One can use other aspects as well.)DEFINITION 3. The Cluster Structure Function (CSF) 1 for a multiset S of n strings is defined by where f is the criterion function, for each k The graph of this function is called the CSF curve.If f is understood we may write H S for the CSF function.LEMMA 2. Let S = {x 1 , . . ., x n } with n ≥ 1.For every f we have H f S (n) = 0 and H f S is monotonic non-increasing with increasing arguments on its domain [1, n].
We give a lower bound on H f for some datasets S.
The following lemma establishes that there are sets S of n elements such that H f S stays at a high value for arguments 1, . . ., n − 1 and drops suddenly to 0 for argument n.
In practice we may use the optimality deficiencies within the standard deviation around the mean to determine the criterion function f (π) for a partition π ∈ Π k of S into parts S 1 , . . ., S k .This is a more refined method since it eliminates 1 The cluster structure function is named in analogy with the Kolmogorov structure function hx : N → N defined by hx(k) = min S⊆{0,1} * {log |S| : x ∈ S, K(S) ≤ k} associated with a binary finite string x [31].
the outliers.only counting the central items (68.2% if they are normally distributed) of the optimality deficiencies in each part S i .The mean of S is µ S = 1/|S| x∈S x.The standard deviation of the δ(S, x) of a multiset S is where That is, H fσ Sσ (k) is the minimum over all partitions of S σ into k parts.It clusters possibly better since implying by Section III that the conditional probabilities between most members of a part of a witness partition may be larger but never smaller using For example if y is the first element of A and therefore K(y) ≤ K(A).Hence δ(A, y) > 0 and µ A > 0. Ad (ii) There is an x ∈ A such that δ(A, x) < 0. For example A is a sufficiently long interval of integers of (represented by n-bit strings) of length O(2 n ) with end points of O(log n) Kolmogorov complexity and x ∈ A is a random string in that interval which means K(x) = Ω(n).Then δ(A, x) < 0 and by Item (i) there are no y ∈ A such that δ(A, y) = 0.

Proof. of Lemma 2 The graph of H f
S starts with the partition of S into 1 part (no partition).n = 1.The optimality deficiency involved is 0 and by Definition 3 we have By Item (i) in the definition of the criterion function, if π ∈ Π k+1 and we change one of the optimality deficiencies of the elements to 0 then the criterion function f does not increase.Hence the minimum of f for a partition in Π k is not larger than the minimum of f for a partition in Π k+1 .Therefore H f S is monotonic non-increasing.
For k = n the multiset S is partitioned into singleton sets which all have optimality deficiency 0. Hence H f S (n) = 0. Proof. of Lemma 3 Choose x ∈ {0, 1} m and S with |S| = n such that S = {y : |y| = m and y equals x with the ith bit flipped k) by giving S in K(S) bits, the integer k in O(log n) bits and an O(1) program.This program does the following: given k and S it generates all finitely many partitions π ∈ Π k .A partition π ∈ Π k of S divides it into, say, S 1 , . . ., S k .By the symmetry of information law [10] we have (This is possible since all n members of are strings of length m and they can have complexity varying continuously between at least m and close to 0.) Since for each finite multiset A and x ∈ A we have δ(A, x) = K(A) + log |A| − K(x) and therefore For a k-partition of S at least one S i in the partition has cardinality at least n/k.Therefore, if n/k > 1 then by the displayed equality H f S (k) > m/n.This holds for k = 1, . . ., n − 1.For k = n all parts S i in the partition are singleton sets and hence H f S (k) = 0. Proof. of Lemma 5. Similar to the proof of Lemma 4.
Proof. of Lemma 6.Similar to proof in Lemma 3.
Proof. of Lemma 7. Similar to the proof of Lemma 4.

C. Computing the number of clusters
To determine the number K of clusters in data S we compare a cluster structure function used on S with the same cluster function on reference set of |S| data distributed uniformly.We do this comparison as the logarithm of the ratio.Using the cluster function H f S on the data set S the number K of clusters in S is the k where the log-ratio D f (k) is greatest.Formally with the reference placement is the uniform distribution of |S| data samples over the range spanned by S. For example if S is a set of numbers than its range is the interval I = [min S, max S].Note that every set S is represented in a computer memory as a finite set of finite strings of 0's and 1's and that therefore min S and max S are well defined.Divide I in n equal parts I 1 , . . ., I n with n i=1 I i = I and To deal with the incomputability of the function K we approximate K from above by a good compressor Z.If x is a string then Z(x) is the length of the by Z compressed version of x.The function Z is by construction a computable function, even a feasibly computable one (for example Z is bzip2 or some other compressor).Because K is incomputable there are strings x such that K(x) Z(x) and the difference Z(x) − K(x) is incomputable.However for natural data we assume that they encode no universal computer or problematic mathematical constants like the ratio of the circumference of a circle to its diameter 3.14 . . . .We assume that for the natural data we encounter the compression by Z has a length which is close to its prefix Kolmogorov complexity.The same holds for a multiset A of strings.We represent For a partition π ∈ Π k of S (|S| = n) we compute the δ(S i , x)'s by computing Z(S i ) (1 ≤ i ≤ k) and Z(x) for all x ∈ S. To do so we require at most k + n compressions.We write "at most" since a member of a multiset S can occur more than once.

III. PROBABILITIES AMONG MEMBERS OF CLUSTERS
By Lemma 1 a part A, with more than two members, of a witness partition of S can not be a sufficient statistic for all of its elements.In clusters the members of a cluster typically share some characteristics but not all characteristics.It turns out that in an appropriate sense the members of a cluster are nonetheless probabilistically close.
We define a conditional probability of n-bit strings following [22].We start with the unconditional probability.Let a finite set A of n-bit strings be chosen randomly with probability m(A) = 2 −K(A) , and subsequently x ∈ A is chosen with uniform probability from A, that is, x is chosen with probability m(A)/|A|.(Since K(x) is a length of a prefix code we have by Kraft's inequality [8] Hence m is a semiprobability.A semiprobability is just like a probability but may sum to less than 1.The particular semiprobability m is called universal since it is the largest lower semicomputable semiprobability [19].In absence of any information about A we can assign m(A) as its probability.Properties are discussed in the text [21]).DEFINITION 5.For each y ∈ A we define the conditional probability p(y|x) by We show below that all pairs of strings in a part of a witness partition of multset S of n strings have an expectation of the conditional p-probability with respect to each other which is at least 2 −H f S (k) for some k ≤ n.Hence the smaller H f S (k) is the more all strings in a part of a witness partition of H f S (k) have a large conditional probability with respect to each other: they form a cluster.THEOREM 1.Let S ⊆ {0, 1} n (consider only n-length strings) and a witness k-partition of S for H f S (k) that divides S into parts S 1 , . . ., S k .The expectation taken over a random variable p(y|x) for pairs x, y ∈ S i for some i (1 Proof.The parts of a witness to H f S (k) form clusters because intuitively if the conditional probabilities in Definition 5 of different strings in a part of the witness partition are small then the conditional Kolmogorov complexities are small: CLAIM 1.
Proof.Start from Definition 5.The first equality holds by the following reasoning: since because the lefthand side of the equation is a lower semicomputable function of x and hence it is O(m(x)); moreover if A = {x} then the lefthand side equals m(x).The same argument can be used for the pair {x, y}.The second equality uses the coding theorem [19] which states m(x) = 2 K(x)+O (1)  and the symmetry of information law [10] which shows both the trivial K(x, y) ≤ K(x) + K(y|x) and K(x, y) The Θ order of magnitude is an O(1) term in the exponent and absorbed in the O(log n) term.
The conditional probabilities of pairs of strings in a part of a k-partition of S which is a witness to H f S (k) satisfy the following.By [22, Theorem 5] if x, y ∈ S i for a particular i (1 ≤ i ≤ k) and δ(S i , x) ≤ d then p(y|x) ≥ 2 −d−O(log n) , while if p(y|x) ≥ 2 −d then δ(S i , x) ≤ d + O(log n).Hence p(y|x) = 2 −δ(Si,x)±O(log n) .The expectation of p(y|x) over S i is given by using in the second line the inequality of arithmetic and geometric means.This implies that if x, y ∈ S i for some i (1 ≤ i ≤ k) then the expectation of p(y|x) over all S i in a witness partition of S is given by using in the second line again the inequality of arithmetic and geometric means and in the third line that 1) for k → n.REMARK 4. Roughly, the smaller H S becomes the larger the conditional probabilities of the elements in a part of the witness partition become. 3

IV. RELATED LITERATURE
This paper extends previous work in the field of algorithmic statistics [11], [31], [29].The applications build particularly on the field of semi-supervised spectral learning [15], [23], [6].Most previous approaches to estimating the number of clusters in a dataset utilize probabilistic statistical modeling of the data.The Bayesian and Akaike information criteria both formulate the question w.r.t.underlying distributions estimated either parametrically or empirically [26], [37], [36], [38].Bayesian methods are well suited when the likelihood function and prior probabilities are known.In comparison, the algorithmic statistics approach proposed here works with the particular dataset rather than probabilities across a hypothesized population.Recently, alternative approaches based on characteristics of the specific data set in question, rather than a population-level model, have been considered [40], [39].These include particularly the widely used Gap statistic [27] that is very similar in spirit to the implementation described here.The connection between the gap statistic and the field of algorithmic statistics was one of the key motivators for this work [5].The gap statistic compares the spatial characteristics of the data being clustered to that of a randomly generated reference distribution.Our approach is similar in both theory and practice to the gap statistic.Many of the advantages of the two approaches are shared.Both are effective when K=1, that is there are no meaningful clusters among the data.Both are reasonably efficient to compute.DBScan combines the clustering and K estimation into a single task, and provides parameters for fine tune control [9].In theory the cluster structure function might be used in an automated parameter search with such an algorithm.
In the computational biological microscopy image analysis area we build on previous work for optimally partitioning connected components of foreground pixels into elliptical regions [35].A key advantage of the cluster structure function compared to all other approaches is the very broad and powerful theoretical structure of Kolmogorov complexity and Algorithmic Statistics.The techniques are generally parameter free, beyond the selection of a suitable compression algorithm.In theory it will be possible to automatically identify the optimal compression by considering ensembles of algorithms and choosing the best results among them via the structure function.

V. EXAMPLE APPLICATIONS
A. How many different digits are in a MNIST digit set?
Here we apply the optimality deficiency to estimating the number of different digits in a set of digits sampled from the MNIST handwritten digits dataset.Classification of the MNIST digits using supervised learning techniques is well studied but there has been little application of unsupervised learning to this problem.One key challenge is establishing a ground truth number of different classes.Different styles of handwriting were taught at different times in different locations.These differences are likely reflected in the underlying data as distinct categories, even within digits of the same class label.Another challenge is the difficulty in unsupervised classification of digits even when the correct number of classes is known.While supervised solutions for the MNIST digit classification are extremely accurate, unsupervised clustering of MNIST digits is still a difficult problem.The MNIST data has been normalized to 28x28 8-bit grayscale (0,..,255) images.The MNIST database contains a total of 70,000 handwritten digits consisting of 60,000 training examples and 10,000 test examples.Originally the input looks as Figure 1.
We apply the cluster structure function to the question of estimating how many different MNIST digits are represented in a large set.Configure a sampler to choose a set of 100 digits at a time randomly given a fixed K value.In each digit set, the cardinality of each digit is given by 100 K .For K = 3 there would be 33 of each digit [0,1,2].For K = 10, there would be 10 of each digit [0, 9].Spectral clustering [23] is used to cluster MNIST digit sets [7].The spectral clustering approach starts with the matrix of pairwise normalized compression distances (NCD) as in [4] among all pairs of digits.We used the free lossless image format (FLIF) compressor [25] for the MNIST digits, and found it to significantly outperform the previously used BZIP and JPEG2000 compressors.After computing the NCD matrix between digit pairs, the classic spectral clustering algorithm [23] is applied.Table I shows the results of spectral clustering for all ten digits.Clustering accuracy for all ten digit types [0..9] was 46% with 95% confidence intervals of [.4622,.4657]obtained via bootstrapping [26].
To compute the CSF function following Section II-C we proceed as follows.For each digit set S, we generate Cluster Structure Function (CSF) curves.The digit set S is clustered at different values of K. Random samples of the data forming subset S ⊆ S are chosen iteratively.Statistics of the pointwise CSF curves are formed from the random samples S. The results here were generated using 1000 random samples of each digit set as follows.For each digit set S compute the pairwise NCD matrix D between all elements of S using the FLIF image compression.For each K on [1, . . ., K max ] use spectral clustering to partition the elements of S into K groups.Each cluster (partition) is labeled A p = {x 1 , x 2 , ..., x |Ap| } and |A p | is the number of points in cluster A p .After the points have Fig. 1: Example MNIST handwritten digits been clustered for a particular value of K, pick subsets at random from S to form S, | S| = 5 * K. Using the cluster assignments for each of the randomly selected points, compute the optimality deficiency for each random sample across each of the where Z(A p ) is the size in bytes of the FLIF compressed image formed by concatenating all of the digit images in S belonging to cluster A p and Z(x i ) is the size in bytes of the compressed image corresponding to digit x i .We write δ(A p ) = {δ(A p , x 1 ), δ(A p , x 2 ), ...δ(A p , x |Ap| )} to denote the set of optimality deficiencies for each x i ∈ A p .After computing δ(A p ) for each cluster from the digit set subsample S, the results of V.1 are combined to compute the related cluster structure function: 2) The final cluster structure function (CSF) curve is then generated using the mean and standard deviation of H S (K) across all random subsamples S and K values.The optimal value of K for such a CSF curve is chosen using the technique proposed in [27], as the first value of K where the CSF curve decreases more than one standard deviation from the previous value.A robust estimator for standard deviation may be useful in identifying the minimum value of the cluster structure function for some applications.Figure 2 shows two example CSF curves.In the left panel, the correct value is obtained at K pred = K true .In the right pane of Figure 2, the selected value is obtained at K pred = 5 and does not match the K true = 9 correct value, although there is a minor decrease at nine for that example.
The minimum of the empirical CSF curve is not in itself significant since the ideal theoretic CSF curve is monotonic non-decreasing and the minimum is always at 0 (Lemma 2).What makes the minimum possibly a little meaningful is when the minimum occurs just after a sharp decrease in the CSF curve.The empirical CSF curves of Figure 2 seem in contradiction with Lemma 2. The curves in the figure roughly follow Lemma 2, but they are the results of several heuristics so they may not be perfectly monotonic non-decreasing.The heuristics are among others: approximation from above of the non-computable Kolmogorov complexity, the spectral heuristic of finding the number of clusters rather than inspecting all the subsets of the data, and repeated random sampling of a subset S ⊆ S computing the CSF curve of each S and taking the average.To identify the number of clusters in the data one takes the number following the sharp decrease of the CSF curve.Here the criterion to select the clusters is optimally satisfied.
Table II shows the average and standard deviations from subsampling digit sets of varying K true .Digits sets with   K true =10 K true =9 Fig. 2: Curves showing mean and standard deviation of the cluster structure function (CSF) for two different digit sets.Subsets are chosen repeatedly from each digit set, and clustered into K groups.The value of K is chosen as the first K that is one standard deviation smaller than the previous value.The left curve selects the value K = K true , correctly identifying the value of K corresponding to the number of different digits in the set.The right curve incorrectly selects K = 5.
K true = 1 have a higher K pred value, and a much higher standard deviation compared to digits sets with other values of K true .Omitting digit sets with K true = 1 significantly increases the correlation the selected point on the CSF curve and K true .For the CSF, the correlation r between K true and K pred for K true > 1 is r = 0.93, with a p-value p = 3e − 4. For the Gap Statistic, r = −0.84(p = 5e−3).Based on that observation, a shallow feedforward neural network was used to map the CSF curves to a predicted value K pred .
The approach is now to use the 20 element vector composed of the mean and standard deviations of the CSF curves evaluated at the numbers of clusters K = [1..10] as a feature vector to identify the optimal value of K. We use one thousand examples each of digit sets from K = [1..10] as training data (ten thousand total digit sets).Using the MATLAB patternet() classifier with all default parameters, a shallow feed forward neural network with 20 input layer nodes, 10 hidden layer nodes and 10 output layer nodes is trained using ten thousand digit sets, one thousand examples each from K ∈ [1..10].We classify 100 unknown digit sets.When the classification confidence is low, we repeat the sampling, selecting a new S up to 10 times and average the results to form the prediction.Table III shows the resulting predictions.The vertical axis of the table represents K true , the horizontal axis represents K pred .Elements on the diagonal represent correct classifications.Overall accuracy, measured as the percentage of nonzero results that fall on the diagonal of the confusion matrix

B. Cell Segmentation
Cell segmentation is the identification of individual cells in microscopy images.The identification of cell nuclei in microscopy images is an important question.Human stem cells (HSCs) are particularly challenging to segment as the cells are highly adherent, forming in naturally densely packed colonies.HSC colonies, or groups of touching cells, consist of dividing and differentiating cells that present a wide variety of sizes and shapes.The large morphological variation arises from both the presence of cells in developmental states and the mechanical interaction among adjacent cells deforming their shape, texture, and behavior [32], [33].Timelapse microscopy of living cells further complicates the problem, requiring reduced imaging energy to lessen phototoxicity, and also introducing temporal variations due to imaging as well as cell and colony appearance variability.It is much easier to segment cells that all have a similar appearance, for example shape and size.Here we present a technique for combining multiple simultaneous segmentations of the same image, each with varying underlying segmentation parameters.We refer to the collective set of segmentation results as an ensemble.The segmentations in the ensemble are combined by using optimality deficiency to select among overlapping segmentations.We use a previously described unsupervised underlying segmentation [32], [34], [33] that takes a single parameter of cell size in µm.The method works as follows.The segmentations are run across a range of expected radius values.The results are combined, with cells that overlap each other placed in common "buckets".The question is then to choose the optimal number of cells K in each bucket.Every segmentation is given a score based on its appearance and how well it captures the underlying pixels.
Here we apply the approach to the question of identifying elliptical cells or nuclei.Rather than using compression-based similarity, the score is built on an appearance model.
Quantitative validation for the ensemble segmentation approach was done using ground truth data from the cell tracking challenge [28] reference datasets.Twelve time-lapse datasets in 2-D and 3-D of live cells were processed using the ensemble segmentation with an empirically selected range of radius parameters.Ground truth scores were obtained for each radius parameter setting run separately and also for the ensemble segmentation.We consider the detection (DET) score here, as our concern is not primarily the accuracy of pixel assigned to each segmentation, but rather that we detect the correct number of cells in each frame.We use the training movies for validation because our method is unsupervised and training is not required.Our results are competitive on these movies with the supervised algorithms evaluated on the testing challenge datasets.In each of the 12 movies, the ensemble segmentation outperformed the best result selected from segmentations run separately.The results for the optimality deficiency based ensemble segmentation were statistically significantly better compared to the best score obtained from the single radius segmentation data for both the detection (DET) (p = 5e − 4, Wilcoxon paired sign-rank test) and tracking (TRA) scores (p = 2e − 3).This is significant because the best radius result varied even within pairs of movies from the same application type, showing the value of the ensemble segmentation approach.Table IV shows the results for the ensemble classification as well as the best and worst performing individual segmentation for each of the datasets processed here.

C. Synthetic Dataset
We evaluate the performance of the cluster structure function using synthetic data generated as random points from K = 3 different 2-D standard normal distributions, each with covariance Σ = [1, 0; 0, 1].Position the K = 3 clusters along the x-axis at x = [0, r, 2 * r] with cluster spacing r = [0.5 : 0.25 : 1.5].In each of the 100 trials, generate 1e4 points from each of the K = 3 distributions.Supplementary Figure 1 shows a histogram of an example synthetic dataset with cluster spacing = 1.0.To evaluate the cluster structure function, approximate K(A) − K(x), as in eqn.II.1 using the Euclidean distance between point x and the centroid of cluster A. As in the examples above, we include only the points that fall within one standard deviation of the centroid for each cluster and then average this result across each cluster.We estimate the value of K using the cluster structure function and compare to results from the Gap statistic, the Akaike Information Criteria (AIC) and the Bayesian Information Criteria (BIC) [26].The CSF performed significantly better compared to all three alternatives, with the AIC the next closest.The AIC was the only alternative that was competitive with the CSF for this application.Fig. 4 shows results for the CSF and AIC.The good performance of the cluster structure function here follows from the optimality of Euclidean distance used to estimate K(A) − K(x) as in eqn.II.1.

VI. SOURCE CODE AVAILABILITY
All of the source code used to generate results in this paper is available open source from https://git-bioimage.coe.drexel.edu/opensource/ncd.This includes MATLAB implementations of the NCD and clustering algorithms.There is also limited support for a Python implementation, with ongoing development on that task.The ensemble segmentation algorithms are available at https://leverjs.net/git.scale bar is 25um

LEMMA 4 .
There exists a set S ⊆ {0, 1} m and |S| = n with m a sufficiently large multiple of n such that H I: Confusion matrix for spectral clustering sets of random digits.Each digit set contains 5 of each digit [0,9].The digit sets are clustered into 10 clusters.For evaluation, each cluster is labeled with the mode (most common element) of the true digit values in that cluster.This was repeated ten thousand times.Overall accuracy of clustering the ten digit classes is 46%.

e
convex (C) = |C| |C convex | , where |C| is the area (volume) of segmentation C and |C convex | is the area of the convex hull of C. The boundary efficiency is computed from the normalized ([0,1]) image pixel values, defined as e boundary (C) = 1 − mean(R(β(C)) − T (β(C))), where R(β(C)) is the maximal intensity in the region surrounding the boundary voxels β(C), and T (β(C)) is the mean adaptive threshold value for voxels along the boundary.The background efficiency is defined as e background (C) = mean(I(C) − T (C)) mean(I( Ĉ) − T ( Ĉ)) , where I(C) is the source image, T (C) is the adaptive threshold image of segmentation C, and Ĉ represents the image background.The final segmentation score is the sum of the three scores, e C = e convex (C) + e boundary (C) + e background (C).(V.3)After each cell has been scored, the goal is to select the set of non-overlapping segmentations from the ensemble that maximize the sum of the individual segmentation scores.This

Fig. 3 :
Fig. 3: The ensemble segmentation combines results from different segmentation algorithms using the optimality deficiency to select the best results for overlapping segmentations.Frame segmentations are run at each of a range of different parameter values.The resulting segmentations are each treated as a possible clustering of the underlying pixels into objects.An example is shown here for a single image frame taken from a 1200 frame movie showing the development of live human stem cells (HSCs).The top row shows a raw image (a), the final segmentation results (b) and the overlapping ensemble regions (c).The bottom two rows show different possible combinations of segmentation results from the region shown in the rectangle in (a) and (c).The segmentation results are scored from worst (lowest score) to best (highest score).The optimal set of segmentation results are selected using a greedy optimization to maximize the scores in each overlapping region.Segmentation scores are generated from the convexity, boundary and background efficiencies.
k) is based on the partition π ∈ Π k that minimizes the minimal sum of of the bandwidths of the parts in a k-partition of S.

TABLE II :
Unsupervised cluster structure function (CSF) (left) and Gap Statistic (right) estimates of the number of unique digits K in a MNIST digit set.Both CSF and Gap Statistic predictions K pred are correlated with K true except in case K = 1 (where both exhibit much higher standard deviation).Omitting K true = 1, the CSF correlation is 0.93 (p = 3e − 4) and the Gap Statistic correlation is −0.84 (p = 5e − 3).We used the same procedure on the mean and standard deviation values obtained from the Gap Statistic (as in TableII) and obtained an accuracy of 54% [0.51,0.57].

TABLE III :
Supervised cluster structure function (CSF) (left) and Gap Statistic (right) estimates of the number of unique digits K in a NIST digit set.Each digit set contains 100 digits, split equally among the K digit classes.The algorithm is given a digit set sampler that can pull repeatedly from the same distribution (K value) with the goal of estimating K.The results here were generated by classifying one hundred each of digit sets with K true ∈ [1..10].A 20-element vector consisting of mean and standard deviations of the CSF and the Gap Statistic was the input to a shallow feed-forward neural network.Overall accuracy for the CSF was 86% [0.84,0.88]and 54% for the Gap Statistic [0.51,0.57].
is equivalent to selecting the H S (k) from Equation II.2 where the δ(A, x) in Equation II.1 are approximated by the individual cell segmentation scores.Figure