Generalized Identifiability Bounds for Mixture Models With Grouped Samples

Recent work has shown that finite mixture models with <inline-formula> <tex-math notation="LaTeX">$m$ </tex-math></inline-formula> components are identifiable, while making no assumptions on the mixture components, so long as one has access to groups of samples of size <inline-formula> <tex-math notation="LaTeX">$2m-1$ </tex-math></inline-formula> which are known to come from the same mixture component. In this work we generalize that result and show that, if every subset of <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula> mixture components of a mixture model are linearly independent, then that mixture model is identifiable with only <inline-formula> <tex-math notation="LaTeX">$(2m-1)/(k-1)$ </tex-math></inline-formula> samples per group. We further show that this value cannot be improved. We prove an analogous result for a stronger form of identifiability known as “determinedness” along with a corresponding lower bound. This independence assumption almost surely holds if mixture components are chosen randomly from a <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-dimensional space. We describe some implications of our results for multinomial mixture models and topic modeling.


Introduction
Finite mixture models have seen extensive use in statistics and machine learning.In a finite mixture model one assumes that samples are drawn according to a two-step process.First an unobserved mixture component, µ, is randomly selected according to a probability measure over probability measures P = m i=1 a i δ µ i (δ is the Dirac measure).Next an observed sample X is drawn from µ, X ∼ µ.A central question in mixture modeling theory is that of identifiability (Teicher, 1963): whether P is uniquely determined from the distribution of X. From the law of total probability it follows that X is distributed according to m i=1 a i µ i .Excepting trivial cases, a mixture model is not identifiable unless one makes additional assumptions about the mixture components.A standard assumption is that the mixture components µ 1 , . . ., µ m are elements of some parametric class of densities.A common choice for this class is the set of multivariate Gaussian distributions, which yields the well-known and frequently-used Gaussian mixture model.This model is indeed known to be identifiable (Anderson et al., 2014;Bruni and Koch, 1985;Yakowitz and Spragins, 1968).A natural question to ask is whether it is possible for a mixture model to be identifiable without such parametric assumptions.
In Vandermeulen and Scott (2019) the authors consider an alternative setting for mixture modeling where no assumptions are made on the mixture components µ 1 , . . ., µ m and, instead of having access to one sample from each unobserved mixture component µ ∼ P, one has access to groups of n samples, X = (X 1 , . . ., X n ) which are known to be independently sampled from µ, i.e.X 1 , . . ., X n iid ∼ µ.In Vandermeulen and Scott (2019) the authors develop several fundamental bounds relating the identifiability of P to the number of samples per group n and the number of mixture components m.These bounds consider extremal cases where there are either no assumptions on the mixture components or they are assumed to be linearly independent.If the mixture components are assumed to lie in a finite dimensional space, such as when the sample space is finite, then it is reasonable to assume that the collection of all mixture components is linearly dependent, however sufficiently small subsets of the mixture components are linearly independent.
In this paper we prove two fundamental bounds relating the identifiability of P = m i=1 a i δ µ i to the number of mixture components m, the number of samples per group n, and a value k which describes the degree of linear independence of the mixture components.We show that if every subset of k measures in µ 1 , . . ., µ m are linearly independent, then P is the simplest mixture, in terms of the number of mixture components, yielding the distribution on X if 2m − 1 ≤ (k − 1)n.
If n is even-valued and 2m − 2 ≤ (k − 1)(n − 1) then P is the only mixture, with any number of components, yielding the distribution on X.We furthermore show that the first bound is tight and that the second bound is nearly tight.Most of the bounds in Vandermeulen and Scott (2019) are special cases of the bounds presented in this paper.We also show that this linear independence assumption occurs naturally, similarly to results in Kargas et al. (2018), and describe some practical implications of our results.

Background
We introduce the mathematical setting used in the rest of the paper before reviewing existing results.

Problem Setting
The setting described here is drawn from Vandermeulen and Scott (2019) and is highly general.Let (Ω, F ) be a σ-algebra and let D be the space of probability measures on (Ω, F ).Note that D is contained in the vector space of finite signed measures on (Ω, F ), a fact which we will use often.For an element γ, let δ γ denote the Dirac measure at γ.We equip D with the power σ-algebra.We call a measure P on D a mixture of measures if it is a measure on D of the form P = m i=1 a i δ µ i with a i > 0, m i=1 a i = 1, and m < ∞.We will always assume that the representation of a mixture of measures has minimal m, i.e. there are no repeated µ i in the summands.For a full technical treatment of the concept of minimal representation see Vandermeulen and Scott (2019).We refer to the measures µ 1 , . . ., µ m as mixture components.We now introduce the model we wish to investigate in this paper which is termed the grouped sample setting in Vandermeulen and Scott (2019).If we let µ ∼ P and X 1 , . . ., X n iid ∼ µ then the probability distribution for X With this in mind we introduce the following operator, To give some concreteness to this setting it can be helpful to consider the application of topic modeling with a finite number of topics.Here µ 1 , . . ., µ m are topics, which are simply distributions over words.The measure P designates a topic µ i being chosen with probability a i .The group of samples (X 1 , . . ., X n ) represent a document containing n words as a bag of words.A collection of documents X 1 , X 2 , . . .are then iid samples of V n (P).In this setting we are interested in the number of words necessary per document to recover the true topic model P.
We will be investigating two forms of identifiability, n-identifiability where P is the simplest mixture of measures, in terms of the number of mixture components, yielding the distribution on (X 1 , . . ., X n ) and n-determinedness where P is the only mixture of measures yielding the distribution on (X 1 , . . ., X n ).We finish this section with the following two definitions which capture these two notions of identifiability.
Definition 2.1 A mixture of measures P = m i=1 a i δ µ i is n-identifiable if there exists no mixture of measures P = P with m or fewer components such that V n (P ) = V n (P).
Definition 2.2 A mixture of measures P is n-determined if there exists no mixture of measures P = P such that V n (P ) = V n (P).

Previous Results
Here we recall several results from Vandermeulen and Scott (2019).In that paper the authors prove five bounds relating identifiability or determinedness to the geometry of the mixture components, the number of mixture components m, and the number of samples per group n.For brevity we have summarized these bounds in Table 1.Vandermeulen and Scott (2019) showed that none of these bounds are improvable via matching lower bounds.For clarity we include an example of a precise statement of an entry in this table.
Theorem 1 (Table 1 Row Four or Vandermeulen and Scott (2019) Theorem 4.6) Let P = m i=1 a i δ µ i be a mixture of measures where µ 1 , . . ., µ m are linearly independent.Then P is 4determined.The last row of Table 1 contains a property known as joint irreducibility which was introduced in Blanchard and Scott (2014).A collection of probability measures µ 1 , . . ., µ m is jointly irreducible when all probability measures in the linear span of µ 1 , . . ., µ m lie in the convex hull of µ 1 , . . ., µ m .We do not use joint irreducibility anywhere else in this paper, however we note that it is a property that is stronger than linear independence.
For completeness we also include the following lemmas from Vandermeulen and Scott (2019) that demonstrate the unsurprising fact that k-identifiability and k-determinedness are, in some sense, monotonic.Each lemma encapsulates two statements, one concerning identifiability and one concerning determinedness, which we have combined for brevity.
Lemma 2.1 If a mixture of measures is n-identifiable (determined) then it is q-identifiable (determined) for all q > n.
Lemma 2.2 If a mixture of measures is not n-identifiable (determined) then it is not q-identifiable (determined) for any q < n.
Finally Vandermeulen and Scott (2019) Lemma 7.1 showed that if the sample space Ω is finite, the grouped sample setting is equivalent to a multinomial mixture model where n is the number of trials and µ 1 , . . ., µ m are the categorical distributions for each component.One may consider the grouped sample setting to be a generalized version of multinomial mixture models.

Related Work
A significant amount of work regarding the grouped sample setting has focused on the setting where Ω is finite, which is equivalent to a multinomial mixture model.Some of the earliest work on identifiability was done on binomial mixture models with Teicher (1963) demonstrating that binomial mixture models are identifiable if the number of trials n and the number of mixture components m satisfy n ≥ 2m − 1.These results were extended to multinomial mixture models in Kim (1984) and Elmore and Wang (2003).Rabani et al. (2014) and Vandermeulen and Scott (2019) introduced estimators for the multinomial components when the n ≥ 2m − 1 bound is met.Turning to the continuous setting, the paper Ritchie et al. (2020) introduces a method for recovering mixture components in the grouped sample setting when the components are densities on some Euclidean space.That method is furthermore guaranteed to asymptotically recover the components whenever the mixture model is identifiable.In Wei and Nguyen (2020) the authors consider the grouped sample setting where the mixture components come from some parametric class of densities and provide results for identifiability and rates of convergence.Other works have investigated continuous nonparametric mixture models, without assuming the grouped sample setting, by assuming a clustering structure (Dan et al., 2018;Aragam et al., 2020;Vankadara et al., 2021;Aragam and Yang, 2021;Aragam and Tai, 2022;Kivva et al., 2022).
Identifiability with linearly independent components given n ≥ 3 was first established in Allman et al. (2009) by way of Kruskal's (Factorization) Theorem (Kruskal, 1977).A spectral algorithm for the estimation of models with linearly independent components can be found in Anandkumar et al. (2014).This algorithm also has a nonparametric adaptation (Song et al., 2014).
The grouped sample setting can be considered as a special case of a finite exchangable sequence (Kallenberg, 2005).An (infinite) sequence of random variables for every distinct subsequence ξ k 1 , . . ., ξ km .For an infinite sequence de Finetti's Theorem tells us that (for Borel spaces) one can always decompose the distribution of the sequence to be independent, conditioned on some other random variable in a way akin to (1), though this random variable is not necessarily discrete as in (1).This theorem does not extend to finite sequences, but there have been works investigating the grouped sample setting for continuous mixtures.In Vinayak et al. ( 2019) the authors present rates for estimating a continuous version of P in the context of binomial mixture models.
A generalization of Kruskal's Theorem for d-way arrays can be found can be found in Sidiropoulos and Bro (2000) and is related to the techniques we use in the proof of Theorem 4.1.The proof technique we use in Theorem 4.2 is completely novel, so far as we know.As far as the contributions of this paper are concerned, Theorems 4.1, 4.3, and 4.4 are natural extensions of the grouped sample results using techniques from Vandermeulen and Scott (2019) to the independence setting considered in Sidiropoulos and Bro (2000).The determinedness result in Theorem 4.2 required the development of a new proof strategy.

Main Results
In this section we present the main results of this paper.They are related to a property which we call k-independence.
The concept of k-independence is simply a generalization of Kruskal rank (Kruskal, 1977) to vector spaces.We define k-independence using a sequence rather than a set of vectors so as to relate it to a matrix rank, since a matrix can have repeated columns.When x 1 , . . ., x m are distinct (as will be the case in our main theorems) we can define k-independence simply using sets and subsets.We now present the main results of this paper.
For Theorem 4.1 we must omit the case where m = 1 since this would imply that k = 1 which results in the inequality "1 ≤ 0" in the theorem statement.Note that any mixture containing only one component is trivially 1-identifiable, which is accounted for by row one in Table 1.
The following theorem demonstrates that Theorem 4.1 cannot be improved for any values of m, n, or k, not satisfying 2m − 1 ≤ (k − 1)n.
Theorem 4.3 For all m ≥ k ≥ 2 and n with 2m − 1 > (k − 1)n there exists a mixture of measures P = m i=1 a i δ µ i where µ 1 , . . ., µ m are k-independent and P is not n-identifiable.For determinedness we have a similar bound, but it is a bit loose.
Theorem 4.4 For all m ≥ k ≥ 2 and n with 2m > (k − 1)n there exists a mixture of measures P = m i=1 a i δ µ i where µ 1 , . . ., µ m are k-independent and P is not n-determined.

Comparison to Previous Results
The results in Section 4 are quite general and contain four of the five bounds from Vandermeulen and Scott (2019) as special cases.Since any pair of distinct probability measures are not colinear, it follows that the components of any mixture of measures with at least two mixture components are 2-independent.Setting k = 2 in Theorems 4.1 and 4.2 gives us the first two bounds from Table 1 (noting that n is always even for the determinedness result).If a collection of m vectors are linearly independent we have that they are m-independent.Setting k = m in Theorem 4.1 we have with the minimal n satisfying this being 3 which yields row 3 in Table 1.The analogous determinedness bound on row 4 can similarly be derived from Theorem 4.2, and the smallest even-valued n satisfying this bound is 4. As a final point we remark that, in contrast to previous results, Theorems 4.1 and 4.2 imply that, when n = 3 or n = 4 respectively, it is possible to have identifiability/determinedness without linearly independent components; for example by setting n = 4, k = 7, and m = 10 for the determinedness case.

Applications
The grouped sample setting occurs naturally in many problem settings including group anomaly detection (Muandet and Schölkopf, 2013), transfer learning (Blanchard et al., 2011), and distribution regression/classification (Póczos et al., 2013;Szabó et al., 2016).In these settings one has access to groups of samples X 1 , . . ., X N with X i = (X i,1 , . . ., X i,n ).Mathematical analysis of techniques in this setting typically assume n → ∞.The study of such problems for fixed n is less explored, and the results here can help give some intuition for the learnability of this setting.
Our results are particularly relevant when samples X i,j lie in a finite sample space, |Ω| = d < ∞.When |Ω| = ∞ one could convert samples to a finite sample space by assigning them to a set of d prototypes.In this setting it is natural to assume that the mixture components are d-independent due to the following proposition.This fact is particularly relevant for topic modeling where d, the number of words in a vocabulary, can be large and estimation can be difficult.A straightforward way to fix this is to assign words to d < d clusters, perhaps using a vector word embedding (Mikolov et al., 2013), thereby coarsening the event space.To recover m topics we would should have d ≤ m and satisfy 2m − 1 ≤ (d − 1)n where n is the number of words per document.To test whether a corpus could potentially contain more topics than a proposed topic model with m topics, we would need that 2m−2 ≤ (d −1)(n−1).These results are also useful for other discrete clustering problems (Portela, 2008).

Proofs
This section contains proofs of the results in Section 4 and supporting lemmas.Proofs omitted in this section can be found in Appendix A. The symbol represents tensor product when applied to elements of a Hilbert space.For a natural number N , [N ] is defined to be {1, 2, . . ., N }.To streamline the presentation of our main theorems we first introduce the mathematical tools we will be using.
The following lemma is not particularly novel, but we will be using it quite extensively without reference so we include a statement of it here.
Lemma 5.1 Let x 1 , . . ., x m nonzero be vectors in an inner product space.Then x 1 , . . ., x m are linearly independent iff there exist vectors z 1 , . . ., z m such that x i , z i = 0 for all i and x i , z j = 0 for all i = j.
The next lemma serves as something of a workhorse in our proofs.
Lemma 5.2 Let x 1 , . . ., x m be vectors in a Hilbert space which are k-independent with k ≥ 2. Then x ⊗n 1 , . . ., x ⊗n m are min (n (k − 1) + 1, m)-independent.Proof We will first consider the case where n(k − 1) + 1 = m.We can relabel the vectors x 1 , . . ., x n(k−1)+1 as x and x i,j where (i, j) . By k-independence, for all i, there exists a vector z i such that z i , x = 1 and z i , x i,j = 0 for all j.From this we have that x, z i = 1 and x ⊗n i,j , x i,j , z l = 0, for all i, j.
Because the relabeling is arbitrary it follows that for all i ∈ [n(k − 1) + 1] there exists z i such that x ⊗n i , z i = 1 and z i ⊥ x ⊗n j for all j = i .Thus we have that x ⊗n 1 , . . ., x ⊗n m are m-independent.We will now consider two other cases for the value of m.For m < n(k − 1) + 1 we can show that x ⊗n 1 , . . ., x ⊗n m are linearly independent by the same argument.If m > n(k − 1) + 1 then it follows from the m = n(k − 1) + 1 case that every subsequence of length n(k − 1) + 1 of x ⊗n 1 , . . ., x ⊗n m is independent, so it follows that x ⊗n 1 , . . ., x ⊗n m is (n(k − 1) + 1)-independent.
The following lemma's proof is very similar to the proof of Lemma 5.2, but we defer it to Appendix A due to its length.Note that it recovers Lemma 5.2 by setting k = k.
To prove Theorem 4.1 we will use the following slight adaptation of Kruskal's Theorem.
Theorem 5.1 (Hilbert space extension of Kruskal (1977)) Let x 1 , . . ., x r , y 1 , . . ., y r , and z 1 , . . . ,z r be elements of three Hilbert spaces H x , H y , H z such that x 1 , . . ., x r are r x -independent with r y , r z defined similarly.Further suppose that r x + r y + r z ≥ 2r + 2. If a 1 , . . ., a l ∈ H x , b 1 , . . ., b l ∈ H y , and c 1 , . . ., c l ∈ H z with r ≥ l such that r i=1 then l = r and there exists a permutation σ : The following three lemmas allow us to embed general measures in Hilbert spaces and will allow us to use tools from Hilbert space theory (Kadison and Ringrose, 1983).
Lemma 5.4 (Vandermeulen and Scott (2019) Lemma 6.2) Let γ 1 , . . ., γ n be finite measures on a measurable space (Ψ, G).There exists a finite measure π and nonnegative functions f 1 , . . ., f n ∈ L 1 (Ψ, G, π) ∩ L 2 (Ψ, G, π) such that, for all i and all B ∈ G The last lemma will be used in particular to embed collections of probability measures in a joint measure space as pdfs.
Lemma 5.5 (Vandermeulen and Scott (2019) Lemma 6.3) Let (Ψ, G) be a measurable space, γ and π a pair of finite measures on that space, and f a nonnegative function in L 1 (Ψ, G, π) such that, for all A ∈ G, γ (A) = A f dπ.Then for all n, for all B ∈ G ×n we have Lemma 5.6 (Vandermeulen and Scott (2019) Lemma 5.2) Let (Ψ, G, γ) be a measure space.There exists a unitary transform U : Here and elsewhere note that powers of a σ-algebra utilizes the standard σ-algebra product.Finally we remind the reader of the following standard result from real analysis .
For the rest of the paper we will leave the "almost everywhere" qualifier implicit.We can now prove the main theorems in Section 4. Proof of Theorem 4.1 Let Q = l i=1 b i δ ν i be a mixture of measures with l ≤ m, such that V n (P) = V n (Q).From this we have that From Lemma 5.4 there exists a measure ξ and nonnegative functions p 1 , . . ., p m , q 1 , . . ., q l ∈ L 1 (ξ) ∩ L 2 (ξ), such that, for all measurable A and i, µ i (A) = A p i dξ and ν i (A) = A q i dξ.From Lemmas 5.5 and 5.7 we have that b j q ⊗n j . (2) From the theorem hypothesis we know that m ≥ 2, and trivially k ≤ m, so we have the following Because n ≥ 3 it is always possible to decompose n = n 1 + n 2 + n 3 where n i are all positive integers.
We will now prove the following claim which we will denote " †": if the sequences p ⊗n i 1 , . . ., p ⊗n i m (for all i ∈ [3]) are k i -independent respectively and k 1 + k 2 + k 3 ≥ 2m + 2 then the theorem conclusion follows.From (2) we have that From Theorem 5.1 we have that l = m, there exists D 1 , D 2 , D 3 ∈ R m , and a permutation σ : where D 1,i D 2,i D 3,i = 1 for all i.Applying Lemma 5.6 to (3) we have that, for all i So D 2 is a vector of ones and so is D 3 by the same argument.We have that D 1 is also a vector of ones since D 1,i D 2,i D 3,i = 1 for all i.Thus we have that p i = q σ(i) for all i.Assuming that σ is the identity mapping it follows that a i = b i and we have shown †.Now that †has been demonstrated, to finish the proof we will show that we can decompose n = n 1 +n 2 +n 3 such that p ⊗n i 1 , . . ., p ⊗n i m are k i -independent for each i with k 1 +k 2 +k 3 ≥ 2m+2 which will finish our proof.To continue we will split into the cases where n mod 3 is 0, 1, or 2. Case 0: We have that n = 3n for some positive integer n and we can let n 1 = n 2 = n 3 = n .Now we can reformulate our Hilbert space embedding: From Lemma 5.2 we know that p ⊗n 1 , . . ., p ⊗n Case 1: Here we have that n = 3n + 1 and we let n 1 = n 2 = n and From Lemma 5.2 we have that k 1 = k 2 = min(m, (k−1)n +1) and k 3 = min(m, (k−1)(n +1)+ 1).If we have that min(m, (k −1)n +1) = m then it follows that min(m, (k −1)(n +1)+1) = m and k If we have that min(m, (k − 1)n + 1) = (k − 1)n + 1 and min(m, (k − 1)(n + 1) + 1) = (k − 1)(n + 1) + 1 then we have that by the theorem hypothesis.
Proof Sketch of Theorems 4.3 and 4.4 Theorem 4.2 in Vandermeulen and Scott (2019) states that, for all m ≥ 2, there exists a mixture of measures P = m i=1 a i δ µ i which is not 2m−2-identifiable.Therefore there exists a mixture of measures Q = l i=1 b i δ ν i with P = Q, and l ≤ m such that Since any pair of distinct probability measures are linearly independent, it follows that any collection of probability measures are 2-independent.Using this fact we can can adapt Lemma 5.2 to show that , we have that V n (P ) = V n (Q ) and we are done.The proof of Theorem 4.4 is virtually identical and follows from Vandermeulen and Scott (2019 Since the theorem hypothesis assumes m ≥ 2 we have that the theorem is vacuously true for the n = 2 case. 1 To finish the proof we will show that theorem holds for n = 4 and then proceed by induction. Base Step: Let n = 4.We will proceed by contradiction and assume there exists k ≥ 2 and a collection of k-independent probability measures µ 1 , . . ., µ m such that there exists a mixture of measures P = m i=1 a i δ µ i and Q = l i=1 b i δ ν i a mixture of measures where P = Q and V 4 (P) = V 4 (Q).From the theorem hypothesis that m ≥ 2 it follows that k ≥ 2 and k − 2 ≥ 0. Applying this bound to our theorem hypothesis with n = 4 we have that Since we have that 2m − 1 ≤ (k − 1)n we can apply Theorem 4.1 and thus l > m.We will proceed analogously to the proof of Theorem 4.1 and embed the measures in a Hilbert space as before Because l > m there exists i such that q i = p j for all j.We will assume without loss of generality that q 1 satisfies this.Let k be the largest value such that q 1 , p 1 , . . ., p m are k -independent.We will now show that that k < k.To see this suppose that k ≥ k which would imply that q 1 , p 1 , . . ., p m are k-independent.Observe that m > 2 (thus m ≥ 3, we use this at (5)).Were this not the case then the components of P, µ 1 and µ 2 , would be linearly independent and P would be 4-determined from Table 1 row 4, thereby violating the contradiction hypothesis.With the base case n = 4 in our theorem hypothesis, and the fact that k and m are positive integers, we get From application of Lemma 5.2 it follows that q ⊗2 1 , p ⊗2 1 , . . ., p ⊗2 m are min(2(k − 1) + 1, m + 1)independent and from (6) it follows that they are linearly independent.Now we have that there exists z such that z ⊥ p ⊗2 i for all i but z, q ⊗2 1 = 1 and thus a contradiction.So k < k.
Because k < k there must exist a collection of k elements of p 1 , . . ., p m , which we denote p i 1 , . . ., p i k , such that q 1 , p i 1 , . . ., p i k are linearly dependent; for convenience we will assume without loss of generality that these elements are p 1 , . . ., p k .Because p 1 , . . ., p k are linearly independent but q 1 , p 1 , . . ., p k are linearly dependent we have that q 1 = k i=1 α i p i for some α 1 , . . ., α k .We will now show that k ≥ 2k − m + 1. Suppose this were not the case and k ≤ 2k − m or equivalently m ≤ 2k − k .By k-independence there exists a vector z such that z, p k = 1 with z ⊥ p 1 , . . ., p k −1 , p k +1 , . . ., p k and another vector z such that z , p 1 = 1 and z ⊥ p 2 , . . ., p k , p k+1 , . . ., p m .We know z exists since k ≤ 2k − m so the cardinality of p 2 , . . ., p k , p k+1 , . . ., p m satisfies the following 2, . . ., k On the other hand we have that which contradicts (4).Thus we have that k ≥ 2k − m + 1.
We are now going to show that q ⊗2 1 , p ⊗2 1 , . . ., p ⊗2 m are linearly independent via Lemma 5.3.To do this we will show that 1 , p ⊗2 1 , . . ., p ⊗2 m are linearly independent we can finish our contradiction using the same argument as in (7). Induction Step: We will now proceed by induction along n in an increment of 2 since the theorem statement holds for even-valued n.For our inductive hypothesis assume that for even valued n ≥ 4 and all k, m with 2m − 2 ≤ (n − 1)(k − 1) that any mixture of measures with m components which are k-independent are n-determined.Consider some mixture of measures P = m i=1 a i δ µ i with k-independent components and 2m − 2 ≤ (k − 1)((n + 2) − 1).If k = m then we have that the components are linearly independent so it is (n + 2)-determined by Lemma 2.1 and Table 1 row four.Now suppose that m > k.Let Q be a mixture of measures with l components such that V n+2 (P) = V n+2 (Q).Embedding V n+2 (P) and V n+2 (Q) as before we have that By k-independence there exists z such that z, p 1 = 0 and z ⊥ p m −(k−1)+1 , . . ., p m .Now we have that The implication (8) follows from the equivalence between tensor products and Hilbert-Schmidt operators (see Kadison and Ringrose (1983) Proposition 2.6.9).Let λ = m −(k−1) i=1 a i p i , z 2 , a i = a i p i , z 2 /λ, and b i = b i q i , z 2 /λ.Without loss of generality we will assume that q i , z = 0 for i ∈ [l ] and q i , z = 0 for i > l , with l potentially equaling l.Note that a i ≥ 0 and m −(k−1) i=1 a i = 1 and likewise for b i , since the right hand side of ( 9) is a convex combination of pdfs that itself must be equal to a pdf.So now we have so by the induction hypothesis we have that the mixture of measures Without loss of generality we will assume that µ 1 = ν 1 and a 1 = b 1 .It follows that p 1 = q 1 and thus a 1 = b 1 .By the same argument it follows that ν i = µ i and a i = b i for all i and, because m i=1 b i = 1, that m = l and thus P = Q and P is n + 2-determined, which finishes our proof.
Proof of Theorem 5.1 Note that two Hilbert spaces with the same finite dimension are isometric to one another.Suppose that a 1 , . . ., a l , b 1 , . . ., b l and c 1 , . . ., c l with m ≤ l such that r i=1 Let H x = span ({x 1 , . . ., x r , a 1 , . . .a l }) with H y and H z defined similarly.Because these spaces are finite dimensional the theorem follows from direct application of Kruskal's Theorem, see the following.
Definition 2 A matrix M has Kruskal rank k if every collection of k columns of M are linearly independent.
Then there exists a permutation matrix P and invertible diagonal matrices Proof of Proposition 4.1 Let Γ 1 , . . ., Γ d iid ∼ Ψ.We will proceed by contradiction and assume that Γ 1 , . . ., Γ d are linearly dependent with nonzero probability.It follows that that Γ 1 = d i=2 α i Γ i for some α 2 , . . ., α d with nonzero probability.Let 1 d be the d-dimensional vector containing all ones.Because Γ 1 , . . ., Γ d are all probability vectors it follows that The probabilistic simplex lies in an affine subspace of dimension d − 1 which we will call S.
Because of this there exists an affine operator f which is a bijection between S and a closed subset of R d−1 with f (x) = M x + b for some matrix M and vector b.Let Γ i = f (Γ i ).We have that Γ i ∼ Ψ f −1 (•) is a measure on R d−1 which is absolutely continuous wrt the Lebesgue measure on R d−1 and thus Γ 1 , . . ., Γ d lie in general position with probability one (see Devroye et al. (1996) Section 4.5).Note that Since Γ 2 , . . ., Γ d trivially lie in a (d − 2)-dimensional affine subspace there exists a vector v = 0 d−1 and r such that v T Γ i = r for i ≥ 2. Now we have To finish the proof we will show that µ ×k−1 1 , . . ., µ ×k−1 m are k-independent and we are done.To do this we will proceed by contradiction, suppose that they are not k-independent and there exists a nontrivial linear combination of k elements in µ 1 , . . ., µ m which is equal to zero.We will assume µ 1 , . . ., µ k without loss of generality satisfy this, so Proof of Theorem 4.4 From Vandermeulen and Scott (2019) Theorem 4.4, for all m ≥ 1 there exists a mixture of measures P = m i=1 a i δ µ i which is not 2m − 1-determined.From here this proof proceeds exactly as the proof of Theorem 4.3.
Proposition 4.1 Let Ψ be a measure which is absolutely continuous to the uniform measure on the probability simplex ∆ d−1 and let Γ 1 , . . ., Γ d iid ∼ Ψ.Then Γ 1 , . . ., Γ d are linearly independent with probability one.
Theorem A.1 (Kruskal's Theorem) For a matrix M let M i be its ith column vector.Let A, B, C be matrices of dimensions d A × r, d B × r, and d C × r respectively with Kruskal rank k A , k B , k C respectively and let k

α.
i r ⇒ v T Γ 1 = r and thus, with nonzero probability, Γ 1 , . . ., Γ d do not lie in general position, a contradiction.Proof of Theorem 4.3 From Vandermeulen and Scott (2019) Theorem 4.2, for all m ≥ 2 there exists a mixture of measures P = m i=1 a i δ µ i which is not 2m − 2-identifiable, thus there exists a mixture of measures Q = m i=1 b i δ ν i = P with m ≤ m and − 1)n ≤ 2m − 2 we have that either (k − 1)n = 2m − 2 and it directly follows that that (k − 1)n < 2m − 2 and we have If we let P = m i=1 a i δ µ ×k−1 i and Q = m i=1 b i δ ν ×k−1 i then we have that V n (P ) = V n (Q ).
i such that α i = 0. Embedding these measures as was done in the proof of Theorem 4.1 we have thatk i=1 α i p ⊗k−1 i = 0but since any pair of distinct p i , p j are 2-independent, applying Lemma 5.2 gives us that p ⊗k−1 1 , . . ., p ⊗k−1 k are linearly independent, a contradiction.