Learning Mixtures of Low-Rank Models

We study the problem of learning mixtures of low-rank models, i.e. reconstructing multiple low-rank matrices from unlabelled linear measurements of each. This problem enriches two widely studied settings — low-rank matrix sensing and mixed linear regression — by bringing latent variables (i.e. unknown labels) and structural priors (i.e. low-rank structures) into consideration. To cope with the non-convexity issues arising from unlabelled heterogeneous data and low-complexity structure, we develop a three-stage meta-algorithm that is guaranteed to recover the unknown matrices with near-optimal sample and computational complexities under Gaussian designs. In addition, the proposed algorithm is provably stable against random noise. We complement the theoretical studies with empirical evidence that confirms the efficacy of our algorithm.

This paper explores a mixture of low-rank models with latent variables, which seeks to reconstruct a couple of low-rank matrices M ⋆ k ∈ R n1×n2 (1 ≤ k ≤ K) from unlabeled linear measurements of each.More specifically, what we have available is a collection of N linear measurements {y i } 1≤i≤N taking the following form: where {A i } 1≤i≤N are the sampling/design matrices, •, • denotes the matrix inner product, and {Ω ⋆ k } 1≤k≤K represents an unknown partition of the index set {1, . . ., N }.The aim is to design an algorithm that is guaranteed to recover {M ⋆ k } efficiently and faithfully, despite the absence of knowledge of {Ω ⋆ k } 1≤k≤K .This problem of learning mixtures of low-rank models enriches two widely studied settings: (1) it generalizes classical low-rank matrix recovery [RFP10,CLC19] by incorporating heterogeneous data and latent variables (i.e. the labels indicating which low-rank matrices are being measured), and (2) it expands the studies of mixed linear regression [QR78,YCS14] by integrating low-complexity structural priors (i.e.low-rank structures).In addition to the prior work [YC15] that has studied this setting, we single out two broader scenarios that bear relevance to and motivate the investigation of mixtures of low-rank models.
• Mixed matrix completion.If each measurement y i only reveals a single entry of one of the unknown matrices {M ⋆ k }, then the problem is commonly referred to as mixed matrix completion (namely, completing several low-rank matrices from a mixture of unlabeled observations of their entries) [PA18].One motivating application arises from computer vision, where several problems like joint shape matching can be posed as structured matrix completion [CGH14,CC18a].When the objects to be matched exhibit certain geometric symmetry, there might exist multiple plausible maps (and hence multiple ground-truth matrices), and the provided observations might become intrinsically unlabeled due to symmetric ambiguities [SLHH18].Other applications include network topology inference and metagenomics given mixed DNA samples; see [PA18] for details.
• Multi-task learning and meta-learning.The model (1) can be viewed as an instance of multi-task learning or meta-learning [Bax00, MPRP16, KSS + 20], where the tasks follow a discrete prior distribution supported on a set of K meta parameters, and each training data point (A i , y i ) is a realization of one task that comes with a single sample.While it is typically assumed in meta-learning that even light tasks have more than one samples, understanding this single-sample model is essential towards tackling more general settings.Additionally, in comparison to meta-learning for mixed linear regression [KSS + 20, KSKO20], the model (1) imposes further structural prior on the unknown meta parameters, thereby allowing for potential reduction of sample complexities.
The challenge for learning mixtures of low-rank models primarily stems from the non-convexity issues.
While the low-rank structure alone already leads to non-convex optimization landscapes, the presence of heterogeneous data and discrete hidden variables further complicates matters significantly.

Main contributions
This paper takes a step towards learning mixtures of low-rank models, focusing on the tractable Gaussian design where the A i 's have i.i.d.Gaussian entries; in light of this, we shall also call the problem mixed matrix sensing, to be consistent with the terminology used in recent literature [BNS16,CLC19].In particular, we propose a meta-algorithm comprising the following three stages: 1. Estimate the joint column and row spaces of {M ⋆ k } 1≤k≤K ; 2. Transform mixed matrix sensing into low-dimensional mixed linear regression using the above subspace estimates, and invoke a mixed linear regression solver to obtain initial estimates of {M ⋆ k } 1≤k≤K ; 3. Successively refine the estimates via a non-convex low-rank matrix factorization algorithm (more specifically, an algorithm called scaled truncated gradient descent to be described in Algorithm 3).
The details of each stage will be spelled out and elucidated in Section 2. Encouragingly, the proposed algorithm is guaranteed to succeed under mild conditions (to be specified in Section 3.1).Informally, our contributions are three-fold.
• Exact recovery in the noiseless case.In the absence of noise, our algorithm enables exact recovery of {M ⋆ k } modulo global permutation.The sample complexity required to achieve this scales linearly (up to some log factor) in the dimension max{n 1 , n 2 } and polynomially in other salient parameters.
• Stability vis-à-vis random noise.The proposed algorithm is provably stable against Gaussian noise, in the sense that the estimation accuracy degrades gracefully as the signal-to-noise-ratio decreases.
• Computational efficiency.When the number K of components and the maximum rank of the unknown matrices are both constants, the computational cost of our algorithm scales nearly linearly in N n 1 n 2 with N the number of samples -this is proportional to the time taken to read all design matrices.
The precise theorem statements are postponed to Section 3. Empirical evidence will also be provided in Section 3 to corroborate the efficacy of our algorithm.

Notation
Before we proceed, let us collect some notation that will be frequently used.Throughout this paper, we reserve boldfaced symbols for vectors (lower case) and matrices (upper case).For a vector x, x 2 denotes its ℓ 2 norm.For a matrix X, X (resp.X F ) denotes its spectral (resp.Frobenius) norm, σ k (X) denotes its k-th largest singular value, and col{X} (resp.row{X}) denotes its column (resp.row) space.If U is a matrix with orthonormal columns, we also use the same notation U to represent its column space, and vice versa.For any matrices A, B ∈ R n1×n2 , let A, B := n1 i=1 n2 j=1 A ij B ij stand for the matrix inner product.I n represents the n× n identity matrix.vec(•) denotes vectorization of a matrix, and mat(•) denotes the inverse operation (the corresponding matrix dimensions should often be clear from the context).
We use both a n b n and a n = O(b n ) to indicate that a n ≤ C 0 b n for some universal constant C 0 > 0; in addition, a n b n is equivalent to b n a n , and a n ≍ b n means both a n b n and b n a n hold true.Finally, For a finite set Ω, we denote by |Ω| its cardinality.For a number α ∈ [0, 1] and a random variable X following some distribution on R, we let Q α (X) denote the α-quantile function, namely (2) For a finite set D of real numbers, with slight abuse of notation, we let Q α (D) be the α-quantile of D; more precisely, we define , where X D denotes a random variable uniformly drawn from D.

Algorithm
This section formalizes our algorithm design by specifying each stage of our meta-algorithm with a concrete procedure (namely, Algorithms 1, 2, 3 for Stages 1, 2, 3, respectively).It is worth noting that these are definitely not the only choices; in fact, an advantage of our meta-algorithm is its flexibility and modularity, in the sense that one can plug in different sub-routines to address various models and assumptions.Before continuing, we introduce more notation that will be used throughout.For any 1 ≤ k ≤ K, define which represent the fraction of samples associated with the k-th component and the rank of the k-th groundtruth matrix M ⋆ k , respectively.In addition, let the compact singular value decomposition (SVD) of {M ⋆ k } be where

Stage 1: subspace estimation via a spectral method
Procedure.We propose to estimate the following joint column and row spaces: by means of a spectral method.More specifically, we start by forming a data matrix and set U ∈ R n1×R (resp.V ∈ R n2×R ) to be a matrix whose columns consist of the top-R left (resp.right) singular vectors of Y , where This method is summarized in Algorithm 1.
Rationale.To see why this might work, note that if {A i } consist of i.i.d.standard Gaussian entries, then Recalling the definitions of U ⋆ and V ⋆ in (5), we have under some mild conditions (detailed in Section 3).This motivates the development of Algorithm 1.

Stage 2: initialization via low-dimensional mixed linear regression
Key observations.Suppose that there is an oracle informing us of the subspaces U ⋆ and V ⋆ defined in (5).Recognizing the basic relation Algorithm 1: Subspace estimation via a spectral method ) be the matrix consisting of the top-R left (resp.right) singular vectors of Y .
Algorithm 2: Initialization via low-dimensional mixed linear regression ← the output of a black-box mixed linear regression solver (i.e.Algorithm 5) on we can rewrite the measurements in hand as follows: In other words, the presence of the oracle effectively reduces the original problem into a mixed linear regression problem in lower dimensions -that is, the problem of recovering {S ⋆ k } from mixed linear measurements.If {S ⋆ k } can be reliably estimated, then one can hope to recover {M ⋆ k } via the following relation: Procedure.While we certainly have no access to the aforementioned oracle in reality, Stage 1 described above provides us with subspace estimates U and Treating these as surrogates of , we can view the measurements as which are mixed linear measurements about the following vectors/matrices: Here, the equivalent sensing vectors are defined to be a All this motivates us to resort to mixed linear regression algorithms for recovering {β k }.The proposed algorithm thus entails the following steps, with the precise procedure summarized in Algorithm 2.
• Invoke any mixed linear regression algorithm to obtain estimates { β k } 1≤k≤K for {β k } 1≤k≤K (up to global permutation).For concreteness, the current paper applies the tensor method (Algorithm 5) originally proposed in [YCS16]; this is a polynomial-time algorithm, with details deferred to Appendix A.
To simplify presentation, let us assume here that the global permutation happens to be an identity map, so that β k is indeed a faithful estimate of β k (1 ≤ k ≤ K).By simple matricization, β k leads to a reliable estimate S k of S k .
• Given the observation that we propose to compute the rank-r k SVD -denoted by This in turn leads to our initial estimate for the low-rank factors 2.3 Stage 3: local refinement via scaled truncated gradient descent (ScaledTGD) Suppose that an initial point L 0 (R 0 ) ⊤ lies within a reasonably small neighborhood of M ⋆ k for some 1 ≤ k ≤ K. Stage 3 serves to locally refine this initial estimate, moving it closer to our target M ⋆ k .Towards this end, we propose to deploy the following update rule termed scaled truncated gradient descent (ScaledTGD): where η > 0 denotes the step size.Here, Ω t ⊆ {1, 2, • • • , N } is an adaptive and iteration-varying index set designed to mimic the index set Ω ⋆ k .Indeed, if Ω t = Ω ⋆ k , the aforementioned update rule reduces to the ScaledGD method developed for vanilla low-rank matrix sensing (see [TMC20]), which is guaranteed to converge to M ⋆ k in the presence of a suitable initialization.Here, the rescaling matrix (R t ) ⊤ R t −1 (resp.(L t ) ⊤ L t −1 ) acts as a pre-conditioner of the conventional gradient ), which effectively accelerates convergence when M ⋆ k is ill-conditioned.See [TMC20] for more intuitions and justifications of this rescaling strategy.
Viewed in this light, the key to ensuring effectiveness of ScaledTGD lies in the design of the index set typically has a smaller scale for a sample i ∈ Ω ⋆ k when compared with those i / ∈ Ω ⋆ k .This motivates us to include in Ω t a certain fraction (denoted by 0 < α < 1) of samples enjoying the smallest empirical loss Intuitively, the fraction α should not be too large in which case Ω t is likely to contain samples outside Ω ⋆ k ; on the other hand, α should not be chosen too small in order not to waste information.As it turns out, choosing 0.6p k ≤ α ≤ 0.8p k strikes a suitable balance and works well for our purpose.See Algorithm 3 for a precise description.

The full algorithm
With the three stages fully described, we can specify the whole algorithm in Algorithm 4, with the choices of algorithmic parameters listed in Table 1.Note that the discussion in Section 2.3 focuses on estimating a Algorithm 4: A fully specified three-stage algorithm for mixed matrix sensing single component; in order to recover all K components {M ⋆ k } 1≤k≤K , we simply need to run Algorithm 3 for K times (which can be executed in parallel).In addition, Algorithm 4 is built upon sample splitting: while Stages 1 and 3 employ the same set of samples {A i , y i } 1≤i≤N , Stage 2 (i.e.Line 3 of Algorithm 4) operates upon an independent set of samples {A ′ i , y ′ i } 1≤i≤N MLR (where "MLR" stands for "mixed linear regression"), thus resulting in a total sample complexity of N + N MLR .The main purpose of sample splitting is to decouple statistical dependency across stages and facilitate analysis.Finally, the interested reader is referred to Appendix D for a discussion regarding how to estimate certain parameters in Algorithm 4 if they are not known a priori.

Models and assumptions
For notational convenience, let us define the following parameters: where ) stands for the condition number of M ⋆ k .This paper focuses on the Gaussian design, where the entries of each design matrix A i are independently drawn from the standard Gaussian distribution.In addition, we assume that the samples drawn from the K components are reasonably well-balanced in the sense that for all 1 ≤ k ≤ K, where Ω ⋆ k is the index set for the k-th component (see (1)).We assume that this well-balancedness assumption holds for both sets of samples {A i , y i } 1≤i≤N and {A ′ i , y ′ i } 1≤i≤N MLR .Next, we introduce an incoherence parameter that plays a crucial role in our theoretical development.
Definition 1.The incoherence parameter µ ≥ 0 is the smallest quantity that satisfies The incoherence parameter µ takes value on [0, n/r].As an example, if is on the order of r i r j /n 1 (resp.r i r j /n 2 ), which is further upper bounded by r/ √ n 1 (resp.r/ √ n 2 ).This observation motivates our definition of the incoherence parameter.One of our main technical assumptions is that the column (resp.row) spaces of the ground-truth matrices are mutually weakly correlated -defined through the parameter µ -which covers a broad range of settings.
Assumption 1.The incoherence parameter µ is upper bounded by

Theoretical guarantees
Exact recovery in the absence of noise.
Our first main result uncovers that, in the noiseless case, Algorithm 4 achieves exact recovery efficiently, in terms of both sample and computational complexities.
The proof can be found in Section 5. Two implications are in order.
• Suppose that the parameters K, r, κ, Γ = O(1).In order to achieve exact recovery, the sample size N in (21) only needs to scale as O(n log n), while N MLR only needs to exceed the order of log 4 n.
• By setting the step size η k = c 1 /p k for some constant 0 < c 1 ≤ 1.3, we see that the third stage (i.e.ScaledTGD) achieves linear convergence with a constant contraction rate, which is independent of the condition number κ(M ⋆ k ) of the matrix M ⋆ k .
Stability vis-à-vis noise.Moving on to the more realistic case with noise, we consider the following set of samples {A i , y i } 1≤i≤N : The set {A ′ i , y ′ i } 1≤i≤N MLR is independently generated in a similar manner.Our next result reveals that the proposed algorithm is stable against Gaussian noise.The proof is postponed to Section 5.
Theorem 2 (Stable recovery).Consider the noisy model (23) under the assumptions of Section 3.1.Suppose that the sample sizes satisfy (21), and that the noise level satisfies for some sufficiently small constant c > 0. Then with probability at least 1 − o(1), there exists some permutation π : {1, . . ., K} → {1, . . ., K} such that the outputs of Algorithm 4 obey for all where 0 < c 0 < 1/4 and C 0 > 0 are some universal constants, and T 0 is the number of iterations used in Algorithm 3. Theorem 2 asserts that, when initialized using the proposed schemes, the ScaledTGD algorithm converges linearly until an error floor is hit.To interpret the statistical guarantees (25), we find it helpful to define the signal-to-noise-ratio (SNR) w.r.t.M ⋆ k as follows: This together with the simple consequence as long as the iteration number T 0 is sufficiently large.Here, the first term on the right-hand side of (27) matches the minimax lower bound for low-rank matrix sensing [CP11, Theorem 2.5] (the case with K = 1) up to a factor of K √ log N .In contrast, the second term on the right-hand side of (27) -which becomes very small as SNR k grows -is not a function of the sample size N and does not vanish as N → ∞.This term arises since, even at the population level, the point (L, R) satisfying LR ⊤ = M ⋆ k is not a fixed point of the ScaledTGD update rule, due to the presence of mislabeled samples.

Numerical experiments
To validate our theoretical findings, we conduct a series of numerical experiments.To match practice, we do not deploy sample splitting (given that it is merely introduced to simplify analysis), and reuse the same dataset of size N for all three stages.Throughout the experiments, we set n 1 = n 2 = n = 120, r = 2, and K = 3.For each k, we let p k = 1/K and Σ ⋆ k = I r , and generate U ⋆ k and V ⋆ k as random r-dimensional subspaces in R n .We fix the sample size to be N = 90 nrK.The algorithmic parameters are chosen according to our recommendations in Table 1.For instance, for each run of ScaledTGD, we set the step size as η = 1.3K and the truncation fraction as α = 0.8/K.
Linear convergence of ScaledTGD.Our first series of experiments aims at verifying the linear convergence of ScaledTGD towards the ground-truth matrices {M ⋆ k } when initialized using the outputs of Stage 2. We consider both the noiseless case (i.e.σ = 0) and the noisy case σ = 10 −5 .Figures 1(a) and 1(b) plot the relative Euclidean error 1(a) and 1(b) that ScaledTGD, when seeded with the outputs from Stage 2, converges linearly to the ground-truth matrices {M ⋆ k } in the absence of noise, and to within a small neighborhood of {M ⋆ k } in the noisy setting.
Estimation error in the presence of random noise.The second series of experiments investigates the stability of the three-stage algorithm in the presence of random noise.We vary the noise level within where {M k } are the outputs of Algorithm 4) versus the noise level σ, showing that the recovering error is indeed linear in σ, as predicted by our theory.

Prior work
Low-rank matrix recovery.There exists a vast literature on low-rank matrix recovery (e.g.[CR09, KMO10, BNS16, CC17, MWCC20, CCFM19, SL16, CLS15, JNS13, CCF + ar, CFMY19, SQW18, CLL20, DC20, NNS + 14, CCG15, CCD + 19, ACHL19, ZQW20, ZWYG18, LMZ18, PKCS17]); we refer the readers to [CC18b,CLC19] for an overview of this extensively studied topic.Most related to our work is the problem of matrix sensing (or low-rank matrix recovery from linear measurements).While convex relaxation [CR09, RFP10, CP11] enjoys optimal statistical performance, two-stage non-convex approaches [ZL15, TBS + 16, TMC20] have received growing attention in recent years, due to their ability to achieve statistical and computational efficiency at once.Our three-stage algorithm is partially inspired by the two-stage approach along this line.It is worth mentioning that the non-convex loss function associated with low-rank matrix sensing enjoys benign landscape, which in turn enables tractable global convergence of simple first-order methods [BNS16, GJZ17, ZLTW18, LMZ18, LZT19].[QR78], mixed linear regression has attracted much attention due to its broad applications in music perception [DV89, VT02], health care [DH00], trajectory clustering [GS99], plant science [Tur00], neuroscience [YPCR18], to name a few.While computationally intractable in the worst case [YCS14], mixed linear regression can be solved efficiently under certain statistical models on the design matrix.Take the two-component case for instance: efficient methods include alternating minimization with initialization via grid search [YCS14], EM with random initialization [KYB19, KQC + 19], and convex reformulations [CYC14,HJ18], where EM further achieves minimax estimation guarantees [CYC14] in the presence of Gaussian noise [KHC20].Mixed linear regression becomes substantially more challenging when the number K of components is allowed to grow with n.The state-of-the-art method -namely, the Fourier moment method [CLS20] -achieves sub-exponential sample and computational complexities w.r.t.K, whereas other methods (e.g. the method of moments [LL18] and grid search over K-dimensional subspaces [SS19a]) all have exponential dependence on K.It turns out that by restricting the ground-truth vectors to be in "general position" (e.g.linearly independent), tensor methods [YCS16, CL13, SJA16, ZJD16] solve mixed linear regression with polynomial sample and computational complexities in K.It is worth noting that most of the prior work focused on the Gaussian design for theoretical analysis, with a few exceptions [CYC14,HJ18,SS19a].Another line of work [KC07, SBVDG10, YPCR18, KMMP19, MP20] considered mixed linear regression with sparsity, which is beyond the scope of the current paper.

Mixed linear regression. Being a classical problem in statistics
Mixed low-rank matrix estimation.Moving beyond mixed linear regressions, there are a few papers that tackle mixtures of low-rank models.For example, [YC15] proposed a regularized EM algorithm and applied it to mixed matrix sensing with two symmetric components; however, only local convergence was investigated therein.Additionally, [PA18] was the first to systematically study mixed matrix completion, investigating the identifiability conditions and sample complexities of this problem; however, the heuristic algorithm proposed therein comes without provable guarantees.

Iterative truncated loss minimization.
Least trimmed square [Rou84] is a classical method for robust linear regression.Combining the idea of trimming (i.e.selecting a subset of "good" samples) with iterative optimization algorithms (e.g.gradient descent and its variants) leads to a general paradigm of iterative truncated loss minimization -a principled method for improving robustness w.r.t.heavy-tailed data, adversarial outliers, etc. [SS19b,SWS20].Successful applications of this kind include linear regression [BJK15], mixed linear regression [SS19a], phase retrieval [CC17, ZCL18], matrix sensing [LCZL20], and learning entangled single-sample distributions [YL20], among others.

Multi-task learning and meta-learning.
The aim of multi-task learning [Car97, Bax00, BDS03, AZ05, EMP05, AEP07, JSRR10, PLW + 20, PL14, MPRP16] is to simultaneously learn a model that connect multiple related tasks.Exploiting the similarity across tasks enables improved performance for learning each individual task, and leads to enhance generalization capabilities for unseen but related tasks with limited samples.This paradigm (or its variants) is also referred to in the literature as meta-learning [FAL17, TJJ20] (i.e.learning-to-learn), transfer learning [PY09], and few-shot learning [SSZ17, DHK + 20], depending on the specific scenarios of interest.Our study on learning mixture of models is related to the probabilistic approach taken in multi-task learning and meta-learning, in which all the tasks (both the training and the testing ones) are independently sampled from a common environment, i.e. a prior distribution of tasks [Bax00].See [KSS + 20, KSKO20] for recent efforts that make explicit the connection between mixed linear regression and meta-learning.

Analysis
In this section, we present the proofs of Theorems 1 and 2. Our analysis is modular in the sense that we deliver the performance guarantees for the three stages separately that are independent of each other.For instance, one can replace the tensor method in Stage 2 by any other mixed linear regression solver with provable guarantees, without affecting Stages 1 and 3.

Stage 1.
The first result confirms that given enough samples, Algorithm 1 outputs reasonable estimates of the subspaces (U ⋆ , V ⋆ ) (cf. ( 5)).The proof is deferred to Appendix B.1.
Theorem 3. Consider the model (23) under the assumptions in Section 3.1.Recall the definitions of κ and Γ in (17).For any 0 < δ < 1, the estimates U and V returned by Algorithm 1 satisfy with probability at least 1 − Ce −cn for some universal constants C, c > 0, provided that the sample size obeys for some sufficiently large constant C 0 > 0 .
Next, we demonstrate that the tensor method employed in Algorithm 2 reliably solves the intermediate mixed linear regression problem defined in (12).The proof is postponed to Appendix B.2. Theorem 4. Consider the model (23) under the assumptions in Section 3.1.Suppose that the subspace estimates U and V are independent of {A i , y i } 1≤i≤N and obey max{ ) for some sufficiently small constant c 1 > 0. Let { β k } 1≤k≤K be the estimates returned by Line 3 of Algorithm 2. Given any 0 < ǫ ≤ c 2 /K, there exists a permutation π(•) : {1, . . ., K} → {1, . . ., K} such that with probability at least 1 − O(1/ log n), provided that the sample size obeys Here, c 2 > 0 (resp.C > 0) is some sufficiently small (resp.large) constant.
From now on, we shall assume without loss of generality that π(•) is an identity map (i.e.π(k) = k) to simplify the presentation.Our next result transfers the estimation error bounds for U , V and { β k } to that for {L k R ⊤ k }, thus concluding the analysis of Stage 2; see Appendix B.3 for a proof.
for all 1 ≤ k ≤ K.
The last result guarantees that Algorithm 3 -when suitably initialized -converges linearly towards M ⋆ k up to a certain error floor.Here M ⋆ k is the closest among {M ⋆ j } 1≤j≤K to the point L 0 (R 0 ) ⊤ .The proof can be found in Appendix B.4.
Theorem 5. Consider the model (23) under the assumptions in Section 3.1.Suppose that the noise level obeys (24).Choose the step size η and truncating fraction α such that 0 then with probability at least 1 − Ce −cn the iterates of Algorithm 3 satisfy for all t ≥ 0, provided that the sample size exceeds N ≥ C 0 nrK δ 2 log N. Here, 0 < c 2 < 1/4 and C, c, C 2 > 0 are some universal constants, and c 0 , c 1 > 0 (resp.C 0 > 0) are some sufficiently small (resp.large) constants.
Putting pieces together: proof of Theorems 1 and 2. With the above performance guarantees in place, we are ready to establish the main theorems.Note that due to sample splitting in Algorithm 4, we shall apply Theorems 3 and 5 to the dataset {A i , y i } 1≤i≤N , and Theorem 4 to the dataset for some sufficiently small constants c 3 , c 4 > 0 in Theorems 3 and 4.These choices -in conjunction with our assumption on σ in Theorem 2, as well as Proposition 1 -guarantee that the initialization L 0 (R 0 ) ⊤ lies in the neighborhood of M ⋆ k as required by (33).This allows us to invoke Theorem 5 to conclude the proof of Theorem 2. Finally, Theorem 1 follows by simply setting the noise level σ = 0 in Theorem 2.

Discussion
This paper develops a three-stage algorithm for the mixed low-rank matrix sensing problem, which is provably efficient in terms of both sample and computational complexities.Having said this, there are numerous directions that are worthy of further investigations; we single out a few in the following.
To begin with, while our required sample complexity scales linearly (and optimally) w.r.t. the matrix dimension max{n 1 , n 2 }, its dependency on other salient parameters -e.g. the number K of components, the ranks {r k } of the ground-truth matrices {M ⋆ k } -is likely sub-optimal.Improving the sample efficiency in these aspects is certainly an interesting direction to explore.In addition, in the presence of random noise, the performance of ScaledTGD saturates after the number of samples exceeds a certain threshold.It would be helpful to investigate other algorithms like expectation-maximization to see whether there is any performance gain one can harvest.Furthermore, our current theory builds upon the Gaussian designs {A i }, which often does not capture the practical scenarios.It is of great practical importance to develop efficient algorithms that can accommodate a wider range of design matrices {A i } -for instance, the case of mixed low-rank matrix completion.Last but not least, it would be of interest to study more general meta-learning settings in the presence of both light and heavy tasks (beyond the current single-sample setting) [KSS + 20], and see how sample complexities can be reduced (compared to meta-learning for mixed regression) by exploiting such low-complexity structural priors .
Algorithm 5: The tensor method for mixed linear regression [YCS16, Algorithm 1] 1 Input: {a i , y i } 1≤i≤N . 2 Randomly split the samples into two disjoint sets {a i , y i } 1≤i≤N1 , {a ′ i , y ′ i } 1≤i≤N2 such that N = N 1 + N 2 , by assigning each sample to either dataset with probability 0.5.
, where T is defined in (36).
5 Denote the rank-K SVD of M 2 as U 2 Σ 2 V ⊤ 2 , and compute the whitening matrix 7 Run the robust tensor power method [YCS16, Algorithm 2] on and in part by a Princeton Schmidt Data-X Research Award.We would like to thank Qixing Huang who taught us the symmetry synchronoziation problem in computer vision that largely inspired this research.

A The tensor method for mixed linear regression
This section reviews the tensor method proposed in [YCS16] for solving mixed linear regression.For simplicity of exposition, we consider the noiseless case where we have access to the samples {a i , y i } 1≤i≤N obeying Our goal is to recover the ground truths Notation for tensors.For two matrices A and B, denote by A⊗B their Kronecker product, and let A ⊗3 represent A ⊗ A ⊗ A. For a symmetric tensor T ∈ R d×d×d and matrices denote the multi-linear matrix multiplication such that In addition, let T stand for the operator norm of T , namely, T := sup x: x 2=1 T (x, x, x) .
The tensor method: algorithm and rationale.
We summarize the tensor method in Algorithm 5, which is mostly the same as [YCS16, Algorithm 1] and included here for completeness.
In the following, we explain the intuitions behind its algorithmic design.Given data {a i , y i } 1≤i≤N generated according to (35), we compute the following empirical moments: here, letting {e i } 1≤i≤d be the canonical basis of R d , we define the operator T (•) : R d → R d×d×d as The key observation is that: under the Gaussian design (i.e. a i i.i.d. ∼ N (0, I d )), M 2 and M 3 reveal crucial second-order and third-order moments of {β ⋆ k } since (cf.[YCS16, Lemma 1]) where we recall p k = |Ω ⋆ k |/N .This motivates one to apply tensor decomposition [AGH + 14] on M 2 and M 3 in order to estimate {β ⋆ k } and {p k }.Indeed, the estimates {β k } and {ω k } returned by Algorithm 5 serve as our estimates of {β ⋆ k } and {p k }, respectively.Remark 1 (Sample splitting).Similar to [YCS16], we assume that m 0 and M 2 are computed using one set of data, while M 1 and M 3 are obtained based on another independent set of samples.This sample splitting strategy ensures that the whitening matrix W is independent of M 3 , thus simplifying theoretical analysis.

B Proofs for Section 5
For notational simplicity, we use dist U ,V throughout to denote the following subspace estimation error:

B.1 Proof of Theorem 3
The proof is decomposed into two steps: we first develop an upper bound Y − E[Y ] (where Y is as defined in Algorithm 1), and then combine this with Wedin's Theorem to control the subspace distance dist U ,V .
Step 1: controlling , where we define Lemma 1 asserts that: with probability at least 1 − Ce −cn for some universal constants C, c > 0, we have as long as the sample size N satisfies (29), which together with the triangle inequality further implies In addition, [CP11, Lemma 1.1] reveals that with probability at least 1 − Ce −cn for some constants C, c > 0, Step 2: controlling dist U ,V .
Before embarking on controlling dist U ,V , we make the following claim.
Claim 1.Under the assumptions of Theorem 3, we have With this claim in place, we are ready to apply Wedin's Theorem [Wed72] to obtain with the proviso that ∆ defined in (38 ).On the other hand, if instead one has ), then we claim that (40) trivially holds; this can be seen by observing that dist U ,V ≤ 1, while the right-hand side of (40) is greater than Substituting this relation into (40) immediately leads to the advertised bound (28) in Theorem 3.
Proof of Claim 1. Recall that we can write for notational convenience, we have This together with Assumption 1 gives This completes the proof of (39a).Next, we turn attention to (39b).Denote the SVD of where diag {p k Σ ⋆ k } 1≤k≤K is a R×R full-rank diagonal matrix, with blocks p 1 Σ ⋆ 1 , . . ., p K Σ ⋆ K on the diagonal.This implies that where the last inequality uses the assumption that p k 1/K.This establishes (39b).

B.2 Proof of Theorem 4
Step 1: basic properties of the auxiliary mixed linear regression problem.We begin by formally characterizing the intermediate mixed linear regression problem in Stage 2. It is easily seen from Section 2.2 that for i ∈ Ω ⋆ k , one has where the additional term accounts for the subspace estimation error.In words, the observations {y i } can be equivalently written in the mixed linear regression form, where {β k } constitutes the underlying parameters, {a i } the measurement vectors and {ξ i } the measurement noise.We then focus on characterizing the properties of a i and ξ i .Recall from Algorithm 2 that a i = vec(U ⊤ A i V ).In view of the independence between {A i } and U , V , one can deduce that where d := R 2 .Again, leveraging the independence between {A i , ζ i } and U , V , we have For notational convenience, we shall denote the variance to be More importantly, the measurement vectors {a i } are independent of the measurement noise {ξ i }.To see this, one has Here the second equality follows from the independence between ζ i and A i , U , V , whereas the last line utilizes the independence between A i and U , V and the isotropic property of A i .
In conclusion, in Line 3 of Algorithm 2, we are equivalently faced with a d-dimensional mixed linear regression problem with data {a i , y i } 1≤i≤N , which satisfies that for i ∈ Ω ⋆ k , with ξ i being independent from a i .
Step 2: performance of the tensor method.
Next, we characterize the performance of the tensor method for solving the above mixed linear regression problem.Our proof follows closely that of [YCS16, Theorem 1], with minor modifications to accommodate the noise {ξ i }.Therefore we only provide a sketch here.
Recall that in Algorithm 5, we randomly split the input data {a i , y i } 1≤i≤N into two sets {a i , y i } 1≤i≤N1 and {a ′ i , y ′ i } 1≤i≤N2 (with slight abuse of notation).This sample splitting strategy is adopted merely to decouple statistical dependence and facilitate analysis.The high-level idea of the proof of [YCS16, Theorem 1] is simple to state: if the quantities are sufficiently small, then the tensor method returns reliable estimates of {β k }; see [YCS16, Eq. (24) in Section 5.4.1].Here, the empirical moments M 2 , M 3 and the whitening matrix W are defined in Algorithm 5.With this connection in place, it suffices to control the quantities in (45).While the analysis in [YCS16, Section 5.4.2]only applies to the noiseless mixed linear regression problem, we can easily modify it to accommodate our noisy case (44).The trick is to augment {β k } and {a i } as follows: The advantage is clear: the noisy mixed linear regression problem (44) can be equivalently phrased as a noiseless one, that is for all i ∈ Ω ⋆ k , ∼ N (0, I d+1 ) and Similarly, we can define a aug i ′ analogously, and introduce the augmented versions of the empirical moments as follows: where T aug (•) is defined analogously as in (36).By virtue of the augmentation procedure, M 2 (resp.M 3 ) is a a sub-matrix (resp.sub-tensor) of M aug 2 (resp.M aug 3 ).Consequently, we have where With the above augmented vectors/matrices/tensors in place, one can follow the analysis in [YCS16, Section 5.4.2] to upper bound the quantities above.One subtle issue is that our sampling scheme is slightly different from the one in [YCS16], where each sample has i.i.d.labeling; nevertheless, it is easy to check that this difference is minor, and does not affect the result of the analysis.Indeed, repeating the analysis in [YCS16, Section 5.4] yields the conclusion that: in order to achieve ǫ errors (30) with probability at least 1 − γ, it suffices to require the sample complexities to exceed (analogous to [YCS16, Eq. ( 13)]) Here C 1 , C 2 > 0 are some sufficiently large constants, and the simplifications (i) (ii) hold due to the following facts: , and (v) the following claim (in particular, (51) and (52) therein).
Claim 2. Instate the assumptions of Theorem 4.
1.The ground-truth matrices {M ⋆ k } 1≤k≤K satisfy that for all 1 ≤ i, j ≤ K, i = j, 2. In addition, the parameters {β k } 1≤k≤K obey that for all 1 ≤ k, i, j ≤ K, i = j, 3. In the end, we have Armed with (49b) and (49d), we can plug in the bounds d = R 2 ≤ K 2 r 2 and log(K log n) log n to complete the proof of Theorem 4.
Proof of Claim 2. With regards to the first part of (50), it is seen that where the second line utilizes Assumption 1.The second part of (50) follows immediately from the first part and some elementary calculations.Next, we turn to proving (51).Recall the definitions as well as the lower bound , where the last inequality uses the assumption that dist U ,V ≤ c 1 /(KΓ 2 ) ≤ 0.05; this justifies the first part of (51).To prove the second part of (51), we start with the decomposition , which together with the triangle inequality yields In light of the first part of (50), the first part of (51), and our assumption on dist U ,V , this establishes the second part of (51).
Finally, it remains to prove (52).In view of the assumption that Therefore, it suffices to show that σ F .Towards this, we find it helpful to define the last inequality follows from the definition Γ = max k M ⋆ k F / min k M ⋆ k F , and the first part of (51).This together with Weyl's inequality implies that As a result, we arrive at which in conjunction with (53) completes the proof of (52).

B.3 Proof of Proposition 1
To begin with, the triangle inequality gives Regarding the first term on the right-hand side of (55), we plug in the definition (13) of S k to obtain With regards to the second term on the right-hand side of (55), we observe that Here, (i) follows since L k R ⊤ k is the best rank-r k approximation of U S k V ⊤ and U S k V ⊤ is also rank-r k ; (ii) holds since β k = vec( S k ) and β k = vec(S k ).Substitution into (55) establishes (32).

B.4 Proof of Theorem 5
We shall only prove the local convergence w.r.t. the matrix M ⋆ 1 ; the proof for other components is identical and hence is omitted.Our proof is decomposed into three steps.
1. Study the ScaledTGD dynamics (particularly the population-level dynamics), and control the effects of mislabeling and finite-sample errors.
2. Show that if the estimation error is larger than the error floor (namely, the last term in (34)), then one step of the ScaledTGD update contracts the error by a constant factor.
3. Show that, once the estimation error gets smaller than this error floor, then the estimation errors remain small in subsequent iterations.
Before continuing, we note that Condition (33) with k = 1 implies the existence of some constant c 1 > 0 such that (L 0 , R 0 ) ∈ B, where This arises from the inequalities M ⋆ 1 F (due to Assumption 1).We isolate Condition (56) since it is more convenient to work with in the analysis.
Notation.To simplify presentation, we shall often let (L, R) denote an iterate lying within B (cf. ( 56)), and define the corresponding estimation errors as The truncating level for a prescribed truncating fraction α is denoted by where Q α is the α-quantile defined in Section 1.2.We also define the following functions and quantities: where φ stands for the probability density function of a standard Gaussian random variable.
Step 1: characterizing the ScaledTGD dynamic.The above notation allows one to express the ScaledTGD update rule (16) as Recall that for any i ∈ Ω ⋆ k , we have The following result makes apparent a useful decomposition of the ScaledTGD update rule.
Claim 3. Recall the notation (59) and (60).The ScaledTGD update rule (61) can be written as Here, (L + pop , R + pop ) represents the population-level update from and the residual components are given by Before moving on, we note that it is crucial to control the sizes of ∆ mis and ∆ fs , where "mis" stands for "mislabeling", and "fs" stands for "finite sample".Regarding ∆ mis , Fact 1 tells us that for all k = 1, .
Here, the last inequality holds since where we have used due to the assumption that (L, R) ∈ B defined in (56).Consequently, we obtain Next, we turn to the term ∆ fs .Note that rank(∆ k ) ≤ 2r.Therefore, Lemmas 1 and 2 (see Remark 2) imply that, with probability at least 1 − Ce −cn for some constants c, C > 0, the following holds simultaneously for all (L, R) ∈ B (cf. ( 56)): 1. the truncating level τ obeys 0.54 < τ 2. for any real matrix W with n 2 rows and of rank at most r, we have The above-mentioned bounds will play a useful role in subsequent steps.
Step 2: per-iteration improvement above the error floor (34).Let us look at the Euclidean error Since L + pop and R + pop (64) are exactly the same as the update rule of scaled gradient descent for low-rank matrix factorization, [TMC20, Theorem 3] tells us that if 0 < ηp 1 w 1 ≤ 2/3 (which holds true under our choices of η ≤ 1.3/p 1 and α ≤ 0.8p 1 ), then It remains to control the perturbation terms in (69b), accomplished as follows.
Claim 4. Denoting Putting ( 70) and (72) back to (69) and denoting It remains to control B. First, the relations δ ≤ c 0 /K and ∆ 1 F ≥ C 2 Kδσ (for some sufficiently large constant C 2 > 0) imply that for some sufficiently small constant c 3 > 0.Moreover, observing that we have for some sufficiently small constant c 4 > 0. Putting (74) and (75) back into (71), we have which together with ∆ 1 F ≤ c 1 σ r1 (M ⋆ 1 ) implies the existence of some small constant c 5 > 0 such that Substituting this into (73), we arrive at the desired bound for some constant c 2 > 0; this is because in (73), we have p 1 ≍ 1/K by assumption, and w 1 1 according to (67).
Step 3: no blowing up below the error floor (34).Suppose that the estimation error satisfies We intend to show that, in this case, the estimation error of the next iterate ∆ + 1 F satisfies the same upper bound (77); if this claim were true, then combining this with our results in Step 2 would complete the convergence analysis of ScaledTGD.
Note that (73) remains valid when ∆ 1 F is below the error floor, which implies that Recalling the definition of B in (71), one has .
By the assumption that δ 1/K and σ min k M ⋆ k F /K, we have ∆ 1 F σ according to (77), and thus KB/σ r1 (M ⋆ 1 ) σ/σ r1 (M ⋆ 1 ) 1. Consequently, on the right-hand side of (78) we have Proof of Claim 4. We shall only prove the first part of (72) concerning E L (R + pop ) ⊤ F ; the analysis for L + pop (E R ) ⊤ F is essentially the same.By the triangle inequality, we have We utilize (66) and (68) from Step 1 to control the terms above.For the first term of (79), recognizing that Regarding the second term of (79), we observe that 56)) as well as the following fact (which will be proved at the end of this section): for any Combining these with (79) establishes that E L (R + pop ) ⊤ F ≤ 2B, which is the first part of (72).Finally, for the second part of (72), we can apply similar techniques to reach Proof of (80).Weyl's inequality tells us that which further implies that both L and R have full column rank r 1 .Consequently, we denote the SVD of L and R as With the SVD representations in place, it is easy to check that In addition, the orthonormality of V L and V R implies Combining this with (81) completes the proof.

C Technical lemmas
This section collects several technical lemmas that are helpful for our analysis (particularly for the analysis of Stage 3).For notational convenience, we define the set of low-rank matrices as We remind the reader of the definitions ½(a Variants of matrix-RIP.We recall the standard notion of restricted isometry property (RIP) from the literature of matrix sensing, and introduce a variant called truncated RIP (TRIP).
holds simultaneously for all X, Z ∈ R r .
2. We say that {A i } 1≤i≤m satisfy (r, δ)-TRIP if holds simultaneously for all X, Z ∈ R r and for all 0 ≤ τ ≤ 1.35.
As it turns out, the Gaussian design satisfies the above notion of RIP and TRIP, as formalized below.
Empirical quantiles.Our next technical lemma is a uniform concentration result for empirical quantiles.Given the design matrices {A i } 1≤i≤N , the index sets {Ω ⋆ k } 1≤k≤K and the low-rank matrices {X k } 1≤k≤K , we define several sets as follows: In addition, let us introduce the following set of low-rank matrices: where c 0 > 0 is some sufficiently small constant.Recall that Q α (D) denotes the α-quantile of D, as defined in (2).
Lemma 2. Let {A i } 1≤i≤N be random matrices in R n1×n2 with i.i.d.N (0, 1) entries.Set n = max{n 1 , n 2 }, and suppose the index sets {Ω ⋆ k } 1≤k≤K are disjoint and satisfy the condition (18).If 0.6p 1 ≤ α ≤ 0.8p 1 and N ≥ C 0 nrK 3 log N for some sufficiently large constant C 0 > 0, then there exist some universal constants C, c > 0 such that: with probability at least 1 − Ce −cn , holds simultaneously for all (X 1 , . . ., X K ) ∈ T 1 , where D is defined in (85).Remark 2. We can further incorporate additional Gaussian noise {ζ i } into Lemmas 1 and 2, where ∼ N (0, σ 2 ).For example, we claim that, with the same sample complexity m as in Lemma 1, we have the following noisy version of (r, δ)-TRIP (84): To see this, let us define the augmented matrices where * stands for some auxiliary i.i.d.N (0, 1) entries.Observe that {A aug i } 1≤i≤N are random matrices with i.i.d.N (0, 1) entries; in addition, rank(X aug ) = rank(X) + 1, rank(Z aug ) = rank(Z), and X aug 2 F = X 2 F + σ 2 ; finally, A i , X − ζ i = A aug i , X aug , A i , Z = A aug i , Z aug , and X, Z = X aug , Z aug .Therefore, the left-hand side of (87) can be equivalently written as in the noiseless form (84), in terms of these augmented matrices, thus allowing us to apply Lemma 1 to prove (87).This trick of augmentation can be applied to Lemma 2 as well, which we omit here for brevity.
One miscellaneous result.Further, we record below a basic property concerning the function w(•).
Fact 1.The function w(•) defined in (60) satisfies Proof.This result is equivalent to saying w(x)/x 2 ≤ w(y)/y 2 for any 0 < x ≤ y ≤ 1.35.Hence, it suffices to show that the function g(x) := w(x)/x 2 is nondecreasing over (0, 1.35], or equivalently, This can be verified numerically (see Figure 2), which completes the proof.
The rest of this section is devoted to proving Lemmas 1 and 2. We use the standard notions (e.g. the subgaussian norm • ψ2 ) and properties related to subgaussian random variables (cf.[Ver18, Section 2]).For notational convenience, we define the normalized version of R r defined in (82), as follows: Before moving on, we record two results that will be useful throughout the proof.
∼ N (0, I d ), 1 ≤ i ≤ m.There exist some universal constants C, c > 0 such that with probability at least 1 − Ce −cd , we have Proof.This result follows from [Ver18, Corollary 7.3.3]and the union bound.

C.1 Proof of Lemma 1
The first result on RIP has been established in the literature (e.g.[CP11, Theorem 2.3]), and hence we only need to prove the second result on TRIP.We first restrict to the case at the end of this subsection, we will prove TRIP for the case 0 ≤ τ < m −100 separately.By homogeneity, it is sufficient to show that holds simultaneously for all (X, Z, τ ) ∈ T TRIP , where The proof consists of two steps: (1) we replace the discontinuous function ½ by a Lipschitz continuous surrogate χ and establish a uniform concentration result for χ; (2) we show that the discrepancy incurred by replacing ½ with χ is uniformly small.Our proof argument is conditioned on the high-probability event that {A i } m i=1 satisfy (2r, δ)-RIP.
For notational convenience, define where the expectation is taken w.r.t.{A i } while assuming that (X, Z, τ ) are fixed.With these preparations in place, we set out to prove that: if m ≥ C 0 nrδ −2 log m, then with probability at least 1 − Ce −cn , holds simultaneously for all (X, Z, τ ) ∈ T TRIP ; here C 0 > 0 is some sufficiently large constant, and C, c > 0 are some universal constants.First, consider any fixed point (X, Z, τ ) ∈ T TRIP .Note that | A i , X χ A i , X ; τ | ≤ τ is bounded, and that the subgaussian norm of A i , Z obeys A i , Z ψ2 = N (0, 1) ψ2 1.As a result, Invoking [Ver18, Theorem 2.6.2]tells us that for all t ≥ 0, holds for some constant c 1 > 0. Next, we construct an ǫ-net to cover T TRIP .In view of [CP11, Lemma 3.1], the set R norm r defined in (90) has an ǫ-net (in terms of • F distance) of cardinality at most (9/ǫ) 3nr .In addition, we can cover the interval [m −100 , 1.35] with precision ǫ τ using no more than 2/ǫ τ equidistant points.Putting all this together, we can construct a set M TRIP ⊆ R norm r × R norm r × [0, 1.35] of cardinality at most (9/ǫ) 6nr (2/ǫ τ ) such that: for any (X, Z, τ ) ∈ T TRIP , there exists some point (X 0 , Z 0 , τ The union bound then implies that with probability at least 1 − 2 exp(−c 1 mt 2 )(9/ǫ) 6nr (2/ǫ τ ), one has In what follows, we shall choose so as to achieve a uniformly small error tτ = δτ /4 in (97) with probability at least 1 − 2 exp(−c 3 mδ 2 ) for some universal constant c 3 > 0. Now, for any (X, Z, τ ) ∈ T TRIP , let (X 0 , Z 0 , τ 0 ) ∈ M TRIP be the point satisfying (96).Then we have Here, (A) is already bounded by δτ /4 by construction.We can control (B) via the following decomposition: .
Step 2: controlling the errors incurred by using the surrogate χ.Similar to (94), we define where the expectation is taken assuming independence between A i and (X, Z, τ ).In this step, we aim to show that: if m ≥ C 0 nrδ −2 log m, then with probability at least 1 − Ce −cn , holds simultaneously for all (X, Z, τ ) ∈ T TRIP .If this were true, then combining this with (95) would immediately conclude the proof of Lemma 1.
Towards establishing (100), we start with the following decomposition: where we abuse the notation (A) and (B).In the sequel, we shall control (A) and (B) separately.

The last inequality holds since
| on an interval of length c χ τ , over which the probability density function of A i , X ∼ N (0, 1) is upper bounded by some constant.By our choice of c χ in (93), we have (A) ≤ δτ /4.
• We then move on to (B).For notational convenience, given any τ > 0, we let which clearly satisfies χ(a; τ ) ≤ ½(a; τ ) ≤ χ(a; τ ′ ).In addition, defining As a consequence, Next, we demonstrate how to analyze the first term (C) above; the analysis for the other term is essentially the same.For notational simplicity, define where the expectation is again taken assuming that A i is independent of X, Z and τ .Then we have Regarding the first term on the right-hand side, we follow an argument similar to our previous analysis for (A) to obtain for some sufficiently small constant 0 < c 2 < 1/8.Thus, it remains to show that holds simultaneously for all (X, Z, τ ) ∈ T TRIP .Note that by definition, F + m (X, Z, τ ) is the empirical average of some Lipschitz continuous function (in particular, A i , ).Therefore, we can prove (103) by a standard covering argument similar to that in Step 1; we omit the details for brevity.Putting the above bounds together, we establish that (B) ≤ δτ /4.
Proof for the case 0 ≤ τ < m −100 .It remains to prove that (91) holds simultaneously for all X, Z ∈ R norm r (cf.( 90)) and all 0 ≤ τ < m −100 .We start with the following decomposition: (104) The second term on the right-hand side of (104) can be bounded by where (i) can be seen from the definition of w(•) in (60), and (ii) relies on the observation that our assumption m ≥ C 0 nrδ −2 log m implies δ m −1/2 .It thus remains to show that the first term on the right-hand side of (104) is bounded by 0.9δτ .In view of (2r, δ)-RIP, the Cauchy-Schwarz inequality, and the observation that | A i , X ½ A i , X ; τ | ≤ τ , we have where the last inequality uses the assumption that τ < m −100 .We can invoke Lemma 3 with t = 0. holds simultaneously for all X ∈ R norm r , provided that m ≥ C 0 nrδ −2 log m; here, Z 0 denotes a random variable having the same distribution as |N (0, 1)|.Plugging this into (105) confirms that the first term on the right-hand side of (104) is bounded by 0.9δτ , thus concluding the proof for the case with 0 ≤ τ < m −100 .
Step 2: lower bounding Q α (D).For notational convenience, we denote q := 0.7α p 1 ∈ [0.42, 0.56], and Clearly, by the definition of B N , one has where it can be verified numerically that Q q (Z)/1.01≥ 0.54.Therefore, it suffices to upper bound the probability P B N > α .To accomplish this, we first upper bound B N as follows: Here, the last line follows from the assumption that 1 = X 1 F ≤ (c 0 /K) min k =1 X k F ; see the definition of T norm 1 in (106).Note that X 1 ∈ R norm r , and for all k = 1, we also have X k / X k F ∈ R norm r .Therefore, we can invoke Lemma 3 with m = N 1 = |Ω ⋆ 1 |, τ = Q q (Z)/1.01,t = 0.15α and ǫ = N −10 1 to obtain that: with probability at least 1 − Ce −cn (provided that m ≥ C 0 nrK 2 log m), the following holds simultaneously for all X 1 ∈ R norm r : 1.01 ≤ P Z ≤ Q q (Z) + t + 200ǫ τ = q + 0.15α + 202N −10 1 Q q (Z) .
Similarly, for all k = 1, one can apply Lemma 3 with m = N k := |Ω ⋆ k |, τ = c 0 Q q (Z)/(1.01K),t = 0.15α and ǫ = N −10 k to show that: with probability at least 1 − Ce −cn (provided m ≥ C 0 nrK 2 log m), the following holds simultaneously for all X k / X k F ∈ R norm r : where the last inequality relies on the property of Z. Combine the above two bounds with (109) to reach Recall that p 1 q = 0.7α, α ≍ p 1 ≍ 1/K, and observe that p 1 Z) ≤ 0.05α as long as N k K for all k.Putting these together guarantees that B N ≤ α as desired, which further implies Q α (D) ≥ Q q (Z)/1.01≥ 0.54.

Combining this lower bound with the upper bound in
Step 1 completes our proof of Lemma 2.

C.3 Proof of Lemma 3
Throughout the proof, we assume that the ensemble {A i } obeys (2r, 1/4)-RIP.In view of Lemma 1, this happens with probability at least 1 − C 2 e −c2n for some constants c 2 , C 2 > 0, as long as m ≥ Cnr.Recall the definition of χ from Appendix C.1, and set the parameter as c χ = 0.01/1.01.One then has In the sequel, we invoke the standard covering argument to upper bound 1 m m i=1 χ ( A i , X ; 1.01τ ).First, consider a fixed X ∈ R norm r independent of {A i }.In this case we can bound the expectation as where we recall that Z follows the same distribution as |N (0, 1)|.In addition, note that 1 m m i=1 χ ( A i , X ; 1.01τ ) is the empirical average of m independent random variables, each lying within [0, 1] and having variance bounded by 2τ .Therefore, for all t ≥ 0, one sees from Bernstein's inequality [Ver18, Theorem 2.8.4] that where c 0 , c 1 > 0 are some universal constants.Let M ⊆ R norm r be an ǫ-net of R norm r , whose cardinality is at most (9/ǫ) 3nr .The union bound reveals that: with probability at least 1 − (9/ǫ) 3nr exp(−c 1 mt 2 /(τ + t)), one has sup X∈M 1 m m i=1 χ ( A i , X ; 1.01τ ) ≤ P (Z ≤ 1.01τ ) + t.
Next, we move on to account for an arbitrary X ∈ R norm r (which is not necessarily independent of {A i }).Let X 0 be a point in M obeying X − X 0 F ≤ ǫ.As a result, one has χ ( A i , X ; 1.01τ ) − χ ( A i , X 0 ; 1.01τ ) Here the inequality (i) holds since χ(•; 1.01τ ) is Lipschitz with the Lipschitz constant 1/(1.01cχ τ ) = 100/τ , the relation (ii) results from the Cauchy-Schwarz inequality, and (iii) follows since {A i } obeys (2r, 1/4)-RIP.
Combine the above two inequalities to finish the proof.

D Estimating unknown parameters in Algorithm 4
Throughout the paper, we have assumed the knowledge of several problem-specific parameters, e.g. the proportion p k of the k-th component, the rank r k of the low-rank matrix M ⋆ k and the rank R = rank(E[Y ]).In the sequel, we specify where we need them and discuss how to estimate them in practice.
• In Line 2 of Algorithm 4, when running Algorithm 1, we need to know R = rank(E[Y ]), which can be estimated faithfully by examining the singular values of the data matrix Y .
• In Line 3 of Algorithm 4, when running Algorithm 2, we need to know {r k } 1≤k≤K , where r k = rank(M ⋆ k ).Recall from (14) that U S k V ⊤ ≈ M ⋆ k ; therefore, r k can be estimated accurately by examining the singular values of S k .
• In Line 5 of Algorithm 4, when running Algorithm 3, we need to know p k to set η k and α k appropriately.
It turns out that the outputs {ω k } of the tensor method (see Algorithm 5) satisfy ω k ≈ p k , 1 ≤ k ≤ K.

Figure 1 :
Figure 1: (a) The relative Euclidean error vs. the iteration count of ScaledTGD in Stage 3 of Algorithm 4 for each of the three components, in the noiseless case.(b) Convergence of ScaledTGD in the noisy case σ = 10 −5 .(c) The largest relative Euclidean error (at convergence) of ScaledTGD in Algorithm 4, vs. the noise level σ.Each data point is an average over 10 independent trials.
, and decompose B ⊤ B as B ⊤ B = D + O. Here, D stands for the diagonal part of B ⊤ B with D kk = β k 2 2 , while O is the off-diagonal part of B ⊤ B. Note that for any