Uncertainty quantification for nonconvex tensor completion: Confidence intervals, heteroscedasticity and optimality

We study the distribution and uncertainty of nonconvex optimization for noisy tensor completion -- the problem of estimating a low-rank tensor given incomplete and corrupted observations of its entries. Focusing on a two-stage estimation algorithm proposed by Cai et al. (2019), we characterize the distribution of this nonconvex estimator down to fine scales. This distributional theory in turn allows one to construct valid and short confidence intervals for both the unseen tensor entries and the unknown tensor factors. The proposed inferential procedure enjoys several important features: (1) it is fully adaptive to noise heteroscedasticity, and (2) it is data-driven and automatically adapts to unknown noise distributions. Furthermore, our findings unveil the statistical optimality of nonconvex tensor completion: it attains un-improvable $\ell_{2}$ accuracy -- including both the rates and the pre-constants -- when estimating both the unknown tensor and the underlying tensor factors.


Noisy low-rank tensor completion
Tensor data are routinely employed in data and information sciences to model (structured) multi-dimensional objects (Kolda & Bader, 2009;Anandkumar et al., 2014;Sidiropoulos et al., 2017;Zhang, 2019). In many practical scenarios of interest, however, we do not have full access to a largedimensional tensor of interest, as only a sampling of its entries are revealed to us; yet we would still wish to reliably infer all missing data. This task, commonly referred to as tensor completion, finds applications in numerous domains including medical imaging (Semerci et al., 2014), visual data analysis (Liu et al., 2013), seismic data reconstruction (Kreimer et al., 2013), to name just a few. In order to make meaningful inference about the unseen entries, additional 1 Electrical Engineering, Princeton University. Correspondence to: Yuxin Chen <yuxin.chen@princeton.edu>.
Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). information about the unknown tensor plays a pivotal role (otherwise one is in the position with fewer equations than unknowns). A common type of such prior information is low-rank structure, which hypothesizes that the unknown tensor is decomposable into the superposition of a small number of rank-one tensors. Substantial attempts have been made in the past few years to tackle this low-rank tensor completion problem.
To set the stage for a formal discussion, suppose we are interested in reconstructing a third-order tensor T = [T i,j,k ] 1≤i,j,k≤d ∈ R d×d×d , which is a priori known to have low canonical polyadic (CP) rank. This means that T admits the following CP decomposition 1 T = r l=1 u l ⊗ u l ⊗ u l =: where u l ∈ R d (1 ≤ l ≤ r) represents the unknown tensor factor, and the rank r is considerably smaller than the ambient dimension d. What we have obtained is a highly incomplete collection of noisy observations about the entries of T ∈ R d×d×d ; more precisely, suppose we observe T obs i,j,k = T i,j,k + E i,j,k , (i, j, k) ∈ Ω, where Ω ⊆ [d] 3 with [d] := {1, · · · , d} is a subset of entries, T obs i,j,k denotes the observed entry in the (i, j, k)-th position, and we use E i,j,k to represent the noise, in an attempt to model more realistic scenarios. The presence of missing data and noise, as well as the "notorious" tensor structure (which is not as computationally friendly as its matrix analog (Hillar & Lim, 2013)), poses severe computational and statistical challenges for reliable tensor reconstruction.

Review: a nonconvex optimization approach
A natural reconstruction strategy based on the partial data at hand is to resort to the following least-squares problem: Here and in the sequel, we define U := [u 1 , · · · , u r ]. Unfortunately, owing to its highly nonconvex nature, the opti-mization problem (3) is, in general, daunting to solve.
To alleviate computational intractability, a number of polynomial-time algorithms have been proposed; partial examples include convex relaxation (Gandy et al., 2011;Romera-Paredes & Pontil, 2013;Huang et al., 2015), spectral methods (Montanari & Sun, 2018;Cai et al., 2020a), sum of squares hierarchy (Barak & Moitra, 2016;Potechin & Steurer, 2017), alternating minimization (Jain & Oh, 2014;Liu & Moitra, 2020). Nevertheless, most of these algorithms either are still computationally prohibitive for large-scale problems, or do not come with optimal statistical guarantees; see Section 4. To address the computational and statistical challenges at once, the recent work (Cai et al., 2019) proposed a two-stage nonconvex paradigm that guarantees efficient yet reliable solutions. In a nutshell, this algorithm starts by computing a rough (but reasonable) initial guess U 0 = [u 0 1 , · · · , u 0 r ] for all tensor factors, and iteratively refines the estimate by means of the gradient descent (GD) update rule: See Algorithm 1 (the initialization scheme is more complex to describe, and is hence postponed to Appendix A.1). Despite the nonconvex optimization landscape, theoretical guarantees have been developed for Algorithm 1 under random sampling and random noise. Take the noiseless case for instance: this approach converges linearly to the ground truth under near-minimal sample complexity; furthermore, the algorithm enjoys intriguing 2 and ∞ statistical guarantees under a broad family of noise models.
Algorithm 1 A nonconvex algorithm for tensor completion.

Uncertainty quantification for tensor completion
In various decision making scenarios (e.g. medical imaging), it is crucial not only to provide the users with the reconstruction outcome, but also to inform them of the uncertainty or risk underlying the reconstruction. The latter task, often termed uncertainty quantification, can be accomplished by characterizing the distribution of our reconstruction, which can be further employed to construct valid confidence intervals for the unknowns. Two questions deserve particular attention: given an estimate returned by the above nonconvex algorithm, how to identify a confidence interval when predicting an unseen entry, and how to produce a confidence region that is likely to contain the tensor factors of interest?
Unfortunately, classical distributional theory available in the statistics literature, which typically operates in the largesample regime (with a fixed number of unknowns and a sample size tending to infinity), is not applicable to assess the uncertainty of the above nonconvex algorithm in high dimension. In fact, due to the nonconvex nature of the algorithm, it becomes remarkably challenging to track the distribution of the solution returned by Algorithm 1. The absence of distributional characterization prevents us from offering a trustworthy uncertainty estimate to the users. While the statistical performance of Algorithm 1 has been investigated in (Cai et al., 2019), existing statistical guarantees -which hide the (potentially huge) pre-constants -can only yield confidence intervals that are overly wide and, as a result, practically uninformative. In principle, we should aim for valid confidence intervals that are as short as possible.
Further, an ideal uncertainty quantification procedure should be adaptive to unknown noise distributions. Accomplishing this goal becomes particularly challenging when the noise variance is not only unknown but also location-varying -a scenario commonly referred to as heteroscedasticity. In fact, there is no shortage of realistic scenarios in which the data heteroscedasticity makes it infeasible to pre-estimate local variability in a uniformly reliable manner. Addressing this challenge calls for the design of model-agnostic data-driven procedures that are fully adaptive to noise heteroscedasticity.

Main contributions and insights
We now give an informal overview of the main contributions and insights of this paper. To the best of our knowledge, results of this kind were previously unavailable.
A distributional theory for nonconvex tensor completion. Despite its nonconvex nature, a distributional representation of the estimate returned by Algorithm 1 can be established down to quite fine scales (i.e. down to the entrywise level). Under mild conditions, (1) the resulting estimates for both the unknown tensor factors and tensor entries are nearly unbiased, and (2) the associated uncertainty of the estimates is nearly zero-mean Gaussian, whose (co)-variance can be accurately determined in a data-driven manner.
Entrywise confidence intervals. Our distributional theory leads to construction of entrywise confidence intervals for both the unknown tensor and the associated tensor factors. Our inferential procedure is fully data-driven: it does not require prior knowledge about the noise distributions, and it automatically adapts to local variability of noise.
Optimality. We develop lower bounds under i.i.d. Gaussian noise, showing that the proposed entrywise confidence intervals are the shortest possible. Our results also reveal that nonconvex optimization achieves un-improvable 2 estimation accuracy (including both the rates and pre-constants).
All in all, our results shed light on the unreasonable ef-fectiveness of nonconvex optimization in noisy low-rank tensor completion, which enables optimal estimation and uncertainty quantification all at once.

Notation
For any matrix M , let M (resp. M F ) denote the spectral (resp. Frobenius) norm of M , denote by M 2,∞ := max l M l,: 2 (resp. M ∞ := max i,j |M i,j |) the 2,∞ norm (resp. entrywise ∞ norm) of M , and let M i,: (resp. M :,i ) be the i-th row (resp. column). For any tensors T ∈ R d×d×d , the Frobenius (resp. entrywise ∞ ) norm of T is We use u l,i (resp. u l,i ) to denote the i-th entry of u l ∈ R d (resp. u l ∈ R d ). We shall often let (i, j) denote (i − 1)d + j whenever it is clear from the context.

Models and assumptions
In this paper, we shall consider a setting with random sampling and independent random noise as follows.
Next, we introduce additional parameters about the unknown tensor of interest. Recall that To begin with, we define λ min := min To enable reliable tensor completion, we impose further assumptions on the tensor factors {u l } as follows.
Suppose that T satisfies Further, assume that T is well-conditioned in the sense that κ (cf. (7)) obeys κ = O (1).
Informally, when both µ 0 and µ 1 are small, the 2 energy of both T and u l (1 ≤ l ≤ r) is dispersed more or less evenly across their entries. In addition, a small µ 2 implies that every pair of the tensor factors of interest is nearly orthogonal to (and hence incoherent with) each other. Finally, the wellconditionedness assumption guarantees that no single tensor component has significantly higher energy compared to the rest of them. For the sake of notational simplicity, we shall combine them into a single incoherence parameter The focal point of this paper lies in distributional characterization of and uncertainty assessment for the nonconvex estimate (i.e. the solution U returned by Algorithm 1) in a strong entrywise sense. In particular, we set out the goal to (1) establish distributional representation of the estimate U , and (2) construct short yet valid confidence intervals for each entry of the tensor factor {u l } 1≤l≤r as well as each entry of the unknown tensor T . To phrase the latter task more precisely: given any target coverage level 0 < 1 − α < 1, any 1 ≤ l ≤ r and any 1 ≤ i, j, k ≤ d, the aim is to compute intervals [c 1,T , c 2,T ] and [c 1,u , c 2,u ] such that up to global permutation and Ideally, this should be accomplished in a data-driven manner without requiring prior knowledge about noise distributions.

Main results
This section presents our distributional theory for nonconvex noisy tensor completion, and demonstrates how to conduct uncertainty quantification in a data-driven and optimal manner. In the sequel, we denote by U = [u 1 , · · · , u r ] ∈ R d×r the estimate returned by Algorithm 1, and let T ∈ R d×d×d indicate the resulting tensor estimate as follows Given that one can only hope to recover U up to global permutation 4 , we introduce a permutation matrix as follows where perm r is the set of permutation matrices in R r×r . In order to guarantee reliable convergence of Algorithm 1, there are several algorithmic parameters (e.g. learning rates) that need to be properly chosen. We shall adopt the choices suggested by (Cai et al., 2019) throughout. Given that our theory can be presented regardless of the reader's understanding of these choices, we defer the specification of these algorithmic parameters to Appendix A.2 to avoid distraction. All proofs are deferred to (Cai et al., 2020b).

Distributional guarantees for nonconvex estimates
We now establish distributional guarantees for the nonconvex estimate. For notational convenience, we introduce an auxiliary matrix U ∈ R d 2 ×r as well as a collection of here, we abuse the notation (i, j) to denote (i − 1)d + j. In words, U lifts the tensor factors to a higher order, and D k records the noise variance in the k-th slice of E. To simplify presentation, we begin with the Gaussian noise case.
Theorem 1 (Distributional guarantees for tensor factor estimates (Gaussian noise)). Suppose that the E i,j,k 's are Gaussian, and that Assumptions 1-3 hold. Assume that µ, κ, r = O(1) and that t 0 = c 0 log d, for some sufficiently large (resp. small) constants c 0 , c 1 , c 2 > 0 (resp. c 3 > 0). Then with probability at least 1 − o(1), one has the following decomposition: Remark 1. As an interpretation of Condition (14): (i) the sample complexity pd 3 is O(d 3/2 poly log(d)), which is widely conjectured to be computationally optimal (up to some log factor) (Barak & Moitra, 2016); (ii) the typical size of each noise component (as captured by {σ i,j,k }) is allowed to be substantially larger than the maximum magnitude of the entries of T under our sample size assumption.
In words, Theorem 1 reveals that the estimation error of U can be decomposed into a Gaussian component Z and a residual term W . Encouragingly, the residual term W is, in some sense, dominated by the Gaussian term and can be safely neglected. To see this, recall that σ i,j,k ≥ σ min , leading to a lower bound 5 This tells us that the typical 2 norm of each row Z k,: exceeds the order of σmin , which is hence much larger than W 2,∞ (by virtue of Theorem 1). To conclude, the nonconvex estimate U is -up to global permutation -a nearly un-biased estimate of the true tensor factors U , with estimation errors being approximately Gaussian.
As it turns out, this distributional characterization can be extended to accommodate a much broader class of noise beyond Gaussian noise, as stated below. Theorem 2 (Distributional guarantees for tensor factor estimates (general noise)). Suppose that {E i,j,k } are not necessarily Gaussian but satisfy Assumption 2. Then the decomposition in Theorem 1 continues to hold, except that Z is not necessarily Gaussian but instead obeys Before continuing, there is another important observation that is worth pointing out (which is not stated in Theorems 1-2 but will be made precise in the analysis): for any three different rows i, j, k, the corresponding errors Z i,: , Z j,: and Z k,: are "nearly" statistically independent. This is a crucial observation that immediately leads to entrywise distributional characterizations for the resulting tensor estimate T , as summarized below. Theorem 3 (Distributional guarantees for tensor entry estimates). Instate the assumptions of Theorem 2. Consider Then the estimate T defined in (10) obeys 5 To see why the penultimate inequality holds, note that under our assumptions, where Φ(·) is the CDF of a standard Gaussian random variable. Here, the variance parameters {v i,j,k } are defined such that for any three distinct numbers i, j, k, v i,j,k := U (j,k),: Σ i U (j,k),: + U (i,k),: Σ j U (i,k),: where Σ k is defined in (15).
In short, the above theorem indicates that: if the "strength" of a tensor entry T i,j,k is not exceedingly small, then our nonconvex estimate of this entry is nearly unbiased, whose estimation error is approximately zero-mean Gaussian with variance v i,j,k (which admits a closed-form expression). To see this, note that when (14) holds, the right-hand side of Condition (16) is at most O d −1/4 / √ log d , which is vanishingly small. In other words, the Gaussian approximation is nearly tight unless the energy U (j,k),: 2 + U (i,j),: 2 + U (i,j),: 2 is vanishingly small compared to the average size. This entrywise distributional theory allows one to accommodate a broad family of noise models.

Confidence intervals
The preceding distributional guarantees pave the way for uncertainty quantification. To achieve this, it remains to compute the unknown covariance matrices {Σ k } and the variance parameters {v i,j,k }, which are functions of both the ground truth {u l } and the noise variance {σ 2 i,j,k } and are not known a priori. In particular, in the heteroscedastic case where {σ 2 i,j,k } are location-varying, it might be infeasible to estimate all variance parameters reliably.
Variance and covariance estimation. Fortunately, despite the absence of prior knowledge about the truth and the noise parameters, we are still able to faithfully estimate these important parameters from the data at hand, using simple plug-in rules. Specifically: 1. Rather than estimating all {σ i,j,k } directly, we turn attention to estimating the noise components {E i,j.k } instead, with the assistance of our tensor estimate T as follows We then construct a diagonal matrix D k ∈ R d 2 ×d 2 obeying Note that D k (1 ≤ k ≤ d) is not really a faithful estimate of the D k defined in (13), but it suffices for our purpose.
3. Substitute the above estimators into the expressions of the (co)-variance parameters to yield our estimates. Specifically, for any 1 ≤ k ≤ d, we compute as an estimate of Σ k . We also produce estimates for {v i,j,k } as follows: for any three distinct numbers 1 ≤ i, j, k ≤ d, v i,j,k := U (j,k),: Σ i U (j,k),: + U (i,k),: Σ j U (i,k),: Confidence intervals. With the above variance and covariance estimates in place, we are positioned to introduce our uncertainty quantification procedure. This is accomplished by constructing entrywise confidence intervals for both the tensor factors and the unknown tensor as follows.
• For each 1 ≤ i, j, k ≤ d, we construct a (1 − α)confidence interval for the (i, j, k)-th entry of T : where v i,j,k is constructed in (22).
As it turns out, the proposed (entrywise) confidence intervals are nearly accurate, as revealed by the following theorem. Theorem 4 (Validity of confidence intervals). Instate the assumptions of Theorem 2. There is a permutation π(·) : [d] → [d] such that for any 0 < α < 1, the confidence interval constructed in (23) obeys P u π(l),k ∈ CI 1−α In addition, for any 1 ≤ i, j, k ≤ d obeying (16) and any 0 < α < 1, the confidence interval (24) obeys This theorem justifies the validity of the inferential procedure we propose. Several important features are worth emphasizing: • "Fine-grained" entrywise uncertainty quantification.
Our results enable trustworthy uncertainty quantification down to quite fine scales, namely, we are capable of assessing the uncertainty reliably at the entrywise level for both the tensor factors and the tensor of interest. To the best of our knowledge, accurate entrywise uncertainty characterization for noisy tensor completion is previously unavailable.
• Adaptivity to heterogeneous and unknown noise distributions. The proposed confidence intervals do not require prior knowledge about the noise distributions, and automatically adapt to noise heteroscedasticity (i.e. the case when the noise variance varies across entries). Such model-free and adaptive features are of eminently practical value.
• No need of sample splitting. The whole procedure studied here -including both estimation and uncertainty quantification -does not rely on any sort of data splitting, thus effectively preventing unnecessary information loss due to sample splitting.
Lower bounds. One might naturally wonder whether the proposed confidence intervals can be further improved; concretely, is it possible to identify a shorter confidence interval that remains valid? As it turns out, our procedures are, in some sense, statistically optimal under Gaussian noise, as confirmed by the following fundamental lower bound.
Theorem 5 (Fundamental lower bounds). Consider any unbiased estimator u l for u l (1 ≤ l ≤ r) and any unbiased estimator T for T . Suppose {E i,j,k } are i.i.d. Gaussian. Under the assumptions of Theorem 2, one necessarily has Taken collectively with Theorems 2 and 3, the above result reveals that our nonconvex estimators {u l } and T achieve minimal mean square estimation errors in a very sharp manner at the entrywise level. Recognizing that the proposed confidence intervals allow for accurate assessment of the uncertainty (by virtue of Theorem 4), we conclude that the proposed entrywise inferential procedures are, in some sense, un-improvable under i.i.d. Gaussian noise (including both the rates and the pre-constants).

Back to estimation: 2 optimality of nonconvex estimates
Thus far, we have established the distributions of the estimators u l,k (1 ≤ l ≤ r, 1 ≤ k ≤ d) and T i,j,k (for those i, j, k obeying (16)). These results taken together allow one to pin down the 2 risk of the nonconvex optimization approach in a sharp manner. Our result is this: Theorem 6 (Sharp 2 estimation risk). Instate the assumptions of Theorem 2. With probability 1 − o(1), the estimates returned by Algorithm 1 obey Here, the characterization of the 2 risk (26a) for u l is a straightforward consequence of Theorems 1-2. In comparison, establishing the 2 risk (26b) for T requires more work, as Theorem 3 is valid only for a subset of the entries obeying (16). Fortunately, a majority of the entries of T satisfy (16), thus allowing for a nearly accurate approximation of the Euclidean risk of T . Theorem 6 taken together with Theorem 5 delivers an encouraging news: when the noise is i.i.d. Gaussian, nonconvex optimization is informationtheoretically optimal in a sharp manner when estimating both the unknown tensor and its underlying tensor factors.

Numerical experiments
To validate our theory and demonstrate the practical applicability of our inferential procedures, we perform a series of numerical experiments for a variety of settings. Specifically, we set d = 100, p = 0.2, and generate the ground-truth tensor T = r l=1 (u l ) ⊗3 in a random fashion such that u l i.i.d.
∼ N (0, I d ). Regarding the algorithmic parameters for nonconvex optimization (i.e. Algorithm 1 and Algorithm 2 in Supplementary materials A.1), we choose L = r 2 , th = 0.4, η t ≡ 3 × 10 −5 /p, and t 0 = 100. The noise components are independently drawn from Gaussian distributions, obeying i,j,k constructed as follows. We generate where β dictates the degree of heteroscedasticity. The noise becomes more heteroscedastic as β increases, and setting β = 0 reduces to the homoscedastic case where the noise variances are identical across all entries. In what follows, we shall set β = 5.  Tensor factor entries. We begin with inference for the entries of the tensor factors of interest. Consider the construction of 95% confidence intervals (i.e. α = 0.05). Define the normalized estimation error as follows For each 1 ≤ l ≤ r and 1 ≤ k ≤ d, we denote by CR l,k the empirical coverage rate for the tensor factor entry u l,k over 100 independent trials. Let Mean(CR) (resp. Std (CR)) denote the average (resp. the standard deviation) of {CR l,k } over all tensor factor entries. Figure 1 displays the Q-Q (quantile-quantile) plots of R U 1,1 , R U 1,2 and R U 1,3 vs. a standard Gaussian random variable, and Table 1 summarizes the numerical results for varying r and σ. Encouragingly, the empirical coverage rates are all very close to 95%, and the empirical distributions of the normalized estimation errors are all well approximated by a standard Gaussian distribution.
Tensor entries. Next, we turn to inference for tensor entries. Similar to the above case, we intend to construct 95% confidence intervals. Define For each 1 ≤ i ≤ j ≤ k ≤ d, we record the empirical coverage rate CR i,j,k for the tensor entry T i,j,k over 100 Monte Carlo trials. Denote by Mean(CR) (resp. Std(CR)) the average (resp. the standard deviation) of {CR i,j,k } over entries 1 ≤ i ≤ j ≤ k ≤ d. Figure 2 depicts the Q-Q (quantile-quantile) plots of R T 1,1,1 , R T 1,1,2 and R T 1,2,3 vs. a   thereby resulting in sub-optimal sample complexity when directly invoking matrix completion theory. To address this issue, a recent line of work (Barak & Moitra, 2016;Potechin & Steurer, 2017) suggested the use of sum-of-squares (SOS) hierarchy, which performs convex relaxation after lifting the data into higher dimension. The SOS-based algorithms achieve a sample complexity on the order of rd 3/2 for thirdorder tensors, which is widely conjectured to be optimal among all polynomial-time algorithms. However, despite their polynomial-time complexity, the SOS-based methods remain too expensive for solving large-scale practical problems, primarily due to the lifting operation.
None of the above results, however, suggested how to evaluate the uncertainty of the resulting estimates in a meaningful way. Despite a large body of work on statistical estimation for noisy tensor completion, it remains completely unclear how to exploit existing results to construct valid yet short confidence intervals for the unknown tensor. Perhaps the work closest to the current paper is inference and uncertainty quantification for noisy matrix completion and matrix denoising (Chen et al., 2019d;Xia & Yuan, 2019b;Cheng et al., 2020), which enables optimal construction of confidence intervals on the basis of nonconvex matrix completion algorithms. Inference for singular subspaces has also been investigated under both low-rank matrix regression and denoising settings (Xia, 2018;. While these results might potentially be applicable here by first matricizing the data, the resulting sample complexity, as discussed above, could be pessimistic. Finally, construction of confidence intervals has been extensively studied in a variety of highdimensional sparse estimation settings (Zhang & Zhang, 2014;van de Geer et al., 2014;Javanmard & Montanari, 2014;Ren et al., 2015;Cai et al., 2016;Ning & Liu, 2017;Cai & Guo, 2017;Sur et al., 2019;Janková & van de Geer, 2018;Miolane & Montanari, 2018). Both the inferential approaches and analysis techniques therein, however, are drastically different from the ones employed for inference for either tensor completion or matrix completion.

Discussions
This paper has explored the problem of uncertainty quantification for nonconvex tensor completion. The main contributions lie in establishing (nearly) precise distributional guarantees for the nonconvex estimates down to an entrywise level. Our distributional representation enables data-driven construction of confidence intervals for both the unknown tensor and its underlying tensor factors. Our inferential procedure and the accompanying theory are model agnostic, which do not require prior knowledge about the noise distributions and are fully adaptive to location-varying noise levels. Our results uncover the unreasonable effectiveness of nonconvex optimization, which are statistically optimal for both estimation and confidence interval construction.
The findings of the current paper further suggest numerous possible extensions that are worth pursuing. To begin with, our current results are only optimal when both the rank r and the condition number κ are constants independent of the ambient dimension d. Can we further refine the analysis to enable optimal inference for more general settings? It would also be interesting to go beyond uniform random sampling by considering the type of sampling patterns with a heterogeneous missingness mechanism.
A. More details about Algorithm 1 A.1. The initialization scheme For self-completeness, we record in this section the detailed initialization procedure employed in the two-stage nonconvex algorithm proposed in (Cai et al., 2019) (namely, Algorithm 1). This is summarized in Algorithm 2, with auxiliary procedures detailed in Algorithm 3. As a high-level interpretation, Algorithm 2 estimates the subspace spanned by the tensor factor {u l } 1≤l≤r via a spectral method (similar to PCA-type methods (Montanari & Sun, 2018;Zhang et al., 2018;Cai et al., 2020a)), whereas Algorithm 3 attempts to retrieve estimates for individual tensor factors from this subspace estimate U space . Here and throughout, we denote T obs := [T obs i,j,k ] 1≤i,j,k≤d , where we set T obs i,j,k = 0 for any (i, j, k) / ∈ Ω. In addition, for any vector u ∈ R d , we define the vector product of a tensor T ∈ R d×d×d -denoted by T × 3 u ∈ R d×d -such that T × 3 u ij := 1≤k≤d T i,j,k u k , ∀i, j ∈ [d].
Algorithm 2 Spectral initialization for nonconvex tensor completion 1: Input: tensor T obs . 2: Let U space ΛU space be the rank-r eigen-decomposition of P off-diag (AA ), where A = unfold T obs is the mode-1 matricization of T obs , and P off-diag (Z) extracts out the off-diagonal entries of Z. 3: Output: an initial estimate U 0 ∈ R d×r on the basis of U space ∈ R d×r using Algorithm 3.

A.2. Choices of algorithmic parameters
To guarantee fast convergence of Algorithm 1, there are a couple of algorithmic parameters -namely, the number of restart attempts L, the pruning threshold th in Algorithm 3, as well as the learning rates η t -that need to be properly chosen. Unless otherwise noted, this paper adopts the following choices suggested by (Cai et al., 2019): where c 4 > 0 is some sufficiently large constant, and c 5 , c 6 > 0 are some sufficiently small constants. The interested reader is referred to (Cai et al., 2019) for justification.
Algorithm 3 Retrieval of low-rank tensor factors from a given subspace estimate. 1: Input: number of restarts L, pruning threshold th , subspace estimate U space ∈ R d×r given by Algorithm 2. 2: for τ = 1, . . . , L do 3: Generate an independent vector g τ ∼ N (0, I d ).