CP Degeneracy in Tensor Regression

Tensor linear regression is an important and useful tool for analyzing tensor data. To deal with high dimensionality, CANDECOMP/PARAFAC (CP) low-rank constraints are often imposed on the coefficient tensor parameter in the (penalized) $M$-estimation. However, we show that the corresponding optimization may not be attainable, and when this happens, the estimator is not well-defined. This is closely related to a phenomenon, called CP degeneracy, in low-rank tensor approximation problems. In this article, we provide useful results of CP degeneracy in tensor regression problems. In addition, we provide a general penalized strategy as a solution to overcome CP degeneracy. The asymptotic properties of the resulting estimation are also studied. Numerical experiments are conducted to illustrate our findings.


Introduction
Tensor linear regression (Raskutti et al., 2019;Guo et al., 2011;Zhou et al., 2013;Hoff, 2015;Sun and Li, 2017;Lock, 2018;Miranda et al., 2018;Chen et al., 2019) has recently received an increasing amount of attention among the statistical researchers.In most existing work, low-rank constraints are imposed to overcome the curse of dimensionality (Chen et al., 2019;Guo et al., 2011;Zhou et al., 2013;Lock, 2018;Li et al., 2018;Raskutti et al., 2019;Zhang et al., 2020).One popular low-rank modeling is based on CANDECOMP/PARAFAC (CP) decomposition (Harshman, 1970).However, the corresponding low-rank space may not be compact (De Silva and Lim, 2008).This raises questions about the well-posedness of the corresponding M -estimators, as the solution of their optimizations may not be attainable.Similar issues, known as CP degeneracy, have been discussed in the context of low-rank approximation problems (De Silva and Lim, 2008;Krijnen et al., 2008), which can be treated as a special case of tensor linear regression (see Example 1 in Section 3.1).However, such fundamental and important issues are seldomly discussed in the statistics literature.In this article, we aim to provide corresponding understanding and discussion in the context of tensor linear regression, from a statistical perspective.In particular, we will discuss the existence and characteristics of the CP degeneracy in tensor linear regression.We will also provide a solution to CP degeneracy and investigate the properties of the resulting estimation procedure.
The contribution of this article is threefold.First, we construct a set of examples to illustrate the possibility of CP degeneracy in tensor regression settings.Second, we theoretically derive some important characteristics of CP degeneracy.A crucial characteristic of CP degeneracy is that the magnitude of the CP parameter will diverge, if an iterative algorithm is adopted to obtain an M -estimator of the CP parameter.Third, we propose a general penalized method with the theoretical guarantee to surmount the barrier of degeneracy.In particular, we provide an asymptotic analysis of the penalized estimation in the high-dimensional regime.To the best of our knowledge, this is the first analysis that does not require the assumption of the well-defined low-rank approximation.That is, in our analysis, the underlying low-rank approximation of the true tensor coefficient does not have to exist.
The rest of this article is organized as follows.Section 2 provides the background of CP decomposition and degeneracy, as well as tensor linear regression.Section 3 discusses the existence of CP degeneracy, its characteristics and a strategy to overcome it.Section 4 provides corresponding numerical experiments.Technical details are deferred to the Section S.1.

Tensor basics and CP degeneracy
In this article, we focus on tensors that are real-valued multidimensional array.For a tensor A = (A i 1 ,...,i D ) ∈ R p 1 ×•••×p D with p d ∈ Z + for d = 1, . . ., D, there are D modes where D is often referred to as the order of the tensor.Vectors and matrices can be regarded as first-order and second-order tensors respectively.Often, tensors of order 3 or above are referred to as higher-order tensors.It is now well-known that higher-order tensors behave very differently from its lower-order counterparts, e.g., matrices.Indeed, many tensor problems are NP-hard (Hillar and Lim, 2013).Also, many matrix-related concepts cannot be directly generalized to higher-order tensors.The CP degeneracy problem (with details given below) -the major subject of this article -is an example.To avoid clutter, we will simply refer to a higherorder tensor as a tensor, and use vector and matrix specifically for first-order and second-order tensor.
One of the most important matrix tools is singular value decomposition (SVD).As for tensors, there are two commonly used tensor decompositions, CANDECOMP/PARAFAC (CP) decomposition (Harshman, 1970) and Tucker decomposition (Tucker, 1966), both of which can be regarded as higher-order generalizations of SVD.The major interest of this article centers around the CP decomposition.More specifically, the CP decomposition of a tensor where β d,r ∈ R p d , d = 1, . . ., D, and • represents the outer product.We refer to the entries of β d,r 's as CP parameters.The rank of a tensor A, denoted by rank(A), is the smallest R such that the CP decomposition (1) holds.In other words, it is the minimal number of rank-1 tensors (outer products of D vectors) that can represent the tensor.Despite its high similarity to the definition of matrix rank, tensor rank possesses surprisingly different properties.First, the rank of tensor may depend on whether the entries are real-valued or complex-valued (i.e., the fields).To date, this seems to be relatively less interesting to the statistical community, as most of the statistical work related to tensors focus on real-valued tensors.Second, the rank determination is NP-hard (Johan, 1990).Third, the space of low-rank tensors may not be closed.More precisely, it has been shown by De Silva and Lim (2008) that the space is not closed regardless of the choice of norm, for R = 2, . . ., min{p 1 , . . ., p D } and D ≥ 3.As a result, the best low-rank approximation for a higher order tensor may not exist.More specifically, the best rank-R approximation is defined as a solution to the following problem arg min Here • F denotes the Frobenius norm, also called Hilbert-Schmidt norm, which is defined The non-closedness of the feasible set in (2) has a significant impact on the existence of a solution.When the solution of (2) does not exist, it is generally referred to as a CP degeneracy problem (Kolda and Bader, 2009).Note that the degeneracy problem never happens for matrices (i.e., D = 2), and a best rank-R approximation for a matrix can be obtained efficiently by the R leading factors of the SVD (Eckart and Young, 1936).Indeed, the possible non-existence of a best low-rank approximation has a significant influence on the well-posedness of low-rank tensor modeling in statistical problems.The goal of this article is to investigate this in the low-rank tensor regression problems, and provide solutions to avoid the CP degeneracy in the regression settings.
To better understand this non-existence of best rank-R approximation for a higher-order tensor, border rank was introduced in Bini et al. (1980).The border rank of a tenor A is defined as rank B (A) := min{R : for any > 0, there exists a tensor C such that C F < and rank(A + C) = R}.
When rank B (A) < rank(A), A does not have a best rank-R approximation for rank B (A) ≤ R < rank(A).Many examples of tensors with border rank smaller than the tensor rank can be found in the literature (e.g., Bini et al., 1980;Paatero, 2000;Stegeman, 2006Stegeman, , 2007;;De Silva and Lim, 2008).For example, let where the pair w d , v d ∈ R p d are linearly independent for each d = 1, 2, 3.As shown in De Silva and Lim (2008), rank B (G) = 2 while rank(G) = 3.One can construct the rank-2 tensor sequence {G γ , γ = 1, 2, . ..}: In other words, this sequence of rank-2 tensors converges to a rank-3 tensor.Let us suppose (on the contrary that) there exists a solution G 0 to the best rank-2 approximation problem (2).It follows from the assumption that rank(G 0 ) ≤ 2 and G 0 − G F > 0. However, by (3), we can always find a γ such that which leads to a contradiction.Therefore, (2) does not have a solution in this example.One interesting phenomenon in this example is that, as G γ converges to G, some CP parameters of G γ diverge.Indeed, this example is not a special case.Krijnen et al. (2008) shows that if the low-rank approximation does not have a solution, then some CP parameters of the sequence whose objective value converging to the infimum will diverge.It is worth noting that for matrices (i.e.D = 2), rank and board rank always coincide.For higher-order tensors (i.e.D ≥ 3), there exists a positive volume of the tensors which do not have a best low-rank approximation (De Silva and Lim, 2008).Therefore, it may result in numerical problems in practice, if one blindly computes the low-rank approximation.
In recent years, the research of tensor regression has gained popularity. Due to its empirical successes, the low-rank assumption is among the most popular modeling strategies used to overcome the "large-p-small-n" problem of the underlying tensor parameter estimations (e.g., Zhou et al., 2013;Rabusseau and Kadri, 2016;Guhaniyogi et al., 2017;Hao et al., 2019;Lock, 2018;Raskutti et al., 2019).The low-rank assumption is often just a working assumption, and may only hold approximately.In this case, the well-posedness of the low-rank modeling hinges on the existence of low-rank approximation.However, the related CP degeneracy problem in tensor regression is rarely mentioned.To the best of the authors' knowledge, there is virtually no work for understanding this issue in the context of tensor regression, which we hope to address in this article.We will also provide strategies to avoid the CP degeneracy in this statistical setup.

Tensor linear regression
We briefly review the tensor linear regression model (e.g., Guo et al., 2011;Zhou et al., 2013;Hoff, 2015;Suzuki, 2015;Yu and Liu, 2016;Sun and Li, 2017;Li and Zhang, 2017;Guhaniyogi et al., 2017;Li et al., 2018;Lock, 2018;Kang et al., 2018;Raskutti et al., 2019;Chen et al., 2019;Zhang et al., 2020).Suppose we have a data set {(y i , X i ) : i = 1, . . ., n}, where y i ∈ R is a response variable and X i ∈ R p 1 ×•••×p D is a tensor covariate.The tensor linear model assumes where •, • is the entry-wise inner product, A 0 ∈ R p 1 ×•••×p D is an unknown fixed coefficient tensor, and i ∈ R is a random error of mean zero.The commonly used squared loss function leads us to In some applications, it is reasonable to assume A 0 admits a low-rank structure (Guo et al., 2011;Zhou et al., 2013).One can directly impose the rank restriction rank(A) ≤ R in (5) and estimate the coefficient tensor by solving arg min Another way to integrate the rank-R assumption is to reparametrize A using the CP decomposition form (1), i.e., Let θ and B d be the collections of all CP parameters and those in the d-th dimension respectively.In other words, we write The squared loss w.r.t.θ can then be written as and the corresponding estimator is obtained by solving From ( 6) and ( 9), we can see there are two ways to parametrize the low-rank coefficient tensors: (i) the tensor A itself with the rank restriction; (ii) the CP parameters θ defined as in (7).Without further penalization, these two parametrizations are equivalent, in the sense that if one of the corresponding optimizations cannot attain the infimum, the other one cannot either.However, as will be seen in Subsections 3.1 and 3.2, using CP parameters θ helps characterizing the problem of CP degeneracy.Moreover, we will show (in Subsections 3.3 and 3.4 ) that adding a correct form of penalization on the CP parameters θ will solve the degeneracy issue, but not necessarily true when the penalization is directly applied to the coefficient tensor.Although we mainly focus on the least squares problem in this article, the results in the following sections can be generalized to other commonly used convex objective functions.
3 Examples, theory and solution

Existence of degeneracy
Obviously, the first and the most natural question to ask is whether degeneracy would occur in tensor linear regression.Compared with the best low-rank approximation problem (2), tensor linear regression involves an additional component, i.e., the predictors.Whether the degeneracy occurs is also affected by the value of the predictors.Indeed, it is easy to see that the classical degeneracy in ( 2) is a special case of the degeneracy in the regression setting, as follows.
Example 1.Consider the noiseless case in tensor linear regression, i.e., i = 0 in (4).Suppose n = d p d and the set of design matrices {X i } is chosen as where e d,i d is the standard basis of Euclidean space R p d such that its i d -th component equals 1 and the rest are 0. If the true coefficient tensor A 0 in (4) does not have a best rank-R approximation, then the corresponding least squares problem (6) does not have a solution.
Example 1 shows that when {X i } are the full standard basis of R p 1 ×•••×p D , the optimization (6) can be written as arg min which recovers the low-rank approximation form as in (2).If A 0 does not have the best rank-R approximation, then (10) has no solution and thus the original least squares problem does not have a solution.Although we only consider the noiseless case in Example 1, it can also be shown that the degeneracy could happen with a positive probability for the noisy cases; see Theorem 8.4 of De Silva and Lim ( 2008) for the details.
Next, we will give more interesting examples of the degeneracy in unpenalized tensor linear regressions.We will also consider the penalized regressions with, e.g., the ridge penalty, in Section 3.4.For simplicity, we denote where vec(•) is the vectorization operator.Using the notations defined in ( 11), the squared loss function F(A) in ( 5) can be rewritten as where • is the Euclidean norm.Without the rank restriction, the classical theory of least squares estimation shows that the set of solutions to minimize F(A) is where (•) + denotes a generalized inverse for a matrix and I is an identity matrix of compatible dimension.
Lemma 1. Suppose where S is the set of solutions of tensor linear model as defined in (12).If R b ≤ R < R m , then the optimizations (6) and (9) do not have a solution.
The proof Lemma 1 can be found in Section S.1.2.Using Lemma 1, we can give the next two concrete examples (Examples 2 and 3) that illustrate the existence of degeneracy problems.
Example 2. Suppose one observes data pairs of linearly independent vectors for d = 1, 2, 3, then there is no solution to (6) with R = 2.To see this, we only need to confirm the setting satisfies the condition of Lemma 1.By definition, we have S = {G b }.By Corollary 5.12 of De Silva and Lim (2008), rank(G b ) = 3 and rank B (G b ) = 2 .Thus, this setting satisfies the condition of Lemma 1.
Example 2 illustrates a case when the corresponding design matrix Z in (11) has full column rank.Next, we give a more general example that does not require full column rankness of Z, and so Z Z is not necessarily invertible.Note that, when j = (i 1 −1)(p 2 p 3 )+(i 2 −1)p 3 +i 3 , the j-th column of Z consists of the predictor values at the (i 1 , i 2 , i 3 )-th entry in the covariate tensor X ∈ R p 1 ×p 2 ×p 3 .
Example 3. Suppose one observes data (X i ∈ R p 1 ×p 2 ×p 3 , y i ∈ R), i = 1, . . ., n. Collect the columns of Z that correspond to entries of the predictors with positions and the rest of the columns to form a pair of linearly independent vectors, d = 1, 2, 3, then there is no solution when one uses (6) with R = 2 to fit the low-rank tensor linear regression.The proof of this result is given in Section S.1.1.
After demonstrating the possibility of CP degeneracy in tensor linear regression, it is important to understand this phenomenon so as to avoid it.The following subsection will provide some helpful characteristics of CP degeneracy.

Characterization of degeneracy
Next, we provide some useful and important characteristics of CP degeneracy in tensor regression.First, Theorem 1 below shows that if the degeneracy occurs, the CP parameters will diverge.Note that there is a scaling indeterminacy in the CP decomposition, e.g., ), for any S = 0, which will affect the magnitude of the CP parameters.To avoid the effect of scaling indeterminacy, we use to quantify the magnitude of CP parameters in (7).
Theorem 1.Let {θ t : The proof of Theorem 1 can be found in Section S.1.3.From a practical perspective, this theorem can be best linked with a situation when an iterative algorithm is (blindly) adopted to minimize f .Indeed, most existing methods to solve the optimization over a low-CP-rank space are iterative algorithms (e.g.Guo et al., 2011;Tan et al., 2012;Zhou et al., 2013;Lock, 2018;Zhang et al., 2018;Hao et al., 2019;Zhou et al., 2020).Let θ t represent the output value of θ at the t-th iteration of the algorithm.For such an algorithm, a reasonable expectation is that f (θ t ) → inf f .As shown in Theorem 1, a characteristic of CP degeneracy is that M(θ t ) → ∞.Practically this indeed can be checked, and will be illustrated in Section 4. In the case of CP degeneracy, these algorithms may produce overflow problems, if one does not limit the maximum number of iterations.Note that terminating the algorithm after some number of iterations without convergence is an early stopping strategy, which can be regarded as a form of regularization.Therefore, to be precise, the resulting estimator is no longer the one defined by ( 9), and would require a separate analysis of their statistical behavior.
Next, we provide Theorems 2 and 3 which present minimum eigenvalue (upper) bounds on the CP parameters.These bounds indicate the approximate linear dependency in the CP parameters when degeneracy occurs.We use C with or without subscripts to denote a positive constant which may change values from line to line.
Assumption 1.There exists a constant S 1 such that Assumption 1 can be regarded as a restricted eigenvalue condition with respect to the design {X i }.
For notational simplicity, we write In the above, ⊗ denotes Kronecker product and β dr,t ∈ R p d is the component of θ t defined in Theorem 1.We use λ min (•) and λ max (•) to denote the minimum and maximum eigenvalues of a matrix respectively.
Theorem 2. Under Assumption 1, there exists a constant t 0 depending on the sequence {θ t } such that if t ≥ t 0 , then and The proof of Theorem 2 is presented in Section S.1.4.By Theorem 1, M(θ t ) diverges if the degeneracy occurs.Therefore, in the degeneracy cases, the ranks of the CP component matrices are nearly rank-deficient.
We next show in Lemma S.1 that Assumption 1 holds with high probability in random design settings (Assumption 2).Consequently, an analogously non-asymptotic result is shown in Theorem 3. We begin by presenting the definitions of sub-Gaussian random variables and vectors.
Definition 1 (Sub-Gaussian random variables (Vershynin, 2018)).The sub-Gaussian norm of a random variable X, denoted by X ψ 2 , is defined as We say X ψ 2 = ∞ when there is no such positive V satisfying (14 Definition 2 (Sub-Gaussian random vectors (Vershynin, 2018)).A random vector x in R p is called sub-Gaussian if the one-dimensional marginals x, a 's are sub-Gaussian random variables for all a ∈ R p .The sub-Gaussian norm of x is defined to be x, a ψ 2 .
Assumption 2. Assume X i are i.i.d. with E{vec(X i )vec(X i ) } = Σ, which satisfies where S min , S max and κ are constants.
Theorem 3.Under Assumption 2, if n > C 1 w 2 (P), two statements in Theorem 2 will hold with probability at least 1 − 2 exp{−C 2 w 2 (P)}, where P is a low-rank space defined in (13).
Here we provide an upper bound of w(P) as follows.
Theorem 4. We have w The proofs of Theorems 3 and 4 are presented in Sections S.1.5 and S.1.6,respectively.Overall, these characteristics demonstrated in Theorems 1-3 can be used in practice to detect degeneracy.In fact, they lead to solutions to prevent degeneracy in both theory and practice.We will discuss such a solution rigorously in the next subsection.

Strategy to overcome the problem
As suggested by Theorem 1, to avoid the degeneracy problem of tensor linear regression, we should restrict the magnitude measure M(θ).A simple solution is to impose penalty: where λ > 0 is the penalty parameter and g is the penalty function.In this article, we let g(θ) be a continuous function and satisfies the following assumption.
Note that the above assumption on g(•) is mild and is satisfied with many commonly used penalty functions with respect to θ.For example, LASSO (Tibshirani, 1996) where β d,r.l d is l d -th element of β d,r .In fact, they have been used in previous work (Guo et al., 2011;Zhou et al., 2013;Hao et al., 2019) but without any consideration of the CP degeneracy.With Assumption 3, the infimum of f (θ) + λg(θ) is attainable for any positive λ.Formally, the following Corollary 1 shows that CP degeneracy does not occur in ( 15) and its proof is given in Section S.1.7.
As mentioned before, the low-rank approximation of higher-order tensor may not exist and the least squares estimation of tensor linear regression restricted on a low-rank space may not be attainable.Thus, when the degeneracy occurs, the numerical results are unstable over the iterations of any iterative algorithms and the solution of the optimization is not well-defined.Borrowing the idea of Corollary 1, the following Theorem 5 bypasses the rank-R assumption and presents the asymptotic properties of the estimation (15).We note that such an assumption occurs in most existing convergence analyses for tensor linear regression methods (e.g., Zhou et al., 2013;Suzuki, 2015;Lock, 2018).To the best of our knowledge, ours is the first result that does not require such condition, and works even when the best low-rank assumption does not exist.Before presenting the theorem, we need some assumption on the observational errors which are intrinsic in many statistical applications.
Due to Example 1, the low-rank approximation problem can be treated as a special case of tensor linear regression model without observational noise.Thus, using the proof of Corollary 1 also guarantees the existence of restricted low-rank approximation.In particular, for given R > 0 and G > 0, denote the low-rank approximation error δ as Theorem 5. Suppose Â is the coefficient tensor reconstructed from a solution θ of (15).Under Assumptions 2-4, we have The proof of Theorem 5 is given in Section S.1.8.The first term in the right hand side (RHS) of ( 17) is due to the estimation error.In many real applications, D = 3, 4, and R is very small, e.g., R = 3 (Zhou et al., 2013).Thus, (R D + R D d=1 p d )/n is very close to R D d=1 p d /n in terms of order, which is the minimax lower bound shown in Suzuki (2015).The second term in the RHS of ( 17) is due to the approximation error.Note that we do not make any assumptions on the relationship between R and the true rank R 0 .That is, this theorem allows R 0 − R to be arbitrarily large.If R ≥ R 0 and G = g(θ 0 ), where θ 0 is CP parameters of A 0 , we have δ 2 = 0.The third term in the RHS of ( 17) is the bias due to penalization.Using a λ satisfying λ ≤ C max{(R D + R D d=1 p d )/G, nδ 2 /G}, the bias term is dominated by the first two terms in the RHS.
In addition to the tensor Â, we are also interested in the asymptotic behavior of the corresponding CP parameters.For a given set of data, Assumption 3 can be used to show θ is bounded whenever λ > 0. However, in the asymptotic framework, we are considering a sequence of data sets and λ as n → ∞.Therefore, the result is not straightforwardly followed from Corollary 1.Since a generic penalty function g is considered in this article, we show the rate of convergence of Ĝ to obtain a general result, where Ĝ := g( θ) for an attainable θ.
Corollary 2. Using the same assumptions of Theorem 5 and taking The proof of Corollary 2 is in Section S.1.9.

Not every penalty works
Assumption 3 gives a sufficient condition of the penalty function that prevents the CP degeneracy.The crux is to control the magnitude of the estimator.In this subsection, we will give an example that the choice of parametrization of which the magnitude is being controlled is crucial.More specifically, we will focus on the original tensor parameterization A instead of the CP parametrization θ.A natural way to control its magnitude is to impose the elementwise ridge penalty (l 2 ) on the regression coefficient tensor A (e.g., Guo et al., 2011).In other words, consider arg min Proposition 1.The optimization (18) is not guaranteed to have a solution.
The following Example 4 serves as a proof of Proposition 1.We will also compare the estimators using ( 9) and ( 15) in Section 4 of the simulation study.
Example 4. If Z and y satisfy the conditions of Example 2, then there does not exist any solution when one uses ( 18) with R = 2 to estimate the low-rank coefficient tensor in tensor linear model ( 4).The proof of this result is given in Section S.1.1.
If the optimization (18) cannot attain the infimum, similar to Theorem 1, the magnitude of CP parameters may diverge.This phenomenon is confirmed by our simulation study (Table 1).
In practice, most algorithms impose an upper bound on the number of iteration, which prevents various numerical issues due to CP degeneracy.The resulting estimators could still perform well in practice.However, the effect of such early stopping strategy requires more understanding.

Numerical experiments
To further understand the CP degeneracy in tensor linear regression, we performed numerical experiments via synthetic data generated from a variety of settings.According to our discussion in Section 3, we compared three methods: tensor linear regression without penalization, that with l 2 penalty on the CP parameters θ, and that with l 2 penalty on the coefficient tensor A. All of the corresponding optimization problems were solved by a block updating algorithm (Guo et al., 2011;Zhou et al., 2013;Lock, 2018) and the details are stated in Algorithm 1 with the following notations.Recall the definition of f (θ) in ( 8) and the notation B d in (7), For notational simplicity, when the tuning parameters λ and α are given, we respectively rewrite the least squares method (9), CP parameters penalized method (15) with g specified as the ridge penalty ( 16), and the coefficient tensor penalized method in (18) as We randomly generated the initial value B 0 d and then ran Algorithm 1 to get an estimate from each method.Since all of the optimizations are non-convex, we used 5 random initial values and chose the one that minimized the corresponding objective value for each method.
In this study, each simulated dataset {(y i , X i ), i = 1, . . ., n} was generated from one of the following four cases including both noiseless and noisy ones: In the above, X ∈ R p 0 ×p 0 ×p 0 was randomly generated with independent U nif (0, 1) entries, and the error was an independent N (0, σ 2 ) random variable.The variance σ 2 was chosen such that the signal-to-noise ratio SNR = Var{m(X)} σ 2 = 4, where m is the regression function (i.e., A a 0 , X or A b 0 , X ).We set R 0 = 3 and the true coefficient tensor A a 0 and A b 0 were constructed in the following ways: where w d , u d , v d ∈ R p 0 with p 0 = 5, and their entries were independently drawn from U nif (−5, 5), d = 1, 2, 3, for every simulated dataset.It can be seen from the data generating processes that w d , u d , v d are linear independent, for d = 1, 2, 3, with probability 1.Although both A a 0 and A b 0 have CP rank 3, they behave differently in the sense of border rank.The border rank of A a 0 is strictly less than its CP rank, which is called a degenerate tensor (De Silva and Lim, 2008).We considered two different sample sizes n = 100 and 200 for each case.In the experiments, Algorithm 1 was applied to the simulated datasets with R = 2 and 3 for each of these three methods.We varied the tuning parameters λ and α, with values chosen from {0.001, 0.01, 0.1}.For each setting, the experiments were replicated 50 times.
We investigated the CP degeneracy in the experiments by checking whether the magnitude of the CP parameters diverges as the iteration number goes to infinity.We note that this is not a perfect rule, since the underlying optimizations are non-convex and the iteration sequences may not converge to a global optimum even with multiple initializations.However, we will see in later results that our empirical rule of determining the CP degeneracy is fairly accurate and does not affect the overall conclusion of our experiments.
Next we construct this empirical rule.In practice, we only have access to a finite number of iterations due to the limitation of computational resources.In our experiments, we let the algorithm run for 100000 iterations.In order to numerically identify CP degeneracy, we calculated M(θ t ) to measure the magnitude of CP parameters (see also Theorem 1), and therefore we have {t, M(θ t )}, t = 1, . . ., 100000.
Two examples for Case 1a with (n, p, R, R 0 ) = (200, 5, 2, 3) and Case 2a with (n, p, R, R 0 ) = (100, 5, 3, 3) are depicted in the first row of Figure 1.Determining whether the underlying iteration sequence of ( 19) is divergent can be regarded as a classification problem.We propose a simple classification rule to conservatively identify the divergent curves.First, as shown in the Figure 1, the initial iterations are usually of different patterns and clearly not crucial in determining the divergence.So we focus on t ≥ 50000.If M(θ 100000 ) ≤ M(θ 50000 ), the curve will be simply classified as non-divergent.The difficult situations are that M(θ 100000 ) > M(θ 50000 ), where our rule is based on the estimation of the gradient of a continuous surrogate.Consider a differentiable function defined on [50000, ∞) that passes through the points {(t, M(θ t ))} 100000 t=50000 , with its gradient approximated by an exponential function where a, b and c are parameters crucial for determining the divergence.These parameters are estimated by the applications of least squares estimation to the numerical proxies of the gradient: t, M(θ t+100 ) − M(θ t ) 100 : t = 50000, 50100 . . ., 99900 .
Theoretically, when the gradient function has the form of ( 20), one can use the values of a, b, and c to simply determine whether ∞ 50000 h a,b,c (t)dt = ∞ or not.It is known that either when {a > 0, c > 0} or {a > 0, c = 0, b > −1}, the integral will be infinity.Let â, b, and ĉ be the fitted values of a, b, and c, respectively.Due to our discussion, we use the following rule to quantitatively identify whether the magnitude curve M(θ t ) is divergent to infinity, where γ c , η c > 0 and γ b > −1 are the pre-chosen cutoffs.One example of such fitting is depicted in the bottom panels of Figure 2. Since there exist estimation and approximation errors, due to fitting an exponential model, we set (γ b , η c , γ c ) = (−0.5,0, 0.0015) to obtain a conservative classification rule of divergence.
The results of the identified number of diverging curves using the aforementioned rule are summarized in Table 1.For the noiseless cases (Cases 1a and 1b), when the coefficient tensor is degenerate and a misspecified rank is used in optimization (R = 2 in Case 1a), the methods (9) and ( 18) have identified 50 divergences out of 50 replications.This result is consistent with our discussions in Subsection 3.1.For CP parameters penalized method (15) with g specified as the ridge penalty (16), there is only one case that is identified as divergence for small λ (λ = 0.001), which confirms that CP penalization is able to overcome the problem of degeneracy as in Subsection 3.3.For the noisy situations, i.e., Cases 2a and 2b, though CP parameters penalized method (15) has some rare cases to be identified as divergence, it is significantly less than the numbers of divergent cases of the other two methods.Again, we note that the empirical rule of divergence identification is not perfect, and so the very small number of identified divergences does not necessarily contradict with our theory in Section 3.3.Figure 2 shows an example that the least squares and the coefficient penalized methods provide divergent iteration sequences, while the results of CP parameters penalized methods are identified as convergence.Overall, the simulation study demonstrates that the divergence in tensor regression can occur in both noiseless and noised cases, if one directly uses the least squares method or the coefficient penalized method to estimate the unknown coefficient tensor.One way to overcome this challenge is to use CP parameters penalized method as we discussed in Subsection 3.3 and confirmed in our numerical experiments.9) and ( 18) with α = 0.001, 0.01, 0.1 are grouped together and presented in the first column using different colors and point shapes.The results with respect to (15) with λ = 0.001, 0.01, 0.1 are depicted in the second column using different colors and point shapes.The dotted lines with the same color in the bottom are the corresponding fitted curves using (20).
Table 1: The number of curves that are identified as diverging cases.Reported are based on 50 data replications for each setting.The columns LS, λ's, and α's correspond to the least squares method (9), CP parameters penalized method (15) with g specified as the ridge penalty ( 16), and the coefficient penalized method (18), respectively.

S.1.2 Proof of Lemma 1
Proof.Recall that It is trivial to see inf f = inf F .Let A t be the corresponding tensor of θ t .We will show two facts, i.e., F and there exists {A t } satisfying The two facts, i.e., (S.9) and (S.10), lead that the optimization (6) does not have a solution.Since R < R m , it follows from (S.6) and (S.7) that (S.9) holds.Now, let us prove (S.10).

It is trivial to obtain
for all A satisfying rank(A) ≤ R. Since R ≥ R bm , there exists a sequence {A t } satisfying rank(A t ) ≤ R and A m − A t → 0. (S.12)Using (S.9), (S.11) and (S.12) we finished the proof of (S.10).Thus, we prove the argument.

S.1.3 Proof of Theorem 1
Proof.Without loss of generality, we can fix one representation of CP decomposition.Denote , for, d = 1, . . ., D − 1 and It is straightforward that The flowing proof is similar to Lemma 1 of Krijnen et al. (2008).Assume {θ t } is a sequence such that f (θ t ) → inf f and M(θ t ) < ∞.Using (S.13), { θt } is a bounded sequence.By Bolzano-Weierstrass theorem, there exists a further convergent subsequence { θt j }.If follows from the continuity of f that where θ = lim j→∞ θt j and it attains the infimum of f .By assumptions, it leads to a contradiction.Thus, any subsequence of { θt } is unbounded, which leads to M(θ t ) = M( θt ) → ∞.

S.1.4 Proof of Theorem 2
Proof.Recall There exists a vectorization rule v ec(•) for a tensor, such that v ec(A t ) = D t h t , (S.14) where A t ) is the corresponding tensor of θ t and For simplicity, we denote Z = ( v ec(X 1 ) , . . ., v ec(X n ) ) .(S.16) Using the specific rearrange rule (S.16), (5) at iteration t can be written as We will see the performance of D t when the infimum does not attain as follow.Firstly, we note that there exits a constant t 0 , if t ≥ t 0 , then Secondly, we will show the final result.Using Assumption 1, we have By definition, we have Therefore, The proof about D t can be finished by noting M(θ t ) = h t 1 ≤ √ R h t , where . 1 is the l 1 -norm of a vector.The proof of B d,t can be finished by using the proof of Corollary 2 in Krijnen et al. (2008).

S.1.5 Proof of Theorem 3
We first present the argument about restricted eigenvalue.
Lemma S.1.Under Assumptions 2 and 4, if n > Cw 2 (P), we have with probability at least 1 − 2 exp(−C 3 w 2 (P)).Further, w(P) in the lemma can be replace to be V with V ≥ w(P).
Proof of Lemma S.1.Using Theorem 10 and 12 of Banerjee et al. (2015) and Assumption 2 , we will finish the proof.Now we turn to the proof of Theorem 3.
Proof.Using Lemma S.1 and the same arguments in the proof of Theorem 2, we obtain with probability at least 1 − 2 exp{−C 3 w 2 (P)}, which implies λ min (D t D t ) ≤ C n h t 2 , with probability at least 1 − 2 exp{−C 3 w 2 (P)}.We then will finish the proof by following the arguments in the proof of Theorem 2.
where Z and y are defined in (11).If rank(Z) = 3 d=1 p d and the tensor G b can be represented as

Figure 2 :
Figure 2: An example of a numerical experiment for Case 1a with (n, p, R, R 0 ) = (200, 5, 2, 3).The top are the plots of the magnitude M(θ t ) versus the iteration number t for t ≥ 50000.The bottom are plots of {(M(θ t+100 ) − M(θ t ))/100} and the fitted values h â, b,ĉ (t) versus t.The results with respect to (9) and (18) with α = 0.001, 0.01, 0.1 are grouped together and presented in the first column using different colors and point shapes.The results with respect to (15) with λ = 0.001, 0.01, 0.1 are depicted in the second column using different colors and point shapes.The dotted lines with the same color in the bottom are the corresponding fitted curves using (20).