Minimizing Low-Rank Models of High-Order Tensors: Hardness, Span, Tight Relaxation, and Applications

We consider the problem of finding the smallest or largest entry of a tensor of order <inline-formula><tex-math notation="LaTeX">$N$</tex-math></inline-formula> that is specified via its rank decomposition. Stated in a different way, we are given <inline-formula><tex-math notation="LaTeX">$N$</tex-math></inline-formula> sets of <inline-formula><tex-math notation="LaTeX">$R$</tex-math></inline-formula>-dimensional vectors and we wish to select one vector from each set such that the sum of the Hadamard product of the selected vectors is minimized or maximized. We show that this fundamental tensor problem is NP-hard for any tensor rank higher than one, and polynomial-time solvable in the rank-one case. We also propose a continuous relaxation and prove that it is tight for any rank. For low-enough ranks, the proposed continuous reformulation is amenable to low-complexity gradient-based optimization, and we propose a suite of gradient-based optimization algorithms drawing from projected gradient descent, Frank-Wolfe, or explicit parametrization of the relaxed constraints. We also show that our core results remain valid no matter what kind of polyadic tensor model is used to represent the tensor of interest, including Tucker, HOSVD/MLSVD, tensor train, or tensor ring. Next, we consider the class of problems that can be posed as special instances of the problem of interest. We show that this class includes the partition problem (and thus all NP-complete problems via polynomial-time transformation), integer least squares, integer linear programming, integer quadratic programming, sign retrieval (a special kind of mixed integer programming / restricted version of phase retrieval), and maximum likelihood decoding of parity check codes. We demonstrate promising experimental results on a number of hard problems, including state-of-art performance in decoding low density parity check codes and general parity check codes.


I. INTRODUCTION
Finding the smallest or largest entry of a matrix or tensor specified via its low-rank factors is a fundamental problem with numerous applications in science and engineering.The matrix version of the problem has received some attention [1], [2], motivated by applications in graph mining (e.g., most significant missing link prediction, nodes that share many neighbors), text/speech/audio similarity search and retrieval (e.g., using text embeddings), and recommender systems (e.g., finding the best item-context combination to recommend to a given user).
The tensor version of the problem is considerably more powerful, as it allows going beyond bipartite matching / prediction which can be broadly useful in knowledge discovery.As an example, given embeddings of patients, conditions, drugs, clinical trials, therapies, we may be interested in finding the best match, as measured by a function of the pointwise product of the respective embeddings.Tensors can also be used to model high-dimensional probability distributions, wherein low-rank tensor models are used to break the curse of dimensionality while allowing easy marginalization and computation of conditional probabilities, which are crucial for prediction [3].In this context, finding the largest element corresponds to finding the mode / maximum likelihood or maximum aposteriori estimate of missing variables, and it is not unusual to encounter very high-order probability tensors indexed by hundreds of categorical input variables.Despite the obvious importance of the tensor version of this problem, there is scant literature about it -we found only [4], which essentially extends the approach in [2] to the low-order tensor case.

A. Prior work
The matrix case was considered in [1], which proposed using a power-method type algorithm that works directly with the low-rank factors.In [2], the authors developed a randomized "diamond sampling" approach for computing the maximum element of the product of two matrices (which could be, e.g., two low-rank factors) in what they called the MAD (Maximum All-pairs Dot-product) search.Their algorithm comes with probabilistic performance guarantees and was demonstrated to work well in practice using a variety of datasets.As already mentioned above, [4] extends the randomized diamond sampling approach in [2] to "star sampling" for the tensor case.These randomized algorithms do not scale well to high-order tensors, owing to the curse of dimensionality.
A very different approach for the higher-order tensor version of the problem has been proposed in the computational physics / numerical algebra literature [5].The basic idea of [5] (see also references therein) is as follows.By vectorizing the tensor and putting the resulting (very long) vector on the diagonal of a matrix, the tensor elements become eigenvalues corresponding to coordinate basis eigenvectors.This suggests that the maximum element of the tensor can be computed through a power iteration involving this very large matrix.Of course power iterations implemented naively will have prohibitive complexity (as tensor vectorization produces a very long vector and thus a huge matrix).The idea is therefore to employ a tensor factorization model to ease the matrix-vector multiplication of the diagonal matrix and the interim solution vector.This multiplication can be computed by summing the elements of the pointwise (Hadamard) product of the two vectors -the vectorized tensor on the matrix diagonal and the interim solution vector.This product can be efficiently computed for various tensor models, but the pointwise multiplication of two tensors of rank R generally has rank up to R 2 , necessitating conversion back to a rank-R tensor.The natural way to do this is via rank-R least-squares approximation of the higher-rank product.This is a hard and generally ill-posed computational problem, because the best approximation may not even exist.Thus the method cannot be used with a tensor rank decomposition model (known as Canonical Polyadic Decomposition or CPD).The pointwise multiplication and least-squares approximation are easier with a so-called tensor train (TT) model.The Hadamard product of two TT models is another TT model whose wagon (factor) matrices are the Kronecker products of the respective factor matrices of the two TT models.This implies that pointwise multiplication of two TT models has complexity of order N IR 4 for an I N tensor of TT rank R for each wagon.Moreover, rank reduction for TT models is SVD-based and has complexity order N IR 3 .Hence, reducing the rank of the pointwise multiplication (which is R 2 due to the Kronecker product) induces a complexity of order N I(R 2 ) 3 = N IR 6 , and as a result, only small TT ranks can be supported in practice.Another limitation of [5] is that it is hard to pin down if and when the iteration will converge, because of the rank reduction step.The numerical experiments in [5] are interesting but limited in terms of exploring the accuracy-speedup trade-offs that can be achieved.
Unlike [5], our emphasis is on the tensor rank decomposition (CPD) model, in good part because many of the problems that we consider herein admit closed-form tensor rank decomposition formulations, thus bypassing the need for computational model fitting and/or rank reduction.Another issue is that, as we show for the first time in this paper, the core problem is NP-hard even with a TT model, so there is no intrinsic computational benefit to using one tensor decomposition model over the other.

B. Contributions
Our contributions relative to the prior art are as follows: • We focus on the higher-order tensor version of the problem, and we analyze its computational complexity.We show that the problem is easy when the tensor rank is equal to one, but NP-hard otherwise -even if the rank is as small as two.• We consider optimization-based (as opposed to algebraic or randomized) algorithms that can be used to compute good approximate solutions in the high-order tensor case.
We provide an equivalent "fluid" (continuous-variable) reformulation of what is inherently a multi-way selection problem, and a suite of relatively lightweight gradientbased approximation algorithms.The continuous reformulation is derived by showing that a certain continuous relaxation involving a probability distribution instead of hard selection for each mode is actually tight.This is true for any rank, i.e., even for full-rank (unstructured) tensor models.The proposed algorithms take advantage of various optimization frameworks, from projected gradient descent to Franke-Wolfe and various ways of explicitly parametrizing the probability simplex constraints.For low-enough ranks, the associated gradient computations are computationally lightweight, even for very high-order tensors.
• We show that our main results remain valid no matter what kind of polyadic tensor model is used to represent the tensor of interest, including Tucker, HOSVD/MLSVD, tensor train, or tensor ring.• We explore the "span" of the core problem considered, i.e., the class of problems that can be posed as special instances of computing the minimum or maximum element of a tensor from its rank decomposition.We show that this class includes the partition problem (and thus all NPcomplete problems via polynomial-time transformation), as well as integer least squares, integer linear programming, integer quadratic programming, sign retrieval (a special kind of mixed integer programming / restricted version of phase retrieval), maximum likelihood decoding of parity check codes, and finding the least inconsistent solution of an overdetermined system of linear equations over a Galois field.• Finally, we demonstrate promising experimental results on a number of hard problems, including better than stateof-art performance in decoding both low density parity check codes and general parity check codes, and sign retrieval.

Consider a matrix
Three other ways to write the same relation are M = R r=1 A 1 (:, r)A 2 (:, r) T , where A(:, r) stands for the r-th column of A; M = R r=1 A 1 (: , r) • A 2 (:, r), where • stands for the vector outer product; and denotes an element of matrix M, with obvious notation.From the latter, notice that element (i 1 , i 2 ) of matrix M is the inner product of two row vectors, namely row A 1 (i 1 , :) of matrix A 1 , and row A 2 (i 2 , :) of matrix A 2 .
We are interested in finding the smallest or largest element of matrix M from the factors A 1 and A 2 .From the latter relation, we see that seeking the smallest (or largest) element of M corresponds to seeking a pair of row vectors, one drawn from A 1 and the other from A 2 , which have the smallest (or largest) possible inner product.One obvious way to do this is to generate all inner products, i.e., the elements of the product A 1 A T 2 , and find the smallest or largest.This entails complexity I 1 I 2 R, which can be high, even when R is small -e.g., consider the case where I 1 and I 2 are in the order of 10 5 or more.Is there a way to avoid performing order I 1 I 2 computations?When R = min(I 1 , I 2 ), such complexity is unavoidable, because then the elements of M can be freely chosen independently of each other (e.g., from an i.i.d.Gaussian distribution).When R is small however, there seems to be hope.
Generalizing, consider a tensor T of order N and size I 1 × I 2 ×• • •×I N , for which we are given a low-rank factorization, namely where A n is I n × R, and R (when minimal for this decomposition to hold) is the rank of T , in which case the above is known as the canonical polyadic decomposition (CPD) of T [6].When the elements of T are non-negative and the elements of the factor matrices A n are constrained to be non-negative, then the smallest R for which such a decomposition exists is called the non-negative rank of T , which can be higher than the (unconstrained) rank of T .In the sequel, we do not assume that R is minimal; any polyadic decomposition will do for our purposes.In scalar form, we get that which reveals that every element of T is an inner product of N row vectors, one drawn from each of the matrices A 1 , • • • , A N .An alternative way to write this is using the Hadamard (element-wise) product, denoted by * , as where 1 is the R × 1 vector of all 1's, used to sum up the R components of the Hadamard product.We see that finding the smallest or largest element of T is tantamount to finding the smallest or largest inner product of N row vectors, one drawn from each A n matrix.There is an obvious way to do this, by generating all I 1 ×I 2 ×• • •×I N elements of T , at a complexity cost of R(N − 1) flops each, but this exhaustive search is much worse than it is in the matrix case.It quickly becomes intractable even for moderate N and I N , e.g., N = 20 with I n = 10, ∀n ∈ {1, • • • , N }.Is there a more efficient way to do this?III.THEORY When R = 1 and all the A n matrices (column vectors a n in this case) only have non-negative entries, it is easy to see that the smallest element of T is T ( ǐ1 , • • • , ǐN ) with ǐn ∈ arg min in a n (i n ), and likewise the largest is T ( î1 , • • • , îN ) with în ∈ arg max in a n (i n ).For R > 1 however, the answer is no, in the worst case -even if R = 2 and the elements of all the A n matrices are non-negative.We have the following result.
Theorem 1. Finding the smallest element of a tensor from its rank factorization is NP-hard, even when the tensor rank is as small as R = 2 and the rank-one factors are all non-negative.This means that for large N , the worst-case complexity is exponential in N .
Proof.We show that an arbitrary instance of the partition problem, which is known to be NP-complete [7], can be converted to a specific instance of the decision version of the problem of interest.This means that if we could efficiently solve any instance of the problem of interest, we would be in position to efficiently solve an arbitrary instance of the partition problem, which is not possible according to the current scientific consensus, unless P=NP.
Recall the partition problem, which is as follows.We are given N positive integers w 1 , • • • , w N (repetitions are allowed), and we wish to know whether there is a subset of indices S, such that Consider binary variables {i n ∈ {0, 1}} N n=1 , with i n designating whether n ∈ S or not, i.e., i n = 1 if n ∈ S, else i n = 0. Deciding whether a suitable S exists is equivalent to deciding whether > 0.
We will instead use a reformulation of the above which is more conducive for our purposes, namely > 0.
Expanding the square and noting that the product term is a constant, we obtain Each of the exponential terms is separable, i.e., a rank-one tensor comprising non-negative factors.For example, the first term is It follows that the function to be minimized is a tensor of size 2 × • • • × 2 = 2 N and rank R = 2, with the following 2 × 2 matrix factors Hence, the decision version of our problem with R = 2 and non-negative factors contains every instance of the partition problem as a special case.It follows that the decision version of our problem is NP-hard, and thus its optimization version is NP-hard as well.
Theorem 2. Finding the smallest element of a tensor from its Tucker, MLSVD, TT, or TR factorization is NP-hard.
Proof.All these models are outer product decompositions that are related.In particular, a rank two CPD is equivalent to a Tucker model with a diagonal 2 × 2 × • • • × 2 core matrix.
Thus the closed-form rank-two model of the partition problem is also a low multilinear rank (= 2) Tucker model.We may orthonormalize the loading matrices A n if we want to obtain an MLSVD model that features a dense core, obtained by absorbing the inverse transformations into the core.Thus if we could find the minimum element of every Tucker or MLSVD model of multilinear rank (2, 2, • • • , 2) efficiently, we would be in position to solve all instances of the partition problem.
For the TT decomposition, [12] has shown that for low CPD ranks (smaller than or equal to any of the tensor outer size dimensions), it is possible to explicitly construct an equivalent TT model whose lower-order cores exhibit the same low rank as the original high-order CPD model.It follows that we can express our CPD model of the partition problem as a TT model with core ranks equal to two.Finally, the TR model is a generalization of the TT model.In all cases, if we could find the corresponding minimum element efficiently, we would be in position to efficiently solve every instance of the partition problem.
Remark 1.The decision version of the partition problem is NP-complete, which means that every other NP-complete problem can be transformed in polynomial time to an instance of our problem of interest, i.e., computing the minimum element of a tensor of non-negative rank two.Theorem 1 thus speaks not only for the hardness, but also for the span of problems that can be considered within our present framework.
Remark 2. Theorem 1 adds a new problem to a list of tensor problems that are known to be NP-hard [13].Notwithstanding, it is interesting that the problem of finding the smallest or largest entry is NP-hard even for tensors of rank as low as R = 2. Remark 3. Multi-way partitioning [14] is a generalization that seeks to split N numbers to M optimally balanced groups.This can also be cast as finding the minimum element of a tensor from its low-rank factors.Let S = N n=1 w n , and i n ∈ {1, • • • , M } be a variable indicating to which of the M groups w n is assigned to.Consider minimizing the loss function Expanding the square we obtain two rank-one terms and a constant that is irrelevant to optimization.The overall loss is therefore a tensor of rank R = 2M , and it is easy to write out the associated factor matrices, similar to what we did for the basic partition problem.
When the low-rank factors comprise positive and negative values, then even the R = 1 case can be challenging, as the minimum or the maximum element of each a n can be involved in generating the overall minimum (or maximum), due to the presence of signs.In principle, this seemingly entails explicit or implicit enumeration over 2 N possibilities.(If a zero exists in any a n , then zero should also be considered as a candidate at the very end.)The good news is that there is structure to this problem: one can invoke the principle of optimality of dynamic programming (DP).The key observation is that Thinking of α as the vector having as elements the minimum and the maximum element products over the first k modes and β as a k+1 , we can compute the minimum and maximum up to the k + 1-th mode using the formulas above.It follows that Proposition 1. Finding the smallest or largest element of a rank-one tensor from a n ∈ R In×1 N n=1 can be accomplished via DP at complexity that is linear in N .
When R > 1, the problem of finding the minimum (or maximum) element of a CPD-factored tensor can be described as follows.One has N buckets of R-dimensional vectors, with the n-th bucket having I n vectors -the rows of A n .Finding the minimum element of the CPD-factored tensor is equivalent to selecting a single row vector from each bucket A n such that the inner product of the N resulting vectors is minimized.This is inherently a discrete optimization problem that is NPhard per Theorem 1.A possible solution strategy is to employ coordinate descent: fixing all indices except i n , we are looking to minimize or maximize over i n the inner product which only requires computing the matrix-vector product Ad T −n and finding its smallest or largest element.Such discrete coordinate descent can be extended to optimizing over small (and possibly randomly chosen) blocks of variables at a time.This is akin to what is known as alternating (block coordinate) optimization in the context of tensor factorization, and it can work quite well in practice for small to moderate N .However, such an approach does not seem to scale well with higher N , and it is not particularly elegant.We would like to have the option of updating all variables at once, as we do in continuous optimization -but the problem at hand is discrete and does not appear amenable to tools from continuous optimization.Thankfully, appearances can be deceiving.We have the following result.Theorem 3. Finding the smallest element of a N -way tensor in CPD form is equivalent to the following continuous relaxation involving probability distributions Proof.Let {p n } N n=1 denote an optimal solution to (1), and let q−n := A n * m̸ =n A T m pm .Note that the minimum of ( 1) is then equal to pT n q−n .The minimum of the inner product pT n q−n is clearly attained when pn is a Kronecker delta that selects a minimum element of q−n (there might be multiple minimal elements in q−n that happen to be exactly equal).By virtue of optimality, pn cannot combine non-minimal elements of q−n , because that would clearly increase the cost, contradicting optimality.Thus pn generates a convex combination of the possibly multiple equivalent minimal elements of q−n .But because the latter are exactly equal, the same cost is produced by putting all the mass in only one of them.
Thus, given an optimal solution {p n } N n=1 of (1), we can always round p1 so that it is integral, without loss of optimality.This yields another optimal solution of (1) to which we can apply the same argument to round p2 this time, and so on.The proof is therefore complete.Theorem 4. Finding the smallest element of a N -way tensor in Tucker, MLSVD, TT, or TR form is likewise equivalent to the corresponding continuous relaxation.
Proof.All these decomposition models are fundamentally sums of outer products, i.e., rank-one tensors.Thus they can always be put in the form of multilinear (polyadic) decomposition, albeit such decomposition will not necessarily be canonical, i.e., of minimal rank, nor will it be unique.Notice however that our proof of the previous Theorem does not assume anything about uniqueness or the number of components in the decomposition.Hence it applies to all these models.As a special case, it applies to full-rank tensors; but then there is no way to avoid the curse of dimensionality, i.e., exponential complexity in gradient computations, see below.Thus it is low-rankness that saves the day.

IV. METHODS
Theorem 3 opens the door to derivative-based optimization.The gradient of the cost function, This is very easy to compute.When p T n A n contains no zeros, we only have to compute * m p T m A m once, at a cost of R flops, and then divide each time by the leave-one-out factor to compute all the Hadamard products needed.This is followed by a final matrix-vector multiplication for each n.The total cost is thus 2 To understand what happens when zeros appear, it suffices to consider each column (latent dimension) separately, as multiplications and division happen at the column (rank-one factor) level.If the first element of one and only one of the p T n A n is zero, then we should also compute the nonzero leave-one-out product of all other first elements and use that only for the gradient update of the p n which generated the said zero element.If the first element of more than one p T n A n is zero, then all leave-one-out products are zero, hence there is no need to even consider the first dimension in the gradient update, because the corresponding component of the gradient is zero.This implies that for each of the latent dimensions, r ∈ {1, • • • , R}, we only need to detect if there is a single zero and compute the product of the nonzero elements.Hence, even in the presence of zeros, the worst-case complexity is linear in N , as stated above.
We need to enforce the probability simplex constraints, and projected gradient descent (PGD) a natural choice for this purpose; but we have several other choices.Among them, the Frank-Wolfe algorithm has certain advantages.In our particular context, computing the minimum inner product of the gradient over the feasible set separates across modes, and for each mode it boils down to finding the minimum element of the corresponding part of the gradient vector.Thus Frank-Wolfe bypasses projection onto the simplex, which requires an iterative algorithm.Frank-Wolfe applied to nonconvex cost functions with convex constraints enjoys nice convergence guarantees [15] -note that our multilinear cost function is nonconvex, but the probability simplex constraints are convex.
Another way to bypass the projection onto the simplex is to introduce what is sometimes referred to as the Hadamard parametrization of a probability distribution p(i) = (s(i)) 2 , where p is the Hadamard product of s with itself [16]; or the amplitude parametrization p(i) = |s(i)| 2 of the quantum literature, where s is complex and p is the Hadamard product of s and the conjugate of s.With these parametrizations p always has non-negative elements and sums up to one if and only if ||s|| 2 = 1.Thus the simplex constraint is transformed to a unit sphere constraint.One advantage of this is that projection of any vector on the unit sphere only involves normalization, i.e., s = s ||s||2 .A drawback is that the unit sphere is not a convex set, which complicates convergence analysis.The third option when it comes to parametrizing the simplex constraint is to use p(i) = e v(i) I j=1 e v(j) , where vector v is unconstrained.This parametrization emerges from mirror descent using negative entropy, but it can also be motivated from the viewpoint of turning simplex-constrained optimization into an unconstrained problem on which gradient descent can be applied.Following the latter viewpoint and applying the chain rule where g n := ∇ pn f .The aforementioned simplex parametrization approaches can model any finite distribution.In our particular context, we know that the sought distribution can be restricted to be unimodal -after all, the final solution can be "rounded" to a Kronecker delta without loss of optimality per the proof of Theorem 3. Towards this end, we can use as potential map v(i) a discrete analog of the exponent of a Gaussian, namely where µ ∈ R is a location parameter and σ ∈ R controls the spread of the distribution.Using the chain rule again, the derivatives for updating these two "root" parameters are Frank-Wolfe, PGD, and the exponential parametrization work well for random instances of the partition problem, subject to suitable tuning of the step-size related parameters.Pseudo-code listings of these four algorithms are provided as Algorithms 1, 2, 3, 4 for the Frank-Wolfe, PGD, exponential, and "discrete Gaussian" parametrization, respectively.For the partition problem, the Frank-Wolfe Algorithm 1 appears to offer the best performance and the lowest complexity in our proof-of-concept experiments.Each of these algorithms is useful in different application contexts, as we will see.Algorithm 4 uses the most compact parametrization of all, and is often the most efficient in terms of iterations needed for convergence.
For the Frank-Wolfe method, for which there is proof that the algorithm attains a stationary point at a rate of O( 1 √ t ), where t is the number of iterations, computing the associated curvature constant is non-trivial.In the matrix case (N = 2), the Lipschitz constant of the gradient of the cost function for our problem, which can be used to bound the curvature constant in Frank-Wolfe [15] is determined by the principal singular value of A 1 A T 2 .To see this, note that for N = 2 the cost function can be written as the gradient of which is given by , This is unfortunate, for computing σ max A 1 A T 2 is more costly than finding the smallest element of the product A 1 A T 2 by brute-force.In the N -way case, the situation is more complicated.Note that for N = 2 the difference of the gradients is linear in the difference of the mode distributions.This is not true when N > 2.

A. Complexity
Assuming I n = I, ∀n for brevity, the cost of computing all gradients with respect to the mode distributions p n is of order N IR, as we have seen.The Frank-Wolfe algorithm maintains this complexity order per iteration.When employing the exponential parametrization of the mode distributions or the "discrete Gaussian" parametrization, the extra gradient backpropagation steps have complexity N I, thus again maintaining overall per-iteration complexity of order N IR.The actual complexity of the overall algorithm depends on a number of critical parameters, including the choice of gradient stepsize, the maximum number of iterations allowed ("hard-stop"), and the initialization used -intelligent / application-specific or random.We are using these algorithms to tackle NP-hard problems, so initialization does matter.In certain applications, such as parity check decoding, there is a natural initialization that we can use (the channel output bits), but in others, like the sign retrieval application that we will consider in some detail, there is no "natural" initialization that we can use.Through experimentation, we have found that one initialization that works well in many cases is to use the DP-based algorithm to compute an optimal solution for each rank-one factor separately, and then pick among those the one that is best for the higher-rank tensor minimization problem.This often gives a good "universal" (application-agnostic) initialization the complexity of which is of order Tuning the gradient stepsize is not difficult in our experience, but it is application-dependent.The maximum number of gradient iterations is set between 300 and 3, 000 in all our experiments, even for high-order problems (high N ).Thus complexity is always polynomial of order N IR, but there is no guarantee that the optimal solution will be found.Notwithstanding, as we show in our experiments, the optimization performance attained is often state-of-art.

V. HOW EXPRESSIVE IS THE CLASS OF PROBLEMS CONSIDERED?
We have seen that the class of problems that can be viewed as special instances of finding the minimum element of a tensor from its rank-one factors is broad -it contains the partition problem, and thus any NP-complete problem can be transformed in polynomial time to our problem of interest.Still, such transformation may not be obvious, and one wonders whether there exist broadly useful classes of NP-hard problems that are directly amenable, or easily transformable, to instances of the problem of interest.As we will see next, the answer is affirmative for various important optimization models that are frequently used in engineering.

A. Integer Linear Programming
Consider the Integer Linear Programming (ILP) problem where and S N is a finite subset of R N .S N will typically be a finite lattice, i.e., the Cartesian product of finite subsets of R. In the supplementary material, we show that it is possible to transform this ILP problem to minimizing where λ m := e −tρb(m) , for sufficiently large (problemspecific) ρ and t.Every exponential inside the brackets is a rank-one tensor (a separable function of the variables in x), and thus the above is a tensor of order N and rank at most M + 1, which is very low compared to the maximal possible rank (I N −1 when S N = I N ).Here I N is the Cartesian product of N copies of I (note that S N need not be a Cartesian product in general -we are using slightly overloaded notation).

B. Integer Least Squares
Consider the integer least squares (ILS) problem where Then, it is easy to see that Note that a term of the form 1 is separable (rank-one), and so is γ(x(n)) 2 .It follows that the cost function in ( 4) has rank at most N (N −1)

2
+ N , which is again very low compared to the maximal possible rank (I N −1 when S N = I N ).Note that here we have used the symmetry of G := H T H to reduce the required number of rank-one terms.

C. Integer Quadratic Programming
Consider the integer indefinite quadratic problem min x∈S N x T Qx, where Q is not necessarily positive semidefinite, or even symmetric.Note that we can always symmetrize without loss of generality, as x T Qx = (x T Qx) T = x T Q T x, and thus It follows that integer quadratic programming corresponds to finding a minimum element of a CPD model of rank N (N −1) 2 .

D. Mixed integer programming: Sign retrieval
So far we have considered optimization problems with purely categorical ("integer") variables.As an example of a mixed integer problem that falls under our framework, we next consider sign retrieval (e.g., see [17], [18]).This is a special case of phase retrieval [19], [20], and both have important applications in a broad range of disciplines, from optical imaging [19] to wireless communication -where it can be used for channel estimation from coarse channel quality measurements [21].The starting point of the sign retrieval problem is the measurement model takes the absolute value of its argument, and y ∈ R M ×1 + is the sign-less vector of measurements.Treating s = sign(y) and x as deterministic unknowns, maximum likelihood estimation amounts to where D(s) is a diagonal matrix holding the elements of s on its diagonal.The problem is separable with respect to the continuous parameters x.That is, solving for x as a function of s, x = (A T A) −1 A T D(y)s and substituting the result back into the cost function, we obtain where we have used D(s)y = D(y)s.Expanding and using the idempotence of (I − A(A T A) −1 A T ), we can further rewrite the problem as It follows from the preceding subsection on integer quadratic programming that the sign retrieval problem corresponds to finding a minimum element of a CPD model of rank M (M −1) 2 .

E. Maximum likelihood / minimum distance decoding of parity check codes over Galois Fields
Consider a parity check code [22] over GF (2) with M × N parity check matrix C, where M is the number of parity checks, N is the codeword length, and the code rate is N is a valid codeword if and only if Cx = 0 M ×1 in GF(2) modulo 2 arithmetic, i.e., mod(Cx, 2) = 0 M ×1 in real arithmetic.Assuming that the coded bits are transmitted over a memoryless binary symmetric channel (BSC) with cross-over probability p < 0.5, or over an additive white Gaussian noise (AWGN) channel, maximum likelihood decoding reduces to minimum Hamming or Euclidean distance decoding, respectively.Noting that for {0, 1}-encoding of the channel input x and output y, Euclidean distance squared is equal to Hamming distance, we can write both using real arithmetic as Optimal decoding is an NP-hard problem for most "irregular" codes of current interest2 , including low-density parity check (LDPC) codes which can come close to attaining the Shannon limit.These are decoded using an iterative message passing technique known as belief propagation, which performs well for code graphs that are free of short loops [23], [24].We will next show that the same optimal decoding problem can be approached in a very different way: as the problem of computing the minimum element of a low-rank tensor.
Consider the following problem Note that (−1) C(m,:)x = (−1) , which is a rank-one tensor.Likewise, is separable, and thus a rank-one tensor.The overall cost function in ( 6) is therefore a tensor of rank (at most) M + 1.
We have the following result.
Proof.Note that each term of the sum on the left is equal to 1 when the corresponding parity equation is satisfied, or −1 otherwise.With the minus sign up front of the sum, minimization of this term will produce a valid codeword that satisfies all parity checks and yield an overall value −M .Any constraint that is violated will increase this term by 2. The term on the right is monotonically increasing with the squared loss The maximum value of the latter is N , while 1 + 1 N N is upper bounded by its limit as N → ∞, which is the base of the natural logarithm, e.It follows that the term on the right is strictly less than 1 (and lower bounded by 1 e ).Since the all-zero codeword satisfies all parity checks, it follows that the optimum solution to (6) should have cost less than or equal to −M + 1.Moreover, this value cannot be attained by any x which does not satisfy all parity checks, because such a x would incur a cost of at least −M + 2, even discarding the second term in (6).Hence, any solution of ( 6) must satisfy all parity checks.Among all x that satisfy all parity checks, those that minimize N n=1 (y(n) − x(n)) 2 yield the least overall cost in (6), and thus the proof is complete.Remark 4. When y is real-valued (AWGN channel) the only difference is the scaling of the second term, which should be

F. Codes over higher-order Galois Fields
Our framework can also handle the decoding of codes over higher-order Galois fields.For example, let L = 2 ℓ and consider a system of parity equations over GF(L).Such systems are of form Cx = q, where both C and q are given, and the equality is modulo L = 2 ℓ , i.e., Cx − q is a vector of integer multiples of L. We can handle this type of equation by bringing in a familiar signal processing tool, namely, the complex roots of unity.That is, we seek to minimize over where j := √ −1.The same logic applies for appropriately choosing c.

G. Least inconsistent solution of overdetermined linear system of equations over GF(2)
So far, we have been dealing with underdetermined linear equations over a GF, and looking at minimum distance solutions relative to an "anchor".This is similar to the minimum norm solution of underdetermined linear equations in the real or complex field.The overdetermined version of the problem is also of interest, and has many applications -e.g., in cryptanalysis [25]- [27].It turns out that our framework can deal with this problem as well.We are given a set of M > N linear equations in N variables Gx = y ⇔ Gx + y = 0 over GF(2).The system is usually inconsistent, so we seek a x that minimizes the number of violated constraints, i.e., a solution that is least inconsistent.This can be simply posed as H. Underdetermined or overdetermined?
Every linear code can be generated as a linear combination of the columns of a code generating matrix G.When G is tall (M × N with M > N ) and full column rank, the valid code words live in a subspace of dimension N and they are orthogonal to the rows of a parity check matrix C of size (M − N ) × M .Given a noisy code word, we may pose the optimal decoding problem in two equivalent ways: • One is in code space, i.e., find a valid code word that is closest to the given noisy code word, and this is the formulation in ( 6), wherein the unknown vector x is the sought clean code word.After solving (6), we need to solve a consistent system of linear equations over GF(2) (via Gaussian elimination) to obtain the latent information sequence -unless the code is systematic / in standard form wherein the information sequence appears as the prefix of the code sequence.
• The other option is to pose the problem in the lowerdimensional space of the information sequence, i.e., find an information sequence that produces a valid code word that is closest to the given noisy code word; this is the formulation in (8), wherein the unknown vector x is the sought information sequence 3 .When we take this route, there is no need for the additional Gaussian elimination step at the end, as we directly recover the information sequence.Note that the number of optimization variables is smaller and the rank of the CPD model is higher this way, but the complexity of our algorithms is linear in the number of unknowns and the rank, so this does not really affect the complexity of our approach.• The difference between the two ways of approaching the problem lies in the initialization.If we go via (6) there is a natural initialization for x -the received noisy code word.Notice that this works for any parity check matrix.
If we choose to solve (8) on the other hand, there is no obvious way to initialize the information sequence x, unless the code is systematic -in which case we read out a noisy version of x from the noisy code word itself.
It is therefore preferable to use (6) unless the code is systematic.The code is not systematic in cryptography applications for example.

I. Reprise
As we conclude this section, it is useful to reflect on what we learned.The take-home point is that there are many important problems which can be posed as instances of lowrank tensor minimization, for which it is not even necessary to perform tensor factorization -the low-rank factors can be readily derived analytically, in closed-form.These clean-cut mappings of classic hard problems to instances of low-rank tensor minimization reinforce our hopes that, even in cases where the problem is more complicated and the cost function is not fully known (i.e., only examples / samples of the cost function are given), it may be possible to model those "blind" optimization problems using low-rank tensor factorization and minimization.
Another important observation which stems from uniqueness of tensor completion, is that under certain conditions it is not even necessary to completely specify an instance of ILS or ILP in the traditional sense of providing its input parameters H, b, c in order to solve it.It is enough to specify the cost at certain (randomly chosen or systematic) points, and let tensor completion fill out the "rest of the problem".This possibility is certainly intriguing, and the direct result of uniqueness of low-rank tensor completion and our problem reformulation.

VI. EXPERIMENTS A. Partition
Since we used the partition problem to establish the hardness of our problem in the worst case, let us compare one of  --------------------------------------- the proposed continuous optimization algorithms for minCPD to an established approximation algorithm for the partition problem.For this purpose, we will use greedy partitioning algorithm [14] which first sorts the given numbers and then parses the sorted list from largest to smallest, assigning each to the bucket with the smallest running sum.This algorithm comes with a 7/6 approximation guarantee in terms of the larger sum it outputs divided by the larger sum of an optimal partition.For smaller N we also use enumeration to compute the optimal partition as another baseline.
The results obtained using the greedy algorithm and minCPD via Frank-Wolfe with adaptive stepsize (Algorithm 1) are summarized in Fig. 1 for N = 20 (with enumeration as another baseline) and Fig. 2 for N = 30 (without enumeration), for 100 Monte-Carlo trials each.In each trial, N random integers in {1, • • • , 100} are first drawn and then normalized to sum to 1.For Frank-Wolfe, we used C = 5, a hard-stop at a maximum of 1000 iterations, and 5 random initializations.
It is clear that Algorithm 1 outperforms the well-established greedy algorithm, and in many cases attains the optimal solution (zero subset imbalance, or the optimal subset imbalance obtained via enumeration, see Figs. 1 and 2).

B. Sign retrieval
Our second set of experiments considers the application of our framework to the problem of sign retrieval.Towards this end, we use Algorithm 4 with stepsize parameter fixed to 0.1 and a hard limit of 10 3 gradient iterations.For initialization, we use the rank-one DP algorithm of Proposition 1, which is used to efficiently determine the minimum of each rankone factor.The best of these rank-one factor minima (the one that minimizes the full-rank cost) is then used to initialize Algorithm 4. Ten additional random initializations are also used in case the DP initialization is not good enough.As a baseline, we use enumeration over all possible vectors of sign variables.
For our first experiment, we choose N = 6, M = 12, and σ v = 0.5.Note that for this application the order of the tensor used in the CPD model is M and its rank is M (M −1)

2
. Fig. 3 shows the cost function value attained for 100 randomly drawn problem instances, constructed as specified above.Note that CPD comes very close to the optimal cost of enumeration, with a few minor spikes, for all instances considered.For this application, what is perhaps more important than the value of the cost function is the squared error between the groundtruth x and the x estimated by a given algorithm.The values of squared error attained by CPD and enumeration are shown in Fig. 4. Notice that enumeration is by definition optimal in terms of the cost function, but not necessarily optimal in terms of instantaneous or even mean squared error (MSE)as the cost function is only a surrogate for MSE.Indeed, there are a few instances where CPD is better than enumeration in terms of squared error, and vice-versa.Overall though, the two approaches are very close in this experiment.
Monte-Carlo averages of the optimization cost and the squared error are depicted in Figs. 5 and 6 as a function of σ and N (with M = 2N ), respectively.We again observe the excellent performance of the proposed CPD approach in this set of experiments.

C. Decoding parity check codes
In our last set of experiments, we consider decoding five different rate-1 2 parity check codes, for different code densities and lengths N .For the first three scenarios N = 32, while for the last two N = 96.In the first scenario we use a custom LDPC code design4 [28].In the remaining four scenarios, we use a systematic parity check matrix whose non-identity block is randomly generated from an i.i.d.{0, 1}-Bernoulli distribution of density either 0.2 or 0.8.Each code is used  for encoding i.i.d.sequences of N/2 information bits, and the coded sequences are transmitted over BSCs with cross-over probability p ∈ 10 −2.5 , 10 −2 , 10 −1.5 , 10 −1 , 10 −0.5 .For each scenario and cross-over probability, we conduct 10, 000 Monte-Carlo runs and compare the decoding performance of our method to two baselines: i) enumeration (applicable only for N = 32 due to its exponential complexity) and ii) belief propagation (BP) based decoding.Regarding our method, we use Algorithm 4 with step size equal to 0.05.The initialization of Algorithm 4 in terms of {σ n } N n=1 and {µ n } N n=1 is σ n = 0.5, ∀n, while µ n is set to the received noisy (possibly flipped) value of the corresponding code bit.Algorithm 4 is terminated when the number of iterations exceeds the limit of 2, 000 iterations or when the relative change of the objective drops below 10 −9 .As for the BP baseline, we use the ldpcDecode function of MATLAB, to which the log-likelihood ratios based on the corresponding cross-over probabilities are provided as initialization, while the upper limit of iterations is set to 100 (we did not observe any improvement beyond that).
In Figs. ( 7)-( 11), we report the average Bit Error Rate (BER) for all the methods.We can observe that, in general, Algorithm 4 achieves better or comparable performance than BP.At low BSC cross-over probabibilities, we can see that Algorithm 4 outperforms BP for the custom LDPC code design of [28], and the high density parity check codes.In the rest of the cases the two methods attain comparable performance.BP has a small advantage for the longer lowdensity code in Fig. (10), as expected; BP works best with low-density long codes.We note that, unlike BP, Algorithm 4 does not (need to) use the BSC cross-over probability, which is non-trivial to estimate for time-varying channels.We also note that BP exhibits a degradation in performance for the longer and denser code at low BSC error rates, see Fig. (11).This is repeatable (not an artifact of insufficient Monte-Carlo averaging) and likely due to the existence of many short loops in this case.On the other hand, BP is significantly faster than Algorithm 4. Overall though, given that Algorithm 4 is a completely new and application-agnostic take on a wellstudied problem, the fact that it outperforms in terms of BER a proven MATLAB implementation of a custom-designed and widely used algorithm is satisfying.Matlab programs that can be used to reproduce the results in this subsection can be found in the companion supplementary material in IEEExplore.

VII. CONCLUSIONS
We have considered a fundamental tensor problem and showed that it is NP-hard.While most tensor problems are NP-hard [13], it is surprising to see that our particular problem is NP-hard for rank as small as two, but not for rank equal to one.We note here that the best rank-one least squares approximation problem for tensors is already NP-hard.
Given the discrete / combinatorial nature of the problem considered, it is also unexpected to see that it admits an equivalent continuous reformulation.While this reformulation is itself NP-hard by virtue of equivalence, it opens the door for gradient-based approaches from the nonconvex continuous optimization literature.
We have shown that an impressive variety of hard optimization problems that are widely used in engineering can be posed as special instances of our problem of interest.These include integer least squares, integer linear and quadratic programming, certain mixed integer programming problems, and solving systems of underdetermined and overdetermined  linear equations over Galois fields.For all these problems, the low-rank factorization needed to set up the optimization problem is available analytically, in simple closed-form.There is no need for tensor factorization.
As tangible signal processing and communications engineering applications, we delved into sign retrieval and the decoding of linear parity check codes.We have shown that the perfor-mance of the proposed suite of gradient-based approaches is surprisingly good in many of these applications, and under certain conditions it can beat tried-and-proven application-specific algorithms that come with certain performance guarantees.This success is sometimes dependent on using a suitable initialization, either application-specific ("dirty" code bits in the case of parity decoding) or "universal" (the DP algorithm used to initialize the gradient iterations for sign retrieval) in other cases.For other problems, like the partition, a few random initializations seem to work well.
Our main results (hardness, equivalence of continuous reformulation) are also applicable to all popular tensor models beyond CPD, including Tucker/HOSVD, TT, and TR.
Note that for the families of problems and applications considered in this paper, R is a small constant or linear / very low-order polynomial function of N .For so-called black box optimization problems out in the wild, the worst-case R can be exponential in N , and we would have to use low-rank approximation to keep complexity at bay.This is not an issue though for the various problems considered in this paper.
We are currently working on further improving scalability and speed, establishing convergence for some of the algorithmic variants, and considering applications in a variety of settings.Performance analysis is naturally of interest, but is likely to be application-specific.We are also considering top-k / bottom-k extensions which are appropriate for top-k recommendation and other applications.We hope to report on these directions in forthcoming work.

Fig. 1 .
Fig. 1.Partition gap for 100 problem instances with N = 20 random integers in {1, • • • , 100}, normalized to sum to 1.Greedy, optimal (enumerationbased), and proposed minimum CPD algorithm using Frank-Wolfe.Each point on the x axis corresponds to a problem instance, and the instances are sorted in order of increasing partition gap for the optimal enumeration-based solution.

Fig. 3 .Fig. 4 .Fig. 5 .
Fig. 3. Sign retrieval: Cost attained by the enumeration-based solution and the proposed CPD-based approach, for M = 12, N = 6, σ = 0.5.See text for details.Each point on the x axis corresponds to a problem instance, and the instances are sorted in order of increasing cost for the optimal enumerationbased solution.

Fig. 6 .
Fig. 6.Sign retrieval: Mean optimization cost and MSE attained by the enumeration-based solution and the proposed CPD-based approach as a function of N , for M = 2N and σ = 0.5.