Capacity Performance of Tensor Multi-Domain Communication Systems With Discrete Signalling Constellations

Modern communication systems employ multi-domain modulation and coding techniques for effectively exploiting all available resources. Hence in such systems the transmit and receive signals have an inherent multi-domain structure which can be represented using tensors. This work considers the capacity of higher order tensor channels associated with such multi-domain communication systems when the elements of the input tensor are constrained to be drawn from discrete signalling constellations. We establish a relationship between the tensor gradient of the mutual information and the error covariance tensor associated with the minimum mean squared error estimator at the receiver. This relation is used to iteratively find a multi-linear precoder at the input which achieves capacity of the tensor channel under the signalling constellation constraints. Through numerical examples, we show the convergence behavior of the proposed precoder, and compare the capacity achieved under different constellations with the capacity when the input is Gaussian. Further, we exploit the tensor formulation of the problem to find the channel capacity under a variety of different power constraints spanning across several domains. At high SNR, the constellation constraints saturate the capacity while at low SNRs, the constellation constraints are not too relevant, and the power constraints dominate and limit the performance. The capacity saturation level depends on the input order and distribution.


I. INTRODUCTION
M OST modern communication systems require signal transmission and reception over multiple domains such as space, time, frequency, users, code, propagation delay. Tensors, which are multi-way arrays, are a natural choice for mathematical modelling of such multidomain communication systems [1]. Tensors are heavily employed in several signal processing areas including but not limited to computer vision [2], [3], machine learning [4], [5], data mining [6], big data [7], [8], image processing [9], [10], speech processing [11], and biomedical engineering [12], [13]. In the field of communications, tensor decompositions have been extensively employed from a signal processing perspective wherein the multi-domain nature of signals is exploited for blind source separation and developing equalization techniques [14]. Several key technologies intended for beyond 5G and 6G communication systems can be modelled using tensor representations such as Intelligent Reflecting Surfaces (IRS) [15], massive Multiple-Input Multiple-Output (MIMO) [16], [17], relay systems [18], and millimeter wave [19], [20]. The use of tensors to model wireless communication systems allows a consolidated representation of signals and systems spanning multiple domains.
The notion of employing tensors to represent signals in multi-domain communication systems has been well recognized in the professional community. For instance, [21] represents the signals in Code Division Multiple Access (CDMA) and over-sampled systems using third order tensors, and uses tensor decompositions for blind multi-user separation and equalization. A space-time-frequency coding structure for MIMO Orthogonal Frequency Division Multiplexing (OFDM) CDMA systems is developed in [22] where the input is considered an order 5 tensor. The signals in a MIMO Generalized Frequency Division Multiplexing (GFDM) system are modelled as third order tensors in [23]. The received signal is modelled as a fourth order tensor for MIMO OFDM in [24]. Similar application of tensors for signal representation in several multi-domain systems such as Filter bank-based Multi-carrier (FBMC) [25], cellular networks [26], MIMO multiple access and interference channels [27], Nonorthogonal Multiple Access (NOMA) systems [28], have also been investigated.
A unified mathematical framework capable of modelling several such multi-domain systems has been recently proposed in [1] using the tensor contracted convolution and contracted product. In [1], the input and output signals are defined as order N and M tensors respectively, and the channel as an order M + N tensor, with several examples. Most future communication systems would rely on the multi-domain nature of signals for improved performances, by incorporating and utilizing the distinct features of all the domains being considered. The tensor model retains the domain interpretability and their mutual effects, thereby provides a structured mathematical framework to deal with such multi-domain systems. In [27] a discrete version of the tensor modelled multi-domain communication system from [1] is presented using the Einstein product. Further, [27] considers the Shannon capacity of higher order tensor channels under a family of power constraints with unconstrained input signals. However, in practical systems, input is often constrained to be drawn from discrete constellations. Thus, this paper considers the capacity of higher order tensor channels under such discrete signalling constellation constraints.
The case for maximizing the mutual information (MI) with discrete inputs is often handled by exploiting the relationship between the derivative of the MI and minimum mean squared error (MMSE), known as the I-MMSE relationship, proposed in [29]. In [30], a power allocation policy was proposed that maximizes the MI for any arbitrary input distribution for diagonal, i.e., non-interfering MIMO Gaussian channels. For the case of interfering channels, a linear precoder was proposed in [31] using the relation between the gradient of the MI with respect to the channel and the MMSE matrix derived in [32]. Further, [33] presents iterative algorithms using a quadratic function of the precoder matrix to compute a linear precoder which maximizes the MI for finite alphabet input. An iterative method to compute the optimum input covariance has been presented and characterized in high and low signal-to-noise ratio (SNR) regimes in [34]. More recently, [35] considers the power allocation problem for parallel Gaussian channels with input distributions which are close to Gaussian in the Kullback-Leibler divergence, leading to a robust water-filling. A strategy to optimize the MIMO precoder and the input distribution jointly is presented in [36]. Also, MIMO precoder optimization with discrete inputs in the absence of channel state information at the transmitter is considered in [37]. A more detailed summary of results pertaining to the capacity limits of MIMO channels with discrete input constellations can be found in [38].
So far, the problem of finding channel capacity with discrete inputs has only been addressed in the context of scalar or vector inputs, where only a single domain (space/antenna) is taken into account. To the best of our knowledge, there hasn't been any significant effort in finding the capacity of systems spanning multiple domains with discrete input constellation constraints. More so, within MIMO systems, most of the work in the literature assumes only a sum power constraint. However, per antenna power constraints are practically more relevant than the sum power constraints because in several transmitters, individual antennas are connected to separate RF chains creating individual constraints [39]. Also, per antenna constraints are common in distributed MIMO systems where antennas do not share the same power source [40]. In a multi-domain system, power constraints can span beyond the antenna domain, such as per user power constraints in a multi-user MIMO system [27]. In order to mathematically represent such domain specific power constraints, conventional matrix approaches fall short since any lower order representation of the inherently higher order channel loses the domain distinction. In this paper, we consider the channel capacity of multi-domain communication systems with discrete inputs using a tensor framework under a family of such power constraints which include the per domain or per element constraints. This paper extends the I-MMSE relation well known in literature from a vector to a tensor setting. The notion of error covariance as a higher order tensor has been recently proposed in [41], which can be linked to the tensor gradient of the MI establishing a tensor I-MMSE relationship. We derive such a relationship and use it to find a multi-linear precoder at the input which achieves capacity of the tensor channel under a family of power constraints with discrete input signalling constellations. It is to be noted that a tensor channel of order 2 is equivalent to the conventional MIMO matrix channel. The present paper on one hand generalizes the results from matrix channels to higher order tensor channels under sum power constraints, and on the other hand provides a novel approach to solve the problem in a more realistic set up with domain specific power constraints.
The rest of the paper is organized as follows: Section II presents the system model and the tensor I-MMSE relationship. Section III poses the problem of finding the channel capacity as an optimization problem and solves it to find an optimal input precoder which achieves capacity. Section IV presents numerical examples and discusses the implications of different power and constellation constraints. The paper is concluded in Section V. A detailed proof of the tensor I-MMSE theorem along with some tensor derivative and integral rules have been included in the Appendices.

II. THE TENSOR I-MMSE RELATION
A tensor is a multi-way array where each element is indexed by multiple indices [42]. This paper uses several tensor algebra results such as the Einstein product, outer product, tensor inverse, Hermitian. A detailed review of all such relevant tensor algebra results can be found in [27], [43].
Notations: We denote tensors using calligraphic fonts. For instance, X ∈ C I 1 ×···×I N represents an order N complex tensor. A random tensor is denoted by bold fonts, such as X X X. The individual elements of a tensor are denoted using indices in subscript, for instance X i 1 ,...,i N . An all zero tensor is denoted by 0 T , an identity tensor of order 2N by I N . The Einstein product over N modes is denoted using * N , and the outer product using •. The Hermitian transpose of a tensor A is denoted as A H . Since a tensor has multiple modes, its Hermitian transpose can be defined with respect to a separation after any number of modes [44]. To keep the notations simple, in this paper whenever we express a tensor explicitly as order N + M or order 2N, the tensor Hermitian transpose is defined with respect to separation after N modes. A Hermitian positive semi-definite tensor is denoted by A 0.

A. SYSTEM MODEL
Consider a multi-domain communication system where an order-N input X X X ∈ C I 1 ×···×I N and an order-M output Y Y Y ∈ C J 1 ×···×J M are connected via a multi-linear channel H ∈ C J 1 ×···×J M ×I 1 ×···×I N . The input passes through a precoding operator denoted by P ∈ C I 1 ×···×I N ×I 1 ×···×I N which modifies the input via a multi-linear transformation as P * N X X X to generate the transmit tensor. The system model can be specified as: where N N N ∈ C J 1 ×···×J M denotes the received noise tensor and H = (H * N P) ∈ C J 1 ×···×J M ×I 1 ×···×I N can be seen as the equivalent channel. If we assume input tensor X X X is zero mean with covarianceQ = E[X X X • X X X * ], then the transmit covariance would be given by If the input consists of i.i.d. zero mean unit variance elements such thatQ = I N , then the transmit covariance is given as Q = P * N P H . Note that in a more general case, the precoder can be a non-square tensor as well in which case the input tensor and the transmit tensor will have different dimensions. In this work, we consider a square precoder since the precoder is aimed at performing power allocation and ensuring that the transmitted signal has a specific optimum covariance without any change in its dimensions. The system model for a multi-domain communication system from [27] is a special case of (1) when P is an identity tensor, and thusH = H. Several practical systems such as multi-user MIMO OFDM, MIMO GFDM, can be modelled using (1) as shown in [27]. We assume that N N N is circularly symmetric complex Gaussian distributed having independent zero mean and unit variance components such that its covariance is an identity tensor.

B. GRADIENT OF THE MUTUAL INFORMATION
The best MMSE estimator, given by the conditional mean, has been derived for the tensor setting in [41]. The corresponding error covariance denoted using an order-2N tensor, Q E E E ∈ C I 1 ×···×I N ×I 1 ×···×I N , can be written as [41]: The MMSE tensor Q E E E associated with the conditional mean estimator can be linked with the gradient of the MI as shown in the following theorem: Theorem 1: In the system model (1), if N N N is circularly symmetric complex Gaussian with zero mean and identity covariance tensor, then the gradient of the input-output mutual information with respect toH is given by where Q E E E is the MMSE tensor defined in (3). Proof: The proof of this theorem when the input and output are vectors, thus the channel is a matrix, is provided in [32]. The proof can be extended to the tensor setting using tensor gradient and integral rules as shown in Appendix B.
From Theorem 1 we know: SinceH = H * N P, using chain rule of differentiation (see Lemma 3 in Appendix A) (4) gives us: Also based on Q defined in (2), we can apply the chain rule of derivatives (see Lemma 4 in Appendix A) to get: Substituting (6) into (7), we get: Equations (6) and (8) specify the gradient of the MI with respect to the input precoder and the transmit covariance in terms of the MMSE tensor. These gradient results can be exploited to optimize the MI as shown in the next section.

III. CAPACITY WITH DISCRETE INPUTS
In this section, we will consider the optimization problem of maximizing the MI under a family of power constraints when the input is drawn from discrete constellations with a fixed distribution. The family of input power constraints includes cases such as the sum power constraints, per domain or per element power constraints [27]. For instance, in a system with input spanning space, time, and frequency domain, a power constraint can be either across the entire tensor input symbol, or it could be per antenna, or per antenna and per frequency bin, or per antenna per frequency bin and per time slot which is essentially per element. Any such case can be represented mathematically using the tensor framework. In order to describe the family of power constraints, we use the following notations: a sequence of indices and i r = i 1 . Now consider the system model from (1) with P = I N such thatH = H and Q =Q. Assuming that the input elements are drawn from a zero mean signal constellation, the objective is to determine the transmit covariance Q that maximizes the MI, I(X X X; Y Y Y) subject to a given power constraint. Hence we can define the optimization problem as: Q 0.
where P i c denotes the power budgets of the constrained elements such that i c P i c = P denotes the total power budget. We define the SNR as P/σ 2 where σ 2 is the noise variance. For different choices of i c and i r , (10) corresponds to different type of power constraints ranging from per element constraint (when i r is empty and i c = i) to sum power constraint (when i c is empty and i r = i). The optimization problem in (9)- (11) has been considered in [27] when the input is circularly symmetric complex Gaussian distributed. However, unlike the case of Gaussian input, closed form expression of the MI for any given constellation may not be known. Also, the problem at hand may not be a convex optimization problem. However, the Karush-Kuhn-Tucker (KKT) equations provide a set of necessary conditions for the optimal input covariance to be a critical point to any optimization problem [45], [46]. We now derive the necessary conditions for the optimal covariance using the KKT conditions.

A. CONDITIONS FOR OPTIMAL INPUT COVARIANCE
Let M 0 be the Lagrange multiplier tensor for the positive semi-definite constraint from (11) of size I 1 × · · · I N × I 1 × · · · I N . Let μ i c ≥ 0 for all i c be the Lagrange multipliers corresponding to all the linear constraints from (10). Then the Lagrangian functional can be defined as: Let us define a pseudo-diagonal tensor B of same size as the input covariance such that Based on (13), we can write the Lagrangian from (12) as and the derivative of Lagrangian with respect to Q can be written as The first KKT condition is obtained by setting the ∇ Q L from (15) to zero as: The complementary slackness KKT equations are obtained by setting every function in the Lagrangian functional from (12) to 0 except the objective function. Thus we can write the complementary slackness equations as Since M 0 and Q 0, we can write (17) as: Also notice that all the entries of B, i.e., μ i c , will be strictly greater than 0 at optimum because the inequality constraint must be met with equality at optimum. Since μ i c > 0, (18) can be written as: Now consider the relation between the gradient of the MI and the MMSE tensor from (8) (assuming P = I N , and thus Q =Q in (8)), we can write We can rewrite (16) as: On equating (21) and (23), we get: (19) .
Thus we have where the elements of B are found to satisfy the power constraints in (20). Equation (26) [27] and MMSE tensor [41] in terms of the input covariance tensor are known.

B. CAPACITY ACHIEVING INPUT PRECODER
Since the MMSE tensor Q E E E in (26) would itself depend on the transmit covariance tensor, thus (26) does not provide a direct equation to find the covariance. In order to solve for the optimal transmit covariance with discrete inputs, we may adopt an iterative approach using a gradient ascent method. Consider the system model from (1), with input X X X containing i.i.d. zero mean unit variance elements. The tensor P can be seen as a multi-linear precoder tensor which also does power allocation given the specific power constraints. The transmit covariance Q can be written as Q = P * N P H (based on (2)). Thus the optimization problem in (9) can be recast using the multi-linear precoder P as the optimizing variable. Our objective is to find a P which maximizes the MI, I(X X X; Y Y Y) subject to the power constraints, which can be described as: Let P denote the set of all such tensors which satisfy the constraints in (28), known as the feasible set. We use a projected gradient ascent approach to find the optimal P using the following iterative equation: where [·] P denotes the projection onto the feasible set P described mathematically as [P (0) ] P = arg min The choice of Frobenius norm minimization as a projection operation ensures that the projected point is closest in terms of the Euclidean distance from the point to be projected [46]. The variable k denotes the iteration index. The constant ν denotes the step size taken for the gradient ascent. Substituting ∇ P I from (6) into (29) we get Equation (30) gives an iterative procedure to find the precoder. For low SNRs, any initial guess of P (0) which satisfies the power constraints with equality can be made. For instance, in general the initial guess for P (0) could be a pseudo-diagonal tensor corresponding to a uniform power allocation at the input in case of sum power constraints. Note that since the MI may not be a concave function for several discrete input distributions, the iterative method from (30) would in general provide only a necessary condition for the optimal input precoder. One approach in such cases is to consider running the gradient ascent iterative equation using several starting guesses of P (0) , and then select the stationary point precoder which offers the maximum MI as used in [31]. However, an alternate approach could be devised using the fact that the MI for discrete inputs is a concave function at low enough SNRs [34]. Thus in order to obtain an optimal P for any given SNR, we use a similar approach as presented in [34]. First we determine the unique globally optimal input precoder P using (30) for a low enough SNR with any starting guess. The low SNR ensures that the optimization problem is concave hence the solution is indeed globally optimal. Further, we increase the SNR gradually where the optimal precoder for the higher SNR is computed by taking the precoder corresponding to the lower SNR as the starting point. It is assumed that the MI as a function of SNR does not change drastically with a small change in SNR. Thus, choosing the globally optimum value at a low SNR as a starting point for a slightly higher SNR ensures that the starting point remains in the proximity of the global optimum for the higher SNR case. Such a technique is sometimes used for non concave function optimization [34], [47].

C. DETERMINING THE PROJECTION ONTO THE FEASIBLE SET
The projection of the iterative precoder onto the set of feasible precoders satisfying the sum power constraints for MIMO channels has been cast as a convex optimization problem in [32]. Similarly, the projection of a tensor P (0) onto the feasible set P satisfying the family of power constraints denoted by [·] P , can be cast as a convex optimization problem given by: Such a projection problem can be solved using the KKT conditions [46], or using software tools such as CVX [48].
Here we present the solution using the KKT conditions.
Let l i c ≥ 0 for all i c be the Lagrange multipliers corresponding to all the linear constraints from (32). Then the Lagrangian functional corresponding to (31)-(32) can be defined as: We arrange the values {l i c } in a pseudo-diagonal tensor F of same size as the input precoder such that its non-zero entries are Based on (34), we can re-write the Lagrangian from (33) as To find the derivative of the Lagrangian with respect to P, we note that ∇ P tr(P * N P H ) = P. Since ||P − P (0) || 2 is same as tr((P − P (0) ) * N (P − P (0) ) H ), we have ∇ P ||P − P (0) || 2 = P − P (0) . Also, based on the chain rule of derivatives (see Lemma 4 in Appendix A), we have ∇ P tr(F * N P * N P H ) = F * N P. Thus the derivative of the Lagrangian with respect to P can be written as The KKT condition is achieved by setting (36) to an all zero tensor which gives us Since (F + I N ) is a pseudo-diagonal tensor with non-zero elements, we denote its inverse as Z = (F + I N ) −1 . The pseudo-diagonal elements of tensor Z are represented as The tensor P can be written using Z as Thus the projection is given by the Einstein product between the tensors Z and P (0) . In order to find the elements of the pseudo-diagonal tensor Z, we use the power constraints from (32). Substituting P from (39) into (32), we get Note that Z is a pseudo-diagonal tensor with only real elements, so Z H = Z, and also Z i,j is non-zero only when i = j. Thus we get: where the last equality follows from the pseudo-diagonality of Z. On substituting (41) into (40) we get Using (38) and (42), we get which gives us all the elements of the pseudo-diagonal tensor Z. To better understand this result, consider the case of sum power constraint for which i c is empty, P i c = P represents the total power budget, and i r = (i 1 , . . . , i N ). Since there is a single power constraint, we have a single Lagrange multiplier, and thus Z is a scaled identity tensor z · I N . Thus, under sum power constraint, (44) is written as i.e., the solution to (31)-(32) is proportional to P (0) upto a scaling factor given by (45). This result when applied to a MIMO matrix channel under sum power constraint is consistent with the same solution for the MIMO case as presented in [32].

D. CALCULATION OF THE MMSE TENSOR
The MMSE tensor has a closed form expression in case of Gaussian input. For the numerical computation of the MMSE tensor for discrete inputs, the usual approach is to use Monte-Carlo simulations [31], [32]. If the input constellation size is , then we can have a total of K = (I 1 ···I N ) possible input tensors X X X ∈ C I 1 ×···×I N . We denote the realizations of input tensor as X (k) for k = 1, . . . , K. Finding the MMSE tensor entails averaging over all such input tensors. For a fixed tensor channel, corresponding to a given input tensor X (k) , we generate L noise tensors which are drawn from circularly symmetric complex Gaussian distribution, and compute the corresponding output tensor using (1) and denote it as Y (k,l) for l = 1, . . . , L. With large L, the MMSE tensor can be approximated using Monte Carlo runs as: If the input distribution conditioned on the output is denoted by the function p X X X|Y Y Y (X|Y), the conditional expectation (47) can be computed as The function p X X X (·) denotes the input probability distribution and p Y Y Y|X X X (·), given as denotes the output distribution conditioned on the input in the presence of circularly symmetric Gaussian noise.

E. CALCULATION OF MI FOR A GIVEN PRECODER
In the system model (1), since noise tensor N N N is independent of the input tensor X X X, the MI is given as Assuming noise tensor to be zero mean circularly symmetric Gaussian with identity covariance I M , we get [27]: The output entropy can be calculated using Monte Carlo simulations to approximate the MI. For input with equiprobable elements, the MI can be calculated as follows: Lemma 1: In the system model (1), assuming input tensor elements are drawn from a set of discrete constellation points which are equiprobable, and given a precoder tensor P, the MI can be numerically evaluated using (52) (mentioned on bottom of the page), where the expectation over noise can be carried out using Monte Carlo runs.
Proof: When input consists of equiprobable i.i.d symbols, the output distribution can be specified as: Using (53), the output entropy can be calculated as: Since we assume that X X X is independent of N N N, we get: On substituting (56) and (51) into (50), we get which can be simplified to give (52).
In the presence of high transmit power, the MI from (52) converges to a fixed value as shown in the following lemma.
Lemma 2: As the transmit power increases, i.e., when P → ∞, the MI from (52), I(X X X; Y Y Y) → I 1 · · · I N log( ). I X X X; Y Y Y = I 1 · · · I N · log − J 1 · · · J M · log(e) − 1 Proof: Consider the expression inside exp(·) in (52). As the transmit power becomes large, i.e., P → ∞, we can write: Note that when m = k, for any given noise realization N N N = N, the function ||H * N P * N (X (m) − X (k) ) + N|| 2 is same as ||N|| 2 . Also for m = k when P → ∞, there is one realization of N N N that diverges from the limit in (58) (when N = −H * N P * N (X (m) − X (k) )). But the probability that N N N takes this specific realization is 0. Hence the limit in (58) converges almost surely. Thus as P → ∞, we have: Since the noise tensor contains i.i.d. unit variance elements, using (59) the expectation term in (52) can be written as Thus (52) becomes: − log(e) · J 1 · · · J M = I 1 , . . . I N · log . (61)

F. COMPLEXITY OF FINDING MMSE AND MI
In order to compute the MMSE tensor or the MI for a given channel H and precoder P, it can be seen that the complexity increases exponentially with the input tensor size. Let I = I 1 · I 2 · · · I N and J = J 1 · J 2 · · · J M . Based on the complexity of computing Einstein product as discussed in [27], calculating Y − H * N P * N X requires O(I 2 · J) operations, and further calculating the norm square in (49) requires J multiplications and J − 1 additions. Thus calculating the conditional pdf using (49) requires O(I 2 · J) operations. The conditional pdf is substituted into (48) to find the conditional expectation for a given Y, which requires summing over all the possible input tensors X. Thus calculating (48) requires O(K · I 2 · J) operations where K = I denotes the number of possible input tensors with I elements and constellation size . Further this conditional expectation is substituted into (47) which computes a double summation, where within each summation a tensor subtraction and an outer product is calculated. Since an outer product between tensors of same size (in this case I 1 ×· · ·×I N ) requires each element of the first tensor multiplied with every element of the second tensor, its computational cost is O(I 2 ) operations. (47) is O(K · I 2 · J · I 2 ) = O(K · I 4 · J). Further this summation takes place over all L values of l, and K values of k. Thus the overall complexity of calculating the MMSE tensor for a fixed channel and precoder is O(K 2 · L · I 4 · J) which is same as O( 2·I · L · I 4 · J).

The cost of tensor subtraction is O(I). Thus total number of operations required to compute the expression inside summations in
Similarly, for finding MI using (52), computation of the expression inside exp(·) requires O(I 2 · J) operations. Then, since we are averaging over L noise realizations, it is clear from the summation over k and m in (52) that the overall complexity of finding the MI would be given as Thus the complexity of using Monte Carlo method to find the MMSE tensor and MI increases exponentially with input tensor size given by I = I 1 · · · I N . The reliance on Monte Carlo methods often comes with such drawbacks, which makes it not suitable for online deployment of such schemes. However, it should be noted that the higher complexity here is not due to the framework being considered but is inherently associated with the size of the input. The complexity increases exponentially with the input signal size irrespective of input being represented as a vector or tensor. The high complexity issue has been a concern in a vector or matrix set up as well, and several methods have been proposed in literature to perform such simulations faster and more efficiently. For instance, [32] suggests to update the MMSE matrix recursively using a weighted average of the matrix from previous iteration and a new matrix computed using very few random samples. Similar procedure to approximate the MMSE matrix has also been used in [49]. A message passing based algorithm is proposed in [50] as a low complexity measure to compute the MMSE matrix and MI. Such methods can be employed also to the tensor case to reduce the complexity of the Monte Carlo methods. A detailed investigation into such approximations has been left as a future work.

IV. NUMERICAL RESULTS
In this section, we present numerical examples with Gaussian channels to find the optimum input precoders and capacity of channels under discrete input constellation constraints. We first summarize the steps required to find the optimal input precoder that maximizes the MI when the input is drawn from discrete constellations: Step 1: Set n = 0 and initialize P (0) to a choice which satisfies the power constraints, such as a precoder doing uniform power allocation.
Step 2: For a given precoder, calculate the MMSE tensor Q (n) E E E using (47). Note that calculating the MMSE tensor using (47) requires averaging over all the possible input tensors and several noise realizations. Corresponding to each possible input tensor X (k) , we generate output tensors Y (l) for l = 1, . . . , L where L denotes the number of noise realizations. For any given input and output pair, the conditional pdf is calculated using (49) for the given precoder and channel, and further (48) is used to find the conditional expectation which is substituted in (47) to find the MMSE tensor.
Step 3: Based on the MMSE tensor calculated in step 2, update the precoder using (30).
Step 4: Set n = n + 1 and go to step 2 and repeat until a desired convergence criteria is met.
Step 5: Calculate the MI for the updated precoder using (52). Note that MI can be calculated after each precoder update for checking the convergence. However, MI is not needed for updating the precoder itself, and convergence criteria for the iterative equation can also be set using the mean square error as we will illustrate through examples.
For the numerical results, the channel elements are generated using circularly symmetric complex Gaussian distribution with zero mean and unit variance. Also the input symbol elements are drawn from equiprobable distribution. The input constellations we consider are Binary Phase Shift Keying (BPSK) and Quadrature Phase Shift Keying (QPSK). For the gradient ascent, we use a step size of ν = 0.01. The total transmit power constraint is denoted by P and the noise tensor contains i.i.d. circularly symmetric complex Gaussian entries with zero mean and unit variance. All the results are presented by averaging over S channel realizations. For a given channel and precoder, the calculation of the MMSE tensor and MI is carried out using L noise realizations. The values of S and L directly determine the computational complexity. Thus depending on the size of the channel tensor and input constellation, we use different values for S and L which are listed in Table 1. The reason for choosing smaller values for larger channel order is to save on simulation time, since for higher order cases, running the simulations for a single channel realization can take upto weeks. For order 7 and 8 BPSK case, we average over only 5 channel realizations. However, our results indicate that for such higher order channels the deviations in capacity values from one channel realization to next is small enough due to channel hardening, such that a general trend could be analyzed based on averages over such small numbers only.

A. CAPACITY FOR SELECTED INPUT CONSTELLATIONS
In this section, we numerically analyze the capacity of tensor channels with sum power constraints when the input is constrained to be drawn from a fixed discrete constellation. We first consider the example of BPSK constellation. Consider an order-4 tensor channel of size 2 × 2 × 2 × 2 corresponding to an order-2 input of size 2 × 2, and order-2 output of size 2 × 2. Under different power budgets P, we iteratively update the input precoder using (30) and plot the MI computed through (52) in each iteration in Fig. 1. As

FIGURE 1. Convergence of Mutual Information for order 4 channel with BPSK input.
can be seen in Fig. 1, the MI increases with number of iterations and reaches a saturation after some point. The saturation level increases with increasing P. Note that the MI is always upper bounded by the source entropy which can be achieved at higher transmit power (as per Lemma 2). Thus the saturation at P = 5dB reaches almost 4 bits but never crosses it. Hence, even though closed form MI expressions are not known, this saturated level can be considered as the maximum MI or the capacity under BPSK constrained input for the given power budget.
Note that we present the convergence of MI against number of iterations as the MI is an entity of interest for our purpose which can be used as a criteria for terminating the iterations. However, we do not actually need to calculate the MI at each iteration. For updating the precoder, we only need an updated MMSE tensor Q E E E . Hence, a possible termination criteria for the iterative equation could also be established by observing the MMSE tensor. In Fig. 2, we present the normalized mean square error (NMSE) which is defined as the ratio of the trace of the error covariance tensor and the total transmit power, Both NMSE and MI are shown in Fig. 2 to illustrate the relative behavior of the two entities with increasing number of iterations at two different power levels. The left y-axis shows the NMSE and the right y-axis shows the corresponding MI. It can be seen in Fig. 2 that as the NMSE decreases, the MI increases, with increasing number of iterations and both entities reach a saturation. We see that both entities saturate almost around the same number of iterations. After a certain number of iterations the drop in NMSE is small enough such that it doesn't significantly impact the MI. Thus the convergence criteria for the precoder update iterative equation can be set by considering the saturation of NMSE as well. Note that computing MI is computationally expensive and is not required for updating the precoder in each iteration. Thus it is better to determine the convergence using NMSE since it can be calculated simply using (62) in each iteration.
Next, in addition to the constraint on input signal constellation, we also consider two different types of power constraints: sum power constraint, and per domain power constraints where one of the two domains has individual power budgets defined by P 1 = x · P and P 2 = (1 − x) · P with x = 0.9. For both these constraints, we plot the capacity when the input is constrained to be drawn from BPSK and QPSK constellations. We also plot the capacity with the same power constraints but under Gaussian input signalling using results from [27]. Fig. 3 shows the comparison of these capacities and thus illustrates the loss of capacity due to the additional constraints on input constellation. Notice that the use of BPSK and QPSK constellations limit the capacity, which for high values of P saturates around 4 bits/channeluse for BPSK and 8 bits/channel-use for QPSK as seen in Fig. 3. However, the saturation occurs at a higher value of P for QPSK as compared to the BPSK. The gap between the capacity with BPSK/QPSK inputs and the channel capacity under no input constellation constraint (in which case we have Gaussian signalling) is significant at higher transmit powers. Although, at lower P, we observe that capacity is almost same whether we have the additional signalling constraints on input or not. Also, with per domain power constraints the capacity is lower than the capacity achieved with sum power constraint for both Gaussian and discrete inputs.  Fig. 3 highlights the dominant behaviour of several different constraints. If the total power budget P is low, then a constraint on the input constellation does not affect the capacity much. Hence at low SNRs, the limited power budgets are the dominant constraints whereas the signalling constellation constraints are not too relevant. However, at high SNRs, the input signalling constellation constraint severely limits the capacity and thus acts as the dominant constraint. Increasing the power budget P, or having individual domain power constraints, does not change the capacity beyond a point with discrete input signal constellation constraints since the capacity saturates as also indicated by Lemma 2.
Note that [27] provides an algorithm to approximate the optimal input covariance. However, the optimal covariance as derived in [27] only maximizes the MI when the input is Gaussian distributed. If the result corresponding to Gaussian input is employed for discrete inputs, the performance can be significantly sub-optimal. We present such a comparison in Fig. 4 where we plot the MI against increasing P under sum power constraints and per domain power constraints (as defined in previous example) for two different types of precoders with BPSK input. The precoder obtained using (30) is referred as precoder 1. The square root tensor of the optimal covariance obtained from [27] is referred as precoder 2, which can be seen as the optimal precoder under Gaussian signalling. It is clearly observed in Fig. 4 that for BPSK inputs using precoder 2, which assumes Gaussian input, leads to much lower MI than precoder 1, which considers discrete input. Also, it can be observed that the difference in the performance of the two precoders is more pronounced in case of the per domain power constraints. For instance, at P = 2 dB, the difference in MI is around 0.5 bits under sum-power constraint, whereas the difference is around 1.0 bits under per domain power constraint. This suggests that under such domain specific power constraints, the input constellation constraints can become even more significant.

B. CAPACITY UNDER PER DOMAIN AND PER ELEMENT POWER CONSTRAINTS
In Fig. 3, we considered a specific per domain power constraint corresponding to x = 0.9. To analyze the capacity for a larger variety of power constraint, in Fig. 5 we plot the tensor channel capacity for different values of x for both BPSK and Gaussian inputs. Note that different values of x such that 0 < x < 1 generate different per domain power constraints. Fig. 5 shows the capacity behaviour for several power budgets (P = 2, 0, −2 dB), where it can be seen that the capacity with both the Gaussian input and the BPSK input changes with change of per domain power constraint parameter x. The gap between the curves for Gaussian and BPSK increases as P grows. Changing the parameter x changes the capacity for Gaussian inputs for any power level. With BPSK input, at lower power levels the capacity changes with x but for higher P, we observe that the capacity does not get much affected by changing the parameter x. As P increases, the capacity with BPSK input tends to be a straight line when plotted against x, whereas with Gaussian input the capacity shows an arch like curve against x. This further establishes that for high transmit powers under discrete signalling constraints, Note that the curves in Fig. 5 tend to have a peak at around x = 0.5 which suggests a uniform power allocation constraint. However, it should be noted that such a behavior is observed only on an average over several different channel realizations. For a given specific channel, the peak can occur at any other x as well depending on the specific channel elements. To see this, we plot the capacity against x for both Gaussian input and BPSK input for a specific channel realization in Fig. 6. As can be seen that the peak occurs at around x = 0.7 for Gaussian inputs, and x = 0.8 for BPSK inputs.

C. CAPACITY FOR DIFFERENT TENSOR CHANNEL ORDER
The previous sections considered examples of order-4 tensor channels. Now we consider the capacity against various orders of tensor channels for discrete inputs under sum power constraint. Fig. 7 illustrates the behavior of the MI as the precoder is updated iteratively using (30) at P = 0 dB for BPSK input. The MI convergence is plotted for different order of tensor channels. Corresponding to order 3, 4, 5, and 6 tensor channels, the orders of input output pairs are taken as (2, 1), (2,2), (3,2), and (3, 3) respectively. The dimensions of the individual channel domains is kept as 2. It can be seen in Fig. 7 that the MI increases and saturates after several iterations for any order tensor channel. This saturation level indicates the capacity of the tensor channel. For the same P, the saturation level is higher for higher order channels. Fig. 8 illustrates the behavior of the capacity against increasing order of the tensor channels for different power budgets P. We consider two different configurations of input and output tensors for odd order (2N + 1) tensor channels: case 1) input order N + 1, output order N, and case 2) output order N + 1, input order N. Case 1 is represented by dashed lines in the figure and case 2 by solid (or dotted for P = 5 dB case) lines. For even order channels, the two cases are the same, thus the points coincide. With input drawn from a discrete constellation of size , the capacity can be specified as: Thus the capacity of a channel is always upper bounded by the maximum of the input entropy, and this bound can be achieved with high enough power as established in Lemma 2. In Fig. 8, the upper bound from (63) is indicated using a square marker for case 1 and a hexagram marker for case 2. Since for even order channels, the two bounds are exactly same, for clarity of representation, in Fig. 8 we mark the bound for even order cases only with a square marker. As can be seen in Fig. 8, the capacity with case 1 is higher than that of case 2 because of larger number of input elements in case 1. Also note that the capacity increases with increase in P for both cases, however for odd order tensor channels, the difference between the capacity with case 1 and case 2 also increases with increase in P. This is because as P increases, the capacity with BPSK input tends to reach a saturation. This saturation depends on the size of the input constellation and the number of input elements as specified by (63). Hence for a given order channel, a higher input order configuration leads to higher capacity. For discrete inputs, the capacity bound from (63) is reached with increasing P. This can be also observed from Fig. 8, since the square markers and the hexagram markers lie almost on top of the P = 10 dB points which shows that the capacity saturation is reached at this power level. On comparing the channel order 4 and 5 for high values of P for case 2 (input order less than output order), we see that the capacity does not increase much and stays almost at 4 bits/channel-use because the order of the input tensor (order 2) did not change in going from channel order 4 to order 5 in this case. However, for case 1 the order of the input tensor goes from 2 to 3 while moving from channel order 4 to 5, which causes a significant increase in the capacity. This increase becomes more substantial as P increases. The curves for 5 dB and 10 dB almost overlap for case 2 in Fig. 8 because of the capacity saturation with increasing power budget. For case 1 also, the curves for 5 dB and 10 dB are close but they do not overlap as the saturation value of the capacity is higher because of the higher input order.
It is to be noted that the results presented for order 7 and 8 tensor channels are averaged over much smaller number of channel realizations. However, they still provide a good approximation of the average taken over larger number of samples. To establish this, in Table 2 we present the capacity calculated corresponding to 5 different channel realizations. For each power level, and different channel orders, the table presents the capacity found for 5 specific realizations of random channels. Note that for odd order channels, the values in the table correspond to case 1 (input order N + 1, output order N), since case 1 has higher complexity due to larger input size. It can be observed from the table that the values of C obtained for different channel realizations are pretty close to each other for higher order such as order 7 and 8 channels. To analyze the dispersion of the data points, we use the relative standard deviation and relative range as measures. For a set of 5 values {C 1 , C 2 , C 3 , C 4 , C 5 } with average denoted by C avg = 1 5 5 i=1 C i , the dispersion measures are defined as Based on both the dispersion measures as indicated in the table, it can be seen that for lower orders, such as order 3 and 4, the dispersion is much higher than for order 7 and 8 cases. This justifies our choice of Monte Carlo simulation parameters as listed in Table 1. Since the number of elements in the tensor channel of higher orders is very large (2 7 and 2 8 for order 7 and 8 respectively), due to channel hardening the variation across the results for different random realizations is small enough such that a general trend could be observed with averaging over only a smaller number of realizations.

V. CONCLUSION
This paper employed the tensor system model using the Einstein product to formulate the problem of maximizing the MI in a multi-domain communication system with discrete inputs. Since the tensor framework retains the domain interpretability, it allows to formulate the problem under a family of interpretable power constraints. We extended the relation between the derivative of the MI and the error covariance associated with the best MMSE estimator from a vector to a tensor setting in the presence of circularly symmetric Gaussian noise. The tensor I-MMSE relation was then used to find a precoder that achieves the tensor channel capacity when the input is drawn from discrete signalling constellations under a family of power constraints. It was shown that the MI saturates at high transmit power, where the saturation value is a function of the input tensor size, which increases exponentially with tensor order. Through numerical examples, it was concluded that at lower SNRs, the power constraints play a significant role while the signalling constellation constraint does not affect the capacity much. The capacity with discrete inputs is almost same as the capacity with Gaussian input at low enough SNRs. However, at higher SNRs, the capacity is significantly affected by the signalling constellation constraints, and thus any individual power constraints on input domains or elements does not significantly affect the capacity. Thus, at low SNRs, the power constraints are dominant, while at higher SNRs the signalling constellation constraints are dominant.

APPENDIX A TENSOR DERIVATIVES AND INTEGRALS
The derivative of a scalar function f with respect to a tensor, denoted as ∇ X = ∂f /∂X * is defined in [27]. Also, the chain rule of derivatives can be extended from a matrix set-up [32], [51] to a tensor setting. Consider a scalar function g of tensors, denoted by g(h(X)), where X ∈ C I 1 ×···×I N and U = h(X) ∈ C J 1 ×···×J M are order N and M tensors respectively. Then we have We now present two lemmas based on the chain rule of tensor derivatives which were used for deriving (6) and (8). Lemma 3: For a real valued scalar function h(D) where tensor D ∈ C I 1 ×···×I P ×T 1 ×···×T Q depends on tensor B ∈ C J 1 ×···×J M ×K 1 ×···×K N as D = A * M B * N C with tensors A ∈ C I 1 ×···×I P ×J 1 ×···×J M , and C ∈ C K 1 ×···×K N ×T 1 ×···×T Q , we have Lemma 4: For a real valued scalar function h(Q), where a Hermitian tensor Q ∈ C I 1 ×···×I P ×I 1 ×···×I P depends on tensor Lemma 3 and 4 can be proven using the chain rule from (66). Details have been omitted here for brevity.
Tensor Integrals: We represent the integration of a scalar function f (A) over the tensor variable A ∈ C I 1 ×···×I N as follows: (69) For a tensor valued function g of the tensor A such that g(A) ∈ C J 1 ×···×J M , integrating over the tensor variable denoted as g(A)dA would be a tensor of same size as g(A) with each element given as: If h(Y) and g(Y) are scalar functions of a tensor Y ∈ C J 1 ×···×J M , then from the product rule of gradient we can write [52]: where ∇ Y (h(Y)g(Y)) is a tensor of same size as Y. This equation can be rearranged and subsequently a form of integration by parts can be written as: It is important to note here that each term in (72) is an integral as defined in (70) and is a tensor not a scalar.

APPENDIX B DERIVATION OF THE TENSOR I-MMSE RELATION
In order to prove Theorem 1, we first present some results which are required at several intermediate steps in the proof. Corollary 1: Let U = Y −H * N X where Y ∈ C J 1 ×···×J M , X ∈ C I 1 ×···×I N , andH ∈ C J 1 ×...J M ×I 1 ×···×I N , and let h(U) = ||U|| 2 = ||Y −H * N X|| 2 , then we have: Thus we get ∇ U h = U, which directly leads to ∇ Y (||Y − H * N X|| 2 ) = (Y −H * N X) and hence proves (73). Now to prove (74), we invoke Lemma 3 from Appendix A. In the statement of Lemma 3, replace A with a scaled identity tensor (−1) · I M of order 2M, B with a tensorH of order M + N, C with a tensor X of order N, and function h as ||Y −H * N X|| 2 . Then from (67) we get: The integral in (92) can be written using (72) as: The term B in (93) can be written as and A can be written element wise as (using (70)) Note that (log(p Y Y Y (Y)) + 1) · p Y Y Y|X X X (Y|X X X) is a real valued scalar function of Y. Let us denote it using a(Y) = (log(p Y Y Y (Y))+1)·p Y Y Y|X X X (Y|X X X) which can also be seen as a function of the real and imaginary components of Y. Also, as |Y j 1 ,...,j M | → ∞ both p Y Y Y (Y) and p Y Y Y|X X X (Y|X) tends to 0, hence the function a(Y) → 0. Thus the integral inside the square brackets in (95) when evaluated from −∞ to +∞ for both real and imaginary parts of Y j 1 ,...,j M tends to 0, which makes the square bracket term 0 in (95) and subsequently A j 1 ,...,j M = 0. This implies that A is an all zero tensor. Substituting A = 0 T and B from (94) into (93) gives us On substituting (96) into (92), we get The expectation term inside the integral in (97) can be written as and the term ∇ Y p Y Y Y (Y) inside the integral in (97) can be written using p Y Y Y (Y) from (80) as On substituting ∇ Y p Y Y Y|X X X (Y|X) from (89) into (99), we get Substituting (98) and (100) into (97), we get where the last equality follows based on (3).