Asymptotic Performance of Box-RLS Decoders under Imperfect CSI with Optimized Resource Allocation

This paper considers the problem of symbol detection in massive multiple-input multiple-output (MIMO) wireless communication systems. We consider hard-thresholding preceeded by two variants of the regularized least squares (RLS) decoder; namely the unconstrained RLS and the RLS with box constraint. For all schemes, we focus on the evaluation of the mean squared error (MSE) and the symbol error probability (SEP) for M-ary pulse amplitude modulation (M-PAM) symbols transmitted over a massive MIMO system when the channel is estimated using linear minimum mean squared error (LMMSE) estimator. Under such circumstances, the channel estimation error is Gaussian which allows for the use of the convex Gaussian min-max theorem (CGMT) to derive asymptotic approximations for the MSE and SER when the system dimensions and the coherence duration grow large with the same pace. The obtained expressions are then leveraged to derive the optimal power distribution between pilot and data under a total transmit energy constraint. In addition, we derive an asymptotic approximation of the goodput for all schemes which is then used to jointly optimize the number of training symbols and their associated power. Numerical results are presented to support the accuracy of the theoretical results.


I. INTRODUCTION
The use of multiple-input multiple-output (MIMO) systems has been recognized as an efficient technology to meet the ever-increasing demand in spectral efficiency. It is indeed known since the early works of Telatar [2] and Foshini [3] that the mutual information scales with the minimum of the number of transmit and receive antennas. In practice, however, the spectral efficiency of a wireless link depends not only on how many antennas are deployed at the transmit and receive sides but also on the channel estimation accuracy, the detection procedure as well as the distribution of the power resources, all of which have a direct bearing on the end-to-end signal-to-noise-ratio. At the receiver side, accurate channel estimation and symbol detection are crucial to reap the gains promised by the additional degrees of freedom offered by MIMO systems. Channel estimation is performed by allocating a training period during which the transmitter sends known pilot symbols to the receiver for it to acquire an estimate of the channel state information (CSI). During the data transmission phase, this estimate is leveraged by the receiver to equalize the channel and recover the data symbols. It is worth mentioning that it is important for the success of the data recovery step to acquire accurate channel estimates because otherwise the error in the channel estimation would propagate to the symbol recovery, crippling the overall performance even under state-of-the-art detection strategies. Resource allocation in terms of power and time is also an essential part of the design of wireless systems. Increasing the duration of the transmitted pilot sequence would lead to enhanced channel estimation quality but at the cost of spectral efficiency losses since less time is spent to transmit useful data. Moreover, as the total power allocated to data and training transmission is fixed, we cannot increase the power allocated to data without affecting the channel accuracy estimation and vice versa.
The problem of finding optimal power allocation between pilot and data has received a lot of attention over the last decades. It has been applied to different contexts with the aim of realizing various communication objectives. In [4], [5] and [6], the authors proposed power designs that optimizes bounds on the average channel capacity. A different line of research works in [7]- [10] considered the postequalization signal to interference and noise ratio as a target metric to determine the optimal power allocation. Depending on the application at hand, different other metrics such as the bit error rate (BER) [11], [12], the symbol error rate (SER) [13], mean squared error (MSE)-related indexes [14], [15], maxmin fairness utilities [16] or bounds on the received signal-to-noise ratio [17] have been used in several existing papers. Of interest in these works are a wide range of contexts including classical MIMO systems [4], mutli-carrier systems [18]- [23], amplify-and-forward relaying [11], cognitive radio systems [24] and very recently massive MIMO systems in both single-cell [25] and multi-user multi-cell settings [14], To summarize, the main contributions of this work can be listed as follows: 1) We derive sharp characterizations of the MSE, SEP and goodput expressions for the RLS, LS and Box-RLS under imperfect CSI. Our expressions shed light on interesting relationships between MSE and SEP for M -PAM modulation and under imperfect CSI.
2) We determine the optimal power allocation between training and data symbols when SEP or MSE are used as target criteria.
3) We optimize the power and the number of pilot symbols to maximize the goodput for all studied decoders.
To the best of our knowledge, none of the above was previously derived for the M -PAM case in the presence of imperfect CSI.

A. Paper Organization
This paper is organized as follows. The system model is presented in Section II. In Section III, we discuss channel estimation, the properties of the estimator and the channel estimation error, as well as symbol estimation. The MSE/SEP of symbol estimation is derived in Section IV by applying the CGMT.
These expressions are then validated through an assortment of numerical results and leveraged to find optimal power strategies in Section V. The key ingredient of the analysis which is the CGMT is reviewed in Appendix A. The proofs for the derived MSE and SEP are presented in Appendix B and Appendix C for the Boxed RLS and un-boxed RLS, respectively.

B. Notations
Scalars are denoted by lower-case letters (e.g., α), column vectors are represented by boldface lowercase letters (e.g., x), whereas matrices are denoted by boldface upper-case letters (e.g., X). The notations (·) T and (·) −1 denote the transpose and inversion operators, respectively. The j-th element of vector x will be denoted by x j . The symbol I N is used to represent the identity matrix of dimension N × N . We use the standard notation P[·] and E[·] to denote probability and expectation. We write X ∼ p X to denote that a random variable X has a probability density function (pdf) p X . In particular, G ∼ N (µ, σ 2 ) implies that G has a Gaussian (normal) distribution of mean µ and variance σ 2 . p(x) = 1 ∞ x e −t 2 /2 dt denote the pdf of a standard normal distribution and its associated Q-function respectively. Finally, ∥·∥ indicates the Euclidean norm (i.e., the ℓ 2 -norm) of a vector and ∥·∥ ∞ represents its ℓ ∞ -norm. 1 1 For a vector x, ∥x∥∞ = maxj |xj|.

II. SYSTEM MODEL
We consider a flat block-fading massive MIMO system with K transmitter antennas and N receiver antennas. The transmission consists of T symbols that occur in a time interval within which the channel is assumed to be static. A number T p pilot symbols (for channel estimation) occupy the first part of the transmission interval with power, ρ p . The remaining part is devoted for transmitting T d = T − T p data symbols with power, ρ d . Figure 1 illustrates the system model. It implies from conservation of time and energy that: where ρ is the expected average power. Alternatively, we have ρ d T d = αρT , where α ∈ (0, 1) is the ratio of the power allocated to the data so that The received signal model for the data transmission phase is given by where y ∈ R N is the received data symbol vector, x 0 ∈ R K is the transmitted data symbol vector, H ∈ R N ×K is a channel matrix with i.i.d. Gaussian elements h ij ∼ N (0, 1), and z ∈ R N stands for the additive Gaussian noise at the receiver with i.i.d. elements of mean 0 and variance 1. It is assumed that , such that each transmit antenna sends a data symbol x 0,j that takes values (with equal probability 1/M ) in the set: where is the average power of the non-normalized M -PAM signal, M = 2 b being the modulation order and b the number of bits carried by each symbol.
As the channel matrix H is unknown to the receiver, a training phase during which the transmitter sends T p ≥ K pilot symbols is dedicated. The received signals corresponding to this phase can be modeled as where Y p ∈ R N ×Tp is the received signal matrix, X p ∈ R K×Tp is the matrix of transmitted pilot symbols, and Z p ∈ R N ×Tp stands for the additive Gaussian noise with E[Z p Z T p ] = T p I N . For the reader convenience, we summarize in Table I the notation symbols of the parameters used in this paper.

A. LMMSE Channel Estimation
Based on the knowledge of Y p from (5), the LMMSE channel estimate is given by [36] where ∆ is the zero-mean channel estimation error matrix, which is independent of H, as per the orthogonality principle of the LMMSE estimation [4], [36]. For MIMO channels with i.i.d. entries, it has been proved that the optimal X p that minimizes the estimation mean squared error under a total power constraint satifies [4] X p X T p = T p I K .
For the above condition to hold, the number of training symbols should be greater than or equal to K.
Moreover, under (7), the channel estimate H has i.i.d. zero-mean Gaussian entries with mean 0 and variance σ 2 being the variance of each element in ∆. It appears from (8) that channel estimation error decreases with the pilot energy given by ρ p T p .

B. Symbol Detection under Imperfect Channel Estimation
With the channel estimate H at hand, the receiver can proceed to the recovery of the transmitted symbols. The optimal decoder which minimizes the probability of error is the maximum likelihood (ML) decoder under perfect channel knowledge which is given by: As can be seen, the ML decoder involves a combinatorial optimization problem. It presents thus a prohibitively high computational complexity, especially when the system dimensions become large as envisioned by current communication systems. To overcome this issue, suboptimal strategies that require less computational complexity are in general used. They often proceed in two steps. First, a real-valued approximation of the transmitted symbol is obtained. This estimate is then hard-thresholded in a second step to produce the final estimate. In this work, the focus is on the regularized least squares (RLS) and the Box-regularized least squares (Box-RLS) decoders.
The RLS decoder is based on regularizing the cost in (9) and relaxing the finite-alphabet constraint, thus leading to: where λ ≥ 0 is the regularization coefficient, A = ρd K H and B is a normalization constant, the value of which will be suggested from our analysis so as to remove the bias of the decoder [37] 2 . Note that the optimization in (10c) simply selects the symbol value s that is closest to the solution xj B among a total of M possible choices. As can be seen from (10b), the elements of the solution x may take large values in high-noise conditions or poor channel estimation scenarios. This motivates the Box-RLS decoder [31], [38], [39] given by which is based on relaxing the finite-alphabet constraint to the convex constraint x ∈ [−t, t] K , where t is a fixed threshold that can be optimally tailored according to the propagation scenario.

C. Performance Metrics
This work considers the performance evaluation of the RLS and the Box-RLS decoders in terms of three different performance metrics, which are: Mean Squared Error: A natural and heavily used measure of performance is the reconstruction mean squared error (MSE), which measures the deviation of x from the true signal x 0 . This assesses the performance of the first step of the decoding algorithms. Formally, the MSE is defined as Symbol Error Probability: The symbol error rate (SER) characterizes the performance of the detection process and is defined as: where 1 {·} indicates the indicator function.
In relation to the SER is the symbol error probability (SEP) which is defined as the expectation of the SER averaged over the noise, the channel and the constellation. Formally, the symbol error probability denoted by SEP is given by: Goodput: The goodput is a performance measure that accounts for the amount of useful data transmitted, divided by the time it takes to successfully transmit it. The amount of data considered excludes protocol overhead bits as well as retransmitted data packets [40]. In our context, it can be defined as Goodput and throughput are connected performance parameters in that the throughput can be obtained by simply dividing the goodput by the data transmission rate.

IV. ANALYSIS OF THE MEAN SQUARED ERROR (MSE) AND SYMBOL-ERROR PROBABILITY (SEP)
In this section, we derive asymptotic expressions of the MSE and SEP for the RLS and Box-RLS decoders. Particularly, we show that these metrics can be approximated by deterministic quantities that involve the power and time devoted for data and training transmissions. Our analysis builds upon the CGMT framework. For the RLS decoder, the same results could have been obtained using tools from random matrix theory as the decoder possesses a closed-form expression. However, since the use of the CGMT framework is more adapted to the Box-RLS decoder that cannot be expressed in closed-form, we rely in this work on the CGMT framework for both decoders for the sake of a unified presentation.
Prior to stating our main results, we shall introduce the following assumptions which describe the considered growth rate regime: A. Technical Assumptions Assumption 1. We consider the asymptotic regime in which the system dimensions K and N grow simultaneously to infinity at a fixed ratio Assumption 2. We assume a fixed normalized coherence interval and that the pilot and data symbols grow proportionally with K, where: and are fixed and denote the normalized number of pilot and data symbols, respectively.
In the sequel, we leverage the statistical distribution of the channel and the channel estimate as well as the asymptotic regime specified in Assumption 1 and 2 to provide asymptotic approximations of the MSE and SEP for RLS and Box-RLS. We use the standard notation plim n→∞ X n = X to denote that a sequence of random variables X n converges in probability towards a constant X.

B. MSE and SEP Analysis for RLS
We provide herein asymptotic approximations of the MSE and SEP for RLS under imperfect channel state information. The derived closed form expressions are given in Theorem 1, and Theorem 2 while the proof is given in Appendix C. Define Then, under Assumption 1 and Assumption 2, it holds: Proof. The proof of Theorem 1 is given in Appendix C. ■ It is worth mentioning that the above formula is not restricted to x 0 belonging to M -PAM constellation and is valid for x 0 from any distribution provided that x 0 is normalized to have unit-variance. However, assuming that x 0 is drawn from M -PAM constellations, the SEP can be approximated as: where where θ ⋆ is defined in Theorem 1.

Proof.
A sketch of the proof is provided in Appendix C. ■ Before proceeding further, we validate the approximations provided in Theorem 1 and Theorem 2.
To this end, we report in Figure 2  Corollary 1. (Optimal regularization coefficient for RLS in MSE and SEP senses): Let λ ⋆ denote the optimal regularization coefficient that minimizes the limit in (16) or in (17). Then, Proof. Note that in both (16) and (17), the regularization coefficient λ appears through θ ⋆ only. Then, λ ⋆ = arg min λ≥0 θ ⋆ . Taking the derivative of θ ⋆ with respect to λ, setting it to zero and solving completes the proof of the corollary. ■ Remark 1. It is worth mentioning that the optimal regularization coefficient in (19) minimizes both the MSE and SEP. Moreover, it can be written in terms of the so called effective SNR of the system [4] as Remark 2. In Appendix H, we show that the RLS detector with optimal regularization coefficient is equivalent to the LMMSE detector. The later is known by definition to minimize the MSE, but it turns out according to Corollary 1 that it also minimizes the asymptotic SEP among all other choices of λ.
In the perfect CSI case, σ 2 ∆ = 0, hence the optimal regularization coefficient becomes λ ⋆ = 1 ρd which is clearly equivalent to the LMMSE decoder. This shows that in both perfect and imperfect settings, the RLS with optimal regularization coefficient turns out to be the LMMSE detector. Such a finding is appealing due to the fundamental importance of the LMMSE decoder in many applications.

C. MSE and SEP Analysis for Box-RLS
In this subsection, we study the asymptotic performance of the Box-RLS decoder in terms of the MSE and SEP. We first present the MSE results in the following theorem.

Theorem 3. (MSE of Box-RLS):
Fix λ > 0, δ > 0, and let x be a minimizer of the Box-RLS problem in (11a). Let β ⋆ and θ ⋆ be the unique solutions in β and θ to the following max-min problem: with Proof. The proof of this theorem is deferred to Appendix B. ■ Remark 3. It is worth mentioning that contrary to previous works based on the framework of the CGMT, the optimization problem in (21) which resulted from the asymptotic analysis is not convex-concave in the variables θ and β. Indeed, it is concave in β but not convex in θ. This poses a major challenge to prove the uniqueness of the solutions in θ of (21), which is a crucial step that is required to ensure convergence results. The reader can refer to Appendix B for more details of the technical arguments developed to show the uniqueness of the solutions of (21).
Remark 4. If the optimal values θ ⋆ and β ⋆ are strictly positive, then they satisfy the following first-order stationarity conditions: which can be exploited in practice to facilitate their numerical evaluation.
The following theorem provides the asymptotic expression of the SEP for the Box-RLS decoder. where β ⋆ is the solution to (21) in β, and that t / ∈ Bi √ E , i = 1, 3 · · · , M − 1 , it holds that: where SEP Box-RLS is given in (24). Remark 6. It should be noted that when t ≥ M −1 √ E , the MSE and SEP expressions take the same form as in the RLS case, with the single difference that θ ⋆ and β ⋆ do not admit a closed-form expression. Moreover, as for the RLS, the optimal regularization coefficient for Box-RLS is given by λ ⋆ = arg min λ≥0 θ ⋆ , since λ appears in the expressions for the MSE and SEP only through θ ⋆ . However, in contrast to the RLS, the optimal regularization coefficient cannot be obtained in closed-form, but could be retrieved by invoking any bisection algorithm. Moreover, as opposed to the RLS, its value depends on M .
Remark 7. Figure 4 plots the optimal regularization coefficient computed using a bisection algorithm as a function of ρ d for RLS and Box-RLS for different values of M . As a first observation, we note that the optimal regularization coefficient for Box-RLS becomes zero starting from moderate values of ρ d .
Moreover, the Box-RLS needs less regularization, due to its achieved improvement over the RLS. On the other hand, in low SNR regions corresponding to low ρ d values, the optimal regularization coefficient for both RLS and Box-RLS are higher than 1 ρd which coincides with the optimal regularization coefficient in the perfect CSI case. This can be explained by the fact, under imperfect CSI cases, more regularization is needed in low SNR regions, because of the degradation caused by channel estimation errors. Remark 8. Similar to the regularization coefficient, we can set the threshold t to the optimal value that minimizes the MSE and SEP expressions, that is t ⋆ = arg min t>0 θ ⋆ . Figure 5 shows the optimal boxthreshold as a function of ρ d when the regularization coefficient is already optimized as well. As can be seen, for practical SNR regions, the optimal threshold coincides with M −1 √ E , which is the maximum value of x 0 . For this reason, we will use in the subsequent simulations this value for the threshold t.

V. OPTIMAL DATA POWER ALLOCATION AND OPTIMAL TRAINING DURATION ALLOCATION
In this section, we leverage the asymptotic expressions of the MSE and SEP derived thus far for the RLS and Box-RLS to determine the optimal power distribution between the training and data symbols.
Particularly, we show that for all considered decoders, the optimal allocation schemes boils down to maximizing the effective SNR of the system ρ eff defined in (20). Additionally, we derive for each decoder the asymptotic expression of the goodput and derive the optimal fraction of power allocated to the pilot transmission as well as the training duration. (i.e. (τ p , α) that maximizes the goodput). In this respect, we illustrate that, while the optimal power allocation remains to be the one that maximizes the effective SNR, the optimal number of training symbols coincides with the number of transmitting antennas K, which is also the minimum number of training symbols that needs to be employed to satisfy orthogonality between pilot sequences.
The MSE limit in (16) reduces thus to where ρ eff is the effective SNR defined in (20). The result in (26) recovers the well-known formula of the MSE of LS with the difference being that ρ d which stands for the SNR in the perfect CSI case is replaced by ρ eff . Similarly, for δ > 1, and from (17), the SEP of the LS decoder can also be expressed in terms of ρ eff as follows Again, the first equation in (27) parallels the well-known result for the LS and BPSK signaling but under perfect CSI [41], (in which case the BER converges in probability to BER LS = Q( (δ − 1)ρ d )) in that it takes the same form with ρ eff replacing ρ d . Hence, our result generalizes [41] to encompass M -PAM modulation and imperfect CSI scenarios.
2) RLS decoder: We proceed now with the RLS decoder. The MSE expression in (16) can also be written in terms of ρ eff as from which it follows that . This yields the following interesting relationship between the MSE and SEP for the RLS decoder.
Such an expression holds for any λ > 0, and not necessarily λ ⋆ . But when λ = λ ⋆ , with λ ⋆ = σ 2 H ρeff , we obtain after some algebraic manipulations the following expression for the MSE: Note that in the perfect CSI case for which the optimal regularization coefficient is 1 ρd , the right-hand side of (30) is exactly the minimum mean squared error estimator (MMSE) (see [42,Theorem 8] 3) Box-RLS decoder: In a similar way, for the Box-RLS decoder, we have the same asymptotic relationships between MSE and SEP: and, for t ≥ M −1 √ E : which again reveals that minimizing the MSE is equivalent to minimizing the SEP.

B. Optimal Power Allocation in MSE and SEP Sense
For the RLS decoder, we prove in Appendix E that both MSE RLS , and SEP RLS are monotonically increasing functions in 1 ρeff . Hence, minimizing the MSE or SEP is equivalent to maximizing ρ eff . This can be easily seen to be the case of the LS decoder. However, for the Box-RLS decoder, such a statement could not be checked analytically as θ ⋆ does not possess a closed-form expressions. However, based on extensive simulations, we conjecture that both MSE Box-RLS and SEP Box-RLS increase with 1 ρeff . All these considerations suggest that the optimal power allocation is the one that maximizes ρ eff over α, i.e., Recall that ρ eff = . Substituting the expressions for σ 2 H and σ 2 ∆ gives Further, upon using ρ p = (1−α)ρτ τp , and ρ d = αρτ τd , the effective SNR becomes With this expression at hand, we determine in the following Theorem the optimal power allocation that maximizes the effective SNR: Theorem 5. (Optimal Power Allocation): The optimal power allocation α ⋆ that maximizes the effective SNR in a training-based system is given by where ϑ = 1+ρτ Proof. The proof of this theorem is given in Appendix F. ■ It is worth mentioning that the use of this power allocation has already been proposed in the early work of [4] as the one that maximizes a lower bound on the capacity. Interestingly, we retrieve the same power allocation scheme which we prove to be optimum in the MSE/SEP sense for RLS and LS decoders and conjecture that it is also optimum for the Box-RLS decoder.
• At low SNR (ρ ≪ 1), ϑ ≈ τd ρτ (τd−1) , and α ⋆ ≈ 1 2 . This means that, at low SNR, half of the transmit energy should be devoted to training and the other half to data transmission.
Remark 10 (Numerical Illustration). The asymptotic predictions of the MSE and the SEP are plotted as functions of the data power ratio α in Figure 6 and Figure 7 when δ = 2, K = 256, T = 1000, T p = 256, M = 2 and ρ = 15 dB. As can be seen, the optimal power allocation, α ⋆ , is the same in the MSE and SEP sense for the different decoders considered here, namely LS, RLS and Box-RLS. The same conclusion has been found for other settings confirming the conjecture that for Box-RLS, the optimal power allocation is obtained by maximizing the effective SNR.

C. Joint Optimization of Power Allocation and Training Duration in the Goodput Sense
We now consider the goodput metric for the joint optimization of the power allocation and the training duration. From its definition in (15), its asymptotic value can be written as: where the limit in the right hand side is given by (17) for RLS, and by (24) for Box-RLS. The above expression can be used to find the optimal pair (τ ⋆ p , α ⋆ ) that maximizes the goodput limit in (36). The result is summarized below. Proposition 1. (Joint optimization in goodput sense): The optimal pair (τ ⋆ p , α ⋆ ) that maximizes the goodput limit in (36) is given by: τ ⋆ p = 1 (or T ⋆ p = K), and α ⋆ is the same as in (35) for all ρ and τ (or T ).
Proof. The proof of this proposition is given in Appendix G. ■ Remark 11. A major outcome of the above result is that the optimal number of training symbols that maximizes the goodput is given by the minimum number of the required training symbols that is the number of transmit antennas, K. This result differs from the finding of [4] in which it has been proven that in case of equal distribution of power between training and data, the optimal number of training symbols may be larger than the number of transmit antennas.

VI. CONCLUSIONS
Based on the CGMT framework, this work carries out a large-system performance analysis of the regularized least squares (RLS) and box-regularized least squares (Box-RLS) decoders used to recover signals from M -ary constellations when the channel matrix is estimated using the LMMSE and is modeled by i.i.d. real Gaussian entries. Although our analysis relies on asymptotic growth assumptions, numerical results demonstrated the accuracy of the theoretical predictions even for limited system dimensions.
Compared to previous related works, the main feature of the present work is our consideration of imperfect CSI, which allowed us to derive the optimal power allocation between training and data. However, considering imperfect CSI, posed several technical challenges and brought us to develop novel technical tools to establish convergence results. We believe that these results can be leveraged in the future to further facilitate carrying out rigorous analysis based on the use of the CGMT framework.

ACKNOWLEDGMENT
We would like to thank Prof. Ahmed-Sultan Salem for the very helpful comments, discussions and suggestions. We also would like to thank Houssem Sifaou for fruitful discussions.

APPENDIX A GAUSSIAN MIN-MAX THEOREM
The key ingredient of the analysis is the Convex Gaussian Min-max Theorem (CGMT), a concrete formulation for it can be found in [33]. The CGMT is a tool that allows analyzing the behavior of solutions of stochastic optimization problems that can be cast into the following form: where G ∈ R N ×K with i.i.d. standard normal entries, S w and S u are sets of R K and R N and ψ : (37) is referred to as Primary Problem (PO) and its analysis is in general not tractable. The CGMT associates with it an Auxiliary Optimization (AO) problem given by where g ∈ R N , and s ∈ R K have i.i.d. standard Gaussian entries. The initial formulation of the CGMT establishes that the (AO) has the same asymptotic behavior as the (PO) in the regime in which N and K grow simultaneously with the same pace under the condition that the sets S w and S u are convex and compacts. Particularly, if for some ν ∈ R, the optimal cost of the (AO) concentrates around ν in the sense that then the optimal cost of the PO concentrates also around ν, satisfying similarly: Recently in [43], the compactness of S u is shown to be possibly relaxed provided that the order of the min-max in (38) can be inverted, that is ϕ(g, s) is also given by: More formally, we have the following result: Theorem 6 (CGMT [33]). Let S be any arbitrary open subset of S w , and S c = S w \ S. Denote ϕ S c (g, s) the optimal cost of the optimization in (38), when the minimization over w is constrained over w ∈ S c .
Assume that S u is convex while S w is convex and compact. Assume also that (39) holds true. Consider the regime K, N → ∞ such that N K → δ, which will be denoted by K → ∞. Suppose that there exist constantsφ and η > 0 such that in the limit as K → +∞ it holds with probability approaching one: (i) ϕ(g, s) ≤φ + η, and, (ii) ϕ S c (g, s) ≥φ + 2η. Let w Φ and w ϕ denote respectively the solutions in w to the (PO) and the (AO). Then, lim K→∞ P[w ϕ ∈ S] = 1, and lim K→∞ P[w Φ ∈ S] = 1.
Remark 12. It is worth mentioning that the result in Theorem 6 goes beyond the asymptotic equivalence between the costs of the (AO) and the (PO) to the localization of the (PO) and (AO) solutions. More specifically, one can easily see that conditions (i) and (ii) in Theorem 6 imply that the solution of the (AO) lies in the set S with probability approaching 1. Theorem 6 allows us to carry over this property to the solution of the (PO), that is w Φ is in S with probability approaching 1.
Remark 13. To satisfy (i) and (ii) in Theorem 6, one can prove that ϕ(g, s) converges to ϕ while ϕ S c (g, s) is lower-bounded by a quantity that converges to ϕ S c with In practice, it is usually the case that ϕ and ϕ S c represent optimal costs of the same optimization problem but with the solution of the latter being constrained to be away from the optimal solution of the former.
Under this setting, showing that the optimization problem whose optimal cost is ϕ admits a unique solution directly implies (40).

APPENDIX B PROOFS OF BOX-RLS
In this appendix we prove Theorem 3 and Theorem 4. For simplicity, we will divide the steps of the proof into subsections.

A. Identifying the (PO) and the (AO)
For convenience, we consider the error vector w := x − x 0 , and also the box set: With this notation, the problem in (11a) can be reformulated as To bring the problem in (42) to the form of (37) required by the CGMT, we express the loss function of (42) in its dual form through the Fenchel's conjugate Hence, the problem in (42) is equivalent to the following: To reach the desired PO form, we introduce the variables , where H and ∆ are N × K independent matrices with i.i.d. standard normal entries. Now, using these variables, and after normalization by 1/K, the above problem can be written as: where We note that (44) is in the form of the (PO) and associate with it the following (AO) where q ∈ R 2K and g ∈ R N are independent standard normal vectors. Note that for the moment, we relate the (PO) to the unbounded (AO) as S u = R N is not compact. In the sequel, we check that (39) holds true, which gives support to considering the unbounded (AO) according to Theorem 6.

B. Scalarizing the (AO)
The next step is to simplify the (AO) as it appears in (45) into an optimization problem involving only scalar variables. Since the vectors g and z are independent and have i.i.d. Gaussian entries, then, ∥v∥g− √ Kz has i.i.d. entries N (0, ∥v∥ 2 +K). Hence, for our purposes and using some abuse of notation so that g continues to denote a vector with i.i.d. standard normal entries, the corresponding terms in (45) can be combined as ∥v∥ 2 + Kg T u, instead. Therefore, (45) is equivalent to Expressing the above problem in terms of the original w variable: where q 1 , q 2 ∈ R K are independent standard normal vectors.
Fixing the norm of u √ K to β := ∥u∥ √ K , it is easy to see that its optimal direction should be aligned with g. Working with x instead of w results into the following optimization problem: Then, based on [45, Lemma 8] and setting u = √ Kβũ we can prove that (49) is the same as (48) in which the order of the min-max is inverted. This completes the proof of (39), which as aforementioned, allows us to extend the scope of the CGMT for optimization problems in which the variable u is constrained to lie in a non-compact set. Now, getting back to the optimization problem in (48) and flipping the order of min x max β results into the following optimization problem: Prior to proceeding further, we shall first check that the optimization over β of the above problem is not achieved in the limit β → 0 and more specifically, there existsδ > 0 such that taking the supremum over β >δ instead of β > 0 would almost surely not change the optimal cost of (50). Towards this goal, first note that converges in probability to 2 √ δ, with probability approaching one, for all δ ∈ (0, √ δ), and all β ∈ (0,δ), To conclude it suffices to prove that there exists a β 0 such that with probability approaching one, min −t≤xj≤t j=1,...,KĤ where Θ is a some positive constant. Indeed, if (52) is satisfied then almost surely, are now ready to proceed to the optimization of the (AO). Let χ = ρd K σ 2 H ∥x∥ 2 − 2σ 2 H x T 0 x + ∥x∥ 2 0 + 1. To make the above optimization problem separable, we express the term in the square root in a variational form using the identity: √ χ = min r>0 1 2r + rχ 2 . Note that at optimum, r ⋆ = 1 √ χ . Using the fact that χ ≥ 1 and as such bigger than any small positive constant, we also have √ χ = min where C ′ is any constant greater than 1. Similarly, as χ is almost surely bounded by some constant, we can also argue that: where ϵ ′ is a sufficiently small positive constant. Using this relation, the optimization problem (49) becomes (53) For ease of notation, defineθ := r∥g∥ √ K . As 1 √ K ∥g∥ is almost surely bounded above and below, the variableθ is almost surely bounded above by a constant C which we shall assume as large as needed and also bounded below by some positive constant ϵ . Introducing this notation leads to: For β > 0, the optimal solution in the variables x j , j = 1, . . . , K of (54) is given by: where To simplify notation, define ξ = √ ρ d σĤ , then x − 0 (θ, β, x 0,j ) = −t ξθ+ 2λρd ξβ −ξx 0,jθ , and x + 0 (θ, β, x 0,j ) = t ξθ + 2λρd ξβ − ξx 0,jθ . With these notations at hand, the above optimization problem reduces to the following SO max β>0 min ϵ<θ<C D(θ, β, g, q 1 ) :=

C. Asymptotic analysis of the SO problem
After simplifying the (AO) as in (58), we are now in a position to analyze its limiting behavior. Using the Weak Law of Large Numbers (WLLN) 3 To analyze the behavior of the summand, recall that each x 0,j takes values ±1 E . Hence, it can be shown that for allθ > 0 and β > 0, with 3 We write P −→ to denote convergence in probability as K → ∞.

2K
converges uniformly tõ on the compact set [ϵ, C]. Combining both results yields thatθ → D(θ, β, g, q 1 ) As a consequence, We need now to prove that the supremum over β converges to the supremum of the right-hand side of the above equation. To this end, note that the function β → min Based on the above convergence, it follows from the CGMT that the optimal cost of the (PO) converges to the asymptotic limit of the (AO) which is given by sup β>0 min ϵ≤θ<C D(θ, β). However, our interest does not directly concern the characterization of the asymptotic limit of the (PO) but that of functionals of the vector w = x − x 0 that can be linked to some important metrics like MSE or SEP. As explained in Remark 13, proving that the max-min problem in (62) admits a unique solution (β ⋆ ,θ ⋆ ) would allow us to transfer any property of the solution of the (AO) to that of the (PO). Unfortunately, the objective in the max-min problem (62) is not convex inθ, and hence the same approach pursued in [33] could not be used here. A new approach to handle this problem is thus proposed. To begin with, we notice that since β → − β 2 4 is strictly concave, β → min ϵ<θ<C D(θ, β) is strictly concave in β. It thus has a unique maximum as it satisfies lim β→∞ min ϵ<θ<C D(θ, β) = −∞. Denote by β ⋆ such a maximum. Let us prove that there exists a unique θ ⋆ that minimizes function h defined as h :θ → D(θ, β ⋆ ). The proof of this result will be carried out into the following steps: 1) First, we prove that the minimum should be in the interior domain of (ϵ, C) for C sufficiently large and ϵ sufficiently small.
2) Next, we establish that Y (θ, β ⋆ ) satisfies: 3) Starting from the observation thatθ ⋆ is in the interior domain of the optimization set and based on the previously established results, we prove that h admits a unique minimum.
We start by establishing the first statement. It is obvious that the optimum could not be reached wheñ θ is in the vicinity of zero since limθ →0 + D(θ, β ⋆ ) = ∞. Similarly, to prove that the minimum is not reached whenθ grows to infinity, it suffices to check that limθ →∞ D(θ, β ⋆ ) = ∞. Simple calculations lead to: Using this approximation, we thus have: It is easy to check that for M ≥ 2, 3(M −1) M (M +1) ≤ 1 2 . As σ 2 H < 1, we thus have lim θ→∞ D(θ, β ⋆ ) = ∞. To prove (63), we need to compute the first three derivatives of the functionθ → Y (θ, β ⋆ ). After simple calculations, we can establish that: Based on this expression, we compute the second and third derivatives of Y (θ, β ⋆ ) as: Leveraging (65) and (66), it is easy to check that: from which we deduce that With this result at hand, we are now ready to prove that function h admits a unique minimum. We already proved that any minimum should lie in the interior domain of (ϵ, C). Assume that there exists two minimizers of h which we denote byθ ⋆,1 andθ ⋆,2 such thatθ ⋆,1 <θ ⋆,2 . The first order and second order conditions imply that: Hence, there existsθ 3 ∈ (θ ⋆,1 ,θ ⋆,2 ) such that We will prove that this will lead to contradiction unlessθ ⋆,1 =θ ⋆,2 . To this end, first we notice that: Hence, from (67) and (68) we obtain the following relations Consider function k :θ → βδ +θ 3 ∂ 2 Y (θ,β⋆) ∂θ 2 . The derivative of k with respect toθ is given by: From (63), k ′ (θ) < 0 and as such k is decreasing. Hence, the relations (69), (70) and (71) could not simultaneously hold. Hence,θ ⋆,1 =θ ⋆,2 , and as a consequence h admits a unique minimizer which we denote byθ ⋆ . Now, since for any β > 0,θ → D(θ, β) goes to infinity whenθ approaches zero or grows to infinity, we thus have: The above relation holds for all β > 0. We already proved in Section B that the optimal solution in β is almost surely away from zero. Thus,

E. Asymptotic behavior of metrics depending on the solution of the (PO)
So far, we proved that the PO cost converges to the asymptotic cost of the (AO). We prove now that the uniqueness of the minimizerθ ⋆ allows us to carry over this convergence to metrics depending on the solution of the (PO). The recipe is as follows. Let η > 0 and define Considerthe "perturbed" version of the (AO) in (45) as follows: Following the same analysis carried out previously, we lower bound ϕ η by It is easy to note that ifθ in (ϵ, C) is such that ∥g∥ 2 Since 1 − ∥g∥ √ K √ δ converges to zero almost surely, there existsη > 0 such that with probability approaching 1, Hence, with probability approaching 1, Following the same asymptotic analysis as in Section C, we can prove similarly that Clearly, Asθ ⋆ is the unique minimizer of infθ ≥0 D(θ, β ⋆ ), Based on the CGMT in Theorem 6 and recalling Remark 13, we thus have: where v PO is the solution in v of the (PO) in (44), or equivalently: where θ ⋆ = 1 θ⋆ .

F. Convergence of Lipschitz functions of the estimated vector x
The objective here is to study the asymptotic behavior of Lipschitz functions of the solution to the PO which we denote byx. As will be seen in the next section, such a result is fundamental for the asymptotic analysis of the symbol error rate and can be of independent interest to analyze any other performance metric. Let β ⋆ andθ ⋆ be the unique solutions of the optimization problem sup β>0 infθ >0 D(θ, β). Recall that x 0 represents the transmitted vector, whose elements are drawn with equal probability from the M -PAM constellation.
Proof. To avoid heavy notations, we will remove x 0,j from the notations of κ j (θ, β, x 0,j ) and that of x − 0 (θ, β, x 0,j ) and x + 0 (θ, β, x 0,j ) as it will not play any role in the proof. To prove (76), we consider the set: Then, in view of the CGMT, for (76) to hold true, it suffices to show that with probability approaching 1, where we recall thatĤ(β, x) is the objective of the optimization problem in (49). Noting that the proof of (77) boils down to proving that A key step towards showing (78) is to analyze the asymptotic behavior of the following optimization problem: Particularly, the following statements will be shown in the sequel: (i) The following convergence holds true: be the solution to the optimization problem in (79). Then, it holds with probability approaching 1, (iii) For any x = [x 1 , · · · , x K ] T ∈ [−t, t] K , there exists a constant C such that: Prior to proving the above statements, let us see how they lead to the desired inequality (78). Putting together (80) and (82) shows that for any ℓ > 0, with probability approaching 1, From (81), we have for any x ∈ [−t, t] K \S ϵ , Now, setting ℓ = Cϵ 2 16L in (83), and using (84) we get: The above inequality holds for any x ∈ [−t, t] K \S ϵ hence, thus proving (78).
Proof of (80) Following the same calculations used in the analysis of the (AO), we can prove that for sufficiently small ϵ and large constant C, with probability approaching 1, Based on similar asymptotic analysis to that carried out in Section B, we can prove thatθ → D(θ, β ⋆ , g, q 1 ) converges uniformly toθ → D(θ, β ⋆ ) and hence, Proof of (81) Letθ ∈ arg min ϵ≤θ≤C D(θ, β ⋆ , g, q 1 ) . In view of (55), for j = 1, · · · , K, x j = κ j (q 1 j ,θ, β ⋆ ) is a solution to (79). On the other hand, as the minimum ofθ → D(θ, β ⋆ ) is unique, we conclude that θ converges in probability toθ ⋆ . From this convergence, we argue that Prior to showing (87), let us explain how it leads to (81). Indeed from the Lipschitz assumption it holds that: Moreover, from the law of large numbers, which combined with (88) yields: Hence, for ϵ > 0, with probability approaching one, By definition of the set S ϵ and the triangle inequality, it holds with probability approaching 1 that for all Then, based on the Lipshitz property of ψ, it holds that: which shows (81). Now, to prove (87), note that sinceθ P −→θ ⋆ , for any η > 0, with probability approaching one, from which we deduce that: where C = √ ρ d σĤ (max 1≤j≤K |x 0,j | + t) and similarly, Using the fact that x − 0 (θ ⋆ , β ⋆ ) < x + 0 (θ ⋆ , β ⋆ ), and choosing η sufficiently small, we conclude that if for some j = 1, · · · , K, q 1 Each term of the right-hand side of (90) can be bounded by a linear function of η using (89), thereby proving (87).
Proof of (82) It can be checked that x →Ĥ(β ⋆ , x) is strongly convex, and its Hessian satisfies

G. From Lipschitz to the indicator function of solutions to the (PO)
Lemma 2. Letx be the solution to the (PO). Let c ∈ R such that c / ∈ {−t, t}. Then, where q is assumed to be drawn from a standard normal distribution.

H. Applying the CGMT: MSE of Box-RLS
Let x be the solution of (11a). Recall that the MSE is given by: which can be also written as: As 1 K ∥v∥ 2 P −→ δθ 2 ⋆ − 1, and 1 K ∥x 0 ∥ 2 P −→ 1, we thus have:

I. Applying the CGMT: SEP for Box-RLS
In this subsection, we study the limiting behavior of the SEP defined in (14). For j = 1, · · · , K, consider the output of the (PO) problem which denoted byx . Recall the expression of the SEP In PAM-constellations, we distinguish inner symbols when x 0,j belongs to 1 Based on Corollary 2, we have where SEP follows after some tedious but straightforward calculations and is given in (99).

UN-BOXED RLS PROOFS
The analysis of the Un-Boxed RLS scheme is similar to the analysis of the Box-RLS. Below we provide a brief sketch of the proof. Following the same analysis as before, we identify the same (PO) and (AO) with the single difference that the constraints on {x j } K j=1 are now removed. Particularly, the (AO) associated with the RLS writes as: which is similar to (54) with the difference that the optimization over x j is on the whole real axis.
Optimizing over the variables x j , j = 1, · · · , K, we thus obtain: Using the same approach as that in the Box-RLS, we can prove that ϕ converges to: We showed that for the LS case, optimizing the power allocation in MSE sense is equivalent to optimizing the SEP and it boils down to maximizing ρ eff .
In this appendix, we will show that this holds also true for the RLS decoder that employs optimal regularization coefficient. Towards this goal, we proceed with the following change of variables J = 1 ρeff , c 1 = 2(1 + δ), and c 2 = (1 − δ) 2 . Then, the MSE and SEP write as: It appears from (107) and (108)  Here, we derive the optimal power allocation given in Theorem 5. First, rewrite ρ eff as follows .

APPENDIX G OPTIMAL POWER AND TRAINING TIME ALLOCATION DERIVATION BASED ON GOODPUT
In this section, we determine the optimal power and training time allocation that optimizes the asymptotic value of the goodput for LS decoder which we denote here by G LS .
The proof above is done for LS but the same conclusions hold for both RLS and also for the Box-RLS under the conjecture that for Box-RLS, the optimal α is the one that maximizes ρ eff . We omit details for briefness.

APPENDIX H COMPARISON WITH THE LMMSE DECODER
In this appendix we will show that the LMMSE estimator of x 0 is equivalent to an RLS estimator with the optimal regularizer λ ⋆ = 1 ρd + σ 2 ∆ . The LMMSE estimate of x 0 is given by [36] x LMMSE = C xy C −1 yy y, To find C yy , let us write y as where z ≜ ρd K ∆x 0 + z which is a zero-mean vector with Czz = ρd