Trainable Projected Gradient Detector for Massive Overloaded MIMO Channels: Data-driven Tuning Approach

The paper presents a deep learning-aided iterative detection algorithm for massive overloaded multiple-input multiple-output (MIMO) systems where the number of transmit antennas $n(\gg 1)$ is larger than that of receive antennas $m$. Since the proposed algorithm is based on the projected gradient descent method with trainable parameters, it is named as the trainable projected gradient-detector (TPG-detector). The trainable internal parameters like step size parameters can be optimized with standard deep learning techniques such as back propagation and stochastic gradient descent algorithms. This approach referred to as data-driven tuning brings notable advantages of the proposed scheme such as fast convergence. The main iterative process of the TPG-detector consists of matrix-vector product operations that require $O(m n)$-time for each iterative step. In addition, the number of trainable parameters in the TPG-detector is independent of the number of antennas $n$ and $m$. These features of the TPG-detector lead to its fast and stable training process and reasonable scalability to large systems. The numerical simulations show that the proposed detector achieves comparable detection performance to those of the known algorithms for massive overloaded MIMO channels, e.g., the state-of-the-art IW-SOAV detector, with lower computation cost.


I. INTRODUCTION
Multiple-input multiple-output (MIMO) systems have attracted great interests because they potentially achieve high spectral efficiency in wireless communications. Recently, as a consequence of high growth of mobile data traffic, massive MIMO is regarded as a key technology in the 5th generation (5G) wireless network standard [1]. In massive MIMO systems, tens or hundreds of antennas are used in the transmitter and the receiver. This fact complicates the detection problem for MIMO channels because the computational complexity of a MIMO detector, in general, increases as the numbers of antennas grow. A practical massive MIMO detection algorithm should possess low computational complexity in addition to reasonable bit error rate (BER) performance.
In a down-link massive MIMO channel with mobile terminals, a transmitter in a base station can have many antennas A part of this work was submitted to IEEE ICC 2019. but a mobile terminal cannot have such a large number of receive antennas because of the restrictions on cost, space limitation, and power consumption. This scenario is known as the overloaded (or underdetermined) MIMO. The data collection by a base station from Internet of Things (IoT) nodes also can be regarded as an (up-link) overloaded MIMO system, because the number of IoT nodes is typically greater than that of antennas at the base station.
Development of an overloaded MIMO detector with computational efficiency and reasonable BER performance is a highly challenging problem because conventional naive MIMO decoders such as the zero-forcing detector or the minimum mean square error (MMSE) detector [2] exhibit poor BER performance for overloaded MIMO channels, and an optimal detection based on the maximum likelihood (ML) criterion with the exhaustive search is evidently computationally intractable.
Several search-based detection algorithms such as slabsphere decoding [3] and enhanced reactive tabu search (ERTS) [4] have been proposed for overloaded MIMO channels. Though these schemes show excellent detection performance, they are still computationally demanding, and it may prevent us from implementing them into a practical massive overloaded MIMO system. As a computationally efficient approach based on the ℓ 1 -regularized minimization, Fadlallah et al. proposed a detector using a convex optimization solver [5]. Recently, Hayakawa and Hayashi [6] proposed an iterative detection algorithm with practical computational complexity called iterative weighted sum-of-absolute value (IW-SOAV) optimization (see also [7]). The algorithm is based on the SOAV optimization [8] for sparse discrete signal recovery. In addition, the algorithm includes a re-weighting process based on the log-likelihood ratio, which significantly improves the detection performance. The IW-SOAV provides the state-ofthe-art BER performance among overloaded MIMO detection algorithms with low computational complexity.
The use of deep neural networks has spread to numerous fields such as image recognition [9], [10] and speech recognition [11] with the progress of computational resources. It also gives a great impact on design of algorithms for wireless communications and signal processing [12], [13]. In [13], for instance, it is shown that the belief propagation decoder with trainable parameters provides excellent BER performances for several short codes such as BCH codes. Gregor and LeCun first proposed the learned iterative shrinkage-thresholding algorithm (LISTA) [14], which exhibits better recovery performance than that of the original ISTA [15] for sparse signal recovery problems. Recently, a part of the authors proposed the trainable ISTA (TISTA) [16] yielding significantly faster convergence than ISTA and LISTA.
TISTA includes several trainable internal parameters and these parameters are tuned with standard deep learning techniques such as back propagation and stochastic gradient descent (SGD) algorithms. From our research work on TISTA [16] and several additional experiments (an example will be presented in Section II-B), we encountered a phenomenon that the convergence to the minimum value is accelerated with appropriate parameter embedding for several numerical optimization algorithms such as the projected gradient descent method and the proximal gradient method. We call this phenomenon data-driven acceleration of convergence.
Most of known acceleration techniques for gradient descent algorithms such as the momentum methods do not care about the statistical nature of the problems. On the other hand, the data-driven acceleration is obtained by learning the statistical nature of the problem, i.e., stochastic variations on the landscape of the objective functions. The internal parameters controlling the behavior of the algorithm are adjusted to match the typical objective function via training processes. The datadriven acceleration is especially advantageous in implementation of detection algorithms because it reduces the number of iterations without sacrificing the detection performance. This makes the algorithm faster and more power efficient.
The goal of this paper is to propose a novel detection algorithm for massive overloaded MIMO systems, which is called the Trainable Projected Gradient-Detector (TPGdetector). The proposed algorithm is based on the projected gradient descent method with trainable parameters. We have confirmed that data-driven acceleration improves the detection performance and the convergence speed.
Though deep learning architectures for MIMO systems were recently proposed, i.e, deep MIMO detectors (DMDs) in [17], [18] and a TISTA-based MIMO detection algorithm in [19], no deep learning-aided iterative detectors for massive overloaded MIMO channels have been proposed as far as the authors are aware of. Furthermore, an application of data-driven tuning to MIMO detectors has not yet been studied in the related literatures.
This paper is organized as follows. In Section II, we introduce the concept of the data-driven tuning for iterative algorithms and demonstrate it by a simple example. In Section III, we describe settings of massive overloaded MIMO systems. Section IV is the main part of this paper which proposes the TPG-detector for massive overloaded MIMO systems and presents its detection performance comparing with other detection algorithms such as the IW-SOAV. The last section is devoted to a summary of this paper. The Appendix presents a brief review of the IW-SOAV. II. DATA-DRIVEN TUNING In this section, we first introduce our key design principle called data-driven tuning for numerical optimization algorithms. A simple example of data-driven tuning is then presented. In the numerical results, we will observe phenomenon of data-driven acceleration of convergence for a projected gradient descent algorithm. The trainable algorithm shown in the example will be used as the base of the TPG-detector in Section IV.

A. Basic concept
We here introduce the concept of the data-driven tuning of numerical optimization algorithms, whose origin dates back to the work by Gregor and LeCun [14]. For improving the performance of a numerical iterative optimization algorithm, several trainable parameters can be embedded in the algorithm. By unfolding the iterative process of the numerical optimization algorithm, we have a multilayer signal-flow graph that is similar to a deep neural network. If each component of the signal-flow graph is differentiable, these trainable parameters can be adjusted by standard deep learning techniques. It is crucial to have sufficient training data for this approach. Note that the training data can be randomly generated according to a channel model in the case of the communication problem. Figure 1 (a) illustrates a signal-flow diagram of an iterative numerical optimization algorithm where Processes A, B, and C are processes whose input/output relationships are expressed with differentiable functions. By unfolding the signal-flow diagram, we obtain a signal-flow graph similar to a multilayer neural network ( Fig. 1(b)).
Each process contains trainable parameters that are represented by the black circles in Fig. 1(b). The trainable parameters can control behavior of the processes A, B, and C. Appending a loss function, e.g., the squared loss function, at the end of the unfolded signal-flow graph, we are ready to feed randomly generated training data to the graph. We can apply back propagation and a SGD type parameter update (SGD, RMSprop, Adam, etc.) to optimize the parameters.

B. Example of data-driven tuning
We now discuss a simple example of the data-driven tuning in more detail. As a toy model closely related to the MIMO channel, we consider a quadratic programming problem with binary variables.
1) Problem Setting: Let us consider a simple quadratic optimization problem where A ∈ R n×n is a given matrix and · 2 represents the Euclidean norm. We assume that y is stochastically generated as y = Ax + w ∈ R n wherex is a vector sampled from {−1, +1} n uniformly at random and w ∈ R n consists of i.i.d. Gaussian random variables with zero mean and variance σ 2 . The optimization problem is essentially the same as the ML estimation rule for the Gaussian linear vector channel. Since solving this problem is known as an NP-hard problem in general, we need to solve the problem approximately. We here exploit a valiant of the projected gradient (PG) algorithm to solve (1) approximately for large systems. The PG algorithm can be described by the recursive formula: where t = 1, . . . , T and tanh(·) is calculated element-wisely. The initial value is set to s 1 = 0. The PG algorithm consists of two computational steps for each iteration. In the gradient descent step (2), a search point moves to the opposite direction to the gradient vector of the objective function, i.e., ∇ 1 2 Ax− y 2 2 = −A T (y − Ax). The parameter γ controls the step size causing critical influence on the convergence behavior. In the projection step (3), softprojection based on the hyperbolic tangent function is applied to the search point to obtain a new search point nearly rounded to binary values. Precisely speaking, the projection step is not the projection onto the binary symbols {−1, +1}. This is because the true projection to discrete values results in insufficient convergence behavior in a minimization process (see also the discussion in Section II-B4). The parameter ξ controls the softness of the soft projection. Note that this type of nonlinear projection has been commonly used in several iterative multiuser detection algorithms such as the soft parallel interference canceller [20].
2) Trainable PG algorithm: According to the data-driven tuning framework, we can embed trainable parameters into the PG algorithm. The trainable PG (TPG) algorithm is based on the recursion with s 1 = 0. The trainable parameters {γ t } T t=1 play a key role in the gradient descent step by adjusting its step size adaptively. In the following discussion, the parameter ξ is treated as a fixed hyper parameter in Section II-B3 and it is treated as a trainable parameter in Section II-B4.
As described above, the parameters {γ t } T t=1 are optimized by the standard mini-batch training. The ith training data +1} n is generated uniformly at random and the corresponding y i is then generated according to In the following experiment, a matrix A is randomly generated for each mini-batch. Each element of A follows the Gaussian distribution with mean 0 and variance 1.
For each round of a training process, we feed these minibatches to the TPG algorithm to minimize the squared loss function wherex t (y) s t+1 is the output of the TPG algorithm with t iterations and Θ t {γ 1 , . . . , γ t } (or Θ t {γ 1 , . . . , γ t }∪{ξ}) is a set of trainable parameters up to the tth round. A back propagation process evaluates the gradient ∇L(Θ t ) and it is used for updating the set of parameters Θ t by a SGD type algorithm such as the Adam optimizer [21]. It should be remarked that a simple single-shot training for a whole process by letting t = T does not work well because the vanishing gradient phenomenon prevents appropriate parameter updates due to the fact that the derivative of the soft projection function (5) becomes nearly zero almost everywhere. Figure 2 shows the gradient of trainable parameters {γ t } T t=1 in the TPG algorithm. We find that the gradient vanishes as the iteration index t becomes small. It results in insufficient training by the single-shot training (see Fig. 3). In order to avoid the vanishing gradient phenomenon, we use an alternative approach, i.e., incremental training as in TISTA [16]. In the incremental training, the parameters {γ t } T t=1 are sequentially trained from Θ 1 to Θ T in an incremental manner.
The details of the incremental training is as follows. At first, Θ 1 is trained by minimizing L(Θ 1 ). After finishing the training of Θ 1 , the values of trainable parameters in Θ 1 are copied to the corresponding parameters in Θ 2 . In other words, the results of the training for Θ 1 are taken over to Θ 2 as the initial values. For each round of the incremental training which is called a generation, K mini-batches are processed.
3) Data-driven acceleration: We show the numerical demonstration of the TPG algorithm. Here, we treat the step size parameters {γ t } T t=1 as trainable parameters but ξ is treated as a hyper parameter. In the experiment, the dimension of matrix A is set to n = 1000 and the noise variance is fixed to σ 2 = 4.0. The number of iterations of the TPG algorithm is T = 20. We performed two types of training processes of the TPG algorithm to measure the effect of incremental training. In the training process of TPG with incremental training, we used K = 100 mini-batches per generation. The mini-batch size was set to D = 200 and Adam optimizer with learning rate 0.0002 was used for the parameter updates. In the training process of TPG without incremental training (named "TPG-noINC" in Fig. 3), we used K = 2000, D = 200 and Adam optimizer with learning rate 0.002. The initial values of the trainable parameters were given by γ t = 1.0 × 10 −4 (t = 1, . . . , T ). In this experiment, the softness parameter ξ was fixed to 8.0 for the TPG algorithm. Figure 3 shows the mean squared error (MSE) as a function of iteration steps of the plain PG algorithm based on (2), (3) (ξ = 6.0, γ = 6.5 × 10 −4 ) and the TPG algorithms based on (4), (5) (with/without incremental training). The MSE after t iterations is defined by 10 log 10 (E[||x−x t (y)|| 2 2 ]/n) (dB) and it was estimated from 10 4 random samples of A, y and x. The parameter γ = 6.5 × 10 −4 in the plain PG algorithm is the optimal value for T = 20 (see also Fig.4).
From Fig. 3, we can observe that the TPG algorithm provides much smaller MSEs than those of the plain PG algorithm. The MSE of TPG achieves −80 dB at t = 8 but the PG yields the smaller MSE after t = 19. Namely, TPG shows much faster convergence and it implies that the parameter tuning drastically improves the convergence speed. This is an example of the data-driven acceleration of convergence by data-driving tuning. The effect of the incremental training can be confirmed by comparing the MSEs of the TPG algorithms with/without incremental training. The MSE curve of TPG-noINC is almost flat, indicating that the parameter tuning is not successful due to the vanishing gradient in the case of the TPG algorithm without incremental training.
In Fig. 4, we show the relationships between the step size parameter γ and the MSE performance of the plain PG algorithm with ξ = 6.0. It can be observed that the parameter γ must be selected carefully to obtain appropriate convergence. In other words, the sweet spot of γ is relatively narrow, i.e., close neighborhood of 6.5×10 −4 is only allowable choice for achieving −100 dB at T = 200. This means that optimization of the step size is critical even for the plain PG algorithm. In addition, the TPG algorithm achieves the lower MSE performance (around −130 dB) which cannot be achieved by the plain PG algorithm. This fact implies that embedding an independent step size parameter into each iteration step provides substantial improvement on the quality of the solution.
4) Effect of softness parameter: We further discuss the effect of the softness parameter ξ in (5) on the MSE performance. Figure 5 shows the MSE curves of TPG algorithms with different ξ's and the TPG algorithm with trainable ξ. The setting of the experiment is the same as the previous one. As described above, the projection of the TPG algorithm is the soft-projection function instead of the hard-projection function corresponding to the ξ → ∞ limit. The results show that the large fixed ξ is not appropriate in terms of the MSE performance. On the other hand, the TPG algorithm with small fixed ξ also shows poor MSE performance. It is thus crucial to tune not only the step size parameter {γ t } T t=1 but also the softness parameter ξ to pull out the full performance of the TPG algorithm. The curve with the label "TPG (ξ trained)" represents the MSE of TPG with trainable ξ. In the experiment, we used K = 10000 due to the slow convergence of ξ. In Fig. 5, we can see that it outperforms other TPG algorithms with fixed ξ.

III. OVERLOADED MIMO CHANNELS
The section describes the MIMO channel model and introduces several definitions and notations. The numbers of transmit and receive antennas are denoted by n and m, respectively. We mainly consider the overloaded MIMO scenario in this paper where m < n holds. It is also assumed that the transmitter does not use precoding and that the receiver perfectly knows the channel state information, i.e., the channel matrix.
wherew ∈ C m consists of complex Gaussian random variables with zero mean and covariance σ 2 w I. The matrix H ∈ C m×n is a channel matrix whose (i, j) entryh i,j represents a path gain from the jth transmit antenna to the ith receive antenna. Each entry ofH independently follows the complex circular Gaussian distribution with zero mean and unit variance. For the following discussion, it is convenient to derive an equivalent channel model defined over R, i.e., where x and (N, M ) (2n, 2m). The signal set S is the real counter part ofS. The matrix H ∈ R M×N is converted from H. Similarly, the noise vector w consists of i.i.d. random variables following the Gaussian distribution with zero mean and variance σ 2 w /2. Signal-to-noise ratio (SNR) per receive antenna is then represented by where E s E[||Hx|| 2 2 ]/m stands for the signal power per receive antenna and N 0 σ 2 w stands for the noise power per receive antenna. Throughout the paper, we assume the QPSK modulation format, i.e.,S

IV. TRAINABLE PROJECTED GRADIENT (TPG)-DETECTOR
The proposed algorithm called TPG-detector is based on the TPG algorithm introduced in Section II-B. We first describe the detail of the TPG-algorithm and discuss its time complexity. Then, we will show some numerical results which mainly compare the detection performance of the TPG-detector with that of other baseline algorithms for massive overloaded MIMO systems. In addition, the detection performance for non-massive non-overloaded MIMO systems will be presented, which indicates that the TPG-detector is a promising MIMO detector not only for massive overloaded systems but also for non-massive non-overloaded systems with reasonably low computational complexity.

A. Details of TPG-detector
The ML estimation rule for the MIMO channel defined above is given bŷ This problem is a non-convex problem and finding the global minimum is computationally intractable for a large scale problem. The TPG-detector is based on the TPG algorithm to solve the above non-convex problem approximately. The process of the TPG-detector is described by the following recursive formulas: where t(= 1, . . . , T ) represents the index of an iteration step (or layer) and we set s 1 = 0 as the initial value. This algorithm estimates a transmitted signal x from the received signal y and outputs the estimatex = s T +1 after T iteration steps. The steps (12) and (13) correspond to the gradient descent step and to the projection step, respectively, as described in Section II-B. The matrix W in the gradient step (12) is the linear MMSE (LMMSE)-like matrix defined by where α ∈ R is a trainable parameter. The matrix (14) also appears in the solution of a linear regression problem with a quadratic regularization term. Precisely speaking, the matrix W should be H T as in (4) for achieving the gradient descent process for the quadratic objective function in (11). However, we adopt the modification inspired by [22] because this modification improves the BER performance of the proposed scheme (an experimental evidence will be shown in Section V-B1). This modification is especially effective when the matrix H is ill-conditioned, i.e., the condition number of H is large. The LMMSE-like matrix (14) changes an ill-conditioned matrix to a reasonably well-conditioned matrix. Since it is critical to optimize α to achieve a reasonable detection performance, the parameter α is optimized in a training process. As in the case in (5), we use the hyperbolic tangent function for the soft projection.

B. Trainable parameters
The trainable parameters of the TPG-detector are 2T + 1 real scalar variables α, {γ t } T t=1 , and {θ t } T t=1 . The parameters {γ t } T t=1 in the gradient step control the step size of a move of the search point. In order to achieve fast convergence, appropriate setting of these step size parameters is of critical importance as described in Section II-B. It should be remarked that similar trainable parameters are also introduced in the structure of TISTA [16]. The parameters {θ t } T t=1 control the softness of the soft projection in (13). Different from the TPG algorithm, the trainable parameters depend on the iteration index t to increase the degree of freedom of trainable parameters in the soft-projection functions. The parameter α adjusts the degree of compensation for an ill-conditioned matrix H. Apart from {γ t } T t=1 and {θ t } T t=1 , the TPG-detector uses the same parameter α through all the iteration steps. It is crucial to reduce the computational cost to execute the TPGdetector, which will be discussed in the next subsection.
One of the advantages of the TPG-detector is that the number of trainable parameters is small, i.e., O(T ), and it leads to fast and stable training processes. The number of trainable parameters of the TPG-detector is constant to the number of antennas n and m though the DMD [17] contains O(n 2 T ) parameters in T layers.

C. Time complexity
The computational complexity of the TPG-detector per iteration is O(mn) because calculating the vector-matrix products Hs t and W (y − Hs t ) takes O(mn) computational steps. We need to calculate the LMMSE-like matrix W with O(m 3 ) computational steps because the calculation involves an matrix inversion. The evaluation of W is required only when H changes, i.e., the matrix inversion is not needed for each iteration of a TPG-detector if H is constant during the process. A TISTA-based MIMO detection algorithm proposed in [19] also uses the LMMSE matrix as a linear estimator. It should be remarked that, in the TISTA-based MIMO detector [19], the LMMSE matrix must be recalculated for each iteration and it thus requires O(m 3 ) computational steps for execution. This is one of the critical differences between our algorithm and the TISTA-based MIMO detector.

D. Training process
The TPG-detector is trained based on the incremental training described in Section II-B2. The training data is generated randomly according to the channel model (8) with fixed variance σ 2 w corresponding to a given SNR. As described in Section III, we assume a practical situation in which a channel matrix H is a random variable. According to this assumption, a matrix H is randomly generated for each mini-batch in a training process of the TPG-detector.

V. NUMERICAL RESULTS
In this section, we show the detection performance of the TPG-detector and compare it to that of other algorithms such as the IW-SOAV which is known as one of the most efficient iterative algorithms for massive overloaded MIMO systems.

A. Experimental setup
A transmitted vector x is generated uniformly at random. The BER is then evaluated for a given SNR. We use randomly generated channel matrices for BER estimation.
The TPG-detector was implemented by PyTorch 0.4.0 [23]. In this paper, a training process is executed with T = 50 rounds using the Adam optimizer [21]. To calculate the BER of the TPG-detector, a sign function sgn(z) which takes −1 if z ≤ 0 and 1 otherwise is applied to the final estimate s T +1 in an element-wise manner.
As the baselines of detection performance, we use the ERTS [4], IW-SOAV [6], and the standard MMSE detector. The ERTS is a heuristic algorithm based on a tabu search for overloaded MIMO systems. The parameters of ERTS is based on [4]. The IW-SOAV is a double loop algorithm for massive overloaded MIMO systems (see Appendix for the brief review). Its inner loop is the W-SOAV optimization recovering a signal using a proximal operator. Each round of the W-SOAV takes O(mn) computational steps, which is comparable to that of the TPG-detector. After finishing an execution of the inner loop with K itr iterations, several parameters are then updated in a re-weighting process based on a tentative recovered signal. This procedure is repeated L times in the outer loop. The total number of steps of the IW-SOAV is thus K itr L. In the following, we use the simulation results in [6] with K itr = 50.

B. Main Results
1) Selection of matrix W in gradient step: As described above, we can choose matrix W in the gradient step (12). The selection of the matrix will affect the detection performance of algorithms as shown in the orthogonal AMP [22]. Before we show the detection performance of the proposed TPGdetector, we numerically test the effect of the selection of W . Figure 7 shows the detection performance of the TPG-detector with a different choice of W . We examined three types of matrices: the matched filter matrix (MF), pseudo-inverse matrix (PINV), and LMMSE-like matrix (LMMSE). We find that the LMMSE-like matrix outperforms other choices of W in a wide range of SNR. We thus select the LMMSElike matrix as W to treat an ill-conditioned matrix H in the proposed TPG-detector. This selection is also effective for the non-massive non-overloaded case as shown in Section V-C
For (n, m) = (50, 32) in Fig. 8, ERTS outperforms other detection algorithms in a large margin when SNR is larger than 10 dB. It should be remarked that ERTS needs much more time complexity (around several orders of magnitudes) than that of the IW-SOAV (see Fig. 7 in [6]). Comparing the TPG-detector with the IW-SOAV, we find that the TPG-detector overwhelms the IW-SOAV (L = 1) and shows BER performance close to the IW-SOAV (L = 5) when SNR is below 20 dB. Note that the computational cost for executing the TPG-detector with T = 50 is almost comparable to that of the IW-SOAV (L = 1). The IW-SOAV (L = 5) requires 250 iterations that is 5 times as large as the number of required iterations for the TPG-detector with T = 50. It implies that the TPGdetector achieves considerably good detection performance with relatively small computational cost.
For (n, m) = (100, 64) in Fig. 9, ERTS detector shows the best BER performance in a middle SNR region where SNR is between 10 dB and 18 dB but the BER curve of ERTS is saturated after 20 dB. The TPG-detector and the IW-SOAV (L = 1 and L = 5) outperform ERTS in a high SNR regime. In such a regime, the TPG-detector exhibits the BER performance superior to that of the IW-SOAV (L = 1) for the entire range of SNR, i.e., the TPG-detector achieves approximately 5 dB gain at BER = 10 −4 over the IW-SOAV (L = 1). More interestingly, the BER performance of the TPGdetector is fairly close to that of the IW-SOAV (L = 5). For example, with SNR = 20 dB, the BER estimate of the TPGdetector is 6.8 × 10 −5 whereas that of the IW-SOAV (L = 5) is 4.3 × 10 −5 . Figure 10 shows the BER performance for (n, m) = (150, 96). In this case, ERTS shows poor BER performance that cannot achieve BER smaller than 10 −3 for any SNR. The TPG-detector successfully recovers transmitted signals with lower BER than that of the IW-SOAV (L = 1). It again achieves about 5 dB gain against the IW-SOAV (L = 1) at BER = 10 −5 . Although the IW-SOAV (L = 5) shows considerable performance improvements in this case, the gap between the curves of the TPG-detector and the IW-SOAV (L = 5) is only 2 dB at BER = 10 −5 .
3) System-size dependency: In Fig. 11, we show the BER performance of the TPG-detector and IW-SOAV (L = 1) as a function of the number of antennas n with the rate m/n = 0.6 fixed. The gap of their BER performances is especially large for SNR= 20 dB. We also find that the gain of the TPGdetector increases as n grows though these algorithms have  the same computational costs. It is confirmed that the TPGdetector outperforms low-complexity algorithms especially in the massive overloaded MIMO channels. Figure 12 displays the learned parameters {γ t } T t=1 , and {|θ t |} T t=1 of the TPG-detector after a training process as a function of the iteration index t(= 1, . . . , T ). We find that they exhibit a zigzag shape with damping amplitude similar to that observed in TISTA [16]. The parameter γ t , the step size of a linear estimator, is expected to accelerate the convergence of the signal recovery. Theoretical treatments for providing reasonable interpretation on these characteristic shapes of the learned parameters are interesting open problems.

4) Trained parameters:
The trained values of α for different SNR values are shown in Fig. 13. We find that the parameter α is tuned depending on the value of SNR. In particular, the trained value decreases when SNR≤ 7.5 dB. This tendency is similar to that of a parameter related to the parameter α in the IW-SOAV [6] using the LMMSE-like matrix. On the other hand, the trained value is non-monotonic unlike the IW-SOAV, i.e, it increases when SNR< 7.5 dB. The parameter corresponding to the parameter α in the IW-SOAV should be chosen in advance by numerical simulations. The learning process in the TPG-detector easily tunes the parameter α in addition to other trainable parameters.

5) Computation time:
We finally discuss the scalability of the TPG-detector to show the required computation time for training. The empirical execution time of training process of the TPG-detector is measured by using a PC with GPU NVIDIA GerForce GTX 1080 and Intel Core i7-6700K CPU 4.0 GHz with 8 cores. Table I presents the execution time of training processes with different n. Even for the case (n, m) = (150, 96), we need only 20 minutes for training the TPG-detector and this results indicates that the training process of the TPG-detector is reasonably practical for fairly large systems.

C. Performance for non-massive non-overloaded MIMO systems
Although the main target of the TPG-detector is massive overloaded MIMO systems, the proposed algorithm can be applied to a non-massive non-overloaded MIMO systems as well. In this subsection, we will study the detection performance of the TPG-detector for such a MIMO channel with (n, m) = (5, 5). Figure 14 presents BER performance curves for several detection algorithms. In Fig.14, the label "ML" stands for the ML detection algorithm that provides the smallest BER. The label "MMSE" and "GIGD" represents the MMSE detector and a belief propagation-based detection algorithm [24], respectively. In addition to these detectors, the BER curves of the IW-SOAV (L = 1 and L = 5) are also included in Fig.14.
The MMSE detector requires the least computational complexity but it gives the largest BER from SNR = 0 to 20 dB. Although the BER performance of GIGD is competitive with that of the TPG-detector and the IW-SOAV (L = 5) up to 12.5 dB, the BER of GIGD is saturated to a constant value when SNR > 12.5 dB. Comparing the TPG-detector with the IW-SOAV (L = 1), we find that the TPG-detector shows superior BER performance as observed in the massive overloaded MIMO systems. The gaps between the BER curves of the TPG-detector and the IW-SOAV (L = 1) is about 6 dB at BER = 10 −3 . It can be also observed that the TPGdetector and IW-SOAV (L = 5) shows almost the same BER performance which is the best except for the ML detection. Note that the IW-SOAV (L = 5) requires much larger computational complexity compared with the TPG-detector. This experimental result indicates that the TPG-detector is also a promising iterative algorithm for non-massive non-overloaded MIMO systems.

VI. CONCLUSION
In this paper, we proposed the TPG-detector, a deep learning-aided iterative decoder for massive overloaded MIMO channels. It is based on the concept of the data-driven tuning with standard deep-learning techniques. The TPG-detector contains two trainable parameters for each layer: γ t controlling a step size of the gradient descent step and θ t controlling the softness of the soft projection. In addition, the parameter α in the LMMSE-like matrix W (14) is also optimized in a training process. The total number of the trainable parameters in T layers is thus 2T + 1, which is significantly smaller than those used in the previous studies such as [17], [18]. This fact promotes fast and stable training processes for the TPGdetector.
The computational complexity of the TPG-detector is O(mn) per iteration because a matrix inversion is not required for each iteration. A TISTA-based MIMO detector [19] needs a matrix inversion for each iteration and it needs O(m 3 ) time complexity. In terms of the time complexity, it could be said that the TPG-detector is more scalable for massive MIMO systems.
The numerical simulations show that the TPG-detector outperforms the state-of-the-art IW-SOAV (L = 1) by a large margin and achieves a comparable detection performance to the IW-SOAV (L = 5). The TPG-detector therefore can be seen as a promising iterative detector for massive overloaded MIMO channels providing an excellent balance between a low computational cost and a reasonable detection performance. APPENDIX BRIEF REVIEW OF IW-SOAV Here, we give a brief review of the IW-SOAV detector. The IW-SOAV is an effective iterative detection algorithm for massive overloaded MIMO systems proposed in [6], [7]. It is based on a valiant of the Douglas-Rachford algorithm [25] which solves the following weighted SOAV (W-SOAV) optimization problem: where z j (j = 1, . . . , 2n) is the jth element of z and α(> 0) is a constant. Here, we assume that each symbol x j in the transmitted signal x is an independent random variable which takes 1 w.p. w + j and −1 w.p. w − j 1 − w + j . The IW-SOAV repeats the following procedures: (i) estimation of w + j based on the detected signal and (ii) detection of transmitted signal by solving the W-SOAV optimization (15). The IW-SOAV is thus a double-loop algorithm.
In the outer loop corresponding to procedure (i), the algorithm approximately estimates {w + j } of each transmitted symbol. The estimation is based on the approximate log likelihood ratio which readŝ where h i,j is the (i, j) element of matrix H andŝ ′ represents a clipped signal ofŝ, i.e.,ŝ ′ j (j = 1, . . . , 2n) takes −1 if s j < −1, 1 ifŝ j > 1, and s j otherwise. In addition, we definê for i = 1, . . . , 2m. Then, the weight w + j is calculated by In the inner loop corresponding to procedure (ii), the algorithm solves the W-SOAV optimization problem with an iterative process defined by the following recursive formula: where t(= 1, . . . , K itr ) is an index of the iteration step, θ t ∈ [ǫ, 2 − ǫ] is a constant, and φ γ : R N → R N is a componentwise function whose jth element [φ γ (z)] j is defined by with d j w + j −w − j . The parameters γ > 0, ǫ ∈ (0, 1), and the initial value r 0 ∈ R 2n can be set arbitrary. In this W-SOAV optimizer, the transmitted symbol is detected asx = z Kitr+1 after K itr iteration steps. The IW-SOAV starts withŝ = 0 and repeats L outer loops with K itr inner loops. When the all loops are finished, the sign function sgn(·) is applied to the outputx in an element-wise manner. The parameter α is to be fixed appropriately depending on SNR. In numerical experiments in Section V, we used r 0 = 0, ǫ = 0, γ = 1, and θ t = 1.9 (t = 1, . . . , K itr ), and chose α as in [6].
The computational cost of each iteration of the IW-SOAV is O(mn). Although it contains a matrix inversion which takes O(m 3 ) computational steps, it can be computed in advance. Since the total number of inner and outer loops is K itr L, the total computational cost of the IW-SOAV is O(K itr Lmn).