Deep Learning for CSI Feedback Based on Superimposed Coding

Massive multiple-input multiple-output (MIMO) with frequency division duplex (FDD) mode is a promising approach to increasing system capacity and link robustness for the fifth generation (5G) wireless cellular systems. The premise of these advantages is the accurate downlink channel state information (CSI) fed back from user equipment. However, conventional feedback methods have difficulties in reducing feedback overhead due to significant amount of base station (BS) antennas in massive MIMO systems. Recently, deep learning (DL)-based CSI feedback conquers many difficulties, yet still shows insufficiency to decrease the occupation of uplink bandwidth resources. In this paper, to solve this issue, we combine DL and superimposed coding (SC) for CSI feedback, in which the downlink CSI is spread and then superimposed on uplink user data sequences (UL-US) toward the BS. Then, a multi-task neural network (NN) architecture is proposed at BS to recover the downlink CSI and UL-US by unfolding two iterations of the minimum mean-squared error (MMSE) criterion-based interference reduction. In addition, for a network training, a subnet-by-subnet approach is exploited to facilitate the parameter tuning and expedite the convergence rate. Compared with standalone SC-based CSI scheme, our multi-task NN, trained in a specific signal-to-noise ratio (SNR) and power proportional coefficient (PPC), consistently improves the estimation of downlink CSI with similar or better UL-US detection under SNR and PPC varying.


I. INTRODUCTION
A S one of the key technologies in the fifth generation (5G) wireless communication system, massive multiple-input multiple-output (MIMO) has now motivated a growing research interest [1]. In massive MIMO systems, hundreds of antenna elements are deployed at the base station (BS). Combined with a pre-coding scheme, such as minimum mean-squared error (MMSE), these antennas provide an effective way to exploit the spatial degrees of freedom, which significantly enhance system performance, e.g., system capacity, energy efficiency, and link robustness [2]- [8].
In massive MIMO systems, the accurate channel state information (CSI) is required by BSs for downlink beamforming user selection [9]. In the time division duplex (TDD) mode, the CSI of downlink can be estimated by the uplink channel for the reciprocity property [10]. However, in the frequency division duplex (FDD) mode, the reciprocitybased CSI is not available. Thus, the downlink CSI should be estimated by users and fed back to the BS. This CSI feedback incurs significant overhead in massive MIMO systems due to large number of antennas. Since FDD mode is pervasively deployed for delay sensitive and traffic symmetric applications, it is of great importance to reduce the CSI feedback overhead in FDD mode.
The codebook-based CSI feedback has been widely applied [11]. In FDD massive MIMO systems, however, the large number of antennas requires correspondingly expanded codebook size to guarantee an acceptable CSI-accuracy [12]. Subject to the curse of dimensionality, the overhead of the codebook-based feedback becomes substantial for massive MIMO systems [13]- [15]. To address the aforementioned problems, the compressive sensing (CS)-based CSI feedback approaches are proposed to reduce the channel dimension by exploiting the sparse structures of CSI [12], [14]- [16] (e.g., CSI's temporal correlation [12], CSI's spatial correlation [14]- [16], and the sparsity-enhancing basis for CSI [14], etc.). It is well known that, the sparsity of CSI is only approximated for specific models [3], [4], beyond which, the general assumption of channel sparsity could not be guaranteed. Thus, existing CS-based algorithms may have practical issues in case of model mismatch.
Recently, the deep learning (DL) based physical-layer technique shows its promising prospects in wireless communication system [3]- [9], [17]- [21] and the comprehensive overview could be found in [18]- [20]. Compared with the CS-based CSI feedback, DL-based methods (e.g., [3], and [4]) outperform many existing CS schemes in feedback reduction. Despite all this, an efficient DL-based CSI feedback to further improve the occupation of the uplink bandwidth resource is still highly desired.

A. RELATED WORKS
The literature of DL-based CSI feedback for FDD massive MIMO systems mainly concentrates on feedback reduction [3]- [6]. In [3], a deep neural network (DNN) called CsiNet has been developed for CSI feedback. The CsiNet is based on autoencoder of DNN, where the encoder learns to compress the original channel matrices to some codewords and the decoder learns the inverse transformation from compressed codewords through training data. Compared to the CS-based algorithms, the CsiNet was more effective in reducing the CSI dimensionality. However, the CSI is independently reconstructed in CsiNet and thus it is not suitable for practical application in time-varying channels due to the ignorance of time correlation. To remedy this defect, a CsiNet-long shortterm memory (CsiNet-LSTM) has been proposed in [4] to enhance recovery quality of CSI by learning spatial structures and time correlation of time-varying massive MIMO channels. However, the investigation in [5] indicated that both [3] and [4] (i.e., CsiNet and CsiNet-LSTM) are not sufficient for tracking the temporal correlations due to the employment of linear fully-connected networks (FCNs) for CSI compression. By incorporating a LSTM module and FCN in a neural network (NN) architecture, the recurrent compression and uncompression modules were formed in [5] to effectively capture the temporal and frequency correlations of wireless channels. Considering feedback error and feedback delay, a deep autoencoder based CSI feedback was proposed in [6]. Although the DL-based CSI feedback methods in [3]- [6] exhibite excellent performance in feedback reduction, the uplink bandwidth resources are still occupied to some extent.
Without any occupation of uplink bandwidth resources, [7] and [8] estimated downlink CSI from uplink CSI by using DL approach. In [7], the core idea was that since the same propagating environment was shared for both uplink and downlink channels, the environment information could be applied to downlink channel cases after it was extracted from uplink channel response. Similar to [7], a NN-based scheme for extrapolating downlink CSI from observed uplink CSI has been proposed in [8], where the underlying physical relation between the downlink and uplink frequency bands was exploited to construct the learning architecture. Need to mention that, the methods in [7] usually needs to retrain the NN when the environment information changes significantly. For example, for a well-trained equipment, its extracted environment information (e.g., the shapes of buildings, streets and mountains, the materials that objects are made up, etc) from one city would no longer be applicable for another. The method in [8] will encounter poor CSI recovery performance in the environment of wide band interval between downlink and uplink frequency bands.
Besides the DL-based CSI feedback approaches, the superimposed coding (SC), which is similar the non-orthogonal multiple access scheme [21], is also proposed for CSI feedback to avoid the occupation of uplink bandwidth resources [22]. This is accomplished by spreading and superimposing the downlink CSI on the uplink user data sequences (UL-US) to feed back to BS [22]. But still, this method is challenged by the difficulties of cancelling the interference between CSI and UL-US.
As a whole, the DL-based and SC-based CSI feedback methods still face huge challenge, which can be summarized as follows: • Concentrated on feedback reduction, the DL-based CSI feedback methods, e.g., the methods in [3]- [6], inevitably occupy uplink bandwidth resources. • Although the occupation of uplink bandwidth resources can be avoided, the methods that estimate downlink CSI from uplink CSI in [7] and [8] usually limit the applications in mobile or wide frequency-band interval environment. • The SC-based CSI feedback [22] can also avoid the occupation of uplink bandwidth resources, while facing with huge challenge to cancel the interference between downlink CSI and UL-US due to the lack of good solutions in previous works. Motivated by DL-based CSI feedback methods, we combine DL technique and SC technique for CSI feedback to overcome these challenges mentioned above.

B. CONTRIBUTIONS
In this paper, we combine DL technique and SC technique for CSI feedback. The main contributions of our work are summarized as follows: • The SC-based CSI feedback (e.g., [22]) is introduced in user equipment. Therefore, the occupation of uplink bandwidth resource is thoroughly avoided, which is different from the DL-based methods in [3]- [6]. In particular, the DL-based methods by using uplink CSI to estimate downlink CSI in [7] and [8] are not adopted for a wider application in mobile or wide frequency-band interval environment. • A multi-layer NN (i.e., a DNN) is constructed at BS by with the unfolding idea from [23]- [25]. Compared to the SC-based CSI feedback [22] with perfectly known noise variance, this multi-layer NN method improves the performance of downlink CSI recovery without obvious change of bit error rate (BER) of UL-US. Note that the iteration algorithm according to minimum mean-squared error (MMSE) criterion in [22] requires to know the noise variance. Our unfolded iteration can work well without any knowledge of link noise. That is, both the recovery of downlink CSI and the BER of UL-US are actually improved compared to SC-based CSI feedback in [22] due to the inevitable estimation errors of noise variance. • A subnet-by-subnet method, inspired by layer-by-layer training in [26], is exploited to train the designed DNN. This method facilitates the parameter tuning and expedites the convergence rate. The remainder of this paper is structured as follows: In Section II, we present the SC-based CSI feedback to formulate a learning problem. The proposed method, i.e., deep learning for CSI feedback is presented in Section III, and the numerical results are given in IV. Finally, Section V concludes our work.
Notations: Boldface letters are used to denote matrices and column vectors;(·) T , (·) H , (·) † and E {·} denote the transpose, conjugate transpose, matrix pseudo-inverse, and statistical expectation respectively; Re (·) and Im (·) denote the real and imaginary parts of a complex number, complex vector or complex matrix; I P is the identity matrix of size P × P ; BN (·) denotes the operation of batch normalization; · 2 is the Euclidean norm; and 0 is the matrix or vector with all zero elements.

II. PROBLEM FORMULATION
In this section, the SC-based CSI feedback is first elaborated in II-A, and a SC-baseline is also formed for ease of comparison and description. Then, in II-B, based on this baseline, we form a multi-task learning for SC-based CSI feedback.

A. SC-BASED CSI FEEDBACK
In [22], the MIMO system consists of a BS with N antennas and U single-antenna users. The transmitting signal X u of user-u, u = 1, 2, · · · , U , is denoted as where, ρ ∈ [0, 1] stands for the power proportional coefficient (PPC). For each user-u, E u represents the transmitting power; H u is the 1 × N downlink CSI from BS to user-u, whose elements are independent and identically distributed (i.i.d) complex Gaussian variable with zero mean and variance 1/N ; P u ∈ R M ×N is a spreading matrix, satisfying P T u P u = M I N ; D u ∈ C 1×M denotes UL-US; and M is the frame length (or UL-US length).
The received signal at BS from user-u, denoted as r u , is given by [22] where, r u is N × M signal block captured from N BS antennas; G u ∈ C N ×1 is uplink channel vector, i.e., uplink CSI; the feedback link noise is denoted by N u , which is a N × M complex matrix. Each element of N u is modeled as i.i.d complex additive white Gaussian noise (AWGN) with zero mean and variance σ 2 u . Assuming perfect synchronization, perfect uplink channel estimation (i.e., G u can be known), and perfect noise variance estimation (i.e., σ 2 u is known) to be available at the BS, we form a "SC-baseline" for DL-based CSI feedback. Referring to [22], the iteration procedure of "SC-baseline", which is utilized to recover downlink CSI and UL-US on the basis of MMSE criterion, is given as follows: u , and then estimate the downlink CSI according to MMSE criterion, i.e., 3) Eliminate the interference of downlink CSI:

5) Cancellation of UL-US's interference:
6) k = k + 1 and return to step 2) if k is within iteration limit.
It should be noted that, to form a comparison baseline, the maximum likelihood detection of UL-US and maximum likelihood estimation of downlink CSI, is impractical due to the extremely high computational complexity in a massive MIMO system. Therefore, the MMSE criterion is considered here for SC-baseline. After several iterations, the MMSE estimation of downlink CSI and the MMSE detection of UL-US could be converged.

B. LEARNING TASK FOR SC-BASED CSI FEEDBACK
To further improve the SC-based CSI feedback, we combine the DL and SC for CSI feedback by exploiting the advantages of SC and DL techniques. The whole system model is given in Fig. 1. For user-u, the downlink CSI (i.e., H u ) is spread firstly. Then the weighted downlink CSI and UL-US are superimposed together to form signal X u , as given in (1). Over the attenuation of the uplink channel G u and link noise N u , the transmitted X u from user-u is received at BS. Experiencing the operation of radio frequency (RF) frontend, the received signal r u is expressed in (2). With the received signal r u , the main task of BS is to recover downlink CSI and detect UL-US by using DL technique.
Similar to the assumption of [22] and [24], the uplink channel G u (i.e., the uplink CSI) is known to the BS in advance. In [24], the knowledge of CSI is used to form maximum likelihood optimization for DL-based MIMO detection problem. However, the complicated NN architecture (e.g., 30 layers in [24]), long training time (e.g., 3 days in [24]), and difficult parameter tuning, etc., cause its application difficulties in different scenarios. Besides the detection of UL-US (i.e., D u ), the estimation of downlink CSI (i.e., H u ) is also needed at the BS. This is a typical multi-task problem in NN [27], which encounters more difficulties than the single-task detection (e.g., [24]). Therefore, to simplify implementation complexity, a multi-task NN architecture is structured by unfolding the iterations of SC-baseline under MMSE criterion. Naturally, other baselines and corresponding NN architectures formed according to the same approach can also be considered, which will not affect the fairness of the comparison.
Although the known uplink CSI G u is exploited in SCbaseline under MMSE criterion, we are still trying to develop a multi-task NN that has no uplink CSI as input but outperforms SC-baseline. Thus, a coarse estimation of X u is employed to circumvent the explicit uplink CSI G u . To do this, the NN architecture can be simplified and thus accelerates network convergence. Then, the estimatedX u passes through a multi-layer NN (i.e., a DNN) to solve the multi-task problem, i.e., to recover downlink CSI (denoted asĤ u ) and to detect UL-US (denoted asD u ). This will be elaborated in the next section.

III. DEEP LEARNING FOR CSI FEEDBACK
In traditional SC-based CSI feedback [22], the main task of BS is to recover downlink CSI and detect UL-US. In our proposed DL-based CSI solution, this is also the main task at BS. From II-B, a coarse estimation is employed for simplification and convergence acceleration of designed DNN. In this section, the coarse estimation is first described and then followed by our multi-layer NN design, in which the downlink CSI recovery and UL-US detection is addressed by solving a multi-task problem.

A. COARSE ESTIMATION
The benefit of a coarse estimation is to eliminate the interference of uplink channel. When the uplink CSI is not used as network input, the NN architecture can be simplified, and thus improves the convergence rate of offline training. According to the received signal r u at BS, the coarse estimation can be given byX Then, the estimatedX u is delivered to a multi-layer NN, and a multi-task problem is solved in the next subsection.

B. MULTI-TASK DL NETWORK
To solve our multi-task problem (i.e., to recover downlink CSI H u and to detect UL-US D u ), a multi-layer NN is constructed by unfolding the iteration of SC-baseline in II-A. In [22], simulations show that with three iterations, the SC-based feedback algorithm nearly converges. According to our design and experiment, we observed that unfolding two iterations is enough. Unfolding with more iterations could not obtain significant improvement to recover downlink CSI and UL-US but merely increase the complexity of NN. Thus, without special explanation, the unfolding operation in the rest of this paper is applied on a two iterations' SC-baseline, and this forms a four subnets' NN. Need to mention that, this subnet structure is flexible for unfolding three or more iterations. The designed multi-layer NN is illuminated in Fig. 2.

1) NETWORK FUNCTION SUMMARY
For ease of description, we denote four subnets as CSI-NET1, DET-NET1, CSI-NET2, and DET-NET2, respectively. The functionality of the network components is summarized as follows: • CSI-NETi corresponds to the MMSE estimation of downlink CSI (i.e., (3) in SC-baseline), while i = 1, 2 represents the first and second iteration, respectively. • DET-NET1 and DET-NET2 respectively detect UL-US (i.e., (5) in SC-baseline) in the first and second iteration. • Some known parameters and iteration procedure, corresponding to (4) and (6) in SC-baseline, are exploited as expert knowledge to implement interference reduction. In addition, this expert knowledge is also utilized to improve network performance, e.g., the convergence acceleration [28].

2) NETWORK ARCHITECTURE
In Fig. 2, each of the four subnets consists of an input layer, a hidden layer, and an output layer with a fully connected (FC) mode. These subnets look straightforward, but they are very conducive to parameter tuning in III-C. The architecture is given as follows: • CSI-NET1, DET-NET1, CSI-NET2, and DET-NET2 are successively cascaded to form a multi-task network. In addition, some expert knowledge is inserted between two cascaded subnets to implement interference reduction. • For CSI-NET1 or CSI-NET2 (DET-NET1 or DET-NET2), the neuron numbers of input layer, hidden layer, and output layer are 2N (2M ), 16N (16M ), and 2N (2M ), respectively. • For each subnet, the batch normalization (BN), which is used to accelerate convergence and prevent overfitting [29], is employed to normalize input layer and hidden layer. To do so, the inputs of these layers will have zero mean and unit variance. • For each subnet, the hidden layer adopts activation function "swish", defined as swish (x) = x/(1 + e −x ), for a usual good performance [30] [31]. Linear activation is employed for other layers which are not listed here. • The outputs of CSI-NET2 and DET-NET2 are the estimated downlink CSIĤ u and detected UL-USD u , respectively.

3) NETWORK PROCESSING
• Data Preprocessing In the common framework of machine learning, the data set has to be real value. However, signals in wireless systems are complex valued. Thus, to make the NN architecture in Fig. 2 works, the data preprocessing is first given. The complex vectors of downlink CSI H u ∈ C 1×N , UL-US VOLUME 4, 2016 D u ∈ C 1×M and estimatedX u ∈ C 1×M (see the coarse estimation in III-A) are reshaped as real valued vectorsH u ∈ R 2N ×1 ,D u ∈ R 2M ×1 andX u ∈ R 2M ×1 , respectively, i.e., To match real valued vectors operation, we also transform the spreading matrix P u ∈ R M ×N as Then, the reshaped real valued vectorX u is used as the input of the process in TABLE 1.     Output:Ĥu =Ĥ (2) u andDu =D (2) u .
• Processing Procedure The procedure of proposed NN is given in TABLE 1, and some steps are explained as follows. For the sake of convenience, we use W X1 (b X1 ) to denote the weight matrices (bias vectors) for hidden layer, while and W X2 (b X2 ) for output layer, respectively. Where X = Ci or Di represent the CSI-NETi and DET-NETi, i = 1, 2, respectively.
Despreading: With the mapped real valued vectorX u , a despreading (see (0-1) in TABLE 1) is employed to reduce UL-US interference. The corresponding despreading at BS can be expressed asH whereP T u is obtained by transforming P u according to (11). The despreading is used to reduce UL-US interference, which is corresponded to the despreading in (3).
Estimation of downlink CSI: The step (1-1) and (2-1) in TABLE 1 are used to estimate downlink CSI according to CSI-NET1 and CSI-NET2, respectively. These estimations can be given bŷ The operations in (13) correspond to the MMSE estimation of downlink CSI of the ith iteration in (3).
Reduction of downlink CSI interference: We use the step (1-2) and  in TABLE 1 to reduce the downlink CSI interference. According toĤ (i) u ,X u , and the expert knowledge, the interference reduction can be given bỹ where the knownP u , E u , ρ, N and the structure of interference reduction are viewed as expert knowledge. These interference reductions are related to the ith iteration in (4).

Detection of UL-US:
The UL-US detections are given in step (1-3) and (2-3) based on DET-NET1 and DET-NET2, respectively. The detection can be expressed aŝ (15), the detection is related to the MMSE detection of UL-US of ith iteration in (5).
UL-US interference reduction: In TABLE 1, the step (1-4) is used to reduce the UL-US interference, which can be given byH where E u , ρ, and the structure of interference reduction are known as expert knowledge. This step is corresponded to the interference reduction in (6). By the end of our multi-task network,Ĥ u =Ĥ (2) u and D u =D (2) u , or say the outputs of CSI-NET2 and DET-NET2, are the ultimate outputs of downlink CSI estimation and UL-US detection, respectively.

C. MODEL TRAINING SPECIFICATION
Training a multi-task deep network is usually challenged by vanishing gradient, initialization sensitivity, activation saturation, and model over-fitting [24], [32], [33], [34], etc. To overcome these challenges, the common method is to solve an optimization problem by using the gradients of each task to update the shared parameters [33]. However, the task imbalances impede proper training [34], and result in enormous difficulties for parameter tuning.

1) SUBNET-BY-SUBNET TRAINING
To address the challenge of paramter tuning, we come up with a subnet-by-subnet training pattern inspired by the layer-by-layer training in [26]. Specifically, CSI-NET1 is first trained independently until it converges. Then the weight matrices and bias vectors of CSI-NET1 are fixed and applied to train the next subnet in sequence, i.e., DET-NET1, CSI-NET2 and DET-NET2. The detailed training procedure is given in TABLE 2.
DET-NET1, and obtain the weight matrices (W D11 and W D12 ) and bias vectors (b D11 and b D12 ).
b D12 } unchanged, we train CSI-NET2 to acquire the weight matrices (W C21 and W C22 ) and bias vectors (b C21 and b C22 ).
DET-NET2 to achieve the weight matrices (W D21 and W D22 ) and bias vectors (b D21 and b D22 ).
for testing.
In the following paragraphs, we first give loss functions involved in training. Then, the initialization of weight matrices and bias vectors are presented. Finally, we explain how to prepare training data.

2) LOSS FUNCTIONS
To train each subnet, the criterion of minimizing the mean squared error (MSE) is used. The loss function for CSI-NETi is expressed as where T 1,i is the total number of samples in training set of CSI-NETi training,H u is the real representation of complex vector H u (see (11)). Similarly, the loss function for DET-NETi can be given by where T 2,i is the total number of samples in the training set of DET-NETi training.

3) WEIGHT AND BIAS INITIALIZATION
Appropriate initialization can effectively avoid gradient exploding or vanishing problem [35]. Thus, the initialization of weight matrices and bias vectors should be carefully considered. In this paper, we initialize weight matrices on the basis of the method in [35].
For the training of CSI-NETi (i = 1, 2), elements of W Ci1 and W Ci2 , are initialized as the i.i.d. Gaussian distribution with 0 mean and variance 1/(8N ) and 1/N , respectively. Similarly, for the training of DET-NETi, elements of W Di1 and W Di2 are initialized as the i.i.d. Gaussian distribution with 0 mean and variance 1/(8M ) and 1/M , respectively. Elements of all bias vectors (i.e., b Ci1 , b Ci2 , b Di1 , and b Di2 ) are initialized as zeros.

4) DATA PREPARATION FOR TRAINING
The training set is acquired by a simulation approach, in which significant amount of data samples are generated to train a DNN. Specially, these data samples are generated as follows.
P u consists of N Walsh codes of length M , satisfying P T u P u = M I N ; andP u is obtained from P u according to (11). H u and G u are randomly generated on the basis of the distribution CN (0, (1/N ) I N ). Then complex valued H u is converted to a real valuedH u by using (8). The uplink and downlink channels (i.e., H u and G u ) are assumed to be stable during one frame, but varying from one to another [36] [37]. Elements of link noise N u follow the distribution of CN 0, σ 2 u . {D u } is created by quadrature-phaseshift-keying (QPSK) symbol set generated by modulating a Bernoulli sequence {s j }, and then are mapped to D u according to (9). By using {H u }, {D u }, {G u } and {N u }, we derive training data sets X u according to (1), (2), (7) and (10). The training labels of estimating {H u } in CSI-NET1 and CSI-NET2 are set as H u . To detect {D u }, the labels used for training DET-NET1 and DET-NET2 are set as D u .

IV. SIMULATION RESULTS
In this section, the performance comparison is made between the proposed DL-based scheme and SC-baseline [22] (presented in II-A) under different conditions. Some definitions involved in simulations are first given as following. The signal-to-noise ratio (SNR) in decibel (dB) of the received signal from user-u at BS is defined as Normalized MSE (NMSE) is used to evaluate the recovery of downlink CSI, which is defined as In the NN training phase, the PPC ρ and frame length (or UL-US length) M are set to ρ = 0.2 and M = 512, respectively. Training set X u has 200,000 samples, and the batch size is 200 samples. During training, the SNR is set to 5dB. We use Adam Optimizer as the training optimization algorithm [38] with parameters β 1 = 0.99 and β 2 = 0.999 [39]. The learning rates is set to 0.0001. The maximum number of iterations is 15,000. For each subnet training, the The training and testing of proposed method are carried out on a server with NVIDIA TITAN RTX GPU and Intel Xeon(R) E5-2620 CPU 2.1GHz×16, and the results of SCbaseline are obtained by using Matlab simulation on the server CPU due to the lack of a GPU solution. With subnetby-subnet training, each subnet in a network model (e.g., the model of N = 64) is converged after 10,000 iterations. Totally, it takes no more than 80 minutes to train a whole network model (including four subnets), which is significantly faster than the case in [24] (about 3 days).
To verify the effectiveness of trained NN for the case where the test PPC and frame length are the same as that of training phase (i.e., ρ = 0.2 and M = 512), we first test the NMSE and BER performance and compare them against the SC-baseline. The performance curves are given in Fig. 3 and Fig. 4, respectively. Fig. 3 shows that the NMSE of each model (i.e., N = 16, N = 32, and N = 64) outperforms the SC-baseline, especially at high SNR. Although SN R = 5dB is adopted in training phase, the three trained network models work well in the entire SNR span varying from 0dB to 14dB. Thus, it is obvious that the designed and trained subnets (i.e., CSI-NET1 and CSI-NET2) have a good generalization ability for  NMSE improvement. In Fig. 4, the trained NNs and SC-baseline obtain almost identical BER when SNR is not greater than 10dB. For the case where N = 64 and SN R ≥ 12dB, the BER of SCbaseline is slightly better than our trained NN. One reason for this is that a bigger N would result in a smaller spreading gain and then deteriorate NN's learning ability. Another reason is likely that the testing SNR (14dB) is far from the training SNR (5dB). This can be confirmed that without changing the testing process, the NN trained at SN R = 14dB obtains similar testing BER as that of SC-baseline at 14dB. To resolve this kind of generalization degradation, the method that obtains training data from multiple SNRs in [24] can be used. Although the similar BER cannot be obtained when N = 64 and SN R ≥ 12dB, its BER performance in Fig. 4 is only slightly degraded. Especially, only one SNR (i.e., SN R = 5dB) is employed in our NN training, which bring us great benefits of practicality to avoid the difficulty of capturing multi-SNR data.
To demonstrate the impact of PPC ρ on the trained NNs, the BER and NMSE performances are given from Fig. 5 to Fig. 10. Note that, from Fig. 5 to Fig. 10, the NN training adopts ρ = 0.2, while ρ = 0.05, ρ = 0.10, and ρ = 0.15 are employed for testing. We use these simulations to illuminate that our NN has excellent generalization and robustness against the impact of PPC.
Given downlink CSI lengths N = 64, 32, and 16, Fig.  5, Fig. 7 and Fig. 9 illustrate the NMSE performance with SNR varying from 0dB to 14dB. Especially for relatively high SNR, e.g. SN R ≥ 4dB, it is obvious that the trained NNs evidently improve the NMSE when compared to SCbaseline. At the low SNR regime (e.g., SN R ≤ 2dB) in Fig. 5 and Fig. 7, however, the NMSE of trained NNs is slightly inferior than that of SC-baseline. For example, in Fig.  7, the NMSE curve of the proposed method is a little higher than the baseline curve when ρ = 0.05 and SN R ≤ 2dB. This situation is similar to that in Fig. 4, where the decrease of spreading gain is a cause of the degradation of NN's learning ability. Although slightly inferior to the SC-baseline in certain low SNR regimes, our NN still shows prominent improvement in majority SNR regimes. On account of the training requirements (only one training PPC and one training SNR) and noise knowledge (without the knowledge of noise variance), the DL-based CSI feedback is still attractive. To validate the generalization and robustness of BER against the impact of PPC, the BER performance is given in Fig. 6, Fig. 8 and Fig. 10 with N = 64, N = 32, and N = 16, respectively. These figures reflect that, compared with the SC-baseline, our trained NN could achieve a similar or better BER performance. Especially, at the high SNR regime (e.g., SN R ≥ 10dB), Fig. 6 shows BER improvement for the cases where ρ = 0.05 and ρ = 0.10 . A slight BER improvement is also observed in Fig. 8. The reason is likely that a small PPC avoids the generalization deterioration of BER performance due to the small superimposed interference from downlink CSI. It is worth noting that, the training PPC and SNR are fixed as ρ = 0.2 and SN R = 5dB, while the testing PPC and SNR are varying, e.g., ρ = 0.05, 0.10 or 0.15, and SNR is varying from 0dB to 14dB.
To sum up, compared to the SC-baseline, Fig. 3 to Fig. 10 show that the designed and trained multi-task network can improve the NMSE performance while keeping comparable (or better) BER performance. From Fig. 9 and Fig. 10, we can see that with similar BER, our NN can improve the NMSE for the case where N = 16. As N increase, it is observed from Fig. 5 and Fig. 6 (or Fig. 7 and Fig. 8) that, when N = 64 (or N = 32), both BER and NMSE of baseline can be improved, and a smaller PPC obtains greater improvements. Since we train three models under the conditions that SN R = 5dB, ρ = 0.2 and M = 512, the designed NNs have a strong generalization ability for different SNRs and PPCs. In addition, the trained NN dose not need any knowledge of noise variance, which is also superior to the SC-baseline.

V. CONCLUSIONS
The accuracy of downlink CSI is the prerequisite of system capacity and link robustness. In this work, a CSI feedback method combined with SC and DL approaches is developed to improve the estimation of CSI in 5G wireless communication system without occupation of uplink bandwidth resource. We propose a multi-task neural network with subnetby-subnet training method to facilitate the parameter tuning and expedite the convergence rate. The effectiveness of the proposed technique is confirmed by simulation result showing comparable or better NMSE and BER than that of baseline. This performance of the trained NN is also robust to varying SNR and PPC.