LSRN: A Recurrent Residual Learning Framework for Continuous Wireless Channel Estimation Using Super-Resolution Concept

As only a few parts of wireless resources can be utilized for pilot transmission, channel estimation, especially the interpolation process, has often been recognized as a challenging ill-posed reconstruction problem. To deal with this task, we formulate it as a typical image super resolution problem, and propose a recurrent residual learning framework named LSRN. Our proposed scheme jointly utilizes the advantages of recurrent and residual structure in the machine learning area to approximate the non-linear interpolation relations between the reference signal and surrounding resource elements. In addition, we propose a low complexity implementation scheme called LSRN-L to address the stringent processing delay requirement in the channel estimation tasks. Through numerical examples as well as prototype verification, the proposed LSRN/LSRN-L can easily outperform the convolutional GI plus DFT based interpolation scheme by 10dB in terms of normalized mean square error. Meanwhile, the low complexity LSRN-L can maintain the processing delay within one millisecond.


I. INTRODUCTION
Evolved mobile broadband (eMBB) transmission has been identified as one of the most important scenarios in the fifth generation (5G) communication systems [1]. To guarantee ultra-high throughput in the eMBB scenario, orthogonal frequency division multiplexing (OFDM) transmission with coherent detection is often regarded as the main approach in the frequency selective and time varying wireless environment, and the high-order modulation, such as 64 quadrature amplitude modulation (QAM), is widely adopted to boost the throughput [2]. In order to accurately detect the high-order modulated symbols, an efficient yet fast channel estimation is often regarded as the most important stage and has been investigated for more than twenty years [3].
As a few part of resources can be utilized for pilot transmission in modern wireless communications, channel estimation has been recognized as a challenging ill-posed reconstruction problem with only a small amount of observations (pilots) [4].
The associate editor coordinating the review of this manuscript and approving it for publication was Fang Yang . In order to solve this type of complicated problem within the channel coherence time, a standard channel estimation process usually contains a channel state recovery stage and a low complexity interpolation stage as illustrated in [5]. Given the additive white Gaussian noise (AWGN) assumption, the conventional channel state information (CSI) recovery schemes at the pilot locations usually apply the least square (LS) or the minimum mean squared error (MMSE) algorithm, which has been proved to be efficient in practice [6]. However, the low complexity interpolation mechanisms are under-investigated, where traditional schemes, such as linear interpolation (LI) [5], guassian interpolation (GI) [7], or discrete fourier transform interpolation (DFTI) [8], are still used in recent communication systems.
Since the interpolation process to recover the whole channel frequency responses from some pilot observations can be regraded as a typical image super-resolution (SR) task [9], it is of great importance to review the history of classical SR methods and the potential applications in the channel estimation procedure. For example, by assuming the deterministic linear/non-linear relations between the low resolution (LR) and the high resolution (HR) images, bilinear or bicubic interpolation schemes have been utilized to improve the quality of SR process in [10], and the reconstruction-based schemes, such as [11] and [12], have been proposed to further compensate the remaining high-frequency components. Instead of predicting the possible linear/non-linear relations between LR and HR image pairs through mathematical formulas, another approach is to directly learn the relations via some pre-known data-sets. Typical learning frameworks include dictionary learning [13], local linear regression [14], and random forest algorithm [15]. Although the above schemes have been applied to practical systems in [16]- [18], the resultant performance gain in terms of the estimation accuracy and the complexity reduction is not satisfactory.
Recently, with the development of the computing technology and the numerical calculating technology, deep neural networks [19] have been widely applied in the learning field, and achieve significant progresses in the general SR tasks. As summarized in [20], they can be fulfilled by several kinds of advanced network architectures, including recursive [21] and progressive reconstruction [22]. However, the above network designs often suffer from high computational complexity and are rarely deployed on low-cost computing devices, such as mobile terminals. Instead, linear [23]- [25] and residual networks [26] are more popular due to their simple network structures. For example, super-resolution convolutional neural network (SRCNN) and very deep super-resolution network (VDSR) have been proposed in [23] and [24] respectively, which up-sample the LR figures before the feature extraction process, while fast SRCNN [25] performs the up-sampling afterwards to reduce the processing delay of the entire SR task. All of them are based on linear neural network topology and able to achieve peak signal-to-noise ratio (PSNR) for more than 30 dB in the DIV2K public data set. In [26], enhanced deep super-resolution (EDSR) scheme has been proposed to apply the residual learning block and achieves 2-4 dB PSNR gain in the same data set. In addition to the above single image SR tasks, multi-frame SR tasks have been widely studied as well, including detail-revealing deep video SR [27] and temporally coherent generative adversarial networks video SR [28]. Since the interpolation process in the wireless channel estimation is quite similar to the image SR task, there are some initial research efforts on the SR based channel estimation, which provides superior performance gain in terms of the normalized mean square error (NMSE) according to [9] and [29]. Nevertheless, to propose a suitable SR based learning framework is never straight forward due to the following reasons.
• Stringent Processing Delay Different from the conventional image SR, the wireless channel estimation usually has stringent requirements on the processing delay. For example, in the commercial long term evolution (LTE) networks, the channel estimation needs to be performed on a sub-frame (10 milliseconds duration) basis and the corresponding delay budget is less than one millisecond in general. Hence, how to strike a balance between the processing delay and the channel estimation performance using the SR learning framework, needs to be carefully investigated.
• Domain Knowledge Exploration The specific prior knowledge has been proved to be very effective for some tailored SR applications. For example, the spectral sparsity and the correlation information among consecutive frames have been explored for the hyper-spectral image SR and the video streaming SR tasks in [30] and [27], respectively, which shows significant performance gain in terms of PSNR. Since the wireless channel matrix has some slow-varying properties in the time domain as we shown later, how to explore this type of domain knowledge to further improve the estimation accuracy and reduce the processing delay based on the SR learning framework is still open.
• High Generalization Ability In the practical communication system, to adjust the learning framework based on different channel fading environments is quite challenging, and a more reasonable solution is to design a more robust network which only needs offline training. Therefore, the desired SR based learning framework should provide high generalization ability as well. In our previous work [9], to address the above challenges, we have proposed a novel channel estimation scheme using the super resolution image recovery concept, which achieves significant performance improvement based on the numerical evaluations. However, one drawback of this scheme is the associated significant processing delay. Our current work is proposed to directly address this issue and the novelty of the current scheme can be summarized as follows. First, we propose to use the recurrent architecture together with the traditional residual learning task to balance the achieved NMSE performance and the processing delay, which is a novel concept based on our literature survey. The proposed recurrent residual learning framework, named LSRN (Long-Short term memory based Residual Network), which jointly exploits the advantages of recurrent and residual architectures on top of the traditional image SR tasks. Specifically, we apply the residual network model to approximate the non-linear interpolation relations of real-time CSI between the reference signals (RSs) and the neighboring resource elements (REs) to improve the estimation accuracy, and utilize the recurrent structure to learning the slow-varying time domain correlation among consecutive OFDM symbols. In order to control the processing delay, we simplify the original structures of residual blocks (ResBs), and reduce the number of cascaded blocks and convolutional filters simultaneously. We denote the corresponding low complexity implementation as LSRN-L. Second, we jointly use the classical channel model and the prototype measurement results to elaborate the effectiveness of the proposed scheme, which is much more valuable for practical implementations. We offline train the LSRN-L using the standard COST 2100 channel model [31] and deploy it online in the commercial WiFi system using OFDM transmissions. Based on some numerical simulations and prototype verification, we show that the proposed LSRN-L can provide 10 dB to 11 dB NMSE improvement if compared with the conventional GI plus DFTI based interpolation schemes and preserve a processing delay within one millisecond.
The rest of this article is organized as follows. In Section II, we provide some preliminary information on the channel estimation and the machine learning empowered SR schemes. The proposed LSRN framework is discussed in Section III and the corresponding low-complexity implementation strategy LSRN-L is proposed in Section IV. The corresponding numerical and empirical examples are provided in Section V, and in Section VI, we conclude the whole paper.

II. PRELIMINARIES
In this section, we briefly introduce the basic concept of interpolation in the channel estimation process and the existing SR schemes.

A. INTERPOLATION FOR CHANNEL ESTIMATION
Consider an OFDM transmission system with a single transmit and N r receive antennas in the wireless fading environment. 1 With the AWGN assumption, the received symbols at the (f RS , t RS ) th RE of the k th resource block (RB), are the predefined RS and the equivalent channel responses, respectively, and n k (f RS , t RS ) ∈ C N r ×1 denotes the additive complex Gaussian noise with mean 0 N r and variance σ 2 I N r . Based on the famous MMSE criteria, the estimated channel state,ĥ k MMSE (f RS , t RS ), is given bŷ In the practical system, to minimize the channel estimation overhead, an interpolation process is usually performed on a RB basis (in accordance with the coherence time and coherence bandwidth) with N t time slots and N f sub-carriers as shown in Figure 1, which maps the estimated channel state of RSs into the entire RB. Denote h k to be the aggregated CSI of the k th RB and the interpolation process can be described as, 1 For illustration purpose, we denote the boldface upper-case and lower-case to be a matrix and a vector. (·) T , (·) * , (·) H , and (·) −1 denote the matrix/vector transpose, conjugate, Hermitian, and inversion operations, respectively. The identity matrix with dimension N is denoted by I N , and we apply 0 N and 1 N to denote all zero and all one vectors with dimension N . 2 In this paper, we assume the channel correlation matrix and the noise covariances are pre-known based on the previous long-term observations. where RS represents the collection of all the possible RS positions in each RB, F(·) denotes the general interpolation function mapping from 3 | RS | RSs to N t × N f REs. In the existing literature, the interpolation processes rely on some heuristic algorithms, and a standard approach is to jointly utilize the DFTI process in the frequency domain and Gaussian approximation in the time domain as defined in [7], [8]. Based on (1), the above interpolation process can be rewritten as, where f GI (·) and g DFTI (·) denote the Gaussian approximation and DFTI processes as illustrated in [32], respectively. As shown before, the traditional method approximates the original function F(·) by cascading several deterministic linear functions together, e.g., g DFTI (·) and f GI (·), while the inherent relations between {ĥ k MMSE (f RS , t RS )} and h k are in general nonlinear due to the additive random noises and the non-deterministic relations of CSIs from different REs.

B. EXISTING SR SCHEMES
In order to solve the above ill-conditioned problem as defined in (1), a well known approach as mentioned before is to apply the single image SR as shown in Figure 2. Based on the existing literature, three types of CNN-based SR schemes are commonly adopted, which are summarized as below.
• SRCNN [23]: In the first stage, the original LR images are expanded to the target size, and then using threelayers neural networks to approximate the non-linear relation between LR images and HR images. 3 | · | represents the cardinality of the inner set. 38100 VOLUME 8, 2020 The network architecture of three typical SR neural networks, where subfigure (a), (b) and (c) demonstrates the abstracted network configuration of SRCNN [23], VDSR [24] and EDSR [26], respectively. Different from the traditional interpolation scheme, they approximate the non-linear interpolation via neural networks.
• VDSR [24]: The first stage of VDSR is similar as SRCNN. In feature extraction stage, VDSR apply the global residual learning network to progressively predict the SR images.
• EDSR [26]: Different from SRCNN and VDSR, EDSR directly input the LR image, and then apply the global and local residual learning network [33] to progressively predict the SR images from LR images. Since the existing SR schemes are directly designed for image processing, there are still many challenges and opportunities to explore when applied to the channel estimation tasks. Despite the stringent processing delay and the generalization requirement as aforementioned, the time and frequency domain correlations among neighboring REs might be able to exploited as well. If we define to be the time and frequency domain correlations, 4 we can plot the relation between the correlation coefficients and the number of intervals in Figure 3. The red dots in the figure represent the average values, the maximum and minimum values are indicated by a dash line. As can be seen from Figure 3, the correlation of the channel in time domain is stronger than the correlation in frequency domain and the channel changes are subtle and gradual over a period of time. Using this type of domain knowledge, we can further improve the estimation accuracy and reduce the processing delay by improving the existing network modules and adding related modules. From SectionII-B of the paper, we conclude that the change of channel is subtle and gradual over a period of time, and the time-domain correlation of channel is much stronger than 4 We choose the time interval and frequency intervals according to the IEEE 802.11n standard, which is 10ms and 312.5KHz, respectively. its frequency-domain correlation. Using this kind of domain knowledge, we think it is necessary to introduce recurrent learning for channel estimation.

III. LSRN FRAMEWORK
With the strong ability to model the non-linear function F(·), the interpolation for channel estimation can be recognized as a single image SR reconstruction problem. In this section, we provide an overview of the proposed LSRN framework and then introduce block by block to demonstrate the effectiveness of the proposed scheme.

A. OVERVIEW
The primary target of using SR technology is to find a suitable interpolation function F SR by minimizing the potential MSE between the estimated CSI,ĥ k SR , and the aggregated CSI, h k . Mathematically, the optimal SR reconstruction for CSI interpolation, F SR , can be modeled via the following problem. where A F = Tr(A · A H ) denotes the Frobenius norm and N RB denotes the total number of RBs. In the practical system, the true value of channel responses, h k , is difficult to observe in general due to the noisy environment, and a more feasible solution is to approximate the original F SR by solving the following problem.
Note that in the above formulation, the aggregated CSI, h k , is replaced by the estimated CSI using the conventional MMSE scheme, e.g.,ĥ k MMSE , which is much easier to be implemented in practice. Since F SR is an ill-conditioned mapping from | RS | RSs to N t × N f REs, the above optimization problem is in general difficult to solve.
Inspired by the EDSR scheme as illustrated before, we divide the entire process into two steps, where the first step focuses on the feature extraction and fusion processes to obtain several N fc -dimension features, and the second step scales up the N fc -dimension features into N t × N f REs by up-sampling and maps the high dimensional features back into the complex valued channel responses. As illustrated in Section II-B, the time domain correlations are much stronger than the frequency domain, which can be regarded as the domain knowledge for SR based wireless channel estimation tasks. To exploit this effect, we propose the LSRN framework as shown in Figure 4, where a recurrent learning block is applied to extract the time domain correlations. Mathematically, the proposed LSRN framework can be expressed as follows.
where F LSRN (·; θ LSRN ) andĥ k LSNR denote the proposed LSRN with parameters θ LSRN and the corresponding estimated CSI, respectively. 5 Denote F RL (·; θ RL ), F LL (·; θ LL ), and F GL (·; θ GL ) to be the recurrent learning, the local residual learning, and the global residual learning blocks respectively, the extracted and fused N fc -dimensional feature tensors, {ĥ k FE (f RS , t RS )}, can be obtained via following formula, Following the pixelShuffle network as proposed in [34], F PS (·; θ PS ), and the feature mapping operation, F FM (·; θ FM ), the estimated CSI by the proposed LSRN is finally given bŷ

B. RECURRENT LEARNING AIDED FEATURE EXTRACTION
Feature extraction has been proved as a key procedure in the image analysis tasks, including SR, de-blurring [35], and image semantic segmentation [36]. With the modern machine learning technology, the intrinsic features are automatically learned by feeding the ''labelled'' data-sets, and the residual learning approach, such as residual network (ResNet), is reported to achieve the state-of-the-art performance for feature extraction in the image recognition tasks [37]. The similar phenomenon has been observed in the image SR task as well, where the residual learning based EDSR networks achieves the state-of-the-art PSNR result. However, as a general-purpose SR network, the design of EDSR does not rely on any specific correlation models between LR and HR images, and the imbalanced correlation nature in the channel estimation tasks, as shown in Figure 1, is NOT considered. In order to take the advantage of this domain knowledge, we propose to use the recurrent learning on top of the original EDSR framework. Typical recurrent learning algorithms include long short term memory (LSTM) [38], gated recurrent unit (GRU) [39], and convolution LSTM (ConvLSTM) [40]. Since LSTM and GRU architectures are dedicated designed for one-dimensional fixed-length vector sequences, the two-dimensional correlation property of channel responses as shown in SectionII-B can not be fully represented. To overcome this obstacle, we propose to use ConvLSTM structure, and the corresponding mathematical model to calculate where σ 1 (x) = e x −e −x e x +e −x is hyperbolic tangent activation function and σ 2 (x) = 1 1+e −x is logistic sigmoid activation function, respectively. (t RS − 1) represents the previous time slot before t RS . The parameters of convolution θ RL = [θ 1 RL , . . . , θ 12 RL ] are optimized during the training process. The cell state C t RS −1 is iteratively updated by the following formula, To demonstrate the effectiveness of the proposed recurrent learning architecture, we perform the channel estimation tasks using the traditional EDSR network with three recurrent learning architectures, e.g. EDSR with LSTM, GRU, and ConvLSTM schemes, respectively. To make a fair comparison, the above schemes are trained and tested under the famous COST2100 channel models [31]. As shown FIGURE 5. NMSE comparison of different residual learning structure and parameters used in EDSR, subfigure(a) shows the different effects of LSTM [38], GRU [39] and ConvLSTM [40], subfigure(b) shows the different effects of different network parameters under COST 2100 model.
in Figure 5(a), the EDSR with ConvLSTM scheme outperforms the other three schemes under different received signal noise ratio (SNR) regions. In addition, to obtain a better understanding of the proposed scheme, we compare the channel estimation results with the traditional EDSR network under different number of residual blocks. As verified in Figure 5(b), the proposed scheme shows consistent NMSE improvement over all the tested network configurations.

C. FEATURE RECONSTRUCTION
According to (6), the extracted and fused N fc -dimensional feature tensors, {ĥ k FE (f RS , t RS )}, can be expressed in detail as follows, whereĥ k RL (f RS , t RS ) represents the features extracted by recurrent learning branch as defined in (7), and the global residual learning is simply a direct connection to the initial features, e.g.
For local residual learning, we use a 16-layer stacked residual block structure to model F LL ĥ k MMSE (f RS , t RS ); θ LL which has been proved to be effective in SR cases [26]. After the feature extraction process, the pixelShuffle procedure up-scales the feature tensors {ĥ k FE (f RS , t RS )} to the entire time-frequency resources. Typical reconstruction methods, as shown in Figure 6, gradually up-scale the feature tensors via the pixel re-arrangement or the deconvolution approach. Inspired by a direct up-scaling operation with the factor of 4, we extend this operation to more general cases with the factor of s 2 . Specifically, we rearrange the feature tensor to the size of N f × N t . If we denotê is the period shuffle factor and the corresponding feature channel per pixel after up-scaling is N fc /s 2 .
We compared the experimental effects and time delay under these three reconstruction models as shown in Figure 6, the effects of the three reconstruction methods are relatively close. Due to the simpler structure, we choose pixelShuffle (4) which can effectively reduce the time delay.
The last step is a simple feature mapping process which maps the N fc /s 2 feature tensored into the final channel responses, and the corresponding realization F FM (·; θ FM ) is performed by a two-layer convolutional neural network. The detailed network configuration of the proposed LSRN structure is summarized in Table 1

IV. LOW COMPLEXITY IMPLEMENTATION
In the previous section, we designed a recurrent learning based LSRN scheme to recover the channel states. However, as the channel estimation is typically a delay critical task, the process delay of the proposed scheme needs to be carefully studied as well. As shown in Figure 7, more than 60% of the processing time is occupied by the residual learning process. Therefore, in this section, we focus on proposing a low complexity implementing strategy to satisfy the potential delay requirement.

A. SIMPLIFIED RESB STRUCTURE
As shown in Figure 4, the three parallel learning paths determine the processing delay of the feature extraction stage, and according to the existing literature [26], the local residual learning has the largest critical path, which dominates the entire delay chain. To solve this issue, a straight forward approach is to simplify the basic building block in the original ResNet architecture. Inspired from [41] and [26], we proposed to remove one activation layer and one convolutional layer as shown in Figure 9 and summarize the theoretical explanation in the following lemma.
Lemma 1: For a common residual learning network architecture with N cascaded residual blocks, two different residual block realization schemes, e.g. with M 1 and M 2 convolutional layers, respectively, can be trained to provide the similar reconstruction results, if the residual learning network is deep enough, e.g. N → +∞.
Proof: Please refer to Appendix for the proof. 6 The inputs of the neural network are the amplitude and phase of the estimated channel responses at the pilot locations. Hence, the inputs to the two channels of the network are real values and dealt with in one network.  Based on Lemma 1, we can simplify the number of convolutional layers in each ResB when the residual learning network architecture is deep enough. However, in the practical systems, we need to balance the processing delay and the implementation complexity with the achievable performance. In our cases, the number of ResB, N , is chosen to be 16 and we reduce the number of convolutional layers on top of ResB_Lim structure. As shown in Figure 8, although the number of ResB, N , is equal to 16, we can still simplify the number of convolutional layers of each ResB with marginal NMSE losses (e.g. 7.14%) and the overall processing delay is reduced average from 27ms to 20ms, which corresponds to 25.9%.

B. REDUCED NUMBER OF RESB AND CONVOLUTIONAL KERNELS
Another scheme to reduce the complexity of the local residual learning path is to reduce the number of ResBs and convolutional kernels. In the conventional EDSR scheme, to reduce the number of ResBs and kernels may greatly affect the image SR performance. However, in the specific task, to reduce the number of ResBs and convolutional kernels, the corresponding performance degradation is controllable due to FIGURE 8. NMSE and processing delay comparison of three different ResBs, the blue, green, and yellow histogram and scatter chart represent ResB_Ledig [42], ResB_Lim [26], and ours, respectively. the following reasons. First, since the correlations between different REs is much stronger than the pixels in the convolutional image SR task, the differences of extracted feature maps among different ResBs are much smaller, which makes the possibility to reduce the number of ResBs with marginal performance. Second, with the proposed LSRN architecture, part of the feature losses due to the smaller number of ResBs VOLUME 8, 2020 and convolutional kernels can be compensated by the recurrent learning branch.
To demonstrate the above two effects, we perform the following two experience, where in the first experiment, we extracted the feature map of each layer in the residual learning chain with 16 ResBs as shown in Figure 9(a), and in the second experiment, we combine the recurrent learning and the global learning chains to obtain the overall feature map variations as shown in Figure 9(b).
The extracted feature map of the residual learning chain is depicted in Figure 10(a), where the entire feature values begin to converge after the fourth ResB. Based on this observation, we collect 10000 samples from the COST 2100 channel model and pass them into the residual learning chain.
The average NMSE, define as, , is around 0.0104, which provides another possibility to reduce the processing complexity bu inducing the number of ResBs from 16 to 4. As mentioned before, we combine the residual learning and the recurrent learning together as the second approach to reduce the processing delay and the corresponding simulation results are shown in Figure 10(b).
In this experiment, we directly use 4 ResB rather than the entire 16 ResBs to see the combined effects. By averaging over 10000 examples, we compare the difference in terms of NMSE between the proposed LSRN architecture with four ResBs and with the other three possibilities. Since the NMSE values are quite close for three different cases, we choose the proposed LSRN architecture with one ResB only to reduce the potential complexity. 7 As shown in Figure 11, the overall processing delay for 14 × 14 pilot configuration can be reduced average from 20ms to about 0.9ms, which corresponds to 95.5%. As discussed in [43], for both classification and generation tasks, even-sized kernels will result in severe performance degradation in general. For odd-sized kernels, increasing the size of kernels cannot provide a monotonic improvement in terms of Top-1 accuracy for ImageNet dataset [44]. Therefore, in the following numerical evaluations, we choose different kernel sizes (with odd numbers) and compare the NMSE result for LSRN-L schemes. As shown in the following Figure 12, we can observe the similar behavior for NMSE performance and the neural networks with kernel size 5 × 5 achieve the best NMSE performance. However, to balance the achieved NMSE and the processing delay, we choose the kernel size to be 3 × 3, since it achieves the similar NMSE performance in the high SNR regime, and reduces the processing delay by more than 20%, if compared with the kernel size 5 × 5 case.
In short, we propose two different approaches to simplify the network architecture, e.g. to simplify the structure of ResB and to reduced the number of ResBs by measuring the feature variations. Through the priliminary simulation results as shown in Figure 7, the overall processing delay for 14 × 14 pilot configuration can be reduced from 27ms to about 0.9ms, which corresponds to 25.9% and 95.5%, respectively.

V. EXPERIMENTS & RESULTS
In this section, we introduce the generation of data sets and simulation environments, for both model generated case and prototype sampled case. In what follows, we present the corresponding evaluation results for the proposed recurrent learning framework.

A. DATA SETS AND IMPLEMENTATION DETAILS
In the following evaluation, we apply two different approaches to generate the real-time channel state, namely directly ''Model Generated'' and ''Prototype Sampled''. The ''Model Generated'' data set is directly collected from the famous channel model, COST2100 [31], while the ''Prototype Sampled'' data set is estimated from the practical  WiFi prototype system as shown in Figure 4. The detailed configuration of the prototype system is listed in Table 2.
After we generate the original high resolution CSI h k (f RE , t RE ), we can follow the conventional MMSE channel estimation method to generate the low resolution CSI at the RS positionsĥ k MMSE (f RS , t RS ). Based on different configurations of RSs and the LOS/NLOS, we can construct different data sets for training and testing as shown in the following Table 3. The sizes of training and testing data sets are chosen to be 5.6 × 10 5 and 1.12 × 10 5 , respectively, and the other related learning parameters are listed in Table 2 as well.

B. COMPARISON WITH STATE-OF-THE-ART
In the following experiments, we compare the proposed LSRN scheme and the corresponding low complexity implementation (denoted as LSRN-L) with the following baseline systems. Baseline 1 is the conventional GI plus DFTI scheme as elaborated in Section II-A. Baseline 2 and Baseline 3 apply the image SR technique for the interpolation process using SRCNN and EDSR networks as explained in Section II-B, respectively.
In the first experiment, we compare the channel estimation performance in terms of NMSE as well as the processing delay versus SNR results under different pilot configuration, where we rely on COST 2100 model to generate the training and testing data sets. All the neural network based solutions are pre-trained using the data sets A and tested under the data setsÃ. The corresponding numerical results are depicted in Figure 13. As we can see from this figure, in the condition of 14 × 14 pilot configuration, the proposed LSRN and LSRN-L schemes outperforms the baseline schemes in terms of NMSE among 14 dB to 15 dB and 10 dB to 11 dB, respectively. Meanwhile, in terms of the processing delay, LSRN-L performs much better than LSRN scheme and eventually achieves less than 1 ms delay budget.
In the second experiment, we compare the channel estimation performance between the tradition GI plus DFTI scheme and our low complexity LSRN-L scheme. As for LSRN network, we respectively use model generated training data set A, sampled training data set B and C for training, and test under the sampled testing data setsB andC, respectively. As the corresponding numerical results shown in Figure 14, we can see that our LSRN-L scheme outperforms the baseline scheme GI plus DFTI in terms of NMSE in all circumstances. By comparing the test results of different data set training models, we can conclude that the network trained by model generated data can also be applicable to the channel estimation scenarios under the WiFi environment LOS and NLOS, which represents our network is robust to the actual application scenarios. Last but not least, this experiment   shows that the complex correlation characteristics between pilots can be learned through network training, so that the network trained by model generate data training can adapt to different channel fading environments in actual communication systems.

C. EXTENSIVE STUDY
From the results of the previous experiment, we can see that the theoretical model can already deal with the channel estimation problem of the real communication system. In order to observe whether a small amount of real sampled data is helpful to the channel estimation of the real communication system, we improve the training data set by mixing data sets B or C to data sets A in order to simulate the real communication system. According to the experimental results shown in Figure 15, by using the mixed training data sets include 20% sampled data, we can obtain the majority gain of the result, which is around 70%. In the other experiment, with a fixed number of pilots, we discuss the influence of different pilot designs on the experimental results. Two another different pilot design modes are considered, they are training data sets A 7×28 , A 28×7 and testing data setsÃ 7×28 ,Ã 28×7 .
The design idea of pilot arrangements comes from the different correlation of pilots in time and frequency domain. Data set A 7×28 andÃ 7×28 are generated by concentrates more pilots on different time slots than on different sub-carriers. Data set A 28×7 andÃ 28×7 are generated by concentrates more pilots on different sub-carriers than on different time slots. The experimental results are shown in figure 15, we can see that the estimation results of the 7×28 pilot configuration are quite different under different SNRs, and the overall estimation effect is not as good as that of the 28 × 7 pilot configuration. What's more, the 28 × 7 pilot configuration with fewer pilots obtains the close estimation effect with the 28×28 pilot configuration, in term of NMSE is among 1 dB to 2 dB. The above experimental results provide us a possibility to explore more pilot design methods to improve our network in the future.

VI. CONCLUSION
In this paper, we propose a LSRN architecture to jointly utilize the recurrent and residual learning capabilities for the channel estimation tasks in the wireless transmission. By exploiting the slow-varying time domain correlation and the non-linear interpolation relations among different RSs and REs, the proposed low complexity LSRN can provide among 10db to 11dB NMSE improvement if compared with the conventional GI plus DFTI based scheme, and eventually consumes less than one millisecond processing delay.   He has published over 100 peer-reviewed research articles in top international conferences and journals. One of his most referenced articles has over 1400 Google Scholar citations, in which the findings were among the major triggers for the research and standardization of the IEEE 802.11S. He has over 20 U.S. patents granted. Some of these technologies have been adopted in international standards, including the IEEE 802. Her current research interests include wireless communication systems, channel encoding and decoding, machine learning acceleration, and its ASIC design. VOLUME 8, 2020