Deep Learning for Super-Resolution Channel Estimation in Reconfigurable Intelligent Surface Aided Systems

Reconfigurable intelligent surface (RIS) enables the configuration of the propagation environment. Channel estimation is an essential task in realizing the RIS-aided communication system. A RIS-aided multi-user multiple-input multiple-output (MIMO) orthogonal frequency division multiplexing (OFDM) communication system involves cascaded channels with high dimensions and sophisticated statistics. Thus, implementing the optimal minimum mean square error (MMSE) with the integration computation is infeasible in practice. To accurately estimate channels with high accuracy in a RIS-aided multi-user MIMO-OFDM system, we model the channel state information (CSI) estimation as an image super-resolution (SR) problem to recover and denoise the channel matrix. Particularly, a convolutional neural network based on a super-resolution convolutional neural network (SRCNN) and denoising convolutional neural network (DnCNN), named SRDnNet, is then proposed. By taking estimated channels at pilot positions as a low-resolution image, the enhanced SRCNN can fully exploit the features of inputs to learn a suitable interpolation method and generate the coarse estimation of the channel matrix. The denoising model DnCNN with an element-wise subtraction structure can exploit features of the additive noise and recover channel coefficients from the coarse channel matrix. The simulation results demonstrate the effectiveness and excellent performance of the proposed SRDnNet.


I. INTRODUCTION
T HE rapid growth of communication types of equipment demands high data rates up to 1 terabyte per second (TB/s), super-low latency of less than 1 ms, super-high connectivity density of more than 10 7 devices/km 2 , ten times spectrum efficiency of the fifth generation (5G) mobile network and other indices in energy efficiency, mobility, area traffic capacity, etc [2]. To meet these requirements, device-centric architectures, millimeter wave (mmWave), massive multipleinput multiple-output (MIMO) and other wireless technologies are proposed to lead to disruptive changes in both architectural and component design [3]. However, challenges from high hardware design requirements, excessive energy cost and complex signal processing prevent these technologies from implementation in practice [4]. Thanks to the development of materials, a novel concept of reconfigurable intelligent surface (RIS) has recently been proposed as a potential solution to these difficulties. RIS, which can shape a wireless propagation environment by intelligently reflecting incident signals through its reconfigurable elements, has been considered a promising technology for the future smart radio environment [5], [6], [7]. Typically, a RIS consists of numerous nearly passive reflecting elements, which can be separately controlled by a smart controller to adapt its phase shift on incident signals and alter their propagation direction. By adjusting the phase shift of reflected signals, the RIS can program and reshape the wireless propagation environment to strengthen the received power at intended receivers [8]. The implementation of the RIS can improve throughput, coverage, energy efficiency (EE) and spectral efficiency (SE) [9] by adapting its phase shift. Therefore, RIS-aided communication systems and related studies, e.g., the throughput maximization with different reflection patterns [10], the EE and SE trade-off [9] and the RIS-aided physical layer security [11], [12], have attracted extensive interest from academia.
In RIS-aided wireless communication systems, reaping the promising performance gain requires precise channel state information (CSI) to perform beamforming. However, these studies mentioned above were completed with perfect CSI, which is usually unavailable. In fact, the acquisition of CSI is a crucial task in realizing RIS-aided communication systems. Compared with channel estimation in conventional systems, the RIS can only reflect signals without the ability to transmit/receive pilots and signal processing. In this case, acquiring the channel from RIS to the user/base station (BS) is infeasible via pilot-based channel estimation methods. Only the cascaded channel of the BS-RIS-user can be estimated by adequately designing the transmission protocol and reflection pattern of This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the RIS. However, the estimation of the cascaded channel is challenging due to the following difficulties: 1) The cascaded channel does not follow the complex Gaussian distribution. In this case, it is infeasible to derive the optimal minimum mean square error (MMSE) estimator, which requires multiple integrals for practical implementation.
2) The achievable linear MMSE (LMMSE) and least squares (LS) estimators have a significant performance disparity compared to the optimal MMSE. Consequently, the channel estimation accuracy is limited for realizing RIS-aided communication systems.
3) The RIS generally consists of hundreds or even thousands of reflecting elements, leading to a highly dimensional cascaded channel. Therefore the pilot overhead is excessively high. The analytical channel estimation approaches, such as the LS estimator and the LMMSE estimator, bring a heavy workload on computation. To address the difficulties of channel estimation in RISaided communication systems, extensive works have been carried out [13], [14], [15], [16], [17]. Mishra et al. [13] proposed a novel low-complexity optimal channel estimation protocol for the single-user system. Only one of the RIS reflecting elements is switched on and the rest are switched off simultaneously. In this case, the BS can receive pilots without inter-element interference of the RIS, and the estimation of cascaded channels can be conducted. However, the signalto-noise ratio (SNR) obtained at the BS is limited for CSI estimation, as only one active reflecting element reflects the pilots in each time slot. Moreover, the massive number of reflecting elements introduces an exceedingly long latency to the system. Jensen et al. [14] activate all RIS elements to enhance the received SNR as well as to shorten the latency. Particularly, a novel RIS activation pattern has been designed to achieve the minimum variance unbiased estimation of the cascaded channel. The pilot cost in [13], [14] is determined by the product of the number of antennas, RIS elements and users, which could hinder the application of the RIS with massive reflecting elements to communication systems. On the other hand, some attempts have been dedicated to lowering the pilot overhead by designing efficient schemes. He et al. [15] first formulated the channel estimation in RIS-aided MIMO systems as recovering a sparse channel matrix, in which the sparsity of the spatial channel matrix is exploited by applying compressed sensing to reduce the pilot overhead.
To reduce the pilot overhead for channel estimation introduced by massive RIS reflecting elements, [16], [17], [18], [19], [20], [21] have proposed novel CSI estimation frameworks in which they grouped those adjacent RIS elements into several sub-surfaces. Within every sub-surface, the RIS elements adopt the same configuration. Therefore, the computation power of channel estimation can be dramatically reduced. Zheng et al. [16] have proposed a transmission protocol for the channel estimation, while it is not an energyefficient manner since all RIS elements are always active. Besides, we note that the pre-designed phase shift for each transmitted symbol should be configured when the symbol arrives at the RIS. Otherwise, the propagation environment is not the same as pre-designed. The above works significantly contribute to channel estimation for RIS-aided communication scenarios. However, the limited channel estimation performance for cascaded channels has not been well addressed. Thus a practical and efficient channel estimation scheme is more than desired to estimate the RIS channels accurately.
Deep learning (DL) [22] has been widely used in RISaided communication systems [23], [24], [25], such as the deep reinforcement learning (DRL)-based SNR maximization, DRL-based secure RIS multi-user communication systems and the DL-based active or passive beamforming design. It is noted that these promising DL-based approaches still demand perfect CSI, which is hard to obtain in practice. Up to now, DL-based channel estimation methods have been widely investigated in various conventional communication systems [26], [27], [28]. However, DL-based techniques have not been thoroughly investigated for channel estimation in RIS-aided wireless communication scenarios. Some attempts are devoted to designing DL-based frameworks to improve channel estimation performance [29], [30], [31], [32].
A variety of DL-based algorithms have been proposed to improve the limited channel estimation performance in RIS-aided communication systems. Liu et al. [29] developed a deep residual network-based method and a convolutional neural network (CNN)-based network (CDRN) to improve channel estimation performance in RIS-aided multi-user communication systems. Although the estimation performance of CDRN is promising, we should note that inputs of the CDRN are generated by the LS estimator, which is costly computational because of the high-dimension cascaded channel. Kundu et al. [30] adopted two CNN-based methods, i.e., denoising CNN (DnCNN) and fast and flexible denoising Network (FFDNet), to perform channel denoising in a RISaided single user multiple input single output (MISO) system. The noise map (standard deviation of the additive noise in the LS estimate) is concatenated with the LS channel matrix as the input to the FFDNet. It provides flexibility to handle different noise levels. However, prior Knowledge of the additive noise is usually unavailable in practice, and the LS estimator generates the inputs of both CNN-based methods. Liu et al. [31] proposed a complex-valued denoising convolution neural network (CV-DnCNN) to enhance compressive sensing-based channel estimation performance in a RIS-aided mmWave massive MIMO system. Jin et al. [32] proposed two practical residual neural networks, i.e., single-scale enhanced deep residual (EDSR) and multi-scale enhanced deep residual (MDSR), to recover channel matrix from the sparse RIS channels in a time division duplex (TDD) RIS-aided mmWave communication system. In particular, the RISs in [31] and [32] are hybrid in that a few reflecting elements are active and connected to the baseband unit (i.e., a radio frequency (RF) chain) for signal processing. Besides, the estimation of the RIS channel is processed at the receiver to generate the inputs of the CV-DnCNN, the EDSR and the MDSR.
It is noted that estimated channels in RIS-aided communication systems must be obtained as inputs of neural works, which is computationally intensive and highly affect the performance of neural works [29], [30], [31], [32]. Pilot-based channel estimation is the most commonly used method to acquire the CSI in OFDM systems. The pilot channels can be first estimated by the pilots known at transmitters and receivers. Then the data channels can be calculated by interpolation methods, e.g., spline and linear interpolation methods, based on the estimated channels at pilot positions. Interestingly, the Image super-resolution (SR), which aims at recovering a high-resolution (HR) image from a low-resolution (LR) and noisy image, is comparable to channel estimation. Various attempts have been conducted to estimate channels in non-RIS communication scenarios based on the image SR algorithm [33], [34], [35]. Ouyang et al. [33] have proposed the channel super-resolution neural network (CSRNet) for channel estimation in underwater acoustic (UWA) orthogonal frequency division multiplexing (OFDM) communications. Soltani et al. [34] have developed a two-stage CNN-based channel estimation scheme (ChannelNet) to recover the channel matrix from the initial estimation of channels at pilot positions. ChannelNet is the concatenation of a super-resolution convolutional neural network (SRCNN) and a denoising convolutional neural network (DnCNN). Li et al. [35] have developed a deep residual channel estimation network (ReEsNet) to perform CSI estimation in downlink OFDM systems.
Note that the CSRNet and ChannelNet generate the complete channel matrix by the spline and bicubic interpolation, which are fixed and can not be trained. The ReEsNet adopted transposed convolution layer as the up-sampling layer and upscaled the LR channel matrix to the desired size. However, the up-sampling layer is the penultimate layer, and only a convolutional layer is concatenated to generate the output without denoising. Moreover, the ChannelNet is not an end-to-end model, its estimation performance needs further improvement. Motivated by this, we formulate the channel estimation as an image SR problem and propose the revised SRCNN with a trainable interpolation method to generate the coarse channel matrix and concatenate it to the DnCNN to denoise the channel matrix and further improve channel estimation performance. In this work, we concentrate on channel estimation in RISaided multi-user MIMO-OFDM systems with the proposed DL-based scheme, SRDnNet. We modeled channel estimation as a denoising problem and proposed a deep residual learningbased channel estimation method for a downlink RIS-aided OFDM systems [36]. In [36], compared with this work, each user collects their datasets to train their local models and only transmits model weights to the BS for updating the global model. The updated global model is then broadcasted to all users for the next training turn until convergence. The main contributions of this paper are summarized as follows: 1) To separately estimate the direct and cascaded channels, we propose a three-stage transmission protocol in which we consider that the pre-designed phase shift in a time slot should correspond to transmitted symbols in the same time slot. By applying the transmission protocol, the mismatch of the pre-designed phase shift and symbols can be mitigated and eliminated. Besides, the RIS works energy-efficiently since RIS only activates its elements when receiving signals from users. 2) Compared with related works, we formulate the channel estimation problem in a RIS-aided multi-user MIMO-OFDM communication system as an image SR problem. Based on this, we develop a deep CNN-based algorithm, SRDnNet, to accurately recover the channel matrix from estimating channels at pilot positions. With the superior interpolation and denoising capabilities of the enhanced SRCNN and DnCNN, the SRDnNet could further improve channel estimation accuracy. 3) Comprehensive simulations validate the effectiveness of the SRDnNet concerning the effect of SNR and channel dimension. The results illustrate that the performance of the SRDnNet outperforms benchmarks without prior knowledge of direct and cascaded channels. The rest of this paper is organized as follows. Section II presents the system model of RIS-aided multi-user MIMO-OFDM communication systems and the transmission protocol. Section III presents two conventional CSI estimation methods as benchmarks. Then the proposed SRDnNet is presented. Section IV provides simulation results of our proposed methods in RIS-aided wireless communication systems. Finally, Section V concludes this work.
In this paper, I M denotes the identity matrix of size M × M . (·) −1 is the matrix inverse. The (·) T and (·) H stand for transpose and Hermitian transpose, respectively. E(·) is used to represent the statistical expectation operation. ∥·∥ F represents the Frobenius norm of a matrix. ⊙ denotes the Hadamard element-wise multiplication. ./ denotes the right element-wise division, which divides each element of the left by the corresponding element of the right. C denotes the dimension of a complex variable. In addition, CN (µ, σ 2 ) is the complex Gaussian distribution with µ-mean and variance σ 2 .

II. SYSTEM MODEL AND TRANSMISSION PROTOCOL
As shown in Fig. 1, we consider an uplink OFDM system, where the RIS serves one BS and K users. The BS is equipped with an N BS -antenna array and the RIS comprising N M reflecting elements serves the BS and K single-antenna users. Antenna arrays of the BS and reflecting elements of the RIS are arranged in a uniform linear array (ULA).
The BS controls a controller to reconfigure the propagation environment. As shown in Fig. 1, h 1 , h 2,k and d k , k ∈ {1, 2, . . . , K} denote the channels of RIS-BS, U k -RIS and U k -BS, respectively. Since there are plenty of scatters distributed in the channel between the BS and users, the line of sight (LoS) path may not exist [37], we adopt Rayleigh fading channel model to formulate d k ∈ C N M ×1 . For the channels h 1 and h 2,k , we model them as Rician fading channels.

A. System Model
where β 1 is the Rician factor of The elements of h 1 follows the complex Gaussian distribution CN (0, 1). In the same way, we denote k-th user where The LoS components can be expressed by the array response of a N M -element ULA: where ϕ is the angle of departure (AoD) or angle of arrival (AoA) of a signal, d is the antenna spacing, and λ denotes the wavelength of the transmitted signal, respectively. Under this condition, h 1 is expressed as where ϕ AoD,1 is the AoD from the RIS reflecting elements, ϕ AoA,1 is the AoA to the BS. We express the LoS path h 2 in as where θ AoA,k is the AoA from the k-th user to the ULA at the RIS. The usable bandwidth is evenly partitioned into N subcarriers in the uplink RIS-aided multi-user MIMO-OFDM system. At the k-th user, the OFDM symbol X k ∆ = [x 1,k , x 2,k , . . . , x N,k ] T is first transformed into the time domain by an N -point inverse discrete Fourier transform (IDFT). A L cp -length cyclic prefix which is longer than the delay spread of both cascaded and direct channels, is then added to transformed symbols. At the BS, we remove the CP of received signals transmitted from the k-th user and then perform N -point DFT to transform these signals into the frequency domain. Then, we denote the received signals at each antenna of the BS as where N M denotes the RIS elements, y k ∈ C N ×1 and X k = diag(x) ∈ C N ×N are the received and transmitted OFDM symbols. H 1,m ∈ C N ×1 is the channel frequency response (CFR) of the BS-RIS channel for the m-th element, H (k) 2,m ∈ C N ×1 is the CFR of the channel between RIS and the k-th user for the m-th element, ϕ m,k = e jϕ m,k is the phase shift introduced by the m-th element when the k-th user transmits signals, To make the model more practical, we consider path loss [38]. P (k) L,r is the path loss of the cascaded channel, and it is expressed as in (7), shown at the bottom of the page. We consider the scenario in which the distance between the RIS and BS and that between users and the RIS is much larger than the size of the RIS. We can simplify the path loss of the cascaded channel as in (8), shown at the bottom of the page. The detailed information in Eq.(7) and Eq.(8) can be found in [36].
We define the path loss of direct channel as P where d k denotes the distance between the BS and the k-th user and γ is the path loss exponent, respectively.
2,m ⊙ H 1,m to express the equivalent CFR of the k-th user-RIS-BS channel associated with the m-th element, (6) can be rewritten as 1 where ϕ k ∆ = [ϕ 1,k , ϕ 2,k , . . . , ϕ N M ,k ] T denotes the phaseshift vector. Therefore, the main task is to estimate the superimposed CFR of the whole channel h k = [H k,0 , H k,1 , . . . , H k,N −1 ] T . By stacking the received signals of each antenna at the BS, we can get a concise form where Y k ∆ = [y k,1 , y k,2 , . . . , y k,N BS ] ∈ C N ×N BS denotes the received signal transmitted by the k-th user at the BS.
denotes the superimposed CFT of the whole channel between the k-th user and the BS. To practically estimate the cascaded and direct channel, we design a novel three-stage transmission protocol, which will be illustrated in the following section.

B. Transmission Protocol
To practically implement the RIS in wireless communication systems, we present a novel transmission protocol to estimate direct and cascaded channels separately. The reflecting elements embedded in the RIS are passive, so they cannot actively process the received signals to estimate the CSI. The RIS needs to know the timing when the user transmits signals through the uplink channel to prepare the proper phase shift ϕ m,k = e jϕ m,k for communications. Otherwise, the BS cannot estimate the channel correctly because the propagation environment is not the same as the pre-designed, e.g., the pre-designed phase shift prepared for the m-th symbol meets the (m + 1)-th or the (m − 1)-th symbol.
The proposed transmission protocol with three sub-frames is illustrated in Fig. 2. The first sub-frame is to estimate the channel P (k) L,d d k between the k-th user and the BS, while the RIS is turned off. According to the feedback information of the first sub-frame, the RIS controller can know that the k-th user will establish a link through it, and then it is turned on for the upcoming signal from the user. This means the RIS does not need to be active all the time and it is an energy-efficient working mode. During the second sub-frame, to estimate the cascaded channel P (k) L,r H k ϕ k + P (k) L,d d k , each pilot symbol is arranged with the pre-designed phase shift of the RIS, both of which are known at the BS [16]. After removing the effect of the direct channel, the BS can estimate the channel P (k) L,r H k based on the received pilot symbols and phase shift. With the estimated cascaded channel and direct channel, we can design a reflection pattern of the RIS to maximize the system throughput, which will be done in the future. According to the feedback information of the second sub-frame, the RIS controller adjusts the phase shift of each element to reconfigure the propagation environment for data transmission in the third sub-frame.
The pilot overhead for channel estimation in this transmission protocol scales with the symbol number and pilot number N p . With too few pilots, the accuracy of CSI cannot fulfill the requirement of phase shift design and precoding design, while too many pilots will lead to low SE and data rates. Thus, there is a trade-off between channel estimation accuracy and pilot overhead. To accurately estimate the channel with relatively low pilot overhead in RIS-aided multi-user MIMO-OFDM systems, in the next section, we propose a data-driven framework, i.e., SRDnNet, to tackle this problem.

III. SUPER-RESOLUTION-BASED CHANNEL ESTIMATION
As shown in Fig. 2, the first sub-frame is used to estimate the direct channel when the RIS is turned off. After performing the channel estimation, we can obtain the CSI of the direct link d k . A comb-type pilot arrangement is adopted to estimate the cascaded channel. For each user, the received OFDM symbol at pilot positions can be defined as where G p ∈ C N P ×N BS denotes the superimposed CFRs of the whole channel. X p = diag(x p ) ∈ C N P ×N P is the diagonal matrix of pilot sequence x p , and N P is the number of pilot in each OFDM symbol. Y p ∈ C N P ×N BS and v p ∈ C N P ×N BS denote received signals and the additive noise on pilot positions. With the pilot observations Y p , we can employ analytical methods, i.e., LS or LMMSE, both of which are derived as benchmarks, to acquire G. After eliminating the effect of d, we can acquire the cascaded channel H. For simplicity, we only explain the estimation process of G in the following section.
A. Conventional Channel State Information Estimator 1) LS Estimator: The LS channel estimator findsĜ in such a way that can minimize the cost function. On pilot tones, it is given bŷ whereĜ LS p ∈ C Np×1 is the estimated channel on positions via LS estimator in each OFDM symbol. To obtain the channel coefficients on other positions, we need to apply some interpolation methods, i.e., linear interpolation and cubic spline interpolation [39], [40]. This method does not need prior knowledge of the channel. Thus the LS estimator has been broadly used.
Then we employ the discrete Fourier transform (DFT)-based channel estimation technique to further improve the channel estimation performance [41]. We useĜ[k] to denote the estimated channel at k-th subcarrier, the IDFT of the channel estimate is given by and transform these L elements back to the frequency domain asĜ The maximum delay L should be known before applying the DFT LS estimator.
2) MMSE Estimator: Unlike the LS estimator where the estimation does not need any prior knowledge of the channel. The MMSE estimator assumes that the prior probability density function (PDF) p(G) of the channel has been known. Therefore, according to the Bayesian method, the estimation accuracy can be further improved by using prior knowledge and denoted aŝ where p(G|X) = p(X|G)p(G) Gp(G|X)dG is the posterior PDF of G given X and p(X|G) is the conditional PDF.
3) LMMSE Estimator: It is noted that the MMSE estimator is infeasible to implement since we need the PDF of the channel, which is usually impossible to obtain. The LMMSE estimator, which only needs the second-order statistic, is always used in practice. Considering the solution of the equation (14), LMMSE estimator can be obtained by using a filtering matrix W LMMSE , the result is given bŷ Then, the LMMSE estimator finds an estimate in terms of W in such a way that can minimize the mean square error(MSE) in the following equation and solve the problem in equation (20) yields whereĜ LS is the LS estimation on pilot tones, R GGLS = E{GĜ H LS } denotes the cross-correlation matrix between the true channel vector and LS channel estimation vector in the frequency domain, and R GLSGLS = E{Ĝ LSĜ H LS } is the autocorrelation matrix of the LS channel estimation vector. The elements of R GGLS and R GLSGLS can be obtained by calculating the exponentially-decreasing power delay profile. The LMMSE estimator can find a better linear estimation of the channel matrix by utilizing correlation between different channels on sub-carriers in the frequency domain.
Note that when the channel is Gaussian-distributed, the optimal MMSE estimator and LMMSE are uniform [42]. Both can improve the channel estimation accuracy by using the second-order statistic. However, the cascaded channel in RISaided communication systems usually does not follow the Gaussian distribution due to the phase shift introduced by the RIS. Thus the optimal MMSE estimator is computationally costly with the calculation of posterior PDF and integration of real channel, which is always unknown in practice. Besides, the performance of the LMMSE estimator has a gap in contrast to the optimal MMSE estimator when channels are not Gaussian-distributed [42]. To deal with the estimation of the above problem, in the following subsection, we will adopt data-driven methods to design an adequate model, i.e., the SRDnNet.

B. Image Super-Resolution and Channel Estimation
The objective of this work is to estimate RIS channels by using received pilots. The DL method's ideal input and output to deal with pilot-based channel estimation are received pilots and estimated channels, which do not need to do any operation on these pilots and result in the least time and computation consumption. In the experiment stage, this method is tough to implement since the functionality of the model is too complex and the training has never converged. In this case, an algorithm, the input and output pairs of which are estimated channels at pilot positions and channel matrix, and only need once division operation on received pilots and transmitted pilots known at both BS and user sides, becomes an alternative solution.
Recently, SRCNN, which can transform low-resolution images into high-resolution images, has been proposed to deal with the image SR problem [43]. Image SR problem can be formulated asÎ whereÎ h is the recovered high-resolution image, I l is the low-resolution and noisy image, and F is the super-resolution technique parameterized by θ.
Note that the input of the Pilot-based channel estimation method in RIS-aided multi-user OFDM communication systems is the initial estimation of channels at pilot positions, and the output is the estimated channel. This method is comparable to the problem formulation (22). Thus we formulate the problem as an image super-resolution (SR) problem. We consider the estimation of channels at pilot positionsĜ p,2 the second sub-frame which involves M symbolŝ whereĜ p,2 ∈ C 2×Q1×N BS is the low-resolution 2D input image (for the real and the imaginary value of the inputs), Q 1 = N P × M and N BS are the length and width of the 2D image, M is the number of symbols in the second sub-frame as shown in Figure 2. Y p,2 ∈ C 2×Q1×N BS and X p,2 ∈ C 2×Q1×N BS are received and transmitted pilots in the second sub-frame. The problem of obtaining an estimation of the whole channelĜ ∈ C 2×Q2×N BS can be written aŝ where Q 2 = N × M and N BS are the length and width of the output 2D image. It is noted thatĜ contains a cascaded channel, which is knotty to find a close expression of the MMSE estimator. In this case, we develop a data-driven model, named SRDnNet, to estimate the channel in the RIS-aided multi-user MIMO-OFDM system. The size of the inputĜ p,2 ∈ C 2×Q1×N BS can be huge because of the large number of antennae and symbols. The complexity order of the neural network is directly related to the input size. The dimension reduction techniques, i.e., principal component analysis (PCA), can reduce the dimensions of the dataset, computation cost and data transmission overhead [44]. PCA is a popular multivariate approach that transforms the high-dimensional dataset into the low-dimensional dataset by using only the first several principal components while extracting the most important information from the dataset. We summarize the PCA method in algorithm 1, where P denotes 2 × Q 1 × N BS ,Ḋ denotes {Ġ (1) p,2 , . . . ,Ġ (Ns) p,2 } T ∈ C Ns×P , n is the number of dimensions of the output lowdimensional dataset. The output is the compressed dataset D n .

C. The Proposed SR-Based Channel Estimation Scheme
Based on the image SR model, the ChannelNet [34] has been developed for channel estimation but with the following limits: 1) The choice of interpolation methods highly affects the final performance of this network; 2) After the interpolation at the very beginning, the input size becomes larger than the raw inputĜ p,2 , and significantly increase the computation complexity; 3) The ChannelNet is not an end-to-end model and is trained in two stages.
To avoid these drawbacks of ChannelNet, we propose an end-to-end model named SRDnNet, which is illustrated in Fig. 3. The SRDnNet is a concatenation of SRCNN and p,2 } T ∈ C Ns×P ; n the number of dimensions of the low-dimensional dataset.
Step: 1: Calculate mean of all samples 3: Calculate covariance matrix S =ḊḊ T ∈ C P ×P . 4: Calculate eigenvalues and eigenvectors of S, arrange them in descending order. 5: Select n eigenvectors with n largest eigenvalues to construct U n ∈ C P×n . Output: The compressed dataset D n = DU n ∈ C Ns×n .
DnCNN. To learn the features of complex input, we adopt two channels at the input layer. The first layer of SRCNN is the feature extraction and representation layer and can extract features of low-resolution figures and represent them in highdimension. The second layer can non-linearly map each feature map to another high-dimension feature map representing a high-resolution patch. The third layer can aggregate these patches to generate a high-resolution representation that has a different size compared with the ground truth channel without the interpolation at the very beginning. We then use a dense layer and a reshape layer to generate an initial estimation G init as the same size as the ground truth channel but not an up-sampling layer as usual. Since the up-sampling layer is not learnable, the dense layer can be trained and find a learnable method to aggregate all features to estimate the ground truth. Channels in different transmitted symbols are not independent because the data transmission is within a coherence time. Each pixel in the output of the dense layer is related to the high-dimension representation of all estimated channels at pilot positions. In this case, the SRDnNet can outperform the ChannelNet. In a word, the SRCNN can extract high-dimensional features from low-resolution imageĜ p,2 and aggregates these features to recover a high-resolution image G init . It is noted that the ChannelNet is a denoising model since the interpolation at the very beginning is not trainable. Meanwhile, the SRCNN has the capability to find a proper interpolation method beyond linear or polynomial constraints to improve the interpolation accuracy. We adopt the DnCNN, an efficient denoising model verified by academia, cascaded with SRCNN to enhance performance further. The DnCNN can learn features of the coarse channel matrix generated by SRCNN and the additive noise, its element-wise subtraction before the output layer could generate an accurate denoised estimationĜ. The hyperparameters of SRDnNet are listed in Table I. All the hyperparameters follow the default settings in [43] and [45].
1) Offline Training: We define (X, G) = {(X (1) , G (1) ), (X (2) , G (2) ), . . . , (X (Ns) , G (Ns) )} as training examples, where X (i) and G (i) are the received pilots and real channels in i-th frame, respectively. To obtain the initially estimated channels at pilot positions as defined in the image super-resolution (24), we sent X to LS estimator in (14) to generate estimated channels at pilot positionsX and training data set is (X, G) = {(X (1) , G (1) ), (X (2) , G (2) ),. . . , (X (Ns) , G (Ns) )}, whereX (i) and G (i) are the i-th training sample pair. We denote f (·) as the function of SRDnNet and the output is where I and O are the input and output of SRDnNet, Θ is the trainable parameters of SRDnNet. We adopt the MSE loss function to train the model, which is formulated as where G denotes the estimation and the ground truth of the channel. Formally, the average MSE between the estimated channel and the ground truth is where ℓ(·) is the loss function. Based on this, SRDnNet can update its trainable parameters by the backpropagation (BP) algorithm to get a well-trained model parameterized by f (I, Θ * ), Θ * is well-trained model parameters.
2) Online Prediction: We denote , G (1) ), ( ⌢ X (2) , G (2) ), . . . , ( , G (Nt) )} as test data set, where and G (i) are estimated channels at pilot positions and ground truth in i-th test sample, respectively. We send ⌢ X to f (I, Θ * ), and the estimated channel iŝ We summarize the SRDnNet-based channel estimation approach in algorithm 2. i denotes the loop indicator. epoch is the maximum loop number. Compared with some existing channel estimation methods [26], [31], [46] deriving estimated channels as inputs, our proposed framework, SRDnNet, can accept rawer input with only one division operation on received pilots and transmitted pilots with less time and computation consumption. Based on the proposed SRDnNet with the capability of non-linear interpolation and denoising, we can estimate the cascaded channel, which does not follow Gaussian distribution without any prior knowledge of the channel. Inheriting the superior feature extraction and denoising ability from SRCNN and DnCNN, SRDnNet can accurately estimate channels which will be validated in the following section.

IV. NUMERICAL RESULTS
In the simulations, the system consists of an N BS -antenna BS, an N M -element RIS and K single-antenna users, as shown in Fig. 1. Without the further declaration, the RIS can continuously adjust its phase shift, and the default settings are N BS = 16, N M = 144 and K = 5. A practical path loss model illustrated in (7) has been considered in the simulation to better formulate the channels for our considered system. We refer to the parameter Settings for small RISs in the paper by Tang et al. [38]. In addition, Rician fading for the cascaded  (1) and (2), and a Rayleigh fading channel model is modeled for the direct channel. Note that (1) and (2) Table II. Additionally, the second transmission sub-frame consists of 16 OFDM symbols, each of which consists of N = 64 sub-carriers appended with a CP with the length of L cp = 8. The maximum delay spread of both direct and cascaded channels is L = 5. We set γ as 3.5 because of the long distance and the variety of obstructions in the considered propagation environment.
To evaluate the performance of the SRDnNet, we derive the following estimation methods as benchmarks: the LMMSE channel estimator, the LS, the DFT LS channel estimator and the ChannelNet [34]. The proposed neural network SRDnNet adopts the Adam optimizer to update the network parameters with a learning rate of 0.001 and a batch size of 100 samples. In total, 10,000 channel realizations for different users are collected at the BS. The training, validation and testing sample numbers are 6000, 2000, and 2000 respectively. In addition, the SNR is defined as SNR = E b N0 , in which E b is the energy of each user data bit and N 0 is the noise spectral density. In the following, we will illustrate the performance of the SRDnNet in terms of normalized MSE (NMSE) defined as , whereĜ SRDnN et and G denote the channel estimated by a well-trained SRDnNet and the ground truth channel, respectively.

A. NMSE Versus SNR
As shown in Fig. 2, we train two models for the first and second sub-frame to obtain the direct and whole channels, which is the superposition of the direct and cascaded channels. Then we calculate the cascaded channel by estimating the direct and the whole channel. Simulation results under different SNRs for the whole channel and direct channel are presented in Fig. 4 and Fig. 5(a), Fig. 5(a) shows the NMSEs after calculation of the cascaded channel. The cascaded channel is not a Gaussian-distributed channel. Thus it is not easy to find the posterior PDF to design a closed-form expression for the MMSE estimator. Only the LMMSE estimator (21)  can be derived to obtain an estimated channel based on the exponentially-decreasing power delay profile. As shown in Fig. 4, Fig. 5(a), and Fig. 5(b), the performance of all estimators decreases with the enhancement of SNR since the impact of noise can be mitigated by higher transmit power. In the following, we will analyze the simulation results of conventional methods and DL-based methods.
First, we investigate the estimation performance of the LS, DFT LS, and LMMSE estimator in terms of NMSE. It can be seen that LMMSE significantly outperforms the LS method. The reason is that the LMMSE can utilize the second-order statistical knowledge of the channel, and the LS estimator could only estimate unknown but deterministic constants without prior knowledge. In contrast to the LS, the DFT LS estimator can reduce the noise effect by removing channel coefficients beyond the maximum channel length, which only contains noise, to improve the channel estimation performance. It is shown that the DFT LS estimator significantly outperforms the LS method but is still worse than the LMMSE estimator.
Second, we will discuss the performance of DL-based methods, i.e., ChannelNet and SRDnNet. It is noticed that the ChannelNet is comparable to the LMMSE and DFT LS estimator since the ChannelNet is a denoising model for the LS estimator as the same as these two estimators. Another reason is that the ChannelNet is not trained in an end-toend manner. In contrast, the SRDnNet can learn a non-linear interpolation method to improve channel estimation accuracy and perform better. For example, in Fig. 4, the SRDnNet achieves a performance gain of 13 dB at 10 dB SNR compared with the LMMSE estimator and 12 dB compared with the ChannelNet. The reason is the SRDnNet has the capability of non-linear mapping a low-resolution image into a highresolution image and learning the distinguishable features of the additive noise to improve the channel estimation accuracy.
Simulation results for the direct channel are shown in Fig. 5(a), and with the estimated direct and whole channel, we can calculate and obtain the cascaded channel as shown in Fig. 5(b). Fig. 5(a) shows that SRDnNet can achieve the best channel estimation performance compared with benchmarks. In Fig. 5(b), it is noted that the estimation error of the direct channel and whole channel accumulate in the calculation of the cascaded channel. For example, in Fig. 5(a) and Fig. 4, the channel estimation performance gaps in terms of NMSE for SRDnNet and LMMSE at 10 dB SNR are −38dB and −38.1dB, but in Fig. 5(b) the NMSE performance of the cascaded channel is −35dB. In comparison, the performance gap between the SRDnNet and LMMSE remains 13dB, which shows the optimality of SRDnNet.

B. NMSE Versus Number of Element, Antenna and Pilot
Note that the number of elements, antennas, and pilots in each OFDM symbol is also an important parameter that can affect channel estimation accuracy. To further verify the scalability of SRDnNet in different communication scenarios, simulations are conducted to evaluate the estimation performance with different system settings. For simplicity, we only demonstrate the NMSE performance on the estimation of the whole channel in the following, the accumulation of channel estimation error of the whole channel and direct channel on the cascaded channel remains similar to Fig. 4 and Fig. 5. As shown in Fig. 6, the NMSE performance is not sensitive to the change in the number of elements denoted byM . The reason is that in our system model (6), the dimensionality of channels to be estimated is a constant no matter M changes; thus, the size of input pairs (26) remains the same. Based on this, the NMSE performance of these considered methods just slightly changes as the increase of M . In addition, our proposed SRDnNet still achieves the best performance because of its superiority in non-linear interpolation and denoising. The NMSE performance gap between SRDnNet and ChannelNet is 13.6dB under 128 elements.
Besides M , the number of antennae denoted by N BS , which determines channel dimension, also impacts the channel estimation performance. Therefore we investigate the NMSE performance with the increase of N BS and demonstrate it in Fig. 7. It is noted that NMSEs of analytical channel estimators, i.e., the LS, DFT LS, and LMMSE estimator, do not change with the increase of N BS , while NMSEs of DL-based estimators, i.e., the ChannelNet and SRDnNet decrease. This is because, in the simulation, we separately estimate the channels of each antenna and average their performance. In this case, the received signal power of each antenna does not change, and the NMSEs remain almost constant. For the DL-based estimators, and the input data contains received signals from all antennas, the input data size and received signal power become larger with the increase of N BS . Thus the NMSEs for DL-based estimators decrease when adding more antennae to the BS. The SRDnNet works with the bestperforming in terms of NMSE. Moreover, the pilot number can directly affect the dimension of input data size for DL-based estimators and consequently affect the channel estimation performance of DL-based estimators. As illustrated in Fig. 8, the performance of all considered algorithms decreases as the number of pilots increases as expected. This is because, for those analytical estimators, i.e., the LS, DFT LS and LMMSE, the interpolation accuracy will increase with more received pilots. For DL-based estimators, the ChannelNet and SRDnNet, convolutional neural networks can exploit more spatial features from larger input data and improve their performance. The simulation results of the line PCA compressed data adopt PCA to generate the compressed input of the SRDnNet and the compression ratio is fifty percent leading to a significant reduction in complexity order. However, the performance of the PCA compressed data is worse than the SRDnNet. Because the PCA extracts low-dimensional data as well as eliminates part of the useful information in the input dataset. Fig. 8 shows the superiority of SRDnNet in non-linear interpolation and denoising. For example, with only four pilots in each OFDM symbol, SRDnNet can still achieve NMSE performance gain of −27.5dB, −15.4dB for ChannelNet and −11.3dB for LMMSE. Moreover, this simulation result shows that our proposed DL-based estimator can still achieve high channel estimation accuracy even with a few pilots under a relatively low SNR.

C. Complexity Analysis
We further investigate the computation complexity of the SRDnNet. Table I summarizes the parameters of each convolutional layer. The complexity of the convolutional layer is where W x and W y are the length and width of output feature maps, F is the length of the side of filters, and N I and N O are the number of input and output feature maps, respectively. Thus the complexity of SRCNN (without the dense layer) is C SR = O(2 · N BS · N P · M (81 · 2 · 64 + 64 · 32 + 25 · 32 · 2)) = O(128 · 219 · N BS · N P · M ), where N P is the pilot number in each OFDM symbol, and M is the number of symbols in the second sub-frame. And the complexity of DnCNN and the output layer becomes approximately C Dn = O(2 · N BS · N · M (3 2 · 2 · 64 · 2 + 3 2 · 64 · 64 · 18)) ≈ O((6 · 192) 2 · N BS · N · M ), The time complexity of the fully connected layer is where I x and I y are the input size, and N F C is the number of units of the fully connected layer, respectively. Therefore the complexity of the fully connected layers is approximately The total time complexity of the SRDnNet is C = C SR + C Dn + C F C , which approximately is C = C SR + C Dn + C F C = O(((6 · 192) 2 + 2048) · N BS · N · M +128 · 219 · N BS · N P · M ).
The bicubic interpolation method can be formulated as a convolutional layer. The time complexity of the bicubic interpolation is C B = O(2 · N BS · N P · M (4 2 · 2 · 64)) = O(4096 · N BS · N P · M ).
The total time complexity of the ChannelNet is It is obvious that C B > C F C . Thus the total time complexity of the SRDnNet is less than that of the ChannelNet. The time complexity of the LMMSE and LS estimators are O(N BS · (N · M ) 3 ) and O(N BS · N · M ) [47]. Fig.9 shows the time complexity of the proposed SRDnNet, LMMSE and LS with respect to N BS · N . It is shown that the SRDnNet has higher complexity than the LS estimator. As the number of channel dimensions, i.e., N BS · N increases, the complexity of the LMMSE estimator becomes close to that of the SRDnNet, and it becomes larger approximately after N BS · N ≥ 1200. While it seems that the complexity of the SRDnNet is comparable to the LMMSE estimator, the SRDnNet can run efficiently on GPUs, which can significantly reduce the testing time [48]. However, implementing the LMMSE and LS estimators is not straightforwardly supported by graphics processing units (GPUs) and requires an application-specific processor. Although the proposed SRDnNet involves higher computational complexity than the LS and LMMSE estimators, the required running time of the SRDnNet can be reduced by the parallel computing of a GPU. This conclusion can be verified in Table III, where each method runs on a desktop computer with an i9-9940X 3.30 GHz central processing unit (CPU) and an Nvidia GeForce RTX 2080Ti GPU.

V. CONCLUSION
This paper proposed a deep learning (DL)-based method to accurately perform channel estimation in a reconfigurable intelligent surface (RIS)-aided multi-user multiple-input multiple-output (MIMO)-orthogonal frequency division multiplexing (OFDM) system. We formulated channel estimation as an image super-resolution (SR) problem and proposed an image super-resolution network, named SRDnNet to recover the channel matrix from the coarse estimation of channels. By inheriting the abilities of the revised super-resolution convolutional neural network (SRCNN) and the denoising convolutional neural network (DnCNN) in feature extraction and denoising, the SRDnNet can further improve the channel estimation performance. We evaluated the performance of the proposed approach via different signal-to-noise ratios (SNRs), number of elements, antennae, and pilots. Simulation results demonstrated that the proposed SRDnNet outperforms analytical channel estimators and DL-based methods with more than 10 dB performance gain in terms of the normalized mean square error (NMSE). We showed that at least 8 pilots were required for reliable channel estimation performance under 5dB SNR. Furthermore, the proposed SRDnNet can reduce the size of input data and the computation cost for channel estimation in complicated scenarios.