Deep Learning Gated Recurrent Neural Network-Based Channel State Estimator for OFDM Wireless Communication Systems

Using deep learning technologies, the channel estimate for an orthogonal frequency division multiplexing system (OFDM) based on pilots is done in this work. To be more specific, deep learning gated recurrent unit (GRU) neural networks are used to present a new framework for channel estimation. Initially, it is trained offline using generated data sets, and thereafter it is used online to track the channel parameters, after which the data transmitted can be recovered. For the purpose of determining the performance of the proposed estimator, three alternative deep learning optimization techniques are used to test it. It is also compared to other commonly used estimators, such as least squares (LS) and minimum mean square error (MMSE). In addition, the proposed estimator is compared with two existing models. Deep learning GRU neural network-based channel state estimator, which are capable of learning and generalizing rapidly, are shown to outperform the comparable estimators when just a few pilots are available. In addition, there is no need for prior knowledge of channel statistics. So, estimating OFDM communication system channel states using the proposed estimator appears promising.


I. INTRODUCTION
Modern wireless networks are built to provide high data rates for users to accommodate the rapidly increasing volume of mobile internet traffic. Because of its bandwidth efficiency and resistance to frequency-selective fading, orthogonal frequency division multiplexing (OFDM) is a crucial building block in present 4G wireless networks and will continue to be in future 5G networks. The ability to gather channel state information (CSI) quickly and accurately is critical in today's rapidly evolving wireless environment.
CSI is frequently acquired using pilot-based channel estimation in OFDM systems. Additionally, a particular sort of symbol known as a "pilot" is transmitted, and the receiver uses this symbol to determine channel information by comparing received symbol to the transmitted. LS and MMSE are two of the most common techniques for determining the best estimating approach. Despite its simplicity, the LS method's accuracy is often unsatisfying. Second-order channel statistics and noise variance are required as prior knowledge for MMSE, which has high computational complexity.

A. Related Work
Because of their great nonlinear mapping ability, artificial neural networks have recently attracted increasing attention and are commonly utilized in classification or recognition [1]. There are hundreds of different topologies for neural network models right now, each tailored to meet the demands of a particular application, but their working method is outlined as follows: The neural network's connection weights and biases are modified because of the training data being fed into the network. Once the network achieves a stable condition, the learning process has been completed [2], [3]. Several various network structures have been coupled with traditional cellular communication networks to handle associated challenges such as CSI estimates, symbol detection, channel coding, and dynamic spectrum allocation as well as the management of resources, energy optimization, and network fault identification [4], [5].
With more layers and neurons, the deep neural network (DNN) has a greater ability to generalize and learn when large datasets are used [6]- [8] and more complex feature mappings [9] than the typical three-layer neural network.
There is currently a DNN model for the automatic classification of modulation in communication techniques [9], and the DNN structure of encoders have been implemented to minimize the difficulties of peak-average power ratio and symbol identification against Doppler frequency shift [10], [11].
For OFDM systems with frequency selective channels, the authors in [12] suggested a feed-forward neural network (FFNN)-based combined channel estimation and symbol detection technique. Proposed algorithms outperform conventional estimators when imperfect communication systems are considered. Online feedforward deep learning (DL)-based estimators for doubly selective channels were proposed by the authors in [13]. The proposed algorithm exhibits superiority over standard linear MMSE estimators in all examination conditions. In [14], a 1D-convolutional neural network (1D-CNN) DL model was developed to estimate the channel and retrieve equalized data. It was also examined in terms of bit error rate and mean square error at various modulation approaches to see how the 1D-CNN compared to LS, MMSE, and FFNN. LS, MMSE, and FFNN estimators are all found to be inferior to 1D-CNN. By treating the channel as an image, the authors in [15] introduced a deep residual channel estimation network (ReEsNet) for channel estimation with high performance and low computation cost. In [16], two architectures of five-layer DNN models are proposed in an underwater acoustic (UWA) system to alleviate the effect of environmental variations while estimating the channel parameters.

B. Motivation and Contribution
Deep learning approaches have a significant advantage for channel estimators because they can automatically extract the features of a specific problem without the need for extensive prior knowledge. The difficulties in training and computing complexity have kept recurrent neural networks (RNN) from becoming a standard network model in the last few years. RNN has recently entered a period of rapid development as a result of the development of deep learning theory. Handwriting recognition [17] and speech recognition [18] are two areas where RNN has already been successfully used. RNN's major characteristic is that it has a hidden layer that can remember information previously processed, resulting in a structural advantage for the processing of time-series information. As a result, RNN can be used as a channel estimator to enhance the CSI estimator's learning and performance. Therefore, we propose a channel estimator based on deep learning techniques. However, using RNN with long short term memory (LSTM) [19] has a relatively complex structure. To address this issue, in this paper, we propose DL gated recurrent unit (GRU) neural networks for channel estimation of data subcarriers as an innovative approach where the computation requirements can be significantly reduced. The neural network approach is more flexible than standard nonneural network methods in that it does not have to concern about channel details and can be deployed for any channel estimator. It is more direct, simple, intelligent, and adaptive by ignoring the estimation of specific CSI parameters. Listed below are the most significant contributions.
1) The DL approach can be integrated into the OFDM system to estimate the channel. Specifically, we use the DNN, which is viewed as a black box in which various network layers can handle certain tasks. The training procedure can increase the accuracy of CSI estimation at data subcarriers thanks to the DNN's superior identification and representation capabilities.
2) The proposed DNN for channel estimation will be learned offline because of the long training period and the high number of weights and other variables that must be updated and changed during the learning process. The transmitted data is subsequently retrieved by means of a DNN that has been trained for online employment.
3) We examine the proposed framework for channel estimation's performance in a variety of scenarios. The symbol error rate (SER) is specifically simulated to evaluate the channel estimation's accuracy. In addition, rigorous simulations and comparisons have shown that the proposed framework is both efficient and robust under the condition of fewer pilots.
4) The proposed framework's performance will be compared to the LS and MMSE estimations. Moreover, the performance of the proposed framework is studied in comparison with the ReEsNet model [15] and the five-layer DNN model [16]. In addition, three different optimization algorithms are used to train this proposed estimator on simulated datasets to produce the most efficient model with the lowest number of pilots.

C. Paper Organization
The following is a summary of the information presented in this paper. Section II provides the OFDM communication system and conventional methods for channel estimation. Section III presents the novel CSI estimation method, which is based on DL GRU neural networks. Simulation results of the proposed framework are offered in Section IV. Section V shows the conclusion of this paper.

II. OFDM COMMUNICATION SYSTEM AND CONVENTIONAL CHANNEL ESTIMATION
The standard OFDM communication system and conventional methods for channel estimation are introduced briefly in the next subsections.

A. OFDM Communication Systems' Model
The system model of the standard OFDM communication system is depicted in Figure 1 where ( ) is the information that multiplexed on the SC subcarriers of one OFDM symbol in the frequency domain. The indices and are used to represent the discrete time and discrete frequency components of the OFDM symbol at a certain subcarrier, respectively. At the receiver, the OFDM system's data symbol can be given as follows: The notation ⊕ is used to represents the circular convolution, ℎ( ) is channel coefficients in the time domain, and ( ) represents the additive white Gaussian noise (AWGN). Using discrete Fourier transform (DFT) procedures to transform the signals from the time domain to the frequency domain, the resultant signals can be represented as follows: where ( ), ( ), ( ), and ( ) are the DFT of ( ), ( ), ℎ( ), and ( ), respectively. After removing the cyclic prefix, these DF transformations were created.
Then, the received OFDM symbols ∈ ℂ SC × SY can be expressed as where ∈ ℂ SC × SY is the transmitted OFDM symbols. ∈ ℂ SC × SY includes the channel coefficients in the frequency domains, ∘ is the Hadamard product (element-wise product) and ∈ ℂ SC × SY represents the noise matrix.

B. Conventional Channel Estimation
In traditional OFDM systems, pilots embedded in the transmitted data can be used to estimate the channel. At the receiving end, the channel parameters can be derived from the relationship between the received signal and the pilot information. The accuracy of the channel information, on the other hand, is highly dependent on the density of pilots [20]. Pilot-based signals are used in both LS and MMSE. The LS algorithm is the most widely used approach for channel estimation. For CSI, the LS method is frequently used as a performance benchmark [21]. In the LS framework, the LS method can be described as where , , ̂∈ ℂ SC SY ×1 are vectorized , , ̂ and ̂ is the estimated . Operator (. ) ∘− represents the Hadamard inverse (element wise inverse). The MMSE algorithm aims to minimize the mean square error between the estimated channel information and the real channel information. Its objective function can be given by To get the closed-form formula, we need to get the partial derivative with respect to ̂ and set the result to 0, which can be given as follows: where = , = + 2 SC and 2 denotes the noise variance. Although the MMSE technique takes into account the impact of Gaussian noise on the CSI performance, its computing complexity is significantly more than that of the LS approach.

A. Preliminary to the Deep Neural Network (DNN)
With DNN, the results of the prediction are derived at the output layer via linear and nonlinear operations at numerous hidden layers, as in the classical neural network model. With its outstanding learning and representation capabilities, it excels at handling extremely complex and nonlinear situations.
Two steps comprise DNN's learning process: training and testing. The network model must first be trained in three steps before it can be used to estimate channel parameters effectively. The first step is to determine the input data samples. Second, the gradient descent technique is used to determine the partial derivative of the cost utility that contains the output and the true values, to reduce the error between these two values. The precise modification of its value must be made in the direction of the error function's negative gradient. In the third step, the validation set must be The th neuron's weight in relation to the th layer is depicted in Figure 2 by the symbol . After that, the preactivation of the layer is provided by The output activation of each neuron can be expressed as The training data is uploaded to the DNN network whose structure is depicted in Figure 2. Through the network's hidden layer, data features are extracted, and then classification results are created. In a matrix form, the network output can be expressed as follows: where is the output layer's connection weight, is the output layer's bias vector, and denotes the output vector of the network.
The algorithm's basic procedures are to determine the partial derivatives of the cost utility. The following is a list of the steps involved in using it.
The parameters and , as well as the weights and associated parameters, are initialized. Second, use the forward propagation formula to figure out the state and the activation values for each NN layer, as shown below: = ( ), (12) where and represent the ( − 1)th layer to th layer weights matrix and the bias matrix of the th layer, respectively.
The output layer parameter can be given by where and denote the expected and the actual outputs of the training data that generated using the neural network, respectively. The symbol `( . ) is used to represent the partial derivative of the underlying variable.
of the hidden layer can be determined from the ( − 1)th layer to the 2nd layer as follows: where is the number of neurons at the th layer. The crossentropyex loss function is used in this paper as cost function, which can be defined as follows: where is the total number of samples that used in the training phase.
For the third step in the process, the back propagation algorithm is utilized to determine the difference among the DNN network's output value and its true value.

B. Optimization Techniques
It is most common to minimize the loss function via gradient descent. Using the gradient descent method, weights and biases are changed using incremental steps in the direction of the negative gradient of the loss.
where is the iteration number and represents the learning rate. In the typical gradient descent approach, the gradient of the loss function is evaluated using the complete training set at once. Instead of using all training data to evaluate the gradient and update the parameters, stochastic gradient descent can be used. Iterations employ a distinct subset of data, referred to as a "mini-batch." During one epoch, a training algorithm runs through all of its mini-batches of training data. In stochastic gradient descent, the parameter updates computed using a mini-batch are a noisy estimate of the parameter updates that would be generated if the whole data set were employed. Stochastic gradient descent algorithms allow for oscillations along the route of the steepest fall towards the optimum. Incorporating a momentum element into the parameters update [22] can help reduce this oscillation. Using stochastic gradient descent with momentum (SGDm), it is possible to modify a neural network's weights and biases in the following ways: where the previous gradient step's contribution to the current iteration is given by . One learning rate is used for all parameters in SGDm's algorithm. Network training can be improved by adopting learning rates that vary by parameter and can automatically adjust to the loss function that is being adjusted. One such technique is root mean square propagation (RMSProp). It keeps track of the element-wise squares of parameter gradients, which is given by where 2 is the decay rate of the moving average. In general, the decay rate is 0.9, 0.99, or 0.999. The associated squared gradient averaging lengths are equal to 1/(1 − 2 ), specifically, 10, 100, or 1000 parameter updates, respectively. The RMSProp technique employs a moving average to normalize the updates of the weights and bias parameters as follows: where the division is done element-by-element. Because of this, RMSProp can reduce the learning rate for parameters with large gradients and boost it for parameters with small gradients. To avoid division by zero, a minor constant is added. where the division is performed elementwise. Similar to RMSProp, momentum terms have been introduced to the parameter updates in Adam [23] (derived from adaptive moment estimation). An element-wise moving average of both the parameter gradients and their squared values is maintained by the algorithm. The moving average of the parameter gradients can be given as follows: where 1 is the decay rate. Adam updates the network parameters using moving averages as follows: Using a moving average of the gradient allows parameter updates to gain momentum in a specific direction if the gradients across several iterations are similar. The moving average of the gradient is smaller if the gradients contain large amounts of noise, so the parameter updates are smaller as well.

C. Gated Recurrent Unit (GRU)
A new memory cell called a gated recurrent unit (GRU) has been shown to be useful for various tasks [24]. In this sense, the GRU can be viewed as a simplified and improved version of the LSTM [19]. In both the LSTM and GRU, "cell state" is the central concept. Information flow to the cell state is controlled by the structure known as a "gate" in this system. Sigmoid and multiplication operations are used in the construction of the gate. Values from zero to one are outputs from the "sigmoid" layer, with zero meaning "no quantity can pass" and one meaning "any amount can pass." This system has three gates to preserve and govern the "unit" state in order to preserve long-term information dependency. The three gates are the forget gate, which determines what information should be discarded from the cell state, the input gate, which determines what information should be stored in the cell state, and the output gate, which determines what information should be output. It is widely accepted that the GRU is an LSTM variation and that it makes use of the same gate control mechanism. Gradient vanishing is no longer an issue with this technique [25]. In the GRU, however, there are some differences. An update gate is created by combining the LSTM's forget gate and input gate. Similarly, both the cell and hidden states are combined. Thus, the GRU only has two gates: an update gate and a reset gate. Unlike the LSTM, the GRU's computation requirements are significantly reduced. Figure  3 depicts the general layout of the GRU. Using the update gate ̃( ), you can regulate the amount of information stored between the previous moment in time and the current state. Using the reset gate ̃( ), you can specify how much information about the preceding moment should be ignored. In the GRU hidden elements, the following conversion functions for an input vector ̃( ) are provided: ̂( ) = tanh ( ℎ̃( ) + ℎ (̃( ) ∘̃( − 1)) where , denote the input weight matrices of ̃( ) and ̃( ) gates, respectively. , are the corresponding bias terms. ℎ denotes the weight matrix of the output state and ℎ is the corresponding bias term. ̃( − 1) represent the input data at time − 1, ̂( ) and ̃( ) denotes candidate states and output states at time . and tanh represents activation functions for ̃( ) and ̃( ) gates, and , , ℎ are the recurrent weight matrices.

D. Overall Architecture
The following five layers were used to build the DL GRU neural network for channel estimation: A sequence input layer, where the input data size is set to 256, which represents the input data features, the GRU layer, which has 16 hidden units, where it outputs the latter element of the sequence; ultimately, 4 classes are created by using a fully connected layer of size 4, then a softmax layer and a classification output layer are inserted. Inserting additional GRU layers allows for deeper GRU networks. Figure 4 shows the proposed channel estimator's structure.
A softmax layer uses a softmax function to transfer the output from the last fully connected layer into the normalized prediction possibility in the interval (0,1) as follows: The classification layer often follows a softmax layer in classification tasks. This layer uses the defined loss function to classify the values from the softmax function to one of the mutually exclusive classes.
In this work, we study the end-to-end performance of the receiver by implicitly estimating the channel parameters and then detecting the transmitted message from the received signal. Moreover, treating the channel estimation along with the signal detection as a classification problem can reduce the computational resources for online feature comparison due to the limited number of classes, which is required for the practical implementation of OFDM communication systems.
The proposed DL GRU neural network will be trained using three optimization strategies to minimize the loss function. These are Adam, SGDm, and RMSProp. As a result of pilot limitations, we needed the most accurate and robust estimator we could find. Iterations are stopped when the stopping requirement is met (the maximum number of iterations is reached or the error difference between two of them is very small).
Channel models that accurately represent the CSI statistics of physical channels have recently been established by researchers. Training data can be modeled using these channel models. The 5G channel model is used in this investigation. According to TR38.901 [26], [27], this model is used to simulate a variety of imperfection causes that decrease the performance of channel estimation. Narrowband Rayleigh fading channels and doubly selective fading channels [28] can also be employed.
The OFDM frame that consists of pilot and transmission symbols is constructed from a random data in offline training. The selected channel model is used to model the CSI data. Channel distortion and noise are considered while determining the received OFDM signal. Offline training data This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and   sets are made up of both transmitted and received signals. Figure 5 displays the training symbols creation and offline method for producing the proposed GRU-based channel estimator.

IV. SIMULATION RESULTS
In this section, various experiments have been carried out to illustrate the effectiveness of the DL GRU-based channel state estimator. Accordingly, it was trained offline using the simulated data sets, then its SER was compared with different estimations under varied signal-to-noise ratios (SNRs). Data for one subcarrier is used in the training dataset. A single OFDM pilot symbol and one OFDM transmission symbol are sent from the transmitter to the receiver in each OFDM packet. Some of the data symbols may be interleaved in the pilot sequence. A total of 10,000 OFDM packets are created, of which 80% are used for training and 20% for validation. Table I presents the parameters and training  options for the GRU-based channel estimator. Table II, on the other hand, depicts the channel parameters used in the OFDM system. The proposed estimator will be trained using a variety of optimization techniques in the current simulations so that we can see how it performs under various learning approaches. The SGDm, ADAM, and RMSProp optimization algorithms are used.

A. Effect of different number of pilots on system performance
In this section, the proposed estimator's performance will be compared to the conventional LS and MMSE estimations. In addition, the results are compared with the ReEsNet model studied in [15] and the five-layer DNN model used in [16]. All models are tested under the same channel conditions. The five estimators will be tested on pilots of 4, 8, and 64 to see how well they perform. The Adam optimizer is being utilized in this simulation. At SNRs ranging from 0 to 19 dB, as depicted in Figure 6, the proposed estimator performs significantly better than the LS estimator, and at SNRs ranging from 0 to 11, it performs similarly to the MMSE estimator. Moreover, the proposed estimator outperforms the ReEsNet and DNN models, especially at high SNR levels. For all SNR levels, the MMSE estimate outperforms the LS estimator. The reason is that the MMSE technique takes into account the effect of Gaussian noise on estimate performance and uses the second order channel statistics. The LS estimator, on the other hand, does not make use of the prior channel statistics in its estimation.
From Figures 7 and 8, the DL GRU-based estimator outperforms both the LS and MMSE conventional estimators when the number of pilots decreases (8 and 4). Figure 6 shows that when only 8 pilots are utilized, the LS and MMSE conventional estimators perform worse than the DL GRU- based estimator. When eight pilots are used, the proposed estimator outperforms the LS, MMSE and ReEsNet estimators by 2 to 4 dB gain in the low SNR range and by 4 to 6 dB gain in the high SNR range. Also, the proposed estimator outperforms the DNN model at SNRs ranging from 15 to 20 dB. According to Figure 8, however, when only four pilots are used, the conventional estimators as well as ReEsNet estimator lose their ability to work at zero dB. Conversely, the DL GRU-based estimator and the DNN model can improve the SER when SNR is increased. However, the proposed estimator performs better than the DNN model for all SNR levels. The improvement of the proposed estimator compared with the DNN model is about 4dB in the high SNR range. This can prove that the DL GRU-based estimator is resilient to the restricted pilots that may be employed for CSI estimation due to the structural advantage of the GRU layer, which can remember the previous processed information. Figure 9 shows the proposed estimator's performance at pilot numbers of 64, 8, and 4. The DL GRU-based channel estimator outperforms both LS and MMSE at varied pilot sizes of 4, 8, and 64, respectively. Also, the proposed estimator performs better than the compared models. In addition, the proposed estimator's performance can be improved by increasing the number of pilots.

B. Effect of different optimization algorithms on system performance
Choosing the best optimization strategy for a given problem can be a difficult challenge. Choosing the wrong optimization strategy can cause the network to stay in the local minima throughout training, which does not improve the learning process. It is therefore important to examine how different optimizers perform on the basis of the model and dataset used to produce the optimum performance for the DL GRU-based channel estimator.
An experimental comparison of three optimization techniques is presented in this section to determine the best appropriate technique for the channel estimation problem. Adam, which was examined in the previous subsection, SGDm, and RMSProp are the three optimization techniques employed. To get a more accurate DL GRU-based channel estimator, we will examine how well the learning processes of RMSProp and SGDm perform. Figures 10 and 11 show that the Adam and RMSProp models outperform the SGDm model at pilots of 64 and 8. Apart from that, the Adam model outperforms the RMSProp model at higher levels of SNR. The RMSProp model exceeds its competitors in terms of SER in restricted pilots of four, as illustrated in Figure 12. We can see that the performance of the same optimizer varies depending on the number of pilots. Finally, it should be noted that the SGDm model performs the worst across a range of pilot numbers. The reason is that the SGDm model has a single learning rate for all parameters.    Figure 13 shows the robustness of the proposed estimator against the restricted pilots utilized, as well as the relevance of exploring alternative optimization techniques in the DL process of the proposed DL GRU estimator. For communication systems, it is more advisable to employ the suggested estimator with 8 and 4 pilots. As a result, OFDM wireless communication systems will be capable of transmitting data at higher rates. In addition, when 4 pilots are used, the proposed estimator has a comparable performance with 8 and 64 pilots in low SNR range.
It is a good idea to keep a focus on the training process while it is taking place. By plotting loss metrics during training, we can see how the training is proceeding. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and  In Figures 14-16, it is shown that the SGDm optimization technique attains the highest loss compared to Adam and RMSProp optimization techniques, which can be confirmed from Figures 10-12, where the trained DL GRU-based CSI estimation using the SGDm technique have the worst SER performance. In addition, the loss of RMSProp and Adam optimization techniques in Figures 10-12 emphasizes the conclusions obtained in these figures.

V. CONCLUSION
Using the DNN method, a new approach to OFDM channel estimation has been developed. The utilization of DL GRU neural networks has been implemented. The proposed estimator is trained offline, then utilized online in a communication system to track the channel statistics, so that the CSI parameters can be estimated, and the transmitted symbols can be reconstructed. The proposed estimator's performance has been examined on three distinct pilots: 64, 8, and 4. Also, the proposed estimator is tested against three distinct optimization techniques for DL, namely, SGDm, RMSProp, and Adam, to see how well it performs at each. The results show that the Adam and RMSProp models outperform the SGDm model with pilots of 64 and 8. Additionally, the RMSProp model exceeds its peers in terms of SER when only four pilots are involved. When just a small number of pilots are available, the proposed DL GRU-based CSI estimation outperforms both the LS and MMSE estimates as well as the ReEsNet and DNN existing models in terms of SER. With DL GRU neural networks, the proposed approach, which does not need prior information about the channel, is encouraging in OFDM communication systems for CSI estimation purposes, mainly when the number of pilots is restricted, because of their excellent learning and generalization capabilities. It also has the potential for improving communication systems, such as 5G and beyond.