AI-Based Channel Prediction in D2D Links: An Empirical Validation

Device-to-Device (D2D) communication propelled by artificial intelligence (AI) will be an allied technology that will improve system performance and support new services in advanced wireless networks (5G, 6G and beyond). In this paper, AI-based deep learning techniques are applied to D2D links operating at 5.8 GHz with the aim at providing potential answers to the following questions concerning the prediction of the received signal strength variations: i) how effective is the prediction as a function of the coherence time of the channel? and ii) what is the minimum number of input samples required for a target prediction performance? To this end, a variety of measurement environments and scenarios are considered, including an indoor open-office area, an outdoor open-space, line of sight (LOS), non-LOS (NLOS), and mobile scenarios. Four deep learning models are explored, namely long short-term memory networks (LSTMs), gated recurrent units (GRUs), convolutional neural networks (CNNs), and dense or feedforward networks (FFNs). Linear regression is used as a baseline model. It is observed that GRUs and LSTMs present equivalent performance, and both are superior when compared to CNNs, FFNs and linear regression. This indicates that GRUs and LSTMs are able to better account for temporal dependencies in the D2D data sets. We also provide recommendations on the minimum input lengths that yield the required performance given the channel coherence time. For instance, to predict 17 and 23 ms into the future, in indoor and outdoor LOS environments, respectively, an input length of 25 ms is recommended. This indicates that the bulk of the learning is done within the coherence time of the channel, and that large input lengths may not always be beneficial.


I. INTRODUCTION
Advanced wireless networks-5G, 6G, and beyond-aim to provide a multitude of new services, many of them supporting low-latency communications. One way to deliver lowlatency communications is to enable wireless terminals to communicate with each other through direct, usually short device-to-device (D2D) links. These links can also be conveniently used to provide for additional features such as increase reliability and/or extension of radio coverage [1].
The characteristics of signal propagation in D2D links can often be very different to those encountered in traditional wireless communications. In the latter, the base station is fixed with antennas elevated above rooftops, and the link is therefore, usually, free of local scatterings. D2D communications occur in an infrastructure-less network, with fixed or mobile terminals at low elevations, immersed in rich scattering environments. Furthermore, mobile terminals may be in close proximity to the human body, whose motion is VOLUME 4, 2016 1 arXiv:2206.08346v1 [eess.SP] 16 Jun 2022 bound to cause stochastic shadowing on the radio link [2], [3]. Shadowing and scattering may be caused by a number of different radio obstructions present in the local environment, such as vehicles and buildings (outdoor), internal walls and furniture (indoor), and pedestrians (indoor and outdoor) [4].
This paper is concerned with the problem of predicting radio channel conditions encountered by D2D links. Given the difficulty in modelling D2D links, we explore the use of artificial intelligence (AI) tools and rely on extensive experiments.

A. RELATED WORK
Several works have previously addressed the wireless channel prediction problem using AI techniques [5]- [8]. In [5], a hybrid deep learning model for spatiotemporal prediction in cellular networks was presented. It was shown that the proposed model, which included an autoencoder-based deep network for spatial modeling and long short-term memory network (LSTM) for temporal modeling, significantly improved prediction accuracy when compared to two commonly used baseline methods, namely autoregressive integrated moving average, and support vector regression. Channel prediction using LSTMs and autoregressive methods was also applied to vehicular measurements in [6]. Unlike in [5], where it was observed that 'learning more' (i.e., increasing the number of stacked layers and hidden units in each layer) helped improve prediction performance, it was found in [6] that LSTMs with just a small number of hidden units performed better when compared to increasing the number of hidden units in each layer. Reference [6] only compared a single deep learning model with a single baseline model.
An initial study on channel prediction in body area networks (BANs) was carried out in [7]. It was shown that an LSTM based framework performed better on BAN measurements when compared with existing approaches such as moving average and adaptive prediction. However, these studies did not include comprehensive experiments using real data from a mobile device. Wireless channel quality prediction was also studied in [8] where an encoder-decoder based sequence-to-sequence deep learning model was used, and its performance was compared with linear regression for multiple networks and communication standards. It was observed that sequence lengths of size 20 captured most of the useful information in the data, and sequences of greater lengths did not improve prediction performance.
Concerning D2D communications, reference [9] focused on deep learning approaches for content caching in cacheenabled D2D networks. Two recurrent neural network approaches, namely echo state networks and LSTMs, were employed to predict users' mobility and content popularity, so as to determine which content to cache and where to cache. However, these results were not tested on real-world channel measurements. In [10], a deep learning approach was proposed to predict D2D channel gains from independent cellular channel gains in order to solve various problems related to radio resource management. All predictions were based on the assumption of the channel being Gaussian.

B. MAIN CONTRIBUTIONS
This paper explores AI-based deep learning techniques on D2D links with the aim at providing potential answers to the following questions concerning the prediction of the received signal strength variations: 1) How effective is the prediction as a function of the coherence time of the channel? 2) What is the minimum number of input samples required for a target prediction performance? To address these questions, four deep learning models are explored. These include dense or feedforward networks (FFNs), convolutional neural networks (CNNs), gated recurrent units (GRUs), and LSTMs. It is worth mentioning that CNNs are commonly applied to analyse image data, but they also find application in predicting time series data [11]. Furthermore, recurrent neural networks (RNNs) are powerful in discovering the dependency in sequential data. Specifically, GRUs and LSTMs work well on sequential data with longterm dependencies [12]- [14] due to their internal memory mechanisms.
Accordingly, unlike prior work, we focus on understanding empirically the relationship between channel coherence time and number of samples used in the prediction models, as well as the minimum input length required to achieve a target prediction performance for a given coherence time.
To this end, we compare and validate the prediction performances of the deep learning models on real-world D2D field measurements conducted for a variety of environments and scenarios at 5.8 GHz. These include an indoor open office area environment, an outdoor open space environment, LOS, non-LOS (NLOS) and mobile scenarios.
The remainder of this paper is organized as follows. Section II describes the D2D channel measurements conducted at 5.8 GHz. Section III discusses the the data preprocessing steps applied. Section IV discusses the deep learning models used for prediction and their implementation details. Section V presents the experimental results. Lastly, Section VI finishes the paper with some concluding remarks.

II. D2D CHANNEL MEASUREMENTS
The wireless channel measurement system used in this study was based on the ML5805 transceivers, manufactured by RFMD (Qorvo) 1 . The transceiver boards were interfaced with a PIC32MX which acted as a baseband controller and allowed the analog received signal strength (RSS) to be sampled with a 10-bit quantization depth. A D2D link was formed between two persons, namely person A (an adult female of height 1.65 m and weight 53 kg) and person B (an adult male of height 1.83 m and weight 73 kg). The user equipment (UE) positioned on person A acted as the transmitter and was configured to output a continuous wave signal with a power level of +17.6 dBm at 5.8 GHz. The UE positioned on person B acted as the receiver and sampled the channel at a rate of 10 kHz.
The RSS data was downsampled by averaging 10 consecutive samples to improve the signal to noise ratio (SNR) performance, thus giving an effective sampling rate of 1 kHz after downsampling. Furthermore, the antennas used by the transmitter and receivers were +2.3 dBi sleeve dipole antennas (Mobile Mark model PSKN3-24/55S). The antennas were housed in a compact acrylonitrile butadiene styrene (ABS) enclosure (107 × 55 × 20 mm). This setup was representative of the form factor of a smart phone which allowed the user to hold the device as they normally would to make a voice call. Each antenna was securely fixed to the inside of the enclosure using a small strip of Velcro®.
The D2D channel measurements were obtained within an indoor open office area and an outdoor open space environment, as shown in Fig. 1. The indoor open office was located on the first floor of the Institute of Electronics Communications and Information Technology (ECIT) building at Queen's University Belfast in the United Kingdom. The building mainly consists of metal studded dry walls with metal tiled floors covered with polypropylene-fiber, rubber backed carpet tiles, a metal ceiling with mineral fiber tiles and recessed louvered luminaries suspended 2.7 m above floor level. The office contained a number of chairs, metal storage spaces, doors and desks constructed from medium density fibreboard. These desks were vertically separated by soft wooden partitions. During the measurements, the office area was unoccupied in order to facilitate pedestrian free D2D channel measurements. The outdoor D2D measurements were conducted in an outdoor car parking area adjacent to the ECIT building.
As shown in Fig. 1, during the D2D measurements [15], person A and B held their UE at their left-ears to imitate making a voice call. For the LOS D2D measurements, person B was positioned directly in front of person A whilst for the NLOS D2D measurements, person B was positioned around an adjacent corner. It is worth noting that both test subjects were initially stationary after which they were instructed to walk around randomly within a circle of radius 0.5 m from their starting points. For the LOS D2D measurements in both environments, while there may have been a direct LOS between the two person's bodies during the trials, in actual fact, the link between the hypothetical UEs would have been subject to quasi-LOS conditions due to the random movements undertaken. For the NLOS case, person B was always positioned around an adjacent corner to ensure that the NLOS conditions (i.e. no direct signal path between persons A and B) were maintained irrespective of the random movements.

III. DATA PREPROCESSING
Once the RSS measurements were obtained, the small-scale fading data was extracted for analysis. Specifically, the largescale fading component was removed by applying a lowpass filter to the raw RSS data in linear scale. To determine the window size for extraction of the local mean signal, the raw data was visually inspected and overlaid with the local mean signal for differing window sizes. A smoothing window of 50 samples was then used. Fig. 2 shows the smallscale fading variations for a 4 s window observed in the indoor LOS, indoor NLOS, outdoor LOS and outdoor NLOS environments, respectively. The overall measurement data set consisted of approximately 62300 samples, equivalently 62.3 s length in time. This included the measurement data for all scenarios.
Data scaling was then applied to transfer the data into ranges and forms that are appropriate for modeling. It is well known that models trained on scaled data perform significantly better when compared to models trained on unscaled data [16]. As well as this, the gradient descent converges much faster with scaled data than without it [17]. In this paper, the data sets were scaled using min-max normalisation (which performs a linear transformation on the original data) [18] before being input to the model for training. Let x min , and x max be the minimum and maximum values for attribute X. Min-max normalization maps a value v of X tó v in the range [new x min , new x max ] using (1), as follows: (1) Note that the normalization output was customised to be in the range [-1, 1] by rewriting (1), as followś (2) VOLUME 4, 2016

IV. METHODOLOGY
This work focuses on a univariate time series forecasting problem. Here, data sets comprised of only a single variable are observed at each time step, and a model is used to exploit the values seen at prior time steps to predict the subsequent time step values. A sliding window 2 approach is adopted to restructure the time series data as a supervised learning problem. 3 Thus, the models here make a set of predictions based on a window of consecutive samples from the data sets. This section first discusses how the prediction problem in this paper is framed in a supervised learning manner through data windowing. Then, the baseline and deep learning models used here are explained. Following this, their implementation details are presented.

A. DATA WINDOWING
Data windowing of the models is represented in Fig. 3. The input size, also called as the input width, is the number of time steps considered by the window as an input, and is denoted by T x . The number of output steps to be predicted, also called as the horizon, is represented as T y .
2 Is a statistical method in which a window of specified length moves over the data, sample by sample, and the statistic is computed over the data in the window. 3 Supervised learning is the most popular way of framing problems for machine learning as a collection of observations with inputs and outputs.
Linear and feedforward flatten the input data as a vector to convey the previous time steps with size T x . The main drawback of this approach is that the resulting model can only be executed on input windows of exactly the same shape. The CNN model also takes multiple time steps as input to produce one-shot T y out steps predictions. However, different than feedforward networks, CNNs can be run on inputs of any length, and the predictions are based on a fixed-width history controlled by their kernel sizes. This might result in better performance since it can see how things are changing over time. Recurrent neural networks such as LSTMs and GRUs are intrinsically well-suited to process sequential data. This is done by maintaining an internal state from time step to time step. At each time step an input size of T x is fed into the model producing T y output steps as predictions. For the next time step, the data window is shifted by T y samples.

B. LINEAR REGRESSION BASELINE MODEL
This model assumes that the relationship between the independent variables (or features) x and the dependent variable y is linear i.e., y can be expressed as a weighted sum of the elements in x, given some noise on the observations. It should be noted that the baseline model here refers to a univariate linear regression model.
Assuming that the inputs consist of T x features, the pre-  dictionŷ is expressed aŝ where w 1 , . . . , w Tx are called weights, and b is called a bias (also called an offset or intercept). The weights determine the influence of each feature on the prediction, and the bias indicates the value that the prediction should take when all of the features take value 0. Models whose output prediction is determined by the affine transformation of input features are linear models, where the affine transformation is specified by the chosen weights and bias [19]. Now collecting all features into a vector x ∈ R Tx and all weights into a vector w ∈ R Tx , the model in (3) can be expressed as [19, eq. 3.1.3], Here, the vector x corresponds to features of a single data example and (·) is the vector transpose. For a collection of features fed into the model in a batch size 4 of N , X ∈ R N ×Tx and w ∈ R Tx×Ty , the predictionsŷ ∈ R N ×Ty , can be expressed via the matrix-vector product [19, eq. 3.1.4] with b ∈ R 1×Ty .
FIGURE4: Dense/feed-forward network with one hidden layer with h hidden units. 4 It is the number of samples processed before the model is updated.

C. FEEDFORWARD NETWORK
These networks are also known as dense networks, and are capable of handling a more general class of functions by incorporating one or more hidden layers. These layers create non-linear representations of the data and are able to capture complex interactions among the input. The final (output) layer is usually a linear predictor. The network layers are connected in a fully connected manner, meaning that every input influences every neuron in the hidden layer, and each of these influence every neuron in the output layer. A dense network with one hidden layer is illustrated in Fig. 4. The inputs X ∈ R N ×Tx are being fed into the model in a batch size of N training instances where each instance has T x inputs. Considering one hidden layer network whose hidden layer has h hidden units, the hidden representation, H ∈ R N ×h , and the network output, Y ∈ R N ×Ty , are given as [19, eq. 4

.1.3]
and respectively. The weights and biases of the hidden layer are W (1) ∈ R Tx×h and b (1) ∈ R 1×h , respectively, whereas the weights and biases of the output layer are W (2) ∈ R h×Ty and b (2) ∈ R 1×Ty , respectively. Finally, the activation function g(·) is responsible for introducing non-linearity in the model. In this work, we adopt rectified linear units (ReLU), g(x) = max{0, x}, as the hidden layer activations.

D. LONG SHORT-TERM MEMORY NETWORK
The LSTM network is a type of RNN that is well known for its time series prediction capabilities. In a standard RNN, the nodes i.e., the building blocks of a neural network architecture are composed of basic activation functions such as tanh and sigmoid. As indicated in [11], because RNN weights are learned by backpropagating errors through the network, the use of these activation functions can cause VOLUME 4, 2016 FIGURE5: Block diagram of the LSTM cell.
RNNs to suffer from the vanishing gradient problem that causes the gradient to have either infinitesimally low or high values. This affects a recurrent neural network's ability to learn long-term dependencies [20]. The LSTM network is able to partially overcome the vanishing gradient problem by creating paths through time that have derivatives that neither vanish nor explode [11] by incorporating the ability to forget.
As explained in [11], LSTM recurrent networks have LSTM cells, which includes a memory cell (or cell for short), designed to record additional information (which allows it to handle long-term dependencies). Each cell has the same inputs and outputs as an ordinary recurrent network, and also has more parameters and a system of three gating units that controls the flow of information, namely the output gate, input gate, and forget gate. These gates were specifically designed inspired by logic gates of a computer. The output gate reads out the entries from the cell. The input gate decides when to read data into the cell. Lastly, the forget gate represents a mechanism for resetting the cell's content. The main motivation of this gating design is to be able to decide when to remember and when to ignore inputs in the hidden state. Fig. 5 shows the block diagram of a single LSTM cell which has an internal recurrence (a self-loop), in addition to the outer recurrence of the RNN. The most important component is the memory cell state unit c (t) ∈ R N ×h that captures the internal state of the LSTM cell and has a linear self-loop given by [19, eq. 9.2.3] where is the Hadamard (elementwise) product operator. The memory cell is updated by partially forgetting the existing memory and adding a new memory content. This candidate memory cellc (t) ∈ R N ×h represents the degree to which the new memory content is added to the memory cell and is modulated by the input gate i (t) ∈ R N ×h . The new memory content is given as [19, eq. 9.2.2] where W xc ∈ R Tx×h and W hc ∈ R Tx×h are input weight parameters and recurrent weights with respect to the cell gate, respectively, and b c ∈ R 1×h is a bias parameter. The batch size is denoted by N , T x is the number of inputs, x (t) ∈ R N ×Tx is the current input vector and h (t) ∈ R N ×h is the current hidden layer vector with h hidden units containing the outputs of all the LSTM cells. The self-loop weight is controlled by a forget gate unit f (t) ∈ R N ×h (for time step t), that sets this weight to a value between 0 and 1 via a sigmoid unit. It is expressed as [19, eq. 9.2.1] The biases, input weights and recurrent weights for the forget gates are denoted by b f ∈ R 1×h , W xf ∈ R Tx×h , and W hf ∈ R h×h , respectively. The external input gate unit i (t) ∈ R N ×h is computed similar to the forget gate and is expressed as [19, eq. 9.2.1] with W xi ∈ R Tx×h being the input weights, W hi ∈ R h×h the recurrent weights, and b i ∈ R 1×h the bias for the input gate. The output h (t) ∈ R N ×h of the LSTM cell, also called hidden state, and the output gate and respectively. Again, the input weights, recurrent weights, and bias for the output gate are respectively denoted as W xo, ∈ R Tx×h , W ho, ∈ R h×h , and b o ∈ R 1×h . The hidden state vector is simply a gated version of the hyperbolic tangent of the memory cell. This ensures that h (t) is always between -1 and 1. Whenever the output gate approximates to 1, all information is effectively passed from memory to the predictor, while for the output gate close to 0, all the information is retained within the memory cell and no further processing is performed.

E. GATED RECURRENT UNIT
GRUs are a newer generation of RNNs and work similar to LSTMs. Both have a dedicated mechanism composed by gating units to decide when to memorize and when to ignore inputs in the hidden state [13]. The key difference is that GRUs have only two gates that control the flow of information, namely the reset gate, and update gate. Furthermore, the cell state (memory unit) is not part of its gating unit, and uses only the hidden state h (t) ∈ R N ×h to transfer information. The core functionality of GRUs rely on a single gating unit simultaneously controlling the forgetting factor and the decision to update the state unit. This update is expressed as [11, eq. 10.45] .
The update gate u (t) ∈ R N ×h and the reset gate r (t) ∈ R N ×h are expressed as [19, eq. 9.1.1], FIGURE6: Block diagram of the GRU model. and respectively, where W xu , W xr ∈ R Tx×h and W hu , W hr ∈ R h×h are weight parameters and b u , b r ∈ R 1×h are biases. The current input vector is x (t) ∈ R N ×Tx with batch size of N and input size T x . The current hidden layer containing the GRU outputs is denoted as h (t) ∈ R N ×h with h hidden units.
The candidate hidden stateh (t) ∈ R N ×h at time step t is given as [19, eq. 9 with W xh ∈ R Tx×h , W hh ∈ R h×h denoting weight parameters and b h ∈ R 1×h bias.

F. CONVOLUTIONAL NEURAL NETWORK
The name convolutional neural network indicates that the network employs a mathematical operation called convolution, which is a specialised kind of linear operation. CNNs exploit spatial locality by enforcing a local connectivity pattern between neurons of adjacent layers. As well as this, the convolution of the input with a set of filters (called kernels) is used as the main operation in at least one of its layers. A convolution of a general time series with a kernel of size 5 is shown in Fig. 7. Each kernel convolves with the input producing a representation of the input as an output (this is illustrated as a dashed line in Fig. 7). These representations are then flattened and fed into a feedforward network with one hidden layer with T y hidden units, producing the outputs.
The discrete convolution of an input x with a kernel w results in the output y, given by [11, eq. 9.3] where * represents the convolutional operator, t is the time index, and x and w are defined only on integer t. Unlike GRUs and LSTMs, CNNs are not a type of RNN due to the lack of self-loop mechanisms. Instead, CNNs are well established in the literature and industry as an efficient feature extractor, leading to important progress in computer vision and related tasks [21]- [23]. It is worth highlighting  that although CNNs are not RNNs they are known for processing data that have grid-like topology. In particular, onedimensional convolutional neural networks (1DCNN) are efficient in processing information present in one-dimensional data, such as time series [11]. The benefit of using 1DCNNs for sequence classification is that they can learn directly from the raw time series data and do not require domain experience to manually design input characteristics.

G. IMPLEMENTATION DETAILS
The models used in this paper were built using Ubuntu 20.04.2 LTS and Tensorflow ®5 2.5. The data was split into three independent sets, namely training (70%), validation (20%) and test sets (10%). The models were trained and tested in parallel on three computing systems which include: 1) a 9th Generation Intel ® Core TM i7-9750H consisting of 6 cores, 16  was used as an adaptive learning rate method with a step size of 0.001. Depending on the configuration of the parameters and computer used, training the deep learning models (i.e., a single experiment) took a maximum of 1 hour whilst the testing phase of the model took only a few minutes. Furthermore, due to the high computational requirement of the deep learning models, the parameter space was extensively investigated over a few months before empirically 6 deciding upon the optimal parameters of the model. Table 1 provides the parameter space explored for the deep learning models on which the predictions here are based. A batch size of 32 was chosen as no significant improvements were noticed when the batch size was increased from 32 to 64. A dropout size of 0.3 was chosen to reduce overfitting of the model to the training data, and the models were trained for 150 epochs. 7,8 Besides dropout, early stopping with patience of 15 epochs was used to regularize the models.
As shown in Table 2, for each of the D2D data sets the output lengths to be predicted were varied between 4 and 23 samples depending on the coherence time of the channel and time correlation function. According to [25], the coherence time is defined as the time over which the time correlation function is above 0.5. However, in this study the time correlation function was varied from 0.1 to 0.9 to obtain a range of output lengths for prediction, and to evaluate the prediction performance of the models. For instance, for the indoor LOS measurements when the time correlation function is 0.5, the coherence time of the channel was found to be 11 ms. Since each sample is equal to 1 ms, the corresponding output length is computed to be 11 samples. In the experiments conducted here (at both training and test times), single shot predictions were made where the model predicted out steps time steps in the future, given RSS measurement samples of length between 1 and 100. 6 It is worth highlighting that determining the optimal parameter values theoretically for a particular data set is still an open research question. Hence, for the work carried out here, the parameters were determined empirically [8]. 7 The number of epochs is the number of complete passes through the training data set. Assume that a data set has x number of samples (rows of data), a batch size of y and z epochs. This means that the data set will be divided into x/y batches, each with y samples. The model weights will be updated after each batch of y samples. This also means that one epoch will involve x/y batches or x/y updates to the model. With z epochs, the model will be exposed to or pass through the whole data set z times i.e., a total of x/y × z batches during the entire training process. 8 In this study, there was no significant improvement in the performance noticed when the number of epochs was raised above 150.

V. EXPERIMENTAL ANALYSIS AND RESULTS
This section discusses the main metrics used for evaluation in this study, the experiments performed on each of the D2D data sets to determine the model and parameters that provide the best prediction performance, and the results.
The main metrics used for evaluation in this study include the mean absolute error (MAE) and root mean squared error (RMSE). MAE measures the average magnitude of the errors in a set of predictions without considering their direction. It is the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight. RMSE also measures the average magnitude of the error. It is the square root of the average of squared differences between prediction and actual observation. Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means that the RMSE should be more useful when large errors are particularly undesirable. Both MAE and RMSE express average model prediction error in units of the variable of interest. They are negatively-oriented scores, which means lower values are better. Let y mn be the mth test sample for the nth prediction step where n ∈ [1, z] and z is the total number of prediction steps. Letŷ mn be the predicted value of y mn . Then, the RMSE and MAE are given by (19) and (20) as follows: where i is the number of test samples.
Recall that for each data set, the model parameters were tuned by varying the number of stacked layers and hidden units/ kernels in each layer as indicated in Table I. Following extensive experimentation on the parameter space, in general, it was observed that a single LSTM, GRU and FFN layer with 25 hidden units, and a single 1DCNN layer with 128 kernels and kernel size of 5, provided the best prediction performance across all environments and scenarios. Increasing the number of hidden units to more than 25, and the number of kernels to beyond 128, significantly increased the time taken to train the networks without providing any substantial performance improvements. Likewise, when the number of layers were increased to two, in general, the best prediction performances were obtained when the number of hidden units in each layer for the FFNs were 5; 25 for GRUs and LSTMs; and 64 kernels for 1DCNNs.
Once the number of hidden units and kernels for the deep learning models were chosen, their MAEs and RMSEs were compared with each other and the linear regression baseline model for varying input and output lengths, for single and multiple layers, across different D2D environments and scenarios. These results are discussed in detail next.

A. COMPARING ERRORS VS INPUT LENGTHS
Figs. 8 and 9 compare the RMSEs and MAEs of the deep learning models with each other, and linear regression 9 , for a range of input lengths. Figs. 8 (a), (b), (c) and (d) were obtained for an indoor LOS environment for output lengths 8, 11, 14 and 17, respectively, whilst Figs. 8 (d), (e), (f) and (g) were obtained for an indoor NLOS environment for output lengths 8, 12, 15 and 18, respectively. Likewise, Figs. 9 (a), (b), (c) and (d) were obtained for an outdoor LOS environment for output lengths 10, 14, 19 and 23, 9 Linear regression is a statistical model that fits the best line to the input data. Similar to the deep neural networks, the baseline also considers a history of 1 -100 samples to predict the required out steps time steps in the future. Interestingly, it is observed that a fine-tuned GRU performs very similar to the LSTM. GRUs control the flow of information in essentially the same way as LSTMs. The difference is that LSTMs use a specifically designed memory cell to capture the long-term dependencies in sequences whilst the GRUs use the update gate. Furthermore, these models outperformed FFNs and 1DCNNs for all measured data sets in this study. This points to the importance of accounting for longterm temporal dependencies for channel prediction, which FFNs and 1DCNNs are unable to capture. It is also seen that GRUs and LSTMs significantly outperform linear regression in all environments and scenarios. The basic idea behind linear regression is to provide a model which can observe linear trends in the data. It is possible that this baseline model did not perform well here because: 1) the D2D data sets in this paper are composed of real-world measurements, possibly with nonlinearities introduced due to factors such as the presence of obstacles in the environment, and the direction in which the receiver moved as a result of the random movement undertaken; 2) it very closely follows the trend captured by the previous value, which predicted values may not necessarily follow.

B. RECOMMENDATIONS ON INPUT LENGTH FOR COHERENCE TIME PREDICTION
From Figs. 8 and 9 it is possible to see how little input a model requires to achieve a target prediction performance.
For instance, to predict 17 and 23 samples into the future, corresponding to coherence times of 17 and 23 ms in indoor and outdoor LOS environments, respectively, an input length of 25 samples is recommended. Likewise, to predict 17 and 18 samples into the future, corresponding to coherence times of 17 and 18 ms in indoor and outdoor NLOS environments, respectively, input lengths of 25 and 75 samples are recommended. Thus, through these figures, the interested reader can obtain information on the minimum input length required to achieve a target prediction performance for their chosen coherence time given the environments being considered are similar to the ones presented in this work. It is also worth highlighting that in most cases, a short input length of around 25 samples was found to achieve similar prediction performance when compared to larger input lengths of 100 samples. Thus, indicating that large input lengths (i.e., knowledge of a large number of past values) may not be always be beneficial. This is intuitive because samples further in the past than the coherence time of the channel are uncorrelated and therefore are less likely to carry as much useful information.

C. COMPARING ERRORS VS INPUT LENGTHS FOR MULTIPLE LAYERS' CASE
Observations discussed in the above two subsections for the single layers case also hold when the number of stacked layers is increased to two as demonstrated through Figs. 10 and 11. Fig. 10 shows the RMSEs of the deep learning and baseline models vs varying input lengths for single and multiple layers case, whilst Fig. 11 shows the MAEs of the deep learning and baseline models vs varying input lengths for single and multiple layers case. Furthermore, it can be seen that, just a single LSTM or GRU layer provides good prediction performance and increasing the number of stacked layers will increase the training times without providing any considerable performance benefits. Fig. 12 compares the prediction errors for all deep learning and baseline models for different output lengths, across all D2D environments and scenarios. These plots have been obtained for a single LSTM, GRU, FFN and 1DCNN layer when the input length was 25 samples, number of hidden units = 25 and number of kernels = 128. As before, it can be seen that LSTMs and GRUs perform comparably across all environments and scenarios. Furthermore, these models significantly outperform FFNs, 1DCNNs and linear regression for all of the D2D data sets considered here. It is also seen that the prediction errors associated with the outdoor LOS scenario were the lowest. This could be because of: 1) the overall low fluctuations in the small scale fading data observed here when compared to the indoor LOS, NLOS and outdoor NLOS cases (see Fig. 2), which means that the model has less difficulty making predictions, and 2) the fades observed here are not as deep when compared to the indoor LOS, NLOS and outdoor NLOS cases, again making it easier for the models to predict. Fig. 13 shows a qualitative comparison between the actual and prediction results for linear regression, FFN, 1DCNN, GRU and LSTM models. This figure has been obtained for the outdoor LOS environment and illustrates an input timeframe of 25 ms to predict 14 ms (or 14 samples) in the future. The number of hidden units is equal to 25 whilst the number of kernels is equal to 128. As indicated previously, the linear regression model is only able to capture a lowdimensional slice of the behavior (i.e., it very closely follows the trend captured by the previous value) resulting in poor prediction performance. GRUs and LSTMs perform the best whilst FFNs and 1DCNNs are the worst performing deep learning models for the given data sets.

F. TIME PROFILING
Profiling is a way to measure how the models behave in relation to the resources (time and/or memory) they use. It is well known that deep learning models are typically computationally expensive. Thus, quantifying the resource consumption of these models can resolve performance bottlenecks and, ultimately, make them execute faster. In this subsection, we implement time profiling by comparing the training times of the two best performing deep learning models here, namely LSTM and GRU. TensorBoard ® 10 , a visualization toolkit of TensorFlow ® was used to profile and track the performance of the models on the device. The device used to evaluate the training times is an NVIDIA ® GeForce ® GTX 1650 4 GB    Table 3 that the training time associated with the LSTMs was found to be 123 s whilst the GRUs was found to be 101 s. This means that for these parameters in the outdoor NLOS scenario, the GRUs trained 22% faster when compared to the LSTMs. Thus, by investigating the prediction performance and training times, it was found that for the D2D measurements considered in this paper, both the GRUs and LSTMs were the best performing models.

VI. CONCLUSIONS
This paper investigated the capabilities of AI-based deep learning models (LSTM, GRU, FFN and 1DCNN) to predict received signal strength variations in D2D communications channels. A thorough investigation was performed on the efficacy of the models to predict different output lengths chosen depending on the coherence time of the channel and time correlation function. It was found that, in general, GRUs and LSTMs consisting of a single layer with 25 hidden units provided the best prediction performance. Training times of the models were also compared with each other to pick the most suitable model for the D2D data sets considered here. Interestingly, there was no clear winner found between the LSTMs and GRUs. The paper also investigated the minimum input length a model requires to achieve a target prediction performance. It was found that to predict 17 and 23 ms into the future, corresponding to the coherence times observed in indoor and outdoor LOS environments, respectively, an input length of 25 ms was recommended. Likewise, to predict 17 and 18 samples into the future, corresponding to coherence times of 17 and 18 ms in indoor and outdoor NLOS environments, respectively, input lengths of 25 and 75 samples were recommended. This indicates that large input lengths may not always be necessary as samples further in the past than the coherence time of the channel are uncorrelated and therefore are less likely to carry as much useful information.  He has authored and co-authored over 140 publications in major IEEE/IET journals and refereed international conferences, two book chapters, and two patents. Among his research interests are cellular deviceto-device, vehicular, and body-centric communications. His other research interests include radio channel characterization and modeling, and the simulation of wireless channels. He was a recipient of the H. A. Wheeler Prize, in 2010, from the IEEE Antennas and Propagation Society for the best applications journal paper in the IEEE Transactions on Antennas and Propagation, in 2009, and the Sir George Macfarlane Award from the U.K. Royal Academy of Engineering in recognition of his technical and scientific attainment since graduating from his first degree in engineering, in 2011. DAVID E. SIMMONS received the BSc in mathematics from the University of Central Lancashire, in 2011, the M.Sc. degree in communications engineering from the University of Bristol, U.K., in 2012, and D.Phil. degree in Engineering from the University of Oxford, U.K., in 2016. His research has focused on studying the information theoretic properties of relay networks as they scale. During his D.Phil. studies, he was a recipient of the Best Paper Award at the 23rd edition of EUCNC'14. From 2016 to 2017, he worked as a PDRA within the Networked Quantum Information Technologies group at the University of Oxford. From 2018 until present, he has worked as a Senior AI/ML Research Scientist, Senior Software Engineer, and Engineering Team Lead in two startup companies in Belfast, U.K. His research interests include communication and network theory, information theory, AI/ML, and cryptography. VOLUME 4, 2016