A Comprehensive Evaluation of Deep Learning-Based Techniques for Traffic Prediction

Deep learning-based techniques are the state of the art in road trafﬁc prediction or forecasting. Several deep neural networks have been proposed to predict the trafﬁc but they have not been evaluated under common datasets. Current studies analyze their capacity to predict road trafﬁc in general but do not focus on their capacity to predict the formation of congestions. This is critical for avoiding congestions or mitigate their negative impact. This paper progresses the current state of the art by presenting a comprehensive comparison of the state-of-the-art deep neural networks for road trafﬁc prediction. The comparison is conducted using the same real trafﬁc datasets, and under normal and congested trafﬁc conditions. The evaluation includes new deep neural networks and error recurrent models. Our study ﬁrst demonstrates that accurately predicting the trafﬁc overall does not imply that a deep neural network can accurately predict the trafﬁc when congestions are being formed. This reinforces the idea that prediction techniques must also be evaluated under congestion conditions. Our analysis also shows that exploiting the spatiotemporal evolution of the trafﬁc (and not just the temporal one) provides better prediction accuracy overall and in particular under congestion conditions. The study also demonstrates that error recurrent models outperform deep neural networks that do not utilize an error feedback both under normal and congested trafﬁc conditions. In particular, our study shows that the error recurrent model eRCNN is the deep learning technique that achieves to date the best trafﬁc prediction accuracy. It is also important emphasizing that error recurrent models achieve better prediction accuracy with shallower neural networks and therefore lower computational cost.


I. INTRODUCTION
Traffic congestion has a significant impact on the economy. The European Commission estimates that the annual cost associated to traffic congestion is in the order of 100 billion euros (1% of GDP) [1]. Traffic congestion also increases pollution and reduces the drivers' comfort. These effects could be reduced with proactive solutions capable to predict future road traffic conditions and implement traffic management procedures that reduce the risk of congestion.
Several road traffic prediction techniques have been proposed in the literature. To date, those based on deep neural networks achieve the highest prediction accuracy. As an The associate editor coordinating the review of this manuscript and approving it for publication was Shaohua Wan . example, [2] and [3] propose using deep belief networks and recurrent neural networks for traffic flow and traffic speed prediction, respectively. Several deep neural network techniques have been proposed to date, and it is not possible to conclude from the existing literature which has the best prediction accuracy. This is the case because they have not been evaluated over a common framework with the same traffic datasets. In addition, current studies generally focus on predicting the traffic in general, and do not analyze in detail the capacity of the techniques to predict the road traffic under congestion conditions. This is critical to be able to anticipate potential congestions and avoid them, or at least mitigate their negative effects. Being able to predict the traffic in general does not necessarily imply that a technique can accurately predict the road traffic under congestion conditions. This is the case because the traffic speed and flow experience larger variations when congestions are formed or dissolved than under non-congested conditions. Congestions are also less frequent than free flow conditions, and hence the traffic datasets include less congestion events to train the deep learning-based prediction techniques.
This study advances the current state of the art by conducting what the authors believe is the most comprehensive evaluation and comparison to date of the state-of-theart deep learning-based traffic prediction techniques. The study allows identifying the techniques that achieve better prediction accuracy under different traffic conditions. The comparison includes the evaluation of new deep neural networks presented in this paper and error recurrent models. All techniques are evaluated under common real traffic datasets provided by the Department of Transportation of California (Caltrans) through its Performance Measurement System (PeMS) platform [4]. PeMS is a traffic monitoring system that provides real traffic data from highways in California. PeMS is the largest and most up to date open traffic detector database that provides several traffic variables like traffic flow, speed or density. The evaluation is conducted using two years of data from three different highway sections in California. The prediction accuracy is evaluated under general traffic conditions (i.e. considering complete datasets) and under congestion conditions.
The study demonstrates that error recurrent deep neural networks achieve the best prediction accuracy both under normal and congested traffic. In addition, error recurrent models achieve better prediction accuracy with shallower neural networks and therefore lower computational cost for prediction. The study also demonstrates that exploiting the spatiotemporal evolution of the traffic (and not just the temporal one) provides better prediction accuracy overall and in particular under congestion conditions. Finally, the study demonstrates that techniques that have been shown to accurately predict the traffic in general cannot accurately predict the traffic under congestion conditions. The remainder of this paper is organized as follows. Section II analyzes the state of the art in traffic prediction. Section III describes the dataset used to train the neural networks. Section IV introduces neural networks and describes the neural network models implemented in this study. Section V and VI evaluate the traffic prediction accuracy and traffic congestion prediction accuracy of the different neural network models, respectively. Finally, Section VII summarizes the main contributions of this study.

II. STATE OF THE ART ON TRAFFIC PREDICTION
References [5] and [6] provide an exhaustive review of techniques for traffic prediction prior to the irruption of deep learning. These techniques include, for example, parametric models like the autoregressive models and the Kalman filter. Autoregressive schemes model output variables as a function of their past values. They can model stationary time series (like the temporal evolution of a traffic variable) using linear dependencies. Autoregressive models can hence be easily adjusted and have a low computational cost. The autoregressive models that are more commonly utilized for traffic prediction are the AutoRegressive Integrated Moving Average (ARIMA) and Seasonal ARIMA (SARIMA) models. [7] and [8] use these models to predict the traffic flow. These studies show that SARIMA can improve the prediction over ARIMA by taking into account the traffic seasonality. Other studies propose using Kalman filters to predict the traffic since they are easy to adjust, fast to compute, and robust against external disturbances and errors of the model. For example, [9] predicts travel times using a Kalman filter. The study models the traffic variables like a dynamic system in the state-space. [10] also uses a Kalman filter for traffic prediction and fuses data provided by different traffic detectors. A disadvantage of parametric models is that they need to be readjusted every time we change the set of parameters that define the model.
Other studies propose predicting road traffic using techniques based on non-parametric models like the k-nearest neighbors, Bayesian Networks or support vector regression. The k-nearest neighbors technique can model a variable to predict as a function of any type of variable (independently of whether they are continuous or discrete). It is then possible to predict traffic variables as a function of diverse variables such as the weather. [11] shows that traffic prediction improves using the k-nearest neighbors compared to using parametric models. A disadvantage of the k-nearest neighbors technique is that the prediction time increases significantly with the size of the training data. Bayesian Networks have also been proposed to predict traffic [12]. They predict variables using probability distributions. They are robust to noise and incomplete datasets, and they do not overfit to the training set. Their main disadvantage is that they cannot learn cyclic relationships between variables, i.e. relationships where two variables influence each other simultaneously. Support vector regression also avoids overfitting during the prediction thanks to the use of regularization terms in the optimization process. In addition, its training does not have local minima during the optimization process. These properties result in a lower prediction error of traffic flows compared to ARIMA models [13]. Prediction models using support vector regression are defined as a function of kernels that map the inputs of the model into high-dimensional feature spaces. New models need to be trained every time the kernels are changed to improve the prediction error.
Neural networks are also non-parametric models. They provided the most accurate traffic predictions prior to the emergence of techniques using deep learning [6]. One of the reasons for their accuracy is their versatility and robustness against noise in the data. Several types of neural networks have been proposed for traffic prediction. For example, [14] proposes a multilayer perceptron (MLP) that outperforms the ARIMA model when predicting the traffic flow. The MLP takes as input past values of the traffic flow. Other MLP proposals take as input preprocessed values of the temporal evolution of the traffic variables. This is for example the case of [15] where the authors show that the preprocessing improves the prediction accuracy. In [16], the authors use the discrete wavelet transform to decompose the temporal evolution of the traffic flow into a set of mutually orthogonal wavelets. A vector in the space of wavelets is then fed to a MLP to predict the traffic flow. [17] proposes partitioning the input vector space into a set of fuzzy clusters. The input vectors are replaced by vectors of their membership to the fuzzy clusters, and these membership vectors are then fed to a MLP. The discussed MLP proposals predict traffic based on its temporal evolution. The first proposal that uses temporal and spatial traffic information for the traffic prediction was presented in [18]. This study proposes an MLP to predict the mean speed of vehicles at predefined road sections. To do so, the MLP takes as input values for a given road section the past average speed of vehicles in that road section and in the previous and following road sections.
Deep learning-based techniques are currently the state of the art in traffic prediction given their capacity to forecast time series among others. In fact, several studies have demonstrated that deep learning techniques can successfully forecast time series in different domains. For example, [19] used an ensemble of deep belief networks (DBN) to predict the evolution of the Mackey-Glass differential equation or the electricity demand. DBNs were shown to outperform other models like the support vector regression or the MLP. In [20], autoencoders (AE) were proposed to predict the evolution of the electricity price. The study shows that AEs achieve better predictions than non-deep learning models like the MLP, support vector regression or lasso regression. Longshort term memory recurrent neural network (LSTM RNN) have also been proposed for time series forecasting [21], [22]. In [21], the authors use LSTM RNNs to predict the temperature of the sea's surface and show that they outperform the MLP and support vector regression. LSTM RNNs are used in [22] to predict the evolution of oil production. The study shows that LSTM RNNs provide better predictions than ARIMA, other recurrent neural networks (RNN), as well as other non-linear methods. Convolutional neural networks (CNN) were also proposed for time series forecasting in [23] where authors used CNNs to predict the evolution of the electricity demand, quote prices in credit default swap markets, and artificial multivariate time series. Authors demonstrate that CNNs outperform the vector autoregression and LSTM RNNs. Other studies have proposed to combine deep learning techniques for time series forecasting [24], [25]. For example, [24] presents a hybrid model combining AEs, CNNs and RNNs. This model predicted the concentration of different chemicals in the tanks of a sewage treatment center better than models based on ARIMA, RNNs, LSTM RNNs, and CNNs. The study presented in [25] proposed a hybrid model combining AEs and LSTM RNNs to predict stock market prices. This hybrid model proved to outperform LSTM RNNs and RNNs. Autoencoders (AE) was one of the first deep learning models used for traffic prediction. Autoencoders are able to identify patterns in the input data. These patterns are then used to define a representation or coding of the data that facilitates its preprocessing. Autoencoders were used for traffic prediction in [26] where authors showed the benefits of letting the autoencoder choose how to represent and preprocess the input data during the training process. This approach yields better results than a predefined preprocessing since the autoencoder identifies the preprocessing to be done following an optimization process. [26] shows that autoencoders predict road traffic with higher accuracy than other non-parametric models, including the MLP. In [27], authors propose using autoencoders to predict road traffic congestion. To this aim, an autoencoder is trained to predict discrete congestion levels in a road network using the past temporal evolution of these congestion levels.
Different types of recurrent neural networks (RNN) have also been proposed for road traffic prediction. RNNs only require as input the last value of the traffic variables and not an average or preprocessed value. This is the case because RNNs maintain the state of the network between predictions, and the state is influenced by past input values. The study in [28] compares the accuracy achieved with the Elman, Jordan and state-space RNNs when predicting freeway travel times. However, the study does not compare the performance of the trained RNNs with other traffic prediction techniques. Higher traffic prediction accuracy can be obtained with long short-term memory (LSTM) RNNs [29]. LSTM RNNs use a trainable memory architecture that is used to learn when the network should keep information of past inputs. This allows LSTM RNNs learning long-term dependencies with better results than standard RNNs. This feature is useful to predict road traffic as demonstrated in [29] and [30] where authors use LSTM RNNs to predict traffic speed and flow, respectively. The study in [29] demonstrates that LSTM RNNs outperform several parametric and non-parametric models (including the MLP and the Elman RNN) when predicting the traffic speed. [30] also proved that LSTM RNNs predict more accurately the traffic flow than MLPs and AEs. The study in [31] showed that the prediction accuracy of LSTM RNNs can be further improved using as inputs multiple traffic variables (e.g. traffic speed, flow or occupancy) and data from several neighboring traffic detectors. However, [31] did not compare the performance of the trained LSTM RNNs with other traffic prediction techniques. Other studies propose variations of the LSTM RNN architecture for traffic prediction. This is for example the case of [32] that proposes a 2-dimensional LSTM RNN capable to process traffic data from different detectors at the same time. The authors showed that the trained LSTM RNN outperforms AEs and other RNNs when predicting the traffic flow. In [33], authors propose predicting traffic using a RNN consisting of stacked bidirectional LSTM layers and regular LSTM layers. Their proposal outperforms other techniques based on non-parametric models including the MLP.
Convolutional neural networks (CNN) have also been recently utilized to predict road traffic. CNNs can detect local features of the input data thanks to the convolution operation. This is useful to detect traffic events that can occur at different locations or at different times. This information can then be used to improve the prediction accuracy. The study in [34] proposes the use of a deep neural network made of several sub-CNNs. Each CNN processes the temporal evolution of traffic variables measured in a detector in order to predict the traffic speed. The authors demonstrate that this CNN outperforms an MLP-based architecture. The temporal evolution of the traffic variables at each detector is represented as a 1-dimensional vector. Consequently, all the convolutions performed by the CNN proposed in [34] are 1-dimensional convolutions. The studies presented in [35], [36] and [37] propose utilizing 2-dimensional convolutions so that CNNs can simultaneously process the temporal and spatial evolution of the traffic variables. To this aim, traffic data is represented as a matrix where one direction represents the spatial evolution of the data and the other one the temporal evolution. This approach allows CNNs to capture the spatiotemporal correlations present in the road traffic patterns. 2-dimensional CNNs have been shown to outperform different parametric and non-parametric models (including MLPs, AEs, and LSTM RNNs) when predicting the traffic speed [35]- [37]. CNNs have been also proposed to predict the traffic flow, and [38] showed that they can again outperform different parametric and non-parametric models. An interesting proposal is presented in [39] where authors introduce a CNN that takes as input the matrix of the spatiotemporal evolution of the traffic variables and the prediction error in previous time steps. The proposal is referred to as error recurrent convolutional neural network (eRCNN). Authors show in [39] that eRCNN can outperform different parametric and non-parametric models (including AEs, 1-dimensional CNNs and 2-dimensional CNNs) when predicting the traffic speed. This is due to the error feedback introduced as input to the eRCNN. [39] also shows that 2-dimensional CNNs outperform 1-dimensional CNNs. A further evolution of CNNs for traffic prediction are graph convolutional neural networks (GraphCNN). This type of CNNs exploits information on graphs using the convolution operation on matrices. This is useful for traffic prediction since road networks can be represented using graphs. GraphCNNs can then combine traffic data with information about the structure of the road network to predict the road traffic. [40] and [41] show that GraphCNNs can accurately forecast the traffic speed, and outperform different parametric and non-parametric models, including MLPs and LSTM RNNs. Other studies propose using CNNs to process spatiotemporal road traffic data for other tasks. For example, [42] uses a CNN to classify vehicle trajectories data in order to categorize their transport mode and their speed.
Recent studies propose predicting road traffic using hybrid models that combine some of the previous models. For example, [43] proposes to combine the k-nearest neighbors algorithm with LSTM RNNs to predict the traffic flow. The k-nearest neighbors algorithm is used to select the most rel-evant traffic detectors for the prediction. Then, the LSTM RNN predicts the traffic flow using the traffic data from the selected detectors. The study shows that this hybrid model outperforms other traffic prediction techniques based on parametric and non-parametric models, including MLPs and regular LSTM RNNs. In [44], authors propose combining AEs and LSTM RNNs for the prediction of the traffic speed. The proposed model uses the AE to process the traffic data at each time step. The LSTM takes then as input the processed data to predict the traffic speed. Authors show that this hybrid model outperforms different parametric and non-parametric models, including regular AEs and LSTM RNNs. CNNs and LSTM RNNs have also been combined for traffic prediction in [45] and [46]. In this case, CNNs process the spatial evolution of the traffic and LSTM RNNs the temporal evolution. The proposal in [45] processes the traffic data of the entire road network using a 2-dimensional CNN. The data is processed at each time step. The processed data is then fed into an LSTM RNN to predict the traffic speed. Authors show that this hybrid model outperforms different non-parametric models, including AEs, CNNs and LSTM RNNs. The proposal in [46] uses a 1-dimensional CNN to process traffic data from all the detectors in a highway section. The processed data is then fed into the LSTM RNN in order to predict the traffic flow. Authors show that this hybrid model outperforms MLPs, AEs and LSTM RNNs.
The conducted review shows that deep learning models outperform parametric and non-parametric models (including shallow or non-deep neural networks) for traffic prediction. However, existing studies have several limitations to identify the most adequate model. First, proposals are generally compared with a subset of all deep learning models for traffic prediction. In addition, all deep learning proposals are trained and evaluated using different datasets so there is no common benchmark. It is also important noting that current deep learning proposals have been evaluated to predict the road traffic in general, and no specific analysis has been presented to predict road traffic congestion. The capacity to accurately predict road traffic under these conditions is highly valuable for an effective road traffic management. This study complements the current state of the art with several important contributions. First, it compares the most relevant deep learning-based traffic prediction techniques using the same dataset. This allows us identifying the most accurate deep learning model. The comparison includes the current stateof-the-art techniques but also new hybrid models proposed by the authors. An important novelty is that the comparison considers for the first time an in-depth analysis under road traffic congestion conditions. This is important since this study also demonstrates that deep learning models that can accurately predict the road traffic in general do not accurately predict the formation of road traffic conditions. In summary, the conducted study allows identifying using common datasets the deep learning model that best predicts the road traffic under general conditions and also under congestion conditions. VOLUME 8, 2020

III. DATASET
This study has been conducted using real traffic data provided by the Department of Transportation of California (Caltrans) through the PeMS platform [4]. PeMS is a real traffic monitoring system that provides real traffic data from the road network in California. This data has been used to train and test the different neural networks analyzed in this study. In particular, we have used data related to the traffic flow (measured in vehicles per unit time), traffic mean speed (measured in miles per hour), and traffic density (measured in vehicles per unit length). 1 All the neural networks have been trained and tested with data of the I5 freeway. We have chosen this freeway because it is the longest freeway of the state (crosses all California from North to South), and because it has a large number of traffic detectors that provide road traffic data. In particular, we work with three sections of the I5 freeway. The characteristics of the selected sections are reported in Table 1. The three sections and the location of the selected detectors are shown in Figure 1.
We have downloaded the 2015 and 2016 traffic data (flow, mean speed and density) for each detector in the selected I5 freeway sections. The data has been downloaded with a temporal resolution of 5 minutes. The dataset contains 100000 samples for each year and each section, i.e. a total of 600000 samples. We provide as open source the tool we have developed for downloading road traffic data from PeMS. 2 We also provide the downloaded dataset in an open repository. 3 PeMS provides the traffic flow as number of vehicles every 5 minutes. In addition, PeMS sums the flow for all lanes at the location of the detector. We convert this value to vehicles per hour and per lane as follows: where Q is the traffic flow in vehicles every 5 minutes, Q is the traffic flow in vehicles per hour and per lane, and n lanes is the number of lanes of the highway in the location of the detector. The transformation in (1) is done so that the neural networks are abstracted from the number of lanes in each particular road section. The dataset has been divided in a training set, a validation set and a test set. The training set is used for training the neural networks in approximating the prediction function. The validation set is used for monitoring the accuracy of the models while training and for deciding when to stop the training. The test set is used for evaluating the prediction accuracy of the neural networks under study. In this study, all the 2015 dataset is used for training. Half of the 2016 dataset is used for validation and the other half for testing and evaluating the prediction accuracy.

IV. NEURAL NETWORK MODELS
This section presents the models implemented for road traffic prediction, including those proposed in this paper. All the models are based on neural networks since they outperform other models as highlighted in Section II. Following the review in Section II, this study implements the following neural networks that have been proposed for road traffic prediction: multilayer perceptron (MLP), convolutional neural network (CNN), LSTM recurrent neural network (LSTM), a hybrid neural network combining a 1D-CNN and a LSTM (referred to as CNN+LSTM), and the error recurrent CNN (eRCNN). The MLP model has been chosen for comparative purposes and to check that our results are in line with those reported in the literature. The CNN and CNN+LSTM models have been chosen since they consider the spatiotemporal evolution of the road traffic. The CNN model processes both the spatial and the temporal evolution of the road traffic by means of the convolution operation. The CNN+LSTM model processes the spatial evolution of the road traffic using the convolution operation. It processes the temporal evolution maintaining an internal state influenced by previous inputs. The LSTM model has been chosen so that we can compare over the same dataset the results achieved when processing only the temporal evolution of the traffic and when processing both the temporal and spatial evolution. The eRCNN model also processes the spatiotemporal evolution of the road traffic, but introduces an additional feedback mechanism to account for previous prediction errors. This model has been chosen to analyze the impact of the error recurrent feedback mechanisms and check their suitability for other hybrid models. All these models have been evaluated so far using different datasets so it is not possible to draw an exact conclusion on which model achieves highest traffic prediction accuracy. However, the CNN+LSTM and eRCNN models seem to offer the highest potential following the review reported in Section II. This study does not implement the GraphCNN model since the dataset used to train and evaluate the different prediction techniques is made of highway road traffic data. The structure of the road is hence very simple, and it would not bring any advantage to exploit it using a GraphCNN model.
This study also proposes and evaluates new hybrid neural network models for traffic prediction inspired on existing models. This includes a hybrid neural network combining a LSTM and a 1D-CNN (referred to as LSTM+CNN), an error recurrent LSTM (eRLSTM), and an error recurrent CNN+LSTM (referred to as eRCNN+LSTM). The LSTM+CNN model is proposed to analyze whether inverting the order in which the CNN and LSTM sub-networks are applied leads to lower prediction errors. Recent studies have shown that the LSTM and CNN+LSTM models can achieve good prediction accuracy. We have then decided to develop the new eRLSTM and eRCNN+LSTM models to analyze whether the introduction of the error recurrent feedback approach can further augment their prediction accuracy. All selected and proposed models are presented in the following subsections. All the models have been implemented using the TensorFlow framework [47].

A. MLP
The MLP is composed of stacked layers of neurons. Each layer is composed of a set of neurons that process the layer's input. This input is the output of the previous layer, except in the case of the input layer that takes as input the MLP's input. The last layer of a neural network is called the output layer, and the layers located between the input and output layers are called hidden layers. In a MLP, all the neurons of a layer take as input all the inputs of the layer. Consequently, the neurons of the hidden and output layers are connected to all the neurons of the previous layer. These layers are hence referred to as fully connected layers or dense layers. Figure 2 illustrates the architecture of a MLP. We denote as y the output vector of a fully connected layer that is composed of the individual outputs y i of each neuron in the layer. W denotes the weight matrix built by concatenating the weight vectors of each neuron. b is the biases vector composed of the biases of all the neurons of a layer. The output of a fully connected layer can then be computed as follows: where x is the input vector of a fully connected layer (and hence of all its neurons), and F is the activation function of the layer. F is the same as in the case of a single neuron, but in this case it is applied element-wise. The input vector of the implemented MLP is composed of the traffic variables (speed, flow and occupancy) of all the traffic detectors in the dataset in the last 72 time steps. This input vector is a vector of size 72 · S · 3, being S the number of traffic detectors of the dataset ( Table 1). The input vectors have therefore 5832, 5400, and 6696 samples when using the I5-N-3, I5-S-3, and I5-S-4 datasets, respectively. This input vector is normalized before being fed to the input layer of the where x is the non-normalized input vector,x is the normalized input vector, E [x] is the mean of the input vectors of the training set, and Var [x] is the variance of the input vectors of the training set. The division and the square root in (3) are element-wise operations. This normalization results in that the distribution of input values has zero mean and unit variance. This contributes to a more stable training since the input values cannot take very low or very high values. The normalized input vector is fed to the implemented MLP. Its architecture is represented in Figure 3. This MLP is composed of three fully connected layers (FC) with 2048, 1024 and 1 neurons each. 4 The first two layers use the Rectified Linear Unit (ReLU) activation function [48]: where x is the input variable of the ReLU function. The ReLU activation function has been shown to outperform other activation functions like the sigmoid or the hyperbolic tangent [48]. In fact, we also tested the sigmoid and hyperbolic tangent activation functions and the prediction accuracy was lower than that achieved using the ReLU function. The output layer uses the identity function as the activation function. This is because traffic prediction is a regression problem and with the identity function we ensure that the output of the last layer is an affine transformation of its inputs. To avoid overfitting, 5 we have used dropout [49] for the training. Dropout is a regularization technique that consists in setting a random fraction of a layer's neurons' output equal to zero at each iteration of the training process. When using dropout, the neurons of a network don't know if other neurons are going to have a non-zero output. This prevents the neurons from co-adapting. In our MLP implementation, we use dropout between the first fully connected layer and the second one. The probability of maintaining the neurons' output has been set to 0.6. We have not used dropout between the second FC layer and the output layer because it would cause the output layer to learn incorrectly the affine transformation of its inputs. This MLP architecture is the one that achieved the best results in traffic prediction in all the tests we performed. We have tested different MLP architectures with 2, 3, 4 and 5 layers with different number of neurons each. We also tested different dropout probabilities (0.2, 0.4, 0.6 and 0.8) with the best combination of the number of layers and neurons. In total, we tested ten different architectures without dropout and four with dropout. The validation error obtained with the tested architectures is shown in Table 2. The architecture is identified by the number of FC layers and the number of neurons in each layer. The ten first rows correspond to 4 Reducing the number of neurons at each FC layer reduces the memory used by the parameters and the outputs of the FC layers. 5 Overfitting happens when the network learns to predict the training set and does not have the capacity to predict other data that it has not used during the training process.
whereŷ i is the prediction obtained with a neural network, y i is the ground truth, and N is the number of samples used to compute the MAE. Following the results in Table 2, the implemented MLP has 3 FC layers with the architecture FC 2048, FC 1024, FC 1 and a dropout probability of 0.6.

B. CNN
CNNs are a subclass of neural networks that is particularly useful to detect local features in the input vectors [50]. CNNs have one or more convolutional layers, and the neurons of a convolutional layer only assign weights to a subset of contiguous inputs. This is illustrated in Figure 4. The neurons of a convolutional layer have the same trainable parameters, i.e. they share the same weights and biases. Consequently, the neurons of a convolutional layer assign the same weights to different parts of the input vector. This allows CNNs to detect the same features in different parts of the input independently of where they are actually located. This property of CNNs is called translational invariance. The weight matrix of a convolutional layer contains many zeroes that correspond to the inputs without an assigned weight. The weight matrix has also many repeated variables as a consequence of the shared weights. Then, instead of computing a matrix-vector product with this weight matrix (like done in MLPs using (2)), CNNs change the matrix-vector product and the weight matrix by a convolution operation and a convolution filter, respectively [51]. The convolution filter is composed of the weights assigned to an input (i.e., the weights that are not necessarily zero), and the convolution operation slides the filter along the input of the layer. The output of a convolutional layer contains information about where a feature has been detected, and it is usually referred to as the feature maps associated with the convolution filters. This is the case because each filter used in a CNN searches for a specific feature. The sliding of the convolutional filters along the input results in a map of the particular locations where this feature was detected in the sliding process resulting from the convolution operation. Convolutional layers are usually followed by a set of fully connected layers that process the detected features in order to compute the output of the CNN. CNNs can take as input vectors, matrices and tensors of any dimension. In fact, CNNs have become the state of the art in computer vision [52] where CNNs process images with 2-dimensional convolutions. 2-dimesional convolutions work like 1-dimensional ones but the weights and biases are shared along the two dimensions of the input. This turns the convolution filter into a 2-dimensional filter. The convolution filter is no longer a vector but a matrix. In this case, convolution filters are slided along the height and width of the input. This can be generalized to n-dimensional convolutions [51].
The translational invariance of 2D-CNNs allows them to detect features of the objects present in an image no matter the context or the region of the image where those objects are located. CNNs represent then an interesting option for traffic detection and prediction [35] since the traffic data can be represented as an image as illustrated in Figure 5. Figure 5 represents a contour plot of the spatiotemporal evolution of the traffic speed for the I5 freeway northbound on June 29 th 2017; the data has been obtained from PeMS [4]. Figure 5 clearly shows that traffic events (e.g. a congestion) can be visualized as shapes (or 'traffic objects') in the image. For example, Figure 5 shows a reduction of the traffic speed that causes a traffic jam between 14:30 and 15:30 and between postmiles 45 and 55. CNNs can recognize these events as objects in the image and can identify where and when they occurred. The CNN implemented in this work is similar to that of [35]. It takes as input the same traffic data as the MLP but formatted as an image. The input is then an image of the spatiotemporal evolution of the traffic variables (speed, flow and occupancy) of all the traffic detectors in the dataset in the last 72 time steps. The height of the image is equal to the number of traffic detectors and its width is the time window (or number of time steps) over which the road traffic variables are analyzed in order to predict the road traffic. Images are usually composed of the R, G and B channels. Similarly, a traffic image is composed of three channels. Each channel is the spatiotemporal evolution of one of the traffic variables (speed, flow and occupancy). Then, the CNN implemented in this work takes as input a tensor of size (S, 72, 3), with S being again the number of traffic detectors of the dataset (Table 1). Using the I5-N-3, I5-S-3, and I5-S-4 datasets, the CNN takes as input a tensor of size (27,72,3), (25,72,3), and (31, 72, 3), respectively. Figure 6.a illustrates the general architecture of a CNN that processes an image in computer vision. Figure 6.b depicts the architecture of the CNN that processes the spatiotemporal evolution of the traffic represented as an image and that is used in this study for traffic prediction or forecasting. VOLUME 8, 2020 The input layer of the implemented CNN takes as input the traffic image that is normalized using (3). The input layer is followed by a set of stacked convolutional layers that process in space and time the road traffic data in order to detect traffic features and patterns. The convolutional layers of the implemented CNN have been designed following the guidelines presented in [53]. In particular, we have used convolutional filters of size 3×3. These filters are stacked so that the output of one filter is the input of the next one. The same activation function is applied after each convolution operation. [53] shows that two filters of size 3 × 3 with 18 trainable parameters (9 parameters each) process their input similarly to a filter of size 5 × 5 with 25 trainable parameters. Likewise, three filters of size 3×3 process their input like a filter of size 7×7, and so on. [53] shows that using more filters of smaller size achieves the same or better performance with less trainable parameters than fewer filters of larger size. This approach also results in a more computationally efficient architecture that consumes less memory. All of the convolutional layers of the implemented CNN use the ReLU activation function [48]. Following [38], we use residual connections [54] in the convolutional layers. Residual connections consist in adding to the output of a set of convolutional layers the input of the first layer of that set. The authors of [54] demonstrated that the use of residual connections allows for deeper CNNs without increasing the complexity of the training. This approach leads to better results as shown in [55]. Figure 7.a depicts the diagram of the residual blocks used in the architecture of the implemented CNN. The architecture is illustrated in Figure 7.b. The implemented CNN uses batch normalization (BN in Figure 7) [56]. Batch normalization extends the normalization of the input of the network to the input of any layer. Batch normalization reduces the covariate shift of the layers' inputs and ensures that their distribution is maintained during the training. This makes the training of the CNN more stable and accelerates it. 6 The design of our implemented CNN has been adjusted following extensive testing. The final CNN we implement for 6 Our tests showed that the batch normalization we apply also improves the prediction accuracy. traffic prediction has 18 convolutional layers arranged in 9 residual blocks like the one shown in Figure 7.a. Each residual block has two convolutional layers and we apply batch normalization (BN) after each convolution. The residual blocks are stacked on top of each other so that each residual block takes as input the output of the previous one. We utilize padding in order to maintain constant the size of the input of every convolutional layer 7 and avoid losing information along the network. We do not use any pooling layer 8 since it degraded the prediction accuracy. The first three residual blocks have convolutional layers with 32 convolution filters. The following three have convolutional layers with 64 filters and the last three have convolutional layers with 96 filters. By increasing the number of filters, we augment the number of features that the network can detect. Augmenting the filters at the last residual blocks is useful to obtain the maximum number of feature maps at the last convolutional layer without using too much memory in the previous layers.
The implemented CNN has three Fully Connected (FC) layers after the convolutional ones. The FC layers are used to compute the prediction as a function of the traffic features detected by the convolutional layers. All the FC layers use the ReLU activation function except the last (output) one that uses the identity function as the activation function. The first FC layer has 2048 neurons and takes as input the output of the last residual block. The second FC layer has 1024 neurons and the last (or output) FC layer has 1 neuron. Similarly to the MLP, we have used dropout [49] between the first FC layer and the second one in order to avoid overfitting. The probability of keeping the neurons' output has also been set to 0.6. Figure 7.b depicts the architecture of the implemented CNN for traffic prediction or forecasting. This architecture is the one with the best prediction accuracy in the tests we performed. In particular, we tested several CNN architectures 7 We use same padding for every convolution operation. 8 Pooling layers are used to reduce the size of the inputs. with different number of residual blocks (1, 2, 3, 6 and 9), with 2 and 3 convolutional layers per residual block, and with different number of feature maps at each convolutional layer (16, 32, 64 and 96). We also tested different numbers of FC layers (the same as in the MLP, i.e. 2, 3, 4 and 5 FC layers with 256, 512, 1024 and 2048 neurons). Again, we tested the different dropout probabilities with the best architecture. In total, we tested 41 architectures without dropout and 4 with dropout. The MAE obtained on the validation set with the best ten CNN architectures without dropout and the four best architectures with dropout is shown in Table 3.

C. LSTM
The Long Short-Term Memory (LSTM) [57] is a type of RNN that is the state of the art in natural language processing [58], machine translation [59] or speech recognition [60]. RNNs are particularly useful to address sequence problems since their architecture includes an internal state that contains information about previous inputs to the network. This state results in that RNNs compute the network's output as a function of the current and previous inputs without having to introduce all the previous inputs at each time step. In fact, the output can be obtained as a function of any previous input since the information of all previous inputs is represented in the internal state. The trainable parameters of a RNN remain the same at each time step. This allows RNNs to detect the same features independently of when they occur. This property of RNNs is called temporal invariance. However, RNNs experience some challenges when dealing with long-term dependencies. In particular, RNNs are exposed to the exploding gradient and the vanishing gradient problems [61], [62]. Both problems are related although with different solutions. The exploding gradient problem can be solved by clipping the gradient maximum norm to a fixed value. This limits the learning step and ensures the training converges when the gradient is large. Hochreiter and Schmidhuber proposed in [57] to solve the vanishing gradient problem using a new RNN architecture referred to as LSTM. LSTMs are able to learn long-term dependencies without suffering the vanishing gradient problem. This is possible because LSTMs can learn when they should remember (and when not) their internal state as a function of the network's input, the previous internal state and the previous output.
The basic unit of the LSTM is the LSTM memory cell. These cells are arranged in layers referred to as recurrent layers. Recurrent layers are stacked on top of each other and take as input the output of the previous layer, except in the case of the input layer. In this work, we have implemented the LSTM memory cell of [63] that modifies the LSTM memory cell introduced in [57]; the same implementation is considered in [29]. Figure 8 depicts the architecture of the LSTM memory cell implemented in this study. Readers are referred to [63] for more details on the implementation of this LSTM memory cell. LSTMs usually stack a set of fully connected layers after the last recurrent layer in order  to compute the output of the network. This is also done in our LSTM implementation. Figure 9 illustrates the architecture of the LSTM implemented in this study that follows the proposal in [29]. The input vector is composed of the traffic variables (speed, flow and occupancy) of a single traffic detector in the last time step (like in [29] or [30]). Then, the input vector is a vector of size 3. The input vector is normalized using (3) before being fed to the input layer of the LSTM. The implemented LSTM is composed of a single recurrent layer with 40 LSTM memory cells followed by a fully connected layer with one neuron. The fully connected layer uses the identity activation function. This LSTM is the one with the best prediction accuracy among the different architectures we tested. We have tested LSTMs with 1, 2, and 3 recurrent layers with 20, 40, 60, 80, and 100 LSTM memory cells.
In total, we have tested 15 LSTM architectures. The MAE obtained with the best ten LSTM architectures tested is shown in Table 4. We highlight in boldface the architecture achieving the best results and that has been implemented in this study. The results in Table 4 show that increasing the number of recurrent layers or LSTM memory cells did not increase the performance of the LSTM.

D. CNN+LSTM
The CNN+LSTM model is a hybrid model that combines convolutional and recurrent layers. This combination has proven to be successful in tasks like image captioning [64] or action recognition in videos [65]. It has also been shown to perform well in traffic prediction [45], [46]. Like  the CNN, the CNN+LSTM model also processes the spatiotemporal evolution of the traffic. However, it differs on how it processes it. The CNN+LSTM model first processes the spatial component with convolutional layers. Then, it uses recurrent layers to process the temporal evolution of the output of the convolutional layers. Since the convolutional layers only process the spatial evolution of the traffic data, the dimension of the convolution filters is decreased by one. Our study evaluates the traffic prediction over a (long) set of highway sections. In this case, the traffic data has a single spatial dimension. The convolution filters of the CNN+LSTM model will hence be 1-dimensional like in [46]. We denote as CNN 1D and LSTM the CNN and LSTM sub-networks, respectively, and we denote as FC a fully connected layer stacked on top of the LSTM sub-network. This FC layer is the last layer in our CNN+LSTM model and is in charge of computing the prediction. The output of the CNN+LSTM model can be computed as: where y t is the output of the CNN+LSTM output (FC) layer at time step t andx t is the normalized input vector of the CNN sub-network input layer at time step t. The vectorx t is composed of the traffic variables measured at all the traffic  (Table 1). Then, the input vector has a size of (27, 3), (25, 3), (31, 3), when using the I5-N-3, I5-S-3, and I5-S-4 datasets, respectively. The convolutional layers search for local features in the input vector. Then, the recurrent layers process the temporal evolution of the detected local features to search for temporal features. Finally, the fully connected layer computes the prediction as a function of the detected temporal features. Figure 10 illustrates the architecture of the implemented CNN+LSTM model. It is composed of three convolutional layers, one recurrent layer and one fully connected layer; this implementation is similar to the one in [46]. The three convolutional layers have 32 feature maps that are the result of applying 1-dimensional convolution filters of size 3. All convolutional layers use the ReLU activation function [48] and batch normalization [56]. The output of the last convolutional layer is the input for the recurrent layer. The recurrent layer is composed of 40 LSTM memory cells. These memory cells are implemented following [63]. Finally, the output of the recurrent layer at each time step is the input of the fully connected layer. This fully connected layer is composed of a single neuron and uses the identity function as its activation function. Like in [46], we have used L1 regularization for the weight matrix of the output layer. L1 regularization consists in the addition to the loss function of a term that penalizes large values of trainable parameters. This term is defined in (7): where W is the weight matrix that is regularized (in our case the weight matrix of the output layer), and λ is a ponderation factor. λ is equal to 0.002 in our implementation like in [46]. The implemented CNN+LSTM architecture is the one that achieved the best performance of all the tested architectures. We tested architectures with 1, 2 and 3 convolutional layers and different number of feature maps per layer (16, 32, and 64). We also tested the same number of recurrent layers and LSTM memory cells as in the case of the LSTM. Again, using more than one recurrent layer and 40 LSTM memory cells did not increase the performance of the model. In total, we tested 68 CNN+LSTM architectures. Table 5 shows the MAE obtained with the best ten architectures. We highlight in boldface the architecture achieving the best results and that has been implemented in this study.

E. LSTM+CNN
This study proposes and evaluates an LSTM+CNN model that switches the order of the convolutional and recurrent layers compared to the previous model. The objective is to analyze whether exploiting the temporal evolution of the traffic prior to the spatial evolution might improve the accuracy of the prediction. In the LSTM+CNN model, the recurrent layers process the input and the convolutional layers process the output of the recurrent layers. Again, a fully connected layer is in charge of computing the prediction. The output of the LSTM+CNN model can then be computed: where, CNN 1D , LSTM, and FC represent the CNN and LSTM sub-networks and the fully connected layer, respectively. y t andx t are the output and normalized input vector of the model. The input vectorx t contains the same information as in the case of the CNN+LSTM model. However, the input vector is now fed to the recurrent layers. The input vector is now a 1-dimensional vector of size S · 3 where S is again the number of detectors. The size of the input vector is then equal to 81, 75, and 93, when using the I5-N-3, I5-S-3 and I5-S-4 datasets, respectively. The recurrent layers do not search for local features in this vector but features in the temporal evolution of the vector. The output of the recurrent layers at all time steps are concatenated into a vector and fed to the convolutional layers. The convolutional layers search for local features in the resulting vector. Finally, the fully connected layer computes the prediction as a function of the detected local features. Figure 11 depicts the architecture of the implemented LSTM+CNN model. The architecture is similar to the CNN+LSTM model (with the switched layers) but there are slight differences. The LSTM+CNN model is composed of one recurrent layer followed by three 1-dimensional convolutional layers and one fully connected layer. The recurrent layer is composed of 40 LSTM memory cells. These memory cells are also implemented following [63]. The output of the recurrent layer at all time steps is concatenated and fed to the convolutional layers. The three convolutional layers have 16, 32 and 64 feature maps respectively. These feature maps are the result of applying 1-dimensional convolution filters of size 3. All convolutional layers use the ReLU activation function [48] and batch normalization [56]. The output of the last convolutional layer is fed to the fully connected layer. This layer is the output layer and computes the prediction. It is composed of a single neuron and uses the identity function as its activation function. We also use L1 regularization for the weight matrix of the output layer with a ponderation factor of 0.002. Similar tests to that conducted for the CNN+LSTM model were also realized for the LSTM+CNN proposal in order to find the architecture that minimized the prediction error. We tested the same architectures but inverting the order of the convolutional and recurrent layers. Again, the best results were obtained with one recurrent layer and 40 LSTM memory cells. The MAE obtained with the best ten LSTM+CNN architectures is shown in Table 6. We highlight in boldface the architecture achieving the best results and that has been implemented in this study. VOLUME 8, 2020

F. eRCNN
The eRCNN model is a CNN that modifies the fully connected layers. The modification introduces a feedback mechanism so that the fully connected layers take as input the output of the convolutional layers and a vector composed of the prediction error in the last time steps. The implemented eRCNN follows the proposal in [39]. This implementation processes separately the output of the convolutional layers and the error vector by means of two parallel fully connected layers. The output of these two layers is then concatenated and fed to another fully connected layer that computes the prediction. We denote as FC FF and FC E the fully connected layers that take as input the output of the convolutional layers and the error vector, respectively. We denote as FC O the fully connected layer that computes the prediction. The output of the eRCNN model is then computed as: where concat is a function that concatenates two vectors, y CONV is the output of the last convolutional layer, e is the error vector, and y is the output of the eRCNN model. The idea behind the eRCNN model is to improve the prediction by using feedback related to the accuracy of the prediction in the last time steps. The model can adapt its prediction if previous predictions made a significant error. The authors of [39] claim that this property of the eRCNN could be useful to predict traffic congestion.  Figure 12 represents the architecture of the implemented eRCNN. This architecture is the one that achieved the highest prediction accuracy in the exhaustive tests we performed. We tested architectures with different number of feature maps (16,32,64), with and without pooling layer, and different number of neurons (16, 32, 64, 128, and 256) in the FC layers. In total, we tested 48 different eRCNN architectures. The MAE obtained with the best ten eRCNN architectures is shown in Table 7. We highlight in boldface the architecture achieving the best results and that has been implemented in this study. The implemented eRCNN is composed of a single 2-dimensional convolutional layer followed by an average pooling layer, the two parallel fully connected layers, and the output fully connected layer. This model is similar to the one implemented in [39]. The convolutional layer takes as input an image of the traffic data equal to the one used for the CNN. The convolutional layer has 32 feature maps that are the result of applying 2-dimensional convolution filters of size 3 × 3. The convolutional layer uses the ReLU activation function [48]. The size of the output of the convolutional layer is reduced by using an average pooling layer. This layer divides the feature maps of the convolutional layer in portions of size 2 × 2. Then, the average pooling layer computes the average value for each portion which reduces the width and height of the feature maps by a factor of two. The output of the average pooling layer is the input to the fully connected layer represented by FC FF in (9). This fully connected layer is composed of 256 neurons and uses the ReLU activation function. In parallel, another fully connected layer takes as input an error vector composed of the prediction error in the last 6 time steps. This fully connected layer is represented by FC E in (9). It is composed of 32 neurons and also uses the ReLU activation layer. The output of these two fully connected layers is concatenated into a vector that is the input of the output fully connected layer. This layer is composed of a single neuron and uses the identity activation function.

G. eRLSTM
The eRLSTM is the second proposal in this paper. It is a modification of the eRCNN model in which the convolutional layers are replaced by recurrent layers with LSTM memory cells. The idea is to implement the error feedback mechanism into LSTMs. Equation (9) can still be applied to compute the prediction in the fully connected layers of the model replacing y CONV by y LSTM . This variable represents the output of the last recurrent layer. The objective of the error feedback is the same as for the eRCNN model. Figure 13 illustrates the architecture of the implemented eRLSTM. This architecture is the one that achieved the best prediction accuracy performance in the tests we carried out. We tested architectures with different numbers of LSTM memory cells (20,40,60,80, and 100) and neurons (20, 40, and 60) in each layer. In total, we tested 45 different eRLSTM architectures. The MAE obtained with the best ten eRLSTM architectures is shown in Table 8. We highlight in boldface the architecture achieving the best results and that has been implemented in this study. The input of the implemented eRLSTM model is the same as in the case of the LSTM model previously described. The implemented eRLSTM is composed of a single recurrent layer, two parallel fully connected layers, and an output fully connected layer. The recurrent layer is composed of 40 LSTM memory cells which have been implemented following [63]. The output of the recurrent layer at each time step is fed to the fully connected layer represented by FC FF in (9). This fully connected layer is composed of 40 neurons and uses the ReLU activation function [48]. In parallel, a fully connected layer composed of 20 neurons takes as input the error vector. This fully connected layer is represented by FC E in (9) and uses the ReLU activation function. The error vector is composed of the prediction error in the last 6 time steps. The output of the parallel layers is concatenated and fed to the output layer. This layer is composed of a single neuron and uses the identity activation function.

H. eRCNN+LSTM
The eRCNN+LSTM model is the third proposal presented in this paper. The model builds from the eRCNN model but replaces its convolutional layers by 1-dimensional convolutional layers and recurrent layers with LSTM memory cells. The 1-dimensional convolutional layers are the first to process the traffic data. Their output is then fed into the recurrent layers. Equation (9) can again be used to compute  FC Z), FC A, where X is the number of LSTM memory cells, Y is the number of neurons in the FC layer that takes as input the output of the recurrent layer, Z is the number of neurons in the FC layer that takes as input the error vector, and A is the number of neurons in the output FC layer. the prediction in the fully connected layers replacing y CONV by y LSTM . This variable represents the output of the last recurrent layer. Figure 14 depicts the architecture of the eRCNN+LSTM model proposed and implemented in this study. This architecture corresponds to the one with the best prediction performance in the tests we carried out. We conducted tests with 1, 2 and 3 convolutional layers and different feature maps per layer (16,32,64). We also tested different numbers of LSTM memory cells in the recurrent layer (20,40,60,80, and 100), and different numbers of neurons in the FC layers (20, 40, and 60). In total, we tested 96 different architectures. Similar to the previous recurrent models, using more than 40 LSTM memory cells did not improve the accuracy of the model. Table 9 shows the MAE obtained with the best ten eRCNN+LSTM architectures. We highlight in boldface the architecture achieving the best results and that has been implemented in this study. The input of the implemented eRCNN+LSTM model is the same as for the CNN+LSTM model previously described. The implemented eRCNN+LSTM is composed of three 1-dimensional convolutional layers followed by one recurrent layer, two parallel fully connected layers, and an output fully connected layer.
The three convolutional layers have 32 feature maps that are the result of applying 1-dimensional convolution filters of size 3. All convolutional layers use the ReLU activation function [48] and batch normalization [56]. The output of the last convolutional layer is the input of the recurrent layer. This recurrent layer is composed of 40 LSTM memory cells implemented following [63]. The output of the recurrent layer at each time step is fed to the fully connected layer represented by FC FF in (9). This layer is composed of 40 neurons and uses the ReLU activation function. The parallel fully connected layer takes as input the error vector that is composed of the prediction error in the last 6 time steps. This layer is composed of 20 neurons and also uses the ReLU activation function. The output of the two parallel layers is concatenated and fed to the output layer. The output layer is composed of a single neuron and uses the identity activation function to compute the traffic prediction.

I. TRAINING
All neural networks implemented and proposed in this study have been trained using the backpropagation algorithm and a variation of the stochastic gradient descent algorithm (SGD) called ADAM [66]. The SGD algorithm trains neural networks at each iteration with a batch of training examples instead of using all the training set. Each example is a tuple that contains an input to the neural network and the ground truth of the prediction that the network should compute for that particular input. SGD ensures that the neural network, its activations and its gradients fit in memory. We have trained all the neural networks with batches of 50 training samples each. All models with recurrent connections (LSTM, CNN+LSTM, LSTM+CNN, eRCNN, eRLSTM and eRCNN+LSTM) have been trained with a variation of the backpropagation algorithm called backpropagation through time (BPTT). BPTT computes the gradient of the loss function with respect to the trainable parameters through several time steps. This ensures that the computation of the internal state of the network and the connections through time are also trained. We have trained the networks with BPTT and sequences of 72 time steps. The objective of the training is to minimize a loss function. The prediction of the road traffic is  (FC A, FC B), FC C, where X is the number of convolutional layers, Y is the number of convolution filters in a convolutional layer, Z is the number of LSTM memory cells, A is the number of neurons in the FC layer that takes as input the output of the recurrent layer, B is the number of neurons in the FC layer that takes as input the error vector, and C is the number of neurons in the output FC layer. a regression problem. We have then selected as loss function the squared L2 norm of the network's error: whereŷ is the network's output, y is the ground truth, and · 2 is the L2 norm. Minimizing this loss function is equal to minimizing the squared error of the network. CNN+LSTM and LSTM+CNN have regularization terms. In this case, these terms are summed to the loss function.
We have used learning rate exponential decay so that the training converges faster at the beginning and achieves low error rates at the end of the training process. The initial value for the learning rate is 10 −4 . The learning rate is multiplied by 0.1 every 2000 iterations of the backpropagation algorithm. We have also used gradient clipping in order to avoid training oscillations at the beginning. The maximum allowed gradient norm with gradient clipping is 40. Every epoch (i.e. every time we have iterated over the whole training set), we permute the training set in order to change the order of the training examples. This is done to ensure that the training batches are never the same in different iterations. This strategy prevents the network from learning non-existent correlations between different training examples. We have trained the different neural networks for 20000 iterations (10 epochs). All the 2015 data is used for the training set, half of the 2016 data for the validation set, and the other half for the test set. The data is prepared in training, validation and test samples so that they can be used by the different neural network models. Each sample contains data of six hours of continuous traffic so they contain information about the temporal evolution of the traffic. The traffic data has a temporal resolution of 5 minutes so every sample contains 72 measurements of the traffic detectors. The samples are prepared according to the format required by each neural network. They are formatted as vectors for the MLP, LSTM, CNN+LSTM, LSTM+CNN, eRL-STM, and eRCNN+LSTM (each one with the corresponding vector size according to each model) and as traffic images for the CNN and the eRCNN. The neural networks with recurrent connections (LSTM, CNN+LSTM, LSTM+CNN, eRCNN, eRLSTM, and eRCNN+LSTM) require the sam-ples to be formatted into sequences for the BPTT training and for computing the internal state or error vectors. The samples for networks with recurrent connections are then sequences of 72 vectors or traffic images. For the MLP and the CNN, the sequence of 72 traffic measurements is already included in the vector and the traffic image, respectively. In total, we have used 100000 samples to train all models, 50000 samples to validate them, and 50000 samples to test them. We would like to highlight that this is a large number of samples. For example, [29] uses 18000 samples for training and 3600 samples for testing. In [35], authors used 21600 samples for training and 5040 samples for validation and testing. It should be noted that the 2016 halves do not correspond to half a calendar year but half of the samples in the 2016 dataset. We first prepare the samples in the format required by each model as we previously explained. To build the validation and test sets, we then randomly permute the order of the samples in the dataset. This is done to eliminate any bias on the results by ensuring that both the validation and test sets represent the seasonality of the traffic throughout the year. However, the temporal evolution of the traffic is still present in each sample that represents 6 hours of continuous traffic and their integrity is hence not compromised. This approach is followed in related studies such as [33], [67]. Finally, we use the first half of the permuted samples for the validation set and the other half for the test set.

V. TRAFFIC PREDICTION
All neural networks have been trained to predict the value of a traffic variable (traffic flow and traffic speed in this study) in the next 15 minutes. This choice was made since the reference Highway Capacity Manual [68] considers 15 minutes for most of its analysis. The prediction accuracy achieved with the different neural networks is compared using error metrics that measure the difference between predicted values and the ground truth. In particular, we utilize the mean absolute error (MAE, measured in mph for traffic speed and veh/h/lane for traffic flow), the mean absolute percentage error (MAPE, measured as a percentage), and the root mean squared error (RMSE, measured in mph for traffic speed and veh/h/lane for   traffic flow). MAE was defined in (5). The MAPE and RMSE error metrics are defined as follows: whereŷ i is the prediction obtained with a neural network, y i is the ground truth, and N is the number of samples used to compute the error metrics. In our case, N is the size of the test set, i.e. 50000 samples. For all neural networks under evaluation, the error metrics computed over the validation set never increased and took similar values to those computed over the training set. This highlights that none of the trained neural networks suffered overfitting. Figure 15 compares the capacity of all the implemented neural networks to predict the traffic speed and the traffic flow. This comparison is done utilizing the full test sets of the three I5 freeway sections. Figure 15 compares the MAE of the prediction obtained with each neural network under evaluation. Table 10, Table 11, and Table 12 complement Figure 15 and report the three error metrics (MAE, MAPE, and RMSE). The results on these tables are again obtained using the full test sets for each I5 freeway section. We highlight in boldface the lowest value for each in the three tables. The results 9 depicted in Figure 15 and Table 10, Table 11, and Table 12 show that the best prediction accuracy (for both traffic variables) is achieved with the error recurrent models, i.e. with eRCNN, eRLSTM and eRCNN+LSTM. eRLSTM and eRCNN+LSTM are two new models proposed in this paper. eRLSTM achieves the best accuracy when predicting the traffic speed on the I5-S-3 and the I5-S-4 freeway sections, and eRCNN on the I5-N-3 section. eRCNN achieves the best traffic flow prediction accuracy on all I5 freeway sections. 10 eRLSTM and eRCNN+LSTM also achieve good accuracy when predicting the traffic flow. However, these two models can be outperformed by the LSTM under certain scenarios. This trend was not observed when predicting the traffic speed; in this case, error recurrent models always outperform the rest of the models. Table 10, Table 11, and Table 12 also show that some models outperform error recurrent models when analyzing the MAPE achieved when predicting the traffic flow over some datasets. For example, the LSTM model achieves the lowest MAPE when predicting the traffic flow on the I5-N-3. The MAPE metric is greatly influenced by the accuracy of the prediction for low values of the ground 9 The performance achieved with MLP and its comparison with other existing models is in line with the results reported so far in the literature. The results also show that LSTM outperforms all other models that do not utilize the error feedback. 10 It is interesting to observe that while LSTM outperformed CNN for both traffic variables, eRCNN outperforms eRLSTM. The error feedback has then a more positive effect on CNN models than on LSTM ones. truth. When such values are encountered, LSTM can better predict the traffic flow than the error recurrent models for some scenarios. However, it should be noted that low traffic flows are observed under free flow or traffic congestion. Traffic congestion is characterized by the reduction of the traffic speed. Table 10, Table 11, and Table 12 show that the error recurrent models achieve the lowest MAPE when predicting the traffic speed. This indicates that error recurrent models better predict traffic congestion than the LSTM model. As a result, the lower MAPE observed for LTSM in certain scenarios is basically due to free flow conditions. Free flow conditions are less relevant from the point of view of traffic management. Accurately predicting the road traffic when congestions are being formed is highly relevant for road traffic management authorities. The next section presents then a more detailed analysis of the traffic prediction under congestion conditions. Another interesting result that can be observed in Figure 15 and Table 10, Table 11, and Table 12 is that the LSTM+CNN model exhibits higher error rates than the CNN+LSTM model in all cases. This is possibly due to the fact that the spatial information is lost after the traffic data is processed by the recurrent layers. As a result, processing the output of the recurrent layers with convolutional layers does not offer any more an advantage, and degrades the prediction accuracy.
Results in Figure 15 and Table 10, Table 11, and Table 12 show that overall error recurrent models achieve the more accurate prediction of the traffic speed and the traffic flow. The prediction accuracy of these models is illustrated in Figure 16 that compares the ground truth with the predicted traffic speed and flow for the three error recurrent models. The results are depicted for a typical weekday (i.e. no unusual traffic event happened in that weekday) in the I5-N-3 freeway section. The figure shows that the three error recurrent models can accurately predict the traffic speed and the traffic flow. Similar accuracy levels have been observed on the I5-S-3 and I5-S-4 freeway sections.

VI. TRAFFIC CONGESTION PREDICTION
This section analyzes the traffic prediction accuracy of the different neural networks when traffic congestion is being formed or dissolved. Congestion occurs when the traffic demand exceeds the road capacity. When this happens, the traffic flow and speed decrease if the traffic demand increases, and traffic congestions emerge. Being able to predict the formation of traffic jams is interesting to road authorities since they can then implement proactive measures to avoid congestions or reduce their negative impact. However, predicting congestions is not that easy since the traffic datasets used to train the deep learning-based prediction techniques represent daily road traffic conditions and noncongested traffic is generally more frequent than congestions.
We have first analyzed the datasets to detect road traffic congestion conditions. To this aim, we follow the guidelines in [68] and its level of service (LoS) definitions. The level of service is a measure that classifies the traffic state in six different levels. The level of service ranges from LoS A (free flow) to LoS F (traffic congestion). LoS A-E correspond to non-congested traffic, while LoS F corresponds to traffic congestion. [68] defines the LoS F as traffic situations with a traffic demand exceeding the road capacity, forcing traffic speed and flow to drop. Then, in order to detect traffic congestion in our datasets we plotted the fundamental diagrams of the traffic flow for the three datasets, and used them to detect a pair of values for the traffic speed and the traffic flow below which the road capacity is exceeded and congestions (LoS F) emerge. The fundamental diagram of traffic flow represents the relationship between the traffic speed, flow and density for a given road or location. In our case, we have plotted a scatter of the traffic speed and flow measures in the datasets, so the speed-flow relationship of the fundamental diagram is represented. The speed-flow relationship for the I5-N-3 dataset is illustrated in Figure 17. In the speed-flow relationship, the evolution of traffic from free flow conditions (upper left corner) to congestion conditions (lower left  corner) can be observed. In our three datasets, LoS F is experienced when the speed and traffic flow are lower than 50mph and 1500veh/h/lane, respectively. This pair of values is highlighted in red in Figure 17. We can then categorize the traffic as congestion (LoS F) when the speed and traffic flow are below the selected pair of values. Figure 18 depicts the proportion of congestion (LoS F) and non-congestion (LoS A-E) situations in the three datasets used in this study. The figure clearly shows that congestion conditions are much less frequent than non-congestion conditions. 11 It is also important noting that traffic congestion causes large variations of the traffic speed and flow. This can actually be observed in Figure 16 that represents the traffic flow and speed in a weekday for the I5-N-3 dataset. The figure clearly shows how in peak traffic hours (e.g. 12:00-14:00) the traffic flow and speed significantly decrease when traffic congestion appears and increase when it dissolves. These large variations are not present under free flow conditions. Consequently, neural networks trained with regular traffic datasets might be able to accurately predict normal traffic conditions but not congestions if congestions are less frequent and also have different traffic patterns. This is actually illustrated with Figure 19 that compares the error rates achieved by the implemented CNN and LSTM models as a function of the difference between the ground truth value 11 The selected datasets are hence representative of the traffic experienced in most highways. we want to predict and the previous ground truth value. Figure 19 shows that LSTM achieves lower error rates than CNN when the difference is small, i.e. when the variations of traffic speed and flow over time are small. On the other hand, CNN achieves lower error rates when the differences are large. These results show that the CNN model better predicts large variations of the traffic speed or flow than the LSTM model. This is due to the capacity of CNNs to exploit the spatiotemporal evolution of the traffic. Road traffic is characterized with a strong spatiotemporal correlation. As a result, large variations of the traffic variables depend not only on the temporal evolution of the traffic in a given location, but also on the temporal evolution of the traffic in neighboring locations. CNNs consider this dependency since their prediction takes into account the spatiotemporal evolution of the traffic. As a result, CNNs can better predict large variations of the traffic variables that are representative of traffic congestion conditions as illustrated in Figure 16. We should remind that Figure 15 showed that the LSTM model better predicts the traffic flow and speed than the CNN model over the complete datasets. These results 12 clearly show that the overall best traffic prediction technique might actually not be the best technique to predict congestion conditions. The importance for traffic management of being able to accurately predict congestion conditions justifies then the need to search for the best prediction techniques overall (i.e. considering complete datasets with normal and congestion traffic conditions) and under congestion conditions. Figure 20 compares the prediction accuracy of the implemented neural networks during the formation and dissolution of congestions. Similarly to Figure 15, Figure 20 reports the MAE of the prediction for the three datasets. Table 13,  Table 14, and Table 15 complement Figure 20 with the MAE, MAPE, and RMSE metrics. We highlight in boldface the lowest error achieved for each metric. Traffic congestion occurs when the road capacity is exceeded which corresponds to a level of service F. A congestion is then formed when the traffic conditions at a given time step correspond to a level 12 They are not in contradiction with Table 10, Table 11, and Table 12. These tables report lower error rates for the LSTM because the error rates are computed considering the complete datasets where free flow conditions (with small variations of the traffic variables over two time steps) are more frequent (see Figure 18).   of service other than F and at the next time step to a level of service F. Similarly, a congestion is dissolved when the traffic conditions at a given time step correspond to a level of service F and at the next time step to a level of service other than F. There are 776, 1360, and 504 events matching the formation or dissolution of congestion situations in the I5-N-3, I5-S-3, and I5-S-4 test sets, respectively. The comparison of Figure 15 and Figure 20 shows that in general the MAE increases when predicting traffic congestion conditions compared to when predicting general traffic conditions. This demonstrates that predicting traffic congestions is harder than predicting the traffic under other levels of service. This is due to the fact that the level of service F (i.e. congestions) is less frequent than the other levels of service (e.g. Figure 18) and therefore neural networks are less trained to detect congestions. The neural networks tend then to underestimate the large variations of the traffic variables caused by the traffic congestion which consequently increases the error metrics. Similar trends are observed for the RMSE metric. The MAPE also increases for all models when predicting the traffic speed under congestion conditions. However, it decreases when predicting the traffic flow. This is due to the fact that the MAPE metric is more affected by errors on low values of the ground truth. Such low values are less frequent under congestion conditions than when working with the complete datasets which explains the lower MAPE experienced under congestion conditions. The MAPE increase when predicting the traffic speed is due to the drop of the speed under congestion conditions. Low speed values are hence experienced and the MAPE metric increases. Figure 19 showed that the CNN model can better predict large variations of the traffic speed and flow than the LSTM model. This translates into a better performance in Figure 20, Table 13, Table 14, and Table 15. It is interesting to observe that the LSTM model was one of the best models to predict traffic under general traffic conditions (i.e. using the complete datasets, see Figure 15). However, it is one of the worst models when predicting traffic under congestion conditions ( Figure 20). This clearly shows that being able to predict traffic in general does not necessarily imply that a technique can predict well congestion conditions. Both cases require then a separate analysis like the one conducted in this study. The comparison of LSTM and CNN also shows that exploiting the spatiotemporal evolution of the traffic is more useful to predict traffic congestion than exploiting only its temporal evolution. 13 The results in Figure 20, Table 13, Table 14, and Table 15 show that the error recurrent models (in particular, eRCNN and eRCNN+LSTM) are also the best models to predict the traffic speed and flow when traffic congestion emerges. This demonstrates that the error recurrent models can better predict the large variations of traffic speed and flow experienced when congestion is being formed or dissolved. This is actually observed in Figure 21 that depicts the error metrics for the error recurrent models and the CNN model as a function of the difference between the ground truth value we want to predict and the previous ground truth value. The figure clearly shows that the error recurrent models achieve low prediction errors (in particular, eRCNN and eRCNN+LSTM) and outperform the CNN (which in turn, outperformed the LSTM model, see Figure 19). The difference between the error recurrent models and the CNN model tends to increase with the difference between two consecutive values of the ground truth. The advantage of the error recurrent models for traffic prediction under congestion conditions is also demonstrated in Figure 22. The figure represents the MAE of the prediction for the eRCNN error recurrent model and the LSTM model. The LSTM model is the non-error recurrent model that best performs when predicting the traffic using the complete datasets. Figure 22 represents the MAE for each point in the test set of the I5-N-3 dataset. The representation is done using the speed-flow relationship of the fundamental diagram of traffic flow. This figure shows how the error of the prediction changes as the traffic evolves from free flow conditions (upper left corner) to congestion conditions (lower left corner). The figure shows that in general the prediction error 13 Similar trends are observed when comparing LSTM with CNN+LSTM. increases as the traffic approaches congestion. However, this increase is higher for the LSTM model than for the eRCNN one (for the traffic speed and flow). 14 Similar patterns have been observed for the I5-S-3 and I5-S-4 datasets.
We have also conducted a robustness analysis to analyze how the traffic congestion prediction of each model is affected by outliers. We followed the guidelines in [69] and trained the implemented neural network models with the training set with 5%, 10%, 15%, and 20% of outliers. These outliers are sampled from a Gaussian distribution with the same mean as the training set and a standard deviation ten times higher than the original standard deviation of the training set. Then, we measure the error metrics when predicting congestion over the test set. Figure 23 illustrates the evolution of the MAE as a function of the percentage of outliers in the training set. Similar trends have been observed for the MAPE and RMSE. The results of the robustness analysis are depicted for the CNN, LSTM, CNN+LSTM, eRCNN, eRLSTM, and eRCNN+LSTM models. Figure 23 shows that the models that are less affected by the outliers are the error recurrent models. This is more obvious for traffic speed than for traffic flow. Figure 23 also shows that the eRCNN model is the one that is less affected by outliers among the error recurrent models. For the other models, the CNN+LSTM model is the one that is less affected by the outliers. It is interesting to note that the CNN+LSTM model always outperforms the CNN and LSTM models when the percentage of outliers in the training set increases. This is true even when the CNN+LSTM model achieves a worse prediction than the CNN or LSTM models when trained with no outliers (e.g., traffic speed for the I5-S-3 dataset). Another thing that can be observed in Figure 23 is that the difference between the errors of the different models tend to increase as there are more outliers in the training set.
The results reported so far show that the eRCNN model achieves the highest accuracy when predicting the traffic speed and flow under congestion conditions. The eRCNN+LSTM model obtains better prediction accuracy than the eRLSTM model, and confirms that exploiting the spatiotemporal evolution of traffic improves the traffic congestion prediction accuracy. Taking into account the results reported in this section and Section V, it can be concluded that the eRCNN model is the one that achieves the best traffic prediction accuracy. It is also important emphasizing that the eRCNN model outperforms the CNN model with a much lower number of convolutional layers, and hence a smaller computational cost for prediction.

VII. CONCLUSION
This study has presented the most comprehensive comparison of deep learning-based traffic prediction techniques reported to date. The study compares the state-of-the-art traffic prediction techniques and proposes some new models to predict the traffic speed and flow. The comparison is conducted using the same datasets for a common benchmark. The prediction accuracy is compared under general traffic conditions (i.e. considering complete traffic datasets) and for the emergence and dissolution of traffic congestion. This is an important contribution since we demonstrated that accurately predicting the traffic overall does not necessarily imply that a deep learning technique can accurately predict traffic under congestion conditions when the traffic variables experience larger variations over time. This was for example the case of the LSTM model that accurately predicted the traffic in general but not under congestion conditions. Traffic congestion might not be that frequent but it is important to predict it accurately for an effective and proactive management of the road traffic. The study has also demonstrated that error recurrent models outperform other deep learning prediction techniques when predicting the traffic speed and flow in general and under congestion conditions. This is actually achieved with shallower neural networks than when the error feedback is not used, which reduces the computational cost for prediction. Our analysis has also demonstrated that exploiting the spatiotemporal evolution of the traffic (and not just the temporal one) provides better prediction accuracy overall and in particular under congestion conditions. Finally, our evaluation has shown that the error recurrent model eRCNN is the deep learning technique that achieves the best traffic prediction accuracy.