City-Wide Traffic Congestion Prediction Based on CNN, LSTM and Transpose CNN

Traffic congestion is a significant problem faced by large and growing cities that hurt the economy, commuters, and the environment. Forecasting the congestion level of a road network timely can prevent its formation and increase the efficiency and capacity of the road network. However, despite its importance, traffic congestion prediction is not a hot topic among the researcher and traffic engineers. It is due to the lack of high-quality city-wide traffic data and computationally efficient algorithms for traffic prediction. In this paper, we propose (i) an efficient and inexpensive city-wide data acquisition scheme by taking a snapshot of traffic congestion map from an open-source online web service; Seoul Transportation Operation and Information Service (TOPIS), and (ii) a hybrid neural network architecture formed by combing Convolutional Neural Network, Long Short-Term Memory, and Transpose Convolutional Neural Network to extract the spatial and temporal information from the input image to predict the network-wide congestion level. Our experiment shows that the proposed model can efficiently and effectively learn both spatial and temporal relationships for traffic congestion prediction. Our model outperforms two other deep neural networks (Auto-encoder and ConvLSTM) in terms of computational efficiency and prediction performance.


I. INTRODUCTION
With the increase in the economy, rapid urbanization, and desire toward a private traveling [1], the traffic congestion level of most of the large and growing cities around the world has increased drastically, which directly affects the growth, development, and environment of the cities. Besides, it also increases both commuting time and the tendency of road rage, increasing the frequency of road accidents [2]- [4]. Thus, the study of traffic management is of high importance among researchers. The high congestion can be alleviated either by increasing the transportation infrastructure, which is the expensive method, or by deploying feasible traffic strategies, such as congestion pattern analysis or short-term traffic information prediction that can efficiently be employed to the existing road network in a fraction of cost. Compared to pattern analysis, which determines the road networks having reoccurring congestion [5], [6], predicting the accurate short-The associate editor coordinating the review of this manuscript and approving it for publication was Gustavo Callico . term traffic information [7]- [10] such as traffic speed, traffic flow volume, and congestion level, is more informative to the commuters and traffic management agencies. Among them, the traffic congestion level gives the status of the road network (Jam, Slow, or Free) more desirable parameter for the drives to make better route choices circumventing the congested roads, and traffic managers to operate efficiently by systematically responding to the supply and demand change of transportation network.
The early forecasting models focused on the prediction of traffic parameters such as speed, volume, and traffic flow on a single road, group of roads, or small road networks mainly due to lack of data availability. Due to its partial prediction capabilities on road networks, these works were not desirable to both commuters and traffic agencies. The existing work uses data from a fixed sensor (road sensor, inductive loop, traffic camera, etc.) installed on every road or the network of vehicles (VANET, Floating Car) operating on each route. These kinds of data are difficult to collect because installation, operating and, maintenance are expensive as well as it is difficult to access from the third party, as it requires special permission. Recently the web service like Google Traffic [11], Bing Map [12], Seoul Transportation Operation and Information Service (TOPIS) [13], and Baidu Map [14], publicly started to provide the accurate city-wide real-time traffic information such as congestion level and the average speed of the road segment. Although these web services are public, easily accessible, and provide traffic information for most of the cities in the world, there is only a handful number of studies based on them. The one reason is the curse of dimensionality, as the prediction problem is a time-series analysis which takes multiple inputs, so processing multiple traffic map image is costly.
Long Short-Term Memory (LSTM) recurrent neural network is known for its capability of mining and remembering temporal relationship over a long sequence of historical data and has been very successful in varieties of the area such as recognition [15], translation and time series prediction [16], [17]. But, LSTM can be challenging to use and is slow at the processing when input and output are the sequences of high-resolution two-dimensional data. Similarly, the Convolutional Neural Network (CNN) has gained its fame in spatial learning by extracting a low-resolution feature image from the high-resolution input image and has been successful in the domain of image understanding, object detection, and segmentation [18], [19]. Unlike LSTM, CNN doesn't have any difficulty in processing highresolution multidimensional data because of its unique ability like local connectivity, weight sharing, and pooling. Image segmentation work in [18] uses convolutional encoder and decoder architecture, where convolutional encoder encodes the input image into a low-resolution image consisting of important spatial features and convolutional decoder decodes the latent representation back to its original size, in the process, the network learns to segment the image into various groups.
Inspired by the successful application of convolutional autoencoder and LSTM mentioned above, in this paper, we present an approach, which can learn both spatial and temporal relationships between the sequences of historical image data for traffic congestion prediction. In the proposed architecture, we add the LSTM network between the convolutional encoder and the convolutional decoder. The convolutional encoder at first converts the sequence of the input image into sequences of low-resolution latent state. The LSTM network then learns the time series representation from the sequences, and finally, the convolutional decoder converts the latent state back to its original resolution. The contribution of the paper can be summarized as follows: • We develop a new prediction model PredNet. The model exploits the advantage of various deep learning architecture, including convolutional neural network, LSTM network, and transposed convolutional neural networks to learn the spatiotemporal sequences of historical data for efficient end-to-end traffic congestion prediction on the transportation network.
• The proposed model can be generalized to large-scale traffic prediction problems while retaining trainability on resource constraint devices because of the implementation of convolutional, downsampling, and upsampling layers.
• Our extensive experiments on the Seoul city transportation network demonstrate the efficiency and effectiveness of the proposed approach.
In this section, we describe the background and motivation of the study and briefly highlight the inspiration for modeling the proposed algorithm. The rest of the paper is structured as follows: Section II presents the literature related to work on traffic prediction. Section III presents the methodology of traffic prediction, which includes problem statement, data source, preprocessing and database design, and explain the architecture of our hybrid deep learning neural network which learns both spatial and temporal features for traffic congestion prediction. Section IV presents the data description for the model, metrics used for testing the effectiveness of the proposed model, detailed explanation for model construction, and performance comparison of the proposed model with two state-of-art prediction models like ConvLSTM and Autoencoder. Finally, Section V presents the conclusion of our work and provide the future direction of this study.

II. RELATED WORK
In the past, the researchers used the data-driven approach, mainly focused on the development of the statistical and mathematical model to analyze the time-series relation in the traffic data, also referred to as a parametric approach. Primarily, the work was based on the assumption of linearity and stationarity to capture future trends, such as the historical average model [20], smoothing techniques [21], and error component model [22]. Later, a typical parametric time-series model autoregressive integrated moving average (ARIMA) model [23] was introduced to identify the pattern by decomposing long-term trends and seasonal patterns. However, it suffers from the tendency to concentrate on the mean value of the time-series and unable to predict the extremes [24]. The family of the ARIMA-based model, such as seasonal ARIMA models [25], and Kalman filter model [26], use a vast historical database for model development and are also very sensitive to the traffic data.
Due to limitations in a parametric approach, the researchers start to pay attention to nonparametric models, k-nearest neighbor (KNN), support vector machine (SVM), bayesian network (BN), and artificial neural network (ANN). Unlike the parametric approach, the nonparametric model relies on the training data to determine the model structure and number of parameters. A KNN model [27] searches a close neighbor matching to the current data from the historical database to predict the traffic flow. The SVM model [28] is based on the structural risk minimization principle, has a unique advantage in the fields of small samples and highdimensional nonlinear data. The family of SVM-based mod-els, such as seasonal SVM, least-square SVM [29], online-SVM [30], and wavelet-SVM [31] were proposed for traffic prediction, and they can solve the problem of the curse of dimensionality, overfitting, and local minima. The BN model [32], [33] takes account of the causal relationship between the random variables statistically and can address the problem of incomplete data based on the message passing mechanism.
All the aforementioned work requires significant prior domain knowledge and feature engineering to achieve better performance. ANN model in [34] have the advantage over the previous algorithms due to its capability to work with multi-dimensional data without any feature engineering, and also due to its potential to perceive the non-linear relationship between input and output features to provide generalized solutions. However, due to its shallow depth in architecture, the accuracy of the model was not satisfactory, so the researchers shift their research directed toward the deep machine learning architecture. Long short-term memory (LSTM), a special recurrent neural network (RNN), can learn the temporal relations from the time-series sequence because of its build-in memory cells. It has shown remarkable results for traffic speed [35], traffic flow [36], [37] and congestion prediction [38], [39]. A deep autoencoder neural network [10] uses the temporal relationship between the input image sequences to predict short-term traffic congestion. All these attempts only consider the temporal relationship between the image sequences.
To make use of both spatial and temporal features of traffic data, the researchers started to build a hybrid model, i.e., by combining two or more independent models as a one. In [40], the KNN-LSTM model mines both spatial features by selecting the most related neighbor and temporal variability to predict the flow. In [41] Autoencoder-LSTM were combined, the internal relationship of the traffic flow was obtained by Autoencoder, and the LSTM network predicts complex linear traffic flow. In [8], traffic data was converted into a twodimensional time-space matrix (spatiotemporal traffic data) and use a convolutional neural network model to predict large-scale network-wide traffic speed with high accuracy. In [42], the researcher proposed a novel model called the LC-RNN to predict road traffic speed, which consists of a look-up convolutional layer and recurrent layers. Look-up operation selects all the adjacent road, convolutional operation extracts the spatial correlations, and recurrent layers learn the longterm temporal patterns. The researcher in [43] used the con-vLSTM model, which is just like the LSTM model, where internal matrix multiplications are exchanged with convolutional operations. Convolutional operation extracts the spatial information, and LSTM learns the temporal information of traffic flow. Furthermore, in [44], the researcher proposed a deep learning model called SCRN, which is a combination of CNN and LSTM. At first, CNN extracts the spatial features of the traffic network for all the periods, and then the LSTM network learns the time-series temporal relation to predict traffic speed of 278 road links.
The hybrid neural network achieves better results than simple neural networks and traditional methods due to its ability in mining both spatial and temporal features from the traffic data. Although the hybrid model shows encouraging results in the domain of traffic prediction, there are very few works in the discipline of traffic congestion prediction based on deep neural networks, mainly due to the unavailability of high-quality city-wide congestion data. Some recent works in the congestion prediction are, as mentioned. In [10], the researcher trained the Autoencoder model with unnaturally compressed snapshots of traffic images from the open-source website to predict traffic congestion. The predicted images are not visually intuitive as a lot of road information is lost during image compression. In [38], the researchers collected road-based congestion information for few roads from an online source to learn and predict the traffic congestion on those roads using the LSTM network. In [39], the researchers used bus driving time data during peak periods to train the LSTM network to predict the traffic congestion time on six road segments. In [45], the researcher proposed a novel model called PCNN, which uses vehicle passage records from surveillance cameras on roads, to model periodic traffic congestion patterns to predict the short-term traffic congestion. In [46], the researchers used the congestion information from GPS data from each link to learn and predict traffic congestion evolution. In [47], the researchers used machine learning techniques (logistic regression, random forest, and neural networks) on vehicle trajectories data available through connected vehicle technology to identify and predict congestion formation. This study has a prediction horizons of 10 and 20 seconds, intended for warning drivers of upcoming traffic conditions. All these congestion studies predict congestion on a single road or a few major roads in the city or lack in predicting traffic image in fine granularity as in [10]. So, in this paper, we used a city-wide transportation Image data from TOPIS website to predict city-wide short-long term (prediction horizons of 10, 30, and 60 minutes) traffic congestion with fine granularity based on a hybrid structure containing CNN, LSTM, and Transposed CNN.

III. METHODOLOGY
In this section, we first define the problem statement for time-series traffic congestion prediction, then discuss on data acquisition from an open-source online platform, and finally, describe the component of the proposed Prediction Network (PredNet) architecture.

A. PROBLEM STATEMENT
Based on past observation of traffic congestion data, the proposed deep neural network is designed to predict the shortterm congestion level. Let's consider N ∈ {1, 2, . . . , n} be the chronological order of n image data in our database. The i th past observation of the time series data for the period of t is given by, The primary object of this study is to develop the prediction model f , which uses previous observation X i to predict the congestion level at time t, for k prediction horizon, i.e., Y i = x i+t+k . The model f can be defined, as in (1) here, θ are the model parameter of our model. We divided our database N into 3 parts train, validation, and test. We generate m set historical time-series data from the train dataset, . , x t+m } also lies in the training dataset. Hence, we can use supervised learning to train our neural network.
Since the digital image is a collection of the pixels which can be represented as a matrix (2D for grayscale and 3D for color image), extracting road network with congestion information can be easily performed by the mathematical operation. As congestion level in the image data has unique color composition, with upper and lower boundary value as ([0, 28,160 [20,160,50], [110, 240, 120]) for Red, Yellow and Green color respectively in BGR format. The image masking operation is performed to extract the congestion level from the image. At first, the mask is generated for only road network by comparing the image data with the boundary value of each congestion color, and then 'bitwise and' operation is performed between mask image and raw image to generate the image with the only road network. The resulting image is shown in Figure 1(b) has red, yellow, green color representing the congestion level, and black color is the background. Matrix form representation of the image is shown in (3); each element has three values representing the congestion level in [B, G, R] format.
In this section, we discuss the architecture for predicting short-term city-wide traffic congestion using the sequences of the historical Image dataset. A schematic of our framework is shown in Figure 2. Our proposed model is the combination of 3 networks, namely: 1) Feature Extraction Network (FEN), which perform convolutional and pooling operation and convert the image into a lower-dimensional feature space, 2) the Recurrent Network, which consists of stacked LSTM layers responsible for learning time-series information on data from the previous layer, and 3) Reconstruction Network, which performs convolutional and transposes convolution operation on data from recurrent layer to produce predicted image. The architecture is designed to learn the spatial-temporal relationship of traffic congestion levels among roads in the transportation network. As stated in section 3.1, our primary objective is to predict the congestion level of the transportation network based on the historical time-series sequence of the image, which can be solely performed by the LSTM model. The need for the CNN model and Transpose Convolutional arises because of 2 reasons: i) LSTM is a recurrent neural network, compute slowly for the large input parameter, and ii) LSTM only consider temporal features. In the following sections, we discuss each network in detail.

1) FEATURE EXTRACTION NETWORK
Convolutional Neural Network is a special type of Deep Neural Network (DNN) that is inspired by Hubel and Wiesel's work in neuroscience [48]. Since it first proposed in the work of handwritten zip code recognition [49], numerous variations of CNN architectures have been proposed. However, the unique aspects of CNN are the same, i.e., local connectivity and weight sharing. CNN has the superior ability in feature representation of an input image as compared to other deep learning architectures like auto-encoder and multilayer perceptron, as it handles the spatial correlations between the nearby pixels.
CNN consists of two layers, namely convolutional and pooling layer. The main objective of the convolutional layer is to learn the feature representation of the input image. It is a locally connected neuron, i.e., each neuron of output layers only receives input from a small local group of the neuron from the previous layer. It is composed of several convolution kernels which convolute with the image or previous layer to learn different feature representation. Mathematically, the f th feature map of l th convolutional layer y l f is can be obtained by first convoluting input image or previous layer output with the convolutional filter and then applying bit-wise non-linear activation, as in (4) where y l−1 k is the k th feature map of (l − 1) th layer, W l kf is the kernel weight at position k connected to the f th feature map of l th layer, b l f is the bias of f th filter of l th layer, f l is the number of filters in l th layer and σ represent element-wise non-linear activation function.
The function of the pooling layer is to achieve shiftinvariance by progressively reduce the spatial size of the feature map but retains essential information. Pooling layers decrease the complexity in the network leading to faster convergence rate. Each pooled feature map corresponds to the feature map from the previous layer. The common pooling operations are max-pooling [50] and subsampling [51]. These pooling operations doesn't have any trainable parameter and computes one value from (m, n) rectangular region of the feature map, which decreases the resolution by the factor of m and n along each direction. Max-pooling selects the superior invariant features from the patch, and subsampling takes the average over the patch and passes through the nonlinearity. In our experiment, we replace the traditional pooling operation by standard convolution operation with the stride of 2 * 2. This approach is also mention in [52], the strided convolution increase the model expressiveness ability as it has a learnable parameter while reducing the feature map resolution.
The Feature Extraction Network is shown in Figure 2, section A. This architecture is formed by stacking convolutional and pooling layer and have one flatten layer at the end. The convolution operation is performed by the kernel of 3 * 3 over the input image of 192 * 448 * 3 using unit stride and zero padding (to conserve the dimension of input). The (i, j) location of f th feature map of l th convolutional layer, y l f (i, j) can be obtained by first convoluting previous l − 1 th layer output with the convolution filter of size (m, m) and then applying bit-wise non-linear activation, it is the detailed version of Equation 4, and is given as in (5) where (a, b) is kernel location. The convolution layer is followed by the pooling layer, the (i, j) location of f th feature map of l + 1 th pooling layer, y l+1 f (i, j) can be obtained by first convoluting previous lth layer output with the convolution filter of size (2, 2) and then applying bit-wise non-linear activation, and is given in (6) where y l k is the location of k th feature map of location of l th layer, W l+1 kf is the kernel weight at position k connected to the location of f th feature map of l + 1 th layer, b l+1 f is the bias of f th filter of l + 1 th layer, f l+1 is the number of filter in l th layer and σ represent element-wise non-linear activation function.
The output of the feature extracted network is connected to the recurrent network, which is stacked LSTM layers, which only take vector input. Hence, the output of CNN architecture is converted in vector by the flattening layer. Let L be the previous layer before flattening layer, having f L number of feature maps of resolution (x, y), then the output of L + 1 layer, y L+1 is given, as in (7) y L+1 = flatten y L 1 , y L 2 , . . . , y L f L where y L 1 , y L 2 , . . . , y L f L are the feature maps of L th layers, each feature maps contains x * y elements, Hence, (7) can be rewritten as given in (8 (8) here, e is the number of elements in each feature map. Equation 8 shows the vector representation of L th layer. Besides, it represents the high-level feature extraction of the input image.
The proposed architecture is for traffic congestion prediction based on the sequence of the historical data, but FEN extracts high-level features for only one image at a time. Therefore, the entire FEN network has to be encapsulated in the time distribution layer, i.e., FEN works in the loop to extract the feature from all the time-series input images before going to the Recurrent Network. As stated in section 3.1, t number of the past image is used for prediction, then the output of the FEN network has t number of vector representations of the input images. In figure 2, section A, there is a dotted block with blue and white color, blue represents the vector of the first map, and white represents the feature vector of other input images in the sequence. Each input with the resolution of 192 * 448 * 3 is compressed to the vector of 672 elements.
2) RECURRENT NETWORK Figure 2, section B, shows the recurrent network, which is a primary prediction module in the architecture, it is made up of four stack of LSTMs. LSTM network was first mention in [53], which solves the vanishing and exploding gradient problem seen while training conventional recurrent Neural Networks (RNNs) [54] with the gradient-based back-propagation through time technique.
An LSTM unit contains a cell state, the memory part of LSTM and three gates input gate, forget gate and output gate, to protect and control the cell state. LSTM unit undergoes multiple operations at each gate to compute the output of LSTM called hidden state. At time t hidden state (h t ) is computed by the following operation: At input gate, new information is added to cell state which is completed in twopart, first, a sigmoid layer decide which input value is to be updated (i t ) given, as in (9), and then tanh layer creates a VOLUME 8, 2020 vector of new candidate values (c t ) given, as in (10) At forget gate, LSTM decides what information to forget from the cell state, computed, as in (11). Based on the update at input gate and forget gate, the old cell state at t − 1, (c t−1 ), is updated toc t as in (12) At output gate, the LSTM decide what parts of the cell state go to output, it is given, as in (13) The final output of the LSTM unit, h t is the function of cell state and the output gate, it is computed, as in (14) h where σ (z) and tanh(z) are sigmoid activation function and hyperbolic tangent activation function are defined as follows: Here, x t is the input at time t. W i , W f , and W o represent the weight matrices of the input gate, forget gate, and output gate.

3) RECONSTRUCTION NETWORK
The output from the recurrent layer is passed through the reconstruction layer, where the compressed representation of spatiotemporal learned data is enlarged into the original resolution of the input data. The reconstruction network consists of series of convolutional and partially convolutional operation, is shown in Figure 2(a), section C. During transposed convolutional operations, we set stride and kernel size same to prevent artifacts such as checkboard patterns at final layer, due to overlap in the kernels. To further support the better decoding, the skip connection from FEN is connected, shown in Figure 2(a). Mathematically, reconstruction is performed similarly to as presented in the feature extraction layer. The detailed implementation of the network is explained in the next section.

IV. EXPERIMENT AND RESULT ANALYSIS A. DATA SOURCE
In this research, we choose Seoul (the capital city of) South Korea and among the largest city in the world, where the traffic congestion is very high, especially in the central region, as shown in Figure 1(a). The figure is an example of a raw snapshot of a road network of central Seoul from the TOPIS website. Each snapshot is 192 × 448 pixels in size, covering about an area of 7.5km × 17km (scale 1 cm = 1.3km). These online web services use multiple sources of data collections such as inductive loop, crowdsourcing, etc. to provide accurate real-time data for the entire city.
In this paper, we are focused on traffic congestion level prediction from 07:00 to 12:00 on weekdays. We collect the snapshots of traffic data from 19th September 2019 to 31st December 2019, a total of 104 days, at an interval of 5 minutes (60 samples per day). Out of 104 days, there is partial or no data collection for 26 days, so we remove all the missing day's data to generate the database. Samples from 19th September to 25th November are used as the training set, samples from 25th November to 30th November are used for the validation set, and samples from 1st December to 31st December are used for the prediction of the trained model. Data preprocessing and database generation is explained in section III-B.

B. PERFORMANCE COMPARISON AND METRICS
In order to verify the effectiveness and superiority of the proposed architecture, PredNet, two state-of-art deep learning neural networks, namely: ConvLSTM [43], and Autoencoder [10] are selected for comparison. ConvLSTM is just like LSTM, but internal matrix multiplication is exchanged with convolution operations, which can mine both spatial and temporal information from the input image sequence. The deep Autoencoder is a neural network having an encoding layer, which aims to learn a representation (encoding) from the set of data and decoding layer, which tries and learns to generate from the reduced encoding as close as possible to its original input. We performed the traffic congestion prediction for three-time horizons (10, 30, and 60 minutes) for comparison and analysis of proposed PredNet. As the model predicts the image with congestion levels information represented by a color on each road, pixel-wise classification based on categorical entropy loss function is a more desirable parameter instead of evaluating the model based on mean square error or mean absolute error as in literature [10] and [43]. In this paper, we present the performance result based on precision, recall, and accuracy of the model for traffic prediction. Equation 17, 18, and 19 define precision, recall, and accuracy, respectively, and categorical cross-entropy loss function, as in (20).  (20) Here, y is true value, y is predicted value, (i, j) is the row and column in the image.

C. IMPLEMENTATION OF PROPOSED MODEL
As mention in section III-A, the objective of the model is to take the chronological sequence of traffic image to predict the congestion level of transportation network at different prediction horizons (10, 30, and 60 minutes). To achieve the objective, the deep neural network, as explained in section III-C, is used. As the network is very deep, i.e., a large number of hidden layers, this might bring the problem of vanishing gradient. Because with the increase in the number of hidden layers, the gradient shrinks towards zero during backpropagation, which results in the weight never updating its value. The solution to vanishing gradient is, skip connection -a connection from initial layers of the network to later layers-which enables the gradient to flow directly through the skip connections backward from later layers to initial layers. Apart from vanishing gradient, the other parameters to consider while implementing the model are the number of past images for input, the number of LSTM layers in between convolutional encoding and decoding, consecutive filter number for both convolutional encoder and decoder, and hyper-parameters.
At first, we evaluate our prediction model for vanishing gradient problem by monitoring change in performance under different combinations of the skip connections from the feature extraction network (Figure 2(a) Section A) to the reconstruction network (Figure 2(a) Section C). We select five upsampling layers in the reconstruction network, namely: E1, E2, E3, E4, and E5, as shown in Figure 2(a), where skip connection could be beneficial as it carries the feature maps with much image details, which would help upsampling layer to recover a clean version of the image. We select five layers, presented just before the downsampling layer in the feature extraction network, namely: C1, C2, C3, C4, and C5, as shown in Figure 2(a), for skip connection to reconstruction network. There are 5 cases of skip connection C1 to E5, C2 to E4, C3 to E3, C4 to E2, and C5 to E1 and a large number of possible architecture based on a combination of skip connections. We investigate 6 model scenarios to find the best architecture. Table 1 shows the prediction performance (for 10 min horizon) for the training and validation dataset. As seen from the table, the performance of configuration 7 (without any skip connection) is boosted significantly with the addition of skip connections. With the addition of one skip connection, Configuration No. (C.N) 5 and 6, the training and validation mean square error is decreased by 40% and 23% respectively; the training and validation pixel to pixel accuracy is improved by 1.85% and 1.60% respectively, and the training and validation categorical cross-entropy loss is decreased by 30% and 22% respectively. Further investigation shows the skip connection between the early and late layers (C1 to E5) in C.N 5 achieves higher performance gain than skip connection between intermediate layers (C5 to E1) in C.N 6, which suggests the model is suffering from vanishing gradient problem. With the addition of more skip connections, the model continues to perform better, and the best result is achieved with C.N 1, which have skip connection to all the upsampling layer in reconstruction network from feature extraction network, as shown in Figure 2(a). Although the training accuracy in C.N 1 is lower than C.N 2, the C.N 1 has better representation capability as its validation accuracy, MSE, and loss is best among all the configuration. All the other model parameters, such as the number of filters, sequence of input images, and hyper-parameters, are kept same for testing all the configuration mention in Table 1.
Choosing the optimal number of historical images as an input to the model is very crucial. Selecting a large number of images than optimal will consume unwanted computing resources, and it will be difficult to train on resourceconstrained devices. Whereas, choosing a small sequence input image will hinder the performance of the model, as there wouldn't be enough information to exploit the time-series relationship among the data. Since the time interval between data is 5 minutes, we experiment with 13, 12, 11, and 9 historical data (i.e., 65, 60, 55, and 45 minutes respectively) to train and predict the traffic congestion.
As seen from Table 2, C.N 2, with 11 input samples, achieves exceptional performance for the training dataset but fails to be superior on the validation data set. C.N 3 produces better MSE and accuracy results than other configurations and similar loss as C.N 4. Hence, we choose a C.N 3 with 12 image samples as input for implementing our prediction model as it has better generalization capability compared to other configurations, and with less computing time compared to C.N 3 with 13 input samples.
Furthermore, the number of filters in the convolution neural networks plays a vital role in model performance. As each filter extract different feature map from the same layer, increasing the number of filters insure more learning but increase beyond some optimal filter number does not affect the performance gain but increase the resource consumption. Hence, we experimented on three configurations of the number of filters for the convolutional layer to find a proper one. The feature extraction network and reconstruction network mention in Figure 2(a) are symmetric in terms of the number of filters. From Table 3, we can see that C.N 3 has better MSE and accuracy, and C.N 1 has a better loss in the training dataset. However, for the validation dataset, C.N 1 outperforms all other configurations, which suggests that C.N 1 has better generalization capability. Other parameters like skip connection, no. of the input image, and hyper-parameters were same for generating results for all three configurations.
The details of the PredNet is explained in this subsection. As shown in Figure 2(a), the proposed model has 30 layers TABLE 1. Comparison of prediction metrics (10 min horizon) for training and validation dataset for different configurations of skip connections. The best result is marked in bold.

TABLE 2.
Comparison of prediction metrics (10 min horizon) for training and validation dataset for the different numbers of the historical image as input data. The best result is marked in bold.

TABLE 3.
Comparison of prediction metrics (10 min horizon) for training and validation dataset for the different numbers of filter configuration for convolutional layers. The best result is marked in bold.
consisting of 12 convolutional layers; 5 downsampling layers; 5 upsampling layers; 4 LSTM layers; and one of each flatten, reshape, input and output layers. The model input has four dimensions (12,192,448,3), where the first number indicates sequences of 12 images are taken as input, second and third number indicates the row and column of image and fourth number represent the channel of the image. The input layer is followed by the convolutional layer with 32 convolutional filters of size (3 × 3), strides of (1 × 1), and padding 'same'. This layer is followed by a downsampling layer, performed by convolutional operation with strides of (2 × 2), and padding 'valid'. The combination of convolutional and downsampling layer encode the input (12,192,448,3) to (12,6,14,8). The flatten layer further converts to (12,672) and feed to Recurrent Network, where LSTM learn the features by unfolding the times series and capturing the pattern. The output of LSTM (12, 672) is reshaped into (12,6,14,8) by Reshape layers, which is followed by series of convolutional layers and upsampling layers -performed by transposed convolutional operation with filter size (2 × 2) and stride (2 × 2)-, to regenerate the encoded representation back to original image resolution (12,192,448,4). All the layer have ReLU activation except the last layer which have softmax activation. All the convolutional, downsampling and upsampling layer have dropout of 0.1, and batch normalization, whereas LSTM layer, has 0.2 dropouts.
In our experiment, we implemented the ConvLSTM model with six layers having configuration [48, 36, 24, 24, 12, and 4], with the filter size of (3 × 3), strides of (1 × 1), and 'same' padding. Each layer except last has ReLU activation, 0.1 dropouts, and a batch normalization layer, whereas the last layer has softmax activation. The input to ConvLSTM is a sequence of 12 images with the resolution of (192 × 448 x3). For Autoencoder, we adopt the configuration [512, 384, 256, and 128] with ReLU activation of each layer, except softmax activation in the last layer. The input to Autoencoder is (12, 345), i.e., the congestion level of each road for 12 input samples. Besides, the loss function for both models is changed from MSE to categorical cross-entropy for a fair comparison.
All the models mentioned were trained on a real-world traffic congestion data of Seoul city (South Korea) using Adaptive Moment Estimation (Adam) optimizer with the learning rate of 1e-4, learning decay rate of 0.95, variables moving average decay of 0.999, and a batch size of 1. We train all the models using a categorical cross-entropy loss function. Also, all the model is implemented using Keras deep learning library on an Ubuntu 18.04.4 machine with 4 NVIDIA TITAN Xp Graphics Cards. Table 4 presents the performance metrics of our proposed model PredNet along with ConvLSTM and Autoencoder in terms of precision, recall, and accuracy on a training dataset at different prediction horizons (10, 30, and 60 minutes). Here, instead of using the entire pixels of the image, we randomly choose a single pixel for each road (road-wise value) to evaluate the model. The road-wise prediction performance gives the true evaluation of the model as the performance calculation is not affected by background pixels and road length. The proposed model PredNet achieves around 2 to 12% performance gain, for all prediction horizons compared to ConvLSTM and Autoencoder. Table 5 shows the road-wise per hour average prediction accuracy for eight working days from 3rd December 2019 to  12th December 2019, in a period of 08:00 to 12:00, for all three congestion prediction models. The proposed network, PredNet, shows that the model accuracy performance is better for all prediction horizons, i.e., 10, 30, and 60 minutes. For 10 minutes horizons, the average accuracy for PredNet is in the range of lowest being 0.8473 to highest being 0.8793. Out of 32 hours of prediction, PredNet achieves the best accuracy for 28 hours, and ConvLSTM achieves best for 4 hours. Similarly, for 30 minutes prediction horizon, PredNet achieves the best accuracy for 23 hours, and ConvLSTM achieves the highest value for the other 9 hours. Whereas, for 60 minutes prediction horizon, the PredNet outperforms all other models by delivering the best result for all periods. The highest accuracy for 30 and 60 minutes prediction for PredNet is 0.8566 and 0.8489, respectively. Figure 3 shows the detailed representation of prediction accuracy on 3rd December 2019, from 08:00 to 12:00, for prediction horizons of 10 and 30 minutes, respectively. It shows the prediction accuracy for every 5 minutes. Figure 3(a) shows that the PredNet achieves the highest accuracy value for 46 times out of 48 samples, the highest being 0.9325 at 10:00. Similarly, the PredNet makes high accuracy for 42 times out of 48 samples for 30 minutes prediction, as shown in Figure 3(b). The accuracy of ConvLSTM is slightly lower than our proposed PredNet in most of the instance. However, the prediction accuracy for Autoencoder based on architecture from the literature [10] is inferior. Both PredNet and ConvLSTM perform well for traffic congestion prediction as they use spatial and temporal information. In contrast, Autoencoder uses only temporal data for forecasting; this could be one of the reasons for its poor performance.

D. RESULT AND ANALYSIS
From Table 5 and Figure 3, we can say the proposed Pred-Net performs better than the other two state-of-art prediction models in all prediction horizons. Even though the models attain high performance in terms of accuracy, there is no guarantee that the trained model has better representation capacity for predicting the different congestion levels. There might be a case that the model predicts some congestion level accurately and have a poor prediction for others. In Table 6, we present Precision and Recall metrics for Jam, Slow, and Free congestion levels for all three models, for all three prediction horizon of 10, 30, and 60 minutes. This metrics shows the exactness and sensitivity of models in learning and representing congestion levels. Table 6 shows the comparison on 3rd December 2019 from 09:00 to 10:00 at every 5 minutes. The proposed, PredNet, shows prediction ability with high precision for all the congestion levels in all three prediction horizons. For 10 minutes, prediction horizons, the PredNet achieves precision ranges from 77% to 94% with an average value of 86%, which is 10% and 12% more than ConvLSTM and Autoencoder for Jam congestion levels. For the Slow congestion level, the PredNet attains an average precision of 86.7%, which is 3% and 14% higher compared to ConvLSTM and Autoencoder, respectively. Furthermore, for the Free congestion level, PredNet reaches an average precision of 82.9%, which is 0.6% and 10% higher than ConvLSTM and Autoencoder, respectively. Similarly, the PredNet achieves a precision value of 80.8%, 86.9%, and 80.7% for 30 minutes prediction and 82.4%, 85.3%, and 80.5% for 60 minutes prediction, for Jam, Slow, and Free congestion levels, which is higher than other two models. As shown in Table 6, the PredNet dominates other models in terms of average recall value for predicting all three congestion levels at all prediction horizons. The highest recall value achieved by the PredNet is 0.876, 0.919, and 0.926 for 10 minutes horizons, 0.874, 0.944, and 0.918 for 30 minutes horizons, and 0.839, 0.903, and 0.904 for 60 minutes horizons, for Jam, Slow, and Free congestion levels. In terms of recall, for Jam congestion level, PredNet achieves performance gain of around 4-10% compared to ConvLSTM and 10-16% compared to Autoencoder, for predicting Jam congestion levels, for prediction horizons of 10, 30, and  60 minutes. For Slow congestion level, PredNet attains performance gain of around 1 % gain for prediction horizons of 10 and 30 minutes whereas 10% for 60 minutes prediction horizon compared to ConvLSTM and about 15-18% for all three prediction horizons compared to Autoencoder. Similarly, a significant gain in the recall value is attained compared to other models for predicting the Free congestion level. Our proposed model shows the consistent prediction for all three prediction horizons, ConvLSTM shows a reliable forecast for 10 and 30 minutes horizons but fails for 60 minutes prediction horizons. Figure 4 shows the end-to-end result of PredNet, which shows the comparison of ground truth and its corresponding prediction congestion level on 3rd December 2019 with a prediction horizon of 10 minutes. In Figure 4, Column A denotes the ground truth image of every 5 minutes, and column B indicates the predicted image with its precision (P), recall (R), and accuracy (A). The PredNet predicts the congestion level of the city at excellent granularity, which is visually intuitive compared to work in [10].
The computational resource requirement for any deep neural network solely depends on the type of connection between the layers. Local connectivity and weight sharing nodes use fewer resources compared to fully connected nodes, between the layers. For the input dimension of (12,192,448,3) and architecture mentioned in section IV-C Con-vLSTM takes 0.307 million parameters, the proposed Pred-Net takes 16.5 million parameters, and Autoencoder takes 1,718.5 million parameters (575 Million for gray image). It shows that the PredNet is efficient than Autoencoder in terms of resource utilization but not compared to ConvLSTM. Besides, the PredNet model is more computational efficiency than the other two in terms of computing time, as shown in Table 7. We can see our proposed network takes less number of epochs and training time to converge compared to ConvLSTM. PredNet takes around 1.9 to 2.3 hours to converge, and ConvLSTM around 15.67 to 16.95 hours to converge. PredNet is eight times faster than ConvLSTM. For original input resolution, Autoencoder is practically impossible to model and train on same device where PredNet and ConvLSTM are trained. So, in this research, we have reformatted the input matrix for Autoencoder by taking onepixel value per road from the image instead of every pixel. The input dimension for Autoencoder decrease to (12,345), i.e., 12 image samples with 345 road congestion level on each image. The trainable parameter decreases to 2.9 million and training time down to around 40 to 50 minutes. Even though resource utilization and training time are very low compared to the PredNet, in terms of performance, the Autoencoder model is very poor. Hence, in regard to the high-performance result, the proposed model is efficient in terms of training time compared to both models and efficient in terms of resource utilization compared to the Autoencoder.

V. DISCUSSION AND CONCLUSION
In this work, we present a deep learning model architecture to predict the city-wide traffic congestion prediction using an image data source from the online traffic portal. We develop the hybrid model by combining the Convolutional Neural Network, LSTM, and Transposed Convolutional Network, which can learn both spatial and temporal relation of the input data effectively. The model was trained for three prediction horizons 10, 30, and 60 minutes. Unlike in the previous studies where researchers compared the predicted image with the ground truth in terms of MSE or MAE showing the average error between the images rather than the roadwise prediction, in this paper, we compare our proposed PredNet performance with two other state-of-art algorithms ConvLSTM and Autoencoder, in terms of precision and recall for predicting each congestion level, and accuracy based on road-wise prediction.
As discussed in section IV-C, we conclude the optimal architecture for the proposed prediction model consists of 5 skip connection from feature extraction to reconstruction layers to prevent vanishing gradient, and the sequence of 12 historical images provide an excellent prediction result. From the result and analysis section IV-D, we can see our proposed model achieves the best average accuracy for all three prediction horizons, and precision and recall values are highest for PredNet for all congestion levels, for all prediction horizons. Besides, our proposed PredNet beats the ConvLSTM by 8 folds in terms of computing time and can train image data with a large resolution on a smaller resource compared to Autoencoder, as it incorporates convolutional and downsampling layers rather than fully connected layers as in Autoencoder. However, with the encouraging prediction performance, there is still room for improvement in the model in terms of computational efficiency as a lot of resources and computing time is being wasted in learning the background area.
For future work, we can include external factors like weather information (rain, snow, foggy) for each road, which can improve the model performance. In addition, we will try to enhance the computing capability by removing all the background during our training and also try to add more information from the different data sources for more accurate predictions. NAVIN