Multi-Lane Short-Term Traffic Forecasting With Convolutional LSTM Network

Short-term traffic prediction consists a crucial component in intelligent transportation systems. With the explosion of automated traffic monitoring sensors and the flourishing of deep learning techniques, a growing body of deep neural network models have been employed to tackle this problem. In particular, convolutional neural networks (CNN) and long short-term memory (LSTM) recurrent networks have demonstrated their advantages in modeling and predicting the spatiotemporal evolution of traffic flows. In this paper, we propose a novel Convolutional LSTM neural network architecture for multi-lane short-term traffic prediction. Compared to existing methods, we highlight the importance of (1) applying multiple features to characterize traffic conditions; (2) explicitly considering the routing between neighbouring lanes and downstream/upstream traffics; and (3) predicting multiple time-step traffic in a rolling-prediction manner. Experiments on 10 months 5-minute interval observations of the US I-101 Northern freeway at California Bay Area verify the proposed model. The results show that our model has considerable advantages in predicting multi-lane short-term traffic flow.


I. INTRODUCTION
Accurate, in-time, and detailed traffic prediction plays a critical role in intelligent transportation systems (ITS). Such predictions help to plan and guide vehicle routing, increase the reliability and efficiency in road networks, and alleviate congestion and their related problems [1]. Consider a multi-lane freeway, which is a relatively simple subsystem of a traffic network, the spatiotemporal evolution of its traffic status results from individual traveler's routing option and their dynamical interactions. Although it is difficult to predict traffic evolution at long range due to the highly variant traffic loading and releasing, it is feasible to make short-term traffic predictions based on current traffic status. Typically, the short-term forecasts are within the range of 5 to 30 minutes.
Short-term traffic forecast has been an important task in the transportation research community since 1970s.
The associate editor coordinating the review of this manuscript and approving it for publication was J. D. Zhao . Traffic flow data are often treated as time series or signal data with implantation of conventional time series models, such as autoregressive (AR) model, moving average (MA) model, auto regressive integrated moving average (ARIMA) model [2]- [4], Hidden Markov Model [5], and Bayesian network [6], [7]; or signal processing model, such as Kalman filter [8]. Some variant of these models and their combinations have also been applied for short-term traffic predictions, such as seasonal ARIMA model [9], KARIMA model [10], and ARIMAX model [11]. Other time series models can also be applied in this problem [12], [13]. While these classical models make strong assumptions of traffic evolution, such as stationarity and linearity, observation-based investigations have found that real-world traffic flows exhibit complex nonlinear behavior, and are usually non-stationary regarding the geographical locations, traffic hours, weather, and unexpected incidents [14].
The recent decades have witnessed a flourishing of automated traffic monitoring sensors. These sensors provide detailed and comprehensive picture of the traffic status. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ The massive traffic observation data offer unprecedented opportunity to examine and analyze the traffic system, and make it possible to build more sophisticated, data-driven models for more accurate predictions. Machine learning algorithms build empirical functions by learning from data. Provided with large amount of training data, these models usually show advanced prediction capacity compared to classical statistical models [15], [16]. Many machine learning techniques have been employed for short-term traffic prediction, such as support vector regression (SVR) [17], [18], k-nearest neighbor algorithm (KNN) [19], fuzzy logic [20], [21], and artificial neural network (ANN) models [14], [22], [23].
Compared with classical machine learning algorithms, deep learning models have shown advantage in scaling well with data availability, i.e., deep neural networks usually make better use of massive amount of data by learning customized feature representations. In particular, convolutional neural networks (CNN) and long short-term memory (LSTM) recurrent neural networks (RNNs) have demonstrated their peculiar advantages in modeling and predicting spatiotemporal data. Recently, such models have received extensive attention in the field of traffic flow prediction [24]- [31].
Although many researchers have noticed the advantage of the above mentioned modeling techniques, most studies limit on traffic prediction with single input of lane-average information. In traffic flow theory, traffic flow, speed and occupancy are correlated [32]. It is hard to learn and model traffic flow without speed or occupancy information. Besides, traffic patterns can be distinct for different lanes [33]: traffic flow tends to be high in the median lane of the freeway, compared to shoulder lanes [34]. Drivers may frequently route between neighbouring lanes to alleviate traffic over-headings. The local interactions can aggregate to influence the overall evolution of traffic flows. By explicitly considering the local traffic pattern between neighbouring lanes and along traffic streams, we may better capture traffic evolution patterns and make more accurate predictions. These understandings bring potential benefits for better modeling of traffic evolution, but there still lies a considerable gap in utilizing these understandings for better traffic predictions.
To fill the gap, we propose a novel convolutional LSTM recurrent neural network architecture for multi-lane shortterm traffic flow prediction. The model uses CNN to extract routing patterns of neighbouring lanes and downstream/upstream traffics. The extracted features at each time step are connected using a LSTM module, allowing the model to simulate the temporal dynamic behavior of traffic flows.
This model allows using multiple features spatiotemporal input to characterize traffic conditions at lane scale. The key contributions of this paper could be summarized as follows: • To the best of our knowledge, this paper is the first attempt to introduce convlutional LSTM architecture for multi-lane short-term traffic prediction.
• Multiple features are used in the model to characterize traffic conditions with a high prediction accuracy.
• Learning traffic patterns from neighbouring lanes and downstream/upstream traffics can provide better results in a transportation system. It is demonstrated that predicting multi-lane traffic flow, outperforms predicting average flow of all the lanes.
• The proposed model can make multi-step traffic prediction in a recursive manner. The rest of this paper proceeds as follows. Section II reviews deep learning based traffic prediction works. Section III presents our proposed model. Section IV shows the experiments and results. Discussion and conclusion are drawn in Section V.

II. RELATED WORKS
Short-term traffic flow prediction has a long history in the transportation literature. Many existing works have made comprehensive review of classical parametric modeling methods and the relatively novel machine and deep learning methods. The latter has demonstrated their advantage in leveraging large observation data for accurate predictions.
Here, we only review the most relevant deep learning based short-term traffic prediction works.
The stacked autoencoder (SAEs) model was used to predict short-term traffic flow and achieve better results than parametric models such as ARIMA [23]. Similarly, a deep model of stack denoise autoencoder was developed to learn hierarchical feature representation [36]. A deep belief network (DBN) was used to capture the spatial-temporal features of traffic flow and a multi-task learning architecture to perform exit station flow and road flow forecasting [37]. In [38], two DBN architectures have been compared for short-term traffic flow forecasting, and they show that DBN based on restricted Boltzmann machines (RBMs) that have Gaussian visible units and binary hidden units outperforms the DBN based on RBMs with all units being binary. In [39], DBN has also been explored for longer term (i.e., day ahead) traffic prediction. Furthermore, Generative Adversarial Networks (GANs) have been used to estimate trip travel time distribution [40]. A deep neural network was used to capture nonlinear spatiotemporal effects [14]. These models can make accurately future traffic flow prediction to some degree, but they did not exploit long-term memory of traffic flow, which hindered their predictive power.
RNNs are neural networks with hidden states, which are well suited for modeling sequential data with dynamic behavior. In the context of traffic prediction, researchers in ITS treat the traffic evolution as a dynamical system, and apply RNNs to model this system [41]. However, the training of a vanilla RNN has been demonstrated to be difficult due to the vanishing/exploding gradient problem [42], [43]. LSTM architecture with recurrent gates called ''forget'' gates has made it possible to train deep recurrent neural networks efficiently, and shows excellent ability with regard to traffic prediction tasks [24], [44]- [49]. In addition, LSTM has been combined with DBN to model lane changing process [48]. Other gating mechanisms, such as gated recurrent units mechanism, are adopted to predict network wide traffic [50]. Even though the studies above claimed that LSTM can characterize the time correlation of the traffic data and capture long-term memory of traffic flow, the spatial dependency is not fully utilized.
CNNs have succeed in many research field, including image recognition [51], [52], video analysis [53], [54], natural language processing [55], [56], and weather prediction [57]- [59]. Owing to the good performance of the CNN in capturing spatial features, some researchers in ITS use CNN for traffic forecasting task [27], [29], [31], [60], [61]. In most of their model architectures, the input traffic data have the spatial correlation in the column, and the temporal correlation in the row. In [29], multiple features are considered as channels of input, and output is 1D traffic speed at each station. In [31], multi-lanes are seen as channels of the CNN input, and output is 2D traffic speed at each station at each lane.
The combination of CNNs and LSTM RNNs model can effectively improve the accuracy of traffic prediction tasks, where CNN can capture spatial dependencies and LSTM can learn temporal features. There are different variations in formulating the problem, customizing and applying these modules. For instance, the model structure in [25] includes a 1D CNN for capturing the spatial features of traffic flow, two LSTM RNNs for learning temporal features, and fully connected layers for combining the output features. The model takes in average flow of all the lanes and treats time steps as CNN channels. In their following work [28], speed data were used to learn the weights of the models for flow prediction. Compared these two studies on freeway, Yu et al. [35] focused on regional network-wide traffic with hundreds of interchanges and intersections. The architecture they proposed consists a 2D CNN, two LSTMs, and a fully connected layer. The model inputs 2D average speed networks which is represented by grid-based transportation network segmentation methods, and outputs 1D traffic speed at each links.
The works that most related to us are summarized in Table 1. The table lists predictand, data source, samples, input variable, input dimension, output dimension, lane and model of each work. Most existing models limit on one step traffic prediction with single input/output of lane-average information.
Our approach is different from all those methods due to both the problem settings and the formulation of the combination of CNN and LSTM models. We model different attributes of the lane-level traffic at the same time. We apply convolution operator to explicitly seek spatial invariant patterns from the array-shaped data. In the context of traffic prediction, where we treat multi-lane traffic status as a 2D matrix, such local features are preferably to be particular traffic flow patterns that involve the interactions of several neighbouring vehicles. Besides, the sequence to sequence learning framework is different than previous studies. As shown in Figure 1, almost existing research adopted sequence to sequence LSTM model (diagram (a)), which FIGURE 1. Schematic of two different varieties of RNN sequence to sequence learning. The left one is sequence to sequence architecture, and the right one is sequence to sequence synced architecture. The yellow circles are the input sequence feed into the network. Every circle itself stands for a feature vector at a distinct time step. The hidden layer in blue processes the inputs and outputs the prediction for the next time step(s). The model in (a) use observations at sequence time step (step 1-3) to make a multi-step forecast (step 4-6). The hidden state is accumulated with each input value before the first output value (step 4) is produced. In the diagram (b), we input the features at time step 1 and output a prediction for the following time step (step 2). We repeat this for three time steps until we reach the end of the sequence. This loop allows information to be passed from one step to the next. The twist of these RNNs is that we input the predicted values from previous time steps as well.
use observations at sequence time step to make a multi-step forecast. The hidden state is accumulated with each input value before the first output value is produced. However, we use synced sequence to sequence LSTM (diagram (b)) which produces one time step forecast for each observed time step received as input, and the information are passed from one step to the next.

A. PROBLEM STATEMENT
Consider a traffic monitoring system that observes the traffic flow along multiple lanes of a freeway with a fixed observing frequency. At each observation time step, we monitor c attributes of the traffic condition at h observation stations along the considered freeway. Each observation is made for all the w lanes. We treat the spatial and temporal dependent data as a sequence of spatial signals. Specifically, we apply a n × c × h × w tensor T n to represent the traffic data, where n denotes total observing time steps, c denotes observed attributes, h denotes number of stations, and w denotes number of lanes: where X i represents the traffic status at time step i. X i is a c × h × w tensor: where each x jk i is a c dimensional vector, representing c traffic attributes at time step i, observation station j, and lane k. VOLUME 8, 2020 TABLE 1. Summary of related works regrading deep neural networks for learning the spatial-temporal patterns for traffic prediction. The models include CNN, RNN or their combinations. The predictand of each work is flow/speed for freeway/urban traffic with single/multiple variable input(s). The input variable considers average value of all the lanes or distinct value for each lane. The input dimension of most models is 1D (number of stations) or 2D (number of stations/links by time step). The output dimension of most work is 1D (number of stations/links). The data source and data samples of each work are listed as well.
The goal of our short-term traffic flow forecasting task is to predict the one-step future traffic flow spatial signals based on the current and historic spatial signals. Each signal consists multiple aspect of the traffic, such as traffic flow and speed. We formalize forecasting problem as learning a function h(·) that maps n historical spatial signals to (n + 1)th step future spatial signal, as Equation 3.
With X n+1 being predicted, we could have T n+1 estimation by appending X n+1 to T n . Thus, we can iteratively apply Equation 3 to make multiple time step traffic predictions.

B. BUILDING BLOCK I: CONVOLUTIONAL NEURAL NETWORKS
This section describes the convolutional neural network that searches, extracts, and synthesizes spatial features from the traffic input data for traffic flow estimation. The model here makes explicit use of the traffic neighborhood lanes and upstream/downstream spatial structure but does not consider temporal evolution of the traffic. Details about the convolution neural network module are described as follows.
The traffic input is denoted with a n×c×h×w tensor. Here n is the duration time of traffic information, c = 2 represents 2 variables (flow and speed), h = 4 is the four lanes of the freeway, and w = 125 is the detectors span of the predictor field. The CNN model operates on the 3-D traffic data at each time step. As illustrated in Figure 2, the CNN applies a set of convolution kernels to go through the c × h × w input snapshot. The kernels are c × c × a × b tensors composed of trainable parameters. Each convolution operation is carried out by computing the dot product between the kernel and a particular input patch, followed by shifting and element-wise non-linearity: Here the upper index labels the variable's dimension, the lower index labels the position. X c×a×b p,q represents a c × a × b dimensional input patch around location (p, q) in the h × w input field. (p, q) shifts as we scan X c,a,b with W c ×c×a×b by a pre-defined stride. The scanning is performed using element dot product between X c×a×b p,q and the convolution kernel W c ×c×a×b . a × b is named the receptive field of the kernel. The result is further transformed by adding a bias vector b followed by a non-linear transformation f . The final convolution result is a c dimensional vector Y c p,q at location (p, q). Equation 4 can be interpreted as applying c learnable filters to seek salient features from the input field while maintaining the spatial structure of the input.

C. BUILDING BLOCK II: RECURRENT NEURAL NETWORK
The temporal connection of traffic information can be extracted using a recurrent neural network (RNN) module after spatial features extraction from CNN. RNN can take advantage of all the available input information up to the current time. It can be seen that the final output depends not only on the current input but also on the output of previous hidden layer. Basics about RNN are briefly introduced below.
RNNs are neural networks with hidden state variables: where h t denotes the hidden state variable at time t, f is the transition function that updates the hidden state h t based on previous state h t−1 and t-step input X t . θ denotes parameter vector. For instance, in a vanilla RNN, the state transition function takes the following parameteric form: We can unfold Equation 5 along time by iteratively applying Equation 5 on itself: Gradient-based training of RNN has been recognized difficult for basic recurrent architectures, such as for Equation 6. The difficulty originates from the fact that, the partial derivatives of the loss function L with respect to the parameters tend to vanish or blow up as error signals flow backwards in time [42], [43]. Although it is possible to clip the gradients as they blow up [62], a vanishing gradient prevents effective learning of long-term dependencies. Here we adopt the Long Short-Term Memory (LSTM) recurrent network [63] to mitigate the vanishing gradient problem.
A typical LSTM unit contains an input gate i, a forget gate f, an output gate o, and two hidden states in its computational graph, one is the conventional hidden state vector h t , the other is a memory vector c t . LSTM adopts four interacting modules to read from, write to, or reset the memory vector c t : where σ /tanh is element-wise sigmoid/hyperbolic-tangent function that squashes each element of a vector to (0, 1)/(−1, 1). Assume the input vector x of dimension p, and hidden state vector h as well as memory vector c of dimension q, W is thus a (4q, p + q)-dimensional transition matrix, i, f, o, and g are q-dimensional vectors. With these modules defined, the memory vector c t and h t are updated based on the following equations: where indicates element-wise product. Equation 9 and 10 tells how i/g, f, and o work as binary gates that control whether memory cell c t is updated, whether it is reset to zero, and whether its local state is revealed in the hidden vector, respectively [64]. The memory cell c t is updated based on two mechanisms that cooperates in an additive manner. The first mechanism tells how c t maintains memory from past state, which is described by f c t−1 ; the second mechanism tells how to update c t based on input x t and past hidden state h t−1 , which is described by i g. The additive interaction of the two mechanisms allows error gradient on c t to be distributed through time without suffering from vanishing/blowing up gradient, thus enabling learning long-term dependencies [64]. We use LSTM to relate the extracted spatial feature to the corresponding traffic flow time sequences. Specifically, We start from mapping a CNN to the time series of predictors. The extracted spatial feature time series are then used as input for the LSTM for flow estimation.

D. ENTIRE ARCHITECTURE: CNN-LSTM
The high-level architecture of our model is shown in Figure 2. The model works by channeling various modules introduced in the previous sections. Specifically, at each 5-minute time interval (one time step), we have an observation represented as a c × h × w tenor. c denotes the features that are applied to characterize the traffic status, here 2 features are adopted, namely the traffic speed and flow; h denotes the station number, w denotes the lane number. To extract the spatiotemporal features, a convolution kernel of size c 2 × c 1 × a 1 × b 1 is applied to scan through the observation tensor with a receptive field of a 1 × b 1 . Each convolution operation results in c 2 local features. The output maintains the spatial alignment of the lane and downstream/upstream traffic. Multiple convolutional layers are stacked. Higher level convolutions operate on the output from lower level convolutions, enabling the model to capture multi-scale spatial patterns. A LSTM recurrent module is applied to chain the extracted spatial features, as shown by the green arrows. At each time step, the model predicts 5-minute ahead traffic flow and speed, based on (1) current traffic flow and speed, (2) hidden state h t from the LSTM module.

IV. EXPERIMENTS A. DATASET
We use traffic data collected from the Caltrans Performance Measurement System (PeMS). The data are collected every 30 seconds from over 35,000 individual single lane loop detectors, which are deployed statewide in freeway systems across California. 1 The collected data are aggregated to 5-minute interval each for each detector station, and one detector preserves 288 data points per day. We select District 4 (US I-101 North) from 00:00 January 1st, to 23:55 January 31st, and 00:00 October 1st, 2018 to 23:55 June 30, 2019 by 125 sensors (detectors) as experiment samples. We select average flow and average speed of each lane for our experiments. The flow is the number of vehicles that crossed the detector during the time period. The speed is the average speed (km/h) that traffic is traveling, which calculated using g-factor (effective vehicle length) from the flow and occupancy data. The detailed algorithm for speed estimation can be found in [65].
To offer an impression of the data, the average traffic flow and speed for each lane of 125 stations of freeway US I-101 North during October 1st, 2018 to June 30th, 2019 are given in Figure 3. The average flow/speed are labeled with circles of corresponding size. The figure shows that outer lane has relative low flow and low speed, and lane 2 has relative high flow and inner lane has relative high speed, indicting that different lanes show distinct traffic behavior and it is reasonable to forecast traffic on lane-level.

B. MODEL IMPLEMENTATION 1) DATA PREPARATION
We start by structuring the observational data into the format of n × c × h × w, where n denotes total time steps, c denotes observed attributes, h denotes station numbers, and w denotes lane numbers. It is worth noticing that the station sequence of the data should be rearranged based on the real-world station alignment. After data re-aligning, the flow and speed data are scaled to [0, 1] using a min-max normalization. Thereafter, we divide the data into non-overlapping training/validation/test sets.   (61 days × 288 5-min-observations/day). The training/ validation data are shuffled at a daily base, and 80%/20% used for model training/validation. The training set is directly applied to optimize the model, while the validation set is frequently evaluated along the training process for model hyperparameter tuning and preventing from overfitting. Figure 2 offers a high-level illustration of the network model. The network architecture and model hyperparameters should be further specified. Here we rely on empirical experiments to determine network architecture and hyperparameters. For implementation convenience, we only consider equal convolution kernels (same channel size and receptive field) for CNN. We try different network architecture, learning rate, and training iterations to decide the optimal setting. We admit that the result could be significantly different by adopting alternative architectures. The specific network architecture and hyper-parameters options are listed in Table 2.

3) MODEL TRAINING
The model is trained using mini-batch gradient descent strategy. Each mini-batch is composed of B truncated traffic sequences. B denotes batch size, which is set to 10 here. We apply a truncate time of 1 day (288 5-minute steps). For each sequence in a mini-batch, the model error is calculated at each time step based on (1) the state updated from previous step (for the first time step, the state is randomly initialized), and (2) network activation for the current step. Then, we calculate and back-propagate the model errors gradients for each time step, based on the current step error and next step gradient [66]. The gradients for each time step are averaged. This calculation is carried out for B sequences, and the averaged gradients is accumulated to update the model parameters.  The learning rate r is used to control how much the parameters are adjusted along the gradient direction. The considered r options are listed in Table 2.

4) MODEL IMPLEMENTATION DETAILS
We apply the Wolfram Mathematica 12.0 [67] deep learning library to implement the model. This library offers off-theshelf network building blocks, and can automatic differentiate the network, enabling efficient and effective computing. We split the training process across 3 NVIDIA Titan X (Pascal) GPUs using the Common Unified Device Architecture (CUDA) library [68].

C. PERFORMANCE METRICS
The flowing four widely used performance metrics are used to evaluate the models: CORR = cov(y obser , y simu ) σ y obser σ y simu (14) Here y obser is observation and y simu is simulation. MAE denotes mean absolute error. RMSE denotes root mean square error. Nash Sutcliffe Efficiency (NSE) denotes normalized square error, it expresses the square error relative to an average value guess. NSE = 1 for perfect predictions, and NSE = 0 for non-informative predictions. CORR is Pearson Correlation Coefficient. We carry out the evaluation at different spatiotemporal scales, as will be clarified in the following sections.
D. RESULTS Figure 4 shows the training progress. Results here are based on the model configuration that achieved optimal performance for the validation set. The red/green/blue line represents the mean square error as function of training epoch for the training/mini-batch/validation set. Results here show that the model has similar satisfying prediction skill for most stations, with NSE skill above 0.9 and CORR skill above 0.95. We also noticed that intervals around Station 40 and Station 90 show poorer performance compared to the rest stations. For these two intervals, the outer lane is worse predicted than the inner lanes. By projecting the stations back to the geographical map, we found that these regions are where considerable traffic flows merge into I-101 North from the two bay bridges: Station 40 connects US I-101 and Dumbarton Bridge, Station 90 connects US I-101 and San Mateo-Hayward Bridge. We attribute the poor performance to the fact that the model is targeted for long homogeneous freeway sections, and do not explicitly consider the traffic merges from outside traffic flows. The analysis above is based on results for entire test period. Below we inspect model's performance under typical traffic scenarios. Extreme weather, especially flash flood related with land-falling atmospheric river event, may pose severe challenge to the local traffic in our study area [69]. Besides, we may expect distinct traffic patterns in holidays and weekends, as compared to weekdays [70]. Figure 6   Considering the prediction for different lanes, both the traffic flow and speed prediction for the Lane 4 (outer lane) show slightly worse performance compared to the predictions for the rest lanes. The correlation skill difference is significant at 95% confidence level, based on a z statistic test. We believe the relatively poor performance for the outer lane is due to the frequent traffic loading and releasing from outside traffic systems.
Considering the difference for different scenarios, we find that weekdays are featured by heavy morning traffic, weekends and holidays show slower increase of traffic load in morning, and stronger persistence throughout the daytime. The heavy rain on January 8th, 2018 significantly slowed down the traffic speed and flow. Considering the model performance for these scenarios, we find that traffic flow predictions show no significant correlation skill difference between weekdays and weekends/rainy days/holidays at 95% confidence level, based on a z statistic test.

E. APPLYING MULTIPLE FEATURES TO CHARACTERIZE TRAFFIC DYNAMICS
In this part, we verify the benefits of considering multiple aspects of traffic conditions for characterizing traffic dynamics. Ma et al. [27] have demonstrated the advantage of including traffic volume information as a supplementary predictor to improve traffic speed prediction. However, their models only predict traffic speed but not traffic volume. With no traffic volume being predicted, the models do not have sufficient input to support predictions beyond one time step. Here we highlight the following two aspects: 1) The inclusion of traffic flow/speed information as predictors enhances the prediction accuracy for both of them. 2) To predict flow/speed simultaneously helps to regularize the model, and may enhance the prediction accuracy for each of these two variables. To testify these two points, we design the following two control experiments: Experiment 1. 1 We build a Convolutional LSTM neural network model that uses only traffic flow or speed as predictor and predictand. The results are compared against the model proposed in Section III, which takes input of both variables. Experiment 1. 2 We build a Convolutional LSTM neural network model that uses traffic flow and speed as predictors, but predict only traffic flow or speed. The results are compared against the model proposed in Section III, which predicts both variables simultaneously.
We follow similar architecture and hyperparameter settings as the benchmark model in Section III. Model modifications are only made to highlight the control experiment settings described above. Figure 8 shows the training progress for models in Experiment 1.1 and 1.2. For comparison purpose, we draw the validation loss for our benchmark model along with the training, mini-batch, and validation loss of the control models. Our benchmark model outperforms the comparison model in Experiment 1.1 for both speed and flow prediction. In Experiment 1.2, the benchmark model and comparison model show neglectable difference in   The model performances at lane scale and multi-lane average scale are summarized in Table 3. The benchmark model outperforms the control model in Experiment 1.1 for predicting flow and speed at both lane scale and lane-average scale. This evidence suggests that by including extra traffic features, we can increase the prediction accuracy for both features. The skill improvement is more evident for flow prediction compared to speed prediction. For Experiment 1.2, the benchmark model outperforms the control model in predicting flow. This suggests that to predict traffic speed helps to regularize the traffic flow prediction task. On the other hand, the control model usually predicts speed with higher accuracy, indicating that to predict traffic flow as a regularization method does not contribute consistently in predicting traffic speed.
Besides regularization, another benefit for making traffic flow/speed prediction simultaneously is that we can make n time step ahead predictions based on 1 up to (n − 1) step predictions, n starts from 2. This is achieved by sequentially feeding 1 up to (n − 1) step predictions as input to the model.   Below we draw the prediction skill-lead time relation based on our model's rolling predictions. The models considered here are the benchmark model (red), control model in Experiment 1.1 (green), control model in Experiment 1.2 (blue), and persistency (black). The persistency prediction is made using current step status as predictions.
As can be expected, all models show decrease of prediction skill with prediction lead time. For both flow and speed predictions, the benchmark model and the control model in Experiment 1.2 show advantageous performances for lead time from 10 minutes up to 55 minutes. It is worth noticing that the control model in Experiment 1.2 requires both traffic flow and speed as predictors, while only outputs one of them, the rolling prediction is achieved by combing two models, each predicting flow or speed separately. The control model in Experiment 1.1 makes slightly worse flow predictions, but still outperforms persistency for all the considered lead time periods. For speed prediction, the disadvantage of the control model in Experiment 1.1 is more obvious, with skills falling far behind the benchmark model and control model in Experiment 1.2 after lead time of 30 minutes. Besides, the skill for the control model in Experiment 1.1 is even worse than persistency after around 45 minutes. This quick diminish of prediction skill is attributed to the fact that prediction error may grow fast in a non-linear model. Overall, the results here suggests that the proposed model can make satisfying multiple step rolling predictions.

F. USING INFORMATION FROM NEIGHBOURING LANES
In this part, we investigate how traffic evolution is influenced by neighbouring lanes. Considering the traffic condition for different lanes at a same freeway cross-section, researchers have long noticed that drivers tend to make use of neighbouring lanes to buffer traffic congestions. This evidence suggests that to accurate predict traffic, we need to carefully model the routing between neighbouring lanes. Here we design the following experiment to highlight the importance of considering neighbouring lane for traffic prediction:

Experiment 2
We build a 1-D Convolutional LSTM neural network model that takes into input and predicts the traffic flow and speed at lane-average scale. By averaging the flow and speed for the four lanes, the model does not consider the routing between neighbouring lanes. The results are compared against the model proposed in Section III, which explicitly predicts the traffic evolution for all the four lanes.
As for Experiment 1.1 and 1.2, we follow similar architecture and hyperparameter settings as the benchmark model in Section III. The only difference is that the input/output are at lane average scale rather than for each lane. Figure 10 shows the training progress for the control model in Experiment 2. The flow and speed prediction skill are summarized in Table 3. For comparison, the lane-level predictions from the rest models are averaged and evaluated. We found that the benchmark model and the control model in Experiment 1.2 offer better flow and speed predictions, as compared to the control model in Experiment 2 here. This evidence suggests that by explicitly modeling the routing between neighbouring lanes using the proposed model architecture, we can improve the accuracy for traffic prediction.

G. USING DYNAMICAL TRAFFIC INFORMATION
In this part, we testify the importance of applying dynamical traffic information for short term traffic prediction. To achieve this target, we carry out a control experiment that makes prediction based solely on current time step traffic status. Also, bearing in mind that there are various ways for encoding dynamical traffic information besides the LSTM recurrent architecture employed here, we implement a comparison 3-D convolution neural network model that apply  both spatial and temporal convolution. The specific experiment settings are described as follows: Experiment 3. 1 We build a convolutional neural network model that uses only current step traffic status (flow and speed) as predictor and predictand. The difference with the benchmark model in Section III is that we drop off the LSTM recurrent structure here. The results are compared against the benchmark model. Experiment 3. 2 We build a 3-D Convolutional neural network model that first apply convolution along the spatial domain. The results for neighbouring time steps are later synthesized using a temporal convolution kernel. The results are compared against the model proposed in Section III. Figure 11 shows the training progress for models in Experiment 3.1 and 3.2. We draw the validation loss for our benchmark model along with the training, mini-batch, and validation loss of the control models. Performances of the models are summarized in Table 3.
The benchmark model proposed in Section III, which considers previous step traffic data using LSTM recurrent network, outperforms both control models by a large margin. These results suggest that the Convolution LSTM network architecture can effectively utilize dynamical traffic information for better predictions.

H. COMPARING WITH CLASSICAL TIME SERIES MODEL
The previous sections have used several baseline models to highlight the importance of explicitly considering local traffic patterns and traffic dynamics for traffic prediction. This is achieved using a novel composition of convolution  and recurrent neural network modules. At the beginning of this paper, we argued that classical time series models usually make strong assumptions of traffic evolution, such as stationarity and linearity. Such assumptions may not be valid for real-world traffic patterns. To justify this critic, we compare our model with well-established time series model. Here we apply Auto Regressive Integrated Moving Average (ARIMA) model as the baseline.
ARIMA model can be understood by outlining each of its three components: 1) Autoregression(AR) Autoregression refers to time series in which variable depends linearly on on its own previous values. The lagging dependency length is denoted with p. 2) Moving average(MA) Moving average refers to time series in which variable depends linearly on the current and various past values of a stochastic terms. The lagging dependency length is denoted with q.

3) Integration (I)
Integration is the key component that differs ARIMA with autoregressive-moving-average (ARMA) models, which assumes auto-regressive and moving averaging property of a time series. Integration replace data values with the difference between current and previous values, allowing the time series to be nonstationary. The degree of differencing is denoted with d. A detailed explanation can be found in [71]. We test and compare the performance of ARIMA for the same station (Station 400010) as investigated in previous sections. Different combinations of p, q, and d parameters are tested. The ARIMA model is trained using bootstrapped data from the training set, and tested for the same test set. The comparison results are shown in Table 4.
Results in Table 4 shows that the CNN-LSTM model outperforms the ARIMA for traffic flow and speed predictions. We attribute this advantage to the fact that the deep neural network model can better capture the non-linear evolution of traffic flow by explicitly exploring local traffic patterns and traffic dynamics.

V. CONCLUSION
In this paper, we propose a novel Convolutional LSTM neural network architecture for multi-lane short-term traffic prediction. Compared to existing methods, we highlight the following aspects: • The inclusion of multiple features for characterizing traffic condition helps to improve prediction accuracy.
• By explicitly modeling the routing between neighbouring lanes and downstream/upstream traffics, we can significantly enhance traffic prediction.
• By applying synced sequence to sequence recurrent architecture, we can satisfyingly predict multiple time-step traffic in a rolling-prediction manner. We carry out experiments based on PeMS dataset. The results show that our model has considerable advantages in predicting multi-lane short-term traffic.

ACKNOWLEDGMENT
The authors would like to make a grateful acknowledgement to the Caltrans Performance Measurement System for providing valuable daily traffic flow data, and the anonymous reviewers whose comments and suggestions helped to improve and clarify this manuscript. He has served on many international conferences. His research interests include information management, data mining, and machine learning.
ALEXANDER IHLER (Member, IEEE) received the B.S. degree (Hons.) from Caltech, in 1998, and the Ph.D. degree in electrical engineering and computer science from MIT, in 2005. He is currently a Professor with the Department of Computer Science, University of California Irvine, Irvine. His researches focus on machine learning, graphical models, and algorithms for exact and approximate inference, with applications to areas such as sensor networks, computer vision, data mining, and computational biology. He was a recipient of the NSF CAREER Award and several best paper awards at conferences, including NIPS, IPSN, and AISTATS. VOLUME 8, 2020