Bayesian Combination Approach to Traffic Forecasting With Graph Attention Network and ARIMA Model

To better capture the spatio-temporal characteristics and reduce unbalanced errors in short-term traffic prediction, an advanced Bayesian combination model with graph neural network (ABCM-GNN) is proposed. A new ABCM framework involving an error correction mechanism is established, based on the analysis of distance correlation between historical and current traffic volumes. Two sub-predictors built, respectively, on the graph attention gated recurrent unit (GAGRU) network, which captures the spatial correlation of road network, and autoregressive integrated moving average method (ARIMA), are incorporated into the ABCM framework to enhance the strength and capability of the framework. The effectiveness and superiority of the proposed model are demonstrated in various scenarios with experiments conducted using real-time traffic data collected on the California freeway. The overall results show that the ABCM-GNN with ARIMA method is superior to state-of-the-art methods in terms of precision and stability.


I. INTRODUCTION
Traffic flow forecasting is one of the core problems in intelligent transportation systems. Accurate real-time traffic prediction can help maximize the utilization of the road network capacity, improve traffic efficiency and safety, optimize traffic distribution, thereby alleviating congestion and reducing air pollution [1], [2], [3], [4]. There are various methods for short-term traffic forecasting, e.g., statistical models, machine learning methods and big data-driven deep learning methods, or broadly classified into three types: parametric, non-parametric and hybrid methods.
Typical statistical models for traffic prediction include autoregressive integrated moving average (ARIMA) approach [5] and its variants such as seasonal ARIMA and space-time ARIMA [6]. Kalman filters are also a powerful statistical method for traffic prediction. Machine learning methods used for traffic forecasting include hybrid wavelet The associate editor coordinating the review of this manuscript and approving it for publication was Tao Huang . analysis [7], support vector regression model [8], neural network models [9], and others. Statistical models require prior knowledges whereas traditional machine learning methods cannot deal with complicated spatio-temporal dependencies of road traffic networks.
In order to better extract traffic features from voluminous data, deep learning models, as an advanced non-parametric method, have been increasingly utilized for traffic prediction [10]. Long-short term memories (LSTM) [11], [12] and its variants gated recurrent units (GRU) [13], [14] were used to capture sequential dynamics evolution in time from the traffic data. Hybrid models combining convolutional neural networks (CNNS) and recurrent neural networks (RNNS) [15], [16], [17], [18] were utilized to explore the spatial-temporal dependencies to improve traffic forecasting performance.
The performance of deep learning models depends on a large set of high-quality traffic data, and perturbed data may produce inaccurate or even erroneous prediction [19], [20]. In recent years, fusion models were used to take advantages of all models involved to improve the prediction performance and stability [21], [22], [23], [24]. To name some, Zheng et al. [25] presented a Bayesian combination method (BCM) for traffic forecasting based on two neural network predictors. A BCM framework was given in [26], using gray correlation analysis to integrate the outputs of a back-propagation neural network, ARIMA, and Kalman filter, to deal with the long operation period and insensitivity to prediction error fluctuations. In particular, an improved BCM method was associated with a deep learning model GRU in [27] to address the error amplification phenomenon and improve the prediction performance.
It is worth noting that the above state-of-art techniques still leave much room for improvement. i). The deep learning networks used cannot capture the spatial characteristics of complex road networks satisfactorily, resulting in poor prediction performance. ii). The existing Bayesian combination models cannot deal with unbalanced sub-model errors with non-identical sign (i.e., negative vs positive errors). iii). The traffic data sequences are nonlinearly correlated and the correlation is ignored in determining the key parameters of the Bayesian combination method.
In this paper, we present an advanced Bayesian combination method (ABCM) in association with a graph attention gated recurrent unit network (or graph neural network, GNN) for traffic prediction. The model, denoted as ABCM-GNN, adopts co-integration and error correction to correct short-term unbalanced errors through long-term cointegration in the Bayesian combination model. While in the deep learning model, we use graph attention gated recurrent unit networks to effectively capture the temporal and spatial characteristics in the traffic network. The ABCM-GNN also integrates sub-predictors GAGRU and ARIMA, the latter of which deals with the nonlinear characteristics in short-term traffic dynamics. The combined GAGRU, GAT and GRU model can deal with massive data information and obtain the spatial characteristics of complex traffic networks. Meanwhile, the time interval correlation parameters are obtained by distance correlations.
The main contributions are as follows: 1) An advanced Bayesian combination method (ABCM) is proposed based on co-integration and error correction, which can effectively combine the sub-predictors and reduce short-term unbalanced errors. 2) By combining an ARIMA model and a GAGRU neural networks using this ABCM, the nonlinear characteristics and spatial correlations in the road network can be effectively captured. 3) Distance correlation analysis is introduced to capture the nonlinear temporal correlations in the time series data, which can also reduce the computation time cost of the algorithm. The remaining parts of this article are organized as follows. Section II gives a further detailed description of ABCM-GNN model. Sub-predictors are introduced in Section III. Section IV presents the validation data, evaluation criterion, the processes of model implementation, and experimental results, following the conclusion and future work in Section V.

II. METHODOLOGY
This section presents the description of the ABCM-GNN model. The structure of ABCM-GNN is shown in Fig. 1. First, the correlation analysis is performed using the distance correlation coefficient. Next, the cointegration analysis and error correction model are presented. Then, we propose a new combinatorial framework named ABCM. Finally, the process of implementing the ABCM-GNN methods is presented in detail.

A. CORRELATION ANALYSIS
Short-term traffic volume forecasting is a typical time sequences prediction question. The time interval is the unit of time in which the sensor collects traffic flow data. Considering the set of time intervals in the NBCM model where the traffic volume in the prediction traffic sequence intervals t is determined strongly related to the historical traffic flow [26]. Let Z denote the time period collection and Z = {t − 1, t − 2, . . . , t − z}. We present a distance correlation analysis approach to obtian the correlation of present and historical traffic volumes [28].
Assume that y t is the traffic flow in a time interval, which is influenced by the J previous traffic flow for time period (y t−1 , y t−2 , . . . , y t−J ), where J is set sufficiently large to include most related time intervals. Let Y t = {y t (l)|l ∈ E} denotes the traffic volumes target sequence to be predicted, and Y t−z = {y t−z (l)|l ∈ E} is a traffic volume alternative series whose data are the time interval z before the relevant data in Y t . The collections of all period of the traffic volume series are indicated as L, where E = {1, 2, . . . , L} and L are the lengths of the traffic volume series.Sequence Y t , Y t−1 , . . . , Y t−J is derived from the identical traffic data series the relations between the alternative series Y t−z and the target series Y t can be calculated by the distance correlation VOLUME 11, 2023 coefficient, denoted as (1), and r(t −z) denotes the correlation between Y t and Y t−z .
The size of the set Z is formulated as: where δ is a measurement that determines the dimensionality of the set Z and the set δ ∈ [0, 1].

B. ERROR CORRECTION MECHANISM
Traffic flow series are typical time series. Regression analysis of a non-stationary time series as a stable one can lead to pseudo-regressions, where there is no correlation between the variables, but erroneous conclusions are drawn that the regression results are given indeed correlated. According to the cointegration theory, a number of stable variable sequences, if their single integration order is the same, some of their linear combinations are stable, it demonstrates the existence of long-term equilibrium relations between these variable series cointegration relationship. Let the sequence of two variables x i and y i be a first-order integral process, where y t ∼ I (1) and x t ∼ I (1)., and if the following formula is true.
The linear combination of the two nonstationary time series is called cointegration. According to Granger and Siklos [29] theorem, if there is a cointegration relationship between two non-stationary time series, then these variables have error correction expressions. When the stable relationship between these variable series will have some imbalance in the short term, i.e., the variables deviate from the cointegration relationship, the error correction model introduces the cointegrating variables reflecting the long-term equilibrium relations into the dynamic equation and uses the long-term equilibrium error as the correction term for short-term fluctuations, which compensates for the shortcomings of traditional statistical analysis models. The error correction model incorporates the long-term equilibrium nexus with the short-term disequilibrium state to improve the stability of the forecasting model. Due to the time variable x t and y t there is a long-term equilibrium relationship as: The short-term disequilibrium relationship is: The error correction expression is: represents the error correction term; α 0 , α 1 are the long-term reaction parameters; β 1 , λ are the short-term reaction parameters; ε t is the residual value.

C. ADVANCED BAYESIAN COMBINATION MODEL
For a certain time interval t, usually only the n th sub-model with the highest prediction accuracy is selected as the best model. We obtain posterior probability as: where p n t is considered as the weight of the n th predictor at time period t; And N denotes the count of component predictors.
Based on Bayes' rule, the following recursion can be obtained: (11) Assume that the sub-predictor error obeys a Gaussian white noise e n t = (y t − y n t ) ∼ N (0, σ n ), then we have: P(y t |U = n, y t−1 , . . . , y 1 ) = P(e n t = y t − y n t |U = n, y t−1 , . . . , y 1 ) where e n t denotes the forecasting error of the predictor over the time interval; y n t is the traffic forecasting of the n th predictor at the time interval t.
We combine (5) and (10)- (12) to obtain: Altering p n t−1 , p n t−2 , p n t−3 , . . . , p n 1 in (13), then: Eq. (14) denotes an unrealistic assumption that the traffic volumes in the forecast interval are relevant with past traffic volumes. As referred to in a few previous research studies [15], [16], traffic volumes are susceptible to disturbances from the external environment, especially during peak hours. Usually only the traffic volumes of the last few periods are strongly correlated with the traffic volumes of a given forecast period. The smaller the interval, the greater the influence of the traffic on the current flow.
Therefore, we have the weight p n t of the n th predictor as: The prediction results can be obtained as a linear combination of the output of each predictor with the weights of each sub-predictor over the time interval. [26]], denoted as: For the condition of non-stationary time series combined prediction modeling, to establish an integrated prediction model of non-stationary time series, the effectiveness of combined prediction modeling must be judged first. Since the traffic flow series are non-stationary, the cointegration theory is adopted into the fused traffic volumes forecasting and the cointegration is verified for the single forecasting series and the forecasting series. Test whether the predicted sequencê y t+1 corresponds to m single item.
The traffic flow sequence y m t+1 , y m−1 t+1 , . . . , y 1 t+1 is predicted to have a cointegration relationship, and the Engle-Granger two-step test [29] is used here.
Step one, the least square method is used to estimate the regression formula as follow: α andβ i are used to denote the estimated value of the regression coefficients, so that the estimated value of the model residuals as: Step two, check the stationarity of ε i t+1 . If ε i t+1 ∼ I (0), there is cointegration relationship between the predicted sequencesŷ t+1 and the m output sequences in the sub-prediction period. If it is found that there is a cointegration relationship between each single traffic flow prediction series and the predicted series, and there is a long-term equilibrium relation between them, it can be proved that these single traffic flow prediction series can be used to establish an effective combined traffic flow prediction model.
According to Granger representation theorem, error correction model can be obtained in the form of: The first order hysteresis y t is used as error correction model. At last, combined equation (16) takes the form: Through the cointegration of the sub-model and the prediction model, combined with the long-term equilibrium correlation of the prediction time series, combined with the idea of error correction, the error correction of the final prediction model is carried out to improve the prediction accuracy.

D. STEPS OF ABCM-GNN MODEL IMPLEMENTATION
This part summarizes the steps of ABCM-GNN implementation: Step1: The raw traffic volume data are normalized using the Min-Max method so that the traffic data take values between zero and one.
Step2: Introduce the distance correlation analysis coefficients to derive the historical time series that are highly correlated with the traffic volume within the target time interval.
Step3: Each sub-predictor is calibrated using the past traffic volume data to select the best settings of the predictors. Then the trained sub-predictors are used to forecast the test traffic flow.
Step4: The traffic volume is predicted by the ABCM model using the weights and error correction model.

III. TWO COMPONENT PREDICTORS
A. GRAPH ATTENTION GATED RECURRENT NETWORK 1) GRAPH ATTENTION MECHANISM Attention mechanism in time series task can extract more valuable information accurately and efficiently [17]. The attention mechanism is employed to capture the propagation characteristics and spatial correlation of traffic volume data seen in complex traffic road networks.
The input data to this part is A group of node characteristics, H = {h 1 , h 2 , . . . , h n }, h i ∈ R F where N is the number of nodes characteristics in the traffic road networks graph. Masked attention is usually used to deal with real-world graphs. In adjacency matrix, A ij =1means the road nodes and in the range are connected, others mean disconnected.
In equation (22), adjacency matrix is obtained through threshold Gaussian core.
We employ active the function normalizes the attention coefficients so that they are comparable between first-order neighborhoods. The normalization function is shown below.
where soft max( ) refers to a nonlinearity function. The attention coefficient can be acquired by where W is the weight matric of node characteristics. In order to further extract spatial features of traffic networks efficiently, a multi-head attention mechanism is designed. The number of independent attention heads parameter is denoted by K .The node collects information at the same time through multiple parallel channels from adjacent nodes.
where, ∥ denotes concatenate operation,ê ij is a learnable weight vector.

2) GRAPH ATTENTION GATED RECURRENT NETWORK
Based on spatio-temporal information aggregation, GRU [13], [14] unit is taken as the main body. Gated recurrent Unit (GRU) is a popular recurrent neural network that can effectively capture long-term and short-term time series in recent years. We replace the original linear connection layer of GRU units with graphical note operations functions. The architecture of GAGRU memory block is illustrated in Fig. 2. The GAGRU element is established and its internal tensor calculation conforms to the principle of information processing of spatio-temporal graph nodes. The architecture of the GAGRU model is shown in Fig. 3.
n t = tanh(GA(x t ) + (GA(r t * h (t−1) ))+b n ) where x t and y t . denote the input and output at time step t, and h t is hidden state of model. z t and r t denote the update and reset gate, σ is the activate function, and * is the Hadamard product. GA(·) is a graph attention mechanism function.

B. ARIMA METHOD
ARIMA [5] model is one of the most widely applied time sequence prediction methods for traditional parametric methods. It can perform an arithmetic fit to past moment time data to predict the present. The ARIMA model usually refers to ARIMA(p, d, q)model, where AR is autoregressive, MA is sliding average, p is the number of autoregressive terms, d is the count of differences made to make it a smooth series, and q is the count of sliding average terms The model is expressed as follows: where ϕ(B) denotes the autoregressive process of order p; ε n is random error that follows a normal distribution and has a mean value of 0, variance σ 2 , and cov(ε n , ε n−d ) = 0, ∀d ̸ = 0; θ(B) denotes the moving average process of order q.

IV. EXPERIMENTS AND DISCUSSIONS A. DATA DESCRIPTION
To verify the proposed short-term traffic volume prediction model, we validated our model on California Highway traffic data set PeMS. The traffic data set was captured by the positioned remote flow microwave sensor (RTMS). These data sets were collected in real-ime every 30 seconds by 94736 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
the Caltrans Performance Measurement System (PeMS) [30] is shown in Fig. 4. Traffic volume data is assembled from real-time information every five minutes. The system deploys outstrip 39,000 sensors on the freeways of California's major metropolitan areas. Geographic information about the sensor station is recorded in the dataset. The PeMS data is SAN Bernardino traffic data for June 1st to August 31st 2022 and includes 1,979 detectors on eight roads. We used the data of the first 50 days as the training set and the data of the last 12 days as the test set. We aggregated the traffic speed every 5 minutes. We trained this architecture on a server with a NVIDIA 2080Ti and an Interi9-9980XE CPU. The GAGRU and GRU model were implemented on Pytorch. And we use pyramid function for ARIMA model. The predictors were combined with NBCM or ABCM.

B. PREDICTIVE PERFORMANCE CRITERIA
In order to be able to estimate the accuracy of the forecasting models, in this paper we adopt three error evaluation metrics to measure the capabilities of different traffic prediction methods, namely mean absolute error (MAE), mean square root error (RMSE), and mean absolute percentage error (MAPE). In addition, we also adopt three information loss criteria indicators [31], [32] for optimal model selection, namely Akaike information criterion (AIC), Bayesian information criterion (BIC), and Hannan-Quinn Information Criterion (HQIC). The formulas for each evaluation metric are expressed as below.
where n denotes the number of observed samples, m denotes the number of model parameters,ŷ i denotes the predicted value, y i is the true value.
C. MODEL SETTINGS 1) GAGRU AND GRU GAGRU and GRU [13] utilizes the data set of previous two months for training process. In the GAGRU model, there are few hyperparameters that require confirmation, such as the number of attention heads in the graph attention mechanism. Table 1 shows the effect of the number of attention heads on the performance, and it can be seen from the table that when the number of attention heads is 3, the metric of each parameter metric is the smallest on the test dataset. The initial learning rate is le-3 and the decay rate is 0.6 / 10 epochs. We set the number of model training epochs to 100, the batch size to 32, and set the number of hidden units to 64. Then, the hidden sizes of GRU are set to 128, and we utilize Adam optimizer and adaptive learning rate. The mean square error of the predicted results with respect to the ground truth is minimized to obtain the weight matrix.

2) ARIMA
Model identification is performed for the selected eight road segments using historical traffic volume data for the previous two months. Fig. 5 describes the process of ARIMA model prediction. First, the series smoothness test is performed using MA and AR operators, and the unstable series are differenced to obtain the smooth series. Secondly, the parameter values of ARIMA are determined using the great likelihood estimation, and the residual statistics of the model are calculated; then, the residuals of the ARIMA model are estimated and tested to obtain the best traffic volume prediction model parameters for the road target sections. In this experiment, we adjusted auto.arima () function to automatically determine the number of autoregressive terms, the number of moving average terms, and other parameters to determine the optimal model for traffic volume prediction. VOLUME 11, 2023 94737 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

3) CORRELATION ANALYSIS
In this paper, the target series Y t is obtained by calculating the average traffic volume of eight roads in the dataset at each time period. We suppose a potential correlation between current traffic volume and traffic volume within the first 25 intervals. Therefore, the total length of the correlation interval is one hour. The distance correlation coefficient between the alternative and the object sequence is increased from 0 to 25, and the results are illustrated in Fig. 6. As depicted in Fig. 6, R(Z ) and the set Z and are decided by parameters. For example, set to 0.98, R(Z ) = 3 and Z = {t − 1, t − 2, t − 3}.

D. EXPERIMENTAL RESULTS
After the experiment, the model in Table 2 was validated. Table 2 demonstrates the detail description of testing models. And we choose ARIMA and GAGRU models as sub-predictors. The NBCM-GNN and IBCM-GNN incorporate the ARIMA and GAGRU models using the NBCM model [26] and IBCM model [27]. And ABCM-GNN fuses the same sub-predictor through our proposed traffic volume model ABCM. Meanwhile ABCM-DL fuses ARIMA and GRU model. FC-LSTM is a variant of LSTM, with input and hidden states in vector form. The temporal correlation of traffic data is captured through LSTM, but spatial correlation is not fully considered [12]. The STGCN consists of graph convolutional layers and time convolutional layers, which can capture the spatial correlation between road network nodes and the temporal characteristics of traffic data [18]. The spatial correlation between them and the temporal characteristics of traffic data.The forecasting results of the ARIMA and GAGRU models on Friday and Sunday are shown in Fig. 7 and Fig. 8.
All the results of the models with δ set as 0.98 is shown in Table 3. It shows the superior prediction performance of our proposed model under MAE, RMSE, MAPE criteria compared to other models. Comparing the GAGRU model with GRU and ARIMA, the MAE metrics improved by 2.63 and 5.53 in next 15 minutes, respectively, indicating that the GAGRU fusion graph neural network is capable of higher predictive performance compared to the traditional ARIMA and deep learning method GRU. Similarly, the MAE and MAPE criteria of ABCM-GNN and ABCM-DL improved by 0.13 and 0.44% in next 15 minutes, implying that the joint model fusion graph neural network can improve the prediction accuracy. For different combination approaches, the MAE and MAPE metrics of our proposed ABCM-GNN improve by 4.2% and 6.6% over IBCM-GNN, and the surface-based error correction ABCM   model effectively corrects the prediction error compared to the IBCM model, which has better prediction results. In terms of optimal model selection, the numbers of model parameter is a disciplinary term which means that a small AIC value indicates a better model. The AIC and BIC criteria of ABCM-GNN and IBCM-DL decreased by 0.12 and 0.11 in next 15 minutes, implying the goodness of the ABCM-GNN model. Due to the large number of parameters in the deep learning model, ABCM-GNN did not demonstrate advantages in information loss criteria compared to ABCM-DL. However, in terms of error evaluation indicators, ABCM-GNN has better performance and smaller error values. The comparison of the prediction effects of ABCM-GNN and IBCM-GNN models with observations is shown in Fig. 11. Fig. 9 represents the cumulative weights calculated by ABCM for each predictor, indicating that ABCM assigns weights among sub-models according to the error magnitude. The amount of error correction in ABCM-GNN and NBCM-GNN is shown in Fig. 10, and the surface error correction mechanism is able to perform short-term equilibrium error correction in the combined model prediction process. The results of the different models on Friday and Sunday is shown in Table 4. Meanwhile, the ABCM-GNN model superior other models in the case of different traffic volume data. In addition, another essential feature of ABCM-GNN is that its prediction performed is not significantly affected by the VOLUME 11, 2023  length of the prediction period. The results indicate that the predictive performance of our proposed ABCM-GNN model outperforms other models in the measurement of MAE, RMSE and MAPE.

V. CONCLUSION AND FUTURE WORK
This paper proposes a new short-term traffic volume forecasting model called advanced Bayesian combination model with graph neural network (ABCM-GNN) model. Firstly, an ABCM framework is established, and the cointegration analysis and error correction mechanism are introduced. Through the long-term cointegration correlation, the short-term forecasting quantity of the joint predictor is corrected to enhance the prediction accuracy. Second, the non-linear correlation between historical and current traffic flows is analyzed using distance correlation coefficients to determine the length of the previous time set sequences to be used in the ABCM framework. Then, the graph attention gating recurrent neural network model is combined with the time domain prediction model based on the advantages of graph attention mechanism in dealing with the spatial characteristics of traffic flow, and the GAGRU method is used as sub-model of the advanced Bayesian combination model. Experimental performance on a real-time dataset shows that both the proposed ABCM framework and the introduced GAGRU sub-predictor enhance the prediction accuracy of the combined model. And MAE and RMSE metrics of the ABCM-GNN model improved by 0.25 and 0.31, respectively. Meanwhile, ABCM-GNN model outperforms other models in the case of different traffic volume data.
Since the performance of ABCM-GNN model is largely dependent on sub-predictors, more high-level sub-predictors should be considered in the future. Such as SVM and KNN in ABCM framework. At the same time, more information loss standard parameters (eg: AIC and its variants) will be included in my work related to optimal model selection. Meanwhile, the next step will focus on obtaining more kinds of data information from various types of data such as climate conditions, passenger travel demand distribution, and unexpected road conditions. Further, the method can be applied to other time series predictions, such as arrival time estimation and speech recognition. XINMING JIANG received the B.S. and master's degrees from Jilin University. He served as the Manager of the General Assembly Planning Section and the Electronic and Electrical Planning Section, Planning Department, FAW-Volkswagen Company Ltd; and the Manager of the Functional Electronics Section, Technical Development Department, FAW-Volkswagen Company Ltd. Currently, he is the Manager of the Technology Development Driving Assistance and the Intelligent Driving Development Section and the Director of the Intelligent Driving Development Department, FAW-Volkswagen Company Ltd. His current research interests include intelligent driving, chassis electronics and matching, and chassis traditional component development.