SDSCNN: A Hybrid Model Integrating Static and Dynamic Spatial Correlation Neural Network for Traffic Prediction

Traffic flow prediction is of great significance for traffic control, and it has been challenging for capturing the complex spatial-temporal correlation. However, most existing prediction methods only consider the spatial adjacency of the nodes (i.e., static spatial correlation), lacking sufficient analysis of non-stationary traffic conditions (i.e., dynamic spatial correlation). The combination of static spatial correlation and dynamic spatial correlation enables the model to comprehensively analyze the feature of traffic flow at each moment and improve the mining capability. To address this problem, we use the multi-head self-attention mechanism to establish a hybrid model integrating static and dynamic spatial correlation neural network (SDSCNN) for traffic flow prediction. Specifically, we first construct static adjacency matrix and dynamic adjacency matrix according to different methods. These two matrices are simultaneously input into Graph Attention Network for analysis. The two outputs are integrated by the sum operation. Then the fused static and dynamic spatial features are fed into the multi-head self-attention layer to analyze the temporal correlation. Also, multi-layer SDSCNNs are stacked to further analyze the dynamic correlations between road sections, as well as to improve the model’s multi-step prediction capability. Finally, Multi-layer Perceptron is used to output the prediction results. Extensive experiments are conducted using the datasets PEMS04, PEMS08, and METR-LA. And the results demonstrate that our model shows a good prediction performance.


I. INTRODUCTION
With the development of intelligent transportation systems, more and more scholars pay attention to traffic flow prediction. Accurate and effective predictions can help traffic management departments guide vehicles more reasonably, alleviate traffic congestion and improve road traffic efficiency [1]. Current research on traffic flow prediction mainly focuses on the mining of temporal and spatial correlations [2].
The associate editor coordinating the review of this manuscript and approving it for publication was Shaohua Wan.
Temporal correlation refers to the impact of the historical traffic flow state on the future state [3]. It focuses on periodicity. For example, the morning peak hours of weekdays are generally fixed, and the traffic flow state during these periods is similar. Also, the traffic state of the previous week is similar to the general trend of the current week, which means that there is a temporal correlation between the current traffic status and the historical traffic status. Currently, the existing models for analyzing temporal correlation include Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) [4], [5]. However, these models can only analyze VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ temporal correlation in the short term, and still, have limitations for periodicity temporal correlation analysis [6]. Spatial correlation refers to the interactivity between traffic flow in a road network due to spatial location [3]. For example, the traffic state of an upstream intersection directly affects the traffic state of a downstream intersection. Since the two intersections are adjacent to each other, which is a relatively fixed spatial condition, the traffic state will present a comparatively stable spatial influence relationship in space. High values indicate a strong correlation between nodes. By now, the extraction of spatial correlation is still a big challenge due to the accelerated urbanization and complex structure of urban roads. The existing study on spatial correlation focuses on the following two points, including static spatial correlation analysis and dynamic spatial correlation analysis. Specifically, static spatial correlation analysis relies on the original topology of the road. It transforms the road topology into an adjacency matrix and uses Graph Convolutional Network (GCN) to extract spatial correlation between road sections [7]. Only static spatial correlation between neighboring traffic flow is extracted via the adjacency matrix. Dynamic spatial correlation analysis is part of dynamic graph learning [8], [10]. In dynamic graph learning, in addition to considering the weights of edges between nodes, the dynamic changes of nodes in the graph are also considered. In traffic flow prediction, most traffic flow prediction studies focus on mining the weights of edges between nodes, and rarely consider the changes in sensor nodes. These methods dynamically obtain optimal weight parameters between nodes through an end-to-end training process [11]. In this way, the spatial weight obtained from training depends on the input traffic value rather than the original road network topology. However, the drawback of this method is that for a large road network, the number of parameters is large and the computational complexity is high.
The spatial correlation of traffic flow cannot be analyzed comprehensively using either of these two methods alone.
In the case of smooth traffic flow, the spatial correlation of traffic flow depends on the static topology of the road. The spatial correlation between the roads is usually consistent with the topology and remains unchanged. Feeding the original adjacency matrix into GCN can better extract static spatial correlation. However, for some unstable traffic situations (e.g., traffic congestion during peak hours, sudden traffic accidents, and one-way road restrictions during peak hours), the spatial correlation of traffic flow not only changes with time but also shows dynamic characteristics. In these cases, the simple topology cannot accurately represent the spatial correlation between roads at that moment. Adding the dynamic information based on static spatial correlation can fully consider the dynamic characteristics of spatial correlation of traffic flow which changes with time, and make up for the shortcomings of static spatial correlation [12], [13]. By integrating static spatial correlation and dynamic spatial correlation, we can conduct a comprehensive analysis of traffic flow and improve the generalization ability and prediction accuracy of the model.
To this end, we propose a hybrid model integrating static and dynamic spatial correlation neural network for traffic prediction. Figure 1 shows the model structure proposed in this paper. The model divides the spatial correlation into two parts, including the static spatial correlation analysis and the dynamic spatial correlation analysis. Specifically, the static spatial correlation is constructed according to the static adjacency matrix and GAT. Then the output of the above two parts is summed as the final output of the spatial correlation analysis. For the temporal correlation part, we use the multi-head self-attention to get the output. The above network structure is constructed as an SDSCNN layer, and we stacked multiple SDSCNN layers in our model for further exploration. Finally, the output of the last SDSCNN layer is fed into the Multi-layer Perceptron (MLP) layer to predict the specific moment traffic prediction value.
In summary, the main contributions of this paper are as follows.
(1) We propose a combination mining method, which jointly uses static and dynamic spatial correlations for traffic flow prediction. Compared to using only static or dynamic information, our method can better analyze sudden traffic conditions and adapt to more complex traffic conditions.
(2) We construct a sequence-to-sequence (Seq2Seq) model structure, which stacks several spatial-temporal layers. Compared with the single spatial-temporal layer, our method can mine deeper spatial-temporal correlations. It can effectively avoid the shortcoming that the correlation mining will become weak due to the long intervals.
(3) We conduct experiments on three real datasets, PEMS04, PEMS08, and METR-LA. The results show that our model has better prediction performance compared to other benchmark models.

II. RELATED WORK
The problem of traffic flow prediction is to model the traffic status of the urban road network or highway to predict the future traffic situation [14]. Existing traffic flow prediction methods are mainly divided into two categories: modeldriven methods and data-driven methods [15].

A. MODEL-DRIVEN METHODS
The model-driven method is also known as the parameter method. The main representative models include the differential autoregressive moving average model (ARIMA) [16], the vector support autoregression model (SVR) [17], the gray prediction model (GM(1,m)) [18], etc. These models usually have strict assumptions and fixed conditions. algorithm structure. However, traffic flow is easily affected by random interference factors (such as traffic accidents, weather, traveler behavior, etc.) and has strong uncertainty. Therefore, parametric methods cannot mine the nonlinear characteristics of traffic flow, resulting in generally low prediction accuracy. Data-driven methods are further divided into traditional machine-learning methods and deep-learning methods. Classic representatives of the former are support vector machines [19], [20], k-nearest neighbors [21], Bayesian networks [22], MLP [23], and so on. The latter are mainly artificial neural networks and their variants [24]. These models have strong nonlinear mapping ability, and the data requirements are not as strict as model-driven methods. Therefore, they can better adapt to the uncertainty of traffic flow and effectively improve the prediction effect.

B. DATA-DRIVEN METHODS
In recent years, to fully extract spatial-temporal characteristics of traffic flow, deep neural network models with highdimensional data processing capabilities and nonlinear data feature mining efficiency have been favored by more and more researchers. Since traffic flow forecasting is a typical time series forecast with significant time correlation, LSTM and GRU have been widely used to deal with the temporal characteristics of traffic flow. However, the model based on time series cannot completely cover all the factors of traffic flow prediction. The traffic flow in each area of the urban road network is not only affected by the traffic flow in the previous period, but also by the congestion of adjacent roads. The time series model can only extract the traffic flow information of a single area, and it is far from enough to rely on the information of a single area to predict the traffic flow. Therefore, the Convolutional Neural Network (CNN) [25], which can handle the spatial structure of roads, is also considered. For example, Ma et al. [26] convert traffic flow into images and utilize CNN to extract traffic features in adjacent regions. Meanwhile, to comprehensively consider the spatial-temporal dependencies of traffic data, many researchers combine CNN and LSTM models for traffic flow prediction or traffic speed prediction, such as ConvLSTM [27], DMVST-Net [28], CLTFP [29], and so on. However, these CNN-based methods [30] are not suitable for non-Euclidean space, so they cannot extract features in the traffic network with complex structures.
Since the spatial structure of the road network is in non-Euclidean space, the classic convolutional neural network cannot be used to extract the spatial features between adjacent observation points. Therefore, Graph Convolutional Networks (GCN) [31], [33] have received extensive attention, and many studies have proposed GCN-based traffic prediction models [34], [39]. For example, Li et al. [40] proposed Diffusion Convolutional Gated Recurrent Unit (DCGRU) to improve the gating of GRU to capture the spatial-temporal dependence of traffic data, and combined encoder-decoder to propose a DCRNN model of Seq2Seq. Guo et al. [41] used three different spatial-temporal components to extract the information from historical data, integrated the graph structure of the traffic network and the dynamic spatialtemporal pattern of the traffic data to characterize the spatialtemporal correlation between neighbor nodes and predicted nodes, and proposed an attention mechanism-based method named Spatial-temporal Graph Convolutional Neural Network (ASTGCN). Song et al. [42] used three consecutive time slices to construct a local spatial-temporal graph, while using a sliding window to segment different periods, stacking multiple graph convolutional layers to form a spatialtemporary synchronized graph volume. Wu et al. [43] used an adaptive adjacency matrix to capture hidden spatial dependencies and combined dilated causal convolution for modeling. Zhang et al. [44] proposed the Structure Learning Convolution Neural Networks (SLCNN) which consider the local and global spatial correlation and used the 3D convolutions to capture temporal dependencies. Overall, GCN shows good prediction performance on the non-Euclidean datasets.

C. DIFFERENCES BETWEEN SDSCNN AND EXISTING WORKS
Although the above methods take into account the spatial and temporal dependencies relatively well, there are certain drawbacks in dealing with complex spatiotemporal correlations. VOLUME 10, 2022 Yan et al. [45] improved the transformer structure by introducing a global decoder and a global-local decoder respectively. The multi-head attention is used to extract nonlocal features, and the masked multi-head attention is focused on extracting local features. In addition, temporal correlation analysis is performed using LSTM. However, the model only considered the dynamic characteristics, and the static correlation of traffic flow is ignored. Also, the computational complexity of the LSTM is high. Wu et al. [46] also proposed a multi-attention mechanism model to predict traffic flow. But the paper focused on emphasizing data integrity and verified that initial interpolation affects the stability of model performance, especially for complex missing scenarios with large-scale data.
For the temporal correlation, both convolutional and RNNbased time series models have shortcomings such as the weak ability to capture long-range connections and the inability to capture the periodicity of traffic information. For spatial correlation, GCN-based deep learning models cannot assign different weights according to node importance and cannot be applied to directed graphs. Furthermore, GCN uses a static Laplacian matrix as a filter and does not take into account dynamic spatial dependencies, i.e., situations where spatial weights change over time, such as traffic jams, traffic accidents, and one-way lanes that only appear during peak hours, and so on. To capture complex spatial-temporal associations, we propose Hybrid Model Integrating Static and Dynamic Spatial Correlation Neural Networks (SDSCNN). The SDSCNN model is based on the idea of SLCNN, which considers global and local spatial graphs, including static and dynamic spatial dependencies. Different from SLCNN, SDSCNN extensively uses the attention mechanism to obtain different features in space and time. Specifically, we construct spatial-temporal graphs based on static road networks and dynamic correlations, use attention to extract global and local spatial dependencies, and assign different weights to them. Meanwhile, to capture the periodicity of traffic information, we design a multi-head self-attention mechanism based Seq2Seq temporal module, which can perform computations at different periods to extract temporal dependencies. Finally, we stack the two components to aggregate spatial-temporal correlations for prediction.

III. METHODOLOGY A. PROBLEM DEFINITION
Given a graph G = {V , E} with N nodes, where V and E are the set of nodes and edges respectively, the historical traffic data embedded on graph G with M input channels and T time intervals can be denoted as X ∈ R N ×M ×T . The task of traffic forecasting is to learn a mapping function f (·), which takes historical traffic data X and graph G as inputs to forecast future K time intervals of traffic data: whereX ∈ R N ×M ×K is the prediction value.

B. SDSCNN
Many works use ChebNet or GCN to analyze the spatial correlation. However, whether it is ChebNet or GCN, there are three shortcomings: (1) From the history of GCN-related research, GCN and its original model ChebNet are both models of the Spectral domain, which uses graph theory to perform convolution operations on topological graphs. Convolution operations on topological graphs (graph Fourier transform and graph convolution) require the use of Laplacian matrices. The Laplace matrix of an undirected graph is a positive semi-definite matrix. The spectral decomposition of the positive semi-definite matrix can be decomposed into N linearly independent eigenvectors, to carry out the subsequent Graph Fourier transformation and convolution, and finally obtain the first-order derivation formula of GCN. However, the Laplacian matrix of a directed graph is not a positive semi-definite matrix, so spectral decomposition is not possible. Theoretically, the derivation formula of GCN cannot be obtained.
(2) GCN cannot handle dynamic graphs. GCN relies on the specific graph structure during training, and it is also performed on the same graph during testing. It is a full graph calculation method. GCN can only handle transductive learning tasks and cannot be applied to inductive tasks.
(3) GCN cannot assign different weights to each neighbor. GCN treats all neighbor nodes uniformly during convolution, and cannot assign different weights according to the importance of nodes.
Based on the above issues, GAT [47], [48] is proposed. There are at least three benefits of using GAT rather than using GCNs.
(1) GAT is a spatial domain model that does not require the Laplacian matrix to join operations. Therefore, it is more suitable for directed graphs.
(2) The operation mode of GAT is node-wise operation, instead of GCN using the Laplacian matrix for convolution. Therefore, it can be used for dynamic graphs.
(3 The spatial weights of GAT are computed dynamically, instead of GCN utilizing the Laplacian matrix. Therefore, GAT allows for assigning different weights to nodes of the same neighborhood.
We introduce GAT to analyze the temporal and spatial correlation respectively, which is used to construct the SDSCNN model. The model includes four parts, static spatial module, dynamic spatial module, spatial fusion module, and temporal module. The specific description is as below.

1) STATIC SPATIAL MODULE
GAT introduces a self-attention mechanism in the propagation process, and the hidden state of each node is calculated by paying attention to its neighbor nodes. The GAT network is implemented by stacking a simple graph attention layer. For each attention layer to nodes, the attention coefficient calculation method is as shown in equation (2): Among them, α ij is the attention coefficient from node i to j, and N i represents the neighbor node of node i. The input feature of the node is h where N , F represents the number of nodes and the feature dimension respectively. h is the feature value of the nodes. W ∈ R F is the linear transformation weight matrix applied on each node, and a ∈ R 2F is the weight vector, which can map the input to R. Finally, softmax is used for normalization, and LeakyReLU is added for activation.
The feature output of the final node is obtained by equation (3): Among them, α ij is the attention coefficient of nodes i to j calculated in equation (2), W is the weight matrix to be trained, h j is the feature of the neighbor node j of node i, and h i is the node i calculated by the attention mechanism output.
In addition, GAT utilizes multi-head self-attention [49], [50] to stabilize the learning process. It applies an independent attention mechanism to calculate the hidden state. For the results of the self-attention layers, this method concatenates the output and feeds into the fully connected layer to get the final result. The process is shown in Figure 2, equation (4) is the specific calculation process of multi-head attention: where α k ij is the normalized attention coefficient of the first attention head and || represents the concatenation operation. Our model introduces the multi-head self-attention method to mine the correlation. Static spatial correlation analysis depends on the original road topology. Generally existing work uses a 0-1 matrix to represent the road structure. However, it cannot accurately reflect the static spatial structure of the road network. Therefore, this paper uses the adjacency matrix based on the distance from the road network, which not only better represents the actual spatial structure of the road network, but also reduces the computational complexity of the model. And the product of adjacency matrix and parameter matrix is used as the input of GAT. The static adjacency matrix calculation formulas are as follows: where the adjacency matrixÃ ∈ R N ×N is calculated using a threshold Gaussian kernel [51] and the road distance between each node of the road network. dist v i , v j represents the distance between sensor v i and sensor v j , σ 2 is the variance of the road distance, and d g is the threshold. Since the actual road network is generally a directed graph, dist v i , v j and dist v j , v i are inconsistent in some cases, soÃ is an asymmetric matrix. d g is the maximum threshold of road network distance, and different datasets have different values. The feature value X and the matrix W s are input into equation (4) to obtain the static spatial correlation analysis result. The calculation rule is as follows.
where X s is the output of static spatial correlation analysis. θ s ∈ R N ×N is a learnable parameter. Sometimes, the adjacency matrixÃ of the road distance cannot be directly used in the traffic flow prediction task. We consider the impact of road construction, natural disasters, etc., and use the parameter θ s to get the matrix W s . α k s ij is calculated by equation (2), and keeps fixed during the calculation. It represents the weights of the edges trained by the GAT. The product α k s ij and W s determines the relationship of each node in the static spatial correlation module. The edges and the associated weights of the nodes in this model are static.

2) DYNAMIC SPATIAL MODULE
Dynamic spatial correlation often exists in the historical data. Similarly, we also use the multi-headed attention mechanism for dynamic spatial correlation mining. The difference is that the attention coefficients of each layer changes according to the input feature values.
Following the same idea as the static weight calculation, the dynamic part uses the fusion of two weight matrices to construct the dynamic weights. One is the matrix W d , and the other is the matrix α k d ij which is calculated by GAT. The weight parameters depend on the current feature values and the correlation between nodes at different moments shows dynamic characteristics, so we introduce the dynamic weight training method to construct the W d [44]. The optimal dynamic feature at the current moment is determined by combining the latest feature of the nodes and the weight matrix of the model which is self-trained in real-time. So, we first parameterize the matrix W φ . And then the matrix W φ and the latest feature value X are input into the heuristic function, the specific formula is as follows.
where φ (·) is the heuristic function. W d is the output of dynamic spatial correlation analysis. For the α k d ij , the latest values of the nodes are input into equation (2) to get the output.
Also, the dynamic spatial correlation is analyzed using the GAT which is calculated as follows.
where α k d ij is calculated by equation (2), and the input is the latest features of the nodes. It represents the training parameters of the dynamic spatial module. X d is the output of dynamic spatial correlation analysis.

3) SPATIAL FUSION MODULE
Spatial correlation analysis includes static spatial correlation analysis and dynamic spatial correlation analysis. We add both of them to achieve the fusion of spatial correlation. Therefore, based on the above analysis, the spatial correlation result is as follows.X = X s + X d (10) whereX is the result of the spatial correlation.

4) TEMPORAL MODULE
To improve the model's ability of mining temporal correlation, especially the periodicity, the multi-head self-attention mechanism is also used in this module. Since the selfattention mechanism is to calculate the attention at every time point and all time points, it is possible to directly calculate the dependency relationship regardless of the interval between time points. The model structure is shown in Figure 3 and the input of the model isX. The calculation rule is as below.X where α k ij is calculated by equation (2) and the input is the current node's value.X is the result of the temporal correlation. Based on multi-head self-attention, this paper constructs a temporal correlation analysis module of Seq2Seq, which can be extracted in different periods to predict information of different time interval lengths.

C. DETAILED IMPLEMENTATION
Based on the above analysis, SDSCNN is proposed. To mine the deeper spatial-temporal correlation, we stack several SDSCNN layers. Since the receptive field size varies from layer to layer, the physical meaning of each layer's weight matrix varies from layer to layer. At the first level, a matrix element may correspond to the relationship between two roads. At a high level, matrix elements can represent road information within an area. Therefore, the calculation of the weight matrix varies from layer to layer, and SDSCNN learns different parameters at different layers to characterize the graph structure, which is more in line with the actual situation.
The training process of the SDSCNN is shown in Algorithm 1.

Algorithm 1 The SDSCNN Training Process
Input: Historical traffic dataset X ∈ R N ×M ×T , Y ∈ R N ×1 ; Static adjacency matrixÃ; Number of training iterations P; Small batch sample size B; Output: Hybrid network SDSCNN X, W k , θ s , W φ , W y , b y ; 1 Randomly initialize training parameters W k , θ s , W φ , W y , b y 2 for p ← 1 to P do For the static spatial module, the static attention coefficient α k s ij is first calculated by GAT with equation (3). Then, we use equation (4) to obtain the static adjacency matrix and parameterize it the matrix W s . We feed the input X, W s , and α k s ij into equation (7) to get the output of the static spatial module X s .
For the dynamic spatial module, the dynamic attention coefficient α k d ij is also calculated by equation (3). In this module, we directly parameterize the dynamic adjacency matrix W φ . The W φ and the input X are input into equation (8) to get matrix W d . And then, the input X, W d , and α k d ij are input into the equation (9) to get the output of the dynamic spatial module X d .
After the above step is complete, the X s and X d are summed to get the spatial module resultX. Then, the is input to the multi-headed self-attention to analyze the temporal correlation with equation (11). And the output of the temporal module is used as the output of the first SDSCNN layer.
Next, several layers of SDSCNN are stacked according to the above steps, and the parameters of each layer are initialized independently. The output of each layer is used as input for the next layer. We choose the optimal numbers of the layer and the head number of the multi-head self-attention through extensive experiments, and the detailed values are shown in the following experiments section. Finally, the result of the last SDSCNN layer is input to the MLP layer to get the prediction value.

IV. EXPERIMENTS
To evaluate the effectiveness of the proposed model, a series of experiments have been conducted. They are organized into the following steps.

A. EXPERIMENTS SETTINGS 1) DATA COLLECTION
The SDSCNN in this paper is validated using three traffic datasets. The highway datasets PEMS04, PEMS08, and METR-LA in the Caltrans Performance Evaluation System (PeMS, https://pems.dot.ca.gov) are used for analysis.
PEMS04 is the traffic data collected from San Francisco Bay, which contains 3848 sensors on 29 roads. The PEMS dataset contains traffic information such as traffic flow, vehicle speed, and density. In addition to this, the road distance between the sensors is also included. Among them, 307 sensors are selected for prediction. The period of this dataset is from January 1, 2018, to February 28, 2018, with a total of 59 days of data.
PEMS08 is the traffic data of San Bernardino for 62 days from Among them, represent the actual and predicted traffic flow of the j-th road in the i-th sample, M represents the number of samples, N represents the number of roads, and Y is the average value of the sample.

3) MODEL PARAMETER SETTINGS
This paper builds a new SDSCNN model based on the deep learning framework PyTorch and uses 2 NVIDIA TITAN 12GB GPUs for experiments. In the experiment, the mean absolute error (MAE) was used as the Loss function, and used Adam optimizer to optimize the model parameters. The optimizer training step size is set to 0.0001, the number of batches is set to 40, the maximum number of iterations is set to 500, and the early_stop is set to 10. Also, initialize the training parameters with a random function. In this paper, grid search is used to optimize some hyper-parameters. The recommended hyper-parameters are shown in Table 1. The k-fold cross-validation was selected as the dataset division method, and the hyper-parameter cross-validation in Table 1 was used to repeat 10 times, and the mean and standard deviation were used as the results. In addition, the adjacency matrixÃ ∈ R N ×N in equation 6 uses d g as the maximum threshold of the road network distance, and different datasets have different values. Figure 4 shows the histograms of the road network distances for each dataset. According to the histograms, this paper sets the d g of PEMS04, PEMS08, and METR-LA to 1500, 1000, and 12000, respectively.
ARIMA [16]: It is a classic time series data prediction method. In the experiments, we first check the data stability and deal with non-stationary data. We input the data into the ARIMA model and check the model parameters. After the model parameters are checked for reasonableness, the prediction is performed and the results will be output.
SVR [17]: This method is an application of the support vector machine (SVM) in regression tasks. In the experiments, we choose the linear function as the kernel function. And the regularization coefficient is set to 1. LSTM [7]: This method is a variant of the recurrent neural network. It only considers the temporal correlation. The LSTM uses the gating mechanism, including input gates, output gates, and forgetting gates. We input the traffic flow data into the LSTM to obtain the prediction results.
DCRNN [40]: The DCRNN model represents the road network as a weighted directed graph. The model first performs a diffusion convolution operation of the graph signal and the filter to map PD features to QD features which mines the spatial correlation of traffic flow. Meanwhile, the output is used as the input of GRU to further analyze the temporal correlation. To perform better multi-step prediction, a Seq2Seq model structure is built by diffusion convolution and GRU network. We predict by inputting the traffic flow data into the encoder and initializing the decoder with its final state.
STGCN [41]: STGCN consists of several spatial-temporal convolution modules. Each module contains two gated sequence convolution layers and one spatial graph convolution module. Gated sequence convolution is used to capture temporal correlation and consists of a 1D convolution and a GLU. Compared with the RNN series models, this method reduces the computational complexity and simplifies the model training. The spatial correlation is analyzed by Chebyshev graph convolution.
GWN [43]: The GWN model stack multiple spatialtemporal layers. A spatial-temporal layer consists of a GCN and a Gated TCN. Each layer contains residual connections. The adaptive adjacency matrix is employed in GCN. Meanwhile, the diffusion convolution is extended to graph convolution to represent diffusion graph convolution, which complements the spatial structure of road network to a certain degree. GWN uses causal convolution and dilated convolution to convert all inputs into one-dimensional vectors according to the time dimension, and performs 1D convolution operations.
SLCNN [44]: SLCNN is mainly used to address the problem of analyzing dynamic spatial correlation. SLCNN consists of multiple SLCNN layers. Each SLCNN layer includes a Global SLC module and a Local SLC module. The Global SLC module combines Chebyshev graph convolution with static adjacency matrix and dynamic adjacency matrix, respectively, to achieve global spatial correlation analysis. The Local SLC module only considers the spatial correlation between the k-nearest neighbor nodes. Both modules use P3D-SLC networks for the analysis of temporal correlation.

C. RESULTS OF TRAFFIC PREDICTION
The experimental results are shown in Table 2. In the PEMS08 dataset, the SDSCNN model is always better than other benchmark models in accuracy. In the PEMS04 dataset, SDSCNN has the smallest MAE and MAPE, and the RMSE is slightly larger than SLCNN. Taking MAPE as the evaluation index, SDSCNN is 6.52% higher than DCRNN, 1.78% higher than STGCN, 17.11% higher than GWN, and 4.02% higher than SLCNN.
Meanwhile, we visualize and compare the prediction results of the SLCNN model with that of the SDSCNN model, as shown in Figure 5. The fit of the SDSCNN model is better than SLCNN. Although there is a certain regularity in the prediction results of SLCNN, there is still a certain distance from the accuracy of the SDSCNN model. This proves that the SDSCNN model adopts the multi-head attention mechanism to replace the 1D convolution of SLCNN and obtains a certain performance improvement. In addition, SDSCNN shows stronger adaptability when there are sudden and large changes in the traffic flow state. This shows that when compared with the global and local graph structures of SLCNN, SDSCNN can better capture the dynamic transformation of the traffic flow by combining static and dynamic graph structures.

A. SPATIAL CORRELATION ANALYSIS
SDSCNN converts the distances between sensors into static spatial correlations, as shown in Figure 6 (a). From the perspective of road network structure, the structure can distinguish adjacent road segments and non-adjacent road segments, to provide a reference for prediction. Figures 6 (b), (c), and (d) represent the dynamic spatial correlation coefficients trained from the first layer of the SCSCNN Layer to the last layer.
We define HCN as the number of correlation coefficients greater than 0.5 in the correlation matrix. From Figure 6, we can be found that the correlation in Figure 6 (a) is not similar to that in Figures 6 (b), (c), (d), which indicates that the real road network structure cannot fully express the spatial correlation, we can calculate the previous Traffic flow further improves the model's ability to capture spatial correlations.
The density of strong correlation coefficients in Figure 6 (b) is lower than that in Figures 6 (a), (c), and (d). This is because the first hidden layer of SDSCNN can only obtain information about adjacent nodes for analysis, so the density of strong correlation coefficients is not as large as shown in Figure 6 (a). When the information aggregated by the previously hidden layer of SDSCNN is used as the input of the next hidden layer, the next hidden layer can obtain more information about non-adjacent nodes. Therefore, the density of strong correlation coefficients in Figures 6 (c), and (d) is higher than that in Figure 6 (b). Table. 3 and Figure 7 show the change in the prediction performance of various methods as the prediction output step increases. In general, as the prediction output step gets longer, the corresponding difficulty of the prediction gets greater, so the prediction error also increases. As can be seen from the images, methods that only consider temporal correlation can achieve good results in short-term prediction, such as ARIMA, SVR, and LSTM. However, their forecasting performance dropped sharply as the forecast output step increased. In contrast, graph-based methods degrade slowly. This is mainly because they take into account the traffic flow effects of adjacent areas and the previous period. The prediction errors of graph-based methods increase slowly with the output step, and they all perform well overall.

B. MULTI STEP PREDICTION ANALYSIS
SDSCNN model achieves the best prediction performance in almost every scenario, except the PEMS08 dataset. We carefully analyzed the PEMS08 dataset and found that the dataset is noisy with some extreme values. The RMSE is more sensitive to noise data, so the error rises slightly when the prediction output step is 5. Moreover, compared with the METR-LA dataset, the PEMS series data are more volatile and have many sudden flow changes, which makes the prediction task more arduous, and as a result, their RMSE values will be larger.
Especially, in multi-step prediction tasks, the difference between ASTGCN and other baselines is more significant, indicating that the strategy combining graph and attention mechanism can better mine the dynamic spatial-temporal patterns of traffic data.

C. HYPER-PARAMETER OPTIMIZATION
We also compare the performance of SDSCNN with different hyper-parameters, including the influence of the time step, the number of layers of SDSCNN, and the number of heads of the multi-head self-attention mechanism on MAPE. The experimental results are shown in Table 4 and Table 5.
From Table 4, we can find that with the increase of the time step and the increase of the number of SDSCNN layers, the MAPE decreases first and then increases. For different datasets, the optimal hyper-parameters of SDSCNN are different. As can be seen in Table 4, the PEMS04 and PEMS08 datasets perform well when the time step is 8 and the number of SDSCNN layers is 3. However, in the METR-LA dataset, our method achieves the best results when the time step is 7 and the number of SDSCNN layers is 2. This may be caused by the inconsistency between the sensor types of METR-LA and the PEMS04 and PEMS08 datasets. In Table 5, as the number of heads of the multi-head selfattention mechanism increases, the MAPE tends to decrease slightly. When the number of heads is greater than 5, the accuracy of the model no longer improves. This result is reasonable because the multi-head self-attention mechanism will increase the parameter and increase the risk of overfitting problems.

D. ABLATION EXPERIMENT
To further analyze the performance of our model, we added the other two variants under different combinations of modules. These two variants are compared with the SLCNN model and the SDSCNN model in the PEMS04 dataset, and the four models differ in: (1) SLCNN (GL+1D): the basic model, which uses global and local modules for spatial convolution, and uses 1D convolution to extract temporal dependencies.
(2) SD+1D: The spatial correlation analysis method of this model is the same as the method of our model. And the temporal module uses the 1D CNN.
(3) GL+ATT: The temporal module of this variant is designed based on a multi-head attention mechanism rather than 1D CNN.
(4) SDSCNN (SD+ATT): The SDSCNN model deploys a multi-layer spatial module and a temporal module based on a multi-head attention mechanism for prediction.
As shown in Figure 8, the model with static and dynamic spatial modules (SD) achieves a relatively better prediction performance than the model with global and local modules (GL). Compared with the 1D convolutional temporal module, In the case where the spatial component is (GL), the multi-head attention mechanism module (ATT)based model achieves 3.15%, 1.1%, and 3.78% improvement in MAE, RMSE, and MAPE, respectively. With the spatial component being (SD), the multi-head attention mechanism module (ATT)-based model achieves 2.86%, 0.91%, and 2.59% improvement in MAE, RMSE, and MAPE, respectively. For the task of road network prediction, previous research either started with adjacency matrix and established a static spatial matrix to represent the node relationship, or used training parameters to try to mine the node relationship in the data. In this paper, static spatial matrix and dynamic spatial matrix are constructed respectively. The static spatial matrix is constructed by the static adjacency matrix calculated from the distance of the road network; the dynamic spatial matrix is determined by combining the historical data characteristics of the nodes and the weight matrix of the real-time self-training model. Use the static spatial matrix and the dynamic spatial matrix to perform node aggregation operations respectively, and finally perform fusion weighting and input them into the temporal module. Ablation experiments show that the spatial component (SD) proposed in this paper can improve the accuracy of the model.
Additionally, traffic information (vehicle flow, speed, etc.) has obvious periodicity, and the 1D convolution components used in the original model cannot extract the periodicity. Moreover, the multi-head self-attention mechanism can mine periods, and there is no disadvantage that the RNN-like model has a weak ability to capture long-term dependencies. Therefore, The multi-head attention mechanism can significantly improve the performance of the SDSCNN model.

E. LIMITATIONS AND FUTURE WORK
Although our model has achieved good results in traffic prediction, there are still some limitations to consider in this study. Firstly, we ignored the multi-modality feature of the input data, such as heterogeneous datasets of weather, car accidents, and unexpected situations. The model prediction performance may be affected by the relatively homogeneous characteristics of the dataset. Secondly, we directly employ the summation operation for the fusion of static and dynamic spatial correlation analysis results. This is a relatively simple fusion method. To further improve the model prediction performance, we can study the automatic learning of these two component weights by the model in the future. Furthermore, we can introduce the traffic sensing method to improve the data quality to enhance the prediction accuracy [52], [53]. The biased estimated traffic speed data can be corrected by Wi-Fi and Bluetooth passive sensing technology. For some missing values, the data can be calculated by detecting the time differences. Thirdly, this paper does not consider the application of real-world scenarios such as the compressed transmission of model datasets, and abnormal data removal in the Internet of Vehicles (IOV) [54], [55]. Using the model for the Internet of Vehicles will also be a direction of our future work.

VI. CONCLUSION
In this paper, we propose a hybrid model that integrates static and dynamic spatial correlation neural network for traffic flow prediction. Specifically, in our model, the graph attention network is used to construct static and dynamic modules based on the correlation of distance and traffic information of the road network. Then, the periodicity and temporal correlation of traffic flow are captured by using the multi-head self-attention mechanism. Finally, multiple spatial-temporal layers are stacked for prediction. The implementation with three public traffic flow datasets shows that our SDSCNN outperforms several state-of-the-art methods.