A General Traffic Flow Prediction Approach Based on Spatial-Temporal Graph Attention

Accurate and reliable traffic flow prediction is critical to the safe and stable deployment of intelligent transportation systems. However, it is very challenging since the complex spatial and temporal dependence of traffic flows. Most existing works require the information of the traffic network structure and human intervention to model the spatial-temporal association of traffic data, resulting in low generality of the model and unsatisfactory prediction performance. In this paper, we propose a general spatial-temporal graph attention based dynamic graph convolutional network (GAGCN) model to predict traffic flow. GAGCN uses the graph attention networks to extract the spatial associations among nodes hidden in the traffic feature data automatically which can be dynamically adjusted over time. And then the graph convolution network is adjusted based on the spatial associations to extract the spatial features of the road network. Notably, the information of rode network structure and human intervention are not required in GAGCN. The forecasting accuracy and the generality are evaluated with two real-world traffic datasets. Experimental results indicate that our GAGCN surpasses the state-of-the-art baselines on one of two datasets.


I. INTRODUCTION
The speedy growth of vehicles has brought tremendous pressure on urban traffic, which has seriously affected people's daily lives. Therefore, it is necessary to find an effective technical means to improve traffic management efficiency and ease traffic problems. As a critical part of Intelligent Transportation System (ITS) [1], short-term traffic flow prediction can predict the next 5-60 minutes' traffic conditions of the road section, and provides great helps in many areas, such as signal control, traffic guidance, path planning.
In the real world, traffic flow data is affected by many factors, with the properties of being highly complex and nonlinear, thus accurate traffic prediction is very challenging. After decades of research, traffic flow prediction methods were mainly classified into two approaches, model-driven and data-driven. Model-driven methods are also called para-The associate editor coordinating the review of this manuscript and approving it for publication was Xi Peng . metric methods, such as time-series models, which have wellestablished theoretical background. However, such methods require plenty of parameters and assumptions to apply to the entire network, which makes their prediction performance unsatisfactory. Recently, with the improvement of transportation infrastructure, different data collection technologies such as monitoring points, detectors, have provide a mass of available data for traffic flow prediction. Data-driven approaches can be separated into two subclasses: machine learning and deep learning [2]- [4]. Common machine learning methods are inadequate when processing high-dimensional data and also rely on detailed feature engineering. Therefore, this type of methods have fragile generality. Deep learning models, for instance convolutional neural networks, long short-term memory neural (LSTM) networks and their combination, have achieved great success in traffic prediction [5]. Their success is mainly due to the good performance when dealing with highly nonlinear, dynamic arbitrary precision, and multidimensional problems. In traffic networks, the detector nodes FIGURE 1. Topology graph of traffic network. The detector nodes deployed in the road network can be regarded as vertices on the topology graph. We contact the nodes at each location so that the road network can be abstracted into a topology graph. Then, we predict the vehicle speed of each detector in the road network of the next period of time. are delpoyed on the traffic roads, which form a topological graph with a non-Euclidean structure, as shown in Fig. 1. Observations obtained at nearby locations influence each other, resulting in spatial local association. But traditional deep learning methods are not suitable for processing non-Euclidean data. An ideal way to process non-Euclidean structured data is to use graph convolutional network (GCN) [6], whose essential purpose is to collect the spatial features of the topological graph. Graph convolution includes vertex domain and spectral domain, but when using vertex domain to extract features, bacause the neighbors extracted from each vertex are different, the calculation process must be performed for each vertex. The spectral domain is the focus of GCN research, which regards the features of each node as signals on the graph, and studies the features of the graph through spectral analysis to realize the topological graph convolution operation. However, existing traffic flow prediction models based on graph convolutional network use fixed distance information between nodes, when constructing the Laplacian matrix of the graph, and ignore the dynamic changes in the association/weights among nodes. Even though some models considered that the association among nodes will change in time, the method of dynamically adjusting the spatial association/weights among nodes is not aimmed at topology graphs. Moreover, they all rely on the road network configuration, such as the position of the detectors deployed on the road, and the distance among the detectors. These methods can not reflect the true spatial-temporal properties of the tracific roads, and lack of generality.
The limitations of existing traffic prediction models based on graph convolutional network encourage us to design a novel framework of traffic flow prediction. We have two observations regarding this problem. First, except distance, many other factors should also be considered when it comes to spatial associations/weights among nodes, and we should research the dynamic spatio-temporal associations of traffic flow data with Non-Euclidean structure from the perspective of topological graph, as shown in Fig. 2. Second, we should reduce the subjective participation of human and try to predict the traffic prevalence of the road network without knowing the road network structure in advance, if we want to make the model adapt to different road network structures and improve the generality of the prediction model.
In this paper, we propose a spatial-temporal graph attention based dynamic graph convolutional network (GAGCN), which is employed to predict the road network traffic flow based on spatial-temporal feature and has better generality and prediction accuracy than previous approaches. The main contributions of ours can be summarized: • We develop a graph attention mechanism to dynamically adjust the spatial associations/weights among nodes over time. We identify the associations/weights among nodes hidden in the traffic data through the graph attention network, and the Laplacian matrix of the road network topology graph is dynamically adjusted in line with the spatio-temporal features of the traffic data.
• The structure information of the road network, such as the position of the detector, is not required and we only need traffic flow features data in our model. The proposed method can reduce the error of people's prior knowledge in previous models and also improve the generality of the model.
• Large-scale experiments are performed on two universal traffic datasets. The experimental results confirm the prediction accuracy and generality of our model on different datasets. The source code can be found at https://github.com/sam101340/GAGCN-BC-20200720. The remainder of this paper includes: Section II introduces the research and development on traffic flow prediction. Section III introduces the spatial-temporal graph attention based dynamic graph convolutional network. Section IV exhibitions our experimental and results analysis and conclude our work in Section V.

A. TRAFFIC FLOW PREDICTION
In recent years, many excellent-performing prediction models have been proposed to assist signal control, traffic guidance, and path planning. Traffic data has the characteristic of flowing and is a typical time series, given traffic data T to predict the traffic parameters (such as speed, traffic flow or occupancy) Y of the next H time points with the traffic data of the past P time points of all the nodes in the road network, Y = (y 1 , y 2 , . . . , y N ) T ∈ R N ×H , and y i = (y i P+1 , y i P+2 , . . . , y i P+H ) ∈ R H denotes the future parameters of node i from P + 1.
The time-series analysis model uses mathematical formulas to model past behavior, and then uses the obtained model to predict future results. ARIMA [7] is a classic statistical model in time-series analysis, which is widely used in traffic prediction [8]. References [9] and [10],extended the spatial domain to the ARIMA time series model to obtain the spatio-temporal autoregressive integrated moving average (STARIMA) models. And STARIMA achieved good results in the field of traffic flow prediction [11]. However, the timeseries analysis model is a purely inductive method, which requires some ideal prior assumptions. And it is difficult to satisfy these assumptions in the real world because of the inherent complexity of traffic data. Therefore, the above methods often perform poorly in practical applications.
Machine learning methods, for instance support vector regression (SVR) [12], [13], k-nearest neighbor algorithm (KNN) [14], K-means [15] have solid mathematical foundation and can help us handle more complex traffic data. To achieve the theoretical advantage of these methods, the premises are to choose the appropriate parameters and conduct detailed feature engineering.
Recently, deep learning models with good learning capabilities and deep network structures have developed rapidly. Based on its good performance, deep learning models have also made great progress in traffic flow prediction. Such as stacked autoencoder (SAEs) [16] and Deep belief network (DBN) [17], [18] are all based on deep learning models.
However, the fully connected networks are not sufficient to extract the spatial-temporal features of traffic flows, they only process one region every time, and the configuration of their neurons cannot meet this demand. Convolution neural networks (CNNs) [19] as a typical deep neural network, has made many breakthroughs in image processing. Recently, some researchers have used CNNs to capture spatial features in traffic prediction tasks [20]. Gated recurrent unit networks (GRUs) [21], [22] and long short-term memory neural networks (LSTM) [23] are both good at processing time-serie [24] and have also been used in traffic flow prediction. And then, researchers combined CNN and LSTM networks to propose a functional level fusion architecture CLTFP [25] for short-term traffic prediction by combining CNN with LSTM networks. Later, a convolutional LSTM was proposed by [26], which embedded an extended fully connected LSTM into convolutional layer LSTM (FC-LSTM) [27]. Compare with the above methods, CLTFP and FCLSTM can obtain better prediction performance. Convolutional neural networks can effectively capture the spatial features of grid data. However, convolutional neural networks cannot extract the spatial features of the road network, when we consider the traffic road network and the detectors deployed on the road network as a topology graph. Because the number of neighbors of each vertex in the topology graph is different, so the convolution operation cannot be performed with the same size convolution kernel.
However, it is hoped that the spatial features can be effectively extracted on the data structure such as the topology graph, so GCN has become the research focus. Existing graph convolutional network can be fall into two types according to the convolutional operator: vertex domain and spectral domain. Vertex domain finds the neighbors adjacent for each vertex to extract the spatial features on the topology graph [28], however the neighborhood of each vertex is different, each vertex needs to be processed alone. Spectral domain uses the theory of the spectral graph to convolve topological graph. Bruna et al. [29] proposes a general graph convolution framework, and then Defferrard, Bresson, and Vandergheynst [6] approximates it with Chebyshev polynomial to decrease the computational complexity of the model.
Recently, many researchers use GCN to predict traffic flow [30]- [33]. Spatial-Temporal Graph Convolutional Networks (STGCN) [34] was proposed, which constructs a fixed Laplacian matrix based on the spatial distance among the detector nodes and human experience. Further, Attention Based Spatial-Temporal Graph Convolutional Networks (ASTGCN) [35] uses an attention mechanism [36] to capture the dynamic associations among nodes. However, the construct of Laplacian matrix required by the GCN of the above methods are all dependent on the spatial distance among the,detector nodes in the road network and the human experience, which make the mo,del have great limitations.

B. ATTENTION MECHANISM
Attention mechanism, as a new technology, has developed rapidly in recent years, and is widely used in fields such as natural language processing, speech recognition and others. Attention mechanisms allow neural networks to focus on input data and provide helpful information for the current task. Then, an alignment model is proposed to evaluate the match between input and output [37]. After that, a neural network architecture consisting of two memory networks is proposed, which can model the semantic association and relationship between each word and two entities [36]. Based on the above research, Graph Attention Networks (GATs) [38] is proposed. GATs does not need to know the structure of the graph in advance and only focuses on the feature data of the nodes and uses the self-attention layer to specify the weights among the nodes. It has achieved the best level in the industry in three difficult benchmarks. In traffic flow prediction, to extract spatial features, Liu et al. perform 2D convolution on each feature map to obtain the corresponding VOLUME 8, 2020 attention matrix, perform maximum average pooling on each feature map, and use the result as the input of the feed-forward neural network to obtain the channel attention [39]. Recently, Zheng et al. use scaling point product attention mechanism to obtain spatial-temporal attention, and transformer attention from encoder to decoder [31].
To improve the generality of prediction model and decrease errors caused by human experience, we propose a spatialtemporal graph attention based dynamic graph convolutional network. Our framework employs the graph attention networks to find spatial dependencies hidden in traffic data and adjust the Laplacian matrix in time.

III. METHODOLOGY
In this part, before we present our model in detail, we will introduce some background and explain the definitions that appear in our article.

B. GCN
Compare traffic network to a graph G = (V , e, A), where V , e, A represent a set of vertices (detector positions), a set of edges, the adjacency matrix of G respectively. The vehicle speed V observed by the detector can be regarded as a graph signal that is defined on the graph, where v i is the signal value at the i th node. The theoretical core of the graph convolution is the feature decomposition of the Laplacian matrix of the graph. Graph Laplacian L = D − A, where D is the degree matrix of the vertices. Further, the Laplacian matrix can be normalized as: L = I n − D − 1 2 AD − 1 2 (I n is an identity matrix). In general, the performance of graph convolutional networks mainly depends on the quality of A [40]. Eigenvalue decomposition of the Laplacian matrix can obtain its eigenvector matrix U and eigenvalue matrix A. And the Laplacian matrix can be expressed as: L = U U T , where ∈ R N ×N is a diagonal matrix, U ∈ R N ×N is Fourier basis. Graph convolution filter g θ = diag(θ) parameterized by θ = R N . Hence, the graph convolution of x defined in the Fourier domain is: where * G denotes a graph convolution operation and U T x is the graph Fourier transform of x. However, it is expensive to use (1) to calculate the eigenvector matrix of L, when the structure of the graph is very complicated. To solve this problem, we use the Chebyshev polynomial approximation to reduce computational K th complexity [41], and (1) can be further defined as: where θ ∈ R K is a vector of Chebyshev coefficients.L = The core of GCN is based on the spectral decomposition of Laplacian matrix. Building an accurate Laplacian matrix is very helpful in improving prediction accuracy. First we propose a method of constructing dynamic Laplacian matrix with traffic data of nodes, and then introduce a novel dynamic spatial-temporal GCN for traffic speed prediction. As shown in Fig. 3, Our framework is compose of three modules: a Laplacian matrix module constructed from graph attention networks, two spatial-temporal convolution blocks and an output layer. Taking the m th time series as an example, we use the constructed traffic data T ∈ R P×N ×C as the input of graph attention networks to obtain the weighted adjacency matrix A of the graph. Then we use L = D − A to obtain the Laplacian matrix L, and spectrally decompose L to obtain the graph convolution kernel required for graph convolution. V ∈ R P×N ×C 3 is the only input data of the ''ST-Conv block''. There are two temporal convolution blocks and one spatial convolution block in ''ST-Conv block''. And the spatial convolution block is located in the middle of the two temporal convolution blocks. The output layer includes a temporal convolutional and a fully connected layer. Finally, the output layer integrates all features to get the final prediction result.Ṽ .

1) OBTAINING LAPLACIAN MATRIX BY GRAPH ATTENTION NETWORKS
In the previously proposed method of predicting traffic flow by using GCN, in order to obtain a weighted adjacency matrix, they need specific road network information (such as detectors position information), which reduces the generality of the model. And they also need to use their prior knowledge (such as selecting a specific mathematical model to control the sparsity of the adjacency matrix) to construct the weighted adjacency matrix A.
In this paper, we implicitly assign different weights to nodes in the neighborhood by graph attention networks and do not rely on understanding the road network structure in advance (such as the spatial position information of the detector nodes). Fig. 4 shows the details of obtaining the Laplacian matrix through the graph attention networks. Taking into account the impact of changes of time and traffic conditions on the relationship among nodes, we construct the data into M time slice and batch input to the graph attention network. Take the m th time slice as an example, data at the p th time point T p = { T 1 , T 2 , T 3 , . . . , T N }, ( T i ∈ R C , N denotes the number of the detectors, C represents the feature number of traffic data) and the m th time slice contains P time points. Each time point of data will get a association matrix by the graph attention network. Finally, we average the P relational matrices to obtain the association/weights matrix (weighted adjacency matrix) of the m th time slice.
As shown in Fig. 4, for data T p , a One-dimensional convolution layer is employed to transform the input features into higher level features on each node in the initial step to obtain T p = { T 1 , T 2 , T 3,..., T N }, T p ∈ R N ×C , (C is the number of filters) and the time complexity is O 1 (N · K · C · C ) where K is the size of the convolution kernel. Before obtaining the attention coefficients among the nodes on the graph, the highdimensional feature data T p is linearly transformed by the parameter matrix W ∈ R C ×1 . The attention coefficient can be expressed as: where a ∈ R 2C is a single feedforward neural network, e ij ∈ R P×N ×N . Equation (3) indicates the degree of influence of the traffic condition of node j on node i and its time complexity is O 2 (2N · K · C ). And this process is applied to each node on the graph. By implementing masked attention, the structural information of the graph is incorporated into the mechanism, and only compute e ij for nodes j ∈ N i where N i is some neighborhood of node i in the graph. In order to compare the attention coefficients e ij between different nodes, we normalize them with a nonlinear function: Then applying the LeakyReLU nonlinearity, the coefficients computed by the attention mechanism can be expressed as: where ⊕ represents the concatenation operation. Once the normalized attention coefficient α ij obtained (ie. the weight relationship between node i and node j), then we can get each element of the weighted adjacency matrix: We have only used one attention factor calculation mode as described above, (as shown in Fig. 5). But in order to stabilize the self-attention learning process, it is beneficial to use multi-head attention, as shown in Fig. 6. At this time, the time complexity of the calculation process in Fig. 6 is O 3 (K · (M 2 · K · C + 2M · K · C )). We eventually get the K attention coefficient average: where K denotes the number of attention computations, σ is a nonlinear transformation. α k ij is the attention coefficient computed by the k th attention mechanism a k . Finally, we get all VOLUME 8, 2020 FIGURE 5. Constructing attention coefficients α ij . Our model linearly transform the input traffic data T to obtain a higher-dimensional representation, and then employ the attention mechanism a(W T i , W T j ) and the non-linear activation function soft max j to obtain the attention coefficients α ij . FIGURE 6. The explanation for multihead attention. For nodes i and j 1 , multihead attention can be considered as using several different attention mechanisms a to obtain the attention coefficient between nodes i and j 1 . Among them, each attention mechanism a will get a corresponding α i j 1 , and then average each α i j 1 as the final attention coefficients.
the W ij and construct the final weighted adjacency matrix A.
Then we get the Laplacian matrix L according to L = D − A.
In the end, we can get the time complexity O(K ·(N 2 ·K ·C + 2N · K · C ) + N · K · C · (C + 2)) of graph attention network. Treating K, K , C and C as constants, the time complexity of the graph attention networks becomes O(N 2 ).

2) JOINT EXTRACTION OF SPATIAL FEATURES BY GRAPH ATTENTION NETWORKS AND GCN
Our purpose is to predict the vehicle speed Y of the next H time slices with the traffic data of the past P time points of all the nodes in the road network (Y = (y 1 , y 2 , . . . , y N ) T ∈ R N ×H , and y i = (y i P+1 , y i P+2 , . . . , y i P+H ) ∈ R H denotes the future velocity of node i from P + 1). All the features are selected as the input of the graph attention networks, and only the velocity is selected as the input of the GCN. The input of the graph convolutional layer can be expressed as x ∈ R N ×P×C in , where N , P, C in represent the number of the node, time point and channel, respectively.
We create the Laplacian matrix L of the graph with weighted adjacency matrix A, and perform feature decomposition on L. We set K = 1, and use a layer-wise linear formulation to stack multiple localized graph convolutional layers. With the first-order approximation of graph Laplacian and in this linear formulation of a GCN we can further approximate λ max ≈ 2 [42]. Under these approximations (2) simplifies to: where θ 0 , θ 1 are two shared parameters, θ 0 and θ 1 can be replaced by a single parameter θ by letting θ = θ 0 = −θ 1 .
To avoid repeated application of this operator, which may lead to numerical instability and explosion/disappearance gradients [42], A and D are renormalized byÃ = A + I n andD ii = jÃ ij separately. Then (9) can be alternatively expressed as: Finally, the graph convolution in (10) can be rewritten as: where y i ∈ R N ×c out , C in , C out represent the size of input feature map and output feature map, respectively (in this case, C in = 1). According to the reference [33], the time complexity of the graph convolution network is O(K (N · C in + N · C in · C out )), where K is the approximate order number of Chebyshev polynomial. Taking K , C in and C out as constants, the time complexity of the graph convolution module becomes O(N ).

3) GATED CNNs FOR EXTRACTING TEMPORAL FEATURES
As shown in Fig. 7, the traditional 2-D convolution operation of CNN is used to obtain short-term features of traffic flow. We use ϒ ∈ R N ×P×C in as the input to the temporal convolution layer (N , P represent the size of the spatial and temporal dimensions, respectively). The convolutional kernel ∈ R K ×C in ×2C out will map the input to an output element [A, B] ∈ R N ×(P−k+1)×(2C out ) (A, B have the same size of channels, which is half of the total size of channels). The temporal convolution can be defined as: where A, B are input of gates in GLU, respectively, is Hadamard product. The sigmoid gate σ (B) is a gating mechanism that controls which information in A can be passed to the next layer.

4) SPATIAL-TEMPORAL CONVOLUTIONAL BLOCK
In order to improve the prediction accuracy, we fuse the spatial convolution block with the temporal convolution block, and jointly process the time series of the graph structure by using the space-time convolution block.

IV. EXPERIMENTS A. DATASET DESCRIPTION
We validate our model on the real-world traffic dataset PeMS (collected by the California Department of Transportation). The dataset contains key attributes such as overall flow for each detector node, average lane occupancy, and average vehicle speed. It also contains the geometric information of the detector and the corresponding timestamp, as detailed below: • PeMSD4: It refers to the traffic data in San Francisco Bay Area, containing 3848 detectors on 29 roads. We randomly select 50/100 detectors and select data for major traffic routes during the workdays from May 1, 2012 to June 30, 2012. The traffic data are aggregated and output by each detector every 5 minutes.
• PeMSD7: Its traffic data for the Los Angeles area includes 39,000 detectors. We randomly select 100/206 detectors and select data for major traffic routes during the workdays from May 1, 2012 to June 30, 2012. The traffic data are aggregated and output by each detector every 5 minutes. We select three traffic features: traffic volume, average lane occupancy, and speed, and the time interval of the data is 5 minutes. Therefore, each node contains 288 data points per day. The missing values are filled by the linear interpolation. In addition, z-score normalization is performed on three different traffic attribute data, respectively.

B. EXPERIMENTAL SETTINGS
All experiments are performed and tested on the Window's operating system (CPU: Intel Core i7-8700 @ 3.20GHz, GPU: NVIDIA GeForce GTX 1070Ti). In order to get the best parameters on validation, grid search strategy is selected. In this paper, the historical time window of all experiments is 60 minutes, i.e. 12 observation data points (P = 12) for predicting the next one hour (H = 3; 6; 9; 12).
During the training phase, the RMSprop is used to optimize the mean square error. All baselines are also trained for 50 epochs with batch size as 30. The initial learning rate is 10 −4 with a decay rate of 0.7 after every 5 epochs. We use the 1st-order approximation, both the spatial and the temporal convolution kernel size are set to 1.

1) EVALUATION INDICATORS AND BASELINES a: EVALUATION INDICATORS
We use Mean Absolute Error (MAE), Mean Absolute Error Percentage (MAPE) and Root Mean Square Error (RMSE) to evaluate the performance of different models. They are defined as: where v t is the detected vehicle speed, andṽ t is the predicted vehicle speed.

b: BASELINES
We compare our GAGCN with the following seven baselines: • HA: Historical Average method. Here, we use the average value of the last one hour to predict the next value.
• VAR [43]: Vector Auto-Regression is a time series model, which easy to analyze multiple time series.
• DCRNN [30]: A diffusion convolutional recurrent neural network, which captures the spatial dependence between nodes on the graph by bidirectional random walk.
• ASTGCN [35]: Attention Based Spatial-Temporal Graph Convolutional Networks, which can learn the dynamic spatial-temporal associations of traffic data.

1) FORECASTING ACCURACY
We validated our model and ten baselines on the datasets PeMSD7 and PeMSD4 and the mean and standard deviation of these two datasets are showed in Table 2. Table 1 shows the average speed of the four time steps (15 minutes, 30 minutes, 45 minutes and 1 hour) of each verification baseline in the next hour. It can be seen from Table 1 that the prediction results of the traditional time series analysis methods (HA, VAR) are usually not ideal, demonstrating those methods' limited abilities of modeling nonlinear and complex traffic data. The comparison results show that traditional deep learning models like CNN-LSTM have better prediction performance than traditional time series analysis methods. The prediction results of graph convolutional network models such as STGCN and MSTGCN without injecting the attention mechanism in the model are better than traditional spatialtemporal deep learning models such as CNN-LSTM. When the road network structure is complicated and the positions of the detectors are relatively random, the prediction accuracy of ASTGCN is lower than traditional spatial-temporal models without attention mechanism, such as STGCN. Compare with DCRNN, Graph WaveNet and GMAN on PeMSD7, GAGCN does not inject road network graph structure information, however it can still achieve the best prediction performance. As shown in Table 1, on two different datasets, the prediction performance of GAGCN is better than MTGNN, which shows that the method of using graph attention network to extract the correlation among nodes hidden in various traffic feature data is more effective. The prediction results of STGCN, ASTGCN, and GAGCN on a certain day are shown in Fig. 8. And they are all based on the graph convolution network method.we can easy to observe that GAGCN is closest to the ground truth, compared with other two methods. Fig. 9 shows the change of the prediction performance of each method with the increase of the training epoch. In general, as the training cycle increases, the prediction error also gradually decreases and eventually stabilizes.

2) EFFECT OF GRAPH ATTENTION MECHANISM
To investigate the effect of graph attention mechanism in our model, we designed a version of GAGCN that contains road information, named Non-universal spatial temporal graph attention based dynamic graph convolutional network (NGAGCN). In this part, we compare and analyze three models: GAGCN, NGAGCN, STGCN. The main difference between the three models is the way in which the associations/ weights among road network nodes is extracted. Among them, STGCN only uses the structural information of the road network (the distance between the detection points), NGAGCN combines the graph attention mechanism and the road network structure information, and GAGCN only uses the graph attention mechanism to extract the associations/ weights relationship among the nodes hidden in the traffic feature data, and the data does not contain any road network structure information. It can be seen from Table 1 that the prediction accuracy of STGCN which only using road network structure information is the worst, and there is no generality match with other road networks. NGAGCN, which combines graph attention mechanism and road network structure information, reduces the generality of the model, but achieves better prediction results than STGCN. This proves that it is effective to inject a graph attention mechanism into the model to extract the spatial features of traffic flow. Finally, compared with NGAGCN and STGCN, GAGCN that does not contain the information of road network structure not only improves the versatility of the model, but also achieves the best prediction performance. This further proves that graph attention network can effectively mine the dynamic association features between hidden nodes in traffic feature data.

3) SPATIAL WEIGHT MATRIX
Based on the change of traffic conditions caused by time changes, our model can pay attention to and adjust to the association/weights among nodes in the road network according to the changing traffic conditions at the nodes. Fig. 10 (b-c) shows a part of the spatial association/weights matrix among nodes obtained by our graph attention networks. The i th row represents the association between each detector and the i th detector. Taking 0 th and 1 th detectors as an example, we can find from Fig. 10 (b-c) that the intensity of the influence of 1 th on 0 th is greater than 0 th on 1 th detector. This result is consistent with the real situation, because the traffic condition of detector 1 in the traffic network is affected by detectors 0, 2, 3, and 4, while detector 0 is only affected by detectors 1 and 5, as shown in Fig. 10 (a). Therefore, our model not only obtains the best prediction performance and the highest generality, but also shows an interpretability advantage.

4) BENEFITS OF GATs BUILDING LAPLACIAN MATRIX
Obtaining the weighted adjacency matrix is the key to constructing the graph Laplacian matrix. The existing model is mainly based on the positional distance among the detectors in the road network, and then uses its own prior-knowledge explicitly to assign a weight relationship to each detector node, which may cause a lot of error, and lower the prediction accuracy. We pay attention to the traffic data of each detector, and let the graph attention networks to find out the hidden relationship among each detector, that is, implicitly specify the weight relationship among nodes, and construct a real weighted adjacency matrix. Our method (GAGCN) reduces the error caused by human experience and does not require the spatial position information of the detector. The results in Table 1 indicates that the error of our model is the smallest under the same conditions. This shows that our model captures the influence relationship among detectors more accurately than STGCN, in other words, it better captures the spatial features of the road network. Although MTGNN does not inject graph structure into the model, and uses traffic characteristic data to find the correlation among nodes, however the experimental results on tasks with a large time span are not very ideal, and the experimental process is relatively complicated. It can be concluded that the Laplacian matrix using GATs to construct not only improves the generality, but also ensures the accuracy of the model.

V. CONCLUSION
In this paper, we propose a novel traffic flow prediction model called GAGCN. GAGCN employs graph attention networks to dynamically obtain weighted adjacency matrix of road network graph. Its ''spatial-temporal convolution'' uses graph convolutional network to extract the spatial features of the VOLUME 8, 2020 road network, and employs gated temporal convolution to excavate temporal features. The information of rode network structure and human intervention are not required in GAGCN, and it can flexibly face various complex road networks. Lots of experiments and comparisons were performed on two actual datasets, compared with traditional methods that use the distance between nodes and human experience, GAGCN has better accuracy and versatility, and can also achieve better prediction performance when compared with models without injected graph structure.
In real life, there are many factors that can affect the traffic conditions of the road network. These include natural and unnatural factors, such as weather, social events, air quality, and many other factors. In the future, we will consider some external influence factors to further improve the prediction accuracy.
CONG TANG (Student Member, IEEE) is currently pursuing the master's degree with the College of Computer Science and Electronic Engineering, Hunan University.
His research interests include data mining and intelligent transportation technology. She is currently an Assistant Professor with the College of Computer Science and Electronic Engineering, Hunan University. She has published more than ten articles. Her research interests include intelligent transportation, memristors, and its application to storage and neural networks. He has also been widely involved in various IEEE technical committee and international conference activities.
MU PENG is currently pursuing the master's degree with the College of Computer Science and Electronic Engineering, Hunan University.
His research interests include data mining and intelligent transportation technology.
NIANFEI GAN received the Ph.D. degree in mechanical engineering from Hunan University, China, in 2007.
She is currently an Associate Professor with the College of Mechanical and Vehicle Engineering, Hunan University. She has published more than ten articles. Her research interests include driving intention recognition, human-like motion planning, and its application to intelligent vehicle. VOLUME 8, 2020