Spatiotemporal Data Fusion in Graph Convolutional Networks for Traffic Prediction

A plethora of information is now readily available for traffic prediction, making an effective use of them enables better traffic planning. With data coming from multiple sources, and their features spanning spatial and temporal dimensions, there is an increasing demand to exploit them for accurate traffic prediction. Existing methods, however, do not provide a solution for this, as they tend to require expertise feature engineering. In this paper, we propose a general architecture for SpatioTemporal Data Fusion (STDF) with parameter efficiency. To make heterogeneous multi-source data fusion effectiveness, we separate all data into traffic directly related data and traffic indirectly related data. With traffic indirectly related data as the input to Spatial Embedding by Temporal convolutiON (SETON) that simultaneously encodes each feature in both space and time dimensions and traffic directly related data as the input to the graph convolutional network(GCN), we designed a fine-grained feature transformer to match the ones generated by GCN. This is then followed by a fusion module to combine all features to make final prediction. Compared to using GCNs training with only traffic directly related data, experimental results show that our model can achieve a 6.1% improvement in prediction accuracy measured by Root Mean Squared Error.


I. INTRODUCTION
The traffic is playing a vital role of human life and significantly influenced every aspect of life. With the rapid increase of vehicles, the traffic jam has attracted a national concern for urban management. Smart city is considered as a potential solution, which uses intelligent technologies to predict the traffic flow, and smooth the peaks and valleys by offering residents travel guidance [1]. Traffic prediction is of great importance for smart cities and has attracted The associate editor coordinating the review of this manuscript and approving it for publication was Keli Xiao . attention from research to industry for many years. Accurate real-time road traffic prediction is critical for the realization of intelligent cities [2], [3]. Traffic authorities require reliable prediction to facilitate the related process of policy-making, regulatory, and implementation. With the development of sensors, the traffic data is collected by sensors equipped within vehicles or installed along the roads. Examples of traffic data include license number of vehicles, GPS data of vehicles, video or image records of surveillance devices, temperature, wind speed and level of sunlight data of weather sensors [4]. These multi-source data converge to the data center by vehicle ad hoc networks (VANET), or the upcoming 5G cellular network [5]. Many traffic prediction algorithms have proposed to guide convenient travel for citizens based on these mass traffic data, and there have been some works showing the advantages of multi-source data fusion in the spatio-temporal data prediction tasks [6].
Unlike traditional data fusion methods, multi-source traffic data includes not only traffic directly related data, e.g., vehicle speed, vehicle density, traffic flow, but also indirectly related data, e.g., weather, points of interests (PoIs), etc.. All these data span both spatial and temporal dimensions [4], [7]. As shown in Figure 1, our goal is to predict traffic condition at each road segment. The residence areas in the morning tend to have many people going out for fun or work, while in the evening many people go home, or to a place of entertainment. These information therefore must be incorporated in the model for accurate traffic prediction. However, merging all these data straightforwardly could not explore the semantics changing of traffic indirectly related data over time.
To tackle this, in this paper we propose a SpatioTemporal Data Fusion (STDF) framework, which is a general architecture to improve traffic prediction performance in metro-city scales by using data fusion.
Traffic prediction using data fusion needs to consider both spatial and temporal features from multi-source data. It is notable that the distribution of urban traffic exhibit high variability both in spatial and temporal domain. Traffic prediction in urban cities is challenging because of their complex environment. It is thus essential to find an efficient and effective way to make traffic prediction more accurate by using them jointly. There have been many works on data fusion. According to the model parameters size, the work in traffic prediction by data fusion can be classified into two categories, i.e., traditional machine learning method and deep neural networks. Many effective methods have been proposed, such as XGBOOST [8], random forest [9], LightGBM [10], embedding learning [11]. Although these methods can find the relationship between traffic prediction and traffic indirectly related data, they require significant human effort because the features extracted from multi-source data play a vital role in the prediction accuracy. Meanwhile, it is computation consuming if we apply these methods into large scale data fusion for urban cities. To overcome it, an end-to-end learning method is thus a desirable alternative at the cost of computation power. For example, some works use deep neural networks [12]- [14] to automate the processing of multi-source data fusion and extraction of useful features. They merge multi-source data straightforwardly into a vector and treat traffic directly related and indirectly related data equally. As a result, they ignore the semantics changing of traffic indirectly related data. In this paper, we explore an effective and efficient way to fuse multi-domain data considering both the spatial and temporal properties based on the GCN.
Multi-source data fusion with the consideration of its spatial and temporal properties is challenging for the following reasons. The first challenge is the large scale feature representation. It is infeasible to encode each node at different time into a unified vector in metro-city scales. For example, the parameter size is over 10G for a Small city containing 10,000 road segment and 100 external factors for each node on average if each factor is represented by a 10-tuple vector at one time interval, which will easily result in an over-parameterized model and over-fitting when training. Second, an automated but efficient facility is urgently needed to find the spatio-temporal representation for all multi-source data. Third, fusing traffic indirectly related data into traffic prediction may cause negative effect on prediction accuracy. Besides, there are practical concern when applied into the real traffic prediction in metro city scales.
To tackle the aforementioned challenges, we proposed a general STDF framework. STDF adopts branchingtransfer-fuse strategy. STDF first separates the prediction model into two branches with each branch processing traffic directly related data and traffic indirectly related data correspondingly. The traffic directly realted data is processed by GCN to get the spatio-temporal representation from the middle layer of GCN. While the traffic indirectly related data is process by two parts successively. The first part is called static Spatial Embedding by Temporal convolutiON (SETON). SETON first encodes each feature in both space and time dimensions simultaneously, followed by an convolutional operation with spatial embeddings as input and temporal embeddings as convolutional kernel to get the spatio-temporal representation. Meanwhile, all nodes share the same spatial and temporal embeddings, which are traninable in the model as well as to avoid the overparameterized problem. The second part is a feature transform module which is to map the spatio-temporal representation generated by SETON to the feature map space generated by GCN. At last, the feature map generated by GCN and feature transform module are fused together followed by several full connection output layers. In summary, this paper has the following contributions.
• Generic Architectures for Deep Spatio-temporal Data Fusion -The STDF framework is a general neural network architectures, which can efficiently fuse multi-source data both in spatial domain and temporal domain in large scales.
• Deep Spatio-temporal Data Fusion Operator-We designed a new type of deep spatio-temporal data fusion operator i.e.SETON. The operator has the ability to capture both the spatial representation and temporal representation simultaneously.
• Computation Efficiency and Practical -Both the components in the STDF framework have the parameter sharing strategy to avoid model over-parameterized, which is applicable in the complex urban computing with high computation efficiency.
• Performance Improvement in Spatiotemporal Data Prediction -We apply our method into real traffic speed prediction and human flow prediction in metro. Experimental results demonstrate that our spatiotemporal data VOLUME 8, 2020 FIGURE 1. Semantics changing over time. Multi-source data fusion will benefit the traffic prediction accuracy. At the same time, the representation of different factors is changing over time. Traffic prediction using data fusion needs to consider both the spatial and temporal representation simultaneously.
fusion method performs significantly better than the one without data fusion or only spatial data fusion. The rest of this paper is organized as follows. Section II gives a brief literature review of related work from traffic prediction and data fusion perspective. Section III formulates the traffic prediction problem and an overview of the architecture of solution. Section IV details the process of spatiotemporal representation with parameter efficiency. With the extracted features, a feature transformer module and data fusion method are introduced at Section V. We conduct comprehensive experiments in section VI and give a discussion of our model. Section VII offers the conclusion of our work and outlines our future work.

II. RELATED WORK
In this section, we review the recent studies that are relevant to traffic prediction and data fusion. We first introduce the traffic prediction methods from mathematical model perspective. Then data fusion methods are detailed both in feature level and semantic level.

A. TRAFFIC PREDICTION
There are many achievements made in traffic prediction, including traffic flow, vehicle speed, vehicle density, etc. Traffic prediction can be models as a time series data prediction. The statistical modes including history average (HA), Autoregressive Integrated Moving Average (ARIMA) [15], Seasonal Autoregressive Integrated Moving Average (SARIMA) [16] and spatiotemporal correlations [17] are widely used in real traffic condition prediction for its computation efficiency. However, all these methods require the input data to meet a certain condition, which consequently perform poorly in the complex urban traffic prediction.
To make traffic prediction model have the ability to deal with complex data, there are continuous applying trying machine learning methods into the urban traffic prediction, such as XGBOOST [8], random forest [9], LightGBM [10], embedding learning [11]. Although these methods have the inherent advantage to deal with multi-source data, they need a lot of domain knowledge and careful feature engineering, which is not only computation consuming but also has some scalability issues.
Because of the strong self-adapting and self-learning ability of artificial neural network, deep learning has been used in different domains, such as computer vision [18], natural language processing and auto driving, and brings many significant breakthroughs. At the same time, a great deal of studies have been done on improving traffic prediction performance by using different types of neural network architectures, such as multi-layers perception [19], long short-term memory [20] and auto encoders [21]. Although these works can effectively extract the local patterns of data, they can only be applied for the standard structure data and are lack of awareness of the global prediction. With the ability of processing data of graph structures, the graph convolutional networks are widely used to deal with complex graph data in a global perspective. Yu proposed a Spatio-Temporal Graph Convolutional Networks (STGCN) with the ability to capture comprehensive spatial and temporal dependencies for long-term traffic prediction [22]. Guo applies attention strategy into GCN to predict traffic flow with considering the dynamic spatial-temporal correlations of traffic data [23]. Li proposed a diffusion convolutional recurrent neural network (DCRNN) to model the traffic flow as a diffusion process [24]. Different from our work, these models did not deal with the multi-source data problem.

B. DATA FUSION
Data fusion [25] in traffic scenario often implies the combination of traffic related data sets that present an enormous diversity on the basis of location, weather, points of interests, traffic flow, density and speed. These data sets are differently represented in different perspective, but they represent the same real world object and complement each other. A straightforward method [26], [27] in the feature level is that all the object-related features are extracted equally and all features are concatenated sequentially into a equal-sized or unequal-sized vector to be injected into the kernel task. The low-level representation might exist redundancies and the sampled data may be not independent, it is easy to lead to model instability.
Feature engineering is an especially good idea that makes machine learning algorithms work. Lakhinaet analyzed the distributions of packet features in flow traces in details, which showed significant advantages for anomalies detection [28]. Samant and Adeli extracted traffic incident related features by using wavelet transform and linear discriminant analysis [29]. The two-stage feature extraction algorithm made the traffic incidents detection model more robust. Although a good feature engineering can get better performance, it needs a deep understanding of domain knowledge. Besides, it is time consuming and computation consuming for large scale data fusion. An end-to-end learning technology with better flexibility provides a consistent alternative for the ability of auto feature extraction.
Deep neural networks (DNN) is an excellent solution for end-to-end learning when geta unified feature representation from disparate data sets. An end-to-end structure of ST-ResNet [12] was proposed to predict citywide crowd flows, where the input with unique properties of spatiotemporal data is feed into ST-ResNet simultaneously. Bojarski trained a convolutional neural network (CNN) to map raw pixels from three cameras directly to steering commands [30]. The system automatically learns internal representations of the necessary processing steps such as detecting useful road features. With the ability to self-learn feature representation, these end-to-end based data fusion methods need lots of computation cost. At the same time, the feature representations are extracted in a grid scale, but not in the road segments level. Different from them, we are more interested in the graph structure data.
Feature based data fusion approaches take all the feature equally and ignore the semantic meaning of each feature. On the contrary, semantics based data fusion methods try to understand the meaning of each feature and find the relationships between features by mining the insight of each data. For example, many works tried to find the relationship between emotion and audio signals in the emotion recognition [31]- [33]. The fusion results combining the acoustic and facial emotion recognition were achieved in the semantic level. DeepFM [34] is an end-to-end deep learning framework for click-through rate prediction, where data representation is realized by feature embedding. DeepFM fuses the feature by a factorization-machine with a deep neural network. However, all these feature representation are static and only related to its input data correspondingly. In this paper, we will tackle the spatiotemporal data fusion problem in traffic prediction scenarios because the spatial features in semantics level are dynamically changing with time.

A. PROBLEM STUDIED
The problem of traffic prediction by data fusion can be described as: given the observations at N nodes of historical P time steps X = (X t−P+1 G , X t−P+2 G , · · · , X t G ) ∈ R P×N ×C and the external factors F G collected from other domain, we aim to learn a mapping function f which can map the input data into the future observation of traffic condition where Q denotes the length of the target of traffic condition to predict. Figure 2 illustrates the architecture of STDF framework to solve the problem. As the studies about GCN have gotten state-of-the-art performance in spatio-temporal data prediction and there have been many completed GCN architectures widely used in time series data prediction, we select one type of GCN [22] to demonstrate the framework of STDF.

B. GRAPH CONVOLUTIONAL NETWORK
Graph convolutional network (GCN) is a neural network that operates on graphs, which is able to extract local features with different reception fields from translation variant VOLUME 8, 2020 non-Euclidean structure [35]. As depicted in [22], GCN is designed to solve the time-series prediction problem, i.e., predicting the future traffic measurements under given input with a fixed temporal length, which is written as . The feature map FM g generated by the second ST-Conv Block in GCN as demonstrated in Figure 2 is denoted by . However, there are many external factors that have influence on traffic pattern. For each node v, the external factors are written as a vector F v . Spatio-temporal data fusion is not a simple data integration process. STDF is designed to find an efficient and effective data fusion strategy that is one kind of practical methods for large scale traffic data prediction in real world. STDF consists of three parts: SETON and Feature Matching. The SETON is to find a computation efficient spatio-temporal representation for external traffic related variables. Feature transformer maps spatio-temporal representation to a feature space that has the same feature shape with the feature map FM g for each node. Then a fusion module is followed to combine the two features into one tensor. We introduce the three parts in details as follows.

IV. SETON
The STEON consists of three components: spatial feature embedding layer, temporal feature embedding layer and embedding vector fusion layer. The spatial feature embedding layer maps the external factors to a fixed sized embedding vector. The vector length k is determined in advance. Similarly, the temporal feature embedding layer maps the time interval to a 3-D tensor, with the length of the first and second dimension equal to k and the length of the third dimension equal to the number of time slots, and embedding vector fusion layer is to get the spatio-temporal embedding vectors using the output of the aforementioned two components as input, which is the spatio-temporal representation of traffic indirect related data in low level. At the same time, all vectors in SETON can be self-learned without any feature engineering.

A. SPATIAL FEATURE EMBEDDING
Because the traffic network is complex and the environment around each node is different from each other, the size of external data related to traffic prediction is too large if we give each factor a spatiotemporal representation in neural networks, which may cause over-parameterized and overfitting when training. To overcome the over-parameterized problem, we proposed a data sharing strategy for all nodes.
We first classify the indirect traffic data into an m-fields data according to the way how the PoI will influence people travel pattern. They may include categorical fields (e.g., residential area, hi-tech zones, entertainment place) and continuous fields (e.g., PoI density, PoI number). Different categorical fields may contains different size of data denoted by an one hot encoding. The continuous fields are represented by the value itself. The instance for node v is written as F v = {f field 1 , f field 2 , · · · , f field m }, where f field j stands for the j-th field of F v . Then the instance for all node V is F = {F 1 , F 2 , · · · , F n }. The task for spatial feature embedding is to find a parameter efficient method to allocate each value in F to a equal sized embedding vector. The length of embedding vector is a predefined as k.  where the length of embedding vector is set to 5. The left part stands for the embedding process for node 1 and the right part stands for the same process for node n. All nodes share a same latent feature vectors W . By the way, there is no need of pre-training for the latent feature vectors W . The tensor W serves as network weights, which can be learned by the network itself. Besides, the tensor W acts as a role in mapping the input data to fixed sized embedding vectors, which denoted as: where e i,j stands for the embedding vector of j-th field for node i and m is the number of fields. More specifically, the embedding output for each node is a k × m tensor. The parameter that needs to be learned is of a size of M × k, where M is equal to m j=1 |f field j |. The parameter size has no relationship with the node number, which is the foundation for large scale data fusion in urban cities.

B. TEMPORAL FEATURE EMBEDDING
In the application of traffic prediction, time factor plays an important role in understanding people travel patterns [7]. For example, people would like to go out in the morning and get back home in the evening. So the embedding vector for residential area is different at different time. Meanwhile, the spatial semantics changing is also needed for other kinds of categories. Because the characteristics of traffic data has a property of cyclical, we divide the time in one day into T time intervals. As we all know, urban traffic has different patterns and people travel patterns also differs from each other at different time. So each time interval has a distinct transmission matrix in our model, which is used to transfer the spatial embedding vector to a corresponding vector.
We use a tensor W to represent the temporal embeddings for all time intervals. At the same time, W serves as network weights which can be learned by the network itself. To make the temporal embeddings matching with the spatial embeddings, the tensor W is of a 3-D shape T × k × k.
Taking k = 5 as an example, we highlight the temporal feature embedding process from the input layer to the embedding layer as shown in Figure 5. For the time factor t, we encode it to a one hot vector after discretization, where the vector length is equal to T . Similarly, the latent feature vectors W for temporal embedding process serves as network weights which can be learned by the network itself. After the embedding process, we get the temporal embedding matrix α t corresponding to the time interval t.
where f is a look up function to get its corresponding vector. And the output matrix α t has a shape of k × k, which stands for how the spatial meaning of each categories changes over its corresponding time t. The size of parameters W has relationship only with time intervals and embedding size but not determined by node number n, which benefits large scale data fusion in urban cities. Therefore, the spatial embeddings and temporal embeddings make our model scalable without the influence from graph size.

C. SPATIO-TEMPORAL FEATURE REPRESENTATION IN LOW LEVEL
To get both the spatial and temporal feature representation, we apply temporal embedding matrix α t to every field of spatial embedding vector for all nodes. For a node i at time t, its spatiotemporal feature embedding vectors is calculated by: where * means the convolutional operator. In summary, the number of parameters learned by the network itself is only M × k + T × k × k. And the output spatiotemporal feature embedding vectors A 0 t after SETON operation is of a shape of n × k × m. As similar to the concept in convolutional neural network for computer vision task, this feature representation is in low level.

V. FEATURE TRANSFORMER AND DATA FUSION
This section introduces a feature transformer method that is to get the representation in high level and match the feature map FM g calculated by GCN.

A. EXTRACT SPATIOTEMPORAL REPRESENTATION IN HIGH LEVEL
As illustrated in Figure 2, the feature transformer component is a bridge between SETON and the feature map FM g , which achieves shape alignment between the two layers. The feature transformer part stacks q convolutional layers. Each convolutional layer contains a 1-D convolutional kernel which enables all nodes in graph G share the same convolutional VOLUME 8, 2020 kernel, rectified linear units and batch normalization except the last layer containing only convolutional operation.
A l t = bn(relu(conv(A l−1 t , k l ))), where k l is a 1-D vector that can be learned by network itself, l ∈ {1, 2, · · · , q} stands for the layer number. All nodes share the same convolutional kernel k l at layer l, which not only makes parameters efficient but also avoid overfitting when training. What's more, the padding operation, incidentally, depends on whether up-sampling is necessary. For example, the feature map FM g ∈ R n×k g ×c g with c g channels generated by GCN is regarded as a representation for the objective traffic data in high level. If k < k g , up-sampling is necessary and experimental results tell us will cause performance degradation sharply. So it is better to keep the value of k g is less than k. After the node-wised convolutional operation, the output feature map A q t is denoted by FM st , which has the same size with FM g .

B. FEATURE MAP FUSION
There are many feature map fusion methods widely used in neural networks. But in the large urban cities computing, we prefer to directly merge the feature map FM st generated by STDF with that of GCN as shown in Figure 2, which is denoted by FM followed by rectified linear units and batch normalization and written as: This type of feature map fusion method has two benefits used in large scale data fusion. The first is to reduce the computation overload when add more data into traffic prediction.
Besides it would not bring more parameters into our model, thus it can avoid overfitting problem.
To get the predicted value, several full connected layers are stacked to map the feature map to the object value.

C. LOSS FUNCTION
In the training process, the goal is to minimize the gap between the real traffic conditionY and the predicted value Y . Different from other tasks, traffic prediction has data incomplete and data bias problem. In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss. To minimize the influence of traffic outliers, we select Huber loss as the loss function.
where δ is a threshold parameter which controls the range of squared error loss.

VI. EXPERIMENTS
In this section, we present the experiment and comparison results. We first present the experiment settings with baseline algorithms and datasets introduced, then demonstrate the overall performance of STDF with its components analysis. Finally we detail the training process, testing performance and hyperparameters selection. A. EXPERIMENT SETUPS 1) DATESETS • Metro: The dataset used in this study is the smart card transaction records and train operation logs in Shenzhen, China. The metro system has 5 metro lines by 2015 as shown in Figure 6(a). The whole data collected from around 4 million smart cards have more than 300 million smart card transaction records, covering 184 consecutive days from January 1, 2015 to July 30, 2015. We use 144 days of data to train the network and 20 days for cross validation and 20 days for testing. The standard time interval is set to 30-minutes. The prediction task for Metro is to forecast the passenger number at each metro station. We use this data to predict the traffic speed at every road segment as shown in Figure 6(b). • ARIMA: Auto-Regressive Integrated Average is fitted to time series data either to better understand the data or to predict future points in the series [15] • SARIMA: The SARIMA is an extension of ARIMA that explicitly supports univariate time series data with a seasonal component [36].
• GCN: We use STGCN [22] as an example. The channels of three layers in STGCN are 64, 32, 128 respectively. To evaluate each component of our model, we also compare it the difference of fusion layers.
• GCN-SDF-logits: GCN-SDF-logits only considers data fusion in spatial domain and the fusion layer is located at the logits layer.
• GCN-SDF-FM: GCN-SDF-FM only considers data fusion in spatial domain and the fusion operation is located at the middle layer as depicted in Section III-B.
• GCN-STDF-logits: GCN-STDF-logits considers data fusion both in spatial domain and temporal domain. But the fusion layer is located at the logits layer. The fusion layer is located at the middle layer as depicted in Section III-B. All above methods are evaluated and compared using datasets: Metro and TaxiSZ. All GCN-based networks are trained using fine-tuned hyper-parameters. We use five-fold cross validation for calculating its average performance. All networks have been trained using 50 epochs under the same settings with TensorFlow implementations. We use Adam optimizer [37] to train all networks. For each node, the traffic indirectly related data contains all the features within one-kilometer radius.  Table 2 demonstrates the results of STDF and the baselines on the datasets Metro and TaxiSZ. ARIMA gets the worst results because of its low capacity in handling spatio-temporal data prediction. GCN get a better performance than ARIMA. However, GCN-SDF-logits gets a worse results compared with GCN only. Although data fusion is believed to be more effective than the one without data fusion, we can see that data fusion by putting more data into one model may bring negative effects on the model performance. GCN-SDF-FM and GCN-SDF-logits, which only consider the spatial property but ignore the temporal dependency, have much higher RMSE. We call this phenomenon as negative fusion. GCN-SDF-FM and GCN-STDF-FM get a better performance than GCN-SDF-logits and GCN-STDF-logits, which suggests it is better to locate the fusion layer at the middle layer but not at the logits layer. Our proposed model GCN-STDF-FM consistently achieves the best performance on the datasets Metro and TaxiSZ, which shows the effectiveness of using spatial property and temporal property simultaneously. The intuition is that STDF gives the model the ability to capture the dynamic traffic demands and relationships between the node and its surroundings.

C. TRAINING EFFICIENCY AND GENERALIZATION
In order to further investigate the overload caused by adding more data when predicting, we calculate the parameters size and training time consumption (second per epoch) as shown in Table 3  of parameters increasing. And the training time of our model only cause 0.909 seconds longer than GCN per epoch, which is practical for the real traffic prediction. Similar observations have been also obtained for Metro dataset. GCN has been improved with 6.1% lower testing error by using the method STDF.

D. CASE STUDIES
To understand the performance of STDF, we conduct the following case studies.  Figure 7 demonstrates the comparison of training process of GCN, GCN-STDF-FM, GCN-STDF-logits, GCN-SDF-FM and GCN-SDF-logits. We randomly select one of five-fold cross validation to show the training process. Each network is trained for 50 epoches. The X axis stands for the epoch number. and the Y axis is the loss value. Taking the metro data as an example, we can see that GCN-STDF-FM achieves the lowest training loss and GCN-SDF-logits with the highest training loss. Similar phenomenon can be seen at the testing performance demonstrated at Figure 8 corresponding to Figure 7. It can be clearly observed that STDF provides GCN both (1) enhanced capacity to fit training data as well as (2) the generalizability to adapt testing samples.

2) LEARNING RATE
Configuring the learning rate is challenging and timeconsuming. We use five-fold cross validation for searching the best configurations of the learning rate for each experiment by grid search. We set the learning rate to be 0.00001, 0.0001, 0.001, 0.01, 0.1. As observed from Figure 9, the test RMSE reaches to the best performance 68.49 when the learning rate is set to 0.001. The learning rate is fixed to 0.001 in all experiments for STDF.

VII. CONCLUSION
In this paper, we propose a novel framework STDF for traffic prediction to handle multi-source data fusion. A splittransform-merge strategy is used in STDF. We first separate multi-source data into directly related data and indirectly related data, which are input to GCN and SETON, correspondingly. The feature transformer module is designed to extract spatiotemporal representation for traffic indirectly related data. We get the spatiotemporal representation for traffic directly related data from the middle layer of GCN. This is then followed by a fusion module to combine all features to make final prediction. By using a data sharing strategy, our model is scalable and the overload caused by fusing traffic indirectly related data is acceptable in the real traffic prediction. Experimental results show that our model achieves the best performance compared with other stateof-the-art methods on two real world datasets. In summary, STDF can successfully capture the spatial features changing over time from multi-domain data, which not only can be used into traffic prediction, but also can be applied into other spatiotemporal data prediction. In the future, we will predict the traffic congestion diffusion via representation learning, where the representation vectors are extracted from the fusion layer of STDF. A traffic congestion control policy will be made according to the traffic congestion diffusion model.