An Interpretable Station Delay Prediction Model Based on Graph Community Neural Network and Time-Series Fuzzy Decision Tree

High-speed train delay prediction has always been one of the important research issues in the railway dispatching. Accurate and interpretable delay prediction can enable staff to implement preventive measures and scheduling decisions in advance, and guide relevant departments to cooperate in completing complex transportation tasks, so as to improve rail transit operations, service quality, and the efficiency of train operation. This article proposes a new interpretable model based on graph community neural network and time-series fuzzy decision tree. This model can well capture the influence of spatiotemporal characteristics, train community structure, and multifactor in high-speed train station delay prediction. Besides, the time series fuzzy decision tree based on multiobjective optimization and reduced error pruning can mine potential decision rules to improve the model's interpretability, transparency, and high reliability. Finally, we prove that the prediction effect of the proposed model is superior than the other seven state-of-the-art models and our model is interpretable.


I. INTRODUCTION
H IGH-SPEED train delay prediction has always been one of the core research issues in the railway dispatching [1]. With the rapid development of high-speed rail, high-speed rail has become the travel choice of many people because of its speed and convenience. However, train delays will reduce the travel efficiency of passengers and bring a heavy burden to the train operation network. Therefore, accurate delay prediction can enable staff to implement preventive measures and scheduling decisions in advance, and guide relevant departments to cooperate in completing complex transportation tasks, so as to improve rail transit operations, service quality, and the efficiency of train operation.
This article focus on the high-speed train station delay prediction. The problem refers to: according to the train history and current running state, to predict the total number of delayed trains at each station on the operating network in the future via learning the distribution and evolution of train delays in time and space. According to the prediction duration, prediction can be divided into two categories: short-term prediction (0-30 min) and long-term prediction (>30 min) [2]. The problem is an event-driven predictions based on graph [34], and the network has typical characteristics such as spatiotemporal dependence, complexity, and community, which has become a bottleneck restricting the current high-speed train delay prediction. This bottleneck is mainly reflected in the following aspects.
1) Train station delay prediction is a typical spatiotemporal data prediction problem [3]. The analysis of train delay needs to comprehensively consider the spatiotemporal dependence between multiple trains and lines, and the adjacent stations and adjacent time stamps are dynamically associated. Train delay data has spatial dependence, temporal correlation, and spatiotemporal correlation. 2) Train delay is also affected by many external factors [4].
When the train operation is affected by wind, thunderstorm beyond a certain limit [26], the train needs to limit the speed or stop operation, which may cause some congestion of the train network. In addition, large fluctuations in passenger flow will cause the incompatibility of trains and stations, resulting in congestion of passengers, delays in getting ON and OFF, and a large number of passengers being stranded, for example, holidays will cause a sudden increase in passenger flow. All of these factors can lead to train delays. Therefore, to achieve the station delay prediction, the influence of external factors must be considered. In addition, different external factors have different influences on train operation. For example, extreme weather has a greater impact on train operation than good weather, however, many works do not consider this issue. little influence on each other. For example, trains on the Beijing-Shanghai line and the Beijing-Shenyang line have almost no intersection, which indicates that there is a community structure within the network. The current method only considers the simple interaction between railway lines, stations, and trains, and considers that they have the same community attributes, without mining richer community relationships. Strictly speaking, if any train on the train network is delayed, it will affect others, but, in some cases, the impact is extremely small and negligible. This negligible influence exists between different communities, and it will be manifested as noise that affects the prediction accuracy in the prediction model. Ignoring the community of the network, it is impossible to characterize the complex high-speed railway network in more detail. 4) An interpretable model can well verify whether the model behavior is as expected and establish a trusting relationship with users [28]. However, the model based on deep learning network has many parameters and high complexity, the decisions and intermediate processes made by the model are difficult to be understood by people. It is a "black box" that lacks explanation, which seriously hinders the application of deep learning models. Therefore, an interpretable modeling method is urgently needed to explain the neural network model and extract the decision rules, which is beneficial to improve the scalability and trust degree of the model. How to explain the model in a human habit has become a huge challenge. Focusing on the abovementioned challenges, we propose a new interpretable interpretable multiattention graph convolutional network (C-MATGCN) based on community detection model. The contributions of this article are summarized as follows.
1) The train operation network is modeled as a graph, the stations are nodes on the graph, and the lines connecting adjacent stations are edges, so that the train station delay prediction problem is transformed into a node prediction problem on the graph. We then add the external factors as node features and use graph convolutional network for node feature extraction. 2) Considering the spatial dependence, temporal correlation, spatiotemporal correlation of train operation, community of the train operation network, and the different influences of different external factors on train operation, we proposed a C-MATGCN model, which is used to process the abovementioned problems at the same time. 3) In order to eliminate the unnecessary noise of different lines and different intervals, we apply the graph community division to the train operation network, and propose a high-speed rail community detection strategy to improve prediction accuracy. 4) We model the train delay data as a fuzzy time series, and proposed a multiobjective optimization of time series fuzzy decision tree based on reduced-error pruning (REP) [31] to explain the decision-making process of the C-MATGCN model, which makes the prediction model transparent and reliable.
The C-MATGCN model can achieve the interpretable and accurate train delay prediction work, has a positive effect on the recovery of train operation. It helps to optimize the preparation of train operation diagrams, promote the development of robust timetables, and improve emergency response capabilities. Furthermore, for dispatchers, it can help reduce work pressure and improve decision-making efficiency. For drivers, it can help to use the buffer time to restore trains running on time and flexibly adjust the speed of trains. For passengers, the prediction results can help them adjust their travel plans and improve passenger satisfaction. Therefore, the interpretable high-speed train delay prediction model in this article is significant for improving the dispatching capacity of high-speed railways and improving the quality of railway transportation.
The rest of this article is organized as follows. Section II systematically investigates the relevant literature and clarifies the work of this article. Section III introduces the station delay prediction problem, the train community detection strategy, and the multifactor fuzzy decision tree. Section IV proposed the interpretable C-MATGCN model. Section V describes the experiments carried out in this article and the analysis of the results. Finally, Section VI concludes this article.

II. LITERATURE REVIEW
In this section, we will review the existing works from the following parts.
1) Methods to deal with the spatiotemporal correlation of train operation data. 2) Methods for dealing with external factors that affect train operation. 3) Graph community applications. 4) Methods for interpretability modeling based on decision tree. Huang et al. [3] used a three-dimensional (3-D) convolutional neural network to process spatial features and a long short-term memory recurrent neural network long short-term memory (LSTM) to process time series variables to predict station delays for four railways from China and The Netherlands, respectively. Wu et al. [7] integrated the long short-term memory network LSTM and the key point search technology critical point search (CPS), and proposed a hybrid LSTM mode to process the time series variables of historical train motion data to predict the running time and dwell time of the train. Huang et al. [6] modeled train delay data as time series data, using two long short-term memory neural networks LSTM to capture the interactions between stations and between trains, respectively.
Based on historical train delay data and weather data, Arshad and Ahmed. [8] used linear regression, gradient boosting regression, decision tree, and random forest algorithms to predict delays for Indian railways, respectively. Huang et al. [3] used a fully connected neural network (FCNN) to deal with the external factors of train operation, such as equipment-related factors, from four railways in China and The Netherlands. Sajan and Kumar [9] proposed the key to accurately predicting train delays is to determine a regression algorithm that can handle weather conditions such as ridge regression, lasso regression, Elastic-net regression, and support vector machine algorithms.
In the graph community applications, Chiang et al. [10] extended the depth of graph convolutional neural network (GCNs) by partitioning complex networks and tested them with a large dataset ogbn-proteins. Liao et al. [11] proposed a graph partition neural network GPNN, which is an extension of the graph neural network GNN for processing very large graphs. Mallick et al. [12] considered problems such as insufficient computing power, insufficient memory, and low computational efficiency in the process of convolution of large-scale graphs, and used graph partitioning to decompose large-scale graphs and execute them separately, thereby reducing the computational load.
For interpretability modeling methods based on decision tree, most of the interpretable methods for rule extraction are global interpretation methods. The CRED algorithm [29] uses decision trees to decompose the neural network, and merges the rules extracted from each tree to obtain the generated rules. The DecText algorithm [30] uses the black box data to extract decision rules. This method uses genetic algorithm to query the trained network and extract prototypes, then uses the prototype selection mechanism to select a subset of prototypes, and finally, uses ID3 or C5.0 and other standard inductive methods to extract decision trees.
Based on the results of literature research, we draw the following conclusions.
1) Most of the train delay prediction research works use LSTM and CNN for spatiotemporal modeling. Although spatiotemporal features can be extracted, they can only process structured 2-D or 3-D data, while the data on the train operation network is non-European. 2) Most of the train delay prediction works use the fully connected network and linear regression to deal with the external factors. This method ignores the connection between adjacent nodes and external factors of adjacent time stamps, and does not take into account the different influences of different external factors on the train operation.
3) The current work on train delay prediction does not consider the community structure of the train operation network. Most of the application of graph communities is to deal with complex large graph data. 4) Almost all delay prediction works do not explain the proposed deep learning model, making it difficult to understand its decision-making process. The tree generated by the current methods may be very large, requires a lot of time and memory, the scalability is limited, and needs to overcome the impact of tree deepening on interpretability modeling. Considering the abovementioned problems, this article proposes a high-speed train community division strategy, we use three popular community division algorithms, spectral clustering, Louvain and GN algorithms to divide, and judge the quality of the community division according to modularity, and choose the suitable community division algorithm for the current train operation network. Then, we propose a multiattention mechanism to deal with the spatiotemporal characteristics and external factors. The external factors selected are wind level, temperature, weather conditions, and whether it is a major holiday. In the interpretability modeling process, this article uses a multiobjective optimization of time series fuzzy decision tree based on REP to explain the decision-making process of the C-MATGCN model.

III. PRELIMINARIES
Before this section, as shown in Table I, we first give a table of some notation definition to help to find the meaning of notations used in the model and method descriptions.

A. Train Station Delay Prediction
In this article, the train operation network is defined as an undirected graph G = (S, E, A, W ), where S is the set of all stations, |S| = N , N is the number of stations, E is the edge, represents the set of lines between all stations, A ∈ R, represents the connectivity between stations, is the weighted adjacency matrix of G, and its weight value is W . The greater the distance between two stations, the smaller the mutual influence, and the weight W is also smaller, so W is the reciprocal of the distance between stations.
The nodes represent a series of interconnected stations, and the connection between stations is determined by the running route of one or more trains. Any train running on a train network has an itinerary consisting of S = S 1 , S 2 , . . . , S N , which consists of a starting station, a terminus and one or more intermediate stations. For any station S, the train schedule defines that a train should arrive at T S SA (the scheduled time to arrive at S station), and leave at T S SD (the scheduled time from S station) after staying at station S for a period of time, according to the train timetable, these data are accurate.
It should be noted that there is no scheduled arrival time at the starting station S 1 , and no scheduled departure time at the target station S N . Assuming that the actual arrival time of the train is T S AA (the actual time of arrival at station S), and the actual departure time is T S AD (the actual time of departure from the station S), then T S AA − T S SA is defined as the arrival delay, , it will be counted as an arrival delay, and if T S AD − T S SD > 0, it will be counted as a departure delay.
Given a fixed time period τ , F is the number of features per station. Each station has multiple statistical values in the time period τ , including the total number of departure delays and the total number of arrival delays in the time period τ , and the F characteristics of all stations in the time period τ :

B. High-Speed Train Community Detection Strategy
For the high-speed railway operation network, trains in different lines and regions have different mutual influences, and even some trains do not affect each other. Excessive consideration of these influences will introduce noise into the model and reduce the prediction accuracy. It is necessary to consider the community structure of the high-speed rail network in delay prediction. In a complex network, there are implicit relationships between nodes and edges. According to the degree of connection between different nodes in the network, nodes can be divided into different subnetworks. A subnetwork is called a community. As shown in Fig. 1, the left is the original network, and the right is the community division result, which is distinguished by different colors.
In this article, the idea of graph community is applied to the high-speed train station delay prediction. We take the train operation graph as the research object, and repute that the train operation between different communities does not affect each other. We divide the train operation graph G into M subgraphs, such as for each subgraph G i (i ∈ 0, . . . , M − 1), S i is the set of nodes in the subgraph, E i is the set of edges connecting S i , and A i is the weighted adjacency matrix of the subgraph. According to the subgraph division results, the train historical operation

C. High-Speed Train Community Detection Algorithm
For the network topology, the spatial correlation between high-speed rail stations is mainly reflected in the distance between stations and the importance of the station in the train operation network, i.e., the greater the distance between two stations, the smaller the mutual influence, the smaller the distance, and the greater the mutual influence.
For the attribute similarity, this idea is to place nodes with more similar attributes in the same community. For example, in social network community detection, it is more desirable to divide people with the same hobbies into a community. The attributes of the stations in this article are weather and holiday, and it is obviously unreasonable to divide the stations with more similar attributes into the same running community.
Therefore, in this article, considering the attribute similarity of stations will have an adverse effect on the spatial fitting within each community. So, spectral clustering [13], Louvain [14], and Girvan-Newman [15] are selected as the basis of the community detection part, and use the modularity to measure the degree of spatial fitting. The Louvain merges each vertex with the adjacent vertices in turn, and calculates whether the modularity gain is greater than 0. If it is greater than 0, it is merged into the same community. The Spectral clustering divides the original graph, making the weight of the edges between the subgraphs after the segmentation is as low as possible, and the weight of the edges in the subgraph is as high as possible. The Girvan-Newman detects communities by computing the betweenness centrality of all edges in the network and removing the edge with the highest betweenness centrality.

D. Fuzzification of Input Attributes
To make decision tree learning amenable to numeric attributes and improve the interpretability of the model, these attributes should be fuzzified and discretized beforehand. In practical applications, we prefer to use "strong wind" to describe the wind level rather than "level 5-6" or other specific descriptions, so the fuzzification of input attributes is very necessary to improve the interpretability of the model.
On the lowest level, the original input attributes (e.g., measurements) are modulated in terms of associated fuzzy sets. More specifically, each domain X i is discretized by means of a fuzzy partition, i.e., a set of fuzzy subsets [32] where the F i,j are often associated with linguistic labels such as "high" or "low".
In this article, we fuzzy the input attributes by customizing the membership function. We use the following membership function to divide the input attributes into three levels of "low," "medium," and "high". The formula is as follows and e is a human-given parameter: This article models the high-speed train delay data as a fuzzy time series. The concept of fuzzy time series is as follows: For a subset of the set of real numbers R, Y (t)(t = · · · , 0, 1, 2, · · ·) is a given domain, the fuzzy set is composed of all fuzzy sets f i (t)(t = · · · , 0, 1, 2, · · ·), is the fuzzy time series defined on the domain Y (t). This article is a multitime-step prediction problem, and F t is caused by the joint determination of F t−1 , F t−2 , …, F t−n , i.e., there is an n-order fuzzy relationship between several time series.

E. C-Fuzzy Decision Tree
Fuzzy decision tree is constructed by fuzzy C-means clustering algorithm [27] and the decision tree. Fuzzy decision tree is a non-attribute-guided decision tree, branching through fuzzy clustering algorithm, replacing the method of classical decision tree relying on entropy to select branch nodes.
The fuzzy decision tree is guided by the fuzzy C-means algorithm to generate the decision tree, and the data set X is clustered into c classes: X 1 , X 2 , . . . , X c through the fuzzy C-means algorithm. The multiattribute model needs to consider the characteristics of multiple attributes, so each sample X i is k-dimensional, representing the number of attributes in the multiattribute model, i.e., X i = (X i1 , X i2 , . . . , X ik ), the corresponding cluster center m i = (m i1 , m ik , . . . m ik ) is also k-dimensional.
The fuzzy C-means algorithm achieves the clustering effect by minimizing the objective function J, uses an iterative algorithm to minimize the objective value, and continuously updates the cluster center m and partition matrix A. The objective function J is defined as follows: where A p ij ∈ [0, 1] is the membership degree of the j-th sample to the ith category, p is the fuzzy factor, w t is the weight of the t-th dimension attribute, and the corresponding weight is adjusted according to the importance of each element, And k t = 1 w t = 1, d 2 (x jt , m it ) represents the distance between the tth dimension x jt of the sample and the cluster center m it of the ith category.
In the process of minimizing the objective function J, the iteratively update m it and A ij are as follows: The fuzzy C-means clustering algorithm guides the generation of multiattribute fuzzy decision trees by iteratively updating the above cluster centers and partition matrices.

A. Model Architecture
The C-MATGCN consists of three modules: the community detection module, the model training module, and the interpretation module (as shown in Fig. 2

B. Community Detection Module
In the community detection module, the original data are train delay data, weather and holiday data (see Section V for the dataset). The Louvain algorithm is used to divide the community to obtain the community division result. The divided results are processed by GAM, EWH, and ED, respectively, and the output is divided into two parts: 1) the corresponding stations in the community and the weighted adjacency matrix, 2) the train delay data, weather and holiday data of the corresponding stations in the community.

1) Community Detection:
In this article, by comparing the effects of the three community detection methods of spectral clustering [13], Louvain [14], and Girvan-Newman [15] on the train operation graph in this article, we finally choose to use the Louvain algorithm for community detection.
2) Community Data Processing: We use the Louvain algorithm to divide the community, and extract the data in each community according to the divided results. These data include: node id in each community, weighted adjacency matrix, train operation data, weather, and holiday data. The community data extraction algorithm is shown in Algorithm 1.
Among them, the input data are the starting and ending station ids and the distance between stations, train delay data, weather, and holiday data. First, read id_distance to obtain from_id (initial station id), to_id (end station id) and the distance between the them, construct a weighted adjacency matrix, and use Louvain algorithm to obtain the community division results, including the number of communities and the stations in each community.
Then, get the adjacency matrix of each community through from_id and to_id. Second, obtain train delay data and weather and holiday data in each community. Finally, the nonnumeric data is one-hot encoded to get all the data input to the model.

C. Model Training Module
The core component of the model training module consists of three parts with the same structure, which model the recent (one hour), daily-period and weekly-period dependencies of train historical data, respectively. Each of these components is composed of several multiattention convolution modules and a fully connected layer, and each multiattention convolution module includes a multiattention module and a spatiotemporal convolution module. We use a residual learning framework in the model to improve training efficiency. At the end of the model, multicomponent fusion is used to combine the results of the three parts to obtain the final prediction result.

1) Community_x Data:
The Community_x data is the input data, which is further divided into three categories in this article.
a) The most recent time series. The arrival delay of the previous one or more stations in the past will affect the arrival delay of multiple stations in the future, among them, external factors will have an effect on it. The mathematical representation is as follows: b) Daily-period time series. People's daily travel are regular, station delays may occur in a relatively fixed time period, such as five to six o'clock in the afternoon every day, and external factors will have an effect on it, the purpose of the daily-period component is to simulate the daily-periodicity of the train arrival delay data. The mathematical representation is as follows: c) Weekly-period time series data. The weekly attributes and time intervals of these fragments are the same as the predicted period. Normally, the traffic pattern on Wednesday is similar to the traffic pattern on Wednesday in history, but it may be very different from that on Thursday and Friday 2) Graph Convolutional Neural Network: In this article, GCN is used to model the spatial characteristics of nodes on the train operation network. In the spatial dimension, train operation data is a kind of graph structure data. Different from grid data, it exists in non-Euclidean space, which makes it difficult for the traditional neural network to process. However, graph convolution neural network can directly model the original graph structure data and obtain the representation of nodes in graph structure data. In this article, the spectral method [16] is used to define the graph convolution. The spectral method uses the convolution theorem and Fourier transform to transfer the graph from the node domain to the spectral domain, and then defines the convolution kernel in the spectral domain.
3) Multiattention Module: The multiattention module includes spatial, temporal, and multifactor attention, which can well capture the spatiotemporal characteristics of the data and dynamic adjust the input data of each layer according to the importance of external factors to the model.
In the spatial dimension, there is a certain correlation between the arrival delay of trains at different stations, especially the influence between adjacent stations is highly correlated, and the interaction between adjacent stations with different distances is also different. The greater the distance between the two stations, the greater the possibility of adjusting from the delayed state to normal, then the delay impact of the current station on the next is smaller.
Consider the static characteristics of high-speed railways network, we calculate the correlation weight matrix C of the input data. C ij in C represents the correlation between station i and j. The calculation formula is as follow: where, X Z ∈ R N×F r−1 ×t r−1 represents the input data processed by the multifeature attention module of the rth layer, are learnable parameters. d S i S j is the distance between station i and station j. Then Q is normalized by softmax.
In the temporal dimension, there is a correlation between the arrival delay of stations in different periods. The correlation of each station is also changing in different time. The arrival delays in the previous periods will affect the future arrival delays of the stations on the line.
We calculate the time weight matrix Z of the input data. The element Z ij in Z represents the degree of dependence between time i and j. The calculation formula is as follow: where .means inner product, means Hadamard product, X = (X 1 , X 2 , . . . , X t r−1 ) ∈ R N×F r−1 ×t r−1 represents the input data of the rth layer of multiattention module, F r−1 represents the number of features of the rth layer, T r−1 represents the length of the time series of the rth layer, the activation function is sigmoid, Then, Z is normalized by softmax. Different features have different effects on train delay, so, in this article, we propose a multifeature attention mechanism to capture this difference where X (r−1) h = (X 1 , X 2 , . . . , X r−1 ) ∈ R N×F r−1 ×t r−1 represents the input data of the rth layer of multifeature module, U ∈ R F r−1 ×F r−1 , b P ∈ R N ×F r−1 ×t r−1 , V P ∈ R t r−1 ×N ×N are learnable parameters, * represents the matrix batch dot, the activation function is sigmoid, attention matrix P is dynamically calculated according to the current input of this layer, S i,j in S semantically represents the importance of different features of different nodes to the model. Then, P is normalized by softmax. 4) Components Fusion: Train delays are affected by a combination of the recent, daily-period and weekly-period time series components, so, we fuse the results of the three components. The final fusion result is where W h , W d , W w ∈ R N ×P , can be obtained through elearning, it reflects the impact of the three time dimensions on the prediction objectives, stands for Hadamard product, P stands for predicted time step, Y h , Y d , Y w , respectively, represent the final output results obtained after the output of the recent, daily-period, and weekly-period components passing through the fully connected layer.

D. Multiobjective Optimization of Time Series Fuzzy Decision Tree Based on REP
In this article, we use C-fuzzy decision tree to approximate the prediction effect of the C-MATGCN model, so as to explain the decision-making process of C-MATGCN via the rules of fuzzy decision tree.
Based on the C-fuzzy decision tree algorithm, this article uses the training data to generate the fuzzy decision tree [see Section IV-D1)], and prunes the ready-made tree according to the error rate pruning algorithm REP [see Section IV-D2)]. During the optimization, we set multiple optimization goals including test set error rate and the complexity of the tree, because we do not want to get some too complex rules. When the optimization achieves the desired result [see Section IV-D3)], the pruning stops immediately.

1) C-Fuzzy Decision Tree Building: a) Node inconsistency:
The fuzzy decision tree guides the growth of the tree through fuzzy C-means clustering algorithm and node splitting criterion. The node with the highest inconsistency is selected each time for splitting, because the lower the attribute similarity within the node, the higher the inconsistency, the more the data tends to continue to be divided. This article defines the ith node as where X i is the samples set whose degree of membership to node N i is greater than that to N i s sibling nodes in the sample data, i.e., Y i is a set of output values corresponding to each sample in the sample set X i A i represents the degree of membership of the sample data in where s is the number of samples in the N i node. A i corresponds to the ith row in the division matrix, and represents the degree of membership of the sample data in X i to the ith class. For node N i , the inconsistency is defined as follows: . (22) According to the abovementioned definition of inconsistency, the node splitting criterion can be transferred into finding the node with the greatest inconsistency, i.e., finding the node satisfying V fission = max(V i ) , 1 ≤ i ≤ s for splitting. b) Node stability: If the data in the node has a high degree of membership to this node and a low degree of membership to other nodes, we think the structure is very stable. On the contrary, if the data has a relatively low degree of membership to the current node, we think the structure is not stable. Taking a single node as an example, if the degree of membership of each sample data in the node is 1, and the degree of membership for other nodes is 0, we think it is the most stable. If each sample data in the node has the same degree of membership to c child nodes, i.e., 1/c, we think it is extremely unstable. Taking the kth sample data as an example, the stability of a single sample is as follows: where c is the number of clusters and a ik is the degree of membership of the kth data to the ith class. Assuming that the kth sample data completely belongs to the sth node, there is a sk = 1, and a tk = 0 for other nodes, where s, t ∈ 1, 2, . . . , c, s = t, at this time θ k = 1, It shows that the sample has the highest stability in this node. c) Time series fuzzy decision tree: In this article, we use the multifactor time series fuzzy decision tree to explain the C-MATGCN model via using the final decision tree rules to interpret the C-MATGCN by approximating the effect between the tree and C-MATGCN.
The fuzzy decision tree generation algorithm is shown as Algorithm 2.
As shown in the Algorithm 2, first, the delay data under fuzzification is put into the split-node-set S F N as the root node, and select the node with highest inconsistency from S F N to do splitting and the node must including more than three data. Then, calculate the stability of each node after splitting, if it is stable, remove the original node in S F N , and put the split node into S F N , if it is unstable, remove the split node in S F N , and return to the state before the node split. Finally, repeat the abovementioned process until S F N is empty.
2) Multiobjective Optimization: The error rate reduction pruning algorithm REP is a postpruning algorithm commonly used in decision tree pruning. All data are divided into two sample sets: one is used as a training set to build a decision tree, and the other is a separately isolated test set to evaluate the accuracy of the decision tree built from the training set on other sample data.
The fuzzy decision tree is pruned based on minimizing the overall cost function of the decision tree. The number of leaf nodes of C fuzzy decision tree T is |T|, the number of samples in this leaf node is n. The cost function of the fuzzy decision tree is defined as follows: where C(T) represents the prediction effect of the C fuzzy decision tree on the test set, i.e., the fitting degree of the C fuzzy decision tree to the test set, |T| represents the complexity of the decision tree, the parameter α ≥ 0 controls the influence between the complexity of the decision tree and the prediction error. When the α is determined, the more leaf nodes in the tree, the higher the fitting degree with the test data, and the smaller the prediction error, but the higher the tree height, the more complex the structure; On the contrary, the less leaf nodes, the more complex the model is lower, but does not fit well with the test data. The pruning process is as follows: delete the entire subtree with the candidate node as the root node, make the parent node of the node as the root node, assign all the sample data in the candidate node as the value of the parent node, and calculate the cost function C α (T ) in this case, if the cost becomes smaller, delete this subtree. Pruning is to select the model with the smallest cost function C α (T ) when α is determined, i.e., the tree with the smallest cost function.
3) Termination Conditions: For the C fuzzy decision tree generated under different parameters, we use the abovementioned multiobjective optimization method for pruning optimization. During the optimization process, if the current decision tree meets the following conditions, the pruning will be stopped. We just think that we have obtained an ideal fuzzy decision tree. 1 Leaf decision threshold θ l [33] If the number of a dataset is less than a threshold θ l , then stop expanding. For example, a dataset has 300 examples where θ l is 2%. If the amount of data in a node is less than 6 (2% of 300), then stop expanding. 2 Prediction error θ e Based on root-mean-square error (RMSE), we use the following formula to measure the approximation of the prediction effect of the current decision tree and the C-MATGCN model: where RMSE Tree represents the prediction effect of the current decision tree on the test set, RMSE C−MATGCN indicates the prediction effect of C-MATGCN on the test set, when the prediction error of the two is less than θ e , then stop expanding. For example, set θ e is 30%, If the RMSE Tree of current tree is 0.2, RMSE C−MATGCN is 0.15, then θ c = 25% < θ e = 30%, then stop expanding.

V. EXPERIMENT
In this article, we perform comparative experiments on the three selected community detection algorithms to select the highest quality community detection results, and then perform train delay prediction experiments based on the community detection results. In this section, we introduce the experimental dataset, experimental setup, evaluation metrics, and experimental results in turn.

A. High-Speed Train Delay Dataset
This article uses the actual train operation data from the China Railway Passenger Ticket System (https://www.12306.cn) and the historical weather data from China Weather Network (https: //www.tianqi.com). The train operation data is recorded in the whole minute, including the train operation records of 727 stations from October 8, 2019 to January 27, 2020. The features include operation date, arrival delay times, departure delay times, wind level, weather conditions, and temperature, whether it is a major holiday.
We collected, collated relevant data, and published it on nature scientific data [5] for researchers in related fields. The distribution of stations and lines in the dataset is shown in Fig. 3.

B. Experimental Setup
In the community detection experiment, we implemented spectral clustering using Scikit-learn [17], Louvain using Python-Louvain [18], and Girvan-Newman using Communities [19], all three methods use the train operation graph weighted adjacency matrix as the input.
In the model training stage, in order to prove the superiority of the C-MATGCN model, this article selects the following methods as the benchmark model: 1) ANN [20]: artificial neural network; 2) SVR [21]: regression model of support vector machine; 3) RF [22]: Random forest model; 4) LSTM [23]: long short-term memory network; 5) ASTGCN [24]: a hybrid attention spatiotemporal network for predicting traffic flow; 6) TSTGCN [25]: a spatiotemporal attention network for railway delay prediction; 7) MATGCN (proposed in this article): C-MATGCN without community detection. We implemented the C-MATGCN model using mxnet, the Chebyshev polynomial was set to 3, all graph convolution layers and temporal convolution layers used 64 convolution kernels, and the time span of the data was adjusted by controlling the stride of the temporal convolution. We set t h = 3, t d = 1, t w = 1, the prediction window t p = 1, i.e., the number of delays to arrive at the station in the next hour, the training batch size is 4, and the learning rate is 0.00001.
We implement other models on Windows 10 system. ANN uses a single hidden layer network structure with a learning rate of 0.01; the learning rate of RF and ANN is 0.001, LSTM contains two hidden layers, the loss function is L2Loss, the learning rate is 0.001. TSTGCN is based on mxnet, the learning rate is 0.00001, and the other parameters remain the default.
In the fuzzy decision tree building stage, we use cross validation to adjust the parameter settings for constructing the C-fuzzy decision tree, such as the number of clusters c and the relevant parameters of fuzzy set division, and prune the decision tree under each set of parameters, and adjust the parameters α and θ l , θ e , please see the experimental results section for details.

1) Modularity:
Modularity is a method to evaluate the quality of a community detection result. The physical meaning is the difference between the sum of the weights of the edges of the nodes in the community and the sum of the weights of the edges in random cases. The value range is [ −1/2,1), the modularity mainly depends on the community distribution of the nodes in the graph, i.e., the community detection of the network, which can be used to quantitatively measure the quality of the network community detection. The closer the value is to 1, the stronger the community structure, i.e., the better the division quality. It is defined as follows: where c i , c j represent all the nodes in the two communities, A ij represents the weight of the edge between nodes i and j, k i and k j represent the degrees of nodes i and j, and δ is indicator functions. If nodes i and j are in the same community inside, the return value is 1, otherwise, it returns 0.

2) Model Evaluation Metrics:
We use the following three model evaluation metrics to evaluate the performance of C-MATGCN and its comparative test models. They are mean square error (MAE), RMSE, and mean absolute percentage error (MAPE). Their calculation formulas are as follows: where x i is the actual value,x i is the predicted value, and n is the number of test samples. Table II shows the modularity of the three community detection algorithms, with the best scores shown in bold.

1) Community Detection Results:
The modularity of the three algorithms is 0.7682, 0.6957, and 0.7026, respectively. The results show that the Louvain algorithm has the highest modularity and the best quality. Therefore, this article uses Louvain algorithm for community detection. We visualize the results, as shown in Fig. 4, communities are shown in different colors.
2) Comparative Results: We compare C-MATGCN with the other seven models on the high-speed train delay dataset, Table III shows the results of train delay prediction performance in the next 1-6 h, the result is the average of the prediction effects of all the stations in all 1-6 h in the test set, rather than some specific stations and time periods, and Table IV shows the performance of C-MATGCN compared to the other models in 1 h performance, the best scores are shown in bold.
As shown in Tables III and IV, we can observe changes in the prediction performance of each method as the prediction duration increases. Generally speaking, with the prediction time increases, the prediction error will also become larger. The model training results are analyzed as follows.
1) In the prediction of 1-6 h, the errors of ANN, SVR, and RF always keep a high level. The performance of RF has dropped dramatically. 2) Compared with ANN, SVR, and RF, the performance of LSTM degrades slowly because LSTM considers the temporality of input data, but also maintains a high error level in long-term prediction of 1-6 h. 3) The training effect of ASTGCN and TSTGCN is relatively better because they consider the spatiotemporal correlation of the input data. 4) Compared with the first four models, MATGCN performs better in prediction performance, because MATGCN not only considers the spatiotemporal correlation of the input data, but also considers the influence of external factors on the train operation, and the multiattention mechanism can deal with different external factors. 5) C-MATGCN achieves the best prediction performance at almost any time in 1-6 h. Even in long-term prediction, the error remains low because we capture the temporal and spatial features of train delay data well, taking into account external factors and their influences as well as the community structure of the train operating network. So, C-MATGCN can achieve better results on station delay prediction.

3) Fuzzy Decision Tree Building Result:
We used the crossvalidation method to test the tree-building results of multiple sets of parameters, and the results are as Table V. where θ e is the prediction approximation degree, θ l is the minimum threshold number of leaf node data, c is the number of clusters, and α controls the influence between the complexity of the decision tree and the prediction error. Result shows that the decision tree with c = 2, α = 5 can achieve more than 70% of the prediction effect of the C-MATGCN model. In this article, we believe that the fuzzy decision tree can approximately explain the decisionmaking process of the C-MATGCN model. Then, we give the partial resulting decision tree as Fig. 5 and refine several decision rules as follows.
1) When the arrival delay occurs in the first two hours, the arrival delay occurs in the previous hour, the snow is moderate, and the wind level in the current time period is strong (membership degree is 0.29), there will be three train arrival delays.  2) When the arrival delay occurs in the first two hours, and there is no delay in the previous hour. On a cloudy day, when the wind level in the current time period is medium wind (membership degree is 0.71), there will be 0 train arrival delays.

VI. CONCLUSION
This article proposes a new interpretable neural network C-MATGCN for high-speed train station delay prediction. Where, the high-speed train operation network has many complex characteristics. We use the community detection strategy, multiattention mechanism, and spatiotemporal graph convolution to capture the community structure, spatiotemporal characteristics, and multifactor influence of the network. To improve the interpretability of the model, we fuzz the original data to adapt to people's daily understanding, such as using "high temperature," "low temperature" to replace specific temperature, then, we use error rate pruning, fuzzy theory to build time-series fuzzy decision tree, and extract tree rules to approximate the decisionmaking process of C-MATGCN. The result shows, compared with the typical seven benchmark models, the C-MATGCN model can achieve the best long-term prediction in 1-6 h, and can also improve the transparency and practicality of the model.
The model can not only be used for high-speed train station delay prediction, but also can be applied to passenger trains, freight train network station prediction, and other application with spatiotemporal and community characteristics. In the future, we will improve our model from the following aspects.
First, we will add more features to the station, such as accurate passenger flow [35]. However, due to the difficulty of data acquisition, the current version uses major holiday to approximate passenger flow. We think passenger flow will increase during major holidays. Besides, line attributes such as railway utilization and track directions can also be added.
Second, in order to avoid the noise between high-speed rail communities, this article treats each community independently. More accurate prediction can be achieved if the intercommunity influence is well captured [36], therefore, we will consider setting an attention mechanism among multiple communities to capture the intercommunity train interaction influence. As far as we know, there is almost no research in this part.
Finally, about the interpretability modeling, this paper cannot explain the C-MATGCN model more perfectly. At present, there are endless researches on interpretable models for deep learning [37]. We will try more methods to improve the model. It is significant to improve the transparency and trustworthiness of the model.