A Data-Driven Approach for Estimating the Effects of Station Closures in Metro Systems

To estimate the impact of station closures, a novel data-driven approach is developed. First, Symbolic Aggregate Approximation (SAX) is designed to classify stations with an anomaly in passenger flow volume. Secondly, to define the trend anomaly within each segment, the Dynamic Time Warping (DTW) approach is suggested. The method simultaneously considers the mean value and trend information for passenger flow data. To check the efficiency of the proposed model, a case analysis of the Beijing metro system is illustrated. The findings suggest that the proposed model is superior in terms of the calculation of the impact of station closures to other state-of-the-art models.


I. INTRODUCTION
Metro, the most effective and energy-efficient transit option, is seen as the perfect way to relieve urban pollution and thus plays a growing role in many big cities around the world. Such a heavy dependence imposes enormous strains on metro systems and makes service disruptions hardly affordable. Even minor service disturbances, such as temporary station closures in metro systems, may result in a major loss of public safety and service efficiency, as these disturbances also have ripple effects that extend to other stations and lines [1], [2]. Take the 4-day station closure from 23th April 2019 to 26th April 2019 in Beijing's metro network for example: train services at 10 stations were disrupted and more than 200,000 passengers were affected. It is therefore more important for operators to recognize the effect of the station closure on their passengers.
The station closures affect passenger flow according to their origins, destinations and routes, such that passenger flow shifts in different stations vary [2], [3]. The review of relevant research focuses on the following aspects: (a) changes in passenger behavior under station closure; and (b) estimates of the impact of station closure based on smart card data.
The associate editor coordinating the review of this manuscript and approving it for publication was Huiling Chen .
Understanding the behavior of travelers under metro closures is important for implementing appropriate remedial actions, which will be helpful to minimize impacts to the metro system. However, research on passenger flow characteristics under a metro system closure has been limited as metro station closures are not frequent and the available relevant data is limited. More recently, Pnevmatikou et al. [4] combined information on traveler experiences and perceptions to model mode choice during a long-run metro service disruption. Sarker et al. [5] proposed Affective Events Theory to investigate transit users behavioral reaction to metro service disruptions based on survey data. Saxena et al. [6] analyzed traveler's mode choice under the two forms of public transport service disruptions using a joint revealed preference and stated preference survey method.
Fortunately, the widespread use of automated fare collection (AFC) data in metro systems offers an alternative chance to understand passenger behavior under station closures massively. However, few studies have utilized the potential of AFC data for metro station closure analysis. Younes et al. [7] investigated the effects of three public transit disruptions in Washington, D.C. on travel model choice and bikeshare demand during the SafeTrack project. This study suggested that the transit station closures had a considerable effect on bikeshare ridership. These above station closures both last for several months, while our study focuses on temporary station closures that last for several hours or a day.
In general, it is possible to assess the impact of station closures by detecting the temporal-spatial anomaly of passenger traffic in comparison with standard scenarios. Sun et al. [1] have developed a disruption estimation model that, taking into account the spatiotemporal factors involved, affects passengers on the metro network. Zhao et al. [8] suggested statistical and clustering methods to better understand the travel habits of individual Shenzhen metro passengers and the irregular passengers were classified based on the results of the clustering. Tonnelier et al. [9] established four methods using smart card logs to identify anomalies in a transport network. In order to detect passenger outflow irregularities and to alert administrators in real time, Wu et al. [2] proposed an anomaly detection approach based on ensemble algorithms. These studies are focused on statistical and clustering approaches and have the following drawbacks: the efficiency of the detection of clustering approaches is susceptible to data size and setting parameters, and the overall statistical index or aggregation results do not give rise to the interpretability of these approaches.
Current anomaly detection work on time series data can be extended to the passenger flow analysis of station closures in order to resolve these problems. A comprehensive literature review concluded that a great deal of technique was built based on time series data in cluster analysis and detection of anomalies [10]. Symbolic Aggregate Approximation (SAX), which uses the mean value of a segment as the symbol, has been widely used in time series data mining [11], [12], as well as in passenger flow analysis [13], as a major symbolic representation process. For example, Zhang et al. [13] proposed that SAX defines the characteristics of the daily passenger flow of each station based on Beijing Metro AFC data. However, in previous studies [14], two key time series characteristics, i.e. period and pattern, are not completely taken into account, as is SAX. Specifically, SAX ignores the pattern of a shift in the value of the segment, which in some cases can lead to an incorrect classification, because it is not capable of distinguishing different time series with different patterns from the same average value symbol. Dynamic Time Warping (DTW) [15], [16] is used to calculate the value change trend in the passenger flow segment, the value change trend during each passenger flow data segment is further calculated using SAX, and a new weight equivalent distance is created taking into account the mean value and trend information of the passenger flow data.
Overall, a novel approach is proposed for estimating the effects of station closures. To achieve this objective, a novel SAX-DTW method that considers the mean value and trend information of passenger flow data simultaneously is proposed to determine the effected time and station automatically. The contributions of this paper are as follows. First, a data-driven method based on the quantity and trend characteristics to identify the spatial-temporal influence range of temporary station closures is proposed. Our proposed SAX-DTW is a robust method for measuring the similarity between two time series with unequal lengths by taking warping alignments and the obtained distance is not sensitive to abnormal points [15], [17], and it is very better in deal with temporal drift than existing improved SAX. Moreover, our proposed method is shape-based data-driven and does not need predefined parameters, thus the estimation effect of station closures is more elastic and more intuitive than existing methods [1], [2]. Second, the real-world traffic scenarios data from Beijing subway is tested in our case study, resulted in strong stability. It shows that the proposed model can more accurately capture the abnormal changes in passenger flow under closure.
The rest of the paper is structured as follows. A theoretical approach for estimating the impact of station closures is suggested in Section 2. The characteristics of the data used from the Beijing subway AFC system are listed in section 3. Section 4 illustrates the application of SAX-DTW in the identification of the effect of passenger flow under station closure, comparing our approach with the other models in detail. The paper ends with section 5.

II. METHOD
Consider a metro network with the set of metro stations V. Let the outbound volume be the feature attribute of each station and there be N days in the sample with Z normal days. The outbound volume of each station i at day j is a time series of length T, represented as The historical average passenger flow of station i on normal days is defined as wherex it is the outbound volume of station i at interval t on To address different time series that may have different offsets and amplitudes, time series of outbound volume are converted through Z-score standardization.
where x * it is the standardized outbound volume at interval t on normal days at station i, x * ijt is the standard deviation of the VOLUME 8, 2020 outbound volume at interval t on day j at station i. Hence, the normalized passenger flow of station ion normal days is denoted as x * iT and the normalized passenger flow of station i on day j is denoted as SAX is a technique of data adaptive reduction that transforms a time series into a symbolic sequence in two main steps: (1) transform the raw data into a representation of Piecewise Aggregate Approximation (PAA); (2) transform the representation of PAA into a sequence of symbols belonging to a predefined set of alphabets. Key steps are given as follows, and it is possible to refer to more SAX specifics [21]. First, one time series Y with length T are divided into w segments, then Y can be represented in a w-dimensional space by a vectorȲ = {ȳ 1 ,ȳ 2 , . . . ,ȳ W }. The wth element ofȲ is calculated by the following equation: whereȳ w represents the mean value of the outbound volume in the segment wof station i and W is the number of PAA segments. Then, the mapping from the above PAA representationȲ to the corresponding symbolized sequence Y is obtained as follows: where y w is the mapping symbol; alpha k denotes the kth element of the alphabet, i.e., alpha 1 = a and alpha 2 =b; the breakpoint β k depends on the alphabet set size α. Specifically, the breakpoints are derived using a Gaussian distribution such that the area from β k to β k+1 is 1/α [18]. In this way, the time series X * i is first converted to the symbolized sequence X i = x i1 x i2 ... x iW of station i, and then converted to a string sequence.
To further symbolize the data, a distance function of passenger flow at station i on given day j is defined to return the minimum distance D s between historic average passenger flowX i = {x i1 ,x i2 , . . . ,x iT } and the passenger flow on given where x ijw is the wth mapping symbol values of X * ij calcuated by SAX the same method as x iw . Therefore, the symbolic distance D s can be applied to detect the possible anomaly day with regards to the quantity of passenger flow.

B. TREND ANOMALY IDENTIFICATION
DTW is suggested to calculate the pattern correlation between the historical average and any day for each section. DTW is a robust method of calculating the similarity between two time series and overcomes Euclidean distance drawbacks, such as having equal lengths for the two time series and being unreasonably large because of its sensitivity to even slight mis-matches [15], [17], [19]. The pattern anomaly in segment w is determined by the distance D d between the outbound volume of each segment at each station at different dates, calculated as follows: Let the historical outbound volume of normal days and day j with station closures in the wth segment of station i be respectively. The warping path P of X w i and X w ij is 2 represents the Euclidean distance between the mth point of X w i and the kth point of X w ij in the step u of the warping path P and l = T /W . Then, the DTW-distance D d at the wth segment minimizes the distance between all possible expansions that can be derived from warping path P.
Note that computing the DTW-distance and deriving an optimal warping path is usually solved by applying a dynamic program and the pseudo-code of DTW can be referred to existing studies [17].

C. A NOVEL SAX-DTW ANOMALY IDENTIFICATION METHOD
A new distance, called the SAX-DTW distance, is generated to define the passenger flow anomaly and measured as follows, considering the quantity anomaly at different segments and the pattern anomaly within a segment simultaneously.
where ( x iw , x ijw )is the SAX-based symbolization distance calculated by Eq. (7), and D d (X w i , X w ij ) is the DTW trend distance calculated by Eq. (9). Note that the novel distance D sd is weighted with the trend distance D d on the basis of the existing symbolic distance ( x iw , x ijw ) by the weight w/N , which reflects the size of the segments, resulting in a more comprehensive distance than basic SAX.

III. EXPERIMENT
The Olympic Sports Center Station on Beijing Subway Line 8 will be chosen for a comparison test to verify the efficiency of the proposed model. The station at the Olympic Sports Center is situated between two transfer stations, west of Forest Park Station's South Gate. There are more than 14000 visitors a day from Beitucheng Station to the Olympic Sports Center Station due to the major cultural and recreational activities and the many tourist attractions that also take place in the stadium.
The data of passenger flow for the possible affected stations in the Beijing metro are shown in Fig. 1, which introduces the layout of the Beijing metro network. Specifically, the possible affected stations are selected based on travel time: (1) the neighbour stations that belongs to the same Line 8 (i.e., Beitucheng, Olympic Green, South Gate of Forest Park, Lincuiqiao); (2) the stations that needs transfer once and does not the same Line 8 (i.e., Beishatan, Anlilu, and Datunlu East).
The station closure that happened from 17:00 until the last train stopped on April 24, 2017 was selected. The experiment is based on archived 5-minute intervals of historical data. Data from the first two weeks are used to model the passenger flow characteristics at normal days, and the data at the station closure day is used to identify the effects and evaluate the predicted results.

IV. IDENTIFICATION RESULTS AFFECTED BY THE STATION CLOSURE A. SAX RESULTS
The passenger flow from 10:00 to 22:00 is considered and the possible affected period is divided into 8 intervals according to the proposed method in Section II. Let alpha={a, b, c, d, e}. The mean value of each segment is marked with the corresponding symbol and different colored lines. Red, green, blue, yellow, and black represent a, b, c, d, and e, respectively. The symbol sequence of the normal day and the station closure day time series can be calculated based on the proposed SAX method, as shown in Fig. 2. Note that the left subplot is the PAA symbolized sequence of the station closure day and the right subplot is the PAA symbolized sequence of the normal days.
According to Fig.2, the quantity change of passenger flow under the station closure can be obtained as follows: (1) the passenger flow of Beitucheng and South Gate of Forest Park Station change the most. For example, the PAA symbolized sequence of Beitucheng Station turned from ''cbbbbbceecba'' into ''bacceccecbaa'', where nine of twelve symbols are different shown in Fig.2(d). Also, the PAA symbolized sequence of South Gate of Forest Park Station turned from ''ecdedccbdcaa'' into ''cbcecccedbac'', where seven of twelve symbols are different shown in Fig.2(g). (2) other stations have no change in the passenger flow due to their almost fixed PAA symbolized sequences.
Then, D s for each time series are calculated as shown in Table 1. Beitucheng and South Gate of Forest Park Station are clearly detected by the SAX distance colored by red in VOLUME 8, 2020  Table 1. Moreover, the values of segments are analyzed, some abnormal segments of other stations are detected shown in Table 1. The thresholds of SAX distance for segments of each station and D s are calculated according to three-sigma rule, respectively. For example, there are several different segments near the time of the station closure (i.e., the seventh segment) for Olympic Green Station (i.e., the third segment shown in Fig.2(b)), Beishatan Station (i.e., the seventh segment shown in Fig.2(c)) and Lincuiqiao Station (i.e., the eighth segment shown in Fig.2(f)). It can be concluded that the effects of the station closure are related with time distance and space distance: more abnormal segments exist in the nearer time distance or in the nearer space distance of the station closure.

B. SAX-DTW RESULTS
The trend changes of passenger flow under station closures are calculated by SAX-DTW to further obtain the tiny temporal-spatial range shown in Table 2. The thresholds for segments and D sd are calculated according to three-sigma rule, respectively. The results show that the most affected intervals are from the seventh to the ninth interval during the station closure (i.e., the eighth to the tenth columns of Table 2). During the transition segment before the station closure, the passenger flow of many affected stations decrease significantly compared with the normal days while their passenger flow suddenly increase and the rising trend are obvious after the time of station closure. What's more, the nearby transfer stations are more likely to be affected, while the stations which have to pass two or more transfer stations are less affected. In this scenario, the most affected stations are Beitucheng and South Gate of Forest Park Stations. Moreover, Olympic Green, Beishatan and Lincuiqiao Stations are less affected.
In summary, our proposed SAX-DTW method can address the anomaly detection under station closures well not only from the view of quantity, but also from the view of trend in segments.

C. COMPARISONS OF METHODS
To validate the performance of our proposed SAX-DTW method, it is compared with the existing SAX method and a hybrid method mentioned by [2].
First, our proposed SAX-DTW is compared with the existing SAX method. Take the identification results of Datunlu east and Lincuiqiao Station for example.
(1) Datunlu east Station anomaly detection. The change of passenger flow at this station is illustrated in Figure. 3. There was no difference in passenger flow data for the first and seventh segments of the station using the SAX method, but there were variations of passenger flow in the first and seventh segments using our proposed SAX-DTW method. To verify which of the two is rational, data on the passenger flow in the first and seventh segments of the station is combined   for study. It clearly shows that the historical average data in the first and seventh segments in Figure.3 is relatively stable. However, the flow of passengers fluctuates (a sharp rise and a high, followed by a rapid decline) during the first segments of the station closure. The flow of passengers in the seventh segment also fluctuates considerably (the second box in the picture). In other words, the SAX-DTW algorithm proposed in this paper can reliably detect a pattern shift in passenger flow that is more reliable than the SAX process.
(2) Lincuiqiao Station anomaly detection. The shift in the number of passengers at that station is shown in Figure 4. For the eighth and ninth segments of the station, there was no difference in passenger flow data using the SAX system, but there were differences in passenger volume using our proposed SAX-DTW method. The passenger flow data in the station's eighth and ninth segments is combined for analysis to verify which of the two is fair. It can be seen from the figure that the passenger flow is much higher than the historical average passenger flow shown in the box in Fig.4 on the closing date of the eighth and ninth segments. Therefore, there are clear variations between these two segments in the amount of passenger traffic. That is, the proposed SAX-DTW algorithm in this paper can accurately define the change in the number of valued passengers, which is more precise than the SAX method.
In short, the proposed SAX-DTW method can be inferred to have better performance in the detection of irregularities under station closures in terms of precision and stability than the current SAX method.
Second, compare the performance of our proposed SAX-DTW method with the hybrid method.
In the hybrid model, LOF algorithm, Grabus criterion, kNN algorithm are combined by weighting method, and voting method is used to integrate independent sample T test, Wilcoxon Signed Rank Test and Mann-Whitney U test, respectively. A total of five related parameters need to be set: for KNN, Euclidean Distance for distance measurement is selected and the K value is 5, for LOF, the K value is 5 and contamination is 0.1; for Grabus, the confidence probability is 90 percent. The results are given as shown in Table 3 and the anomaly is marked by red color. Take the identification results comparation of South Gate of Forest Park Station and Olympic Green Station for example. (1) South Gate of Forest Park Station anomaly detection. The change of passenger flow at this station is illustrated in Figure. 5. There was no difference in passenger flow data for the first and eighth segments of the station using the hybrid method, but there were variations of passenger flow in the first and eighth segments using our proposed SAX-DTW method. The passenger flow data is merged into the first and eighth segments of the station for analysis to check which of the two is rational. It clearly shows that the historical average data is that in the 1st segments in Figure.5 the passenger flow fluctuates (a sharp rise and a peak, then a fast decline) in the 1st segments. During the first sections of the station closure, however, the passenger flow fluctuates (ups and downs). In the eighth section, the passenger flow also fluctuates greatly (second box in the picture). That is to say, the SAX-DTW algorithm proposed in this paper will accurately classify the more reliable trend shift in passenger flow than the Integration process.
(2) Olympic Green Station anomaly detection. The change of passenger flow at this station is illustrated in Figure. 6. There was no difference in passenger flow data for the third segment of the station using the hybrid method, but there were variations of passenger flow in the third segment using our proposed SAX-DTW method. In order to verify which of the two is rational, data on the flow of passengers in the third segments of the station is combined for study. The figure shows that the passenger movement at the closing date of the third segment is much higher than the historical average passenger flow shown in the box in Figure.6. There are also obvious variations in the number of passenger flows between the segment. In other words, the SAX-DTW algorithm proposed in this paper will correctly recognize the change in the number of passengers, which is more accurate than the hybrid method.

V. CONCLUSION
In this study, a novel approach is proposed for estimating the effects of station closures. To achieve this objective, a novel SAX-DTW method that considers the mean value and trend information of passenger flow data simultaneously is proposed to determine the effected time and station automatically. A case study of the Beijing metro system is done to verify the performance of the proposed model. The proposed model is superior to other state-of-the-art models in terms of the estimation of the effects of station closures.
In reality, the proposed method can offer an abundance of passenger information to metro managers: stations with irregular inbound or outbound volumes lead to the overcrowding of passenger flow management strategies and train service adjustment strategies [20]- [22]. In addition, the proposed method considers the difference in overall passenger flow during station closure and normal time, but also the trend of passenger flow. In future, more cases and methods should be compared.
MING YANG is currently an Associate Professor with the School of Economics and Management. He is also the Director of the Institute of Transportation and Logistics Economics, Ningbo University of Technology, Ningbo, China. His research interests include transportation statistics methods and applications, urban transportation policy, and transportation economic analysis.
HUARONG QIN is currently an Associate Professor with the School of Economics and Management. She is also a member with the Institute of Transportation and Logistics Economics, Ningbo University of Technology, Ningbo, China. Her research interests include transportation planning and management, urban traffic management, and logistics system planning and design.
KE ZHANG received the B.S. degree from Central South University. She is currently pursuing the master's degree with the School of Traffic and Transportation, Beijing Jiaotong University. Her current research interests include passenger flow guidance, passenger flow prediction, and intelligent transportation systems.
XIN DING received the B.S. degree from East China Jiaotong University. He is currently pursuing the master's degree with the School of Traffic and Transportation, Beijing Jiaotong University. His research interest includes big data mining.
ZIYUE MI received the B.S. degree from East China Jiaotong University. He is currently pursuing the master's degree with the School of Traffic and Transportation, Beijing Jiaotong University. His research interests include passenger behavior analysis and big data mining. VOLUME 8, 2020