Two-Phase Multivariate Time Series Clustering to Classify Urban Rail Transit Stations

Consider the problem of clustering objects with temporally changing multivariate variables, for instance, in the classification of cities with several changing socioeconomic indices in geographical research. If the changing multivariate can be recorded simultaneously as a multivariate time series, in which the length of each subseries is equal and the subseries can be correlated, the problem is transformed into a multivariate time series clustering problem. The available methods consider the correlations between distinct time series but overlook the shape of each time series, which causes multivariate time series with similar correlations and opposite shapes to be clustered into the same class. To overcome this problem, this paper proposes a two-phase multivariate time series clustering algorithm that considers both correlation and shape. In Phase I, the discrete wavelet transform is applied to capture the wavelet variances and the correlation coefficients between each pair of variables to realize the initial clustering of multivariate time series, where time series with a similar correlation but opposite shape may be assigned to the same cluster. In Phase II, multivariate time series are clustered based on shape via the symbolic aggregate approximation (SAX) method. In this phase, time series with similar correlations but opposite morphologies are differentiated. The method is evaluated using multivariate time series of incoming and outgoing passenger volumes from Beijing IC card data; these volume data were collected between March 4, 2013 and March 17, 2013. Based on the silhouette coefficient, our approach outperforms two popular multivariate time series clustering methods: a wavelet-based method and the SAX method.


I. INTRODUCTION
The aggregation of objects with many time-dependent variables has been considered in research on data mining, such as the classification of cities with multiple changing socioeconomic indices, the identification of crop type from various remotely sensed image series, and the categorization of the point of interest (POI) social function based on incoming and outgoing passenger flow series. This problem can be regarded as a cluster determination problem in a multivariate time series that is composed of single-variable time series of equal length. In the multivariate time series, the single-variable time The associate editor coordinating the review of this manuscript and approving it for publication was Keli Xiao . series may be correlated. It may not be possible to adapt the available methods that are designed for univariate time series to multivariate time series, even if the multivariate time series can be treated as a single time series, because the correlations between single-variable time series may be disregarded by this adaptation. Hence, it is necessary to develop clustering methods for multivariate time series.
Univariate time series clustering has been well studied in the literature [1]- [6], whereas multivariate time series clustering, for which univariate methods are not suitable, has been less extensively addressed. The available approaches for multivariate time series clustering can be divided into three main categories: model-based methods, feature-based methods and shaped-based methods [7]. Model-based approaches assume that each time series can be represented by a known model [8]. For example, Maharaj employed the vector autoregressive model to fit multivariate time series and utilized an algorithm that was based on the p-values of hypothesis tests to cluster time series [9]. However, it was necessary to estimate the model parameters of each time series; this approach failed to capture the correlations between the variables of the multivariate time series. In a feature-based method, a raw multivariate time series is represented by a lower dimensional feature vector. The extracted feature vectors are then clustered via a conventional clustering algorithm [10]- [13]. To extract feature vectors that represent a raw time series, Ye et al. performed a generalized principal component analysis (GPCA) [10], and Guo et al. [11] and Wu and Philip [12] applied an independent component analysis (ICA). However, they disregarded the correlations between the variables of the multivariate time series. To overcome this problem, D'Urso and Maharaj [13] proposed decomposing each variable of the multivariate time series into the wavelet series on various scales and calculating the wavelet variance at each scale. The wavelet correlations at each scale for the multivariate time series of every pair of variables are calculated, and the wavelet variance and correlation coefficient are concatenated into a single vector to represent the multivariate time series. This approach has the advantage of constructing the time series of wavelet features while considering the correlations. This approach can be applied to average nonstationary time series. However, time series with similar variances and correlation coefficients but opposite morphologies may be clustered into the same category. In shape-based methods, the time series are similar in terms of time and shape. For time series with high dimensionality, the dimension reduction technique is typically employed to reduce the noise and simplify the variables. Symbolic Aggregate approXimation (SAX) [14], which was proposed by Lin et al. as a transformation function in a time series similarity measure study, boasts the advantages of a high compression ratio, retention of data locality details and effective dimensionality reduction. However, SAX is unable to capture the correlations between the variables of multivariate time series.
Although a variety of approaches have been proposed [9]- [12], [14]- [16] for multivariate time series clustering, they either disregard the correlations between the variables of the multivariate time series or the cluster time series with opposite morphology in the same category. To overcome this problem, we propose two-phase clustering of multivariate time series based on wavelet transform and SAX (WSAX), which has the following characteristics: 1) in Phase I, the inherent correlations between variables of multivariate time series are considered by using a wavelet transform to represent the wavelet features of the original time series, and 2) in Phase II, shape-based clustering can effectively distinguish multivariate time series with opposite morphologies. Thus, multivariate time series with opposite shapes but similar variances and correlation coefficients can be effectively identified and clustered into different classes.
The remainder of the paper is organized as follows: Section 2 briefly describes the background of our proposed method. Section 3 details the two-phase multivariate time series clustering algorithm, namely, WSAX. The experimental results and analysis are presented in Section 4. The conclusions of this work are presented in Section 5.

II. BACKGROUND
In this section, we briefly describe the notations of multivariate time series, the maximal overlap discrete wavelet transform, SAX and the similarity measure, from which our proposed method is extended.
where S i represents the multivariate time series of the i th object, with q (q = 1, . . . , Q) as the variable, where t (t = 1, . . . , T ) denotes the observation time interval. Thus, s iqt represents the observation of the i th object's q th variable at time t. S i is a univariate time series if q is equal to 1. Multivariate time series clustering is defined as a specified set of the multivariate time series S clustered into K clusters, C 1 , C 2 , · · · , C K , where C i (i = 1, · · · , K ) is the i th cluster, which contains at least 3 objects in S.

B. WAVELET-BASED CLUSTERING OF MULTIVARIATE TIME SERIES
The maximal overlap discrete wavelet transform (MODWT) [13], [17], [18] is a modified version of the discrete wavelet transform. An example of using the MODWT to analyze time series is available in the literature [18]. Assume that g jl , l = 0, · · · , L j , is a j-level wavelet filter of length L j that is associated with the scale τ j ≡ 2 j−1 and the univariate time series S iq = (s iq1 , s iq2 , . . . , s iqT ). An unbiased estimator of the time-independent variance at scale τ j is where H S iq ,jt denotes the MODWT coefficients of the time series S iq and N j = T − L j + 1 is the number of wavelet coefficients. The wavelet covariance of two specified univariate time series S iq and S ik with the MODWT coefficients H S iq ,jt and H S ik ,jt , respectively, is defined as v S iq S ik (τ j ) ≡ cov( H S iq ,jt , H S ik ,jt ). ∞ j=1 v S iq S ik (τ j ) =cov(S iq , S ik ) if j is infinite. Thus, we obtain the unbiased estimator of the correlation coefficient between S iq and S ik : SAX [14], [19], which was proposed by Keogh E, is an effective discrete dimensionality reduction method of a time series that is based on piecewise aggregate approximation (PAA) [20]. SAX represents the conversion of a time series of length T into a symbol string of length N (N T ), where N is the number of subseries. Given the univariate time series S iq = (s iq1 , s iq2 , . . . , s iqT ), SAX can be conducted via the following three steps: Step 1: Normalization. Each original univariate time series S iq is normalized into S iq = (s iq1 , s iq2 , . . . , s iqT ) with mean 0 and variance 1 via equation (5). The normalization does not affect the shape or scale of the original univariate time series S iq . [21].
where s iqt is the observed value at time t in the univariate time series S iq , u S iq is the mean of all observed values in the univariate time series S iq and σ S iq is the standard deviation of all observed values in the time series S iq .
Step 2: PAA dimensionality reduction. PAA is used to divide the univariate time series S iq of length T into the time seriesS iq = s iq1 ,s iq2 , . . . ,s iqN of length N (N T), according to the subseries length T /N . The mean of each subseries is calculated via equation (6).
Step 3: Symbolic representation. The time seriesS iq of the approximate Gaussian distribution can be divided into α equiprobable intervals, and the breakpoints β i can be obtained as specified in the literature [19]. The sequence values in the same interval are represented by the same symbol, and the original univariate time series S iq is symbolized as For two univariate time series of length T , namely, S iq = (s iq1 , s iq2 , . . . , s iqT ) and S ik = (s ik1 , s ik2 , . . . , s ikT ), SAX is utilized to obtain two symbolic representations of length N : Here, equation (7). Reference [19] is employed to calculate the distance between the two symbolic representations of S iq and S ik to determine their similarity.
where dist( s iqn − s ikn ) represents the distance between the two symbols; the calculation is demonstrated in the literature [19].

III. TWO-PHASE CLUSTERING OF MULTIVARIATE TIME SERIES
The algorithm for two-phase clustering of multivariate time series based on wavelet analysis and SAX is denoted as WSAX and consists of two phases. A flow chart of WSAX is shown in figure 1.
In Phase I, the wavelet variance of each variable and the correlation coefficient between each pair of variables of multivariate time series are calculated. The wavelet variances and correlations are concatenated into a single vector to represent the multivariate time series. The wavelet variance-correlation coefficient feature vector is applied for clustering to yield the Phase I clustering result. The advantage of wavelet analysis is that the inherent correlations between variables of multivariate time series are considered. However, only using wavelet features for multivariate time series clustering will create the problem that time series with similar variances and covariances but opposite morphologies are clustered into the same category. Therefore, the shape-based method SAX is utilized to perform secondary clustering with the results of the Phase I. In Phase II, SAX is utilized to reduce the dimensionality of each variable of the multivariate time series, and the similarity measure is applied. For each category that was clustered in Phase I, K-means is applied to perform a second clustering of the multivariate time series based on shape.
A. PHASE I: WAVELET-BASED CLUSTERING OF MULTIVARIATE TIME SERIES For the multivariate time series of each object, first, the wavelet variance of each variable and the correlation coefficient between each pair of variables can be obtained by using the MODWT. This wavelet information is represented as a wavelet variance-correlation coefficient matrix (equation (8)) using equation (3) and equation (4). Wavelet variances and correlations are concatenated into a single vector to represent the multivariate time series, as expressed in equation (9). For two multivariate time series S i and S i , i, i = 1, . . . , I , i = i , after their respective wavelet features have been obtained, equation (10), which is based on the wavelet variance-correlation coefficients, is applied to calculate the Euclidean distance between S i and S i . K-means is applied to cluster the multivariate time series to yield the Phase I clustering result.

B. PHASE II: SAX-BASED CLUSTERING OF MULTIVARIATE TIME SERIES
For any object i, the data representation is a multivariate time series S i , as expressed in equation (2), where the q th column is a univariate time series of length T, which is denoted as S iq = (S iq1 , S iq2 , . . . , S iqT ) , and its symbolic representation is a sequence S iq = ( S iq1 , S iq2 , . . . , S iqN ) of reduced length N (N T). The symbolic representation of the multivariate time series S i is approximated as S i = ( S i1 , S i2 , . . . , S iQ ).
The symbolic representations of any two multivariate time series S i and S i are approximated as S i = ( S i1 , S i2 , . . . , S iQ ) and S i = S i 1 , S i 2 , . . . , S i Q . The similarity between S i and S i is calculated via equation (11): To overcome the problem of clustering time series with similar variances and correlation coefficients but with opposite morphology into the same category in Phase I, for each category that is clustered in Phase I, we apply a similarity measure that is based on SAX, as expressed in equation (11), and use the silhouette coefficient to determine the number of second clusters of each category in Phase I.

C. CLUSTERING VALIDITY
The dataset size, classification objective and clustering validity must be considered in determining the number of clusters. Compared with other clustering validity indices [22], the silhouette coefficient is a comprehensive index that is highly suitable for clustering validity analysis of univariate or multivariate time series. The silhouette coefficient [23] considers both the discreteness of all clusters of samples and the coherence of each cluster of samples in the clustering structure. The larger is the silhouette coefficient, the higher is the clustering performance. Assume a total of H (H ≥2) clusters and that the K th cluster has N K objects. The silhouette coefficient of each object that belongs to cluster L(1≤ L ≤ H ) is calculated via equation (12).
where Sil L i (−1 ≤Sil L i ≤ 1) is the silhouette coefficient index of the i th object that belongs to cluster L; a L i (equation (13)) is the mean distance between the i th object and the other objects that belong to the same cluster L; and b L i (equation (14)) is the minimum of the mean distances between the i th object and all objects that do not belong to cluster L. where d L,K ij is the Euclidean distance between the i th object that belongs to cluster L and the j th object that belongs to cluster K . In this paper, the mean silhouette coefficient index over all objects (referred to as the mean silhouette coefficient index) is used to evaluate the clustering performance.

IV. EXPERIMENTS AND ANALYSIS A. DATA DESCRIPTION
We employ card touch in/out data from various rail transit stations in Beijing during a two-week period from March 4 to March 17, 2013. We cleaned the data and selected 195 rail transit stations with complete card touch in/out records for the study.
We aggregate these data in hours according to the incoming and outgoing directions and obtain multivariate time series of incoming and outgoing passenger volumes. Figure 2 plots the data from two stations (Huilongguan and Wudaokou) for two weeks. In Figure 2(a), Huilongguan station shows a single peak each day for each of the incoming and outgoing directions, and the daily peaks are scattered among time intervals. On weekdays, the peaks are higher in the incoming direction than in the outgoing direction, whereas on weekends, they are approximately the same, except that they are substantially lower on weekends than on weekdays. In Figure 2(b), Wudaokou station shows dual peaks for both the incoming direction and the outgoing direction, and the daily peaks are scattered among the time intervals. The peaks in the incoming direction are substantially lower in the morning hours than in the evening hours; the opposite is observed for the outgoing direction. No peaks are readily observed on weekends and the volume of passengers is steady throughout the day.
Data preprocessing normalizes the original incoming and outgoing time series to the time series with a mean of 0 and a variance of 1. The incoming and outgoing time series data for the entire period include T = 336 time points. We use the augmented Dickey-Fuller (ADF) test to perform statistical tests on the stationarity of the incoming and outgoing time series data. The ADF test results of 195 inbound and outbound time series are less than the critical statistical values t the 5% confidence level, and all p-values are close to zero, which shows that the data are stationary time series. The set of maximum and minimum ADF test result values for the incoming and outgoing time series are shown in Table 1.

B. EXPERIMENTAL RESULTS
The multivariate time series of incoming and outgoing passenger volumes for each of the 195 rail transit stations are clustered via our method. In the experiment, the wavelet filter is LA(8), the scale is τ j = 2 j−1 , for j = 2, 3, . . . , 5, where j is the number of wavelet filter layers, and the number of clusters is K = 2, 3, . . . , 10. To avoid the local optimal solution, for each K, we repeat K-means 100 times, calculate the average of its silhouette coefficient, and select the maximum of the silhouette coefficient that corresponds to K equal to 2 as the optimal number of clusters in Phase I. The average silhouette coefficients are listed in Table 2.
Based on the Phase I clustering results, we conduct a second clustering according to shape for the stations of each category. We apply SAX to the incoming/outgoing passenger volume time series for a symbolic representation with a window interval of 2 and apply equation (12) to calculate the similarity between the rail stations. Next, we conduct    K-means clustering to yield the results of Phase II clustering for Categories 3 and 2, according to the clustering validity index-silhouette coefficient ( Figure 3) and the presence of at least 3 stations for each cluster. The average wavelet variance, correlation coefficient and station number for the first category and the 3 clusters after Phase II clustering are listed in Table 3 and those for the second category and the 2 clusters after Phase II clustering are listed in Table 4. After the two-phase clustering has been completed, the rail transit stations are divided into 5 clusters.

C. CONTROLLED EXPERIMENT
To evaluate and compare the clustering performances on the multivariate time series of incoming and outgoing passenger volumes of 195 rail transit stations, where the number of clusters K ranges between 4 and 10, we conduct experiments using the wavelet transform (WVC), which is proposed by D'Urso and Maharaj [13]; SAX, which was proposed by Lin et al. [31]; and WSAX, which was proposed in this paper. Figure 4 plots the silhouette coefficients versus the K values of the three clustering methods, according to which the silhouette coefficients are larger for WSAX than for WVC and SAX. Hence, the WSAX clustering method is more effective in clustering rail transit stations. Combined with the curve characteristics of Figure 5(a)-(e), our proposed clustering method can accurately differentiate the time series with similar variances but opposite morphologies that were obtained in the initial clustering, which demonstrates the satisfactory performance of WSAX.
The larger is the silhouette coefficients that correspond to the cluster size, the better is the clustering performance, which can be applied to determine the cluster size. In Figure 4, the blue broken line depicts the change in the silhouette coefficients of the WSAX method with the cluster size K, which shows that the silhouette coefficient is the largest when k is equal to 5, that is, the best clustering performance is achieved.  Figure 5(a)-(e) are grouped into the same category, namely, Category 1, and clusters 4 and 5 are grouped into another category, namely, Category 2. According to Tables 3 and 4, the multivariate time series with weak correlations (0.09-0.24) and strong correlations (0.61-0.62) between incoming passenger volumes and outgoing passenger volumes are divided into Category 1 and Category 2, respectively. However, the multivariate time series of Category 1 in Figure 5(a) and Figure 5(b) have opposite morphologies. The same problem is observed in Category 2. In Phase II, this problem is effectively overcome by applying SAX. Category 1 is divided into 3 clusters, namely, cluster 1, cluster 2, and cluster 3, as plotted in Figure 5(a)-(c), and Category 2 is divided into 2 clusters, namely, cluster 4 and cluster 5, as plotted in Figure 5(d)-(e).

D. ANALYSIS OF THE CLUSTERING RESULTS
The multivariate time series with weak correlations between incoming passenger volumes and outgoing VOLUME 8, 2020 passenger volumes are shown in Figure 5(a)-(c). The curves of the incoming and outgoing passenger volumes exhibit single peaks, except during peak periods. The five clusters, cluster 1, which is plotted in Figure 5(a), show single peaks on weekdays and weekends. The weekday incoming peaks last from 3 p.m. to 9 p.m., and the weekday outgoing peaks last from 7 a.m. to 5 p.m. The weekend incoming/outgoing peaks last longer and are more readily observed. This finding demonstrates that these stations are in integrated service functional areas, namely, they are comprehensive stations (such as stations near scenic locations and large shopping centers). For example, Zoo Station is located near a large attraction, the Beijing Zoo, and a transportation hub; and Olympic Green Station and the Olympic Sports Center Station are located near the Beijing Olympic Park and large shopping malls, such as the XinAo Shopping Mall and Rainbow Shopping Mall. These stations are either located in the same urban functional area or have similar POIs. Cluster 2, which is plotted in Figure 5(b), shows a weekday incoming passenger volume peak between 5 p.m. and 7 p.m. and a weekday outgoing passenger volume peak between 8 a.m. and 10 a.m. Weekends are similar to weekdays but with substantially lower passenger volumes. Hence, these stations are located in business service functional areas, namely, these stations are business-related stations. Zhongguancun Station and Xi'erqi Station are located in typical business areas. Cluster 3, which is plotted in Figure 5(c), shows a weekday incoming passenger volume peak between 8 a.m. and 10 a.m. and a weekday outgoing passenger volume peak between 5 p.m. and 7 p.m. Weekends are similar to weekdays but with substantially lower passenger volumes and a morphology that is exactly the opposite that in Figure 5(b). Hence, these stations are located in residential service functional areas, namely, these stations are residential-related stations. For example, Tiantongyuan Station and Huilongguan Station are located in typical residential communities.
The multivariate time series with strong correlations between incoming passenger volumes and outgoing passenger volumes, which are depicted in Figure 5(d)-(e), show strong dual peaks during peak hours. Cluster 4, which is plotted in Figure 5(d), shows slightly higher morning peaks of weekday incoming passenger volumes, with morphologies that are opposite those of the weekday outgoing passenger volumes. Hence, these stations are located in residential service functional areas and business service functional areas, and the residential function outweighs the business function, namely, these stations are primarily residential-and secondarily business-related stations. Taoranting, Beiyuan and Sihuidong stations, for example, are located near residential areas that also contain business areas. In cluster 5, which is plotted in Figure 5(e), the shape is opposite that in Figure 5(d). Hence, these stations are located in both residential service functional areas and business service functional areas, and the business function outweighs the residential function, namely, these stations are primarily business-and secondarily residential-related stations.
Zhichunlu, Xizhimen and Wangjing stations, for instance, are located near business areas that are mixed with residential areas, and the business area functions outweigh the residential area functions.

V. CONCLUSION
The available methods for clustering multivariate time series, which contain multiple variables, are insufficient. In this paper, we propose a two-phase multivariate time series clustering algorithm, namely, WSAX, which combines the advantages of feature-based and shape-based clustering methods. WSAX not only explores the correlations between variables but also considers the similarity in terms of the time series morphology. Phase I obtains the wavelet variance of each variable and the correlation coefficients between the variables via the MODWT. Subsequently, Phase I uses the wavelet variance-correlation coefficient feature vector for clustering. Phase II uses SAX to reduce the dimensionality of each variable of the multivariate time series and applies the similarity measure to realize the second clustering of multivariate time series based on shape. In the experiment, in which real rail transit station data from Beijing IC cards are employed, the WSAX method outperformed the WVC and SAX methods. The rail transit stations were divided into 5 clusters: comprehensive, business-related, residential-related, primarily residential-and secondarily business-related, primarily business-and secondarily residential-related. The clustering validity index silhouette coefficient is employed to compare the WSAX, WVC and SAX methods, which in combination with the clustering results that are presented in figure 5(a)-(e), demonstrates the satisfactory performance and rationality of the WSAX algorithm.
The experimental results demonstrated two main advantages of the proposed WSAX method: 1) This method considers the inherent correlations between the variables of the multivariate time series in Phase I and re-clusters the initial clustering results into a cluster of weak correlations between incoming and outgoing passenger volumes and a cluster of strong correlations. 2) This method considers the morphological similarity of the multivariate time series in Phase II and overcomes the problem of clustering time series with similar variances and correlation coefficients but with opposite morphologies into the same category in Phase I. In addition, the results provide a scientific basis of reference for studying urban functions, planning rail transit stations, and managing related services. Her research interests include verification technologies for nonlinear circuits and systems, LSI simulation technologies, and intelligence computation technologies. VOLUME 8, 2020