Efficient Time Series Clustering by Minimizing Dynamic Time Warping Utilization

Dynamic Time Warping (DTW) is a widely used distance measurement in time series clustering. DTW distance is invariant to time series phase perturbations but has a quadratic complexity. An effective acceleration method must reduce the DTW utilization ratio during time series clustering; for example, TADPole uses both upper and lower bounds to prune off a large ratio of expensive DTW calculations. To further reduce the DTW utilization ratio, we find that the linear-complexity L1-norm distance (Manhattan distance) is effective enough when the time series only comprise small phase perturbations. Therefore, we propose a novel time series clustering by Minimizing Dynamic Time Warping Utilization (MiniDTW) algorithm to accelerate time series clustering. In MiniDTW, the dataset is first $greedily$ summarized into seed clusters, which comprise time series of small phase perturbations, by L1-norm distance. Then, we develop a new Sparse Symmetric Non-negative Matrix Factorization (SSNMF) algorithm, which factorizes the DTW distance matrix of seed cluster centers, to merge the seed clusters into the final clusters. The experiments on UCR time series datasets demonstrate that MiniDTW, pruning 98.52% of the DTW utilization, is better than the counterpart method, TADPole, which only prunes 75.56% of the DTW utilization; and thus MiniDTW is 10 times faster than TADPole.


I. INTRODUCTION
Time series is one of the most important data in the modern data-driven society and can be generated from nearly every aspects in the daily life [1]. Time series analysis can benefit pervasive applications in different domains, e.g., financial marketing [2], smart home [3] and autonomous vehicles [4]. Time series clustering is a basic technique for analyzing time series. It can discover the underlying structure of the chaotic/raw datasets without the ground truth labels. This makes it particularly useful for analyzing many unlabeled real-world datasets, such as common pattern discovery [5], information retrieval [6] and outlier detection [7].
Time series distance measurement method is essential for the clustering accuracy, but the precision of simple distance measurements, such as L1-norm (Manhattan distance) and cross correlation are undermined by the widely appeared phase perturbations (e.g., phase shifting, time warping) in The associate editor coordinating the review of this manuscript and approving it for publication was Amir Masoud Rahmani . time series [8]. Dynamic Time Warping (DTW) [9] is a distance measurement that is robust to time series phase perturbations; however, its quadratic complexity greatly impairs the efficiency of time series clustering. To accelerate time series clustering with DTW distance, some methods reduce the DTW utilization ratio by pruning unnecessary DTW calculations with fast calculated upper/lower bounds of DTW, such as TADPole [5]. Unfortunately, existing methods are hard to prune most DTW calculations (for example, TADPole needs 24.43% DTW calculations after pruning), because it is challenging to define tight lower/upper bounds of DTW distance, which leads to large runtime for the clustering.
To significantly reduce the DTW utilization ratio for the acceleration, we only apply the complex DTW calculation on a summarized time series dataset (rather than the original dataset). This is inspired by the work [6] that achieves interactive time series retrieval by querying a summarized database, rather than the original large dataset. Specifically, we find L1-norm distance is effective to summarize the time series dataset based on three observations. First, L1-norm distance VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ has a linear complexity and is more efficient than DTW distance. Second, the precision loss of L1-norm distance is limited when time series have small phase perturbations. Third, L1-norm distance is an upper bound of DTW distance, which ensures time series with a small L1-norm distance always have a small DTW distance. To ''greedily'' reduce the DTW utilization ratio, we summarize the dataset into natural-shaped seed clusters with L1-norm distance. Therefore, a seed cluster can group 1) two time series with a small phase perturbation and 2) two time series that comprise a large phase perturbation but can be related to each other by a series of slightly perturbed time series. Then, DTW distance is only used on a small amount of seed cluster centers to merge the seed clusters into final clusters.
In this paper, we propose a novel time series clustering by Minimizing Dynamic Time Warping Utilization (MiniDTW) algorithm to accelerate time series clustering. In MiniDTW, the original dataset is first ''greedily'' summarized as a small amount of natural-shaped seed clusters with the efficient L1-norm distance. The seed clusters are further merged to form the final clusters by a new Sparse Symmetric Non-negative Matrix Factorization (SSNMF) algorithm, which factorizes the DTW distance matrix of seed cluster centers. Comprehensive experiments are conducted using UCR time series datasets [10] to evaluate the proposed MiniDTW algorithm. Therefore, this paper has three contributions: 1) We propose a novel MiniDTW method to speed up time series clustering. MiniDTW minimizes DTW utilization ratio by dataset summarization with the linear-complexity L1-norm distance. 2) We propose an effective SSNMF matrix factorization algorithm, which more accurately merges the seed clusters in MiniDTW than other Non-negative Matrix Factorization based algorithms (i.e., NMF with L1/L2 constraints). 3) We conduct comprehensive experiments to evaluate the performance of the proposed MiniDTW, and the result shows that MiniDTW can effectively avoid 97.90% of DTW utilization and thus is 10 times faster than the counterpart, TADPole, which only prunes 75.73% DTW utilization.
The rest of this paper is organized as follows. In Section 2, we review related works. In Section 3, we introduce the preliminary knowledge and the problem definition. The MiniDTW algorithm is introduced in Section 4 and evaluated in Section 5. Finally, we conclude this paper in section 6.

II. RELATED WORK
Time series clustering groups similar time series into the same cluster, while separating disparate time series into different clusters. There are two essential techniques for effective time series clustering, i.e., the distance measurement of time series, and the clustering algorithm.
Many time series distance measurements have been proposed in the literature, such as the basic Norm distance, which is normally used as L1-norm distance (Manhattan distance) [11], [12] or L2-norm distance (Euclidean distance) [13], [14]. Norm distance is intuitive and has a linear complexity, but it may face significant precision loss when time series phase perturbation occurs. DTW distance [9] is invariant to time series phase perturbations. DTW finds the distance by searching the optimal continuous warping path between two time series; however, DTW distance has a complexity quadratic to the length of time series.
Many methods are proposed to accelerate DTW distance by reducing the complexity of its search space [15]- [17]. For example, SS-PrunedDTW [16] compresses the search space with an continuously updated upper bound; PDTW [14] reduces the dimension of the search space by compressing time series (with PAA [18]). Meanwhile, other distance measurements are proposed to resolve specific types of phase perturbations. For example, SBD [19] is effective for phase shifting through finding the optimal alignment of two time series by cross correlation; while LCSS [20] is invariant to sampling rate by finding the longest common subsequence (LCSS) between the two time series.
Time series clustering is a well-studied field and the clustering algorithms can roughly be categorized into four classes: hierarchical, model-based, partition-based and density-based time series clustering. HSM [21] is an agglomerative time series clustering technique that hierarchically merge clusters. The cluster distance in HSM is calculated with cluster representatives, which are estimated by spectral density. TS3C [22] hierarchically cluster time series, with single-linkage cluster distance, after each time series is mapped to a representation with the centroids of subsequence clusters. Model-based time series clustering normally assumes that time series are generated following specific statistical models. For example, GMM [23] assumes that time series are generated with a mixture of finite Gaussian distributions; while HMM [24], [25] uses a hidden Markov process. Hierarchical and model-based clustering algorithms are relatively complex in terms of calculations, and are usually used to interpret the clustering results.
Partition-based time series clustering partitions the dataset into clusters through minimizing the overall distances of time series to their respective cluster centers. The center of a cluster, e.g., in Kmeans based time series clustering [26], [27], is regarded as the point-wise average of the contained time series; however, such centers may poorly represent the common temporal pattern when phase perturbation appears. Other methods are proposed to find more accurate cluster centers. For example, K-medoids [28] regards the time series that has the least sum of distances to other time series as the center of a cluster; KDBA [29] uses a global averaging technique to generate the centers that adapt to DTW distance; Kshape [19] and KSC [30] discover the centers as eigenvectors by spectral analysis. Compared with density-based methods, these methods do not demand extensive distance calculations; however, their performance is highly affected by the distance measurement adopted.
Density-based clustering finds time series clusters by grouping time series according to their densities. YADING [11] adopts DBSCAN to effectively find clusters containing time series that comprise small phase shifting, with the efficient L1-norm distance. TADPole [5] uses DPC [31], another density-based clustering algorithm, with DTW distance to develop an anytime time series clustering algorithm. In TADPole, the DTW utilization ratio is reduced by pruning out-of-bounds DTW distances with the efficient DTW lower/upper bounds. Density-based clustering can find natural-shaped clusters, and in this paper we utilize this characteristic to develop the proposed method.

III. PRELIMINARIES AND PROBLEM DEFINITION A. L1-NORM DISTANCE AND DTW DISTANCE
Time series is a series of real values, denoted as X = {x 1 , x 2 , x 3 , . . . , x m }, and m is the length. A time series dataset contains n equal length time series (D = {X 1 , X 2 , X 3 , . . . , X n }). L1-norm distance measures the distance of two time series, X and Y , as the overall pair-wise differences as follows: L1-norm distance has a complexity linear to m, but the precision is low for poorly-aligned time series. As shown in Fig. 1 (a), L1 norm (X , Y ) is large, despite that X and Y have similar wave shapes, because of the appearance of phase perturbation. DTW distance is another measurement that is a widely used to accurately measure the distance of time series. DTW distance achieves this by finding the optimal continuous warping path (the best alignment) between X and Y . Specifically, a warping path is denoted as W = {w 1 , w 1 , w 2 , . . . , w k }, where each w l = |x i − y j |, and the overall weight of the optimal warping path is used as the DTW distance: Dynamic programming is applied to ensure the discovery of the optimal warping path has a complexity quadratic to m.
For the example in Fig. 1 (a), DTW distance can properly measure the distance of X and Y regardless the phase perturbation. The optimal warping path between X and Y found by DTW distance is shown in Fig. 1 (b).

B. PROBLEM DEFINITION
DTW distance can effectively measure the distance of time series since it is invariant to phase perturbations; however, its quadratic complexity confines it availability to applications that demand high efficiency. To accelerate time series clustering, we seek to reduce the DTW utilization ratio by dataset summarization using the L1-norm distance, which has the linear complexity. This strategy is based on two observations, where the use of DTW distance is not necessary: 1) time series have small phase perturbations and thus the precision loss of L1-norm distance is not significant, and 2) time series of large phase perturbations can be related by a series of slightly perturbed time series. The following content explains this two observations. YADING [11] shows that L1-norm distance has a limited precision loss for measuring the distance of time series with small phase shifting. We further extend this finding for general phase perturbation, that is, L1-norm distance has limited precision loss for measuring the distance of two time series with a small phase perturbation, in which scenario DTW distance can be replaced with L1-norm distance.
(Assume X is sampled from f (t) by an equal interval, and Y is sampled from the f (t + λ(t))).
Proof: An arbitrarily small phase perturbation λ(t) means that at each Lemma 1 is the building block of our method, and it shows that X and Y are neighbours, which have a distance less than the distance threshold, if they have a small phase perturbation. Meanwhile, the correctness of using L1-norm to find neighbours is that the neighbours found by L1-norm distance are always neighbours found by DTW distance, because L1-norm distance is an upper bound of DTW distance [5]. Moreover, X and Y , which have a large phase perturbation, may also be grouped into the same cluster by density-based clustering algorithm with L1-norm distance. Specifically, X and Y are not neighbours to each other by L1-norm distance since the difference of phase perturbation of X and Y (i.e. where each i is small, X and Y can be related by a neighbour chain ({X , Z 1 , Z 2 , . . . , Z k , Y }), according to Lemma 1. Now we consider how to group together time series comprise the same wave shape but with different scales of phase perturbation by density-based clustering algorithm, which rarely uses DTW distance. Without loss of VOLUME 9, 2021 generality, we formulate the distribution of the phase perturbation as a bimodal Gaussian distribution, PDF( ), which denotes the probability that a phase perturbation with an intensity λ(t, ) appears in the time series. PDF( ) has two peaks ( = 0 and = 8) that represent the peak probabilities (as shown in Fig. 2 (a)). We create a toy dataset that contains 300 time series, which have the simple phase perturbation pattern (f (t + λ(t, ))) following the bimodal Gaussian distribution, as shown in Fig. 2 We separately apply DTW distance and L1-norm distance on the toy dataset, and visualize the respective distance measurements by Multidimensional Scaling (MDS) [32], as shown in Fig. 2 (c-d). The result of DTW distance clearly exhibits its invariance to phase perturbations ( Fig. 2 (c)) since all the obtained DTW distances are small, i.e., the maximum DTW distance is only 0.03. Differently, the visualization of the L1-norm distance matrix appears as two long thin arcs as shown in Fig. 2 (d), with a much larger maximum L1-norm distance, i.e., 1.33. The L1-norm distance matrix becomes two long thin arcs because, for each time series, there is only a limited amount of time series having small L1-norm distances with it (thin), but have many time series that lead larger L1-norm distances when the difference of become larger (long). In addition, the time series that have s of the two peaks in the PDF( ), i.e. = 0 and = 8, also have peak densities in the two arcs, respectively. Therefore, it is intuitive to adopt a two-step approach to group these time series into a cluster. First, the toy dataset is summarized as two seed clusters, which contain respective time series forming the two arcs in Fig. 2 (d), based on density calculated with L1-norm distance. Second, the two seed clusters are merged into the final cluster by their small DTW distance of centers ( Fig. 2 (c)). In this way, the efficiency to cluster these time series is greatly improved since 44,849 ( 300×299 2 − 1) DTW calculations are avoided.

IV. THE PROPOSED METHOD
We propose a novel time series clustering algorithm, MiniDTW, to minimize the DTW utilization ratio by dataset summarization with L1-norm distance. MiniDTW includes the following two steps: 1) Summarize the dataset with L1-norm distance as natural-shaped seed clusters, i.e. time series comprise small phase perturbations. 2) Discover the final clusters on the summarized dataset by merging seed clusters with the DTW distances among their centers. In the following contents we detail the two steps of MiniDTW.

A. DATASET SUMMARIZATION WITH L1-NORM DISTANCE
To efficiently summarize time series comprising small phase perturbations as seed clusters with L1-norm distance, we take the advantage of density-based clustering algorithms. For time series X i , we define its density (ρ i ) with a distance threshold (d c ) as follows: The density in Eq. (3) emphasizes the weight of time series with smaller phase perturbations, which have smaller distances.
Proof: Arbitrarily small phase perturbations λ Y (t) and We define the center of a seed cluster as the time series with a local density peak to approximate the relative peak of PDF( ) as follows: We borrow the heuristic of DPC [31] to group time series with small phase perturbations into seed clusters, with the centers having the largest local densities (see Fig. 2 (d)). For X i , we find a time series n i = X j as follows: Apparently, X i is the center of a seed cluster if n i does not exist according to Definition 1. Then, the dataset is summarized as seed clusters in a two-step process: 1) assign an unique seed cluster label to each center; 2) spread the seed cluster label from the centers to time series that have lower densities, i.e., X i acquires seed cluster label from its n i .

B. MERGE THE TIME SERIES SEED CLUSTERS
After the original dataset is summarized as seed clusters, we further merge seed clusters into the final clusters, based on the DTW distances of their centers. Specifically, we merge the centers of seed clusters into final clusters, and then assign the non-center time series to the same clusters as their centers. Assume we merge τ seed clusters into K clusters by DTW distances of seed cluster centers. We propose a Sparse Symmetric Non-negative Matrix Factorization (SSNMF) algorithm to merge seed cluster centers, due to the non-negative and symmetric properties of the DTW distance matrix (M ∈ R τ ×τ ). SSNMF is able to discover the latent structure of the relationships among the seed cluster centers. Specifically, in SSNMF, M is factorized as the multiplication of two matrices, i.e., H ∈ R τ ×K and S ∈ R K ×K (S = S T ), which have non-negative entries, as follows: H is the feature matrix that represents the latent structure of data derived from M , and it also implies the center assignment, i.e., the ith center belongs to the jth cluster if arg max 1≤k≤K h ik = j. SSNMF imposes a L 1 2 sparse constraint for H to ensure the assignment weights of each center to the K clusters are exclusive, i.e., each center only has seldom non-zero weights. We choose L 1 2 norm since it is differentiable and can produce sparser solutions than the L 1 norm regularization [33]. Therefore, the cost function of SSNMF is defined as follows: where * F is the Frobenius norm, η is the weight of sparsity and H 1 2 is the L 1 2 sparse constraint of H defined as follows: The cost function in Eq. (6) is non-convex, and we obtain local minima by an iterative multiplicative updating process akin to [34]. Let ∈ R τ ×K and ∈ R K ×K be the Lagrange multipliers subject to λ ij ≥ 0 and γ ij ≥ 0, respectively. clusters j = clusters j ∪ seeds i . 13 The partial derivative of with respect to H is given by: where A ∈ R τ ×K and a ij = h − 1 2 ij . The partial derivative of with respect to H is given by: Using the Karush-Kuhn-Tucker conditions that λ ij h ij = 0 and γ ij s ij = 0, we obtain the final multiplicative updating rule for H and S, respectively, as follows:

S = S H T MH H T HSH T H
.
We initialize H and S with random positive values, and the optimal H and S are obtained after the updates of H (Eq. (11)) and S (Eq. (12)) reach convergence. The pseudo code of seed cluster merging is demonstrated in Algorithm 1. At line 1, the DTW distance matrix of seed cluster centers (M ) is obtained. At lines 2-8, M is decomposed into H and S by the iterative updating process. The final clusters are obtained at lines 9-13.

C. TIME COMPLEXITY
The computation of L1-norm distance matrix is O(n 2 * m), where n is the size of dataset and m is the length of time series. The complexity to find time series seed clusters by VOLUME 9, 2021 DPC is O( 1 2 n 2 + n(φ + log n + 1)), where φ n is the average number of neighbours. Thus the overall complexity to find seed clusters is O(n 2 (m + 1 2 ) + n(φ + log n + 1)). In the seed clusters merging phase, the complexity to calculate distance matrix with DTW among dense clusters is O(τ 2 m 2 ), where τ is the number of seed clusters. The iterative updating process to obtain optimal H and S requires a complexity of O(lτ 2 K 4 ), where l is the number of iterations and τ and K are normally far smaller than n. Therefore, the overall complexity of MiniDTW is approximately O(n 2 * m).

V. EVALUATION
In this section, we design the following experiments to compare the runtime efficiency and clustering accuracy of the proposed MiniDTW algorithm with the counterpart methods.
All the experiments are implemented with Python 3.2, and run on a Linux platform with 2.6G CPU and 132G RAM.

A. EXPERIMENT SETUP
We use all the datasets in the UCR time series achieve [10] for the evaluation. These datasets have different sizes (ranging from 40 to 16637) and different lengths of time series (ranging from 24 to 2709). Each dataset contains a training set and a testing set (with labels), and we use both for clustering. Since MiniDTW is proposed to accelerate time series clustering by reducing the DTW utilization ratio, TADPole [5] is the counterpart method most related to ours because it aims at accelerating time series clustering by pruning a fraction of DTW distance use based on faster DTW upper/lower (L1-norm/LB_Keogh [35]) bounds. SS-PrunedDTW [16] is another method that directly accelerate DTW distance. KDBA [29] uses Kmeans for clustering and designs a DTW-adaptable center discovery method. Two state-of-theart time series clustering algorithms, i.e., Kshape [19] and TS3C [22], are also included in the evaluation. We brief the counterpart methods as follows: -TADPole uses density-based clustering algorithm (DPC [31]) with DTW distance measurement for clustering, and it accelerates the clustering process by pruning a significant proportion of DTW use with fast calculated lower/upper bounds of DTW. SS-PrunedDTW -accelerates the DTW distance measurement by pruning outbound search operations with an upper bound. Single-linkage hierarchical clustering is used to cluster time series with their efficiently calculated DTW distances. -KDBA extends the conventional Kmeans clustering algorithm to support the use of DTW distance measurement. KDBA adopts a DTW-adaptive center discovery method to ensure the non-center time series in clusters have small DTW distances to the respective centers. -Kshape is proposed to cluster time series invariant to time series phase shifting. Kshape measures the distance of two time series by a shape-based measurement (SBD). After time series that have phase shifting are re-aligned by SBD, Kshape discovers clusters by minimizing the distances of the re-aligned time series to the cluster centers. -TS3C clusters time series by the temporal patterns of time series segments. TS3C first finds segment clusters for each time series, which contains time series segments of different lengths. Then, time series are represented as the segment clusters to discover the final clusters by hierarchical clustering. We apply the above clustering algorithms on all the UCR time series datasets, and the clustering accuracy is measured by Rand Index (RI) following [19], [22]. RI penalizes false positive and false negative clustering results and is defined as follows: where TP is the number of time series pairs that have the same ground truth label and are correctly clustered in the same cluster; TN is the number of pairs that have different labels and are correctly separated into the different clusters; FP means the number of pairs that have different labels but are wrongly clustered in the same cluster; FN is the number of time series pairs that have the same label but are wrongly separated into different clusters. Note that RI ∈ (0, 1], and a higher RI means the better clustering accuracy.

B. ACCURACY ANALYSIS
Though this paper focuses on improving the efficiency of time series clustering, we first show that the acceleration of MiniDTW does not necessarily sacrifice clustering accuracy by comparing with TADPole, SS-PrunedDTW, KDBA, Kshape and TS3C. The warping window for DTW, which is used by MiniDTW, TADPole, SS-PrunedDTW and KDBA, is fixed as 5% in all the algorithms. Except SS-PrunedDTW, Kshape and KDBA that only require the number of clusters, we use grid search to find optimal parameters for TADPole and MiniDTW. For TADPole, the optimal d c is obtained by a grid search ranging from 0.01 to 1 (multiplied with the largest DTW distance in the dataset), with grid size as 0.01. For MiniDTW, the optimal d c is obtained the same as TADPole, and the optimal η (the weight of sparsity) is searched among {0.1, 1, 10, 100, 1000}. We directly use the published accuracy results of TS3C [22].

1) OVERALL CLUSTERING ACCURACY
We first compare MiniDTW with algorithms that use DTW distance for clustering, i.e., TADPole, SS-PrunedDTW and KDBA, in terms of clustering accuracy. KDBA is used as the baseline and the results are shown in Table 1. Apparently, MiniDTW and TADPole, which are density-based time series clustering algorithms, perform better than KDBA (the partition-based)on most datasets, while SS-PrunedDTW (hierarchical) achieves lower accuracy than KDBA in more than half datasets. Moreover, the average RI of MiniDTW is   0.7685, which improves the accuracy of TADPole (average RI is 0.7322) and SS-PrunedDTW (average RI is 0.6479) by around 5% and 19%, respectively. MiniDTW is further compared with Kshape, TS3C and Kmeans, which do not use DTW distance measurement for time series clustering. We use Kmeans as the baseline for comparison, and the results are shown in Table 2. Specifically, MiniDTW wins or equals to Kmeans in most (80 out of 84) datasets, and achieves the highest average RI (0.7685). Kshape achieves slightly better clustering results than Kmeans, while TS3C performs the worst among the four algorithms and achieve accuracy lower than Kmeans in 57 datasets (out of 84). MiniDTW improves the accuracy of Kshape by 11% and that of TS3C by 16%.
The statistical comparison of MiniDTW and the counterpart algorithms is shown in Fig. 3. In general, the density-based clustering methods using DTW distance measurements, i.e. MiniDTW and TADPole, achieve better clustering accuracy than other algorithms; while SS-PrunedDTW, which uses the same DTW distance values (but a fast calculated version) as MiniDTW and TADPole, performs the worst due to the use of hierarchical clustering. MiniDTW achieves the lowest rank score, that is, it statistically achieves the best clustering accuracy. Meanwhile, the hypothesis that these algorithms are significantly different is rejected by Holm-Bonferroni method, and the horizontal lines connect algorithms that are not significantly different. Kmeans and KDBA, both adopt the same strategy to group clusters and are not significantly different on these datasets. Similarly, SS-PrunedDTW and TS3C (both are hierarchical clustering) are not significantly different. Although the paper focuses on speeding up the clustering of time series, the above results show that the clustering accuracy of MiniDTW is at least comparable with state-of-art time series clustering algorithms.

2) CASE STUDY
We use the Italy.Dem. dataset as a real-world example to show how MiniDTW effectively clusters time series with phase perturbations. In contrast to MiniDTW that approximates the toy bimodal distribution with two seed clusters in Fig. 2, the complex distribution of ItalyPowerDemand dataset is approximated as a combination of multiple unimodal distributions (seed clusters) shown in Fig. 4. Italy.Dem. contains two classes of time series that represent the Italy power consumption in winter and summer, respectively, as shown in Fig. 4 (a). MiniDTW first discovers several seed clusters, visualized by Multidimensional Scaling [32], as shown in Fig. 4 (b). As discussed in Section 4.1, we use the density (defined by L1-norm) of time series to approximate the PDF of phase perturbation. Fig. 4 (c) shows two seed cluster examples (S 1 and S 2 ) effectively group time series with small phase perturbations, and each seed cluster uses the density distribution to approximate a unimodal distribution. For example, the center time series in S 1 has the largest density, while the densities of the rest time series gradually decrease with larger phase perturbations (to the center time series). With these fine seed clusters, the final merged clustering results (Fig. 4  (d)) show that MiniDTW correctly clusters most time series and achieves the highest accuracy (RI = 0.8134) among the compared methods.

3) VARIANTS OF SEED CLUSTER MERGING ALGORITHMS
To understand the effectiveness of the proposed SSNMF algorithm in MiniDTW for seed cluster merging, we replace VOLUME 9, 2021  SSNMF with other variants for comparison. We develop a MiniDTW-HAC method that uses hierarchical clustering (complete linkage), and MiniDTW-L1 and MiniDTW-L2 methods that use NMF with L1/L2 constraint [36] for seed clustering merging. The optimal clustering results of MiniDTW-HAC, MiniDTW-L1 and MiniDTW-L2 on UCR time series datasets are obtained using the same grid search as MiniDTW. The results in Fig. 5 shows that MiniDTW achieves better statistical clustering results than MiniDTW-HAC, MiniDTW-L1 and MiniDTW-L2. Meanwhile, the results also show that NMF based methods all perform better than the seed clustering merging method using hierarchical clustering.

C. EFFICIENCY ANALYSIS
For the convenience of demonstration, we generate one synthetic dataset similar to [37] and choose six UCR datasets of different sizes and time series lengths to compare the efficiency of MiniDTW. The statistics of the UCR datasets are shown in Table 3. We especially compare MiniDTW with TADPole and SS-PrunedDTW, because they all aim at accelerating time series clustering using DTW. The calculation of the LB_Keogh lower bound matrix (for TADPole) is regarded as the setup-time the same as [5].

1) PERFORMANCE ON UCR DATASETS
The running time results, which are obtained under their optimal parameters, provide the best accuracy in the first experiment, as shown in Fig. 6. The results show that both methods reduce the usage of DTW, i.e., TADPole (using lower/upper bound pruning) and MiniDTW (summarizing datasets with L1-norm distance) are more efficient than SS-PrunedDTW that accelerates DTW calculations. Moreover, it further accelerates MiniDTW and TADPole by replacing the DTW used with the more efficient SS-PrunedDTW.
So, MiniDTW is around 10 time faster than TADPole on the six datasets. We further compare the DTW utilization ratios of MiniDTW and TADPole to show why MiniDTW is more efficient. DTW utilization ratio of an algorithm is defined as 2x n(n− 1) , where x is the number of DTW calculations   the algorithm adopted and 1 2 n(n − 1) is the baseline (the DTW distance matrix). The results of DTW utilization ratios are shown in Table 4. The average DTW utilization ratio of MiniDTW (1.58%, i.e., 98.42% DTW calculations are avoided) is one magnitude smaller than that of TADPole (24.44%). This observation roughly explains why MiniDTW is much faster than TADPole (due to the quadratic complexity of DTW calculation). Moreover, MiniDTW only requires less than 1% DTW calculations on four datasets; while TADPole uses more than 10% DTW utilization ratio on most datasets.
We further apply the convergence analysis for SSNMF, which merges seed clusters in MiniDTW, and the results are shown in Fig. 7. The proposed SSNMF converges fast in all datasets, i.e., less than 25 iterations are required for most datasets; and this fast convergence also contributes to the runtime efficiency of MiniDTW. Specifically, MiniDTW on Car dataset requires only 4 iterations to reach convergence. This fast convergence is attributed to the small number of seed clusters, which determines the size of the factorized matrix.

2) PERFORMANCE ON SYNTHETIC DATASET
We compare MiniDTW with TADPole and SS-PrunedDTW on synthetic datasets that comprise different levels of phase perturbations. We generate 19 synthetic datasets using the method in [37] and each dataset contains 100 time series (length = 50) of two classes, which have sinusoidal and rectangular shapes, respectively. We use phase shift as an example of phase perturbation due to its pervasiveness. Phase shift that is randomly selected from a normal distribution is added to each time series. Different datasets select phase shift intensity from different distributions; these distributions have mean values of 0 and {0.1, 0.2, . . . , 1.9} variations, with respect to the 19 datasets. An Gaussian noise (µ = 0.1 and δ = 0.5) is further added to each time series. The two datasets that have the smallest and the largest phase shift intensities are shown in Fig. 8 (a) and (b), respectively. The running time and DTW utilization ratio results are obtained using the same grid search as above experiments. The running time result in Fig. 8 (c) shows that MiniDTW and TADPole (reduce the DTW utilization ratio) constantly run faster than SS-PrunedDTW (accelerates the DTW distance). Meanwhile, the running time of MiniDTW increases with phase shift intensity but is smaller than TADPole before the intensity becomes too large (≥ 1.6); this trend is consistent with the trend of DTW utilization ratio shown in Fig. 8 (d). Specifically, the DTW utilization ratio of MiniDTW increases with phase shift intensity and is larger than TADPole, which has a DTW utilization ratio fluctuating around 20%, when the intensity exceeds 1.6.

D. SCALABILITY ANALYSIS
We use StarLightCurves dataset, the largest dataset, considering both dataset size (n = 9236) and time series length (m = 1024), in the UCR time series archive, to compare the scalability of MiniDTW and TADPole with respect to the different dataset sizes and time series lengths. For fairness, we use the same d c , i.e., 0.2 multiplied with the largest DTW distance in the dataset, for both MiniDTW and TADPole, and set η = 100 for MiniDTW to produce the near optimal clustering accuracy on StarLightCurves dataset. The calculation of LB_Keogh lower bound matrix (for TADPole) is regarded as setup time as the previous efficiency analysis [5].
To compare MiniDTW with TADPole for clustering dataset of different sizes, we generate 9 subsets, the size of which vary from 1000 to 9000, by randomly selecting time series from StarLightCurves dataset. The running time results in Fig. 9 (a) show that MiniDTW is more scalable than TADPole in large datasets. Even though MiniDTW and TADPole achieve close running time on the smallest dataset (n = 1000), the running time of MiniDTW increases much slower than TADPole on larger datasets. We further compare MiniDTW and TADPole using 7 subsets that have 1000 time series of different lengths, i.e., from 200 to 800, which are segments of time series in StarLightCurves dataset. As shown in Fig. 9 (b), the running time of MiniDTW is nearly linear to the length of time series, while that of TADPole is quadratic since TADPole uses more quadratic DTW calculations during clustering. Therefore, MiniDTW is more scalable than TAD-Pole on large datasets.

VI. CONCLUSION
We propose a novel MiniDTW algorithm, which minimizes the DTW utilization ratio by applying DTW on summarized datasets (with L1-norm distance), to accelerate time series clustering. MiniDTW first uses density-based clustering with L1-norm distance to efficiently summarize the datasets as natural-shaped seed clusters, which contain time series comprising small phase perturbations; and then form the final clusters by merging seed clusters using an effective SSNMF decomposition of the DTW distance matrix of seed cluster centers. The experimental results conducted on the UCR time VOLUME 9, 2021 series datasets show that MiniDTW reduces 98.52% of DTW utilization and is better than its counterpart, TADPole, which reduces only 75.56% of DTW utilization; and thus MiniDTW is 10 times faster than TADPole, without sacrificing clustering accuracy. Before he joint CSIRO, he has worked in industry (Philips Research Laboratory, USA, IBM, Poughkeepsie, NW, USA) and universities (the Chinese University of Hong Kong, the National University of Singapore, and Tsinghua University) for more than 20 years. He has published more than 260 international journal and conference papers and edited ten books; he also holds six U.S. patents. His research interests include cybersecurity, behavior modeling, knowledge graph, data engineering and analytics, cloud and service computing, social computing, the Internet-of-Things, and distributed computing. VOLUME 9, 2021