Time Adaptive Optimal Transport: A Framework of Time Series Similarity Measure

Similarity measure is a critical tool for time series analysis. However, currently established methods, for instance, dynamic time warping (DTW) and its variants, are still facing some issues such as non-maximum-to-maximum alignment and pathological alignment, etc. Despite many attempts to improve, these issues remain stubborn because they are directly caused by the intrinsic mechanism of DTW. Thinking out of the context of DTW based methods, we propose in this article a new time series similarity measure framework which we call Time Adaptive Optimal Transport (TAOT). As its name implies, TAOT is based on optimal transport, a powerful distance measure for histograms and probability distributions, and TAOT inherits several promising properties from optimal transport to tackle the problems of classic DTW based methods. We make optimal transport capable of handling time series data by considering both observed values and their corresponding time coordinates simultaneously. TAOT can generate a many-to-many alignment between time series that further releases the search space for a more correct result. Experimental results show that TAOT can outperform other widely used similarity measures on classification tasks on multiple datasets. We also introduce the parameter extracting and visualization strategies of TAOT in this article.


I. INTRODUCTION
Similarity measure is fundamental to many machine learning and data mining tasks, such as classification [1]- [4], clustering [5]- [7], indexing [8]- [11], etc. In the context of time series analytics, similarity measure has long been a research hotspot and many methods have been proposed [2], [12]- [16]. These similarity measures can be roughly categorized into: time-rigid measures (Euclidean distance), time-flexible measures (dynamic time warping) [15], [17], feature-based measures (Fourier coefficients) [18]- [20], and model-based measures (auto-regression and moving average model) [14], [21]. Among these methods, dynamic time warping (DTW) [17] and its variants [22]- [27] are probably the most popular and established ones [12], [21], [28]. However, this kind of DTW-based methods still exhibit several intrinsic alignment problems: a) Pathological alignment: A critical job for similarity measure of time series is to tackle time distortions. DTW The associate editor coordinating the review of this manuscript and approving it for publication was Dominik Strzalka . screens most time distortions by realigning data points in the two time series to be compared, and that helps DTW outperform Euclidean distance in many scenarios [28]. However, the intrinsic rules used by DTW to generate the new alignment can lead to pathological alignment, where a single point in one time series links to a large subsection of another time series [23], as shown in Fig. 1(a). In order to avoid this undesirable alignment, various constraints for DTW [29]- [32] have been proposed, for instance, Sakoe-Chiba band and Itakura parallelogram. But most of these constraints are rigid such that they take a risk of preventing the correct alignment from being generated. b) Pairing of maximum values: In some applications, the maximum value of a time series is of overwhelming importance, and thus it is decisive to guarantee that the maximum value of one time series aligns to the maximum value of another time series. Unfortunately, DTW and its existing variants provide no such guarantee, and to some extent this requirement even conflicts with the underlying mechanism of DTW. A mandatory pairing of maximum values can break the time-monotonicity of DTW and prevent DTW from finding the optimal alignment with the lowest global cost. Other common similarity measures, such as Euclidean distance and Minkowski distance, can only succeed in a certain situation when two peaks happen at the same time point, because they obey a time-rigid alignment rule. c) Many-to-many alignment: From the perspective of typology, there should be three categories of alignments between time series, one-to-one, one-to-many, and manyto-many, as illustrated in Fig. 1(b-c-d) respectively. Among the three categories, the potential of one-to-one alignment and one-to-many alignment has already been exploited; for instance, time-rigid methods employ one-to-one alignment and DTW-based methods employ one-to-many alignment. However, the use of many-to-many alignment is still an open direction. In theory, many-to-many alignment has a larger capacity and a higher flexibility. Therefore, many-to-many alignment based methods deserve to be developed. d) Privilege of time ordering: The similarity between time series is measured mainly from two aspects: observed values and time ordering. To some extent existing methods always consider time ordering in the first place. For instance, DTW first sets a rule of time ordering (continuity, monotonicity, etc), and then finds the minimum accumulated cost of observed values under the given time ordering rule. Euclidean distance simply follows a time-rigid ordering. This privilege of time ordering narrows the search space and as a result forfeits the opportunity to find a more correct alignment. Instead, a new method that considers observed values and time ordering simultaneously may further release the capacity and lead to a more accurate similarity measure.
A review of literature shows a lack of method effectively responding to the aforementioned issues, and one reason is that these alignment problems are closely related to the core mechanism of DTW. Thinking outside the context of classic DTW-based methods, we find Optimal Transport (OT) [33], also known as the Earth Mover's Distance (EMD) [34] or Wasserstein Distance, is a successful method to compare probabilities or histograms [35]- [37], and theoretically it has several promising properties [38] to solve the aforementioned alignment problems of DTW. For example, OT does not rely on the time order of data points and it always aligns maximum values with each other. In the past the utility of OT was greatly limited by its high computational complexity, but fortunately this issue has been largely alleviated by one of its variants, Sinkhorn distance [38], which transfers OT into a problem with a fast solution by adding an entropic regularization term and ends up making the computation several orders of magnitude faster. Another barrier to the use of OT on time series data is that OT only considers the observed values while ignores the corresponding time coordinates.
In this article, we propose a time-adaptive version of OT (TAOT) to serve as a new similarity measure framework between time series. We make Sinkhorn distance capable of handling time series data by considering both observed values and their corresponding time coordinates simultaneously. Classification experiments on multiple real-world and synthetic datasets are conducted to show the accuracy of TAOT and how the aforementioned alignment problems of DTW are coped with. Other closely related topics such as the parameter extracting strategies and the visualization of TAOT will also be discussed.
The rest of this article is structured as follows. Section II briefly revisits related works on dynamic time warping (DTW), optimal transport (OT), and their major variants. Section III details the proposed new similarity measure, namely TAOT. Section IV evaluates the performance of TAOT with classification experiments on UCR time series datasets and discusses several practical issues. Finally, Section V summarizes our main conclusions.

II. BACKGROUND
A. DYNAMIC TIME WARPING DTW was first introduced in the field of speech recognition in the influential work by [17] and soon become widely used because it can cope with time distortions effectively. DTW aims at finding the optimal alignment between time series that can achieve the minimum accumulated cost. Given two time series, A = a 1 , a 2 , . . . , a i , . . . , a m and B = b 1 , b 2 , . . . , b j , . . . , b n , DTW first constructs an m-by-n cost matrix D. Each matrix cell (i, j) contains the distance d(i, j) between the two data points a i and b j . Squared Euclidean distance is normally used when calculating the cost matrix, and thus d(a i , b j ) = (a i − b j ) 2 . Then DTW tries to find the optimal alignment that leads to the minimum accumulated cost. The alignment is represented by a warping path W = w 1 , w 2 , . . . , w k , . . . , w K that consists of a continuous set of matrix cells. Each matrix cell, for instance, w k = (i, j) corresponds to a pairing of points a i and b j , and the cost between them d(i, j) is added to the final accumulated cost. Fig. 2 shows an example of the cost matrix and warping path. Eq. 1 defines the above-described DTW distance: DTW satisfies three time ordering rules: boundary condition, continuity, and monotonicity [23]. The boundary condition restricts the warping path to start at the lower-left corner of the warping matrix w 1 = (1, 1) and finish at the upper-right corner w K = (m, n). The continuity restricts the allowable steps to adjacent cells. The monotonicity prevents all steps in the warping path from turning backward in any circumstances.
In practice, DTW distance can be efficiently calculated by the following recursive formula: is the accumulated cost found in current cell (i, j) and the final DTW distance is dtw(m, n).

B. VARIANTS OF DTW
Despite its popularity, DTW was proposed decades ago and its pathological alignment problem has long been noted. In order to alleviate the problem, different variants of DTW have been proposed and we briefly review them here. These methods can be classified into two major categories.
The first category sets constraints on DTW. The most common and straightforward constraint is windowing [29]- [32], where allowable elements of warping path must be located within a warping window, as illustrated by the dashed line in Fig. 2. Different warping windows have been proposed, among which Sakoe-Chiba band and Itakura parallelogram are two most commonly used ones. In recent years, some researchers suggest to learn adaptive-shaped windows [39], [40] from the data instead of fixed-shaped windows. Another frequently used type of constraints is weighting. For example, Weighted DTW (WDTW) [24] is a representative method of weighting which applies different weights to temporally adjacent points when computing DTW. By penalizing further points, WDTW prevents minimum distance distortions caused by outliers and enhances the detection of similarity between two time series. Besides windowing and weighting, recently proposed LDTW [27] constrains the maximum allowable length of the warping path based on the observation that pathological alignment always happens concurrently with an unusually long warping path. SP-DTW and its kernelization version SP-KrDTW [41] address the sparsification of the alignment path search space for DTW-like measures to improve their efficiency without loosing accuracy.
The second category replaces the feature DTW considers. For example, Piecewise DTW (PDTW) [42], [43] proposes to use a compact abstraction instead of the raw data to compute DTW, with the aim of avoiding the impact of outliers and accelerating the computation. PDTW first splits time series into fixed-length segments and compute the mean of each segment. Then these mean values are used instead of raw data points. A challenge of PDTW is the choice of the optimal size of segments. To avoid brute-force search, Parameter Free Piecewise DTW (FDTW) [26] proposes a heuristic search of the size of segments for PDTW on the basis of classification accuracy. Derivative DTW (DDTW) [23] considers the local derivative of each data point rather than the raw value. Many variants of DTW can have a derivative version, for instance, WDTW and Weighted Derivative DTW (WDDTW) [24]. SC-DTW [25] views time series as 2D contours and employs shape context, a rich shape descriptor, to compute DTW. Local feature based DTW (LFDTW) [20] proposes a general framework to use different type of local features to generate the warping path.

C. OPTIMAL TRANSPORT
In the context of machine learning, OT [33], [34](also known as Earth Mover's Distance (EMD) or Wasserstein Distance) has long been a powerful tool to compare probabilities or histograms [37], [44], [45]. OT is modeled as the solution to the transportation problem. Suppose there are a collection of mines mining iron ores, and a collection of factories that consume the iron ores. Given the amount of supply and demand of each mine and factory, and the shipment cost from each mine to each factory, OT can find the optimal allocation plan with a minimum total shipment cost for resolving the supply-demand transports.
Given two probability distributions denoted as: An example of optimal transport. We can observe that OT generates a many-to-many alignment.
where a i or b i is the i-th observed value in the respective distributions and p is its corresponding probability value. Let then the OT distance can be defined as: where 1 d is the vector of ones with a dimension of d, and U (A, B) contains all possible joint probabilities of A and B, whose row and column sums to p A and p B respectively. Fig. 3 illustrates an example of OT. The optimal transport plan P , is thus defined as: OT is a special case of the linear programming (LP) problem whose worst case time complexity is O(d 3 logd) [46]. Although the acceleration of OT has attracted a large amount of research interest and various algorithms have been proposed, the computational overhead remains high (supercubic). Recently a variant of OT, Sinkhorn distance [38], successfully makes OT several orders of magnitude faster, which greatly enhances the utility of OT. Sinkhorn distance adds an entropic regularization to the classic OT formula and then enforces a simple structure on the optimal transport matrix, as defined by the following equation: where λ is the regularization coefficient. As λ increases Sinkhorn distance converges to the classic OT distance. When M itself is a metric matrix, Sinkhorn distance satisfies all the three distance axioms, including the symmetry and the triangle inequality. As its name implies, Sinkhorn distance can be computed by Sinkhorn's fixed point iteration [47], [48].
To converge to the optimal transport matrix P λ , one only needs to iterate Sinkhorn's update a sufficient number of times. The algorithm can be even faster with GPUs because it supports 1-vs-N mode where the distances against multiple targets are computed simultaneously. The 1-vs-N mode turns vector-by-matrix operations into matrix-by-matrix operations, which leads to a higher computing density and makes the algorithm more suitable for parallel computing. A flaw of Sinkhorn distance is that sometimes it will be constrained by machine-precision limit when λ increases beyond a problem-dependent value λ max beyond which some elements of e −λM are represented as zeroes.

III. TIME ADAPTIVE OPTIMAL TRANSPORT (TAOT)
The motivation of TAOT is to migrate OT from probability distributions (one-dimensional) to time series data (twodimensional), which means except for the difference in observed values, TAOT also need to capture the difference in time coordinates. An intuitive solution is to consider both observed values and their corresponding time coordinates when calculating the cost between data points in respective time series. In this setting, the one dimensional data point, for example a i , is extended by another time dimension, (a i , t i ), and then a straightforward cost matrix can be defined as where w is a weight parameter to balance the two parts.
Originally the OT algorithm copes with probability distributions and each input value must have its corresponding probability. Here we assume all observed values of a time series share an equal probability based on the fact that these values come from independent observations. As a result, the input time series can be denoted by: Since each observed value shares an equal probability, the Sinkhorn iteration [38] can be further simplified. Algorithm 1 summarizes the simplified TAOT algorithm. Basically it contains two phases, where it first prepares the cost matrix and then iterates the simplified Sinkhorn update until the stopping criterion or the maximum iteration number is reached. Note that the time coordinates are normalized into z-score before calculating the cost matrix M . We can get not only the TAOT distance from the algorithm, but also the optimal transport plan, which is significant to the visualization of the results and the execution of other analysises.
The general idea of TAOT is to integrate temporal dissimilarity between time series to the final distance measure. For Sinkhorn distance based algorithm, the temporal dissimilarity is usually contained in how we construct the cost matrix, and we can observe from Algorithm 1 that any kind of cost matrix can easily fit into this algorithm framework and lead to a new distance measure. In this article we just employ the most straightforward way to construct our cost matrix. Other definitions of cost matrix is still an open direction for future work.
As for the computational efficiency, given two time series of length N , the time complexity of Euclidean distance   is O(N ). The time complexity of DTW-based methods, for example, DTW, derivative DTW, or weighted DTW, is O(N 2 ). The time complexity of TAOT is relatively tricky because it is not as straightforward as the Euclidean distance or most DTW-based methods. TAOT involves a stopping criterion for its iteration and the criterion is calculated based on the current state of the iteration. According to a study on the complexity of multimarginal optimal transport [49], the time complexity of TAOT is approximately O (N 2 log N ). Note that TAOT is based on Sinkhorn distance [38], which is a faster version of optimal transport. The naive optimal transport is a computational heavy method, whose cost of computing scales at least in O (N 3 log N ).

IV. EXPERIMENT A. 1NN ACCURACY REPORT
In this section, we evaluate the classification accuracy of TAOT on 63 datasets of the UCR time series classification archive [50], which has the largest and most widely used collection of public benchmark datasets of time series. These datasets include both real-world time series and synthetic time series from various application domains. The length of time series varies between 60 (Synthetic Control) and 1639 (CinC_ECG_torso), and the number of classes varies between 2 (Gun-Point) and 60 (ShapesAll).
In order to ensure fair comparisons between different similarity measures, we employ the one nearest neighbor classifier (1NN), because the classifier itself does not involve any parameter, and thus the accuracy depends only on the similarity measure. Six most well-established methods, Euclidean distance, DTW, DTW with Sakoe-Chiba constraint, DDTW, WDTW, and FDTW are selected for comparison. Table 1 summarizes the 1NN classification error rate of TAOT and other competing methods on all datasets. If any parameter is involved, for instance, the width of Sakoe-Chiba band, we choose the optimal one found by cross validation on the training set. From these error rates we can observe that in general TAOT outperforms all the other measures on 28 datasets (Synthetic Control, Swedish Leaf, 50Words, ECG200, Adiac, etc), and achieves leading accuracy together with one or more other measures on another 7 datasets (Wafer, Face(four), Lightning-7, Plane, Coffee, ECGFiveDays, and BirdChicken).
TAOT encounters the aforementioned machine precision limit problem on four datasets (Face(all), Lighting2, Fish, LargeKitchenAppliances) of the original UCR time series classification archive, and therefore there is no accuracy report on these datasets. This is harmless for the classification task since we can find the problem during the early training phase and then change for another alternative method. Fig. 4(a-b-c-d-e-f) shows the pairwise comparisons between TAOT and the six competing methods, respectively. In Fig. 4, each dot represents a dataset, whose x-coordinate and y-coordinate are respectively the accuracy generated by the competing method and TAOT. In this setting, a dot falling above, on, or below the diagonal indicates that the proposed TAOT outperforms, ties with, or lose to the competing method. The numbers of dots in each different region are labeled on these figures. We can observe that TAOT achieves a better performance in all of the six pairwise competitions. To demonstrate whether TAOT is statistically significant different from other methods, the p-value generated from Friedman test is shown the in lower-right corner of each figure. The p-value ranges from 0 to 1, and a smaller p-value indicates a more significant difference between methods. Note that all these p-values are smaller than 0.05 and it proves that TAOT performs significantly better than the six competing methods.

B. EXTRACTING RELIABLE PARAMETERS FOR TAOT
TAOT involves two parameters: the regularization coefficient λ and the time weight w. To find a reliable combination of λ and w, we employ grid-search based leave-one-out cross validation on the training set. We choose leave-one-out cross validation over hold-out or k-fold validation because sometimes the training set is too small to further divide. Typically, grid-search is considered to be time-consuming, but for TAOT the search spaces of λ and w are both small and thus the efficiency is acceptable.
For w, if it is too large, for example w > 10, the difference in time coordinates will predominate. Contrarily, if w is too   small, for example w < 0.1, the difference in observed values will predominate. Either way we lose the balance. Therefore we set the search space of w to a moderate range, [0.1, 10]. Given a uniform step size to the integer and the fraction part respectively, our candidates for w are 0.1, 0.2, . . . , 0.9 and 1, 2, . . . , 10.
For λ, recall that as it increases, Sinkhorn distance converges to the classic OT distance. Empirically, 200 is large enough to make Sinkhorn distance close to the classic OT distance, and 5 is strict enough for regularizing. So we set the search space of λ to [5,200] with a fixed step size of 5.
During grid-search, each pair of λ and w results in an error count, error(λ, w). There can often be multiple pairs of parameters that all achieve the lowest error count, and these optimal parameters on the training set may not perform as well on the testing set. Therefore we develop another strategy to decide the sole and most adaptable pair of parameters. For λ, the most adaptable one should achieve the lowest cumulative error count given every different w, as defined by the following Eq. 8. If there are ties, we choose the smallest λ because it leads to a faster computation.
After λ is fixed, the best w should be the one to have the lowest error count when pairing with the λ , as defined by the following Eq. 9. If there are ties, we choose the median. Fig. 5 shows the real testing error counts of different w candidates that all achieve the lowest training error count. We can observe that the medians of those w candidates achieve the most stable performance in general on the three datasets.

C. THE TEXAS SHARPSHOOTER FALLACY
To further demonstrate the capability of TAOT, we will test whether TAOT is capable of overcoming the Texas Sharpshooter Fallacy. The Texas Sharpshooter Fallacy is a common logic error that happens when comparing methods on multiple datasets. A pervasive scenario is that as long as a method can win on some datasets, the author then claim the method is valuable. Because the author think since the method can win on the datasets from some domains, it could be useful in those domains. However, it is not enough to have a method that can be more accurate on some datasets unless you can tell ahead of time that on which datasets it will be more accurate [51]. One way to show whether a method can tell in advance that it can be more accurate on a certain dataset is to test whether the expected accuracy gain over another competing method coincides with the actual accuracy gain. The expected accuracy gain and the actual accuracy gain are defined by Eq. 10 and Eq. 11, respectively. In our setting, the expected accuracy gain is based on the best training accuracy during leave-one-out cross validation, and the actual accuracy gain is based on the best testing accuracy. An expected accuracy gain larger than one indicates that we predict TAOT will perform better, while an actual accuracy gain larger than one indicates that TAOT indeed performs better. expected gain = training accuracy(TAOT ) training accuracy(competing method) (10) actual gain = testing accuracy(TAOT ) testing accuracy(competing method) (11) As shown in Fig. 6, Texas sharpshooter plot is a convenient tool to visualize the comparison between the expected accuracy gain and the actual accuracy gain on multiple datasets. Each point represents a dataset to test and each dataset falls into one of the following four possibilities: • TP(True Positive): In this region we predicted TAOT would increase accuracy, and TAOT did. Obviously, this is the most beneficial situation for TAOT and the majority of points fall into this region.
• TN(True Negative): In this region we predicted TAOT would decrease accuracy, and TAOT did. This is not a bad case. Because if we know ahead of time that TAOT will do worse, we can choose another method to avoid the loss of accuracy.
• FN(False Negative): In this region we predicted TAOT would decrease accuracy, but the accuracy actually increased. This is also not a bad case. We might miss the opportunity to improve, but we will not do worse.
• FP(False Positive): In this region we predicted TAOT would increase accuracy, but the accuracy actually decreased. This is the truly bad case. But not many points fall into this region.

D. THE PAIRING OF MAXIMUM VALUES
One motivation of TAOT is that we want global maximum values to pair with each other if such a pairing is critical for the current problem. Classic OT, which is a special case of TAOT when w = 0, can guarantee the pairing of global maximum values in theory. As for TAOT, if this pairing is decisive, then normally we will get a relatively small w from the training phase, and that will lead to a high possibility of generating a maximum-to-maximum alignment. Fig. 7 gives an example to visually demonstrate that TAOT is more likely to generate the required maximum-to-maximum alignment than the other currently established methods.

E. VISUALIZATION OF ALIGNMENTS GENERATED BY TAOT
The many-to-many alignment generated by TAOT, namely the transport plan, is usually represented by a weight matrix. As the most established and popular similarity measure of time series, DTW visualizes its output alignment with a warping path in a warping matrix, as shown in the previous Fig. 2.
In order to keep consistency with DTW, the visualization of TAOT is still based on a warping matrix, but with a fuzzy warping path or a heat map of the warping path. Fig. 8 illustrates the fuzzy version of a warping path where each matrix cell corresponds to a pairing between the two points respectively indicated by the x-coordinate and y-coordinate of the cell, and we use the transparency of a matrix cell to represent the weight of the pairing. In this setting, a more conspicuous area indicates more intense similarity and vice versa. Note that although we use an alignment path matrix here, TAOT does not have a real alignment path like other DTW-based methods. TAOT is based on optimal transport and the values of all transport matrix cells are updated simultaneously during the calculation instead of a step-by-step motion.

V. CONCLUSION
In this article, we proposed a new time series similarity measure framework entitled Time Adaptive Optimal Transport (TAOT). It inherits the advantages of several promising properties from the optimal transport (OT) that can greatly alleviate some long-suffered issues with currently established time series similarity measures. To make OT capable of handling time series data, TAOT takes into consideration not only the difference in observed values but also the difference in time coordinates. To make the efficiency of TAOT acceptable, the calculation of TAOT is based on Sinkhorn distance, a very fast variant of OT. TAOT further simplifies the Sinkhorn iteration by assuming that all observed values of a time series share a equal probability. The performance of TAOT was demonstrated by a series of one nearest neighbor classification experiments on multiple public datasets. Compared with other currently established methods, TAOT exhibits a better accuracy and it even has the ability to predict whether there will be an accuracy improvement on the majority of experimental datasets. We also introduced a grid-search based parameter extracting strategy and a fuzzy warping path based visualization method for TAOT. TAOT is not just a single distance measure, but also an algorithm framework where other kinds of cost matrix can easily fit into and each creates a new distance measure. We believe this may lead to some potential future work. In 1998, she was a Professor with the Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, Beijing. She is currently serving as a Team Leader and leading a project of multispectral imagery radiometric and geometric correction for large volume of image data at global scale for higher resolution global land-cover mapping. She has significant experience in managing and designing software systems for satellite image processing and applications. Her research interest includes the use of mathematical theories to development algorithms related to image processing and analysis.