A Novel Similarity Measurement and Clustering Framework for Time Series Based on Convolution Neural Networks

In recent years, with the development of machine learning, especially after the rise of deep learning, time series clustering has been proven to effectively provide useful information in cloud computing and big data. However, many modern clustering algorithms are difficult to mine the complex features of time series, which is important for further analysis. Convolutional neural network provides powerful feature extraction capabilities and has excellent performance in classification tasks, but it is hard to be applied to clustering. Therefore, a similarity measurement method based on convolutional neural networks is proposed. This algorithm converts the number of output changes of the convolutional neural network in the same direction into the similarity of time series, so that the convolutional neural network can mine unlabeled data features in the clustering process. Especially by preferentially collecting a small amount of high similarity data to create labels, a classification algorithm based on the convolutional neural network can be used to assist clustering. The effectiveness of the proposed algorithm is proved by extensive experiments on the UCR time series datasets, and the experimental results show that its superior performance than other leading methods. Compared with other clustering algorithms based on deep networks, the proposed algorithm can output intermediate variables, and visually explain the principle of the algorithm. The application of financial stock linkage analysis provides an auxiliary mechanism for investment decision-making.


I. INTRODUCTION
Time series is time-related or logically sequential data. A large amount of time series has been generated in various fields with the rapid development of information technology, e.g. company sales data, stock prices, and industrial data. Many researches related to analyzing time series have been carried out, such as: prediction [1], [2], classification [3], clustering [4], anomaly detection [5], visualization [6], pattern recognition [7] and trend analysis [8].
Time series clustering is an essential method for mining data information when there is no prior knowledge of data [9]. The research on the clustering method of time series mainly focuses on the following two aspects: time series similarity The associate editor coordinating the review of this manuscript and approving it for publication was Chao Tong . measure and time series clustering mechanism. Numerous studies have shown that time series data mining is highly dependent on time series similarity measures [10]. To date lots of effective similarity methods have been proposed for time series, such as Euclidean distance (ED) [11], edit distance [12], dynamic time warping (DTW) [13], area-based measure [14], set-based measure [15], and adaptive similarity metric selection strategy [16]. However, the above techniques focus on the corresponding relationship at the time point in the time series and do not observe the trend change of the time series from an overall perspective. Moreover, they are usually sensitive to outliers and noise because all-time points are considered. Therefore, a time series similarity measurement method is proposed. The algorithm measures the trend of time series based on the Convolution Neural Network(CNN) to reduce the influence of noise and outliers. Meanwhile, in the algorithm, the output changes of CNN in the same direction is transformed into the similarity of time series, so that CNN can be directly used to measure the similarity of time series.
Traditional clustering algorithm has some limitations in dealing with time series with noise, random, and nonlinear. However, density-based spatial clustering of applications with noise(DBSCAN) can effectively handle this problem [17] and it is widely used in time series tasks. Huang et al. [17] proposed a new hybrid algorithm based on the Optimization of Initial Points and Variable-Parameter Density-Based Spatial Clustering of Applications with Noise (OVDBCSAN) and support vector regression (SVR). The algorithms can efficiently improve the accuracy of the prediction of stock prices and financial indexes. He et al. [18] proposed the semi-supervised time series clustering framework (STSC), and further designed two effective semi-supervised clustering algorithms. It integrates fast similarity measurement and constraint propagation methods. However, the selection of the parameter eps(epsilon) and min_samples(minimum number of points) in the DBSCAN algorithm has a significant impact on the final clustering result. If eps is too tiny and min_samples is too large, it will cause the samples that initially belong to the same category to be divided, and eventually produce poor clustering results. Nevertheless, the similarity of the intra-class data at this time is very high, so if an algorithm can be designed to extract the intra-class features, and used to guide the data aggregation, the accuracy of clustering can be improved.
Hence, a CNN-based weakly supervised auxiliary clustering algorithm based on DBSCAN is designed due to CNN's excellent performance in the time series classification task [19]. The algorithm not only retains the CNN's appropriate time series feature extraction and noise reduction capabilities to extract intra-thunder features but also cleverly uses the classification capabilities of CNN to assist time series clustering.
Based on the above analysis, the main contributions of the paper can be summarized as follows.
1) A CNN-based similarity measurement method is designed based on the discovery that the similarity of time series has a positive correlation with the number of CNN output variations in the same direction.
2) A two-step clustering algorithm is proposed in which the classification algorithm based on CNN can be used to assist clustering by preferentially gathering a small amount of data to create labels.
3) The performance of the algorithm is evaluated by using 32 time series mining benchmarks. The simulation results show its effectiveness. Moreover, it is verified in the financial time series.
The remaining sections of this article are organized as follows. Section 2 mainly expounds CNN, time series similarity measurements, and clustering methods. The principle and details of the proposed algorithm are given in Section 3. Section 4 reports the comparison of the algorithm with other algorithms and its application in the financial field. The paper is concluded in Section 5.

II. RELATED WORK
In this section, the fully convolutional neural network technology used in the framework will be introduced and related research on time series similarity measures and clustering algorithms will be enumerated.

A. FULLY CONVOLUTIONAL NETWORK FOR TIME SERIES
The full volume set neural networks show better quality and efficiency in the field of image semantic segmentation. In the face of time series clustering, the full convolution neural network will be implemented as a feature extractor. The convolution layer of the full convolution neural network consists of batch normalization and ReLU activation layers. The convolution layer consists of three 1 − D kernels with the filter sizes {8, 5, 3} without stringing [19]. The elemental composition of the convolution layer is shown in Eq. (1), where ⊗ represents a convolution operation.
The batch normalization layer can speed up the convergence speed of the model and improve the generalization ability. After the convolution block, the global average pool layer is connected instead of the fully connected layer. This method will greatly reduce the number of weight parameters, and the final label will enter the softmax layer [19].
The input of the full convolution module varies depending on the dataset [20]. The input to a full convolution block is a univariate time series. If it contains Q time steps, a fully convolutional block will receive such data. The channel for interconnecting data streams is defined as Eq. (1).

B. TIME SERIES SIMILARITY MEASUREMENT
The similarity of time series is defined in [21] as follows: suppose two time series x 1 , x 2 , if dist(x 1 , x 2 , ) < γ , the time series are similar. Where dist(x 1 , x 2 , ) is calculated by distance measurement function, and γ is the threshold to judge whether the time series are similar.
ED has been widely used because of its simple calculation and clear meaning [22]. However, it does not measure the trend of time series. DTW was first used by Berndt and Clifford [23] in the similarity measurement study of speech recognition. It measures the similarity between time series through stretching and contraction of the time axis and solves the problem that ED cannot measure the time series with inconsistent time length. Nevertheless, DTW's time complexity is higher. To this end, FastDTW [24] has been proposed. However, it cannot ensure that the best twist path is found [25]. The longest common subsequence (LCSS) [26] defines its proportion as the similarity of time series and has strong adaptability to noise. Edit distance with Real FIGURE 1. TSC_CNN, which includes two parts: similarity measurement algorithm and two-step clustering. The similarity between time series is the result after normalization. In the two-part clustering, the red, yellow, and green dotted boxes represent different clusters. d i , i ∈ {1, 2, . . . , n} in two-step clustering represents different time series in the dataset. Every time serie has Q steps Penalty (ERP) [12] and Edit Distance on Real sequence (EDR) [27] are two methods based on editing distance. Their basic idea is string matching. A large number of experiments are carried out to analyze and compare time series similarity measures [28]. The results show that LCSS, EDR, ERP, and DTW are similar in classification accuracy, and DTW is a relatively simple method. Therefore, DTW is more practical when there is no domain information about the time series.

C. TIME SERIES CLUSTERING ALGORITHMS
Time series clustering algorithm is divided into two categories according to whether the clustering is based on the original data or the feature [29]. Raw-data-based clustering methods do not process the original data in the clustering process, but directly match the two time series by stretching and contracting the non-linear axis. This kind of method usually adjusts the distance measure of the traditional static clustering method to adapt to the matching of time series. Three alternatives for fuzzy clustering of time series based on DTW are proposed by Izakian et al. [30]. Paparrizos and Gravano [31] presented a method called k-shape, it adopts a standardized cross-correlation measure to define the distance between two time series. However, these methods are usually sensitive to outliers and noise because all time points are considered.
Feature-based methods convert the original time series data into low-dimensional feature vectors and use the clustering method to process the feature vectors. Guo et al. [32] used Independent Component Analysis (ICA) to transform the original data into low dimensional feature vectors and then applies a modified k-means to clustering. Madiraju et al. [33] proposed to train both autoencoder and K-mean (loss based on Kullback-Leibler divergence). However, in the training process, the target distribution is calculated by the predicted distribution and updated at each iteration, resulting in instability [34].

III. OUR APPROACH
The technical framework of Time Series Cluster based on CNN(TSC_CNN) proposed in this article is shown in Fig. 1, which includes two parts: similarity measurement algorithm and two-step clustering.
In similarity measurement algorithm (the upper part of Fig. 1,), each time series of the dataset D = {d 1 , d 2 . . . d n } is used to train different CNNs, and the output changes of CNN are transformed into the similarity between data in a certain way. In two-step clustering, part of time series in dataset D is first aggregated into three categories {d 1 , d 4 }, {d 2 }, {d 7 , d n } according to similarity, which are used as training set to train CNN. This process is called preliminary clustering. Then, in the process of final clustering, CNN is trained with the results of preliminary clustering, and the dataset D is divided into three categories as the test set. A detailed description of these two parts will be given below.

A. SIMILARITY MEASUREMENT ALGORITHM
In the similarity measurement algorithm, since the data has no labels, in order to complete the training of the network, the training data labels are set to 0 (that is, the training data is used as the origin of other untrained data). To avoid those different categories of time series are set to the same label (0) during one training process, only one time series can be trained at the same time, then the output of the trained network can be used as the distance between the input data and the training data (origin). However, the output can only be used to indicate the similarity between the input data and the current training data, and it cannot be judged whether the different input data is similar. Therefore, each data in the dataset need to be trained separately.
Next, an example is used to describe how Algorithm 1 converts the output of network into similarity. Suppose a dataset contains two categories, each category containing 3 time series shown in Fig. 2. Fig. 2(a) shows three time series (x 1 , x 2 , x 3 ) belonging to category I, and Fig. 2(b) shows  shows the result where the values in the yellow and blue parts are significantly larger than those in the white part, that is, the output changes in the same direction between time series of the same category are considerably larger than those between different categories. Hence, the number of output changes in the same direction (increase or decrease) between time series can be used to represent the similarity between them.
The structure of CNN is shown in Fig. 1. The convolution part consists of three stacked convolution blocks with filter sizes 128, 256, and 128 in each block. And each block is composed of a convolution layer which is fulfilled by 1-D kernels without striding and a BN layer and then activated by the ReLU function [19]. The kernels' size of blocks is {8, 5, 3}. The global average pooling layer is connected to the convolution layer. The output of the similarity measurement phase is obtained from the linear layer.

B. TWO-STEP CLUSTERING
The proposed two-step clustering algorithm includes the following two steps. In the first step, the cluster generation algorithm is used to aggregate the partial data with a higher similarity in the dataset. The cluster generation algorithm is given in Algorithm 2. The key to the cluster generation algorithm is the data selection criteria, that is, which data Update the weights ω in the CNN f net i by decending for j = (0, 1, . . . , n − 1) do 5: The output matrix E[i, j] ← f net i (d j ) 6: end for 7: end for 8: for k = (0, 1, . . . , n − 1) do 9: for j = (0, 1, . . . , n − 1) do 10: for i = (0, 1, . . . , n − 1) do 11: should be clustered firstly. In this algorithm, two indicators are adopted to judge: the value of similarity and the similarity ranking with the current data. Only use the similarity value as the criterion for judging whether to cluster, too much data may be collected, especially when the similarity difference between data is small, which will increase the training time.
Only a fixed number of data are collected based on similarity ranking may result in the collection of data with low similarity, which may reduce the clustering accuracy. Therefore, Algorithm 2 selects data based on the similarity ranking from the data that its similarity above the similarity threshold. In the second step, the clusters generated in the first step are used as the training set to train the network, and the time series dataset is divided as the test set of the network. In the two-step clustering phase, the structure of CNN is the same as that in Algorithm 1, except that the last layer is Softmax instead of liner.

IV. EXPERIMENTS
To conduct a comprehensive evaluation of TSC_CNN, this article has conducted a large number of experiments, and the specific details of the experiment will be elaborated below.
A. EXPERIMENT SETTINGS 32 datasets from the UCR time series classification archive [35] are selected for cluster analysis. The details are shown in Table 1.
TSC_CNN is compared with 10 representative time series clustering methods. These clustering algorithms include two types: 1) Non-deep clustering: • k-means: Adopts k-means on the entire time series.
• DBSCAN: Uses DBSCAN on the entire time series. • k-shape [36]: The algorithm uses an extensible iterative refinement process to explore the shape of time series with standardized cross-correlation measures.
• USSL [4]: It combines the advantages of shapelet learning, spectral analysis, pseudo-class tagging, and least square method for unsupervised salient subsequence learning.

2) Deep clustering algorithms:
• DEC [37]: It is based on a deep neural network not only to learn the low-dimensional feature representation of the data but also to iteratively optimize the clustering target.
• IDEC [34]: It jointly performs clustering and learns representative features with local structure protection.
• DTC [33]: Auto-encoder is used to reduce the dimensionality of the data, and the clustering target and the dimension reduction target are jointly optimized.
• DTCR [9]: It integrates the temporal reconstruction and k-means objective into the seq2seq model and a fake-sample generation strategy for time series and auxiliary classification task are proposed to enhance the ability of the encoder.
• SOM-VAE [38]: It is a new time series interpretable discrete representation learning framework, which uses the SOM and Markov model to improve the clustering and interpretability of time series representation.
• N2D [39]: It uses a manifold method to extract features, and performs shallow clustering on the resulting re-embedding space. During the similarity measurement phase, TSC_CNN is trained with learning rate 0.0001, epoch = 5000, and batch size = 1. In the clustering phase, learning rate = 0.0001, epoch = 3000, batch size is the quotient that dataset length divided by 5, K = 3, and the Adam optimizer is used [40].
RI can be calculated as: where TP, TN, FP, FN represent the number of sequence pairs in a cluster, and TP represents the same cluster originally belonging to the same class, and is predicted to be the same cluster, TN is originally belonging to different classes and assigned to different clusters, FP is originally belonging to different classes but assigned to the same class, and FN is originally belonging to the same class but assigned to different clusters. NMI is calculated as: .
where N represents the total number of time series. |·| is the number of time series in cluster. N ij = G i ∩ A j representss the number of time series in the intersection of G i and A j .  In these two metrics, values close to 1 indicates high quality clustering [4].

C. COMPARISON WITH STATE-OF-THE-ART METHODS
TSC_CNN is evaluated against the selected methods using RI and NMI. The experimental results in Tables 2 and 3 are the better results obtained by each algorithm on each dataset after parameter tuning. The bold data in Tables 2 and 3 represents the best results of all algorithms in the current dataset. TSC_CNN performs best with respect to RI on 14 datasets and to NMI on 11 datasets of the total 32. The least improvement in RI is 0.4984 percent. This verifies that it is feasible to measure the similarity between time series by counting the number of co-directional movements and a CNN-based time series classification algorithm can improve clustering performance.
To explain the principle of feature analysis, the feature extraction process of TSC_CNN based on the CBF dataset is visually shown. The CBF dataset is synthesized from three types of one-dimensional time series, which has been used as a benchmark for time series mining [43]. The data of each category is shown in Fig. 3. Fig. 4 shows the output changes of Algorithm 1 under different epochs. In Fig. 4 and Fig. 5, the gradients from red to blue indicate that the similarity between data is gradually decreasing. In Fig. 4, the proportion of blue in the similarity matrix graph increases with the increase of epoch, which indicates that the similarity between data is getting smaller and smaller on the whole. It is because with the increase of epoch, CNN can better extract the characteristics of data, so as to better distinguish the differences between different data. Therefore, when different data pass through the same CNN, the difference of CNN output similar results gradually increases. Fig. 5 shows the similarity between the data numbered 6 in the CBF dataset and other data in the dataset. The similarity between data belonging to the same category (such as 6,9) is much greater than that of different categories (such as 6, 10). With the increase of epoch, the similarity between No.6 data and other data in the data set gradually decreased. However, similar to data No. 6, the decline rate of data No. 9 in the  same category (15.21%) was much lower than that of data No. 10 (85.71%) in different categories. This shows that with the increase of epoch, the algorithm increases the difference between intra-class and inter-class. It makes the clustering algorithm can better distinguish different types of data, which helps to improve the clustering accuracy to a certain extent.
Compared with other clustering algorithms based on deep networks, TSC_CNN can output intermediate variables and visually explain the principle of the algorithm. According to the visualization of the intermediate variables, the rationality of the algorithm can be pre-judged to provide support for the use of the clustering algorithms. At the same time, the visualization method can be extended to a class of clustering methods, analyzing the rationality of extracting time series features based on deep networks, and judging which features are effective for the data. In the method of feature extraction based on deep networks, this visualization method is beneficial to optimize the structure and hyperparameters of the deep network.

D. ABLATION STUDY
To verify the improvement of TSC_CNN in time series similarity measurement and clustering, TSC_CNN is compared with its two ablation models: 1) ED_CNN: replacing the time series similarity measurement algorithm with ED in TSC_CNN.
2) TSC_DBSCAN: replacing the time series clustering algorithm with DBSCAN in TSC_CNN. Table 4 shows ED_DBSCAN, TSC_DBSCAN perform best on 7 and 3 datasets respectively, and TSC_CNN performs best on 23 datasets, which shows time series similarity

E. PARAMETER STUDY
In this part, four crucial parameters in TSC_CNN: the similarity intensity threshold p, the nearest neighbor number K , batch size, and epoch are discussed.
In the CNN part of the algorithm TSC_CNN, epoch, and batch size are the main controlling parameter. For these two control parameters, the article conducts sensitivity analysis experiments on four datasets, as shown in Fig. 6 (a) The RI of the BirdChicken, Plane, and Car datasets shows a more obvious trend with the increase of epoch. That is, the feature extraction ability increases linearly in the front part, and the latter shows a stable area. However, the ECG200 dataset is not sensitive to the epoch. There is a strong possibility that the similarity between ECG200 data is low. It does not require strong network training and learning to achieve better feature extraction. It can be obtained from the analysis of batch size that is related to the size of the dataset, and the clustering performance is non-linearly related to batch size. Monotonically increasing batch size will not increase the clustering performance. For feature extraction, the batch size setting requires more information about the dataset size.
For K and p, K is varied from 1 to 5, with a step of 1, and p from 0.5 to 0.9, with a step of 0.05. Fig. 6 (e)(f)(g)(h) shows the variations in RI using different p and K learned on four datasets. The peak of RI appears in the region with p greater than 0.75 because the generated cluster has high internal similarity, which is conducive to CNN to extract internal features. But for the BirdChicken dataset, RI decreased significantly when p is greater than 0.85. It may be that most of the data cannot be aggregated, which will lead to the generation of some smaller clusters. It is easy to occur overfitting when small clusters are used for training. Therefore the intensity threshold should not be too large. For K , most data sets have better results when K is smaller. With the increase of K , RI values change slightly. But for the BirdChicken dataset, when K = 1, there is not enough data to merge into a training set, resulting in poor clustering results. Additionally, K should not be set too high as this may cause the number of initial clusters is less than the number of categories.

F. TIME PERFORMANCE ANALYSIS
In this Section, the time performance of TSC_CNN is analyzed from two aspects: time complexity and wall clock time.
To analyze the time complexity of the TSC_CNN, the Big O notation is used to express the time complexity. Where n represents the amount of data, b represents the batch size, and e represents the epoch. TSC_CNN first takes O(n*e) to train the CNN to get the output, and then it takes O(n*n) to convert the output into the similarity between time series. After that, it seeks O(n*n) to gather a part of the data according to the similarity between the data, and O(n/b*e) to classify the time series. So the final time complexity of the TSC_CNN is O(n*e + 2*n*n + n/b*e).
Although the Big O notation can measure the time complexity of the algorithm to a certain extent, the representation method is mainly related to the amount of data. It does not include low-order terms and first coefficients, such as the time required to train a network. However, for deep learning algorithms, these neglected values may just be essential factors that affect the running time of the algorithm. Therefore, to discuss the time consumption of the algorithm more specifically, the wall clock time of different algorithms on the same machine is showed in Table 5. For the the fairness of comparison, all the algorithms with released code were run on a server equipped with NVIDIA 1080. Table 5 shows that the time performance of our algorithm is the worst among all the comparison algorithms, especially for the non-deep learning algorithm. Combined with the time complexity of TSC_CNN, the consumption of time mainly lies in the similarity measurement algorithm, because each data in the dataset should be trained, and the consumption time of the algorithm will increase with the size of the dataset and the epoch. However, due to this training method of CNN, compared with other clustering algorithms based on deep networks, the intermediate variables can be visualized to judge the rationality of the algorithm and provide support for the use of the clustering algorithm. At the same time, the visualization method can be extended to a class of clustering methods, analyzing the rationality of extracting time series features based on deep networks, and judging which features are effective for the data. In the method of feature extraction based on deep networks, this visualization method VOLUME 8, 2020  is beneficial to optimize the structure and hyperparameters of deep networks feature extractor. On the whole, the algorithm is suitable for offline tasks or online tasks that require high precision and ample computing power.

G. APPLICATION IN THE FINANCIAL
Financial time series clustering is a significant application of time series clustering in the financial field, which plays a vital role in analyzing and predicting the behavior of the financial market [44]. Based on the similarity of the company's stock, the construction of an effective stock market classification system is a conventional means in recent years, which can provide a reliable reference for investors' investment decisions. Because exploring whether different companies have common trends in the stock market helps predict the share prices of other companies in the same cluster [45]. Moreover, when a clustering algorithm is used in a financial system, the result of the algorithm is rarely directly regarded as the final statistical result, which is usually used as an intermediate layer to support other tasks. As described in [46], the data is first clustered by a clustering algorithm. Artificial neural networks (ANN) and logistic regression models are used to classify the datasets obtained by processing the clustering results, to predict the daily direction of future market returns.
To illustrate the application of TSC_CNN in the stock time series, the daily closing price of 50 shares in the Shanghai Stock Exchange is select as the dataset (from December 1, 2018, to December 1, 2019) and the normalized data is used as the dataset for this experiment. The dataset contains 50 time series, each of which contains 231 points.
The reasonable number of clusters is related to the shape and scale of data distribution along with the resolution required by users [47]. Therefore, the number of clusters should be set to √ L/2 (L is the number of time series in the dataset) for this dataset, and the clustering results by TSC_CNN are shown in Fig. 7. The results show that the time series of the same class have similar movement trends, which reveals the collective movement of different companies with similar stock prices. The movement trend of different categories of time series is different, which shows that TSC_CNN can find and distinguish the trend characteristics of time series and classify them well. Besides, according to the similarity of different time series given in TSC_CNN, the tree diagram of the sample hierarchy of some companies can be drawn, as shown in Fig. 8. And in Fig. 8, 600029, 601111 respectively represent China Southern Airlines and Air China, 601398, 601939 respectively represent the Industrial and Commercial Bank of China and China Construction Bank. This proves the effectiveness of our algorithm to a certain extent because companies belonging to the same industry background have a greater probability of similar stock trends.

V. CONCLUSION
This article proposes a new time series clustering algorithm named TSC_CNN, with high clustering performance. The framework includes two algorithms of similarity measurement and two-step clustering. In the similarity measurement algorithm, based on the strong time series feature extraction ability of CNN, a novel data feature extraction method is designed, which makes CNN be able to apply in the field of clustering. The visualization of similarity between data can provide a reference for the adjustment of network structure and parameters, and is conducive to the prediction of the rationality of the algorithm, and provides support for the use of clustering algorithm. In the two-step clustering algorithm, a method that prioritizes clustering of data with higher similarity is designed to make the CNN classification algorithm be used to assist clustering to improve clustering performance. Experiments show that the algorithm has a significant advantage over the classical algorithm and the existing technology in the clustering effect of time series. The analysis of financial time series data reflects its value. We believed that TSC_CNN could be useful for many future time series clustering efforts. XIN