Heterogeneous Feature Based Time Series Classification With Attention Mechanism

Time series classification (TSC) problem has been a significantly attractive research problem for decades. A large number of models with various types of features have been proposed. However, with the rapid development of new applications, like IoT and intelligent manufacturing, the time series data from different industries and applications are constantly emerging. To classify these data accurately, data scientists face the challenges of 1) how to select the optimal features and classification models and 2) how to interpret the results. To tackle these two challenges, in this paper, we propose a heterogeneous feature ensemble network, named FEnet. Multiple features, including time-domain, frequency-domain, and so on, are combined to build the model so that it can deal with the diversity of the data characteristics. Furthermore, to improve the interpretability, we propose a two-level attention mechanism. Finally, we propose two model optimization strategies to enhance classification accuracy and efficiency. Extensive experiments are conducted on real datasets and the results verify the accuracy, operation efficiency, and interpretability of FEnet.


I. INTRODUCTION
Time series data are pervasive across almost all human endeavors, including medicine, finance, and science. In consequence, there is an enormous interest in analyzing (including query processing and mining) time series data [2], [9], [16].
Time series classification (TSC) problem has been a significantly attractive research problem for decades. Many approaches of time series classification have been proposed, such as 1NN [25], shapelet based [13], temporal interval based [8], frequency feature based [31] and neural network [9], etc. Nowadays, the latest 1NN algorithm [25] tries to fuse multiple distances to construct a time series classifier. Shapelet based approach [13] and temporal interval based approach [8] try to use local feature to distinguish the time series in different classes. Frequency feature based approch [31] try to ensemble multiple local frequency The associate editor coordinating the review of this manuscript and approving it for publication was Sajid Ali .
features to get good preformance on time series classification. Neural network based approach utilizes the strong fitting ability of the deep learning to deal with the time series classification task.
Although those approches mentioned above may work well on some dataset, there still exist two challenges in the TSC problem. First, when data scientists come across a new TSC problem (data collected from a strange domain), they have no idea about which classification model ought to be selected. A natural solution is to test all types of models and pick the one with the highest accuracy. But we may miss the best model due to the inappropriate parameter setting. Another solution is to select the ensemble model which combines variable types of features, like [26], [34].
Although the ensemble-based approaches have better performance in many domains, they still are confronted with the second challenge, the interpretability. It is extremely hard to tell the reason why these models can distinguish two classes of time series. In some cases, the ''know-how'' knowledge is very important. In fact, many other models also face the VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ interpretability challenge, like the deep learning based model. Even some shapelet-based approaches, like [13], have low interpretability.
To address these two challenges, in this paper, we propose a feature-based ensemble algorithm, called Feature Ensemble network (FEnet for short), to solve the TSC problem. Multiple features, including time-domain, frequency-domain, and so on, are combined to build the model so that it can deal with the diversity of the data characteristics. Moreover, we propose a two-level attention mechanism to improve interpretability. The high-level attention vector, called inter-feature attention, is used to tell which type of feature is really distinguishing. The low-level attention vectors, called intra-feature attention, are used to discover which part of a certain feature is more important.
In summary, our work has the following contributions: • We propose an ensemble model at the feature level to deal with the TSC problem. In this model, the two-layer attention mechanism makes our network model have strong fitting ability and interpretability.
• In order to further improve the performance of our model, we propose two optimization methods to improve the network capability from the perspectives of accuracy and operation efficiency.
• We conduct extensive experiments to verify the effectiveness of the proposed approach. The rest of the paper is organized as follows. In Section 2, we introduce the related work. In Section 3, we present the necessary concepts. In Section 4, we describe the overview and the details of our model are given. In Section 5, two optimization method is proposed. In Section 6, the experimental results and the analysis are presented. Finally, the conclusion and the future work are described in Section 7.

II. RELATED WORK A. TIME SERIES FEATURES
Since the algorithm proposed in this paper is a heterogeneous feature ensemble time series classification algorithm, in this part, we first review the various features of time series.

1) TIME DOMAIN FEATURE
First, the simplest time domain feature is the raw data itself. In addition, there are many methods to extract features from time domain, including PAA [19], PLA [39], APCA [18], SAX [24] and so on. The idea of these methods is to extract the approximate representation of time series by using the approximate method after dividing the time series. All these representations are more used in time series similarity search than in time series analysis.

2) FREQUENCY DOMAIN FEATURE
In the early time series analysis, DCT [1], DFT [17] and DWT [14] are often used to get the frequency domain representation of time series. Over the years, the latest frequency domain time series representation SFA [32] has attracted extensive attention. It encodes the short subsequence of each time series from the perspective of the frequency domain to generate local frequency domain word representation. In addition, the idea that WDN [38] uses multi-layer wavelet decomposition to extract different levels of time-frequency feature components as features in the work also has its originality. It extracts time-frequency features of different granularity levels by the wavelet decomposition method, which is used to construct the neural network.

3) EVOLUTION FEATURE
In the classical financial time series analysis, scholars tend to use the evolution feature, such as autocorrelation or autoregression, to get the analysis data. Therefore, ACF, PACF [11], AR [36] and so on have become the preferred features of this time series analysis research. In recent years, the autocorrelation coefficient and autoregression coefficient of random position subsequence of time series species are also used as features to construct classifiers [26], [34].

4) DISTANCE FEATURE
In recent years, the similarity of time series has been studied by a large number of researchers, and dozens of distance measurement functions have been proposed to deal with a variety of data sets, such as ED, DTW [3], TWE [29], MSM [35] and so on. EE [25] is proposed to get the similarity of time series by ensemble strategy, which has a good effect on classification. These distances are more used to construct a 1nn classifier in time series classification.

5) LOCAL FEATURE
There are two kinds of local feature in the research of time series analysis. One is from the perspective of shapelet, and the other is from the perspective of random interval. The time series classification method based on shapelet has strong interpretability in the task of time series classification. The idea is to find the distinguishing subsequences as features and construct classifiers [41], [42], but its computational efficiency is always a big problem. The interval based feature does not care which subsequences are meaningful, but randomly selects a large number of observation windows, counts the statistical indicators in these windows as features, and uses the integration model to fuse these features [4], [8].

6) GRAPH FEATURE
Graph feature, as a popular time series feature representation in the past decades, has gradually entered the researchers' field of vision with the development of graph neural network in recent years [21]. Visibility graph [22] is a popular feature representation. Its main idea is described as follows: Each point in the time series is regarded as a line, and its height is the value of this point in the time series. Each time point is the vertex of a line segment in the graph. If the top of two vertical lines is visible to each other, that is, they do not intersect with other lines, then the two vertices are connected, that is, there is an edge at the top of the two vertical lines. Recently, scholars began to use it in classification task [23], [37].

7) KERNEL FEATURE
Nowadays, Rocket [7] and Grail [30] are two of the most widely used Kernel based approaches to get the time series feature. Rocket [7] uses a large number of random convolution operations to transform the time series into a representation. The powerful nonlinear representation ability brings a good effect on the task of time series classification. While the main idea of Grail [30] can be simply described as: Based on the KPCA [33] method, the kernel function is replaced by learning the distance metric function and parameters that conform to the characteristics of time series to obtain a low-dimension representation of time series. This work has been proved to be effective in many time series analysis tasks.

8) DL-BASED
Recently, some representation methods of learning time series based on deep learning have been proposed, such as DTCR [28] and USRLTS [12]. DTCR combines generation countermeasure network and autoencoder to encode time series for clustering problem. USRLTS constructs a scalable network to learn the representation of time series with variable lengths.

B. TIME SERIES CLASSIFICATION
In recent years, although time series classification has been well studied, there are still some new breakthroughs. Deep learning and ensemble methods have become two mainstream routes of time series classification. Therefore, we summarize the two time series classification methods

1) DL-BASED ALGORITHM
In view of the great development of deep learning, its attempt at time series classification tasks has gradually attracted the attention of scholars in recent years. Firstly, in 2019, a summary of time series classification based on deep learning was proposed, which compared the main deep learning models used for time series classification [9]. Then, with the development of deep learning technology, models applied to time series classification tasks are gradually proposed [10], [40].

2) ENSEMBLE ALGORITHM
In order to make the time series classification model deal with data from various sources, an integration method is proposed. HIVE-COTE [26] integrates dozens of time series classification models and then aggregates them for integration. TS-CHIEF [34] uses the random forest to integrate three methods based on distance, subsequence statistical index and frequency domain word representation (SFA) [32] to obtain the classification model. Both of these two methods have good classification performance. However, from the perspective of the algorithm itself, it is difficult to explain which sub-classifiers of the classifier are more important for the classification effect, and which sub-classifiers are less important for the classification effect. In a word, their explicability is not good.
BOSS, mWDN, EE, and Pforest are all ensemble approaches [25], [27], [31], [38], but all of them are aimed at the integration of homogeneous features which are different from the problem discussed in this paper.

III. PRELIMINARY KNOWLEDGE
First, we give some necessary definitions.

Definition 1 (Time Series):
A time series, T = {t 1 , t 2 , · · · , t n }, is an ordered list of real-valued data points with the same internal, where n = |T | is the length of T . A length-l subsequence of T is a shorter time series, denoted as Definition 3 (Time Series Feature): A time series feature, F = {f 1 , f 2 , · · · , f |F| }, is a kind feature extracted of time series, which reflects some characteristics of a time series. We use the subscript to represent the feature type and the superscript to represent the ID of the time series corresponding to the feature. For example, we use Definition 4 (Wavelet Tree): A Wavelet Tree, WT , is a complete binary tree used to store wavelet decomposition coefficients for one signal as shows in Fig.1. Each node in WT (i, j) stores a wavelet decomposition coefficient of a certain level, where i represents the decomposition level and j represents the level serial number. Data stored in its root node, WT(0,0), is the input signal, which is a origin time series T in this paper. For each node, WT (i, j), its left child node, WT (i + 1, j * 2 i ), stores the high-frequency component of the signal stored by this node, and the right child node, WT (i + 1, j * 2 i + 1), stores the low-frequency component of the signal stored by this node. The high pass filter and low pass filter used in wavelet decomposition are represented by G and H respectively. Then we have Eq. 1,2,3 for each node in wavelet tree. VOLUME 11, 2023 In this paper, we splice the wavelet decomposition coefficient with different granularity stored in each WT nodes to obtain a long vector as the wavelet decomposition coefficient feature of time series, donated as F wavelet . This process is donated as WaveletTransform (.).

Definition 5 (Distance Matrix):
A distance matrix, DM = {dm 11 , dm 12 , · · · , dm ij , · · · , dm |DS|,|DS| }, is a matrix that reflects the degree of similarity between examples in a dataset. Taking one of the elements as an example, dm ij reflects the distance between the i-th time series and the j-th time series, expressed as dm ij = D(T i , T j ). We use the following table to distinguish different similarity measurement functions. If we choose DTW distance as the measurement function of distance, then there will be dm ij = D DTW (T i , T j ). The whole process of DM calculation is represented by the Distance() function.
E is the set of edges, which are created by some connection rule. Let e i,j ∈ E denote the edge connecting v i and v j . Let Time series visibility graph [22], denoted as VG, is a kind of widely used time series graph structure representation. In essence, VG reflects the numerical relationship between each point of the time series, so the effect of this feature will be better when the data are well aligned. From this point of view, we construct the Frequency Visibility Graph (FVG) as the time series graph representation [37], because when we convert a time series to a frequency domain signal with FFT transformation, each point of the frequency domain signal is well aligned. Definition 7 (Individual Representation): In this paper, we first extract the time series features of many different fields. Then, we use different neural network layers to fit each feature. And finally, integrate them to get the results we want. Through feature fitting, we get an intermediate representation vector, IR, for each feature of each time series. We call it individual representation. We use the subscript to represent the feature type and the superscript to represent the ID of the time series corresponding to the individual representation. For example, we use IR i time to describe the individual representation corresponding to the time domain feature of the No.i time series.

IV. MODEL
In this section, we mainly introduce the relevant details and design ideas of our time series classification network. As shown in the Fig. 2, we start with a time series T . First, we extract a total of nine time series features as input through a time series extraction module. Then, we do the model fitting to obtain model M , low-level intra-feature attention vector Att I and high-level inter-feature attention vector Att II .

A. FEATURE GENERATION
Since our model is try to integrate heterogeneous features, we need to generate several different kinds of features from time series at first. The list of feature types is shown in Table 1.
We select nine different features as our input. There are two principles that we choose to build these features. First, we hope that the selected features can cover all the feature types mentioned in related work. Second, we hope that the feature generation step does not take too much time, so the selected feature are relatively fast in calculation.
For the time domain features, we don't do any operation, we directly take the original data as the input, as Eq. 5 describes.
For frequency domain features, we extract three kinds of features as input, which are the spectrum features obtained by fast Fourier transform, the time-frequency features obtained by wavelet transform, and the energy spectrum features obtained by modifying the spectrum features, as Eq. 6,7,8 describes.
A specific basic wavelet function is composed of a specific set of wavelet filter coefficients. When the wavelet function is selected, the corresponding set of wavelet filter coefficients will be known. Low pass filters and high pass filters with different dimensions are constructed by using wavelet filter coefficients. The original signal obtains the approximate coefficients, donated as a, through the low-pass filter G, and the detail coefficients, donated as a, through the high-pass filter H . Here we choose the most widely used DB4 wavelet as the fundamental wave of the filter. Taking the low-frequency coefficient of the first layer as the time series signal input a group of approximate coefficients and detail coefficients are obtained. Then take the obtained approximate coefficients as the signal input to obtain the approximate coefficients and detail coefficients of the second layer. And so on until the set grading level is met. The original signal, T , can be regarded as level 0 low frequency coefficient.
In order to better understand the feature generation process, we use Wavelet Tree to illustrate the wavelet feature generation method, which is well definited in Definition 1. In this paper, we construct a WT with 3 layers for each time series. As each node in WT stores a wavelet decomposition coefficients. Putting them together into a long sequence is an easy way to get wavelet features. So the wavelet feature can  be computed with Eq. 9.
For distance feature, we calculate a distance matrix for all time series in DS, and take the row vector corresponding to the sequence number of time series as the distance feature of time series. We select three distance metric functions to generate features, including Euclidean distance and DTW distance.
For the time evolution feature, we extract the Autocorrelation Coefficient (ACF for short) of time series as the feature.
For graph features, we extract VG and get its adjacent matrix A vg as graph features. Each point in the time series corresponds to a node in the graph. The edge construction strategy is described as follows: For two data points t i and t j in a time series T , If they meet the following Eq. 13, there is an edge between them. In essence, VG reflects the numerical relationship between each point of the time series, so the effect of this feature will be better when the data are aligned. Therefore, in order to deal with the dimension rift phenomenon in the time series data, it is a more conventional means to obtain the frequency domain signal by Fourier transform. This is because each dimension of the frequency domain signal represents the coefficient of a sine wave in the frequency domain, and each point is strictly aligned. Here we follow this idea and make a Fourier transform of the time series data first, and then construct the Frequency Visible Graph (FVG). We use f k to represent a point in the transformed vector F frequency . Thus, Eq. 13 can be rewritten to Eq. 14. The method of constructing FVG is also shown in Fig. 3. Since each point in the graph also affects itself, so F FVG equals the adjacency matrix of Frequency Visible Graph add a unit matrix as Eq. 15 describes.
Considering the computational efficiency and the expression ability of features, we choose the kernel function of rocket [7] to generate Random convolutional kernel features, where K is the generated kernel function set, and encode represents the kernel function coding process.
Next, let's describe the specific calculation process in the kernelgenerator and encode function. ROCKET algorithm encodes the time series into a 2η 1 long vector through the random convolution kernel randomly generated by η parameters, in which each random convolution kernel provides two features, including the maximum value and the proportion of positive values after the dot product of the random convolution kernel and the time series. We use w i to denote different convolution kernel, vf i 1 and vf i 2 to denote the two time series features extracted by convolution kernel w i .
For each convolution kernel w and each time series T , a vector V will be obtained after coding. For each element v i in the vector V , we have (17) where i represents the starting position of convolution, l kernel represents the kernel length, w is the convolution kernel vector, b represents the bias and d. Except i, all other parameters are randomly selected from the value range given by the author.
For each convolution kernel w i , the method of extracting the two features is given by Eq. 7 and 8.
We splice η vectors extracted from all η convolution kernels to obtain a 2η length time series feature vector TF.
Note that features used in our model may have redundancy. The reason is that we hope to utilize the most suitable features for different datasets so that the accuracy can be guaranteed. Moreover, the core of this paper is not to select the most concise feature set, but to design a highly interpretable framework based on the attention mechanism.

B. INTRA-FEATURE ATTENTION
As shown in the Fig. 2, our proposed FEnet is a feature-level embedding neural network model. A very common idea is that for each feature, the importance of its internal parts to the model is different The intra-feature attention is used to tell the respective feature weight vector of each feature. We use 1 In this paper, we set η to 10000 as the original paper suggested.
Att I to represent the intra-feature attention vector of each feature. For example, the attention vector of the time domain can be donated as Att I time . The domain features modified by intra-feature attention are represented by F . Taking the time domain feature as an example, its calculation process is shown in Eq. 21.
where F is used as the input of individual embedding in the next stage.

C. INDIVIDUAL EMBEDDING
For each F , we build a personalized small network S to fit the individual representation IR of each feature, and we call this process individual embedding. In our model, three kinds individual embedding methods is given, including MLP-based, FCN-based and GCN-based.

1) MLP-BASED
In our model, except for the time domain feature, F time , and Frequency visibility graph feature, F FVG , all other features use MLP to learn representation IR. From the perspective of computational load, we try to use the simplest neural network to learn the IR representation of features. Each MLP consists of a full connected layer, a batch normalize layer, an activation function Relu layer and a dropout layer. The input dimension is equal to the number of dimensions of each feature, and the output dimension is 50. Generally, all the dimensions of IR are equal to prevent unfairness. When we take the frequency spectrum feature, F fft , as an example, his individual representation is fitted by the Eq. 22.
where W is the weight vector, B is the bias vector and r is the parameter of dropout layer with default value 0.2, which is used to prevent over fitting of the model.

2) FCN-BASED
For the time domain feature F time , we use an FCN network to learn its representation IR time . The reason why we use the more computationally intensive FCN model is that we lack the categories of time series local features when selecting the feature set. We use FCN to learn the representation of time-domain features in an attempt to capture some missing locally in other features. An FCN consists of three convolution layers and one linear layer. And each convolution layer is accompanied by a batch normalization layer, an activation function layer Relu, and a dropout layer. The first convolutional layer maps an time domain feature F time ∈ R n to its output Y conv ∈ R n , using the following Eq. 23.
where W is the weight vector and B is the bias vector. The structure of the following two convolutional layers is likely to the first one. The number of convolution kernels is 128 for the first convolutional layer, 256 for the second, and 128 for the third. After completing the convolution layer calculation, we flatten the feature to a long sequence and then complete IR t ime fitting through a linear layer. The output dimension number of IR t ime is also 50 for the fairness between other domain features' IR.

3) GCN-BASED
For the graph feature F FVG , we use a GCN(Graph convolution network) to learn its representation IR FVG . The reason why we use GCN to learn the representation of fvg features is to make the network better match the data format without missing information. In this model, GCN uses a multi-hop process to collect the neighbor nodes' messages and utilizes the Readout option to aggregate all the nodes' information to get a 1D representation. And then, use three convolution layers to get the local feature and flatten them to a 1D vector. Finally, the vector is encoded with an MLP layer to get the individual representation of F FVG , the IR FVG .
Generally, GNNs always have a neighborhood aggregation strategy, where we iteratively update the representation of a node by aggregating representations of its neighbors. Multihop is a generally used option to aggregate the neighbors' messages. We use as our aggregation function, where VN is the neighbor nodes of v i . h k v i describes the v i representation when the multi-hop reach to k-th iteration. And we use the F FVG as the initial representation. So for node v i in T j there is h 0 v i = F j FVG (i, :). And in this model the multi-hop iteration number we used is one. So after multi-hop process, the graph representation GP can be described in Eq. 24.
The Readout process is try to convert each GP to 1D graph representation OGP. We use Eq. 26 to get the OGP OGP = GP · W readout (26) where W readout is the weight vector for all nodes, so the dimension number of W readout is n × 1. Then, three convolution layers are designed to get the detail feature of the graph representation got by GCN. These convolution layers are similar to that in the FCN model for IR time , except a maxpooling layer is set in each convolution layer. So the convolution layer maps an graph feature F FVG ∈ R D to its output Y conv ∈ R n , using Eq.27.
After that, a linear layer is set to collected the local feature got by the convolution layer and fit the final graph individual representation, IR FVG .

D. INTER-FEATURE ATTENTION
The contribution of different feature kinds to the classifier is also different. We show these contributions through the attention mechanism, which is the core of reflecting the interpretability of our model. Now that we have the individual representation of all the features, we need to splice them together to get a global representation, denoted as GR.
As mentioned above, the contribution of each feature kind to the classifier is different, so our inter-feature attention is used to learn the importance weight of IR fitted by different feature kinds.
The inter-feature attention is the learnable vector to tell the contribution of different feature kinds to the classification model. We use Att II to represent the global attention vector of each kind of feature. For example, the inter-attention vector of time domain can be donated as Att II . In order to train the attention weight fairly, we initialize all attention vectors to all 1 vectors. And the shape of Att II is (1,9). Thus, global representation can be rewritten with Eq.29.
We use cross entropy loss as the objective loss function of this network.The loss function of our model is denoted as: where C is the total class number, Y is true target value,Ŷ is the prediction value. Specifically, if the time series belongs to class c, Y i = 1. Otherwise Y i = 0. And in this paper, we use adam [20] optimizer to train our network.

V. OPTIMIZATION STRATEGY
In order to further improve the performance of FEnet network, we propose two kinds of optimization strategies for FEnet from the perspectives of accuracy and operation efficiency.

A. TWICE MODEL FITTING
As some of the time series training sets are relatively small, the network is likely to converge to local optimization and cause overfitting. Therefore, here, we give an optimization method of twice training to solve this problem to a certain extent.   The idea of this strategy is to train the network in two phases, including the pre-training phase and the training phase.
In the pre-training phase, we use an all-one vector as the initialization of inter-feature attention. The purpose of the pre-training process is to help determine the initialization parameters of the Att II vector in the training phase. When the pre-training is completed, we evaluate the parameters of Att II and select the important weight parameters and unimportant weight parameters. When the weight is greater than 1, we think the corresponding feature is important. When the weight is less than 1, we think the corresponding feature is not important.
So in the training phase, we first take out the positions less than 1 in the Att II got in the pre-training phase. Therefore, we think that those features corresponding to these positions are not important for the classification model. Therefore, in the second fitting, we multiply the corresponding elements in the initialization vector corresponding to these unimportant features by α during initialization to weaken their impact on the performance of the classification model, where α is a number between 0 and 1. In our work, the default value of α is set to 0.5. And then the training phase start. In this way, after the training phase, we get the final classification model M'.

B. ONLINE MODEL PRUNING
As our neural network model in this paper is a feature-based fusion model, the load of the model is relatively heavy when training a big dataset. In order to solve this problem, we provide an online model pruning method to accelerate the fitting time of the model. The core idea of this optimization is to cut out the corresponding computational branches of unimportant features in the network.
The approach can be described as follows. Different from the Twice Model Fitting strategy, we do not set the pretraining phase, but check the importance weight of Att II every k epoch. For each k epoch training. In our work, the default value of k is set to 500. According to the minimum loss, we can get a current optimal model and its corresponding inter-attention vector. Sort the inter-attention vector Att II and prune the network branch of the feature domain corresponding to the minimum position of the value in the inter-attention vector. Then reinitialize the network parameters other than Att II , and then continue training. In this way, because part of the network branch is pruned, the network training time will be further accelerated.
For classification ability, although it is difficult for us to ensure that the classification accuracy of the network will be improved after pruning, because our final network selection is to select the network with the smallest loss in the whole process, the overall classification performance will not be reduced a lot.

VI. EXPERIMENT
In this section, we conduct several experiments to verify the effectiveness of our approach.

A. DATA AND EXPERIMENT OPTIONS
The UCR Time Series Archive [5] is a popular benchmark for the time series classification problem. We build the training and testing dataset based on the UCR Archive. We randomly choose 63 datasets in different domains.
All of our experiments are run on a laptop computer with AMD Ryzen 7 4800H @2.90 GHz CPU, 32 GB memory, and Windows 10 operating system. We used Python 3 for implementation, with PyTorch 1.8.1 for neural networks. The network was trained on a single Nvidia RTX 2060 GPU card with CUDA 10.1 unless stated otherwise. In our experiment, the epoch number of all neural network algorithms is 2000 for fairness, and the model is selected when the loss reaches the minimum.

B. ACCURACY COMPARISON
In this section, we evaluate our model from the perspective of accuracy. Because our model is an ensemble model based on deep learning. In this section, we take some mainstream deep learning models [9] and heterogeneous feature ensemble models [26], [34] for time series classification as the baseline for comparison with our work. Our code is avaliable, 2 and the baseline code can be found in the original papers.
The accuracy comparison results of all 63 data sets are shown in the Table 2. Each row in the table is the comparison result of the classification accuracy of 13 methods on the same data set. Among them, the value of the best accuracy is expressed in bold. The last row in the table is the average ranking of the accuracy of the algorithm on 63 data sets. As can be seen from the table. When there is no optimization, the classification accuracy of our FEnet model is equivalent to TS-CHIEF and HIVE-COTE. When we utilize the twice model fitting optimization (FEnet 2F ) strategy, our model has the best classification ability. In addition, even if we use online pruning optimization (FEnet Pru ), our method is better than the traditional deep learning algorithm. Due to the large number of data sets we tested, in order to intuitively reflect the gap between algorithms, we use the Critical Difference(CD) diagram [6] to evaluate the performance of the algorithm. The CD diagram of the Accuracy Comparison between our FEnet and baselines is given in Fig.4. From the figure, we can also intuitively see that our method also has the best classification performance.

C. INTERPRETABILITY
The interpretability of our method is mainly reflected in intra-feature attention and inter-feature attention, Att I and  Att II . In this section, we mainly try to prove their interpretable ability.

1) INTERPRETABILITY OF INTRA-FEATURE ATTENTION
To verify the effectiveness of the intra-feature attention, we use ECGFiveDays and GunPoint dataset to illustrate. In order to illustrate the effectiveness of the first layer of attention mechanism, we draw the thermal diagram of the intra-feature attention vector, as shown in Fig. 5. We use the intra-feature attention vector corresponding to the time-domain feature to illustrate the interpretability.
For ECG data, let's start with some background knowledge. ECG is a manifestation of electrophysiological activity. In the process of heart beating, a series of electrophysiological changes will be produced, which will be transmitted to the body surface. These electrical signals can be picked up through the electrode and unfolded on the time axis is the heart wave, which is called body surface ECG. A complete ECG wave is composed of P wave, QRS complex wave, T wave, and U wave, which reflects different stages in the process of heart beating [15]. All of them have a certain ability to identify abnormal waveforms in ECG data. The thermodynamic diagram of the intra-feature attention vector representation weight shows the importance of each position of the intra-attention vectors. In the thermodynamic diagram, the warmer the hue, the more important the corresponding feature part. In particular, the position weight of the attention vector corresponding to the T-wave reaches the maximum.
The GunPoint dataset corresponding to Fig.5(b) is an experiment. The experiment tested the data of the action trajectory of people's hands in the whole process from drawing and shooting to receiving with or without a pistol [41]. The experiment found that due to the weight of the pistol, there will be a small swing back when the pistol is collected. The attention vector fitted by our model automatically finds this without human intervention. It is the darkest part of the thermodynamic diagram representing the attention vector in the diagram.

2) INTERPRETABILITY OF INTER-FEATURE ATTENTION
Since UCR time series archive classifies all data sets according to their source differences. Therefore, we make importance statistics on the inter-feature attention vectors according to the model fitting, and sort them according to the weight from high to low. The top three features are listed in Table. 3. From the results in Table 2, we can clearly see that if the data sources are images and people's motion, their time-domain features have the highest weight. For the source is device, because its mechanism is not clear, the kernel feature rocket describing nonlinear ability has more representative ability. The source is the data set of spectrograph, which is consistent with our common sense, and its frequency spectrum feature will be more important.

D. EFFICIENCY
For the sake of fairness, in terms of operation efficiency, we only compare our method with the method based on deep learning. The average runtime is shown in Table 4. We can see from the table that our method can reach the midstream level in terms of running time. Compared with ResNet and FCN, our method has advantages in running time. In addition, the feature extraction time of FEnet model, donated as FEnet(Feature), only accounts for 0.78% of the running time of the whole algorithm. It can be seen that the steps of feature extraction do not account for too much of the running time of the algorithm. The runtime of the twice model fitting optimization, donated as FEnet(2Fitting), is roughly the same as two times of the FEnet. From the table, we can see that our online model pruning optimization method, donated as FEnet(Pruning), has increased the average running time by about 20%.

VII. CONCLUSION
In this paper, we propose a feature-level ensemble model, FEnet, to solve the TSC problem. Compared with the existing works, FEnet is competitive in accuracy. Moreover, with a two-level attention mechanism, FEnet can automatically distinguish the more important features, which is very important to the results. In the future, we will try to build a more accurate, lighter load, more robust, and interpretable time series classification model. HANBO ZHANG received the B.S. degree in information engineering from the East China University of Science and Technology, Shanghai, in 2013, and the M.S. degree in science in computer systems organization from China Electronics Technology Corporation, Academy of Electronics and Information Technology, Beijing, in 2016. He is currently pursuing the Ph.D. degree with the School of Computer Science, Fudan University, Shanghai. His research interests include data mining, time series data analysis, and low quality data processing.
PENG WANG received the Ph.D. degree from Fudan University, Shanghai, in 2007. He is currently a Professor with the School of Computer Science, Fudan University. He has published more than 30 articles in refereed international journals and conference proceedings. His research interests include database, data mining, and series data processing.
SHEN LIANG received the Ph.D. degree from Fudan University, Shanghai, in 2020. He is currently a Research Associate with the Data Intelligence Institute of Paris (diiP), Université Paris Cité. His research interests include data mining, time series analysis, and knowledge-guided machine learning.
TONGMING ZHOU received the M.S. degree in software engineering from Fudan University, Shanghai, in 2018, where he is currently pursuing the Ph.D. degree with the School of Computer Science. His research interests include motif discovery and time series forecasting.
WEI WANG received the Ph.D. degree from Fudan University, Shanghai, in 1998. He is currently a Professor with the School of Computer Science, Fudan University. He has published more than 100 articles in refereed international journals and conference proceedings. His research interests include database, data mining, and series data processing.