A Novel Channel and Temporal-Wise Attention in Convolutional Networks for Multivariate Time Series Classification

Multivariate time series classification (MTSC) is a fundamental and essential research problem in the domain of time series data mining. Recently deep neural networks emerged as an end-to-end solution for MTSC and achieve state-of-the-art results on several public datasets. It is favored by its hierarchical feature extraction ability and most of the researches focus on designing a network architecture to ensure its performance on MTSC. Despite this, there are seldom investigations on the attention mechanism in MTSC, which has been demonstrated as an effective module to extract features in other domains. In this paper, we propose a residual channel and temporal attention (CT_CAM) module, which aims to refine the feature extracted from the convolutional neural network and thus improve the classification performance. Extensive experiments on 15 public MTSC datasets show that the proposed CT_CAM module achieves competitive performance compared with nine baseline methods and three other attention modules.


I. INTRODUCTION
With the advance of sensor technologies, extensive data sequentially ordered by time are received and recorded in our daily life. These time series data are typically recorded by different types of sensors simultaneously over time and form the so-called multivariate time series. Extracting knowledge from multivariate time series has attracted an increasing amount of attention in recent decades. Multivariate time series classification (MTSC) is one of the most significant tasks. MTSC aims to predict classification labels for a certain multivariate time series data, which has many application scenarios in the real world such as clinical time series data analysis [1], human activity recognition [2], [3], sea state estimation [4], [5], and fault diagnosis in machinery system [6], [7].
The associate editor coordinating the review of this manuscript and approving it for publication was Vicente Alarcon-Aquino .
For MTSC, a plethora of research focuses on feature-based methods that extract a set of features that can represent the time-series patterns. Then a classifier can be trained using these features. These approaches need heavy crafting on feature engineering and there might be different feature extraction schemes for different applications. Moreover, the generated huge feature space usually makes the feature selection step difficult and thus results in low accuracy [8].
Recently, deep neural networks have been utilized to provide an end-to-end solution for time series classification problems and achieve state-of-the-art results on several public datasets [9], [10]. The advantage is that it combines hierarchical feature extraction and classification and therefore it can learn the representation from data directly.
Designing deep neural network architecture is a difficult engineering task but essential because well-designed networks ensure remarkable performance improvement in VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ various applications [11]. Most of the researches in MTSC focus on designing a network architecture by stacking convolutional neural network (CNN) block [12], combining long short-term memory (LSTM) with CNN [10] or adding skipconnection [9]. However, an aspect that lacks investigation in MTSC is the attention mechanism, which has been addressed extensively in natural language processing [13] and computer vision domain [11]. Attention tells where to focus and therefore improves the representation of interests. Attention mechanisms designed for CNN and applied to other domains can play a certain role in time series data, but time series data has its own unique characteristics, which might need a special treatment for the attention mechanisms. The most notable feature of time series is its temporal correlation, while in image processing and other computer vision applications, researchers pay more attention to the spatial correlation between pixels [5]. Intuitively, different sensor modalities come from different domains and they have different importance in different tasks. For example, in human activity recognition, the accelerometer features may be more significant in distinguishing the ''walking'' and ''biking'' activities while the gyroscope features may be more significant in distinguishing the ''turning-left'' and ''turning-right'' activities [14]. Besides, not all timesteps contribute equally to the task. For instance, the features on some timesteps may show a more salient pattern than the others in distinguishing the ''fault'' and ''normal'' status in the problem of fault diagnosis.
In this paper, we propose a novel attention module for MTSC. It consists of two parts: channel calibration attention module (CCAM) and temporal calibration attention module (TCAM), which aims to address importance along channel and time axis, respectively. Therefore the representation power of the network can be enhanced. The proposed attention block can be implemented in state-of-the-art network architecture by simply adding it behind the CNN block. To summarize, our work has the following main contributions: 1) a novel attention module called CT_CAM (channel and temporal calibration attention module), which effectively integrates channel and temporal attention in CNN features, is proposed for MTSC. This module is generic and therefore can be applied to any layer in any CNN architecture such as fully convolutional network (FCN) and Deep residual network (ResNet). By integrating attention layers with both CCAM and TCAM, the proposed attention module can capture spatial and temporal dependencies of the time series data, which amplifies the more important and informative modalities and timesteps during classification. 2) Extensive experiments are performed on 15 public MTSC datasets. All the results of the combination of the proposed CT_CAM and CNN, CT_CAM and FCN, CT_CAM and DenseNet outperform that of baselines. Compared with other attention mechanisms, the proposed CT_CAM achieves stateof-the-art whether it is combined with CNN, FCN and DenseNet. The ablation study demonstrates the importance of the proposed attention module.
The structure of this paper is as follows: Section II gives an introduction to MTSC and attention mechanism. Section III describes the architecture of the proposed approach. The experiment is discussed in Section IV, and the paper is summarized in Section V.

II. RELATED WORK A. MULTIVARIATE TIME SERIES CLASSIFICATION
Most of the work for MTSC can be grouped into three categories: similarity-based methods, feature-based methods and deep-learning methods. Similarity-based methods, as the name suggests, are to identify time series by calculating the similarity (Euclidean distance or other distance metrics) between two time series. Dynamic Time Warping (DTW) has been reported to be the best competitive methods. There are two widely used version of DTW for MTSC, dependent DTW (DTWD) and independent DTW (DTWI). DTWD measures the squared Euclidean cumulated distance of all dimensions, but DTWI is to consider the cumulative distance over the multiple dimensions. Feature-based methods transform the original time series into a low latent space that is easier to classify. There are two techniques that widely used for the transformation of time series: Shapelets models and Bag-of-Words (BOW). The bag-of-features framework (TSBF) [15] extracts the local and global features for each time series and feeds them to a random forest classifier. Bag-of-SFA-Symbols (BOSS) [8] introduces a combination of a distance-based classifier and histograms with symbolic Fourier approximation. Ensemble algorithms that use multiple feature-based algorithms such as the elastic ensemble (PROP) [16] and the flat collective of transform-based ensembles (COTE) [17] also achieve promising results. Recently effort has been made to exploit the deep learning approaches to overcome the limitation of feature-based methods. A hybrid model combines FCN and LSTM is proposed by [10] with the aims of better feature extraction. A novel model, integrating with random group permutation method, LSTM and multi-layer convolutional networks for MTSC is proposed [18]. The above researches target designing a network architecture by stacking CNN and LSTM for better performance. We focus on the attention mechanism for MTSC which is less addressed by most of the existing works.

B. ATTENTION MECHANISM
Attention has been recognized as an important role in human perception [19]. Attention mechanisms have been demonstrated in sequence learning [20] and image understanding [21] for its ability to focus on the informative salient parts of a signal. Attention mechanisms have been proven as an effective way to enhance CNN. Now the developments of attention mechanism can be roughly categorized into two directions: enhancement of feature aggregation and combination of channel and spatial attention. A compact attention module called Squeeze-and-Excitation (SE) is proposed to exploit the inter-channel relationship [22]. SE is the first attempt to learn channel attention and achieves promising performance. A Convolutional Block Attention Module (CBAM), which can integrate into any CNN architectures seamlessly, is proposed [11]. CBAM can infer the attention map along the channel and spatial dimension, and then the input feature map can be refined based on the element-wise multiplication of input feature map and attention map. A second-order attention module is proposed for effective feature aggregation by learning more discriminative representations [23]. Another attention module named gather-excite (GE) is introduced to aggregate spatial features in CNN [24]. A non-local (NL) attention module is presented to utilize the local relationship for capturing long-range dependencies in the task of computer vision [25]. On the basis of the NL module, a GCNet is developed to model long-range dependency [26]. Inspired by the promising results achieved in the domain of image processing, CNN has been gradually used in many MTSC tasks. The impact of attention mechanism on CNN has not been exploited extensively in MTSC. The SE module is first integrated into a CNN model and applied to MTSC [10]. The experimental results show that with the help of the SE module, the accuracy of the model has been greatly improved. Obviously, all of the above methods are dedicated to the development of sophisticated attention modules by learning more discriminative features. Different from them, our proposed attention module aims at learning effective channel attention as well as temporal attention simultaneously.

III. CHANNEL AND TEMPORAL CALIBRATION ATTENTION MODULE
As mentioned above, CNN has become a common framework for TSC tasks. This paper mainly studies the use of attention mechanism to improve the classification ability of CNN. In other words, the proposed attention module can be applied to all kinds of variants of CNN, such as FCN, ResNet, and DenseNet. As illustrated in FIGURE 1, the original multi-layer feature maps of FCN would be enhanced through CT_CAM module, which consists of CCAM and TCAM. The extracted features would be processed by CT_CAM sequentially. Given an intermediate feature map at k-th layer F k ∈ R T ×C , T stands for the timesteps and C is the channels of features. The CT_CAM modulates F k using the attention weights in a recurrent and multi-layer fashion as: where F k is the feature map output from previous CNN layer which consists of a convolutional layer, a normalization layer, and a RELU layer. F k+1 and are the modulated feature and the CT_CAM function, respectively, which will be detailed in Section III-A and Section III-B. f (·) is a weighting function that modulates CNN features and attention weights.

A. CCAM
The whole process of CCAM is depicted in FIGURE 2.
Assuming the shape of the raw multivariate time series data is R T ×N , T and N are the timesteps and dimension of time series data, respectively. Usually, 1D CNN would be utilized to extract feature from the raw time series data. The filter of 1D CNN is performed as the pattern detector, which can transform the raw time series data R T ×N to the features R T ×C . C is the number of filters in 1D CNN, and also is the channel of feature. It is easy to know that each channel of feature represents the response activation of convolutional filter. In this paper, a channel attention module is proposed to overcome the conventional CNN treat feature channel equally. That is, employing an attention module in a channel manner can be regarded as ''channel selection''. In the domain of image processing, it is also called as semantic attribute selection [27].
represents the k-th channel of feature map X . Then the global average pooling and global max pooling are applied to each channel to obtain the average channel feature X ac ∈ R C×1 and max channel feature To calculate the attention weights, the average channel feature X ac and max channel feature X mc are forwarded to two multilayer perceptron (MLP) with shared weights. After the MLP, the features is transformed by sigmoid. Finally, the raw features can be calibrated by the attention map.
where X is the original input, α is the weights of attention module, X att is the weighted features, ⊗ means the element-wise multiply. W 1 ∈ R C/r×C and W 2 ∈ R C×C/r represent the weights of the first and second MLP, respectively. σ denotes the sigmoid transformation. Inspired by the success of residual blocks [9], the channel attention is integrated with residual connection, which is called residual channel attention. From FIGURE 2, we have X = X + X att , where X is the weighted features, X is the original input, X att is the feature scaled by attention weights.

B. TACM
The channel attention is focusing on the informative features in different channels, but the information in time axis also should be emphasized. To achieve this goal, TACM is proposed. To compute the temporal attention efficiently, Gate Recurrent Unit (GRU) is utilized, as illustrated in FIGURE 3. Suppose the feature processed by channel attention is X ∈ R T ×N . The features will first be mapped to X ∈ R T ×K . As shown in FIGURE 3, the raw feature would be transformed using bidirectional Gate Recurrent Unit (BiGRU) to better capture the temporal memory information: X gru = BiGRU ( X)).
To calibrate the temporal information, the idea of channel attention and residual connection is adopted. The X is first forwarded to average pooling and max pooling, as similar with channel attention. And then these two pooling features would be repeated along the time-axis. Finally, the attention weights can be computed as follows: where F at and F mt are the features after average pooling and max pooling, respectively. W 1 ∈ R C/r×C and W 2 ∈ R C×C/r represent the weights of the first and second shared MLP, respectively. The refined feature can be computed as follows: where ⊗ means the element-wise multiply. The final output X T of temporal attention module can be computed based on the residual connection, as shown in the left branch of FIGURE 3: where MLP is used for the shape mapping from R T ×K to R T ×N .

C. ARRANGEMENT OF ATTENTION MODULES
According to the different implementation order of CCAM and TACM, there are two types of models, which incorporates both two attention mechanisms. These two types are described as follows: The first type, denoted as Channel-Temporal (CT), applies CCAM before TACM. The flow chart of CT is represented in FIGURE 1. For the initial convolutional feature map X r , the residual channel-wise attention c is adopt to obtain the weights α for raw feature map. Then the weighted feature map can be obtained through the combination of X r and α.
After the channel attention, the weighted feature map is fed to the temporal attention t and the temporal attention weights β is obtained in the same way with channel attention. The whole process can be summarized as follows: where f c (·) is the multiplication of feature map channels and corresponding weights. X w is the modulated feature map, f represents the modulate function. dp is the dropout rate between the two attention modules, and dp is set to 0.3 in this paper.

2) TEMPORAL-CHANNEL (TC)
The second type is called as Temporal-Channel (TC), which implements the TACM first. For this type, given the raw feature map X r , the TACM t is first utilized to calculate the temporal attention weights β. The CCAM c would employ the weighted channel feature map as the input, and the channel attention weight α can be calculated. The whole processes are summarized as follows: where f t (·) is a element-wise multiplication for feature map time-steps and corresponding attention weights. f denotes the modulate function, and dp is the dropout rate between the two attention modules. X w represents the weighted feature map through the two attention modules.

IV. EXPERIMENT A. EXPERIMENTAL SETUP 1) DATASETS
We use 15 datasets from the latest MTSC archive [28]. This archive consists of real-world multivariate time series data with a wide range of cases, dimensions, and series lengths, as presented in TABLE 1. Its application mainly includes human activity recognition, motion classification, ECG/EEG signal classification, and audio spectra classification. The number of the class ranges from 2 such as face detection to 39 in audio phoneme. The length of the time series ranges from 8 to 3,000 while the dimension ranges from 2 to 963. The size of datasets also has a range from 27 to 9,414. For each dataset, the classification accuracy is calculated as the evaluation metric. The average accuracy value, the number of Wins/Ties and the average rank are computed to compare different methods.

2) IMPLEMENTATION DETAILS
All the experiments are implemented on a server, which is equipped with Intel processors (64GB) and TITAN V (12GB). Pytorch is used for the implementation of the models [29]. During the whole training process, the learning rate is set to 1e-4; Adam is utilized as an optimizer [30]; For a fair comparison, the training epochs are set 3000, which is the same as [18].

B. BENCHMARK COMPARISON
We plug the CT_CAM module into the FCN [9], single layer CNN (SLCNN), and the latest proposed DenseNet [4] and then compared this DenseNet_CT_CAM, FCN_CT_CAM, and SLCNN_CT_CAM with nine different baseline approaches, including common distance-based classifiers, bag-of-patterns feature-based methods, and deep learning framework. The details of the baselines we use are provided as follows.

ED-1NN, ED-1NN(norm), DTW-1NN-I, DTW-1NN-I (norm), DTW-1NN-D and DTW-1NN-D (norm):
One nearest neighbor classifier (1NN) with two different distance measurements, Euclidean distance (ED) and dynamic time warping (DTW). I and D denote that the DTW is computed by treating every dimension individually or together respectively. Data normalization is applied with annotation (norm) [31]. WEASEL-MUSE [8]: This framework builds a large feature space using multiple window lengths. Then Chi-squared test is used to identify the most relevant features and feed them to logistic regression. MLSTM-FCN [10]: This deep learning model consists of an LSTM layer and an FCN layer along with a SE module. TapNet [18]: This model is also a combination of an LSTM layer and stacked CNN layers. The random permutation was used before stacked CNN layers to reorganize the time series dimensions into different groups. For a fair comparison, we duplicate the table shown in [18], and add the experimental results of our model, as listed in TABLE 2. The default settings are adopted for the DenseNet_CT_CAM, the number of filters for SLCNN and FCN are 128 and {128, 256, 128}. In the three cases, the number of hidden units for BiGRU is 8. However, for some large datasets, such as MotorImagery, Phoneme, the hyperparameter will be adjusted accordingly. The best accuracy for each dataset is denoted with boldface. In terms of average accuracy, our three models outperform all the baseline methods. The DenseNet_CT_CAM gets the best average accuracy of 0.746, which achieves a significant improvement compared with the existing state-of-the-art approach TapNet with the average accuracy of 0.691. In terms of the number of wins/ties, our model achieves 9 wins/ties which is the best among nine methods, while both TapNet and WEASEL+MUSE achieve 2 wins/ties. It can be observed that our model can achieve better performance in most datasets, especially in the datasets with small amounts of data such as Heartbeat and HandMovementDirection, which contains only hundreds of training samples.   Classifiers with the lowest (best) ranks are to the right. The group of classifiers that are not significantly different in their rankings are connected by the solid horizontal lines. The critical difference (CD) length at the top represents statistically significant differences.

C. COMPARISON WITH OTHER ATTENTION MECHANISMS
To illustrate the superiority of the proposed CT_CAM module for MTSC, three different attention modules are used for comparison in the benchmark datasets. Three different backbones, SLCNN, FCN and DenseNet, are used for these modules. SLCNN contains only one CNN block with the number of filter 128 while FCN consists of three CNN blocks with the number of filter {128, 256, 128}. The setting of DenseNet is the same with [4]. The attention module is stacked after each CNN block. The details of the attention modules we used are presented as follows. CBAM [11]: Convolutional block attention module (CBAM) consists of a channel and spatial attention block, where both the global average and max pooling are used to generate statistics. GC [26]: Global context (GC) adopted 1 × 1 convolution for both attention pooling and bottleneck transform. SE [22]: Squeeze-and-Excitation (SE) used global average pooling to generate channel-wise statistics and used bottleneck MLP for transform. Moreover, N/A means no attention modules are used. The comparison results are presented in TABLE 3,  TABLE 4, and TABLE 5.  From TABLE 3, TABLE 4, and TABLE 5, it is easy to see that the proposed CT_CAM outperforms other attention modules in terms of average accuracy, wins&ties and average rank in the three backbones. More specifically, the proposed CT_CAM in SLCNN shows 10.64%, 5.04%, and 3.47% improvement compared to the SE, GC, and CBAM, respectively, as depicted in Table Table 3. There are 6.05%,  2.25%, and 1.18% improvement compared to CBAM, GC, and SE when the CT_CAM is applied to FCN presented in Table 4. From Table 5, we can know that the improvements of CT_CAM to CBAM, GC, and SE are 5.54%, 5.42%, and 2.49%, respectively. For SLCNN, the average accuracy can be improved dramatically by including attention module, especially for CBAM and CT_CAM, which consider both channel and temporal attention. However, when it comes to FCN and DenseNet, including CBAM, GC, SE provides even worse results than vanilla FCN. This suggests that the attention module can significantly enhance the performance of a simple network with relatively week representation power. The representation power of a deeper network might be suppressed due to the limit of data and the increase in complexity. Our CT_CAM module uses residual connection inside which allows the information flow explicitly into the next block and therefore the network's ability is not likely to be suppressed.

D. ABLATION STUDY
To validate the importance of the proposed attention module, four variants are compared. 1) C: It is a pure model of CCAM.   In this case, the TCAM is removed. 2) T: It is a pure model of TCAM. In this case, the CCAM is removed. 3) CT: This is the proposed CT_CAM module. Detailed information is described in Section III-C. 4) TC: We exchanged the position of TCAM and CCAM. Detailed information is described in Section III-C. N/A means no attention modules are used. To fully illustrate the performance, SLCNN, FCN, and DenseNet are used as the backbone for these four variants in the 15 benchmark datasets. The results are presented in TABLE 6, TABLE 7, and TABLE 8. As illustrated in TABLE 6, the best average accuracy happens when the TC model is added. The CT model shows  a slight lower accuracy than TC but with more numbers of Wins&Ties and better average rank than TC. Compared the N/A module in SLCNN, the performance of TC and CT has relatively improved 12.10% and 12.03%, respectively. From TABLE 7, we also can know that the CT module shows a higher average accuracy than other modules. However, the C and T achieve higher Wins&Ties and average rank, respectively . TABLE 8 shows similar results with TABLE 7 where the FCN is used as the backbone. It is easy to know that the CT achieves highest average accuracy and average rank.
It is shown in TABLE 6, TABLE 7, and TABLE 8 that adding C exhibits a small decrease in average accuracy while adding T alone displays a small average accuracy increase. But sequentially adding the CCAM and TCAM shows relatively large improvement. The reason might be that temporal attention can compensate for the channel features. This phenomenon is much obvious in the shallow CNN architecture but we empirically show that the CT_CAM block can enhance the performance of shallow and deep network architecture . TABLE 6, TABLE 7, and TABLE 8 also summarize the experimental results on different attention arrangement. From the results, it can be found that the channel-first order performs slightly better than the temporal-first order but they can be considered as almost equal. All the arranging methods outperform using only the channel or temporal attention independently, showing that utilizing both attention is crucial.

E. VISUALIZATION
We visualize the attention maps using CAM [33]. It is shown in FIGURE 6 that these three models highlights a similar region. These models highlight the plateau are for ''I have a command'' while focus on the transition area for ''Spread wings''. The N/A model clearly have a more wide spread attention region than CBAM and CT_CAM. The CBAM and CT_CAM modules help the network to focus on the informative area and related region. From FIGURE 7, it is also can know CT_CAM, CBAM, and N/A model are focusing on the transition in both classes. However, the CT_CAM and CBAM have a more wide attention area. FIGURE 8 and FIGURE 9 present the similar scenarios where the changes of signal only occurs in a small area,  and the signal remains stable in other areas. From these two figures, the proposed CT_CAM can not only obtain useful information in the sharply changing area of the signal, but also identify subtle changes of these signals. The performance of CBAM is even worse than N/A model in the two cases as CBAM cannot obtain such wider informative area as well as cannot observe the signal's subtle changes.

V. CONCLUSION
In this paper, the CT_CAM module is presented to improve the representation power of CNN networks for MTSC problem. This module consists of a channel and a temporal block, which focus on refining the feature from the two dimension, i.e. spatial and temporal, in multivariate time series. The experimental results in the public UEA archive demonstrate that the recent proposed DenseNet, FCN, and SLCNN combined with the proposed CT_CAM module achieve the stateof-the-art results compared to nine baseline methods. Compared with other attention modules, the proposed CT_CAM provides a better performance whether it is combined with SLCNN or FCN. The sensitivity analysis studies the impact of the number of hidden unit in GRU. From the experimental results, the proposed CT_CAM can enhance the performance of various CNN networks and CT_CAM can be an important component of CNN networks.
The focus of this work is to improve the performance of feature extraction ability of CNN by utilizing attention mechanism. According to the characteristics of time series data and drawing on the design ideas of attention mechanism in the direction of computer vision, we propose a sequential attention structure, which can learn temporal and spatial information simultaneously. This novel attention module can improve the accuracy of the model, but it will inevitably lead to the model being too cumbersome and not lightweight enough.
PEIHUA HAN received the bachelor's and master's degrees from the Department of Architecture and Civil Engineering, Zhejiang University, China, in 2019. He is currently pursuing the Ph.D. degree with the Norwegian University of Science and Technology (NTNU), Aalesund, Norway, as part of the Mechatronics Laboratory, Department of Ocean Operations and Civil Engineering. His current research interests include fault diagnosis and prognostics, predictive maintenance, machine learning, and uncertainty qualification.
GUOYUAN LI (Senior Member, IEEE) received the Ph.D. degree from the Department of Informatics, Institute of Technical Aspects of Multimodal Systems, University of Hamburg, Hamburg, Germany, in 2013. Since 2014, he has been with the Mechatronics Laboratory, Department of Ocean Operations and Civil Engineering, Norwegian University of Science and Technology, Aalesund, Norway. In 2018, he become an Associate Professor in ship intelligence. His research interests include path planning, ship motion prediction, maneuvering control, artificial intelligence, optimization algorithms, and locomotion control of bioinspired robots. In these areas, he has authored or coauthored more than 60 articles. The focus of his research lies on two areas. One is on biological robots and modular robotics. The second focus is on virtual prototyping and maritime mechatronics. In these areas, he has published over 130 journal and conference papers and book chapters as author or coauthor. VOLUME 8, 2020