Enhancing Multivariate Time Series Classifiers through Self-Attention and Relative Positioning Infusion

Time Series Classification (TSC) is an important and challenging task for many visual computing applications. Despite the extensive range of methods developed for TSC, relatively few utilized Deep Neural Networks (DNNs). In this paper, we propose two novel attention blocks (Global Temporal Attention and Temporal Pseudo-Gaussian augmented Self-Attention) that can enhance deep learning-based TSC approaches, even when such approaches are designed and optimized for a specific dataset or task. We validate this claim by evaluating multiple state-of-the-art deep learning-based TSC models on the University of East Anglia (UEA) benchmark, a standardized collection of 30 Multivariate Time Series Classification (MTSC) datasets. We show that adding the proposed attention blocks improves base models' average accuracy by up to 3.6%. Additionally, the proposed TPS block uses a new injection module to include the relative positional information in transformers. As a standalone unit with less computational complexity, it enables TPS to perform better than most of the state-of-the-art DNN-based TSC methods. The source codes for our experimental setups and proposed attention blocks are made publicly available.


I. INTRODUCTION
T ime series data is a set of data points representing qualitative or quantitative information over a time interval.The significance of any series depends on the order and timing of its data points.TSC is the task of classifying time series data based on their attributes over the time they are collected.The process is called Univariate Time Series Classification (UTSC) if the data points only have a single dimension.In contrast, classifying time series data with multidimensional data points is called Multivariate Time Series Classification (MTSC).Many real-world tasks, such as human activity recognition, machine condition monitoring [1], electrocardiogram (ECG) [2] and electroencephalography (EEG) classification [3], facial action unit classification [4,5], and more [6,7] could be categorized as a TSC problem.As a result, TSC is an active research subject [7].Deep learningbased TSC algorithms tend to have simpler implementations, shorter training periods, fewer computational complexities, and more scalability than traditional methods.However, they are outnumbered by traditional methods because of their lower classification accuracies.Specifically, only one method has demonstrated competitive performance compared to the traditional methods [8,9].One noticeable issue with deep learningbased TSC algorithms is their incapacity to generalize for 1 https://github.com/mehryar72/TimeSeriesClassification-TPSdifferent applications.For instance, a model may have superior performance for one task but inferior performance for another [10].It has, therefore, been difficult to develop a general deep learning-based solution suitable for all TSC applications.There seems to be room for extending and improving deep learning-based TSC methods to be more accurate and general for different tasks.
This paper presents two deep learning-based modules that can be directly integrated into any Deep Neural Network (DNN) TSC model to improve performance.We introduce two new attention-based processing blocks called Global Temporal Attention (GTA), and Temporal Pseudo-gaussian augmented Self-attention (TPS).While the application format of these two blocks is different.They both improve any model regardless of the task.We present the improvement gained by the addition of the proposed blocks to four well-known and popular deep learning-based TSC models (FCN [11], ResNet [11], InceptionTime [8]).The performance of these baseline models before and after adding the suggested attention modules are compared on UEA [7] benchmark dataset collection.
Our main contribution here is the introduction of GTA and TPS blocks.These two blocks use attention to underline informative temporal points and data sections unique to each class.They make the learning process of distinguishing between classes easier for any DNN-based TSC.

II. RELATED WORK
We categorize related works into three categories.The first category is focused on traditional (non-deep learning) TSC methods.The second category is Deep Neural Network (DNN) TSCs.The proposed work in this paper is among this group of works.The last category is the group that focuses on injecting position information into Transformer models.These works are primarily concerned with Natural Language Processing (NLP) tasks.However, according to [12], the way they process the position information is similar to the proposed TPS block and, therefore, worthy of review in this section.

A. Traditional TSC methods
The most common method of this group is the "NN-DTW" approach, which is composed of the nearest neighbor (NN) classifier with Dynamic Time Warping (DTW) distance metric [13].Recently, the field has been dominated by highly complex classifiers such as Shapelet Transform [14], BOSS [14], and HIVE-COTE [9,15] (ranked as the most accurate classifiers on UCR archive [9,14]).The most recent classifiers can be divided into two categories of simple and complex methods.Complex classifiers are cumbersome, memory intensive, and difficult to train, making them highly unscalable.In contrast, simple methods are faster and relatively easier to train but usually less accurate.
1) Complex Traditional TSC: In this subcategory, we will review several famous, highly complex, computation-heavy, and accurate traditional TSC classifiers.
Shapelet Transform classifier identifies discriminative subseries (i.e., shapelets) [16] unique to each class.For each input, shapelets are slid along the time dimension.The distance between the shapelet and the sample at each time is used to form a new array (transformation).The classification step is performed using transformed arrays.The Shapelet transform is one of the most computationally complex methods that escalates higher depending on the number of training samples and the time series' length.Bag-of-SFA-Symbols (BOSS) is a dictionary-based ensemble classifier model that transforms the frequency of the patterns' occurrence into a new format.Word extraction for time series classification (WEASEL) [17] applies static feature selection on the output of a dynamicallysized sliding window feature extractor.It achieves higher accuracy than BOSS but has similar training complexities and high memory usage.Collective Of Transformation-based Ensembles (COTE) [14] is a large ensemble of 35 different classifiers, including BOSS and Shapelet Transform.
Hierarchical Vote Collective of Transformation-based Ensembles (HIVE-COTE) [15] extends the COTE system to include a new hierarchical structure with probabilistic voting and two new classifiers.It includes two new functions to project time-series arrays into new feature spaces.Although HIVE-COTE has become one of the leading TSC algorithms, it is a highly complex algorithm with many hyper-parameters, requiring high memory usage and computational resources.Temporal Dictionary Ensemble (TDE) [18] is a new ensemble of dictionary-based classifiers similar to BOSS.It combines design features from multiple methods, such as BOSS and WEASEL, to create a more accurate approach.[18] showed that HIVE-COTE's accuracy can significantly increase if BOSS is replaced with TDE.Hive-cote 2.0 [9] is an upgraded version of HIVE-COTE, which includes comprehensive changes through the compilation of scattered works such as TDE, which significantly improved its accuracy.
Even though complex traditional TSC methods are highly accurate, their complexity makes them unscalable and less practical.Training for some of these algorithms might take weeks to complete.Therefore, there exists a demand for scalable and less memory-intensive methods.
2) Simple Traditional TSC: Simple traditional TSCs are a set of algorithms designed to be faster, less complex, less memory intensive, and easier to train than complex traditional TSC methods.This subsection reviews more scalable traditional TSC methods such as Proximity Forest [19], TS-CHIEF [20], and ROCKET [21].
The Proximity Forest is an elastic ensemble of proximity decision trees, where the samples are compared against branch exemplars with a randomly chosen distance metric for each node [19].The Time Series Combination of Heterogeneous and Integrated Embedding Forest (TS-CHIEF) extends the Proximity Forest by combining interval-based and dictionarybased branching [20].Although these methods are more scalable, they are still highly complicated, with training complexities that are quadratic in time series length.RandOm Convolutional KErnel Transform (ROCKET) was proposed as a high-speed, high-accuracy method for TSC [21].ROCKET requires only 5 minutes of training on the longest UCR archive time series [21].For comparison, TS-CHIEF requires four days for training on that same dataset [21].ROCKET uses many random convolution kernels in combination with a ridge regression classifier.A notable limitation of ROCKET is its requirement for an extensive collection of diverse data to establish a general feature space.Moreover, it shows a lower performance on unseen datasets.MiniRocket [22] is a faster version of ROCKET that uses a deterministic approach toward selecting Kernels and their specifications.MultiRocket [23] increases the accuracy of MiniRocket by generating more diverse features and utilizing pooling and transform operations.
Although simple traditional TSCs seem more scalable and less computationally expensive than the complex traditional TSCs, they are nonetheless unscalable and computationally expensive if compared to the non-traditional TSC methods.In retrospect, deep learning-based TSC classifiers are much easier to train and considerably more scalable than traditional classifiers (both simple and complex categories).

B. DNN-based TSC
DNN methods' simpler implementations, shorter training times, and lower computational complexities make them a desirable choice for TSC.Although DNN methods have progressed quickly for TSC applications, they still lack generalizability compared to traditional methods.Still, based on the relative complexity of the DNN-based methods, we can divide them into two subcategories Simple (earlier, low computational cost, less accurate methods) or Complex (latest, higher complexity, more accurate methods).
1) Simple DNN-based TSC: Early DNN-based TSC approaches began with the simplest method, MultiMayer Perceptron (MLP).MLP is composed of four Fully Connected (FC) layers and was proposed as a baseline for TSC.Multiscale Convolutional Neural Network (MCNN) was composed of two convolutional layers with max pooling and two FC layers [24].Even though MCNN was simple, it required heavy and complex data preprocessing steps.Time-LENET [25] had similar architecture to MCNN but with a modified pooling method.Time-CNN [26] used Mean Squared Error (MSE) loss for training its model and removed the final Global Average Pooling (GAP) layer behind the FC layer.Multi-Channel Deep Convolutional Neural Network (MCDCNN) [27] also had a similar architecture designed for multivariate data.It applied parallel and independent convolutions to each input channel.Recurrent DNNs were traditionally used for time series forecasting in the form of Echo State Networks (ESNs).Time Warping Invariant Echo State Network (TWIESN) [28] was a recurrent DNN method that redesigned ESNs for TSC.
2) Complex DNN-based TSC: The performance of early DNN-based TSC methods, based on accuracy benchmarking on UCR/UEA datasets, was still inferior to traditional methods [9,10].The introduction of more complex 1D-CNN architectures (such as FCN and ResNet) showed that new DNN methods could achieve similar results to traditional methods with lower computational complexities and training times.FCN model was a three-layered 1D-CNN model that preserved the length of the series throughout all its layers.The output layer was a fully connected layer right after global average pooling (GAP).The GAP layer was later replaced with an Attention layer in [29].Residual Network (ResNet) is a deeper model with eleven 1D convolutional layers.ResNet was constructed with three residual blocks followed by GAP and a softmax classifier.The inferior performances of FCN and ResNet on UCR datasets compared to HIVE-COTE [10,21] meant that DNN-based TSCs could still be improved.
Inception Time [8] was introduced as a DNN-based TSC method that achieved comparable accuracy to HIVE-COTE.It was developed as a TSC equivalent of the image classification architecture, AlexNet.It consisted of multiple inception modules [30] that apply four concurrent convolutional filters of varying kernel sizes on their input.OS-CNN [31] is a newer 1D-CNN deep learning-based TSC method.It used OS-Blocks in which the kernel size of each layer is different based on the data.OS-CNN's results for UCR benchmark were marginally better than InceptionTime.However, its performance on UEA benchmark was worse than FCN, ResNet, and InceptionTime.DA-Net [32] is a model composed of two layers of Squeeze Excitation Window Attention (SEWA) and the Sparse Self-Attention within Windows (SSAW).The first layer is a 1D version of squeeze and excitation (SE) block [33], and SSAW is a windowed multi-head attention [34] layer.Even though this model utilized both self-attention and temporal attention, its results on the UEA benchmark are significantly worse than OS-CNN and, subsequently, FCN, ResNet, and Incep-tionTime.Voice2Series [35] leverages large-scale pre-trained speech models by reprogramming the input time series.Its performance was only reported on 30 out of 128 datasets of the UCR benchmark.Therefore, the generality of this method is somewhat questionable.
A few works focus on changing the convolution-based methods to a more suitable approach for TSC.DTWNet [36] replaces the inner product kernel with a DTW kernel.However, the authors evaluated its performance entirely differently from other related works.[37] replaced the inner dot product between the kernels and the input by Elastic Matching (EM) Mechanism in the form of an FC layer that imitates DTW.Therefore, the model became invariant to time distortion.
TapNet [38], SimTSC [39], SelfMatch [40] and iTimes [41] are a few works from the scope of semi-supervised TSC with different testing methods.These methods combine traditional TSC and feature prototyping to utilize unlabeled data.Since these methods require extensive external data for training, their results on isolated UEA/UCR benchmarks were lower than supervised methods.FCN, ResNet, and InceptionTime have been demonstrated to be the most successful methods for TSC by achieving the highest ranks [10,42] on the UCR archive [43].They were therefore chosen as the baseline models for the work presented here.We would have considered using more models as our base models, such as XCM [44], ShapeNet (SN) [45]), and TCRAN [46].However, their reported performances on the UEA dataset included either marginal or no improvement compared to the FCN's.This statement concludes the review of TSC work related to our proposed method.However, since the proposed TPS model is a transformer with a positional information injection module, it would be essential to review related works on positional information modification in transformers.

C. Positional information injection in transformers
Our TPS model is a transformer with a modified approach to processing positional information.Therefore, This section reviews the related works on positional information injection in transformers.Transformer models [47] have shown good performance for many natural language processing tasks.The baseline Self-Attention (SA) transformer model is indifferent to the time order of the input.However, text data is inherently sequential.Therefore, the injection of position information in transformers is the focus of many methods.[48] provided an overview of these methods.It laid out multiple categorizing specifications such as (1) Reference Point (Ref.P): Absolute (Abs) or Relative (Rel) position information, (2) Injection Method (Inj.M): Additive Positional Embedding (APE) or Manipulating Attention Matrices (MAM), (3) Learnable during training or Fixed.Based on these categories, our proposed TPS algorithm could fall into the Relative, MAM, and Learnable positional information processing method group.In the next paragraph, we provide an overview of the methods in this field.The novelty of these methods is in positional information injection into transformers.
[49] modified a self-attention matrix by adding a learned representation of relative positions using the distance between time entries.[49] hypothesized that the exact relative positional information is not useful beyond a certain distance.De-BERTa [50] represented each word by two vectors of content and position.The positional vectors were used to generate a second attention matrix added to the original.[50] also injected a traditional absolute position embedding into its last stage, utilizing APE and MAM injection methods and Abs and Rel position to embedding conversion.TUPE [51] separated the analysis of position and content.Both relative and absolute positional placements were used to create a position-based attention matrix, added to the separately calculated content correlation attention matrix.SPE [52] proposed a combination of K learned sinusoidal components to replace classical additive fixed Positional Embedding in sparse transformers.
[55] proposed a direct relative and multiplicative smoothing on the attention matrix.[53] took on a similar approach.But it included both Rel and Abs reference point utilization.A summary of the number of Ref. P, Inj.M, and learnable parameters for these methods is presented in Table I.In this  [53] Rel MAM dlh(2N − 1) DeBERTa [50] Rel+Abs MAM+APE 3N d Transformer XL [54] Rel MAM 2d + d 2 lh DA-transformer [55] Rel MAM 2dlh TUPE [51] Rel+Abs MAM 2h SPE [52] Rel It is very hard to quantitively compare these methods as they seemed to be used for different tasks and tested using different datasets.However, none of these methods is used in applications of TSC tasks.

III. APPROACH
This section describes and details each proposed attention block's operation format.

A. Global Temporal Attention block (GTA)
Temporal Attention (TA) is useful for TSC and regressionrelated problems [56][57][58].In a Classic TA (CTA) block, features from each time unit emphasize or suppress the content based on how informative they are in creating class separation.However, CTA block has two limitations.
First, the attention weight calculation for each time sample is dependent only on the values at that time, as shown in Eq. (1).
In this equation, the input time series array is F = (f 1 , f 2 , . . ., f N ) with a dimension of d × N .d refers to the input's dimension size and N refers to input's max length (duration).Therefore, the dimensions for weights W 1 and W 2 are d r × d and 1 × d r , respectively.W 1 is dimensionalityreduction layer, in which the input dimension d is decreased by a factor of r (dimensionality-reduction factor).σ 1 (.) and σ 2 (.) indicate softmax and ReLU activation functions.The attention matrix A ∈ R 1×N and the output of the attention block, O ∈ R d×N , is shown in Eq. (2). ( As shown, the temporal relations between time samples do not affect the calculated attention weights (each o i is multiplied by a number between 0 and 1, which is only dependent on f i ).
Second, each time sample's temporal location has no impact on the calculated attention weights.Some temporal samples may be more important than others due to their temporal location.CTA block does not seem to factor in such importance when calculating attention weights.We propose a novel Global Temporal Attention (GTA) block to address these two limitations.The formula for calculating the global attention is shown in Eq. ( 3).In this equation, learnable weights W 1 , W 2 , and W 3 have dimensions of 1 × d, T r ×T , and T × T r .W 2 and W 3 apply a dimensionality-decrease /increase with the reduction/increase coefficient of r set to a default value of 16. σ 1 (.) and σ 2 (.) depict sigmoid and ReLU activations.A 1 has a dimension of 1 × T .The final output of GTA block, O, is calculated in the same manner as shown in Eq. (2).
A GTA block learns to utilize global temporal information to emphasize informative time samples and suppress noninformative ones.Its structure enables determining samples' attention based on their temporal location instead of their values exclusively.Therefore, temporal relations between time samples and their placements are used to determine the importance of time samples during the model's training.Given the similarities between CTA and GTA blocks with the squeeze and excitation (SE) block [33], a GTA block was added as an intermediate block after each processing layer.An example of such use for FCN model [11] is shown in Fig. 1.One potential limitation of GTA could be its susceptibility to misclassification from temporal shifts.This is related to how GTA processes temporal placement information.During training, W 1 in Eq. ( 3) is fixated on values that depend on temporal placements of the training data.Therefore, a system that can overcome such potential limitation is needed.

B. Temporal Pseudo-Gaussian augmented Self-attention
The self-attention mechanism successfully replaced recurrence in the field of language modeling [47].The similarities between the two fields of language modeling and time series analysis suggest that self-attention might be a promising method for TSC.self-attention in MTSC has been explored by [59,60].However, two main reasons motivated a reformulation of the attention calculation in the self-attention mechanism for TSC.
In the reformulated method, the calculation of attention weights for each time sample is not limited only to the relative similarity of that sample's content with the other samples; it is also dependent on its relative positional placement.The attention weights are then modified so that more consideration will be given to the neighboring samples based on the content of the current sample in a pseudo-Gaussian distribution form.We call this distribution pseudo-Gaussian because it is similar to Gaussian, but it is not symmetric.Moreover, its distribution is normalized after combining it with the self-attention weights.
The proper inputs for the self-attention mechanism are generated by transforming the input time series (F ∈ R N ×d ) into three elements of Q, K, and V as described by Eq (4).Q, K, and V stand for query, key, and value.They each have a N ×d dimension, where N indicates the Maximum sequence length and d is the feature array length.This operation is shown in Fig 2 as passing the input through three fully connected layers.Then, The self-attention mechanism transforms the query and the set of key-value pairs into an output, as described in Eq. ( 5).The output of self-attention O is calculated by multiplying V by A ∈ (R N ×N ).
The formulation for TPS is shown in Eq. ( 6) in which the attention matrix calculation is modified.First, a scaling function is applied to the base attention matrix (S (.)).Second, the scaled attention matrix (A 1 ) is combined with the new pseudo-Gaussian temporal attention matrix (A 2 ).Finally, the result of this addition is normalized (N (.)) by dividing each row by the sum of its elements.
, and The calculation for the additional pseudo-Gaussian temporal attention matrix A 2 is presented in Eq. (7).Each row of A 2 is presented by P i , where i indicates the row number.p i,j represents the element that modifies the attention weight between time samples i and j based on their distance.However, this relation does follow a similar pseudo-gaussian distribution.The Gaussian variance would be different if time sample j is placed before or after time-sample i, σ2 i , and σ 2 i , respectively.Additionally, both σi , and σ i are calculated based on v i , the value of time sample i.In Eq (7) W and W are 1 × d dimensional learnable weight matrices, and b is a configurable bias determined empirically.
The complete TPS processing structure is shown in Fig. 2. It is inspired by an encoder structure first presented in [47].The suggested application for TPS for incorporating it into a base TSC model is shown in Fig. 3.It is independent of the TSC model's architecture.TPS can be integrated into a model as simple as a single FC layer or as complicated as an InceptionTime [8].Our method enables general users to enhance the performance of an existing model by simply adding the TPS module.Positional encoding (PE) allows the model to utilize sequence order by adding information about the Absolute position of each time sample into its embedding.We used learnable positional encoding introduced in [62] to project positional information into the input array (shown by ⊕ in Fig. 3).PE injection is placed after the base model, similar to its placement (after the embedding layer/FC layer) in [47].Unlike self-attention encoders, 1D-CNN TSC classifiers do not need positional encoding.As the convolutional and kernel operation inherently process the positional placements into the outcome.So, PE is only required to be injected before the data is entered into the self-attention layer.

IV. EXPERIMENTAL RESULTS
We explored different experimental settings to evaluate and compare GTA and TPS on different applications.
1) Benchmark Datasets: The UEA Multivariate TSC archive [7] is a collection of 30 datasets of various types, dimensions, and lengths series.These datasets are selected from various applications, including human activity recognition and classification of motion, electroencephalogram (ECG)/electroencephalography (EEG)/magnetoencephalogram (MEG) signals, audio spectra, and more.The dimensions of these datasets vary from 2 (AtrialFibrillation, Libras, PenDigits) to 1345 (DuckDuckGeese).The series lengths vary from 8 (PenDigits) to 17984 (EigenWorms) time samples.The collection was introduced as a benchmark for the standardized evaluation of MTSC algorithms.We used the UEA archive to assess and compare the proposed blocks against the state-ofthe-art TSC methods.
2) Hyperparameters and Hardware Setup: Following [8,10], we utilized a standard setting that includes unified hyper-parameters across all datasets.A uniform set of hyperparameters provides a fair and comparable testing ground.Batch size is the only parameter that does not have an identical value across all datasets.Batch size is set to a lower number for some datasets due to hardware limitations (EigenWorms: 6, EthanolConcentration: 16, MotorImagery: 4, SelfRegulation-SCP2: 32).For the rest of the Datasets, Batch size is set to 64.The rest of the training parameters for all models and all datasets are the same.The initial learning rate, loss function, optimizer, and the number of epochs are set to 0.0001, categorical cross-entropy, Adam, and 400, respectively.We also used a learning rate scheduler, which decreases the learning rate by a factor of 0.1 if the validation loss does not improve after 20 consecutive epochs.We used their exact model specifications, including kernel sizes and numbers, for the CNNbased models (FCN, ResNet, and InceptionTime).For the selfattention encoders, the number of layers and heads are set to 1.Moreover, the input embedding dimension size is set to 128.Our experiments are performed on a Compute Canada [63] node equipped with an NVIDIA V100 Volta GPU (32G HBM2 memory) unit.

A. Comparison with state of the art
In this section, we compare five state-of-the-art multivariate TSC models ( [31,32,38,44,45]) against three baseline deep learning TSC models (FCN [11], ResNet [11], and Inception-Time [8]).Table II encapsulates the performance accuracies for these methods.As the average accuracies on UEA show, all five state-of-the-art models [31,32,38,44,45] underperform compared to the three baseline models.Therefore, we chose to add the proposed GTA and TPS augmentations to the baseline models.In Table II, we also compared the baseline models (FCN, ResNet, and InceptionTime) with their GTAor TPS-augmented counterparts.For GTA augmentation, the GTA block is added after each computing layer of the model, as shown in Fig. 1.As for TPS, the modified model is shown in Fig. 3.
The accuracy values for UEA datasets are shown in Table II.These results are the average of 5 independent runs for each model.Rank Average (the mean rank of each model in terms of its accuracy for each specific dataset, lower numbers are better) is presented in the last row of Table II.Bold numbers indicate the highest value in each row.Colored numbers in each vertical section highlight results higher than the base model (shown in red).
From Table II, "InceptionTime+ TPS" delivers the highest average accuracy.Adding TPS consistently improves the base models' accuracy by up to 3.0%.Adding GTA also improves the accuracy of all four models, though only marginally for the model with the largest temporal receptive field (Incep-tionTime).From the rank average metric, the highest-ranking models are "InceptionTime + TPS" and "InceptionTime + GTA".Adding PE to TPS only improves accuracy over TPS for the FCN base model, demonstrating that the effectiveness of Absolute Positional Encoding (PE) depends on the base model's architecture.
Figure 4 depicts the Wilcoxon-Holm post hoc test Critical Difference (CD) diagrams for each base model section of Table II (one for every four columns under each base model).IT32 and RES stand for InceptionTime and ResNet, respectively.The CD diagrams imply that GTA model is producing better rankings than TPS model.However, there are no significant probabilistic differences between the two.

B. Standalone TPS performance analysis
This section presents experimental results and highlights the effectiveness of reformulating self-attention with Temporal Pseudo-gaussian augmentation in the temporal attention blocks.These experiments also include the accuracy comparison between TPS and two of the latest state-of-the-art works in positional information injection into transformers (TUPE [51] and DeBERTa [50]).DA-transformer [55] was also taken into account, but unlike the other two, there was no public reproduction material for DA-transformer, so it was left out of the quantitative comparison.The baseline self-attention (SA) classifier is comprised of a TSC base model with an encoder using self-attention [47] added at the output (this can be imagined as a modified version of what is shown in Fig. 3, where the PE is removed and SA replaces the TPS encoder).The addition of PE to each of SA and TPS models creates SA+PE and TPS+PE, respectively.The TSC base model here (Fig. 3) is an FC layer that converts the data's dimension to 128 for input to the self-attention and TPS encoders.
The performance accuracy for each model on UEA datasets is presented in Table III.The numbers in each row represent the average of five runs.Bold numbers highlight the highest value in each row.From this table, replacing SA with the TPS block improves the accuracy of the network by an average of 8.7%.APE + SA improves accuracy by 5.5%.TPS + PE results in the highest accuracy increase of 11%.TUPE and DeBERTa do not reach the average accuracy of the SA+PE model, even though they both have newer ways of processing positional information.That could be because each method was made to deal with different problems in different Natural Language Processing (NLP) tasks.Therefore, they lost their generality in comparison to the original SA encoder.As a result, a lower average accuracy than TPS was expected.Even though DeBERTA uses Rel and Abs reference points and APE and MAM injection methods, its average accuracy is still lower than TPS + PE.Based on the characteristics of TUPE, one would expect a better performance than TPS and worse than TPS+PE.It, however, did not catch up to either of them.A limitation of TUPE algorithm is that it cannot operate on short sequences.As a result, it could not generate any results for InsectWingbeat and PenDigits datasets.
Comparison of the accuracy results for FCN and ResNet in Table II with the standalone TPS + PE unit shown in Table III implies that the standalone TPS + PE unit performs better than both FCN and ResNet.However, TPS has fewer learnable parameters and fewer computational complexities than FCN and ResNet.The number of learnable parameters for TPS is: In this equation, l is the number of layers (1), h is the number of attention heads (1), d is the hidden dimension size (128), and d dataset is the dimension of the test dataset.Based on these values, the number of learnable parameters for TPS standalone model is about d dataset × 128 + 182k.Meanwhile, the number of learnable parameters for the FCN model is (8×d dataset +1)×128+267k, and ResNet is almost twice that.The learnable PE unit also adds additional learnable parameters of 2N ×d dataset (N is the max sequence length).Based on Table III, TPS+ PE has more learnable parameters than FCN only for "EigenWorms" and "MotorImagery" datasets.Interestingly, for both cases, standalone TPS without PE performs better than both TPS+PE and all base models.Additionally, if the number of learnable parameters is a limiting factor for PE, using non-learnable PE functions [47] could be explored as an alternative.

C. Qualitative Attention Analysis
This section visualizes the effect of asymmetrical pseudo-Gaussian positional attention on the content-correlation attention matrix.Pseudo-Gaussian attention injects the positional information into the transformer and forces the encoder to find new and unexplored relations between the time samples.Visualizations are performed on two instances trained and tested on AtrialFibrillation [7] and PenDigits [7] datasets.Each dataset consists of 2D multivariate time series arrays.However, their series lengths are substantially different (640 for AtrialFibrillation and 8 for PenDigits).AtrialFibrillation is a dataset composed of two-channel ECG signal recordings.The task is to predict spontaneous atrial fibrillation (AF) termination (3 classes) from 5-second long recorded instances with a 128 samples per second sampling rate.PenDigits is a dataset composed of time recorded x and y coordinates of a pen-tip, while the pen is used to write down a digit (0-9) on a digital 500 × 500 screen.The axial coordinates are normalized to 100 × 100 and resampled into 8 time samples.Fig. 5 shows the calculated P and σ values (Eq.( 7)) from the TPS model  for two multivariate time series samples from the above two datasets.
As shown in Fig. 5, separate calculations of σ and σ provide flexibility in the distribution of neighboring attention.For AtrialFibrillation, the distribution of attention varies across different points.σ values seem higher than σ values, indicating that more attention is placed on the previous samples.Also, the pseudo-Gaussian attention is only spanning across a small temporal range compared to the series length.For the PenDigits sample, the attention is directed toward future samples with larger σ values that make it span across the entire series.
Fig. 6 shows the attention maps for SA and TPS models (A in Eq. ( 6)) for two sample multivariate time series.From the second row, we can conclude that the self-attention mechanism takes a weighted average of the key time points (possibly because of the GAP layer).However, for TPS, self-attention is forced to identify connections between multiple time data points and the data points along the diagonal direction.
Our experimental results show that adding our proposed block to the existing TSC models can make them work better.Since there is no statistical difference between how well the two blocks improve performance, choosing the best block depends on the task and how easy it is to implement.Furthermore, it was shown that a TPS block could be used as a standalone TSC model with comparatively good performance and fewer computational complexities compared to the stateof-the-art.The asymmetrical pseudo-Gaussian positional attention is the main reason the TPS block works well.This is because it feeds relative positional information into the transformer model and forces the transformer to make new and better content-correlation attention matrices.

V. CONCLUSIONS
This paper presented two novel attention blocks, GTA and TPS, for deep learning-based TSC networks.We showed that incorporating these two blocks into DNN TSC models could improve their performances.GTA is proposed as a sublayer attention block, placed after each 1D Convolutional layer block.In contrast, TPS is presented as an add-on block that could reprocess the output of a TSC model.Experiments on UEA benchmark dataset archive highlighted the advantage of adding TPS and GTA blocks to three state-of-the-art baseline deep learning-based TSC models.These experiments demonstrated that both blocks could improve the accuracy and average rank in all three state-of-the-art.However, the improvement varies according to the application; in some cases, it is marginally and in others substantially better.Since there is no probabilistic difference between the two methods, the choice could be based on the task.We also showed that TPS block could be used as an independent TSC unit.The standalone TPS unit is better at TSC compared to the state-of-the-art in transformer's positional information injection methods.Additionally, an independent TPS unit coupled with PE performed better than both base FCN and ResNet models with almost half and onesixth of the number of learnable parameters, respectively.

Fig. 4 .
Fig. 4. CD diagram comparing the average ranks of different networks which use the same base models of (a) FCN, (b) ResNet, (c) and IT.

Fig. 6 .
Fig. 6.Attention map comparison between TPS and SA models on samples from a) AtrialFibrillation, b) PenDigits datasets.

table , N
refers to the max sequence length, h is the number of attention heads, l represents the number of layers, and d is the input dimension size.

TABLE II ACCURACY
[%]COMPARISON BETWEEN STATE-OF-THE-ART BASELINE TSC MODELS AND PROPOSED GTA, TPS, AND PE BLOCKS ON UEA

TABLE III SELF
-ATTENTION AND TPS ACCURACY [%] COMPARISON ON UEA