Electromyography Based Decoding of Dexterous, In-Hand Manipulation Motions With Temporal Multichannel Vision Transformers

Electromyography (EMG) signals have been used in designing muscle-machine interfaces (MuMIs) for various applications, ranging from entertainment (EMG controlled games) to human assistance and human augmentation (EMG controlled prostheses and exoskeletons). For this, classical machine learning methods such as Random Forest (RF) models have been used to decode EMG signals. However, these methods depend on several stages of signal pre-processing and extraction of hand-crafted features so as to obtain the desired output. In this work, we propose EMG based frameworks for the decoding of object motions in the execution of dexterous, in-hand manipulation tasks using raw EMG signals input and two novel deep learning (DL) techniques called Temporal Multi-Channel Transformers and Vision Transformers. The results obtained are compared, in terms of accuracy and speed of decoding the motion, with RF-based models and Convolutional Neural Networks as a benchmark. The models are trained for 11 subjects in a motion-object specific and motion-object generic way, using the 10-fold cross-validation procedure. This study shows that the performance of MuMIs can be improved by employing DL-based models with raw myoelectric activations instead of developing DL or classic machine learning models with hand-crafted features.


I. INTRODUCTION
H UMAN-MACHINE Interfaces (HMI) are finding an increased use in activities of daily living in recent years. For this purpose, biological signals can be used to develop such interfaces, as they carry vital information from the human physiological system. In tasks such as controlling bionic devices, e.g. prosthetic arms and hands, the most commonly employed method is Electromyography (EMG). These signals measure the myoelectric activations of the human muscles generated during contraction and offer an intuitive method for developing HMIs. EMG-based interfaces can decode human movement intention to classify hand gestures and motions [1], [2], as well as the execution of in-hand manipulation motions with an object or continuous human-hand motions [3], [4]. One of the main types of dexterous, in-hand manipulation is Equilibrium Point Manipulation (EPM), in which the contact points of the fingers remain relatively stationary on the object surface while the object is manipulated (see Fig. 1). Robotic arm-hand systems are able to achieve EPM [5], which can be employed to execute tasks such as object inspection or in-hand repositioning or reorientation.
Developing an EMG-based control scheme for intuitively executing EPM tasks with a robot or prosthetic hand is a new research direction that has achieved promising results [6], [7]. Machine learning (ML) techniques have been employed to analyse and decode EMG signals in the past few years. A classic machine learning model-based control system for an assistive device generally depends on prior signal preprocessing and feature engineering steps before obtaining the desired classification/regression output [8]. With classic tools such as RF, a feature vector set is extracted from raw data after processing the signal. Time-domain (TD) features have been proved to be a feature class computationally less expensive to calculate, achieving more consistent performance compared to frequency-domain features [9]. Castellini et al. [10] compared the results achieved by Neural Networks (NN), Support Vector Machines (SVM), and Locally Weighted Projection Regression to predict the type of grasp and the grasping force through regression. It was found that none of the tested approaches showed outstanding results among the others, indicating that ML as a whole is a viable approach. Liarokapis et al. [11] proposed a task-specific framework for myoelectric activations based on decoding the reach to grasp motions. When comparing the performance of RF with Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis, K-Nearest Neighbours (kNN), NN, and SVM, they found that task-specific models outperform general models, and RF methodology based learned models showed better performance than other learning techniques in classification and estimation accuracy.
Deep learning (DL) approaches have great potential for decoding the human motion or intention from the myoelectric activity. Due to their large amount of parameters compared to conventional function approximators and their non-linear activation functions, DL techniques allow relating more abstract domains and counter domains. End-to-end DL-based models automatically identify and learn high-level features from processed input or raw data using multiple hidden layers, resulting in an increasingly complex and robust system without the need for prior feature extraction. Two well-established DL methods are convolutional neural networks (CNN) and recurrent neural networks (RNN). In recent studies, they have been employed in executing several classification [12] and regression [13]- [15] tasks. In [14], the authors proposed a scheme for estimating the direction and magnitude of the force applied to a grasped object using sEMG of the forearm with a CNN. Chen et al. [15] predicted the force of the multi-DOF individual fingers simultaneously based on high-density EMG signals. They compared the performance of a CNN and a CNN plus RNN models with classical methods such as common spatial pattern. Chen et al. concluded that methods based on neural networks significantly outperform traditional methods.
Transformer architectures [16], represent the state-of-the-art in Natural Language Processing (NLP) tasks and have recently been employed in new fields, such as image recognition through the Vision Transformer (ViT) [17]. This architecture has great potential in solving problems that hinder the adoption of RNNs and CNNs, such as the inherent impossibility of parallelisation of the former and the significant computational power needed for training the latter. Recent research has developed Transformer-based models for usage in other tasks. Regarding biological signals, Krishna et al. [18] proposed an automatic speech recognition model based on Transformer using as input statistical features extracted from EEG signals. Other recent works also employed Transformer-based models in classification tasks using as input EEG, for emotion recognition [19], and EMG signals, for hand gesture classification [20]. These recent advances open up a new range of application areas. However, to the best of the authors' knowledge, no Transformer-based model has ever been developed for regression using raw biological signals' data with multiple channels and time steps as input.
In our previous works, we proposed a learning scheme based on the RF regression method to map the myoelectric activations of the muscles of the forearm and the hand to the object's motion. We studied the optimal muscle selection for the sEMG-based decoding of these in-hand manipulation motions [6], [21]. Then we explored how the EMG signals vary across different subjects of different genders and with different hand sizes, assessing the decoding models' performance [3]. In this paper, we extend our previous works by proposing a new Framework based on two novel Transformer-based regression models. We further compare the results with the results obtained with a CNN benchmark model and with the results of the previously proposed RF model. The rest of the paper is organised as follows. Section II presents the details regarding the dataset used in this work, a comprehensive description of the Transformers and ViT architectures, and a brief explanation of the CNN benchmark model. Section III presents the results obtained, which are discussed in detail in Section IV. Finally, Section V concludes the paper and presents potential future directions.

A. Dataset
We used the dataset collected by Dwivedi et al. [3] to test the proposed models' performance. The dataset was collected for 11 non-disabled subjects, five males and six females. More information regarding the subjects can be found in Table I. The experiments were performed by each subject with their dominant hand. Each subject performed 3-dimensional equilibrium point manipulation tasks using the Rubik's cube, the chips can from the Yale-CMU-Berkeley (YCB) grasping object set [22], and a custom-made off-center cube. Each manipulation task session was executed with a sequence starting with a 5 sec rest period followed by five repetitions of the manipulation motion for each trial. Adequate time to rest (approximately 30 sec) was given to each subject between trials to reduce the muscles' fatigue. There were 10 of these trials per session. The manipulation tasks performed during the experiments were: pitch, roll, and yaw. More information regarding the manipulation tasks can be found in [3].
The myoelectric activations were measured from eight muscles of the hand and eight forearm muscles using double differential EMG electrodes. The EMG signals were acquired at a sampling rate of 1200 Hz by the bioamplifier, which bandpass filtered the data using a Butterworth filter (5 Hz-500 Hz). The electric line noise was filtered out using a notch filter of 50 Hz.

B. Preprocessing
To evaluate the motion decoding capabilities of our proposed methods, we tested the models for raw and processed data. In order to train models using these methods, the input data needs to be segmented first.
1) Window Size: The procedure described in [3] was employed to segment the data. The signals were segmented into sample sets using a sliding window of 200 ms with a 10 ms increment. According to the literature, the window size is selected to be larger than 125 ms to avoid high biases and variance [23] and smaller than 300 ms due to real-time constraints [24].
2) Raw Data: Raw data implies minimal preprocessing is employed. In the case of raw data, the signals are only filtered by the bioamplifier and segmented before being fed to the algorithms. The use of raw data as input is only possible due to the ability of DL algorithms to learn discriminative features even from noisy data. The RF method can not be used to train successful decoding models, as shown in our last work [3]. Employing automatic feature extraction in other biological signals, such as EEG, using DL has been reported as being more robust and with more potential than those hand-crafted features [25]. Our models identify patterns and characteristics that feature engineering could miss.
3) Processed Data: Three time-domain features were extracted from each EMG channel: Root Mean Square Value (RMS), Waveform Length (WL) and Zero Crossings (ZC). More information regarding the features used can be found in [26].

C. Training and Evaluation
Our models were trained on a Google Colab Pro virtual machine with GPU. The models were developed in Python using Tensorflow and Keras, employing a hyperparameter optimization framework [27] during 200 trials. Then, the hyperparameters were fine-tuned by executing cross-validation with 10% of the available training data for optimization by empirical evaluation. The mean squared error (MSE) loss function was employed during training. The MSE is defined as follows where l is the loss function, y is the desired output, andŷ is the predicted output. All models used Adam as the optimizer [28]. The trained model's efficiency is assessed using the Pearson correlation coefficient and the percentage of the Normalized Mean Square Error (NMSE) representing accuracy in comparing the predicted and the actual object motion. The NMSE value of 0% denotes a bad fit, whereas the NMSE value of 100% denotes that the two trajectories are identical. The NMSE value is derived as follows where, . indicates the 2-norm of a vector, x r is the actual reference motion, and x p refers to the predicted motion. All the results presented in Section III are an average of the 10fold cross-validation, in which one separated repetition of the dataset is used for testing per fold.
To assess the robustness of our algorithms, we compared the results for specific and generalized models, analyzing four different sets: Subject-Specific and Object-Specific Models: For each subject, we trained and tested one model for each object.
Subject-Specific and Object-Generic Models: For each subject, we trained and tested one model for all the objects.
Subject-Generic and Object-Specific Models: With this set, we trained and tested subject-generic and object-specific models for females, males, small hand size (hand length ≤ 165mm), medium hand size (165mm < hand length ≤ 185mm), and large hand size (hand length > 185mm).
Subject-Generic and Object-Generic Models: With this set, we trained and tested subject-generic and object-generic models for females, males, small hand size, medium hand size, and large hand size.
Finally, we evaluated the prediction time of each model for raw and processed data as input to assess the applicability of the solutions developed here in online applications.

D. Models
The following sections describe the DL models designed for decoding continuous motion (regression) using EMG activations as input. The last layer in all models comprises three neurons with a linear activation function to perform the roll, pitch, and yaw regression.
1) CNN: The first DL model that we built for regression is a CNN. This technique can identify patterns and extract spatial characteristics of the data. It is one of the most well-established DL techniques, representing the state-of-theart in several tasks and application fields. Hence, the CNN is evaluated as a benchmark model in order to compare its results with our novel DL techniques. Our CNN model, shown in Fig. 2, comprises three convolutional blocks. Each block contains a convolutional layer, followed by batch normalization [29] and dropout [30] layers. The dropout rate is set to 0.3. The first two blocks also count with max-pooling layers with dimensions of 1 × 5 each. The filters of the convolutional layers have dimension of 1 × 20, 4 × 4, and 2 × 2, respectively. Three fully-connected layers follow the convolutional blocks with 256, 128, and 64 neurons.
2) TMC-T: The Transformers networks [16] were a milestone in NLP applications, as most of the state-of-the-art algorithms in this area are based on this architecture. Transformers are designed to process sequential data without suffering from vanishing gradients like the RNN, without presenting such complexity as the GRU or LSTM or the impossibility of parallelization inherent to these recurrent techniques. These architectures are based only on attention mechanisms, dispensing with any convolution or recurrence.
Transformer architectures employ attention mechanism to create an attention-based representation for each element in the input sequence. Then, the Transformer focuses on the regions of most significant interest for a given input and, consequently, spends a greater computational resource in this area. Unlike the attention mechanisms employed with RNNs, the Transformer computes these representations in parallel for each input element. The attention mechanism used by Vaswani et al. [16] was the Scaled Dot-Product Attention, given by where √ d k is the so-called scale factor, and Q, K , and V are vectors called query, key, and value, respectively, that are going to be used inside attention layers in order to compute the attention value for each element.
Vaswani et al. [16] employed attention in different positions of different representations of input subspaces through a mechanism called Multi-Head Attention, which allows parallel computation and calculates a richer representation of the input sequence. In the Multi-Head Attention, the same Q, K , and V vectors are multiplied by learned weight matrices. Hence, the attention is calculated for each head h, and the concatenation of these three values is multiplied by a matrix W O to generate the output of the Multi-Head Attention, as follows where W Q i , W K i e W V i are the learned weight matrices, one for each head. The Transformers' encoder receives the input after going through an embedding to convert each input element to vectors of the same dimension. Following the embedding step, since this model does not use convolution or recurrence, position information for each element is added to the input via a positional encoding. Then, these embeddings get fed to a Multi-Head Attention block within the encoder with h heads. The resulting matrix is provided to a feed-forward network. Residual connection (Add) [31] is employed after both the Multi-Head Attention and the feed-forward network to pass along positional information through the encoder, together with a normalization (Norm) layer [32] to speed up learning. This encoder structure is shown in Fig. 3.
Advantages of using Transformers are the ability to perform parallel computing and fast training time at the cost of not supporting large input sequences since the attention mechanisms scale quadratically with the input length. For many machine translation applications, in which the input is not that long, the quadratic cost to run the algorithm might not be a problem. However, the quadratic cost represents an obstacle with biological signals acquired at high frequencies for several seconds or even hours. To fulfil the task of processing biological signals with several channels, we developed a Transformer-based model named Temporal Multi-Channel Transformer (TMC-T). The TMC-T model comprises a Transformer block with eight heads and feed-forward networks with 32 neurons. For position encoding, learnable embeddings were used. For token embedding, a convolutional network was used. Using a CNN to generate the inputs' embedding has two purposes: 1) Learn and extract the embeddings. This model employed an embedding dimension of 32 for each token, which is the number of filters in the last convolutional layer within the CNN block. 2) Reduce the input dimension. Since the Transformers scale quadratically with the input length and our input is a matrix of 16×240, i.e. 16 channels of EMG samples of 200 ms acquired at 1,200 Hz, the convolutional layers followed by max-pooling layers reduce the input size while keeping the most relevant information. For raw data as input, the CNN block is composed of convolution layers followed by batch normalization and a dropout with a rate of 0.3, and max-pooling layers after each of the first two convolution layers. There are three convolution layers of 16, 32, and 32 filters of dimensions 1×20, 4×4, and 2 × 2. The max-pooling layers have dimensions of 1 × 5 and 1 × 4. After the CNN block, the output is reshaped to a Fig. 4. TMC-T Model for raw EMG data. Three convolution layers extract the embeddings and reduce the input dimension. After that, the result of the convolutions is flattened and supplied to Transformers' blocks. For processed data, filters of 4 × 1, 4 × 3, and 2 × 1 dimensions are employed. A max-pooling layer of 2 × 1 is used between the second and third convolutional layers, reducing the data dimensions from 16 × 3 to 8 × 3. After the convolutional layers, the data is reshaped to 24 × 32, in which 32 is the embedding dimension. matrix of dimensions 192×32, i.e. an input sequence of length 192 and embedding size of 32. After the Transformer blocks, a dropout of 0.5 and a dense layer of 64 neurons with ReLU activation function is employed (see Fig. 4).
3) TMC-ViT: ViT is a Transformer model adapted to use images as input. Thus, instead of processing 1D sequential data, ViT will use 2D images as input. In a first step, the ViT will subdivide the input image x ∈ R H ×W ×C into a sequence of flattened 2D patches x p ∈ R N×( P 2 ·C) , where (H, W ) is the resolution of the original image, C is the number of channels, (P, P) is the resolution of each image patch, and N = H W/P 2 is the resulting number of patches. Then, a linear embedding sequence of these patches and position embeddings are provided as input to a Transformer encoder (see Fig. 3). While the position embedding adds input topology information, the ViT processes the image with a linear projection of the flattened patches, whose components indicate low-dimensional correlations in the patches, and the Multi-Head Attention mechanism aggregates image information across all layers. Dosovitskiy et al. [17] employed this architecture for classifying 16 × 16 images. To adapt this network to process multi-channel EMG signals as input, we developed the Temporal Multi-Channel Vision Transformer (TMC-ViT) model. A CNN was used at the input to create the embeddings and reduce the input matrix to dimensions 16 × 16. The patches we used have the size of 4 × 4. Thus, sequential signals from multiple channels will be interpreted as 2D images. The CNN mentioned above has 16, 32, and 64 filters of dimensions 1 × 20, 4 × 4, and 2 × 2. The first two convolutional layers are followed by max-polling layers of 1 × 5 and 1 × 3, respectively. After each convolutional layer, a dropout rate of 0.2 and batch normalization layers are employed. Again, learnable embeddings were used for the position encoding, and a convolutional network was used for the tokens embedding. Since the last layer of the convolution has 64 filters, embedding has a dimension of 64. The number of attention heads and Transformer layers adopted was respectively 4 and 8. The last dense layers have dimensions 2,048 and 1,024. This model is illustrated in Fig. 5.

III. RESULTS
This section presents the object motion decoding accuracies obtained using the TMC-ViT, TMC-T, and CNN. Respective decoding models were developed using raw and processed myoelectric activations. All the results presented are an average of 10-fold cross-validation. For each evaluation set, the results for the motion decoding models developed using raw data will be presented first, followed by the results for the models developed using the processed data. The DL models were statistically validated using the analysis of variance (ANOVA). The null hypothesis for the analysis was that all models are the same. However, a p-value of 0.0049 was obtained for the models developed using raw EMG data, implying the results are statistically significant (p-value<0.05), thus rejecting our null hypothesis and concluding that there is a significant difference among the tested models developed for raw EMG data. On the contrary, for the models developed using processed EMG data, a p-value of 0.40 was obtained, indicating that there is no significant difference when trained using the processed EMG data, due to the limited amount of information that can be extracted from processed data.

A. Subject-Specific and Object-Specific
In this set, one model was trained for each object in a subject-specific way.
1) Raw Data: The results obtained by our Subject-Specific and Object-Specific models for raw EMG data are shown in Table II.
2) Processed: The results obtained by our Subject-Specific and Object-Specific models for processed data are shown in Table III. The results are further compared with our previously RF model [3].

B. Subject-Specific and Object-Generic
Here, we developed subject-specific models for all objects. 1) Raw Data: The results obtained for the subject-specific and object-generic models for raw data are presented in Table IV. It can be noticed that our TMC-ViT model surpassed the CNN benchmark model in both correlation and accuracy, achieving 89.68% and 79.09%, respectively. Moreover, the TMC-ViT presented a correlation above 80% for all tested subjects, demonstrating its robustness in learning the unique characteristics of each individual's EMG, performing the regression with more than 60% accuracy for all subjects, reaching the regression up to 93.63% accuracy for subject number nine. The TMC-T model achieved better accuracy and competitive correlation compared to the CNN model.
2) Processed Data: The results obtained for the subject-specific and object-generic models for processed data are presented in Table V together with the results achieved by the RF model. All the DL techniques learned from the features extracted during preprocessing. The DL models trained in this work and the RF model from our previous paper [3] achieved similar results. This indicates that our models could have reached a threshold in which the algorithms learned as much as possible from the data available. The TMC-ViT achieved the best results, with an average correlation of 81.01% and an accuracy of 63.10%. As was expected, the RF model could benefit from processed data, presenting competitive performance with the DL models when features extracted through feature engineering were used as input.

C. Subject-Generic and Object-Specific
This section presents the results for models trained for a set of subjects and each object.  IV  SUBJECT-SPECIFIC AND OBJECT-GENERIC MODELS FOR RAW DATA   TABLE V SUBJECT-SPECIFIC AND OBJECT-GENERIC MODELS FOR PROCESSED DATA 1) Raw Data: The results obtained for subject-generic and object-specific models for raw data are shown in Table VI. The TMC-ViT model achieved the best results compared with the other models for raw data for any group of subjects or objects. The TMC-T and CNN obtained competitive results with each other.
2) Processed: The results obtained for subject-generic and object-specific models for processed data are shown in Table VII. Once again, all the models for processed data presented similar results. The DL models achieved better performance for the males, those with medium and larger hand sizes.
D. Subject-Generic and Object-Generic 1) Raw Data: The results obtained for subject-generic and object-generic models for raw data are shown in Table IX. The TMC-ViT achieved better performance than the TMC-T and CNN models, presenting correlation above 81% and accuracy above 65% for all the tested groups. The males achieved 93.14% correlation and 86.6% accuracy, representing the group with the highest results. Fig. 6 shows the actual vs decoded motion for the participants of the male group achieved by the TMC-ViT model.
2) Processed: The results obtained for subject-generic and object-generic models for processed data are shown in Table VIII. The TMC-ViT model achieved the best results, whereas the others showed similar results. The medium and large hand sizes presented competitive results for processed data and Transformer-based methods. The CNN and RF models performed worse for the large hand sizes when compared to medium hand sizes, indicating higher robustness of the TMC-ViT and TMC-T models. The DL models performed better when raw data was used as input compared with the RF model.

E. Prediction Time
In this section, we measured the time required by each model to predict a new motion sample. The TMC-ViT and other models were developed and optimized to present a high correlation and accuracy. The time presented here is an average of 100 trials for classifying 12,000 samples. First, we calculated the time required for processing the data, i.e. extracting the three hand-extracted features. It was found that the feature engineering takes an average of 1.43 seconds to extract the features from 12,000 samples.  time for the three DL models for raw and processed data as input. In the latter case, the data processing time is already considered. Moreover, in this table, we show each model's prediction frequency and number of parameters. Finally, the results obtained for the RF model are also presented.
For the sake of comparison, we also optimized a deep CNN (DCNN) with a similar number of parameters to the ViT. The DCNN was trained and tested for the subjectgeneric object-generic set. This subset of experiments was chosen because it is the largest and most generic set with the greatest potential to benefit from a deeper model. The DCNN is composed of three convolutional blocks. The blocks contain two, three, and four convolutional layers with 32, 64, and 128 filters, respectively. This model achieved an accuracy of [49.80%, 84.20%, 47.92%, 75.05%, 82.24%] for the "female", "male", "small hand", "medium hand", and "large hand" groups respectively. Comparing these results with those presented in Table IX shows that the DCNN achieved the worst results among the tested models. Adding complexity to the model did not improve the performance compared to the smaller CNN.

A. Subject-Specific and Object-Specific Models
The analysis of the Table II and III highlights the better performance of the DL techniques to the classic ML algorithm tested, i.e. the RF model. All tested DL models outperformed the RF model in correlation and accuracy for  is only achievable due to the ability of DL techniques to learn the relevant features even from raw data, extracting information that could be lost during feature engineering. Another advantage is that using raw data minimizes the need for prior knowledge regarding the signals.

B. Subject-Specific and Object-Generic
The analysis of the Tables IV and V leads to two conclusions: i) once again, employing raw data to train the DL models improved both accuracy and correlation when compared to DL or classic ML techniques for processed data, ii) our TMC-ViT for raw data and processed data outperformed any other model for the respective data type. When comparing the results achieved by the TMC-ViT model for raw data and the RF model from our previous work, we can observe an increase of 10.31% in correlation and 17.35% in accuracy. Another interesting finding is that the performance difference between DL and classical ML techniques is even more significant for the Subject-Specific and Object-Generic models. The Subject-Specific and Object-Generic model's accuracy employing TMC-ViT and raw data is 1.28 times larger than the model's accuracy using RF. For the Subject-Specific and Object-Specific models, for the Chips Can object, for example, this ratio is only 1.14. This fact is explained by DL models outperforming classical ML models the larger the dataset.

C. Subject-Generic and Object-Specific
From the Table VI, it can be noted that the female subjects and those with small or medium hand sizes have a considerable drop in motion decoding accuracy and correlation for the chips can as compared to the Rubik's and the off-center mass cube. Whereas the male subjects and those with bigger hand sizes have better performance with the off-center mass cube and worse with the chips can. When comparing the results obtained by the subject-generic and object-specific (Table VI) with the subject-specific and object-generic (Table II) models, it is noted that the DL models with raw data could benefit from the larger dataset, performing better for subject-generic models than for the subject-specific models. One thing that is interesting to notice is that, for the RF model, the females and those with smaller hand sizes have a considerable drop in motion decoding accuracy for the off-center mass cube compared to the Rubik's cube, the opposite of the DL models. The DL models performed better when raw data was used as input, surpassing the RF model as expected.

D. Subject-Generic and Object-Generic
In Fig. 7 we present both the correlation (dashed line) and accuracy (solid line) obtained by the TMC-ViT model for raw and processed data. The results of the RF model for processed data are also compared. Here is noticed the behaviour shown in all the training sets: i) the TMC-ViT model achieves higher results than the RF model and ii) using raw data as input enhanced both correlation and accuracy for the DL models.

E. Prediction Time
The DL models performed better for raw data as input than for processed data, showing that removing feature engineering steps during data preprocessing can improve the applicability of the DL models in real-time applications. The DCNN achieved the worst prediction time among the tested models. The TMC-T and the CNN models for raw data presented a shorter prediction time than any other model for processed data, including the RF model. The TMC-ViT is the most robust model, which showed better accuracy and correlation results in all tests. The TMC-ViT is also one of the deepest models, presenting 4,950,067 parameters for raw data and, consequently, a longer prediction time. Even though the TMC-ViT model has shown a longer prediction time than the other models, it is still a suitable candidate method for online applications with a prediction frequency for 12,000 samples higher than 4 kHz.

V. CONCLUSION
In this work, we have proposed a novel end-to-end deep learning approach for decoding object motion in dexterous, in-hand manipulation tasks based on EMG signals. The proposed framework employs a Transformer-based architecture modified to receive as input EMG signals in order to achieve motion decoding. In particular, two new models called Temporal Multi-Channel Transformer and Temporal Multi-Channel Vision Transformer are introduced for solving the EMG-based decoding problem. We tested our models with raw and processed data as input and compared the results with a CNN benchmark model and an RF model proposed in previous works, representing the classic machine learning techniques.
Our models have been trained in subject-generic and subject-specific ways and an object-generic and object-specific manner. It can be seen that both the accuracies and the correlations increase when using DL models with raw data instead of DL or classic ML techniques with processed data. The DL models also generalized better than the classic ML models, achieving better results for the subject-generic object-generic model. In terms of accuracy and correlation, the Temporal Multi-Channel Vision Transformer achieved the best results among the tested models. The DL models showed a faster prediction time for raw data than for processed data. Hence, end-to-end DL approaches surpassed the use of processed data and/or classic ML techniques, such as RF.
Future work will focus on the information learned by the DL techniques by evaluating the patterns learned by the different attention heads in the Temporal Multi-Channel Vision Transformer model, using both sEMG and high dimensional-EMG signals as input.