Upper Limb Movement Decoding Scheme Based on Surface Electromyography Using Attention-Based Kalman Filter Scheme

Convolutional neural network (CNN)-based models are widely used in human movement decoding based on surface electromyography. However, they capture only the spatial information of the surface electromyography and lack prior knowledge of the system, resulting in unsatisfactory decoding accuracy. To address these issues, we propose an attention-based Kalman filter scheme (AKFS), which uses an attention-based CNN model to better extract temporal information and a KF to add prior knowledge of the system. We further solve the problem of insufficient data due to the short training time of new subjects by using a transfer learning method based on a fine-tuning strategy. The proposed scheme was tested in four scenarios: intra-session, intra-session long-time use, inter-subject, and inter-subject with a fine-tuning strategy. The proposed attention-based CNN model outperformed the vanilla CNN model and a hybrid CNN-long short-term memory (LSTM) model in intra-session and intra-session long-time use. After fine-tuning with a small amount of data on a new subject, the attention-based CNN model achieved higher decoding accuracy than the vanilla CNN model and lower response time than CNN-LSTM model. Furthermore, the schemes with KF outperformed the schemes without KF in all scenarios. Our proposed scheme improves the decoding accuracy of the traditional CNN model for a single subject by better capturing the temporal information of the surface electromyography signal and increasing the prior knowledge of the system. Additionally, the proposed scheme can be easily transferred to a new subject using only a small amount of data.

In contrast with machine learning approaches [9], [10], these CNN models are not as time-consuming in feature engineering and can realize multiple degrees of freedom (DoFs) angle prediction in human movement decoding [11]. Moreover, these CNN models can use small amounts of data after electrode shifting and user change to retrain the model through a fine-tuning strategy to improve the robustness of the decoding method in the case of electrode movement and user change [12], [13], [14], [15]. However, due to neglect of the temporal information of sEMG and lack of prior knowledge of the system, the accuracy of human movement decoding based on CNN models is unsatisfactory, which limits its clinical application [16].
Many researchers have tried to improve the accuracy of human movement decoding schemes based on CNN models by proposing mixed structure models [17], [18], [19], [20]. Although these CNN models are popular solutions for human movement decoding based on sEMG, single CNN methods focus only on the spatial information of sEMG and ignore the temporal information of sEMG. Thus, a novel structure model to better capture spatial information and temporal information of sEMG needs to be explored. Because the long short-term memory (LSTM) structure is suitable for processing time series and has achieved considerable improvements in human movement decoding based on sEMG [21], [22], [23], [24], many researchers have proposed hybrid CNN-LSTM structures to improve the accuracy of human movement decoding based on CNN models by better capturing the temporal information of sEMG. Xia et al. [25] proposed a CNN-LSTM model to decode the upper limb movement in three dimensions (3D) space. This method performed better than the support vector regression (SVR) and CNN models. Bao et al. [26] also proposed a CNN-LSTM method; they adopted a CNN model to extract features from Fourier-transformed sEMG and adopted an LSTM model to estimate the degrees of angles from human joints. The method is superior to a single CNN model in decoding accuracy. The attention mechanism has recently been proposed to improve the performance of LSTM-based models in natural language processing [27], [28], [29], [30] by focusing on more important information in time series through an attention weight matrix. Hu et al. [31] proposed an attention-based CNN architecture for sEMG-based gesture classification that outperformed CNN, LSTM, and CNN-LSTM models. However, their method focused solely on human discrete movement decoding. Consequently, the effectiveness of the attention mechanism in human continuous movement decoding needs investigation.
Inspired by the application of the attention mechanism in human discrete movement decoding, we hypothesized that it can further improve the performance of CNN-LSTM models in human continuous movement decoding using sEMG by better capturing the temporal information of sEMG.
Lack of prior knowledge of the system also leads to low accuracy of human movement decoding based on CNN models. To address this issue, many researchers have adopted post-processing strategies to increase the prior knowledge of human movement decoding systems. Englehart and Hudgins [32] proposed a majority vote approach to determine the current output of the decoding model through voting on the previous three or more outputs of the decoding model to reduce classification error outputs. Amsuss et al. [33] proposed a post-processing strategy based on an artificial neural network to reduce classification errors. The movement was determined by the maximum likelihood estimation of the artificial neural network and the forearm muscle activity. Their method significantly reduced the classification errors of discrete movement decoding. Zhang et al. [34] proposed a postprocessing strategy based on a threshold-based motion onset detection method to reduce classification errors. The method significantly reduced the classification errors compared to the original classification. These post-processing strategies add prior knowledge to human movement decoding schemes and significantly reduce the classification errors of the schemes. However, as they focus solely on human discrete movement decoding, a post-processing strategy for human continuous movement decoding needs to be explored.
The Kalman filter (KF) has been applied to CNN-LSTM models to improve the performance of human continuous movement decoding. KF uses Kalman gain to determine the weight of the internal transition and observation models and uses noise measurement to calculate the optimal output of the model [35]. Bao et al. [36] proposed a CNN-LSTM with a KF scheme to estimate the single DoF of the finger and wrist. The CNN-LSTM with KF scheme achieved significantly higher performance than the CNN scheme and CNN-LSTM schemes. However, their method focused solely on the prediction performance of a single DoF joint angle under the same subject. The performance of KF in predicting multiple DoF joint angles between different subjects requires further investigation.
To solve the low decoding accuracy of CNN due to the neglect of temporal information of sEMG and the lack of prior knowledge of the system, we propose a human movement decoding scheme named the attention-based KF scheme (AKFS), which combines an attention-based CNN model with a KF. We adopt the attention-based CNN model to improve the decoding accuracy of the CNN model by better capturing the temporal information of sEMG and adopt the KF as a post-processing strategy to improve the decoding accuracy of the CNN model by adding prior knowledge of the system. In contrast to previous studies, we use the attention mechanism to solve the problem of continuous motion decoding based on sEMG signals, and explore the problem of multi-degree of freedom angle prediction of KF in different subjects. The proposed scheme was evaluated in four scenarios: Intrasession, intra-session long-time use, inter-subject, and intersubject with a fine-tuning strategy. To evaluate the effectiveness of each part of the proposed scheme, we examined the performance of six different schemes (CNN, CNN-LSTM, attention-based CNN, CNN with KF, CNN-LSTM with KF, and AKFS) by combining the three decoding models with the two post-processing strategies under the four scenarios.

A. Subjects
Ten non-disabled subjects (all right-handed males, age: 26-30 years) participated in this experiment. None of the subjects had a history of neuromuscular diseases. Before the experiment, we informed the subjects of the process and purpose of the experiment, and they provided informed consent. The study was approved by the ethics committee of Changchun University of Science and Technology (CUST), 20210013, July 3, 2021.

B. Collection of sEMG Data
Seven dual-electrode wireless sEMG sensors (Noraxon, USA) were used to collect sEMG signal from seven positions corresponding to the following muscles: biceps, brachioradialis, flex carpi rad, flex carpi ulnaris, triceps, ext carpi ulnaris and ext dig communis (as shown in Fig 1.). Sensors were placed on the belly of muscles, according to the Noraxon instructions. The locations of the electrodes on the skin were cleaned with 75% alcohol to reduce the impedance between the skin and the electrodes. The sEMG were sampled at 1500 Hz and were band-pass filtered by a seventh-order Butterworth filter in the range of 20-500 Hz.
Data from two sessions were collected for each subject.

C. Kinematic Data Collection
To ensure that all the subjects' motions had similar movement trajectories and to label the sEMG data to obtain sufficient training data, we developed a HoloLens virtual system (HVS), as shown in Fig. 2.
A red cross was generated in front of the subject according to the limb position by a unity program run on HoloLens. The trajectory of the upper limb motion was simulated by the trajectory of the red cross. A laser pen was fixed to the hand of the subject. A green cross was also projected in front of the subject by the laser pen. The subject was instructed to control the green cross to track the trajectory of the red cross. An MPU6050 sensor was fixed on the laser pen to collect the continuous kinematic data of upper limb movement (the Euler angles of three axes were calculated from the data of the accelerometer and gyroscope). To synchronize the kinematic data with the sEMG of upper limb movements, the MPU6050 sensor communicated with an STM32F103 board at 115200 baud through a serial port and synchronized with Noraxon sEMG acquisition software. The kinematic data of the human upper limb were sampled by MPU6050 at 100 Hz. We normalized the collected sEMG signal and kinematic data, respectively, as shown in (1).
where x i is the i th sample of data, u is the mean value of the data, σ is the variance of the data, and ∧ x i is the i th sample of normalized data.
During the experiment, the hand of each subject was placed on a cross-shaped metal rod with the elbow supported by a bracket. When the subjects felt uncomfortable, they were allowed to quit at any time. All subjects conducted nine upper limb movements each by tracking the trajectory of the red cross movement in a unity program on a HoloLens.
The trajectories of the nine movements are shown in Fig. 3. All subjects were encouraged to control the green cross projected by a laser pen to follow the trajectories of a red cross projected by HoloLens without fatigue. Before the experiment, the subjects familiarized themselves with the nine upper limb movements and the HVS.

D. Data Segmentation
The sEMG signals were segmented using a 250 ms sliding window with 80% overlapping (Fig. 4). To label the sEMG signal, we downsampled the kinematic data from 100 Hz to 20 Hz (taking the last point every five points). The number of samples in the first second was 16, and thereafter, 20 samples per second. In this way, the kinematic data collected by MPU6050 can be labeled with the sEMG signals to interpret the movement of the human upper limb.

E. Attention-Based CNN
An attention-based CNN model was introduced in this study (Fig. 5). Raw sEMG signals from seven channels in a sliding window were used as the input of the model to directly decode the human movement intention. The model was mainly composed of one input layer, three convolutional modules, two LSTM modules, one attention mechanism module, one fully connected layer, and one output layer. The structure of the CNN module was adopted from [14], and the structure of the LSTM module was adopted from [25]. Because the decoding test in our study was significantly simpler, the structure of the proposed model was significantly simpler than those of the original models. Each sliding window had 375 samples of 7 channels as the input of the model. The filter numbers of the 3 convolution layers were 64, 96, and 128, and the kernel sizes of the 3 layers were 23, 13, and 11. All convolution layers adopted the same padding in the stride of 1.
There were five layers in the first and second CNN modules (convolution layer, batch normalization layer, rectified linear unit (ReLU) layer, average pooling layer, and a dropout layer). The epsilon of the batch normalization layers was set to 0.001. The pooling size and the stride of the first pooling layer were set to 15. The pooling size and the stride of the second pooling layer were set to five. The drop rate of the two dropout layers was set to 0.15. The third CNN module had a special design: There was no pooling or dropout layer in the third CNN module. The number of hidden units of the two LSTM modules was 128. The number of time steps of the two LSTM modules was set to five according to the increment of the sliding window. An attention mechanism module was followed by the last LSTM module to enhance the relationship between the sequences. The fully connected layer was followed by the attention mechanism module, which comprised 128 hidden units. Finally, the three Euler angles of the three axes were output by a regression layer.
The attention mechanism adopted in this study is shown in Fig. 6. After the second LSTM module, the shape of the sequence was 128 × 5. Q, K, and V matrices were defined. Q, K, and V were obtained from the dot product of the output of the second LSTM and three 128 × 128 matrices. The dimensions of the Q, K, and V matrices were 5 × 128. The Q matrix was multiplied by the transpose of the K matrix and then passed through a softmax layer to obtain a 5 × 5 relationship matrix between sequences. Then, the dot product of the relationship matrix with the V matrix produced the attention weight matrix. The output attention weight matrix can be described as in (2).
where Q, K, and V are three matrices. A softmax function was adopted in the attention mechanism. For human decoding schemes, the model was trained using an Nvidia Quadro P620 GPU. The stochastic gradient descent optimizer with momentum was applied to the model. The momentum and learning rate were set to 0.001 and 0.8 based on the results of multiple experiments.

F. Kalman Filter (KF)
A KF was adopted in this study to add prior knowledge of the system. The kinematics of the transition model and observation model are described by (3) and (4): where x i is the output of the transition model at time i, F is the state transition matrix, z i is the output of the observation model at time i, H is the measurement transformation matrix, and w and v are the state and measurement uncertainties drawn from Gaussian distributions N(0, Q) and N(0, R). After these parameters are determined, the state transition equation and the prior state covariance matrix of the system can be expressed using (5) and (6): where is the prior state estimate for the time i and is the prior state covariance matrix. The Kalman gain can be calculated using (7): The output and covariance matrix P i optimized by the KF can be expressed using (8) and (9): In KF, the parameters H , Q, and R need to be determined manually. H was set to 1, Q was set to 10 −4 , R was set to 10 −3 , and z i was the real output of the scheme. The designed KF was followed by a decoding model to add prior knowledge of the system.

G. Scheme Adaptability Analysis Experiment
In human movement decoding based on sEMG, the model is usually challenged by long-time use and user changes. Thus, a robust model that can cope with long-time use and user changes is essential. To solve this issue, we proposed AKFS that combines the attention-based CNN model with KF. To evaluate its effectiveness, we examined the performance of six different schemes (CNN, CNN-LSTM, attention-based CNN, CNN with KF, CNN-LSTM with KF, and AKFS) under four scenarios. The four scenarios were as follows.
1) Scenario 1. Intra-Session: The first session from each subject was chosen as the data, and five-fold cross-validation was performed on the data. The first session contained 9 movements and each movement contained 5 repetitions. Each repetition was segmented by a 250 ms sliding window with 80% overlapping. Thus, there were 236 samples in each movement trial. Therefore, the training set contained 8496 samples (9 movements × 4 trials × 236 samples) and used the remaining trials of all movements as the testing set (9 movements × 1 trial × 236 samples). Five-fold crossvalidation was conducted. One repetition of one subject was chosen as the training set and the remaining repetitions were chosen as the testing set.
2) Scenario 2. Intra-Session Long-Time Use: Two sessions from each subject were used as the data. Each session contained 9 movements. To simulate the long-time use of movements, the five consecutive repetitions of each movement (60 s) were segmented. Each movement was segmented by a 250 ms sliding window, with 80% overlapping. Thus, there were 10764 samples in a session (9 movements × 1196 samples) after segmentation. Two-fold cross-validation was conducted. One session of one subject was chosen as the training set and the other session was chosen as the testing set.
3) Scenario 3. Inter-Subject: The first session of each subject was chosen as the data. The continuous 60s data of each movement of the first session was segmented by a 250 ms sliding window with 80% overlapping. The first session of each subject contained 10764 samples (9 movements × 1196 samples). Two-fold cross-validation was conducted on each pair of subjects. We chose data from one subject as the training set and the trained model was tested on the first session of the remaining subjects. To investigate whether the proposed scheme can learn the human muscle activities of multiple subjects, we also investigated whether a model trained on multiple subjects can accurately predict data from a new subject. Ten-fold cross-validation was conducted. We chose data from the first session of nine subjects as the training set (9 movements × 9 subjects × 1196 samples) and data from the first session of the remaining subject as the testing set (9 movements × 1196 samples).

4) Scenario 4. Inter-Subject With a Fine-Tuning Strategy:
After changing the subjects, the performance of the model with the data from new subjects will be significantly reduced. To solve this problem, a fine-tuning strategy was applied, in which only a small amount of sEMG data was used to retrain the model to obtain a satisfactory performance of the model for the specific subject. Two-fold cross-validation was conducted on each pair of subjects. The data from the first session of one subject was chosen as the training set (9 movements × 1196 samples). The data from the first session of the remaining subjects were used to construct testing and calibration data. Five-fold cross-validation was conducted on the test subject. We used one repetition of each movement as a calibration set (9 movements × 236 samples) to retain the trained model. The remaining four repetitions were used as the testing set (9 movements × 4 trials × 236 samples) to evaluate the fine-tuned model. The efficiency of the model trained on multiple subjects was also investigated. Ten-fold cross-validation was conducted. We used the first session of nine subjects as the training set (9 subjects × 9 movements × 1196 samples). Five-fold cross-validation was conducted on the test subject. We used one repetition from each movement as a calibration set (9 movements × 236 samples) to fine-tune the trained model. The remaining four repetitions were used as the testing set (9 movements × 4 trials × 236 samples) to evaluate the fine-tuned model.

H. Performance Index and Statistical Analysis
The coefficient of determination (R 2 ) was used as the performance index to evaluate the efficiency of the decoding  Flow of operation of the HoloLens virtual system (HVS). A HoloLens projects a red cross in a real environment and the subject controls a green cross projected by a laser pen to track the red cross. The upper limb movement is interpreted by the kinematic data of MPU6050 (Euler angles of x, y, and, z axes). A computer records the Euler angles of MPU6050 while recording the sEMG signals of the subject.
schemes. R 2 can be described as follows: where F i (t) is the real angle of the i th axis, F i (t) is the average angle of the i th axis, P i (t) is the angle predicted by the scheme, N is the number of samples in a sliding window, and D is the number of axes. The R 2 in the four scenarios was analyzed using a repeated measure analysis of variance (ANOVA) with factors including the three decoding models (CNN, CNN-LSTM, and attentionbased CNN) and two post-processing strategies (with KF and without KF) with a significance level of 0.05. A Bonferroni corrected post-hoc test was also conducted. All results were analyzed on a personal computer running IBM ®SPSS Statistics 22 software.    Table I summarizes the decoding accuracy (R 2 ) of the six decoding schemes of the ten subjects in the intra-session scenario. An ANOVA analysis showed that both the decoding model (F 2,8 = 13.297, p = 0.003) and post-processing strategy (F 1,9 = 76.278, p < 10 −3 ) had a significant influence on the decoding accuracy of the schemes and there was no significant interaction between the decoding model and postprocessing strategy (F 2,8 = 2.145, p = 0.179). The schemes

B. Results on Intra-Session Long-Time Use
Table II summarizes the decoding accuracy (R 2 ) of the six decoding schemes obtained after combining them with three decoding models and two post-processing strategies of the ten subjects in the intra-session long-time use scenario. An ANOVA analysis showed that both the decoding model (

C. Results on Inter-Subject
The inter-subject confusion matrices of the six schemes are depicted in Fig. 7.
Each small square represents the decoding accuracy (R 2 value) obtained by training (row) and testing (column) based on the first session data from different subjects. The ANOVA analysis showed that the decoding model had an insignificant influence on the accuracy of the decoding schemes (F 2,43 = 1.461, p = 0.243). However, the post-processing strategy had a As previous results show (R 2 lower than 0.11), the accuracy of decoding new subject data using the model trained on the data of a single subject is very low. Therefore, we analyzed the results of the schemes trained with multiple-subject data to determine whether the proposed scheme can learn common features from multiple-subject data and thereby enhance the decoding accuracy of the model for new-subject data.
The results of the schemes trained on multiple subjects are presented in Table III.
An ANOVA analysis showed that the post-processing strategy had a significant influence on the accuracy (F 1, 9 = 102.265, p < 10 −3 ). However, the decoding model had no significant influence on the accuracy (   CNN, CNN-LSTM,ATTENTION-BASED CNN, CNN WITH KF,  CNN-LSTM WITHKF, AND AKFS TRAINED ON MULTIPLE SUBJECTS  UNDER THE INTRA-SESSION SCENARIO   TABLE III  The confusion matrices of the six inter-subject schemes with a fine-tuning strategy are depicted in Fig. 8. Each small square represents the accuracy (R 2 value) obtained by training (row) and testing (column). The results showed that both the decoding model (F 2,43 = 6.157, p = 0.004) and post-processing strategy (F 2,43 = 808.327, p < 10 −3 ) had a significant influence on the accuracy. A significant interaction between the decoding model and post-processing strategy was observed (F 2, 43 = 3.637, p = 0.035).
A simple effect analysis showed that the decoding model had a significant influence on the accuracy of the schemes with KF (CNN with KF, CNN-LSTM with KF, and AKFS) (F 2, 43 = 6.320, p = 0.004) or the schemes without KF (CNN, CNN-LSTM, and attention-based CNN) (F 2,43 = 6.009, p = 0.005). Without KF, both the CNN-LSTM model (p = 0.005) and the attention-based CNN (p = 0.03) had significantly higher accuracy than the CNN. However, there was no significant difference in accuracy between the CNN-LSTM and the attention-based CNN (p = 0.216). With KF, both the CNN-LSTM with KF (p = 0.004) and the AKFS (p = 0.029) achieved higher accuracy than the CNN. There was no significant difference in accuracy between the CNN-LSTM with KF and the AKFS model (p = 0.278).  CNN, CNN-LSTM,ATTENTION-BASED CNN, CNN WITH KF,  CNN-LSTM WITHKF, AND AKFS TRAINED ON MULTIPLE SUBJECTS  UNDER THE INTER-SUBJECTWITH A FINE-TUNING  The results of the schemes trained on multiple subjects with a fine-tuning strategy are presented in Table IV. An ANOVA analysis showed that the post-processing strategy had a significant influence on the accuracy of the decoding schemes (F 1,9 = 59.220, p <10 −3 ). However, the decoding model had no significant influence on the accuracy of the decoding schemes (F 2,8 = 0.232, p = 0.798). There was a significant interaction between the decoding model and post-processing strategy (F 2,8 = 7.178, p = 0.016).
A simple effect analysis showed that the decoding model had an insignificant influence on the accuracy of the schemes with KF (

E. Comparison of the Accuracy and Response Time Between the Model Trained by A Single Subject and the Model Trained by Multiple Subjects
We also analyzed the differences between the two training strategies (the schemes trained on a single subject with a fine-tuning strategy and the schemes trained on multiple subjects with a fine-tuning strategy) in the inter-subject scenarios to determine whether the model trained on multiple subjects can learn the generalization features of sEMG signals of different subjects. The average decoding accuracy (R 2 ) and average response time (RT) of the different schemes in the two different training strategies are shown in Fig. 9.
A repeated measure analysis of variance (ANOVA) with factors including the three decoding models (CNN, CNN-LSTM, and attention-based CNN), two post-processing strategies (with KF and without KF), and two training strategies (the schemes trained on a single subject with a fine-tuning strategy and the schemes trained on multiple subjects with a fine-tuning strategy). An ANOVA analysis showed that the training strategy had a significant influence on the accuracy of the decoding schemes (F 1, 9 = 16.025, p = 0.003). There was a significant interaction between training strategies and the post-processing strategy (F 1,9 = 42.09, p < 10 −3 ).
A simple effect analysis indicated that the training strategy had a significant influence on the accuracy of the schemes with KF (CNN with KF, CNN-LSTM with KF and AKFS) ( By comparing the response time results, we found that there was no significant difference between the response time in schemes using single-subject data for training and that in schemes using multiple-subject data for training (p = 0.260). The response time of the schemes with KF (CNN with KF, CNN-LSTM with KF, and AKFS) is significantly higher than that of the schemes without KF (CNN, CNN-LSTM, and attention-based CNN) (p < 0.001). The response time of the schemes with CNN-LSTM (CNN-LSTM and CNN-LSTM with KF) (p < 0.001) and the schemes with attention-based CNN (attention-based CNN and AKFS) (p < 0.001) is significantly higher than that of the schemes with CNN (CNN and CNN with KF). Increasing the attention mechanism to CNN-LSTM significantly reduces the response time of the model (p < 0.001). The response time of all frames should be less than 1 ms.

IV. DISCUSSION
In this study, we proposed a human movement decoding scheme using an attention-based CNN model and a KF to improve the decoding accuracy of a traditional CNN model. An attention-based CNN model was adopted to further improve the spatial information and the temporal information extraction ability of CNN models by capturing the important information between various time series of sEMG signals. A KF was adopted to improve the decoding accuracy of the CNN model by adding prior knowledge of the system. Furthermore, a fine-tuning strategy was adopted to solve the problem of insufficient data due to the short training time for a new subject. To evaluate the effectiveness of the proposed scheme, we compared the decoding accuracy of the proposed scheme with that of five combined schemes of decoding models and post-processing strategies (CNN, CNN-LSTM, attention-based CNN, CNN with KF, and CNN-LSTM with KF schemes) in four scenarios (intra-session, intra-session long-time use, inter-subject and inter-subject with a fine-tuning strategy).
The attention-based CNN significantly improved the decoding accuracy of CNN models in the intra-session, inter-session long-time use, and inter-subject with fine-tuning strategy scenarios. As reported in previous studies, the hybrid CNN-LSTM model has been extensively used in human movement decoding and achieved satisfactory performance [25], [26] by capturing the temporal and spatial information of sEMG signals. To further improve the ability of the model to use temporal information and improve the decoding accuracy of the model in human upper limb movements, we applied the attention mechanism to CNN-LSTM. Attention-based CNN can extract effective information in the time domain to improve the utilization of the CNN-LSTM time domain information. The attention mechanism can also reduce information redundancy and reduce model calculation time. Thus, the proposed attention-based CNN achieved a higher decoding accuracy than CNN and a lower response time than CNN-LSTM.
Consistent with a previous study on human movement classification [31], the attention mechanism can also improve the performance of the CNN model in human continuous movement decoding. However, previous studies ignored the factors influencing sEMG pattern recognition in real life, such as user changes. In this study, we further extended the attention-based CNN model to the inter-subject scenario. However, its performance was unsatisfactory. This may be a result of the different data distributions of different subjects in the same movement because the data of different subjects have different mean values and variances. Even when we developed an HVS to unify the movement range and normalize the sEMG signals of different subjects, the exercise habits of different subjects still differed.
The models trained with multiple subjects achieved significantly higher performance than those trained with a single subject under the inter-subject scenario. This is because the deep learning method can extract generalization features from multiple subjects and adapt the difference between the distribution of the training and test sets under the intersubject scenario. However, although the model trained with multiple subjects' data showed improved performance in the inter-subject scenario, the performance of its implementation is still unsatisfactory. There is still a gap between the data distribution of multiple subjects and that of a new subject. Therefore, it was necessary to propose a fine-tuning strategy to improve the decoding accuracy of a specific subject.
The fine-tuning strategy can reduce the gap between the data distribution of different datasets. After fine-tuning with a small amount of data on a new subject, the CNN, CNN-LSTM, and attention-based models achieved higher performance than the models without fine-tuning. Furthermore, both the CNN-LSTM and attention-based CNN models outperformed the CNN model under the inter-subject with fine-tuning strategy scenario. After fine-tuning, all deep learning models can adjust parameters to adapt to the data distribution of the new subject through a small amount of data. Because CNN-LSTM and attention-based CNN can better extract the temporal information and spatial information of the sEMG signal, these two models can achieve a smoother prediction trajectory after fine-tuning compared with the CNN model.
We also evaluated the model trained with multiple-subject data in the context of inter-subject fine-tuning to determine whether the model can learn multiple generalization features to improve the fine-tuning performance. With the increase in training subjects, the gap between the three models became smaller after fine-tuning. Previous studies reported that mixture models can achieve better performance than single-structure models [25], [26]. However, we arrived at a different conclusion in the condition of the model trained on multiple subjects. Because the CNN model uses data from multiple subjects to obtain generalized initialization parameters, using a small amount of data from the target subjects to fine-tune the model can result in a similar performance to that of the fine-tuned CNN-LSTM and fine-tuned attention-based CNN models. Furthermore, because the model gets more generalized initial parameters from multiple subjects' data, the decoding performance of the model trained with multiple subjects' data is higher than that of the model trained with data from a single subject. The initialization strategy of model parameters has more influence than the difference of model structure.
The proposed KF significantly improved the decoding accuracy of the schemes. Inspired by the post-processing strategies of discrete movement classification based on sEMG, we proposed a post-processing strategy based on KF for continuous movement decoding. In contrast to Bao et al. [36], who focused only on a single subject or a single DoF, we tested not only the effectiveness of KF decoding multiple DoFs for a single subject, but also the effectiveness of KF in decoding multiple DoFs between different subjects. The results indicated that KF significantly improves the decoding accuracy of the schemes in intra-session, intra-session longtime use, and inter-subject with a fine-tuning strategy. Thus, adding prior knowledge to the decoding scheme improves the decoding accuracy of the decoding scheme for inter-subject with multiple DoFs.

V. CONCLUSION
In this study, we proposed a decoding scheme that employs an attention-based CNN model to improve efficiency by using the temporal and spatial information of sEMG signals. In addition, it uses a KF to increase its prior knowledge and thereby improve its decoding accuracy in human movement decoding. The proposed scheme achieved good performance for a single subject under the scenarios of intra-session and intra-session long-time use, and the model could be easily transferred to a new subject through a fine-tuning strategy. Nevertheless, in this study, there was no significant difference in decoding accuracy under the scenario of inter-subject with a fine-tuning strategy between models trained on multiple subjects due to different data distributions. Therefore, in future work, we will use the domain adaption method to further reduce the data distribution difference between different subjects and further improve the decoding accuracy of the AKFS using the fine-tuning strategy. We will also apply this scheme to an upper limb exoskeleton with multiple DoFs to achieve simultaneous and proportional control.