Classification of Motor Imagery Based on Multi-Scale Feature Extraction and the Channel-Temporal Attention Module

Motor imagery (MI) is a popular paradigm for controlling electroencephalogram (EEG) based Brain-Computer Interface (BCI) systems. Many methods have been developed to attempt to accurately classify MI-related EEG activity. Recently, the development of deep learning has begun to draw increasing attention in the BCI research community because it does not need to use sophisticated signal preprocessing and can automatically extract features. In this paper, we propose a deep learning model for use in MI-based BCI systems. Our model makes use of a convolutional neural network based on a multi-scale and channel-temporal attention module (CTAM), which called MSCTANN. The multi-scale module is able to extract a large number of features, while the attention module includes both a channel attention module and a temporal attention module, which together allow the model to focus attention on the most important features extracted from the data. The multi-scale module and the attention module are connected by a residual module, which avoids the degradation of the network. Our network model is built from these three core modules, which combine to improve the recognition ability of the network for EEG signals. Our experimental results on three datasets (BCI competition IV 2a, III IIIa and IV 1) show that our proposed method has better performance than other state-of-the-art methods, with accuracy rates of 80.6%, 83.56% and 79.84%. Our model has stable performance in decoding EEG signals and achieves efficient classification performance while using fewer network parameters than other comparable state-of-the-art methods.

module includes both a channel attention module and a temporal attention module, which together allow the model to focus attention on the most important features extracted from the data. The multi-scale module and the attention module are connected by a residual module, which avoids the degradation of the network. Our network model is built from these three core modules, which combine to improve the recognition ability of the network for EEG signals. Our experimental results on three datasets (BCI competition IV 2a, III IIIa and IV 1) show that our proposed method has better performance than other state-of-the-art methods, with accuracy rates of 80.6%, 83.56% and 79.84%. Our model has stable performance in decoding EEG signals and achieves efficient classification performance while using fewer network parameters than other comparable state-of-the-art methods.

I. INTRODUCTION
B RAIN-COMPUTER Interface (BCI) uses artificial intelligence to decode signals from the brain in order to provide a communication pathway between the brain and the world [1]. The original purpose of BCI technology is to identify the intention of human activities by analyzing EEG signals and converting them into commands to control external auxiliary devices, to assist people with motor disabilities to interact with the external environment [2]. As a result of continuous development of research, BCI technology is gradually being applied to increasing numbers of fields such as the medical industry [3], [4], entertainment [5], smart homes [6], and the military [7].
Commonly used BCI paradigms include steady-state visual evoked potentials (SSVEPs), P300 potentials, and motor imagery. Among these paradigms, motor imagery requires no additional stimulation apparatus, instead users modulate their EEG simply through the imagination of movement. In contrast to other BCI systems, the motor imagery paradigm can directly map a user's movement intention to an action. This allows participants to complete specific tasks by imagining limb movements. Furthermore, BCIs based on motor imagery can be spontaneous. In other words, participants can generate EEG signals through motor imagery without the need for external cues or other stimuli. Consequently, motor imagery BCIs have become one of the most popular paradigms.
When imagining movement, the activity of specific frequency bands within the brain changes. Specifically, activity in the mu band, from 8-12 Hz, and the beta band, from 13-30 Hz, are known to change during motor imagery. While performing an imagined task, such as the right-hand movement, the contralateral hemisphere of the brain exhibits a phenomenon of reduced low-frequency activity, termed event-related desynchronization (ERD), while the ipsilateral hemisphere of the brain produces a phenomenon of increased activity, termed the event-related synchronization (ERS) [8].
To accurately recognize the ERD/S, considerable research efforts have focused on combinations of feature extraction methods combined with machine learning to complete the classification task. Within the feature extraction step a form of transformation method (usually a linear transformation) is employed to identify and extract some important features from the EEG signals. The important feature information is retained and the influence of noisy additional information is reduced or removed, thus transforming the originally complex high-dimensional EEG signals into a lower-dimensional less noisy domain [9]. Feature extraction is usually conducted from four perspectives: the time domain [10], [11], the frequency domain [12], [13], the time-frequency domain [14], [15], and the spatial domain [16], [17]. Typical feature extraction methods include, but are not limited to, the Fast Fourier Transform (FFT) [18], the wavelet transform (WT) [19], principal component analysis (PCA) [20], and common spatial patterns (CSP) [21]. Among these methods, the CSP algorithm is the most widely used in BCI systems. Its basic principle is to identify a spatial filter that maximizes the variance between two categories. Recently, the basic CSP algorithm has been extended in a number of ways to meet particular challenges within the BCI domain. Among these extensions, the filter bank common space mode (FBCSP) [22] is one promising method that uses the frequency domain characteristics of MI to optimize the spatial filter. Specifically, within the FBCSP algorithm, the MI signal is divided into multiple frequency sub-bands and then each sub-band is filtered by CSP to extract an optimal feature set.
However, traditional machine learning methods work most effectively only when the features they are applied to have been carefully chosen and pre-processed to maximize the signal-to-noise ratio. An alternative approach, that many investigators have recently begun to focus on, is deep learning. Deep learning models can effectively capture a high-dimensional feature representation from the EEG signals as well as the potential relationships between internal features through a nonlinear deep structure. For EEG signals with complex information content and strong time-varying characteristics, a deep feature representation can be extracted through deep learning models. Deep learning models do not require complex processing of the input data, and some models can even directly use the original data as their input without any pre-processing or feature extraction [31]. Many different deep learning models have been developed over recent years to classify EEG data. For example, Liu and Zeng [32] proposed a multi-feature fusion method based on ResNet to extract features. This model classified EEG with accuracies which were 39.65% higher than those achieved by a model using single features. For feature extraction and classification of MI-EEG signals, Wang et al. [33] combined a Squeeze-and-Excitation convolution neural network (SECNN) with the time-varying autoregressive model (TVAR) and power spectral density (PSD) time-frequency analysis, to improve the accuracy of MI-BCI systems. Hwaidi and Chen [34] trained deep neural networks by combining a deep autoencoder (DAE) and CNN architectures to classify EEG MI signals. The results of the model show that it outperforms current CNN-based approaches and several traditional machine-learning approaches.
Deep learning does not require complex feature extraction methods, but simple data preprocessing steps have been shown to further improve their classification results. For example, Shalu et al. [35] achieved relatively good results by transforming continuous wavelet transforms into time-frequency plots and feeding them into a deep CNN for classification. Zhu et al. [36] designed a separate channel convolution network to encode the multi-channel data of CSP, preserving the time-varying information that is helpful for distinguishing tasks. However, some studies have proved that deep learning can even directly extract features from raw EEG data. For example, Wu et al. [37] proposed a parallel multi-scale filter bank CNN for MI classification. The extracted output features are connected to the spatial convolution layer to complete the fusion of multimodal features from EEG signals. Roy et al. [38] explored the use of different fusion models to automatically complete multimodal feature extraction and classification from raw EEG signals. The model achieves 80.32% accuracy on the BCI competition IV 2b dataset. Li et al. [39] used amplitude interference as a means of data enhancement to expand the dataset and constructed a channel projection mixed-scale CNN to decode EEG. This framework used the original multi-channel EEG signal as the input, and achieved an average accuracy of 67.17% in a four-class classification task.
Deep learning has considerable potential to improve the performance of MI-EEG BCIs, but some problems still remain: 1) Most methods developed to date are intended for binary classification tasks, or if they are applied to multiclass problems, convert the multi-class problem into multiple binary tasks. However, there is a growing need for effective multi-class solutions for MI BCIs. 2) The process of feature extraction and selection is very time-consuming. Ideally, we would like to take full use of the advantages of automatic learning of deep neural networks. To do this we need to extract more effective features by either adopting data enhancement methods or by designing more complex network models. However, complex networks will increase the run time and the number of parameters that need to be trained in the network.
3) There are individual differences in the EEG signals of different participants but the single-scale convolution kernel can only use a single set of weights when extracting features. 4) Although the neural network can automatically extract features, the extracted features are not necessarily effective in all cases. Extracting features without emphasis will not only increase the calculation cost but also lead to feature redundancy.
The literatures on MI mentioned above are all based on the use of 2D inputs to the network, while there are far fewer studies on the use of 1D inputs. Liu et al. [40] compared the classification results of both one-dimensional (1D) and twodimensional (2D) input forms based on public datasets and the results indicate that the 1D form of input can lead to higher classification accuracies and converge faster. Jia et al. [41] used a 1D input, but the convolution was performed in the time dimension and finally reached an average classification accuracy rate of 78% on the four classified published datasets, demonstrating that a 1D input can also be helpful for MI classification tasks.
To tackle the problems listed above, a deep learning-based multi-class MI signal recognition method is proposed in this paper. This model utilizes preprocessed EEG signals to realize end-to-end automatic learning without the need for manually designed feature extraction methods. The main contributions of this study are as follows: 1) To investigate multi-class tasks, this paper carries out experiments on two BCI competition datasets, each containing four-classes of movement tasks. To demonstrate the performance of the model we add a dataset with 2-categories of MI tasks.
2) To improve the efficiency of feature extraction, this paper proposes an end-to-end neural network and uses the proposed data augmentation to enrich the feature information.
3) In view of the differences in EEG signals recorded from different participants, a multi-scale module is designed to extract richer features, which increases the range of the network to learn features and improves the classification accuracy. 4) To address the problem that the neural network may learn features that are not focused, the information over different channels and over time is learned through two modules, a channel attention module and a temporal attention module, to attempt to improve the classification result.
The rest of the paper is organized as follows: the details of the methods are described in Section II. The experiments and results are presented in Section III. The factors influencing the experimental and future work are discussed in Section IV. Finally, we conclude our research in Section V.

II. METHODS
The model proposed in this paper is a neural network with a multi-scale module and two attention modules. The overall framework of the MSCTANN model is shown in Figure 1. The model includes three core parts: the multi-scale module, the residual module and the channel-temporal attention module (CTAM). Augmentation of the training data is used to provide more information for the neural network. Multi-scale modules then automatically extract features from this augmented data and different extraction levels solve the problem of differences in EEG signals among participants. The residual module is used to fuse the features transmitted by the multiscale module, and the introduction of the residual module avoids network degradation caused by an excessive number of network layers. The CTAM is used to automatically select the fused features, effectively avoiding information redundancy, and automatically learning the importance of different features, thus improving the classification result for MI-EEG signals.

A. Data Augmentation
For neural networks, the data that need to be used for training needs to be sufficiently larger. However, most EEG datasets cannot satisfy this requirement to support training the network. Therefore, data enhancement is necessary. Each sample of EEG data based on MI in this paper is represented as a 2D matrix of dimensions C×T (channel × time), where rows represent data collected from different electrodes and columns represent data at different sample points. In this paper, a headto-tail extended data augmentation method is proposed. The schematic diagram of the head-to-tail augmentation method is shown in Figure 1(a). For each trial, the signal head is extracted at a certain length and filled into the tail. This extraction process is continuously cycled until the cycle of the whole signal is completed. The length of the loop is an adjustable parameter and one of the influential factors affecting the final classification outcome. If the length of the loop is too long, the network cannot obtain enough information to overcome the over-fitting problem. Conversely, the difference between different samples will be very small.

B. The Multi-Scale Module
There are several challenges still to be overcome when designing MI-BCIs. One of the most important of these challenges is the difference in EEG signals between different participants. As a result of this, if a single extraction method is used, it will not only limit the extraction of each participant's information but also ignore the individual differences between different participants, resulting in a poor final classification result. How to use the information provided to extract more features is a problem that still needs a solution. Therefore, this paper designs a multi-scale structure, which automatically extracts features from the original EEG signals based on multi-scale convolution and pooling. Its structure is shown in Figure 1 The multi-scale structure proposed in this paper is designed according to related methods in the field of signal processing. Conv1 is a convolutional layer with a smaller kernel, which can effectively collect fine-grained local information. Conv2 is a convolutional layer with a medium kernel, which can retain relatively coarse-grained feature information. Conv3 is a convolutional layer with a larger kernel, which can capture the overall characteristics of the EEG signals. Three different sizes of convolutional kernels can extract more adequate features from a multi-scale perspective by the multi-scale structure. In order to extract features it is necessary to reduce the matrix parameters and feature dimensions through the pooling layer, thereby reducing the number of parameters in the last fully connected layer. The incorporation of the pooling layer can also speed up calculations and prevent overfitting effects. Most of the existing studies used a single-scale pooling layer, which increased the possibility of information loss to some extent. Although this method can remove redundant information, the criteria for information redundancy are not fixed for different participants. If only a single scale is used for extraction and processing, it greatly increases the possibility of loss of important information. Therefore, our method added multi-scale pooling to multi-scale convolution, and two multi-scale structures were combined to better process the MI-EEG signals.

C. The Residual Module
The residual module can fuse the extracted features. Furthermore, the introduction of this module avoids the problem of network degradation produced by an excessive number of network layers. The structure of the residual module is shown in Figure 1(c). The connection line on the right side of the module is called the identity shortcut connection, which adds neither extra parameters nor computational complexity. The identity shortcut connection can solve the problem that the newly added layer does not work effectively by allowing the module to skip one or more layers if needed. This module is formed by combining multiple 1D convolutional layers and batch normalization (BN) layers with the superimposed residual connection. The definition of this module can be expressed as: where X and X out represent the input and the output of the residual module, and Y represents the total output. The features learned by the shallow network can be passed to the deep network by the residual connectivity module, thus avoiding network degradation.

D. Channel-Temporal Attention Module (CTAM)
Convolutional block attention module (CBAM) is a lightweight attention module that can conduct attention training in both channel and spatial dimensions [42]. Inspired by the CBAM module, this article constructs a channel-temporal attention module, called CTAM. The overall structure of our CTAM module is shown in Figure 1(d), and the specific structure is shown in Figure 2. The CTAM module includes a channel attention module and a temporal attention module, which complement both channel attention and temporal attention, achieving considerable performance improvement while keeping the computational overhead small. Further directed screening of features can be performed to automatically learn more important features, thus achieving the goal of boosting the MI-EEG signal classification performance.
where M c (Y ) represents the convolution result of the CAM, σ represents the activation function. W 1 and W 0 represent the weights of the MLP and Y c avg and Y c max represent features output by different pooling layers under the channel attention module.
2) Temporal Attention Module (TAM): TAM compresses the channel dimension without changing the temporal dimension. This module focuses on the location information of the target. The specific flow of the temporal attention module is shown in Figure 2(b).
The TAM produces two feature maps by maximum pooling and average pooling of the output from the channel attention module. Then the two feature maps are combined and turned into a single-channel feature map by the convolution operation. The feature map of temporal attention is obtained through the sigmoid function. Finally, the output is multiplied by the original map. The formula for temporal attention is as follows: where M s (Y ) represents the convolution result of the TAM, f represents the active convolutions, and Y s avg and Y s max represent the output features of the different pooling layers under the temporal attention module.

3) Combined Channel and Temporal Attention Module:
Attention modules can increase the representativeness of the network by focusing on important features, and suppressing unnecessary ones. The CTAM module emphasizes temporal information while simultaneously reinforcing channel information.
The CTAM module consists of the CAM and TAM modules. Through the CAM module, the input feature Y multiplies the result by the original input, and the result Y ′ is the input of the TAM module. Finally, the output result of the TAM module is multiplied with Y ′ : where Y is the original input feature, Y ′ is the result of multiplying the CAM convolution output with the original map, and Y ′′ is the result after multiplying the TAM convolution output with Y ′ . This is also the CTAM final output result.
Dataset 1: The BCI competition IV 2a dataset contains data from nine participants performing four classes of MI tasks (involving the left hand, right hand, foot, and tongue). This dataset records EEG signals from an EEG setup placed according to the international 10-20 system with 25 electrodes (22 EEG channels and 3 EOG channels). The sampling frequency is 250 Hz and a 0.5-100 Hz band-pass filter and 50 Hz power frequency notch filter are used for filtering. Each participant completed two sessions, each of which contained six runs with 48 trials per run, while one session contained 288 trials.
The timing pattern of Dataset 1 is shown in Figure 3(a). In the experiment, the beginning of the trial is a fixation cross for the first 1s. Then a cue of direction shows for 1.25s. At t=3s, participants are asked to imagine the corresponding movement until they finished the task at t=6s. The acquisition of this dataset uses a feedback-free experimental paradigm, intercepting the time of the signal from 0.5s after the start of the cue to the end of MI, that is, from 2.5-6s, with a total intercept of 3.5s. Since MI is most commonly associated with changes in the a (8-12 Hz) and β (13-30 Hz) frequency bands the data is band-pass filtered at 8-30 Hz using a Butterworth filter.
Dataset 2: The BCI competition III IIIa dataset contains data from 3 participants performing a four-category MI task. The task type is the same as used in Dataset 1. This dataset includes EEG signals recorded with 60 electrodes. The sampling frequency used was 250 Hz. Participant K3 in this dataset completed 360 MI trials while the other two participants each completed 240 trials. The sample data is equal for each category.
The timing paradigm used to record Dataset 2 is shown in Figure 3(b). In the experiment, the beginning of the trial is a black screen for the first 2s. Then a fixation cross "+" is displayed for 1s. A directional arrow is then displayed for 1s. At the same time, the participant is asked to imagine the corresponding movement until the fixation cross disappears at t=7s. Each of the 4 cues is displayed 10 times within each run in a randomized order. The trials for Dataset 2 are extracted from 3.5-6.5s, and the filter settings used remain consistent with Dataset 1.
Dataset 3: The BCI competition IV 1 dataset contains data from 7 participants performing a 2-category MI task. This dataset includes EEG signals recorded with 59 electrodes. The sampling frequency used was 100 Hz. The data from participants labeled c, d, and e from this dataset were not used, because they are artificially generated. Each participant was asked to complete 200 trials.
The timing paradigm used to record Dataset 3 is shown in Figure 3(c). In the experiment, the beginning of the trial is a fixation cross for the first 2s. Then a directional arrow is displayed for 4s. At the same time, the participant is asked to imagine the corresponding movement until the cross disappears at t=6s. The trials for Dataset 3 are extracted from 2.5-5.5s, and the filter settings used remain consistent with Dataset 1.

B. Experimental Setup
This experiment adopts within-subject classification. We employ a fivefold cross-validation approach to perform The neural network is trained using the Adam optimizer, which updates the network weights more efficiently than the classical random gradient descent method. Additionally, it also accelerates the convergence of the neural network. The initial learning rate of the network is set to 1 × 10.3 and then adjusted through a cosine annealing attenuation strategy, which means that the learning rate will be readjusted and restored after decay to a certain value, jumping out of the current local optimal solution and searching for the global optimal solution again. To prevent overfitting problems, a dropout rate of 0.4 is set in the final fully connected layer. More network parameter settings are detailed in Table I (note,  the term N in the table denotes the batch size).

C. Overall Comparison
The model presented in this paper was compared with several classical and state-of-the-art models on the BCI IV 2a, BCI III IIIa and BCI IV 1 datasets. Table II compares the effect of our proposed method with other state-of-the-art methods on Dataset 1. The numbers highlighted in bold in the table indicate the participants' best outcomes. We compared our model to the following methods: 1. FBCSP [45]: A model that manually extracts features. This model is often used as a baseline method to classify MI-EEG signals. It has yielded good results in several previous EEG decoding studies. It performs task classification by extracting CSP features from different frequency bands and then using the SVM model to classify the features.
3. DeepConvNet [47]: A deep learning model that is deeper than the ShallowConvNet. It consists of four convolutional and Max pooling layer blocks, followed by a soft Max layer. 4. FBCNet [48]: A deep learning model using EEG bandpass filtering to create multi-frequency bands. It consists of two trainable layers.  II  A COMPARISON OF THE CLASSIFICATION PERFORMANCES ACHIEVED BY THE DIFFERENT MODELS ON DATASET 1(THE P-VALUE IS THE RESULT  OF A t-TEST COMPARING OUR PROPOSED METHOD TO EACH OF THE OTHER METHODS) 5. MBEEGSE [49]: A deep learning model for decoding MI known as a multi-branch EEGNet with squeeze-and-excitation blocks.
As shown in Table II, our proposed MSCTANN model can achieve an average recognition accuracy of 80.6%. Compared to other methods, our proposed MSCTANN method achieves a statistically significantly higher mean classification accuracy over participants. In terms of recognition accuracy, MSC-TANN achieves an accuracy that is, on average, 10.92% higher than FBCSP, which proves that our model can extract more effective information than FBCSP. The five other deep-learning models, EEGNet, DeepConvNet, FBCNet, MBEEGSE and EEG-TCNet are generally more effective than FBCSP, which demonstrates the advantages of deep learning. However, our proposed MSCTANN method is, on average, 8.16%, 8.58%, 7.4%, 2.71% and 3.98% more accurate than these five models, which demonstrates the effectiveness of our proposed model design. The structure of the deep learning model is very important for feature extraction, and selecting an appropriate model can result in better classification performance. The reason our proposed MSCTANN model performs better than other state-of-the-art methods may be due to the use of multi-scale design elements and the role of the attention module. Specifically, our proposed MSCTANN model incorporates both a multi-scale model and attention modules, which makes feature extraction and screening more reasonable and leads to improved classification performance. Figure 4 shows the results of our proposed MSCTANN model and the other state-of-the-art models on Dataset 2 and Dataset 3. In Dataset 2, our MSCTANN model achieves the highest accuracy, 83.56%. The performance of our MSCTANN model is 16.2% higher than FBCSP. When comparing the deep learning models, our MSCTANN model achieves an average accuracy that is 7.87%, 11.62%, 4.9%, 1.1% and 3.23% higher than those five models, indicating that targeted extraction of features can improve model performance. In Dataset 3, our MSCTANN model achieves the highest accuracy, 79.88%. The performance of our MSCTANN model is 17.13% higher than FBCSP. Compared to the five other deep learning models, our model produces a performance which is 8.75%, 13.88%, 6.13%, 0.88% and 2%, better respectively. When considering all three datasets, our model has a more stable performance than the other methods we compare against. In addition, the single training time for the three datasets is 64.98s, 65.8s, and 25.61s, respectively.

IV. DISCUSSION
The performance of our proposed MSCTANN model is affected by several factors: 1) To demonstrate the advantages of multi-scale kernels, ablation experiments corresponding to single scale kernels are conducted. 2) In the multi-scale module, the different sizes of the convolution kernels will affect the content of the information in the extracted features.
3) The validity of the CTAM layer for feature learning and selection. 4) The abundance of features as a result of the data augmentation and feature augmentation methods.

A. Influence of Multi-Scale Kernel
Multi-scale kernels can extract features at different scales at the same time. To highlight the advantages of multi-scale convolution kernels, we perform corresponding ablation experiments on the optimal combination of multi-scale kernels for each dataset. The results of the three datasets are shown in figures 5(a), (b) and (c), respectively. We use a radar chart to display the results. As can be seen from the figure, the single scale kernels of almost all participants are not as performant as multi-scale kernels. In dataset 1, compared to single scale kernels, the accuracy of multi-scale kernels for all participants improved by 5.57%, 5.27%, and 3.05%, respectively. In dataset 2, the accuracy improved by 2.82%, 6.94%, and 8.19%, respectively. In dataset 3, the accuracy improved by 4.25%, 2.13%, and 3.52%, respectively. This result also verifies that multi-scale kernel can extract more information than single scale kernel. In addition, we added a significance test for the single scale kernel (in Figure 5(d)). From a statistical point of view, the single scale kernels and multi-scale kernels have significant differences in results.

B. Multi-Scale Kernel Size
In this section, we use different combinations of convolution kernel sizes to explore the effect of kernel size on model performance. Due to the different number of channels that are available in the two datasets, kernel combinations need to be considered separately for each of the three datasets. The results for each dataset are shown in Figure 6. Due to the close number of EEG channels in dataset 2 and dataset 3, the same kernel combination is adopted for both these datasets. It can be seen that the most suitable kernel combination for Dataset 1 (in Figure 6(a)) is (3,11,19), the most suitable kernel combination for Dataset 2 (in Figure 6(b)) is (7,21,35) and the most suitable kernel combination for Dataset 3 (in Figure 6(c)) is (3,11,19). For the three datasets, the accuracy of the best kernel combination was 5.95%, 5.32%, and 3.04% higher than that of the worst kernel combination, respectively. From the experimental results achieved with the three datasets, it is not difficult to see that the performance difference between different kernel combinations is still relatively large. In this article, we were only able to consider a limited number of combinations, so there may still be better kernel combinations that we have not yet discovered. Indeed, from our current results, it is not yet possible to analyze helpful optimization rules and this, and related aspects, need to be further explored in the future.

C. Influence of the CTAM
To explore the effects of the CTAM layer on the classification results, we perform ablation experiments. As shown in Figure 7. The experimental results on all datasets illustrate that the classification accuracy appears to be reduced by different amounts for each participant without the CTAM layer. If the participants with high and low accuracy rates are divided by 78.89% (the average accuracy of all subjects), the results show that the effect of improvement of low accuracy rate participants is more obvious, with an average increase in performance of 3.45%, and the best improvement effect reaching 5.95%. Thus, the validity of the CTAM layer in our proposed model is demonstrated. The features extracted by the original EEG signals through the multi-scale module and the residual module are different. The CTAM layer can automatically learn the importance of different features, and then improve the classification result for the MI tasks.

D. Influence of Data Augmentation
The purpose of data augmentation is to optimize the training process by overcoming the problem of insufficient training  Influence of data expansion length and multiples on the classification results. Blank groups represent the results of unused data augmentation.
data. To demonstrate the validity of our head-to-tail data augmentation method, the experiments are repeated with a dataset that does not use data augmentation. Figure 8 shows the test results achieved with this un-augmented data from each of the three datasets (the blank groups). The use of data augmentation has improved the classification performance of the three datasets by 7.99%, 6.58%, and 3.63%, respectively.
In the head-to-tail data augmentation method, the augmentation length determines the final amount of training data, which will, in turn, affect the training of the model. If the augmentation is too short, it may result in the reuse of data, which not only does not provide additional useful information but also increases the computational cost of training. If the augmentation is too long, it may again reduce the utilization of the data and prevent the extraction of additional useful information. To investigate the relationship between the expansion length and the final training effect, we conducted a comparison test at different amplification lengths. Considering computational cost and time, the expansion length is only set from 20 to 100 in steps of 10. The results of this test on each of the two datasets are shown in Figure 8. In Dataset 1, the best effect is the combination with a length of 50. This results in a 3.39% improvement over the worst combination. In Dataset 2, the best effect is the combination with a length of 80. This results in an improvement of 1.38% over the worst combination. In Dataset 3, the best effect is the combination with a length of 40. This results in an improvement of 1.75% over the worst combination. This illustrates that there is not a clear relationship between the classification results and the augmentation length. A shorter augmentation length will obtain more training data, but it does not bring better results, and even requires more network computing power.

E. Future Work
In the future, we will investigate the application of lightweight networks for the classification of motor imagery, reducing the number of parameters in the neural networks and improving the operational efficiency of the neural networks. In addition, our future work will further investigate which network architectures are more suitable for processing MI-EEG signals and try to utilize fewer channels to achieve better results.

V. CONCLUSION
In this paper, we propose a neural network model called MSCTANN. It is a deep learning-based signal recognition method for multi-class MI classification. The multi-scale module of our MSCTANN model is able to automatically extract and screen features, which can extract rich feature information for the differences in MI-EEG signals. The CTAM layer in our proposed MSCTANN model is able to automatically learn channel and temporal valid information from the data, thus making the network more targeted for learning. Additionally, this paper also proposes a data augmentation method to increase the training data samples, which provides more information for our MSCTANN model. The validity of the method on two four-classification datasets is verified by experiments. Our MSCTANN model provides ideas for the model architecture of deep learning and makes contributions to the recognition task.