A Novel Algorithmic Structure of EEG Channel Attention Combined With Swin Transformer for Motor Patterns Classification

With the development of brain-computer interfaces (BCI) technologies, EEG-based BCI applications have been deployed for medical purposes. Motor imagery (MI), applied to promote neural rehabilitation for stroke patients, is among the most common BCI paradigms that. The Electroencephalogram (EEG) signals, encompassing an extensive range of channels, render the training dataset a high-dimensional construct. This high dimensionality, inherent in such a dataset, tends to challenge traditional deep learning approaches, causing them to potentially disregard the intrinsic correlations amongst these channels. Such an oversight often culminates in erroneous data classification, presenting a significant drawback of these conventional methodologies. In our study, we propose a novel algorithmic structure of EEG channel-attention combined with Swin Transformer for motor pattern recognition in BCI rehabilitation. Effectively, the self-attention module from transformer architecture could captures temporal-spectral-spatial features hidden in EEG data. The experimental results verify that our proposed methods outperformed other state-of-art approaches with the average accuracy of 87.67%. It is implied that our method can extract high-level and latent connections among temporal-spectral features in contrast to traditional deep learning methods. This paper demonstrates that channel-attention combined with Swin Transformer methods has great potential for implementing high-performance motor pattern-based BCI systems.


I. INTRODUCTION
N EUROSCIENCE research is growing exponentially, and as a result, brain-computer interface (BCI) has become one of the most crucial tools in the field of neuroscience for communication control purposes [1], [2]. A BCI is a device that enables the use of the brain's neural activity for communication without relying on neural control of limb movements [3] and it can be widely applied in various medical fields. Traditional rehabilitation during the treatment phase of a stroke is insufficient for patients, and brain-machine interfaces (BMIs) offer a promising solution for guiding motor rehabilitation. BMIs can be used on patients with no remaining movement, making them a valuable tool in such cases. Due to the rigor of the medical field, researchers began to focus on improving the accuracy of the model. Ang et al. employed the common spatial pattern (CSP) [4] for feature extraction and combined it with support vector machines (SVM) for classification tasks [5], [6]. The linear discriminate analysis (LDA) [7] for motor imagery classification yielded an average of 80% classification accuracy [8]. Later, more classifiers were applied to classify electroencephalogram (EEG) datasets such as decision trees [9] which is now common in the BCI system [10]. Furthermore, some researchers use the same classifier but different feature extraction methods, for example, the power spectrum, variance and mean of the Haar mother wavelet to get the important information of EEG datasets [11]. In addition, Cao et al. [12] presented an innovative method for extracting EEG signal features. This technique involves combining CSP, PCC, PLV, and TE features and integrating network connections with temporal-spatial analysis. The results demonstrate that the fusion network method performs very well in EEG datasets.
In the era of "big data," the need to transform enormous volumes of data into usable information has increased across many disciplines [13], while it is challenging to use traditional algorithms to extract the necessary features and classify them into logical control signals [14]. In contrast, the flexible deep learning model [15] has become one of the most effective solutions for a variety of challenges involving extensive data. Deep learning frameworks have recently been used for decoding and classifying EEG signals, which are often characterized by low signal-to-noise ratios (SNRs) and high dimensionality [16]. Convolutional Neural Networks (CNNs) have demonstrated exceptional efficacy across a multitude of domains, serving as a vital representation of the progress made in deep learning. Accordingly, preliminary research has begun to examine the potential of CNNs for brain-signal decoding [17]. Sakhavi et al. [18] used the filter-bank common spatial patterns method to optimize CNNs for capturing temporal features based on EEG signals, while Dewi et al. [19] employed CNNs to calculate the power spectral density (PSD) of EEG recordings from normal and stroke subjects to classify pain levels. CNNs might encounter difficulties in processing EEG signals that possess intricate temporal relationships, however particularly when dealing with lengthy sequences. Diverse recurrent neural networks (RNNs) [20] are suggested to learn the temporal aspect of EEG data and have shown advantages in some situations [21]. To deal with long signals, extended short-term memory networks (LSTMs) [20] utilize gates for managing the state of information flow in order to learn long-term EEG signal relationships. Due to inefficiency these approaches are insufficient to deal with massive datasets [22]. Furthermore, these methods naturally rely more on the last hidden states to capture dependencies on previous elements, rather than working on overall time series data, while neither CNNs nor RNNs are sufficient to discern global EEG dependencies.
The Transformer method is renowned for its use of attention to describe long-range data dependencies [23]. Natural language processing (NLP) aims to enable computers to understand and generate human language in a manner that resembles human communication. This interdisciplinary field draws on computer science, linguistics, and psychology to develop algorithms and models that handle the complexity and ambiguity of natural language. Among NLPs objectives are sequence modeling, sentiment analysis, machine translation, transduction tasks, named entity recognition and responding appropriately to questions. NLP is a rapidly evolving field with ongoing efforts to overcome the challenges posed by natural language, making communication more natural and effective. The significant accomplishments achieved by NLP in the field of language have motivated researchers to explore their potential application in BCI. Recently, they have demonstrated encouraging outcomes in certain tasks, such as the Vit Transformer [24]. The based transformer however [23], [24], [25] has certain limitations in the task of EEG classification. One of these distinctions is related to scale. In contrast to the word tokens that serve as the core processing pieces in language transformers, the scale of the important EEG signal frequency band can vary significantly. Existing transformer-based models have tokens of a fixed size, which is inappropriate for these EEG applications. The EEG signal elements are far higher than words in text passages.
The traditional methods such as CSP and FBCSP have been widely used in EEG research for feature extraction and classification. Despite their capacity to deal with EEG datasets, the sensitivity of the CSP method to noise is a major limitation to its success. The CSP algorithm depends on the assumption of a stationary signal, and the presence of noise can disrupt this assumption, leading to reduced feature extraction accuracy. CNNs are designed to capture spatial patterns in the data, but are not as well suited to capturing temporal dependencies as the EEG signals are time-series data and contain temporal dependencies between consecutive time points. Although LSTMs are better equipped to capture these temporal dependencies, they still have certain limitations when processing long-term dependencies. Attention mechanisms have proven to be effective in EEG classification, particularly in the presence of noise [26]. They enable the model to concentrate on the most important aspects of the EEG signal, thereby enhancing their ability to capture long-term dependencies. Furthermore, attention mechanisms have been demonstrated to enhance performance in various EEG classification tasks. Attention mechanisms allow the model to focus on only the most relevant parts of the signal, which can improve performance in capturing long-term dependencies. Attention mechanisms also have been shown to improve performance in various tasks, including EEG classification [27], [28], [29]. The transformer architecture, including the Vit-transformer, is constrained by its attention mechanism's limited scope and dimension, making it incapable of effectively modeling relationships among different channels.
To address these problems, we design a novel channel attention mechanism based on a Swin Transformer [30], such that it can serve as a general-purpose backbone for EEG classification tasks. We apply its shift of the window partition between consecutive self-attention layers to capture several essential characteristics of EEG signals with different lengths. Furthermore, we add the channel attention mechanism before the Swin Transformer to obtain attention maps of channels, leading the model to focus more on the importance of channels.This is an advance on other models that, due to the huge disparity between channel numbers and temporal signals, typically ignore the essential channel connections.
The remaining sections are organized as follows, Sec. II provides an overview of the relevant studies. Sec. III demonstrates the research methodology employed in this study. The experimental design and outcomes are summarized in Sec. IV. In Sec. V, we provide concluding remarks on our work.

II. RELATED WORKS
Many recent studies have focused on MI-EEG categorization to promote its influence. They found that the classification error rate was reduced by an average of 1.07% on BCI Competition III & IVa [31]. Ang et al. [32] proposed a new filter bank general spatial pattern (FBCSP) divided into multiple frequency bands during EEG pre-processing. They extracted and selected corresponding CSP features from different frequency bands, improving the classifier's performance using the feature algorithm. In addition, through bench-marking, they found that the method generated high cross-validation accuracy. Thomas and colleagues introduced a novel method for generic spatial patterns of discriminative filter banks in their 2009 study [33], leading the about 17.42% and 8.9% lower error rates correspondingly to BCI competition III & IVa and the competition IV & IIb [34].
The development of deep learning has introduced more cutting-edge technologies to the MI-EEG, such as Multilayer Perceptron (MLP) [35] and Convolutional Neural Network (CNN) [36]. The use of convolution kernels to establish a common receptive field structure has greatly reduced the complexity of neural networks and is widely utilized in computer vision (CV). This approach has also proved highly effective for processing EEG data [37] due to the impressive feature extraction capabilities of convolutional neural networks (CNNs). Tang et al. [38] first proposed the use of CNNs for processing EEG data, achieving an accuracy of up to 86.41% ± 0.77 on the BCI competition dataset, a significant improvement over traditional methods.
Stroke is one of the primary causes of long-term disability in adults and contributes significantly to the worldwide socioeconomic burden [39], [40]. Even though Motor Imagery (MI) stroke therapy efficiently promotes brain remodelling, current therapeutic methods are immeasurable and their repetition can be demoralizing. For stroke rehabilitation, a real-time EEGbased MI-BCI system with a virtual reality (VR) game as motivating feedback has been created [41] based on CNN methods. CNNs have limits in global dependency recognition however, something that is insufficient for an EEG paradigm with a strong global link [25], [42]. The "Transformer", designed by the Google team has been a recent breakthrough in the field of CV [23]. This model became the SOTA in the NLP field in 2017. Since then, a large number of updated versions have emerged [43]. Despite the fact that the transformer architecture was used extensively in the NLP, its usage in the CV remained restricted [44]. Dosovitskiy et al. [24] constructed a novel architecture called the Vision Transformer (ViT) to tackle the computer vision problem. At the same time, the complexity of global self-attention increases significantly due to the high resolution and excessive pixels of the images, resulting in a longer training time. In 2021, Liu and colleagues [30] developed Swin Transformer, including sliding window operation and hierarchical architecture, to resolve the aforementioned difficulties. In their experiment, the accuracy rate of their methods was 86.4%, also performing very well in object detection and image segmentation tasks. None of these mentioned methods can extract the implicit interconnection among channels however, resulting in a significant loss of mutual information. Hence, in our paper, we constructed channel attention, which can fully utilise the internal connection to bridge the gap between the channels and our goal.

A. Common Spatial Pattern
The common spatial pattern (CSP) is utilized quite frequently as a feature extraction method for binary classification tasks in spatial filtering. CSP extracts the spatial component distribution of each category from multiple channels for BCI classification [45]. It applies the diagonalization of the matrix to locate a collection of optimal spatial filters for projection. This way, the variance difference between two different signals can be maximized, and the feature vector with higher discrimination can be obtained. The following is a normalized version of the covariance matrix for the n-th class where the original data A n is filtered by ω space,and the feature vector f p can be obtained where var (A csp ( p)) is the variance of p rows in feature matrix.

B. Transformer
Encoder and Decoder are the two key components that make up the complete concept. Each encoder is constructed in an identical manner, but they do not share any parameters. Within each encoder, there are two sub-layers: a self-attention layer and a feed-forward neural network. Both of these layers are connected in a feedback loop. The feed-forward neural network (also known as FNN) that corresponds to each word is the same [46]. However, each word follows its own oneof-a-kind route into the encoder. Because of this, there is a reliance on these different pathways in the self-attention layer. The outcome of the processing performed by the feed-forward neural network is sent to the subsequent encoder. A residual connection is used around each sub-layers to ensure that the subsequent layers are normalized [47]. Based on the encoder, the Decoder incorporates an extra encoding-decoding attention sub-layer. This sub-layer places its attention just on the significant components of the input text. An attention function is applied to describe the mapping between the input query and the collection of key-value pairs. The output vector is a weighted sum of values, and the correlation function determines the weights themselves in conjunction with Q and K. Following is the formula for calculating the attention score [23], which utilizes queries and keys of dimension d k , as well as an input vector of values of dimension d v A fully connected feed-forward network is applied to each position in our encoder and decoder on an individual and consistent basis across all of the layers, in addition to the attention sublayers. This process comprises two linear transformations with a ReLU activation in between them to separate them 4 where W 1 , W 2 , b 1 and b 2 are the parameters of activation layer; x is the input datasets from the previous layer.
The transformer will convert the label to a dimension vector both at the input and the output, and the decoder output may then be converted to the next probability of the projection. It makes use of conventional linear transformations as well as softmax functions. The weight matrix for the embedding layer and the pre-SoftMax component of the model is the same. To capture the sequence order, we have to mark the sequence throughout the input process. For the sake of input embedding, where pos refers to the position of the word in the sequence. d model means the number of the output dimension from the embedding layer. Finally, i refers to the dimension of the token embedding vector.

C. Swin Transformer
Swin Transformer aims to extend the applicability of Transformer such that it can be used as a general-purpose backbone for computer vision. This is analogous to how NLP and CNNs are used in the field of vision. A significant component of Swin Transformer design is how it varies the window divider between different stages of increasing self-attention. The shifted windows create linkages between the windows of the preceding layer, resulting in a significant increase in modeling capacity. Fig. 1 illustrates a simplified model of the Swin Transformer design and provides an explanation of its components (Swin-T). Similar to the patch splitting module in Vision Transformer (ViT), an input RGB image is divided into non-overlapping patches. Each patch is regarded as a token, and the feature token is set to the concatenation of the raw RGB pixel values. Our implementation uses a patch size of 4 * 4, and as a result, the feature dimension for each patch is 48. Through the use of a linear embedding layer, this raw-valued characteristic is mapped into an arbitrary dimension (denoted as C). On these patch tokens, many Transformer blocks that have changed self-attention computation are applied (these blocks are referred to as Swin Transformer blocks). When used in conjunction with the linear embedding, the Transformer blocks, responsible for counting the tokens and displaying their values as H * W , are referred to as Stage 1.
As the depth of the network increases, the number of tokens decreases as a result of patch merging layers, which provides a hierarchical representation. The first patch merging layer takes each pair of two adjacent patches and concatenates their feature sets before applying a linear layer to the resulting concatenated 4 * C dimensional feature set. The number of tokens is cut down by a factor of 4; consequently, the output dimension is changed to be equal to two times the value of C. Following that, Swin Transformer blocks are used to alter the features, with the resolution remaining at H * W throughout the process. The first stage of patch merging and feature transformation is referred to as Stage 2. The method is repeated twice, with output resolutions of H * 16 and H * W , respectively, as Stage 3 and Stage 4, respectively. Swin Transformer is created by replacing the regular multihead self-attention (MSA) module in a Transformer block with a module based on the shifted window while leaving the other layers the same. This is done while maintaining the same structure of the block. Swin Transformer block is made up of a shifted window-based MSA module, followed by a 2-layer MLP with GELU non-linearity in the center of the structure. A Layer Norm (LN) layer is applied before each MSA module and each MLP, and a residual connection is applied after each module. Since Swin Transformer would enable collaborative modeling of visual and textual data and allow for deeper modeling knowledge sharing, a unified architecture for computer vision and natural language processing might be advantageous to EEG signals. Swin Transformer's application to a variety of BCI issues can strengthen this viewpoint among the general public.

D. EEG Channel Attention Combined With Swin Transformer
Due to the fact that the temporal dimension of EEG data is significantly greater than the channel dimension, the classic transformer model ignores some essential channel features of EEG data. Consequently, we created a unique structure combining channel attention mechanism and a Swin Transformer. We introduce an EEG channel attention (ECA) block, which uses the dependency between channels such that the model may extract important characteristics based on various conditions. ECA is a specific attention mechanism that may assist the model in assigning various weights to each piece of channels and extracting more crucial and significant information. ECA block consists of LN layer, ECA layer, and dropout layer, as shown in Fig. 2. Fig. 3 illustrates the structure of the channel attention mechanism and the flow of data. The dimension of dataset is denoted as N*T, where N represents the number of channels in the EEG signal, and T represents the length of time. The comprehensive structure outlined in Fig. 3 reveals a distinct characteristic: each channel exhibits an extensive data dimensionality. This extensive dimensionality necessitates the model to allocate a heightened level of attention towards individual channels. Therefore, the structure of our model, designed to accommodate such unique specifications, can be justified as logical and appropriate. It is comparable to the transformer's scaled dot product focus structure, with a series of queries and keys constituting the input. In the equation 7,x l denote the features extracted by layer norm. We apply query matrices Q C and key matrices K C to get output features z l . To avoid overfitting and redundant features, we add a drop layer at the end of each single block to calculate the next input features x l+1 .

A. Experimental Setup
Seven stroke patients selected from the Department of Rehabilitation Medicine at Huashan Hospital in Shanghai, ranging in age from 25 to 75 years (41.6 ±12.0), participated in the tests. None of them had expertise with a motormodality BCI. All participants provided were informed of the method involved in our investigation. And they all gave written consent. In the experiment, we conducted 12 sessions for all the patients. For each three sessions, patients should finish them within one week. Each session included three runs, with each run comprising thirty trials of two motor tasks (motor attempt or resting state), presented in arbitrary order. Selection of suitable parameters in deep learning model is of paramount importance to obtain classification accuracy results. For the parameters of the model (Swin Transformer), we apply the default parameters settings. The performance is improved by selecting the optimal values of the penalty factor, the learning rate and the parameters of ECA by using the grid-search method [48] (in our settings, the learning rate is 1e-5, the penalty factor is 5e-3).
In the clinical investigation, the patient sat on the chair one meter in front of a computer screen. Initially, a white arrow was displayed for three seconds. Then, a red circle representing the motor work or a red rectangle representing the resting condition appeared within 1.5 seconds. After the circle faded, the patients was required to image his or her wrist in mind as far as possible without any compensating movements. The subject will take a break about 1.5 seconds when the cue rectangle disappeared. During this interval of inactivity, the subject was required to maintain open eyes without thinking. Due to the impossibility of achieving a pure resting state in actuality, the subject managed a very light work load while doing a regulated resting task. The entire experiment consists of three components, as shown in Fig. 4. The initial step is preprocessing, followed by feature extraction and classification. Patients were instructed to wear an acquisition device to capture EEG signals. After filtering and denoising, the  processed EEG dataset is annotated with position embedding to extract the channels' feature map before entering into the EEG Channel Attention block. The classification result can be calculated after feeding the features' final processing into the Swin Transformer's structural block.

B. Preparing the Input
The EEG acquisition instrument used in this paper is the actiCAP consisting of Ag/AgCl developed by the Germany Brain Products Company. The signal is amplified by the amplifier (Brain Products). The electrodes are placed in accordance with the international 10×20 standard electrode placement method . The placement locations of 31 electrodes are FP1,  FZ, F3, F7, FT9, FC5, FC1, C3, T7, TP9, CP5, CP1, PZ,  P3, P7, O1, O2, P4, P8, TP10, CP6, CP2, CZ, C4, T8, FT10,  FC6, FC2, F4, F8, FP2.respectively. The reference electrode is located in the right mastoid process. The irrelevant sensory part of the raw EEG signal sampled at 200 Hz was removed in the preprocessing stage using a 5-30 Hz filter. Fig 5 shows the scalp spatial powers maps of two different states of Subject 1.

C. Results
In order to evaluate the performance of our approach, we used the following metrics, accuracy(Acc), F1score,training time for evaluating classification performance.
Cao et al. [12] presented a unique study of connection networks for EEG-based feature selection, which not only recorded the spatial activities in response to motor tasks, but also uncovered the interacting pattern between various brain areas. We follow the previous Cao's work to conduct   Table II and Table III show the results of each subject's classification performance using ECA mechanism. Both algorithms perform excellently in classifying, even reaching the accuracy of 1.0. However, the stability have significant difference between the two methods. For ECA transformer(56.4% 100%), not every session reaching the threshold of 70% [49], while ECA Swin Transformer (72.7% 100%) can achieve the expected accuracy of more than 70%. The average accuracy is basically greater than 80% across all subjects, so that it can be widely used in BCI. Therefore, we perfer to apply ECA Swin Transformer based on its stable performance. The ECA Swin Transformer, PCC, PLV, and TE combined with CSP methods are used to compare the receiver operating characteristics(ROC) curves generated by the test set classification. ROC [50] have properties that make it particularly advantageous for domains with asymmetric class distribution and disproportionate classification error costs. As research in the areas of cost-sensitive learning and learning in the presence of imbalanced classrooms progresses, the significance of these traits has increased. The ROC curves are shown in Fig. 6 for all subjects. The area under the ROC curves of the proposed method in each figure is larger than those of other three methods. It was indicated that the ECA Swin Transformer's performance is better. Table IV shows the comparison of significant differences in accuracy between the ECA Swin Transformer and other methods. We consider the difference is significant when p-value <.005. [51] Combined with Table I, the greater difference between the baseline model and the ECA Swin Transformer, demonstrating the greater gap and the lower accuracy for  baseline models. It shows that traditional models such as CSP and CNNs have large significant differences and low accuracy. However, the p-value of the Swin Transformer and the ECA Transformer is large, indicating that there is no significant difference. In the Table I we can also find that the accuracy of these two models are also much higher than the traditional models, which proves the structure of the Swin Transformer and the ECA are effective.
In statistics, The F1 score [52] is derived from the precision and recall of the test, where precision is the number of true positive results divided by the total number of positive results, including those that were misidentified, and recall is the number of true positive results divided by the total number of samples that should have been identified as positive. In diagnostic binary classification, precision is also known as positive predictive value, while recall is also known as sensitivity. The F1 Score of the Swin Transformer is shown in Fig. 7. It is clear that the classification effect produced by the Swin Transformer is superior.
To enhance the interpretability of the model, we applied Fig. 8 to explain our model. The highlighted green segment indicates that this specific period of the series has a positive influence on identifying the positive class. The darkness of the hue in the segment corresponds to a value approaching 1, indicating a greater significance in the classification process. Due to the utilization of Swin Transformer for image processing, the training time is significantly longer compared to ordinary feature extraction models. Hence, our comparison focused solely on the training time of deep learning models. The evaluation was conducted using Python 3.7.1 on a server equipped with 4 Gforce RTX 2080Ti GPUs. Fig. 9 illustrates the average training time across all subjects in 12 sessions. During the treatment phase, positive feedback plays a crucial role in encouraging patients to actively engage in training. Excessive negative feedback may discourage patients, causing them to lose faith and patience in the training process [53]. Therefore, the length of the training session is not a primary consideration in the context of rehabilitation BCI. Furthermore, the model training occurs prior to the patient's rehabilitation, not during the experiment itself. Consequently, the model training during this phase does not extend the rehabilitation time, nor does it exacerbate the discomfort experienced by the patient during the rehabilitation. We also compared the ECA Swin Transformer with other base methods in terms of test time. Table V presents the average test time for these methods. As the results demonstrate, the differences in test time are marginal.
Furthermore, we can also use fine-tuning techniques to transfer the learned recognition capabilities from general domains to the specific challenge by inputting datasets of each patient, which drastically reduces training time. In order to better demonstrate the differences in the subjects' training   time, Fig. 10 shows that the best epochs out of 80 for the model, which cause a great impact on training time. In the experiment, the convergence epoch for subject 3 and 4 are 36 and 25 respectively, while the convergence epoch for subject 5 and 7 are 45 and 42. Furthermore, we compared the training time of our model with that of a CNN, as depicted in Fig. 11. Due to the complexity of our ECA Swin Transformer model, the training time is approximately ten times longer than that of a CNN.

D. Limitations of Our Works
It is important to recognise the possible limitations of our work. Firstly, large training datasets will significantly increase the required training time. Transformer encoders require sufficient data to adequately establish a robust sequence learning ability. Even though we applied a swin-slide windows strategy to reduce the training line by dividing each trail into several segments, the dimensionality of the datasets remained considerable, resulting in longer training times compared to traditional models. Accordingly, we are committed to improving our model and reducing the training time by incorporating few-shot learning techniques [54]. By leveraging the advantages of few-shot learning, we aim to enhance the efficiency of our model while maintaining its performance and effectiveness.

V. CONCLUSION
We have proposed a novel approach for classifying MI-EEG data by combining the EEG channel attention block with the Swin Transformer. Through the utilization of the ECA mechanism, we have assigned weights to the channel features with a particular emphasis on the crucial features of each channel. Our proposed end-to-end framework, distinct from traditional MI-EEG classification methods, requires preprocessing of the data to obtain results. The experiments have demonstrated that our approach significantly improves classification accuracy by extracting key features, outperforming other advanced deep learning models. In the future, we will continue to explore and optimize the framework to develop a lightweight model suitable for deployment across all MI-EEG tasks. This proposed methodology has the potential to enhance our understanding of brain-computer interfaces and aid in the development of more efficient and effective medical devices for patients suffering from neurological disorders.