M-FANet: Multi-Feature Attention Convolutional Neural Network for Motor Imagery Decoding

Motor imagery (MI) decoding methods are pivotal in advancing rehabilitation and motor control research. Effective extraction of spectral-spatial-temporal features is crucial for MI decoding from limited and low signal-to-noise ratio electroencephalogram (EEG) signal samples based on brain-computer interface (BCI). In this paper, we propose a lightweight Multi-Feature Attention Neural Network (M-FANet) for feature extraction and selection of multi-feature data. M-FANet employs several unique attention modules to eliminate redundant information in the frequency domain, enhance local spatial feature extraction and calibrate feature maps. We introduce a training method called Regularized Dropout (R-Drop) to address training-inference inconsistency caused by dropout and improve the model’s generalization capability. We conduct extensive experiments on the BCI Competition IV 2a (BCIC-IV-2a) dataset and the 2019 World robot conference contest-BCI Robot Contest MI (WBCIC-MI) dataset. M-FANet achieves superior performance compared to state-of-the-art MI decoding methods, with 79.28% 4-class classification accuracy (kappa: 0.7259) on the BCIC-IV-2a dataset and 77.86% 3-class classification accuracy (kappa: 0.6650) on the WBCIC-MI dataset. The application of multi-feature attention modules and R-Drop in our lightweight model significantly enhances its performance, validated through comprehensive ablation experiments and visualizations.


I. INTRODUCTION
B RAIN-COMPUTER Interface (BCI) constitutes a com- munication conduit connecting the nervous system to the external environment [1].Motor Imagery (MI) entails the internal rehearsal of a movement before its execution [2], and it serves as a cornerstone of BCI research [3].Electroencephalography (EEG), a non-invasive technique renowned for its cost-effectiveness and convenience, facilitates the recording of neural activity with high temporal resolution [4], [5].When participants visualize moving parts of the body, the phenomenon that specific areas of the brain experience energy changes called event-related desynchronization/synchronization (ERD/ERS) could be recorded via EEG and then used to discriminate motor intent [6], [7].MI-based BCI systems have advanced notably, enabling the control of exoskeletons [8], [9] and cursors [10].Furthermore, the integration of MI with virtual reality technology [11] has shown promising prospects for stroke rehabilitation [12].The success of these systems hinges on high-performance MI decoding methods [13].However, improving the classification performance of spontaneous MI, as compared to other BCI paradigms reliant on external stimuli, such as event-related potential [14] and steady-state visual evoked potential [15], poses a formidable challenge due to factors like a low signal-to-noise ratio and cross-subject variability [16].
The EEG signals, distinguished by their excellent temporal resolution, encompass rich spectral and spatial features [4].The integration of these temporal-spatial-spectral features can bolster classification accuracy.Traditional machine learning methods generally extract neurophysiological features from MI-induced EEG signals.For instance, the Common Spatial Pattern (CSP) is a spatial-domain filtering feature extraction method that distills spatially distributed components from multichannel EEG signals corresponding to each class [17].The Filter Bank CSP (FBCSP) method enhances CSP performance by segmenting EEG data into multiple frequency bands [18].However, traditional machine learning methods highly rely on handcrafted features and, thereby, may fail to exploit the latent information embedded in the data fully.
With the advent of deep learning methods, a wealth of novel methods have emerged for EEG signal classification [19], such as Convolutional Neural Networks (CNN) [20].DeepConvNet [21], proposed by Schirrmeister et al., employed multiple convolutional layers with larger temporal and spatial feature extraction kernels.Sakhavi et al. [22] utilized FBCSP for feature extraction, followed by CNNbased classification.These comparatively larger networks have demonstrated remarkable classification performance compared to traditional methods.Nevertheless, larger models pose challenges in hyperparameter selection, as they are more prone to overfitting and demand longer training times.
Recently, novel lightweight deep-learning networks based on CNN have achieved better performance.These networks are efficient and outperform traditional deep learning models in accuracy when applied to MI-BCIs.FBCNet [23], proposed by Mane et al., initiated with spectral filtering of raw EEG data, followed by a spatial convolutional layer, and eventually computed temporal variance.FBCNet effectively encapsulates time-frequency domain feature information.Lawhern et al. [24] introduced a compact network, EEGNet, employing depthwise and separable convolution for spatial-temporal feature extraction.EEGNet wholly learned frequency filters via deep learning.However, hand-crafted frequency bands also encompass a substantial amount of frequency domain information, necessitating their effective combination with automatic learning frequency filters.In addition, these networks typically utilize a convolution kernel identical in size to the number of electrodes for direct spatial feature extraction.Nevertheless, considering that MI primarily activates particular brain regions [25], [26], it is crucial to place an additional focus on these responsive areas rather than solely relying on global spatial feature extraction.
The highly regarded attention mechanisms, which have gained significant attention in natural language processing and image processing, have recently been successfully applied to MI-EEG decoding.The MI-DABAN [27], proposed by Li et al., leveraged domain-specific attention modules based on CNN to capture critical features in both the source and target domains, achieving higher transfer performance.Song et al. introduced the Conformer [28], which applied Transformer modules based on self-attention to MI decoding, achieving strong classification performance.Self-attention modules have the advantage of globally extracting information from the entire sequence without being constrained by distance, and they apply weights separately.However, self-attention modules come with drawbacks, such as a high number of parameters and demanding computational resources.Moreover, they are not tailored for EEG data, leading to suboptimal resource utilization.
EEG signals often suffer from limited training samples, leading to overfitting.Besides adopting more lightweight network structures, exploring EEG signal-specific training methodologies is essential.Recently, Regularized Dropout (R-Drop) [29], a novel training method for tackling deep learning overfitting issues, has shown promise in natural language and image processing.R-Drop enhances the model's generalization ability by narrowing the inconsistency between the complete model and sub-models, thereby boosting the final performance of the model.This training method is apt for MI task classification networks prone to overfitting.
To tackle the above issues, we propose a lightweight network incorporating multi-feature attention named Multi-Feature Attention Neural Network (M-FANet).This model comprises multiple convolutional layers for feature extraction and several distinct attention modules for calibrating relevant information from three perspectives: frequency, local space, and feature map.The multi-feature attention modules adaptively extract valid frequency band features specific to individual subjects, heightening the perception of the spatial response of MI-related channel groups and filtering out redundant feature maps.Additionally, we employ R-Drop as a training method, reducing the discrepancy between various sub-models resulting from dropout by constraining the probability distribution differences of their outputs, thereby enhancing the performance of the complete model.We conduct detailed comparative experiments and ablation studies on two MI datasets to demonstrate the significant performance of M-FANet.
The contributions are summarized as follows: • We propose a lightweight network named Multi-Feature Attention Neural Network (M-FANet) by extracting a broader range of features from EEG signals and using multiple attention modules to effectively leverage them to enhance the classification performance of MI tasks.
• We introduce R-Drop as a model training method to mitigate overfitting, an issue stemming from the limited and noisy nature of EEG samples.
• We conduct experiments on two datasets to evaluate the effectiveness and superiority of the proposed M-FANet against state-of-the-art MI decoding methods.In addition, we delve into the impact of the regularization term introduced by R-Drop on the network performance.

A. Traditional Methods
Many traditional machine learning methods for MI-EEG classification have been proposed.Barachant et al. introduced a brain-computer interface classification framework based on MI, employing Riemannian geometry for direct classification using spatial covariance matrices as EEG signal descriptors [30].CSP [17] was one of the most effective and popular feature extraction methods in EEG signal processing.It differentiated brain activities under various tasks by analyzing the spatial distribution of multi-channel EEG signals.
However, the effectiveness of CSP was greatly influenced by the selected frequency bands as well as selected features.To address this issue, mutual information-based selection of optimal spatial-temporal patterns (OSTP) [31] utilized a feature selection method based on mutual information to optimize the CSP method, facilitating the automatic selection of frequency bands and time segments for CSP filtering.Ozdenizci et al. introduced a method that leverages information theoretic learning to enhance neural feature interpretation and overcome the constraints of traditional feature ranking and selection techniques in model training [32].
FBCSP [18] divided the EEG data into multiple frequency bands and applied CSP to these segments.Subsequently, a feature selection algorithm was employed to automatically select features specific to each subject, thereby significantly enhancing classification accuracy.The FBCSP method stands as one of the most triumphant in the field, being extensively utilized for comparative method analysis.

B. Deep Learning Methods
In recent years, a surge of deep learning methods has demonstrated promising advancements in the sphere of MI-BCI [19].Among these, EEGNet [24] has emerged as a notably compact deep learning network.By leveraging depthwise and pointwise convolutions, EEGNet significantly reduced the number of parameters, thereby ensuring a lightweight network structure.The model initiates by convolution along the temporal dimension, broadening the feature maps.Then, it carried out a depthwise convolution along the spatial dimension, compressing the entirety of the spatial information.Ultimately, it used depthwise separable convolution to extract features.
Compared to EEGNet, FBCNet manually extracted temporal features by calculating variance rather than employing temporal filters.Moreover, FBCNet utilized a bank of band-pass filters to segment the data into multiple frequency bands.Owing to their reliable performance, they serve as baseline methods in our experiments.
Attention mechanisms are being widely applied in EEG decoding.More recently, Conformer [28], which leveraged self-attention modules to learn global temporal features, has reached the highest classification accuracy on the BCIC-IV-2a dataset.MI-DABAN approached EEG feature transfer with the design of compact attention modules, leading to improved classification performance.The advantage of attention modules is that their design basis is the original input characteristics, which is suitable for MI-EEG, ensuring both lightweight implementation and effective feature extraction.Inspired by these observations and considerations, we propose the Multi-Feature Attention Neural Network (M-FANet), aimed at enhancing the extraction of features related to the principles of MI.

A. Proposed M-FANet Architecture
In the realm of deep learning methods employed for MI, EEGNet is known for its subtle architecture and lucid explanation.However, it does present limitations in robust feature extraction capabilities.Building upon the foundation provided by EEGNet, we propose M-FANet, which firstly incorporates data selectively from hand-crafted frequency bands by utilizing a frequency band attention module.Then, the M-FANet's capacity for extracting local spatial features is enhanced by introducing a local spatial attention module.
The chosen kernel length for the following temporal convolution layer, K T , is equivalent to one-eighth of the data sampling rate, outputting multi-feature maps of size F T , which is numerically equivalent to the number of temporal filters.A depthwise convolution follows, where each kernel solely connects to one preceding feature map.This depthwise convolution layer reduces the number of trainable parameters, enabling the integration of all spatial information, where D is the spatial filter multiplier and the number of spatial filters is D * F T .By adjusting the values of D and F T , the number of feature maps can be modified, allowing M-FANet to adapt to various datasets.Furthermore, a Squeeze-and-Excitation Block (SEBlock) [33] also is introduced, facilitating automatic feature calibration and allowing the network to prioritize feature maps dynamically.
Following the introduction of the feature map attention mechanism, a separable convolution comprising depthwise convolution and pointwise convolution is employed.Specifically, this first extracts temporal features within each feature map, followed by the extraction of inter-feature map features.This approach enables the network to capture local temporal patterns within individual feature maps and global relationships between different feature maps with fewer parameters.Ultimately, a convolution layer is used in the classification stage to aggregate the features.
The visual representation and comprehensive description of our M-FANet method are presented in Fig. 1 and Table I, respectively.
1) Frequency Band Attention Module: In machine learning algorithms for MI, the band-pass filter parameters significantly influence the decoding results, which suggests that while the frequency domain is replete with valuable information, it is difficult to extract it accurately.To capture the pertinent information within the frequency domain, we design a frequency band attention module based on the attention mechanism, facilitating adaptive frequency filtering.
A single-trial raw EEG sample can be represented as X ∈ R C×T , where C denotes the number of EEG channels, and T is the time points.Initially, we implement multiple band-pass filters to the raw EEG data utilizing the Chebyshev Type II filter with a stopband ripple of -30dB and a fixed step size of L (set at 4Hz here).This yields multi-band EEG data X M B : where h(n) represents the bandpass filter corresponding to the n th frequency sub-band, N b is the number of sub-band.Then, we employ a pointwise convolution to amalgamate the multi-band information.This process enables the network to harness the complementary information from each frequency band.Simultaneously, an adaptive weight is designated to each frequency band to attenuate noise in redundant frequency bands and amplify efficient information in other frequency bands.After fusing frequency band information, we derive X FS : where w(n) represents the feature weight corresponding to the n th sub-band.
2) Local Spatial Attention Module: Based on the neuroscientific prior knowledge regarding MI [25], [26], it's known that MI predominantly activates motor-related brain regions, especially near the electrodes C3, Cz, and C4.Hence, when extracting features in the spatial dimension, channels shouldn't be treated as equal, but focus should be placed on areas relevant to MI.Consequently, leveraging the attention mechanism, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.we design a local spatial attention module to extract effective spatial information from local electrode groups.
We employ a convolution layer to extract local spatial information by using a small kernel size (K s , 1) and stride to enable the detailed extraction of local spatial features from EEG data, convolving solely along the spatial dimension.After local spatial feature extraction, we derive X LS : where w s represents the weight matrix of the kernel, m denotes the index of the weights within the kernel, and b is the bias term.
The small convolution kernel facilitates the focus of M-FANet on local receptive fields, promoting the identification of intricate spatial patterns related to various MI tasks.In terms of MI EEG signals, this results in attention being directed towards particular sets of electrodes within the motor area, as evident in the output feature maps.
3) Feature Map Attention Module: A SEBlock, as shown in Fig. 2, is utilized to apply weights to the feature maps.Consider input with F feature maps, C channels and T time points that are represented as X S E ∈ R F×C×T .We squeeze the global spatial and temporal information into a feature descriptor by using global average pooling to generate feature-Fig.3. The overall framework of R-Drop.We take CNN structure for illustration.The network conducts dual forward passes and utilizes KL divergence to enforce consistency between the results of these passes.
wise statistics: Then, two fully connected layers are used to learn the nonlinear relations between different feature maps: where ×F is a weight matrix of the front fully connected layer whose reduction ratio is r , a hyperparameter, and W F2 ∈ R F× F r is a weight matrix of the other fully connected layer which elevates the feature back to its original dimension, δ (•) is rectified linear unit (ReLU) activation function, σ (•) is the sigmoid activation function.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. R-Drop Module
The dropout technique, randomly dropping a fraction of units with a given probability p (set at 0.5 here) during forward propagation, suppresses overfitting [34].During the training process, dropout randomly disables some units, thereby enhancing the generalization ability of the model.However, each training iteration uses a sub-model due to the dropout while using a complete model during inference.This discrepancy between the models used in training and inference can lead to a circumstance where the model performs well on the training set but still demonstrates an inescapable gap on the test set.This issue is especially pronounced in EEG signal datasets with small sample sizes highly susceptible to overfitting.
To address the abovementioned problem, we chose R-Drop as the training method, which constrains the different output predictions of sub-models caused by dropout [29].The core idea of R-Drop is adding a regularization term to the loss function, which mitigates the discrepancies among different sub-models, narrowing the gap between the complete model and its sub-models.For each sample, the Kullback-Leibler (KL) divergence between the output probability distributions of two randomly chosen sub-models is computed and added to the loss function.The KL divergence encourages sub-models to align as closely as possible in their output probability distributions, narrowing the inconsistency between the complete model and the sub-models.
The framework of R-Drop is shown in Fig. 3.We obtain two distributions of the model predictions by performing forward propagation twice, denoted as P 1 (y i | x i ) and P 2 (y i | x i ), which are different from each other because the dropout operator randomly dropped units.Next, we calculate the bidirectional KL divergence between these two sub-model outputs to regularize the model output, where the input data pair (x i , y i ) is the same: With the cross-entropy loss to minimize the classification error between the predicted labels and the ground-truth labels: The final training objective is to minimize the following: where the KL divergence is represented by D K L (P 1 ∥P 2 ) between two probability distributions P 1 and P 2 , the cross-entropy loss is represented by between the distribution of the ground-truth label Y and the predicted labels probability distribution P (Y | X ), α is the coefficient weight to control the degree of regularization, L i is used for gradient updates.We will discuss the impact of α in the next chapter.

IV. EXPERIMENTS AND RESULTS
A. Datasets 1) Dataset I: The 2008 BCI Competition (BCIC) IV-2a EEG data set is also used to evaluate method performance [35].It consists of EEG data from nine subjects on four types of MI tasks: left hand, right hand, foot, and tongue.Two recording sessions are collected on separate days, utilizing twenty-two Ag/AgCl electrodes with a sampling rate of 250 Hz.Each session contained 288 EEG trials, with 72 trials per task.The first session is designated for training while using second for testing.2) Dataset II: The 2019 World robot conference contest-BCI Robot Contest MI (WBCIC-MI) dataset contains 3-class MI-EEG data from 12 healthy subjects provided by Shanghai University.Twelve healthy right-handed students from the school participated in the experiment.All participants are naive BCI users and provide informed consent before the experiment.This experimental study, which received ethical approval (No.20190002) from the Tsinghua University Medical Ethics Committee, complies with the Declaration of Helsinki.Drinking alcohol 24 hours before the test, coffee, or tea within 4 hours is prohibited.The subjects are required to sit on a comfortable chair roughly 1m in front of the computer screen and remain as still as possible when performing the tasks.
In a single session, there is a 1-minute eye-opening time when subjects need to keep calm and gaze at the "+" on the screen without MI.After the opening eyes phase, a 1-minute closing eyes phase is subsequent.Then, the subject can press any button to start MI, including three-class MI tasks-lefthanded fist, right-handed fist, and both-ankles bend, as shown in Fig. 4(a).In a single trial, shown in Fig. 4(b), a virtual reality video as "Cue" appears on the screen for 1.5 seconds when subjects should prepare for MI.In the next 4 seconds, subjects perform corresponding MI tasks according to the "Cue."At the end of the trial, the subjects have a break for 2 seconds.set of data (block) contains five sessions, a total of 300 trials divided equally into three-class MI tasks, and between each session, there is a 2 minutes rest time for subjects.
The EEG signals are based on CPz and collected by Neuracle 64-lead equipment whose wet electrodes are arranged following the international 10-20 electrode placement method, with a 1000 Hz sampling rate, maintaining impedance below 10k during recording and subsequently downsampling to 250 Hz.Ten-fold cross-validation is used in the WBCIC-MI dataset.
3) Preprocessing: In this study, we preprocess the EEG data with the Mne [36].We use [2,6] seconds of each trial and perform the 0.5-40Hz band-pass filtering on the original data to remove interference signals such as ocular electricity and electromyography, improving the signal-to-noise ratio.

B. Experiment Details
In this study, we implement our M-FANet using the PyTorch library, based on Python 3.7, with a Geforce 3090 GPU.We train M-FANet employing the Adam optimizer [37] with default settings.The batch size and learning rate are configured at 16 and 0.0001, respectively.
We utilize classification accuracy and Cohen's kappa as two key performance metrics for method evaluation.The computation formula for kappa is as follows: where p o denotes the classification accuracy, p e denotes the accuracy of random guesses.To analyze statistical significance, we apply the Wilcoxon Signed Rank Test.

C. Baseline Comparison
We conduct extensive subject-dependent experiments and compare our method with several state-of-the-art approaches across two separate datasets.Dataset I is a widely adopted dataset for four-class MI tasks, and many notable methods have proven their efficacy on this dataset.For instance, OSTP [31] adapted parameters specifically for the selection within the CSP framework; FBCSP [18] initially subdivided the signal into frequency bands and subsequently applied individual spatial filtering to each band; DeepConvNet [21] and EEGNet [24] were both convolution-based methods, designed for feature extraction and classification, with EEGNet being more lightweight than DeepConvNet.FBCNet [23], on the other hand, leveraged narrowband filters to capture features across different frequency bands, followed by a convolution with a large kernel to learn the spatial pattern for each band and subsequently extracted temporal feature through variance computation.Conformer [28], which extracts features through a self-attention module.
Table II displays the classification performance of all methods on Dataset I. Our M-FANet method achieves the highest average accuracy of 79.28% and the highest average kappa value of 0.7259.The results demonstrate that the classification accuracy of our proposed M-FANet is not only 11.5% higher than FBCSP (p < 0.05), which was the victor of the BCI Competition IV but also significantly surpasses that of OSTP (p < 0.01).Deep learning methods entirely based on CNN, such as DeepConvNet and EEGNet, also garnered commendable classification results, further testifying the effectiveness of CNN in feature extraction.However, these methods lack effective feature selection.For example, FBCNet does not selectively concentrate on the frequency bands it partitions, and EEGNet compresses all spatial information directly with large convolutional kernels, neglecting attention to local channel groups.In addition, they also lack calibration of feature maps.In contrast, our proposed M-FANet enhances CNN with frequency band attention, local spatial attention, and feature map attention, hence selecting informative features.M-FANet, employing compact yet accurate attention modules, outperformed the larger model Conformer [28], which utilized self-attention modules.As a result, M-FANet achieves the best average accuracy (p < 0.05) and kappa.
Further, we compare several state-of-the-art methods on Dataset II in Table III.In comparison to other methods, M-FANet significantly enhances the overall classification performance, with improvements of 32.81%, 16.75%, 12.17%, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

E. Ablation Study
The significant advancement of M-FANet lies in its multi-feature attention modules compared to other state-of-theart methods.Therefore, we conduct an ablation study using Dataset I, as presented in Fig. 5, involving the sequential removal of frequency band attention, local spatial attention, and feature map attention from our method.The results show a corresponding average accuracy decrease of 5.98%, 2.81%, and 2.05%, respectively.When we remove the frequency band attention module, three participants (S04, S07, S09) exhibit a decreased classification accuracy of over 10%.The most pronounced decrease in accuracy is S07, with a reduction      function, and we can manipulate the impact through an adjustable parameter α.We experiment with varying the α in {0, 1e −1 , 2e −1 , 3e −1 , 4e −1 , 5e −1 , 6e −1 , 7e −1 } and evaluate the effect of regularization term on Dataset I.When α = 0, it implies that there is no constraint term between submodels.As shown in Table IV, a small α (e.g., 1e −1 ) does not outperform a large α (e.g., 4e −1 ), indicating the necessity for enhancing regularization.However, an excessively large α (e.g., 7e −1 ) signifies excessive regularization, causing the classification accuracy to decrease.In this work, the best-balanced choice is α = 4e −1 on Dataset I.
We also vary the α in {0, 1, 5, 10, 15, 20} and conduct experiments on Dataset II.As shown in Table V, when the α increases, the performance of our method gradually improves.Nonetheless, an excessively large α may lead to a decline in classification performance, as it might obstruct model convergence during training.The optimal balance is reached when α = 5 on Dataset II.
Due to the difference in spatial information between Dataset I with 22 channels and Dataset II with 64 channels, we utilize different spatial filter multipliers (D).For Dataset I, we use D = 2, while for Dataset II, we use D = 4.During the experiments, we observed that changing the value of D results in variations in the distribution of α.For models with a larger D, indicating more parameters, a larger α is required to achieve optimal classification accuracy.Consequently, we set D in {2, 4, 8} on Dataset I, where the corresponding numbers of parameters are 4080, 7696, and 16656 respectively, and examine the impact of different α values on average accuracy.Fig. 7 shows that smaller models exhibit greater sensitivity as α increases by the same amount.With the continuous increase of α, the small-size model shows a trend of an initial increase and then a decrease in accuracy, quickly reaching the optimal point (α = 0.4).The medium-sized model demonstrates a similar trend but changes more slowly, with the optimal point occurring at a larger α value (α = 8).In contrast, the big-size model displays a consistently slow increase in accuracy and has not yet reached their optimal point.
A larger value of D increases the number of parameters in M-FANet, therefore requiring a larger value of α, i.e., a more substantial regularization term, to constrain the training process.Therefore we recommend larger values of D and α for datasets containing more raw EEG information.

G. Visualization
1) Global Representation: The frequency band attention module extracts features directly from the raw EEG signals, critically influencing the feature extraction processes in all subsequent modules.Therefore, it is essential to ascertain whether the features learned by the frequency band attention module are exclusively derived from specific MI tasks.We extract the output from the frequency band attention module and represent it as a brain topography map for comparison Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The raw EEG signals from subject S07 already exhibit wellknown MI-related brain activation patterns.After processing through the frequency band attention module, the activation features under different MI tasks become more pronounced.When averaging the EEG under specific tasks of all subjects, the overall activation patterns appear irregular due to the variability in brain activation modes among individuals.However, the frequency band attention module still distinctly differentiates the brain activation patterns associated with different tasks.3) Feature Distribution: T-distributed stochastic neighbor embedding (t-SNE) [38] is a commonly used statistical dimension reduction and feature visualization method.After separate training with DeepConvNet, EEGNet, FBCNet, and M-FANet,

V. DISCUSSION
The key to the research and application of MI-BCI lies in the accuracy of MI decoding methods.We propose a lightweight yet highly effective method called M-FANet.The distinctive feature of its network architecture is its reliance on attention mechanisms and the utilization of multiple attention modules based on the frequency and spatial characteristics of MI.In the frequency domain, we divide the original signals into multiple sub-bands and apply adaptive weights to non-overlapping subbands.These sub-bands are then linearly combined to remove redundant information and amplify relevant frequency-domain information.Inspired by the neurophysiological mechanisms of MI in the spatial domain, we employ a local spatial attention module for detailed feature extraction.When passing through MI-related brain areas, the local spatial attention module will output a significant activation, as demonstrated in Fig. 9, showcasing its sensitivity to the regions associated with MI.The local spatial attention module allows our network to prioritize MI-related responses in these areas.Additionally, we incorporate attention mechanisms from deep learning and utilize SEBlock to calibrate the feature maps.To address the challenges of small sample size, high feature dimensionality, and the risk of overfitting in EEG signals, we introduce R-Drop, which mitigates the inconsistencies between sub-models and enhances the generalization ability of the complete model during the inference stage.
The experimental results substantiate that M-FANet outperforms the current state-of-the-art methods in terms of performance.Ablation experiments illustrate that each attention module contributes significantly to the overall model.To assess the effectiveness of feature extraction, we visualize the features.We also discuss the impact of the α parameter on the performance of R-Drop, which indicates that as α increases, the model's classification performance gradually improves.However, exceedingly large values of α can lead to a deterioration in performance.We also note that the optimal value of α varies for deep learning networks of different parameter sizes.Networks with more parameters necessitate a larger α, likely due to the amplified inconsistency between sub-models requiring more potent regularization.
Table VI displays the resource consumption of four stateof-the-art methods.The M-FANet model completes a single forward pass in 1,536 milliseconds, requiring 23.39 Million Floating Point Operations (MFLOPs), which denotes the computational load of the network for each inference.However, M-FANet boasts a compact architecture, requiring only 4,080 parameters and a minimal memory footprint of 28.10KB, significantly less than ConvNet [21] and Conformer [28].In summary, M-FANet strikes a balance between superior classification performance and minimal memory requirements.
Despite these advancements, several limitations call for further investigation.Firstly, we only conduct experiments on MI datasets.However, given that our proposed M-FANet exhibits high sensitivity to local regions, exploring its effects on other paradigms, such as event-related potentials, may prove valuable.Secondly, the adjustable parameter α in R-Drop currently depends on empirical settings.While extensive experimentation allows us to determine the optimal value of α, specific to a particular network architecture, developing a computationally efficient formula for calculating α, which holds significance across different network structures, could be beneficial.Thirdly, we fix the stride for sub-band partitioning currently.Although we can use some neurophysiological priors as references, a fixed stride may result in valuable information being submerged in sub-bands containing more redundant information.In our future work, we plan to explore the usage of a non-fixed stride for sub-band partitioning, where the number of sub-bands can be specified, and boundary points can be automatically determined.
Transfer learning is also a noteworthy method, enabling the utilization of pre-trained models on extensive datasets, followed by fine-tuning with a limited amount of specific subject data to improve classification performance [39], [40].The combination of lightweight models and transfer learning facilitates high-performance decoding while maintaining low resource consumption.This approach is particularly beneficial for the further application and popularization of BCIs, especially in scenarios with limited resources, such as in portable and embedded devices.

VI. CONCLUSION
This study proposes a lightweight M-FANet that employs convolution for feature extraction and effectively selects frequency band features, local spatial features, and informative feature maps using attention mechanisms.Moreover, we introduce a training method, R-Drop, to tackle the training-inference inconsistency of the model prompted by dropout.We conduct extensive experiments on two datasets, and the results corroborate that our method outperforms other state-of-the-art methods.Ablation experiments and visualization further validate the effectiveness of the individual attention modules and R-Drop.The proposed M-FANet holds promising potential for MI-based BCI research and applications due to its high performance in MI-EEG decoding.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 1 .
Fig. 1.The architecture of our proposed Multi-Feature Attention Convolutional Neural Network (M-FANet) employs convolution for feature extraction and classification, augmented with three distinct attention modules from the perspectives of frequency domain, spatial domain, and feature maps.

Fig. 4 .
Fig. 4. Illustration of the single motor imagery task and the flow chart of the single trial.
ON DATASET I TABLE III CLASSIFICATION ACCURACY(%) COMPARISONS WITH STATE-OF-THE-ART METHODS ON DATASET II 12.68%, 10.34%, and 4.7%, respectively, for OSTP (p < 0.01), FBCSP (p < 0.01), EEGNet (p < 0.01), FBCNet (p < 0.01), Conformer (p < 0.05), and DeepConvNet (p < 0.05).D. Training Process We employ a two-stage training strategy [21].In the first stage, the training data is divided into training and validation sets.The model is trained exclusively on the training set, and its accuracy is monitored using the validation set.If the validation set accuracy does not improve for 400 consecutive epochs, the training is stopped.Once the stopping criterion is met, the network parameters with the best validation set accuracy are loaded.These loaded model parameters serve as the starting point for the second stage, where the model is further trained using the complete training data (training + validation sets).The second stage of training is stopped when the validation set loss decreases below the training set loss from stage 1.To prevent unlimited training in cases where convergence is not achieved, the maximum number of training epochs is limited to 1500 for stage 1 and 600 for stage 2. For the Dataset I, 20% of the training data is reserved as the validation set.In Dataset II, one of nine training folds is retained as the validation set.

Fig. 5 .
Fig. 5. Ablation study on the impact of frequency band attention, local spatial attention, and feature map attention on classification accuracy.

of 13 .
89%.A comparison of the confusion matrix for S07, as shown in Fig.6, reveals that the frequency band attention module substantially enhances the classification accuracy for tasks involving the right hand and both feet.

F
. Parameter Sensitivity R-Drop also contributes to our method in terms of model training.It introduces a regularization term to the loss

Fig. 7 .
Fig. 7. Different values of D are chosen to adjust the number of trainable model parameters, and the influence of model size on the selection of α is observed on Dataset I.The results indicated that for larger models, a larger α is necessary for regularization.

Fig. 8 .
Fig. 8. Raw EEG topography averages over all trials for each MI task, the attribution patterns after the frequency band attention module (FBA) on the input EEG show that the features become more focused on the MI-relevant regions.(a) Attribution patterns for S07.(b) Attribution patterns average across all subjects.

2 )
Region of Interest: To evaluate the effectiveness of the local spatial attention module, we employ heatmaps to visualize both the input and output of this layer, thereby scrutinizing the tangible impact of local spatial attention.As shown in Fig. 9 (b), during the early stages of model training, the local spatial attention module chiefly concentrates on local channel groups associated with motion regions where attention is distributed over an expansive range, and the response values are comparatively low.As the model matures into a more stable state, shown in Fig. 9 (c), the attention becomes more refined, with an increment in response values.This phenomenon suggests that the local spatial attention module persistently focuses on local spatial information, intensifying effectiveness as the model evolves through training.The significance of this module lies in enabling subsequent large convolutional kernel, which extracts global spatial features, to pay more attention to spatial information relevant to MI that is manifested in

Fig. 9 .
Fig. 9.The heatmap of S07 from Dataset I, where the gray color represents the original EEG image, and the red color represents the output of the local spatial attention module.(a) The origin EEG signals.(b), (c) The heatmap after 10 and 100 iterations, respectively.

TABLE IV THE
CLASSIFICATION ACCURACY(%) OF M-FANET WITH DIFFERENT α ON DATASET I TABLE V THE CLASSIFICATION ACCURACY(%) OF M-FANET WITH DIFFERENT α ON DATASET II