Motor Imagery Decoding in the Presence of Distraction Using Graph Sequence Neural Networks

In this study, we propose a graph sequence neural network (GSNN) to accurately decode patterns of motor imagery from electroencephalograms (EEGs) in the presence of distractions. GSNN aims to build subgraphs by exploiting biological topologies among brain regions to capture local and global relationships across characteristic channels. Specifically, we model the similarity between pairwise EEG channels by the adjacency matrix of the graph sequence neural network. In addition, we propose a node domain attention selection network in which the connection and sparsity of the adjacency matrix can be adjusted dynamically according to the EEG signals acquired from different subjects. Extensive experiments on the public Berlin-distraction dataset show that in most experimental settings, our model performs considerably better than the state-of-the-art models. Moreover, comparative experiments indicate that our proposed node domain attention selection network plays a crucial role in improving the sensibility and adaptability of the GSNN model. The results show that the GSNN algorithm obtained superior classification accuracy (The average value of Recall, Precision, and F-score were 80.44%, 81.07% and 80.54%) compared to the state-of-the-art models. Finally, in the process of extracting the intermediate results, the relationships between important brain regions and channels were revealed to different influences in distraction themes.


I. INTRODUCTION
E LECTROENCEPHALOGRAM (EEG) is an important way for the decoding of brain signals that reflect the dynamic activity of neurons originating in the central nervous system and respond quickly to different brain states [1]. Motor imagery (MI) is a commonly used noninvasive brain-computer interface (BCI) paradigm [2], [3]. When a person imagines or mentally rehearses body movements, the corresponding MI response is evoked in the brain, and a large number of neuronal activities occur in the motor cortex [4]. People with disabilities can control auxiliary robots [5], [6] or wheelchairs [7] and perform daily activities by implementing MI-based BCI system, which has been shown to contribute to stroke rehabilitation [3], [8], [9]. However, recent studies found that the performance of the MI-based BCI system depends heavily on the behaviors of subjects, such as the case of performing kinesthetic MI as visual MI. Precisely, the expected EEG response for kinesthetic MI requires participants to focus on imagining motion feedback, not just imagining the picture of the action [10]. According to Friedrichetal et al. [11], passive auditory distraction improves the subjects' performance in MI. By contrast, Alvarsson et al. proposed that both positive and negative auditory interferences can lead to the extension of imagery time, and found that there is a slight difference between positive distraction and negative distraction, thus indicating that different forms of auditory interferences (natural and noisy environments) have different effects on individual well-beings and arousal levels [12]. Studies have examined the differences in brain activation patterns when eyes were open and closed by considering the effects of distraction themes. Even if these studies were conducted under the same conditions, different brain activation patterns could be found in both cases [13]. In the open state of eyes, an increase in the level of awakening and increased eye movements are observed when the attention load increases [14]. Compared with the closed state, the "external sensory" network in the open state, including the attention, eye movement, and arousal systems, are more activated. In contrast, the "endosensory" network, This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ including the visual, auditory, and sensory systems, along with some default mode networks, is more active when eyes are closed than when eyes are open [15]. Therefore, subject to the influence of different distraction paradigms, accurate decoding of MI-associated EEG patterns becomes highly challenging, and thus require more sophisticated algorithms that are robust to noise disturbances caused by distraction [16].
So far, many traditional machine learning algorithms have been used for MI EEG classification, such as linear discriminant analysis (LDA), random forest (RF), and support vector machine (SVM). However, owing to the low signal-to-noise ratio (SNR) values of the EEG signals, the original EEG data are usually contaminated by noise. Therefore, the distribution of the original EEG samples is chaotic. In particular, this situation is more evident in the case of the MI task in the presence of distraction. To solve this problem, numerous research efforts have been devoted to the development of feature extraction methods to uncover MI-driven EEG features before classification. Common spatial pattern (CSP) [3] is a popular feature extraction method for MI BCI tasks. In the past decade, many CSP variants have been proposed to optimize EEG features for improved MI decoding. For example, Ang et al. [5] proposed the filter bank common space pattern (FBCSP), which uses bandpass filters and the CSP algorithm to extract optimal spatial features, while maintaining high-subject sensitivity. Inspired by the idea of multisubband input, Jing et al. [17] proposed the Correlation-based channel selection CSP (CCS-CSP) method, to select the channels that contained more correlated information in the MI tasks. However, this algorithm depends on frequency band selection and time window selection, which can not extract the ideal spatial features from the feature space lack diversity [18], [19]. Although these approaches showed strengths in improving MI decoding, all of them have been studied under a distractionfree experimental paradigm, wherein the SNR of the EEG was guaranteed by the full engagement of subjects. However, in natural and complex environments with various distractions, use of these traditional algorithms for EEG-based MI decoding may face the following challenges: 1. Given that subjects are affected differently by the same independent distraction theme, the EEG signals of different subjects vary greatly, which hinders the generality of classifiers in specific distraction themes.
2. The topological structure of EEG channels may not be effectively utilized to learn specific EEG representations.
3. Participants may not always produce the expected imagery under certain distraction themes [20]. Consequently, the distribution of the original EEG in the collected data may be noisy, thus making it impossible to extract valid features.
To overcome these problems, researchers have proposed many end-to-end MI EEG decoding methods based on deep learning techniques. For example, Li et al. [21] proposed an end-to-end framework called channel projection mixed scale convolution neural network (CPMixedNet) to improve the performance of MI EEG decoding. Long-and short-term memory (LSTM) model is suitable for time series feature learning, although it ignores spatial information that is important for neural pattern decoding [22]. Wang et al. [23] proposed long-and short-term memory based EEG classification (Em-LSTM) to solve the problem of gradient disappearance or gradient explosion in the training process of convolution neural networks. Generally, the method based on deep learning can achieve better performance than traditional methods because of the advantages provided by learning and embedding of feature separation and classification in a single network. However, existing EEG coding methods based on deep learning do not consider the graph structure of EEG signals. Based on the research of graph theory, there are some related work to analyze brain signals, including topological structures information in cognitive tasks and rehabilitation engineering [24], [25]. The spectral method based on the redefinition of the Fourier convolution operation on the graph can effectively extract feature information. Kipf et al. [26] proposed a hierarchical propagation rule, which uses the Chebyshev expansion method to simplify the approximation of the graph Laplacian [27]. The graphical attention network [28] may be used to calculate the representation of nodes in an entire community based on the attention mechanism [29]. Wu et al. proposed a active framework named Simplifying Graph Convolutional Networks (SGC) [30]. These methods improve the performance of graphrelated tasks, while ignoring the some characteristics of low frequency in graph structures [31]. Nevertheless, none of these studies can ensure that the found input features can be reconstructed reasonably in the MI research, that is, the provided important input features are guaranteed to ensure realization of MI task in the presence of distraction. Conclusively, the previous graph convolutional networks unable to adapt to the MI task well and can not have good generalization ability. In this work, we aim to reconstruct features and interpret distracted MI pattern from the channel standpoint. Simultaneously, the fact that the subjects behave differently from each other is rooted in individual differences in brain connectivity [32]. The domain-invariant feature learning method may be used to align the features of the source and target domains to the common feature space. This alignment can be achieved by minimizing bifurcation [33] or maximizing reconstruction [34], or by adversarial training [35]. The node-wise domain adaptation (NodeDAT) regularizer proposed by Miao et al. [36] extends domain adversarial training into graph neural networks. By minimizing the difference between the features in the source and target domains of each node, NodeDAT achieves finer-grained regularization. In this study, we propose a novel graph sequence neural network adapted a nodal domain attention selection network for the MI task subject to the influences of various distractions. With nodal domain confrontation training, our proposed model achieves a stronger generalization ability and can effectively solve the problems of BCI blindness and poor goal realization of MI subjects. The primary contributions of this study are as follows: 1. A preprocessing regularization map algorithm is proposed, which is suitable for preprocessing EEG signals in the presence of distraction and improves the robustness of the model against noisy data.

2.
A gated graph convolutional network is proposed for task related EEG decoding, which can effectively capture the essential patterns of MI subjects in distraction and decode brain regions and interchannel structures in motor imagery tasks.

3.
A node domain adversarial training method for different independent distraction themes is proposed to improve the self adaptability and classification accuracy of graph sequence neural network (GSNN).
On Berlin Institute of Technology public datasets, our proposed model was validated with extensive experiments in comparison with benchmark models. The remainder of this paper is organized as follows. In section 2, we introduce the datasets and the baseline methods commonly used for graph representation learning and our designed GSNN network. Section 3 presents the experimental results of the GSNN. Section 4 discusses some detail analysis and potential applications.

A. Dataset
A public dataset collected by the Technical University of Berlin from 16 healthy subjects between the ages of 24 and 28 was used for our study [37]. These participants were assigned to exercise imagery experiments in the presence of different independent distraction themes. For each subject, the main experiment was divided into seven runs for 10 min, and 72 secondary experiments were run each time; one experiment trial lasted 4.5 s (as shown in Fig. 1). The participants were asked to choose a tactile hand movement subject to the influence of six secondary tasks (clean, eye-closed, news, numbers, flicker, stimulation). There are several MI strategies for participants to choose from. For example, squeezing a soft ball, turning on the faucet, playing the piano, and using a saltshaker. The experiment used the Fast EasyCap (Easy-CapGmbH) system, which consisted of 63 wet Ag/AgCl electrodes placed in a symmetrical position according to the International 10-20 system [38], with reference to the nose. Moreover, two 32-channel amplifiers were used to amplify the signal at a sampling rate of 100 Hz. The original EEG signal of the subjects was recorded in detail in the dataset and was applied to the main task tag of each subject to record whether the task was left-handed or right-handed. This study was conducted according to the declaration of Helsinki and was approved by the Ethics Committee of the Charite-Universitätsmedizin Berlin (approval number: EA4/012/12).
To cross-compare outcomes with other models, we adopted the same preprocessing method as the original dataset, sampled the EEG data at 100 Hz (as shown in Fig. 2 (a)), calculated the Laplace filter of C3 and C4, and used the fifthorder Butterworth filter to filter the data in the range of 9-13 and 18-26 Hz [37].

B. Overall Structure of Framework
In this study, we propose a novel framework, known as the graph sequence neural network (GSNN), which adapts the graph convolution network with self-attention mechanism and node domain adversarial, to predict the downstream task results of MI in the presence of distractions. The overall structure of our proposed framework is illustrated in Fig. 3. Specifically, the framework consists of three modules: 1. Graph reconstruction module for establishing a brain channel based graph (cf. Section II-C).
2. Gated graph convolutional network module for taskrelated EEG decoding which captures the essential patterns of MI subjects in distraction (cf. Section II-D).
3. Node domain adversarial training method for improving the classification accuracy in each task which provides the self adaptability of GSNN from a brain channel perspective (cf. Section II-E).

C. Graph Reconstruction
We use the notch filter to filter power frequency interference [38], and use wavelet entropy reconstruction as the node attribute feature. (as shown in Fig. 2) The dynamic distance ratio is reconstructed as the node connection parameter to construct the adjacency matrix of the edge connection.

1) Node Attribute Characteristics:
We reconstruct the signal based on the time-domain sequence signal after preprocessing. According to Rosso et al. [39], wavelet entropy can measure the degree of order and disorder of the signal and provide potential dynamic process information related to the signal. It selects the appropriate wavelet basis based on the wavelet transform with multiple scales and directions and expands the original signal at different scales [22]. Therefore, we used wavelet entropy reconstruction in each channel as the nodal attribute feature of the EEG channel. The relative wavelet energy of jth signal at each scale is defined as follows: L h denotes the number of coefficients at each decomposition scale, E h is the wavelet energy, and E total is the total energy. The wavelet entropy is calculated by the following formula: where X j ∈ R 1×128 is one of the element of X ∈ R N×F , which is the input features of the graph with 63 nodes and 128-dimensional features of wavelet entropy (as shown in Fig. 2 (b)). Additional details are provided in [22].
2) Node Topology Reconfiguration: The topological structure of EEG channels is necessary for graphic representation learning. The adjacency matrix A represents the topological pattern, where f represents the number of channels in EEG signals. The weight between channels m and n is indicated by each learnable entry A mn . According to Salvador et al. [20], the strength of connections between brain regions attenuates as an inverse square function of physical distance. Therefore, we initialized the relationship between local channels in the adjacency matrix as follows, where δ > 0 denotes a calibration constant, q mn , m, n = 1, 2, . . . , f , denotes the physical distance between channels m and n, which is computed from their three-dimensional coordinates obtained from the data sheet of the recording device. Achard et al. [40] observed that sparse functional magnetic resonance imaging networks (˜20% of all possible connections) can typically maximize the efficiency of network topology. Therefore, Setting δ to 5 implies that˜20% of the entries in A are non-negligible. Finally, to reduce overfitting, we set A mn as a symmetric matrix.
D. Gated Graph Convolutional Network 1) Graph Filter Neural Network: We consider a graph G = (X , A), where X denotes a set of nodes and A denotes a set of edges between nodes in X . If the graph convolution network (GCN) formula [26] is used, the form of score G ∈ R N×1 is as follows.
where σ is the activation function (e.g. sigmoi d), A ∈ R N×N is the adjacency matrix with self-connections,which can represented the edge set A.D ∈ R N×N is the degree matrix of A after renormalization trick, X ∈ R N×F is the input features of the graph with N nodes and F-dimensional features, which can represent data on X . att ∈ R F ×1 is the only parameter of the convolution-layer. In general, GCNs learn a feature transformation function for X and produces the output G ∈ R N×F , where F denotes the output feature dimension. However, when the noise is so large that GCN cannot be reduced by the first-order low-pass filter, it has a risk of overfitting to the noise. In particular, the inner weight att1 and A may be trained using high noisy features. According to Hoang et al. [31], graph filter neural network (gfNN) is a faster structure compared with GCN after the correction of the convolution layer, which is more suitable for the filtering of noisy signals. The form of multilayer (two-layer) perceptrons is as follows, where σ 1 is the entry-wise rectifier linear unit function, and σ 2 is the softmax function. Note that both σ 1 and σ 2 are contraction maps. By using the graph convolution kernel, H =D − 1 2 AD − 1 2 X. 2) Self-Attention Pooling: Pooling layer enables the gfNN model to lessen the number of parameters by reducing the scale of representation, thus avoiding overfitting. According to Zhang et al. [41], SortPool sorts the embedding of nodes and feeds the sorted embedding to the next layer. Conversely, Ying et al. [42] proposed DiffPool, which is a hierarchical graph pooling method for end-to-end learning of allocation matrices. Inspired by these studies, self-attention pooling is divided in two parts: global and hierarchical pooling methods [43]. The global pooling method considers the graphic function and uses neural networks to aggregate all representations of nodes in each layer. Diagrams with different structures can be processed because the global pool method collects all representations. The hierarchical pooling method is used to construct a model with node assignments based on features or topologies that can be learnt in each layer. P (l) ∈ R N l ×N l+1 contains the probability values of nodes in layer l, which is assigned to clusters in the next layer l +1. A denotes an assignment matrix which updates in layer l + 1 based on P (l) . N l denotes the number of nodes in layer l. Specifically, the form of assigned nodes is as follows: Top [k N] nodes are selected based on the value of G g f N N . The pool ratio k ∈ (0, 1] is a super parameter used to determine the number of nodes to be retained. The select score is calculated as follows where idx is an indexing operation and G mask is the feature attention mask. The listing top-rank is the function that returns the indices of the top [k N] values. The select module form is as follows where X id x is the nodes matrix by node-wise indexing. For node pooling, indexed nodes are embeded into graph convolution mask in the sparse space. A idx,idx is adjacency matrix which is pooled by row-wise and col-wise indexing. Finally, where X denotes the node feature matrix. The output probability of class Y i can be computed based on passing each feature matrix X i into model.The form is as follows where Com-pool denotes the combined pooling across all nodes on the graph, Y i ∈ {0, 1, . . . , L − 1} denotes the label class index, L denotes the number of classes, and O att ∈ R F ×L denotes the output weight matrix.

E. Node Domain Adversarial Training
When participants conduct the main task of prompting, owing to the influence of distraction, they may not always produce the expected MI, which may have a negative impact on the performance of the model [38]. Inspired by the nodewise domain distributed learning method (NodeDAT) [36], it is possible to learn the distribution of classes, which can solve the influence of different subjects to a certain extent. Specifically, the model converts each training label Y i ∈ {0, 1, . . . , L − 1} into a prior probability distribution of all classesŶ i ∈ R L , whereŶ i L denotes the probability of class L inŶ i . We have migrated this method to make it suitable for the task of motor imagery under distraction. Label has two classes: left and right class indices 0, 1,respectively. We convert Y as followŝ Subsequently, the model establishes two domains, which are labeled as source/training data X O ∈ R B×N×F (in this subsection, we denote X by X O for better clarity) and unlabelled target/testing data X E ∈ R B×N×F , where in practice X E can be either oversampled or downsampled to have the same number of samples as X O [9], B denotes the number of training samples. The domain classifier aims to minimize the sum of the two binary cross-entropy losses, where 0 and 1 denote source and target domains respectively, θ D denotes the parameters of the domain classifier. Intuitively, the domain classifier learns to classify source data as 0 and target data as 1. O is the labelled target, E is the unlabelled target. p j is the domain probability of the jth node calculated by Eq. (5). To make the domain classifier less sensitive to noise input in the early stage of the training process. Zhong et al. implemented gradient reversal layer (GRL) in domain training. With the training, the scaling factor β gradually increases from 0 to 1, thus further scaling the inverse gradient. Additional details are provided in [36].

F. Dynamics of GSNN
The characteristics of the reconstructed EEG data from preprocessing regularization map algorithm are used to train the GSNN network (as shown in Fig. 2 (c)). Firstly, gfNN realizes the denoising of node features in a high-dimensional space. After 63-dimensional gfNN convolution, the self-attention parameter G of node selection is obtained. According to the node selection method [44], even when inputting graphs of different sizes and structures, part of the nodes of the input graph are retained. Therefore, we used self-attention pooling structures to preserve graphic features and topologies. Respectively, the global pooling and hierarchical merging were executed. The label output of the sparse node domain was used as the input of nodeDAT. That is, the source domain label and the target domain label are separated as much as possible in the node domain classifier. Graph convolutional feature extractor is combined with a domain classifier for adversarial training (using gradient reversal layer (GRL)). Finally, the model fuses the testing data features to achieve generalization of feature extraction. Through the regularization of the smooth node label, the embedding is conducted by using 63 nodes, and the subsequent round is executed after supervised evaluation (as shown in Fig. 3). The optimization of the model is determined by the following loss function, including graph network and nodeDAT.
where θ denotes the model parameters that need to be optimized, and α denotes the weight of L1 sparse regularization for adjacency matrix A. p(Y|X i , θ) denotes the output probability distribution computed via Eq.(10). By combining both NodeDAT and graph sequence network, the overall loss function of GSNN is as follows, The detailed algorithm for training GSNN is presented in Algorithm.1. The entire neural network is fine-tuned after GSNN training.

A. Experiment Protocols
According to Glorot et al. [45], even if the evaluation is conducted on the same database, different evaluation schemes Compute output representation G based on Eq.(5); 10: Select top-nodes to represent the graph sparsely G based on Eq.(9); 11: Use X B andŶ B to compute KL loss based on Eq.(13); 12: Use X E and X E B to compute domain loss D based on Eq.(12); 13: Compute gradient reversal layer scaling factor β; 14: 18: until all samples in X have been drawn; can lead to significant differences in results. Generally, the influence of verification methods on test results proceeds as follows: (1) compared with theme-independent evaluation methods, subject-independent evaluation methods are more objective because subjects rely on evaluation methods that disregard individual differences. (2) Subject-level k-fold crossvalidation (CV) can evaluate better the sensitivity of the model among different subjects. (3) Independent distraction themelevel k-fold CV can evaluate better the adaptive adjustment ability of the model at different secondary tasks. Therefore, to avoid data leakage in the evaluation process, we conducted six-fold CV topic independent evaluation protocol under the independent distraction theme of the database. The training and test data used to evaluate the individual difference sensitivity of the model were produced from the same subjects, and there is no information overlap between the subjects. Specifically, at each independent distraction theme, we cross-validated 16 subjects at six-fold CV, and repeated the verification process until each secondary task of subjects was treated as a test dataset. In other words, for each independent distraction theme, we repeated it 16 times and calculated the final CV performance as the average of all the test results obtained. This verification method provides a fair evaluation of the performance of the cross-object model and can accurately  I  MOTOR IMAGERY PERFORMANCE (%) COMPARISON WITH DIFFERENT NETWORK CONFIGURATIONS USING ALL INDEPENDENT DISTRACTION  THEMES ON THE BERLIN-MOTOR IMAGERY DATABASE. THE STATISTICAL SIGNIFICANCE OF PERFORMANCE DIFFERENCE  BETWEEN OUR METHOD AND OTHER APPROACHES WERE EXAMINED USING PAIRED SAMPLE T-TEST estimate the possible recognition accuracy of new data from new objects. To compare findings with the results of previous studies, we used four indicators, namely accuracy (P acc ), recall (P re ), precision (P pre ), and F1-score (P f ), to evaluate the performance of the model. The F1-score was introduced as a comprehensive index to balance the influence of accuracy and recall and to evaluate the classifier in a more comprehensive manner.

B. Network Training
In the process of training, according to Shchur et al. [46], different data segmentations affect the model performance of graph neural network (GNN). In our experiment, the weight parameters in the convolution layer were initialized with 20 random seeds and six-fold CV. In a round of cross-validation in the distraction database, samples from 16 subjects × 12 secondary experiments were randomly used as test data (a total of 16 subjects × 12 secondary tasks × 63 nodes × 1953 edges), and samples from other 16 subjects × 60 secondary experiments were used as training data (16 subjects × 60 secondary tasks × 63 nodes × 3969 edges). To reduce the time complexity of the computation, the momentum of the ADAM optimizer was 0.9. If the verification loss was not improved after 50 epoch intervals under the epoch termination condition (up to 100 k epochs), we stopped training. The model weight that generated the lowest loss of the validation set was saved as the final parameter. A small batch random gradient descent method was adopted with a fixed learning generator rate of 0.001 and a discriminator of 0.0002. Herein, the small batch size was equal to 12. All models were trained on NVIDIA GeForce RTX 2080 graphics processing unit (GPU), and CUDA 11.0 used Pytorch API. Table I presents a comparison of the results of the proposed method with those of the previous studies. The corresponding verification methods are elucidated. The experimental results show that, compared with other supervised methods, the selfattention framework (based on GSNN) yields a better performance of MI recognition. For the supervised method proposed in [21], the initial evaluation scheme is based on the CV of the subjects under all independent distraction themes, and the performance of the cross-subjects is separated. There is a fair comparison between our proposed model and other methods. Based on [21], we used six-fold CV to evaluate their work conducted at each distraction theme (the same method we proposed) and report the results in Table I. The compared results show that the proposed method is better than that reported by Darvish et al. [19]. In the case of the theme "calibration ", the valence of Recall increases from 70.31% to 78.31%,Precision from 71.80% to 78.81%, F-score from 71.36% to 79.24%; In the case of the theme "clean", the valence of Recall increases from 77.37% to 82.14%,Precision from 77.37% to 82.14%, F-score from 77.50% to 82.89%. In the cases of the  remaining six distraction themes, the accuracy of the model was improved to a certain extent. The average value of Recall, Precision, and F-score were 80.44%, 81.07% and 80.54%. The standard deviation of each independent motion imagery task was also calculated. As shown in Table I, compared with other methods, the proposed method achieves a lower standard deviation. This indicates that the proposed method can alleviate overfitting to a great extent and help achieve improved and more stable performances than other state-ofthe-art models (paired t-test, p < 0.05, with seven degrees of freedom). Fig. 4 shows the accuracy assessment of our model and other algorithms. The average growth rates of P acc and P f were 8.35% and 14.09%, respectively, The results also showed that, compared with the manual features used in [5], the deep topology and node features characterized by GSNN were less sensitive to individual differences. Table II shows the classification effect of the GSNN model for different subjects. On the basis of 16 subjects participating in each distraction experiment, we conducted independent model training and evaluation for each subject. It can be seen that in subjects 1, 3, 6, 12, 14 and 16, the model has good classification accuracy under each distraction theme. In the rest of the subjects, it also showed good performance.

IV. DISCUSSION
In this section, we conduct ablation experiment and analyze the sensitivity of the GSNN model based on the distraction effect. The relationship between the important brain regions and MI channels is analyzed. Finally, we determine the potential prospects of GSNN.

A. Ablation Analysis
To evaluate the sensitivity of the proposed GSNN method proposed, we conducted experiments on datasets with different data enhancement methods. First, we conducted a controlled experiment with no data extension for each distraction theme. Subsequently, considering the same conditions, after the node domain adversarial enhancement method, the accuracy of each independent motion imagery task was calculated. As shown   in Table III, compared with each part of data enhancement method, the proposed method achieves a higher accuracy rate. This indicates that properly extracting regularization map is beneficial to model performance. Node domain adversarial training also proves to be useful in reducing overfitting and improving accuracy.

B. Experiments Using Different Structures of GSNN
By constructing convolution kernels with different types, it can be found that gfNN is more suitable for the classification of motor imagery tasks under independent distraction than other convolution kernels. The accuracy is shown in Table IV, where it can be seen that higher performance can be achieved when gfNN blocks is applied. By constructing different blocks, the optimal number of convolutional channels can be filtered out. The accuracy is shown in Table V, in which it can be seen that higher performance can be achieved when four gfNN blocks are applied. It can therefore be concluded that selecting appropriate parameters is conducive to improving classification performance.

C. Analysis of Important Brain Regions and Interchannel Structures
We identified important brain regions of MI based on independent distraction themes. Fig. 5 presents a heat map of the diagonal elements in the adjacency matrix we learned, where each subgraph depends on the classification of independent distraction themes. These values were scaled to the [0, 1] range for better visualization. Intuitively, diagonal values represent the contribution of each channel to the calculation of the final EEG representation. Fig. 5 depicts that there is strong activation in the prefrontal, parietal, and temporal lobes of all frequency bands, suggesting that these areas may be closely related to MI in the brain. Our findings are consistent with existing research. Fig. 5 demonstrates the first 17 connections between the channels with the largest edge weight in the adjacency matrix. After A is learnt and sparse selection is completed based on the selection parameter G value, global link-ages remain the strongest, confirming that the relationship between global channels is essential for MI. Additionally, Fig. 5(a) shows that the connection between the channel pairs (C3, C5) is the strongest when imagining the right side at the independent distraction theme, followed by (CCP3, C5) and (CCP1, CP3). The connection between the channel pairs (CCP4, C6) is the strongest when imagining the left side of the task [10], in the case of the independent distraction theme, followed by (C2,C4) and (CCP2, CP4). We also found that the topological connection strength of (F5, FC3) and (F4, FC5) on the prefrontal lobe are related to the distraction on the left and right side imagery [47], indicating that the relationship between channels in the local frontal region may be important for MI.

D. Other Prospects of the Model
In the process of representation learning with atlas, a better pooling method is needed for learning and selecting important topological features. Because pooling is not a decisive factor, it is difficult to determine whether a global pooled architecture or a hierarchical pooled architecture is beneficial to graph classification in MI. The global pool architecture minimizes information loss, making this architecture perform better than the hierarchical pool architecture on datasets with fewer nodes. However, layering is more effective on datasets with numerous nodes because it can effectively extract useful information from large-scale graphs. Therefore, for the data set of distracted MI, the pooling weight super-parameter setting of the two parts still requires a more precise model.

E. Conclusion
In this study, we trained the sample datasets specific to each subjects at each independent distraction theme. The model adaptively adjusted the subjects' influence at different distraction themes. The problem of noise samples was solved by using the feature standard of regularization and network method in the node domain to solve the influence difference of subjects. The sensitivity of the model to different subjects was reduced. Finally, the results showed that pertaining to the task of MI during distraction, our model performances were significantly better than those of the state-ofthe-art models. Moreover, the interchannel structures of active brain regions were revealed at different distraction themes, to a certain extent, which provides basic knowledge for additional research.

CONFLICTS OF INTEREST
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.