Domain-Generalized EEG Classification With Category-Oriented Feature Decorrelation and Cross-View Consistency Learning

Generalizing the electroencephalogram (EEG) decoding methods to unseen subjects is an important research direction for realizing practical application of brain-computer interfaces (BCIs). Since distribution shifts across subjects, the performance of most current deep neural networks for decoding EEG signals degrades when dealing with unseen subjects. Domain generalization (DG) aims to tackle this issue by learning invariant representations across subjects. To this end, we propose a novel domain-generalized EEG classification framework, named FDCL, to generalize EEG decoding through category-relevant and -irrelevant Feature Decorrelation and Cross-view invariant feature Learning. Specifically, we first devise data augmented regularization through mixing the segments of same-category features from multiple subjects, which increases the diversity of EEG data by spanning the space of subjects. Furthermore, we introduce feature decorrelation regularization to learn the weights of the augmented EEG trials to remove the dependencies between their features, so that the true mapping relationship between relevant features and corresponding labels can be better established. To further distill subject-invariant EEG feature representations, cross-view consistency learning regularization is introduced to encourage consistent predictions of category-relevant features induced from different augmented EEG views. We seamlessly integrate three complementary regularizations into a unified DG framework to jointly improve the generalizability and robustness of the model on unseen subjects. Experimental results on motor imagery (MI) based EEG datasets validate that the proposed FDCL outperforms the available state-of-the-art methods.


I. INTRODUCTION
B RAIN-COMPUTER interfaces (BCIs) enable users to communicate with external systems or devices directly using human brain signals [1]. As a well-established non-invasive brain imaging technique, electroencephalogram (EEG) has been widely used to record neural activities for BCI development [2]. To date, typical paradigms of BCI include steady-state visual evoked potential [3], [4], event-related potential [5], and motor imagery (MI) [6]. Thereinto, MI-based BCIs provide a new perspective for neurorehabilitation, which have drawn increasing attention in the therapy of various diseases, such as amyotrophic lateral sclerosis (ALS) [7], spinal cord injuries [8], brainstem stroke [9] and so on.
Currently, most machine learning methods decode EEG patterns by leveraging the statistical correlations that exist in the training data distribution. In conformity with the hypothesis that training and test data are independently sampled from the identical distribution, these methods exhibit excellent EEG classification performance. For example, many shallow and deep learning models have been developed to decode subject-specific MI-based EEG signals. For shallow methods, prior research usually classifies EEG data by combining a feature extractor (e.g., common spatial pattern, CSP) [10] and a classifier (e.g., support vector machine, SVM) [11]. To further reduce manual intervention, many deep learning (DL) models have been developed to realize end-to-end EEG decoding. For instance, two types of convolutional neural networks (CNNs) [12] (i.e., shallow ConvNet and deep ConvNet) were developed for MI-based EEG classification. With depth-wise and separable convolution operations, a compact CNN [13] was developed for EEG decoding with typical neurophysiological patterns. Besides, cascade and parallel convolutional recurrent neural networks were constructed to learn the spatial Domain shift between subject S3 and subject S4 in BCI Competition IV Dataset IIa. We average the power spectral density (PSD) of each subject at the C3 electrode. The blue line and the shaded area represent the mean and standard deviation of PSD. and temporal feature representations of raw EEG signals for recognizing MI-based EEG signals [14]. An in-depth review of DL-based EEG decoding methods can be found in [15].
Nevertheless, the above methods require large amounts of subject-specific EEG data to tune the model for trustworthy predictions. For a new subject, it usually takes a calibration session of 20-30 minutes to collect training EEG signals, which is especially inconvenient and user-unfriendly. For practical clinical application of BCI systems, zero-calibration EEG decoding methods that can directly generalize to unseen target subjects need to be developed. However, due to the inter-subject variability in physiological/psychological states, domain shifts exist in EEG signals between subjects in real scenarios, as shown in Fig. 1. This inter-subject variability makes it difficult for most machine learning approaches to generalize well to unseen subjects.
To tackle the above-mentioned domain shift issue, transfer learning methods [16] have been proposed to explore the shared knowledge underlying the input EEG data, as a bridge connecting the source and the target subjects. Depending on the input target EEG data, current transfer learning methods for MI-based EEG classification can be broadly grouped into three categories. The first category is inductive transfer learning methods, where a limited amount of labeled target EEG data is available [17], [18], [19], [20]. It has attracted attention in reducing the calibration time for new BCI users. For example, Fahimi et al. [19] significantly improved the classification performance for target subjects by fine-tuning a CNN model with partially labeled target EEG data. Kullback-Leibler (KL) distances between the EEG data of the labeled target subject and multiple source subjects were used as the weights for multiple source subjects [20]. The second category is domain adaption methods. These methods typically need sufficient unlabeled target EEG data to reduce the discrepancies between source and target subjects [21], [22]. For instance, Kostas et al. [23] proposed to leverage maximum mean discrepancy (MMD) and center-based loss to simultaneously achieve distribution alignment and discriminative feature learning. The adversarial learning strategy [22] was adopted to match the domain shift between the source and target subject EEG features.
The above two categories of methods require labeled or unlabeled target EEG data as part of the input, which hinders the clinical application of BCI systems when EEG data are unavailable from new subjects. In view of this, the third category methods, i.e., domain generalization (DG), have received increasing attention, which help in the development of zero-calibration MI-based BCIs. So far, only a few DG algorithms [23], [24], [25] have been developed and applied to MI-based EEG decoding. The generalization strategies used in these methods can be generally grouped into two types: data augmentation and invariant feature learning. In [23], Mixup [26] incorporating data alignment technique was used to create a universal classifier for unseen subjects. Here, Mixup was introduced to linearly interpolate EEG trials from random pairs across subjects. Without considering the class-related distribution across subjects, this vanilla Mixup method may corrupt the class-specific EEG features, which limits the robustness of the model [27]. Furthermore, Ozdenizci et al. [24] proposed a domain adversarial network to learn invariant representations among multiple subjects. The domain discriminator is built on fixed source subjects, which limits its scalability to more source subjects. Since relevant features (i.e., the features that are relevant to a given category) remain invariant and irrelevant features (e.g., the background noise) vary across domains [28], the use of relevant EEG features can enhance the generalization capability of the model under domain shifts. Recently, a mutual information-based deep representation learning method [25] was developed to decompose EEG features into category-relevant and categoryirrelevant features, and the information overlap between them was minimized using mutual information.
Enlightened by the above studies, we propose a novel DG framework, which consists of one implicit and two explicit regularizations processes. Similar to Mixup [23], [26], we first devise implicit regularization induced by inter-subject and category-specific data augmentation to generate EEG features in time-frequency domain. But instead of all EEG trials, only the segments of same-category EEG features from multiple subjects are mixed, so that the proposed inter-subject data augmentation can preserve the category information and span the space of the subjects. In addtion, we further develop an explicit regularization to help the model establish the true mapping relationship between category-relevant features and corresponding labels. First, feature decorrelation scheme [28], [29], [30] is introduced to learn the weights for EEG trials to remove the dependencies between features, so that the model can effectively partial out irrelevant features and capture the relevant features across subjects for prediction. To further distill subject-invariant EEG feature representations, another explicit regularization scheme, cross-view consistency learning, is introduced to encourage consistent predictions of category-relevant features induced from different augmented EEG views, thereby improving the robustness of the model for downstream tasks. Implicit regularization (i.e., inter-subject and category-specific data augmentation) and explicit regularizations (i.e., feature decorrelation and cross-view consistency learning) are integrated into a unified DG framework to jointly improve the robustness and generalizability of the model on unseen subjects.
The main contributions of this paper are summarized as follows: • We develop a unified DG framework for EEG decoding, where the complementary implicit and explicit regularizations are jointly leveraged to generalize the model to unseen subjects.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. • We introduce three innovative schemes into the EEG signal decoding model to improve its robustness and generalization capability: 1) inter-subject and category-specific EEG data augmentation to generate multiple augmented views; 2) feature decorrelation scheme to help the model partial out spurious correlations and establish the true mapping relationship between relevant features and corresponding label; 3) cross-view consistency learning to distill subject-invariant feature representation.
• We demonstrate the superiority of the proposed FDCL approach over the state-of-the-art DG methods experimentally on three publicly available MI-based EEG datasets.
II. DOMAIN-GENERALIZED EEG CLASSIFICATION FRAMEWORK This study aims to learn an EEG decoding model that is generalizable to unseen target subjects by leveraging EEG data from multiple source subjects. Given S source domains (subjects) {D 1 , D 2 , · · · , D S }, subject l contains N l EEG trials and corresponding label pairs D l = X l i , y l i N l i=1 . For the l-th subject, X l i and y l i ∈ {1, 2, · · · , C} are the i-th input EEG trial and the corresponding label, respectively. The goal here is to train a model which contains the encoder network f : X → Z mapping the input trial X i into the representation We assume that f can extract classrelevant features, such that the task network (e.g., classifier) g : Z → R C can perform well on unseen target subjects.
To mitigate performance degradation on unseen target subjects, we introduce three innovative constraints: implicit regularization with an inter-subject and category-specific EEG data augmentation strategy, explicit regularization with the feature decorrelation, and another explicit regularization with the cross-view consistency learning scheme. Previous study [31] demonstrates that simply mixing up features through data augmentation strategy can enable the model to learn better feature representations. Therefore, we design a novel inter-subject and category-specific EEG data augmentation strategy by mixing the segments of same-category EEG features from multiple subjects in time-frequency domain. We further apply a feature decorrelation scheme to get rid of the dependencies between relevant and irrelevant features by learning weights of the training EEG trials, so that the category-relevant features can be established. To make the EEG feature representations more subject-invariant, we adopt another implicit cross-view consistency learning regularization to encourage the consistent predictions of category-relevant features induced from different augmented EEG views. All these tasks use a common auxiliary encoder f ′ . Then they are seamlessly integrated into a unified DG framework. The overall FDCL framework is schematically shown in Fig. 2 and the major components are introduced below.

A. Inter-Subject and Category-Specific EEG Data Augmentation
We aim to design an appropriate EEG data augmentation strategy to generate augmented data derived from the multiple source subjects to increase the diversity of input EEG data [32], which enables the model to learn informative features from a larger data space and generalize to unseen subjects. A simple and efficient EEG data augmentation strategy is to directly concatenate the segments from different EEG trials but this may lead to mismatch at the boundary between two consecutive segments [33]. To alleviate this issue, we propose an inter-subject and category-specific (ISCS) EEG data augmentation strategy by linearly interpolating two consecutive segments. Note that we mix the segments of feature representation of the same-category EEG trials from arbitrary source subject, such that the augmented data can span the data space of the subjects to increase the diversity of input EEG data and preserve their category information. The proposed ISCS data augmentation is illustrated in detail as follows.
Given a randomly selected EEG trial X i , i = 1, 2, · · · , N , we transform it channel-by-channel into time-frequency feature representation T i ∈ R c×r ×t using short-time Fourier transform (STFT) by fixed-length hamming windows with 50% overlap [33], where c, r and t are the number of channels, the number of frequency bands and the number of time points, respectively. Then, we can generate time-frequency representation T i , i = 1, 2, · · · , N along the time dimension into K segments, denoted as T k i ∈ R c×r ×t / K , k = 1, 2, . . . , K , as shown in Fig. 3. The feature representation T i can be reformulated as: For another same-category EEG trial X j from an arbitrary source subject, we can also obtain its feature representation by repeating the same procedure above. We linearly interpolate k-th segments from T i and T j as follows: where λ ∼ U (0, η), and U (·) denotes the uniform distribution. Parameter η controls the strength of the augmentation. The mixed segments are then sequentially combined to form an augmented EEG trialT i as follows: Finally, we use inverse STFT to obtain the augmented EEG trial, which is formulated as: where invST F T (·) performs inverse STFT operation channel-by-channel. A denotes the ISCS EEG data augmentation function. Fig. 3 shows the flowchart of the proposed ISCS data augmentation. Note that the proposed ISCS data augmentation strategy is essentially different from the segmentation and recombination (SR) method [33]. SR method flatly concatenates the segments of the same-category EEG trials within each subject, which may cause prominent boundary mismatches between the consecutive segments and limits the generalizability of the EEG feature representations. Therefore, it fails to achieve the expected improvements in generalizing the model to unseen subjects. To increases the diversity of EEG trials, an intersubject Mixup method [23] is introduced to interpolated EEG trials from random pairs across subjects in the time domain. Unlike [23], we only linearly interpolate the segments from inter-subject and same-category EEG trials to increase the diversity of the input EEG trials and preserves their category information, which can improve the generalizability and robustness of the model to unseen subjects.

B. Category-Oriented Feature Decorrelation
When distributional shift exists in EEG trials across subjects, the model is prone to making wrong prediction caused by the spurious correlation between the irrelevant features and the class labels. Such spurious correlations arise from the subtle correlations between the irrelevant and relevant features [28]. The statistical dependence between the relevant and irrelevant features is a major cause of model crash under distribution shift. Hence, decorrelating the relevant and irrelevant features can significantly improve the generalizability of model. The sample weighting method can be used to decorrelate the relevant and irrelevant features, which has been theoretically shown to make models to produce trustworthy predictions under distribution shift [29], [30]. Referring to [27], we measure the independence between the features using the independence testing statistic, which can be minimized through the sample weighting approach to eliminate the statistical correlations between the irrelevant and relevant features. Then, the model can automatically partial out the irrelevant features and establish the true mapping relationship between the relevant features and the corresponding labels.
We adopt the stable learning approach [28] to remove the feature dependence through the strategy of sample weighting in the representation space. Let Z :, p and Z :,q denote any pair of output feature representations obtained by the encoder, the independence between Z :, p and Z :,q can be measured through the independence testing statistic, which is expressed as the Frobenius norm of the partial cross-covariance matrix Z :, p Z :,q . Here, Z :, p Z :,q is given by: where N = |D 1 | + |D 2 | + · · · + |D S |, the random variable sets Z p1 , Z p2 , · · · , Z pN and Z q1 , Z q2 , · · · , Z q N are sampled from Z :, p and Z :,q , respectively. u Z :, p = u 1 Z :, p , · · · , u N u Z :, p and v Z :, whereN (·) denotes the standard normal distribution, and U (·) denotes the uniform distribution. Thus, the independence testing statistic I Z :, p Z :,q can be formulated as: In practice, Z :, p and Z :,q becomes independent when I Z :, p Z :,q decreases to zero. The sample weighting method can be applied to remove the dependence between the features [28]. Eq. (5) is then reformulated as follows: where α ∈ R N + is the sample weights and After decorrelating the relevant and irrelevant features, the model can be trained with the relevant features, which leads to more reliable predictions on unseen subjects. To this end, the supervised loss function is formulated as the weighted crossentropy loss: where ℓ ce (·) denotes the cross-entropy loss.

C. Cross-View Consistency Learning
Different inter-subject and category-specific data augmentation can generate different augmented EEG data, which can be regarded as different EEG views. The above data augmentation and feature decorrelation method impose an implicit and an explicit constraint to the model, which enable the model to learn the category-relevant feature representations from different augmented EEG views. Although effective, the feature representations learned from one augmented view may be inconsistent with that learned from another view. To alleviate such disagreement, we further propose to align the outputs of the encoder f to the outputs of the auxiliary encoder f ′ under different input augmented EEG views. Through the output alignment, the model is expected to learn the invariant feature representations from different views to achieve further improved generalizability.
To this end, we devise a cross-view consistency learning scheme to guide the model to focus more on invariant features, which can be formulated as: Here, ℓ c (·) denotes the minimum mean-square error. A and A ′ denote the ISCS EEG data augmentation function. Besides, g ′ denotes the classifier of the auxiliary network.
Since different augmented EEG trails can be seen as two views of the same data, the cross-view consistency learning scheme encourages an invariant knowledge sharing between the two views, which directs the model to further distill the invariant feature representations. Through focusing on the common underlying patterns that remain consistent across different views, the model becomes more robust to variations between views and can better generalize to unseen data.

D. Optimization
All the above regularizations are seamlessly integrated into an end-to-end learning framework. The objective of the proposed FDCL framework can be formulated as the following bi-level optimization problem [34]: where λ c denotes the hyper-parameter to balance the supervised loss and the cross-view consistency learning loss.
Denote θ and θ ′ as the parameters of the encoder f and the auxiliary encoder f ′ , respectively. θ is optimized by using back-propagation, while θ ′ is updated as the exponential moving average (EMA) of θ : where t denotes the time stamp. Besides, the decorrelation of the relevant and irrelevant features needs to learn the weights of EEG trials globally. To reduce the computational complexity and the storage requirements of the model, b batches of features are used to approximate the global features, which can be presented as Z G i , i = 1, 2, · · · , b. The corresponding global weights are denoted as α G i , i = 1, 2, · · · , b. Therefore, the features and weights used to optimize the weights of current batch features can are given by: where Z L and α are the features and initial weights of current batch. After computing the weights α L of the current batch, the features and weights of the current batch are fused with the global features and weights, which are given by: Here, the hyperparameter ρ i , i = 1, 2, · · · , b is used to balance the contributions of the global and current batch information.

end
In summary, we propose to iteratively optimize the network parameter θ and the sample weights α. The learning procedure of the proposed FDCL framework is given in Algorithm 1.

III. EXPERIMENTS
A. Experimental Dataset 1) BCI Competition III Dataset IVa (Exp.1): This dataset 1 contains EEG signals collected from 5 subjects (denoted as "aa", "al", "av", "aw" and "ay") using 118 electrodes at a sampling rate of 100 Hz. Each subject performed 280 trials based on a visual cue, including two-class MI tasks. The number of training trials were 168, 224, 84, 56 and 28 for 5 subjects respectively, and the remaining trials were used as test dataset. For each trial, we used 2.5s time window starting 0.5 s after the cue.
2) BCI Competition IV Dataset IIa (Exp.2): This dataset 2 contains EEG signals collected from 9 subjects (denoted as "S1"-"S9") using 22 electrodes at a sampling rate of 250 Hz. For each subject, EEG signals from two sessions were recorded. Each session produced 288 EEG trials, including four-class MI tasks. EEG signals from the first session were used to train the model, while the second session served as the test dataset. For each trial, we used the [-0.5 s, 4 s] time period relative to trial start cue.
3) BCI Competition IV Dataset IIb (Exp.3): This dataset 3 contains EEG signals collected from 9 subjects (denoted as "B1"-"B9") using electrodes C3, Cz and C4 at a sampling rate of 250 Hz. For each subject, EEG signals from five sessions were recorded, including two-class MI tasks. EEG signals from the first three sessions were used to train the model, while the remaining two sessions served as the test dataset. For each trial, we used the [-0.5 s, 4 s] time period relative to trial start cue.
The EEG signals were filtered with a fifth-order Butterworth band-pass filter in the frequency range of 4-38 Hz followed by subsequent analyses. To verify the effectiveness of FDCL for domain generalization tasks, the training data of all subjects excluding the unseen subject were aggregated for training the model of multiple source subjects. The classification performance of the test set of the unseen subject were evaluated. Fig. 4 illustrates the data configuration, where the test data of the first subject was treated as the unseen domain.

B. Experimental Setup
To verify the effectiveness of the proposed method, we compared the proposed FDCL with seven state-of-the art DG methods, which includes: 1) the baseline method (DeepAll) [34]; 2) Inter-domain Mixup (Mixup) [34] For all comparison methods, Shallow ConvNet (SCNN) [12] was adopted as the backbone network. For the baseline DeepAll, all source subjects EEG trials were pooled together for training. Referring to [34], Mixup performed linear interpolation on randomized pairs of EEG trials from multiple source subjects and their corresponding labels. The generated augmented EEG data were then used for training SCNN. For StableNet, the size of global features was set to the same as batch size. The other settings were the same as that in [28]. Besides, the default rank and the parameters of the loss of CSD were set according to [37]. For MLDG, we split multiple source subjects and only adopted one source subject as the meta-test domain at each learning iteration, while the remaining source subjects were used as meta-training domain. For other comparison methods, we set the parameters as the original papers. For the proposed FDCL, the mean-square error was adopted to calculate the consistency loss. The hyperparameter λ c was set to 0.001. In addition, the size of the global features was set to the same as the batch size, i.e., Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. b = 1. In STFT, we use 1s-long hamming window with 50% overlap. Thus, in our ISCS, the number of segments K were 4, 8, and 8 for Exps.1-3, respectively. The parameter η controlling the strength of the augmentation was set to 0.01. The number of functions from the space of RFF, i.e., N u and N u , were set to 5. The trade-off coefficients β and ρ i , i = 1, 2, · · · , b in Eqs. (12) and (14) were set to 0.99 and 0.9, respectively. The SGD optimizer was used to optimize the proposed framework, and the learning rate was set to 0.001. Moreover, the batch size and the number of epochs were set to 128 and 2000, respectively.

A. Evaluation of the Proposed ISCS Data Augmentation
To examine the efficacy of the proposed ISCS EEG data augmentation method, we compared it with the following mainstream methods: 1) noise addition [35]; 2) sliding window [36]; 3) segmentation and recombination in timefrequency domain [33], and 4) Mixup [34]. In addition, we took SCNN without data augmentation as the classification method. For the noise addition data augmentation method, we set the mean value and standard deviation of Gaussian noise to 0 and 0.001 [35]. For segmentation and recombination, we used the same number of segments as ISCS, and all the segments were used for recombination. For sliding window, the augmented EEG trials were generated by sliding a window of 1125 sampling points with a step of 50 points. For Mixup, the parameter η controlling the strength of the augmentation was also set to 0.01.
The classification accuracy of the baseline and the different EEG data augmentation methods on Exp.1, Exp.2 and Exp.3 are given in Tables I, II and III, respectively. Compared to the baseline method SCNN without data augmentation, the performance of the noise addition method drops in 2 out of and 5 out of 9 subjects in Exp.1 and Exp.2, leading to an average accuracy degradation by 0.04% and 0.69%, respectively. Similar outcomes are observed in 6 out of 9 subjects in Exp.3, leading to an average accuracy degradation by 0.35%. The results indicate that the magnitude of noise is critical to the noise addition method. In other words, it is difficult to find a universal noise to jointly boost the performance across all subjects. Sliding window data augmentation methods also face the same problem. In addition, segmentation and recombination and Mixup data augmentation methods can yield better classification accuracy than noise addition and sliding window methods on most subjects. The proposed ISCS EEG data augmentation strategy linearly interpolates the segments of the same-category EEG features from multiple subjects in time-frequency domain. It outperforms all the comparison data augmentation methods in terms of average accuracy. Besides, we listed the average classification kappa and recall of three  datasets, as shown in Fig. 5. As observed, the proposed ISCS data augmentation method achieves consistent improvements in terms of average kappa and recall over other comparison methods. We further investigated the classification performance of the proposed ISCS data augmentation method with different number of segments, as shown in Fig. 5(c). We set the number of segments from 2 to 10 with an interval of 2.
For three datasets, we can find that the average classification performance first increases and then decreases as the number of segments increases. The stable average accuracy curves presented in Fig. 5(c) indicate that the classification performance of the proposed ISCS is not sensitive to the number of segments. Above all, the promising performance validates the merits of the proposed ISCS method, i.e., mitigating the noise caused by the mismatch at the boundary between two consecutive segments, thereby spanning the data space of multiple source subjects and preserving category-specific feature knowledge.

B. Evaluation of the FDCL DG Framework
To evaluate the efficacy of the proposed FDCL framework, we compare the classification results of our FDCL with the state-of-the-art EEG DG methods with Exp.1, Exp.2 and Exp.3. The experimental results are listed in Tables IV, V and VI. The best values of classification accuracy are boldfaced. As observed, most of the DG methods only have little advantage over the baseline DeepAll, which suggests that aggregating multiple source data directly cannot yield decent prediction results on unseen subjects. Nonetheless, the proposed FDCL outperform all the other comparison methods in terms of average accuracy in Exp.1, Exp.2 and Exp.3, which reflects the benefit of integrating the learning of category-relevant feature and subject-invariant feature representations.
Moreover, Mixup and StableNet are two basic components of the proposed FDCL. The experimental results demonstrate that FDCL outperforms Mixup for all subjects, with an average classification accuracy improvement of 5.54%, 2.86% and 3.29% in Exp.1, Exp.2 and Exp.3, respectively. Compared to StableNet, FDCL achieves better performance for all subjects in Exp.1, with 3.27% improvement in average classification accuracy. In Exp.2, FDCL achieves better classification performance in 7 out of 9 subjects, with an average accuracy improvement of 2.78%. In Exp.3, FDCL improves the average classification accuracy by 2.86%, outperforming StableNet in 8 out of 9 subjects. In addition, for all subjects in Exp.1, FDCL outperforms the comparison methods MLDG, ERDG and MIDLR, with 2.43%, 3.75%, and 3.11% improvement in average classification accuracy, respectively. In Exp.2, the proposed FDCL outperforms the comparison methods CSD, MLDG and MIDLR for all subjects, with 3.13%, 1.86%, and 2.82% improvement in average classification accuracy, respectively. In Exp.3, FDCL outperforms CSD in all cases. Compared to MLDG, FDCL shows better classification performance in 7 out of 9 subjects with an average accuracy increase of 2.48%. FDCL outperforms MIDLR in 8 out of 9 subjects, with 2.9% improvement in average classification accuracy. Besides, FDCL shows better classification performance than SMA in 8 out of 9 subjects with an average accuracy increase of 3.21%. Furthermore, FDCL outperforms CSD and SMA in 4 and 3 of 5 subjects in Exp.1, with an average classification accuracy improvement of 2.93% and and 1.25%, respectively. Compared to the most competitive method ERDG, the proposed FDCL improves the absolute classification accuracies in 6 out of 9 subjects in Exp.2, in 7 out of 9 subjects in Exp.3, and the average classification accuracies are increased by 1.47% and 2%, respectively. These improvements are benefited from the proposed FDCL scheme which seamlessly integrate the implicit EEG data augmentation regularization, categoryoriented consistent feature decorrelation and the cross-view consistency learning into an end-to-end DG framework.
We further present the confusion matrices of DeepAll and the proposed FDCL on the three MI-based EEG datasets, as shown in Fig. 6. These matrices give the correspondence between the true labels and the predicted labels on the datasets. As observed, FDCL achieves better classification performance than DeepAll for all MI tasks in Exp.1 and Exp.3. In Exp.2, FDCL achieves better classification performance in 3 out of 4 category, where FDCL outperforms DeepAll by 0.16%, 3.7% and 11.11% for the classification performance of LH, RH and F MI tasks, respectively. The promising classification results highlight the effectiveness of the proposed domain generalization framework.
C. Empirical Analysis 1) Statistical Significance Testing: To evaluate the statistical significance of the classification results, we perform hypothesis testing to verify whether FDCL is significantly superior to the comparison methods. We first use the Friedman test and Iman-Davenport test [41] to check whether significant differences in performance exist among the eight comparison methods. The Friedman test is a nonparametric analogue of the parametric two-way analysis of variance. The objective of this test is to determine if we may conclude from a sample of results that there is difference among treatment effects. Besides, the Iman-Davenport test is an extension of the Friedman test [42]. If the null hypothesis is rejected and indicate significant differences, we then perform pairwise comparisons between the control method (FDCL) and the other compared methods by using the post-hoc Holm's test [41]. The p-values of the Friedman test and Iman-Davenport tests for all the 23 subjects are given in Table VII. The values that are less than 0.05 are boldfaced. It is clear from Table VII all the null hypothesizes are rejected with 95% confidence level. There are significant differences in performances among the comparison methods. We then perform the Holms method to calculate adjusted pvalues for pairwise comparisons with the proposed FDCL as the control method. The values that are less than 0.05 are boldfaced. The results in Table VIII show that all the adjusted p-values are less than 0.05, which confirms that FDCL is better than the other comparison methods with statistical significance.
2) Impact of Different Components: To obtain better insights into the performance of FDCL framework, we further conduct ablation studies to investigate the role of each component in FDCL. We decomposed FDCL into different components: 1) DeepAll is used as the baseline; 2) DeepAll model trained with  3) Impact of Different Network Backbones: We further investigate the generalizability of the proposed DG framework with different network backbones. We adopted the widely used neural networks, i.e., SCNN, Deep ConvNet (DCNN) [12] and EEGNet [13] as the backbones of our framework. The experimental results were listed in Fig. 7. As observed, SCNN and DCNN are more suitable to be used as the network backbone for Exp.1. Besides, DeepAll and FDCL with SCNN as the network backbone achieve best classification performance in Exp.2, while EEGNet seems to be more suitable for Exp.3. Overall, the proposed domain generalization framework achieves consistent classification performance improvement with different backbone networks and demonstrates its stability.

V. DISCUSSION
Automated EEG decoding is essential for assisting the clinical diagnosis and rehabilitation of individuals affected by neurological injuries or disorders. Although deep learning methods have shown promise for EEG decoding tasks, they typically require large amounts of labeled data for model training. In clinical practice, acquiring sufficient subject-specific EEG data is labor-intensive and often impractical for disabled patients. Transfer learning-based EEG decoding methods have potential for reducing the demand on subject-specific data by effectively leveraging the other subjects' EEG data. Inspired by recent domain generalization methods in computer vision [27], we focus on learning the underlying invariant feature representations from multiple source subjects, which is expected to generalize well on unseen subject. Specifically, a category-oriented feature decorrelation scheme is introduced to decorrelate the relevant and irrelevant features to facilitate the model establish the true mapping relationship between Fig. 7. Average accuracies of the proposed FDCL and the baseline method DeepAll on three MI-based EEG datasets using different network backbones: SCNN [12], Deep ConvNet (DCNN) [12] and EEGNet [13]. the relevant features and the corresponding labels. Besides, an inter-subject and category-specific EEG data augmentation and a cross-view consistency learning scheme are developed to further improve the generalizability of model by distilling the common underlying patterns that remain consistent across different augmented EEG views. Different from most of current domain adaption methods used in EEG decoding, the proposed DG method FDCL no longer requires target EEG data to fine-tune the model. This significantly facilitates the clinical practicability of MI-based BCIs. In clinical practice, robustifying a DG model to any unknown distribution is difficult without utilizing target subject EEG data [43]. However, in certain scenarios, we may have access to a small amount of target EEG data. To validate the effectiveness of FDCL in short-calibration EEG decoding task, we conducted experiments by using additional 10% and 20% training data from target subject, respectively. The average classification accuracies on three datasets are listed in Table X. The results reveal that FDCL achieves substantial improvements in classification accuracy when only 10% and 20% training data of the target subject are available. For example, in Exp.1, the average classification accuracies of FDCL are improved by 8.06% and 11.47%, respectively. Experimental results demonstrate that our FDCL method has potential in short-calibration EEG decoding tasks.
However, the proposed domain-generalized EEG classification framework still suffers from certain limitations. For example, in the bi-level optimization problem, we use an alternating optimization method to iteratively optimize the network parameter θ and the sample weights α, which needs more computational time. Thus, future work could be extended to accelerate the optimization.
In summary, the experimental results suggested that the proposed EEG data augmentation strategy could increase the diversity of input EEG data and preserve their category information, which helps the EEG decoding model overcome the overfitting issue and improves the generalizability of the model. Besides, the proposed FDCL could improve the clas-sification performance of the zero-calibration EEG decoding tasks, and also show great potential for short-calibration EEG decoding tasks.

VI. CONCLUSION
In this paper, we present a novel domain generalization framework for generalizing EEG decoding methods to unseen subjects. It simultaneously exploits the implicit inter-subject category-specific EEG data augmentation regularization, the explicit feature decorrelation regularization and the cross-view consistency learning regularization in a unified framework. The proposed EEG data augmentation method can mitigate the noise caused by mismatch at the boundary between two consecutive EEG segments, preserve the category-specific feature knowledge, and span the data space of multiple source subjects. The feature decorrelation regularization can remove the dependencies between features by learning their weights to enable the model to establish true mapping relationship between the invariant features and the corresponding labels. Finally, the cross-view consistency learning scheme is introduced to guide the model to focus on invariant feature representations from different views. These three complementary regularizations are seamlessly integrated into the designed unified DG framework to jointly improve the robustness and generalizability of the model on unseen subjects. We conduct comprehensive performance evaluation on three EEG datasets. The experimental results indicate that FDCL yielded the best classification when compared with other DG approaches. Further works include the application of the proposed FDCL framework to other EEG tasks with different neurophysiological patterns or the investigation of whether DG methods can generalize across different datasets even with different EEG systems (e.g., different numbers of channels).