Toward Domain-Free Transformer for Generalized EEG Pre-Training

Electroencephalography (EEG) signals are the brain signals acquired using the non-invasive approach. Owing to the high portability and practicality, EEG signals have found extensive application in monitoring human physiological states across various domains. In recent years, deep learning methodologies have been explored to decode the intricate information embedded in EEG signals. However, since EEG signals are acquired from humans, it has issues with acquiring enormous amounts of data for training the deep learning models. Therefore, previous research has attempted to develop pre-trained models that could show significant performance improvement through fine-tuning when data are scarce. Nonetheless, existing pre-trained models often struggle with constraints, such as the necessity to operate within datasets of identical configurations or the need to distort the original data to apply the pre-trained model. In this paper, we proposed the domain-free transformer, called DFformer, for generalizing the EEG pre-trained model. In addition, we presented the pre-trained model based on DFformer, which is capable of seamless integration across diverse datasets without necessitating architectural modification or data distortion. The proposed model achieved competitive performance across motor imagery and sleep stage classification datasets. Notably, even when fine-tuned on datasets distinct from the pre-training phase, DFformer demonstrated marked performance enhancements. Hence, we demonstrate the potential of DFformer to overcome the conventional limitations in pre-trained model development, offering robust applicability across a spectrum of domains.


I. INTRODUCTION
B RAIN-COMPUTER interfaces (BCI) have garnered sig- nificant attention due to their potential to bridge human cognition with computational systems.Electroencephalography (EEG) signals, one of non-invasive methods that captures the electrical activity of the brain, are central to BCI technology.The high portability and practicality of devices for acquiring EEG signals have paved the way for advancements in neuroscience, medical diagnostics, and other interdisciplinary domains [1], [2], [3], [4], [5], [6].However, the inherent complexity and non-linearity of EEG signals pose challenges for extracting meaningful information.Traditional signal processing techniques, such as filtering, Fourier transform, or cross-correlation, have been widely used for analyzing EEG signals [7], [8], [9], [10].These techniques still require domain-specific knowledge and extensive manual feature engineering, which could be time-consuming and unsuitable for applications based on EEG signals characterized by high nonlinearity and variability.Moreover, their performance might be susceptible when confronted with augmented or previously unseen data [10].In recent years, deep learning has emerged as a promising approach for analyzing EEG signals.Deep learning models have demonstrated significant success in various EEG-based applications, such as sleep staging [11], [12], seizure detection [13], [14], motor imagery (MI) [15], [16], or emotion classification [17], [18].However, the limited availability of diverse and high-quality data for training the models presents a major challenge when applying deep learning to EEG signal analysis [19].EEG datasets are typically small and highly imbalanced, which poses a challenge in training deep learning models that could effectively generalize to new data.
Leveraging the pre-trained model as a foundation for training the model is an effective strategy when the amount of data is insufficient [20].By guiding the information from previously acquired knowledge, these models could potentially converge to a more refined optimal solution [21].Consequently, several studies have developed pre-trained models for various tasks of EEG signals.He et al. [22] proposed the MLP-Mixer-based neural network and self-supervised learning algorithm to develop the pre-trained model by training the proposed model to predict the following EEG signals.The authors achieved superior performance using the pre-trained model compared to the conventional methods when conducting the MI-based downstream task.Jiang et al. [23] developed a strategy for constructing the pre-trained model based on contrastive learning.They reported the optimal combination of data augmentation methods, which was pivotal for effective contrastive learning.In addition, they achieved significant performance in the sleep stage classification task compared with that of other methods.Zhang et al. [24] presented the adversarial learning-based self-supervised learning algorithm to develop the pre-trained model.Their strategy revolved around the reconstruction of arbitrarily masked segments within EEG signals.By integrating their proposed algorithm, they enhanced the classification accuracy within emotion recognition datasets.
Although using the pre-trained model proved effective in the small number of datasets, there were significant challenges in generally applying the pre-trained model to analyze EEG signals owing to the different characteristics of the dataset.Specifically, the differences in configurations, such as the number and order of channels or sampling rate, of EEG electrodes across various datasets pose significant challenges when implementing the pre-trained model [19], [25].The conventional methods have limitations when utilizing previously trained models because the architectures of the pre-trained and target models differ if the configuration in the target dataset varies.Spherical spline interpolation and channel selection have been employed to circumvent these limitations.Spherical spline interpolation is the fundamental method for addressing this limitation.By applying spherical spline interpolation, EEG signals of the empty electrodes are reconstructed using the information gleaned from the signals of the existing electrodes.Channel selection refers to choosing only the electrodes that are common to both datasets.Kostas et al. [26] strategically chose the 19 electrodes that were most commonly used across all datasets.Furthermore, missing data for specific electrodes were assigned as zero.Wei et al. [27] observed that methods achieving superior performance typically employed the channel selection approach to mitigate issues arising from variations in electrode types.They standardized the electrode types by selecting those commonly found across all datasets used.However, spherical spline interpolation might introduce significant errors when the electrode density is low [28].Furthermore, the channel selection method has its own limitations as it could not utilize all available data, given that it discards information from uncommon electrodes.
In this paper, we proposed the domain-free transformer, called DFformer, capable of consistently decoding EEG signals irrespective of the diverse configurations across different datasets.We defined one domain as one configuration of the dataset.In addition, we appended the class tokens along each axis to establish a pre-trained model that effectively compressed EEG signals into representative vectors in each axis.By utilizing the class tokens, we could perform autoencoderbased signal reconstruction tasks, the fundamental pre-training technique, to create pre-trained models adaptable to various datasets.We verified the effectiveness of the guidance from the pre-trained model based on our proposed model when decod-ing EEG signals.Furthermore, upon analyzing the trained class tokens in the proposed model, we found that the model predominantly focused on the fundamental information for analyzing EEG signals.By implementing our proposed model, we could effectively address the constraints of the conventional methods that struggled with building and applying pre-trained models due to domain differences.By alleviating these limitations, we could develop a unified pre-trained model that could be applied to different datasets without distortion or loss of information in the data.In addition, the knowledge acquired from pre-training also could guide the model toward a more optimized space when applied to other datasets in the downstream step.This resulted in an enhanced performance compared with that of models entirely trained from scratch.In summary, the main contributions of this study are as follows; • We proposed DFformer a novel architecture adept at decoding EEG signals from various datasets with disparate configurations without modifying the architecture.The source code of our implementation is readily accessible on GitHub.1 • The pre-trained model based on DFformer was developed by leveraging an autoencoder paradigm.By using DFformer as an encoder, we could construct the pretrained model without utilizing the previous approaches for alleviating the differences in the domain.
• We achieved the performance improvement when we transferred knowledge from the pre-trained model compared with training from scratch.This finding validated that the knowledge acquired in the pre-trained model could guide in decoding an EEG dataset not utilized during the pre-training phase.

A. Datasets and Preprocessing
For our experiments, we used two prominent MI datasets and two sleep-stage classification datasets.These publicly available datasets are frequently used to evaluate various methods in their respective domains [29], [30].
2) BCI Competition IV-2b (BCIC2b): BCIC2b [33] contains EEG data from three electrodes and nine participants at 250 Hz.The participants performed two MI tasks: Left and Right, across five sessions.The initial sessions consists of 120 trials without visual feedback, whereas the next session consisted of 160 trials with feedback.The imagination interval was between 3-7 seconds, and the signals were filtered with a 38 Hz low-pass Butterworth filter [32].3) Sleep-EDF: Sleep-EDF [34], [35] features the polysomnographic recordings of 20 individuals.EEG signals are acquired at a sampling rate of 100 Hz.The sleep stages were categorized by professionals.In this study, the N3 and N4 stages were combined into the N3 stage [36], and only data from the Fpz-Cz EEG electrode was used.In addition, we only included 30-minute segments from these periods immediately before and after sleep segments, given that our primary interest was in sleep periods.We divided the classification into five stages (WAKE, N1, N2, N3, and REM) for the evaluation.Classification was performed using sets of 30 trials.
4) Sleep Heart Health Study (SHHS): SHHS [37] is the dataset obtained from studies on the connection between sleepbreathing disorders and cardiovascular issues.It consists of EEG signals from 5,793 participants over the night, initially at 150 Hz, but is downsampled to 100 Hz for compatibility with Sleep-EDF [11].The performances of the models were assessed using five categories identical to those in Sleep-EDF dataset.The evaluations used the C4-A1 electrode and grouped 20 trials as a set.

B. DFformer
In the previous method, significant challenge in developing the pre-trained model for various datasets was the dependency on fixed parameters based on the configuration of the dataset.This constraint implied that, once trained, the model could not be seamlessly deployed without necessitating architectural modifications [19].As a result, our primary goal while designing the model was to design a model that would be applicable despite the different shapes of EEG signals encountered.Since the transformer could apply to variable shapes of data without modifying the architecture, it was utilized as the base structure.
As shown in Fig. 1, the model consists of four main components: i) Tokenizer compresses information from the high-frequency raw EEG signals into the patch.ii) Biaxial information embedding block enriches the compressed EEG data with auxiliary information, including positional encoding and both intra-and inter-channel class tokens.iii) DFformer blocks extract significant information from the processed features in the preceding stages.iv) Classification head interprets the features extracted by DFformer, facilitating the prediction of proper labels for each task.By applying these components that could accommodate changes in the shape of the data, we could develop the more robust and adaptable architecture that could be applied across diverse domains.
1) Tokenizer: Since the raw EEG signals (X ∈ R B×C×T raw , B indicates the batch size, C and T raw are the number of channels and the number of time points in the raw EEG signals, respectively) were acquired from a high-sampling rate, we converted the signals into patches by applying tokenization through a module like wav2vec [38].Previous tokenization approaches commonly involve a spatial convolution layer with a spatial kernel whose shape depends on the number of channels in EEG signals [26], [39].Such a dependency necessitates the architectural modification each time when training on different datasets, thus causing complexity and constraints.Therefore, to train the model independently with channel configurations, the spatial convolution used in previous approaches was omitted.Instead, the tokenization module was conducted using only a 1-D convolutional neural network (CNN) in the temporal direction after reshaping EEG signals by aggregating the batch and channel axes.The tokenization module comprised three embedding layers with kernel sizes of [125,8,4] and strides of [1,4,2].Layer normalization and GELU activation functions were applied to each layer.After performing the tokenization module ( f : R BC×T raw → R BC×T ×D , D indicates the number of embedded dimensions and T is the length of the downsampled EEG signals), we split the data on both the batch and channel Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
axes, which were combined.Therefore, we could generate the downsampled patches (H = f (X ) ∈ R B×C×T ×D ) from the raw EEG signals.
2) Biaxial Information Embedding Block: EEG signals contain intra-channel (temporal) features, such as variations in frequency over time, and inter-channel features, including differences in activity between channels.However, these features are not uniformly represented across all tasks of the EEG data.Instead, dominant features might differ according to the specific task.For example, in sleep stage classification data, the target frequency band depends on the sleep stage [40].Conversely, in motor imagery classification data, the differences in activity between channels at a certain time, such as event-related desynchronization/synchronization, are representative features [41].Since EEG signals contain significant features that differ intra-and inter-channel-wise, we designed the biaxial information embedding block to encode significant information for each feature by adding positional encoding and class tokens in the intra-and inter-channel axes, respectively.
EEG signals are time-series data; therefore, we assumed that information on the temporal order is significant for comprehending the relationship between channels.After tokenizing EEG signals into patches, the channel and batch axes were merged.We added intra-channel sinusoidal positional encoding, POS intra ∈ R BC×(T +1)×D , along the temporal axis and concatenated the BC number of class tokens (intra-channel class tokens), CLS intra ∈ R BC×D , at the beginning of the intra-channel axis.The temporally preprocessed data have the dimensionality BC × (T + 1) × D. After intra-channel positional encoding and class tokens were added, multi-head attention (MHA) and a feed-forward network were applied to the channel-wise-divided data to encode significant intrachannel information.Through this subsequent process, the intra-channel class tokens containing the temporal features within each channel were generated.After the intra-channel information was encoded for each channel, we transformed the shape of the data to extract the inter-channel information for each patch.We applied the inter-channel-wise sinusoidal positional encoding, POS inter ∈ R B(T +1)×(C+1)×D , along the channel axis and attached B(T + 1) numbers of class tokens (inter-channel class tokens), CLS inter ∈ R B(T +1)×D , at the beginning of the temporally segmented data.The shape of the inter-channel-wise-preprocessed feature maps is B(T + 1) × (C + 1) × D. By applying inter-channelwise multi-head attention and a feed-forward network to the temporally divided data, we extracted significant information between the inter-channels within a specific time.In this manner, every inter-channel class token captured information about the relationships between the channels within each patch.The final shape of preprocessed data after conducting the biaxial information embedding layer is B × (C + 1) × (T + 1) × D. Therefore, we could effectively encode significant features from the downsampled EEG patches by applying the biaxial information embedding block using two different types of class tokens after the tokenization.
3) DFformer Block: DFformer block was applied to enhance the generalization ability by extracting the high-level features from preprocessed data from the biaxial information 4) Classification Head: Although the shape of the input data is B×C×T raw for all data, the format required to predict varies based on the domains.For instance, in MI signal classification, the input data are multi-channel EEG signals, however, they contain the information from only one trial of a single class.Consequently, we aim to derive the predicted probability for a single class to predict the label.Conversely, the input data consisted of several sequential single-channel EEG signals for sleep stage classification.Therefore, the number of predicted values should match the number of the combined channels.In these tasks, customized classification heads are required for each domain after extracting the features from the backbone model.Hence, we applied a simple task-specific classification head for each assignment to obtain the proper results, as shown in Table I.

C. Strategy for Pre-Training
Since the main purpose of DFformer is to design a pretrained model that is not affected by various domains, to check the effectiveness of the pre-trained model, we developed a pretrained model by performing a reconstruction task based on an autoencoder, one of the simple self-supervised learning methods [42].We assumed that, by utilizing EEG signals directly as input and subsequently encoding and reconstructing them, the fundamental characteristics of the EEG could be trained Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.without being biased by the distinctive attributes specific to any individual dataset.
We utilized the intra-and inter-channel class tokens to reconstruct the original EEG signals.Since each class token contained information about intra-and inter-channels, we generated the basis matrix B by multiplying these class tokens, as shown in Fig. 2. The decoder with a simple architecture, containing three convolution modules, was used to reconstruct the signals from the basis matrix.These modules are the reverse process of the tokenization.Each module contains the transpose convolution layer to extend the shape of the feature, and the convolution layer to fuse the information derived from the transpose convolution layer.GELU activation function was used for each layer.However, since the input EEG signals were normalized between 0 and 1, a sigmoid activation function was used for the final layer.Within each module, both the transpose convolution and convolution layers utilized kernel sizes of [4,8,125].Only the transpose convolution, which was employed to scale the data, had a stride size of [2,4,1].Additionally, padding was applied to the convolution layer to ensure that the data size remained consistent before and after processing.To ensure stability during the training process, we used the mean absolute error as a loss function.The channel permutation was conducted to reduce the bias in channel order information.

D. Evaluation Settings
We conducted a quantitative evaluation of the decoding performance of DFformer and evaluated the efficacy of the pre-trained DFformer.The proposed model was trained from scratch to assess its performance.Furthermore, to evaluate the efficacy of the pre-trained model, we compared the results obtained by fine-tuning the target dataset; evaluation methods were selected based on this dataset.Leave-one-subject-out cross-validation was employed for MI and Sleep-EDF datasets, which have a limited number of participants.In contrast, in SHHS, which has sufficient number of participants, we conducted 20-fold cross-validation [30], [43].Performance indicators were carefully selected to ensure meaningful comparisons across the experiments.For MI datasets, the evaluation metrics included the average classification accuracy (Acc.), kappa value (Kappa), and F1-score.The average classification accuracy is the most common indicator for evaluating the architecture of the classification task.Since the kappa value could handle off-diagonal components in the confusion matrix, it is utilized to supplement the accuracy value [44].Although the F1-score is usually used when the number of class labels is imbalanced, we applied this indicator to evaluate the performance of the architecture from various perspectives.For the sleep stage classification datasets, the average classification accuracy, kappa value, and F1-score were also utilized.Moreover, to address class imbalances in the sleep stage classification datasets, the per-class F1-score was employed to compare performance across different sleep stages [43], [45].All results were the average values obtained from ten repetitions with different seeds.
For a consistent analysis, when pre-training DFformer, we fixed the number of blocks in DFformer as three, including the embedding block, the number of heads in MHA as four, and the dimension as 64.The pre-training was executed over 100 epochs for each dataset with 10 epochs warm up, adopting a learning rate of 3e-4 for MI dataset and 3e-5 for the sleep stage classification dataset.During training from scratch and fine-tuning, the learning rates for MI and sleep stage classification datasets were set to 3e-3 and 3e-4, respectively.To compare the effects of the pre-trained model, the training epochs were 100 and 30 for MI and sleep stage classification datasets, respectively.Throughout the optimization phase, the AdamW optimizer was selected using a cosine-learning rate scheduler.
Various preprocessing and data augmentation methods could help to train the robust model.For example, He et al. [50] proposed the Euclidean-space alignment (EA) which could alleviate the issues from covariance shift between data.By conducting EA, they effectively extracted significant features in EEG signals.However, since we aimed to verify the effect derived from our proposed model, only min-max normalization was employed during the training and evaluation phase, without other preprocessing and data augmentation techniques.By minimizing the utilization of these techniques, we could evaluate the effect of DFformer without interference with other variables.

A. Comparison Among the Conventional Models
Although the main goal of designing DFformer was to ensure that it works sufficiently across domains, it is also critical that DFformer performs similarly to the conventional architectures.In Tables II and III, we evaluated the performances of DFformer compared with the conventional models trained from scratch on MI and sleep stage classification datasets.In Table II, we compared DFformer with three commonly used models for decoding MI-based EEG signals: DeepCon-vNet [46], EEGNet [47], and ShallowConvNet [32].DFformer achieved the highest classification kappa value, and F1-score of 0.5841, 0.4455, and 0.5837 at BCIC2a, respectively.Although DeepConvNet achieved the highest performance in all indicators at BCIC2b, DFformer achieved the second-highest performance on all indicators behind DeepConvNet.The classification accuracy, kappa value, and F1-score of DFformer at BCIC2b were 0.7618, 0.5208, and 0.7552, respectively.
Table III presents the results of the comparison between DFformer and three architectures, DeepSleepNet [48], Robust-SleepNet [45], and U-Sleep [49], used in sleep stage classification.In Sleep-EDF, DFformer achieved the highest overall classification accuracy, kappa value, and F1-score of 0.8370, 0.7778, and 0.7809, respectively.DFformer accomplished the most significant performance in the WAKE, N2, and REM stages, with the per-class F1-score of 0.9087, 0.8657, and 0.8416, respectively.In the N1 and N3 stages, U-sleep achieved the best per-class F1-scores of 0.4470 and 0.8480, respectively.In SHHS, DFformer showed the secondhighest classification accuracy, kappa value, and F1-score of 0.8389, 0.7739, and 0.7620, respectively.In addition, we achieved the highest per-class F1-scores of 0.9104 and 0.4420 in the WAKE and N1 stages, respectively.In this dataset, U-sleep achieved the best performance on overall metrics and the per-class F1-scores of the N2, N3, and REM stages.In the overall metrics, the classification accuracy, kappa value, and F1-score were 0.8410, 0.7800, and 0.7690, respectively.The per-class F1-scores at the N2, N3, and REM stages were 0.8580, 0.8330, and 0.8600, respectively.Although DFformer did not achieve the highest performance on every metric for all datasets, it was verified that DFformer performed competitively compared with the conventional architectures, regardless of the domains.

B. Evaluation of the Pre-Trained Model's Efficacy
To assess the efficacy of using a pre-trained model derived from the autoencoder-based signal reconstruction task, we evaluated the performance of the fine-tuned models in each of the previously utilized four datasets.As shown in Fig. 3, the data in black represent the baseline performance evaluated using DFformer trained from scratch on the selected baseline dataset.Furthermore, the performance differences are presented as ratios on the radar chart, to allow an intuitive comparison across various indicators.For MI datasets, in addition to the three indicators used in the previous evaluation step, we compared the per-class F1-scores, similar to the sleep stage classification evaluation indicators.For the sleep stage classification datasets, we evaluated the effectiveness of the pre-trained model using the identical indicators which were utilized in the previous assessment.
As shown in Fig. 3(a), when BCIC2a was used as the baseline dataset, the performance was mostly improved when fine-tuning DFformer using models pre-trained on other datasets.In particular, using the model pre-trained on BCIC2b, one of the MI datasets, there was significant performance improvement, with an increase of 2.50 % in the kappa value and 1.46 % in the F1-score.Moreover, when the model pretrained on SHHS was applied, the accuracy improved by 2.76 %.As shown in Fig. 3(b), when BCIC2b was chosen as the baseline dataset, the scores of all indicators were enhanced, regardless of the dataset used for pre-training.Notably, with the model pre-trained on BCIC2a, the performance was significantly improved, with increases of 2.77 % in accuracy, 4.67 % in the kappa value, and 1.46 % in the F1-score.Fig. 3(c) indicates the results obtained when Sleep-EDF was chosen as the baseline dataset.All overall metrics showed improvements when utilizing models pre-trained on the other datasets.In particular, the per-class F1-score of the N1 stage showed an average gain of 4.81 % when the pre-trained models were applied.However, as shown in Fig. 3(d), for SHHS, we observed that only when utilizing the model pre-trained on Sleep-EDF, another sleep stage classification dataset, did the performances aligned closely with DFformer trained from scratch.In other cases, the performances decreased compared with the baseline.
Based on these results, we confirmed that fine-tuning with models pre-trained on datasets of similar tasks generally enhanced the performance.Impressively, we found that the utilization of the pre-trained model was effective even though the channel types, numbers, and sampling rates were different.This suggests that the information embedded in the pretrained models could guide DFformer toward better decoding outcomes, capturing foundational characteristics of EEG signals without being heavily biased by dataset-specific features.We observed considerable performance enhancement when MI datasets with limited data were fine-tuned using pre-trained models.In contrast, in the case of sleep stage classification datasets with enormous amounts of data, we found that utilizing pre-trained models from limited data might have a negative effect, due to the discrepancy in distribution between the datasets.

IV. DISCUSSION
We confirmed the enhanced decoding performance when DFformer was fine-tuned with models pre-trained on various domains.To better understand DFformer, we analyzed the features learned during the pre-training phase, which contributed to performance improvement during fine-tuning.Additionally, since intra-channel class tokens were intended to capture distinctive features within a single channel and inter-channel class tokens were aimed to extract the correlations between channels at specific times, we investigated whether both types of channel class tokens were trained as intended.

A. Generalization Performance of the Pre-Trained Model
During the pre-training phase, we intended to train the fundamental features of EEG signals without being biased by the unique characteristics of the datasets.We compared the similarity using the cosine similarity between the intrachannel class tokens of BCIC2a encoded by two DFformers, pre-trained using Sleep-EDF and SHHS.Connectivity maps were created by displaying the 100 highest similarity scores for each representation.In addition, we plotted the grand-average data from all BCIC2a participants, to observe their general tendencies.
As shown in Fig. 4, although BCIC2a was inferred by DFformer only pre-trained on the sleep stage classification datasets, we found that DFformer could cluster channels located in the identical brain regions associated with MI features.As shown in Fig. 4(a), the model pre-trained on Sleep-EDF showed high correlations within channels grouped into three distinct regions: the central, fronto-central, and centro-parietal regions.Meanwhile, based on the model pretrained on SHHS, the channels were mainly clustered in two regions: the central and fronto-central regions, as shown in Fig. 4(b).Notably, when using the model pre-trained on SHHS, which achieved the highest accuracy among the models fine-tuned to BCIC2a, we observed that the channels in the central region, recognized as an important region for classifying MI dataset [51], showed the highest correlation.Based on these results, even though the model was pre-trained using the sleep stage classification dataset unrelated to MI, the pre-trained DFformer could capture significant MI-related features.Therefore, since DFformer could learn the fundamental

B. Effectiveness of Applying the Pre-Trained Model
We conducted a qualitative analysis to understand the effectiveness of DFformer fine-tuned with a pre-trained model.We utilized DFformer fine-tuned to Sleep-EDF based on the pre-trained model from BCIC2a which showed the most significant improvement in the per-class F1-score for the N1, which is the most challenging sleep stage to classify.We assessed the similarity between intra-channel class tokens within each block of DFformer by generating a similarity matrix using cosine similarity.For this comparison, we used three models: DFformer pre-trained on BCIC2a, DFformer fine-tuned on Sleep-EDF using the pre-trained model, and DFformer trained from scratch on Sleep-EDF.Fig. 5 shows the similarity matrices generated by three models based on the randomly selected two data in Sleep-EDF.We found that, as the depth of the model increased, the difference in similarity between intra-channel class tokens from different sleep stages became more apparent.In addition, a high similarity was observed between the identical sleep stages, despite inferring data from Sleep-EDF using the model pre-trained on MI dataset, as shown in the first row of Fig. 5(a) and 5(b).We also found significant difference between the similarity matrices of the fine-tuned model and those of the model trained from scratch.When the patterns of the sleep stages shifted, DFformer trained from scratch struggled to specify the transition points between the different sleep stages.By contrast, the fine-tuned DFformer showed an enhanced performance in recognizing these moments of transition.Moreover, as shown in the second row of Fig. 5(a), the fine-tuned DFformer was more effective in identifying temporary changes in the sleep stages, compared with DFformer trained from scratch.Based on these results, when DFformer is fine-tuned using the pre-trained model, the training begins from a more refined starting point than when training from scratch.Thus, we confirmed that utilizing the pre-trained model is significantly important to enhance the performance of DFformer.

C. Analysis of Intra-and Inter-Class Tokens
The intra-and inter-channel class tokens have specific roles in decoding EEG signals.We evaluated whether the tokens were trained as intended.DCformer fine-tuned on BCIC2a with the model pre-trained on SHHS was used.Based on this model, we evaluated the accurately classified data from BCIC2a and investigated the similarity matrix between the intra-and inter-channel class tokens.Using this similarity matrix, we identified the pivotal channels at specific moments.
Fig. 6(a) shows the difference in the similarity matrix between the entire data and class-specific data.To identify the feature differences between classes relative to the entire data, we computed the difference between the average similarity matrices for all data and each class.Fig. 6(a)-i), ii), iii), and iv) represent the results from the Left, Right, Foot, and Tongue classes in BCIC2a, respectively.The greatest differences were found at the C4 for Left, the C3 for Right, the Cz for Foot, and the POz for Tongue.These observations are in agreement with established neurophysiological knowledge.Specifically, when imagining the movements of the right and left hands, the left and right areas of the motor cortex are activated, respectively, due to the crossing over of motor nerve fibers, known as pyramidal decussation [52].Additionally, the central region of the motor cortex is stimulated when imagining foot movements [53].For imagining tongue movement, the parietal lobe exhibited apparent differences compared to the overall data, given the close connection between tongue movement and speech processes [54].Furthermore, significant differences were observed in the channels located at both ends of the motor cortex, C5 and C6, which are associated with tongue movement [55].
To compare the difference between the entire data and classspecific single trial data, we used the data from the Left and Right classes, where the target activation region is clearly different.Fig. 6(b)-i) and 6(b)-ii) present the results from Left and Right, respectively.In Fig. 6(b), the first row of each figure represents the disparity between the average similarity matrices for the entire data and those of the individual trial data.The second row of each figure shows the topographies of EEG signals at the time points correlated with the black boxes in the first row.We investigated the actual brainwave patterns corresponding to these differences in similarity based on these topographies.The black boxes indicate the moments at which the two greatest and two smallest value differences existed in the average similarity score on the temporal axis.Before visualizing the topographies, we utilized averaged rereferencing.We verified that the topographies for the moments with significant differences from the overall data contained the neurophysiological characteristics corresponding to each class, such as the activation of the right motor cortex during left hand imagery and the left motor cortex during right-hand imagery [52].In contrast, for moments with low similarity differences compared to the entire data, the topographies contained features that were less correlated with each class.
From the analysis presented in Fig. 6(a), it is evident that the intra-channel class tokens accurately capture the dominant intra-channel attributes of the data.Meanwhile, the analysis in Fig. 6(b) reveals that the inter-channel class tokens incorporate the inter-channel relationships at specific time intervals.Thus, we confirmed that DFformer was trained in alignment with its intended design.

V. CONCLUSION
In this paper, we proposed the domain-free transformer, DFformer, that could be utilized in datasets with various configurations, without modifying the architecture.Although research on developing the pre-trained model has been conducted in the EEG domain, there remains an issue with developing and applying the pre-trained model in previous research.Since the configurations of each dataset are different, applying the pre-trained model to other datasets without modifying the architecture is difficult.Unifying the configuration of datasets through channel selection and interpolation raises another problem that distorts the original data.To alleviate this issue, we designed DFformer that could be applied to various datasets without modifying the architecture, by separating the encoding part of intra-and inter-channel information.In addition, based on the intraand inter-class tokens, we could develop the pre-trained model by conducting the autoencoder-based reconstruction task without distorting the original EEG signals.DFformer achieved classification accuracies of 0.5841 and 0.7618 for Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
BCIC2a and BCIC2b, respectively.In the sleep stage classification dataset, DFformer achieved overall F1-scores of 0.7809 and 0.7602 for Sleep-EDF and SHHS, respectively.Therefore, we confirmed that DFformer achieved competitive decoding performances compared with the conventional methods, regardless of the domain.In addition, we verified the performance improvement when DFformer was fine-tuned based on the pre-trained model.We also demonstrated the effect of using the pre-trained model for fine-tuning compared to the model trained from scratch, by analyzing the class tokens.As shown in the above results, the designed DFformer showed significant performance regardless of the datasets used, without modifying the architecture.Furthermore, we could successfully transfer the information between datasets with different domains by developing the pre-trained model based on DFformer.Hence, we could alleviate the issue with the conventional approach of developing the pre-trained model.However, there still remain limitations in our approach.Although the class token, marked by the dotted line in Fig. 1, could be utilized as the representative token of the singletrial EEG data, it was difficult to optimize.In addition, when fine-tuning to SHHS using the pre-trained model developed from the autoencoder-based reconstruction task, we achieved the relatively low-performance improvement compared with other datasets.In the future, we will develop the improved self-supervised method based on DFformer to create a pretrained model.By applying the refined self-supervised method, we will develop more optimized pre-training models that could improve the performance on a wider variety of datasets.Hence, we will construct the generalized model that could be applied as the foundation model for decoding EEG data.

Fig. 1 .
Fig.1.The overall architecture of DFformer is presented in the upper row, which contains the tokenizer, biaxial information embedding layer, DFformer block, and classification head.The detail process of the biaxial information embedding block is visualized in the lower row.The green and blue tokens indicate intra-and inter-channel class tokens (CLS), respectively.

Fig. 2 .
Fig. 2. The overall flow of the signal reconstruction task and visualization of the process for generating the basis matrix based on intra-and inter-channel class tokens.

Fig. 3 .
Fig. 3. Radar chart for comparing the performance of models fine-tuned from the pre-trained models across various datasets with those trained from scratch.The black line represents the baseline performance of the model trained from scratch.Other colors signify the performance of models fine-tuned from each dataset to the baseline dataset.The baseline datasets are (a) BCIC2a, (b) BCIC2b, (c) Sleep-EDF, and (d) SHHS, respectively.

Fig. 5 .
Fig. 5. Visualization of the similarity matrices between each intra-channel class token in each block.(a) and (b) represent data of random samples from Sleep-EDF.The first row indicates the similarity matrices derived from DFformer pre-trained with BCIC2a.The second row presents the similarity matrices from DFformer fine-tuned with Sleep-EDF, beginning with the pre-trained model from BCIC2a.The third row shows the similarity matrices of DFformer trained entirely from scratch with Sleep-EDF.The x-and y-axes represent the labels for each channel in the data.1, 2, 3, and 4 indicate N1, N2, N3, and REM, respectively.

Fig. 6 .
Fig. 6.(a) The grand-average difference between the similarity matrices for the entire BCIC2a dataset and its individual classes.These matrices are derived from the cosine similarity between intra-and inter-channel class tokens in DFformer pre-trained with SHHS.The class of each matrix is i) Left, ii) Right, iii) Foot, and iv) Tongue, respectively.(b) The first and third rows indicate the difference between the similarity matrix of randomly sampled data from BCIC2a and the average matrix for all data.The second and fourth rows show the topographies associated with the two highest and two lowest sums of differences along the temporal axis of the similarity matrix.The class of each matrix is i) Left and ii) Right, respectively.The x-and y-axes indicate the list of channels and temporally tokenized patches, respectively.

TABLE I THE
ARCHITECTURE, PARAMETERS, AND OUTPUT SHAPE OF THE CLASSIFICATION HEAD FOR EACH TASK embedding block.The structure of DFformer is similar to that of the biaxial information embedding block except for the added positional encoding and class tokens in the embedding block.This block employs MHA and feed-forward networks for both intra-and inter-channel axes.By applying DFformer blocks, we could design a flexible architecture that allowed us to change the depth of the model dynamically.Therefore, we could effectively fuse intra-and inter-channel information to extract significant features from the EEG data as the depth of the model increased by applying this block structure.The preprocessed features without class tokens derived from DFformer blocks served as input for the classification head.Furthermore, the tokenizer, biaxial information embedding block, and DFformer blocks were utilized during the pretraining phase.

TABLE II COMPARISON
OF PERFORMANCES FOR DECODING EEG SIGNALS IN BCIC2a AND BCIC2b DATASETS AMONG THE CONVENTIONAL MI DECODING METHODS AND THE PROPOSED METHOD

TABLE III COMPARISON
OF PERFORMANCES FOR DECODING EEG SIGNALS IN SLEEP-EDF AND SHHS DATASETS AMONG THE CONVENTIONAL SLEEP STAGE CLASSIFICATION METHODS AND THE PROPOSED METHOD