Multi-Scale Masked Autoencoders for Cross-Session Emotion Recognition

Affective brain-computer interfaces (aBCIs) have garnered widespread applications, with remarkable advancements in utilizing electroencephalogram (EEG) technology for emotion recognition. However, the time-consuming process of annotating EEG data, inherent individual differences, non-stationary characteristics of EEG data, and noise artifacts in EEG data collection pose formidable challenges in developing subject-specific cross-session emotion recognition models. To simultaneously address these challenges, we propose a unified pre-training framework based on multi-scale masked autoencoders (MSMAE), which utilizes large-scale unlabeled EEG signals from multiple subjects and sessions to extract noise-robust, subject-invariant, and temporal-invariant features. We subsequently fine-tune the obtained generalized features with only a small amount of labeled data from a specific subject for personalization and enable cross-session emotion recognition. Our framework emphasizes: 1) Multi-scale representation to capture diverse aspects of EEG signals, obtaining comprehensive information; 2) An improved masking mechanism for robust channel-level representation learning, addressing missing channel issues while preserving inter-channel relationships; and 3) Invariance learning for regional correlations in spatial-level representation, minimizing inter-subject and inter-session variances. Under these elaborate designs, the proposed MSMAE exhibits a remarkable ability to decode emotional states from a different session of EEG data during the testing phase. Extensive experiments conducted on the two publicly available datasets, i.e., SEED and SEED-IV, demonstrate that the proposed MSMAE consistently achieves stable results and outperforms competitive baseline methods in cross-session emotion recognition.


I. INTRODUCTION
A FFECTIVE Brain-computer Interfaces (aBCIs) employ brain imaging techniques to capture and interpret human emotional states, aiming to achieve emotional communication and expression between humans and computers.This endeavor enhances both the immersive user experience and the efficiency of human-computer interaction.Additionally, aBCIs exhibit promising applications in fields such as healthcare and education for long-term monitoring and prediction of emotional states, enabling personalized psychological interventions and treatment plans [1], [2].With aBCIs, a variety of modalities have been utilized, including functional magnetic resonance imaging (fMRI), Near-Infrared Spectroscopy (NIRS), and electroencephalography (EEG).In particular, EEG-based aBCIs have garnered increasing attention due to the rapid advancements in noninvasive, user-friendly, and lowcost EEG recording devices, particularly with the aid of portable dry electrode devices [3].
EEG-based aBCIs have demonstrated their capability to decode users' intentions from brain recordings and have showcased potential applications in neural rehabilitation systems [4].However, individual differences and the nonstationary characteristic of EEG [5] render the development of stable EEG-based emotion recognition models a challenging task.Consequently, it is necessary to collect labeled samples for each subject at each time to train new models, leading to time-consuming and expensive labeling work.To mitigate the reliance on the labeled data, in recent years, an increasing number of researchers have turned their focus on applying transfer learning methods to reduce individual differences [5], [6], [7], [8], [9] and improve feature invariance representation [10], [11], [12].
Currently, the predominant transfer learning methods employed in EEG-based aBCIs include domain adaptation (DA) and domain generalization (DG).These methods are designed to reduce the distribution discrepancy between the source and target domains, thus resulting in an improved recognition performance in the target domain.Nevertheless, DA methods require utilizing the target domain during the training stage and typically assume that the data distribution remains invariant or changes minimally between the source domain and target domain.In scenarios where the data distribution continuously evolves during real-time data acquisition, DA methods cannot effectively adapt these variations.On the other hand, DG generates domain-invariant representations from the source domains without exposure to data from the target domain, thus being more suitable for practical applications.However, DG methods require large numbers of source domains to train the model and enhance its generalization capabilities.
DA methods require access to target domains with data distributions, while DG methods need large numbers of source domains.These approaches are impractical for the following cross-session emotion recognition scenario: when only one session (i.e., one source domain) of labeled data is available for a specific subject during the training stage.In this context, the primary concern is effectively utilizing the limited labeled data to train a subject-specific model for cross-session emotion recognition.
Within the context of the brain-big-data center, real-time EEG data from a vast group of individuals are continuously transmitted, resulting in an abundance of unlabeled signals from various subjects and sessions, potentially containing some degree of corruption.Therefore, this situation presents an intriguing challenge: Can these unlabeled data be combined with the limited labeled data to train a subject-specific model for cross-session emotion recognition?This paper addresses this challenge by proposing Multi-Scale Masked Autoencoders (MSMAE).The MSMAE model is based on a multi-scale Vision-Transformer hybrid architecture, incorporating spectrum embedding, multi-head spatial attention, and multi-scale feature fusion to capture channel and spatial information of the EEG signals effectively.Specifically, MSMAE is pre-trained using unlabeled EEG data from multiple subjects and sessions, encoding and reconstructing channel-level and spatial-level representations of EEG signals to extract noise-robust, subjectinvariant, and temporal-invariant features.Subsequently, only a small amount of labeled data from specific subjects is necessary to fine-tune the model for personalization.Under this comprehensive training, the subject-specific model demonstrates a remarkable ability to decode emotional states from a different session of EEG data during the testing phase.
The main contributions of this study can be summarized in three aspects: 1) We introduce a unified multi-scale pre-training framework aimed at addressing challenges related to missing EEG channels and limited labeled data in emotion recognition.This framework significantly enhances the practicality and effectiveness of EEG-based emotion recognition in real-world applications.
2) We present an innovative multi-scale fusion approach that combines channel-level and spatial-level learning.Our model aligns spatial-level correlations between pre-training and fine-tuning data to mitigate inter-subject and inter-session variations.Furthermore, it fine-tunes channel-level representation to ensure the exclusivity of subject-specific features.These techniques enhance adaptability and robustness for subject-specific cross-session emotion recognition tasks.
3) Our proposed model exhibits superior performance on two publicly available datasets for cross-session emotion recognition, even when only one session of labeled data is accessible for training.
The organization of this paper is structured as follows: Section II offers a brief review of related works.Section III elaborates on the proposed method.Section IV conducts a comprehensive evaluation of the proposed method.Finally, Section V concludes the paper.

II. RELATED WORK A. EEG Emotion Recognition
EEG-based emotion recognition depends on extracting sufficiently discriminative EEG features.The widely used EEG features can be categorized into four groups: temporal-domain features, frequency-domain features, time-frequency-domain features, and brain connectivity features.The commonly employed statistical information in the temporal domain includes entropy, the fractal dimension, and higher-order crossings [13], [14].Within the frequency domain, power spectral density (PSD) [15] and differential entropy (DE) [16] stand out as two of the most frequently employed features.Several approaches [17], [18], [19], [20] have demonstrated excellent performance for time-frequency-domain features.Nalwaya et al. [19] employed the Fourier-Bessel domain adaptive wavelet transform (FBDAWT) to analyze multi-sensor EEG signals, accurately identifying emotional states.Bhattacharyya et al. [20] integrated the empirical wavelet transform (EWT) with Fourier-Bessel series expansion (FBSE), resulting in enhanced time-frequency representation of multi-component signals.For brain connectivity features, two crucial features, namely the Phase Lag Index (PLI) and the Phase Lock Value (PLV), were utilized to assess the phase synchronization among electrode signals across various brain regions.Liu et al. [21] employed the PLI feature to discern the emotional states of individual subjects, highlighting its remarkable discriminative capability.Chen et al. [22] integrated frequency-domain features with brain connectivity features for cross-subject emotion recognition, demonstrating superior performance.Furthermore, with the widespread adoption of deep learning methods, Alhagry et al. [23] utilized a twolayer long short-term memory network to extract temporal features.Zhang et al. [24] employed a recurrent neural network (RNN) to capture spatial-temporal representations from EEG signals.Zhong et al. [8] introduced a regularized graph neural network that considers the topological structure of EEG channels.Although these supervised approaches have successfully enhanced emotion recognition performance based on EEG signals, they require well-annotated and robust EEG data, which is relatively challenging to obtain in practical applications.Additionally, they often ignore the influence of session differences, such as the variations in the duration and content of the elicitation videos across different experiments, which introduce emotional biases.

B. Transfer Learning
Transfer learning seeks to enhance the performance of a new task by leveraging knowledge from a source task.DA, a subset Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. of transfer learning, has been extensively applied in EEGbased emotion recognition, demonstrating promising results.Chen et al. [25] introduced a multi-source marginal distribution adaptation method that captures domain-invariant and domain-specific features for emotion recognition.Li et al. [26] developed an innovative domain adaptation method for emotion recognition, which extracts generalized features across different subjects and sessions by simultaneously adapting both the marginal and conditional distributions to approximate the joint distribution.However, these DA methods require access to target domains with data distributions.Unlike DA, DG aims to generate domain-invariant representations from the source domains without utilizing data from the target domain.Ma et al. [27] developed a domain residual network that facilitated the separate learning of domain-specific and domain-shared weights, with the latter being used to classify emotion in unknown domains.Ozdenizci et al. [10] proposed an adversarial inference approach to extend deep learning models for EEG-based person identification, aiming to learn session-invariant person-discriminative representations.However, this requirement becomes impractical when only one source domain of labeled data is available.Recently, Li et al. [28] utilized self-supervised learning for initial model pre-training and subsequently fine-tuned the model on new data, demonstrating notable performance in emotion recognition tasks, including scenarios where data may be incomplete or corrupted.However, this model cannot handle complex tasks such as cross-session analysis.Conducting cross-session emotion recognition with limited training data still poses significant challenges.

III. METHOD A. Formulation
We transform the EEG channels into a two-dimensional plane using the EEG electrode distribution map to improve spatial information consistency among adjacent channels, as depicted in Fig. 1.Specifically, each channel is repositioned onto a two-dimensional electrode topology, with a size of 9 × 9, and zero-padding is performed for missing electrodes.We apply this transformation to frequency-domain features, resulting in the EEG image x ∈ R 9×9×C f , where C f represents the number of frequency bands.The pre-training dataset consists of unlabeled data from various subjects and sessions, represented as X Pr e = {x , where N F is the number of samples in this dataset.The test data and labels for the specific subject s are denoted as X s T = {x , with N T representing the number of samples in the test dataset.

B. Overview
We propose a Multi-scale pre-training model based on mask autoencoder (MAE) [29], as shown in Fig. 2. The framework consists of a multi-scale pre-training stage, a personalized finetuning stage, and a personal testing stage.
In the multi-scale pre-training stage, both the channel-level feature extractor E Pr e_1 and the spatial-level feature extractor E Pr e_3 are employed to extract general information, which is shared by all subjects.Specifically, the unlabeled EEG data x Pr e is initially convolved with different scales of kernels (1×1 and 3×3), which are represented by Conv 1 and Conv 3 , resulting in channel-level representation xPre_1 and spatiallevel representation xPre_3 .For channel-level representation xPre_1 , considering the presence of missing data in some channels, we avoid encoding these channels with missing data to preserve complete information and prevent the introduction of noise.We reconstruct the masked portions to learn the encoder E 1 and obtain z Pr e_1 .For spatial-level representation xPre_3 , which include multiple channel information, we apply the attention feature extractor, denoted by Attn, to align the features of pre-training data and fine-tuning data based on brain region correlations, resulting in the aligned feature xPre_3 .We subsequently employ masking and reconstruction on xPre_3 to learn the encoder E 3 and obtain z Pr e_3 .The formulas are as follows: (1) In the fine-tuning calibration stage, only a limited amount of labeled data from a specific subject is employed to finetune channel-level feature extractor E Pr e_1 for the personal emotion predictor.Simultaneously, we freeze the parameters of the pre-trained spatial-level feature extractor E Pr e_3 for the generalized emotion predictor.Finally, we fuse the channellevel representation with the spatial-level representation to perform the final emotion classification.Through this comprehensive training, the subject-specific model demonstrates an exceptional capability to decode emotional states from a different session of EEG data during the test phase.We elaborate on each stage as follows.

C. Multi-Scale Pre-Training
To use more corrupted EEG data and enhance the learning capacity of the model, we adopt the MAE framework with a transformer-based backbone network [30].The model splits images into equal blocks and uses transformer encoders to extract features, with an asymmetric encoder-decoder design for image reconstruction.It leverages transformers for global information, masking for robustness, and self-supervised training for generalizability.In our study, we employ convolutional kernels for patch embedding.The size of the convolutional kernel offers different interpretations for partitioning in two-dimensional EEG images, where 1 × 1 convolutions partition individual electrodes to learn inter-channel relationships, and 3 × 3 convolutions are utilized to learn about broad spatial features.We conduct multi-scale feature fusion to enhance data utilization and model representation capacity, enabling the extraction of deeper emotional representations from the frequency domain channel features and spatial features of the EEG.
1) Channel-Level Representation: By employing 1 × 1 convolution, we map each EEG electrode to a patch, enabling the vision-transformers framework to encode channel relationships and capture specific feature information.However, the challenge of partially missing channels and zero-padding, combined with random masking, risks losing valuable data.To address this, we have improved our approach by ensuring all zero-padded patches are masked, preserving meaningful channel information in our feature extraction process.More specifically, given the input pre-training data x Pr e , we embed patches using C 1 convolutional kernels of size 1 × 1 with added positional embeddings, obtaining xPre_1 ∈ R 9×9×C 1 : where Conv 1 represents a convolution operation.Assuming that out of the 81 (9 × 9) patches, there are p non-zero padded patches (e.g., p = 62 as illustrated in Fig. 1).To ensure the effectiveness of subsequent feature encoding, we randomly mask these p non-zero padded patches in addition to masking all zero-padded patches.The formula is as follows: M i, j = 0, i f position(i, j)should be masked 1, other wise (4) where i=1, j=1 ∈ R 9×9 represents the matrix corresponding to the 2D EEG images with missing channels, represents the updated mask, and • denotes the element-wise multiplication.
2) Spatial-Level Representation: When using a 3 × 3 convolution for partitioning, each patch contains more electrode channel information.The neighboring channels in EEG signals influence each other and reflect the corresponding brain region's signal characteristics.The connectivity between these brain regions is closely related to their spatial positions.The spatial features of EEG signals reflect the coordination and interaction among different areas of the brain, which is crucial for analyzing the spatial distribution and temporal variations of neural activity.In cross-session emotion recognition experiments, factors, like induced emotional stimuli, external environments, and physiological expressions contribute to Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
variability.However, the regional influence of EEG signals remains more objective and stable.Therefore, we consider further encoding and decoding the spatial features.By using large-scale convolutions for weighted average partitioning, we not only incorporate spatial features of brain regions to some extent but also exhibit universality across all EEG data with missing channels, reducing the workload of data preprocessing.Specifically, given the input pre-training data x Pr e , C 3 convolutional kernels of size 3 × 3 are applied to obtain xPre_3 ∈ R 3×3×C 3 : Here, Conv 3 represents a convolution operation, and I(•) denotes the indicator function that returns 1 if the condition is true and 0 otherwise.xPre_3 (i, j) ∈ R C 3 (for i = 1,2,3 and j = 1,2,3) represents the patch obtained through 3×3 convolution, and it is then normalized by dividing by the number of existing channels in each corresponding patch.
3) Invariance Learning for Region Correlation: We align pretraining and fine-tuning data features based on brain region correlations to obtain subject-invariant and temporal-invariant features.Considering that each individual's emotional fluctuations are unique and represent their distinct characteristics, we choose to align the shared features based on brain region correlations instead of directly aligning the pretraining spatial-level representation data xPre_3 and fine-tuning spatial-level representation data xF_3 .This approach partially attenuates the differences in data distribution while preserving the unique characteristics of EEG signals.Specifically, xPre_3 ∈ R 3×3×C 3 and xF_3 ∈ R 3×3×C 3 are first rearranged into xR Pr e_3 ∈ R 9×C 3 and xR F_3 ∈ R 9×C 3 , respectively.Subsequently, an attention mechanism is employed to capture the correlations between patches: Here, Q Pr e ∈ R 9×d k and K Pr e ∈R 9×d k refer to the queries and keys for x R Pr e_3 , respectively, obtained by performing linear transformations on x R Pr e_3 , while Q F and K F are the corresponding queries and keys for x R F_3 ; the dimension of the keys (queries), denoted as d k , is used for scaling the dot product.
Then, the similarity between A Pr e in the pre-training data and A F in the fine-tuning data is measured using Maximum Mean Discrepancy (MMD): where B stands for the number of samples in a training minibatch, i and j are the indexes within the batch, A Pr e represents the correlation matrix of the i-th pre-training sample, A ( j) F represents the spatial correlation matrix of the j-th fine-turning sample, and ∅(•) denotes the mapping function.
By doing so, we can quantify the distribution differences in attention representations between the impaired pre-training data and the fine-tuning data.Introducing this type of loss mitigates the feature disparities between different subjects while preserving the emotional characteristics inherent to the subject, thereby enhancing the model's classification performance and generalization ability.The attention mechanism is further used to obtain the aligned feature xPre_3 : where V Pr e is the values obtained by performing a linear transformation on xR Pr e_3 , and Attn is the attention feature extractor.Finally, xR Pr e_3 is rearranged into xPre_3 ∈ R 3×3×C 3 for the subsequent 2D masking with the size of 3 × 3.
At this point, xPre_3 has better spatial features and prior knowledge compared to the initial data, and it also has some complementary relationship with the channel-level representation.
4) Encoder, Decoder, and Reconstruction: Based on the aforementioned embedding using different scales, we obtain the data xPre_1 and xPre_3 .We apply masks to these data based on different meanings of scale features.Then, we utilize a multi-layer Transformer encoder to extract features, followed by a decoder to reconstruct the images.The formula is shown below: where M(3) ∈ R 3×3 represents the random mask for xPre_3 , ⊙ denotes the element-wise masking operation, and the mask values are broadcasted correspondingly.z Pr e_1 and z Pr e_3 are the masked data obtained through the encoder E 1 and E 3 .
x ′ Pr e_1 and x ′ Pr e_3 are the reconstructed data obtained through the decoders D 1 , D 3 .Then, we use the mean squared error (MSE) to measure the quality of the masked reconstruction.The reconstruction loss is computed only from masked nonzero patches to avoid introducing noise.Specifically, the formulas are as follows: where 1 represents the index set of the masked non-zero patches for x Pr e , 3 represents the index set of masked patches for xPre_3 , | • | denotes the number of elements in the set, (i, j) is the index of the masked patches, x ′ Pr e_1 (i, j) ∈ R C f , and x ′ Pr e_3 (i, j) ∈ R C 3 .We obtain the reconstruction losses, L r econ_1 and L r econ_3 , for two segments of different scales.For a mini-batch training dataset, the reconstruction losses can be expressed as L B r econ_1 and L B r econ_3 .
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

D. Fine-Tuning Stage and Test Stage
After pre-training, generalized feature extractors E Pr e_1 and E Pr e_3 are obtained, which can be fine-tuned to obtain a personalized feature extractor Ê, adapting it to a new task.However, certain modifications have been made when it comes to model transfer for EEG data.
In channel-level representation learning, in order to address the issue of zero-padding when mapping EEG data to twodimensional brain images, we use channel masking during the pre-training stage to minimize the impact of zero-padding on the pre-training data.Similarly, during the fine-tuning stage, in the self-attention mechanism, a masking matrix can be used to assign a weight of 0 to the contribution of the padded regions in the attention weights.This effectively removes the influence of missing data on the attention weights and prevents the padded regions from interfering with the results.
Specifically, given the input x F , we obtain xF_1 through the patch and positional embedding.Afterward, we calculate the corresponding attention matrix A chan ∈ R 81×81 within the encoder E 1 , where each element A chan i, j is defined as: where n represents the number of patches, e i, j represents the similarity score between the i-th and j-th patches, determined through the dot product of two vectors, and M (F) i, j serves as a padding patch indicator.If either the value of the i-th or the j-th patch is missing (i.e., represented by a padding value), then M (F) i, j is set to -∞ to eliminate their contribution to the attention matrix ( i.e., lim i, j is set to 0. In the pre-training phase for spatial-level representation, spatial feature alignment has already been performed through fine-tuning data.Therefore, the aligned data can be directly used for feature extraction in the encoder.To reduce the number of tuning parameters and enhance model stability, in this stage, we choose to freeze the pre-trained parameters of spatial-level representation without adjustments.This approach allows us to effectively leverage the previous pre-training results while avoiding issues such as overfitting during finetuning, thus improving the model's generalization ability.Finally, the features extracted from the channel-level representations and spatial-level representations, denoted as z F_1 and z F_3 , respectively, are concatenated together and passed through a Batch Normalization layer to enhance the model's robustness and generalization ability.A classification layer is then incorporated into the fused features z F for emotion classification, and we compute the classification loss using cross-entropy.
In the test stage, we employ a new session of EEG data from the specific subject, denoted as x s T and y s T , to validate the effectiveness of the subject-specific model.The details of our proposed method are shown in Algorithm 1.

Algorithm 1 Multi-scale Masked Autoencoders
Input: Pre-training data X Pr e = {x Mask the pre-training data and encode: Reconstruct the input data: ).

8:
Optimize E Pr e_1 by minimizing the reconstruction loss L B r econ_1 .9: until all samples in X Pr e have been drawn.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

E. Implementation
Due to the different number of blocks in the channel-level and spatial-level representation learning stages, 9 × 9 and 3 × 3, respectively, different mask rates are established for each stage.Specifically, the mask rate for the channel-level representation learning stage is set as 0.75, in accordance with the original MAE [29], and for the spatial-level representation learning stage, it is adjusted to 0.5, due to the limited number of blocks.
The encoder and decoder parameters for channel-level and spatial-level representation learning are set identically for simplicity.Specifically, the dimensions for the encoder and decoder are chosen from {128, 256, 512, 1024}, with the number of layers selected from {1, 2, 3, 4}, and the number of self-attention heads from {2, 4, 6, 8}.MSMAE is optimized using the SGD optimizer with a learning rate of 0.001, 50 epochs, and a batch size of 32.The parameter settings are detailed in Table I.

IV. EXPERIMENT A. Datasets
Experiments are performed on two publicly available datasets, namely SEED and SEED-IV.The SEED dataset includes EEG signals from 15 subjects, which are recorded using an ESI NeuroScan System with 62 channels [31].Each subject participates in three sessions, with an interval of approximately one week between sessions.During these sessions, the subjects' data are collected while watching emotion-eliciting movies designed to evoke three different emotional states: negative, positive, or neutral.The signals are initially recorded at a sampling rate of 1000Hz and are subsequently downsampled to 200Hz for analysis.They are further segmented into non-overlapping 1-second segments, with each segment treated as a sample.Consequently, for each subject and each session, there is a total of 3,394 samples.
The SEED-IV dataset consists of EEG signals of 15 subjects recorded using the same recording device as SEED [32].Similar to the SEED dataset, each subject participates in three separate sessions with intervals between them.In this case, four emotional states are collected: happiness, sadness, fear, and neutral.The signals are divided into 4-second nonoverlapping segments, and each segment is regarded as an individual sample.Consequently, for Sessions I, II, and III, there are 851, 832, and 822 samples per subject, respectively.

B. Data Preprocessing
To construct a unified pre-training model, it is necessary to preprocess all the data in a consistent manner.Firstly, based on the structure of the EEG cap, the EEG channels of each frame are mapped into a two-dimensional EEG image to preserve the spatial location of the electrodes, as shown in Fig. 1.This transformation is applied to frequency-domain features for each sample.We employ Differential Entropy (DE) for the frequency-domain feature, which is widely used in emotion recognition [31].Specifically, DE features are derived from five predefined frequency bands, which include Delta (1-3 Hz), Theta (4-7 Hz), Alpha (8-13 Hz), Beta (14-30 Hz), and Gamma (31-50 Hz).Additionally, min-max normalization is performed at the sample level to address the issue of varying feature ranges, improve the convergence performance of the model, and eliminate the dimensional differences between different features.

C. Cross-Session Evaluation
Compared to other datasets, the SEED and SEED-IV datasets possess unique characteristics in that each subject completed the experiment in three different sessions.We utilize this distinctive feature to investigate the generalization of models across sessions, specifically assessing whether the models can consistently deliver satisfactory performance when training and testing data come from different sessions.When receiving the same stimuli, the recognition accuracy of various methods for predicting the emotions of the same subject at different times will show temporal stability variations.However, up to the present, there have been limited studies on cross-session experiments, most of which involve the acquisition of test data to minimize the data distribution discrepancy with the training data during the training process.In contrast, our experimental setups do not require the inclusion of test data during training.This approach, while more challenging, offers enhanced practical value.Specifically, we use one session's EEG data as training data and another as testing data.The pairs of sessions used for validation encompass session1-session3, session2-session1, session3-session2, session1-session2, session2-session3, and session3-session1.Through a comprehensive six-fold crossvalidation, we calculate the average recognition accuracy, along with the standard deviation, for all 15 subjects.

D. Method Comparisons
We compare the proposed MSMAE with several relevant models on the SEED and SEED-IV datasets to demonstrate Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE III
ABLATION STUDY OF OUR MODEL WITHIN SEED AND SEED-IV its effectiveness.We select models focusing on spatial features to ensure a more meaningful comparison, including Vit [37], SimpleVit [33], FBCCNN [34], STNET [35], and DGCNN [36].Furthermore, we implement these models using TorchEEG, a PyTorch-based library for EEG signal analysis.We search the parameter space of these compared models following the descriptions outlined in their respective papers.The average accuracies (± standard deviation) for each method are reported in Table II.The experimental results demonstrate that our method significantly outperforms existing methods.Specifically, as shown in Table II, our model achieves a recognition accuracy of 80.86% on the SEED dataset with a standard deviation of only 6.21%.On the SEED-IV dataset, our model achieves a recognition accuracy of 59.33%, accompanied by a standard deviation of 12.61%.Furthermore, according to the results in Fig. 3, our method exhibits performance improvement across different sessionto-session transfers.Even without utilizing the target domain, our model can reduce the influence between different domains by aligning regional features.Additionally, as illustrated in Fig. 4, our model demonstrates advantages for each subject, indicating the generalization and stability of our model.

E. Ablation Study
To evaluate the effectiveness of each module in the MSMAE model, we conduct ablation experiments with the MAE model and the Vit model at different scales, namely (1 × 1) and (3 × 3).We also compare the results with the feature fusion of both methods at scale (1 × 1&3 × 3), which are listed in Table III.By comparing the experimental results at different scales, it is observed that the results at the 3 × 3 scale consistently outperform those at the 1 × 1 scale, indicating the advantage of spatial frequency features in EEG emotion recognition tasks.Furthermore, by comparing the results of the Vit and MAE models on 1 × 1 scale on the SEED-IV dataset with relatively limited pre-training and fine-tuning data, we find that using MAE for pre-training a large model tends to lead to overfitting and a decrease in accuracy compared to Vit without pre-training.However, such pre-trained models are highly dependent on the volume of data, as the performance of the model largely relies on the quality and diversity of the data used during the training process.Building upon this foundation, we further enhance the model's performance by fusing multi-scale features and conducting pre-training.Importantly, our model achieves higher stability and generalization performance by aligning the region correlations between the pre-training and fine-tuning data.Through these ablation experiments, we validate the importance of scale selection, pre-training, and multi-scale feature fusion in our model.These results provide strong support for our research and application in complex EEG emotion recognition tasks and offer valuable directions for future improvements and optimizations.
We randomly select one subject from the SEED dataset for visualization.The t-SNE visualization of different methods is presented in Fig. 5.In comparison to other methods, MSMAE demonstrates a reduction in data distribution discrepancy to some extent, even without utilizing target domain information.

F. Interpretability
To validate the interpretability of our proposed method, we conduct EEG topographic visualization using adjacency matrices at a scale of 1 × 1 learned from MSMAE.Followed by [38] and [39], we visualize the degree centrality of each scalp EEG electrode based on the adjacency matrices.Suppose Ã = { Ãi, j } p i, j=1 is the submatrix of A chan ∈ R 81×81 , where p represents the number of non-zero padded patches in channellevel representation (with p = 62 in the SEED dataset).In this matrix, the i-th row and i-th column values correspond to the connection weights associated with the i-th electrode.The degree centrality of the i-th EEG electrode, denoted as DC i , can be derived by    of the spatial distribution of the emotion recognition task, which reflects the intercorrelation analysis of EEG signals between electrodes in our method.By examining Fig. 6, we observe that the regions of emotional activity are primarily concentrated in the frontal and temporal areas.These findings from saliency maps have been validated and are consistent with existing research on emotions [40], [41], [42].Furthermore, we note that in neutral emotions, the neural patterns are relatively smoother compared to positive and negative emotions.Positive emotions are more readily activated in the lateral temporal areas compared to negative and neutral emotions, consistent with the finding in [31].In addition, we observe that the activation range of negative emotions is larger in the frontal regions.

G. Cross-Dataset Generalization
We perform cross-dataset experiments to assess the generalization ability of our model.We chose the unlabeled data from the latest publicly available dataset, FACED [43], as the pre-training data.This dataset contains EEG signals from 123 subjects with 32 channels.Given that the SEED and SEED-IV datasets lack the A1 and A2 electrodes, we exclude these channels and remain 30 channels for our analysis.We fine-tune the model with data from one session of a specific subject from the SEED or SEED-IV dataset and test the model on another session of the same subject.The challenge of crossdataset experiments is that pre-training is conducted using unlabeled data with 30 channels, whereas fine-tuning used 62-channel data from the SEED or SEED-IV dataset, which resulted in missing channels and differences between devices.Notably, in our cross-dataset and with-dataset settings, the only difference lies in whether the pre-training data originates from the same dataset as the fine-tuning data.
We compare the performance of MSMAE under the crossdataset and within-dataset settings.Additionally, Vit (1 × 1) and MAE (1 × 1) under the within-dataset setting are also included for comparison, as depicted in Fig. 7. Based on the experimental results, our model demonstrates consistent and stable generalization ability in the cross-dataset setting.Furthermore, it confirms our model's capability to address the issue of missing channels, validating the robustness and portability of our model.

V. CONCLUSION
This paper introduces a unified, multi-scale pre-training framework to overcome challenges related to missing EEG channels and limited labeled data in emotion recognition.We propose a novel multi-scale fusion approach combining channel-level and spatial-level representation learning with an improved masking mechanism to preserve electrode relationships and invariance learning for regional correlations.Compared to the Vit (1 × 1) without pre-training, MSMAE significantly improves accuracy by 10.76% on the SEED dataset and 11.9% on the SEED-IV dataset.Moreover, MSMAE surpasses the original MAE (1 × 1) in accuracy by 9.13% on the SEED dataset and by 15.31% on the SEED-IV dataset.MSMAE also demonstrates superiority over current state-of-the-art methods, outperforming the secondbest method by 2.84% and 1.26% on the SEED and SEED-IV datasets.
In summary, the proposed model significantly elevates the performance of cross-session emotion recognition in a selfsupervised fashion.MSMAE is a general framework that can be easily extended to other EEG-based learning tasks, offering promising directions for future research.However, the current implementation of MSMAE relies on handcrafted features as input, potentially resulting in the loss of valuable information in the original signals.Consequently, our future efforts will explore MSMAE's potential for directly extracting information from raw signals, addressing this constraint, and enhancing the framework's utility.

Fig. 1 .
Fig. 1.Mapping EEG electrode distribution map to a two-dimensional plane.The left illustration depicts the spatial arrangement of channels on the brain cap, while the right is the 2D converted feature matrix format.The missing channels are filled with 0.
Pr e }N Pr e i=1 ∈ R N Pr e ×9×9×C f , with N Pr e being the number of samples in this dataset.The fine-tuning data contains a limited amount of labeled data from a specific subject s, represented as X s F = {x

Fig. 2 .
Fig. 2. Overall structure of MSMAE.The framework consists of a multi-scale pre-training stage, a personalized fine-turning stage, and a personal testing stage.
specific subject s; the number of epochs Epoch and the batch size B. Output: The generalized feature extractors E Pr e_1 and E Pr e_3 (include Conv 3 , Attn, and E 3 ); the personalized emotion predictor Ê; and the predicted emotion class ŷs T = { ŷ(i) T } N T i=1 .Pre-training Stage for Channel-level Representation: 1: Randomly initialize E Pr e_1 .2: for i = 1: Epoch do 3: repeat 4: Draw one batch of pre-training data x B Pr e .5: Embed the pre-training data x B Pr e to obtain xB Pr e_1 .6:

Fig. 3 .
Fig. 3. Comparison between MSMAE and other algorithms in various cross-session scenarios within SEED and SEED-IV.

Fig. 6
Fig. 6 presents the EEG topographic maps of positive, neutral, and negative emotions in the SEED dataset.The values of DC are scaled to the interval of [0, 1].Through scalp mapping visualization, we can gain a direct and intuitive understanding

Fig. 4 .
Fig. 4. Comparison of MSMAE and other algorithms on different subjects within SEED and SEED-IV.

Fig. 5 .
Fig. 5. Feature visualization by different methods and at different scales within SEED dataset.

Fig. 6 .
Fig. 6.Topographic maps learned from the MSMAE model within the SEED dataset.

10 :
Return E Pr e_1 .Pre-training Stage for Spatial-level Representation: 11: Randomly initialize E Pr e_3 .12: for i = 1: Epoch do 13: repeat 14: Draw one batch of pre-training data x B Pr e and one batch of fine-tuning data x B F .15: Embed the input data x B Optimize Conv 3 and Attn by minimizing the reconstruction loss L mmd .18: until all samples in X Pr e have been drawn.19: Return Conv 3 and Attn.until all samples in X Pr e have been drawn.27: Return E Pr e_3 .Personalized Calibration Stage: 28: Initialize Ê with E Pr e_1 , E Pr e_3 , and frozen E Pr e_3 .29: for i = 1: Epoch do