Multi-Task Collaborative Network: Bridge the Supervised and Self-Supervised Learning for EEG Classification in RSVP Tasks

Electroencephalography (EEG) datasets are characterized by low signal-to-noise signals and unquantifiable noisy labels, which hinder the classification performance in rapid serial visual presentation (RSVP) tasks. Previous approaches primarily relied on supervised learning (SL), which may result in overfitting and reduced generalization performance. In this paper, we propose a novel multi-task collaborative network (MTCN) that integrates both SL and self-supervised learning (SSL) to extract more generalized EEG representations. The original SL task, i.e., the RSVP EEG classification task, is used to capture initial representations and establish classification thresholds for targets and non-targets. Two SSL tasks, including the masked temporal/spatial recognition task, are designed to enhance temporal dynamics extraction and capture the inherent spatial relationships among brain regions, respectively. The MTCN simultaneously learns from multiple tasks to derive a comprehensive representation that captures the essence of all tasks, thus mitigating the risk of overfitting and enhancing generalization performance. Moreover, to facilitate collaboration between SL and SSL, MTCN explicitly decomposes features into task-specific features and task-shared features, leveraging both label information with SL and feature information with SSL. Experiments conducted on THU, CAS, and GIST datasets illustrate the significant advantages of learning more generalized features in RSVP tasks. Our code is publicly accessible at https://github.com/Tammie-Li/MTCN.


I. INTRODUCTION
T HE human brain is capable of processing visual infor- mation within a few milliseconds, making it an attractive target for investigating neural mechanisms [1].This ability The authors are with the College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China (e-mail: mrtang@nudt.edu.cn).
Digital Object Identifier 10.1109/TNSRE.2024.3357863 has led to extensive research [2], [3], [4] aimed at decoding the mechanisms of human vision by analyzing brain neural signals.Noninvasive techniques, including electroencephalography (EEG) and functional magnetic resonance imaging (fMRI), offer accessible means for studying brain activity without surgical intervention.EEG has become a popular and practical method, due to its safety, affordability, high temporal resolution, potential portability, and wide accessibility.Recent EEG studies have shown promising results, particularly in identifying objects of interest (OOI) within visual stimuli presented to individuals [5].
The rapid serial visual presentation (RSVP) paradigm entails presenting an image sequence as visual stimuli at a frequency of 5-20 Hz.Brain-computer interface (BCI) researchers have combined the RSVP paradigm with EEG decoding techniques to develop an efficient automatic image classification system [6], [7].The RSVP-based BCI system enables faster detection and recognition of objects and relevant information compared to manual analysis [8], thereby significantly enhancing the work efficiency of professionals.This system finds extensive applications in counterintelligence, law enforcement, and healthcare, where professionals are required to review substantial volumes of images or information.During the visual presentation, the OOI information can facilitate the generation of the event-related potential (ERP) component, i.e., the P300, which is related to the brain's memory processing and attention mechanisms [9], [10].This component helps detect the subject's attention and cognitive processing, thereby identifying OOI information.
To enhance the reliability of RSVP-based BCI systems, it is crucial to decode EEG signals generated during RSVP tasks accurately.Over the past decades, conventional and deep learning classification methods have been used to process RSVP EEG signals, and Lotte et al. [11], [12] have summarized these methods.According to the literature, almost all RSVP EEG classification methods rely on supervised learning (SL), whose performance heavily depends on well-collected and well-labeled training datasets.However, EEG signals generated by RSVP tasks have two distinct characteristics that differ from the aforementioned conditions: (1) Low signal-tonoise ratio.EEG signals are prone to significant contamination from external interference and noise signals [13], including eye movements, muscle activity, and electromagnetic noise, which are difficult to filter out from the recorded EEG data; (2) Difficult to annotate.The behavioural responses of the subjects may not necessarily be consistent with the stimuli.In other words, noisy labels have unquantifiably existed for several reasons: from a cognitive perspective, human judgment has inconsistencies and subjectiveness, which may bring cognitive bias when watching OOI-eliciting stimuli [14]; from an experimental participant's perspective, it is challenging to accurately understand what subjects are thinking or doing during an RSVP experiment [15]; from a data perspective, directly interpreting and annotating RSVP EEG data is impossible because of the complexity of brain processes of interest and the individual differences between subjects [16].With low signal-to-noise ratio data and noisy labels, SL would overfit misleading information and degrade generalization performance, leading to parameters obtained after training deviating from the optimal value [17].Thus, it is applicable and meaningful to develop a novel RSVP EEG classification method in the presence of low signal-to-noise ratio data and noisy labels.
To this end, two challenges need to be addressed: (1) how to extract discriminative features from low signal-to-noise ratio EEG data while improving the generalization of the classification model; and (2) how to mitigate the negative impact of noisy labels.Regarding the first challenge, traditional handcrafted features such as band power in the frequency domain, statistical features in the time domain, and discrete wavelet transform in the time-frequency domain [18] have been commonly used for generic EEG signal processing.However, these features are not specifically designed for RSVP EEG signals.
Recent deep-learning literature has also discussed this issue.For example, PPNN [10] and PLNet [19] have been proposed to take the phase characteristic of RSVP EEG signals into consideration, and DRL [20] has been proposed to alleviate the class imbalance between targets and non-targets.Although these deep learning methods can capture certain discriminative information from RSVP EEG signals, they do not sufficiently leverage the prior ERP-related information.Thus, the first key aspect in algorithm design is to extract the discriminative features by utilizing the characteristics of EEG signals.
For the second challenge, a limited number of studies on other EEG tasks, which share many common features with RSVP EEG classification, have considered this problem.Zhong et al. [21] proposed an EEG emotion recognition method based on a regularized graph neural network, which utilizes emotion-aware distribution learning and node-wise domain adversarial training to handle noisy labels.Li et al. [14] proposed JO-CapsNet, which is based on the capsule network and employs a joint optimization strategy.Banville et al. [16] designed three self-supervised learning (SSL) tasks, namely temporal shuffling (TS), relative positioning (RP), and contrastive predictive coding (CPC), to extract deep representation without using label information, thereby mitigating the negative impact of noisy labels.However, RGNN requires prior information on the test distribution, which is not suitable for online EEG classification.Jo-CapsNet needs the noise rate, that is unquantifiable in RSVP tasks.These SSL tasks involve successive self-supervised pre-training tasks and supervised downstream tasks, aiming to reinforce universal representations from unlabeled data and determine a classification boundary between targets and non-targets.Despite the promising results shown in extensive SSL studies, there still exists a performance gap compared to conventional SL.This is because the SSL paradigm requires designing an independent algorithm for each learning task separately, leading to a significant mismatch between the feature distribution in the pre-training and the target tasks.Therefore, the second key aspect is to explore the possibility of a unified algorithm that enhances the collaboration of self-supervised pre-training tasks and supervised downstream tasks.
Based on the two key aspects of algorithm design, we propose a multi-task collaborative network (MTCN) that bridges the benefits of SL and SSL to enhance the performance of RSVP EEG classification.The design of MTCN includes the following two parts: (1) The design of task-related SSL tasks to reinforce temporal-spatial feature extraction without labels, thereby mitigating the effect of noisy labels; and (2) The design of the collaborative mechanism to bridge the advantages of SL and SSL while ensuring the consistency of their distribution.In the first part, we develop two high-related SSL tasks, namely masked temporal recognition (MTR) and masked spatial recognition (MSR), according to the characteristics of RSVP EEG data and the feature extraction process.The MTR task considers multiple ERP components to extract the important temporal dynamics of RSVP EEG data, while the MSR task predefined the electrodes into multiple regions according to their location distribution, to learn the spatial patterns of EEG electrodes between different brain regions.In the second part, MTCN simultaneously learns from multiple tasks, including SSL tasks and the original SL task to find a universal representation space that captures the common features of all tasks, reducing the likelihood of overfitting and improving generalization performance.Moreover, MTCN explicitly decomposes features into task-specific features and task-shared features to make the feature space more structured.Multiple tasks including the SL-based classification task, SSLbased MTR, and SSL-based MSR task, share the task-shared features space to promote collaboration and information sharing between tasks.Task-specific spaces facilitate the learning of individual patterns to improve the performance of each task.
This study focuses on the RSVP EEG classification in the presence of low signal-to-noise ratio data and noisy labels.The contributions are summarized as follows: 1) This is the first study that adopts a unified algorithm to bridge the advantages of SL and SSL, thereby improving RSVP EEG classification performance.2) Through designing two SSL tasks, namely MTR and MSR tasks, MTCN learns more discriminative temporal-spatial features without using label information.3) MTCN explicitly decomposes features into task-specific and task-shared features to facilitate collaboration between SL and SSL tasks.4) Experimental results, based on THU, CAS, and GIST RSVP datasets, demonstrate that MTCN can achieve remarkable performance.(2) the model is optimized jointly for the primary task and two auxiliary SSL tasks.Note that three tasks shared a common task-shared feature extractor (pink) and each task has a private task-specific feature extractor (yellow/blue/green).

A. Problem Definition
For the RSVP EEG classification problem, supposing D = {X, Y} = {(x i , y i )} N i=1 denotes the training dataset, where X ∈ R N×C×K denotes the input space and Y ∈ R N×2 denotes the ground-truth label space using one-hot encoding.To establish the mapping relationship from EEG data to labels, SSL models idealistically learn a mapping function F: X → Y that minimizes the cross-entropy loss L.
However, from the data dimension, recorded EEG data x consists of task-related EEG signals and task-irrelevant noise signals, which may cause the SL model to overfit the misleading noisy information.In some batches, the influence of noise signals is greater than that of the EEG signals, which leads the model to fit in the wrong direction, affecting the model's generalization ability.More severely, from the label dimension, recorded EEG label inevitably contains noisy labels and is difficult to quantify and localize.According to the correctness of the sample annotation, the labels can be divided into the correct label and the incorrect label.The incorrect label includes two situations, e.g.target samples annotated to non-targets or non-target samples annotated to targets.The optimization with noisy labels may easily lead to inaccurate direction, thus misguiding and degenerating the model generalization performance on RSVP EEG data.
The most ideal solution is to filter or correct the noise from the signals or labels.However, due to the low signal-to-noise ratio of EEG signals, it is challenging to directly filter out noise signals.In addition, considering the characteristics of cognitive biases and unstable states among subjects, noisy labels are often difficult to quantify and localize.The MTCN is proposed that improve the generalization ability of SL to enhance the RSVP EEG classification performance in the presence of low signal-to-noise ratio signals and noisy labels.

B. Multi-Task Collaborative Network
MTCN aims to extract the general and discriminative RSVP EEG representations through bridging the advantages of SL and SSL, as illustrated in Fig. 1.The primary SL task is to capture universal patterns and determine a classification bounding for targets and non-targets using label-dependent information.Inspired by masked autoencoders [22], we designed the MTR and MSR task to reinforce the temporal information extraction and extract the spatial patterns in each brain region, respectively.To facilitate the collaboration of SL and SSL, we first decompose the feature space into task-specific and taskshared features, thereby promoting the model recombining representations based on the requirements of each task.Subsequently, multiple tasks are jointly optimized to utilize both label-dependent information from SL and feature-dependent information from SSL.
is generated by successively masking each ERP components block with Gaussian noise.
where G t (•) transforms the signal to Gaussian noise with the same mean and variance.
To distinguish the masked ERP components in the transformed data, a mapping function F t : X t → Y t should be built to minimize the cross-entropy loss L t , where x t i, j and y t i, j are generated by masking the j-th ERP component of the i-th sample in original data x.Through masking the ERP component, it may be helpful to extract deeper temporal dynamic features and make the model more generalizable.
2) Masked Spatial Recognition Task: This task is designed to extract the spatial patterns in each brain region.Considering that different brain regions have unique cognitive functions, the masked spatial recognition task is defined as identifying the brain region to which the missing electrode belongs.As illustrated in Table I, generally 64/32 electrodes are first divided into eight regions according to their location distribution, denoted as [s 1 ; s , where C i is the number of electrodes in the i-th region.Taking the  64-lead EEG data as an example, as illustrated in Fig. 3, the transformed dataset is created by successively masking each brain region with Gaussian noise.The set includes nine masked samples transformed from an EEG sample x is constructed, where G s (•) transforms the signal to Gaussian noise with the same mean and variance.
To identify the masked brain regions in transformed data, a mapping function where x s i, j and y s i, j are generated by masking the j-th electrode region of the i-th sample in original data x.By masking the channels in each brain region, it may be helpful to extract robust spatial function relationships and enhance the generalization ability of the model.
3) Feature Extraction and Decomposition: Considering the limited data and explicability in neurophysiology, the feature extractor E(•) is inspired by EEGNet [24] architecture.The compact neural network begins with a temporal convolution The feature extraction and decomposition in detail.The task-specific feature extractor includes a temporal block and a depth block, the task-shared feature extractor adds a separable block on this basis.Note that the input and output dimensions of S( • ) are the same.
block T (•) to learn frequency filters.Then a depthwise convolution D(•) is used to connect each feature map individually, thereby learning frequency-specific spatial filters.Finally, the spatial-temporal features are fused with a separable convolution S(•) which combines a depthwise convolution and a pointwise convolution.The depthwise convolution independently learns a temporal summary for each feature map, while the pointwise convolution learns the optimal way to combine the feature maps.The layout details of each block can be found in Fig. 4.
However, multi-task learning generally shares a common feature space, and the model may excessively converge towards the auxiliary task, affecting the optimization of the main task.This is because features learned from different tasks would interfere with each other.To extract feature information without redundancy, MTCN explicitly decomposes the feature extractor into task-shared E sh-(•) and task-specific E sp-(•) feature extractors to utilize both label-dependent information from SL and feature-dependent information from SSL.As shown in Fig. 4, the task-specific feature extractor successively adopts temporal block and depth block to extract the spatial-temporal features.The task-shared feature extractor adds a separable block to enhance the information interaction between each specific task.The output of task-shared / taskspecific feature extraction is denoted as f sh-and f sp-.Eq. 6 shows the calculation process.
After feature extraction, we first overlay the task-shared feature and task-specific feature, and then adopt classification head H p , temporal head H t , and spatial head H s to map the feature representation in each classification space with two fully connected layers.The calculation process is as follows: where ỹp , ỹt , and ỹs are the predicted probability of each class with the dimensions of 2, 9, and 8 in primary, MTR, and MSR tasks, respectively.Moreover, to decompose the original features into task-specific features and shared features as much as possible, MTCN adds the orthogonal constraint between the task-shared features and task-specific features.The orthogonal constraint is represented as follows: where ∥•∥ 2 denotes the squared Frobenius norm.

4) Joint Optimization for Training:
The model is simultaneously trained on multiple SSL tasks and the primary SL task.To eliminate the need for manual weight tuning of various loss functions, the total loss function is formulated by taking into account the homoscedastic uncertainty of each task [25].Thus, the joint training problem can be expressed as follows: where L p denotes the cross-entropy loss of the primary RSVP EEG classification task; τ L t , τ L p , τ L s , τ Lt , τ Lp , and τ Ls are the observation noise scalars of the corresponding tasks; The observation noise scalar τ provides a principled approach to multi-task deep learning by weighting multiple loss functions based on the homoscedastic uncertainty of each task.This enables simultaneous learning of diverse quantities with different units or scales in both classification and regression scenarios, optimizing the balance of these weightings and leading to superior performance.These scalars can be treated as trainable parameters that dynamically evolve during the model training process, with an initial value of 1.The complete procedure of MTCN is summarized in Algorithm 1.

III. EXPERIMENTS A. Datasets
To evaluate the proposed method, experiments are performed on three publicly available datasets.All experiments were approved by the Ethics Committee and adhered to the Declaration of Helsinki.The subjects had normal or correctedto-normal vision and did not report a history of neurological problems.Each subject signed consent forms.During the RSVP task, each subject had to focus on the screen, which rapidly displayed an image sequence at a rate of 10 Hz.The characteristics of the three datasets are as follows: 1) Tsinghua University Dataset (THU): This dataset is publicly available at http://bci.med.tsinghua.edu.cn[26], which records EEG data from 64 healthy subjects (32  Compute the task-shared and task-specific features by Eq. 6: Compute the prediction value of each task by Eq. 7: Compute the empirical loss of each classification task by Eq. 1, 3, 5: L p , L t , L s ← y, y t , y s , ỹp , ỹt , ỹs ; Compute the structured loss between task-shared and task-specific features of each task by Eq. 8: Compute the overall loss of MTCN by Eq. 9: Update the network parameters: ← − η∇L; 10: end for 11: return as * .
pedestrians and non-target images devoid of pedestrians.The experiment consists of four blocks, with each block comprising 40 trials.Each trial corresponds to 100 images and contains 1-4 targets.The EEG data were recorded using a 64-electrode Neuroscan Synamps2 system at a sampling rate of 1000 Hz.
The reference electrode was at the vertex.
2) Chinese Academy of Sciences Dataset (CAS): The dataset can be downloaded at https://figshare.com [27], which includes EEG signals recorded from 14 subjects (four males, aged 24.9 ± 1.5 years).Similar to THU, images containing pedestrians in the street are regarded as targets, and the rest are non-target.Each subject participated in two sessions of the RSVP experiment with a 23-day interview.Each trial consisted of 100 images, including four target images.One session comprised three blocks, with one block containing 14 trials.The EEG-recorded device and sampling frequency are consistent with the THU dataset.The reference electrode, with the 10-20 electrode name of "Ref," was located at the vertex.
3) Gwangju Institute of Science and Technology Dataset (GIST): GIST opensources on https://springernature.com [28], which includes EEG data from 55 healthy subjects (14 females, aged 22.9 ± 2.9 years).The stimuli images comprise one green-colored target character and 20 white-colored nontarget characters.The participants completed 40 RSVP trials, including 40 target events and 800 non-target events for each participant.The EEG data are simultaneously recorded by the Biosemi ActiveTwo system using 32 Ag/AgCl electrodes with a sampling rate of 512 Hz.GIST uses the average of all channels as a reference because the EEG device used for data acquisition does not provide hardware-level referencing.

B. Data Preprocessing
The data preprocessing successively includes the following three aspects, detailed descriptions are as follows: 1) Under-Sampling, Filtering, and Normalization: Similar to the original paper, all EEG data were re-referenced to the average of all electrodes with the common average reference (CAR).To unify the structure of MTCN, EEG signals are under-sampled to 256 Hz.Considering ERP components have the phase-locked characteristic, each EEG trail undergoes 6thorder zero-phase bandpass filtering using a Butterworth filter with cut-off frequencies ranging from 0.1 Hz to 48 Hz.This helps eliminate slow drift and high-frequency noise, preventing delay distortions.Lastly, EEG signals are normalized in each channel using the z-score method.
2) Data Segmentation: The RSVP EEG data are divided into segments, starting 1 second from the onset of each stimulus, which is referred to as one EEG sample x ∈ R C×K .The first unit C is the number of EEG channels and the other one K denotes the dimension of each EEG data channel.Concretely, the shapes are 64 × 256, 64 × 256, and 32 × 256 in THU, CAS, and GIST RSVP datasets, respectively.
3) Training and Test Sets: In the THU dataset, the first two blocks are regarded as the training dataset, and the rest two blocks are the test dataset.In the CAS dataset, we adopted the cross-day experiment to evaluate the performance.In the GIST dataset, we adopt the first 80% of EEG data as the training dataset and the rest as the test dataset.In this way, training samples and test samples are collected from distinct time periods, thereby mitigating the influence of temporal correlation on classification and bolstering the reliability of the experimental results.

C. Baseline Methods
This study further considers the low signal-to-noise ratio characteristics of EEG signals and noisy labels in RSVP tasks.Therefore, the baseline methods are selected from two views: 1) Representative method to learn more discriminative representation in the presence of signal-to-noise ratio RSVP EEG data: There are four traditional SL methods and six deep learning methods, which have been commonly used for RSVP EEG classification.Traditional SL methods include rLDA [23], HDCA [29], xDAWN-RG [30], [31], XGB-DIM [32], and deep learning methods consist of DeepConvNet [33], EEGNet [24], EEG-Inception [34], PLNet [19], PPNN [10], and DRL [20].2) Representative method to alleviate the negative impact of noisy labels: We select five methods, e.g.RGNN [21], Jo-CapsNet [14], RP, TC, and CPC [16], which consider the negative impact of noisy labels in the field of EEG classification.RGNN requires prior information on the test distribution, which is not suitable for online EEG classification.Jo-CapsNet needs the noise rate, that is unquantifiable in RSVP tasks.Therefore, we only compare our algorithm with the rest three SSL methods.

D. Experimental Protocol
Due to the extreme class imbalance between targets and non-targets in RSVP tasks, there is a tendency for the model Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II CLASSIFICATION PERFORMANCE OF DIFFERENT METHODS ON THU RSVP DATASET (MEAN ± STANDARD DEVIATION) TABLE III CLASSIFICATION PERFORMANCE OF METHODS ON CAS RSVP DATASET (MEAN ± STANDARD DEVIATION)
to towards the majority class, thereby affecting the classification accuracy.To address this issue, we randomly selected some samples from the non-target dataset and ensured an equal number of samples for both target and non-target classes.It should be noted that the above operation is applied exclusively to the training dataset.Besides the structure of the feature extractor, our proposed method MTCN has no additional hyperparameter.The parameters of the feature extractor are the same as EEGNet.This study uses the PyTorch framework to train the model and iterative optimization of 60 epochs.Concretely, the Adam optimizer is used for model optimization and the initial learning rate is set to 0.0001.The source code is open-sourced at https://github.com/Tammie-Li/MTCN.For the baseline methods, the results of the XGB-DIM are based on its open-source code.The rest models are implemented by Python and PyTorch following the descriptions in the original papers.

E. Evaluation Metrics
Refer to recent RSVP EEG classification studies [32], [35], six metrics, e.g., balanced accuracy (BA), false positive rate (FPR), true positive rate (TPR), F1-score, Cohen's kappa coefficient (KAPPA), and area under the curve (AUC), are used to fully evaluate our method.The results are presented as mean ± standard deviation (SD) for all test subjects.

F. Experimental Results
The classification performance of each method under various metrics on three RSVP datasets are shown in Table II, III, and IV.From these results, we have the following four observations: 1) MTCN achieves superior performance than all baseline methods.On the THU dataset, the average results of BA, TPR, FPR, F1-score, KAPPA, and AUC are improved by 3.42%, 1.45%, 4.94%, 6.23%, 3.47%, and 2.17% compared with the best baseline methods respectively.On the CAS dataset, the proposed method MTCN outperforms the best baseline methods by 6.09%, 3.07%, 4.46%, 10.74%, 5.35%, and 4.34%.On the GIST dataset, the performance improvement is 4.02%, 0.00%, 7.13%, 12.79%, 7.90%, and 1.88%.These results verify the superior performance of MTCN in RSVP tasks.2) MTCN has a stronger ability to extract more generalization representations than traditional SL methods.
Compared to XGB-DIM, which is the best-performing algorithm in conventional methods, MTCN achieves an Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE IV CLASSIFICATION PERFORMANCE OF DIFFERENT METHODS ON GIST RSVP DATASET (MEAN ± STANDARD DEVIATION)
average improvement of 7.71%, and 10.82% in all metrics on three datasets, respectively.Compared with the best deep learning method PPNN, these improvements are 3.69%, 9.75%, and 7.69%, respectively.These results suggest that SSL tasks may be beneficial for extracting the intrinsic structural information of EEG signals, making the learned models more generalizable.
3) MTCN has the ability to alleviate the negative impact of noisy labels.Compared to the existing solutions RGNN and Jo-CapsNet in the EEG emotion field, MTCN does not require test set data or noise rates, making it more suitable for real-world applications.Compared to methods CP, TC, and CPC which use SSL to alleviate the problem of noisy labels, MTCN improves all metrics by an average of 6.84%, 9.98%, and 9.64% on three datasets.This may be because MTCN jointly trains the SL-based classification task and the SSL-based representation learning task, making the EEG structural information learned by the SSL task more compatible with the downstream classification task.4) Bridging SL and SSL is more advantageous than using either approach alone.EEGNet, CP, TS, CPC, and MTCN share a similar feature extraction structure, where EEG-Net uses only SL, CP, TS, and CPC uses only SSL, and MTCN bridges both approaches.Across the three datasets, MTCN achieved an average improvement of 4.64%, 10.32%, and 10.93% on all metrics compared to EEGNet.Compared to the best SSL method, the improvement was 6.18%, 9.99%, and 5.66%.These results demonstrate that MTCN can effectively bridge the advantages of SL and SSL, enabling the extraction of spatial-temporal features from EEG signals while achieving accurate classification.
To assess whether the proposed MTCN method is statistically superior to the baseline methods, we used repeatedmeasure ANOVA and pairwise comparisons with Bonferroni adjustment to analyze significant differences.Prior to conducting statistical tests, we performed the Shapiro-Wilk test (S-W test) to confirm that the data distribution followed the normal distribution hypothesis, which is a requirement for conducting an analysis of variance.The results of the repeated-measure ANOVA, presented in Table V, indicate significant differences in all six metrics among the methods across the three datasets.Additionally, the results of pairwise comparisons using the Bonferroni adjustment with a confidence level of 0.05 are presented along with the classification results in Tables II, III, and IV.Of the 234 pairwise comparisons, 216 showed p-values less than 0.05, indicating that MTCN significantly improved all six metrics compared to the other methods.In summary, the proposed MTCN method achieved significantly better classification performance than the baseline methods.

IV. DISCUSSION A. Ablation Study
To assess the contribution of each essential SSL task in our model, experiments are conducted with the ablated MTCN models based on the THU dataset.The ablation research verifies the influence of each SSL task and the combination of multiple tasks on the performance of EEG classification in RSVP tasks.In Table VI, the results of all ablated MTCN models are presented.The suffix "-T" indicates the ablation of the MTR task, "-S" indicates the ablation of the MSR task, and "-D" indicates the ablation of the feature decomposition module.By combination, a total of seven ablated models were obtained, namely MTCN-TSD, MTCN-TD, MTCN-SD, MTCN-TS, MTCN-T, MTCN-D, and MTCN-S.Similar to the main experiment, we first used repeated measures ANOVA to assess the performance differences among the models after ablation.At a significance level of 0.05, the results for each metric were as follows: BA, F( 7 It should be noted that post-hoc comparisons with Bonferroni correction were conducted for metrics with significant differences, while the Friedman test was used for those without.Examining the results in Table VI.Upon observing the results presented in Table VI, the following three observations can be made: 1) MTCN outperforms all ablated models across all metrics.The performance of the model is affected when any module of MTCN is ablated.Compared with the results of all ablated models, MTCN achieved an average improvement of 3.22%, 1.71%, 4.72%, 4.96%, 2.96%, and 1.36% on the six metrics, respectively.This observation emphasizes the positive impact of the two SSL tasks and the FD module on the performance of MTCN.
2) The impact of the ablation of the FD module on the model's performance is significant.Although MTCN-TD achieved the best results in terms of TPR, it also achieved the worst results on four out of the six metrics.The result means the MTCN-TD tends to predict the test EEG sample as a target.MTCN-D and MTCN-TSD achieved the remaining two results.The reason may be because of the feature redundancy phenomenon that can occur during joint training of the SSL tasks and supervised classification learning tasks, in the absence of the FD module.
3) The ablation of the MTR and MSR tasks has a certain impact on the model's performance.The results of MTCN-T and MTCN-S are slightly lower than that of MTCN.Across all six metrics, the average gap is 1.52%, 0.41%, 2.63%, 3.21%, 1.79%, and 0.52%, respectively.This is because the MTR and MSR tasks are beneficial for extracting the temporal dynamics of EEG signals and establishing spatial relationships between different brain regions.Therefore, all modules involved in MTCN are verified to be effective in RSVP EEG classification.The labels for the MTR and MSR tasks are generated based on the structural characteristics of the EEG signals, without relying on the label information from the RSVP classification task.This approach alleviates the negative impact of noisy labels while extracting the temporal dynamics and inter-regional relationships of the EEG signals.Furthermore, the FD module decomposes the features into shared and specific components, leading to a reduction in feature redundancy and the extraction of more generalizable ERP features.This, in turn, mitigates the risk of overfitting caused by low signal-to-noise ratio EEG signals to some extent.

B. Performance on the Cross-Set Experiment
To explore the generalization ability of MTCN, we further conducted experiments on the cross-set RSVP task.Considering the EEG shapes are the same in the CAS and THU datasets, we use the CAS dataset as the training data, and the THU dataset as the test data.We selected the top three methods, e.g.EEGNet, PPNN, and DRL, from the basic experiments as comparative algorithms and used BA, KAPPA, The performance of MTCN after using three different rereference strategies.
and AUC as evaluation metrics.As shown in Fig. 5, the performance of MTCN is significantly better other methods, which may be because MTCN bridges the benefits of SSL in extracting high-level EEG representations.

C. Performance Effect of Different Re-Reference Strategies
Re-referencing technologies aim to reduce conduction distortion caused by conductivity differences, thereby alleviating its impact on EEG signal analysis.The generally used method includes common average reference(CAR) [36], linked-mastoids(LM) [37], and reference electrode standardization technique (REST) [38].The CAR adopts the average value of all electrodes as the reference signal.The LM selects the average value of the electrode sites on both sides of the mastoid (T7 and T8 in this experiment).The REST calculates the potential of each electrode relative to a standard reference source by utilizing a physical model of scalp potentials.To assess the performance effect of re-reference strategies, we compared the above three strategies based on the first eight subjects in the THU dataset.After applying three different re-reference strategies, as shown in Fig. 6, there were little performance differences in MTCN on BA, KAPPA, and AUC evaluate metrics.This may be because the MTCN can effectively learn the linear transformations of re-reference strategies.

D. Activity of EEG Electrodes
To explore the contribution of different brain regions in decoding RSVP EEG signals and assess the temporal feature extraction capability of MTCN, we used the MNE-Python toolbox [39] to generate the electrode activity maps in Fig. 7 based on the THU dataset.The values at each electrode location were obtained by calculating the L2 norm of the average of features and mapping these values onto the corresponding electrode regions.Four types of features were obtained from both the data and feature dimensions: shared target features (Sh-TF), shared non-target features (Sh-NF), specific target features (Sp-TF), and specific non-target features (Sp-NF), which are calculated by target samples/non-target samples with shared/specific temporal feature extractor.
Based on the results shown in Fig. 7, we make the following observations.From the data dimension, noticed Fig. 7(a) vs. Fig.7(b) and Fig. 7(c) vs. Fig.7(d), The brain activity induced by target signals is mainly concentrated in the occipital electrode sites, which is likely due to the fact that the occipital lobe is strongly associated with visual information processing and the retrieval of memory for target images [40].The brain activity induced by non-target signals is more distributed and tends to be located in the frontal lobe, which may be related to the subjects' focused and sustained attention during the RSVP experiment.From the feature dimension, noticed Fig. 7(a) vs. Fig.7(c) and Fig. 7(b) vs. Fig.7(d), the brain activity maps obtained from the target-specific feature extractor exhibit higher activation intensity than those obtained from the shared feature extractor, which is likely due to the fact that the shared feature extractor needs to balance the SL task for RSVP classification and two SSL tasks, which can compromise the discriminative power of the extracted features.

E. Representation Visualization
To verify the discriminative ability of the representations obtained by MTCN, the data representations of three tasks were visualized using the t-distributed stochastic neighbor embedding (t-SNE) [41] method, as shown in Fig. 8.We visualized the distributions of four types of features, including raw data, shared features, target-specific features, and fused features, for the first subject in the THU dataset.Fig. 8 (a-d), (e-h), and (i-l) respectively show the distributions of these features for the RSVP classification task, the MTR task, and the MSR task.Observing the raw data distribution shown in the first column of, the distribution of target and non-target data is scattered and disordered.For the MTR and MSR tasks, the distributions of the masked samples are located near the original samples and are difficult to distinguish.Observing the shared feature distribution shown in the second column, feature distributions become separable, indicating the superior spatio-temporal feature extraction capability of the shared feature extractor.However, the features of different categories are still close to each other in space, which may be due to the shared feature extractor simultaneously learning for three tasks, compromising the feature extraction for individual tasks to some extent.Observing the target-specific feature distribution shown in the third column, most of the feature distributions are separable, especially in the MSR task.In addition, by observing the coordinate axes, the features belonging to different categories are further apart from each other in space.This is because the target-specific feature extractor focuses on specific tasks and can learn spatial-temporal feature information that is more suitable for those tasks.Observing the fused feature distribution shown in the fourth column, the features of different categories have more distinct discriminative boundaries, which demonstrates the effectiveness of fusing shared features and target-specific features and to some extent reflects the function of the feature decomposition module.

F. Limitations and Future Directions
Although the proposed MTCN method achieved a relatively considerable classification performance for RSVP EEG signals, there are still some limitations.On the one hand, the training time of MTCN is relatively long, and its SSL task requires generating a large number of samples, which imposes high demands on the memory and computing power of the training equipment.This effect is not observed during testing, as the model parameters and inference speed are similar to most deep learning algorithms.In future research, we will optimize the design of the SSL task to reduce the algorithm's computational resource consumption while maintaining performance.On the other hand, in practical applications, high inter-subject variability, electrode shifts, and physiological-state changes will inevitably lead to a mismatch between training and test distributions.Although MTCN is beneficial for enhancing the model's generalization ability, it still struggles to alleviate the negative impact of distribution differences.To address this issue, we will consider real-time model optimization during testing.Before inferring each test sample, we will first use the unlabeled data to adjust the model and make it adapt to the distribution differences between training and testing.
V. CONCLUSION This paper proposes MTCN to achieve a more accurate decoding of RSVP EEG signals.By combining the advantages of SL and SSL, MTCN reduces the risk of model overfitting and makes the learned ERP features more generalizable.The advantage of SL is in extracting task-level features and determining the decision boundary between target and non-target.The advantage of SSL is in mining the intrinsic structural information of the data.By designing MTR and MSR tasks, the SSL task proposed in this article is beneficial for enhancing the model's temporal information extraction and constructing relationships between different brain regions.To promote the collaboration between SL and SSL, MTCN simultaneously trains multiple tasks and divides the features into task-shared features and task-specific features.The shared features are used to extract common information among different tasks, while the specific features are used to extract personalized information for each task.This approach has the advantage of reducing feature redundancy, alleviating the seesaw phenomenon in multi-task learning, and promoting information fusion among different tasks.Extensive experiments on three public datasets demonstrate the superiority of the proposed MTCN method.On the whole, this study reduces the risk of overfitting and improves the performance of RSVP EEG classification tasks in the presence of noisy data and labels, which certainly promotes the development of RSVP-based BCI applications.

APPENDIX A REALTED WORKS
A. Existing Methods for RSVP EEG Classification 1) Conventional Methods: Earlier conventional methods first extract the handcrafted spatial, temporal, and spectral features of RSVP EEG data, followed by classification using Fisher linear discriminant (FLD) algorithms or linear discriminant analysis (LDA).For example, Blankertz et al. [23] developed regularized linear discriminant analysis (rLDA) to accurately estimate the covariance matrix in high dimensional spaces.This method utilizes shrinkage estimators to create a regularized version of LDA, which exhibits superior performance compared to other LDA-based approaches.Sajda et al. [29] Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
introduced the hierarchical discriminant component analysis (HDCA) algorithm, which utilizes FLD to train spatial weights and employs a logistic regression classifier to learn temporal weights and perform classification.The champion of the Kaggle BCI competition adopted the xDAWN-RG algorithm, which combined the Riemannian geometry [31], xDAWN spatial filtering [30], L1 feature regularization, and channel subset selection.Li et al. [32] proposed an ensemble learning-based algorithm XGB-DIM that adopts one global spatial-temporal filter and a group of local filters to extract discriminant information.
2) Deep Learning Methods: Deep learning methods have emerged as the dominant approach for enhancing RSVP EEG classification.Unlike conventional research that heavily relies on prior domain knowledge and expert-level experience, deep learning has the capability to automatically extract discriminative features from EEG data.For instance, Schirrmeisteret al. [33] proposed an end-to-end network called DeepConvNet for generic EEG decoding tasks.This model extracts task-related information without the need for handcrafted features.It demonstrates the potential of combining deep CNN with advanced visualization techniques for EEG-based brain mapping.Lawhern et al. [24] developed EEGNet and achieved excellent classification performance on multiple EEG paradigms.This study employs depthwise and separable convolutions to build an EEG-specific network that incorporates various established EEG feature extraction concepts, including optimal spatial filtering and filterbank construction.Santamaria-Vazquez et al. [34] developed EEG-Inception, which utilizes inception modules to effectively extract temporal features at multiple temporal scales for ERP classification.Considering the phase-locked characteristics in event-related potential (ERP) components, Zang et al. [19] and Li et al. [10] proposed PLNet and PPNN to improve classification performance by capturing phase information from RSVP EEG data.Later, in another study, Li et al. [20] further considered the class imbalance problem of RSVP tasks and proposed a DRL model to alleviate the negative impact.

B. Self-Supervised Learning
SSL employs pretext tasks to generate labels derived from the data itself, removing the need for external supervision.To obtain robust and valuable intrinsic representations, it is crucial to design appropriate pretext tasks for SSL.In the field of computer vision, Noroozi and Favaro [42] utilized jigsaw puzzle tasks to learn general image features by predicting the relative position of patches.Komodakis and Gidaris [43] employed 2D rotation as a pretext task, where the model predicted the rotation angle to learn object position, type, and posture in the image.Caron et al. [44] proposed unsupervised visual feature learning through contrasting cluster assignments, leveraging contrastive methods without the need for pairwise comparisons.He et al. [45] highlighted the effectiveness of momentum contrast in bridging the gap between unsupervised and supervised representation learning.Later, in another study, He et al. [22] introduced the Masked Autoencoder (MAE) for visual feature learning.MAE masks random patches from the input image and reconstructs the missing patches in the pixel space, thereby improving feature generalization.In the EEG feature learning field, Xie et al. [46] introduced a novel approach that incorporates six distinct transformations to extract generalized EEG representations.Mohsenvand et al. [47] suggested learning representations using contrastive learning.Their method recombines multiple channels and trains a channel-wise kernel to capture EEG emotion representations.Inspired by SSL, our work incorporates two mask tasks to assist in learning more discriminative ERP features, while overcoming the challenge of RSVP EEG noise labels.Different from previous SSL methods, we proposed MTCN, a method that incorporates multiple RSVP-related tasks and leverages all task data to facilitate knowledge sharing.This approach enhances the model's learning capacity, generalization ability, and robustness.Additionally, the design of SSL tasks in MTCN takes into account the characteristics of RSVP EEG signals.With the mask recognition learning task, MTCN learns the relationships between brain regions and the temporal dynamics of EEG signals, which is crucial for RSVP EEG classification.

APPENDIX B CALCULATION FORMULA FOR EVALUATION METRICS
In this appendix, we will introduce the calculation methods for six evaluation metrics.AUC denotes the area under the ROC curve, which depicts the ratio of true positives to false positives across various threshold values.AUC is a threshold-independent metric that quantifies the probability of randomly selecting a positive instance with a higher decision value than a randomly selected negative instance.

APPENDIX C CONFUSION MATRICS
Furthermore, confusion matrices are constructed for each classification task, visually presenting the performance of MTCN across all categories and tasks.As depicted in Fig 9, the test results of all participants in each dataset were aggregated into a single confusion matrix.It should be noted, that in the MTR and MSR tasks, only 10% of the data is randomly used for computation due to the limitation on memory and computing resources.From the results presented in Fig. 9(a), (b), and (c), we found that MTCN achieved excellent classification performance in the primary task based on three RSVP datasets.Specifically, MTCN achieves more accurate discrimination of non-target classes on three datasets.This may be attributed to the abundance of non-target samples in the datasets, which contain more generalized feature information compared to the target samples.Figures 9(d)-(e), (f)-(g), and (h)-(i) intuitively displayed the classification results of various classes in MTR and MSR tasks on three datasets.It can be observed from the figures that the classification performance of the MTR task is not ideal on VP1, N1, VN1, P2, and N3, which may be due to the overly fine division of ERP components and the existence of many redundant information among these components, making it difficult for MTCN to distinguish them.The MSR task achieved good performance on all three datasets, which partially demonstrated the necessity of various brain regions in decoding EEG signals.

Manuscript received 25
October 2023; revised 9 January 2024; accepted 18 January 2024.Date of publication 24 January 2024; date of current version 1 February 2024.This work was supported in part by the STI 2030-Major Projects under Grant 2022ZD0208504; in part by the National Natural Science Foundation of China under Grant U22A2059, Grant U1913202, and Grant 62203460; in part by the Major Project of the Natural Science Foundation of Hunan, China, under Grant 2021JC0004; and in part by the Key Laboratory of Space Flight Dynamics Technology under Grant 2022-JYAPAF-F1028.(Corresponding author: Jingsheng Tang.)

Fig. 1 .
Fig.1.Pipeline of the proposed MTCN: (1) respectively from the dimension of temporal and spatial, the original EEG dataset is masked to generate auxiliary datasets for MTR and MSR tasks; (2) the model is optimized jointly for the primary task and two auxiliary SSL tasks.Note that three tasks shared a common task-shared feature extractor (pink) and each task has a private task-specific feature extractor (yellow/blue/green).

Fig. 2 .
Fig. 2. Masked Temporal Recognition Task.This task transforms the EEG data by randomly masking one ERP component with Gaussian noise.The goal is to identify which ERP component is masked.

Fig. 3 .
Fig. 3. Masked Spatial Recognition Task.Taking the 64-lead EEG record device as an example, the electrodes are divided into eight blocks according to the location of brain regions.This task transforms the EEG data by randomly masking the electrode channels in one brain region with Gaussian noise.The goal is to figure out which brain region is masked.

Fig. 4 .
Fig. 4.The feature extraction and decomposition in detail.The task-specific feature extractor includes a temporal block and a depth block, the task-shared feature extractor adds a separable block on this basis.Note that the input and output dimensions of S( • ) are the same.

Fig. 5 .
Fig. 5.The result of cross-set experiments from the CAS to THU dataset.

Fig. 7 .
Fig. 7.EEG electrode activity maps.Sh-TF, Sh-NF, Sp-TF, and Sh-NF respectively denote the features of targets / non-target samples generated by shared / specific feature extractors.

Fig. 8 .
Fig. 8.The t-SNE of different feature spaces in MTCN: the first column denotes raw data; the second column denotes the output of shared-feature extractor; the third column denotes the output of specific feature extractor; and the last column denotes the output of projection head, respectively.
= 2TP 2TP + FN + FP(10) where TP denotes the number of accurately classified positive samples, FN denotes the number of incorrectly classified positive samples, TN denotes the number of correctly classified negative samples, and FP denotes the number of incorrectly classified negative samples.It should be noted that the test set in this study is extremely imbalanced, which leads to a poor result in terms of the F1-score.The calculation process of KAPPA is as follows:P o = Acc = TP + TN TP + TN + FP + FN P c = TP 2+ TN 2 (TP + FN + FP + TN) 2 KAPPA = P o − P c 1 − P c(11)

Fig. 9 .
Fig. 9. Confusion matrices in primary, MTR, and MSR tasks based on the THU, CAS, and GIST datasets, respectively.

TABLE I EEG
ELECTRODES ASSOCIATED WITH EACH BRAIN REGION The Training Pipeline of MTCN Require: Source dataset D = {X, Y}; task model ; number of iteration N; learning rate η Ensure: Optimal task model * 1: Generate the auxiliary dataset for MTR and MSR tasks by Eq. 2 and 4, respectively: D s = {X s , Y s } ← X and D t = {X t , Y t } ← X; 2: for i = 1: N do males; aged 23 ± 4 years).The stimulus images consist of street-view images belonging to two categories: target images displaying Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Algorithm 1

TABLE V THE
RESULTS OF REPEATED-MEASURE ANOVA IN THREE DATASETS

TABLE VI RESULTS
OF ABLATION EXPERIMENTS ON THU RSVP DATASET (MEAN ± STANDARD DEVIATION)