A Cross-Scale Transformer and Triple-View Attention Based Domain-Rectified Transfer Learning for EEG Classification in RSVP Tasks

Rapid serial visual presentation (RSVP)-based brain-computer interface (BCI) is a promising target detection technique by using electroencephalogram (EEG) signals. However, existing deep learning approaches seldom considered dependencies of multi-scale temporal features and discriminative multi-view spectral features simultaneously, which limits the representation learning ability of the model and undermine the EEG classification performance. In addition, recent transfer learning-based methods generally failed to obtain transferable cross-subject invariant representations and commonly ignore the individual-specific information, leading to the poor cross-subject transfer performance. In response to these limitations, we propose a cross-scale Transformer and triple-view attention based domain-rectified transfer learning (CST-TVA-DRTL) for the RSVP classification. Specially, we first develop a cross-scale Transformer (CST) to extract multi-scale temporal features and exploit the dependencies of different scales features. Then, a triple-view attention (TVA) is designed to capture spectral features from triple views of multi-channel time-frequency images. Finally, a domain-rectified transfer learning (DRTL) framework is proposed to simultaneously obtain transferable domain-invariant representations and untransferable domain-specific representations, then utilize domain-specific information to rectify domain-invariant representations to adapt to target data. Experimental results on two public RSVP datasets suggests that our CST-TVA-DRTL outperforms the state-of-the-art methods in the RSVP classification task. The source code of our model is publicly available in https://github.com/ljbuaa/CST_TVA_DRTL.


I. INTRODUCTION
T HE electroencephalogram (EEG)-based brain-computer interface (BCI) is a promising interactive technology that empowers humans to interact with computer directly through brain signals [1], [2].Rapid serial visual presentation (RSVP) is a well-established BCI paradigms that has been widely used in speller [3] and image retrieval [4].Nevertheless, the noisy single-trial EEG data and the class imbalance problem hinder EEG classification methods from achieving better performance on RSVP task [5], [6], [7].
Many researchers have proposed various temporal feature extraction methods to improve EEG classification performance in the RSVP task [8], [9].For example, hierarchical discriminant component analysis (HDCA) introduced a group of spatial-temporal filters to extract discrimination temporal information from single-trial EEG signals [10].With the successful application of deep learning technology in various fields, several deep learning-based frameworks have been proposed for EEG classification.The deep ConvNet (DCN) uses temporal convolution and spatial convolution to extract spatio-temporal features from EEG data, and realizes end-to-end EEG classification [11].Similarly, the EEGNet captures EEG spatio-temporal features through the depthwise and separable convolution layers with fewer parameters [5].In order to extract multi-scale temporal features, Santamaría-Vázquez et al. developed an EEGInception method that incorporates the inception module into the convolutional network [12].The inception module uses three convolution with different sizes kernels to extract multiple scales temporal features.This approach effectively improves the classification accuracy of the RSVP task.However, this method ignores the dependence of different scale temporal features, which may cause information redundancy of multi-scale temporal features, thus affecting the further improvement of model performance.
Additionally, recent studies have demonstrated that spectrogram of EEG signals can provide discriminative features for EEG classification.For instance, Kang et al. converted EEG data into spectrogram images by short-time Fourier transform, and then used an ensemble convolutional neural networks (CNN) to captures the spectral features from the time-frequency view [13].To better discover important spectral features, Zhang et al. adopted the channel attention mechanism to enhance the spectral features on important channels after converting non-stationary EEG signals into multi-channel spectrogram images through continuous wavelet transform (CWT) [14].These studies effectively improved the performance of EEG classification models by exploiting discriminative time-frequency features.However, the aforementioned works only capture spectral features from a single time-frequency view, ignoring the correlation of spectral features between different channels in multi-channel EEG spectrogram, which undermine the discriminability of spectral features.
In RSVP classification task, the risk of overfitting caused by insufficient training data and class imbalance limits the flexible application of RSVP-based BCIs [15], [16].Recently, some studies have introduced transfer learning strategies to enable rapid application of RSVP-based BCIs.For example, Wei et al. proposed a multi-source transfer learning framework based on domain adversarial training, which reduces the amount of data required to train models on new subjects by learning the common features of other subjects' EEG data [17].Similarly, He et al. developed a transfer learning method based on Euclidean space data alignment, which improves the learning performance for new subjects by aligning EEG trials from different subjects in Euclidean space [18].The above studies show that cross-subject transfer performance of the deep learning methods can be effectively improved by learning the common EEG features among multiple subjects.However, due to substantial inter-individual variability in EEG signals [19], existing transfer learning methods tend to overlook individualspecific features and consequently compromise the transfer performance by incorporating untransferable individual information into common features.
In response to above issues, we propose a cross-scale Transformer and triple-view attention based domain-rectified transfer learning framework (CST-TVA-DRTL) for RSVP classification.First, a cross-scale Transformer (CST) temporal feature extractor is employed to extract multi-scale temporal features from EEG signals, which can characterize the dependencies of temporal features across scales and reduce redundant information.Second, a triple-view attention (TVA) spectral feature extractor is proposed, which can capture multichannel spectral features from spectral-temporal view, spatiotemporal view and spatio-spectral view.Finally, we design a domain-rectified transfer learning (DRTL) strategy that can simultaneously encode transferable domain-invariant representations and untransferable domain-specific representations of EEG features, and then uses domain-specific representations to rectify the domain-invariant representations to adapt to the target domain.Two public datasets were used to evaluate the proposed CST-TVA-DRTL method, and the experimental results show that the proposed method is superior to the stateof-the-art methods in the RSVP classification task, which proves the effectiveness of our proposed CST-TVA-DRTL method.
The contributions of this study are fourfold: (1) We propose a novel CST-TVA-DRTL method to detect ERP from single-trial EEG signals for RSVP-based BCIs, which effectively improves the RSVP classification performance by extracting multi-scale temporal features and multi-view spectral features from EEG data, and adopting domain-rectified transfer learning approach.
(2) A CST is designed to extract multi-scale temporal features from EEG data and reduce redundant information, which can significantly enhance the representation ability of extracted temporal features.
(3) We develop a TVA to capture spectral features of multichannel EEG spectrogram from spectral-temporal, spatiotemporal and spatio-spectral views, which can provide more discriminative spectral representations.
(4) We propose a DRTL framework that can simultaneously obtain transferable domain-invariant representations and untransferable domain-specific representations, and use domain-specific representations to rectify the domain-invariant representations, which can enhance cross-subject transfer performance.

II. METHODOLOGY A. Overview
The CST-TVA-DRTL framework mainly consists of two stages: pre-training and fine-tuning, as shown in Fig. 1.In the pre-training stage, except the target subject data, other subject data are used for training whole framework.In the fine-tuning stage, the target domain data is used to fine-tune the parameters of the domain-specific feature encoder to rectify the domaininvariant representation.
For the framework structure, a spatial filtering algorithm xDAWN [20] is first used to filter raw EEG signals to enhance the P300 evoked potentials and the CWT is adopted to convert the EEG data into multi-channel spectrogram images.Then, the cross-scale Transformer temporal feature extractor is constructed to capture multi-scale temporal features of EEG signal and characterize the temporal dependencies of different scale temporal features.Meanwhile, the triple-view attention spectral feature extractor is adopted to extract spectral features from multiple views of multi-channel EEG spectrogram.Next, domain-specific and domain-invariant feature encoders are used to obtain domain-specific and domain-invariant representations, respectively.The domain rectification block uses domain-specific representations to rectify domain-invariant representations, and the rectified representations are used for the final RSVP classification.

B. CST Temporal Feature Extractor
By averaging multiple trials of EEG signals, significant differences can be observed between the waveform of the P300 signal evoked by target images and the EEG signals evoked by non-target images.In order to capture the time-scale difference Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where X ∈ R C×T is the EEG signal obtained through the xDAWN spatial filter, C represent the number of channels, T is the time length, and S is the number of scales.The Conv k (•) denotes convolution block with kernel size k = T 2 s , each convolution block consist of a convolution layer, a batch normalization layer and an ELU activation layer.The M S s is the s-th scale temporal features of EEG signal.In order to enhance key features at different scales by adaptive weighting, a convolution layer with 1 × 1 kernel size followed by a sigmoid function is adopted to calculate adaptive weights.Then, the weighted multi-scale features M S s is calculated as follows: where σ (•) denotes sigmoid activation functions.The Conv 1×1 (•) is a 1 × 1 convolution layer.AW s is the adaptive weights.
Owing to the interdependencies among multi-scale features, ordinary self-attention mechanisms may inadequately capture cross-scale dependencies, potentially leading to excessive redundancy in the fused multi-scale features [21].To address this limitation, we propose a cross-scale multi-head selfattention block that effectively models the cross-scale dependence between multi-scale temporal features, thereby reducing redundancy in the fused multi-scale features.In the crossscale multi-head self-attention block, two linear transformation layers are first used to transform the weighted multi-scale temporal features M S s ∈ R d×T into the matrix V s ∈ R d×T and K s ∈ R d×T , respectively.The query matrix Q s ∈ R d×T is obtained by linear transformation of smaller-scale features M S s+1 .The transformations are defined as: where W V s , W K s , and W Q s are the learnable matrices of linear layers.Then, the cross-scale attention C A s can be obtained by: where K * s is the transpose of K s , √ d is the scaling factor and So f tmax(•) is the softmax function.
The multi-scale features obtained through cross-scale multihead attention [22] are fused through the concatenation function and the feed-forward network after linear transformation.The formulas for CST to obtain multi-scale temporal features are as follows: where Concat(•) is the concatenation function, M S ′ s = P s (C A s ) , P s are linear transform layers.L N (•) means layer Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.C. TVA Spectral Feature Extractor Many studies have shown that the spectral characteristics of EEG are helpful in classifying EEG signals [23], [24], [25].However, converting EEG signals into multi-channel spectrogram images through CWT may lead to redundant information in spectrogram images.In order to extract features from high-dimensional time-frequency images more effectively, we construct a TVA spectral feature extractor to explore the spectral features.
Fig. 3 presents the diagram of TVA.Concretely, spatial convolution block, spectral convolution block and temporal convolution block are first used to obtain triple-view features of multi-channel time-frequency images U ∈ R C×F×T , where F is the frequency dimension.The spatial convolution block consists of a convolution layer with a convolution kernel of 1×C and a reshape layer.Similarly, the convolution kernels in the spectral convolution block and temporal convolution block are 1×F and 1×T , respectively.The triple-view attention can be expressed by the following formula: where spa (•), spe (•) and tem (•) represent spatial convolution block, spectral convolution block and temporal convolution block respectively.With the triple-view attention, the output spectral features of the TVA unit are calculated by the following formula: where ⊙ means element-wise multiplication.

D. Domain-Rectified Transfer Learning
Transfer learning is an effective strategy to improve the performance of deep learning models on EEG datasets with few labeled samples [26], [27], [28].Existing transfer learning methods usually aim to learn domain-invariant features of EEG data from different subjects [29].However, the learning of domain-invariant representations is difficult due to the large individual differences of EEG signals, which affects the performance of transfer learning methods.To address this issue, we design a domain-rectified transfer learning (DRTL) framework, which adapts to the target domain by rectifying the domain-invariant representation through target domain-specific representation.
The diagram of domain-rectified representation learning is shown in Fig. 1.In particular, domain-specific feature encoder E ϕ and domain-invariant feature encoder E φ are used to obtain domain-specific representations z ′ ∈ R L and common domain-invariant representations z ∈ R L of different subjects' EEG data, respectively.The L denotes the dimension of representation vectors.Domain-specific feature encoder and domain-invariant feature encoder have the same network structure, both consisting of a convolutional block and a fullyconnected layer.To make domain-invariant representations adaptive to different domains, domain-specific representations are used to rectify domain-invariant representations.The calculation process of domain-rectified is as follows: where z * denotes the rectified representations, E LU (•) is exponential linear unit activation function and F z denotes the FC layer.Then, the rectified representation is input into the task classifier G ψ for RSVP classification.y = G ψ (z * ) is the predicted class label and the classification loss is defined as follows: where y h and y h are the actual and predicted label for the h-th sample respectively, H is the total number of samples.In the pre-training stage, in order to make the domain-specific feature encoder learn more domain-specific information, a domain classifier D dc is used to classify the domain feature representation.d = D dc (z ′ ) is the predicted domain label.The domain-specific feature encoder is constrained by minimizing the domain classification loss.The domain classification loss is defined as follows: where d j h and d j h are the actual and predicted domain label for the h-th sample respectively, and J is the total number of domains.Meanwhile, similar to DANN, a domain discriminator is used to identify domain labels for domain-invariant representation learning.To constrain the distance of features between different domains, we confuse the domain discriminator by maximizing the domain discrimination loss [30].d = D dis (z) is the predicted domain label.The domain discrimination loss is defined as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

A. RSVP Datasets
Two publicly available RSVP target detection EEG datasets are used to evaluate the performance of the proposed CST-TVA-DRTL: Tsinghua RSVP dataset [31] and PhysioNet RSVP dataset [32].The RSVP paradigms are shown in Fig. 4.
1) Tsinghua RSVP dataset: This dataset comprises RSVP EEG data collected from 64 subjects, with a gender distribution of 32 females.Each subject observed 160 sequences of stimulus images.Within each sequence, a total of 100 distinct images were presented for 0.1s each.In these images, street view images with people are targets, while street view images with no people are non-targets.The number of non-target images is approximately 64 times the number of target images.EEG recordings were obtained using the Synamps2 system equipped with 64 channels, sampling at a rate of 1,000 Hz.The electrodes were placed according to the international 10-20 system.2) PhysioNet RSVP dataset: This dataset encompasses RSVP EEG data acquired from 10 subjects (4 females) while viewing 8 sequences of images at presentation rates of 10 Hz.The stimulus images can be divided into target images with airplane and non-target images without airplane.In these stimulus images, the number of target images is only one tenth of the total number of images.The EEG recordings were captured using BioSemi ActiveTwo system equipped with 64 channels at a sample rate of 2048 Hz, but only 8 channels: P7, P8, PO7, PO3, PO4, PO8, O1, O2 are available.Electrode placement followed the international 10-20 system.

B. Data Preprocessing
Before the classification experiments, we used the Python EEG toolkit MNE to preprocess the raw EEG signal data.We follow the data preprocessing method in [31].First, the electrooculography data is removed and the EEG data were processed by a band-pass filter with a bandwidth of [2 30] Hz.Then, EEG data epochs were extracted according to event triggers and the EEG data was intercepted within the time interval of [−200 1000] ms, where the [−200 0] ms was used for baseline correction.In order to reduce the amount of data, the EEG epoch data was down-sampled to 100Hz and then used for model training.

C. Evaluation Metrics and Comparison Models
Since the number of non-target samples is far more than the number of target samples.RSVP classification has a class imbalance problem.In order to effectively evaluate the performance of the proposed CST-TVA-DRTL in the RSVP classification task, the balanced accuracy (BA), true positive rate (TPR), true negative rate (TNR) and area under the receiver operating characteristic curve (AUC) are used as evaluation metrics.BA is obtained by averaging TPR and TNR, which is used to measure the average classification accuracy for target and non-target samples.To verify the classification Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
performance of the proposed CST-TVA-DRTL method, five widely used EEG classification methods, including EEGNet, Deep ConvNet (DCN), EEGInception, EEGConformer and STSTNet were used for fair comparative experiments: 1) EEGNet [5]: This is a lightweight convolutional neural network that extracts the spatiotemporal features of EEG through convolution, depth-wise convolution and separable convolution.EEGNet has been proven to be effective in various EEG classification tasks but its simple network structure limits its representation learning capabilities.
2) DCN [11]: This is a deep convolutional neural network that mainly captures high-level features from EEG signals through four convolution blocks.This model can serve as a versatile tool for decoding EEG signals across various tasks.However, due to its traditional convolutional network structure, DCN is difficult to efficiently capture the multi-scale feature of EEG signals.
3) EEGInception [12]: To extract multi-scale time domain features, EEGInception introduces the Inception mechanism into the deep convolutional neural network.Each Inception module can extract multi-scale temporal features from EEG through convolution layers with varying convolution kernel sizes.This method demonstrates its efficacy in detecting ERP.Although this model can effectively extract multi-scale information, it cannot effectively capture the long-term dependencies of time domain features.
4) EEGConformer [33]: First, the low-level local features were extracted by using the temporal and spatial convolution.Then, the Transformer was used to capture the global correlation within the local temporal features for EEG classification.However, this model ignores the important time-frequency feature of EEG signals.
5) STSTNet [2]: This is a multi-view features based EEG decoding method, which can simultaneously extract the spatiotemporal and spectral-temporal features from EEG signals and spectrum images, and then fuses the multi-view features through the spatio-temporal-spectral Transformer.Although this model can comprehensively extract spatio-temporal and spectral-temporal features, it does not consider the multi-scale time domain information of EEG signals.
We also compare our method with three other transfer learning approaches: 1) Domain-Adversarial Neural Networks (DANN) [29]: This is a domain adaptation transfer learning method, which achieves the alignment of different domain features through a domain classifier and gradient reversal layer.It has been used for cross-subject transfer learning on EEG data [34].Due to the large inter-individual differences in EEG, aligning features with the source domain may cause the loss of discriminative information.
2) Adaptive Transfer Learning based on DCN (ATLDCN) [35]: The ATLDCN handles the substantial intersubject variability of EEG data through five adaptation scheme.However, the selection of adaptive transfer learning schemes relies on time-consuming optimization processes.
3) Source-free Subject Adaptation (SFSA) [36]: The SFSA transfer learning method generates source domain data through a classifier-based source domain data generator, and then aligns the target subject features with the generated source subject features.Although this method can utilize common information from different subjects, it ignores subject-specific information that contributes to the classification task.
The above comparison methods and proposed CST-TVA-DRTL are all implemented using the Python 3.9.7 and the PyTorch 1.13.1, and comparative experiments are conducted on the same hardware platform.The 5-fold cross-validation strategy is used to conduct comparative experiments.When training the model, all model parameters are optimized by the Adam optimizer.The initial learning rate and batch size are set to 10 −4 and 64, respectively.Due to the data of RSVP paradigm has the characteristics of extremely unbalanced class, the oversampling strategy is adopt to balance data categories of the training set.

D. Overall Performance
The feature extraction capabilities and transfer performance of the proposed method and the above baseline methods are comprehensively compared in subject-dependent experiments and cross-subject experiments, respectively.In subjectdependent experiments, models are trained and tested on the same subject data.Following the setting of 5-fold crossvalidation experiments, each subject data is evenly divided into five parts.In each fold experiment, one part of the data is used as the test set, one part is used as the validation set, and the other three parts are used as the training set.For a fair comparison, five deep learning-based EEG classification models such as DCN, EEGNet, EEGInception, EEGConformer and STST-Net are compared with CST-TVA method that without domainrectified transfer learning.The results of subject-dependent experiments on Tsinghua and PhysioNet RSVP datasets are presented in Table I, where BA, TPR, TNR and AUC are the mean values of all subjects' results, and std is the standard deviation.The results in Table I show that CST-TVA achieves the best BA on Tsinghua and PhysioNet RSVP datasets.For Tsinghua dataset, CST-TVA reaches 92.56%, which is 1.96%, 1.65%, 0.68%, 1.09% and 0.55% higher than EEGNet, DCN, EEGInception, EEGConformer and STSTNet, respectively.Although our method is slightly lower than DCN on the TPR, it is higher than other methods on the TNR and AUC, reaching 0.9551 and 0.9415 respectively.We perform the paired-sample t-test between the proposed CST-TVA method and other methods.The adjusted p-values with Bonferroni correction in the significance test are provided in Table I.The results show that p-values of BA are less than 0.001 for all comparison methods.The significant improvements in BA suggests that the CST-TVA has stronger EEG feature extraction capabilities.For PhysioNet dataset, the CST-TVA achieved a BA of 72.02%, outperforming EEGNet ( p < 0.01), DCN ( p < 0.01), EEGInception ( p < 0.01), EEGConformer ( p < 0.05) and STSTNet ( p < 0.05) by 2.66%, 1.01%, 1.06%, 0.89% and 0.66%, respectively.Although EEGConformer achieves the highest result on TPR, its TNR is lower than ours CST-TVA, which is caused by the class imbalance on the training set.In terms of TNR and AUC, CST-TVA achieves 0.7277 and 0.7448 respectively, which performs better than five baseline methods.Subject-dependent experiments on the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I THE OVERALL COMPARISON OF SUBJECT-DEPENDENT CLASSIFICATION PERFORMANCE ON TSINGHUA AND PHYSIONET DATASETS TABLE II THE OVERALL COMPARISON OF CROSS-SUBJECT PERFORMANCE ON TSINGHUA AND PHYSIONET DATASETS
Tsinghua and PhysioNet RSVP datasets indicates that our proposed method can effectively extract discriminative features to achieve better balanced accuracy.
In cross-subject experiments, one subject the dataset selected as the target subject in turn for fine-tuning and testing the model, and the rest subjects are used as the source domain for pre-training the model.In the pre-training stage, only the source subjects' data are used to train the model.In the finetuning stage, the same 5-fold cross-validation strategy as in the subject-dependent experiment is adopted.The data from the target subject is divided into five parts, three parts for finetuning the model, one part for validation, and one part for testing.The results of cross-subject experiments are presented in Table II.From Table II, we can observe that compared with other transfer learning methods such DANN, ATLDCN and SFSA, our CST-TVA-DRTL achieves better results on both datasets.For Tsinghua dataset, the BA of our method reaches 93.07%, which is 0.53%, 0.44% and 0.36% higher than three baseline transfer learning methods, respectively.In terms of TNR and AUC, our method achieves the best results, reaching 0.9589 and 0.9581 respectively.The adjusted p-values of BA are less than 0.001 for all comparison methods, indicating that the proposed CST-TVA-DRTL not only demonstrates excellent performance in specific subjects but also exhibits statistically significant superiority overall.For PhysioNet dataset, our CST-TVA-DRTL also outperforms other methods.The BA of our method reaches 73.95%, which is 1.44%, 1.24% and 1.01% higher than other methods, respectively.The results of the paired-sample t-test show that the improvement is significant compared to DANN ( p < 0.01), ATLDCN ( p < 0.01) and SFSA ( p < 0.01).Cross-subject experimental results on the Tsinghua and PhysioNet RSVP datasets indicate that our CST-TVA-DRTL achieves a substantially higher BA than DANN, ATLDCN and SFSA, which confirms that our proposed transfer learning framework can leverage data the data from other subjects to boost the method's performance.Fig. 5 illustrates the mean and standard deviation of BA for each subject in the 5-fold cross-validation experiment.The mean and standard deviation for all subjects are also shown in the figure.As shown, after adopting domain-rectified transfer learning, the classification accuracy of the CST-TVA-DRTL on most subjects is improved.In particular, the BA of the third subject in the Tsinghua dataset was improved from 92.48% to 94.59%.It is evident that the incorporation of domain-rectified transfer learning can effectively use crosssubject information to enhance the performance of the CST-TVA-DRTL.In addition, the standard deviation of the BAs obtained by the CST-TVA-DRTL on different subjects is smaller, indicating that domain-rectified transfer learning can also make the CST-TVA-DRTL more stable.

A. Comparison With the State-of-the-Art Methods
We compare the proposed method with recently reported state-of-the-art (SOTA) methods to demonstrate its effectiveness.For fair comparison, these SOTAs for subject-dependent Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.experiments and cross-subject separately compared with CST-TVA and CST-TVA-DRTL methods Table III and Table IV respectively.Since the report BA and AUC results in their papers, we list the results of BA and AUC in the tables for comparison.
As shown in Table III, the proposed CST-TVA exhibits significant improvements compared to the SOTAs on the Tsinghua and PhysioNet RSVP datasets.For Tsinghua dataset, XGB-DIM [9] yields the worst results due to its reliance on machine learning methods, which struggle to effectively extract the key features from RSVP EEG signals.iEEGNet [8] is an enhanced version of the EEGNet model, achieves an AUC of 0.9274, second only to the CST-TVA.For the Phys-ioNet dataset, the CST-TVA method outperforms the PPNN [7] and DRL [16], both based on deep neural networks, by 4.76% and 3.22% in terms of BA, respectively.This indicates that the CST-TVA possesses stronger representation learning capabilities.The main reason is that our proposed CST-TVA can simultaneously learn multi-scale temporal features and multiview spectral features.
Table IV presents a comparison between the proposed cross-subject method CST-TVA-DRTL and other SOTA crosssubject approaches.As can be seen from the table, the CST-TVA-DRTL achieves the best results on both datasets.For the Tsinghua dataset, CST-TVA-DRTL achieves an AUC of 0.9581, while the MACRO [19] and MCGRAM [15] only achieve AUCs of 0.9309 and 0.9352, respectively.This may be attributed to the fact that MACRO and MCGRAM do not consider subject-specific information, thereby weakening their adaptability to the target domain data.For the PhysioNet dataset, the proposed CST-TVA-DRTL method significantly outperforms the xDAWN-SVM [18] approach, highlighting the weaker feature transfer capabilities of machine learning methods.In summary, the proposed CST-TVA-DRTL exhibits superior performance compared to existing SOTA methods, as it not only leverages the temporal and spectral features of EEG signals but also utilizes subject-specific information to enhance cross-subject transfer learning capabilities.

B. Ablation Studies
The proposed CST-TVA-DRTL method primarily consists of three key components: CST, TVA, and DRTL.These components are designed to empower the model with the capabilities of extracting temporal features, extracting spectral features, and performing transfer learning, respectively.To investigate the impact of these components on the model's classification performance, a series of ablation experiments were conducted.The results of the ablation experiments are presented in Table V.
1) Efficacy of CST: As shown in Table V, the BA of CST on the Tsinghua and PhysioNet datasets reached 91.94% and 71.37%, respectively, which increased by 1.03%, 0.36% compared with the baseline DCN method.The reason for this is that DCN model only considers the single scale temporal characteristics of EEG signal, and it is difficult to effectively capture the difference in the time-domain waveform of the target and non-target EEG.The CST can simultaneously extract different scales temporal features and model their temporal dependencies, which improves its ability to extract temporal features, and leading to better RSVP classification results.To investigate the impact of adaptive weighting on the performance of CST, we compared the CST with a multi-scale feature (MSF) extraction method without adaptive weighting.As shown in Table V, the BA of CST exhibits a significant ( p < 0.05) improvement compared to MSF on both datasets.This indicates that the adaptive weighting in CST effectively enhances the performance of the model.2) Efficacy of TVA: The TVA is designed to extract multiview spectral features from EEG time-frequency images.To validate the effectiveness of TVA, we compared it with a conventional single-view spectral feature (SVF) extraction method.As shown in Table V, the TVA method with triple-view attention mechanism demonstrates a significant ( p < 0.01) improvement in BA compared to SVF on both datasets.This suggests that the triple-view attention mechanism is more effective in capturing spectral features.However, if only time-frequency features are used, the results obtained by TVA are worse than those of CST, which suggests that temporal features are indispensable for the RSVP classification task.After incorporating TVA into the CST method, the CST+TVA method surpasses the CST method by 0.62% and 0.65% in BAs on the Tsinghua and PhysioNet datasets, respectively.The superior performance of CST+TVA is due to its ability to extract both temporal and spectral features of EEG signals, while the CST method only focuses on temporal features.The incorporation of TVA into the CST+TVA method enables it to capture key spectral features of multichannel time-frequency images from different perspectives, thus enhancing its ability of representation learning.
3) Efficacy of DRTL: After introducing DRTL into CST+TVA method, the BA of proposed CST+TVA+DRTL on both datasets increased by 0.51% and 1.93% respectively.The proposed DRTL framework is beneficial because it can use more subjects EEG data to improve the feature extraction ability of the model through the transfer learning strategy, and use domain-specific representations to rectify domain-invariant representations to adapt to the target subject data.The domainspecific and domain-invariant representations learned by the proposed CST-TVA-DRTL on Tsinghua dataset and PhysioNet dataset are visualized in Fig. 6.We can see that the domainspecific representations obtained by the proposed model from source subjects and target subject have clear boundaries, while domain-invariant representations of source subjects and target subject are indistinguishable.This shows that our model can simultaneously extract individualized information and common invariant information of different subjects' data, thus effectively improving the transfer learning performance of the model.

C. Saliency Map Analysis of EEG Channels
We investigate the correlation between target and nontarget visual stimuli and different channels of EEG signals by using the saliency map method [37].The saliency map is a commonly used method to visualize the classification inference process of deep learning models in the field of computer vision [38], which can reveal the importance of each part of the input data to the classification results through a single gradient back-propagation.In this study, to investigate the importance of different EEG channels on target and nottarget classification, the EEG data were input into the welltrained DCN model, and then the Gradient-weighted Class Activation Mapping (Grad-CAM) method was used to obtain the importance score of the EEG data [39].The channel importance score was obtained by accumulating the importance score of all data points in each channel.Fig. 7 visualizes the averaged and normalized channel importance score of two classes of EEG data from the Tsinghua and PhysioNet datasets.As shown in Fig. 7 (a), target visual stimuli evoke brain activity in both prefrontal and occipital cortexes, while non-target In order to visually explore the differences between target and non-target EEG signals, we conducted a visualization of the multi-channel EEG signals in both spatial and temporal dimensions in Fig. 8. From the scalp topography in Fig. 8 (a), it can be observed that for non-target visual stimuli, there is periodic brain activity in the occipital visual area, whereas for target visual stimuli, significant activity appears in the prefrontal cortex at around 300ms.The same results can be clearly observed from the time-domain waveforms as well.The target EEG waveforms at channel O2 exhibited clear P300 and N400 components, while the non-target EEG waveforms were near-sinusoidal signals.These findings are consistent with previous studies [31], demonstrating a strong correlation between prefrontal cortex activity and target visual stimuli.Since the PhysioNet dataset only provides EEG signals from the visual area, it is difficult to observe significant differences between target and non-target EEG signals from a spatial dimension.However, from the temporal dimension, clear components such as the P300 and N400 can be observed in the target EEG, while the non-target EEG exhibits periodic patterns.

D. Analysis of Hyperparameter Settings
As the hyperparameters of deep learning models have a significant impact on performance, it is crucial to set appropriate values for optimal model [40], [41], [42], [43], [44].In this study, we employed a simple yet effective grid search strategy to optimize hyperparameters such as the number of scales S for multi-scale temporal features and the dimension of domainspecific representation L. Performance comparison results of the proposed model under different parameter settings are presented in Fig. 9.As shown, increasing the number of scales from 1 to 3 leads to improved model performance, indicating that multi-scale temporal features provide more discriminative information.However, when the number of scales exceeds 3, model performance slightly decreases due to redundant information contained within additional scale temporal features.Similarly, it can be observed that setting L at 128 results in lower balanced accuracy because smaller representation dimensions contain less information resulting in loss of details; whereas increasing L up to 512 leads to optimal model performance.However, exceeding an L value greater than 1024 may cause redundancy and negatively affect efficiency.

E. Computational Complexity
The computational complexity of our proposed CST-TVA-DRTL is evaluated based on the number of parameters and the training and inference time of the model.For comparison purposes, Table VI presents the average BA, model parameter count, training time, and inference time of CST-TVA-DRTL and its competitors on the Tsinghua dataset.It is important to note that the time consumption of all models was measured on 1×10 3 a hardware platform with an Intel Core i7-7700 3.60GHz CPU and NVIDIA GTX 1080 GPU.The software environment includes Python 3.9.7,PyTorch 1.13.1, and CUDA 11.0.From Table VI, it can be observed that the EEGNet has the lowest computational complexity.However, due to its simple network architecture, it struggles to effectively capture the most discriminative features in EEG signals, resulting in the lowest average BA.In contrast, although the proposed CST-TVA-DRTL involves more parameters due to the introduction of Transformer and feature encoder, its average BA is significantly improved.Furthermore, compared with the existing methods, the training time and inference time of the proposed CST-TVA-DRTL do not exhibit a significant increase.This is primarily attributed to the fact that the majority of the parameters in the CST-TVA-DRTL originate from the fully connected layers of the feature encoders, and the fully connected layers are computationally efficient.

F. Limitations and Future Directions
The proposed CST-TVA-DRTL framework achieves better RSVP classification results than existing methods, but it still has some limitations that need to be addressed in future work.First, all channels of EEG data were used for RSVP classification, but only a subset of channels were significantly correlated with the RSVP task.The performance of the CST-TVA-DRTL can be degraded by noisy signals on uncorrelated channels.Thus, an adaptive channel enhancement strategy will be further studied to reduce the interference of noisy channels.Second, although our proposed method effectively improves the transfer performance, it requires data from multiple subjects, which may limit its applicability to datasets with a small number of subjects.Therefore, we will investigate further improving RSVP classification performance by generating minority class data through generative adversarial networks in future work.

V. CONCLUSION
This paper presents a novel CST-TVA-DRTL framework for RSVP classification.Specially, this framework leverages a CST temporal feature extractor to obtain multi-scale temporal features from EEG signals and characterize the temporal feature dependencies across different scales.In addition, a TVA spectral feature extractor is adopted to capture discriminative spectral features of multi-channel EEG spectrogram from three different views.Moreover, a DRTL framework is designed to improve the cross-subject transfer learning performance by simultaneously exploiting the common invariant information and subject-specific information of multiple subject data.Experimental results on the Tsinghua and PhysioNet RSVP datasets confirm that our proposed CST-TVA-DRTL outperforms the state-of-the-art methods, demonstrating its viability as a solution for RSVP classification.

Fig. 6 .
Fig. 6.The t-SNE visualization of domain-specific and domaininvariant representations learned by the proposed CST-TVA-DRTL on (a) Tsinghua dataset and (b) PhysioNet dataset.

Fig. 9 .
Fig. 9. Performance comparison with various S and L on (a) Tsinghua dataset and (b) PhysioNet dataset.

TABLE III RESULTS
OBTAINED BY THE STATE-OF-THE-ART SUBJECT-DEPENDENT METHODS ON TSINGHUA AND PHYSIONET DATASETS

TABLE IV RESULTS
OBTAINED BY THE STATE-OF-THE-ART CROSS-SUBJECT METHODS ON TSINGHUA AND PHYSIONET DATASETS

TABLE V THE
RESULTS OF ABLATION EXPERIMENTS ON TSINGHUA AND PHYSIONET RSVP DATASETS