Improving Generalized Zero-Shot Learning SSVEP Classification Performance From Data-Efficient Perspective

Generalized zero-shot learning (GZSL) has significantly reduced the training requirements for steady-state visual evoked potential (SSVEP) based brain-computer interfaces (BCIs). Traditional methods require complete class data sets for training, but GZSL allows for only partial class data sets, dividing them into ‘seen’ (those with training data) and ‘unseen’ classes (those without training data). However, inefficient utilization of SSVEP data limits the accuracy and information transfer rate (ITR) of existing GZSL methods. To this end, we proposed a framework for more effective utilization of SSVEP data at three systematically combined levels: data acquisition, feature extraction, and decision-making. First, prevalent SSVEP-based BCIs overlook the inter-subject variance in visual latency and employ fixed sampling starting time (SST). We introduced a dynamic sampling starting time (DSST) strategy at the data acquisition level. This strategy uses the classification results on the validation set to find the optimal sampling starting time (OSST) for each subject. In addition, we developed a Transformer structure to capture the global information of input data for compensating the small receptive field of existing networks. The global receptive fields of the Transformer can adequately process the information from longer input sequences. For the decision-making level, we designed a classifier selection strategy that can automatically select the optimal classifier for the seen and unseen classes, respectively. We also proposed a training procedure to make the above solutions in conjunction with each other. Our method was validated on three public datasets and outperformed the state-of-the-art (SOTA) methods. Crucially, we also outperformed the representative methods that require training data for all classes.

Improving Generalized Zero-Shot Learning SSVEP Classification Performance From Data-Efficient Perspective

I. INTRODUCTION
S TEADY-STATE visual evoked potential (SSVEP) is widely used in building brain-computer interfaces (BCIs) for multi-target systems due to its high signal-to-noise ratio (SNR) [1], [2].Each target in these BCIs blinks at a different frequency, and when a subject looks at a target, the corresponding SSVEP signal can be recorded using the electroencephalogram (EEG).The classification algorithm can subsequently identify the target the subject is gazing at by analyzing their EEG signals.
Researchers have proposed numerous classification methods for SSVEP-based BCI.While training-free methods exist [3], [4], training-based methods have shown significant improvement in BCI performance, particularly in information transfer rate (ITR) [5], [6], [7].However, conventional training-based methods can cause user fatigue when acquiring training data [8] in systems with a large number of targets [9], [10].Thus, it is crucial to perform accurate classification with only a subset of the targets having training data.
Recent studies have proposed methods for SSVEP classification using transfer learning and generalized zero-shot learning (GZSL) to reduce the training data requirements.Stimulus-stimulus Transfer is a transfer learning method that extracts the SSVEP common components using training data [11].It can use this component to construct templates of classes without training data and achieve classification.Wang et al. proposed the GZSL-SSVEP network for classification.The network projects test EEG data and sine templates of all classes into the same latent space through convolutional neural network (CNN).After that, it uses the correlation coefficients of the projected signals in the latent space for classification [12].
Despite these methods providing ideas to alleviate the training data requirements, their utilization of SSVEP data is inefficient.This utilization inefficiency is embodied in three levels, namely data acquisition, feature extraction, and decision-making.At the data acquisition level, these methods ignore inter-subject variability in visual latency and use a fixed sampling starting time (SST).For instance, in the Benchmark dataset, previous studies have used data from 0.14 s after the onset of the stimulus.This SST is the average SSVEP response delay in the dataset [13].However, the SST that yields the best ITR may not be consistent for different subjects.At the feature extraction level, the CNN used in GZSL-SSVEP has a small receptive field and cannot capture the global information of the input data.At the level of decision-making, existing methods use the same strategy to classify the input data regardless of whether they belong to seen classes, thus underutilizing the training data.
In this study, we used seen classes to constitute the training and validation sets, while all classes constitute the test set.To effectively utilize the SSVEP data, we proposed a framework to overcome the problems of existing methods.First, we proposed the dynamic sampling starting time (DSST) strategy to optimize the SST for each subject.The DSST strategy first slices the training data into multiple sampling time (ST) windows with different starting times but the same length.Subsequently, we used the classification results on the validation set to select the optimal SST.Furthermore, to address the limited CNN receptive field, we proposed a Transformer structure to process the input data in conjunction with the CNN.The global receptive field of the Transformer [14], [15] allows the network to capture information on a long time scale.We also trained a classification network using intermediate features acquired by the Transformer.This classifier network directly outputs classification results and is complementary to the Transformer network, which utilizes the correlation coefficient for classification.This network is also essential to the automated selection of seen class classifiers and contributes to the DSST strategy.
We used three public datasets [13], [16], [17] to validate the proposed framework and compared it with the state-of-theart (SOTA) methods.We explored the relationship between SST and ST to demonstrate their impact on classification performance.In addition, we conducted ablation studies on the proposed algorithm to verify the contribution of each component to the results.
The rest of this paper is developed in the following sections.First, we introduced the related works in Section II.Then in Section III, we described our proposed framework in detail.In Sections IV and V, we presented the experimental results and discussion.In the last section, we summarized this work.

II. RELATED WORK
In this section, we first introduced the deep learning methods relevant to this paper and their application in SSVEP-based BCIs.Then we introduced the generalized zero-shot learning (GZSL) method.
However, recent studies have found that the size of the convolutional kernel of CNN limits its receptive field, making it difficult to handle long sequences [21].Especially the EEG classification cannot neglect the global interaction of task-related trials [22].The Transformer structure is a solution with a global perception field [14], [23].The Transformer consists of two parts, an encoder and a decoder, where the encoder is suitable for latent space projection tasks.After embedding each input token, the structure performs self-attention operations on these embeddings in the encoder part.The encoder output has the same size as the input embedding.A recent study uses Transformer for cross-subject transfer learning in SSVEP classification [15].It uses the complex spectrum of SSVEP as input tokens and outputs the classification results directly.
Although all these studies show a promising future for DL in SSVEP classification, these neural network schemes require access to training data of all classes, creating a training burden for multi-target systems users [8], [24].

B. Generalized Zero-Shot Learning
Traditional neural network classification algorithms can only classify classes that have appeared in training, which does not correspond to reality.Researchers have proposed generalized zero-shot learning (GZSL) [25] to address this issue.The GZSL method separates the classes into seen and unseen classes, where the seen classes have training data and the unseen classes do not.The GZSL method projects the semantic information of all classes into latent space embeddings.Then, it maps the input data into the same latent space, and the nearest embedding is corresponding to the classification result [26].GZSL has been applied to EEG processing in recent years [27], [28], but work on SSVEP classification can still be improved.Wang et al. proposed a scheme to apply GZSL in SSVEP [12].GZSL-SSVEP uses EEG as input data and the sine template corresponding to SSVEP as semantics.The network projects the data and semantics into the latent space and uses their correlation coefficients in the latent space for classification.Although the method alleviates the training requirements of a multi-target system, its accuracy and ITR need further improvement.

III. METHOD
In this section, we described the proposed network structure in detail.We also demonstrated its training and testing process.The framework structure is shown in Fig. 1.

A. GZSL Scheme Overview
In the GZSL task, we assumed that the SSVEP and stimulus could be projected into a common latent space by the network, which enables us to represent any stimulus or EEG in the same latent space as a vector.Our network consists of two types: one that takes SSVEP data as input to project real EEG into the latent space and another that uses the luminance modulation function of the stimulus to project it into the same latent space.Both networks update their parameters during training by maximizing the similarity between their outputs in the latent space.For classes without training data during testing, the network can use its stimulus function to construct a vector representation in the latent space.The network can output classification results by comparing the similarity between this vector representation and the projection of input EEG.
The networks that use SSVEP data as input are Extraction-Net, Transformer-Net and Elec-Comb-Net.They embed their Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.inputs, the EEG Data and the Class Template, into a latent space (i.e., X 1 , X 2 , T ∈ R 1×1×N S ).N S is the number of sampling points.Their input corresponds to the same target.The input of Extraction-Net and Transformer-Net is the EEG Data Z k ∈ R 1×N C ×N S , where N C is the number of data channels and k is the index of class.The input class template Z k ∈ R 1×N C ×N S of Elec-Comb-Net is the channel-by-channel averaged template of seen class k with the same sampling starting time (SST).In addition, Classification-Net as part of the network uses the intermediate features of Transformer-Net as input and directly outputs the classification results.
The input sine template Y ∈ R N f ×(2×N h )×(N S +N latency ) of Generation-Net is the sine wave of the luminance modulation function of all classes, which is constructed by In this equation, k ∈ [1, N f ] is the index of target, and N f is the number of all classes.f is the stimulus frequency, p is the stimulus phase, and The last convolutional layer will merge all output channels.The output corresponding to target Finally, the network concatenates S k to obtain the output S ∈ R 1×N f ×N S .

TABLE I ARCHITECTURE AND PARAMETERS OF Generation-Net
The testing stage only offers the network two inputs, the EEG Data and the Sine Template.The EEG Data in the testing stage were not used in the training.The subsequent subsections describe the specific details of the network and the implementation process.

B. Generation-Net
The network first sorts the input Sine Template by frequencies and divides them into n frequency groups in order.The set of n is consistent with [12].The sine templates of the same frequency group use the same temporal convolution network (TCN) block to process (TCN-1, TCN-2, etc.).Each TCN block ends with a Conv layer for combining the individual channels and output.
The TCN [29] is a causal convolutional network whose output at any given moment solely depends on the preceding inputs.Since the SSVEP response consistently exhibits a delay from the actual stimulus moment, it is crucial for the TCN to effectively leverage this delay in order to improve the mapping of the latent space.
We refer to [12] for a modification of the original TCN network to enable it to handle 2-dimensional input data.At the same time, we observed that the TCN structure complements  the front end of the input with zeroes, resulting in the inability to utilize the initial period of data in the output S. Therefore, unlike [12], we added the Sine Template for the length of N latency before SST (Fig. 2).N latency = 0.14 × Fs and Fs is the sampling rate.Since the output of TCN has the same size as the input, we only took the last segment of length N S in S.

C. Extraction-Net and Elec-Comb-Net
The BandGen layer of Extraction-Net divides the input signal into 16 subbands first.The subsequent BandMerge and ChCombine layers merge the subbands and channels and output the vector representation in the latent space.
The input to the Elec-Comb-Net is the channel by channel averaged training data with high signal-to-noise ratio (SNR).Therefore the Elec-Comb-Net should makes as little processing of the input as possible.Hence it is a shallow depth network with small kernel size.

D. Transformer-Net
The Transformer structure can utilize the global information of the input data.In our work, we used one 1d-CNN layer for channel embedding.The EmbSize is 64.After that, we used the matrix transpose to make embedding conform to the input size of the Encoder Block.
We used multi-head attention (MHA) and channel attention (CA) layers to process the embedded EEG.In MHA, we used the input of each sample point as a token.The embedding channels of the sample point form the tensor representation.MHA can explore the relationship between multiple sampling points in the EEG data in terms of time series.Unlike MHA, CA takes each embedding channel as a token and uses all sample points of the channel as the tensor representation.CA can explore the spatial relationship between different EEG channels.The two FeedForward layers are linear layers  Gaze shift time (GST) is the time it takes for the user to shift their gaze from one target to another.The SST of Window 0 is 0.02 s, and each subsequent segment starts in 0.04 s steps (i.e., 0.06 s, 0.10 s, 0.14 s, . . .).We limited the sum of SST and STs to less than 2.0 s to reduce the search amount.Total verdict time contains GST, SST, and sampling time (ST).and serve to weigh the embedding channels.Dropout in the network is 0.1.
The final Wave Block uses a linear layer to combine all the embedding channels and output waveform X 1 ∈ R 1×1×N S .

E. Classification-Net
The Classification-Net consists of two convolutional layers and one linear layer, with a dropout rate of 0.4 in the Dropout layer.The output C ∈ R 1×N seen of this network contains the probabilities of each seen class (with the target number of the seen class denoted as N seen ).The input is the intermediate feature in the Transformer network, which reduces the need for learning low-level features, allowing for faster convergence.

F. Dynamic Sampling Starting Time
The dynamic sampling starting time (DSST) strategy is achieved by searching for multiple feasible SSTs, as illustrated in Fig. 3.The starting point of the stimulus is the 0.0 s time point.First, we sliced the training data into several windows with the same sampling time (ST).Afterward, we trained a corresponding Classification-Net for each SST.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
In the validation set, we estimated the classification accuracy and ITR.
In this equation, T is the total verdict time, including gaze shift time (GST), SST, and ST.N is the number of targets, P is the classification accuracy.The SST corresponding to the maximum ITR is the optimal SST (OSST).The detailed training process is described below.

G. Implementation Details
First, we separated all targets into seen (with training data) and unseen classes (without training data).Then, we divided all the data into training, validation, and test sets using leave one-block-out cross-validation (LOOCV) according to the dataset.We first took out one block for the test set and randomly selected one of the remaining blocks for the validation set.The rest formed the training set.The training and validation set contains only the seen classes, while the test set contains all classes.In one iteration, the data input to the network has the same SST.
The training process has four stages in total, including cross-subject pre-training, subject-specific fine-tuning, DSST search, and fixed-window alternating training.The batch size is consistent with the number of seen classes to ensure sufficient data for each iteration.We first pre-trained the network using the training set of all subjects (Algorithm 1).The pre-training part only trains Transformer-Net (parameter θ 1 ), Extraction-Net (parameter θ 2 ), Generation-Net (parameter θ 3 ), and Elec-Comb-Net (parameter θ 4 ).The pre-trained network parameters are shared among subjects, and our goal was to minimize the cosine embedding loss between the output waveforms during pretraining.The loss function for target k is shown in (4).min ). ( The pre-training process performs only one epoch and stores the pre-trained network parameters.The subject-specific finetuning stage loads the pre-trained parameters for network initialization, and ( 4) is optimized again for the specific subjects over N tune = 5 epochs.The pseudo-code is illustrated in Algorithm 2.
After completing the subject-specific fine-tuning stage, we conducted the DSST strategy to find the optimal time window.During this stage, the fine-tuned network parameters θ 1 , θ 2 , θ 3 , θ 4 are frozen.For each ST window, we trained a Classification-Net (parameter θ 5 ) using cross-entropy loss.min In this equation, ŷik is the predicted class probability, and y ik is the ground truth.For each window, we performed N DSST = 20 epochs of training and used the validation set to estimate the performance in this window.We defined four classification tactics and obtained the final classification results according to their combination (6).
In the validation set, we counted the accuracy of these four classification tactics separately and denoted the tactic with the highest accuracy as C opt .The overall accuracy is predicted by C agg and C opt using Acc agg represents the classification accuracy using C agg as result while Acc opt represents the same for C opt .α is the prediction coefficient of the unseen class accuracy based on the seen class accuracy.α and β are hyperparameters based on the proportion of seen classes to all classes (as shown in ( 8) and ( 9)).
In [12], we observed that the accuracy of unseen classes can use the accuracy of seen classes times by a factor α (α < 1) to estimate.We assumed this coefficient varies linearly with the proportion of seen classes to simplify calculations.Based on the results of [12], we set α to 0.9 for a proportion of seen classes of 0.8 and 0.6 for a proportion of seen classes of 0.2.It's important to note that we did not optimize α using the results of this study to ensure unbiased outcomes.
In the context of the Transformer-Net, the β parameter is relevant to its characteristics.Given the limited amount of data in the SSVEP dataset, the Transformer network is more susceptible to overfitting, leading to worse performance [15], From EEG data at win, sample a minibatch of Z ; particularly when training with fewer seen classes.To mitigate this issue, we introduced a weight β to perform an average ensemble of network outputs.When the seen ratio is 0.8, both networks contribute equally to the results.Conversely, when the seen ratio is 0.2, we block the output of the Transformer-Net.We chose to linearly vary the β parameter for simplicity in calculation.The specific settings for α and β are in ( 8) and ( 9), respectively, where r represents the ratio of seen classes.
The DSST strategy estimates the ITR of each window based on its Acc pr ed .The window with the highest ITR is denoted as W in opt , and its sampling starting time is Optimal SST (OSST).
We used the data from W in opt for the last fixed-window alternating training.We first trained 3 epochs according to (4), followed by 2 epochs according to (5).This procedure was repeated for a total of N f i x = 25 epochs.
We tested using the data from W in opt .For the test output, if the classification result of C agg belongs to the unseen classes, this method uses its result directly.If not, the classification result of C opt is used as the output.

IV. DATASETS AND RESULT
A. Datasets 1) Benchmark Dataset: The Benchmark data set [13] has the data of 35 subjects aged 17 to 34 years.It has a total of 40 targets arranged on a 23.6-inch LCD monitor in a 5 × 8 matrix.Each stimulus is a square with a side length of 32 pixels and a black-and-white checkerboard texture inside.Subjects were seated 70 cm in front of the monitor and each stimulus corresponded to a 3.2 • field of view (FOV).The stimulus frequency ranges from 8.0 to 15.8 Hz in 0.2 Hz increments.The phase increases in π/2.Each target uses a sine function to encode luminance In the equation, f and p are the encoding frequency and phase for the target.F r is the refresh rate of the monitor (F r = 60 Hz) and m is the display frame index.There are 6 blocks of data for each subject, and each block contains 40 trials (one trial per target).Each trial contains 6 s of data.First, subjects were given a cue of stimulus for 0.5 s, which also served as the GST.Next, subjects were required to gaze at the specified stimulus for 5 seconds.At the end of the stimulus, the idle state was still recorded for 0.5 s.The data set is sampled using a 64-channel 10-20 SynAmps2 EEG system with a sampling rate of 1000 Hz.The reference electrode was Cz and the electrode impedance was below 10k during recording.Down-sampling to 250 Hz for storage after filtering out power line noise.
2) BETA Dataset: The EEG data from 70 subjects aged from 9 to 64 is contained in the BETA dataset [16].Its target number is also 40, but arranged as a QWER keyboard.The side length of each square stimulus block is 136 pixels (3.1 • FOV) while 966 × 136 pixels for the spacebar.Each subject has a total of 4 blocks of data.The task settings were consistent with the Benchmark dataset, but the stimulus times were different.The stimulus time is 2 s for the first 15 subjects and 3 s for the rest subjects.The dataset acquisition is outside the laboratory setting, and its SNR is lower than that of the Benchmark dataset.The 64-channel 10-10 EEG used SynAmps2 for recording.The sampling rate was 1000 Hz and down-sampled to 250 Hz for storage.Other settings in this dataset, such as stimulus frequency and phase, are the same as those in the Benchmark dataset.
3) UCSD Dataset: The UCSD dataset [17] recorded SSVEP data from 10 subjects totaling 12 classes.Each class had 15 trials of training data.Each stimulus was 6 × 6 cm in size and arranged in a 4×3 matrix on a 27-inch LCD monitor.Each target flashes using different stimulus frequencies (starting from 9.25 Hz and increasing in 0.5 Hz steps) and phases (starting from 0 π and increasing in π/2 steps).The luminance modulation function used a square wave with a 50% duty cycle.First, subjects would sit 60 cm in front of the monitor and wear an 8-electrode EEG cap.Then, a red square would randomly appear at the position of the target stimulus for 1 s.The subject needs to gaze at this stimulus 4 s.During the gaze period, EEG data will be recorded at a sampling rate of 2048 Hz and down-sampled to 256 Hz for storage.

B. Result
We used data from nine electrodes located in the occipital lobe, namely Pz, POz, PO6, PO5, PO4, PO3, Oz, O2, and O1.We verified the performance of our method with 8 and 32 unseen classes using the Benchmark dataset.The selection of unseen classes is consistent with the Grid Distribution used by GZSL-SSVEP [12].For the case of 8 unseen classes, we used two representative training-based methods as the baseline, namely extended canonical correlation analysis (eCCA) [5] and task-related component analysis (TRCA) [6], both of which require the training data of all classes.eCCA obtains the optimal sine template using the training data, while TRCA obtains the EEG data template by maximizing the correlation coefficient between different training trials of the same target.Both methods perform classification using the correlation coefficients between templates and training data.
For 32 unseen classes, we used training-free methods as the baseline.Standard canonical correlation analysis (sCCA) [3] is the commonly used training-free SSVEP classification method, while filter bank canonical correlation analysis (FBCCA) combines filter bank analysis with sCCA and is currently the state-of-the-art (SOTA) training-free method.We also compared the Stimulus-stimulus Transfer method in both cases [11].In the original paper, the method verified the performance at 36, 32, and 28 unseen classes.We adapted it based on its published code and set its unseen classes consistent with our method.
All methods used 0.14 s as SST, except for our proposed method.The accuracy and ITR results are shown in Fig. 4, with error bars indicating the LOOCV standard error.From the figure, our method outperforms the existing comparison methods greatly.Since Stimulus-stimulus Transfer uses fivefold cross-validation (while other methods use LOOCV), we used an independent two-sample t-test to compare with Stimulus-stimulus Transfer and a paired t-test for other methods.The t-test showed a significant performance improvement.The least significant differences are presented in Table V.
In the BETA dataset, we illustrated the accuracy and ITR of the classification methods with 8 and 32 unseen classes (Fig. 5).Our method still significantly outperforms the comparison methods (Table V).
The training procedure used a GeForce RTX 3080 10GB GPU, and the computation time is shown in Table VI, and  the model size is shown in Table VII.The pre-training and subject-specific fine-tuning stages used data from all ST windows.When the ST is shorter, the computation time of a single trial decreases, but the total data volume increases.Therefore, the training time shows a trend of first increasing and then decreasing.In the DSST search and fixed-window alternating training stages, each epoch used the same ST window, so the training time shows an increasing trend with the increase of a single data length.Although our method requires certain training time, the inference times are all lower than the GST (i.e., 500 ms), and are therefore practical.

C. Performance on Small Dataset
To validate the robustness of our method in few target applications, we conducted experiments using five-fold crossvalidation on the UCSD dataset [17].We used POz, PO8, PO7, PO4, PO3, Oz, O2, and O1 as sampling electrodes, and the targets corresponding to 9.75, 11.25, 12.75, and 14.25 Hz stimuli as the seen classes.We compared the experimental results with FBCCA and sCCA, and the comparison results are shown in Table VIII.All improvements is significant ( p < 0.001).The result shows that our algorithm maintains good performance even in the few target scenario.

D. Sampling Starting Time
We proposed the DSST strategy to improve ITR based on the hypothesis that the OSST of different subjects varies.To verify this hypothesis, we recorded the OSST of different

E. Ablation Studies
To investigate the contribution of each component to the overall performance, we performed ablation studies using the Benchmark dataset with 0.6 s ST.We started with the GZSL-SSVEP as the basis and gradually added the method proposed in this work.The results are presented in Fig. 7.

F. Latent Space Embeddings
Our objective is to leverage the larger receptive field of the Transformer structure compared to CNN for more effective mapping in the latent space.Therefore, it is crucial to explore the differences in the output waveforms of the two networks in the latent space.Studies on Transformer indicate that this structure can generate more detailed features than CNN [30], [31], [32].To investigate this, we conducted experiments on subject 22 of the Benchmark dataset and plotted Fig. 8 (a).The figure shows that the output waveform of the Transformer structure has a more detailed component.We analyzed the power spectrum (Fig. 8 (b)) above the 3rd harmonic (24 Hz) found that the output power of Transformer-Net is significantly higher than that of Extraction-Net ( p < 0.05).

G. Number of Unseen Classes
The decrease in the number of seen classes leads to a decrease in overall classification performance [11], [12].We investigated the classification accuracy and ITR of the proposed network concerning the number of seen classes.We adjusted α and β based on (8) and (9), and the network batch size keeps the same as the number of seen classes.We also followed the treatment of n in GZSL-SSVEP.The final result is presented in Fig. 9.As shown in the figure, there is a decrease in accuracy and ITR as the number of unseen classes increases.

V. DISCUSSION AND FUTURE WORK
We proposed an SSVEP classification method based on zero-shot learning in this study.The method can accomplish the classification of all categories when only seen classes of training data are required.Our method uses three strategies to improve training data utilization and can effectively handle Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.visual latency differences across subjects.To the best of our knowledge, we achieved the SOTA performance in terms of accuracy and ITR.
Individual variability in visual latency among subjects has never been discussed in previous studies.Most of these studies used a fixed SST, whereas we found differences in OSST across subjects.Therefore, we proposed the DSST strategy.By analyzing ST and OSST (Fig. 6), we speculated that an optimal verdict range exists in the data that achieves the maximum ITR.When ST is small, DSST increases SST to make the input data fall within this range until it can no longer increase ITR.When ST is large, the data is sufficient to cover the optimal verdict range.Therefore, to increase ITR as much as possible, the DSST strategy chooses to decrease total verdict time using short SST.
We also speculated that the improvement of classification accuracy by DSST is indirectly caused by improving ITR.Since ST is fixed, the strategy increases accuracy by adjusting SST as much as possible to maximize ITR.When ST is short, SST will increase to avoid visual latency and increase accuracy.However, DSST may limit the accuracy when ST is long.ITR is constrained by verdict time, accuracy, and target number.Previous studies show that overall accuracy keeps increasing as ST increases, but ITR increases first and then decreases [4], [5], [6], [20].The DSST strategy will decrease total verdict time after accuracy reaches a certain level for higher ITR.This would introduce more data prior to the SSVEP response and interfere with the classifier.Although ITR can improve, accuracy may not.Future studies can propose more effective solutions to this problem.
The study conducted involved analyzing the effectiveness of different network components through an ablation study (Fig. 7).The results indicated significant improvements except for the cross-subject pre-training part, which has insignificant differences in accuracy and ITR ( p > 0.05), the improvement in every other part was significant.For accuracy enhancement, the smallest p-value is at the introduction of the Transformer-Net ( p = 6.98 × 10 −5 ).For ITR enhancement, the smallest p value is at the introduction of DSST ( p = 2.87 × 10 −4 ).These findings confirm the validity of our network design, where the Transformer-Net can handle long data sequences, Classification-Net improves utilization of seen class data, and DSST can process differences in visual latency between subjects.The cross-subject pre-training section can reduce Fig. 7. Ablation studies.Experiments are conducted using data from the Benchmark dataset to explore each optimization from GZSL-SSVEP on the extent of its contribution to the overall accuracy and ITR.
convergence time for each subject without performance loss.In GZSL-SSVEP [12], the network needs 30 epochs to converge, whereas our proposed structure only requires 5 epochs for each subject to accomplish the same fitting.
By analyzing the output of Transformer-Net and Extraction-Net in latent space, we learn that Transformer-Net offers better detail retention, which is consistent with studies on Transformers in the image processing field.Our proposed Transformer-Net network uses a combination of MHA and CA.MHA can utilize the correlation of signals on time scales, while CA can utilize the correlation of EEG between different electrodes.As a result, the Transformer-Net can better preserve the input details.This retention of details means that the intermediate features contain rich information, which facilitates Classification-Net to perform more accurate classification.Previous studies used CNNs as the backbone, which we believe may lead to a loss of detail that affects the performance of classifiers based on correlation coefficients.Moreover, this loss of detail may also result in inadequate fitting of the Classification-Net.
Although we have demonstrated the superiority of the proposed method in the result section, it still has some shortcomings.First, there may be a bias in predicting the accuracy of the unseen classes.We used the α parameter to predict the unseen class classification accuracy based on the seen class classification accuracy.Some studies on domain generalization provide ideas that could deal with the generalization problem [33], [34].However, no studies have used it in GZSL for SSVEP, which may be a new exploration direction.Another problem caused by reduced seen classes is that the Transformer structure may suffer from overfitting.From Fig. 9, when the number of seen classes decreases, there is a decrease in both accuracy and ITR.Especially when validated on the UCSD dataset, we found that when continuing to reduce the number of seen classes, the network may collapse.We used a β parameter to weaken the effect of the Transformer on overall performance, but this still requires improvement.In addition, other parts of this network may also undergo overfitting due to a decrease in the number of seen classes, such as Generation-Net and Classification-Net.Although overfitting of these components was not evident in this study, it should also be noted.Future studies can explore two directions to prevent this problem: using an optimization structure that avoids overfitting [35], [36] or introducing data augmentation in training [37].

Fig. 1 .
Fig.1.Structure of the proposed network.Extraction-Net, Elec-Comb-Net, and Generation-Net are similar to the network proposed in[12].Transformer-Net can process the global information of input in this network.Classification-Net can classify the seen classes and implement the DSST strategy.The inputs to the network use the same SST.The outputs X 1 , X 2 , S, and T are waveforms, and C is the classification probability.

Fig. 2 .
Fig. 2. Compare the input Sine Template of Generation-Net.(a) The construction of Sine Template in [12].(b) The construction of Sine Template in this work.We added 140 ms more input to make full use of the output S.

Fig. 3 .
Fig. 3. Illustration of dynamic sampling starting time (DSST) strategy.Gaze shift time (GST) is the time it takes for the user to shift their gaze from one target to another.The SST of Window 0 is 0.02 s, and each subsequent segment starts in 0.04 s steps (i.e., 0.06 s, 0.10 s, 0.14 s, . . .).We limited the sum of SST and STs to less than 2.0 s to reduce the search amount.Total verdict time contains GST, SST, and sampling time (ST).

Fig. 4 .
Fig. 4. The mean classification accuracy and ITR of the Benchmark dataset.The horizontal axis indicates the length of the ST.The Stimulus-stimulus Transfer [11] is abbreviated as Stim-stim Trans.(a) Classification accuracy at 8 unseen classes.(b) ITR at 8 unseen classes.(c) Classification accuracy at 32 unseen classes.(d) ITR at 32 unseen classes.

Fig. 5 .
Fig. 5.The mean classification accuracy and ITR of the BETA dataset.The horizontal axis indicates the length of the ST.(a) Classification accuracy at 8 unseen classes.(b) ITR at 8 unseen classes.(c) Classification accuracy at 32 unseen classes.(d) ITR at 32 unseen classes.

Fig. 6 .
Fig. 6.OSST of different subjects in the Benchmark dataset.(a) The average OSST of each subject in all blocks.(b) The trend of OSST with ST for all subjects ( * : p < 0.05, * * : p < 0.01, * * * : p < 0.001).The error bars represent the standard error between the different test blocks.

Fig. 8 .
Fig. 8.The latent space embeddings comparison of seen classes of subject 22 from the Benchmark dataset.The vertical coordinates are the amplitudes of the normalized signal and spectrum in arbitrary units (a.u.).The solid red line is the waveform output from Transformer-Net, and the dashed blue line is the waveform output from Extraction-Net.(a) From the time domain, the output of Transformer-Net contains more detailed components.(b) From the frequency domain, it has more high-frequency energy.

Fig. 9 .
Fig. 9. Accuracy (a) and ITR (b) under the different number of unseen classes on the Benchmark dataset.

TABLE II ARCHITECTURE
AND PARAMETERS OF Extraction-Net AND Elec-Comb-Net