An Auxiliary Synthesis Framework for Enhancing EEG-Based Classification With Limited Data

While deep learning algorithms significantly improves the decoding performance of brain-computer interface (BCI) based on electroencephalogram (EEG) signals, the performance relies on a large number of high-resolution data for training. However, collecting sufficient usable EEG data is difficult due to the heavy burden on the subjects and the high experimental cost. To overcome this data insufficiency, a novel auxiliary synthesis framework is first introduced in this paper, which composes of a pre-trained auxiliary decoding model and a generative model. The framework learns the latent feature distributions of real data and uses Gaussian noise to synthesize artificial data. The experimental evaluation reveals that the proposed method effectively preserves the time-frequency-spatial features of the real data and enhances the classification performance of the model using limited training data and is easy to implement, which outperforms the common data augmentation methods. The average accuracy of the decoding model designed in this work is improved by (4.72±0.98)% on the BCI competition IV 2a dataset. Furthermore, the framework is applicable to other deep learning-based decoders. The finding provides a novel way to generate artificial signals for enhancing classification performance when there are insufficient data, thus reducing data acquisition consuming in the BCI field.


An Auxiliary Synthesis Framework for Enhancing EEG-Based Classification With Limited Data I. INTRODUCTION
B RAIN-COMPUTER interface (BCI) identifies brain activity and converts it into instructions or information, and establishes pathways between the brain and external devices [1]. Electroencephalogram (EEG) is one of the commonly used brain activity recording methods for BCI. EEG measures the scalp electrical signals generated by the brain, and has the characteristics of high temporal resolution, low trauma and low cost [2], [3]. BCI system based on EEG is usually used to realize prosthesis control [4], emotion recognition [5], speech recognition [6], epilepsy prediction [7], sleep monitoring [8], etc. However, limited by the data acquisition and low recognition accuracy, the practical application of BCI technology is still challenging. Over the past decades, numerous studies have applied deep learning methods to EEG signal recognition [9], thus the classification performance has been greatly improved. Deep learning methods automatically extract features from original signals and complete classification [10]. But this kind of methods usually need a large number of training data to learn the latent features, small datasets and low-resolution data easily tend to cause overfitting and feature dependence of models [11]. Eventually, the classification performance of deep learning methods may even be inferior to that of traditional methods, such as linear discriminant analysis, support vector machine, naive Bayes classifier, etc. Some few-shot learning strategies, such as multi-task learning, transfer learning and meta-learning [12], try to solve this problem from the aspects of models and algorithms, and have achieved some results, but it is complicated to design such algorithms. By contrast, studying from the perspective of data, it is expected to fundamentally solve the training problem of deep learning network caused by insufficient data. Compared with the computer vision (CV) and natural language processing (NLP), it is difficult to collect enough high-quality data in the BCI field [13], [14]. There are generally four reasons: 1) Data collection experiment is cumbersome and takes a long time. Subjects may feel uncomfortable during the collection process, and the state of subjects will affect the quality of the data [15]. 2) Because of the physical dysfunction of some subjects, information (such as movement and sound) is difficult to track [16]. 3) The collected data will also be discarded due to problems such as interference and missing information [15], [17]. 4) Scarcity of qualified subjects and experimental environment due to the strict requirements [18].
Data augmentation is one of the effective ways to alleviate the problem of insufficient data [19]. This approach is based on the assumption that more information can be extracted from the original dataset through augmentation, and it artificially increases the size of training dataset by deforming or sampling. [20]. Data augmentation technology has been mature in the field of CV, however, the geometric transformations that used for image data augmentation may not be applicable to BCI research due to the characteristics of EEG signals, such as non-stationarity, time-varying sensitivity, individual differences, etc. [21], [22], [23]. According to the investigations of Lashgari and He et al. [20], [24], common augmentation methods used in this area include cropping (sliding window), adding noise, and generative adversarial networks (GAN) [25]. Other methods such as recombination of segmentation [26], Fourier transform [27], synthetic minority over-sampling technique (SMOTE) can also be found in some BCI studies [28].
Cropping is a simple and effective method used for EEG data augmentation. This approach uses a sliding window to slice raw data for getting many more training samples. Schirrmeister et al. [29] used 2s sliding window to crop and expand the original motor imagery data, and improved the decoding performance of deep convolutional network. Zhao et al. [30] trained a network composed of three branches of deep convolutional neural network by cropping method, and alleviated the overfitting phenomenon. Mousavi et al. [31] proposed an automatic sleep stages recognition method based on deep learning, and used the cropping method to balance the data of different sleep stages, finally obtained 93.55% accuracy, which higher than using GAN augmentation method. In addition, Luo [32], Tayeb [33], Majidov [34] et al. also improved the classification performance by cropping method.
Another easily implemented augmentation method is adding noise to the raw data. It achieves expansion by adding noise to the original EEG signals or extracted feature maps. Wang et al. [35] effectively improved emotion recognition performance of deep learning model by adding Gaussian noise to DE features. Salama et al. [36] added the Gaussian noise signals with zero mean and unit variance to the raw data to improve their 3D-CNN emotion recognition performance, and the accuracy increased from 79.11% and 79.22% to 88.49% and 87.44% for valence and arousal classification, respectively. Li et al. [37] proposed a CP-MixedNet and used the amplitude-perturbation data augmentation method to train the model, this method added noise to the amplitudes of spectral images, and the classification performance on the BCI Competition IV 2a dataset and the High gamma dataset was significantly improved.
Despite the advances of cropping and adding noise in the field of EEG data augmentation, these two methods still cannot completely satisfy the needs of artificial multi-channel EEG signals generation, due to information loss, redundant noise, or inability to use underlying features of data [38]. Data augmentation methods based on deep learning can realize feature extraction to reconstruct artificial data [39]. GAN is one of the most common methods, which aims at achieving Nash equilibrium [40] between generative model and discriminative model, learning distributions from original data and generating new data [41]. Luo et al. [15] used conditional Wasserstein GAN (cWGAN) and selective Wasserstein GAN (sWGAN) to improve the performance of the classifiers in their emotion recognition task. Nik Aznan et al. [42] used Deep Convolutional GAN (DCGAN), Wasserstein GAN (WGAN) and Variational Autoencoder (VAE) models respectively to generate artificial data and improve the decoding performance of the models across subjects. Xu et al. [23] designed a BWGAN-GP model to improve the class imbalance, and the area under the curve tested with EEGNet was 3.7% higher than the original data. Xu et al. [43] apply a GAN based on convolutional neural network (CNN) and recurrent neural network (RNN) to synthesize artificial multichannel EEG preictal samples for ES prediction, and the accuracy and area under the curve improve from 73.0% and 0.676 to 78.0% and 0.704.
Although GAN-based methods perform well in EEG data augmentation, there are still some problems, such as many variant models, complex training process and instability [17], [44], and an additional decoding model is usually required to complete the final classification task. In this study, we propose a data synthesis framework based on deep generative model to improve classification performance. The framework uses limited real data and Gaussian noise to synthesize artificial data, and is expected to reduce the complexity of training progress while preserving the feature information of real data and reducing the redundant noise of synthesized data. The novelties of this study are summarized as follows: • An auxiliary approach is introduced to design a synthesis framework, which utilize a pre-trained decoding model to assist in synthesizing artificial signals.
• A decoding model and a generative model are designed to extract the temporal-spatial features of EEG signals for classification and synthesis.
• Different number of training samples is set to explore the improvement of the decoding performance under limited data. The method is transferred and implemented to the state-of-the-art decoders.
• Visualization methods of multiple perspectives are provided for interpreting the artificial data and framework. The rest of this paper is organized as follows. Section II introduces the dataset, describes the proposed framework and general methods used for comparison and evaluation. Section III presents the experimental results. Section IV performs discussions. Section V gives the main conclusions.

A. Dataset and Preprocessing
The BCI Competition IV 2a is used in this study [45]. This dataset contains EEG signals from nine subjects when they imagine the movements of left hand, right hand, foot and tongue. Each subject requires to complete 2 sessions, each of which contains 288 trials motor imagination tasks. EEG Signals are collected through 25 Ag/AgCl electrodes, the first 22 are EEG channels, and the last 3 are EOG channels. The sampling frequency is 250 Hz, and a band-pass filter of 0.5-100 Hz and a notch of 50 Hz are implemented. In this study, only the EEG channels are focused, and the data of cue and motor imagery periods are clipped as a single sample, each sample lasts for 4s. Finally, the shape of the dataset for each subject is 576 × 22 × 1000. All datasets are normalized before inputting to the model.

B. Auxiliary Synthesis Framework
In order to effectively retain the original information and reduce redundant noise, the proposed framework is built based on deep learning methods which have been proven to work for EEG synthesis and classification. In the following subsections, the generic layout of the auxiliary synthesis framework will be described first, follow by the details of auxiliary decoding model, generative model, loss function and training configuration.
1) Framework Overview: An architecture overview of the auxiliary synthesis framework is presented in Fig. 1. The architecture of this framework can be divided into 2 stages: • Pre-training Process of the Auxiliary Decoding Model.
In this stage, the auxiliary decoding model is pre-trained by the limited real samples, ensure that the accuracy of the decoding model is not less than the given threshold, which can be determined by cross-validation on the real samples. This option prevents the performance degradation of the decoding model caused by random factors during the training process, thereby improving the stability of the generative model and further improving the quality of the synthesized data. In fact, training with insufficient data is likely to produce an unstable result [46].
• Training and Synthesis Process of the Generative Model. The pre-trained auxiliary decoding model is used to assist the generative model in training and synthesizing new data. Specifically, in this stage, the generative model learns the mappings between labeled Gaussian noise and real data distribution to synthesize specific artificial data, and then the synthesized data and their ground-truth labels will be input into the auxiliary decoding model to obtain the probability distribution, which is used to calculate the cross-entropy (CE) loss. The mean squared error (MSE) loss between synthesized data and real data is also calculated. Finally, both CE and MSE are used to optimize the generative model. The parameters of the auxiliary decoding model are frozen at this stage, only the generative model is updated. All synthesized samples from the last epoch of the training stage are retained and labeled with the ground-truth labels, and eventually appended into the training dataset to retrain the decoding model.
2) Auxiliary Decoding Model: The auxiliary decoding model captures the latent features of EEG data and output the probability distribution. In the framework, it is used to help the generative model synthesize artificial data. It is also used for final classification tasks. As shown in Fig. 2(a), a combination of ordinary convolution and depthwise separable convolution is used to extracts the spatial-temporal features of real data or synthesized data. Depthwise separable convolution reduces the parameters while maintain the decoding performance simultaneously [47], [48]. Considering the efficacy and generalizability of deep learning on EEG-based decoding of motor imagery, the Squeeze-and-Excitement (SE) attention mechanism is added to improve the classification performance by changing the weights of different channels [27], [49]. These weighted spatial-temporal features are finally classified by a fully connected layer.
3) Generative Model: Generative model learns real data distributions, and synthesizes new data from a batch of fixed Gaussian noise. The architecture is shown in Fig. 2(b). This model is built based on transposed convolution, which enable the neural network to learn how to up-sample in the best way, and improves the quality of synthesized data [50], [51], [52].
Gaussian noise and category labels are used as inputs for the generative model. The label is first encoded by embedding layer, then the encoded label is regarded as a new channel to concatenate with the Gaussian noise, and finally the concatenated data are transformed by transposed convolution layers from time and spatial directions. In order to maintain the size of synthesized data consistent with the original data, we set different stride sizes in the transposed convolution operations and add padding operations, the specific parameters are shown in Fig. 2(b). 4) Loss Functions: Inspired by the loss function used in [53], [52], and [55], both cross-entropy loss and mean squared error loss are used to optimize the auxiliary synthesis framework. The cross-entropy loss enables the generative model to focus on the classification features of the data, and ensures that the synthesized data maintain a certain level of classification performance under the current decoding model [55]. At the generative model training stage, the cross-entropy loss of the synthesized data is calculated using the probability distribution that is provided by the pre-trained auxiliary decoding model, and is defined as follow: where N is the batch size, C is the number of classes, z n is the gaussian noise, y r,n is the label of real data, G(•) is the generative model, A(•) is the auxiliary decoding model. The mean squared error loss is used in the final loss to ensure that the distribution of the synthesized data is similar to that of the real data, and to prevent the generative model from synthesizing unexpected data, and it is defined as follow: where N is the batch size, x r,n is the real data, z n is the gaussian noise, y r,n is the label of real data, G(•) is the generative model. The final loss used to update the generative model consists of the cross-entropy loss and the mean squared error loss, as shown in Fig. 1, and is defined as follow: where α and β are weights that control the interaction of the losses, we set α to 1 and β to 0.0001 in this study.

5) Detailed
Configuration of Training: All models are designed based on Pytorch and are trained and tested using an NVIDIA RTX A5000. Adam optimizer is used for both decoding model and generative model training. We set the weight decay coefficient to 0.001, and use early stopping strategy [56] when training the auxiliary decoding model. Early stopping strategy reduces the excessive influence of incoherent gradients and improves the generalization ability of the model. Both models are trained using a learning rate decay strategy, and the initial learning rate is set to 0.0003.

C. General Methods Used for Comparison
Following methods are chosen for performance comparison. 1) Cropping: Cropping is commonly used for EEG data augmentation. In our experiment, sliding window with 3.9s length and step with 0.1s length are used to crop the original data. Both training dataset and testing dataset are cropped to keep their length consistent in the time dimension. For training dataset, all the cropped data are reserved in order to increase the size of the dataset. For testing dataset, only the last 3.9s data that containing intact motor imagery signals are reserved.
2) Adding Noise: By adding Gaussian noise to the original data, new training data are generated while retaining the features of the original data. The probability density function of Gaussian noise obeys Gaussian distribution: where z is the random variable, µ is the mean, σ is the standard deviation. We set µ = 0, σ = 0.1 in this study, and only add noise to the training dataset.
3) GAN: GAN [25] consists of generator and discriminator. Traditional GAN needs to train multiple generators to generate multiple types of samples, but its variant cGAN [57] can impose constraints on generator and discriminator to synthesize specified samples. In this paper, cGAN is used to directly synthesize four types of samples. The structure of the generator is consistent with that mentioned in Section II-B. We redesign the discriminator to form a confrontation between the two models, which consists of four convolutional layers and two fully connected layers, and the structure is shown in Table I. 4) Decoding Models Used for Replacement: EEGNet [47], ShallowConvNet [29], DeepConvNet [29] are used as auxiliary decoding model to test the framework. We modify the size of temporal convolution kernel and pooling kernel in EEGNet to twice the original size according to author's suggestion. The size of spatial convolution kernel is set to 22 for the three decoders, and the number of hidden units in fully connected layer is modified according to input size, other parameters are the same as the original and can be found in [29] and [47].

D. Cross-Validation Analysis
As shown in Fig. 3, all real samples are split into training set, verification set and test set for ten times. The total number of samples in training set and verification set is equal to the number of real samples that set in each experiment (i.e., 40, 80, . . . , 520). Training samples account for 90% and verification samples account for 10%. The test set samples are all real samples except the training and verification samples.
After each dataset splitting, the training and verification set are input into the synthesis framework to generate synthesized samples. Then the training samples, synthesized samples and verification samples are used to train and verify the classifier. Real test samples are used to test and evaluate the classifier.

E. Evaluation Metrics
In visualization section, data are mainly evaluated by visual inspection. Accuracy and standard deviation are used to evaluate the classification performance of the model. Following metrics are also used for performance evaluation.
1) Fréchet Inception Distance (FID): FID is commonly used to evaluate the quality of generative model and synthesized samples [58]. Compared with Inception Score (IS), this metric is more robust to noise and more sensitive to the quality of the generative model. FID uses a pre-trained classifier to compare the feature distribution of real samples and synthesized samples in the embedded layer.
2) Wasserstein Distance (WD): WD describes the cost of converting one distribution to another under a given cost function [59], and is often used to measure the similarity between any two distributions.
3) Euclidean Distance (ED): ED is used to evaluate the similarity between samples. The minimum Euclidean Distance (ED min ) calculate the minimum distance between real samples or the minimum distance between real samples and synthesized samples. The ED min between real samples and synthesized samples should be equivalent to the minimum distance distribution between real samples [60].

A. Visualization of Synthesized Data
The synthesized data are visualized from the time, frequency, and spatial domains. Gaussian noise with the zero mean and unit variance is used for comparison with the synthesized data. Gaussian noise is one of the inputs ofthe generative model, and the statistical distribution of it is the same as the normalized EEG signal, which also with zero mean and unit variance.
The waveform of the data is directly evaluated by visual inspection in time domain analysis. In frequency analysis, we use continuous wavelet transform (CWT) [61] to transform the data. Since motor imagery leads to energy changes in alpha-band (8-13Hz) and beta-band (13-30Hz) [62], the time-frequency features within the range of 8-30Hz are selected to analyzed. For spatial analysis, common spatial pattern (CSP) [63] is used to extract two-dimensional spatial features of the data. Fig. 4 shows the waveform, time-frequency features and spatial features of the real data, synthesized data and Gaussian noise in C3 and C4 channel. These two channels are typically related to MI features [17]. According to visual inspection and evaluation metrics of WD and ED, the distributions of these three features of the synthesized data are similar to that of real data while there is a significant difference between Gaussian noise and real data. The similarity means that the synthesized data effectively preserves the time-frequency-spatial features of the real data.

B. Expansion Ratio Explore and Overall Performance
The performance of the deep learning model depends on the number of training data, thus adding different ratio of synthesized data to the training set will have different effects on the classification performance. To find an appropriate expansion ratio, we select different numbers of real samples for cross-validation on the first subject, and the expansion ratios of training dataset are set to 0.5, 1, 1.5, 2, 3 and 4. As shown in Table II. The classification accuracy of the model after expansion is better than that without expansion (p-value<0.05 for 0.5 expansion ratio and p-value<0.001 for other ratios. Wilcoxon signed-rank test is used for assessment and Holm-Bonferroni approach for correction), and the standard deviation decreases after data augmentation, which means that the model is more stable. When the number of expanded samples is twice the number of original training samples, the average accuracy is the highest, which is 6.2% higher than the average accuracy without expansion. In addition,  at least 480 samples are required for 68.3% accuracy without expansion, while only 320 samples are required to achieve this accuracy when the ratio of expanded synthesized data is 4, a decrease of 33.3%. The original decoding performance and the decoding performance after applying proposed method are tested using the expansion ratio of 2 for all subjects, the result is shown in Table III. After applying the proposed method, the average accuracy of all subjects under different number of real samples is improved by (4.72±0.98)%, and the maximum improvement of 6.2% is obtained when the number of real samples is 400.

C. Comparison of Different Augmentation Methods
We compare the performance of three different data augmentation methods under different numbers of real samples, including proposed method, cropping and adding noise. The expansion ratio is set to 2. As shown in Table IV, the proposed method significantly improves the model performance under different conditions (p-value<0.001). Compared with the result using only real data, the average accuracy is improved by 6.2%, which is higher than 2.7%, 1.5% and 3% for cropping, adding noise and GAN. Further research reveals an interesting phenomenon that cropping is more suitable for the case of insufficient data, adding noise and GAN are  more suitable for the case of relatively sufficient data, while our method can simultaneously take into account different conditions. The quality of synthesized samples of different augmentation methods is evaluated by FID, ED min and WD. We analyze the synthesized samples obtained when the number of real samples is 520.The distance between the real samples, and the distance between the real samples and the noise samples are calculated as the reference. Table V shows that the FID and ED min of samples synthesized by the proposed method is the closest to the real samples, which is superior to other methods. The proposed method is inferior to Cropping in WD.

D. Generality of Auxiliary Synthesis Framework
To prove that the proposed framework is also applicable to other decoding models, we replace the auxiliary decoding model with EEGNet, ShallowConvNet and DeepConvNet, respectively, and then test the model performance, the result can be found in Table VI. The data augmentation method proposed in this paper improves the average accuracy of these three models by 3.4%, 2.1% and 7%, respectively, and the stability of the model is also increased. DeepConvNet obtains the greatest improvement, the result reveals that our method may be more compatible with complex models.
Table II-VI also reveal that after applying the proposed method, the highest accuracy of all models is higher than that of training only with the real data. Fig. 5(a) shows the intuitively histogram of this result. Besides, we count the number of samples required to achieve the highest accuracy when training the model only with real data, and the minimum number of samples required to achieve the same or higher accuracy after augmentation, as shown in Fig. 5(b). The number of samples required to train the decoding model to achieve the same or higher accuracy decreases after applying our method.

A. Setting of Step Size and Expansion Ratio
The selection of step size requires to consider the sample balance, the significance of the results and the test error. The most important thing is to ensure the sample balance, which has been proved to affect the model training effect, so each kind of sample needs to be balanced in the process of increasing. And then, in order to avoid the time consumption caused by unnecessary tests, the step size that will lead to significant changes in the results is selected in the study. We test the performance of the model with different step sizes based on real samples. The accuracy of different number of real samples is used as reference. On the basis of these samples, each type of sample is increased with different step sizes, and the statistical difference between the obtained accuracy and the reference is tested by Wilcoxon signed-rank test, and corrected by Holm-Bonferroni function. As shown in Fig. 6, with the increase of step size, the p-value decreases, and the step size of 10 is close to the significant level. Although the larger the step size, the more significant the change, choosing a larger step size will cause greater calculation error of sample reduction. When the same accuracy is achieved, too large a step size may lead to a much larger number of samples than the real demand when using data augmentation method to achieve this accuracy, which will result in a smaller sample reduction.
The number of synthesized samples in the training set affects the improvement of model performance. Table VII shows the data expansion ratio commonly used in EEG recognition literature in the BCI field, in which the optimal ratio refers to the expansion ratio that makes the model performance reach the optimal after expansion. According to the relevant references, the conventional expansion ratios of 1, 2, 3 and 4 are set in this study, while the ratios of 0.5 and 1.5 are set to test the performance change of the model at a smaller expansion ratio. A larger ratio is not used because the improvement of model performance by synthesized samples tends to be saturated when the ratio larger than 2.

B. Complexity of Designing and Training Progress
In this study, we propose a data synthesis framework based on deep generative model. The framework only needs to design a decoding model and a generative model, and because there is no adversarial relationship between the two models, we do not need to make complicated parameter tuning. In contrast, GAN-based data augmentation method realizes data synthesis through adversarial training, which is extremely sensitive to the hyperparameters. In this paper, we try to use the decoding model mentioned in Section II-B as discriminator and find that the discriminator is always unable to distinguish real and fake, so that the GAN cannot converge. After redesign, this phenomenon is alleviated. The difference in the early network structure design indicates that the proposed framework is much easier to implement.
The training process of this framework is stable, and the quality of the generative model can be judged by the loss. Furthermore, the result in Table IV shows that the minimum data required for the training of this framework is less than that of GAN, because when the number of the real samples is less than 120, effective synthesized data cannot be obtained by GAN. Training time of this framework is also less than that of GAN, as presented in Table VIII.

C. Ablation Experiment
In order to verify the effectiveness of the proposed method and verify that the synthesized sample is not a copy of  We test the decoding model using the samples synthesized under these three conditions. The average accuracy of MSE-CE-based condition is 2.3% and 7.2% higher than that of MSE-based and CE-based condition, and the model is more stable, as shown in Table IX. When the number of real samples is less than 400, the samples synthesized with the participation of the auxiliary decoding model can stably improve the decoding accuracy, but when the number of real samples is more than 400, the impact of the auxiliary decoding model on the decoding accuracy may be unstable, and it is necessary to reduce the constraints of the auxiliary decoding model on the generative model.  Grid search ranging from 1 to 1e-5 is used to find an appropriate coefficient combination, we also test the situation that the coefficient is zero. Fig. 7 shows the two coefficients and the corresponding accuracy after Gaussian interpolation when there are 200 and 400 samples. The accuracy is generally higher than other combinations when α is larger than 1e-1 and β is smaller than 5e-4. The β required for the highest accuracy decreases from 5e-4 to 1e-5 when the number of samples increases from 200 to 400.
To further investigate the reasons for the disparity of ablation experiment, we visualized the effect of different conditions on the generative model, and the effect of synthesized samples under different conditions on the training and testing of the decoding model. 1) Influence on the Results of Generative Model: To specifically explain the impact of different components in the synthesis framework on the synthesized data, we observe the outputs and the loss curves of the generative model. Fig. 8 shows the time and frequency evaluation of the real data and synthesized data under different conditions. The alpha and beta bands associated with motor imagery and the gamma band associated with complex tasks and cognition are selected for brain topographical map comparison. The results show that except for CE-based condition, the synthesized data under the other two conditions have similar distributions in alpha and beta bands to the real data while the distribution of the gamma band is significantly different. As mentioned in [60], the generated signal is expected to have additional highfrequency features, which have a positive effect on improving the generalization of the model.
We record the changes of these two losses in the training process of generative model. Fig. 9 reveals that the MSE loss of the synthesized signal is similar under the condition of MSE-CE-based and MSE-based, while is extremely large under CE-based, indicating that the synthesized data distribution is deviated from the real distribution when the MSE loss does not participate in optimization. The CE loss under MSE-CE-based condition is smaller than that of MSEbased condition, which means that the CE loss provided by pre-trained decoding mode helps the generative model synthesize distinguishing features in the synthesis process. Fig. 9 also reveals that there is a big difference between the values of MSE and CE loss in the training stage, which proves that a larger α and a smaller β are more conducive to balancing the constraints of the two losses.
2) Influence on the Training of Decoding Model: We further study the influence of the samples synthesized under different conditions on the decoding model. The convolution outputs of the decoding model are analyzed by t-SNE [66] from the perspective of training. In this experiment, only the synthesized samples are used to train the decoding model. Fig. 10(a) shows that the potential vectors of the samples synthesized under the CE-based condition are obviously distinctive between different motor imagery, which proves that CE loss assists the generator in synthesizing distinctive features. But as shown in Fig. 8 and Fig. 9, only using the CE loss will cause the data distribution of the synthesized samples to deviate from the real EEG signals. These samples can be easily classified by the decoding model, but the model trained with these samples cannot accurately recognize the real samples during the testing. Fig. 11 shows that the model trained with the samples synthesized under the CE-based condition has differences in the classification information focus area of the real samples during the testing compared with other conditions, which leads to the results of the CE-based condition in Table IX being inferior to other ablation conditions. An appropriate using of the auxiliary decoding model will make the feature distributions of different motor imagery are more distinct than real only and MSE-based condition while retaining the real signal distributions. The Euclidean distance between different clusters is shown in Fig. 10(b). After adding the auxiliary decoding model, the average distance between clusters increases, especially the distance between left and right hand and the distance between foot and tongue are significantly increased, which means that the model is more capable of identifying these movements at this time.
3) Influence on the Testing of Decoding Model: The outputs of the decoding model are further analyzed from the perspective of testing. In this experiment, we use both real samples and synthesized samples obtained under different conditions to train the decoding model, and then select the same samples for testing. Class activation mapping (CAM) [67] is used to visualize the performance of the decoding model on the testing dataset, as shown in Fig. 11. The energy distribution of the CAM is different when the same sample is correctly classified by the decoding models trained under the different conditions. According to the statistical analysis of random samples, MSE-CE-based condition has larger energy and narrower focused area, which means that the decoding model captures the information needed for classification more precisely, this may improve the classification performance.

D. Limitation and Future Work
The proposed method improves the decoding performance with limited real samples. However, the generative model learns the real data distribution to generate artificial samples and cannot generate information that is not included in the original dataset. The performance of the augmentation methods is affected by the data quality and diversity of the real dataset. Therefore, the improvement of decoding model performance by synthesized samples is limited. In the actual acquisition experiment of EEG signals, some random changes, such as physiological individual differences and interference caused by the environment, cannot be simulated by deep learning method at present. Besides, the overfitting still exists, so the early stop strategy is used to prevent the performance of the model from decreasing. We expect to achieve acceptable decoding accuracy with few training samples, or even without retraining. In the follow-up research, we will continue to study how to train a model with very few samples to achieve acceptable accuracy, such as combining our method with transfer learning V. CONCLUSION In this paper, an auxiliary synthesis framework is proposed to effectively improve the classification performance of the model under limited samples. We tested the method on BCI Competition IV 2a, and the results show that the data synthesized by this method well preserves the time, frequency and spatial features of the original data. After applying the proposed method, the average decoding accuracy of all subjects is improved by (4.72±0.98)%. A detailed investigations on the first subject shows the improvement of classification performance brought by our method is higher than that of the cropping, adding noise and GAN, and also higher than that without the auxiliary decoding model in the framework. The number of training samples for reaching the original highest accuracy is reduced by about 33.3%. The proposed method is also applicable to other decoding models. Finally, since our framework is purely data-driven, it can be migrated into other BCI domains such as emotion recognition, speech recognition, epilepsy prediction etc.