On the Deep Learning Models for EEG-Based Brain-Computer Interface Using Motor Imagery

Motor imagery (MI) based brain-computer interface (BCI) is an important BCI paradigm which requires powerful classifiers. Recent development of deep learning technology has prompted considerable interest in using deep learning for classification and resulted in multiple models. Finding the best performing models among them would be beneficial for designing better BCI systems and classifiers going forward. However, it is difficult to directly compare performance of various models through the original publications, since the datasets used to test the models are different from each other, too small, or even not publicly available. In this work, we selected five MI-EEG deep classification models proposed recently: EEGNet, Shallow & Deep ConvNet, MB3D and ParaAtt, and tested them on two large, publicly available, databases with 42 and 62 human subjects. Our results show that the models performed similarly on one dataset while EEGNet performed the best on the second with a relatively small training cost using the parameters that we evaluated.

and robotic arm [6], [10]. It has shown potential in helping patients with motor disabilities interact with the environment by controlling devices such as computer cursors, prostheses, and wheelchairs [11]. Electroencephalography (EEG) based BCI is one type of commonly used BCI techniques due to its non-invasive nature that does not require any surgical procedure for use. One of the main components of these BCI systems is the classification of circumscribed and transient EEG changes like event-related synchronization (ERS) or event-related desynchronization (ERD) during different types of motor imagery [3]. Developing a robust decoding algorithm is one of the important things in improving BCI research and user experiences [12]. There are lots of successful feature extraction methods, such as common spatial patterns (CSP) methods [13], [14]. Among the CSP methods, filter bank common spatial pattern (FBCSP) [15] is one of the most popular algorithms which uses a group of band-pass filters to extract the optimal spatial features, and has achieved robust performance in MI classification tasks. However, this kind of traditional framework requires that the process of feature extraction/selection and feature classification are separated, which requires manual work and prior knowledge by the operator, which could lead to bias and inefficiencies.
Recently, with the rapid development of high-powered computing devices, deep learning has become increasingly popular in many fields. A major advantage of bringing in deep learning technology into BCI systems is that feature extraction and classification steps can be jointly learned directly from data, also known as 'end-to-end' learning [16]. Several groups have been working on deep neural networks for MI classification [17] and published several models. Schirrmeister et al. [18] explored deep neural network structures for MI classification. They presented four different models: shallow CNN, deep CNN, hybrid shallow+deep CNN and a residual network and showed that their shallow CNN and deep CNN substantially outperformed the traditional method while the other two did not. Sakhavi et al. reported an envelope representation for EEG-based motor imagery classification and combined it with a 5-layer convolutional neural network, which increased the classification accuracy by 7% on BCI competition dataset IV [19]. EEGNet [20] is another successful network which combined different basic This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Fig. 1. Trial structure of two datasets. A trial starts from a relax stage, shown as the blank screen. Then a rectangle target will appear at either side of the screen, giving subject the hint of direction to perform motor imagery. At feedback stage, a circle cursor will appear on the center of the screen and will move towards either side based on motor imagery of the subject. After the cursor reach or miss the target, or the time exceeds the limit, the cursor will be frozen at the post-feedback stage. The length of each stage is summarized in the table.
convolutional layers together and achieved good performance on multiple datasets. Several other CNN-based structures have also been explored in motor imagery classification, showing that it is a popular architecture choice [21], [22], [23]. In addition, other studies have tried to combine traditional CNN layers with other tools to improve performance, such as self-attention modules [24] or stacked autoencoders [25]. Amin et al. designed a fusion model, which fuses multiple CNN networks to extract different levels of feature characteristics, and achieved substantial improvements of classification accuracy on multiple datasets [26].
A common approach to using EEG signals as inputs to these models is to represent the signal as a two-dimensional (2D) array where each row is a timeseries recording of a single EEG channel, and each column represents signal from different channels at a time. This method is convenient for visualizing the EEG signal and mimics the 2D structure of images in image processing where CNNs have been used extensively. However, this representation loses the spatial relationship among channels in the EEG montage, which may contain information useful for MI classification. To deal with this problem, some studies have explored a 3D representation of EEG data instead [27], [28].
The variety of deep learning architectures proposed in the recent publications shows the field's effort to improve MI-EEG classification by using new tools. Comparing the performance of these architectures offers a strategy for improving future BCI classifiers by further exploring the models and tools that lead to higher accuracies. However, a major difficulty in comparing the proposed architectures is that the datasets used to train and test these models are usually different between studies, or may not even be publicly available. The differences in the datasets may account for some of the differences in published accuracies, where some models may perform better on some datasets over others. In addition, the most commonly used motor imagery dataset is the BCI competition dataset IV [29], which includes only 9 subjects and does not include feedback. The small number of subjects could lead to a large variance in evaluation because the generalizability of a model may be questioned. This will lead to a strong need of comparing existing models on a larger, publicly available database, which will help in selecting deep learning models and design choices to build better MI classifiers in the future. In this study, we selected 5 MI-EEG deep learning models with released code from 4 recent studies: EEGNet [20], Deep & shallow ConvNet [18], Multi-branch 3D CNN [28] and parallel self-attention network [24]. We designed a group of experiments to test these models on two large publically available motor imagery databases: MBT-42 [30] with 42 human subjects and Med-62 [31], [32] with 62 human subjects, each of which contains multiple BCI sessions with online feedbacks. We evaluated each model both on its classification accuracy of L/R (left/right) motor imagery tasks and on its model training cost. Our results are as follows: Among the five models we selected, all of the models performed similarly on the first dataset, while EEGNet performed the best on the second in terms of classification accuracy and training time for the hyperparameters and training choices we tested. We also show that a simple preprocessing step is important to deep learning model training, and its magnitude of improvement depends on the dataset.

A. Datasets
Two large, publicly available datasets were used in this study to compare the performances of several recently published deep learning neural networks for MI classification. More information on these datasets is available in the original publications [30], [31].
1) MBT-42: This dataset was recorded during a series of MI-BCI training [30]. A total number of 42 healthy human subjects took part in this L/R cursor moving task. Here, the subjects' goal is to perform left/right motor imagery and to direct the cursor to reach the target, which appears randomly at either side of the screen (Figure 1). The complete dataset consists of three experiments involving 16, 12, and 14 subjects respectively. All the subjects participated in three sessions of online control tasks. In the first experiment, there are 125 trials within each session. In the second and third experiment, there are 120 trials. The EEG signals are recorded using a 62 channel Neuroscan system in the first two experiments, and a 64 channel Biosemi Active Two system in the third experiment. The trial structures of the three experiments are similar, each including 4 stages. A trial will start with a relaxation stage, which is considered as the rest time between two trials. Next is the cue stage, where two square targets are placed on the left and right of the screen. One of these targets is made visible as a yellow square to provide a goal for the subject. The left and right trials are randomly distributed and balanced within sessions. After that, the feedback stage will start. A round pink cursor appears and starts moving based on the control signal provided by the classifier. This will give subjects the feedback about the effects of their motor imagery performance. If the cursor reaches the visible target the trial ends and results in a "hit". If the cursor reaches the invisible target on the opposite side of the screen the trial also ends and results in a "miss". If neither target is hit before the time limit, the trial ends and results in an "abort" trial. The last stage is the post-feedback stage, where the cursor will be frozen for one second. Subjects are only explicitly instructed to perform motor imagery during the feedback stage, though they may opt to perform it during other stages as well. In the first two experiments, the EEG signals had gone through a bandpass filter between 0.5 and 200 Hz and a notch filter of 60 Hz, and was down-sampled to 100 Hz. In the third experiment, the EEG signals had gone through a bandpass filter between 0.16 and 100 Hz and a notch filter of 60 Hz, and was downsampled to 128Hz.
2) Med-62: This dataset is collected to explore the effect of meditation on motor imagery performance [31], [32]. In this dataset, 62 subjects participated in cursor movement control tasks of three types: left/right (LR) movement only, up/down (UD) only, and combined 2D movement (2D). Each subject completed a total number of 7-11 sessions of online BCI tasks. Each session is comprised of 450 trials counting all of the three tasks, 150 trials for each task. The EEG data is recorded by a 64-channel EEG Neuroscan cap. The data are sampled at 1,000 Hz and have already been bandpass-filtered between 0.1 to 200 Hz, with a notch filter at 60 Hz as well. The trial structure is similar to the first dataset, including a relax stage, cue stage, feedback stage and post-feedback stage. The length of feedback stage varies from 0-6s depending on if and how quickly a target was hit. The online decoders of both datasets are similar, which including spatial filtering using Laplacian filter, estimating mu rhythm power by fitting an autoregressive model, and operating cursor movement based on lateralized mu rhythm power (C4-C3) for left/right movement tasks.
Here, we only use the first three sessions and only LR trials of this dataset to keep comparable between the two datasets. Med-62 dataset is openly available at the following URL/DOI: https://doi.org/10.6084/m9.figshare.13123148.

B. Deep Learning Models
We chose to compare the following models which are specifically designed for motor imagery classification tasks and have released their codes for the models. This makes it easier to replicate the original authors' work by testing their models in a new dataset. The released codes allow us to use the models most accurately as the way the authors intended. This framework can also be used in the future to compare more models as they are released. These models all adopt convolutional layers in their structures, which are widely used in processing temporal signals.
1) EEGNet [20]: EEGNet is a compact convolutional neural network combining depthwise and separable convolutions. It consists of 3 convolutional layers and 1 fully-connected layer, trying to encode several EEG feature extraction concepts like optimal spatial filtering and filter-bank construction.
2) Deep & Shallow ConvNet [18]: In this work, 4 different DNN structures, including shallow convolutional network, deep convolutional network, hybrid network and residual network, are explored and carefully compared to state-of-the-art methods. According to their results, shallow and deep ConvNet outperformed traditional FBCSP method while the other two did not, and we include both of shallow and deep network in our study.
3) Multi-Branch 3D CNN [28]: A typical deep learning model takes in a C ×T matrix as input. In this situation, each channel will be treated equally and independently in the model, which will lose the spatial correlation among different channels. The novelty of this model is converting this 2D input into a 3D tensor, putting data from each channel into matrix entries arranged by its spatial position on the scalp. The channel arrangement in our experiment can be found in Supplementary Table S1. After that, the model uses 3D convolutional layers instead of 2D convolutional layers to generate its prediction.
4) Parallel Self-Attention Network [24]: ParaAtt introduced the popular self-attention concept in deep learning models [33] into EEG classification. Attention modules can automatically capture global relationships among input entries. With the parallel spatial-temporal self-attention mechanism, high-level distinguishable spatial-temporal features of raw signal data can be captured.
The detailed structures and parameters of these models we used in our experiment can be found in the supplementary Information.

C. Data Analysis
In our study, we included both within-subject analysis and cross-subject analysis. Within-subject training will train a model specifically for each subject. For both datasets, we use the first two sessions as training set, and test on the third session. Within the training set, we split 20% of data samples for validation. For data pre-processing, following [24], we first perform exponential moving standardization to the raw data. For a signal vector x = x 1:T , this standardization process can be formularized as: where α = 0.001 is the exponential factor, and = 0.0001 is a small number to avoid division by zero. We performed cross-subject analysis on Med-62 dataset. Since MBT-42 dataset is recorded by different system for different subjects with different sampling frequency and number of channels, it was not included for cross-subjects analysis. We partitioned the data as follows: For each evaluation round, one out of 62 subjects was selected as the test set, and samples from all other 61 subjects are used for training and validation. Within the 61 subjects, training set and validation set are split at an 80:20 ratio. We trained each model using the training set, and select the best training epoch and parameters on the validation set. After that, we evaluated the trained model on the test subject and obtained an accuracy. This process was repeated for each of 62 subjects being used as the test subject. Similar to the within-subject analysis, we only used the data from the first three sessions for all subjects. Full details of model and training parameters we used in our study can be found in the supplementary Information.

A. Within-Subject Analysis
Under our analysis setting, all of the deep learning models achieved higher accuracies compared to the online accuracy in both datasets (Table I, Figure 2). The complete accuracy results can be found in Appendix. The Friedman test for repeated measurements show that there exist significant differences among the five models in both datasets (P<0.05 in MBT-42, P<0.001 in Med-62). In MBT-42, all of the models performed relatively similarly and no model significantly outperformed the others (P>0.05 on each pair, one-sided Wilcoxon signed-rank test, FDR adjusted). Note that although the difference of average accuracy between EEGNet and online experiment is larger (73.65% vs 70.90%), they perform similar  under Wilcoxon test (P>0.2). In Med-62, EEGNet performed significantly better than each other models (P<0.001 compared to Shallow ConvNet, P<0.01 to others). All other model pairs do not have significant difference on performance. All of the P values have gone through adjustment of false-discovery-rate. Figure 3 shows the distribution of model accuracies from different subjects. Each black point represents the model accuracy of one subject and the width of the colored area represents the density distribution of the accuracies. Here we can see a slight difference in the distributions between deep learning models and the online decoder, while the deep learning models are similar to each other.

B. Effects of Preprocessing
To explore the role of data preprocessing on model accuracy, we performed another group of experiments, which train deep learning models on the original data instead of preprocessed data using the same model structure. Figure 4 shows the comparison of the deep learning model classification accuracies training on preprocessed data and original data. We can see the preprocessing step has different effects on different datasets. On MBT-42 dataset, preprocessing only achieves limited improvement on deep learning model performance. None of the models shown have significant improvement when training on preprocessed data (P<0.05). However, on Med-62 dataset, each model achieved significant improvement by preprocessing (P<0.001). All of the P values have gone through adjustment of false-discovery-rate. In comparison with the online performance denoted by red dash line, we can see that most of the deep learning models training on original data cannot beat online performance, which indicates that simple steps of data preprocessing are necessary for the training of deep learning models under these conditions.

C. Across-Subject Analysis
The cross-subject classification results of different deep learning models on Med-62 dataset are shown in Figure 5. The Friedman test for repeated measurements show that there exist significant differences among the five models (P<0.001). EEGNet performed significantly better than each other models (P<0.001 to all other models, one-sided Wilcoxon signedrank test, FDR adjusted). All other model pairs do not have significant difference on performance. All of the P values have gone through adjustment of false-discovery-rate.

D. Computational Cost
Except from model accuracy, training cost is also an important criterion to evaluate a model. Table II  We also compared model inference time of deep learning models. Inference time is the amount of time a trained model takes to generate an output from the input signal. The results are shown in Table III. We fixed all data input length to 300, and tested the model inference time on 125 samples and one single sample. Due to parallel computation, models can process faster on a batch of samples. Generally, the model inference time should be shorter than the update interval of real BCI systems, and 40ms is a commonly used value. All five models are able to decode one single sample shorter than this time, which means they all reserve the potential to put into real use.

IV. DISCUSSION
In this study, we have tested five previously reported BCI deep learning models on two large and publicly available left/right motor imagery classification datasets. There have been research testing deep learning algorithms on motor imagery tasks. Schirrmeister et al. [18] compared the performance of novel DL models against a traditional BCI classifier in an offline setting to show the potential benefits of using DL models for BCI decoding. In their work, Shal-lowConvNet and DeepConvNet achieved 85.3% and 84.0% accuracy respectively on two datasets, outperformed FBCSP baseline (82.1%). Lawhern et al. [20] compared EEGNet with ShallowConvNet, DeepConvNet and traditional approaches on both ERP and Oscillatory-based BCIs, and found that which model performs the best will change with datasets. MB3D network achieved 75.0% accuracy on BCI competition IV dataset in their original work [28]. Liu et al. [24] reported ParaAtt and compared it to multiple models. In their original work, ParaAtt achieved 78.5% accuracy on BCI competition IV dataset, which outperformed EEGNet (65.4%), DeepCon-vNet (70.3%), FBCSP baseline (67.4%) and several other networks. Stieger et al. [35] previously examined the Shallow ConvNet [18], on the Med-62 dataset and reported that DL-based decoders can outperform online performance, and that using all of the available electrodes provides additional benefit to using just the electrodes around the sensorimotor cortex. The classification accuracy of ShallowConvNet can reach near 79% on session 3 of Med-62 dataset. This cannot be directly compared to our results, since we are using different training and test set and different data clipping strategies. These works, along with several others, show the promise of DL-based decoders for MI-BCI. However, while there are several successful DL models that have already been proposed for BCI, and more are certainly being developed, it is still unclear which, if any, of these models perform the best in a general setting.
Previous works have also compared different basic generic deep learning architectures on motor imagery tasks [36], [37]. These results may provide some guidance for improving DL-based BCI decoders, but still do not look at state of the art models. Here, we aim to address this by comparing multiple state-of-the-art deep learning networks that were specifically designed for BCI motor imagery tasks. From the experiment results, we have shown that EEGNet has the best performance on one dataset under these selected conditions among the models we have investigated and outperforms online accuracy on average as well. The depth of the network does not seem to be the most important factor in predicting a model's performance. Though the models we tested have varying depths, they all performed similarly on the MBT-42 dataset. In addition, though EEGNet and Shallow ConvNet have similar depths, EEGNet outperforms the other models on the Med-62 datasets. The way to deal with inputs from different channels are also similar among these three models, since they all applied a depthwise convolutional layer to mix data from all channels in the early part of models. A possible reason for the high performance of EEGNet on Med-62 dataset might come from their special design of separable convolutional layers, which may potentially be capable of extracting more strong features related to this task. The Shallow and Deep ConvNets had the lowest training cost which may be an important factor for performing offline data analysis on lowered powered machines or in a time sensitive environment.

A. The Role of Preprocessing
The performance of deep learning model is greatly influenced by its input scale. Without data scaling, a deep learning model will learn larger weight parameters, which can cause instability and undermine the performance [38]. Figure 6 shows an illustration of different preprocessing methods on a real data snippet of 5 seconds length from Med-62 dataset.
Here the term 'original data' refers to the data we directly obtained from the released dataset, which has already been minimally preprocessed by the authors. The orange line is the signal after going through a highpass filter with a cutoff frequency of 1 Hz. The data scale of this highpass-filtered data is almost the same as the original data, which is the blue line, since it only removes the low frequency component of the signal. On the other hand, exponential moving standardization (green line) and the normal standardization (red line) can rescale the data to a range which is suitable for model training, and the preprocessed data using these two methods are almost the same. From Figure 4, the accuracy improvement from preprocessing is much higher in Med-62 dataset. We checked the average standard deviation of original EEG recordings of two datasets, which is a rough reflection of input scale. The standard deviation averaged over all subjects is 28.2 in MBT-42 dataset, and 1335.0 in Med-62 dataset. Although there is not a clear threshold to discriminate 'normal scale' and 'abnormal scale', this indicates that Med-62 dataset is far from normal distribution, so that it can benefit more from preprocessing. The MB3D model is the one which was affected the least by input scale. One possible explanation for this may be its multi-branch structure. If one branch died out because of abnormal inputs and its subsequent consequence like gradient vanishing, the other branches still reserve the chance to fit well so that the model can still classify correctly.

B. Subject-Wise Clustering
In addition to comparing the overall performances of the various deep learning architectures, we also wanted to explore subject-specific performance between the models. Although some models reached higher accuracies than others on average, it could be the case that specific models work well for some subjects and not others. To test if this is the case, we plotted the subjects' performance in a five-dimensional space where the accuracy from each model is a single dimension. Since the differences in average accuracy between subjects was much larger than any differences between models for a single subject, we needed to center the subjects' accuracies in order to focus on any model-specific differences. To remove the effects of the subject's average performance, we zero-centered their accuracies: (x−mean(x) is the zero-centered vector of x). Here, subjects that performed similarly among the various models are closer together in space (ex. if subject A and B both performed better with Shallow Net than with EEGNet they would be closer together). Using t-distributed stochastic neighbor embedding (t-SNE) [39], [40], a statistical method for visualizing high-dimensional data, this five-dimensional space can be visualized in two-dimensions, shown in Figure 7.
Deep learning models with different connective structures might have different preference on special EEG features of certain subjects. In other words, there might exist a kind of phenomenon that: within a subset of subjects with a common EEG feature, or MI strategy, a group of deep learning models perform better, while within another subset of subjects with a different EEG feature, another group of deep learning models perform better. This difference might be expected due to the differences in architectures of the models, and the abilities of the various layers to extract different features. The t-SNE plots in Figure 7 explore how similarly subjects perform among the different models. Here, distinct groups of subjects, or clusters, would signal that there are sub-populations of subjects with unique features that result in them reaching higher performance with some models rather than others. The results of Figure 7 show that subjects are relatively spread out across this space. This suggests that if the models are extracting different features, then these features are relatively distributed across different subjects. Future work could further explore this idea by comparing the electrophysiology between subjects that are the furthest away from each other in this space to extract the features causing the difference in accuracies between models. Including different types of models in this analysis, such as RNNs, may also yield even more stark differences as the architectures extract different features from the data.
Since the clusters are not clear from the two plots in Figure 7, this kind of model preference is weak based on our experiment result. However, we believe it's still a good direction to explore if we can have more data samples and more models to compare in the future.

C. Future View of Deep Learning
Deep learning has shown success on improving the power of BCI systems. We believe deep learning has further potential on evolving BCI systems. EEG is one of the most commonly used non-invasive BCI inputs, which has low signal-noise ratio (SNR). As a result, extracting features from such a noisy signal is harder for traditional methods in processing EEG signals. As mentioned, the feature extraction step requires lots of manual work and prior knowledge, and may cause information loss. From deep learning, we can have a better option to automatically generate useful features from noisy data.
There are still issues to be addressed to drive deep learning further to the online system. First is how to deal with different format of data. Compared to the easy data collection process in most of the computer science fields due to the prosperity of Internet, the human EEG data collection is much more harder because of the tedious routines, time-consuming process of recruiting subjects and conducting experiments. EEG data are collected from different research groups using different systems. Given this, building a unified model which can be trained on various types of data and serve for multiple purposes, can increase the utilization rate of limited EEG data and may become an important issue in the future. The ideal case is that a user can start using BCI system without any pre-training/calibration trials. Through the early stage of interacting, the DL model can perform gradual adjustment along uses by the user.
Most of the current models treat each trial equally, which means we can arbitrarily shuffle the order of trials and sessions. However, in real scenario, a subject may generate gradually varied features during BCI training, which might be an additional challenge for decoding. We believe that future investigations can be extended to look into effects of training of MI-BCI using deep learning.
In our work, the performance distribution among subjects in two datasets are not identical. The distribution in MBT-42 dataset is like pear-shaped, while that in Med-62 dataset is much like a spindle. In this case, we can find that different dataset may include different proportion of 'poor performers', 'moderate performers', and 'good performers'. Future investigation should be extended to examine effects of deep learning on various performance sub-groups.