State-of-the-art Versus Deep learning: A Comparative Study of Motor Imagery Decoding Techniques

State-of-the-art techniques (SOTA) for motor imagery decoding have largely involved the use of common spatial patterns (CSP) and power spectral density (PSD), for feature extraction. Other frequency transforms, such as wavelets and empirical mode decomposition (EMD) have also been used but the aforementioned two have been the most popular. For classification, linear discriminant analysis (LDA) and support vector machines (SVM) have been mostly used. It is, however, worth investigating other approaches, such as deep learning, which offer a potential for improvement, but are not yet mainstream. Deep learning techniques based on neural networks (NNs) have been underexplored in motor imagery processing. Considering their success in other fields, which speaks to their potential for obtaining improved results over the SOTA, they should be explored for motor imagery decoding. This study takes a comparative approach in the use of deep learning as compared with the SOTA. From our findings, we infer that neural networks are suitable for motor imagery decoding and might be preferable over the SOTA. The use of specific feature extraction is also not as necessary as seen with SOTA approaches, though it might offer some gains in performance. Our results show a statistically significant improvement in decoding accuracies, up to 20% increase, with the use of NNs as compared with the SOTA. Also, we conclude that the use of crops for data augmentation might yield better results with shallow architectures as against deeper ones and that there might be other factors affecting the effectiveness of crops, needing further investigation.

Traditional techniques used in motor imagery processing typically involve pre-processing, feature extraction and then classification. Pre-processing is geared towards cleaning the data, to rid it of noise and improve the signal-to-noise ratio. Such pre-processing procedures are usually necessary when using EEG, as EEG signals have been known for their susceptibility to varying types of noises [5]. The pre-processing procedures get rid of artifacts, which could be environmental, instrumental and/or biological. Other device types, apart from EEG may not be as susceptible to all these artifact types. Feature extraction is performed to extract relevant features, which provide distinguishing characteristics between imag-ined tasks. Typical feature extraction techniques include the use of the CSP algorithm, well known for providing plausibly distinguishing features. So, also have other time-based, frequency-based and time-frequency transforms been applied for feature extraction, such as wavelets, auto-regressive modelling and empirical mode decomposition (EMD). For classification, support vector machines (SVM) and linear discriminant analysis (LDA) have been mostly used, as compared with other algorithms [6]- [9].
A consideration of deep learning techniques and their use in computer vision and natural language processing shows that deep learning techniques could be utilized in many fields, with the potential for improvements. More specifically, the use of convolutional neural networks (CNNs) and sequence models -recurrent neural networks (RNNs), long shortterm memory (LSTM) networks and gated recurrent unit (GRU) networks -has shown significant improvements in these fields [10]- [12]. Deep learning techniques have some advantages over traditional approaches. These include little to no reliance on a priori knowledge and the elimination of specific feature engineering. Neural networks can learn features from the data and do not require manually crafted features for optimal decoding performances. Considering their degree of success and use, this study takes a comparative approach of using these techniques against the SOTA. First, we take an exploratory approach in the use of these networks, determining the best-performing architectures and then finding optimal parameters for best architectures. In addition, we compare these results with SOTA techniques -CSP and spectrograms for feature extraction; SVM and LDA for classification -inferring based on the results that deep learning techniques can equally be applied in motor imagery decoding with significantly better results. One reason for the wide use of SVM and LDA has been the small amount of data acquired during motor imagery studies. This is typically a few hundred samples, per session, due to the nature of the experiments and the fatiguing effect they could have on subjects. However, the application of data augmentation techniques could enhance decoding when using deep learning techniques. The remainder of this manuscript is structured as follows: related works, methods, results and discussion and conclusion.

II. RELATED WORKS
Previous works exploring motor imagery for communication and control, have mostly used CSP and frequency transforms for feature extraction and SVM and LDA for classification. In this section, we present some of the works using these techniques. We also present other more recent works exploring neural networks or deep learning-based techniques for classification. Steyrl et al. [13] made a comparative use of CSP features with regularized LDA and random forests. They demonstrated the online use of a random forests classifier with the discrete Fourier transform (DFT) of the motor imagery signals, in an earlier work [14]. Their comparative analyses, which was done afterwards, showed that the CSP features contained highly discriminatory information. The results showed good performances for both classifiers, with peak accuracies up to 100%, though, random forest was reported to have made better use of the CSP features. Overall, the authors reported significantly better results with the use of CSP, as compared with the DFT. Another work by Wu et al. used a combination of CSP and LDA. The authors used cross validation to determine the optimal time window and number of CSP features for each subject and reported a maximum classification accuracy of 80%, using only 9 channels [15]. In Jin et al.'s work [16], the authors made use of a correlationbased approach for channel selection. Their correlationbased channel selection (CCS) method was used to select channels that contained more relevant information. After channel selection through the CCS method, regularized CSP was applied for feature extraction and finally, an SVM classifier was used for classification. The authors validated their approach on the BCI competition datasets. On the BCI Competition IV dataset, their method achieved up to 94.5% accuracy on a subject and a mean accuracy of 81.6 ± 11.5% across all subjects. On the BCI Competition III dataset IVa, up to 96.8% accuracy was reported on a subject and a mean accuracy of 87.4 ± 10.6% across all subjects and on the IIIa dataset, up to 98.9% single-subject accuracy was achieved and 91.9 ± 10.3% mean accuracy across all subjects [16].
Yet another work by Feng et al. [17] made use of CSP and SVM for feature extraction and classification respectively, with 10-fold cross-validation. Their approach considered time in the feature extraction, with an argument that there exists a time latency in the performance of MI for subjects and that this latency could affect the performance of the BCI, if not accounted for. To that end, they proposed a correlationbased time window selection (CTWS) method, to determine optimal time windows for both training and testing samples, adjusting the windows by using correlation analysis. Afterwards, feature extraction and classification were performed. They validated their approach on two datasets -the BCI IV competition dataset I [18] and a primary data source of MI EEGs from stroke patients. They reported having the average classification accuracy improved by 16.72% on the dataset of healthy subjects (BCI Competition IV Dataset 1), and 5.24% on the dataset of stroke patients. These works described, so far, made use of SOTA processing techniques. Some other works have explored neural networks or deep learning techniques, in MI classification. Though, there are some works that have used this approach, it still appears that deep learning techniques have been underexplored in the space. A look at such works, shows that more have used convolutional neural networks (CNNs). Others have also used RNNs, LSTMs and GRUs, as these sequence models have been known to model time series relationships well. Sakhavi et al. [19], demonstrated the use of CNNs for MI classification. The authors made use of a modified version of the filter-bank CSP (FBCSP) and a CNN architecture tailored for the extracted features. The temporal representation used was the channel-relative instantaneous energy of the signal envelope, extracted using the Hilbert transform. Building upon one of their previous works [20], the authors made use of this temporal representation, as different from their earlier work. Their approach entailed the following: 1) First, FBCSP was performed on the signal. 2) Next, the envelope of each signal was extracted using Hilbert transform. 3) Afterwards, three possible representations were generated -the raw or smoothed version of the EEG envelope (R1); the power of the envelope (R2); and the ratio of the envelope to the total energy of each of the channels in each trial (R3).
The three generated representations (R1, R2 and R3) were used in determining the most effective representation. In constructing the CNN, the authors made use of three approaches: convolutions only across time with a common kernel shape for all channels, convolutions only across channels; and convolutions across both time and channels. Varying kernel and stride sizes for the first layers were used, to determine the optimal sizes. They excluded the use of pooling layers and had just two convolution layers, with a fully connected layer just before the output layer. They included batch normalization and dropout layers before and after the rectified linear unit (ReLU) activation layers respectively and used the Adam optimizer for learning. Their results showed that the first signal representation -R1 -was better, achieving better results compared to others. They also compared the results of their 2-dimensional CNN architecture, used for selecting the best representation, with a multi-layer perceptron (MLP) and support vector machine (SVM) and reported having better results with the CNN. Their approach was validated on the public BCI IV 2a competition dataset.
Other works using neural networks include the work by Dose et al. [21], where the authors used a CNN to classify MI signals in the Physionet movement and motor imagery database (MMIDB) [22], [23]. Their CNN architecture was based on the shallow ConvNet proposed by Schirrmeister et al. [24] and consisted of two convolutional layers with 40 kernels per layer and a fully connected layer.

III. METHODS
This section details our approach in our comparative study. All techniques and methods were applied on a public dataset. Details of the dataset and the methods are given in following subsections.

A. DATASET
The public dataset used in this study was provided by Kaya et al. [27]. The full dataset contains 60 hours of EEG recordings from 13 participants, collected over 75 recording sessions and yielding over 60 000 examples of motor imageries in 4 interaction paradigms. The dataset is currently one of the largest EEG BCI datasets made public. The data was collected at the University of Mersin, Turkey. 13 healthy participants (8 males and 5 females), who were students, within the ages of 20 and 35 were recruited for the study. Participants were labelled A-M, for anonymity. Necessary checks were done to ensure the health status of the participants. The EEG-1200 system, a standard medical EEG station used in many hospitals, was used for data collection. The sampling rate of the device was 200 Hz and 19 channels were used, being placed according to the 10-20 standard. Participants kept their gaze at the centre of a graphical user interface (GUI) window, containing icons representing the imageries to be performed. The original dataset contains 2class and 6-class motor imageries; however, for this study, we made use of the 6-class dataset. The 6 motor imageries were the left hand, right hand, left leg, right leg, tongue and a passive mode. Each imagined action lasted a second. For all subjects except one (Subject D), there were one or more 6-class motor imagery sessions, resulting in the use of 12 subjects' data for this study.

B. PRE-PROCESSING
First, the data were pre-processed, to remove different types of artifacts, mostly oculographic. The data had already been notched at 50Hz, to remove line noise, with an inbuilt filter in the device. The following steps were taken: 1) Bandpass filtering for 1-40Hz.
2) Oculographic artifact correction, using independent component analysis (ICA). First, a bipolar channel was VOLUME 4, 2016 iii created using the two available pre-frontal electrodes -Fp1 and Fp2. The montage used is depicted in Figure 1. The constructed channel was then used as a simulated electrooculographic (EOG) channel for oculographic artifact removal, using the MNE python package [28]. With ICA, eye artifact components are detected using the bipolar channel and are marked for rejection. The signal is then reconstructed using other non-artifact components. The use of the bipolar channel is based on the fact that the prefrontal electrodes typically capture eye artifacts, rather than activities related to the imagined action. 3) Re-referencing to average. The data was re-referenced to improve the signal-to-noise ratio. 4) Baseline correction on epochs, with 200ms pre-cue data. 5) Artifact rejection, using the auto-reject package [29].
Some subjects had more trials rejected, than others, due to more noise being present.  Figure 2 shows the stages in the decoding processing with the pre-processing, feature extraction and classification techniques used for the SOTA and deep learning approaches.

C. STATE-OF-THE-ART FEATURE EXTRACTION
For the SOTA techniques, we chose to use CSP in the time domain and power spectrum, computed as spectrograms, as features for the SOTA classifiers. We also used the Welch's periodogram, however, this yielded results which were not as comparable as those achieved with the spectrograms. We, therefore, used the spectrograms as the desired frequency representation.
In computing the features, first, optimal parameters were chosen to get the best feature representations possible. This involved using each option from the range of options available for each parameter to generate the CSP features, in turn.
Afterwards, the features were used in training the untuned LDA classifier, with 10-fold cross-validation (CV). The number of components giving the highest CV score was then chosen as the optimal parameter. For CSP, the two most important parameters to be optimized were the number of CSP components and the covariance estimation technique, which could be done with or without shrinkage. The optimal number of components was found to be 19 (all), as chosen from a range of 4 to 19. Covariance estimation options included principal component analysis (PCA); Ledoit-Wolf shrinkage [30]; diagonal fixed regularization, and an empirical mode, using the estimated noise covariance matrix, without shrinkage. The optimal technique was the empirical mode.
For PSD, on the other hand, the parameters tuned were the number of samples per segment and the amount of overlap between segments. The optimal number of samples per segment from a range of values between 50 (250ms) and 200 (1000ms), was found to be 100, equivalent to 500ms of imagery period. The optimal overlap was found to be between 75% and 80%; however, we chose 80%, as that generally seemed to give better results. With the optimal parameters, CSP components and spectrograms were generated and finally fed into the SOTA classifiers.

D. STATE-OF-THE-ART CLASSIFIERS
For classification using SVM and LDA, we performed parameter searches for both classifiers to determine optimal parameters giving the best results. To this end, we ran a 10-fold cross-validated grid search on both models, varying the key parameters. The key parameters for SVM, were the C-value, gamma and polynomial degree. For LDA, the key parameters were the solver and shrinkage value. For the SVM classifier, the C-value ranged between 0.1 and 100, with steps in multiples of 10. The range of values for gamma was from 1 to 1e-8, with steps in multiples of 10e-2 and a scale value, which is the inverse of the product of the number of features and the variance in the data. The polynomial degree ranged from 1 to 3. The optimal C-value and polynomial degree were set to 10 and 1, respectively, and a radial basis function (RBF) kernel was used.The optimal gamma was set to the scale, which is the inverse of the product of the number of features and the variance in the data. This is represented by the formula: where N is the number of features and σ 2 is the variance [31]. For LDA, the eigen solver was chosen over the single value decomposition and least squares solvers. The shrinkage for LDA was placed at 0.6, chosen from a range of values between 0 and 1 with a step of 0.1. The classifiers were trained using 10-fold cross-validation on the training set and the best resulting classifier was used for evaluations on the test set. iv VOLUME 4, 2016 To handle potential class imbalance resulting from rejection of trials, we oversampled from any class with trials less than the maximum number of class trials in any set. This was done differently for the SOTA and NNs, to carefully avoid data leakages. For the SOTA cross-validation, each fold from the training set was individually oversampled and for the NNs, each of the train, validation and test sets were individually oversampled. All of these were done after data split.

E. DEEP LEARNING TECHNIQUES
To compare the deep learning-based techniques, we made use of a variety of network architectures. CNNs, RNNs, LSTMs and GRUs were used either solely in a network or in a hybrid fashion, such that CNNs and sequence layers were both used within the networks. Of the various networks, we chose the 3 best performing for comparison, with the SOTA, namely: the bidirectional GRU (bi-GRU), the deep net and the multibranch network. We also experimented with pretrained image models -Inception V3 [32] and MobileNet V2 [33]. However, their performances were significantly worse than other models and so, they were excluded from further analyses.

1) Neural Networks
a) Bidirectional GRU -A 2-layer bidirectional GRU network was used to learn sequence relationships in both the forward and backward direction. A tanh activation was used with a dropout of 0.4, to avoid overfitting. Sequence models are known to be good at learning time relations in data and so, we applied the network for this purpose. b) Deep Net -The network was inspired by the Deep Convnet [24]. A 5-layer modification of the original 4-layer network, introduced in Lawhern et al.'s work [34], was used. We used AveragePooling layers rather than MaxPooling, since the latter performed better. The default exponential linear unit (ELU) activation was also replaced with the scaled exponential linear unit (SELU) activation, as that seemed to perform better. A dropout of 0.4 was used to curb overfitting.
c) Multibranch -The network comprises 4 branches of the same CNN architecture -the Shallow networkintroduced by Schirrmeister et al. [24]. We used a multibranch structure since that generally performed better than an unbranched one and made slight modifications, considering the data specifications. Increasing the number of branches beyond 4, generally yielded no gains. The output from the branches were concatenated and fed into a convolution layer before a final fully connected layer. The choice of the 3 networks was for the following reasons: a) To have a representative of sequence and convolutional models, which have been mostly used. b) To explore CNN branching with an intent to discover how performance varies in a deep CNN as compared with a multibranched shallow one. A branched architecture tended to give more stable results across runs. Across both SOTA and NNs, an 80:20 ratio was used to split the data into train and test sets. The training set was further split using an 80:20 ratio for train and validation sets, in the case of the NNs. For the networks, the optimal batch size was found to be 32. The networks were each trained for 50 epochs, with model checkpointing to save only the best weights, giving the least loss. Structures of the networks are presented in Tables B1, B2 and B3 of Appendix B. Data from subject A was used for all optimizations -SOTA and NNs, since subject A had plausible performance and with the expectation that average optimal values might not vary greatly compared with subject-specific ones. Optimizations could also be done on a per-subject and per-session basis. However, that could get cumbersome very quickly.
2) Training and testing approaches a) Trial-wise (TW)approach -This approach entailed the use of the whole 1-second trial block for training and testing. The 200 * 19 matrix for each trial in the training set was used for training and in the same way, test trials were used for evaluation. b) Cropped approach -For this approach, crops of the 1-second block of data were used for training and testing. The main idea behind the use of crops is data augmentation for possible improvement of model performances. Crops were separately generated on train and test sets. The steps in generating crops were as follows: i) The window length, wlen, was defined, ranging from 0.1 (100ms) to 1 (1000ms). A window length of 1000ms meant no cropping. ii) ii. The amount of overlap was defined, n_overlap, with values ranging from 0 to 95, meaning no overlap to 95% overlap. In practice, any value less than 100 could be used for the overlap, since 100% overlap would be infeasible. iii) Next, a sliding window was applied to extract crops of length, wlen, with the chosen overlap, VOLUME 4, 2016 v from the start to the end of the trial. The number of crops generated is represented by the formula: where crop_size = wlen * sampling_rate For instance, to generate crops of wlen = 0.3, with n_overlap = 50% and the 200Hz sampling rate, on a trial length of 1 second, the number of crops generated is 5. Crops originating from the same trial were assigned the same label. iv) For testing, the prediction scores generated on the crops originating from a test trial were averaged and the trial assigned the class with the highest mean score. For the cropped approach, the optimal window length was found to be between 500ms and 600ms and the optimal overlap was 90%. We, therefore, used a window length of 600ms with 90% overlap. An illustration of the cropped approach is seen in Figure 3.  Table 1 and the threshold, α was set to 0.05. We do not expect a strict conformance of the results to normality and urge the reader to be aware of that.

A. STATE-OF-THE-ART TECHNIQUES
A summary of results for SOTA methods is seen in Table  2. Figure 4 also shows summarized mean accuracies in descending order. Details of our results and those reported by the dataset authors are given in appendix tables A1-A3. The authors reported their results for different train, validation and test partitions. However, for comparison with theirs, we chose the best most consistent result achieved for each subject, irrespective of the partition ratio used.  From the results, we make the following observations and deductions: I) Optimal parameters were more generalizable for CSP as compared with spectrograms In the feature extraction parameter searches for CSP and spectrograms, optimal parameters for CSP seemed to have less variability across subjects, compared to spectrograms. Before reaching a decision to use subject A's optimal values for feature extraction, across all subjects, we performed parameter searches on different subject's data. In initial subject-specific parameter searches, we noticed wider variability in optimal values for spectrograms. For CSP, on the other hand, most optimal values were in a smaller range, making it more generalizable across subjects. We infer based on these that optimal parameters and, in turn, features from CSP were more stable and generalizable across subjects and classifiers than spectrograms, which seemed to be more subject-specific. II) CSP features yield better results than spectrograms CSP yielded better results, than spectrograms, with p-values showing a statistically significant difference (3.42E-07, 7.62E-08, 8.67E-09, 1.48E-09 < 5E-02).
The reduced performance of spectrograms may be partly attributed to the relative non-generalizability and instability of the optimal parameter values. Our approach in the use of CSP SOTA techniques gave slightly better results than the authors' approach, albeit not significantly (1.85E-01, 6.89E-02 > 5E-02), but both the authors' and ours compared better than with the use of spectrograms.

B. DEEP LEARNING-BASED APPROACHES
We present the results from using 3 neural network architectures, with the data in trial-wise and cropped form. Also, CSP features and spectrograms were used with the networks to determine if the neural networks needed these feature extraction techniques to give optimal performance or were capable of capturing relevant relationships from the raw data. Summaries of results for deep learning methods are seen in Tables 3 and 4. Table 3 shows results for the use of the raw data in trial-wise and cropped forms, while Table 4 shows the use of SOTA features with the networks. Figure 5 also shows summarized mean accuracies of all the methods in descending order.   The following observations and deductions were made: I) All subjects gave plausibly discriminatory signals While some subjects performed significantly better than others, all subjects achieved greater than chance level performance (approximately 20%) [35]. This shows that subjects were able to give signals discriminatory enough for the decoding. II) Crops yielded better performance in shallow networks compared with the deep net The results from the use of crops across the neural networks yielded varying performances, which we attribute to their respective architectures. For the multibranch network, results show that using crops gave significantly better results than with the trial-wise form (p = 1.32E-02 < 5E-02). The reverse was observed in the case of the deep net, for which the trial-wise approach gave significantly better results than the use of crops (p = 4.37E-05 < 5E-02). Contrary to the observed results of these two, the BiGRU showed no difference in performance using either approach (p = 9.15E-01 > 5E-02). From these, we infer based on the network architectures that performance with crops tends to degrade with a deep network but with shallow networks, performance obtained is similar to that of the trial-wise form, as in the case of the BiGRU, or better as with the multibranch network. We infer that the multibranch network particularly had better performance since the branches were shallow networks and the branched structure provided more stability compared to others with unbranched structures. Branches may provide stability by combining knowledge learnt across branches to form a more robust classification decision. We, therefore, recommend the use of multibranch shallow networks with crops, for improved and more stable results. This is somewhat comparable with a related work [24], where improvements were noticed with crops, but only in high frequencies. This then means that crops-based improvements might be dependent on different factors, such as the frequencies present, length of signals or architecture of networks. It would be worth investigating these factors in detail. III) Neural networks generally perform optimally but can be exploited to yield improved results in suitable scenarios Using the trial-wise approach, the Deep net outperformed BiGRU (p = 5.05E-03 < 5E-02) but had no significant difference compared to the multibranch network (p = 7.15E-02 > 5E-02). Using crops, on the other hand, resulted in the multibranch network performing significantly better than Deep net while attaining similar performance as the BiGRU. Comparisons amongst the networks do not generally show one as better compared to others, as each network performed best of all in different scenarios, while performing optimally in general. We, therefore, infer that all networks perform optimally but can be more suited to specific scenarios. For instance, it might be preferable to use multibranch shallow networks when cropping is applied and deep networks when not applied.
viii VOLUME 4, 2016 IV) The use of SOTA features with NNS could yield improved results The use of CSP features yielded better or similar performance with most networks. With the BiGRU, for instance, using CSP significantly outperformed TW (p = 3.28E-03 < 5E-02) but yielded similar performances with TW, using the deep net (p = 6.35E-01 > 5E-02) and slightly better performance with the multibranch network (p = 2.42E-02 < 5E-02). As noticed in other cases, the use of spectrograms gave worse results. This leads us to conclude, in this case, that the use of CSP does help improve neural network performance. However, not always by a significant margin. The benefits of using CSP for decoding with neural networks would have to be measured against the corresponding computational cost associated with computing CSP features before use in the neural networks. V) Neural network approaches significantly outperformed authors' approaches For most approaches using the neural nets, results were significantly better than the authors' reported results. For instance, p-value comparisons of the TW + NN against the authors' results, give 5.64E-07, 7.10E-08 and 3.51E-06 (all < 5E-02) for the BiGRU, Deep Net and multibranch networks, respectively. This can be expected given a number of factors to consider such as pre-processing, which was not done by the authors; optimal feature and hyperparameter searches, use of more sophisticated neural networks with a careful consideration of network architectures for optimality. Pre-processing helped to remove different forms of artifacts in the signals, strengthening the signal quality and therefore, yielding better results from the learning. Also, optimal feature and hyperparameter tuning helped to get a more general set of values yielding optimal features and learning for the networks. The architectural choices made also contributed to this. Our choices considered the bestperforming networks with as few parameters as possible. All these resulted in improved results over authors' reported results.

VI) Neural networks significantly outperformed SOTA
For other SOTA techniques, as applied in this study, all feature parameter tuning and data enhancement techniques were applied. . With that, the results would reveal how well SOTA performs, as compared with NNs, as these enhancements were applied to both categories. Across neural network approaches, results show that for all except where spectrograms were used with the networks, performances were significantly better (pvalues from 4.94E-04 -9.04E-09; all < 5E-02) than for all SOTA approaches. This shows that neural networks are more sophisticated and capable of giving better results over the SOTA. They should therefore be explored more for motor imagery decoding, as they tend to capture task-relevant relationships in the data much better than the prevalent techniques. The challenge of having small number of data points might be surmounted by carefully choosing architectures suitable for such sizes. Also, transfer learning and data augmentation might be helpful in such situations. Given the approach detailed here, we recommend the use of neural networks over the SOTA. VII) Comparisons with other works Not many works have explored the dataset used in this study. Of the few works that have [36]- [38], only two performed classifications. In Shahbakhti et al. [36], the authors investigated the detection and elimination of eye blinks from EEG trials but performed no classification. Other works by Phang and Ko [37] and Mwata-Velu et al. [38] performed classification, though with limited number of classes. Phang and Ko [37] focused on left-and right-foot distinction using CSP, band power and Pearson's correlation-based connectivity features with traditional SOTA algorithms -SVM, LDA and KNN. While they reported plausible results for the best-performing method (86.26 ± 9.95%), it should be noted that their result is based on a binary decoding task. Also, the authors did not explore deep learning methods, as done in this study. Mwata-Velu et al. [38], on the other hand, explored a hybrid CNN-LSTM architecture for the classification of signals. The accuracy of their three-class classifier of left-, rightand no-hand imagined movements was reported to be 79.2%. As compared with ours, they used a smaller number of classes and reported less performance as compared with some of our deep learning methods. Also, they did not report exploring multiple architectures and making comparisons based on those. Finally, many of these works did not perform pre-processing in the manner reported in this study.

V. CONCLUSION
In this comparative study, we investigated the different approaches to motor imagery decoding. SOTA techniques, which have mostly been the use of CSP and frequency transforms with SVM and LDA classifiers, were compared with neural networks. We have presented our results categorized by the different approaches and provided summaries of the results and p-values for comparisons.
From the results, we conclude that neural networks are suitable for motor imagery decoding and offer some improvement over the SOTA. They are more sophisticated and capable of modelling underlying task relevant relationships in the data and do not need specific feature extraction to perform well or always boost their performance. As seen from the results, using the raw data is suitable for motor imagery decoding and while the use of specific feature extraction might give some gains in performance, it's not required for optimal performance. Also, we conclude that the use of cropping for data augmentation enhances performance, depending on a VOLUME 4, 2016 ix factor such as the network architecture. As seen from the results, cropping improved results in shallow networks but worsened performance in the deep network. This leads us to infer that the performance of crops in neural networks might be affected by these factors -range of frequencies present (as seen in Schirrmeister et al's work [24]), network architecture and, possibly, the length of the trial. Further investigation needs to be done on other datasets of varying lengths to also determine how much effect the length of trials has when applying cropping for enhancements.
Considering the performances of the CSP-based SOTA approaches, our conclusion is that when following a SOTA approach, CSP is preferable for feature extraction. This is because it significantly outperforms other SOTA approaches and offers more in terms of stability of features and results. We, however, could not provide direct time comparisons since the neural networks were optimized for running on a GPU, while the core library used for the SOTA techniques was not.
In future, we will apply transfer learning specifically with the neural networks. This will involve intra-subject and intersubject transfer learning, to provide a solution to the nonstationarity problem in the EEG experiments. Intra-subject transfer learning would involve making use of knowledge learnt from a previous session in another session, for the same subject. Inter-subject transfer learning, on the other hand involves making use of knowledge learnt from other subjects across different sessions, for a target subject. With this, we opine that there would be improvements in performance of the neural networks, since previously learnt knowledge can be useful in future sessions. Also, with model adaptation, improvements are anticipated since the model is adapted periodically. There, however, is the challenge of finding the optimal adaptation frequency, with this approach.