Improving Generalization of CNN-Based Motor-Imagery EEG Decoders via Dynamic Convolutions

Deep Convolutional Neural Networks (CNNs) have recently demonstrated impressive results in electroencephalogram (EEG) decoding for several Brain-Computer Interface (BCI) paradigms, including Motor-Imagery (MI). However, neurophysiological processes underpinning EEG signals vary across subjects causing covariate shifts in data distributions and hence hindering the generalization of deep models across subjects. In this paper, we aim to address the challenge of inter-subject variability in MI. To this end, we employ causal reasoning to characterize all possible distribution shifts in the MI task and propose a dynamic convolution framework to account for shifts caused by the inter-subject variability. Using publicly available MI datasets, we demonstrate improved generalization performance (up to 5%) across subjects in various MI tasks for four well-established deep architectures.


I. INTRODUCTION
B RAIN-COMPUTER Interface (BCI) technology primarily aspires to provide neural communication and control between a user and a machine bypassing the normal neuromuscular pathways. This is feasible by analyzing brainwaves captured by electroencephalogram (EEG) signal recordings using signal processing and Machine Learning (ML) techniques. Nowadays, BCIs find application in various areas, including emotion recognition (e.g. [2], [3]), epileptic seizure detection (e.g. [4], [5]), robotic control [6] as well as video gaming [7].
One of the first and most popular BCI paradigms is Motor-Imagery (MI). MI-BCIs are based on a neural process, by which a subject mentally simulates a motor action, for example the movement of a hand or foot, without actually executing it [8]. Developing MI-BCI systems (e.g. [9], [10]) mainly relies on robust decoding of a subject's motor intentions from the recorded EEG signals, under the prior assumption that these signals encode that relevant information, and are mainly used for movement rehabilitation purposes (e.g. [11], [12], [13], [14], [15], [16]) as well as the wheelchair/exoskeleton control [17].
Several works have addressed the problem of EEG-based motor-imagery classification using classical feature extraction techniques [18]. The technique of common spatial pattern (CSP) algorithm [19] and its various extensions, like the Filter-Bank CSP (FBCSP) [20], are among the most popular methods of this category due to their simplicity in design and computational efficiency in implementation. In all these methods, specific band-pass filters are applied to the EEG signals prior to the design of spatial filters, sacrificing flexibility and adaptivity to some extend.
In recent years, Deep Learning (DL) techniques -and most specifically Convolutional Neural Networks (CNNs)have largely alleviated the need for manual feature extraction, achieving state-of-the-art performance in various areas, most notably computer vision [21]. Due to their massive progress, CNN-based feature extractors have been introduced in various paradigms in the field of BCIs (e.g. [22], [23], [24], [25]), in an effort to become generic EEG signal processing tools compared to classical feature extraction techniques (e.g. [18], [19], [20]). DeepConvNet and ShallowConvNet [26] are among the first deep learning architectures employed in MI-BCIs and are inspired by common spatial pattern (CSP) filters [19] since they include convolutions across time followed by convolutions across EEG channels. EEGNet [27] is a lightweight BCI architecture which consists of a compound of temporal and spatial filtering inspired by the filter bank common spatial pattern (FBCSP) technique [20]. EEG-Inception [28] shares the exact same fundamentals with EEGNet and has strong performance results across different benchmarks. Although it is similar to EEGNet, it includes several Inception branches, originally introduced in [29]. These branches consist of trainable convolutional temporal filters of different scales, capturing several temporal modulations of the EEG signals.
Although these deep learning architectures are inspired by classical EEG feature extraction techniques and achieve impressive performance in MI classification tasks, they usually fail to tackle the problem of inter-subject variability [30], preventing the successful deployment of a previously trained MI classifier to new unseen subjects. Inter-subject variability is defined as the change in data distributions across different subjects: each individual has a unique brain anatomy and functionality that makes the discovery and exploitation of shared invariant features extremely difficult. In fact, these differences are so distinct that previous works have shown that the identification of a specific subject out-of-many is actually feasible (e.g. [31], [32], [33]). Therefore, modern DL-based BCIs tend to fail to generalize well in unseen subjects due to this type of data distribution shift. For many years, normalization techniques (e.g. [34], [35]) -data scaling using a mean and standard deviation -in conjunction with classical machine learning techniques have been considered the gold standard to solve the problem of inter-subject variability. With the advent of deep learning, methods like transfer learning have emerged in an effort to provide a solution (e.g. [36], [37], [38], [39]). In most of these methods, a small calibration set from the unseen subject is utilized to fine-tune parts of the pre-trained deep network architecture. In [39] only the last fully-connected layers are fine-tuned while the previous layers are frozen. In [36] some identified layers are fine-tuned to maximize knowledge transfer for MI classification. Although transfer learning has been proven to perform well, it still requires a calibration session in order to generalize well to unseen subjects. In the direction of zero-calibration networks, [40] proposes an adversarial inference framework that learns subject invariant features. In this work, we aspire to provide an alternative solution to the problem of inter-subject variability and enhance the above mentioned BCI deep architectures dynamically without the need of a calibration session.
Causal reasoning provides tools to breakdown and analyze important aspects of a BCI task, identify and possibly resolve some of these challenges by employing appropriate ML strategies. The methodical breakdown of a BCI task and the identification of the causal relationships between the various variables of interest take into account the expert's knowledge of the involved biological and neurophysiological processes and can be of vital importance when designing and building ML-based models in the field of BCIs. In this work, we focus mainly on MI-BCI systems, and inspired by the work of [41], we analyze the task of MI EEG signal classification through the lens of causal reasoning. Motivated by this causal analysis, we introduce a framework based on dynamic convolutions that provably tackles the identified problem of data distribution shift across subjects.
Our contributions can be summarized as follows: 1) We employ causal reasoning to breakdown and analyze important challenges / distribution shifts in the task of MI brainwave decoding 2) We propose a subject attention network based on learnable Gabor wavelets that can accurately identify the different available subjects 3) Inspired by [42], we propose a framework based on dynamic convolutions that utilizes our proposed subject attention network and with zero calibration provably tackles the issue of inter-subject variability in the task of MI brainwave decoding according to our proposed causal breakdown. More specifically, our causal analysis allows us to design an evaluation setup which keeps all the identified distribution shifts intact but the intersubject variability. Therefore, unlike other works in the area which claim improved cross-subject performance and often utilize a mixture of techniques like data augmentation (which can affect also other causal variables of interest), our work is theoretically proven to target the problem of inter-subject variability through this specifically crafted evaluation setup. The remainder of the paper is organized as follows: Section II describes our causal analysis to breakdown important challenges / distribution shifts in the task of MI brainwave decoding. Section III outlines our proposed framework based on dynamic convolutions that improves the generalization of MI-BCI systems. Section IV consists of the experimental part, where performance results and comparisons are detailedly presented. Section V acts as a discussion part to demonstrate the advantages and disadvantages of our proposed framework. The last section summarizes and concludes our work and briefly outlines future research steps.

II. CHARACTERIZING DISTRIBUTION SHIFTS IN MOTOR-IMAGERY (MI) DECODING USING CAUSAL REASONING
The main goal of this paper is to propose a framework that tackles the issue of inter-subject variability in CNN-based BCI models. To achieve this, we will first investigate the problem of MI brainwave decoding through the lens of causal reasoning. As it has been demonstrated in [41], causal models encode naturally more information which can be vital in the machine learning design process and if appropriately used can lead to models which are more robust to certain types of distribution shifts. But why is this causal analysis important in this work and for the proposed framework? By performing this causal breakdown, we can identify most of the possible distribution shifts that can be met in the task of MI classification. By associating the inter-subject variability to a distribution shift in one of the core variables of interest, we can design an evaluation setup which keeps all the identified challenges intact but the inter-subject variability. Therefore, we can certainly claim that our framework specifically contributes in solving the targeted problem.

A. Preliminaries
Causal reasoning is the analysis of a task / problem in terms of cause-effect relationships between the different variables of interest: if a variable A is a direct cause of variable B, we express it as A → B (A causes B or B is the effect Fig. 1. Key challenges in machine learning for a MI EEG classification task. X represents input EEG signals, Z the true unobserved brain activity, Y the associated MI labels. • and × represent EEG signals of different labels. Dots represent data points of any label and their color represent different EEG acquisition devices. of A). When designing a machine learning algorithm, it is crucial to understand all the involved factors as well as their causal relationships. A causal breakdown of a system can be represented as a directed acyclic graph (DAG) where the nodes are the variables of interests and the edges represent direct causal relationships. These diagrams can capture vital information for the involved variables of interests such as conditional dependencies as well as independencies.

B. Causality in Motor-Imagery Decoding
In a MI classification problem, we want to accurately predict the mentally performed task from a recorded EEG signal. Mathematically, given an input EEG signal X , we train a statistical model to predict the correct MI task Y , which can be the imagery movement e.g. of a hand or foot. In essence, this statistical model tries to estimate the conditional probability P(Y |X ) using an appropriate objective function.
In machine learning tasks, given the input X and the prediction target Y , we can establish that the task to estimate P(Y |X ) can be either [43]: • Causal: when X → Y , predict effect from cause • Anti-causal: when Y → X , predict cause from effect Using the above basis, we can define an MI EEG classification task as an anti-causal problem, since the true MI intention (observed with the MI label Y ) can be considered the cause of the recorded EEG signal X . Additionally, inspired by [43], we can consider X as a sequence of imperfect observed measurements of the true unobserved brain activity Z within, mainly, the cortical areas responsible for the sensorimotor rhythms, i.e. Z → X . Therefore, using a causal diagram, an MI EEG classification task can be described as: As a consequence of the above anti-causal definition and causal diagram, we can explore the problem of MI EEG classification through the following causal factorization: Through this causal breakdown, we can categorize the major challenges associated with Motor-Imagery (MI) EEG classification tasks into three main categories as illustrated in Figure 1. Challenges related with the: 1) Training EEG signals -X . One of the renowned challenges in motor-imagery classification problemas in any medical-related machine learning problemis the scarcity of labelled data due to the lengthy acquisition process (e.g. [44], [45], [46]). Subjects are required to spend hours in a laboratory facility performing successive motor-imagery tasks [47]. This process has been reported to cause fatigue and discomfort, even when devices with dry electrodes are utilized. To make things worse, due to the wide variety of available EEG recorders in the market, the data acquisition can be undertaken with various devices (acquisition shift P(X |Z )) which have completely different specifications (e.g. number of electrodes, sampling frequency to name just a few), making the combination of publicly available EEG datasets extremely difficult [48]. 2) Anatomical differences of subjects -P(Z |Y ). Each subject has a unique brain anatomy and functionality that results in polymorphous neural activity patterns when appeared in the surface observed EEG signal (e.g. [49], [50]). When designing a generic ML-based MI-BCI, researchers need to take this inter-subject variability (data distribution shift across subjects) into account. 3) Class Imbalance -P(Y ). Class imbalances can arise between the training and the deployment set of a MI-BCI. It is necessary for the training set to be as closely balanced to the deployment set as possible when training machine learning models.
III. OUR PROPOSED FRAMEWORK In this work, we mainly focus on the challenge of subject distribution shift (or inter-subject variability). Using the causal breakdown described in Section II, we will use two publicly available MI datasets -which contain a large number of different subjects, are class balanced, have relatively enough trials per subject and all trials come from a single EEG recorder (within each dataset) -essentially solving all the above identified challenges but the subject distribution shift. In terms of the causal factorization (Eq. 2), the problem of inter-subject variability can be seen as a distribution shift S where: Our framework can be applied to any established CNNbased MI-BCI architecture, resulting in performance increase. Inspired by [42], we utilize dynamic convolutions in the domain of MI brainwave decoding. Instead of having a BCI architecture that tries to discover a common latent space for all K subjects in the training set, we use K parallel trainable convolutional layers (corresponding to the K available training subjects) for each convolutional block of a CNN-based BCI network. Using a subject attention network that learns to distinguish between the available individuals, the subjects are separated from one another and essentially K parallel personalized models of the same BCI architecture are trained simultaneously, as illustrated in Figure 2.
Our proposed framework is inspired by the work of [42] in the field of computer vision, but it includes various modifications to address challenges apparent in the EEG domain. Although the complete framework will be detailedly described in the following subsections III-A and III-B, these differences can be summarized as follows: • Instead of fully trainable attention mechanisms, it utilizes our novel subject attention network (described in III-A) which uses only trainable Gabor filters making it more lightweight and explainable than a shallow fully trainable neural network and it achieves very high performance in the subject identification task.
• Unlike [42] where there is an attention mechanism for each convolutional layer and these mechanisms are trained in an unsupervised manner, our framework uses only one attention mechanism for all convolutional layers, and with supervised training, it learns to distinguish between the available different subjects.
• The K number of parallel layers in our proposed framework is not a tunable hyperparameter (like in [42]) but coincides with the number of available subjects in the training set.
• Instead of using the output vector of the attention mechanism as [42], our framework utilizes the proposed "uniformly attended" vector A* (described in III-B) in order to be more robust to the low Signal-to-Noise Ratio (SNR) of the EEG signal.

A. Attention Network
The first layer of our subject attention network is the first order wavelet scalogram of the input EEG signal X .
Mathematically, let x(t) ∈ R T denote a one-dimensional input EEG signal, where T is the number of initial EEG time points, and ψ ϵ (t) be a wavelet. The 1st order scalogram is defined as X(ϵ, t) = |x(t) * ψ ϵ (t)|, where * stands for the convolution operator. To perform this operation, the raw input signal from each EEG channel is convolved with a wavelet kernel with size (1, W ) = (1, F s 2 ) where F s is the sampling frequency. This wavelet kernel follows the real Gabor wavelet format: with t ∈ [− W 2 , W 2 ] and 1 σ denotes the bandwidth and ϵ the normalized frequency of the Gabor wavelet and these two properties are the only trainable parameters of this layer. During training, ϵ is restricted (ϵ ∈ [0, 1 2 ]) to satisfy the Nyquist theorem. The three-dimensional (3D) tensor X(c, ϵ, t) ∈ R C×F×T (where F is the number of Gabor filters and C the number of EEG channels) containing the first order wavelet scalograms X(ϵ, t) for each EEG channel is followed by a global average pooling across time and frequency. Finally, the resulted vector is passed through a fully-connected layer to compute the subject id vector π.

B. Subject-Attended Dynamic Convolutions
The proposed framework takes the EEG signal X as input and tries to learn both the correct MI task Y (estimate the conditional probability P(Y |X )) as well as the correct subject id π (estimate the conditional probability P(π |X )). The subject attention network and the K parallel convolutional layers are trained simultaneously using the following loss function: where acc is the training accuracy of the subject attention network and ℓ denotes the cross-entropy function (ℓ Attention for the subject attention network and ℓ M I for the MI classification task). This loss function effectively enforces first the training of the subject attention network and, as the attention's accuracy increases, it switches its focus to train the parallel convolutional layers for the different MI tasks.
As also suggested in [42], since softmax does not work well due to its near one-hot output, we use a large temperature in the softmax of the attention network during training in order to flatten the framework's attention, allow a broader gradient backpropagation and effectively assist in the subject attention network's training in the early epochs. During inference, when an input EEG signal (x) from a new unseen subject S x is processed, it passes firstly through the attention network and the subject attention vector π is computed where i π i = 1. We empirically observed that this vector is quite sparse, and if it was used during inference, only a handful of parallel convolutional layers would be utilized during the mixing. Instead, we would ideally like to use knowledge from all K individuals and "shift" the attention more to the most relevant subjects. To accomplish that, we compute what we call the "uniformly attended vector" A*. If there was no attention network, the K parallel convolutional layers would be mixed with a uniform factor A i = 1 K . To compute Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. the "uniformly attended vector", the uniform attention vector A is combined with the subject attention vector π and the result is passed through a softmax activation to flatten the attention across all subjects -while maintaining the focus on the most relevant ones (we refer the reader to the Appendix for a performance comparison between using π and A* as attention). Mathematically, this operation can be described as: where σ denotes the softmax operation, A the uniform attention vector with A i = 1 K and A* the "uniformly attended" vector where i A * i = 1. Let us denote with W j i the learned convolutional kernel of the network's i th convolutional layer from the j th parallel network and with W * i the dynamic convolutional kernel of the network's i th layer, as illustrated in Figure 2. In our proposed framework, we compute the dynamic convolutional as follows: In other words, using the causal factorization (3), our proposed framework estimate the probability P S x (Z |Y ) of a new unseen subject S x as the linear combination of K learned conditional probabilities. More specifically:

IV. EXPERIMENTS
To validate our proposed framework based on our causal breakdown in Section II, two publically available MI datasets are used namely: For the purpose of this study, we kept only the MI-trials without feedback, since the neurofeedback was not included as a factor in our initial causal analysis. In particular, we extracted trials corresponding to MI hands in the form of segments starting with the visual cue and lasting for 4 seconds. Furthermore, we applied a notch filter at 60Hz -and its harmonics (120, 180, 240, 300, 360, 420, 480) -to remove powerline noise. We also applied a notch filter at 460Hz due to a spurious artifact (consistent across all trials).

A. Subject Verification
The subject attention mechanism is a vital part in our proposed framework. Therefore, we evaluated its performance separately first in order to ensure its ability to distinguish between the various available subjects in the two datasets. We performed 10-fold trial-wise cross-validation to measure its performance. Adam optimizer was used with learning rate of 0.01 for the first 30 epochs (to allow the Gabor filters  I  PERFORMANCE OF SUBJECT ATTENTION NETWORK (TRAINED AND  EVALUATED USING 10-FOLD CROSS-VALIDATION) IN PREDICTING THE  SUBJECT ID IN PHYSIONET AND OPENBMI -MI DATASETS.  CV STANDS FOR CROSS-VALIDATION   TABLE II  HYPER-PARAMETER CHOICES FOR THE EXPERIMENTS to quickly adapt to the data) and 0.001 for the remainder 20 epochs. As shown in Table I, the subject attention network performs sufficiently well in both datasets which makes it an ideal candidate for the attention mechanism in our proposed dynamic framework.

B. Comparison Between Standard and Dynamic Models
We tested our proposed framework in four well-established BCI architectures, namely DeepConvNet [26], ShallowCon-vNet [26], EEGNet [27] and EEG-Inception [28] in the following MI tasks: for the publically available MI dataset Physionet [51] one binary classification task (MI Left vs Right Hand) and a 3-class classification problem (MI Left Hand / Right Hand / Feet)) and for OpenBMI -MI [52] one MI binary classification task (MI Left vs Right Hand).
As shown in Table II, we trained the standard networks for 30 epochs with learning rate of 0.001 while their dynamic versions for 30 epochs -in the first 20 epochs with learning rate of 0.01, to assist the attention's Gabor filters to quickly adapt to the data, and 10 epochs with learning rate of 0.001 and frozen attention, to fine-tune to the MI task. In all cases, we used an Adam optimizer. Finally, a temperature of 30 was used during training in the attention mechanism as described in the previous section. We evaluated the performance of the standard networks and their equivalent dynamic networks in a leave-one-subject-out fashion (cross-subject performance) Table III.

C. Comparison With State-of-the-Art Approaches
In this work, we are not only interested in comparing the models trained with our framework versus regularly trained CNN-based BCI architectures but also to compare our framework with other transfer learning approaches in the EEG domain. Therefore, we evaluated the performance of the standard networks and their equivalent dynamic networks in a 1 ±% Refers to the rounded standard deviation across 10 runs of 10-fold cross-validation experiments. 2 Early stopping has been applied to some folds during the fine-tuning phase. leave-M-subjects-out fashion Table IV. Furthermore, we compared our framework with two other commonly used transfer learning EEG techniques: 1) an adversarial approach, namely [40], that (similarly to our approach) does not use a calibration set and 2) Euclidean alignment [53] that projects data into a domain-invariant space but it uses all the trials of a subject. We trained the Euclidean alignment networks similar to their vanilla equivalent after performing the data projection for each subject. And we trained the equivalent adversarial networks with early stopping and adversarial regularization weight ϵ = 0.005 (hyperparameters taken from the original paper [40]). As it can be seen from Table IV, our proposed method outperforms adversarial networks (a similar zerocalibration method) while it achieves the same or higher performance when compared with Euclidean alignment. It is worth mentioning though that Euclidean alignment uses all the trials of an unseen subject while our framework is dynamically adapted for each trial during inference.

D. Calibration Methods
We evaluated the performance of the calibrated networks (using a small calibration set of the unseen subjects to fine-tune the final classification layer). For a fair comparison, we also fine-tuned the last layer of the equivalent dynamic networks using the same calibration sets. As it is shown in Table V, the calibrated dynamic models also outperform their equivalent vanilla calibrated networks.

E. Investigation of Negative Transfer Learning
Although our proposed framework showed increased cross-subject performance as experimentally shown above, we wanted to investigate if there are any signs of negative transfer learning during the process. As it is shown in Figure 3, although there are limited cases of negative transfer learning,  the vast majority is either marginally or significantly better compared to the vanilla architectures.
V. DISCUSSION The proposed dynamic framework can be used in various CNN-based MI-BCI architectures to increase the cross-subject performance and can take us one step closer in tackling the problem of inter-subject variability as the experimental evaluation in the previous Section IV illustrates. We expect this framework, with certain modifications, to be able to generalize well and get adapted to various BCI paradigms, not only MI. Investigating different BCI paradigms is beyond the scope of this paper where the causal analysis of the MI task is a core factor in ensuring that our proposed framework tackles the targeted problem and there are no misleading performance increases. Extending the framework to different paradigms would require also a causal breakdown for these tasks.
One limitation of our work is the unavoidable increase in the number of trainable parameters (about × K where K is the number of available subjects in the dataset). Although our subject attention mechanism seems to identify well a large number of subject (e.g. 103 on PhysioNet), this increase in the number of trainable parameters might be a limiting factor in some cases especially if these models are deployed on real-life applications where devices have limited amount of memory storage. Fortunately, this tremendous increase in number of parameters does not translate to execution time. As it is shown in Appendix, there is a less than × K increase in terms of inference time cost. Inspired by related works [54], we could investigate approaches to mitigate this increase in a future work.
In contrast to other techniques that promise to tackle the issue of inter-subject variability, our framework is dynamically adapted to a new subject during inference without the need of re-training or calibration trials, commonly used in transfer learning methods. Furthermore, an inherent advantage of our framework is the training of K parallel personalized models of the same BCI architecture. During training, these models are not trained using only the samples of one specific subject but also samples from "similar" subjects since the attention mechanism is trained simultaneously. An interesting future step would be to evaluate the performance of these inherent personalized models compared to standard personalized models -trained using strictly the samples of one specific subject. Although the BCI deep architectures used in Section IV are considered state-of-the-art and achieve high performance across different MI-BCI tasks, they are usually comprised of thousands of trainable parameters, making the training of standard personalized models difficult with these publicly available datasets. For that endeavour, we need first to design more lightweight BCI architectures and then perform these comparisons.

VI. CONCLUSION
In this work, we analyze the task of MI EEG classification through the lens of causal reasoning. To the best of our knowledge, this is the first work that brings machine learning in conjunction with causal reasoning to the domain of EEG. Through this analysis, we identify and analyze some of the major challenges and we introduce a framework based on dynamic convolutions that tackles the problem of subject distribution shift (inter-subject variability). Our proposed subject attention mechanism achieves great performance in identifying subjects and the overall proposed dynamic framework demonstrates increased performance when applied to different BCI architectures while at the same time outperforming other similar methods. In future work, we plan to use it to tackle more, if not all, challenges detailedly described in our causal analysis of MI brainwave decoding.

APPENDIX
As described in Section III, during inference when an input EEG signal from a new unseen subject S x is processed, it passes firstly through the attention network and the subject attention vector π is computed. Through investigation, we observed that this vector is quite sparse. Although this is something we would ideally like, the low SNR of the EEG signal makes our framework unstable especially when used in our desired zero-calibration one-trial setup. In order to have a robust network that dynamically adapts to the new trial from an unseen subject, we utilized the "uniformly attended vector" A* that uses knowledge from all k individuals and "shift" the attention more to the most relevant subjects. A comparison between using the vector π versus our proposed uniformly attended vector A* as attention in our proposed dynamic framework can be seen in the following Figure 4.
A significant drawback of our proposed framework is the unavoidable increase in the number of trainable parameters (about × K where K is the number of available subjects in the dataset). This factor can have limiting effects when these models are deployed on real-life applications where devices have limited amount of memory storage. As it is shown in the following table, the tremendous increase in number of parameters does not translate to execution time which is less than × K increase in terms of inference time cost.