MDTL: A Novel and Model-Agnostic Transfer Learning Strategy for Cross-Subject Motor Imagery BCI

In recent years, deep neural network-based transfer learning (TL) has shown outstanding performance in EEG-based motor imagery (MI) brain-computer interface (BCI). However, due to the long preparation for pre-trained models and the arbitrariness of source domain selection, using deep transfer learning on different datasets and models is still challenging. In this paper, we proposed a multi-direction transfer learning (MDTL) strategy for cross-subject MI EEG-based BCI. This strategy utilizes data from multi-source domains to the target domain as well as from one multi-source domain to another multi-source domain. This strategy is model-independent so that it can be quickly deployed on existing models. Three generic deep learning models for MI classification (DeepConvNet, ShallowConvNet, and EEGNet) and two public motor imagery datasets (BCIC IV dataset 2a and Lee2019) are used in this study to verify the proposed strategy. For the four-classes dataset BCIC IV dataset 2a, the proposed MDTL achieves 80.86%, 81.95%, and 75.00% mean prediction accuracy using the three models, which outperforms those without MDTL by 5.79%, 6.64%, and 11.42%. For the binary-classes dataset Lee2019, MDTL achieves 88.2% mean accuracy using the model DeepConvNet. It outperforms the accuracy without MDTL by 23.48%. The achieved 81.95% and 88.2% are also better than the existing deep transfer learning strategy. Besides, the training time of MDTL is reduced by 93.94%. MDTL is an easy-to-deploy, scalable and reliable transfer learning strategy for existing deep learning models, which significantly improves model performance and reduces preparation time without changing model architecture.


I. INTRODUCTION
B RAIN science is one of the most challenging research fields, attempting to reveal the inner mechanisms of brain activities and functions [1]. Brain-computer interface (BCI) based on electroencephalography (EEG) provides a promising way to exchange information between brain and device. Researchers in the BCI community have used various paradigms to trigger changes of EEG signals, such as motor imagery (MI), steady-state visual evoked potentials (SSVEP), P300 evoked potentials, etc.
Nowadays, MI EEG-based BCI is a promising technology due to the enormous demand in medical implementations, including stroke rehabilitation, wheelchair control, prostheses control, exoskeleton control, cursor control, speller, and thought-to-text conversion [2]. MI EEG-based BCI has also been used in non-medical fields such as virtual reality (VR), video gaming, vehicle control, smart home, and the military. The MI task is accomplished by imagining movement without actually performing it. Many pattern recognition algorithms and traditional machine learning methods have been applied to encode signals and decode features. However, real-world MI applications are still limited due to low decoding performance and poor generalizability.
Recently, deep learning methods have shown superior performance than traditional machine learning methods in EEG processing. In the MI experiment, each subject needs a time-consuming period to acquire EEG data for the training process. However, the massive demand for training data hinders the improvement of accuracy. Researchers tried to utilize the data of existing subjects to expand the training set, the so-called cross-subject strategy. Since the MI signals have subject-dependent properties, they differ from person to person. So the result is worse than only using one subject's data. Researchers believe it is necessary to consider the non-stationarity of EEG features when utilizing existing data.
Researchers have applied transfer learning in BCI fields to balance the huge demand for training data and the non-stationarity in cross-subject EEG signals. Despite the non-stationarity in EEG signals, subjects still present similarities in EEG when they imagine the same movement. According to [3] and [4], this similarity is demonstrated by the generality of the feature extractor parameters, and the shallow layers in the neural network usually act as a feature extractor. As shown in Fig. 1, transfer learning can utilize parameters of feature extractor from similar or relevant subjects/sessions/devices/tasks to facilitate for learning for a new subject/session/device/task.
Although transfer learning has been applied in the above EEG-based BCI research, source domain selection is still controversial. According to the current research of MI, there are mainly two strategies: (a) treat one existing subject as source domain; (b) treat all existing subjects as source domain. Technically, (a) avoids the non-stationarity among different subjects and achieves ideal predictive accuracy in the source domain. However, the accuracy may not be better than only using data from the target domain. It's due to the non-stationarity between the source domain and the target domain. Therefore, this strategy relies on manually selecting the most suitable source domain. Furthermore, it only uses the data of one existing subject, and abandons the data of other existing subjects.
On the contrary, (b) makes full use of all existing data, but it ignores the influence of subjects with low correlation to the target domain. That will turn into noise when utilizing data to the target domain. Furthermore, current strategies need to perform N times training process for N subjects, which is time-consuming.
To overcome the above problems, we propose a multi-direction transfer learning (MDTL) strategy, which automatically select several suitable subjects from all existing subjects and utilize part of the trained model's parameters to the target domain. Different from traditional transfer learning, MDTL uses multi-transferring directions to fully use the data of all subjects. It also reduces the influence of source subjects with poor correlation on the target subject domain. Meanwhile, MDTL significantly simplifies the training process compared with other transfer learning strategies in MI. We applied MDTL to three SOTA models. Classification performance is significantly improved on two public datasets. We also compare MDTL with traditional transfer learning strategy. The results show that MDTL has advantages in predictive accuracy and training time. In summary, this study highlights the following contributions: • We put forward a high-performance transfer learning strategy named MDTL, which utilizes the multi-source domain, target domain adaptation, label smoothing, and cosine annealing.
• Multi-source domain helps target domain to match the most suitable source domains, thus efficiently enhancing predictive accuracy. Target domain adaptation, label smoothing, and cosine annealing avoid overfitting. The multiple directions help to utilize source domain data and reduce the computation costs.
• We apply MDTL to three deep-learning models. It significantly improves the predictive accuracy and achieves the best performance in BCI Competition IV dataset 2a (BCIC IV dataset 2a) [5] and Lee2019 [6]. Experiments show that MDTL can be quickly deployed on different neural networks without changing their structure. MDTL provides a new baseline for researchers to carry out motor imagery research based on transfer learning. The reminder of this paper is organized as follows. In Section II, we introduce the relevant concepts and recent studies. Then we describe the proposed MDTL strategy and the experiment procedures in Section III and Section IV. After that, we present and discuss the results in Section V. In Section VI, we finally conclude this paper.

II. RELATED WORKS A. Motor Imagery Classification
The MI-based EEG feature extracting methods have been demonstrated in the frequency and spatial domains. Among these methods, the filter bank common spatial patterns (FBCSP) [7] based on common spatial patterns (CSP) features [8] has been a classical baseline. Fast fourier transform (FFT) [9], autoregressive (AR) model [10] are also used to extract features of power spectral density (PSD). For classification methods, support vector machine (SVM) has been widely used as a classifier in many studies [7], [11]. Cho et al. [12] used fisher discriminant analysis (FDA) as a classifier and acquired good performance for binary MI tasks. However, all these methods rely on handcrafted features. That means feature extractor and classifier are separated into two stages. The parameters of feature extractor and classifier are trained separately by different objective functions [13]. Besides, handcrafted features extracted by above extractors are not the most suitable for classifier.
Deep learning partly solves the above problems. Deep learning methods such as convolutional neural networks (CNN) and recurrent neural networks (RNN) have reached competitive performance for MI BCI using automated feature extractors. In contrast to machine learning-based methods, deep learning needs less preprocessing and embeds all computation steps in one single network to generate an end-to-end model. Schirrmeister et al. proposed DeepConvNet [14] that consists of a series of convolutional temporal filters and spatial filters. It has been verified and widely accepted as a baseline in MI studies. ShallowConvNet is the simplified scheme of DeepConvNet. It reduces computational complexity and has no significant performance degradation. EEGNet proposed by Lawhern et al. [15] was inspired by the FBCSP method [7]. EEGNet uses horizontal and vertical convolution to simulate frequency filters and spatial filters. EEGNet has been employed for P300 and MI BCI paradigms due to its high-level EEG decoding ability on different paradigms. Researchers in [13] proposed a novel multi-branch 3D convolutional neural network (3D CNN), transforming EEG signals into a sequence of a 2D array that preserves the spatial distribution of sampling electrodes. That shows good performance and excellent robustness on different datasets and subjects.

B. Transfer Learning
EEG signals are weak, nonstationary, mixed with noise for one subject, and vary across different subjects and sessions [16]. Therefore, it is hard to design an optimal model for different subjects, during different sessions, and on different tasks and devices. Besides, deep neural networks have more parameters to train compared to traditional machine learning methods. For example, the DeepConvNet in [14] has as many as 305,077 trainable parameters for the binary classification of MI [3]. These parameters require a staggering amount of training data. Although many public MI datasets are accessible, the training data available for one subject is still limited [6].
The transfer learning strategy is a promising solution to overcome the problem of insufficient subject-specific data. Dai et al. [17] proposed a transfer kernel CSP (TKCSP) method to learn a domain-invariant kernel by directly matching distributions of source subjects and target subjects. It first introduces kernel CSPs (KCSPs) [18] and transfer kernel learning (TKL) [19] to EEG spatial filtering for cross-subject MI classification. Azab et al. [20] proposed a new transfer learning strategy to reduce the calibration time and avoid degrading the performance of MI-BCI. They train a logistic regression classifier for the source domain subject and then calibrate the trained classifier to the target subject so that the new classifier has similar parameters to those of the source domain subject. He and Wu [21] proposed a Euclidean alignment (EA) method to align EEG data from different subjects into Euclidean space and then extract common Euclidean space and Riemannian space features. Online and offline experiments show their effectiveness in MI classification tasks.
Above all, these transfer learning-based methods mainly concern the parameters of traditional feature extractors such as CSP, Riemannian alignment (RA), etc.
Fewer studies involve deep transfer learning, which uses the neural network as a feature extractor and classifier. Among these studies, Xu et al. in [22] first introduced deep transfer learning to the EEG-based MI field. They used a VGG-16 model pre-trained on the ImageNet and transferred part parameters of the pre-trained model to the target subject. However, they didn't concern the specific EEG feature but directly used an image-based pre-trained model. Zhang et al. in [4] proposed a transfer learning-based MI classification strategy that combined a convolutional neural network (CNN) and a long short-term memory (LSTM) in one neural network. Zhang et al. in [3] proposed a deep convolutional neural network (CNN) based strategy for decoding motor imagery (MI). It fine-tuned a pre-trained model and adapted it to a target subject. This strategy acquired excellent improvement on dataset Lee2019. The above studies used traditional transfer learning strategy, which adopted all subjects or single picked subject's EEG data as the source domain. However, this strategy takes substantial time, and it is hard to achieve the desired effect when the number of subjects is limited.
The main processes of these deep transfer learning-based methods are described as: Given a source domain , transfer learning aims to learn a prediction function f t : X i t ⇒ Y i t with the lowest error on unlabeled samples, using labeled samples in the source and target domains.
In Fig. 1, we show the structure of traditional transfer learning applied to motor imagery. In MI, we usually consider a single subject as a domain since different subjects have variability in EEG signals when performing the same motor imagery task [1]. To facilitate understanding, we introduce the multi-source domain to MI. Multi-source domain D m = D i s M i=1 includes M subjects with M×N labeled samples. The target domain only contains one subject, so we still utilize D t with labeled and unlabeled samples to represent it. A sample means a trial, and (X, Y ) represents EEG data and labels of MI tasks. Therefore transfer learning in MI can be described as: Given M subjects each with N labeled MI trials, one target subject with N l labeled trials and N u unlabeled trials. We aim to utilize N + N l labeled trials to learn a prediction function f t with low error on N u unlabeled trials.

III. METHODS
In this section, we introduce the proposed multi-direction transfer learning (MDTL) strategy for MI BCI. We focus on cross-subject MI classification task and aim to learn a good target model using a large amount of source domain data and a small amount of target domain data.
Consider n subjects, each with m l labeled MI trials and m u unlabeled MI trials. Here we hypothesize each subject has the same number of labeled and unlabeled MI trials for a better description. In fact, it also works when the size is different. Labeled MI trials are described as Here we use BCIC IV dataset 2a to describe the algorithm. Each subject's EEG data is represented by a set of EEGs, with the number below representing the subject number. We first divided the training EEG data into a training set and a validation set. The purple represents the training set, and the orange represents the validation set. We randomly select a training set of three subjects to form a multi-source domain and train a source model on that domain. Then we validate the model on validation sets of three subjects, respectively. We compare the results with the existing results to decide whether to update the saved models for these subjects. Finally, we use the specific training set to fine-tune the corresponding saved model to obtain the target model.
To avoid data corruption, all unlabeled trials are only used for testing. The models in both domains have the same structure, containing two major components: a feature extractor f and a classifier c. We use to represent a complete model. So the source model is represented as F s = c s • f s and the target model is represented as The proposed MDTL has two major steps as shown in Fig. 2: • Multi-source domain training: the training process of MDTL consists of multiple consecutive training rounds. At the beginning of each round, k different subjects are randomly selected from all subjects to form a multisource domain. In the case of partly initializing the parameters, the feature extractor f s is inherited from the previous round, and it has already converged to the multi-source domain. Then, upon the multi-source domain is newly generated, the network is retrained, and the parameters are updated. And this iteration keeps going on until the maximum number of rounds is reached.
• Target domain adaptation: After step 1), we get several trained models based on multi-source domains. Then we pick the best source model for each target subject by evaluating the models containing labeled trials of the target subject. Other labeled trials are evaluated as validation set. Furthermore, we use fine-tuning and cosine annealing to adapt the picked model to the target subject.
Note that here we divide it into two steps just for demonstration purposes. These two steps are carried out simultaneously and integrated with each other. This is one of the main innovations of this paper, which greatly reduces the computation cost while improving the accuracy of the model.

A. Multi-Source Domain Training
As presented in Fig. 2, the training of multi-source domain is consisted by several epochs.
In each epoch, we randomly select k subjects from all n subjects to form a multi-source domain, i.e., we have Note that the domain size k is artificially settled. In Section V, we will analyze the influence of the number k.
An automatically selected source domain helps to focus on the most suitable domains for the target subject. However, it raises the computations, requiring n subjects tested on m multi-source domains n×m times. But in our strategy, we only tested each subject on the multi-source domains that contain themselves. It is due to models trained by these multi-source domains having been adapted to the target subject in the training phase. This strategy reduces computation cost from n ×m to k ×m, where k stands for the size of the multi-source domain and is much smaller than n.
According to existing studies [17], [20], different subjects exhibit similar EEG features when they perform the same motor imagery task. As neural networks are trained on different subjects, they still share some common information in layers of feature extractors.
We denote a complete model by F as illustrated in Equation (1), and then trained a model F s = c s • f s on the first multi-source domain D m,s . Here, D m,s is a collection of training sets from randomly selected subjects. Therefore, the model F s is not specifically trained for one or all subjects. It is only trained for subjects belonging to the multi-source domain D m,s . This approach does not require to train the source model on all subjects, as other studies do.
We adopt the cross-entropy loss with label smoothing [23], [24] to compute the loss of output where p ls k,i = (1 − α) p k,i + α/N , and p k,i is the kth onehot encoding of output y s i . α is manually set to 0.1, and δ k g s x s i is the kth softmax output of g s x s i . The models are trained through stochastic gradient descent (SGD) and backpropagation.
After we have trained the first source model F s = c s • f s for the first multi-source domain, the parameters of feature extractor f s are delivered to the next epoch. They are set as the next source model's initial parameters to retain the feature extractor's information. The classifier of the next source model is randomly initialized.
Then we repeat the above steps to acquire various multisource models. Furthermore, after certain training epochs, we stop the transferring of parameters and reset the model at initial parameters to avoid overfitting.

B. Target Domain Adaptation
When training is completed in each multi-source domain, the trained model is validated on the validation set of k subjects belonging to the multi-source domain respectively. The accuracy and the loss of the validation set are obtained to measure the performance of the trained model.
In this period, the feature extraction layer of the source model is suitable for MI data in the corresponding multisource domain. Thus, the feature distribution of MI data differs between multi-source and target domains, and the degree of variation varies across target domains. Therefore, different target domains should choose different learning rates of domain adaptation to avoid over-fitting as well as under-fitting. However, the adaptation process of different target domains in the MDTL method is continuous, and each multi-source domain is randomly selected. It is impossible to manually choose the optimal learning rate for each pair of the multi-source domain and target domain in the domain adaptation process. Cosine annealing is a method that dynamically adjusts the learning rate while optimizing the objective function. We applied Cosine annealing in target domain adaptation to adjust the learning rate to accommodate differences in feature distribution between multi-source domain and target domain. We define the initialized multi-source model as F s = c s • f . The parameters of feature extractor c s are copied from the multi-source model, and the classifier is randomly initialized. The target domain adaptation can be formulated as the problem of minimizing the loss function L as described in equation 3.
In this work, we consider the Cosine annealing with warm restarts approaches. We execute a new warm-started run once T epochs are performed. Importantly, the restarts are not performed from scratch but emulated by increasing the learning rate η t while the old value of x t is used as an initial solution at time step t. Within the i-th run, we decay the learning rate with cosine annealing for each batch as follows: where η t represents the learning rate at time step t, η max and η min represent the range of learning rate. T cur represents how many epochs have been performed since the last restart. The relationship between t, T cur , T and the runs i is given by t = T · i + T cur . Thus η t = η max when T cur = 0. When T cur = T , the cos function will output −1 and thus η t = η min . The decrease of learning rate is shown in Fig. 3 for T = 100. Note that the vertical coordinates are in logarithmic form thus the cosine function is not in the typical shape. Then we optimize the parameters of F S with the Stochastic Gradient Descent (SGD) as follows: where x t accounts for the parameter vector at time step t. The learning rate η t is given by equation 4, and it will decrease and warm restart according to Fig. 3. L t is the cross-entropy loss function as described in equation 3. By calculating the gradient ∇ L t of the loss function at x t , we iterate equation 5 repeatedly to approach the optimal parameters x that minimizes the loss function L t . The SGD with Cosine Annealing has a higher ability to search the globally optimum by avoiding reaching the local optimum. After all these training phases are done, we test subject's models on their testing set, respectively. The accuracy of testing result is used to represent the predictive capability of models.

Algorithm 1 Framework of Ensemble Learning for Our System
Input: Training data from all subjects, D a Training labels corresponding to training data, L a Size of one multi-source domain, k Number of multi-source domains, m Number of all subjects, n Output: Learned parameters of all subjects: θ f,1 , θ f,2 , · · · , θ c,1 , θ c,2 , · · · 1 Initialize: Initialize all parameters θ f and θ c ; while epoch ≤ m do Separately caculate the classifier loss of each subject in multi-source domain: L s,1 , L s,2 , · · · , L s,k ; Preserve feature extractor parameter θ f to next epoch; 11 Initialize classifier parameter θ c ; 12 for j ≤ n do 13 while θ f, j , θ c, j have not converged do 14 Samples , a batch from j subject training data; 15 Calculate the model loss L c (θ f, j , θ c, j ); 16 Optimize the parameters of the feature extractor and classifier by θ f, j ← θ f, j − µ ∂ L c ∂θ f, j , θ c, j ← θ c, j − µ ∂ L c ∂θ c, j ;

C. Model Architecture
In this study, we use three state-of-the-art (SOTA) models to verify the MDTL's effectiveness and deployment capability on different models. We described these models as below.
1) DeepConvNet: DeepConvNet [14] has four convolutionmax-pooling blocks, with a particular first block split into two layers to handle input channels better. The first layer is designed to perform a convolution over time. And in the second layer, each filter performs spatial filtering with weights. They were followed by three standard convolution max-pooling blocks and a dense softmax classification layer.
2) ShallowConvNet: ShallowConvNet [14] is similar to DeepConvNet but has a simplified architecture. The first two layers of the ShallowConvNet perform a temporal convolution and a spatial filter, which is the same as the DeepConvNet.
Differently, the temporal convolution of the ShallowConvNet has a larger kernel size (25 vs. 10), allowing a more extensive range of transformation in this layer. After the temporal convolution and the spatial filter of the ShallowConvNet, a squaring nonlinearity, a mean pooling layer, and a logarithmic activation function followed. In contrast to FBCSP, the ShallowConvNet embeds all steps that can be optimized jointly.
3) EEGNet: EEGNet [15] is a compact CNN architecture for EEG-based BCI that (1) can be applied across several different BCI paradigms, (2) can be trained with minimal data, and (3) can produce neurophysiologically. EEGNet has three blocks. Block 1 has a temporal convolution to learn frequency filters. Then a depthwise convolution is used to learn frequency-specific spatial filters. Block 2 combines depthwise convolutions, individually learning a temporal summary for each feature map. After that, a pointwise convolution is applied, which learns how to mix the feature maps optimally. Block 3 was a classification block, where features are passed directly to a softmax classification with N units. N is the number of the category. EEGNet also adopts a dense layer for feature aggregation in front of the softmax classification layer to limit the free parameters in the model, inspired by the work in [25].

A. Dataset and Preprocess 1) BCIC IV Dataset 2a:
The BCI Competition IV dataset 2a involves a four-category MI-based EEG signal recognition task. This dataset consists of MI-based EEG signals from 9 subjects [5]. The cue-based BCI paradigm is consisted of four different motor imagery tasks, that are left hand (class 1), right hand (class 2), both feet (class 3), and tongue (class 4). For each subject, there were 288 trials (72 trials per class) in the training and testing sessions, recording on different dates.
A fixation cross appears on the black screen at the beginning of a trial (t = 0 s). In addition, a short acoustic warning tone is presented. After two seconds (t = 2 s), a cue in the form of an arrow pointing either to the left, right, down, or up (corresponding to left hand, right hand, feet, or tongue) appears and keeps on the screen for 1.25 s. It instructs the subject to perform the corresponding MI task. The EEG signals from 22 electrodes are recorded. The sampling rate is 250 Hz, and the bandpass filter is between 0.5 Hz and 100 Hz. The amplifier's sensitivity is set to 100 µV, and an additional 50 Hz notch filter is enabled to suppress line noise. More details can be found in [5]. Our experiment uses 0-4s MI EEG signals from all 22 electrodes. According to [14] and [15], we apply an additional 4-38 Hz bandpass filter to extract MI related frequency band. All training data from the training session are applied as the training set (144 trials), and half of the testing session is used as a validation set (72 trials) and a testing set (72 trials), respectively.
2) Lee2019: The second dataset used in this paper is collected by the Department of Brain and Cognitive Engineering, Korea University. In their experiments, 54 healthy subjects perform binary class MI tasks of the left hand and the right hand. Their EEG signals are recorded using BrainAmp with 62 Ag/AgCl electrodes at a sampling rate of 1000 Hz. The design of the experiments follows the well-established protocol in Pfurtscheller and Neuper (2001) [26]. Each trial begins with a fixation mark for the subject to prepare for the trial. Then a left or right arrow will appear as a visual cue for 4 seconds, during which the subject will perform the MI task of grasping with the appropriate hand. After each task, the screen remains blank for 6 s (C,B11.5 s). Each subject participates in two sessions with a total of 400 trials. The signals are further down-sampled to 250Hz by a factor of 4 with an order-8 Chebyshev type-I filter for anti-aliasing. More details on the data and the experimental protocol can be found in Lee et al. [6]. We use all 62 EEG channels in our experiment and apply a 4-38 Hz bandpass filter. The 400 trials are split into a training set (200 trials), a validation set (100 trials), and a testing set (100 trials).

B. Experiment Setup 1) Train-Within:
In this experiment, deep learning models are trained on the training set and tested on the testing set within the same subject and dataset. As a result, there are eight train-within experiments among four deep-learning models and two public datasets. This experiment aims to provide a baseline to evaluate the performance of the MDTL strategy.
2) Train-Cross: We applied the MDTL strategy using three models on two public datasets, totaling six experiments in this experiment. For comparison, Adaptive TL proposed by Zhang in [3] is also applied on two public datasets. For dataset BCIC IV dataset 2a, we set the size of the multi-source domain k as 3, and the number of the multi-source domain m as 20. For another dataset Lee2019, we set k = 6, m = 50. For selecting feature extractor layers, we refer to the research in [3] and [4]. We regard convolution layers in front of the last classification layer in the three models as a feature extractor.
We implemented these experiments in PyTorch using a workstation with Intel Xeon Gold 6226R×2, 512GB RAM, and NVIDIA Tesla P100 × 8. To avoid the influence of parameter settings on the results, we set the learning rate of all experiments to 0.001, the batch size to 32, and the training epochs to 500.

A. Performance Comparison of Train-Within and MDTL
We first evaluate MDTL using different models on BCIC IV dataset 2a and record classification accuracy on each subject and their mean accuracy in TABLE I. We conduct the Wilcoxon signed-rank test and use the p-value to indicate the level of statistical difference.
As is shown in TABLE I, • The accuracy of the MDTL strategy is increased by 5.79% ( p = 0.021), 6.64% ( p = 0.012), and 11.42% ( p = 0.008) on three traditional deep learning strategies, respectively. It indicates that MDTL is model-agnostic, significantly improving classification capability without modifying model architecture or changing model parameters.
• ShallowConvNet achieves better classification capability than the other two models, and its improvement via MDTL is also significant. To verify the effectiveness of our MDTL strategy on other dataset, we evaluate MDTL on dataset Lee2019. The results are shown in TABLE II. As shown in TABLE II, our MDTL strategy outperforms all the traditional deep learning and transfer learning strategies. In addition, we observe that when using traditional deep learning methods, many subjects in the Lee2019 dataset exhibit 'BCI-illiterate', with testing accuracy between 40% to 60% in binary classification. Interestingly, these 'BCI-illiterate' subjects show high classification accuracy when using the MDTL strategy. This result conflicts with our common sense: 'BCI-illiterate' subjects are physiologically caused and model-unrelated.
Therefore, we conclude that some subjects are not really 'BCI-illiterate', but their training data is insufficient to fit the model. Transfer learning provides more diverse data than only using one's own. The mixed training data is the main reason for performance improvement in subjects with poor initial results.

B. Performance Comparison of Other TL Method and MDTL Strategies
In this section, we compare MDTL with other transfer learning strategies to measure the improvement. We use the stateof-the-art (SOTA) transfer learning strategy Adaptive TL [3] as a benchmark. Besides, we used two other cross-subject TL method MFAR [27] and RA-MDRM [28] as a comparison. MFAR preprocesses the EEG signal by aligning the motor imagery trials to their resting state trials, and can reduce the differences among subjects. RA-MDRM tries to affine transform the covariance matrices of every subject in order to center them with respect to a reference covariance matrix, and uses a probabilistic classifier based on a density function to perform classification. Since the early proposal, MFAR and RA-MDRM were not applied on dataset Lee2019. Therefore, we only use their results on BCIC IV dataset 2a as a comparison. To avoid data contamination, we split each subject's trials into the training, validation, and testing sets before experiments. Training set and validation set are used in training and transfering periods. The testing set is only used  in the testing period. Testing accuracy is applied as the model evaluation criteria.
First, we investigate the accuracy of each method. The result is shown in the train-cross columns of TABLE I and  TABLE II. As shown in the two tables, we can make the following observations: • It is evident that the three models with MDTL acquire better performance than those without transfer learning. Their mean accuracy on Dataset 2a are 75.69%, 75.31%, 63.58%, and the newly proposed FBCNet [29] is 75.62%. Three models have improved greatly and exceeded the newly proposed method when using MDTL. The best model with MDTL on BCIC IV dataset 2a is Shallow-ConvNet. DeepConvNet performed best on the Lee2019 dataset, with 88.1% mean accuracy.
• As shown in Table I, ShallowConvNet with MDTL achieves 81.95% mean accuracy on BCIC IV dataset 2a. This result outperforms MFAR and RA-MDRM by 6.87% and 11.04%. It is also improved by 16.9% than the Adaptive TL strategy. As to dataset Lee2019, DeepConvNet acquires 88.11% mean accuracy, which is 1.3% higher than Adaptive TL strategy. The improvement is not as significant as manifesting on dataset BCIC IV dataset 2a.
The main reason is that the subject in BCIC IV dataset 2a is limited. There are 54 subjects in Lee2019, which are far more than BCIC IV dataset 2a. When subjects are limited, the unselected source domain will introduce a large amount of noisy data, making it difficult for Adaptive TL to improve the classification capability. Second, we compare the computation cost of Adaptive TL and MDTL on different models. To provide an appropriate experiment circumstance, we set the training epochs in one source domain (multi-source domain) as 500. The size of one multi-source domain is 6, and the number of multi-source domains is 50. We implemented the two strategies on the dataset Lee2019. Results are shown in Table III. We use multiply-accumulate operations(MACs) to represent the computation complexity. One MACs contains a multiplication operation and an addition operation. From the plot, we observe that the MDTL strategy reduces the training time of the three models by 93.94%. This indicates that the MDTL strategy can limit the training phase time, thus reducing the preparation time of BCI system.

C. Effectiveness Analysis
In this section, we designed following experiments to explore the influence of parameter settings.
1) Effect of Subjects Batchsize: To investigate the impact of the size of multi-source domain in MDTL, we further analyse the accuracy on BCIC IV Dataset 2a using DeepConvNet model with different batchsize. The experiment was repeated five times with each batchsize, and we use the mean accuracy as a result. Results are plotted in Fig. 6. In this figure, the x-axis denotes the size number of the multi-source domain, and the y-axis denotes the mean accuracy of all subjects. We can see that as the size of the multi-source domain increases, mean accuracy turns better first and then worse. When the size is 5, the mean accuracy is at the maximum of 81.95%. Besides, the accuracy will decline as the size increases and decreases.
We explain that the source model will be under-fitting or over-fitting when the size of the multi-source domain is too small or too large. When the size is too small, there is less MI data, and the data is insufficient for the multi-source model to extract features and train the parameters. This leads to the under-fitting of the source model. The purpose of using multi-source domain instead of traditional single-source domain is to extract more accurate MI-related features through a large amount of MI data from the multi-source domain. When MI data in the multi-source domain is insufficient, the feature extractor lacks common features. However, when the size of the multi-source domain is too large, there are multiple subjects with feature distribution differences. The source model is always over-fitted when using these MI data with feature distribution differences. And due to the large size of the multi-source domain, it is impossible to find a group of subjects that are similar to each other by randomly generating the multi-source domain. According to our experience, the size of the multi-source domain is best set to about one-third of the total number of subjects. This value varies with the number of MI trials in each subject. The ultimate purpose is to have sufficient MI data with similar MI-related features in each  TABLE III  COMPUTATION COMPLEXITY COMPARISON BETWEEN TRADITIONAL TL  AND MDTL. (K: KILO-10   multi-source domain for the model to converge and not be under-fitting or over-fitting. 2) Confusion Matrix: To verify the MDTL performance of different motor imagery tasks, we analyze the classification results by confusion matrix. We use the model DeepConvNet and BCIC IV dataset 2a to evaluate it. The result is shown in Fig. 7, where the value in the diagonal line is the correctly predicted samples. From Fig. 7, we observe that most mistakes are foot and tongue MI tasks related.

D. Discussion
The early studies have shown that neuronal activities exhibit considerable variability over the different persons. This variability is known as the non-stationarity nature of EEG signals [30] and mainly caused the drift in statistical distribution between different subjects [8]. One subject cannot directly utilize the EEG data of other subjects due to this phenomenon. Traditional transfer learning strategy try to use EEG data from existing subjects by utilizing data of source domain indiscriminately, thus resulting in low-performance improvement. Besides, the training state must be repeated on different source domains for each target subject. It is difficult to apply traditional transfer learning in the BCI system due to the unnecessary repetition training process and high complexity. The proposed MDTL strategy improved traditional transfer learning by changing one-direction transferring to multi-direction transferring. The first direction is from one source domain to a multi-target domain. The other direction is from the previous source domain to the next source domain. This strategy can settle the shortcomings of traditional strategy as follows. Transferring between the previous source domain and the next source domain makes it possible for models to fit EEG data through randomly selected subjects batch; and transferring between one source domain and multi-target domain makes it possible to train multiple models in one epoch.
In Section IV, we evaluate the MDTL strategy using different models. From the results, we observe that the MDTL strategy significantly improves the classification capability of existing models by utilizing multi-source domain data, without changing the model framework. Moreover, the MDTL strategy provides a more efficient, less time-consuming and upgradeable platform, which enables us to apply transfer learning in an online MI system. Selecting a source domain helps to focus on the most suitable domains for the target subject. However, it raises the computations, because it requires n subjects tested on m multi-source domains n × m times. But in our strategy, we only tested each subject on the multi-source domains that contain itself since models trained by these domains' data have been adapted to the target subject in the training phase. This approach reduces computations from n × m to k × m, where k stands for the size of the multi-source domain, k is much smaller than n. For example, there are 54 (n) subjects in Lee2019, and the size of the multi-source domain k is set as 6.
Before MDTL, traditional transfer learning needs to retrain all models when new subjects are added to improve the classification capability. However, we can not collect EEG data of all subjects simultaneously. It means when there is new EEG data, and the existing models have no performance improvement unless all the models of existing subjects are retrained, which is time-consuming. MDTL could update the models of all existing subjects when a new subject is added, which means all trained models will be updated with the expansion of the EEG dataset through one training epoch. It is less time-consuming than the traditional strategy.

VI. CONCLUSION
In this study, we present a novel deep transfer learning strategy based on multi-direction transferring. It improves the MI classification performance without changing model architecture. By selectively utilizing data in multiple transferring directions, the proposed strategy MDTL can significantly improve the fitting ability of the model on MI tasks. The first transferring direction, from the source domain to the target domain, can reduce the training preparation time. The other transferring direction, from one source domain to another source domain, enables each target subject to selectively utilize EEG data from a specific source domain that has the least variance from that of the target subject. We considered three generic deep-learning models and two public MI datasets. Furthermore, we designed a series of experiments to make the strategy reliable and effective in a real-time application.
Then we analyze the improvement on three models using MDTL and compare them with those related deep learning algorithms and deep transfer learning strategies. MDTL significantly improves the classification accuracy of the three models by 5.79%, 6.64% and 11.42%, respectively. Besides, the training time of MDTL is reduced by 93.94% than traditional transfer learning. The results show that when the size of one source domain is near the median of number of all subjects, MDTL strategy could achieve the best classification performance. In summary, MDTL is a model-agnostic transfer learning strategy. Compared with other transfer learning strategy, MDTL has higher accuracy with less computation cost.
MDTL has mainly two advantages for real-world MI applications. First is less preparation time, as it can reduce the time for the patient to collect training data before using the BCI system. Meanwhile, MDTL will accelerate the deployment of MI devices due to less training time. The second is higher prediction accuracy. That means the equipment can be applied in a realistic environment with higher requirements for accuracy and stability, thus promoting MI BCI from laboratory to clinical application.