Intra- and Inter-Subject Common Spatial Pattern for Reducing Calibration Effort in MI-Based BCI

One major problem limiting the practicality of a brain-computer interface (BCI) is the need for large amount of labeled data to calibrate its classification model. Although the effectiveness of transfer learning (TL) for conquering this problem has been evidenced by many studies, a highly recognized approach has not yet been established. In this paper, we propose a Euclidean alignment (EA)-based intra- and inter-subject common spatial pattern (EA-IISCSP) algorithm for estimating four spatial filters, which aim at exploiting intra- and inter-subject similarities and variability to enhance the robustness of feature signals. Based on the algorithm, a TL-based classification framework was developed for enhancing the performance of motor imagery (MI) BCIs, in which the feature vector extracted by each filter is dimensionally reduced by linear discriminant analysis (LDA) and a support vector machine (SVM) is used for classification. The performance of the proposed algorithm was evaluated on two MI data sets and compared with that of three state-of-the-art TL algorithms. Experimental results showed that the proposed algorithm significantly outperforms these competing algorithms for training trials per class from 15 to 50 and can reduce the amount of training data while maintaining an acceptable accuracy, thus facilitating the practical application of MI-based BCIs.


Intra-and Inter-Subject Common Spatial Pattern for Reducing Calibration Effort in MI-Based BCI I. INTRODUCTION
A BRAIN-COMPUTER interface (BCI) is an intelligent system that encodes and decodes brain signals and realizes the interaction between the brain and external devices [1]. Electroencephalogram (EEG), acquired on the scalp, is the most commonly used neurophysiological signal for creating a BCI due to its low cost and high time resolution. In the past decades, EEG-based BCIs have been rapidly developed and widely applied in many fields such as neural rehabilitation, control of assistive technologies, entertainment and intraoperative awareness detection [2], [3]. Motor imagery (MI) is one of the most widely used cognitive tasks in the design of BCI systems. MI refers to the motor behavior that is only rehearsed in the brain without obvious action. During MI, µ rhythm  and β rhythm (18-26Hz) brain waves can be recorded in the sensorimotor area of the brain. When a subject imagines a limb movement, the amplitude of contralateral µ/β rhythm decreases, an electrophysiological phenomenon called event related desynchronization (ERD) [4]; when the subject finishes the imagined movement, the rise in amplitude of the µ/β rhythm is called the event-related synchronization (ERS) [5]. The ERD/ERS has significant differences in the spatial distribution of the brain between different MI tasks, which can be measured by various algorithms. Among them, common spatial pattern (CSP) [6], [7] is one of the most effective algorithms, which decomposes raw EEG signals into spatial patterns that maximize the difference of band power between two categories, e.g., MI of left hand and right hand. A typical BCI system requires a time-consuming calibration stage to collect a sufficient amount of labeled training data from which to extract subject-and task-specific information as features [8]. This limits the practicality of a BCI because it is not only time-consuming, but also affects user experience. As a result, how to significantly reduce training time while maintaining good BCI performance is one of the main research directions [9]. Transfer learning (TL) is a feasible method for solving the problem [10]. TL uses the knowledge learned from the data of source domains (i.e. previous users or previous sessions of the current user) to assist the learning process of target domain (i.e. the current user) [10], [11], [12], [13], [14], [15], which can greatly improve the classification results of insufficient samples in the target domain. The main hypothesis in TL is that the target and source domains belong to the same feature space. However, high inter-domain variability often makes the hypothesis violated [16]. Thereby, it is a critical problem for TL to effectively integrate the features of source domains with those of the target domain.
So far, many CSP-based transfer learning algorithms have been proposed for decreasing the calibration effort. These algorithms mainly boil down to two categories: one is regularization and the other is data alignment. In terms of the former, Fazli et al. [17] achieved cross-subject TL by extracting sparse feature sets and incorporating linear discriminant analysis (LDA). Kang et al. [18] measured the This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ similarity between subjects by KL-divergence to improve the spatial covariance matrix (SCM) by regularization. The premise of the method is that the SCMs of source subjects are highly similar to those of the target subject. Lu et al. [19] proposed a regularized CSP (RCSP) algorithm that regularizes the SCM estimation with two parameters to lower the estimation variance while reducing the estimation bias. The RCSP aims to shrink the SCM of the target subject towards an identity matrix and the generic SCM of all source subjects. Lotte et al. [20] developed several regularized CSP (RCSP) algorithms, which utilize data from other subjects to achieve inter-subject transfer by regularizing the SCM or adding a penalty term to the objective function. Arab et al. [21] presented a weighted transfer learning approach on the classification domain. A regularization parameter is added to the objective function of the classifier to make the classification parameters as close as possible to those of the previous users who have feature spaces similar to that of the target user. With respect to the latter, Zanini et al. [22] proposed a Riemannian alignment (RA) method to align the SCMs of different subjects, and the aligned SCMs can be used as features to be directly classified by the minimum distance to Riemannian mean (MDRM) classifier. He and Wu [23] presented a Euclidean alignment (EA) method, in which EEG trials from different subjects are aligned in Euclidean space to make them more similar and any Euclidean space classifier can be used after EA. Arab et al. [24] proposed a dynamic time warping (DTW)-based TL framework that combines the two SCMs with the trials from the new subject and the previous subjects respectively. The labeled trials from the previous subjects are temporally aligned to the average of the available trials of the new subject from the same class. Rodrigues et al. [25] proposed a method Procrustes analysisbased for matching the statistical distributions of two data sets using geometrical transformations (translation, scaling and rotation) over the data points. The method handles the statistical variability of EEG signals from different subjects. In summary, all these methods can reduce calibration time substantially while keep an acceptable classification accuracy.
Recently, Tanaka extended the task-related component analysis (TRCA) [26] algorithm used for SSVEP-based BCIs to group TRCA (gTRCA) [27] by maximizing reproducible components across trials within a subject and a group of subjects. The gTRCA exploits the similarity between a target subject and the source subjects for classifying stimulus frequencies, which offers an alternative to grand averaging. Motivated by this idea, we proposed a novel CSP-based spatial filtering algorithm, named EA-based intra-and inter-subject common spatial pattern (EA-IISCSP), and developed a crosssubject TL framework to improve the performance of MI-BCIs and reduce their calibration time. The EA-IISCSP estimates an ensemble spatial filter consisting of four different types of spatial filters, each of which is optimized by CSP with two classes of EEG data from the same a domain or two different domains. Thereby, the EA-IISCSP not only contains subjectspecific information, but also incorporates similarities between subjects. The feature vectors extracted by these filters are dimensionally reduced by linear discriminant analysis (LDA) and classified by support vector machine (SVM).
The proposed EA-IISCSP was evaluated on two publicly available BCI data sets and compared with three state-of-theart CSP-based TL algorithms with unaligned and aligned data. Nine different-sized groups of training trials were applied for the classification of MI tasks. The results indicated that the proposed algorithm significantly outperforms these competing algorithms for the training groups with numbers of training trials per class between 15 and 50. The contributions of this paper are as follows.
1) An intra-and inter-subject CSP algorithm (EA-IISCSP) is proposed for feature extraction. With the algorithm, similar experience under the same task and training data of the same subject can be shared across subjects and trials respectively, thereby improving the separability of feature signals; 2) A classification framework was developed based on a feature fusion method, in which four feature vectors extracted by EA-IISCSP are dimensionally reduced separately by LDA and then concatenated as the feature signal for the recognition of MI tasks; 3) The superior performance of the TL-based classification framework was analyzed intensively and validated with nine different groups of training trials on two data sets including a total of 14 subjects.

II. METHODS
This section details basic principles underlining the IISCSPbased classification framework, including TL, data alignment, IISCSP, feature fusion and pattern classification. Throughout the paper, the two technical terms, training trials and training data, mean the training trials and the training data derived from the target subject respectively.
A. Related Works 1) Transfer Learning (TL): TL is usually designed to cope with the shortage of the labeled data from the target domain by transferring the labeled data from other/source domains. A domain refers to a session, a subject, a task, or a device, etc. As described in Pan and Yang [10], a domain D consists of a feature space X and its marginal probability distribution P (X ), where X ∈ X . A task T consists of a label space y and a predictive function f (X ). A source domain D S and a target domain D T may have different feature space or different marginal probability distribution, i.e. X S ̸ = X T or P S (X ) ̸ = P T (X ). Meanwhile, a source task T S and a target task T T may have different label spaces. TL aims to improve the learning ability of the target predictive function f T (X ) by adding the knowledge embedded in N S source domains D i TL approaches used in BCIs can be divided into three categories, i.e. instance transfer, feature representation transfer and parameter transfer [10]. A variant of the second approach is used in this study because two of the four spatial filters for feature extraction are estimated with EEG data from both target and source subjects, as described in the following subsection.
2) Data Alignment (DA): TL is a useful approach to improving the classification performance in BCIs, by exploiting labeled data from auxiliary subjects in similar tasks. Directly transferring instance, features or parameters yielded by original data, however, is not a good idea due to individual differences. Thereby, prior to performing TL, it is necessary to preprocess data in a proper way. Data alignment (DA) is such a method that can enhance the similarity between subjects substantially.
Recently, researchers proposed a variety of DA methods, the well-known two of which are Riemannian alignment (RA) [22] and Euclidean alignment (EA) [23]. The former aligns the covariance matrices of EEG signals to a common reference in Riemannian space using the geometric mean of the SCMs from resting-state data as a reference matrix. The latter aligns the EEG trials to a common reference in Euclidean space using arithmetic mean of the SCMs from task-state data. Since EA is faster than RA to calculate and is more suitable for use in Euclidean space, EA is adopted in the study.
Assume that a single-trial band-pass filtered EEG signal be denoted by a matrix X t ∈ R N c ×N s , t = 1, 2, · · · , N t , where N c , N s and N t are the number of channels, samples and trials of a subject respectively. The SCM of signal X t is estimated by where superscript T represents transpose operation. The averaged SCM across all trials from two classes is Note that the EA does not use any data labels even if the labels of training data are known. The mean SCM is used as the reference matrix to align a single-trial EEG signal X t as After EA, the averaged SCM across aligned trials is equal to an identity matrix [22], which means that all aligned SCMs of a subject are transformed close to the identity matrix. Before EA, usually the SCMs from two subjects do not overlap at all. After EA, the two groups of SCMs have the same mean but different variances. Thereby, the SCM distributions of different subjects after EA are more similar than those before EA. Since most algorithms for feature extraction and pattern classification are based on SCM estimation, DA is instrumental for transferring data from a domain to another domain.
To make the EA suitable for online applications, the EEG data from source subjects and the target subject are aligned in different ways. The EEG data from source subjects are aligned per subject. Since the EEG data of the target subject are divided into a training set and a testing set, the former is aligned as a whole, and its reference matrix is used to align single-trial testing data. For convenience, the following discussion is based on unaligned data.
3) Common Spatial Patterns (CSP): CSP is a spatial filtering and feature extraction algorithm for binary classification tasks, which is considered as one of the most popular and effective algorithms in BCI design [6], [7]. CSP aims to learn a spatial filter for maximizing the discriminability of two-class EEG signals. Its basic principle consists in using matrix diagonalization to find an optimal spatial filter for projection, so as to maximize the difference between the variances of two-class signals, and then obtain feature vectors with higher separability.
Assume that there are two classes of EEG signals evoked by two mental tasks, e.g. MI of left hand and right hand. Let X i ∈ R N c ×N s , i = 1, 2 denote a single-trial band-pass filtered EEG signal from class i. CSP filter w is obtained by extremizing the following objective function Given a single-trial band-pass-filtered EEG signal X t ∈ R N c ×N s , t = 1, 2, · · · , N t , feature extraction is performed by filtering the EEG signal with the filter w and then calculating the resulting signal variance where logarithmic operation makes the distribution of feature signals more normal.

B. Intra-and Inter-Subject CSP (IISCSP)
Conventional CSP aims to extract task-related knowledge with EEG data from a single subject, and thus can be used for estimating one spatial filter. As an extension of CSP, IISCSP aims to extract task-related knowledge with EEG data from a target subject, several source subjects and both, and thus can be used for estimating multiple filters. As a matter of fact, IISCSP contains two types of CSP algorithms, namely intra-subject CSP and inter-subject CSP, which are detailed as follows.
1) Intra-Subject CSP: The goal of intra-subject CSP is to find two spatial filters that maximize the difference in variances between the two classes of signals from a single subject and a set of subjects respectively. Formally, the objective function of intra-subject CSP is the same as that of conventional CSP formulated by Eq. (4). For the target subject, the objective function of the intra-subject CSP is rewritten as whereX t1 andX t2 are the averaged EEG signals across training trials from class 1 and class 2 respectively, whereas P t1 andP t2 are the averaged SCMs across training trials from class 1 and class 2 respectively. For the set of source subjects, the objective function of the intra-subject CSP is written as whereX s1 andX s2 are the averaged EEG signals across total trials of all source subjects from class 1 and class 2 respectively, whereasP s1 andP s2 are the averaged SCMs across total trials of all source subjects from class 1 and class 2 respectively. The two filters w 1 and w 2 are acquired by maximizing Eq. (6) and Eq. (7) respectively, and taking the eigenvectors of M 1 =P −1 t2P t1 and M 2 =P −1 s2P s1 corresponding to m largest and m smallest eigenvalues respectively.
2) Inter-Subject CSP: The goal of inter-subject CSP is to find two spatial filters that maximize the difference in variances between the two classes of signals derived respectively from a target subject and a set of source subjects. Accordingly, there are two objective functions for the intersubject CSP, which are formulated respectively as follows The two filters w 3 and w 4 are acquired by maximizing Eq. (8) and Eq. (9) respectively, and taking the eigenvectors of M 3 =P −1 s2P t1 and M 4 =P −1 t2P s1 corresponding to m largest and m smallest eigenvalues respectively.
C. Feature Fusion and Pattern Recognition 1) Feature Extraction: Given a single-trial band-pass filtered EEG signal X t ∈ R N c ×N s , t = 1, 2, · · · , N t from the target subject, four feature vectors can be extracted by filtering the signal with the four base filters w n as follows where f n ∈ R 4×1 , n = 1, 2, 3, 4 is a 4-dimensiomal feature vector in the case m = 2.
Since the four base filters are derived from different domains of subjects, so the feature vectors extracted by them are heterogeneous. Thereby, a suitable method is needed to fuse these feature vectors. The simplest fusion method is to concatenate the four feature vectors as one 16-dimensional feature vector. However, such a dimension easily leads to overfitting for classification of the EEG signals with a small number of training samples. Therefore, we do not directly concatenate the four feature vectors. Instead, we first reduce their dimension from four to one by linear discriminant analysis (LDA) [28] and then concatenate the four onedimensional feature scalars into a four-dimensional feature vector.
2) Dimension Reduction: LDA is not only a popular classification algorithm, but also a good method for dimension reduction. It aims to project sample data from high dimensions onto a coordinate axis, so that the sample data from the same class on the coordinate axis is more concentrated, and the data from different classes are more scattered. To achieve this goal, LDA maximizes the difference of the sample means between two classes and meanwhile minimizes the total within-class scatter of the projected data. Each of the four feature vectors is reduced to a feature score according to the following equation where the normal vector a n and intercept b n are calculated using training data as follows (12) wheref n1 andf n2 are the mean feature vectors from class 1 and class 2 respectively, and c −1 n is the composite feature vector of two classes.
An ensemble feature vector used for classification is yielded by concatenating longitudinally the four feature scores f e = [s 1 ; s 2 ; s 3 ; s 4 ]. This low-dimensional feature vector not only avoids overfitting, but also incorporates subject-specific and across-subject information. As a result, the feature fusion method promises to improve TL performance.
3) Pattern Classification: As one of the commonly used classifiers, support vector machine (SVM) [29] is widely used in the field of BCIs. In this study, we use the LIBSVM [30] as the classification tool. The principle is to separate the data x ∈ R d from two classes by finding a weight vectorŵ ∈ R d and an offsetb of a hyper-plane with the largest possible margin In cases where the data are not fully separable, a variant of the algorithm is to solve the following optimization problem where c, ξ, y and x i are the penalty parameter of the error term, the slack variable, the class label, and the feature vector of the i-th training sample respectively. The radial basis function (RBF) is used as the kernel function. All parameters are default except for the γ parameter of the RBF, which is set to 0.01.

D. The Classification Framework
The flowchart of the proposed classification framework is shown in Fig. 1, which includes the following procedures.
1) The original continuous EEG data from each subject are preprocessed by segmenting them into single-trial signals and band-pass filtering EEG trials in specified frequency band; 2) The subjects in a data set are divided into a target subject and the source subjects by a leave-one subject-out (LOO) cross validation strategy, i.e. each time we selected one subject as the target subject and the remaining subjects as source subjects; 3) For source subjects, EA is applied to EEG trials from single subjects. For the target subject, the EEG trials are divided into a training set and a testing set. EA is applied  to EEG trials from the training set and the reference matrix is used to align EEG trials from testing set; 4) An ensemble spatial filter including four base filters is obtained by IISCSP with two averaged SCMs across trials from training set of the target subject and all source subjects; 5) Four feature vectors are extracted by the four base filters respectively for both the training set and the testing set, reduced dimensionally to one by LDA and then concatenated to be an ensemble feature vectors; 6) The ensemble feature vector from the training set is employed for training an SVM classifier, whereas that from the testing set are classified by the trained SVM model.

III. EXPERIMENTAL DATA
The proposed EA-IISCSP algorithm was evaluated using two MI-based BCI data sets, namely data set IIa from BCI competition IV [31] and data set IVa from BCI competition III [32]. The two data sets and their preprocessing methods are described as follows.
A. Data Sets 1) Data Set 1: The data set contains EEG data from 9 subjects (named A01-A09), each of who performed four classes of mental tasks, MI of left hand, right hand, tongue and foot. As shown in Fig. 2(a), each trial started with a fixation cross on the screen and a short warning tone (beep). At 2s, an arrow pointing to left, right, up or down (corresponding to one of the four MI tasks) appeared and stayed on the screen for 1.25s. The arrow cue prompted the subject to perform the specified MI task till 6s. The trial ended after a short break of 1.5-2.5s. EEG data were recorded at a sampling rate of 250 Hz, using 22 electrodes in line with international 10/20 system. The experiment included a training session and an evaluation session finished on two different days. Each subject performed 72 trials per class in either session. Only the two classes of MI tasks of left and right hand were used in this study.
2) Data Set 2: The data set comprises EEG data from 5 subjects (named aa, al, av, aw and ay), who performed three mental tasks, i.e. MI of left hand, right hand and foot. As shown in Fig. 2(b), each trial started with a visual cue in the form of letter L, R or F, which corresponds to one of the three MI tasks. The cue lasted for 3.5s, in which the subject was required to conduct the cued task. This trial ended after a relaxation period of 1.75-2.25s. EEG data were recorded at a sampling rate of 1000 Hz, using 118 electrodes in line with international 10/20 system. A down-sampled version to100 Hz is used in this study. Only cues for 'right' and 'foot' were used for the competition. Each subject performed 280 trials, 140 trials per class.

B. Data Preprocessing
The original continuous EEG data were segmented into single-trial data with a time window of 3s and 2s for data set 1 and 2 respectively. The time window started from 0.5s after visual cues to avoid the influence of visual cues on brain signals. Specifically, the data segments 2.5-5.5s and 0.5-2.5s were intercepted for data set 1 and 2 respectively. Then the single-trial data were band-pass filtered between 8 Hz and 30 Hz to extract EEG signals including both µ and β rhythms, which are closely related to the MI tasks. The filtering was performed with a 5th-order Butterworth infinite impulse response (IIR) digital filter.
Since data set 2 contains too many channels, we manually selected the 33 channels closely related to MI, {FC1-6, FCz, CFC1-6, C1-6, Cz, CCP1-6, CP1-6, CPz}, according to neurophysiological prior knowledge, in order to avoid overlearning of CSP algorithms. For data set 1, the two sessions (T and E) were put together and thus each subject had a total of 288 trials. For each subject in either data set, the first 70% of all trials per class are used as the training set and the remaining trials are used as the testing set. Table I shows the data structure of the two data sets after preprocessing.

IV. RESULTS
To investigate the performance of IISCSP and EA-IISCSP on different amounts of training data, we divided the training set of the target subject into nine different groups of training trials, starting from 10 trials per class to 50 trials per class with interval of 5 trials. To avoid overlearning, the EEG trials of each training group were randomly selected from the training set and the procedure was repeated 10 times, leading to 10 classification experiments. The classification accuracy of each target subject was the average of 10 classification accuracies from the 10 experiments. For each source subject, all trials from the training set were used for estimating SCMs, which were averaged across trials per class. The mean SCMs from all source subjects were further averaged per class.
Based on aligned and unaligned data, we compared the proposed algorithm with the four different types of CSP algorithms and with the following three state-of-the-art TL algorithms in terms of topography, classification accuracy, feature distribution or computational complexity. 1) CCSP1 and EA-CCSP1. Kang et al. [18] proposed two composite CSP algorithms (CCSP1 and CCSP2) for TL, which are based on different weighted methods for source subjects' SCMs. Since the two algorithms differ little in accuracy, we adopted CCSP1 as a comparison algorithm.
3) EA-CSP. He and Wu [23] proposed a Euclidean geometry-based data alignment (EA) method and combined EA with CSP (EA-CSP) for TL, which achieved higher accuracy than RA.
In [33], He and Wu reviewed six TL-based enhanced CSP algorithms, BL1-3, CM1-2 and MA, and proposed a new instance-based algorithm (IA). The first two and the fourth are the same as CSP1, CSP2 and CCSP1 in this study respectively, whereas the third is similar to EA-CSP with additional EA.
The classifiers and their parameters used in above baseline algorithms are the same as those employed in the proposed classification framework, i.e. SVM and its chosen parameters. For above baseline algorithms and the proposed algorithm, the methods for applying EA are the same as well, and are detailed in section II-A: 2) Data alignment (DA).
A. The Proposed Algorithm 1) The Ensemble Filter and Its Base Filters: Fig. 3 illustrates the distribution of four base filters obtained by aligned EEG data from subject A08 of data set 1 and subject ay of data set 2. We observed from Fig. 3 (a) that for the MI of right hand, the coefficients of the four filters are localized on the left hemisphere, whereas for the MI of left hand, those are localized on the right hemisphere. We observed from Fig. 3 (b) that for the MI of foot, the coefficients of the four filters are approximately localized on central area of the brain despite somewhat divergent, whereas for the MI of right hand, those are localized on the left hemisphere. The distributions of the four base filters yielded by other subjects in data set 1 and 2 are similar to Fig. 3 (a) and (b) respectively. These observations are in accordance with cortical regions related to the MI of human limbs. Thereby, all the four filters are useful for discriminating two different MI tasks. Combining them in an intelligent way is expected to improve feature extraction for MI EEG signals.
2) Classification Performance: Fig. 4 depicts the averaged accuracy across subjects of the ensemble filter and its four base filters for each of the two data sets. It is revealed from the figure that with either aligned or unaligned data, the accuracies yielded by the four base filters are much higher than the chance level of accuracy (50%) for binary classification, and thus they are all discriminative. The accuracy of the ensemble filter is higher than those of its base filters on all groups of training trials except for 10 trials per class in the case of aligned data, and large gaps in accuracy exist on training groups larger than 15 trials per class. Thus, combining the four base filters is effective for feature extraction and subsequent classification of MI tasks. Table II reports the p-values of statistically significant analysis of averaged accuracy across subjects from the two data sets (corresponding to the third column of Fig. 4) yielded by paired t-tests at nine groups of training trials between each of the six unaligned CSP algorithm and IISCSP, and between each of the six aligned CSP algorithms and EA-IISCSP.
3) Dimension Reduction: Fig. 5 shows averaged accuracies of the proposed algorithms across subjects from data set 1 and data set 2, with dimension reduction (IISCSP and EA-IISCSP) and without dimension reduction (IISCSP\DR and EA-IISCSP\DR). It is revealed from the figure that for data set 1, the accuracies yielded by EA-IISCSP and IISCSP are much higher than those yielded by EA-IISCSP\DR and IISCSP\DR respectively at all groups of training trials except at 10 training trials per class, whereas for data set 2, the accuracies yielded by EA-IISCSP and IISCSP are higher than or equal to those yielded by EA-IISCSP\DR and IISCSP\DR respectively at all groups of training trials except at 10 training trials per class. These results verify the necessity and usefulness of dimension reduction of the feature signal used for classification. 4) Ablation Experiments: Fig. 6 shows averaged accuracies of the proposed algorithm across subjects from data set 1 and data set 2, and the modified algorithms yielded by removing either CSP3 or CSP4 from the proposed algorithms. It is seen from the figure that for each of the two data sets, removing  either CSP3 or CSP4 from IISCSP and EA-IISCSP decreases accuracies considerably at all groups of training trials except at 10 training trials per class. For data set 1, CSP4 plays a more important role than CSP3 in IISCSP at all groups of training trials, and a less important role than CSP3 in EA-IISCSP at most groups of training trials; For data set 2, the performance difference between CSP3 and CSP4 in both IISCSP and EA-IISCSP is not large.

B. The Proposed and Comparison Algorithms
1) Accuracy: The averaged accuracy across subjects from single data sets and the two data sets yielded by the seven CSP algorithms is illustrated in Fig. 7 and Fig. 8 respectively. From  Fig. 7, it can be found that for either data set, the accuracies of these algorithms tend to increase with the number of training trials, but the increase in accuracy slows down or even reverses  II  THE P-VALUES OF STATISTICALLY SIGNIFICANT ANALYSIS OF AVERAGED ACCURACY ACROSS SUBJECTS FROM THE TWO DATASETS  (CORRESPONDING TO THE THIRD COLUMN IN FIG. 4) YIELDED BY PAIRED T-TEST AT NINE GROUPS OF TRAINING TRIALS BETWEEN EACH OF  THE SIX UNALIGNED CSP ALGORITHMS (CSP1-4, INTRACSP AND INTERCSP) AND IISCSP, AND BETWEEN EACH OF THE SIX ALIGNED CSP  ALGORITHMS (EA-CSP1 for some algorithms after a certain number of training trials. The accuracies of the proposed algorithm are higher than those of all other algorithms after 10 training trials per class for data set 1 and after 20 training trials per class for data set 2. From Fig. 8, it is revealed that among the seven algorithms, EA-IISCSP, IISCSP and EA-CSP outperform the other four algorithms with large gaps in accuracy at all groups of training trials. EA-IISCSP and IISCSP achieved the highest and the second highest accuracies after 10 and 15 training trials per class respectively. EA-CSP yielded the highest accuracy at 10 training trials per class and the approximately same accuracy as IISCSP at 15 training trials. Moreover, the gaps in accuracy between EA-IISCSP/IISCSP and EA-CSP increase with the number of training trials. As a comparison of calibration time, EA-CSP required 25 trials per class to obtain the accuracy of 75%, whereas EA-IISCSP required only 15 trials per class to achieve that accuracy, reducing the calibration time by 2/5. Taking 30 training trials per class as an example, the classification accuracies achieved by each subject from the two data sets are listed in Table III. For data set 1, EA-IISCSP and IISCSP achieved the best accuracy on four and three of the nine subjects respectively, where EA-CSP and CCSP1 obtained the best accuracy on one and one subject respectively. On average, EA-IISCSP achieved the best accuracy of 76.91%, which is 2.57% higher than the accuracy yielded by EA-CSP, the best one among the first five algorithms. For data set 2, EA-IISCSP and IISCSP achieved the best accuracy on three and two of the five subjects respectively. On averaged,  III  CLASSIFICATION ACCURACY (%) OF EACH SUBJECT ON THE TWO DATA SETS YIELDED BY THE SEVEN CSP-BASED TRANSFER LEARNING  ALGORITHMS AT 30 TRAINING TRIALS PER CLASS. AS A COMPARISON, THE CONVENTIONAL CSP ALGORITHM (CSP) IS ALSO LISTED IN THE  TABLE. THE BEST RESULT FOR EACH SUBJECT IS HIGHLIGHTED IN BOLD CASE. THE LAST ROW DENOTES P-VALUE OF STATISTICALLY  SIGNIFICANT ANALYSIS BETWEEN EACH OF FIRST SIX ALGORITHMS AND THE LAST ALGORITHM OBTAINED BY PAIRED T-TEST,  IN WHICH THE CONFIDENCE LEVEL IS SET  EA-IISCSP achieved the best accuracy of 81.91%, which is 2.47% higher than the accuracy yielded by EA-CSP. The statistically significant analysis was performed between each of the first six algorithms and the last algorithm by paired t-tests at the confidence level of 95%, for all fourteen subjects form the two data sets. The p-values listed in the last row indicate that the accuracy of EA-IISCSP is significantly better than those of the first five algorithms and there is no significant difference between EA-IISCSP and IISCSP. Using the last 60% of total trials from a subject as the testing set, the classification accuracies on the two data sets are listed in Table IV. Although the accuracy yielded by each of the seven algorithms changes a little compared to the results in Table III, the proposed algorithm is still significantly better than the comparison algorithms.
2) Feature Distribution: To further explore the performance of the seven algorithms, the SVM classification score is adopted for visualizing their differences in feature distribution. Taking 30 training trials per class as an example, Fig. 9 illustrates the distributions of training and testing scores for subject A01 from data set 1 and subject ay from data set 2. It can be seen from the figure that for the subject A01, the overwhelming majority of two-class scores yielded by all these algorithms were clearly separated whether for scores from training set or testing set, but those yielded by IISCSP and EA-IISCSP were further away from their classification line; For the subject ay, the two-class scores yielded by the first five algorithms were close to their classification lines, but those yielded by IISCSP and EA-IISCSP were separated well with larger intervals between the two-class trials. The greater the distance between the two classes of scores, the more favorable it is for their classification. Therefore, the feature distributions of these algorithms further verify the superiority of the proposed algorithm.
3) Running Time: In terms of computational complexity, the classification process of the seven algorithms includes two phases of training and testing. In the training stage, these algorithms contain the main procedures of band-pass filtering of EEG data, estimation of CSP filters, feature extraction and optimization of classifier models and additional procedures of EA and dimension reduction where applicable; In the testing stage, these algorithms contain the three procedures of band-pass filtering, feature extraction and classification of a single trial and additional procedures of EA and dimension reduction where applicable. Compared with training time, the testing time of a single trial is less than 0.1s and thus negligible. The evaluation of computational complexity was conducted on a desk computer with the configuration of Intel (R) Core (TM) i5-6500 CPU @ 3.20 GHz, 8.00 GB RAM and 64-bit OS. Taking 50 training trials per class for an example,   Fig. 10 illustrates the training time of these algorithms for the two data sets. From the figure, it is revealed that the training time of data set 1 is much longer than that of data set 2 for each algorithm. Among the seven algorithms, the training of the first two took the longest time for either data set, whereas the last three took the least time. The training time of the proposed algorithm is smaller than 2s and 1s for data set 1 and 2 respectively, which meet the requirement of real-time computation in online use. 4) Multi-Task Classification: In the above subsections, we analyzed the performance of this proposed algorithm in binary classification. In order to evaluate its performance in multi-task classification, we applied the two proposed algorithms and the five comparison algorithms to the data set 1, which contains EEG data from four mental tasks, namely MI of left hand, right hand, foot and tongue. A voting strategy based on one versus one classification was employed to implement the four-task classification. The experimental results are shown in Fig. 11. As shown in the figure, the accuracies of all the seven algorithms are much lower than those of binary classification because more categories make the recognition of mental tasks more difficult. Nevertheless, the two proposed algorithms still outperform the comparison algorithms at all groups of training trials except for 10 training trials per class. This result is similar to that of the binary classification and further verifies the effectiveness of the proposed algorithms. The main difference between them is that for the proposed algorithms, the role of data alignment in four-task classification decreases because four classes of EEG data are aligned together.

V. DISCUSSIONS
TL is popular in EEG-based BCIs because it can cope with variations among different subjects, sessions or tasks. In this study, we propose a novel TL-based classification framework for enhancing performance or decreasing calibration time. The classification framework includes three main steps: 1) Fig. 11. The classification accuracies of the four tasks (i.e., MI of left hand, right hand, foot and tongue) from data set 1 yielded by the seven CSP algorithms.
EEG trials from the target subject and source subjects are separately aligned in Euclidean-space; 2) An ensemble spatial filter including four base filters is estimated by IISCSP using the EEG trials from the target subject, source subjects or both; 3) the feature signals extracted from each base filter are first dimensionally reduced by LDA, then concatenated and finally classified by an SVM.
Data alignment (DA) is a useful data preprocessing approach for TL. EA is adopted in the classification framework because it performs better than other DA approaches. Including the EA in the proposed algorithm further improves its performance in terms of classification accuracy. It is noted that EEG trials from the target subject and source subjects are aligned in different ways so that the EA is applicable to online experiments.
Different from the existing CSP-based TL methods that consist mainly in the regularization of either the SCM estimate or the CSP objective function and data alignment, the main idea of the proposed algorithm consists in creating an ensemble filter that exploits knowledge from within and across subjects. The proposed algorithm works because the four base filters are estimated by EEG trials from different domains of subjects and thus the feature signals extracted separately by these filters are independent of each other and complementary. It is the complementarity among feature signals that enhances the classification performance of this algorithm.
Dimensionality reduction is an important step for signal classification in the case of only a small number of training samples. We assumed that only 10-50 calibration trials per class are available. As a result, the four feature vectors extracted by the four base filters are respectively reduced to one scalar score and the four scores are concatenated as a lowdimensional feature vector for classification. As an alternative method, we tried to reduce the dimensionality of the feature vector by selecting features from all combined features with approaches such as Fisher score and minimum redundancy maximum relevance (mRMR), but the classification results were poor. The reason might be that the four spatial filters yielded by IISCSP/EA-IISCSP are derived from the EEG data of different subjects, i.e., the target subject, the set of source subjects or both, and thereby the complementarity of resulting four feature vectors can be better represented only when each of them acts as a whole. We also tried to create an ensemble classifier with the feature vectors from the four base filters and recognize testing trials with classifier fusion methods such as Dempster-Shafer evidence theory [34], [35], but the classification results were not satisfactory as well.
At all groups of training trials except for 10 training trials per class, EA-IISCSP achieved higher averaged accuracy across subjects than the five competing algorithms on each of the two data sets, as shown in Fig. 5 and 6. In other words, EA-IISCSP achieved similar accuracy to the competing algorithms using fewer training trials, thus reducing the calibration time. In terms of computational complexity, the running time of EA-IISCSP for classifying a single trial is less than 2s and 1s for data set 1 and 2 respectively, as shown in Fig. 8. Considering the high accuracy and fast running speed, EA-IISCSP is a superior algorithm for MI-based BCIs.
The performance of a BCI system depends mainly upon classification algorithm and/or experimental paradigm. For MI-based BCIs, the experimental paradigms differ little and so their performance is determined by the former to a large extent. Although CSP algorithm is well-recognized method for feature extraction, the spatial filters and resulting feature signals are not necessarily optimal due to the nonstationarity of EEG and inherent defects of the CSP objective function. Jin et al. [36] proposed a new feature selection method to deal with this issue by selecting features based on an improved objective function. The improvements are achieved through suppressing outliers and finding features with larger interclass distances. In addition, a fusion algorithm based on the Dempster-Shafer theory is developed, which takes into account the distribution of features. Experimental results on two data sets show that the proposed method significantly outperforms the competing methods for feature selection in both accuracy and calculation time. As for other types of BCIs, the design of experimental paradigms plays a major role in the performance improvement. Taking P300-based BCIs as an example, the number of items should be adjustable in accordance with the requirements of the specific tasks. Zhou et al. [37] proposed a novel taskoriented optimal approach to deal with this issue, which aims to increase the performance of general P300-BCIs with different numbers of items. First, a stimulus presentation paradigm with variable dimensions (VD) was developed as a generalization of the conventional single-character (SC) and row-column (RC) paradigms. Then, an embedding design approach was used for any given number of items. Finally, the VD flash pattern was determined by a linear interpolation approach for a certain task based on the score-P model of each subject. The experimental results indicate that the proposed paradigm is consistently superior to conventional SC and RC paradigms and significant improvement in the practical ITR can be achieved for a large number of items.
Recently, deep learning (DL) has emerged as a powerful tool for developing BCI systems [38], [39], [40]. Lee et al. [38] proposed a convolutional neural network (CNN) for inter-task TL using a channel-wise variational autoencoder (CVNet) for decoding various forearm movements. The EEG samples from motor execution (ME) are transferred and used as auxiliary data for building a MI classification model. The results suggest that good training model for decoding MI can be created using data from ME and a small number of calibration samples from MI. Zhang et al. [39] presented five adaptive TL schemes for MI classification with a CNN. Each scheme fine-tunes a pre-trained model and adapts it to enhance the performance of the target subject. The improvement of 10% in accuracy was achieved on a MI data set compared to the highest accuracy reported in literature. Wang et al. [40] proposed an unsupervised deep CNN-based method for TL to deal with the non-stationary problem of EEG signals, in which EA and CSP are used for data alignment and feature extraction respectively. The effectiveness of the proposed method was verified by comparing its experimental results and those of other four methods. These methods provide new ideas for developing high-performance BCIs.
The limitation of the proposed algorithm is that it is only suitable for the classification of EEG data with medium numbers of recording channels. If the number of channels is too large like data set 2 with 118 channels, the accuracy of the algorithm will decrease. To solve the problem, we manually selected 33 channels according to neurophysiological prior knowledge. In the future, we will explore the method for channel selection based on the proposed algorithm. The present study only carried out an offline analysis of the proposed algorithm. Future study will focus on its online performance.

VI. CONCLUSION
In this paper, we proposed an intra-and inter-subject CSP (IISCSP) algorithm for MI-based BCIs. Four spatial filters are estimated with EA-aligned EEG data from the target subject, source subjects and both. Four feature vectors are extracted by these filters and dimensionally reduced by LDA. The four feature scores are concatenated and classified by a trained SVM model. The proposed classification framework was applied to two MI data sets and evaluated on nine different-sized groups of training trials. The results indicated that the proposed algorithm outperforms the three competing algorithms and can significantly decrease the calibration time of MI-based BCIs.