An Adaptive CSP and Clustering Classification for Online Motor Imagery EEG

A potential limitation of motor imagery (MI) based brain-computer interface (BCI) (MI-BCI) is that it usually requires a relatively long time to record sufficient electroencephalogram (EEG) data for robust feature extraction and classification. Moreover, due to the non-stationarities in EEG signals, the offline training model has poor adaptability and classification ability in cross-session or sample-wise online testing. Methods: To address the problems, we propose a model updating scheme with adaptive and fast operation. Based on the Common Spatial Pattern (CSP), we propose an online and fast generalized eigendecomposition method by Recursive Least Squares updates of the CSP filter coefficients (RLS-CSP), which allows incremental training for CSP spatial filters. Additionally, we present an Incremental Self-training Classification algorithm based on Density Clustering (ISCDC) to select high-confidence samples to update spatial filters and classifier, and classify at the same time. Results: We conducted extensive experiments to validate the efficiency of the proposed adaptive CSP and classifier on the BCI III_IVa and BCI III_V data sets. Experimental results demonstrate that RLS-CSP outperforms significantly in a small sample setting (SSS), and ISCDC has great adaptability in cross-session and non-stationary EEG signals. The results indicate that our proposed methods are feasible to improve the real-time performance of online BCI system.


I. INTRODUCTION
Brain-Computer Interface (BCI) is a human-computer interaction technology. It does not rely on the peripheral nerve and muscle system and aims to provide a bridge between human brain and external devices [1], [2]. It has demonstrated broad application prospects in the rehabilitation of disabled people and auxiliary control of healthy people [3].
MI-BCI generates correlation signals by thinking, which is accompanied by event-related desynchronization and eventrelated synchronization (ERD/ERS) in functional motor areas [4], [5]. Effective characterization of ERD/ERS phenomenon is of vital importance to a MI-BCI system. The Common Spatial Pattern (CSP), is a widely used a time-spatial feature extraction method in MI-BCI system, which is effective to extract the frequency band variances as The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wang .

A. SMALL SAMPLE SETTING
CSP is highly dependent on sample-based covariance [2], and is very sensitive to noise and prone to overfitting in a small sample setting (SSS) [9]. The regularization CSP is developed to estimate a covariance matrix by adding a-priori information into the CSP learning process, under the form of regularization terms [10], [11]. The regularization CSP can be implemented in two levels. One method is performed at the covariance matrix estimation level, since spatial covariance matrix estimates can suffer from noise or small training sets. The other approach is regularizing CSP at the level of the objective function itself. Tikhonov Regularization CSP (TRCSP) reduces deviation by adding identity matrix to the denominator of the objective function to constrain the norm of spatial filters [12], [13]. TRCSP performs equal VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ weighted penalty on all signal channels, whereas it ignores the particularity of the channels in reflecting region. Therefore, the weighted Tikhonov Regularization CSP (wTRCSP) is improved based on TRCSP by introducing priori information of channels weights, which can be obtained from training samples of auxiliary objects. However, wTRCSP ignored the personality and individual differences between auxiliary subjects and the target subject. Invariant CSP (iCSP) maintains the robust of spatial filters under given noise by adding general invariances to the objective function [15]. However, in order to compute the penalty covariance matrix, an extra artifact recording and additional preprocessing methods are needed. Complex CSP (CCSP) uses the samples of other subjects to compensate for the estimation bias of covariance matrix [16], which may aggravate the estimation bias due to the differences between subjects. In Regularized CSP with Selected Subjects (SSRCSP), samples of other subjects are selectively added into covariance matrix, however, when the data set is large, the selection process takes a long time [9].Besides, a filter band regularization with CSP(FBRCSP) [2] has been proposed to overcome dependence on the frequency and covariance matrix estimation, which a regularization CSP [9] is worked as spatial filters on each frequency filter. Moreover, subject-to-subject feature transfer provides a promising approach to learn reliable features from the limited data of the target subject with help of sufficient data from other subjects [17]- [19].A sparse representation-based classification (SRC) scheme has demonstrates its advantage in exploring potential relationship of CSP features among subjects [20]. Based on SRC, the Sparse Group Representation Model (SGRM) aims at finding out the most significant training feature from both the target and other subjects by exploiting two norm regularizations [8].A sparse filter band CSP (SFBCSP) [21] is proposed to improve the selection of filter band in a supervised way by exploiting sparse representation learning [20], [22], and a temporally constrained sparse group spatial pattern (TSGSP) [22],for the simultaneous optimization of filter bands and time window within CSP to further improve accuracy of MI-related EEG. However, the sparse-based method represents a linear relationship of all the training samples, which puts forward higher requirements on the quality of the training samples [20].
It should be noted that all the regularized-based or sparse-based CSP algorithms are affected by their hyperparameters. However, the parameter optimization based on cross-validation is relatively time-consuming and requires additional dataset for validation, which inevitably produces parameter deviations in SSS [9], [20]. Besides, the parameters need to be re-determined when adding new samples, which limits the practicality of BCI system to some extent.
Aiming at time-consuming and poor adaptability of the offline regularization CSP in MI-BCI system, this paper proposes RLS-CSP, an online and fast generalized eigendecomposition method, which updates the filter coefficients by Recursive Least Squares (RLS). Compared with regularization CSP that relies on batch computing, RLS-CSP allows adding testing samples to alleviate the overfitting and improve the adaptability of the RLS-CSP filters. How to select high-confidence samples is very important to the performance of RLS-CSP.

B. NON-STATIONARITY IN THE TESTING PHASE
Because of the non-stationarities in EEG, the feature distribution change obviously across sessions. Therefore, spatial filters adaptive improvements are necessary to maintain their high performance in a long duration. One of the ways to keep CSP adaptive is to update the initial spatial filters with high-confidence testing samples [23]- [25].To evaluate the confidence, this paper proposes an Incremental Selftraining Classification method based on Density Clustering (ISCDC) by combining Density Peaks Clustering (DPC) and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). In ISCDC, the self-training method based on density peaks is redesigned to classify in SSS, and quantify the reliability of classification. Moreover, a parameter-free local noise filter based on DBSCAN is proposed to filter out mislabeled instances. Compared with existing classifications used in MI-BCI, the ISCDC can remove mislabeled samples by exploiting the information of both labeled data and unlabeled data, and assess the reliability of classification. Fig.1 demonstrates the framework of our proposed approach for a simple-wise online learning. The main contributions of this work are as follows.
1) Aiming at feature extraction in SSS, this paper proposes an adaptive CSP method (RLS-CSP), which allows incremental updating covariance matrix and fast decomposing eigenvectors by RLS. The RLS-CSP has two advantages: a) allows incremental computation of the filter coefficients which is efficient regarding the required memory and computational effort, and b) allows the incorporation of new samples to adapt the current filter, improving overfitting gradually.
2) Adaptation classification of time-varying EEG, a cluster classification method based on density clustering is proposed to execute reliability evaluation on testing samples, remove outliers, and adaptively update cluster centers and spatial filters. Moreover, that distinguishes from recent investigated batch-base clustering [26][27][28][29], our algorithm performs efficient classification in SSS, and is applicable in sample-wise online BCI.

3) In the case of SSS and cross-session,
A series of detailed experiments are designed, with RLS-CSP for feature extraction and ISCDC for classification.
The rest of this paper is organized as follows. Section II details the basic principle of proposed RLS-CSP algorithm. Section III describes the proposed adaptation classification and the implementation framework of our algorithms. Section IV provides a detailed description of the experiment design. Section V evaluates the performance of our method on two datasets. Finally, the conclusions and future work are presented in Section VI.

II. ADAPTIVE CSP SPATIAL FILTERS IN SSS
Since the CSP algorithm is the basis of our method, we shortly describe it in the beginning. Afterwards, we introduce the RLS feature decomposition method [30] to incorporate new arrival samples to fit the current spatial filters in an application.

A. BASIC CSP
Suppose X c ∈ R MxT represent a single-trail time-space matrix, with M channels and T samples in each channel. We suppose except for different cognitive tasks, the matrix and recording conditions are the same in two classes indexed by c ∈ {1, 2}. CSP aims at learning spatial filters which maximize the variance of band-pass filtered EEG signals from one class while minimizing their variance from the other class, which its objective function can be defined as: where c is mean covariance matrix for class c, ω denotes spatial filters with K a real constant. Using the Lagrange multiplier method, this constrained optimization problem amounts to extremizing the following function: The spatial filters ω extremizing L are such that the derivative of L with respect to ω equals zero: Equation (3) is a standard eigenvalue decomposition problem. The spatial filters ω are the eigenvectors of V = ( 2 ) −1 1 , which correspond to its largest and lowest eigenvalues. The extracted features are the logarithm of the EEG signal variance after projection onto ω.
Obviously, the optimization of the spatial filter is based on the variance estimation. However, in SSS, the covariance matrix estimation is overfitted [9]- [12]. Also, non-stationary factors such as noise, fatigue and emotion lead to the deviation in spatial filter [14], [15]. One way to solve this problem is to adjust spatial filters adaptively through test samples.

B. RLS BASED CSP
The optimal spatial filter of (3) is a standard generalized eigendecomposition. Instead of using the QRD and SVD, we introduce an incremental approach based on the recursive least squares (RLS) method [31].
All the generalized eigenvalues are stationary points of (3). Hence the formula is transformed as: Left multiplying (4) by −1 2 becomes: Here Recursive Stochastic algorithm [27] is used to approximate the maximum eigenvector of the n th iteration as: where ω i (n) represents the ith eigenvector of nth iteration, i c (n) denotes the c-class covariance matrix corresponding to the ith spatial filter.
Equation (6) explains how to approximate the largest eigenvector in the nth iteration with the largest generalized eigenvector in the n − 1th iteration. Especially, the eigenvectors of ω are arranged in descending order.
Computation 1 2 (n) −1 : According to Sherman-Morrison-Woodbury theorem, the inverse of 1 2 (n) can be imp-lemented using recursive estimators, then, where X c (n) denotes new sample of c-class in n th iteration, the superscript T denotes the transpose of a matrix. Covariance matrix 1 c (n): The covariance matrix can be updated by: Lower-Order Filters: By using (6) the first spatial filter is obtained. For the minor components, a standard deflation procedure [26] is considered: Hence the low-order filer ω i (n) is given by (6) with 1 1 (n) and 1 2 (n) replaced by i 1 (n) and i 2 (n). VOLUME 8, 2020 Note that the deflation given by (9) does not increase the complexity of the algorithm because all the terms in (9) are pre-computed in pre-iteration, and each filters ω i (n) in n th iteration depends on the preceding iteration filter ω i (n − 1).
The complete algorithm is detailed in Alg.1.

Notations:
CoNb(x) is the set which lies in the region with x as the center and cutoff distance dist cutoff as the radius, defining as ECoNb(x) is the set which lies in the field of extended CoNb(x). X i ∈ R MxT is a single-trail EEG signal, with M channels and T sample points.
. , x u t } is a training set without labels.

A. DENSITY CLUSTERING-BASED CLASSIFICATION
Peak density clustering is a widely used unsupervised clustering method, which is based on the idea that clusters centers have a higher density than their neighbors [26]. For each sample x i two quantities have to be present: its local density ρ i and its distance δ i from points of higher density. The local density ρ i of data point x i is defined as: where dist cutoff is a cutoff distance, φ(x, y) is usually defined as a sign function, i.e. φ(x, y) = 1 if x −y ≤ 0 and φ(x, y) = 0 otherwise. However, for computing the density for cases in SSS, the exponential kernel is suggested φ(x, y) = exp −x 2 /y 2 . δ i measures minimum distance between the sample x i and any other point with higher density: The cluster center is quantified using relatively large δ i and ρ i . Then, the samples are clustered in the direction of decreasing density with the cluster center as the starting point. Similar to the DPC principle, DBSCAN calculates the local density by ε i − neighborhood, and the number of samples in neighborhood is the wanted density. Points with a density higher than MinPts are considered cluster centers. Eventually the samples are classified according to the neighborhood accessibility starting from the central clustering. Readers can refer to [27] for more details about DBSCAN.
Once the cluster centers are selected, the testing samples can be classified by calculating the distance to the cluster centers. However, for time-varying samples, especially in online BCI, the clustering-based classification has the following disadvantages: 1) Because of the limitation in SSS, the obtained cluster center is a local optimum under current samples, not a global one for the whole test set.
2) Batch-based updating method for cluster center burdens time-consuming, which makes it unsuitable for sample-wise online BCI system.
3) Due to the non-stationarity, the influence of test samples on model is different, some are conducive to clustering, and others are not, so filtering outliers is of particular importance for a robust classifier.

B. ADAPTIVE CLASSIFICATION FOR A TIME-VARYING SITUATION: ISCDC
Aimed at above problems, an adaptive density classification method (ISCDC) is proposed. It uses incremental computation to reduce time consumption, automatically updates cluster centers through sample distribution, and performs confidence evaluation and parameter-free noise filtering. Fig.2 shows a general framework of ISCDC, which can be divided into three steps. At the first step, we use DPC to conduct a supervised classifying on a small amount of labeled data, and construct the neighborhood of each labeled point CoNb(x).Step two is a standard self-training process, where unlabeled samples are initially predicted according to the density with c-class points. Besides, we introduce a concept of ECoNb(x) to assess the reliability of pre-predicted labels and remove noisy instances. Then, continuously add high-confidence samples to the labeled data to expand the neighborhood of the labeled samples and update the spatial filter.

1) DISCOVER THE UNDERLYING STRUCTURE OF TRAINING DATA
Unsupervised DPC clustering may cause mislabeling [25], [28]. Therefore, we utilize labeled instances to obtain the best dist cutoff by adjusting dist cutoff and repeatedly clustering. The expected result is that the clustering results of the training samples are as consistent as possible with the true labels, and dist cutoff is relatively small. After that, we construct the neighborhood set ({CoNb(x i )}) of training samples with dist cutoff being radius in density decreasing order. Algorithm 2 illustrates the procedure of mining the structure of the training samples by our algorithm, and the output is the set {CoNb(x i )}.

2) CLASSIFICATION AND CONFIDENCE EVALUATION OF TESTING SAMPLES
Equation (12) indicates that in Euclidean space, the closer the point is to a group of samples, the higher the local density. Therefore, we get an initial prediction on the tested sample according to the density between the point and different types of labeled ones. After that, it is crucial to assess the reliability of the classification.
Inspirited by neighborhood-based clustering (i.e. DBSACN), we introduce an extended neighborhood method to quantify the confidence and remove outliers. We suppose that there is no mutation between adjoining samples of the same type, and the mislabeled sample has a different label from its adjacent neighborhood. We use CoNb(x i ) to represent the neighborhood of a tested instance. The distribution of the testing sample and training samples generally consists of three situations, i.e. the neighborhood of the tested point(CoNb(x i ))(a) contains one category points, (b) includes two categories points,(c) contains no points. For the third case, an extended-CoNb(x i ) (i.e. ECoNb(x i )) is defined.
To quantify the confidence, we modify the definition of harmfulness and usefulness of a sample x u i in reference [31], as follows: Harm (x) indicates the number of instances z, where ECoNb(x) contains z and z s label differs from that of x.By contrast, Help (x) represent the number of a set that z ∈ ECoNb(x) and l(z) = l(x), where N indicates the number of a set, l denotes the class label of an instance. According to the (14), we can quantify the classification and filter out a noisy point without any additional parameters. if Harm x u i = 0, x u i is considered reliable, which is beneficial to supply the clustering distribution. So x u i is incorporated into training set and its neighborhood relationship CoNb(x u i ) is preserved, and then its corresponding EEG signal X u i is saved for RLS-CSP.

IV. EXPERIMENT
This section presents a large number of experiments to support the following objectives: 1) Investigate how the performance of RLS-CSP is affected by the size of training set and testing duration.
2) Investigate how the performance of ISCDC is affected by the size of training set and testing duration.
3)Evaluate the efficacy of RLS-CSP combined with ISCDC in continuous BCI system.

A. DATASET DESCRIPTION 1) BCI COMPETITION III DATASET Iva TERMED AS DATASET 1
The dataset contains EEG signals recorded at 118 channels with 1000 Hz sampling rate (downsamples to 100Hz in this paper) from five subjects named ''AA'', ''AL'', ''AV'', ''AW'' and''AY''. For each subject, a total of 280 cue-based trials are available (half for each class of MI). In each trial, a cue was indicated for 3.5 s during which two MI tasks were performed: (R) right hand, (F) right foot. Then the cue was intermitted by periods of random length, 1.75 to 2.25 s, in which the subject could relax. See website http://www.bbci.de/competition/iii/desc_IVa.html for more details about the dataset.

2) BCI COMPETITION III DATASET V TERMED AS DATASET2
In this dataset, EEG was recorded from three subjects at 32-channel with a sampling rate of 512Hz for three tasks left-hand movement, right-hand movement and word generation (in our experiment word-imagery was eliminated). Each subject performs 4 sessions, each lasting 4 minutes with 5-10 minutes breaks in between them. The subject performed a given task for about 15 seconds and then switched randomly to another task.

B. EXPERIMENT DESIGN
Dataset1 and Dataset2 were filtered by a 5-order Butterworth band-pass filter of 8-30Hz. All trials were normalized before the feature-extraction process, and two pairs of spatial filters were selected for feature extraction.
For Experiment I, II and III were carried out on Dataset 1. In each trial 3-second data was captured after o.5s of a cue. In Experiment IV, Dataset2, a continuous data set, is used to simulate the an online BCI system. The trial is captured by a 2-second-length sliding window with 1-second overlap.
The specific settings of each experiment are as follows:

1) EXPERIMENT I: RLS-CSP IN SSS
To investigate the performance of RLS-CSP in SSS, a set of experiments were carried out with different size of training set on each subject. We randomly selected 30trials from each auxiliary subject (15 trials per class) and 0 to 140 trials from the target subject for classifier training, and the rest 140 trials from the target subject for testing. The hyperparameters β and γ (from the set of 10 candidates {0.001,0.01,0.1, 0.2. . . ., 0.8}) and hyperparameter C in SVM were determined by five-fold cross-validation on the calibration data.
RLS-CSP was compared against the conventional CSP as well as other competing CSP: CSP: the conventional CSP. CCSP1: reduce the bias of covariance matrix by using other subjects 'data with one parameter β [16].
FBRCSP: 10subbands with 2Hz overlap for EEG filter, and R-CSP [10] with two parameters β and γ for the covariance matrix estimation to extract features, referred [2] for details. SGRM: two hyperparameters β and γ from 0.001 to 0.01 with an interval of 0.001 were suggested in [8].

2) EXPERIMENT II: RLS-CSP IN NON-STATIONARY EEG
To verify the adaptability of RLS-CSP to the time-varying EEG, we designed the experiments in two scenarios: subjectspecific and subject-independent.
Subject-specific means that both training and test data are from the same subject. we used 40trilas to initialize the spatial filters and then analysed the feature distributions of sets of the first-120trials (120st for short) and the second-120trials (120nd for short), which two testing sets were organized in chronological order. Furthermore, we defined a measurement to quantify the divergence of distributions.

DR =
Tr(Sb) Tr(Sw) (13) where Tr(.) denotes the trace of a matrix, Sw is the scatter within classes, Sb is the scatter between classes, and the result DR represents the divergence ratio of feature distribution. The larger the DR,the more separable the two sets of features are. Subject-independent means that training data comes from auxiliary subjects, and test data from the target subject. The first 40 trials of each subject served as the training set and rest 240 trials for testing. The propose RLS-CSP and ISCDC were used for feature extraction and classification.

3) EXPERIMENT III: ISCDC IN SSS
The training set was randomly selected from 0 to 140 trials for each target subject, and the rest 140tails for testing. We evaluated the efficacy of ISCDC, by compared with DPC and DBSCAN.

4) EXPERIMENT IV: THE EFFICACY OF RLS-CSP COMBINED WITH ISCDC IN CONTINUOUS BCI SYSTEM
Dataset2 was used to simulate an online BCI system. The first session was used in training phase and reflected a small sample problem. The rest three sessions were used as test data, which possessed the characteristics of time-varying and instability of MI-EEG. We compared the online recognition capabilities among the three algorithms: SGRM, SFBCSP and RLS-CSP, which SFBCSP and RLS-CSP features were classified with ISCDC. Here, we discarded the single-trial with one more label. Table 1and Fig.3 summarize the experimental results for the experiment I, where the average of five repetitions is reported for each subject. Table1 shows that the RLS-CSP algorithm achieves the overall superior performance compared to other CSP-based algorithms in all scenarios. Specifically, RLS-CSP achieved the highest average accuracy84.94%, and its performance was improved with the increase of training size, which got its peak 90.46% when the number of training size was 100triails. All of the methods tend to provide stable classification when using sufficient (more than 80 trials) training samples. The results illustrate that RLS-CSP method can gradually remedy the covariance matrix estimation bias caused by SSS, and update the spatial filter through high confidence test samples to adapt to the time-varying in EEG.

A. A PERFORMANCE COMPARED WITH DIFFERENT SPATIAL FILTERS
Moreover, as shown in Table1, the average accuracy of SFBCSP and SGRM were 83.09% and 79.92%, and these of CCSP, TRCSP and FBRCSP were 77.19%, 77.71% and 81.04%, respectively. The sparse-based CSP algorithms were better than the regularized-based CSP algorithms. The possible reason is that in the framework of group sparse learning, compared with the regularization methods, it can eliminate the less relevant features of the non-target objects more rapidly, which contributes more significant test samples, thus improves the classification performance. Besides, since the optimal filter band is generally subject-specific, many researches have proved the filter bands are more effective than a wide frequency band.
Table2 and Table3 summarize the classification accuracy of each target subject using40and 80training trials, respectively. Paired-sample t-test was used to investigate the statistical significance of accuracy difference between VOLUME 8, 2020   these compared methods (see ρ-value). Table2 indicates that RLS-CSP yields significantly higher accuracy than the others in 40 trials and its performance difference is up to 95%. Nevertheless, Table3 shows sparse-based methods perform better than RLS-CSP. SGRM had a large performance of 75.45% and 85.75% for subject 'AW' and 'AV', and SFBCSP was 84.67% for 'AY'. There was no significant difference between SFBCSP, SGRM and RLS-CSP. In summary, RLS-CSP can improves the robustness and adaptability of a feature model in SSS. However, its advantage is not significantly when the training samples are sufficient. Fig.4 depicts the training time and testing time for each method, under the environment of MatlabR20116b on a computer with 3.40GHz CPU(i5-7500, 8G RAM). In particular, RLS-CSP training time refers to a process of "offline spatial filters initialization-ISCDC-based evaluation-spatial filters update", and its testing time is the time to complete "ISCDC-Classification and spatial filters update". The regularization CSP methods(CCSP,TRCSP, FBRCSP) required much longer training time than the non-regularization algorithms, and the FBRCSP was up to 410.204s. The sparse-based methods(SGRM, SFBCSP) took a moderate computational efficiency. It is because cross-validation for parameter selection is time-consuming. Since RLS-CSP and CSP did not need cross-validation, the training time was short about 1.293s and 0.492s, respectively. The filter bands-based CSP methods required much more testing time than other algorithms. RLS-CSP method was about 1s for a single-trial testing, which was acceptable in an online test. Therefore, considered both time-consuming and performance in SSS, RLS-CSP presents a comparable advantage than others.

B. EVALUATION ON THE EFFECT OF ADAPTION
To study the effect of non-stationarity on the feature distribution, experiments were carried out as described in Experiment II. Fig.5 demonstrates the testing feature distribution of CSP and RLS-CSP of subject ''AY''. The x-axis represents the normal vector of the classifier of hyperplane, and the y-axis is the largest PCA-component of testing features. It can be seen from the Fig.5 that the feature distributions of the two testing sets were transferred due to the non-stationarities in EEG signal. For CSP, the features underwent not only a rotation between 120stand 120nd but also their distribution changed, which the 120st had smaller DR than the latter. For RLS-CSP, the features were more robust and discriminative than those of CSP, with DR of 0.806 and 0.862, respectively. Under the same testing set, the overlap of RLS-CSP features was smaller than those of CSP, an DR were improved 0.582 and 0.537, respectively. Therefore, it is evident that RLS-CSP is more robust and discriminative than CSP, and more powerful when dealing with time-varying EEG signals. Table 4 shows the recognition accuracy of RLS-CSP in the subject-independent. It was evident that the performance of RLS-CSP in the subject-independent scenario is poor, which fully illustrated the impact of individual differences on classification. Further observation showed that in general, the accuracy of 120nd training set was higher than that of the 120st. It demonstrates that RLS-CSP is adaptive and can compensate for the initial deviation of the spatial filters. However, once the deviation is too large, the repair degree is limited. Therefore, RLS-CSP is more suitable for a subjectspecific scenario.

C. EVALUATION OF CLASSIFICATION PERFORMANCE
The effectiveness of RLS-CSP is based on the merging high-confidence testing samples and filtering outliers. Table1 reports the results for ISCDC concerning different training sizes and compares them with DPC and DBSCAN. As can be seen in Table5 that ISCDC was superior to other two methods with different data sets for all subject, and the performance of all methods were improved with the increase of training size. This may be that in SSS (below 80tirlas in our experiments) the cluster center is only a local optimum, not a global one, so the recognition accuracy of DPC and DBSCAN depending on the cluster center are poor, whereas ISCDC is based on the distribution of all test features, the diversity of distributions are conducive to ISCDC classification. Sufficient samples can reduce the deviation of cluster center, and provide a variety of sample distribution information, which expands the neighborhood set of ISCDC, and improves the recognition rate of the three algorithms. In addition, ISCDC has the function of non-parameter singular point filtering, which can gradually overcome the model instability caused by the poor quality of training samples.

D. PERFORMANCE ON THE CONTINUOUS IMAGERY
To verify the application value of proposed adaptive scheme in online BCI system, experiment IV was carried out on Dataset II. Fig.6 demonstrates that SFBCSP has a high accuracy in the first session of all subjects, and the proposed scheme in our paper performs better in the second and third session. The average accuracies of our method were 78.91%, 70.41% and 59.14% respectively, and compared with the winners (79.6%, 70.31% and 56.02%, respectively), the algorithm used less training samples, which proved the TABLE 4. The accuracy (%) of subject-independent on two orderly sets. Note: The numbers in solid frames represent results of subject-specific.  advantages of our algorithm in small sample online BCI.
The results indicate that our scheme is capable to update the feature model and classification model, and can better adapt to the online time-varying system.

VI. CONCLUSION
Aiming at the problems of long calibration time and poor stability of an online motor imagery BCI system, in this paper, a feature model based on recursive least square method (RLS-CSP) is proposed. RLS-CSP uses recursive least squares method to update the spatial filters, which allows incremental computation of the filter coefficients and incorporates new samples to update the filters. Moreover, an adaptive density clustering classification (ISCDC) is proposed to classify the testing sample and quantify its classification confidence.
To verify the efficacy of RLS-CSP spatial filters in SSS, we compared RLS-CSP with CSP, TRCSP, CCSP, SFBCSP, FBRCSP and SGRM in different training sets. As shown in Table 1, RLS-CSP demonstrates better performance. Fig.5 illustrates the feature distributions in RLS-CSP, which proves that RLS-CSP provides more robust and discriminative spatial filters than CSP. Meanwhile, RLS-CSP method is effective to non-stationary EEG signals in SSS. The self-training classification (ISCDC) works for both classification and confidence assessment. Table5 prove the adaptive classification ability of ISCDC.ISCDC achieves excellent performance in SSS and improves its stability by filtering outliers. By classifying in continuous EEG (Fig.6), our propose scheme shows the advantages for small sample setting situation and online BCI system. Our method provides a new insight to shorten calibration phase and improve the adaptability of an online BCI system. However, due the individual difference, RSL-CSP has poor recognition effect in subject-independent. In recent years, transfer learning has shown its promise in coping with the subject-transfer in BCI system [17]- [19]. Thus, a transfer learning extension of our RLS-CSP model could be benefit for coping with the inconsistency of data distributions. In this way, less training samples are required to initialize a spatial filter and improve the subject adaptability of RLS-CSP. In addition, with the accumulation of training samples, the computation cost of ISCDC will be increased. An alternative solution is to delete those training samples that do not contribute much to classification, which will be worth our future investigation.
QIN JIANG received the B.S. and M.S. degrees in biomedical engineering from the Chongqing University of Technology, China. She is currently pursuing the Ph.D. degree in pattern recognition and knowledge discovery with the Chongqing University of Posts and Telecommunications. Her research interests include machine learning, pattern recognition, biomedical signal process, and brain-computer interface.
YI ZHANG received the Ph.D. degree in mechanical manufacturing and automation from the Huazhong University of Science and Technology, Wuhan, China, and the Ph.D. training in intelligent multimode human-computer interaction with the University of Essex, London, U.K. He is currently a Professor and a Ph.D. Supervisor with the Advanced Manufacturing Engineering College, Chongqing University of Posts and Telecommunications. His research interests include robot automatic control and human-computer interaction.
GENGYU GE received the B.S. degree in information and computing science from Chuzhou University, Chuzhou, China, in 2011, and the M.S. degree in computer system architecture from Southwest University, Chongqing, China, in 2014. He is currently pursuing the Ph.D. degree with the Chongqing University of Posts and Telecommunications. His research interests include mobile robot navigation, semantic slam, and embedded system application.
ZHIRONG XIE received the B.S. degree in electronic information engineering from the Sichuan University of Science and Engineering, Zigong, China. He is currently pursuing the M.S. degree with the Chongqing University of Posts and Telecommunications, Chongqing, China. His research interests include deep learning and pattern recognition.