Spectrum-Weighted Tensor Discriminant Analysis for Motor Imagery-Based BCI

Spatio-temporal ﬁltering has been widely used for extracting discriminative features in the motor imagery-based brain-computer interface (MI-BCI). In order to obtain high performance, the algorithms need to enhance robustness or ﬁnd class-discriminative bands for the spatial ﬁlter. However, the existing methods either cannot derive the spatial and spectral ﬁlters with a unique objective function for guaranteeing convergence or rarely consider the combined optimization of spatial-spectral ﬁlters and other patterns for enhancing the discrimination. In this study, we present a novel feature extraction method termed Spectrum-weighted Tensor Discriminant Analysis (SwTDA), which optimizes spectral ﬁlters along with spatial ﬁlters and other associated patterns by tensor-based discriminant analysis. The proposed method considers intrinsic spatial-spectral-temporal information contained by the physiological signal and hence can identify discriminative characteristics robustly. The effectiveness of the algorithm is demonstrated by comparing it with several state-of-the-art methods on two datasets involving 15 different subjects. Results indicate that the SwTDA method yields higher classiﬁcation accuracies than the competing methods. Furthermore, interpretable spatial-spectral patterns that are determined by the algorithm can be used for further analysis of the MI-based EEG signal


I. INTRODUCTION
Brain-computer interface (BCI) system can translate the user's intent to computer commands, so it is valuable for people with severe motor disabilities [1], [2]. Electroencephalogram (EEG) is often used in many BCI studies for its noninvasive measurement and low cost. One type of EEG-based BCI system is based on the modulation of sensorimotor rhythm (SMR), which follows the event-related (de)synchronization (ERD/ERS) when the subjects imagine movements of their limbs, i.e., motor imagery (MI) [3]. Three simple limb motor imagery (sMI) tasks (i.e., left hand, right hand, and foot) are typical for the application because of the discriminative characteristics between each other. In order to obtain more output commands in MI-based BCI for the need of many real-world applications, it is necessary to develop The associate editor coordinating the review of this manuscript and approving it for publication was Juan Wang . new types of MI tasks. Yi et al. [5] have investigated the differences of the induced brain oscillatory patterns between sMI and compound limb motor imagery (cMI).
In recent years, many new classification techniques have been designed and explored to classify the EEG signal. Adaptive classifiers (e.g., [37]) are demonstrated to be generally superior to static ones, even with unsupervised adaptation; Riemannian geometry-based methods (e.g., [6], [36]) have reached state-of-the-art performances on multiple BCI problems; deep learning methods (e.g., [32]- [34]) have been applied to enhance the EEG classification accuracy. Other advanced techniques, such as transfer learning [11] or boosting ensemble methods [9], [35], are also believed to be able to improve the performance. However, MI-based BCIs, the same as most pattern recognition, not only use a classifier but also apply feature extraction method to obtain characteristics as discriminative as possible. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Spatio-temporal filters have been proven to be successful in enhancing the signal-to-noise ratio of the highly noisy and low-spatial-resolution EEG signals [7], [14], [19], [23], [25], [30]. The well-known common spatial patterns (CSP) [8] is a highly successful spatial filtering technique to extract Motor imagery-based signal features. However, good performance of CSP severely depends on performing temporal filtering with a proper frequency band, which is important for portraying SMR activities. Researches (e.g., [10], [30]) have shown that responsive frequency bands of ERD/ERS are not consistent with inter-and intra-subjects. Many recent works have been proposed and aim to resolve the problem. The strategies they used are to enhance robustness or find class-discriminative bands for spatial filtering. These works can be categorized into three schemes as follows (to avoid confusion, we use the temporal or spectral filter on the aspect of class-discriminative bands, whereas the temporal pattern for extracting time domain-related features).
The first scheme focuses on boosting robust to the individual variability by using regularization approaches, robust data averaging, and/or new divergence measures [29]- [31]. Regularization of the covariance matrix, e.g., [29], is one common approach to increase robustness, especially in small-sample settings. Recently, beta divergence-based CSP algorithms (e.g., [30] and [31]) incorporate different types of regularization schemes into the optimization process and improve the performance robustly. The extended common spatial pattern (ECSP) [4], which aims to obtain enough variance information from the prior knowledge available data (e.g., the mean data matrices of the two classes), has been reported to achieve high performance in the dataset IIb of BCI competition IV. However, these methods still neglect the frequency information, and hence cannot provide to portray the rhythmic activities in the sense of signal analysis.
The second category passes the signal through a filter bank and then finds spatial filters from the filtered EEG signal. The Filter Bank Common Spatial Pattern (FBCSP) [14] and its variants (e.g., SFBCSP [15] and SCSSP [16]) belong to this category. Although these methods have been reported to achieve high performance, the spectral range of each subband must be predetermined. As an improvement, Bayesian spatio-spectral filter optimization (BSSFO) [17] is proposed to devise the arbitrary frequency bands which are generated by a particle-based approximation method. More recently, an error-correcting output coding-based CSP (ECCSP-TB2B) is proposed by Shahtalebi and Mohammadi [19]. In the part of feature extraction, the ECCSP-TB2B extends the BSSFO technique by using two frequency bands within each particle. It is worth pointing out for the BSSFO and ECCSP-TB2B that the spatial weights are optimized by CSP and the frequency bands are found by an optimization problem based on the approximate estimation of the posterior probability density function. This means that the cost functions for these two optimization problems are different.
The third category (e.g., [23]- [26]) performs the simultaneous optimization of spatial and spectral filters. The spatio-temporally regularized CSP (STRCSP) [23] embeds several time-delayed signals into raw EEG signals, and obtains spatial and temporal bandpass filters simultaneously based on CSP computation. However, the small time-delayed number limits the flexibility of the spectral filters. The discriminative filter bank CSP (DFBCSP) [25] conducts the optimization of bandpass temporal filters and spatial filters by solving two subproblems of generalized eigenvalue decomposition sequentially and alternatively. Along with a different line, the iterative spatio-spectral patterns learning (ISSPL) [24] optimizes temporal filters equivalently in the frequency domain by alternate iteration. At each iteration, the ISSPL computes spatial filters by the CSP technique, whereas the spectral filters are equivalently solved by the max-margin machine of a classifier. In other words, two different criteria are respectively used for spatial and spectral optimization. As an improvement on ISSPL, the maximizing mutual information of spatial spectral features (MMISS) [26] utilizes the criterion of maximizing the mutual information instead of the strategy ISSPL uses. However, as mentioned by the authors, MMISS is limited in finding the global optimum because the convergence is related to a gradient search method.
Overall, in order to obtain the class-discriminative bands, it is preferred to optimize spectral filters rather than using a fixed filter bank. As mentioned above, most of the studies derive the spatial and spectral filters alternately by two different optimization criteria. The alternating iteration is hard to guarantee to be converged because the cost functions for these two problems are different. Furthermore, these algorithms rarely consider the optimization to combine the spatial-spectral filters and other patterns. Motivated by overcoming the major issues of the methods mentioned above, we develop a new algorithm, named Spectrum-weighted Tensor Discriminant Analysis (SwTDA), for EEG classification. EEG data are firstly transformed into the time-frequency domain for forming tensor-based samples by adding the frequency mode. Then the SwTDA finds the spectral filters with the spatial filters and other associated patterns (e.g., temporal pattern) based on Fisher's discrimination criterion from the repeatedly calculated residues of the original tensor data. Our method belongs to the third category and contributes to three primary aspects.
First, because of processing EEG signals in the form of multiway arrays, the proposed method can find spectral filters together with spatial and temporal patterns. Some recent researches (e.g., [34]) have shown that temporal information can contribute to high classification accuracy. The decomposition that takes into account spatial-spectral-temporal information and their consistency for trials can provide the advantage of capturing interactions and couplings among various variables (e.g., temporal, spectral, and spatial components) often with physical or physiological meanings and interpretations [7]. Hence features extracted by the SwTDA preserve intrinsic interactive information between multidimensions, which can help for boosting the robustness.
Second, the SwTDA yields the frequency mode through time-frequency transform so that it can find spectral weights directly in the frequency domain. This contrasts with the STRCSP, where the low order filters limit the flexibility in shaping the passband and stopband. It also differs from the BSSFO and ECCSP-TB2B, where the parameters of spectral filters are defined as cutoff frequencies and thus the passband and other frequency characteristics are only decided by the used filter.
Third, the SwTDA achieves the simultaneous optimization of spatial-spectral filters and other patterns based on maximizing tensor-based Fisher's discrimination criterion, which has a single cost function and has been proved to guarantee monotonic convergence in reference [20]. Other methods, such as ISSPL and ECCSP-TB2B, do not preserve this characteristic because the objective functions for optimizing the spatial and spectral filters differ from each other.
The remainder of the paper is organized as follows: the mathematical formulation of the proposed SwTDA method is presented in Section II. The experimental results come on Section III. Then, the discussions are carried out in Section IV. Finally, Section V concludes the paper.

II. SPECTRUM-WEIGHTED TENSOR DISCRIMINANT ANALYSIS (SwTDA) A. NOTATIONS AND BASIC TENSOR CONCEPTS
We denote tensors (i.e., multi-way arrays) by boldface Euler script letters. For example, X ∈ R I 1 ×I 2 ×...×I N is a N -order (or mode) tensor, where I n (n = 1, . . . , N ) is the size of the mode-n. Matrices are denoted by boldface capital letters, e.g., X ∈ R I ×J ; vectors are denoted by boldface lowercase letters, e.g., x; scalars are denoted by lowercase letters, e.g., x. The mode-n product of a tensor X ∈ R I 1 ×I 2 ×...×I N with a matrix A ∈ R J ×I n , is defined as The product of a tensor Elementwise, we have

B. FEATURE EXTRACTION
Assume that we have a training set, where I 1 is the number of channels and I 2 is the number of sample points, and M is the total trial number.
In order to optimize spectral filter directly in the frequency domain, each trial data is represented as a 3-order tensor with modes (channel × time × frequency) by wavelet transformation. To be more specific, the tensor X m ∈ R I 1 ×I 2 ×I 3 of the m-th trial is given by the amplitude of wavelet transformed signal where ψ is the wavelet function. We select the complex Morlet wavelet, ψ(t) = 1 , as the mother wavelet. The bandwidth parameter f b is set to 2, and the wavelet center frequency f c is set to 1, since these parameters have been successfully applied in the time-frequency analysis of EEG signal [22]. If the broad frequency range we focus on is [f 1 , f 2 ], the mode size of the frequency domain, I 3 , is decided by . . , f 2 ]; size( · ) return the element number of the vector. Note that f r = 1 Hz is used in this study.
The tensor X m consists of I 3 spatio-temporal matrices, each of which denotes a frequency component. Given a spectral filter p = [p 1 , p 2 , . . . , p I 3 ] T ∈ R I 3 , we define the spectral projection as the weighted sum of the matrices: where X m,f = X m (:, :, f ), f ∈ {1, 2, . . . , I 3 }. p f denotes the weight that X m,f contributes to feature. A spatial projection (named spatial filters) and a temporal projection are used to reduce the dimension of the spectral projection Y m . Given a spatial filter u r ∈ R I 1 and a temporal VOLUME 8, 2020 pattern w j ∈ R I 2 , the spatial-temporal projection can be formulated as where z r,j is the projection signal of X m by {u r , w j , p}. Finally, the combination of the spatial, temporal, and spectral projections can be written as the product of the tensor X m with {U, W , p} , J is the number of temporal projection; Z m consists of the projection features and z r,j is the element (r, j) of Z m . Fig. 1(a) illustrates the principle of SwTDA. Fisher's discrimination criterion (FDC) can be adopted to find {U, W , p}. The reason is that maximizing FDC has a single cost function for the optimization problems and guarantees monotonic convergence [20]- [22]. The scatter matrices and the corresponding FDC can be defined as: where S w and S b are within-class and between-class scatter matrices respectively; K is the number of classes; k m is the class label for the m-th training sample;Z k m is the mean feature of class k m ; andZ is the mean of all sample feature, i.e.,Z = 1 M M m=1 Z m . · 2 denote L 2 Frobenius norm. There exist no closed-form solutions for (10) to determine {U, W , p} simultaneously, so the alternating projection method that fix all but one is used as a suboptimal solution. Please see Appendix A for details to solve the problem (10).
Assuming that p 1 , . . . , p D are the spectral filters we finally sovle for, equation (8) can be rewrote as follows repeat 6: fix U d and W d to solve the problem: p d = arg max (S b /S w ); 7: fix W d and p d to solve the problem: U d = arg max (S b /S w ); 8: fix U d and p d to solve the problem: until Stop criterion is met 10: This means that we obtain a set of projections with the scatter ratio criterion from the repeatedly calculated residues of X m , and each p d explains the most discriminative frequency pattern of the remaining variation. As shown in Fig. 1(b), the decomposition model of SwTDA can be represented as The pseudocode of the proposed SwTDA is summarized in Algorithm 1. Note that the SwTDA is easily extended into four or more higher order case. To be more specific, given X m is a L-order tensor, equation (12) can be rewrote as is the projection matrix in mode-l and Z m,d ∈ R I d,1 ×...×I d,L−1 is the feature tensor.
Indeed, if D is set to 1, SwTDA is a special case of GTDA [20] or HODA [22] with the objective of maximizing the scatter ratio. Furthermore, SwTDA is a heuristic approach with residue calculation, where the spectral filters are obtained one after another. Therefore, there is no orthogonality between any two spectral filters. Additionally, mathematical arguments in reference [20] show that maximizing the scatter differential (or ratio) by the alternating method guarantees monotonic convergence. The MATLAB code of SwTDA is available upon requests from the authors. 93752 VOLUME 8, 2020 Features of a trial can be reshaped to form a vector v m ∈ R T , where T = D d=1 I d,1 × I d,2 . . . × I d,L−1 . In order to obtain the most discriminative features for classification as well as reduce the feature dimension, Fisher Score [27] is employed for ranking the features. To be specific, given feature matrix V ∈ R M ×T for the M training samples, let µ denote the mean feature vector of all classes whereas µ k and σ k respectively denote the mean and standard deviation vectors of the k-class feature vectors. The i-th Fisher score, which corresponds to the i-th column of V , is computed below: where µ k,i and µ i are the i-th element of µ k and µ respectively, and σ k,i is the i-th element of σ k . Significant features corresponding to the largest N (N T ) scores could be chosen for training the classifier. In the testing phase, a trial tensor X t is first projected to the feature components, and then the components are selected and fed into the classifier for predicting the class label.

III. EXPERIMENTS AND RESULTS
In this section, we compare the proposed method with the following algorithms: (i) the conventional CSP; (ii) the FBCSP method [14]; (iii) the STRCSP method [23], which optimizes bandpass temporal filters with spatial filters in the CSP-based framework; (iv) the divCSP [30], which is based on divergence measures and incorporates robust property into optimization process; (v) the CSP\AM-BA-SVM method [28], which is based on the idea of FBCSP but applies a hybrid Attractor Metagene (AM) and Bat Algorithm (BA) for feature selection along with SVM optimization; (vi) the ECSP method [4], which takes into account prior knowledge available from data and leads to a set of features that enhance the discrimination between the two classes; (vii) the method in reference [34], which employs the FBCSP to select the temporal representation for a convolutional neural network, named as Channel-wise Convolution with Channel Mixing (C2CM); and (viii) the ECCSP-TB2B [19], which iteratively optimizes a set of particles (each particle consisting of 2 bandpass filters) and obtains CSP-based features from the bandpass filtered EEG signal. The experiments are performed based on two public databases. The performance is measured by the classification accuracy, i.e., acc = M c M a × 100%, where M c represents the number of correct classification trials, and M a is the total trial number of the testing set.

A. DATASET ASSESSMENT
Dataset A was from international BCI dataset IVa of the BCI competition III (available at http://www.bbci.de/ competition/iii/) [13]. EEG signals from five healthy subjects (aa, al, av, aw, ay) were recorded at 1000 Hz sampling rate from 118-scalp positions with high-pass and low-pass filters 0.05 and 200 Hz. This dataset consists of two motor imageries: right hand and right foot. Table 2 shows the respective number of training trials and test trials for each subject. The EEG data were down-sampled to 100 Hz (by picking each 10th sample) in our experiment.
Dataset B was provided by Yi et al. [5] (available at https://doi.org/10.7910/DVN/27306). This dataset consists of six kinds of motor imagination tasks, i.e., left-hand (LH), right-hand (RH), both-hand (BH), foot (F), left-hand with right-foot (LH&RF), and reft-hand with left-foot (RH&LF). Signals of 64 EEG channels were recorded from ten healthy subjects (S1-S10) with sampling rate of 1000 Hz, and an additional 50 Hz notch filter was used during data acquisition. Similarly, all data were down-sampled to 100 Hz for analysis. A total of 480 trials were collected for each subject (80 trials per task, among which top 60% numbers of trials were used for the training data set, and the remaining trials were for testing). In order to evaluate the single-trial binary classification performance between sMI and cMI tasks, nine binary classification subsets were constructed from the data of each subject: LH vs. BH, LH vs. LH&RF, LH vs. RH&LF, RH vs. BH, RH vs. LH&RF, RH vs. RH&LF, F vs. BH, F vs. LH&RF and F vs. RH&LF. Besides, another three subsets were constructed to evaluate the classification performance between two cMI tasks, i.e., BH vs. LH&RF, BH vs. RH&LF, and LH&RF vs. RH&LF. Thereby, there were 12 binary subsets for each subject, and finally, a total of 120 subsets were obtained. Fig. 2(a) shows the general pipeline for all comparative algorithms. Although a wide range of classifiers is available, a simple classifier is preferred so that the performance is mainly contributed by the feature extraction algorithm rather than the classifier. We typically used linear discriminant analysis (LDA) to model and classify MI tasks except for the CSP\AM-BA-SVM and C2CM methods. If there are no special instructions, details of implementation for both datasets are listed as follows. A homogeneous setting of channels and time window is adopted for all the comparative methods for a fair comparison.

B. PARAMETER SELECTION AND EXPERIMENTAL PROCEDURE
In the preprocessing stage, the EEG data of 40 channels (see Fig. 2(b)) were filtered in 7-30 Hz with a fifth-order Butterworth filter (the FBCSP and C2CM were the exceptions for this procedure), and the signals from the period of 0.5-3.5s were selected for each trial. So a trial was represented as a matrix of 40 channels × 300 time points, and for the SwTDA, the tensor of each trial was size of 40 channels × 300 time points × 24 frequency bins according to the equations (4) and (5). Since the number of time points was much  larger than the number of channels or frequency bins, the scatter matrices consequently would cause an ill-conditioned eigenvalue problem. Hence, the strategy of multiple time intervals was used in practice. To be specific, each trial was finally reshaped as a 4-D tensor with modes 40 channels × 15 time segments × 20 time points × 24 frequency bins. For FBCSP or C2CM, the filter bank was designed to cover the frequency band of 7-31 Hz, which consisted of six nonoverlapping temporal bandpass filters that covered a bandwidth of 4 Hz.
In the stage of feature extraction, three pairs of spatial filters corresponding to the largest and smallest eigenvalues were obtained by CSP, divCSP, STRCSP, ECSP, each sub-band of FBCSP, C2CM, or ECCSP-TB2B. For SwTDA, the number of spectral filters D was set to 3. For the d-th spectral filter, we assumed J d = S d = R d = 3, where S d was the number of time segment patterns. Moreover, we also assumed J 1 = J 2 = J 3 .
Furthermore, other parameters that we tuned to obtain the best classification accuracy were decided by 5 × 5 cross-validation on the training set for each method and subject (or subset). The dimension of log-variance features was chosen out of {2, 4, 6} for CSP, FBCSP, divCSP, and STRCSP, whereas the number of features N was selected from 1 to 6 with step 1 in SwTDA. The ECSP method uses linear regression method LASSO to decide the feature dimension. The regularization parameter λ 1 in the LASSO cost function was selected from a set of 200 logarithmically spaced points between 1e-5 and 1. In the C2CM, temporal features were extracted from 2×N s spatially filtered channels. The parameter N s was chosen out of {1, 2, 3}. The filter order and regularization coefficient in STRCSP were selected from the sets {6, 8, 10, 12, 14, 16} and 10 −6 ×{1, 2, . . . , 50} respectively. In divCSP, the parameter λ was selected from the set of 11 candidates {0, 0.1, 0.2, . . . , 1} and beta parameter was chosen out of {−0.0005, −0.001, −0.0015, . . . , 0, 0.5, 1}. For the ECCSP-TB2B method, the number of particles was selected within the range of 5 to 50 with steps of 5.
Besides, the divCSP, ECCSP-TB2B, and SwTDA need a stop criterion, which guarantees a stable convergence. The stopping criterion for iterations of divCSP was that the difference between two successive iterations was less than 1 × 10 −8 , or the number of iterations has achieved 1000. The maximum iterative number for the ECCSP-TB2B was varied from 5 to 50 with steps of 5, which was determined by cross-validation. In the case of learning one spectral filter for SwTDA, the iteration stopped when the change of Fisher ratio between two consecutive iterations was less than a preset threshold 6 × 10 −4 or iteration number reached 40. Table 3 shows the testing classification accuracies for dataset A. We can see that our algorithm achieves the highest mean accuracy of 86.71%. Although the FBCSP obtains the best FIGURE 3. Scatter plots showing classification accuracies of the SwTDA (y -axis) against six other methods(x-axis). The points above the diagonals indicate the superiority of the algorithm on the y -axis over the one on the x-axis. The p-value of the Wilcoxon signed rank test is shown at the right-bottom corner. The null hypothesis of the Wilcoxon test is that the median of the accuracy rate differences is greater or equal to zero. For p < 0.05, we reject this null hypothesis; thus, we say that our method significantly outperform the baseline.

C. RESULTS FOR DATASET A
performance for subjects al, aw, it only gives slightly better mean accuracy than CSP, because of weak accuracy for subjects ay. It is worth noting that the STRCSP method obtains weak classification performances in this experiment. However, it yields 83.52% of average accuracy when we only modify the filtering band from 7-30 to 4-47 Hz. Moreover, the combination of STRCSP and SFLDA (sparse fisher linear discriminant analysis) classifier has been proved to be effective in [23]. Hence, two likely explanations for the poor performances by STRCSP+LDA include: (1) the STRCSP algorithm constructs for each trial an augmented data that contains both the original EEG signal and N −1 time-delayed copies, where N is the order number of the frequency filters. In order to alleviate serious dimensionality issue, fig3 the flexibility of the frequency filters with a small order (N ≤ 16) is still limited for spectral optimization; and (2) good performance of the STRCSP needs the classifier with sparsity to address the curse-of-dimensionality issue.
As an improvement on FBCSP, the CSP\AM-BA [28], which selects the frequency bands based on the Bat algorithm (BA), is reported to achieve an average accuracy of 85.01%. And the ECCSP-TB2B finds the particles (2 frequency bands per particle) based on a Bayesian framework and yields 85.36% of mean accuracy. We have also compared the proposed method with the C2CM [34], which is considered as one representative method of deep learning applied to EEG-based BCI. The advantage of deep learning is that a large number of hidden layers can be used for extracting features out of input data. In our applications, the C2CM achieves good performances except for the data of subject ay, which only has a small number of training samples. As mentioned in [7], [34], one challenge of deep learning is how a small number of samples can be used to train high capacity networks without suffering from the overfitting issue. Furthermore, an excellent deep learning method still needs to include a data preparation stage, in which a feature extraction method is employed to represent EEG signals in a compact relevant manner without any significant loss in information [32]- [34].

D. RESULTS FOR DATASET B
Compound limb motor imagery (cMI) is a more complex cognitive process than simple limb motor imagery (sMI) [18]. Fig. 3 shows the classification accuracies of the testing sets of the 120 subsets. The accuracy rate of our approach is represented on the y-axis, i.e., if the coordinate point is above the diagonal line, then our method outperforms the method represented on the x-axis. The red number at the lower right corner denotes the p-value when applying the one-sided Wilcoxon sign-rank test. The null hypothesis of the Wilcoxon test is that the median of the accuracy rate differences [our method (y-axis)-baseline method (x-axis)] is greater or equal to zero. For p < 0.05, we reject this null hypothesis; thus, we say that our method significantly outperforms the others.
We can see from the plots that the SwTDA significantly increases classification accuracy rates compared to the CSP (p = 0.0001), FBCSP (p = 0.0013) and STRCSP (p < 0.0001). Although the improvements are not significant, we also can see that the SwTDA performs better than the divCSP (p = 0.2192), ECSP (p = 0.1980) and ECCSP-TB2B (p = 0.2312). The results of 120 subsets are classified into 12 groups according to the task of motor imagery. The mean classification accuracy for each group is provided in Table 4. The results show that the SwTDA achieves the best mean accuracy of 84.4%, which is slightly better than the accuracy of ECCSP-TB2B.

A. PERFORMANCE ANALYSIS
Since the frequency band of ERD/ERS is usually subjectspecific, the CSP works well in some subjects but fails at others when the raw EEG signals are bandpass filtered with a fixed frequency range 7-30 Hz. As a further extension of CSP, the FBCSP obtains the EEG rhythms from a filter bank and hence performs slightly better than the CSP. The divCSP encompasses different types of robust properties in the computing process of the spatial filters and hence performs excellently in a broad frequency range. But this also means that it is hard to obtain exact frequency information for portraying rhythmic activities. Different from the CSP, FBCSP, and divCSP, the STRCSP allows finding temporal bandpass filters within the CSP computing. However, as a feature extraction method, the STRCSP suffers from deterioration of performance when combined with a simple LDA classifier in our applications.
The ECSP uses prior knowledge available from data to produce discriminative features and hence significantly outperforms the conventional CSP method. However, similar to divCSP, the ECSP neglects to find the class-discriminative bands, which are crucial for capturing the ERD/ERS change. From the results of two datasets, the SwTDA method that optimizes the spectral filter appears to achieve better performance than the ECSP. In our applications, the ECCSP-TB2B can achieve excellent results in most of the cases. However, deterioration of performance happens on some subsets. During the optimization procedure, the particles (i.e., frequency bands) are updated by adding a disturbance that follows a normal distribution to the values of the band limits. Then the filter bank is checked for continuing the next update or stopping the iteration. Because it is difficult to traverse all the candidates of the cutoff frequencies, the value of the variance in the normal distribution and the number of iterations need to be selected appropriately, or the final bands may be suboptimal in the sense of classification enhancement.
Besides, defining the bandpass filters by a simple band limitation-based way falls into the case that the passband and other frequency characteristics will only be determined by the selected filter. This problem can be demonstrated in the next subsection.
The SwTDA is proposed to find the patterns by tensorbased discriminant analysis, which is a natural extension of LDA. The maximization of Fisher's discrimination finds the spectral filters and the associated spatial-temporal patterns simultaneously. On the one hand, unlike the conventional CSP and FBCSP, the proposed method directly designs the user-specific parameters for the filters such as the passbands. On the other hand, different from ECSP and ECCSP-TB2B, the proposed method has the opportunity to capture the interactions and couplings within multi-dimensions and seems promising to lead to more robust EEG classification. The results confirm that the SwTDA is effective in extracting discriminative features of the ERD/ERS activity.

B. SPATIAL-SPECTRAL OPTIMIZATION
One advantage of spatial-spectral optimization is that it allows to reveal discriminating parts in the spectrum and topography. To demonstrate differences between the spatialspectral filters obtained by the comparative methods, one specific subset is picked up for subject S8. During each trial of this subset, the subject performs motor imageries of F and LH&RF. Fig. 4 shows the spatial and spectral filters for this subset. The r 2 -value [13], [21] which is listed beneath each weight map is used to evaluate the discriminative information derived from corresponding spatial pattern. We define the r 2 -value as where z 1 and z 2 are the features generated from the same spatial filter for class 1 and 2 respectively, and N 1 and N 2 are the sample numbers of calss 1 and class 2 respectively. Larger r 2 -value indicates a higher separability of features. We can observe that all the methods have extracted the important spatial components for this specific example. Although there exist similar weight maps between some methods, different r 2 -values are obtained because of adopting different optimization criteria. Features generated from the SwTDA may be more correlated with the imagery tasks for the highest r 2 -value. To be specific, the topographic maps provided by SwTDA show that, channels (i.e., FC2 and FC4) over centro-parietal on the right brain hemisphere show the most influential weights with opposite directions for discrimination respectively, which would explain underlying cortical activity patterns of subject S8 during the motor imagery tasks. Fig. 4 also provides the spectral information associated with the spatial filters. Note that the CSP, FBCSP, divCSP, ECSP and ECCSP-TB2B obtain frequency response curves by 5-th order Butterworth bandpass filter, and the STRCSP achieves them by the optimized temporal filters whereas the spectral weights of SwTDA denote the amplitude-frequency characteristics directly. We can observe that all the methods have included the important components of higher alpha (10-16 Hz) rhythms. But for the scheme that adopts the fixed filter bank strategy (such as the FBCSP), since each frequency band is treated independently, possible correlations between different EEG rhythms may be ignored when it is used for further analysis of the signal. On the other hand, we cannot obtain the spectral characteristics exactly from the curves obtained by the STRCSP, since the bandpass filter with low order limit the flexibility in shaping the passband and stopband. By contrast, the spectral filter obtained by SwTDA clearly exhibits that the important rhythms concentrate on 12 Hz and 24 Hz. This is in accordance with the spectral characteristics presented by the ECCSP-TB2B. However, unlike the ECCSP-TB2B where the passband characteristics are decided by the used filter (e.g., Butterworth filter), the proposed SwTDA finds spectral weights directly in the frequency domain, which is more flexible in depicting the frequency characteristics.

V. CONCLUSION
In this study, we introduce a method of Spectrum-weighted Tensor Discriminant Analysis (SwTDA) to motor imagery classification in BCI application. The SwTDA employs the technique of tensor-based discriminant analysis to solve for the combined optimization problem of spatial and spectral filters. The alternative and iterative learning method is adopted to optimize the cost function and derive spatial-spectral filters together with other patterns. Therefore, features extracted by the SwTDA can reflect the intrinsic characteristics in multiway data, which assists to enhance discriminability and robustness. Under the same condition, the experimental results of two EEG datasets demonstrate that the proposed method improves the classification accuracy in comparison with state-of-the-art methods. Furthermore, significant spatial-spectral patterns that are determined by the SwTDA are useful for studying the oscillatory patterns of different EEG tasks and contribute to a better understanding of the task mechanisms. Finally, since the extracted features are tensors, the combination of SwTDA and the tensor-based classifier (e.g., Riemannian geometry classifier and deep neural network) deserves to be explored for further improving the classification performance.