Motor Imagery EEG Classification Based on Riemannian Sparse Optimization and Dempster-Shafer Fusion of Multi-Time-Frequency Patterns

Motor imagery-based brain-computer interfaces (MI-BCIs) features are generally extracted from a wide fixed frequency band and time window of EEG signal. The performance suffers from individual differences in corresponding time to MI tasks. In order to solve the problem, in this study, we propose a novel method named Riemannian sparse optimization and Dempster-Shafer fusion of multi-time-frequency patterns (RSODSF) to enhance the decoding efficiency. First, we effectively combine the Riemannian geometry of the spatial covariance matrix with sparse optimization to extract more robust and distinct features. Second, the Dempster-Shafer theory is introduced and used to fuse each time window after sparse optimization of Riemannian features. Besides, the probabilistic values of the support vector machine (SVM) are obtained and transformed to effectively fuse multiple classifiers to leverage potential soft information of multiple trained SVM. The open-access BCI Competition IV dataset IIa and Competition III dataset IIIa are employed to evaluate the performance of the proposed RSODSF. It achieves higher average accuracy (89.7% and 96.8%) than state-of-the-art methods. The improvement over the common spatial patterns (SFBCSP) are respectively 9.9% and 12.4% (p < 0.01, paired t-test). These results show that our proposed RSODSF method is a promising candidate for the performance improvement of MI-BCI.

Riemannian features. Besides, the probabilistic values of the support vector machine (SVM) are obtained and transformed to effectively fuse multiple classifiers to leverage potential soft information of multiple trained SVM. The open-access BCI Competition IV dataset IIa and Competition III dataset IIIa are employed to evaluate the performance of the proposed RSODSF. It achieves higher average accuracy (89.7% and 96.8%) than state-of-the-art methods. The improvement over the common spatial patterns (SFBCSP) are respectively 9.9% and 12.4% (p < 0.01, paired t-test). These results show that our proposed RSODSF method is a promising candidate for the performance improvement of MI-BCI.
Various methodologies have been proposed to extract discriminant patterns from high-dimensional EEG signals [11], [12], [13]. The spatial patterns associated with ERD generated from different MI tasks could be identified [14]. The most frequently used method is the common spatial patterns (CSP) proposed in [15]. Whereas, the effectiveness of CSP is This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ very sensitive to the selection of frequency bands and easily affected by irrelevant noise [16]. Thus, a fixed time window for feature extraction may lead to low classification accuracy [17]. There are numerous extensions of traditional CSP to optimize subject-specific operational time windows or frequency bands, such as sub-band CSP (SBCSP) [18], filter bank CSP (FBCSP) [19] and sliding window discriminative CSP (SWDCSP) [20]. Advanced algorithms by dividing time window and frequency bands effectively improve the classification accuracy, however, these methods still bring much information redundancy and multiply the decoding time cost [21]. To solve the problem of ineffectiveness, several feature selection algorithms emerge to automatically select the corresponding extracted features, such as sparse CSP (SCSP) [22] and sparse FBCSP (SFBCSP) [23] which select features through sparse learning.
With regard to further effectively utilizing discriminative information from multiple time windows and frequency bands, several heuristic methods have been proposed to automatically select one or a set of time windows and frequency bands [24], [25], [26]. Though these methods contrive to improve the final decoding performance, the feature selection procedure should not be an open/shut case [27]. In [24], the frequency bands are selected through the affinity propagation of multi-filter bank CSP features, which contain the cluster centers. In [25], time windows are selected based on correlation analysis and performance evaluation. In [26], the frequency band and time interval are coded with the particles and the optimal settings of them can be simultaneously detected by the evolution of particles for individual subjects. However, valuable messages in abandoned bands would be lost. Because the optimal time-frequency bands change rapidly and even vary between trials [28]. However, simply concatenating extracted features of all time-frequency bands would cause much redundancy in information and increase the difficulty of classification then multiplying the calculation time of algorithms [29]. Moreover, the Riemannian method has been wildly applied in MI-BCIs for its robustness [30]. A. Barachant et al. successfully utilized the symmetric positive-definite (SPD) spatial covariance matrix [31], to introduce the robustness of Riemannian distance and Riemannian mean into the BCI system and build more stable decoding models. They also combined the Riemannian method with state-of-the-art algorithms by mapping the data onto the Riemannian tangent space (RTS), where the matrices could be vectorized to fit traditional Euclidean models [32].
Aiming at solving the preceding problems, we propose a Riemannian sparse optimization and Dempster-Shafer fusion approach (RSODSF) for MI-BCIs classification to effectively utilize the underlying information of multiple time-frequency patterns. Raw EEG signals are segmented and filtered into multiple time windows and overlapping sub-bands respectively. Then discriminative features of each sub-band from Riemannian tangent space are extracted and then optimized through sparse learning to enhance the stationarity and robustness of the algorithm. Moreover, Dempster-Shafer theory (DST) is introduced to implement the classification fusion of multi-time windows. After we train a support vector machine (SVM) classifier with the linear kernel for the classification, we conduct a series of experiments to evaluate our proposed algorithms with different public datasets of BCI Competition. The major contributions of this work are:1) This work combines the Riemannian geometry of the spatial covariance matrix with sparse optimization to extract more robust and distinct features. 2) The Dempster-Shafer theory is introduced and used to effectively fuse the probabilistic outputs of each time window and reduce uncertainty of the final prediction.
The remainder of this paper is organized as follows. Section II elaborates on the materials and proposed method. Section III and IV describe the public datasets we use for experiments and the experimental setup, show the result and discussion, respectively, and conclude in Section V.

II. METHODOLOGY AND MATERIALS A. Riemannian Feature Extraction
In MI-BCIs, Riemannian geometry is applied to the covariance matrix of EEG signal, which is one of the most frequently used statistical properties [33] and has the property of being symmetric positive-definite (SPD) and located in the Riemannian manifold [34].
Let the X = [x(t), . . . , x(t + L − 1)]∈R N C ×L be the bandpass filtered EEG signal, N c , L respectively demonstrates the number of channels and sample points. The N c x N c spatial covariance matrix C and Riemannian tangent space features can be calculated using the following steps.
1) Bandpass filter the raw EEG data at a specific timefrequency band.
2) Calculate the N c x N c spatial covariance matrix C of each trial using Eq. (1): 3) Use the iterative algorithm to calculate the Riemannian mean between N covariance matrices, which has a minimum sum of the squared distances to all covariance matrices [35]. N demonstrates the number of EEG trials.
where C 1 , C 2 are two different covariance matrices respectively, λ i denotes the i-th eigenvalue of C −1 1 C 2 , || || F denotes the Frobenius norm [36], and log() is the log-matrix operator.
4) Use the logarithmic mapping to project the data points from Riemannian manifold to its corresponding tangent space, as shown in Fig. 1. and expressed in Eq. (4). The projected data points could be mapped back to the manifold by the exponential mapping, i.e., the exponential function of the matrix [35]. 5) The projected data, i.e., the symmetric matrices can be vectorized into Euclidian objects and further used for classification, expressed in Eq. (5).
Considering the symmetry property of the projected matrices, the non-diagonal elements of C need to multiply √ 2 to maintain the standard when vectorizing the projected matrices of each trial [37].

B. Probabilistic Support Vector Machine
Support Vector Machine (SVM) is one of the most popular machine learning algorithms and has been broadly used in binary classification situations [38], which separates differential data samples by constructing a set of optimal highdimensional hyperplanes. These hyperplanes are capable of maximizing the margins between two features samples, and are least affected by local disturbance of samples [39]. Unlike advanced deep learning methods [40], SVM does not require large scales of data and is suitable for small and medium-sized nonlinear data samples [41]. Moreover, it has a potent generalization ability for unknown data and flexibility for decision making [38]. For consideration of expedience, we select the SVM with a linear kernel as our fundamental classification tool, as expressed in Eq. (6).
where the weight vector ω and an offset b of the hyperplane could be obtained through the minimum optimization, φ(x i ) is the function that maps x i into a higher-dimensional space, C is the penalty parameter of the error term, and ξ denotes the slack variable [38].
To ensemble the SVM outputs trained through multiple time windows, we further introduce an SVM-Platt model to transform the outputs into posterior probabilities [42]. The transformation steps are as follows: 1) Calculate the distances d between the sample points and decision boundary.
2) Note that if the distance is a positive value, the sample belongs to a positive class [43]. The transformed posterior probability ρ j is actually the probability that a sample belongs to a specific class and could be computed by fitting a two-parameter sigmoid function of the distance value d j which matches the posterior that is empirically observed [44].
where exp (•) is the exponential function, A and B are the parameters representing the slope and the intercept [42].
3) Transform the labels into the probability target t j , in order to fit the afore sigmoid function. In our study, labels 1, and 2 are respectively demonstrating left-hand and right-hand MI tasks. As expressed in Eq. (9), the right-hand MI task, i.e., label = 2, is the positive target for example.
4) Find the optimal parameters A and B through a crossentropy function, as expressed in Eq. (10). The two-parameter minimization is fitting by using a maximum likelihood estimation from the training set, which finds the two optimal parameters by minimizing the negative log likelihood of the training data [42].
Since we obtain the transformed probabilistic outputs of several trained SVM models, we contrive to find a way to maximumly use underlying information within and between classifiers and effectively fuse multiple posterior probabilities.

C. Dempster-Shafer Theory
Dempster-Shafer Theory (DST) is an evidence theory presented by Dempster [45] and further developed by Shafer [46], which is an extension of the Bayesian inference method by introducing the uncertainty of our knowledge into the probabilistic space. DST could combine all the available evidence come from different sources about a specific event [45] and applied in BCI context [47]. In our context, it is a binary classification event utilizing sources of multiple time windows.
In short, given a finite and exhaustive set of possible conclusions to a question = 1 , 2 , . . . , n . 2 includes all possible subsets of and could be presented as ∅, 1 , 2 , . . . , n , 1 , 2 , { 1 , 3 , . . .. DST would assign a probability for each subset, called mass function and satisfies the following conditions: where ∅ is an empty set, the focal element E A is a hypothesis to the event and may include one or more conclusions. DST also defines belief and plausibility functions to model the uncertainty, expressed in Eq. (12).
The smaller the distance between the plausibility and the belief, the lower the uncertainty. In order to further mitigate the uncertainty, we can fuse two sets of evidence of focal element E A through joint mass function, expressed in Eq. (13). E B and E C respectively demonstrate the hypotheses to the event from two sets of evidence.

D. Riemannian Sparse Optimization and Dempster-Shafer Fusion of Multi-Time-Frequency Patterns (RSODSF)
Since the responding time and optimal operational frequency bands vary between subjects. We contrive to fully take benefit of scarce EEG resources and to operate the decoding better. Given N trials of EEG signals X ∈ R N×N c ×L , and X is segmented into a set of T overlapping time windows and K overlapping sub-frequency bands. After the preceding feature extraction procedure, the multiple time-frequency Riemannian patterns are obtained. Each time window of which could be donated as stands for the extracted Riemannian patterns from i-th frequency band.
Motivated by SFBCSP [23], a multiband optimization method is proposed and developed to select discriminative features through sparse regression [48], we tend to incorporate the sparse learning with our previous extracted Riemannian tangent space features. l1-norm is widely used due to its better optimization solution characteristics than l0-norm and sparser weight parameters than l2-norm [49]. The sparse regression model used in our study would be defined as: where y ∈R N is the label vector of EEG trials, F ∈ R N×(N C +1)N C /2 is the extracted Riemannian features and w ∈R (N C +1)N C /2 is the weight vector, l1-norm is introduced to penalize the inessential feature weights close to zero. λ denotes the regularization parameter to control the sparsity of w, which would be determined through experiments. Now that the Riemannian features of multiple frequency bands from each time window are extracted. We tend to further effectively fuse the information of T time windows in the classification stage.
In our context, two elements are corresponding to each MI task class label and would generate the frame of discernment = 1 , 2 . First, we contrive to get the mass function from every single time window. We respectively set each MI label i.e., labels=1 or 2, as the probability target to obtain the posterior probabilities P 1 , P 2 of each classifier. For the sake of expedience, we simply use the identity function to transform the probability values to mass functions.
where m i and m j are respectively the mass functions of i and j , j is the complement of i , and P i is the probabilistic outputs of SVM when label i is the target probability.
Noting that when choosing one label as the target probability, the probability output of another label would be unknown and usually set as zero, the complement of the target would be the uncertainty. However, when the value of zero is involved, the fusion result will always be zero regardless of the basic mass function allocation value of the other evidence.
In order to address the above-mentioned problems, we first contrive to fuse the mass functions of a time window by respectively setting each one of the labels as our target probability. Then we could obtain three non-zero mass functions of one model through Eq. (14), which respectively stand for three focal elements in the frame of discernment m i1 , m i2 and m iu . In our BCI context, we tend to classify two MI tasks. We preset the frame of discernment as {LH, RH, {LH, RH}} for example and the third element is the uncertainty needed to be mitigated. As shown in TABLE I, the D-S fusion procedure of one PSVM model is achieved, which actually degenerates into Bayesian probability. P1 and P2 are the probabilistic outputs of the PSVM when the positive targets are respectively left-hand and right-hand MI tasks, respectively. The values behind the elements are the basic mass functions allocated in the last section. The mass functions m1, m2 are transformed from the P1, P2 through Eq. (15).
After we obtain mass functions m i1 , m i2 and m iu of the i-th time window. We manage to further combine T segmented time windows using Eq. (13). As shown in TABLE II, the D-S fusion procedure of mass functions from PSVM models trained by two different time windows is achieved. P is the probabilistic output of the PSVM. Once the final three joint mass functions are calculated, we are now capable of implementing the decision-making by comparing two mass functions corresponding to the label. The outputs of the framework would be assigned with the larger mass function and be more credible with smaller uncertainty. The combination of the mass functions of all time windows is achieved through two together. To be specific, after one time window combines with another, the results of combination are further utilized to combine with a third one. The whole structure of the proposed method is shown in Fig. 2.

III. EXPERIMENTS AND RESULTS
A. Data Descriptions 1) Dataset 1: We choose Dataset IIa from BCI Competition IV [50] as the first dataset to evaluate the effectiveness of the proposed methods. It contains 22-channel EEG data recorded from 9 subjects performing 4 kinds of MI tasks, i.e., Left-Hand, Right-Hand, Feet, and Tongue mental tasks. Each participant completed two sessions, one for training and the other for evaluation [50]. We combine two sessions and choose Left and Right-Hand tasks to evaluate the binary classification performance, which contains 144 trials of each MI task. The training set and test set would be further divided for an 8-fold cross validation. The dataset is provided by the Institute for Knowledge Discovery (Laboratory of Brain-Computer Interfaces), Graz University of Technology. This work involves human subjects or animals in its research. We confirm all human/animal subject research procedures and protocols are exempt from review board approval. More details can be found on the following website: http://www.bbci.de/competition/iv/ 2) Dataset 2: We choose Dataset IIIa from BCI Competition III [51] as the second dataset to evaluate the effectiveness. It contains 3 healthy subjects performing 4 kinds of MI tasks and recorded 60-channel EEG signals with a sampling rate of 250 Hz. We downsample them to 100Hz. In our experiments, a total of 180, 120, and 120 trials for subject k3, k6, and l1 were respectively included for two chosen different MI tasks. The training set and test set would be further divided for a 6-fold cross validation. This work involves human subjects or animals in its research. We confirm all human/animal subject research procedures and protocols are exempt from review board approval. Detailed information about this dataset can be found on: http://www.bbci.de/competition/iii/

B. Experimental Setup
In this study, EEG data of 0-4 s after the visual MI cue appeared has been extracted for signal processing. We firstly used 5th order Butterworth bandpass filters to capture EEG signals from 8-32Hz wide frequency bands. The length of the time window should not be too short to include the execution time of MI and also not be too long which would add irrelevant information. However, the response time fluctuates between different subjects. Moreover, time windows and filter banks of different lengths are adopted. The divided time windows are respectively, [0. 5,4], [0.5, 3.5], [1,4]  classification SVM models, the final classification accuracy is calculated within an 8-fold/6-fold cross-validation framework for DS1/DS2. We manually search a set of hyperparameters [10^−4, 10^−3, …, 10^1] and choose λ as 10^-2 through experiments to weigh the classification accuracies and computational cost. To validate the effectiveness of the proposed RSODSF, we implement extensive comparisons with several state-of-the-art methods.
(1) SFBCSP-SVM [23]: A fixed time window EEG of 0.5-4 s after the cue, filter banks are set the same as the proposed method and extracted by SFBCSP.
(2) SW-LCR [52]: It uses the longest consecutive repetition of the sequence of prediction of all the sliding windows.
The entire evaluation is under the environment of MATLAB R2018a on a laptop with 2.60GHz CPU (i7-9750H, 8GB RAM). The classification tool is the SVM with linear kernel and default classifier parameters. The Covariance Toolbox is used which mainly focuses on Riemannian features extraction of SPD matrices and contains a series of MATLAB functions dedicated to covariance matrices estimation [55].

C. Comparison of Feature Distribution
To better interpret the experimental results, we utilize t-distributed stochastic neighbor embedding (t-SNE) [56] to visualize the extracted features. Fig. 3 presents the feature distribution extracted from subjects A01-A09 using SFBCSP and our proposed method. Compared with SFBCSP, the features extracted using our method have stronger discriminability and seem easier to be separated for the majority of subjects. As shown in Fig. 3, subjects A03, A04 and A06 have the most obvious improvement, especially A04 and A06 have an improvement of 14.3% and 11.9%, respectively. The feature distribution of A02 and A05 seems to have no improvement, yet the accuracies still improve a lot.

D. Classification Performance of Multiple Time-Frequency-Patterns Versus Canonical Time-Frequency Band
In MI-BCIs, canonical time-frequency bands are most commonly used to process the raw EEG signals, which is 0.5-4s after the visual cue and filtered within 8-32Hz. As shown in Fig. 4, the accuracies vary between time-frequency bands and the optimal operational time-frequency bands are not always the canonical ones, i.e., TW1-FB15 in Fig. 4.The multiple subbands would provide much more affluent band information that allows us to achieve more credible classification.
More specifically, as shown in Fig. 5 (a), each classifier corresponding to each time window has a relative accuracy. We can conclude from Fig. 5 (a) that the ensembled accuracies are always higher than results obtained from a single fixed time window, which also proves the effectiveness of our methods.

E. Classification Performance With Dempster-Shafer Fusion Versus Majority Voting Strategy
Numerous data fusion methods have been applied to combine outputs of multiple trained classifiers. The majority voting strategy (MVS) [57] is the most commonly adopted fusion method which is followed by the "Max Wins" rule and determines the decisions with the maximum votes. Whereas,  the unintelligible soft information between different classifiers would be lost. The proposed D-S fusion effectively resolves the ambiguity in determining the final class caused by the majority voting of binary classifier results. As shown in Fig. 5 (b), the accuracy of subject A07 obviously decreases with applying MVS, as the number of classifiers increases while its accuracy increases with applying the proposed RSODSF. This might originate from the ambiguity caused by MVS. As the number of fused classifiers grows, DFS always outperforms MVS.

F. Classification Accuracies Comparison
The experimental results of binary classification are summarized in TABLE III. The mean accuracy improvement of the proposed RSODSF method, compared with the above SFBCSP/SW-LCR/RoCSP-SRIT2NFIS/AR-CSP+SRSG-Fas-Art/RSO-SVM are respectively 9.9%/12.4%, 9.7%/11.6%, 6.4%/4.2%, 4.2%/2.4% and 8.8%/11.3% for DS1/DS2, which demonstrates that the proposed RSODSF achieves the highest mean accuracy among them and outperforms other state-of-the-art methods in processing 9 of 12 subjects. All the p-values obtained by paired t-test between the RSODSF and any of others are less than 0.01 (p<0.01). As shown in TABLE III, the improvement of RSODSF is significantly obvious when applied in A02 and A05. The classification results of these two subjects are the worst ones among 9 subjects when processed by other methods. However, the improvement is not that much notable in those subjects who originally performed well with basic algorithms. Moreover, the accuracies of subjects A03 and A08 even slightly decline, which is probably on account of the wellperformed EEG data of them and benefits from the simplicity of conventional algorithms. Since the decline in accuracies is not significant, the average performance of the proposed RSODSF still outperforms other state-of-the-art methods. The average confusion matrices of the proposed RSODSF and compared state-of-the-art methods (see from Fig.6) further evaluate the effectiveness of the proposed method. We also compare the computational time of the proposed method with other state-of-the-art methods. The average model training time and testing time is calculated and shown in TABLE III. The proposed RSODSF also outperforms in the matter of computational time. Compared with SFBCSP, the RSO method has improved the mean accuracies in both DS1 and DS2, especially for subject K6 in DS2 with an improvement of 10.3%. The use of Riemannian geometry has enhanced the robustness of the classification model. The consequent use of the SPD matrix as a feature may have led to the curse of dimensionality and increase the training time. However, the sparse optimization has been utilized to select the most discriminative features which reduces the influence of the large dimensionality of SPD matrix.

IV. DISCUSSION
Previous studies have demonstrated that Riemannian geometry could be effectively applied in BCI fields. Our proposed RSODSF method integrates advanced process algorithms and achieves the best performance among the abovementioned state-of-the-art method. However, the feature distribution of two different classes from subjects A02 and A05 in Fig. 3 still seems mixing and confusing, but the classification accuracies are significantly increased with an improvement of 22.7% and 17.7% respectively. Moreover, we can see that the feature distribution of these two subjects is spatially divided into two clusters, which probably results from the intrinsic geometry structure of the EEG data. Moreover, the accuracy improvement in the feature layer (1.3% and 1.1%) is much less than in the classification layer (8.8% and 11.3%), the key step leading to the improvement in accuracy is at the classification layer. This has further explained the obvious accuracy improvement when the feature distribution has no enhancement.
Besides, see A01 and A03 in Fig. 5 (a), we can observe the accuracies with TW2 of A01 and TW4 of A03 are higher than TW1, i.e., the most frequently used fixed time window in MI classification. These results further denote that TW1 is not always the optimal operational time interval. Due to the individual difference, the response time to the visual cue of subjects varies between subjects, and even between trials. Other achieved accuracies of time windows are obviously lower than TW1, however, the fusion of combined time windows always achieves the best accuracies which validates the necessity of utilizing multiple time windows as our research direction.
It is also worth noting that when evidence conflicts severely or completely, the D-S fusion of mass functions would cause ambiguity which would affect the classification accuracy. Nevertheless, this problem would be naturally resolved in the BCI context because of the design of paradigms. The subjects would only perform the cued MI tasks, and the possibility of occurring undesirable results would become negligible.
Although the RSODFS method has been evaluated and proved to outperform state-of-the-art methods, there still exists leaving blank space in which could be further ameliorated. First, the Riemannian sparse optimization is realized by mapping the covariances points to the tangent space and vectorizing them, which may destruct the underlying structure of original data to some extent. This could be addressed through Riemannian dictionary learning [58], which treats atoms as covariance matrices to keep the intrinsic distribution of EEG data. In addition, the improvement of decoding performance benefits from the robustness of the introduction of Riemannian geometry. However, the succeeding surging of feature dimensionality also increases the computational costs and results in the inability of the sparsity hypermeter to be large, which should be considered in future studies. Moreover, when fusing the probabilistic outputs of the abovementioned one-versusone SVM classifiers in a multi-class classification scenario, two-versus-two strategies could be together considered to improve the credibility of the final prediction.

V. CONCLUSION
In this study, we propose an integrated framework combining the robust Riemannian tangent space features extracted from multiple time-frequency bands with sparse optimization. Then the Dempster Shafer theory based fusion method is utilized to effectively fuse the transformed probabilistic values, which are obtained from trained linear kernel SVM of each divided time window to maximally exploit underlying soft information within and between multiple classifiers. Experimental results on the public dataset show that the proposed RSODSF achieves significantly higher classification accuracy compared with other state-of-the-art methods. In addition, our method is capable of extracting more discriminative features and solves the issue of ambiguity in determining the final class. In conclusion, the RSODSF is a promising candidate for improving the performance of MI-BCIs.