Tensor-CSPNet: A Novel Geometric Deep Learning Framework for Motor Imagery Classification

Deep learning (DL) has been widely investigated in a vast majority of applications in electroencephalography (EEG)-based brain-computer interfaces (BCIs), especially for motor imagery (MI) classification in the past five years. The mainstream DL methodology for the MI-EEG classification exploits the temporospatial patterns of EEG signals using convolutional neural networks (CNNs), which have remarkably succeeded in visual images. However, since the statistical characteristics of visual images depart radically from EEG signals, a natural question arises whether an alternative network architecture exists apart from CNNs. To address this question, we propose a novel geometric deep learning (GDL) framework called Tensor-CSPNet, which characterizes spatial covariance matrices derived from EEG signals on symmetric positive definite (SPD) manifolds and fully captures the temporospatiofrequency patterns using existing deep neural networks on SPD manifolds, integrating with experiences from many successful MI-EEG classifiers to optimize the framework. In the experiments, Tensor-CSPNet attains or slightly outperforms the current state-of-the-art performance on the cross-validation and holdout scenarios in two commonly-used MI-EEG datasets. Moreover, the visualization and interpretability analyses also exhibit the validity of Tensor-CSPNet for the MI-EEG classification. To conclude, in this study, we provide a feasible answer to the question by generalizing the DL methodologies on SPD manifolds, which indicates the start of a specific GDL methodology for the MI-EEG classification.


I. INTRODUCTION
A brain-computer interface (BCI) is a direct communication pathway between a user's brain and an external device by measuring and analyzing the behaviorally relevant information of brain activities.[1] The non-invasive electroencephalogram (EEG)-based BCI is one of the most common BCIs, employing portable, non-invasive electrodes on the scalp for instantaneously measuring electrical changes in neurons.It allows brain-derived communication between patients with amyotrophic lateral sclerosis and motor control restoration after stroking and spinal cord injury.[2] However, decoding mental states from EEG-based BCI is challenging for various reasons, such as low signal-to-noise ratio (SNR), artifacts, and high inter/intra-subject variabilities (a.k.a, nonstationarity changes) in EEG signals.[3] In the paradigm of traditional EEG analysis, spatial patterns of EEG signals are crafted by a preprocessing algorithm to have more strong discrimination between mental states and afterward classified using machine-learning classifiers, such Ce Ju and Cuntai Guan are with the S-Lab, Nanyang Technological University, 50 Nanyang Avenue, Singapore (emails: {juce0001,ctguan}@ntu.edu.sg).
as the support vector machine (SVM) and linear discriminant analysis (LDA).[4] Many spatial filterings such as common spatial pattern (CSP) and its variants [5]- [8] are widely used as such preprocessing algorithms to increase the SNR of signals and, therefore, enhance oscillatory brain electrical activities before feature extraction.However, the validness of the analysis is limited to the capacity of feature extraction for complex event-related and event-unrelated (resting state) neural oscillations.[9] To remedy this limitation, the architecture of CNNs has been broadly adopted as an emerging tool in BCIs [9]- [14].Technically, the scheme of these CNN classifiers is designed to automatically capture the temporospatiofrequency features of neural signals in end-to-end learning without the experience of human engineers.It has been proven effective for the MI-EEG classification in literature [4], [13].Compared with the previous non-DL approaches, CNN is making significant advances in the incredible power of representation with multiple levels of abstraction, end-to-end learning, and causal contributions of patterns on brain topography.[10] However, the essential difference in the underlying structure between images and EEG signals discernibly weakens the feature expression of CNNs in BCI tasks.Specifically, several prior assumptions in computer vision require the underlying structure of visual images to be stationary, translation invariant, translation equivariant, and stable with respect to local deformations, conceptually characterized as the Euclidean nature, which enables CNNs to effectively extract local features from local statistics.[15]- [17] In contrast, the underlying structure of EEG signals might not embody the Euclidean nature according to electrophysiological studies and (nonlinear) dynamical neuroscience [18], [19].To illustrate, a prominent example is that EEG signals are nonstationary, and their local statistics are variant to the location of spatially distributed regions.Consequently, a natural question arises whether an alternative network architecture exists apart from CNNs for efficient feature extraction in the MI-EEG classification.
In the history of the MI-EEG classification study, such an alternative discipline has been raised, which uses a graph convolutional neural network to learn the graph signal representations of EEG rhythmic components [20].Apart from their approach, in this study, we set off a novel discipline in terms of spatial covariance matrice (SCM) derived from EEG signals, which is the inherent correlation between neighbor channels, a second statistics in the spatial domain, and have been developed in CSP for over 30 years [5], [21].We aim to formulate SCMs using Riemannian geometry for an in-depth analysis of the non-Euclidean nature as many existing Riemannian geometry-based modelings in engineering disciplines such as diffusion tensor imaging and geometric mechanics [22]- [24].The Riemannian-based BCI classifier that characterizes EEG signals using the geometric information of SCMs emerged about a decade ago.[25]- [27] It has gained growing interest from the BCI community, and various follow-ups were proposed to optimize the structure [28]- [30].Technically, SCMs derived from EEG signals are inherently symmetric positive definite (SPD).The space of SPD matrices is formulated as a Riemannian manifold called the SPD manifold, provided with a specific metric.Then, the geodesic distance on SPD manifolds between two SCMs is encoded as a high-level feature for the machine-learning classifier.
The most fruitful part of the Riemannian-based BCI classifier is the conceptual importance of using SPD manifolds to characterize EEG signals.However, there are many practical drawbacks to the Riemannian-based BCI classifier.Firstly, hand-crafted feature extraction is outdated and inefficient in complex scenarios such as feature expression for nonhomogeneous BCI sensor data.Secondly, the neurophysiological interpretation of existing hand-crafted features such as geodesic distance on SPD manifolds has not yet been fully understood [31].To cope with these practical drawbacks, more recently, a novel classifier on SPD manifolds [32] is probably a promising solution, which investigates the low-level features of SCMs for the EEG classification using SPDNet [33], an existing Riemannian-based network architecture, to capture the spatial patterns of EEG rhythmic components.Architecture SPDNet is a DL architecture that preserves the SPD structure of matrices across layers and exhibits competitive performance compared with the current state-of-the-art approaches using CNNs on an increasing number of computer vision tasks.[34], [35] Its perspective generally originates from an emerging subfield geometric deep learning (GDL) [17], which aims to generalize the DL models to the non-Euclidean domain as graphs and manifolds.
In this paper, we propose a novel GDL framework, Tensor-CSPNet, that generalizes the DL methodology for the MI-EEG classification.To this end, we build a network architecture upon the principle that largely exploits the temporospatiofrequency patterns of EEG signals.Each structure in our architecture aims to capture features from either of the temporospatiofrequency domains.Firstly, tensor stacking segments EEG signals and stacks them into the temporospatiofrequency tensors according to hand-crafted technical and neurological experience.Each tensor is in an SPD-matrix representation that encodes the inherent correlation between neighbor channels with respect to the time and frequency information, which is also a rough estimation of the brain connectivity between spatially segregated areas [36].Secondly, the spatial patterns and temporal dynamics behind EEG signals are extracted by deep neural networks on SPD manifolds and CNNs on the tangent space sequentially and respectively.Significantly, the combination of the depthwise BiMap layer and ReEig consists of a nonlinear spatial filter that enhances the feature expression of spatial patterns.Finally, the classification stage classifies the extracted temporospatiofrequency patterns using fully-connected neural networks.
In the experiments, Tensor-CSPNet is investigated on sev-eral motor imagery (MI) tasks of EEG-based BCIs, including stationary and non-stationary scenarios.Typically, an MI task refers to an experiment where the individual mentally simulates a physical action.In neurophysiology, since MI of motor actions produces replicable and discriminable patterns (i.e., synchronization/desynchronization) over the primary sensory and motor areas, the signals are discernible to be a classification task.[37], [38] In addition, the visualization and interpretability analyses are conducted on two MI datasets to double verify the validity of Tensor-CSPNet.The remainder of this paper is organized as follows: Section II introduces the paradigm of traditional EEG analysis and elaborates on the mathematical background of CSP and SPDNet.The subsequent section III is the methodology for Tensor-CSPNet.The performance of Tensor-CSPNet is then compared on a broad set of experiments in Section IV, and cautious discussions of nonstationarity and contrasts with other mainstreams are in Section V.In Appendix, we discuss the prior assumptions for CNNs, a brief overview of SPD manifolds, the strategy of fixed-interval segmentation, and an ablation study on BCIC-IV-2a.

B. Paradigm of Traditional EEG Analysis
Let X ∈ R C×T be a short segment (trail) of EEG signals, where C is the number of EEG channels (electrodes), and T is the number of sampled points on epoch durations.This paper assumes that trail X is already band-pass filtered, centered, and scaled.A linear classifier that predicts the label of trail X is typically written as f (X; where N is the number of spatial filters, {w i } N i=1 ∈ R C are spatial filters and {β i } N i=1 ∈ R are biases.At many physiological and anatomical levels in the brain, the lognormal distributions are fundamental to structural and functional brain organization because the distribution of numerous parameters is strongly skewed with a heavy tail.[39] Hence, the logarithm of the power/variance of the projected signal w i XX ⊤ w ⊤ i is considered in the classifier.

C. Common Spatial Pattern
Let Σ + , Σ − ∈ R C×C be the estimates of covariance matrices of trails , where I c (c ∈ {+, −}) is the set of indices of trails of one class.The CSP algorithm is given by a simultaneous diagonalization of covariance matrices Σ + and Σ − in two equations such as In Line 3, we capture the temporal information on the tangent space using 2D CNNs.Fully connected neural networks with cross-entropy loss are used for the MI-EEG classification.
R C×C hold an identity constraint, i.e., Λ + + Λ − = I.The problem of the above simultaneous diagonalization is mathematically equivalent to solve a generalized eigenvalue problem as follows: (Σ + w) = λ • (Σ − w).Feature vectors z := log (wXX ⊤ w ⊤ ) consisting of eigenvectors w ∈ col(W) from both ends of the eigenvalue spectrum are commonly used in the EEG analysis.

D. Riemannian Batch Normalization
Riemannian Batch Normalization (BN) is a generalization of the classic batch normalization on Riemannian manifolds.[40].Formally, the weighted Riemannian barycenter on SPD manifolds Bar w ({B}) utilizes the parallel transport Γ to connect any sample S i with the identity matrix I d according to formulas , where each mini batch B of SPD matrices {S i } i=1 , the biasing parameter of G acquired by the matrix backpropagation in the training is directly applied in the inference.The definition of the parallel transport and the weighted Riemannian barycenter refers to Appendix B.

E. SPDNet
Architecture SPDNet is a deep neural network architecture fed with SPD matrices that preserves the SPD structure of matrices across layers during non-linearly learning.[33] Analogous to convolutional neural networks, the basic layers in SPDNet are designed to include the following layers:

III. METHODOLOGY
In this section, we propose a novel GDL framework Tensor-CSPNet for non-invasive EEG-based BCIs, consisting of four stages: the tensor stacking stage, the common spatial pattern stage, the temporal convolutional stage, and the classification stage.The architecture of Tensor-CSPNet is illustrated in Figure 1.

A. Stage One: Tensor Stacking Stage
In the first stage, EEG signals will be segmented into the temporospatiofrequency tensors concerning the theory of neurophysiology, electrophysiology, and signal processing.
1) frequency Segmentation: We use a well-known filterbank technique in the EEG-BCI classification [6] for frequency segmentation, which employs a bank of bandpass filters to decompose the raw oscillatory EEG signals into multiple frequency passbands using the causal Chebyshev Type II filter.
2) Temporal Segmentation: The temporal segmentation aims to divide EEG signals into small segments on the time domain with or without overlapping.Generally, the signals should be segmented according to the characteristics of EEGbased BCI tasks, for instance, dynamic changes in very short durations in many cognitive tasks.For those signals that we are not familiar with their characteristics, we propose a fixed-interval segmentation strategy in Appdendix C that EEG signals are initially subdivided into fixed short equal-length intervals without overlapping.We require that the length of the time window ω (time resolution) is limited by Garbor's uncertainty principle [41] that time and frequency resolutions cannot be at a high level simultaneously.
3) Tensor Stacking: After two segmentations, we stack elementary information cells to the four-dimensional temporospatiofrequency tensors X ∈ R W ×F ×C×ω , where W , F , C and ω are the number of window slices, the number of filter banks, the number of channels and window length respectively.As a consequence, the input tensors of Tensor-CSPNet are spatial covariance matrices S ij := X[i, j, :, :] • X[i, j, :, :] ⊤ for i ∈ windows slices W and j ∈ filter banks F .The pseudocode of tensor stacking refers to Algorithm 1.

Algorithm 1: Tensor Stacking Stage
Input : X ∈ R F ×C×T , window length ω, stride s, and padding value p) Remark.1).The tensor stacking stage is a data preprocessing stage, which is not included in the network architecture.
2).In frequency segmentation, we adopt a widely used portfolio of filter banks {4 ∼ 8 Hz, 8 ∼ 12 Hz, • • • , 36 ∼ 40 Hz}, which has exhibited the best competition results in the BCI Competition IV 2a 1 It is a well-known and widely used dataset in the MI-EEG classifiers.There are also many pieces of literature to exploit different combinations of frequency ranges, but this one is the most straightforward.

B. Stage Two: Common Spatial Pattern Stage
In the second stage, we modify and employ the architectures of SPDNet to capture the spatial patterns of EEG signals.
1) Depthwise BiMap: The BiMap layer in SPDNet [33] will be first modified to the depthwise BiMap layer that does not take a channel-wise summation after the bi-multiplication and then employed in the CSP stage.Preserving the SPD structure, it will transform spatial covariance matrices in each channel by right-multiplying a full column-rank matrix W and leftmultiplying its transpose simultaneously, i.e., W • S ij • W ⊤ .
2) Riemannian BN: Riemannian BN is used to de-correlate a batch of sample-based spatial covariance matrix estimation towards to an identity by , which equalizes the variance in all directions and removes the batch effects.The behind statistical mechanism has been exhibited in CSP's variant, Regularized-CSP [8], in which a shrinkage is performed towards the identity on the regularized estimate for each class as Sij := (1 − γ) • Sij + γ • I d , where Sij is the regularized estimate of S ij , γ is a user-defined parameter.
3) ReEig: ReEig in SPDNet is used to consist of a nonlinear spatial filter.In contrast with the traditional EEG analysis in which spatial filters are linear, this layer enables Tensor-CSPNet to have a richer feature expression of spatial information.
4) LOG: LOG in SPDNet is adopted to log-project the transformed SPD matrices onto the tangent space for the logpower/variance.It is consistent with a step in the standard paradigm of EEG analysis, refer to Section II-B.

C. Stage Three: Temporal Convolutional Stage
In this stage, we aim to capture temporal dynamics on tangent space using CNNs.We first flatten the outputs of the CSP stage on the frequency and space domains called the Spatial-frequency Flattening.Then, we concatenate the flattened tensors along the time domain called the Temporal Concatenation.After the spatial-frequency flattening and the temporal concatenation, the temporospatiofrequency tensor becomes a 2-dimensional tensor in R W ×(F o 2 ) without SPD structure anymore, where F is the number of filter banks and o is the output dimension of the CSP stage, as illustrated in Figure 2. Finally, we use 2-dimensional (2D) CNN with po 2width (p = 1 or F ) and q-height (1 ≤ q ≤ W ) to capture the temporal dynamics of EEG signals.Remark.1).In Stage Three, the use of 2D CNN for extracting temporal dynamics is because, in principle, the tangent space at a point on a Riemannian manifold is a vector space isomorphic to Euclidean space with the same dimension and thus always flat.Hence, after LOG, the classification problem returns to one in the (flat) Euclidean domain.The geometric neural networks, developed to deal with problems in the curved space, are not necessary to apply.
2).The width of 2D CNN is set at a multiple of o 2 because we hope to alleviate the influence of different spatial locations of EEG electrodes on the scalp.Two possible multiple p = 1 and F mean that each frequency band in the portfolio can independently and equally contribute to the model performance.

D. Stage Four: Classification Stage and Loss Function
In the final stage, single-layer or multi-layer neural networks are utilized for the final classification.The loss function in our approach is cross-entropy for the sake of simplicity.
2) BCI Competition IV 2a (BCIC-IV-2a): BCIC-IV-2a is a cue-based BCI paradigm with four-class MI-EEG motor imagery tasks including left hand, right hand, feet, and tongue recorded in 22 Ag/AgCl EEG electrodes and three monopolar EOG channels with a sampling rate of 250 Hz from 9 subjects.The BCIC-IV-2a dataset has the training session (T), and the evaluation session (E) recorded on different days.Each subject performed six runs of 12 cue-based trials for each of the four classes in either training or evaluation sessions, yielding 288 trials per subject.

B. Evaluation Baselines
The proposed approach is compared with the following diverse baselines the CSP approach FBCSP, the Riemannianbased approaches (MDM/TSM), and the DL approaches (Con-vNet/EEGNet/FBCNet/SPDNet).

C. Naming Conventation for Hyper-parameters
To further analyze Tensor-CSP, we introduce notations of its hyper-parameters.Formally, w-CSPNet (m,n,l) represents the Tensor-CSPNet with w window slices, m banks for filters, n blocks of the CSP layer, and l-layer neural networks in the perception layer.The number of banks m is set to F , and the specific frequency ranges refer to Section III-A2.The depth of the fully-connected layer l has two options {1, 3}.The output dimension of the depthwise BiMap is denoted as o, where o ∈ {4, 8, 12, 16, 20, 22, 24, 28, 32, 36}.The hyperparameters in the temporal convolutional stage are a portfolio @(p, q, r), where the triple is the width, the height, and the number of channels of 2D CNNs, respectively.The summary of the notations refers to Table IV.

D. Evaluation Scenarios
We will evaluate Tensor-CSPNet on two scenarios of the subject-specific analysis.The subject-specific analysis refers to the training and testing datasets from the same subject.
1) Cross-Validation Scenario (Stationary Scenario): This scenario uses a standard evaluation setting of 10-fold crossvalidation (with a shuffle data index) for each subject.2) Holdout Scenario (Non-stationary Scenario): The holdout scenario is a cross-session scenario in which the model is trained in one session and evaluated in another session.Figure 3 illustrates the holdout scenario on two datasets.Note that the two-session signals of each dataset are collected on different days.Hence, there is typically a drift of statistical distributions between two sessions (i.e. the non-stationary phenomenon), as illustrated in Figure 6 (a).

E. Performance Comparison
In this section, we evaluate Tensor-CSPNet on MI-KU and BCIC-IV-2a.Each dataset has three scenarios, including two 10-fold-cross-validation (CV) scenarios and one holdout scenario.
TABLE I: Configurations of Temporal Segments: In this paper, there are three kinds of temporal segments without overlapping on MI-KU, and there are four kinds of temporal segments with overlapping on BCIC-IV-2a that are adopted from [44].

BCIC-IV-2a
Temporal Segments (sec.)The configurations for Tensor-CSPNet are a little different in each scenario.We require the output dimension of the depthwise BiMap layer to be o = 20 and 22, respectively, on two datasets.The reason for picking such a hyper-parameter is discussed in Appdenix D. For the CV scenarios of both datasets, Tensor-CSPNet adopts a shallow neural network 5-CSPNet (9,1,1) because the amount of trials for training is small (i.e., 90 trial/class on MI-KU and 65 trial/class on BCIC-IV-2a), which yields the over-fitting for an extensive neural network.For the holdout scenario of both datasets, we also adopt shallow neural networks but with finner temporal segmentation 10-CSPNet (9,1,1) @(9, 5, 2) and 5-CSPNet (9,1,1) @(9, 5, 4), respectively.Because finner temporal segmentation is much helpful to the performance against the nonstationarity, cautiously discussed in Section V-A.
In EEG-based BCIs, the performance of a classifier typically varies widely in different data preparation, such as the segment length of signals, number of electrodes, and electrode placements, even in the same experimental scenario.FBCSP is always regarded as the most stable and convincing baseline in most cases.From Table II, we notice that the Riemannianbased approaches, MDM and TSM, perform like a random guess on MI-KU but a bit better on BCIC-IV-2a.It exhibits the limited effectiveness of using geometric quantities on SPD manifolds as the high-level features for classification.
The mainstream DL methodology in the MI-EEG classification exploits EEG signals' temporospatiofrequency features.Hence, we will categorize the five DL approaches in Table II into three groups, 1) SPDNet: It only exploits the spatial patterns of EEG signals and achieves the worst performance among all the DL approaches in Table II.2) EEGNet and ConvNet: They exploit the temporospatial patterns, and their performances are close to FBCSP.Note that FBCSP extracts the temporospatiofrequency patterns.
The similar performance shows that combining any two components nearly contributes to the classification.3) FBCNet and Tensor-CSPNet: They exploit the temporospatiofrequency patterns and outperform EEGNet and ConvNet in all scenarios, attributed to bandpass filters that embody the frequency information.Tensor-CSPNet performs slightly better than FBCNet on MI-KU but somewhat worse on BCIC-IV-2a, except for its holdout scenario.We briefly discuss why it performs well on both holdout scenarios in Section V-A.

F. Interpretability Analysis
In this section, we investigate the interpretability of extracted temporospatiofrequency patterns of Tensor-CSPNet using Deep Learning Important FeaTures (DeepLIFT) [45], which is a gradient-based interpretation method widely employed in the BCI classification [9], [13].
Fig. 4: Illustration of the heatmap of the relevance patterns of 5-CSPNet (9,1,1) : The experiment is conducted on Subject No.2 of MI-KU with a testing accuracy of over 0.9.The relevance pattern after DeepLIFT has an output shape (5,9,20,20).We flatten the relevance pattern into five rectangles with a height of 20 grids (20 channels) and a width of 9 grids (9 frequency bands).Each rectangle represents the spatial-frequency information within a time window of {1.0 ∼ 1.5 s, 1.5 ∼ 2.0 s, 2.0 ∼ 2.5 s, 2.5 ∼ 3 s, 3.0 ∼ 3.5 s}.The rectangle column records the main diagonal of the relevance pattern's 20 × 20 covariance matrix.The value in each cell on the heatmap is normalized in [0, 1] and smoothed by a Gaussian filter.
To interpret the extracted features, we propose a simple visualized approach to flatten the four-dimensional relevant pattern of DeepLIFT to a two-dimensional rectangle, illustrated in

G. Visualization
In this section, we plot the 2-dimensional projections for outputs of each intermediate layer in Tensor-CSPNet using tdistributed Stochastic Neighbor Embedding (t-SNE) [48].The t-SNE algorithm is a widely used technique of non-linear dimensionality reduction to visualize high-dimensional data.Specifically, we will investigate the mechanism of Tensor-CSPNet via visualizing the outputs of each intermediate layer

V. DISCUSSION
In the discussion, we first provide evidence of why Tensor-CSPNet outperforms the other approaches in the nonstationary scenarios.Then, we will discuss the relationship between Tensor-CSPNet and other existing mainstreams of the MI-EEG classifiers.

A. Evidence of Temporal Segmentation against Nonstationarity
Early electrophysiological studies show that large-scale patterns of synchronized neuronal activity exhibit considerable variability over time, e.g., alpha-blocking with eyes opening, the transition from wakefulness to drowsiness, etc.The variability was termed as the nonstationarity nature of EEG signals [49] and mainly caused the drift in statistical distribution between different sessions and subjects.To determine Tensor-CSPNet's good performance in non-stationary scenarios, we pick Subject No.28 of MI-KU because Tensor-CSPNet's accuracy of this subject is 0.3 higher than FBCSP's.Figure 6 exhibits a noticeable trend that the more refined temporal segmentation yields a more extensive crossover region of the training, validation, and test sets in the statistical distribution space.In the view of statistics, temporal segmentation fixes the nonstationarity, which is the rediscovery of a four-decadeago theory called segmentation techniques for nonstationary EEGs [50].In the view of neural signals, the fixed-interval temporal segmentation breaks down EEG signals into many short piecewise quasi-stationary intervals.Therefore, the drift between different sessions disappears in the numerical aspect, which is helpful to classification performance when using the statistical classifier.2) CSP: Attributed to the BiMap layer, Tensor-CSP performs like a CSP-like approach.The weight update using the data-driven approach improves the knowledge that the most appropriate projection matrix W can be entirely determined by label data rather than using the rule of simultaneous diagonalization.Moreover, Riemannian BN performs a regularization in Tensor-CSPNet similar to the Regularized-CSP approach [8], and ReEig leverages the linear spatial filter to a non-linear one.
3) Riemannian-Based Approaches: Both Tensor-CSPNet and the Riemannian-based approach characterize EEG signals on SPD manifolds.The Riemannian-based approach uses geodesic distance on SPD manifolds as a high-level feature for the MI-EEG classification.In contrast, Tensor-CSPNet uses the low-level feature expressions of SMCs captured by a neural network-based approach for classification.4) Manifold Learning: Manifold learning [51] is a theoretical dimensionality reduction setting in which the samples are assumed to be on or near a low-dimensional submanifold embedding in high-dimensional space.It aims to acquire a low-dimensional geometric representation of high-dimensional data retaining a meaningful property.Architecture SPDNet can be regarded as a new class of manifold learning for the supervised learning setting because it is a neural-networkbased transformation from one SPD manifold to another, and so is Tensor-CSPNet.However, the studies in Appendix D-A exhibit that expanding the dimension, rather than reducing it, yields a better classification performance in some cases.

VI. CONCLUSIONS
In this work, we propose a novel GDL framework called Tensor-CSPNet to exploit the temporospatiofrequency features of EEG signals for a general EEG-BCI classification paradigm.To achieve this goal, the framework is inspired by a growing interest in formulating EEG signals on SPD manifolds and uses existing network architectures on SPD manifolds to exploit the patterns.Tensor-CSPNet exhibits better classification performance in the experiments than the current state-of-the-art approach.In addition, we investigate how each layer in Tensor-CSPNet works and how temporal segmentation improves the Tensor-CSPNet's performance in the cross-session scenario and gives an interpretability analysis of the extracted patterns.The current experimental results demonstrate the validity of Tensor-CSPNet for the MI-EEG classification.Despite the validity, Tensor-CSPNet also has the following appealing upsides to existing CNN classifiers: 1).SPD-matrix representation for encoding spatial patterns is typically compact and robust to noise.2).Specific Architecture on SPD manifolds to enhance feature extraction.For example, It preserves the SPD structure of matrices across layers and essentially maintains more encoding information of SCM.In addition, the combination of the depthwise BiMap layer and ReEig improve the feature expression of spatial patterns.3).Tensor stacking for well-performing against nonstationarity.

VII. ACKNOWLEDGMENT
This study is supported under the RIE2020 Industry Alignment Fund-Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contributions from the industry partner(s).This study is also supported by the RIE2020 AME Programmatic Fund, Singapore (No. A20G8b0102).
in-depth review of these assumptions, we refer the readers to [15], [17] and references therein.Technically, suppose a signal φ(x) ∈ L2 (Ω), where x ∈ Ω ⊂ R d .The goal of the supervised learning setting is to train a statistical model f : X → Y, where X is the space of representations φ(x) and Y is typically a discrete set of labels.We say model f is translation invariant and translation equivariant with respect to any φ ∈ L 2 (Ω) and any v ∈ Ω if f φ(x − v) = f φ(x) and f φ(x − v) = f φ(x) − v respectively.Many tasks in computer vision are assumed to be translation invariant and translation equivariant and required to be stable with respect to local deformations that is defined as a Lipschitz continuity condition as follows, where C is constant, τ (x) is a smooth displacement field that deforms the signal, and ∇τ (x) is the deformation gradient tensor.

APPENDIX B SPD MAINFOLDS
In this section, we give a brief overview of SPD manifolds with respect to the affine invariant Riemannian metric (AIRM) [22], [52] and the weighted Riemannian barycenter.For a more in-depth review of the geometry of the space of S n ++ , we refer the reader to [53], [54] and references therein.

A. Riemannian Geometry of SPD Matrices
The space of S n ++ is a Riemannian manifold if endowed with a Riemannian metric.AIRM, a widely-used class of the Riemannian metric for the space of S n ++ , was put forward independently from information science in the 1980s [55] and engineering disciplines [22], [56] after 2005.Formally, AIRM is defined as g P (v, w) := P − 1 2 vP − 1 2 , P − 1 2 wP − 1 2 F , for each v and w on tangent space T P S n ++ .Riemannian manifold S n ++ , AIRM is a Hadamard that is simply connected and complete with everywhere non-positive sectional curvature.It holds many nice properties, for example, there is an unique ++ between any two SPD matrices P 1 and P 2 of S n ++ such that γ(0) := P 1 , γ(1) := P 2 and γ(t) )|| F .In addition, the geodesic distance on S n ++ , AIRM is invariant under any congruence transformations Γ W , i.e., 1 2 , where P 1 , P 2 ∈ S n ++ and v ∈ T P1 S n ++ .

B. Weighted Riemannian Barycenter
The weighted Riemannian barycenter is a generalization of weighted barycenter on Riemannian manifolds.In our study, the Riemannian-based BCI classifier and Riemannian BN have been used in the computation procedure.Formally, given a batch B of N SPD matrices {P i } N i=1 , the weighted Riemannian barycenter (a.k.a.F réchet mean [57]) on S n ++ , AIRM is given as the solution to the following optimization problem [58]: where weights w i ≥ 0 (i = 1, ..., N ) and 1≤i≤N w i = 1.

APPENDIX C FIXED-INTERVAL SEGMENTATION AND Q VALUE
In this section, we propose a strategy for those signals that we are not familiar with their temporal characteristics.We call this strategy to be fixed-interval segmentation.Technically, this strategy means that the EEG signals are initially subdivided into fixed, short equal-length intervals or segments without overlapping.In the main paragraph, this strategy is used for the evaluation of the MI-KU dataset in which we require that the length of each time window can divide by the length of EEG signals for simplicity, and therefore, we employ two configurations, 5-CSPNet, and 10-CSPNet, whose length of time windows is 500 ms and 250 ms, respectively on MI-KU.Furthermore, we conclude that the best q value under the strategy of the fixed-interval segmentation is equal to width W. (q is the height of 2D CNN in the TC stage.)This is because the model performance monotonically increases as the q value increases in Table III.It is also consistent with the neurobiological fact that a wider window size yields a higher probability of examining event-related desynchronization/synchronization during motor imagery tasks.

TABLE III:
The results in the hold-out scenario of MI-KU with different q values in the TC stage of 5-CSPNet and 10-CSPNet.q is the height of 2D CNN in the TC stage.

APPENDIX D ABLATION STUDY ON BCIC-IV-2A
This section investigates the effects of each layer and hyper-parameter of Tensor-CSPNet on the training session of BCIC-IV-2a.We will have an in-depth analysis of its mechanism using visualization and discuss the computational efficiency of the Tensor-CSPNet.For ease of communication, we summarize the naming conventions for hyper-parameters in Tensor-CSPNet in Table IV.Primarily, we put a symbol BN at the end of each configuration of Tensor-CSPNet to indicate if it has Riemannian BN in the CSP stage.

A. Output Dimension o of the Depthwise BiMap Layer
We investigate the output dimension o in the depthwise BiMap layer.The average accuracies and standard deviations for evaluation are summarized in Table V, and their quartiles TABLE IV: Notations for Hyper-Parameters in w-CSPNet (m,n,l) @(p, q, r).The different configurations of temporal segments on two datasets refer to Table I The width, height, and output channels in 2D CNN of the TC stage.p ∈ {1, 9}.
Remark.From Table V, Tensor-CSPNet reveals an interesting phenomenon: the accuracy will improve when we lift the output dimension of the depthwise BiMap.The map from a small input to a large output is like a volume-conduction problem reconstructing EEG sources.It is evident that many latent variables exist in the EEG signals because the number of current sources is significantly greater than measurements in the 3-dimensional brain volume of the electrophysiological source imaging in neurophysiology.[59]

B. Validity of the Riemannian BN
The Riemannian BN is after depthwise BiMap and before ReEig inspired by the position of BN in ResNet [60].According to Table V, we have Observation Four (O4), that Riemannian BN relieves overfitting and improves up to 1% on the average accuracy in many pairs.Notice that the average accuracy and standard deviation across all the output dimensions 0.7123 and 0.1305 are both better than the ones of 1-CSPNet (9,1,1) , respectively, so with the pair of 1-CSPNet (9,1,3) _BN and 1-CSPNet (9,1,3) .However, the improvement has no statistical significance, i.e., average p>0.4 across 20 top-bottom pairs in Table V, Wilcoxon signed-rank test.Figure 7 exhibits this statistical result that the shapes of quartiles for both kinds of architecture are similar.

C. Depth of Architecture
For the sake of brevity, the output dimension o of each depthwise BiMaps is set 22 in a 3-block CSP layer, i.e. o 1 , o 2 , o 3 = 22.In Table VI, the shallow model (top) statistically significantly outperforms the deep one (down) in each top-down pair (average p<0.05 across top-down pairs, Wilcoxon signed-rank test).Apart from the effect of the depth of architecture, we notice that the average accuracy of 1-CSPNet (9,1,1) and 1-CSPNet (9,3,1) (Column One) are statistically significantly better than ones of 1-CSPNet (9,1,3)  and 1-CSPNet (9,3,3) (Column Three), respectively in Table V (average p<0.02 across top-down pairs, Wilcoxon signed-rank test), so with Column Two and Four (average p<0.05 across top-down teams, Wilcoxon signed-rank test).Thus, we have Observation Five (O5), that the shallow model statistically significantly outperforms the deep one, and the single-layer statistically greatly exceeds the multilayer (average p<0.05,Wilcoxon signed-rank test).

E. Hyper-parameters in the Temporal Convolutional Stage
In the temporal convolutional stage, there are three hyperparameters in a portfolio @(p, q, r), where width p (p = 1 or F ) and height q (1 ≤ q ≤ W ), and output channels r for 2D CNN.According to the analysis in the previous subsection, we pick 5-CSPNets (9,1,1) _BN with output dimension o = 22 in this experiment.The height q is set at 2 in all the cases.The width p has two options from {1, 9}, and the number The more refined segmentation of hyper-parameters in the temporal convolutional stage yields a higher dimension of the output vector.For instance, for a 5-CSPNets (9,1,1) , the dimension of the concatenated vector after a 2D CNN with @(p, q, r) = (1, 2, 20) is 720 = 9 × (5 − 1) × 20 .From Table VIII, we have Observation Seven (O7) that the performance statistically significantly monotonically improves as the number of output channels increases and the width of 2D CNN decreases (p<0.05,Wilcoxon signed-rank test).

F. Visualization
We investigate the visualization of BCIC-IV-2a using t-SNE.The well-learned and badly-learned cases are considered in the comparison, where the well-learned and badly-learned cases are with the average accuracy over 0.80 or under 0.50 under both Tensor-CSPNet and the FBCSP, respectively.The first line of Figure 8 is the well-learned case upon Fold 2 of Subject 1.The left subplot is the projection from the original data set.We concatenate 22 × 22 SMCs derived from 9 band-pass signals as a 9 × 22 × 22-dimensional vector and project them via the t-SNE method.The four colors in the resulting cross together, and the middle subplot is the projection from the outputs after the depth BiMap, Riemannian BN, and ReEig.We notice that the four-color points are more separated than those in the left subplot and stacked in a sequence where the same color points concentrate on a specific location in the figure.The right subplot is the projection from the outputs after LOG, and it is more concentrated for each color and, after that, more accessible for classification.However, we observe that the configuration of the proposed approach is still hard to distinguish between two pairs, such as the pair of the Tongue (red) and the Feet (green) and the pair of Hand (L) and Hand (R).It is consistent with the statistics in the confusion matrix of Table 9 that both FN (False-Negative) and FP (False-positive) for the tongue and feet are 2.6% and 2.9%, respectively.Both FN and FP for the hand (L) (blue) and hand (R) (yellow) are 2.5% and 3.0%, respectively.The second line of the three subplots is the projections on the badly-learned case.All color points cross together from the first figure to the last one, and there is no clear statistical pattern in shape.

G. Computational Efficiency
This subsection investigates the computational efficiency of the Tensor-CSPNet.The experiments are conducted on an Intel (R) Xeon (R) CPU @ 2.20 GHz with one socket, two cores per socket, and two threads per core.There are two subtopics    2) Calibration Time per Subject: Many training parameters in the DL approaches will affect the calibration time, such as learning rate, batch size, epochs, etc.We fix batch size equivalent to the test set size 28 for training and validation, and the initial learning rate is 0.01 with decay.The total number of training epochs is default 60, and an early stopping strategy with 15 patience is adopted.The average accuracy and standard deviation of the calibration time are shown in Table XI.The average calibration time per subject for non-DL approaches is around 1 min.However, the calibration    (9,1,1) @ (1,2,20) 7650.99 sec 2.13 hr 5-CSPNet (9,1,1) _BN@ (1,2,20) 8860.63 sec 2.46 hr time per subject for the DL approaches is much longer, especially 5-CSPNet (9,1,1) _BN@(1,2,20) is the longest.Three configurations in Tensor-CSPNet are adopted for comparison.Firstly, 1-CSPNet (9,1,1) _BN is the simplest without temporal segmentation, and it has a longer calibration time than other DL approaches.Because 9-time input size and bigger size of architecture extend the calibration time.Secondly, for 5-CSPNet (9,1,1) @(1,2,20), the input size of both algorithms is five times that of 1-CSPNet (9,1,1) _BN.Hence, there is no doubt that both have over two hours of calibration time.

H. Architecture of Tensor-CSPNet
We provide the detailed architecture of 1-CSPNet (9,1,1) with o = 22 and 5-CSPNet (9,3,1) @(1, 2, 20) with o 1 , o 2 , o 3 = 22 on BCIC-IV-2a.Table IX and X exhibit the total parameters of two configurations are 27104 and 232360, respectively.Table XII exhibits the numbers of parameters of different architecture as follows, The numbers of ConvNet and EEGNet refers to Table 3 [9].From Table XII, we notice that EEGNet has a tiny size of parameters.In contrast with ConvNet and EEGNet, the architecture of Tensor-CSPNet needs large-scale parameters to preserve the geometric information of SCMs.
be a set of n × n real SPD matrices.The Frobenius inner product and norm on m × n matrices A and B are defined as A, B F := Tr (A ⊤ B) and || A || 2 F := A, A F respectively.

Fig. 1 :
Fig.1: Illustration of Architecture of Tensor-CSPNet: The network architecture is built upon the principle that fully exploits the temporospatiofrequency patterns behind EEG signals.Hence, each structure in the architecture aims to capture features from either of the temporospatiofrequency domains.In Line 1, EEG signals are segmented into the temporospatiofrequency tensors in the tensor stacking stage.Frequency information has been unfolded in this stage.In Line 2, the CSP stage is designed to capture spatial information from tensors using the depthwise BiMap layer, Riemannian BN layer, and the ReEig layer.In Line 3, we capture the temporal information on the tangent space using 2D CNNs.Fully connected neural networks with cross-entropy loss are used for the MI-EEG classification.

•
BiMap: This layer transforms the covariance matrix S using the bi-map operator W • S • W T .Transformation matrix W is required to be full row rank.• ReEig: This layer is analogous to ReEig in classical deep neural networks that introduces the non-linearity on SPD manifolds using U •max (ǫI, Σ)•U T , where singular value decomposition S = U • Σ • U T , and ǫ is a rectification threshold and I is an identity matrix.• LOG: This layer is to map elements on SPD manifolds on its tangent space using U •log (Σ)•U T , where singular value decomposition S = U • Σ • U T .

Fig. 2 :
Fig. 2: Illustration of Temporal Convolutional Stage: F blocks of 1 × o 2 rectangles flattened and W lines concatenated.To illustrate, in the case of 5-CSPNet employed on MI-KU, F = 9, o = 20, and W = 5.Thus, each line is a 1 × 3600 flattened tensor, and the shape of the whole rectangle is 5 × 3600.

Figure 4 .
Subject No.2 is selected from the MI-KU dataset for interpretation, whose testing accuracy is over 0.9.The upper and lower rows in the heatmap represent the right-hand and left-hand MI, respectively, and are interpreted as follows, (a) The right-hand MI: Patterns with 8∼28 Hz highlights around C3 in 1.0∼1.5 sec and 2.5∼3.0 sec.(b) The left-hand MI: Patterns with 24∼28 Hz highlights around C4 in 1.0∼1.5 sec and 2.5∼3.0 sec.The above-interpreted temporospatiofrequency information is consistent with the existing practical frequency components of the left and right-hand MI [46] that the alpha band 9∼14 Hz and beta bands 18∼26 Hz perform more significantly on C3 and C4 of the primary motor cortex, or M1.Two active time windows indicate the change of band power event-related desynchronization (ERD) and the event-related synchronization (ERS) occurring during MI [47].

Fig. 5 :
Fig. 5: Illustration of outputs of each intermediate stage in 5-CSPNet (9,1,1) with o = 22 on Subject No.28 of MI-KU: The time windows for the model are 1 ∼ 1.5 s, 1.5 ∼ 2.0 s, 2.0 ∼ 2.5 s, 2.5 ∼ 3.0 s, and 3.0 ∼ 3.5 s. (a).5-Temporal Segmentation of Subject No.28:This figure is the same as Fig. 6, but there is a rotation due to the figure scale.(b).Features after CSP layer (Stage 2): Blue and yellow/green have five segments because of the five temporal segmentation.We name each data cluster as the temporal segment in this paper.(c).Features after TC layer (Stage 3): Segments of either blue or yellow/green aggregates.The yellow/green one lies in the middle of two blue parts.(d).Features after TC layer (Class Label): Draw the points with label information.Two classes are almost evenly distributed on both sides of the decision boundary.

B
. Tensor-CSPNet VS.Other BCI Classifiers 1) DL: Most of the DL approaches in the MI-EEG classification are designed to exploit the temporospatial information from EEG signals using CNNs.In contrast, Tensor-CSPNet formulates EEG signals on SPD manifolds, uses existing layers in SPDNet on SPD manifolds to exploit the spatial patterns from SCMs, and uses CNNs to capture the temporal dynamics of EEG signals on the tangent space.

Fig. 6 :
Fig. 6: 2-dimensional Projection of Subject No.28 in MI-KU using t-SNE: There are two sessions for each subject in the MI-KU dataset.S1 is for the training set, and two halves of S2 are for the validation and test sets.The lengths of time windows are (a).2500ms, (b).500 ms, (c).250ms, and (d).125 ms.There is no overlapping between time windows.Each 2-dimensional color point is dimensionality reduced from a 9 × 20 × 20-dimensional point, where it has 20 electrodes in the motor cortex region and nine frequency bands.This is the input format for Tensor-CSPNet.

Fig. 9 :
Fig. 9: Unnormalized Confusion Matrices for FBCSP and Tensor-CSPNet in the CV scenario of BCIC-IV-2a: 4-class labels include Hand (L), Hand (R), Feet, and Tongue.The total number of trials for both matrices is 2610 ( = 29 trails in test dataset × 10 folds × 9 subjects).The quantities in each cell include the number of the corresponding class (top) and its percentage (%) over the total number of trails (bottom).

Fig. 10 :
Fig. 10: Illustration of the relation between the operation time per iteration and the output dimension of the depthwise BiMap layer.

TABLE II :
Average accuracies and standard deviations for the subject-specific analysis of MI-KU (a total of 54 Subjects) and BCIC-IV-2a (a total of 9 Subjects).Each result in the table is denoted as average accuracy (standard deviation).The best-performing number for each analysis is highlighted in boldface.

TABLE V :
Experiments on the Output Dimension o of the Depthwise BiMap Layer: Average accuracy (Acc.) and standard deviation (Std.)under 1-CSPNet

TABLE VI :
Experiments on Effectiveness of Riemannian BN: Average accuracy (Acc.) and standard deviation (Std.)under four pairs of CSPNets with the CSPNet

TABLE VII :
Experiments on Time-Frequency Resolution: Average accuracy (Acc.) and standard deviation (Std.) of Tensor-CSPNet under four configurations of the time windows of 1-CSPNets (9,1,1) _BN without the temporal convolutional stage on BCIC-IV-2a.The best-performing method for each analysis is highlighted in boldface.

TABLE VIII :
Experiments on Hyper-parameters in the Temporal Convolutional Stage: Average accuracy (Acc.) and standard deviation (Std.) of various hyper-parameter portfolios of the temporal convolutional stage for 5-CSPNets(9,1,1)_BN on BCIC-IV-2a.The portfolio @(p, q, r) represents (Width, Height, Number of the Output Channels) in the temporal convolutional stage.The best-performing result is highlighted in boldface.

TABLE XI :
Experiments on Calibration Time per Subject: Average accuracy (Acc.) and standard deviation (Std.) for the calibration time (s) per subject of baselines and Tensor-CSPNet on BCIC-IV-2a.The output dimension of the CSP stage is 22 for the three configurations of Tensor-CSPNet.

TABLE XII :
Table of number of parameters in network architecture.