SincNet-Based Hybrid Neural Network for Motor Imagery EEG Decoding

—It is difﬁcult to identify optimal cut-off frequencies for ﬁlters used with the common spatial pattern (CSP) method in motor imagery (MI)-based brain-computer interfaces (BCIs). Most current studies choose ﬁlter cut-frequencies based on experience or intuition, resulting in sub-optimal use of MI-related spectral information in the electroencephalography (EEG). To improve information utilization, we propose a SincNet-based hybrid neural network(SHNN) for MI-basedBCIs. First,raw EEG is segmented intodifferenttime windowsandmappedintotheCSP feature space. Then, SincNets are used as ﬁlter bank band-pass ﬁltersto automaticallyﬁlterthedata.Next,we usedsqueeze-and-excitation modules to learn a sparse representation of the ﬁltered data. The resulting sparse data were fed into convolutional neural networks to learn deep feature representations.Finally,these deepfeatures were fed into a gated recurrent unit module to seek sequential relations, and a fully connected layer was used for classiﬁcation. We used the BCI competition IV datasets 2a and 2b to verify the effectiveness of our SHNN method. The mean classiﬁcation accuracies (kappa values) of our SHNN method are 0.7426 (0.6648) on dataset 2a and 0.8349 (0.6697) on dataset 2b,


I. INTRODUCTION
B RAIN-COMPUTER interfaces (BCIs) that can directly connect the outside world and the human brain without the involvement of peripheral nerves and muscles [1], have been attracting more and more attention in recent years. Practical applications of BCIs include virtual reality games [2], exoskeleton control [3], [4], rehabilitation of stroke patients [5], [6], communication with patients with consciousness disorder [7], [8], and many other applications.
Motor imagery (MI) based BCIs allow their users to enact control by imagining movement in one or more body part [9]- [11]. The electroencephalogram (EEG) is one of the most commonly used signals in MI-based BCIs and has the advantages of low cost, low risk, and high portability. Movement imagination is associated with changes in the power of the oscillatory EEG. These changes in power are known as the event-related desynchronization (ERD) and event-related synchronization (ERS) phenomena. By detecting the ERD/ERS in the EEG, the part of the body that the user is imagining moving can be identified. Compared with the other common non-invasive BCI control paradigms such as even-related potential (ERP) [12], [13] and steadystate visual evoked potential (SSVEP) [14], [15] based BCIs, the MI-based BCI does not rely on external stimuli, which can make them more convenient and intuitive for their users. However, improving the classification performance of MI-based BCIs is still a challenging problem.
Combinations of spatial, spectral, and temporal features are usually extracted from EEG for MI task classification. The common spatial pattern (CSP) method is the most commonly used spatial feature extraction method used for this process. However, the performance of CSP is easily affected by the selection of the cut-off frequencies that are used to filter the EEG [16], [17]. To address this problem, Ang et al. [18] proposed a filter-band CSP (FBCSP) method to filter the EEG into multiple sub-bands. However, these sub-bands also need to be determined manually, while the appropriate filter This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ bands varied for different participants. With the development of deep learning methods, many researchers proposed novel feature extraction and classification methods for MI-based BCIs. Lawhern et al. [19] proposed a compact neural network called EEGNet, which can extract temporal and spatial features together. In a recent study, Izzuddin et al. [20] used SincNet, which was proposed to improve the classification performance of speech signals [21], as a band-pass filter to process MI-related EEG signals and achieve promising performance. However, these methods can not make full use of the information included in MI-related EEG signals.
Inspired by the filter-bank technique and the SincNet method, we propose a SincNet-based hybrid neural network (SHNN) in this study. First, the raw EEG is segmented into different time windows and mapped into the CSP feature space using the learned CSP projection matrices. Then, for each time window, a SincNet based CNN module is used to extract the spatial and spectral features from the mapped EEG. In addition, after the SincNet method, we also use squeezeand-excitation (SE) modules to re-calibrate the data processed by the different maps of SincNet to obtain a sparse representation of MI-related EEG. Next, features extracted from different time windows are concatenated and fed into a gated recurrent unit (GRU) module to seek sequential relations in the data. Finally, a fully connected layer was used for classification. To verify the effectiveness of the proposed neural network, our SHNN method is evaluated on two public BCI competition datasets. Ablation experiments were also conducted to explore the function of each module in our SHNN method.
The rest of our paper is organized as follows. In section II we introduce our SHNN method and datasets. In section III we present our experiment results. Finally, in section IV and section V we present our discussion and conclusion, respectively.

II. METHODS
In this section, we first introduce the method that was used to extract features from the raw EEG. Then we show the detailed structure of our proposed SHNN method. Next, we describe the datasets used in this study. Finally, we describe our experimental process.

A. Common Spatial Pattern Features Extraction With Retaining Temporal Information
The Common spatial pattern (CSP) method has been widely used to classify ERD/ERS activity in the EEG. The principle of CSP is to seek a spatial projection matrix to maximize the covariance of one class in the EEG whilst simultaneously minimizing the covariance of the other class in the EEG (see equation (2)) [22]. For multi-class MI tasks, the oneversus-rest CSP (OVR-CSP) considers one task as one class and the remaining tasks as the other class [23]. Assume the X i, j ∈ R C×T denotes the j -th EEG trial belonging to i -th class, where C and T are the number of channels and timepoints, respectively. The typical steps to obtain the spatial filter and CSP features are: (1) Use a band-pass filter with a specific cut-off range to process the raw EEG data. Then subtract the mean from the filtered data. (2) Calculate the covariance matrix of i -th class: where N i denotes the number of EEG trials belonging to i -th class. (3) Use equation (3) to solve equation (2) and obtain the spatial projection matrix W : where w is a column vector of the spatial projection matrix W . (4) Construct the spatial filter matrixW R C×2M with the hyperparameter M that is the number of selected eigenvectors corresponding to the largest and smallest eigenvalues. (5) Transfer the filtered EEG data into CSP feature space by the projection ofW : The traditional CSP method further uses a log method to extract CSP features from EEG data mapped into the CSP feature space. However, the extracted CSP features lack temporal information. In this study, we used the EEG data Z i, j transferred into CSP feature space (hereafter referred to as CSPWT) as the input to our proposed neural network.

B. SincNet-Based Hybrid Neural Network
In this study, we propose a SincNet-based hybrid neural network (SHNN) to classify MI tasks. Fig. 1 shows the time flow of the classification process of our SHNN method. The raw EEG data were first divided into multiple time windows, and the CSPWT method was applied to the EEG in each time window to extract the features. The extracted spatial and temporal features from each window were then fed into a sub-CNN module for learning deep features, then the deep features were fed into the GRU module for classification.
As shown in Fig. 2, the SincNet, CNN, and GRU methods are the three main parts of the proposed neural network. We first applied the SincNet method of N f maps in our network to band-pass filter the EEG data that had been transformed into the CSP feature space (Fig. 2(c)). The filter bank parameters, including the low cut-off frequency and the filter's bandwidth, can be auto-adaptive, which can result in a minimum classification error.
The SincNet method was first proposed to discover more meaningful filters for the better recognition of speech signals [21]. Its structure like 1D convolution and the traditional 1D convolution is defined as follows: where x[n] is the input signal, y[n] is the filtered signal and h [n] is the filter kernel of length L. Motived by the 1D convolution, the SincNet performs the convolution with a predefined function g: where θ is the parameters that SincNet needs to learn. In the frequency domain, a band-pass filter can be seen as a difference between two low-pass filters. Thus, the g [n] in equation (6) can be viewed in the frequency domain as: where r ect (·) denotes rectangular function, f 1 and f 2 are the low and high cut-off frequencies, respectively. With the use of inverse Fourier transform, the time domain equation can be written as: Therefore, SincNet can act as a band-pass filter by learning parameters f 1 and f 2 . However, to satisfy the limitation of f 1 ≥ 0 and f 2 ≥ f 1 in the training process, the parameters can be modified as: where f band = | f 1 − f 2 | is the cut-off range of the bandpass filter. Hence, f abs 1 and f band are the parameters that SincNet needs to update during the training process. Moreover, to mitigate the problem of filter truncation, we windowed the convolutional filter with the Hamming windows, and the windowed filter kernel is: where The outputs of SincNet were fed into the squeeze-andexcitation (SE) modules [24], [25] for recalibration. The structure of the SE modules is shown in Fig. 2(a). Suppose the input features map X ∈ R N f ×N c ×N t where N f is the number of maps of SincNet, N c and N t are the numbers of channels and time points. We used average pooling to generate bandwise statistics: Next, two fully connected layers were used to learn the nonlinear relations between different bands, then obtain the scale vector: where W is the scale vector, S = {s 1 , s 2 , · · · , s N f denotes the band-wise statistics vector, are weight matrixes of two fully connected layers, r is the reduction ratio, which is one of the hyperparameters, δ(·) and σ (·) denotethe rectified linear unit (ReLU) activation function and softmax function respectively. Note, the recalibrated output features of the SE modules have the same shape as the input features.
Afterward, temporal and spatial convolutional blocks were used to further extract deep features. Specifically, the temporal convolutional block consists of a convolutional layer with a kernel size of 1 × 10 followed by a batch normalization layer and a ReLU layer. The mathematical formula of the ReLU function can be given as follows: In addition, the spatial convolutional block has a convolutional layer with a kernel size of C ×1 (where C is the number of the EEG channels), a max-pooling layer with a kernel size of 1 × 6, a convolutional layer with a kernel size of 2 × 2, a max-pooling layer with a kernel size of 1 × 2, and a fully connected layer with 100 nodes. As shown in Fig. 2(b), the number of maps of three convolutional layers are 128, 64, and 16, respectively. Thus, the output dimension of the spatial convolutional block is 16 × 100.
According to Fig. 1, we concatenated the output feature vectors from the CNNs. The inputs to the CNNs are N T different time window segmentations of the EEG mapped into the CSP feature space. These feature vectors will be fed into the gated recurrent unit (GRU) module to learn sequential information. The structure of GRU is shown in Fig. 2(c). The input of the GRU module is a vector of length 100. A reset gate r t and an update gate z t were used in the GRU cells. The rest gate can adjust to allow the incorporation of new inputs with the hidden state of the previous GRU cell. Its mathematic formula can be written as follows: where W r and U r are weight matrixes of the reset gate, σ (•) denotes sigmoid activation function, and x t and h t −1 denote the input at time t and the hidden state at time t − 1, respectively.
The update gate z t controls the degree of new information that will be updated. The mathematical formula can be given as follows: Here, σ (•), x t and h t −1 are the same as previously introduced in equation (16),W z and U z are weight matrices of the update gate. Consequently, the candidate stateh t and output h t can be obtained by calculating equations (18) and (19): where tanh(•) denotes hyperbolic tangent operator, W and U are weight matrices, and ⊗ denotes the Hadamard product operator. Finally, the latest outputs of the GRU module were fed into a fully connected layer and a softmax layer for the classification.
Since the lengths of different time windows are the same, the size of inputs of sub-CNN ( Fig. 1) is identical. Therefore, the sub-CNN structure is consistent between different time windows.

C. Loss Function
In this study, we used the cross-entropy loss to minimize the classification error between the predicted labels and the ground-truth labels. Moreover, the sparse loss [24] and center loss [26] were used to simplify the neural network and improve the discriminability of different class features, respectively. The objective functions of cross-entropy loss L C E , sparse loss L Sparse , and center loss L Center are given as follows: where y g i denotes the ground-truth label of the i -th training sample, y p i denotes the normalized probability values of the i -th training sample, N batch is the number of samples in a training batch, W j is the scale vector of the filter banks in the j -th time window, • denotes l 1 -norm operator, f k denotes the feature vector extracted from k-th sample by the GRU module, y k denotes the ground-truth label of the k-th sample, and v e y k denotes the center vector of class y k in the e-th training epoch.
As introduced in [27], the center vectors can be updated in each training epoch through the following formulas: where v e j is the average distance between the j -th class samples and center vectors of the j -th class in the e-th training epoch, and v e+1 j and v e j are the center vectors in the (e + 1)-th and e-th training epochs, respectively. The term ρ denotes the learning rate for center loss, and the value of ρ is in the range of 0 to 1.
The center vectors can be updated with the training process, and the center vectors are firstly initialized with random Gaussian noise of dimension N c × 100, where N c denotes the classes number of the MI tasks, 100 is the length of the GRU output vector. The target of center loss is to minimize the gap between the center vectors of one class and its GRU outputs, which can make the features extracted from the same class of data more aggregated (as shown in Fig. 6). Therefore, the ability of the neural network to identify different classes of data could be improved

D. Dataset Description
In this study, we used two public BCI competition datasets to validate the effectiveness of the proposed neural network. The detailed information of the two datasets are shown as follows: DS1: The first dataset is the BCI competition IV dataset 2b [28], which was recorded from nine participants (B1-B9) at a sample rate of 250 Hz. For each participant, 720 trials from two MI tasks (left hand vs right hand motor imagery imagination) were performed, and the EEG was recorded from 3 electrodes placed at positions C3, Cz, and C4 in the international 10/20 system for EEG electrode placement. Detailed information about this dataset can be found via www.bbci.de/competition/iv/#dataset2b. DS2: The second dataset is the BCI competition IV dataset 2a [29], which was recorded from nine participants (A1-A9). For each participant, 576 trials of EEG were recorded from 22 Ag/AgCl electrodes at a sample rate of 250 Hz. In this dataset, four kinds of MI tasks were performed. Specifically, participants performed left hand (class 1), right hand (class 2), tongue (class 3), and feet (class 4) motor imagery tasks. The number of trials recorded for each class is 144. Detailed information about this dataset can be found via www.bbci.de/competition/iv/#dataset2a.

E. Experiment Setup
We used the adaptive moment (Adam) optimizer [30] to train our proposed neural network and the optimizer parameters β 1 and β 2 were set to 0.5 and 0.999, respectively. The learning rate ρ and weight of center loss were set to 0.00005 and 0.1, respectively. The batch size and learning rate of the neural network were set to 16 and 0.0001. In addition, the reduction ratio r in the SE module was chosen as 3. The number of maps of each SincNet is 15 denoting that there are 15 auto-adaptive band-pass filters. Due to the 3 seconds of motor imagination time in DS1 and DS2, we segmented the raw EEG into four-time windows of length 2 seconds, which started from -0.5 seconds to 1 second with an interval of 0.5 seconds (i.e., −0.5s∼1.5s, 0s∼2s, 0.5s∼2.5s, and 1s∼3s). In this study. due to the SincNet acting as band-pass filters, we only use the notch-filter to remove the 50/60Hz power-line noise of the raw EEG. We trained our proposed neural network with Pytorch on an AMD R7 3700X CPU, 32 GB RAM, and the Nvidia 2080Ti platform. For each participant, eight of the ten training datasets were used to train the SHNN method, while one of the ten datasets was used as a validation dataset, and the performance of our SHNN method was evaluated on the last of the ten datasets.
In this study, Cohen's kappa coefficient and classification accuracy are used as two metrics for performance evaluation. The mathematical formula of Cohen's kappa coefficient is given as follows: where p o denotes the classification accuracy, p e = i n oi n io N 2 denotes the hypothetical probability of chance agreement, n oi and n io are the sum of each column and row of the confusion matrix, and N denotes the sum of all entries in the confusion matrix.

A. The Training Loss and Evaluating Accuracy of the SHNN Method
We first analyzed the training loss and the accuracy of the SHNN method on DS1. Fig. 3 shows the training loss and accuracy for each participant (B1-B9) on DS1. The number of training epochs is 200. For each training epoch, the validating datasets were used to validate the SHNN method after training. The red lines denote the training loss curves, while the blue lines denote the accuracy curves. As shown in Fig. 3, the accuracy is increased while the training loss is reduced. It can be seen that the accuracy increased quickly during the first 25 training epochs and the training loss is generally stable after training about 100 epochs, which suggests that the SHNN metod can be trained well for MI task classification. The average training time and single-trial test time are 399.31 s and 2.03 s, respectively.
We take one participant as an example to illustrate the operation of our method. Fig. 4 shows the first, seventh, and fifteenth sinc-filters g w [n] learned by the SincNet method for participant B4 on DS1. The low frequency signal components and optimal bandwidths were automatically identified by SincNet and varied across the different sinc-filters. The first sinc-filter can extract spectral information from 4 Hz to 10.4 Hz, while the seventh and fifteenth sinc-filters extract spectral information from 18.41 Hz to 24.81 Hz and 37.61 Hz to 44.02 Hz, respectively. The mu and beta rhythms (8-30 Hz) that have been proved crucial for the MI tasks classification [9] are included in the band-pass frequency range of these sinc-filters.
To better show the role of the SE module in our SHNN method, we also illustrate some examples of the weight values learned by the SE module. Fig. 5 shows examples of the weight values of the EEG data filtered by the 15 sinc-filter of the SincNet method across the four-time windows for participants B3 and B4 on DS1.
The classification accuracy and kappa values for participants B3 and B4 are (B3: accuracy = 0.5833, kappy = 0.1667, and B4: accuracy = 0.9730, kappa = 0.9458), which are the worst and best performances achieved by the SHNN method on DS1. As shown in Fig. 5, the weight values from 0 to 1 were automatically assigned to the data after filtering by the different sinc-filters. When inspecting time window 2 (TW2), comparing the weight values distribution of participants B3 and B4, reveals that, for participant B3, the weight values of frequency bands 3 to band 13 are similar, while for participant B4, the weight values of the different frequency bands are more sparse. Thus, a more sparse representation of features may contribute to a better classification performance due to the suppression of redundant information.
We used the t-distributed stochastic neighbor embedding (t-SNE) method [31] to map the feature vectors to a 2D embedding space before and after applying the center loss method. The distribution of feature vectors for participant B4 is shown in Fig. 6. The red and blue dots indicate the left-hand and right-hand classes of samples, respectively. It can be seen that after applying the center loss method, the feature vectors of the same class are gathering together and the variance is smaller. Hence, the center loss method can improve the feature vectors' power for the discriminating different class samples, which also can be proved by the ablation experiments results in Table I (see below).

B. Result of the Ablation Experiments
Three ablation experiments were conducted on DS1 to verify the effectiveness of the SincNet method and specific loss function (e.g., sparse l 1 -norm loss and center loss) within our proposed SHNN method. We used three models named model1, model2, and model3, which represent three cases: 1) Model1: The SHNN method is tested without any additional structure (SincNet and SE modules) and only trained with cross-entropy loss.
2) Model2: The SHNN method is tested without the SincNet module and trained with cross-entropy loss, sparse l 1 -norm loss and center loss.
3) Model3: The complete SHNN method is only trained with cross-entropy loss. Table I shows the classification accuracies, kappa values, and Wilcoxon signed-rank test results of three ablation experiments, respectively. The mean classification accuracies of model1, model2, and model3 are 0.7428, 0.7827, and 0.8044, while the mean classification accuracy of our proposed model is 0.8349. Moreover, the mean kappa values of model1, model2, and model3 are 0.4858, 0.5643, and 0.6086, while the mean kappa value of our proposed model is 0.6697. In addition, the Wilcoxon signed-rank test results demonstrate that our complete SHNN method, i.e., included the SincNet and SE modules and trained with cross-entropy loss, sparse l 1 -norm loss and center loss, outperformed the model1

C. Comparison of the Classification Accuracies and Kappa Values
In order to validate the efficacy of our SHNN method, we compared its performance with other state-of-the-art methods, including EEGNet, FBCSP, and so on [18], [19], [32]- [37]. Table II shows the classification accuracies, kappa values, and Wilcoxon signed-rank test results from the nine participants in DS1. Table III shows the classification accuracies, kappa values, and Wilcoxon signed-rank test results of the nine participants in DS2.
As shown in Tables II and III, our SHNN method achieved the highest mean classification accuracies of 0.8349 and 0.7462, and the highest mean kappa values of 0.6697 and 0.6648 on datasets DS1 and DS2, respectively. Compared with other state-of-the-art methods, our SHNN method achieved improvements of 10.16%, 4.39%, 9.68%, and 3.72% in average classification accuracies on DS1, respectively, while for DS2, the improvements are 9.58% (kappa), 7.48% (kappa), 14.62% (acc), and 9.19% (acc). It can be seen that, for fourteen of the eighteen participants in DS1 and DS2, our SHNN method achieved the highest classification performance. Since the classification results were not normally distributed (as confirmed by Kolmogorov-Smirnov test results), we used Wilcoxon signed-rank test to compare the performance of our SHNN method with the other state-of-the-art methods. The statistical test results demonstrate the superiority of our SHNN method in the two-class and multi-class MI classification tasks. The p-value between SHNN and eight compared method on two datasets are 0.008 (Z=-2.666), 0.038 (Z=-2.073), 0.021 (Z=-2.310), 0.008 (Z=-2.666), 0.008 (Z=-2.666), 0.028 (Z=-2.192), 0.008 (Z=-2.666) and 0.038 (Z=-2.073), respectively.

IV. DISCUSSION
CSP is a very widely used and popular feature extraction method for MI classification tasks. However, previous studies that have used CSP to extract spatial features from EEG ignore  TABLE III  THE CLASSIFICATION ACCURACIES, KAPPA VALUES, AND WILCOXON SIGNED-RANK TEST RESULTS OF THE  DIFFERENT METHODS FOR ALL PARTICIPANTS IN DS2 the temporal information contained in the EEG. Moreover, the performance of CSP is easily affected by the selection of the cut-off frequencies of the band-pass filters used with the method. In this study, we proposed a new method, SHNN, to automatically learn spatial-spectral-temporal information from the EEG. The four main advantages of our SHNN method are: (a) SincNets were used to automatically identify band-pass filters from the data, meaning the parameters of each filter can be automatically learned from the data, which results in a better spectral features extraction of the MI-related information from the EEG. (b) The use of the GRU module and CSPWT feature extraction method can make full use of the temporal information in the EEG and boost the classification performance. (c) Center loss was used to improve the distinguishability of different classes of samples. (d) The sparse representation of data filtered into different frequency bands can be learned by the SE module.
The function of the SincNets module in our SHNN method is to band-pass filter the EEG into different frequency ranges. As shown in Fig.4, according to the characteristics of the EEG data, each SincNet can determine a specific frequency range in order to optimally band-pass filter the data into spectral features. Moreover, the optimal cut-off ranges of the band-pass filters vary across different participants and the SincNet module can make full use of the differences in spectral information between participants. Hence, our SHNN method can extract more discriminative spectral information from the EEG for classification.
The traditional CSP method has the limitation of losing temporal information of EEG. In this study, we used the CSP projection matrix to map EEG data into the CSP feature space while retaining the temporal information from the EEG. In addition, the GRU module was used to learn the sequential information from different time window segmentations of the EEG. Furthermore, the center loss was used to train our SHNN method, which allows it to reduce the variance of a single class of samples and gather the samples together. Fig. 6 shows the clustering performance of the center loss method. The feature vectors from a single class are gathered together and the variance is reduced. Table I demonstrates that the use of center loss can improve the classification performance of our SHNN method (with center loss vs. without center loss, DS1 accuracies: 0.8349 vs. 0.8044, DS2 accuracies: 0.6697 vs. 0.6086).
The SE module in our method can re-calibrate the data by explicitly modeling interdependencies between EEG channels [23]. As shown in Fig. 5, different weight values were automatically assigned to data processed by different SincNets and segmented into the different time windows. The frequency bands include discriminated information that can be assigned to a higher weight value, which can enhance the features extracted from those bands. On the other hand, the redundant features could be suppressed by assigning a lower or zero weight value. Therefore, the SE module can improve the representational power and ease the learning of our SHNN method.
Of course, our SHNN method still has some limitations. Although Tables II and III show the superiority of our SHNN method, the structure of SHNN is complex. Recently, pruning [38] and knowledge distillation [39] techniques have been used to simplify the structure of neural networks while maintaining their performance at the same time. We will try to combine these methods with our SHNN method to achieve a more compact neural network. Secondly, our SHNN method is an offline neural network that has not been validated in online BCI environments. In future work we will explore its use in online BCI experiments.

V. CONCLUSION
In this study, we proposed a SincNet based hybrid neural network consisting of SincNets, SE modules, and GRU modules, which can automatically filter data and extract spatial, spectral, and temporal features from EEG. The results of our experiments, conducted on two public BCI competition datasets, demonstrate that the performance of our SHNN