Recognition of Motor Imagery EEG Signals Based on Capsule Network

In order to fully extract the temporal and spatial features contained in motor imagery electroencephalography (EEG) signals for effective identification of motor imagery, a three-dimensional capsule network (3D-CapsNet) EEG signal recognition model is proposed, which can integrate the MI-EEG temporal dimension, channel spatial dimension and the intrinsic relationship between features to maximize the feature representation capability. Firstly, a multi-layer 3D convolution module is used to extract features in the time and inter-channel space dimensions as the low-level features. Secondly, advanced spatial features are also obtained through capsule network integration. Finally, dynamic routing connections and squash functions are applied for classification. The experimental analysis is conducted on the BCI competition IV dataset 2a. The proposed model performs well on all the subjects’ datasets, such that the average accuracy and average Kappa value of 9 subjects are 84.028% and 0.789, respectively. The experimental results confirm effectiveness of the proposed method. Additionally, accuracy of the four-class classification is improved, and the impact of individual variability is overcome to a certain extent.


I. INTRODUCTION
Brain-computer interface (BCI) allows people to interact with the real world only through brain neural activity [1]. Motor imagery electroencephalography (MI-EEG) is one of the widely used paradigms of BCI which is currently mainly used in the field of sports rehabilitation [2]. On the one hand, rehabilitation training based on MI-EEG converts motion intentions into instructions after processing and transforming MI-EEG signals in a certain way, leading to control of rehabilitation aids such as wheelchairs and robotic arms. This partly facilitates interaction of patients with damaged muscles or nerve endings with the environment. On the other hand, remodeling promotion of brain function enables functional compensation and ultimately restores part of the motor function, resulting in improvement of the life quality of patients.
The associate editor coordinating the review of this manuscript and approving it for publication was Gang Wang . MI-EEG data is an extremely complex nonlinear random time series which is acquired from the 3D scalp surface. Because of diverse physiological states of different subjects, the distribution of their collected EEG data is dissimilar. Even for the same subject, the EEG data distribution varies at different times. The complexity, non-smoothness, and individual variability of EEG signal pose great difficulties to its recognition task.
Based on event-related synchronization (ERS) and event-related desynchronization (ERD) phenomena, numerous MI-EEG classification methods have been proposed [3], [4], [5], [6], [7], [8]. Although feature extraction combined with machine learning has been successfully applied to MI-EEG classification, such methods divide feature extraction and classification as two successive stages. Accordingly, parameters of the feature extraction model and the classifier are trained based on different objective functions. In addition, extraction of the best features is subjective, such that if a sub-optimal frequency band is selected during the feature extraction process, the classification performance will be affected. Moreover, for a complex nonlinear random time series, manual determination of the frequency band requires a great expertise, experience and understanding of the EEG data. Due to the variability among different subjects, the method of selecting the best frequency band for each subject does not apply well to a larger population [9].
Recently, various deep learning methods have been applied for EEG classification, namely convolutional neural network (CNN) [10], recurrent neural network (RNN) [11], capsule network (CapsNet) [12], etc. By utilizing deep learning for EEG signal recognition, there is no need for manual feature extraction, because it embeds feature extraction and classification into an end-to-end network and allows simultaneous optimization of parameters, which minimizes the efforts for EEG signal preprocessing and is clearly more suitable in online BCI studies.
The first task to use deep learning for MI-EEG classification is to represent this data in a consistent form that can be processed by a deep model. At present, MI-EEG data is usually expressed in the form of a two-dimensional (2D) matrix, which is referred to as 2DMI-EEG hereafter. The height and width of this matrix represent the number of sampling electrodes and the sampling time steps, respectively [13]. Another common method is to transform the EEG signal into a 2D time-frequency image by methods such as short-time Fourier transform [14] or wavelet transform [15] to be used as network input. However, 2D matrix representation methods and two-dimensional timefrequency image representation can not preserve the spatial information of MI-EEG data [2]. Accordingly, the intrinsic relationship between adjacent electrodes cannot be reflected in the 2D matrix, which will affect the classification performance. What's more, the two-dimensional time-frequency image representation is essentially an artificially designed research method. In 2019, Zhao et al. [16] proposed a threedimensional (3D) representation method of EEG signals in which EEG time series were mapped into a 3D array according to the spatial distribution of electrodes. This method can retain both temporal and spatial features and it uses the constructed 3D array as the model input. Based on such 3D motor imagery EEG signal representation, this paper proposes an improved MI-EEG recognition model relying on the capsule network, which is referred to as 3D-CapsNet hereafter. CapsNet is a new deep learning model proposed by Hinton et al. [17], which evolved from CNN, with the difference that CapsNet consist of capsules and CNN consist of neurons. Each layer in the CapsNet contains multiple capsule units that contain various types of spatial feature state information of entities. The presence of capsules enables the network to capture entities from all angles. In 2017, Hinton's team [18] established a dynamic routing algorithm for implementation of the capsule networks. They used CapsNet for handwritten digit recognition for the first time and achieved the state-of the-art classification performance. The innovative structure of CapsNet capsules enables it to effectively describe the intrinsic relationships between individual channels and features.
Aiming at the problem that the 2D two-dimensional matrix representation method cannot preserve its spatial information, this paper uses 3DMI-EEG as input. 3DMI-EEG is a three-dimensional representation of the EEG signal that fully preserves the temporal information present in the EEG time series and the spatial information present in the electrode distribution. To address the problem of small data set size of motor imagery EEG signals and insufficient extraction of spatio-temporal features leading to low recognition accuracy, we innovatively propose a three-dimensional capsule network model (3D-CapsNet) for EEG signal recognition work. 3D-CapsNet integrates the MI-EEG temporal dimension, channel spatial dimension and the intrinsic relationship between features to maximize the feature representation capability. At the same time, the feature vector dimension can be reduced by the dynamic routing connection between capsules instead of utilization of pooling layers which are popular in the traditional fully connected networks. This retains many of detailed EEG features and ensures effective feature extraction process, which is very friendly to small-scale data sets such as EEG signals. In addition, the method embeds EEG feature extraction and classification into end-to-end networks and enables fast identification, making it competitive for online BCI applications. Fig.1 shows the 3DMI-EEG mapping process of a subject's EEG signal based on BCI competition IV dataset 2a. First, the EEG signal is intercepted by frame and the current frame value is obtained. Each frame corresponds to a sample time point TP. The unused electrode positions are then filled with zero values, and each frame is converted into a 2D-map according to the general spatial distribution of the sampling electrodes. Second, the N 2D-map at different time points are arranged into a 3D matrix according to the temporal information, where N is the number of sampling points in each channel. Implementation of this approach in 3D representation of MI-EEG data based on the electrode distribution not only fully preserves the temporal information of the EEG VOLUME 11, 2023 time series, but also retains the spatial information of the electrode distribution. It also ensures the processability of the EEG data.

B. THE STRUCTURE OF 3D-CAPSNET
3D-CapsNet is mainly composed of two parts: 3D convolution module and CapsNet module. The 3D-CapsNet framework is shown in Fig.2. First, multi-layer 3D convolution is used to extract features in the temporal and the inter-channel spatial dimensions as the primary features. Then, the convolution capsule layer is applied to detect the intrinsic relationship between the features. Finally, high-dimensional feature vectors called motion capsules are obtained through dynamic routing connection which are followed by their classification using the squash function. The specific parameters of the 3D-CapsNet hierarchical structure are presented in Fig.5(a). They are the optimal parameter settings obtained through continuous experimentation. The 3D convolution module can extract the basic features at multiple levels, and it provides local perception information for the primary capsule layer. Gradually increase of the number of convolution kernels can ensure correct extraction of more and more abundant features. Batch normalization (BN) is implemented after each convolution layer to speed up convergence and mitigate overfitting. The input is convolved by the convolution module to produce 128 3D outputs with dimension 4 × 5 × 6, which are converted into a 128 × 4 × 5 × 6 tensor before delivering to the primary capsule layer. This results in 384 4-dimensional capsules as the output. The primary capsules store different forms of MI-EEG spatial features. Dynamic routing connection is utilized between the primary and the motion capsule layers. The dynamic routing algorithm aggregates closely predicted capsules and abstracts the motion capsules representing the inter-class variability. The motion capsules are finally subjected to a nonlinear squash activation function to provide the classification output.

C. TRAINING ALGORITHM FOR THE CapsNet
The dynamic routing algorithm is only used between two consecutive capsule layers. This is depicted in Fig.3, in which information transfer and routing process between the capsules are shown. The detected low-level feature vectors u i (i = 1, 2, · · · , n) are first multiplied by the corresponding weight matrix W ij to calculate the high-level output vectorŝ u ij , expressed as follows:û where i denotes the i th low-level feature and j represents the j th primary capsules. The vector length and direction encode probability of the corresponding feature and internal state of the feature, respectively.û ij is also called primary capsules. This step represents the spatial and other important relationships between low-level and high-level features. Second, the primary capsulesû ij are weighted in a similar way to scalar weighting in neurons. However, the neuron weights are learned through the back-propagation algorithm, whereas the coupled sparse weights of the capsules are learned using the dynamic routing algorithm. After adjustment of c ij , the output of primary capsules will be transferred to the appropriate motion capsules s j . This is performed by calculating the weighted sum of the prediction vectors including multiple primary capsules in order to aggregate the close predicted values. The overall process is expressed as follows: Finally, s j is mapped to the range of 0 to 1 without changing the direction using the nonlinear squash activation function. The resulted vector v j is calculated as follows: The length and direction of vector v j encode probability of the corresponding feature and internal state of the feature. The above three steps are the complete propagation process between capsules. Among them, learning of the coupling coefficients c ij is performed by the dynamic routing algorithm, which is determined by the softmax function, as follows: where b ij is a temporary variable with zero initial value. Although all coupling coefficients c ij will be the same after the first iteration, the value of b ij is updated by further iterations and the uniform distribution of c ij will be changed. The updating process of b ij is as follows: Capsule loss is evaluated by a marginal loss function for each class, denoted by L k , as follows: where k denotes the category, T k is used to quantify the existence of the category, v k is the current prediction value, and m + , m − and λ have no real physical meaning. If and only if the motor imagery of category k exists, then T k = 1, m + = 0.9 and m − = 0.1, and λ takes the empirical value of 0.5 to reduce the loss of absent categories. The total loss is the sum of all motion capsule losses.

D. TRAINING STRATEGY
This study adopts the cropped training strategy in which samples are generated by sliding a 3D window with predefined step size along the time dimension, such that the window covers all electrodes. This window size is selected based on the EEG data sampling frequency and the type of classification task. The EEG cropped training strategy is a common method for enhancing EEG training samples, similar to the cropped strategies in the field of image recognition. The experiments by Schirrmeister et al. [13] demonstrate that compared to trial wise training, cropped training leads to a better classification performance.
CapsNet training is performed by optimizing the marginal loss function. The number of training iterations is set to 80. The learning rate is dynamically adjusted using the Adam stochastic optimization algorithm, which can replace the classic stochastic gradient descent (SGD) process to update the network weights more efficiently and accelerate convergence of the neural network.

III. EXPERIMENTS AND RESULTS
The 3D-CapsNet model was implemented in Python under the Pytorch framework. The experimental environment involved 11th Gen Intel(R) Core (TM) i5-11400H@2.70GHz 2.69 GHz, 16 GB RAM, NVIDIA GeForceRTX3050 graphics card and 64-bit Windows 11 system.
The experiment was performed using the public dataset BCI competition IV dataset 2a in a four-category study [19], which consisted of EEG data from nine healthy subjects. Twenty-two Ag/AgCl electrodes with spatial distribution recommended in the International 10-20 system were utilized for recording. The data contained four categories of motor imagery tasks, i.e., left-handed, right-handed, bipedal and tongue motor imagery systems. The single motor imagery experiment paradigm is shown in Fig.4. At t = 2s, an arrow appeared and started moving to the left, right, down or up directions (corresponding to the left-handed, right-handed, feet or tongue categories, respectively). This cue arrow persisted on the screen for 1.25 s. The subject then performed the motor imagery task until the cross mark disappeared from the screen at t = 6s. After a short break, the screen was activated again and the next round of trials was performed. The data for each subject was recorded in two sessions on different days, and they were termed as training set and evaluation set, respectively. Each session was comprised of 6 runs separated by short breaks. Each run consisted of 48 trials (12 for each of the four possible classes), resulting in a total of 288 trials per session. The signals were sampled at 250 Hz. Band-pass filtering from 0.5 to 100 Hz and trap filtering at 50 Hz were applied to eliminate industrial frequency interference.

2) DATA PROCESSING
The experiments by Shajil et al. [20] showed that applying the 1-100 Hz wide band filter leads to higher MI classification accuracy compared to the 8-30 Hz narrow band filter. Also, Zhao et al. [16] investigated the decoding performance of several different frequency bands (0.5-4 Hz, 0.5-38 Hz, 4-38 Hz and 0.5-100 Hz) and found that implementation of the original full-frequency band (0.5-100 Hz) was more VOLUME 11, 2023 advantageous. Based on the above analysis, the present study uses the full-frequency band (0.5-100 Hz) of the raw EEG signal for experiments, with minimal preprocessing to protect the features. In addition, the 4-s acquired data during cued motion imagery is selected as the experimental data and it is represented as 3DMI-EEG form. 2-s segment of data is considered as one sample. More specifically, crop window size in time dimension is set to 500TP and step size is set to 20TP, resulting in 26 crop samples for training of each 4-s original sample. All of these crop samples have the same label as the original sample.

B. EXPERIMENTAL RESULTS
The experimental part first verifies feasibility of the proposed scheme, followed by comparing the decoding performance of 2DMI-EEG and 3DMI-EEG representations. Fig.5 shows the structure of all networks. Fig.5(a) shows the proposed framework of 3D-CapsNet, while Fig.5(b) and 5(c) depict the new network structures obtained by partial modification of 3D-CapsNet according to the validation content.

1) VALIDATION OF CapNet APPLIED TO MI-EEG IDENTIFICATION
3D-CapsNet structure is shown in Fig.5(a), 80 epochs are monitored on the test data of 9 subjects. The training and test accuracies of this network are presented in Fig.6. It should be noted that because of the cropped training strategy, each cropped sample during the training period corresponds to a prediction value, whereas during the testing period, 26 cropped sample predictions are averaged to obtain the final prediction value. So, the final recognition result of the original sample is determined based on a combination of its cropped samples.
As shown in Figure 6, 3D-CapsNet performs very well on subjects 1, 3, 7, 8 and 9, with high accuracy and overall stability after 40 iterations. It also performs well on subjects 4 and 6. However, performance of the model for subjects 2 and 5 is not as well as that of other subjects, although about 70% accuracy is achieved for these two subjects. The overall performance of 3D-CapsNet on all subjects shows that the model accuracy does not deviate significantly by changing the subjects, and therefore this model is fairly robust.
As shown in Figure 7, the confusion matrix of 3D-CapsNet is evaluated. The confusion matrix visualizes the number of correct predictions for each category and the number of incorrect predictions into other classes. TPR represents the sensitivity, which is the proportion of correct model predictions to the true labels, and PPV is the precision rate, which is the number of correct model predictions to the total number of model predictions for a category. We use the F1-Score composite index for evaluation, and the F1-Score is calculated as follows: The F1 − Score metric combines the results of the sensitivity and precision rate outputs. F1 − Score values range from 0 to 1, with 1 representing the best output of the model and 0 representing the worst output of the model. The F1 − Score corresponding to left hand, right hand, foot and tongue can be calculated according to equation 7 as F1 − Score (left) = 0.93, F1−Score (right) = 0.93, F1−Score (feet) = 0.82 and F1 − Score (tongue) = 0.90. It is obviously that the model has the best classification for left and right hand motor imagery among the four categories, the second best classification for tongue motor imagery, and slightly inferior classification for feet motor imagery. Overall, the overall performance of the network is very good. In order to further verify effectiveness of the capsule network applied to MI-EEG recognition; the 3D convolution module is followed by a fully connected layer equipped with the softmax function for classification. The hierarchical structure of this network is shown in Fig.5(b). Based on the number of required iterations for a stable loss function trend, the number of iterations in the experiments is set to 80. The experiments are implemented using Pytorch. According to the recognition results presented in Table1, higher accuracy and lower standard deviation can be seen in the case of dynamic routing connections. It can be concluded that the capsule network is effective when it is applied to MI-EEG identification.

2) DECODING PERFORMANCE OF 2DMI-EEG AND 3DMI-EEG FORMS
In order to compare the decoding performance of 2DMI-EEG and 3DMI-EEG representations, all 3D convolutions in 3D-CapsNet are changed to 2D convolutions in this section, and experiments are performed on 2DMI-EEG inputs. The framework structure is shown in Fig.5(c) and comparison of the results with the proposed method is presented in Table2. For 9 subjects, the average recognition accuracy of 3DMI-EEG representation is 11.568 % higher than that of 2DMI-EEG representation. Furthermore, the standard deviation of 3DMI-EEG representation is 1.37% lower than that of VOLUME 11, 2023  2DMI-EEG representation. Consequently, 3D representation of the motor imagery EEG signal is more suitable for decoding purpose using deep networks.

C. COMPARISON OF THE RESULTS WITH SIMILAR STUDIES
This section compares the results of proposed method with those of similar studies. They include DeepNet, EEGNet and ShallowNet structures, which all of them apply 2D-EEG for EEG signal decoding. Furthermore, Multi-Branch3D [16] and Dense-MB3D [9] are employed which utilize 3DMI-EEG for EEG signal decoding. Reference [22] proposes CNN+BiLSTM, a novel deep learning-based model based on attention inception convolutional neural network and long shortterm memory, for 2D-EEG for EEG signal decoding. Table 3 presents the classification accuracy of different methods based on the dataset of 9 subjects. It can be seen that Dense-MB3D [9] and the present paper dominate in recognition accuracy. Multi-Branch3D [16] leads to lower decoding accuracy in comparison with EEGNet, ShallowNet and CNN+BiLSTM, but its result data distribution is more stable because its standard deviation is much smaller than these three networks. In general, the standard deviation based on the 3D representation is much smaller than that of the 2D representation, and the 3D network has stronger robustness.
The recognition accuracy of the proposed method is generally better than that of the literature. More specifically, for 6 subjects, the accuracy of the proposed method is the highest compared to the other methods, and the average accuracy of all subjects is 2.805 % higher than the second-rank average accuracy. Based on the fact that the standard deviation of 3D representation is much smaller than that of 2D representation, it can be inferred that the 3D representation of motor imagery EEG is more suitable for MI-EEG signal decoding. Furthermore, the 3D representation is more potential to retain the general characteristics of MI-EEG among different subjects. It also helps to overcome the differences among individuals to a certain extent and it has stronger interpretability, which can be confirmed by results of the experiment in Section 2.2.2.
In order to further validate the performance of 3D-CapsNet, the Kappa values of classification results are calculated and compared with other literatures, as presented in Table 4. The Kappa value is mainly used to measure the consistency between model prediction and actual classification results. The Kappa value ranges from −1.0 to 1.0, and the larger this value, the better classification performance of the algorithm. The Kappa value is calculated as follows: where P o is the total sample classification accuracy and P e is used to assess the chance probability. Considering that c is the total number of categories, T i (i = 1, 2, · · · , c) is the number of samples with correct classification in each category, a 1 , a 2 , · · · , a c is the number of true samples in each category, b 1 , b 2 , · · · , b c is the number of predicted samples in each category and n is the total number of samples, then P o and P e can be calculated as follows: The results are presented in Table 4. Comparing 3D-CapsNet with other literatures, it can be found that the mean value of Kappa of the proposed method is higher than other literatures. In particular, 3D-CapsNet exhibits higher Kappa values compared with the remaining two 3D frameworks, Multi-Branch3D and Dense-MB3D. So, 3D-CapsNet has good performance for 3DMI-EEG recognition.

D. THE ABILITY OF OVERCOMING INDIVIDUAL DIFFERENCES
Since each subject has different physiological states, the distribution of the generated EEG signals varies, and even for the same subject, the distribution of EEG signals collected at different times is different. For the application of BCI systems, it is important to establish a stable classification model that can overcome individual differences. Once such   a model is obtained, any subject can directly use the model for MI-related classification tasks without any pre-training.
In this paper, a mixture of 9 subjects ''Training data'' is used as the model training set to train the model, and the ''Test data'' corresponding to each test subject is used as the model test set, and the classification performance is shown in Table5. As can be seen from the table, the proposed framework in this paper can still achieve effective classification with an average accuracy of 69.22, which means the proposed framework does have the ability to overcome individual differences in MI classification tasks. Compared with similar studies, the ability to overcome individual variability is effectively improved.

E. TIME CONSUMPTION
Classification accuracy and time consumption are two factors to be considered in the practical application of BCI systems. Experimental results in [24] show that BCI systems with a response time of 1 second or less are very suitauible for realtime applications. In this paper, we conducted experiments on a GeForceRTX3050 GPU with 4 GB of memory. The test times from subject 1 to subject 9 were obtained, and the results are listed in Table 2. The results show that the fast prediction speed (5.1 × 10 −3 s, about 1.48 seconds for 288 samples) and high prediction accuracy prove that our 3DCapsNet is more competitive in implementing online BCI system applications.

IV. CONCLUSION
Compared to the applications such as image recognition and natural language processing, EEG signal recognition research needs to overcome the problem of low recognition accuracy due to small data sets and inconspicuous EEG features. The network should fully extract the embedded features while avoids overfitting problem. In addition, it is also important to adopt a reasonable approach considering the real time and convenience of online BCI in practical applications.
Inspired by the dynamic routing connection method of capsule network, 3D-CapsNet MI-EEG signal recognition model was proposed by combining 3D convolution. The multilayer 3D convolution module was utilized to extract features from both temporal and inter-channel spatial dimensions as the low-level features. The CapsNet also had certain spatial detection capability, such that the low-level output features from the 3D convolution module were integrated by the CapsNet to obtain high-level spatial vectors containing interfeature relationships. Finally, the classification output was obtained using the nonlinear squash activation function. The 3D-CapsNet considered both the temporal and spatial features of the original EEG signal. It adopted dynamic routing connection to discard the pooling layer and retain the subtle features in order to maximize the feature expression capability of the network, which is very friendly to small data sets. In addition, our approach embeds EEG feature extraction and classification into an end-to-end network, avoiding manual involvement while achieving fast recognition (5.1 × 10 −3 s) to be advantageous in practical BCI system applications. This paper provides a new idea for EEG signal recognition research.
The experimental verification stage was performed stage by stage. First, the effectiveness of the CapsNet in EEG signal recognition was verified. The results showed that the CapsNet dynamic routing connection method was more successful in EEG signal recognition than the traditional fully-connected method. Second, the decoding performance of the improved network based on CapsNet was compared for 2DMI-EEG and 3DMI-EEG configurations. The results indicated that the decoding accuracy of the 3DMI-EEG representation was higher and its standard deviation was lower in comparison with those of the 2DMI-EEG representation. Finally, comparison of the proposed scheme with related literature revealed consistency of the results with the above two experiments. Compared to the classical networks including DeepNet, EEGNet, and ShallowNet used for 2DMI-EEG representation, Dense-MB3D [9] and the proposed model in the present study were based on 3DMI-EEG networks and exhibited better decoding accuracy and higher robustness. In addition, the proposed 3D-CapsNet resulted in satisfactory outcome for most advanced performance characteristics such as recognition accuracy and Kappa value. Finally, we experimentally show that our approach has advantages in practical BCI system applications. A limitation of this study is that although the subject-specific recognition accuracy is significantly improved, the performance in terms of cross-subject individual variability is not outstanding. In the future, we will attempt to explore a method for cross-subject MI classification by combining transfer learning.
Considering the difficulties in applying current recognition methods to online BCI systems, future work will further investigate strongly robust deep networks that are applicable to decoding EEG signals from a broader population. A study dealing with channel selection was also carried out, in which a combination of applicable channels to the current subject was selected. The data of the selected channels was used as deep network input to be applied to the online BCI system. YANA LV received the Ph.D. degree from Northeastern University, China, in 2014. She is currently a Lecturer with the Information Engineering College, Dalian University. Her research interest includes data analysis and processing. VOLUME 11, 2023