Spatio-Temporal Explanation of 3D-EEGNet for Motor Imagery EEG Classification Using Permutation and Saliency

Recently, convolutional neural network (CNN)-based classification models have shown good performance for motor imagery (MI) brain-computer interfaces (BCI) using electroencephalogram (EEG) in end-to-end learning. Although a few explainable artificial intelligence (XAI) techniques have been developed, it is still challenging to interpret the CNN models for EEG-based BCI classification effectively. In this research, we propose 3D-EEGNet as a 3D CNN model to improve both the explainability and performance of MI EEG classification. The proposed approach exhibited better performances on two MI EEG datasets than the existing EEGNet, which uses a 2D input shape. The MI classification accuracies are improved around 1.8% and 6.1% point in average on the datasets, respectively. The permutation-based XAI method is first applied for the reliable explanation of the 3D-EEGNet. Next, to find a faster XAI method for spatio-temporal explanation, we design a novel technique based on the normalized discounted cumulative gain (NDCG) for selecting the best among a few saliency-based methods due to their higher time complexity than the permutation-based method. Among the saliency-based methods, DeepLIFT was selected because the NDCG scores indicated its results are the most similar to the permutation-based results. Finally, the fast spatio-temporal explanation using DeepLIFT provides deeper understanding for the classification results of the 3D-EEGNet and the important properties in the MI EEG experiments.


I. INTRODUCTION
M OTOR imagery (MI) refers to a mental action in which humans imagine movements of their body parts without any actual movements [1].MI has been actively studied to identify the movement intentions of patients who have limited ability to perform physical movements (e.g., patients with cerebral palsy or strokes) [2], [3].Electroencephalogram (EEG) is widely used in research on MI brain-computer interfaces (BCIs).EEG has a good time resolution on the order of milliseconds and can be collected more easily compared to invasive methods that measure signals from the surface of the brain directly.However, EEG involves many noisy signals and has a high data dimension.So, it is required to extract effective features that represent the characteristics of EEG signals [4].Hence, many studies have adopted various feature extraction methods [5] for extracting suitable hand-crafted features and have developed various machine learning classifiers using the extracted features [6], [7], [8].
However, deep learning (DL) approaches such as convolutional neural networks (CNNs) can automatically extract useful features from the input through convolutional operation and can be trained through end-to-end learning in which raw data is directly used without complex feature engineering techniques [9].Therefore, CNNs have made EEG classification models to extract various complex features more easily [10], [11], [12], [13], [14].
Nevertheless, CNNs are typically difficult to understand their internal mechanism due to their complex structures.To solve this limitation, many studies on EEG classification have applied explainable artificial intelligence (XAI) methods to understand how their CNN models work [15], [16], [17].The CNN models for EEG classification generally have been explained in the spatial (e.g., which EEG channels were important in the classification results) and temporal (e.g., which time intervals were important) dimensions [18].
In this study, 3D-EEGNet is proposed as an improved classification model for MI EEG.The 3D-EEGNet architecture extends the structure of EEGNet [14] for better explainability in EEG classification.First of all, the three-dimensional (3D) shape of input data preserves the spatial and temporal information of the EEG data.As a result, the experiment showed that the 3D shape could improve the accuracy of MI EEG classification as well as spatio-temporal explanatory power.By applying XAI methods to the 3D-EEGNet, it can be explained in terms of spatial (channel) and temporal (time) aspects.
In the meantime, to explain the 3D-EEGNet model in this research, two kinds of XAI methods, permutation and saliencybased methods, are considered.First, the permutation method, which is more accurate than the saliency-based ones, is used to interpret the model in spatial or temporal aspects.However, the permutation method is difficult to apply to inspect spatial and temporal features at the same time due to its high computational complexity.For that reason, saliency-based methods, which are much faster than the permutation method, are additionally adopted for the spatio-temporal explanation of 3D-EEGNet instead of the permutation method.Unfortunately, the saliency-based methods cannot guarantee their reliability [19].Hence, in this study, a novel evaluation process is also proposed to choose the best XAI method for the spatiotemporal explanation of 3D-EEGNet.Specifically, the normalized discounted cumulative gain (NDCG) score [20] is used to compare the rank of important spatial and temporal features.
The main contribution of this study is two folds.
• Improvement of explainability and classification performance of MI EEG classification: The proposed 3D-EEGNet performs better in terms of the accuracy of MI classification, and also maintains the spatial and temporal properties of the original data in the 3D data shape for better explainability.
• Novel evaluation process for selecting XAI methods based on rank: An evaluation process is presented to choose the best XAI among a few saliency-based methods, specifically by using the NDCG score.Through this process, the computational cost for the spatiotemporal explanation can be reduced in consideration of reliability.The remainder of this paper is structured as follows.In Section II, previous studies on MI EEG classification and XAI methods are described.In Section III, the framework for developing 3D-EEGNet and explaining its output is introduced.In Section IV, the detailed architecture of 3D-EEGNet is presented.The explanation methods for the 3D-EEGNet model are described in Section V.The experimental results are illustrated in Section VI with spatial and temporal explanations of the 3D-EEGNet.Finally, Section VII concludes this paper with future work.

II. RELATED WORK A. EEG Classification for Motor Imagery
The classification of MI EEG signals is difficult because of a few reasons such as limited spatial resolution and low signal-to-noise ratio (SNR) [17].To handle these issues, conventional machine learning (ML)-based MI EEG classification models have employed common spatial pattern (CSP) features and its variants to mainly extract spatial information of MI EEG signals [21], [22].However, such classifiers require comprehensive prior knowledge on EEG because the features are commonly extracted by hand-crafted ways which depend on experimental subjects and hardware settings.
DL-based end-to-end EEG classifiers have been used to overcome this problem because of their ability to automatically extract useful features.Particularly, CNNs have received the most attention owing to their advantages in effectively reflecting structural information of the input data.In EEG classification, DeepConvNet, ShallowConvNet [23], and EEG-Net [14] are popular CNNs models.They commonly used minimally preprocessed EEG signals, and they used spatial and temporal convolutional layers to extract meaningful features of EEG.In this study, the architecture of EEGNet is used as the base structure of MI classification model to extract EEG features effectively.

B. XAI for EEG Classification
Most XAI methods have been introduced in the fields of computer vision and natural language process [24].By all means, in BCI applications, XAI is also adopted to obtain transparency of CNNs models [25].One of the early studies that addressed the interpretability of DL models on EEG signals investigated the application of layer-wise relevance propagation (LRP) [26], [27].Studies applying LRP have generally provided heatmaps that represent the feature importance of each EEG channel in a single experimental trial [28], [29].This visualization using heatmaps has been widely used to show their models work well compared with common methods such as CSP and FBCSP [30].This helped to understand how the DL models exploit specific channels and time points for their classification tasks.In this context, as XAI methods have been enhanced, various methods such as DeepLIFT [31], SHAP [32], and Grad-CAM [33] have also been applied in the BCI applications to understand DL models for EEG [34], [35].

C. Model-Agnostic XAI: Permutation Method
Model-agnostic XAI methods are applicable to various DL models regardless of the type of models.The permutation method explains the models by arbitrarily removing or manipulating specific features.The method is not only straightforward, but also provides an excellent explanation of models in terms of the connection between the feature information and their performance.Therefore, in this study, the permutation method is used as a ground-truth method to explain the proposed MI EEG classification model.However, because of its high computational complexity of having to permuting features one by one, it cannot be used to explain the huge number of combinations of spatial and temporal features.
The permutation method for time-series data was introduced in [39].The authors showed three permutation techniques, zero, swap, and inverse permutations.They showed which data points are important for the ML model's results.However, in EEG, there is no meaning to indicate certain time points as important because users are only interested in important spatial and temporal information (channels and time intervals).Therefore, in this study, the three permutation techniques are applied to sets of time points in spatial and temporal dimensions.

D. Model-Specific XAI for CNN: Saliency-Based Methods
Model-specific XAI methods are dependent on the target algorithms.Among the methods, saliency-based methods are widely used to explain CNNs models [36].The methods calculate the influence of the input on the prediction output.In this study, three saliency-based methods, saliency map, DeepLIFT, and DeepSHAP, are considered XAI for CNN models.
1) Saliency Map: This method uses the gradients of trained CNNs model to represent the importance of the input on the prediction result.For instance, the importances of input image pixels are calculated by multiplying an image with the weight vectors which are calculated using backpropagation [37].The saliency map is created by summing the weight values of each image pixel.This method requires only a single backpropagation to obtain the importances of all the input pixels, so the calculation is simple and fast.

2) DeepLIFT:
Deep Learning Important FeaTures (DeepLIFT) is designed to obtain the importance of input in the prediction of CNNs models.DeepLIFT uses multipliers that represent a slope describing how the outputs are changed when the inputs are different from reference data.In this study, DeepLIFT is also considered to explain the MI EEG classification model due to its good mathematical background and intuitive concept as well as its ease of application.
3) DeepSHAP: DeepSHAP (DeepLIFT + SHAP) is a method of approximating SHAP (SHapley Additive exPlanations) values using DeepLIFT.The calculation of the Shapley values is an expensive process because of the massive number of combinations of features.DeepLIFT makes it possible to obtain the Shapley values by approximating them using multipliers.In this study, DeepSHAP is also regarded as the explanation method because it is based on the Shapley values, which are mathematically proven to satisfy desirable properties for explaining models.

III. FRAMEWORK
The proposed framework for developing and explaining a MI EEG classification model is shown in Fig. 1.In the first stage, an EEG dataset is preprocessed by channel selection, bandpass filtering, and normalization, and it is then used to train 3D-EEGNet for MI EEG classification.Several electrode channels located on the sensorimotor cortex are selected.The MI-related frequency bands, the alpha band (8-12 Hz), beta band (16-24 Hz), and gamma band (30-35 Hz) [38], are then extracted by finite impulse response (FIR) bandpass filtering [39].Finally, the filtered data are normalized by the z-normalization for stable training.The preprocess EEG data is used as the input for training the proposed 3D-EEGNet model.
The next stage is the explanation of the trained 3D-EEGNet model.The XAI methods determine which brain areas (spatial importance) and time intervals in the EEG signal (temporal importance) are important to obtain the prediction results.As a model-agnostic XAI, the permutation method is used to obtain the spatial and temporal importance separately.To obtain spatial and temporal importance of EEG data, the data are permuted channel-wise and time-wise.
In contrast, three saliency-based methods, saliency map, DeepLIFT, and DeepSHAP are considered to obtain the spatiotemporal explanation for the 3D-EEGNet because they are much faster than the permutation method.As the saliencybased methods have been found to be unreliable in a recent study due to the unsatisfaction of input invariance [19], however, the best one is selected for faster, but reliable spatiotemporal explanation after comparison with the permutation method.In the comparison, the spatial and temporal importances are compared by the ranks of the important channels and time intervals because the scales of the XAI methods are different, making it difficult to compare their results directly.
The similarity of the ranks between the saliency-based methods and the permutation method is measured by the NDCG score, which is commonly used for evaluating recommendation systems.The NDCG score assigns more weights to higher ranks to focus on the high-rank items.The NDCG score is appropriate for the EEG XAI because users want to focus more on only important channels and time intervals rather than unimportant ones in terms of ranking.
The saliency-based method with the highest NDCG score is selected as the best XAI method.It is finally used to explain the 3D-EEGNet model in the spatio-temporal dimension so that we can understand the behavior of the 3D-EEGNet and the properties of the experimental subjects.

IV. 3D-EEGNET FOR EEG CLASSIFICATION A. Design of 3D-EEGNet
The 3D-EEGNet is a variant of EEGNet, which is wellknown as a compact CNNs model for EEG classification [14].
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The successful structure of EEGNet includes depthwise convolutional layers and separable convolutional layers, which enable to reduce the model's parameters and extract powerful features of EEG signals.
While the input of EEGNet has two dimensions (channels and time), the input of 3D-EEGNet is designed to have three dimensions so that the model can preserve the spatial information of channels.Although several studies have shown the positive effects of 3D input representation in EEG model developments [16], [40], the main motivation of the 3D input for 3D-EEGNet is the activated brain area when MI.
The 2D structure of the original EEGNet (herein, 2D-EEGNet) learns spatial and temporal features in two dimensions, respectively.However, the receptive fields of EEG itself can be considered the 2D structure as shown in Fig. 3.In other words, 2D-EEGNet uses a single layer (DepthwiseC-onv2D) for spatial learning of EEG signals, while 3D-EEGNet does two layers (Blocks 2-1 and 2-2 in DepthwiseConv3D layer) for vertical and horizontal spatial learning, respectively.Therefore, we think that the 3D structure of our proposed 3D-EEGNet might be beneficial to reflect the 2D spatial features and 1D temporal features.
Additionally, the convolutional filters of 3D-EEGNet are trained to capture the event-related desynchronization (ERD) of alpha (often called mu) and beta frequency bands which are measured in the primary sensorimotor area [41].When humans imagine a kind of movement, ERD quantifies the taskrelated power of the frequency bands by showing a temporary decrease in the signal, which is observed at the onset of events.Note that the sensorimotor area where ERD is significant is related to the type of MI.For example, when movement of the right hand is imagined, ERD is known to be significant in the left sensorimotor area [42].In the case of foot MI, the center of the sensorimotor area is activated.

B. Architecture of 3D-EEGNet
The architecture of 3D-EEGNet is shown in Figure 2, and the details of each layer are summarized in Table I.In block 1, the input data are shaped in (H , W , T ), where H and W are the height and width in the channel dimension, respectively, and T is the length of the time dimension.In the first 3D convolutional layer, each temporal filter learns the temporal information of the specific frequency of MI with F 1 filters.
In block 2-1, (H , 1, 1) spatial filters in the depthwise convolutional layer learn the vertical (front-to-back of the head) information of the sensorimotor area.The number of filters, D, is set to W , which means that one filter only learns one vertical information, excluding the horizontal (ear-to-ear) information.The spatial filters are applied to each output given from the block 1, not to all the outputs (this is the depthwise convolution).
In block 2-2, (1, W , 1) spatial filters learn the horizontal features of the sensorimotor area by depthwise convolution.The number of filters is set to be the same as in the previous layer (block 2-1) because the output of block 2-1 has only one horizontal information (H becomes 1 in block 2-1).Ultimately, the output size of the block 2-2 becomes (1, 1, T /4, F 1 × D), which means that the spatial information is summarized into a single value.
In block 3, two types of filters are learned by separable convolution: one of which summarizes the outputs from the previous layer, and the other reduces the dimension by using (1, 1, 4) feature maps.Subsequently, the outputs of the separable convolutional layer are flattened and connected to the dense layer for classification in block 4.

V. EXPLANATION OF 3D-EEGNET A. Explanation Using Permutation
This section describes how important channels and time intervals in EEG data are recognized in the trained 3D-EEGNet model by using the permutation method.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
1) Spatial Explanation: The spatial explanation of 3D-EEGNet is provided through important channels.Suppose an EEG dataset X for each subject includes I trials, J channels, and T times of EEG signals, denoted by X = {X 1 , . . ., X I }, where X i = {x 1 i , . . ., x J i } is the i-th trial of EEG data, x j i = (x j i,1 , . . ., x j i,T ) is the j-th channel vector of X i , and x j i,t is the signal value of x j i at time t.Three permutation techniques, zero, swap, and inverse, are applied to all channels one by one [43].The zero permutation for spatial explanation replaces x j i with a zero vector of the same size as in (1).The swap permutation changes x j i in the reverse order as in (2).The inverse permutation inverts the values of x j i by subtracting each value from the maximum value of x j i as in (3).
To determine the importance of the j-th channel, three types of the permuted data, X − j zer o , X − j swap , and X − j inver se , are prepared as follows.
where X − j t ype is the permuted data for the j-th channel by replacing all the j-th channel data X j with one of the three channel permutation functions, f zer o , f swap , or f inver se .
Subsequently, the three permuted data for the j-th channel, X − j zer o , X − j swap , and X − j inver se , are used to evaluate the change in the accuracy of the MI classification model.If a specific feature is important, the prediction performance would be significantly decreased when the corresponding permutated data is given as input of the model.The accuracy change for the j-th channel permutation, denoted by j t ype , is measured by the difference between the accuracy of the original model, acc(X), and the accuracy of the model trained by the j-th channel permuted data, acc(X Finally, the permutation importance of the j-th channel, denoted by I j pt , is evaluated by the min-max normalization of the average of three accuracy changes for the j-th channel, 2) Temporal Explanation: The temporal explanation of 3D-EEGNet is performed by identifying the important time intervals in EEG signals.To evaluate the importance of a time interval, one can extract a subsequence x j,k i from the j-th channel vector of the i-th trial, x j i = (x j i,1 , . . ., x j i,T ), as follows.
where x j,k i is the k-th time interval in x j i for k= 1 to K (≪ T ), s is the size of a time interval, and t (k) = s (k − 1).
The goal of the temporal explanation is to evaluate the importance of K time intervals across all I trials.Each time interval x j,k i can also be permuted by the three types of permutation in the same way as the channel permutation.
Similar to ( 1)-( 3), three types of permutations for x j,k i , f t ype (x j,k i ) for type ∈ {zero, swap, inverse}, can be applied to obtain three time-interval permutation data for the k-th time interval, X −k t ype , similarly to ( 4)- (5).By using the permutation data, the change in the accuracy for the time interval, k t ype , is calculated, and the importance of the k-th time interval, I k pt , can also be evaluated similarly to ( 6)-( 8).

B. Explanation Using Saliency
The saliency-based methods are applied to the trained 3D-EEGNet.The test data is given to the model as inputs, and the saliency-based methods generate the gradients (Saliency map), attribution scores (DeepLIFT), or SHAP values (DeepSHAP) with respect to the inputs.When the saliency-based methods are applied to EEG signals, one signal point is considered as a single feature.In other words, they generate the importance of each signal point.
1) Spatial Explanation: Spatial explanation can be conducted by identifying important channels using signal point importance.The importance of a single signal point x j i,t is denoted as v j i,t .The importance of the j-th channel, also known as attribution, α j , can be calculated by the summation of v j i,t 's in all the trials and time as follows.α = α 1 , . . ., α j , . . ., α J , (10) where v j i,t is the importance of channel j on trial i at time t.Finally, the importance of the j-th channel, I j sal , calculated by a saliency-based method sal, is obtained by normalizing the attribution value into a relative value between 0 and 1.It is performed to obtain relative values for each saliency-based method to compare the importance among the methods.

(12)
2) Temporal Explanation: The temporal explanation of 3D-EEGNet is conducted in a similar way to the spatial explanation.The EEG signals were divided into K time intervals as (9), and the importances of the time intervals are determined.
Similar to ( 10)-( 12), the importance of the k-th time interval is obtained by the attribution of the k-th time interval, α k , Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
calculated by the summation of v k i,t 's in all the trials and time.The importance of the k-th time interval, I k sal , calculated by a saliency-based method, is also obtained by transforming the attribution value through the min-max normalization.

C. Selection of the Best Saliency-Based Method
This section describes the selection process of the best saliency-based method.In specific, using the NDCG score [20], the results of three saliency-based methods are compared with that of the permutation method to investigate their consistencies.
The importance of the j-th channel calculated by the permutation method, I pt is compared with the temporal importances of three saliency-based methods, I k sal 's.The IDCG is calculated using I k pt because the I k pt is considered the ground truth.According to the I k pt obtained from a saliency-based method, the k-th time interval is ranked as r k pt , and the temporal IDCG is calculated as I DC G t pr = K k=1 I k pt / log 2 (r k pt + 1) .And the temporal DCG of the saliency-based method is calculated as DC G t pr sal = K k=1 I k pt / log 2 (r k sal + 1) .Finally, the temporal NDCG score of the saliency-based method is calculated as N DC G t pr sal = DC G t pr sal /I DC G t pr .

VI. EXPERIMENTS A. Datasets
In the experiment, the BCI Competition III-IVa (BCIC) dataset [44] and GigaDB dataset [45] were used.The BCIC dataset was collected from five subjects (aa, al, av, aw, and ay).The subjects were instructed to imagine movements of their right hand and right foot according to visual cues.Each subject conducted 280 trials (right hand: 140, right foot: 140) and the visual cues appeared for 3.5 seconds.The cues were randomly given by periods of 1.75 to 2.25 seconds for the subjects to relax.The EEG signals were collected from 118 electrode channels of the extended 10/20 system [46] at 100 Hz.
GigaDB involves the MI EEG of 52 subjects collected with 64 electrode channels at 512 Hz.Each subject conducted 100 or 120 trials (right hand, left hand) and the cue appeared for 3 seconds.In this paper, only the five subjects (s01-s05) were used to evaluate the performance of 3D-EEGNet.To consider only the channels located on the sensorimotor cortex of the brain, 21 channels (FC5, FC3, FC1, FCz, FC2, FC4, FC6, C5, C3, C1, Cz, C2, C4, C6, CP5, CP3, CP1, CPz, CP2, CP4, and CP6) were selected among the many channels, as depicted in Fig. 3.It helps to prevent the other channels from influencing the model with noisy signals.
The frequency bands, alpha (8-12 Hz), beta (16-24 Hz), and gamma (30-40 Hz) bands, are known to be significant to MI analysis [47].Therefore, to involve the above frequency bands, the range of 8-40 Hz was extracted by using band-pass filtering.Then, the EEG signals were normalized to assign the mean and standard deviation of the EEG signals as 0 and 1, respectively.
The original shape of the dataset is represented as (trials I , channels J = 21, time length T ).To train the 3D-EEGNet, the input shape is transformed to (I , W = 3, H = 7, T ), where the channel dimension is converted to two dimensions.The (3,7) channel shape was designed based on the actual location of those channels on the sensorimotor cortex, as shown in Fig. 3.

B. Training of 3D-EEGNet
The hyperparameters of the 3D-EEGNet were set to consider the properties of MI EEG.The number of temporal filters in the first 3D convolutional layer (F 1 ) was set to 24 to match the number of frequencies mainly related to the MI (alpha band: 8-12 Hz, beta band: 16-24 Hz, gamma band: 30-40 Hz).
The length of the temporal filters was set depending on the EEG sampling rate.The 8 Hz EEG in the alpha band was the longest frequency signal.This means that the 8 Hz signal is shown every 12.5 signal points in the 100 Hz EEG (BCIC).Therefore, the length of temporal filters was set to 13 because the filter length should be an integer.In the result, the length 13 temporal filters can involve the alpha, beta, and gamma bands altogether.Likewise, in the GigaDB, the length of the temporal filters was set to 64 (512 Hz / 8).
The number of pointwise filters in the separable convolutional layer was set to D×F 1 , which means learning the representation as much as the number of inputs, as suggested in [14].Batch normalization was applied after the convolutional layers, and dropout was applied after blocks 2 and 3 to prevent overfitting.The ELU [48] was used as the activation function before applying average pooling.

C. Classification Performances of
The classification accuracies are presented in Table II.The accuracy of each subject was the averaged value of the 10-fold cross validation results.The mean and deviation of the 10 test accuracies is represented.The proposed 3D-EEGNet showed better performances in the two datasets compared to the original EEGNet.In the BCIC, 3D-EEGNet showed 1.76% higher accuracy with lower variance in average on the all subjects.In the GigaDB, 3D-EEGNet improved the performance by 6.16% with lower variance.
The hyperparameters of the 3D-EEGNet model were verified by ablation studies, as shown in Table III.Three hyperparameters, the size and the number of temporal convolutional filters in Block 1, and the number of spatial filters in Block 2 were tested.The current 3D-EEGNet model showed the best performances on average for the accuracy of the all subjects in the two datasets (See Supplementary materials Table S.I for the ablation results of 2D-EEGNet).

D. Spatial and Temporal Explanation Using Permutation
In this section, the explanation results for the BCIC dataset are represented (see Supplementary materials Fig. S1 and  Table S.II for the results of GigaDB dataset).Three types of permutation, zero, swap, and inverse, were applied to investigate the importance of the channels and time intervals for five subjects.The importances of channels and time intervals are shown in Figs.4a and 4b, respectively.In Table IV, the top-3 important channels and time intervals are presented.For temporal explanation, EEG signals were divided into eight time-intervals with 0.5 second interval.
As summarized in Table IV, the important channels of five subjects were slightly different.For four subjects, aa, al, av, and ay, channels C3 and CP3, which are on the left area of the sensorimotor cortex, are shown as important channels.This means that the 3D-EEGNet model uses the EEG signals of the left area of the sensorimotor cortex, which is known to be activated when the right-hand MI is conducted.

TABLE V NDCG SCORES OF SALIENCY-BASED METHODS
In the temporal importance, time intervals T3 (0.5 -1.0 sec.) and T4 (1.0 -1.5 sec.) were found as the important ones for the four subjects.This means that the EEG signals for 1 second after 0.5 second from the visual cue were important for MI classification.This result corresponds to the prior knowledge that ERD is observed immediately after initiating motor imagery.Therefore, it is expected that the 3D-EEGNet successfully learned the related features for MI classification.
Meanwhile, the subject aw showed different results with the other subjects.In this subject, the channel C4, which is located on the right area of the sensorimotor cortex, was found to be the most important channel, and time intervals T6 (2.0 -2.5 sec.) and T5 (1.5 -2.0 sec.) to be the most important.The reason of this different result was investigated in the ERD/ERS map (see Appendix A).The ERD of aw cannot be found in the left-side channels differently from the other subjects.Thus, it can be inferred that the 3D-EEGNet learned different important features in the right-side channels (e.g., C4) in the late time intervals (see the results for subject aw in Fig. 4).

E. The Best Saliency-Based Method Selection and Spatio-Temporal Explanation
To select the best saliency-based method for spatio-temporal explanation, the spatial and temporal importances of saliency map, DeepLIFT, and DeepSHAP were compared with those of the permutation method.Their obtained importances of channels and time intervals were transformed to ranks to compare with one another by the NDCG score.
The NDCG scores calculated for the BCIC dataset are summarized in Table V. (see Supplementary materials Table S.III for the NDCG scores for the GigaDB dataset).In the tables the NDCG scores represent how close the spatial and temporal importances of the saliency-based method yielded to those of the permutation method.Among the three saliencybased methods, DeepLIFT showed the highest average NDCG scores of 0.9736 in the spatial dimension and 0.9741 in the temporal dimension, respectively.This means that DeepLIFT provides the most similar ranking results compared with the permutation method, which is known to provide reliable results than saliency-based methods despite high computation cost.Therefore, DeepLIFT was selected as the most proper spatiotemporal explanation method for the trained 3D-EEGNet model.
As a result, using DeepLIFT, the 3D-EEGNet model is explained in the spatio-temporal dimension.In Fig. 5, the spatio-temporal importances obtained by DeepLIFT are depicted.The higher value represents the more important.The important channels and time intervals can be recognized simultaneously in 0.01-second time resolution.
In Fig. 5, for the subjects aa and av, the importances of channels CP3, CPz, and C3 (right area of the sensorimotor cortex) from 0 s to 1.5 s are found to be high.This result is consistent with the permutation results where the channels CP3, C3, and CPz and the time intervals T2, T3, and T4 (0.5-1.5 seconds) were important (see Table IV).Specifically, the importances of subject av are widespread over many leftside channels like CP4 and CP4, as well.From this, it can

+ 1 .+ 1 .
j pt , is compared with importances of three saliency-based methods, I j sal 's.The ideal DCG (IDCG) is calculated using I j pt because the I j pt is considered the ground-truth.According to the magnitude of I j pt , the j-th channel is ranked as r j pt , and the spatial IDCG is calculated as I DC G spt = Similarly, according to the I j sal obtained from a saliencybased method, the j-th channel is ranked as r j sal .The spatial DCG of the saliency-based method is calculated as DC Finally, the spatial NDCG score representing the similarity of the ranks between the permutation method and the saliency-based method in terms of spatial explanation is calculated as N DC G spt sal = DC G spt sal /I DC G spt .Similar to spatial importance, for the comparison of temporal importance, I k

Fig. 4 .
Fig. 4. Spatial and temporal importance by the permutation method.

Fig. 5 .
Fig. 5. Spatio-temporal importances using DeepLIFT; the vertical red dot line at time 0 represents when the visual cue was given.

TABLE I DESIGN
OF THE 3D-EEGNET

TABLE II ACCURACY
COMPARISON OF EEGNET AND 3D-EEGNET TABLE III ABLATION STUDY ON THE ACCURACY OF 3D-EEGNET MODEL

TABLE IV TOP
-3 IMPORTANT CHANNELS AND TIME INTERVALS OBTAINED FROM THE PERMUTATION METHOD