Multiscale Convolutional Transformer for EEG Classification of Mental Imagery in Different Modalities

A new kind of sequence–to–sequence model called a transformer has been applied to electroencephalogram (EEG) systems. However, the majority of EEG–based transformer models have applied attention mechanisms to the temporal domain, while the connectivity between brain regions and the relationship between different frequencies have been neglected. In addition, many related studies on imagery–based brain–computer interface (BCI) have been limited to classifying EEG signals within one type of imagery. Therefore, it is important to develop a general model to learn various types of neural representations. In this study, we designed an experimental paradigm based on motor imagery, visual imagery, and speech imagery tasks to interpret the neural representations during mental imagery in different modalities. We conducted EEG source localization to investigate the brain networks. In addition, we propose the multiscale convolutional transformer for decoding mental imagery, which applies multi–head attention over the spatial, spectral, and temporal domains. The proposed network shows promising performance with 0.62, 0.70, and 0.72 mental imagery accuracy with the private EEG dataset, BCI competition IV 2a dataset, and Arizona State University dataset, respectively, as compared to the conventional deep learning models. Hence, we believe that it will contribute significantly to overcoming the limited number of classes and low classification performances in the BCI system.


I. INTRODUCTION
E LECTROENCEPHALOGRAM (EEG)-based braincomputer interface (BCI) is one of the most actively used non-invasive BCI systems. Owing to the advantages of non-surgical electrode placements, high temporal resolution, portability, and cost efficiency, various EEG-based systems have been developed to rehabilitate or assist patients with neurological impairment [1]. Numerous experimental BCI paradigms have been proposed to control external devices such as a robotic arm [2], an exoskeleton [3], and a speller system [4] using particular brain signals in a specific condition. Among them, mental imagery has been deeply researched as a control strategy for BCI since it uses intrinsic brain activity manifested directly from users' voluntary imagination. Also, as mental imagery is independent of external stimuli, they facilitate the realization of a user-friendly interface that causes less fatigue and enables more natural situation awareness for users. The majority of mental imagery experimental paradigms were designed modality-specifically for the purpose of the system. For example, motor imagery (MI) was used to develop a system to rehabilitate or restore lost motor function in stroke patients [5], speech imagery (SI) can be used to develop communication tools for people who are unable to communicate [6], and visual imagery (VI) can be used as an alternative for users who have BCI illiteracy for MI [7].
Meanwhile, many studies based on functional magnetic resonance imaging, magneticencephalogram (MEG), and EEG have found prominent brain areas and spectral band groups which engage in specific mental imagery. Furthermore, studies on EEG classification using both MI and SI have shown the possibility of expanding the limited number of commands for EEG-based BCI [8], [9]. Also, a recent MEG study on the classification of MI, motor execution, VI, and visual perception [10] has reported that imagery behaviors are correlated with modulated activity in the respective modality-specific regions and with additional activity in supramodal imagery regions.
Therefore, EEG classification of mental imagery in different modalities can be used to overcome the limited number of classes and low classification performance of EEG-based BCI. Since the EEG patterns have a spatial constraint for modality-specific regions, it becomes more challenging to extract discriminative features as the number of classes This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ increases. Moreover, EEG has a low spatial resolution and low signal-to-noise ratio (SNR) owing to the large distance between the source of the signal (brain) and the location of the electrode (scalp). Consequently, the classification performance of modality-specific imagery exhibits a trade-off between the classification accuracy and the number of classes, which results in low stability or a limited number of commands for EEG-based rehabilitation applications. By taking advantage of different functional brain networks of mental imagery in different modalities, this approach can circumvent the constraint of maintaining adequate spacing between the corresponding sources. Also, it can reinforce the system redundancy by providing extra control options for BCI illiteracy in particular imagery.
However, multiclass classification of mental imagery in different modalities still remains a challenge. Firstly, as the similarity among the classes within the same modality is higher than that in the case of different modalities, it causes low classification accuracy due to biased distribution of the classes in the cases of inter-and intra-modalities. Secondly, as each modality of the mental imagery has prominent frequency bands, it is difficult to extract optimal spectral features for each class.
In this study, we designed the experimental paradigms and explored the EEG multi-classification of mental imagery in different modalities (MI, VI, and SI). The main contributions of the research are summarized as follows. i) We confirmed that distinguishable spectral-spatial patterns can be found from mental imagery in different modalities through EEG source localization analysis. ii) We propose a neural network called a multiscale convolutional transformer for decoding mental imagery in different modalities. To devise the all-rounded network for classification of different mental imageries, we adopted a multiscale temporal convolutional block to extract a various range of spectral features. Also, to learn better spatio-temporal representation of EEG signals during mental imagery, we designed factorized transformer encoder with spatial mapping. Statistical and neurophysiological analysis was conducted to investigate the significant features for classifying MI, VI, and SI EEG signals. We classified a total of six-class EEG signals with high and robust accuracy of 0.62 using our proposed multiscale convolutional transformer. To the best of our knowledge, this is the first attempt to classify high-dimensional imagery tasks using MI, VI, and SI, which have the potential for expanding the number of commands and enhancing the classification performance of multi-class EEG signals.

II. RELATED WORK
Various kinds of deep learning models were proposed to classify EEG signals [11], [12], [13], [14]. The application of a convolutional neural network (CNN) to an EEG-based BCI domain showed successful results for an end-to-end feature extraction and classification. Schirrmeister et al. [15] designed CNN architectures called DeepConvNet and ShallowConvNet to classify raw MI EEG signals. Their performances were competitive as that of the filter bank CSP [16], which is the one of the most well-known machine learning method to classify MI EEG signals. Lawhern et al. [17] proposed a versatile and lightweight CNN architecture called EEGNet which can classify EEG signals from various BCI paradigms (P300, movement-related cortical potentials [18], error-related negativity responses, and sensory-motor rhythms). Antelis et al. [19] proposed a dendrite morphological neural network which incorporates the computational structure in the dendrites of the neurons. They designed the model architecture to produce closed separation surfaces between the classes to offer a different solution to enhance multi-classs classification performance. Virgilio et al. [20] introduced spiking neural network (SNN) to EEG classification task and achieved the accuracy of 0.83 for binary classification of MI task. SNN is considered as a third generation artificial neural network (ANN) which has a great potential to replace second generation (ANN) such as CNN.
Recently, an attention mechanism-based deep learning model named transformer [21] demonstrated outstanding performance in diverse domains such as natural language processing [21], [22] and computer vision [23]. Conventional recurrent neural networks (RNNs) such as the gated recurrent units and long short-term memory [24] suffer from the gradient vanishing / exploding problems as the layers become deeper, thus hindering the learning of long data sequences. With its developments, it has been introduced to EEG systems [25], [26], [27], [28], [29]. Bangchi et al. [25] proposed a deep learning network that consists of MHA and convolutional layers to classify the EEG-based visual stimulus classification. Kostas et al. [26] adapted the transformer architecture of the language model to the EEG domain and found a single pre-trained model is capable of modeling new raw EEG sequences and different subjects performing different tasks. Li et al. [30] proposed a convolutional self-attention network for emotion recognition by capturing the individual information in and within different frequency bands to learn complementary frequency information. While the majority of works used attention mechanisms and convolutional layers with limited spatial-temporal information, the application of self-attention with multi-kernel and multi-branch convolutional layers can extract diverse features.

A. Participants
Forty healthy subjects (40 males, aged 25.5 (±3.1) years) participated in the experiments. All experiments have been approved by the Korea University Institutional Review Board (KUIRB-2020-0318-01). Prior to the experiments, we informed the subjects to get adequate sleep (at least seven hours) and to refrain from drinking alcohol the previous day. A detailed explanation of the experimental protocol and procedures was provided to the subjects. In accordance with the Declaration of Helsinki, they provided their written consent.

B. Data Description
We used various mental imagery-based EEG datasets to validate our proposed model, including private and public datasets. The details of the datasets are shown in Table I. 1) Mental Imagery Dataset (Private Dataset): We acquired the EEG signals during the mental imagery tasks using our experimental paradigms. Fig. 1 describes the mental imagery in different modalities and the method for preserving spatial information of EEG. The lower part of Fig. 1 corresponds to subclasses in each mental imagery type. The upper part of Fig. 1 shows the method for preserving spatial information of EEG. After the transformer encoder compressed temporal features, EEG features were rearranged into an 8 × 8 shape. Fig. 2 presents the experimental protocols for each mental imagery task. EEG signals were measured using BrainAmp (BrainProduct GmbH, Germany) and 64 Ag/AgCl electrodes according to the 10/20 international system. An electrode was placed at FCz for the reference electrode and an electrode at FPz for the ground electrode. In this experiment, the sampling rate was set to 500 Hz, and the notch filter was applied at 60 Hz. The conductive gel was injected between the electrodes and the scalp before the EEG signals were acquired to maintain electrode impedance below 10 k . The MI tasks involve kinesthetic imagery of the "left hand" and "right hand." In the VI tasks, participants were instructed to visualize two different swarm behaviors ("split" and "fall in") after watching the video clips of corresponding tasks. For the SI tasks, two words were presented to imagine the pronunciations of the given words ("go" and "stop").
The imagery period for the MI and VI tasks was set at 4 s because imagery of kinesthetic movement or visual representation can continue for a few seconds per trial. However, SI tasks only take one or two seconds to pronounce each word. For the SI tasks, the imagery period was set to 1.5 s, which was repeated four times consecutively in one run. The tasks were repeated 50 times each. Therefore, we acquired 50 trials per class for the MI and VI, and 200 trials per class for the SI. 2) BCI Competition IV 2a Dataset: To validate the proposed method for the public EEG dataset, we used the BCI competition (BCIC) IV 2a dataset [31]. The BCIC IV 2a dataset consists of four class MI of the "left hand", "right hand", "feet", and "tongue". The data were acquired from nine subjects, using twenty two EEG channels and three electrooculogram (EOG) channels. Each subject recorded two sessions on different days. There were six runs in each session, separated by short breaks. There were 144 trials in each class, and the participants were instructed to perform each MI task for 3 s. The data were downsampled to 256 Hz with a frequency range of 0.5-100 Hz.
3) Arizona State University Dataset: The Arizona state university (ASU) dataset [32] was used for the classification of the SI tasks. In this study, we used a dataset for short versus long word classification. The dataset consists of two class SI of the English words "in" and "cooperate." Each class consists of 100 trials, and a single trial lasts for 5 s. The data were acquired from six subjects with sixty EEG channels and four EOG channels. The data was preprocessed using a frequency range of 8-70 Hz.

C. Data Preprocessing
The mental imagery dataset was pre-processed using the BBCI Toolbox [33] and EEGLAB Toolbox [34] (version 2021.1). The δ band (0.5-4 Hz) is the most sensitive (easy to be contaminated) frequency band for EOG artifacts. To examine the influence of with and without δ band, the high-pass filter was set to 0.5 and 4 Hz, respectively. In addition, each mental imagery has different spectral features that contribute to the classification accuracy. Therefore, low-pass filters were set to 30, 60, and 120 Hz to investigate coordinated frequency bands for a multi-class classification task with different mental imageries. In summary, the data were bandpass filtered using a fifth-order Butterworth filter into six types (0.5-30 Hz, 0.5-60 Hz, 0.5-120 Hz, 4-30 Hz, 4-60 Hz, and 4-120 Hz). After that, the data were downsampled to 250 Hz. In order to alleviate the fluctuation and nonstationarity, z-score normalization was employed aŝ whereX ∈ R C×T and X ∈ R C×T indicate the normalized and input signal, respectively. The mean value and variance of the data are represented by μ and σ 2 , respectively. 60 % of the total data were used as the training dataset, 20 % of the total dataset were used as the validation dataset to prevent overfitting, and the rest of the 20 % of the total dataset were used as the test dataset. To match the training dataset's length, the sliding window technique was applied to the MI and VI datasets. A single trial of 4 s was divided into 1.5 s of four epochs with 0.7 s of the overlapping period. Consequently, 720 trials (120 trials × six-class) were used for the training dataset, 240 trials (40 trials × six-class) were used as the validation dataset, and 240 trials (40 trials × six-class) were used for the test dataset.

D. Data Analysis
Data analysis was conducted utilizing EEGLAB Toolbox and Brainstorm [35]. EEG source localization (ESL) was performed with a standardized low-resolution electromagnetic tomography (sLORETA) [36] inverse solution, which is implemented in Brainstorm [35]. An inverse problem entails estimating the current density or activity values of the source that generated the measured electric potential or magnetic field vector. Under a smoothness constraint, the Tikhonov method is employed to regularize the EEG inverse problem. sLORETA implicitly assumes that the activity of the adjacent neuronal populations is highly correlated, which is physiologically plausible. The noise covariance was calculated using the resting state (fixation cross). The parameter λ for the regularization of the ill-posed problem was used as a default (λ = 0.1). As we used the Montreal Neurological Institute template instead of the individual anatomy of the subject, unconstrained dipole orientations were used to prevent failure in representing common activity patterns.

E. Multiscale Convolutional Transformer for EEG Classification
As a decoding method, we propose the multiscale convolutional transformer for EEG classification in different modalities. We devised our proposed method to learn the representation of functional brain networks during mental imagery as a temporal-spatial-spectral pattern from EEG signals. Fig. 3 indicates the overall framework of the proposed method. The pipeline consists of multiscale temporal convolutional blocks, temporal transformer encoder blocks, parallel spatial convolutional blocks with spatial mapping, and spatial transformer encoder. The specifications of the proposed method are listed in Table II in detail.  To extract a various range of spectral features of mental imagery, we adopted a multiscale temporal convolutional block inspired by TSception [37]. TSception is a CNN model for EEG-based emotion detection, which showed robust classification performance by learning multiple temporal and frequency representations. Since mental imagery shows different optimal frequency ranges depending on its modality, utilizing a multiscale kernel is more beneficial. The multiscale temporal convolutional block is comprised of three convolutional blocks with different sizes of kernel windows. The convolutional block consists of the convolutional layer, GELU activation function, and the average pooling layer. The sizes of the temporal kernels were set as half, quarter, and half of a quarter of the sampling rate, to capture frequency information at greater than 2 Hz, 4 Hz, and 8 Hz, respectively. Since the sampling rate of EEG signals was 250 Hz, filter sizes were round to the nearest integer. The number of output channels of convolutional layer is fixed to 8 to compress the EEG features. After each feature passed through the activation function and average pooling layer with a window size of (1,16) and stride size of (1,8), they were concatenated along with the output channel dimension. As a result, the number of output channels of multiscale convolutional layer were set to 24. After that, batch normalization was applied.
2) Temporal and Spatial Transformer Encoder: To apply temporal and spatial attention separately, we used the two transformer encoder which have a same architecture. For the spatial attention scoring, we presented a region-based self-attention. Most of the conventional methods applied self-attention to channel dimension, while the relationship of channel locations was ignored, and most of the spatial information was blurred. To apply self-attention on spatial domain with multiple output kernels from convolutional layer, we reshaped the input features by merging the batch size and temporal size into a same dimension. In this case, we fixed the batch size to 32. Fig. 4 shows detailed architecture of the transformer encoder. The process for region-based attention can be expressed as where Attention(Q,K,V) is the weighted representation. Q, K, and V are matrices packed by vectors for simultaneous calculation and d k is a dimension of K vector. To apply self-attention on spatial domain with multiple output kernels from convolutional layer, we reshaped the input features. We merged the dimension of batch size and temporal size into a same dimension. The MHA is employed to emphasize the spatial and temporal features from a different perspective.
The operation of MHA is expressed as A feed-forward network(FFN) contains two fully-connected layers and the GeLU activation function is connected after the MHA (Eq.5-8), to improve non-linear learning capabilities of the model. A FFN is expressed as is a cumulative distribution of Gaussian distribution and often computed with the error function (erf), hence GeLU is defined as Eq. 6. The layer normalization is performed before the MHA and FF blocks, and residual connections are also used to improve training. For the ensemble effect, the MHA and FF blocks are repeated four times.

3) Spatial Mapping and Parallel Spatial Convolutional Blocks:
To maintain the spatial information of EEG channel locations, we introduce a spatial mapping strategy. After the temporal features were compressed into a single temporal class token, we reshaped the feature. Using the temporal axis as an extra channel dimension, we reformed one-dimensional channel vectors into two-dimensional vectors. As shown in the upper part of Fig. 1, EEG channels were reshaped into (8,8) shapes to compress the spatial information while maintaining its symmetry. Specifically, channels along the middle position (gray tone) were reshaped into a diagonal position. The positions of Fz and AFz were shifted in order to bridge the gap of reference position (FPz). The row components on the left hemisphere (odd numbers) and right hemisphere (even numbers) were repositioned along with horizontal (left) and vertical (down) directions from the diagonal line as a reference line. Channels in bold letters indicate necessary repositioning in order to maintain an 8×8 shape. Table II show detailed output shape of each layer. The parallel spatial convolutional block consists of two convolutional blocks, which are applied before and after the spatial mapping to produce more diverse spatial features. The first spatial convolutional layer has a larger filter size of (64,1), to extract global spatial features. The second spatial convolutional layer has a smaller filter size of (2,2) with a stride of (2,2) to extract local spatial features. The local spatial features were reshaped with a 1D-channel vector to feed into the spatial transformer encoder. After the parallel spatial convolutional blocks, EEG features were concatenated. In this stage, the spatial class token of the spatial transformer encoder and the spatial features from the convolutional layer were fused together. 4) Classifier: After concatenating encoded features of the spatial transformer encoder based on the class token and output features from the spatial convolutional layer, layer normalization was applied. Finally, a fully connected layer is where N is the number of trials and M is the number of classes. y n,m and p n,m denote the real label and predicted probability of the n-th trial for the class m, respectively.

5) Training and Optimization:
In order to optimize the cross-entropy loss, we used AdamW [38]. We also utilized batch normalization and dropout for each block, and used cosine annealing to speed up the training. The network was trained for a maximum of 200 epochs, and the epoch with the lowest validation loss was selected.

A. Decoding Performance Evaluation
To evaluate the decoding performance of the proposed method in a fair manner, we used the public and private dataset gathered from our experiment. We calculated the classification accuracy using five-fold cross validation. The dataset was randomly shuffled and divided into training, validation, and test datasets in a 6:2:2 ratio. We performed the experiments on Geforce RTX 3090 GPUs with 24 GB memory. The machines had Intel(R) Core(TM) i9-10980XE CPU @ 3.00 GHz with 36 cores and 128 GB RAM. We implemented the deep learning models using the python 3.7 version. Table III presents the performance comparison of the conventional and proposed methods using the mental imagery dataset. To investigate the optimal frequency range for the EEG classification of mental imagery in different modalities, we divided the frequency ranges into six bandwidths. The average accuracy and standard deviation (std.) of each frequency band are listed in Table III. Our proposed method exhibits a superior performances in all frequency bands with a reasonable standard deviation. The frequency ranges that include the δ band exhibited lower accuracy as compared to others. This indicates that the δ band is highly contaminated by the EOG artifacts. Moreover, the frequency bands that include the γ band (30-120 Hz) exhibit a better performance in the case of all the decoding algorithms. Although the difference between the 4-60 Hz and 4-120 Hz frequency bands was not statistically significant, the frequency range with the high-γ band exhibited better performance in the case of the majority of the methods. However, only ShallowConvNet [15] exhibited a superior performance in the frequency band of 4-60 Hz as compared to the case of 4-120 Hz. EEGNet [17] exhibited robust performance compared to DeepConvNet and ShallowConvNet within 4-30 Hz, 4-50 Hz, and 4-120 Hz, while DeepConvNet [15] and ShallowConvNet showed better performance than EEGNet within 0.5-30 Hz, 0.5-60 Hz, and 0.5-120 Hz. TCACNet [39] showed similar performance as DeepConvNet. However, the channel attention module from the TCACNet showed less effective performance than our proposed models. Among the conventional models, TSception [37] exhibited a superior performance as compared to the other CNN-based methods, which further introduces different-sized kernels for handling temporal dependencies. S3T [40] also showed decent performance as TSception, by utilizing CSP filters and multi-head attention.
However, our proposed method exhibited the significant improvement over that of TSception. The highest average performance of the proposed method was 0.62 (±0.07) in the frequency band of 4-120 Hz. Although the std. of the classification accuracy was slightly increased, the results indicate that the average classification accuracy of the proposed method was improved by 0.05 as compared to that of TSception. To investigate the statistically significant difference among various frequency bands, a two-way analysis of variance (ANOVA, at the significance level of p<0.05) was used. One factor was the frequency band, and the other factor was the decoding method. Furthermore, we performed the two-tailed Wilcoxon's signed rank test to estimate the p-values of the competing baseline models and our proposed multiscale convolutional transformer, and compared the performance differences statistically. * and ** denote p<0.05 and p<0.01, respectively. Fig. 5 indicates a confusion matrix of six-class classification using the proposed method. The classification performance was averaged among 40 subjects over the frequency range of 4-120 Hz. Both true positive and true negative are evenly distributed with a small proportion of false negative and false positive. The result shows that the proposed method is capable of decoding mental imagery in different modalities without bias to specific tasks.
The performance comparison of the private and public EEG datasets is presented in Table IV. For a fair comparison, the BCIC IV 2a dataset and mental imagery dataset were both preprocessed using a frequency range of 4-60 Hz. Since the ASU dataset was already band-pass filtered a frequency range of 8-70 Hz, we did not apply additional filtering to avoid distortion. The proposed method exhibited outstanding performance on the mental imagery dataset. In the case of BCIC IV 2a dataset, TCACNet showed the best performance  among the comparison models. Nevertheless, the proposed method showed superior performance in BCIC IV 2a dataset compared to rest of the conventional CNN models. In the case of ASU dataset, the proposed method reached the accuracy of 0.7 for two-class SI classification task. However, TSception showed a better performance than the proposed model. The result demonstrates that our proposed method is a suitable model for decoding EEG signals with class versatility. Table V shows the results of the ablation study for TSformer on different parameters of transformer encoder and temporal convolutional layer using the mental imagery dataset. The temporal and spatial transformer (TSformer) is the basic version of the proposed model without multiscale temporal convolutional blocks and parallel spatial convolutional blocks. Note that the TSformer is a different version of the proposed method. We made some variations on each parameter of the proposed method to explore the effect of the depth of the transformer encoder (D), the number of output channels (C out ), kernel size of the temporal convolutional layer (T ), and the kernel size and stride of the pooling layer (P,S). According to the results in table V, the optimal parameters are D = 4, O = 24, T = 62, (P,S) = (16,8). As the depth (number of the repetition) of the transformer encoder either increased or decreased, it showed less performance than the case of D = 4. The number of output channels for the temporal convolutional layer is important to train the transformer encoder with an adequate amount of temporal features. With a smaller number of output channels, it results in decreased performance by an insufficient amount of features to train the transformer encoder. With a larger number of output channels, it also showed reduced accuracy due to larger parameters. The kernel sizes of the temporal convolutional layer were set as half, quarter, and half of a quarter of the sampling rate, to capture frequency information at greater than 2 Hz, 4 Hz, and 8 Hz, respectively. Although the average accuracy of T = 125 and T = 62 were similar, the standard deviation of the T = 62 was smaller. The kernel size and stride size of the pooling layer are also important to compress temporal features, which will influence the performance and computational cost of the temporal transformer encoder. As the size of the pooling kernel and stride decreased, not only did the computational cost of the model increase due to longer temporal sequences, but also the performance has been decreased. With the larger size of the pooling kernel and stride, the computational cost was decreased. However, the average accuracy has decreased due to the lack of a sufficient amount of temporal sequences. Fig. 6 shows the result of the ablation study on three important modules, 1) multi-scale (MS) temporal convolutional block, 2) temporal and spatial (TS) transformer encoder, and 3) dual-stream (DS) spatial learner. MS-module is a modification by replacing the temporal convolutional block with MS temporal convolutional block. DS-module is a modification to fuse different spatial features from spatial transformer encoder and spatial convolutional block. The Tformer-sConv and tConv-Sformer are models that the temporal or spatial transformer encoder was replaced the convolutional block. The MS-TSformer-DS showed the highest performance through variation models resulting average accuracy of 64 %. The  MS-TSformer and TSformer-DS showed better results than the TSformer, which concludes that MS and DS modules are effective for extracting temporal and spatial features.
As the MS-TSformer-DS model showed the best performance, we chose it as the proposed model. The replacement of the temporal transformer encoder with additional temporal convolutional layers showed lower accuracy than the TSformer. However, in the case of the spatial transformer encoder, replacing it with a spatial convolutional layer showed similar accuracy compared to the TSformer. Table VI shows the model complexity comparison in number of trainable parameters. The proposed method (TSformer) shows approximately 33K of trainable parameters while the MS-TSformer has 32K parameters owing to smaller size of kernels. TSformer-DS shows approximately 66K parameters on account of duel-stream spatial convolutional layer. The DeepConvNet shows the largest number of trainable parameters due to multiple convolutional layers and large number of output channels. The EEGNet shows the smallest number of trainable parameters although the performance was lower than the proposed method. The TCACNet, which showed the highest performance on BCI competition IV 2a dataset, showed 233K parameters by placing second largest number of trainable parameters. Our proposed methods shows balanced performance while maintaining less amount of parameters compared to conventional deep learning methods.

B. Feature Visualization
We visualized the learning procedure using t-stochastic neighbor embedding (t-SNE) to further interpret the proposed model further. We used the MS-TSformer-DS as it showed the best performance among our proposed models. Fig. 7 presents a comparison of the visualization using the TSception and the proposed method. The data of one epoch is visualized in the last layer. We used the data from subject 1 to display the learned features. The six colors in the figure denote the six-classes of the EEG signals (i.e., MI of "left hand" and "right hand", VI of "split" and "fall in", and SI of "go" and "stop"). In the case of the TSception, we could see that three types of mental imagery have been distinguished, but the classes within the same category of imagery were vague and difficult to differentiate. In the case of the proposed method, the separation among the classes is clearly distinguished compared to the result of TSception.
According to the t-SNE result, the proposed model showed strong capabilities to classify EEG-based mental imagery in different modalities. Even though the separation level between the classes in the same modality is closer to that of between the classes in a different modality, it showed a decent level of separation. By combining the convolutional layers and transformer encoder, our proposed model achieved enhanced feature extraction capacity in the temporal, spatial, and spectral domains.

C. EEG Source Localization
We conducted ESL to examine activation patterns during mental imagery in different modalities. We used the sLORETA [36] implemented in Brainstorm [35], which is an inverse solution method. Fig. 8 depicts the grand-average source localization results across all 40 subjects obtained using sLORETA. The source localization was performed on each class of mental imagery. All the trials and times were averaged into a single value. The frequency ranges were set as 0.5-120 Hz in order to survey the most prominent brain activity. Through all the imagery tasks, the shared neural activities were revealed in the prefrontal cortex (PFC) and supplementary motor area (SMA), while the PFC exhibited stronger activity. EEG source estimations were captured from the top, left, and right views. White letters on the top of the figure refer to L: left, R: right, F: front, and B: back, respectively.
During both MI tasks, the left dorsolateral prefrontal cortex (dlPFC), the center of primary motor cortex (M1), and the primary somatosensory cortex (S1) exhibited predominant neural activity. From the top view, we can observe the symmetrical highlighting patterns of S1 depending on the tasks. For the MI of "left hand", the bilateral temporal lobe, left inferior parietal lobule, and right inferior occipital lobe were activated. While the MI of the "right hand", the left ventrolateral PFC(vlPFC) was activated as well as the left dlPFC.
As both of the VI tasks were performed, we observed the activity over the ventromedial PFC (vmPFC), primary visual cortex (V1), and inferior temporal gyrus (ITG). During the VI of the "split", we can observe the connected patterns from the PFC to V1. The superior parietal lobule, and right inferior parietal lobule were also activated. While the VI of the "fall in" was performed, the vmPFC, right dlPFC, and superior occipital gyrus were activated.
For both the SI tasks, we identified common neural representation in the left vmPFC and Wernicke's area. As the SI of the "go" was performed, the right inferior parietal lobule and right V1 were activated. During the SI of the "stop", the left ITG and left inferior frontal gyrus (IFG) exhibited strong activity in addition to the left vmPFC. The overall results of ESL indicate that distinguishable spatial patterns could be presented depending on the type of mental imagery.

V. DISCUSSION
This study demonstrates the possibility of classifying multiple mental imagery tasks using only brain signals. First of all, we acquired EEG signals related to various mental imagery EEG signals from 40 subjects. In order to acquire EEG signals of MI, VI, and SI independently, we designed the experimental paradigms delicately and conducted experiments in a strict environment. Also, our proposed network decoded a total of six mental imagery tasks (MI-"left hand" and "right hand", VI-"split" and "fall in", and SI-"go" and "stop") with high performance. EEG classification of mental imagery in different modalities can be used to overcome the limited number of classes and low classification performances within one mental imagery category. In addition, we conducted a source localization analysis to confirm that mental imagery in different modalities has notable differences in spectralspatial patterns. Hence, we believe that it can be the significant contribution to future research in decoding mental imagery.

A. Source Localization Analysis
i Through a source localization analysis across all the mental imagery tasks, the ESL results showed common neural activity in the PFC and SMA. The source of the mental imagery is unknown, but it is likely that the executive structure in the PFC is critical [41]. Furthermore, the SMA is known to contribute not only to multiple aspects of motor behavior, but also functions as a core network of brain regions recruited during imagery, irrespective of the task [42]. The overall results of source estimation imply that unique brain networks are engaged during MI, VI, and SI tasks and the neural representations of the cortical activity can be revealed in the connectivity of the spatial-spectral-temporal domain. The result shows that valuable features could be extracted in different brain regions and frequency bands from three different imagery types. In the case of MI tasks, as expected from the extant literature, the M1 exhibited the predominant neural activity. Moreover, the dlPFC exhibited strong activity, which has been found to be involved in superordinate control functions for various cognitive tasks such as decision making, novelty detection, working memory, conflict management, mood regulation, theory of mind processing, and timing [43]. In the case of VI tasks, the occipital lobe exhibited the predominant activity as compared to the other mental imagery tasks. The mechanism of VI is known as it has a neural mechanism similar to that of visual perception [44], [45], especially over V1. It shows overlapping neural representations much like a weak version of afferent perception [41], [46]. Moreover, the activation of the left ITG implies the relationship with the left fusiform gyrus. The exact functionality of the fusiform gyrus is still disputed, but there is relative consensus on its involvement in face and body recognition, and within-category identifications. Moreover, a recent study has shown that VI engages the left fusiform gyrus before it influences the early visual cortex [47]. In the case of SI tasks, although the motor related cortical activity of articulation imagery [48] was not visible on the present figure, Broca's and Wernicke's areas have been highlighted, which take part in language processing [49].

B. Model Analysis
Based on the vast source estimation results and the several researches on brain network of mental imagery, we propose a multiscale convolutional transformer for EEG classification in different modalities. Contributions of the proposed method are summarized as follow. First, we adopted a multiscale temporal convolutional block to extract a various range of spectral features. Second, we extracted global and local spatial information by utilizing parallel spatial convolutional blocks and preserved EEG channel relations with spatial mapping. Thrid, we devised regional-spatial self-attention to emphasize the prominent brain region instead of the channel attention method, which is computationally expensive and less effective. One of the significant differences between the transformer-based models and RNN-based models is that the transformer does not have the same memory bottleneck as RNN-based models, which means they have direct access to all previous sequences. The RNN-based models, in contrast, only have one current state, which is adjusted with each new input. Thus, this increased memory capacity is the capacity to "remember" exactly which sequences precede the current sequence. Also, our proposed method used broad frequency ranges which is adequate to capture valuable spectral features from different types of imagery.

C. Limitations and Future Works
Even though the overall performance of the proposed method showed relatively robust performance and source localization results of mental imagery in different modality were correspond with neurophysiological evidences reported in related works, there are still some of limitations in our study. First of all, although the entire experiments were conducted consecutively, sequences of the experimental protocol are different from each mental imagery in different modalities. The main reason for the ununified experimental design in this study is to follow the optimized protocols for each mental imagery. Thus, we will continue the additional experiment with unified experimental protocols to validate if the signal quality is comparable with the original experimental paradigm. Secondly, our experiments are conducted in a strictly controlled laboratory environment, this method may not be reliable in real-world environment. Also, as the proposed method is restricted to offline classification, it is not suitable for online classification.
Also, one of the limitations of the grand-average source estimation result is that some of the brain networks are functioning as a subject-dependent matters. Since human imaginations depend on an individual's unique memory, it could result in totally different neural representations even though the experimental instructions were identical. Moreover, some studies reported that mental imagery often shows cross-modal aspects [50]. For example, motor imagery was not always show prominent representations on the motor-related brain regions but also engages more prominent in visual-associated areas for some participants. Also, time-average source estimation results shows limited temporal-spatial patterns. Specifically, we could find repetitive activation patterns in source estimation during a certain period of time with video format, but some of these patterns were invisible in time-average image.
In future works, investigations on cross-modal mental imagery tasks would be valuable for expanding the limited DoF of the BCI system with robust performances. Furthermore, the development of an online decoding model and bridging the gap between laboratory and real-world environment would be desirable.

VI. CONCLUSION
Various mental imageries such as MI, VI, and SI use the neural signals generated by performing designated mental imagery tasks without having any restrictions on external devices. In this study, we proposed the multiscale convolutional transformer based on self-attention on spatial-spectraltemporal domain. Our proposed method outperformed the conventional deep learning methods on classifying mental imagery in different modalities with an accuracy of 0.62. The stability of the method and the effectiveness of each module were also demonstrated. The result of the ESL analysis shows that the functioning brain network and EEG representation are different for each type of mental imagery. This study demonstrates the possibility of multi-class inter-modality classification through robust feature extraction from each mental imagery task.