TSception: Capturing Temporal Dynamics and Spatial Asymmetry from EEG for Emotion Recognition

The high temporal resolution and the asymmetric spatial activations are essential attributes of electroencephalogram (EEG) underlying emotional processes in the brain. To learn the temporal dynamics and spatial asymmetry of EEG towards accurate and generalized emotion recognition, we propose TSception, a multi-scale convolutional neural network that can classify emotions from EEG. TSception consists of dynamic temporal, asymmetric spatial, and high-level fusion layers, which learn discriminative representations in the time and channel dimensions simultaneously. The dynamic temporal layer consists of multi-scale 1D convolutional kernels whose lengths are related to the sampling rate of EEG, which learns the dynamic temporal and frequency representations of EEG. The asymmetric spatial layer takes advantage of the asymmetric EEG patterns for emotion, learning the discriminative global and hemisphere representations. The learned spatial representations will be fused by a high-level fusion layer. Using more generalized cross-validation settings, the proposed method is evaluated on two publicly available datasets DEAP and MAHNOB-HCI. The performance of the proposed network is compared with prior reported methods such as SVM, KNN, FBFgMDM, FBTSC, Unsupervised learning, DeepConvNet, ShallowConvNet, and EEGNet. TSception achieves higher classification accuracies and F1 scores than other methods in most of the experiments. The codes are available at https://github.com/yi-ding-cs/TSception


INTRODUCTION
E MOTIONS are fundamental factors in human beings' daily life [1], affecting decision-making, perception, human interaction, and human intelligence [2].Emotion recognition plays an important role in Cognitive Behavioural Therapy (CBT) [3], Emotion Regulation Therapy (ERT)/Emotion-Focused Therapy (EFT) [4] [5] [6], and the evaluation of medical treatment [7] for emotion-related mental disorders, such as Generalized Anxiety Disorder (GAD) [8], and Depression [9].With the potential applications in CBT and EFT, enabling Artificial Intelligence (AI) to identify human emotions has captured more and more interest from researchers recently [1].
Electroencephalography (EEG) is one of the widely used brain imaging technologies, which measures human brain activity directly.Several electrodes are placed on the surface of the human head to collect EEG signals.EEG has high temporal resolution so that it can capture varying brain states at the sub-second level.A Brain-Computer Interface (BCI) system can identify human emotions through EEG, with the help of machine learning and signal processing techniques [10].
Recently, using EEG-BCI for emotion recognition has gained popularity among researchers [1] [11].Atkinson et al. [12] improved the SVM classifier accuracy for emotion detection by selecting features efficiently, with the accuracy being 73.14%.Zheng et al [13] used a discriminative graph regularized extreme learning machine to investigate stable patterns over time from the differential entropy (DE) features of emotional EEG.Li et al. [14] utilized phase-locking value to construct emotion-related brain networks with multiple feature fusion to detect emotions from EEG.Recently, deep learning-based methods have shown promising results in the BCI domain, such as motor imagery classification [15] [16] [17] [18] [19], emotion recognition [20] [21] [22] [23] [24] [25], and mental-task classification [26] [27] [28].Yang et al. [20] designed a hierarchical network structure to perform emotion classification, proposing sub-network nodes to enhance the performance.Li et al. [21] constructed EEG into 2D images and proposed a Hierarchical Convolutional Neural Networks (HCNN) to extract the spatial patterns of the EEG.Li et al. [22] applied 18 kinds of linear and nonlinear features to solve the cross-subject emotion recognition problems, achieving 59.06% and 83.33% on two public datasets.Zhang et al. [25] utilized recurrent neural networks (RNN) to learn the temporal-spatial information from the DE features of EEG for emotion recognition.Although many machine learning methods have been proposed for emotion recognition, most of them highly rely on hand-crafted features.
With the ability to learn from EEG directly, the convolutional neural networks (CNN) have shown promising results in BCI [18] [29] [30].Schirrmeister et al. [15] proposed deep and shallow convolutional neural networks, named DeepConvNet and ShallowConvNet, to process EEG data, combining the feature extraction and classification using a two-stage spatial and temporal input convolution layer.Re-cently, Lawhern et al. [18] proposed EEGNet, which extracts spatial information by the depth-wise convolution kernel whose size is (n, 1).The global spatial dependency can be learned by letting n be the number of channels.All of those networks apply single-scale 1D convolutional kernels along the time and channel dimension to extract temporal and spatial information from EEG.
In order to effectively learn temporal-spatial information from EEG for emotion recognition, several neurophysiological signatures should be considered.For temporal dimension, EEG signal contains abundant brain activity information in different frequency bands [31].Due to the nonstationary and dynamic nature of EEG, we hypothesize that a single-sized temporal kernel cannot effectively capture the neural processing underlying emotions that occurs at different time scales and duration.For spatial dimension, especially for emotional processes in the brain, the right and left hemispheres have asymmetric responses to emotions [32].Hence, we hold the hypothesis that a global spatial kernel has less ability to effectively extract the distinct asymmetric EEG pattern during emotional processes.
To address the above issues, in this paper, we propose TSception, a multi-scale temporal-spatial convolutional neural network to capture temporal dynamics and spatial asymmetry from EEG to classify emotional states.Different from the methods using manually extracted features [21] [22] [25] [33] [34], EEG signals are fed into TSception directly, which makes it an end-to-end deep learning method that needs less domain knowledge about the features.A dynamic temporal layer with different scaled convolutional kernels is proposed to learn richer time-frequency representations from EEG instead of using single-sized temporal CNN kernels [15] [18].This layer is inspired by the inception block of GoogleNet [35].Besides the global kernel utilized in [15] [18], we take the brain emotional asymmetry into the kernel design.A hemisphere kernel whose length equals the number of EEG channels located on the right/left hemisphere is proposed to extract the hemisphere asymmetric pattern.The effectiveness of multi-scale convolutional neural networks is preliminarily explored in our previous work [36].We further propose a high-level fusion layer after asymmetric spatial layer to learn from combined hemisphere-global representations to distinct emotion-class specific information as well as make the network more compact for online usage in the future.
Emotion classification experiments on two publicly available benchmark datasets, a Database for Emotion Analysis using Physiological signals (DEAP) [33], and a multimodal database for affect recognition and implicit tagging (MAHNOB-HCI) [37] were conducted to evaluate the performance of TSception.The generalized cross-validation settings are utilized to avoid potential data leakage and biased evaluation.TSception is compared with several deep and non-deep state-of-the-art methods in the BCI domain, namely SVM [33], KNN [34], DeepConvNet [15], Shallow-ConvNet [15], EEGNet [18], Unsupervised learning [38], FBFgMDM [39], and FBTSC [39].In most of the experiments, the performance of TSception in terms of accuracy and F1 score is higher than the other methods while having a relatively lesser number of network parameters.After statistical analysis, extensive ablation studies are conducted to analyze the contribution of each module in TSception.The saliency map method [40] is utilized to get the most informative part of the EEG data identified by the network.The maps show that the network mainly learns from frontal, temporal, and parietal areas.Frontal, parietal and temporal are commonly known as the functional brain areas related to the emotional processes in the brain [1].
The major contributions of this work can be summarised as:

•
We propose TSception, a novel multi-scale temporalspatial convolutional neural network, for EEG emotion recognition tasks.Several neuro-physiological signatures are involved in the network design.The proposed multi-scale temporal/spatial convolution kernels can capture temporal dynamics and spatial asymmetry from EEG to classify emotions.A highlevel fusion layer is proposed to further learn from hemisphere-global representations and to make the network more compact, which can benefit the online usage of TSception in the future.

•
Extensive ablation studies and interpretability experiments are conducted to understand the importance of each module in TSception and what it learns using saliency maps.
The PyTorch implementation of TSception is available at https://github.com/yi-ding-cs/TSception The remainder of this article is organized as follows.A summary of related works is introduced in Section II.In Section III the details of TSception are introduced.Section IV describes the datasets and experiment settings.The result and analysis are given in Section V, Finally, we discuss the significance of our results in Section VI.

WORKS
The detailed instruction of the proposed TSception, a multiscale convolutional neural network, is presented in this section.EEG data can be treated as 2D time series, whose dimensions are channels (EEG electrodes) and time respectively.The time dimension reflects the brain activity changes from time to time.The spatial dimension can show the brain activation patterns across different functional areas due to the different locations of the electrodes on the brain.EEG signals contain abundant information in different frequency bands [31].TSception is proposed to identify the most distinct time-frequency-channel specific EEG features corresponding to the emotional states of the user.TSception incorporates specially designed network modules namely, dynamic temporal layer, asymmetric spatial layer, and highlevel fusion layer.To extract more discriminative timefrequency representations, multi-scale 1D convolutional kernels are utilized in the dynamic temporal layer to enrich the learned time-frequency representations.As for the asymmetric spatial layer, it takes the advantage of neuroscience findings [32] which indicate the brain activities in right and left hemispheres are not symmetrically related to emotions.A hemisphere kernel is proposed to learn the asymmetric representations between two hemispheres.A high-level fusion layer is further proposed to learn from the learned ... is the number of channels, BN stands for batch normalization, AP is the average pooling operation, and GAP represents global average pooling.TSception has four main parts: the dynamic temporal layer, the asymmetric spatial layer, the high-level fusion layer, and the classifier.The dynamic temporal layer will first learn the dynamic temporal/frequency representations from EEG data channel by channel.After getting the learned representations for each channel, the asymmetric spatial layer will be applied to learn the global spatial representations and the emotional asymmetry pattern using different scale convolutional kernels.To fuse the information from hemisphere and global representations, a high-level fusion layer is utilized.Finally, the fused representation will be passed to the fully connected layers with the softmax as the activation function.
representations of both the hemisphere and global kernels and make the network more compact for real-time usage.The network structure of TSception is shown in Fig. 1.A detailed description of the temporal, spatial, and high-level fusion layers will be discussed in this section.

Dynamic Temporal Layer
The dynamic temporal layer consists of multi-scale 1D temporal kernels (T kernels).In order to enable the neural network to learn dynamic temporal representations, we set the length of the temporal kernels as the specific ratios of sampling rate f S of EEG.These ratios are defined as α i ∈ R, where i is the level of the dynamic temporal layer.i will vary from 1 to L, if the dynamic temporal layer has L levels.Hence s i T , the size of T kernels in i-th level, can be defined as: From the frequency perspective, the length of the T kernel is set as half the sampling rate in EEGNet, allowing for capturing frequency information at 2 Hz and above [18].Activations related to emotions are observed in Alpha (8-12 Hz), Beta (12)(13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30), and Gamma (>30 Hz) bands [1].In this work, we expand the temporal receptive-field, letting L = 3, i = 1 to 3, and α = 0.5, the ratio coefficients will become [0.5, 0.25, 0.125], learning diversified frequency representations.We hypothesize that the multiscale temporal kernels can enrich the learned dynamic frequency representations from EEG, providing more emotionrelated information.From the time perspective, multi-scale T kernels can capture long short-term temporal patterns, and learn more diverse representations.The higher level T kernel has a smaller ratio coefficient, which gives a shorter convolutional kernel length and vice versa.The long temporal kernel can learn long-term temporal and low-frequency diverse representations.The short kernel extracts short-term temporal and high-frequency representations.Let X denote EEG input samples.X = X 0 , X 1 , ..., X n , X n ∈ R c×l , where n is the number of EEG samples, c is the number of channels, l is the length of each sample.The dynamic temporal representations can be generated by parallelly applying the multi-scale temporal kernels on the input EEG samples.After LeakyReLU (•) activation function, the feature map is further down-sampled by average pooling (AP).The reason for using average pooling is to reduce the effect of the noise as well as the feature dimension since EEG signals are of high dimensions with a low signal-noise ratio.Let Z i temporal denote the output of the i-th level temporal kernel, Z i temporal ∈ R n×t×c×fi , where n is the number of samples, t is the number of each level's T kernel, c is the number of channels, and f i is the length of the feature after i-th level convolution operation.Z i temporal is defined as: where s i T is the T kernel size, X is the input EEG sample array, Conv1D(•) is the 1D convolution operation with the kernel size being s i T , step being (1,1), and The output of each level's T kernel will be concatenated along the feature dimension.In order to reduce the internal covariate shift problems in neural networks, we added batch normalization [41] after the dynamic temporal layer.Hence the final output of the dynamic temporal layer, Z T , Z T ∈ R n×t×c× fi , is defined as: where f bn is the batch normalization operation, and [•] stands for concatenation operation along the feature (f) dimension.

Asymmetric Spatial Layer
The asymmetric spatial layer has multi-scale 1D convolutional kernels whose sizes are related to the location of the EEG channels.There are two types of spatial kernels: global kernel and hemisphere kernel.
The global kernel has a size of (c, 1), where c is the number of channels.Since the length of the kernel is the same as the channel dimension of the input EEG segment, it can learn the global spatial information.
In this work, we further combine the frontal area of brain emotional asymmetry [42] into the kernel design.The hemisphere kernel is used to extract the relations between the left and right hemispheres by sharing the convolutional kernels.The size of the hemisphere kernel is (0.5 • c, 1), and the step is (0.5 • c, 1), where c is the total number of channels.The hemisphere kernel is shared by two hemispheres without overlapping so that the asymmetric pattern can be extracted.The size of the spatial kernel s j S can be defined as: where δ = 0.5 is the coefficient to control the ratio between the spatial kernel length and the total number of channels.Let Z j spatial denote the output of the j-th type spatial kernel, Z j spatial ∈ R n×s×cj ×f , where n is the number of samples, s is the number of each type S kernel, c j is the number of channels after j-th spatial convolution, and f is the length of the feature after each spatial convolution operation.Z j spatial is defined as: where s j S is the S kernel size, Z T is the output of dynamic temporal layer, Conv1D(•) is the 1D convolution operation with the kernel size being s j S , the step being (1,1) for the global kernels and (0.5 • c, 1) for the hemisphere kernels, and Fig. 2. The location map of 32 channels cap.The electrodes can be divided into 3 groups: electrodes on the left hemisphere (in orange), electrodes on the right hemisphere (in blue), and the electrodes on the central line (in black).For the electrodes located on the central line of the head, Fz, Cz, Pz, and Oz, which can not be paired on the left and right hemispheres, we further removed them to let TSception learn the asymmetric pattern of left and right hemispheres better.
In order to apply hemisphere kernels, the sequence of channels in the input EEG samples should be arranged in a particular way.The order of the channels should be [channel lef t , channel right ], where the channel lef t are the channels located in the left hemisphere, the channel right are the ones on the right hemisphere.The order for channels on each hemisphere should also be rearranged to make each kernel weight shared between pairs of symmetrically located electrodes on two hemispheres because the step of the hemisphere kernel is also (0.5 • c, 1).Fig. 2 shows the electrode locations of DEAP dataset.The final output of the asymmetric spatial layer, Z S , Z S ∈ R n×s× cj ×f is defined as: where f bn is the batch normalization operation, and [•] stands for concatenation operation along the channel (c) dimension.The output of hemisphere kernel have a length of two in the spatial dimension, which refers to two hemispheres respectively.The output of global kernel is only a vector whose length in the channel dimension is one.After concatenation, the channel dimension is c j = 3.

High-level Fusion Layer
In order to learn high-level spatial representations by fusing the learned information from global and hemispheres, a high-level fusion layer is further proposed.Given the output of asymmetric spatial layer, Z S ∈ R n×s×3×f , a 1D convolutional layer whose kernel size is (3, 1) is utilized to fuse the information along the spatial dimension.After LeakyReLU(•), average pooling, and batch normalization, a global average pooling layer (GAP) is added to overcome over-fitting and reduce the model size.The final learned global-hemisphere fusion representations will be generated by: Finally, the latent representation of Z f usion will be fed into fully connected layers.The final output layer is activated by the softmax function, Φ sof tmax (•).Hence the final output can be calculated by:

Datasets
To evaluate the proposed TSception, we conducted several experiments on two publicly available benchmark datasets, a Database for Emotion Analysis using Physiological signals (DEAP) 1  [33], and a multimodal database for affect recognition and implicit tagging (MAHNOB-HCI) 2 [37].Table 1  Return: y summarizes the related information of the two datasets used in our experiments.Arousal and valence dimensions on both datasets were utilized as reported in [39].DEAP is a multi-modal human affective states dataset, including EEG, facial expressions, and galvanic skin response (GSR).There are 32 subjects watching music video clips while their EEG, facial expression, and GSR are recorded.Each of the subjects participates in 40 trials in total.The duration of each trial is 1 minute with a 3 seconds pre-trial baseline.After each trial, the subject will be given a questionnaire to provide their own emotional state in arousal, valence, dominance, and liking with each dimension having 9 discrete levels.The EEG is collected using 32 channels device, with the sampling rate being 512Hz.MAHNOB-HCI [37] is another multi-modal dataset similar to the DEAP dataset.There are 30 subjects watching movie clips while their facial expression, audio signals, eye gaze data, EEG signal, and other physiological signals are recorded.Note that Subject 12, 15, and 26 failed to finish the data collection, therefore, the remaining 27 out of 30 subjects were used in this work.The movie clips are between 35 and 117 seconds long.The EEG signals are acquired from 32 electrodes on the 10-20 international system.The sampling frequency is 256 Hz.For each trial, four integers ranged from 1 to 9 and self-reported by the subjects are used to label the valence, arousal, dominance, and emotional keywords, respectively.

Pre-processing
For DEAP, the 3 seconds pre-trial baseline was removed for each trial.Then the data was down-sampled from 512Hz to 128Hz, after which the electrooculogram (EOG) was removed with a blind source separation method as [33].To remove the low and high-frequency noise, a band-pass filter from 4.0-45Hz was applied to the original EEG as [33].V: valence; A: arousal Finally, the EEG channels were averaged to the common reference.The class label for each dimension is from 1 to 9, hence 5 was selected as a threshold to project the 9 discrete values into low and high classes in each dimension as [33] [39].In line with [39], only arousal and valence dimensions are used in this study.The deep neural networks have a higher number of trainable parameters hence to optimally learn emotion state representations in EEG a large number of labelled data samples are required.However, as listed in Table 1, the number of trials is very small in the selected datasets.To overcome this challenge, a data augmentation step by splitting each trial into smaller non-overlapping 4s segments was applied.The segments were then used to train the deep neural network.
For MAHNOB-HCI, the pre-processing was much the same as that for the DEAP dataset except for the following.First, the 30 seconds pre-trial and post-trial baselines were removed for each trial, so that the remaining corresponds to the event of emotion elicitation [37].Second, to remove the low-high frequency noise, a band-pass filter from 0.3-45Hz was applied to the original EEG as [36].Note that the delta band 0.3-4Hz is included since it also contributes to an individual's affective state [43], [44].

Performance Evaluation Metrics
The first type of metric is accuracy.It is one of the most commonly used evaluation metrics in classification problems [36].It is the ratio of the correctly predicted samples and the total number of the samples.For binary classification problems, the accuracy can also be defined as: where T P is the true positive, T N is the true negative, and F P is the false positive, and F N is the false negative.Accuracy can measure how precise the prediction is for the class-balanced dataset.However, after the preprocessing of the labels mentioned in the pre-processing section, the labels become imbalanced.To better evaluate the performance of a classifier on class-imbalanced datasets, the F1 score is added as [33] [38].It combines the precision and recall of the classifier, and it is defined as the harmonic mean of the classifier's precision and recall.F1 is defined by: where T P is the true positive, T N is the true negative, and F P is the false positive, and F N is the false negative.

Experiment Settings
There are two types of experiment settings in this paper: I) trial-wise 10-fold cross-validation and II) leave-one-trial-out cross-validation.Each of them is introduced in the following paragraphs.
In the first experiment setting, we split each trial into 4's non-overlapping segments, also know as cropped experiments [15], and a trial-wise 10-fold cross-validation is utilized for each subject to prevent potential data leakage issues.The reason for doing cropped experiments is that the predictions of shorter segments are preferred than the trialwise predictions that are evaluated in [33] [38] [39] for an efficient real-time BCI system.Besides, a decoding model with a good generalization capability is needed for the realworld situation where the testing data is unseen to the model.In each trial, the subject was asked to watch or hear a certain stimulus that is supposed to evoke a certain type of emotion.Because emotion is one of the continuous cognitive processes in the brain, the data segments within a single trial are highly correlated.Hence, randomly shuffle the segments among trials before the training-testing split of the data could make the adjacent segments be in training and testing data, which will give high classification results.But the accuracy will drop when the highly correlated segments are never seen by the model in the real-world situation.To get the more generalized evaluation, the 10 folds are split among trials, which will make sure the adjacent segments in one trial will not appear in both training and testing data.In each step of 10-fold cross-validation, one fold is selected as testing data, the rest 9 folds are utilized as training data.Among the 9 training folds, the data is randomly divided into 80% training data and 20% validation data.During the training process, we train the network on training data for 500 epochs and evaluate the network on validation data in each epoch.The model with the highest accuracy on validation data among those 500 epochs is saved and tested on the testing data.The above process is repeated 10 times for each subject till each fold has been the testing fold once.In each fold, the test data remains completely unseen in all stages of training and validation.The mean accuracy and F1 score of all subjects are reported as the final results.
In the second experiment setting, a leave-one-trial-out cross-validation is adopted for each subject to further compare our methods with the recently proposed methods in [39] and [38].In each cross-validation step, one trial is selected as testing data and the rest are selected as training data.For each step of the leave-one-trial-out crossvalidation, the training data is also split into 80% training and 20% validation data.The process is repeated till every trial is selected as testing data once for each subject.The average accuracy and F1 score of all subjects are reported as the final evaluation criterion as [38].The features extracted from the entire trial's data are utilized as one input sample to the classifier in [33] [38] [39].To compare our deep learning methods trained from segmented EEG data with those papers, a voting mechanism is utilized for the segment predictions in each testing trial as: where y t is the prediction of one testing trial, n is the number of the predictions of the segments in each trial under each condition indicated in the sub-script.

Implementation Details
The code is implemented using the PyTorch library, the source code can be found via this link 3 .
3. https://github.com/yi-ding-cs/TSceptionThe ratio coefficients of T kernel length are [0.5, 0.25, 0.125] for DEAP.The sampling rate of the data in DEAP is 128Hz, hence, the temporal kernel lengths are 64, 32, and 16 according to Eq. 1.When training TSception on MAHNOB-HCI, we found that using [0.25, 0.125, 0.0625] as the ratio coefficients achieved higher mean accuracy on validation set.The sampling rate of MAHNOB-HCI is 256Hz, which gives the temporal kernel lengths of 64, 32, and 16 as well.The number of temporal and spatial kernels in dynamic temporal, asymmetric spatial, and high-level fusion layers is equal to 15.The number of hidden nodes in the first fully connected layer is chosen as 32.For model training, the maximum training epoch is 500.The batch size on the DEAP dataset is set as 64 which will be reduced to 32 on the MAHNOB-HCI dataset because the trials in MAHNOB-HCI are half of the ones in DEAP.All the other hyper-parameters (including the structure hyper-parameters as well as the training hyper-parameters), except batch size, are the same for DEAP and MAHNOB-HCI to test the generalization ability of TSception.The hyper-parameters are the same for all the subjects.Adam optimizer is utilized to optimize the training process with the initial learning rate being 1e-3.Cross-entropy loss is selected as the loss function to guide the training process.For more details, please refer to the open-access GitHub repository for TSception.

RESULTS AND ANALYSIS
In this section, we first report and statistically compare the results in terms of accuracy and F1 score for ours against the state-of-the-art methods.The ablation studies are then presented to reveal the contribution of each component in TSception.Finally, saliency maps are presented to visualize how the brain areas contribute to the arousal and valence dimensions.

Statistical Analysis
The experiment results: include I) the per-subject accuracy and F1 score on DEAP dataset (see Fig. 3 and Fig. 4), II) the overall accuracy and F1 score on DEAP dataset (see Table 3) MAHNOB-HCI data set (see Table 4), and III) the comparison against the results from existing literatures (see Table 5).To conduct statistical analysis, a twotailed Wilcoxon Signed-Rank Test is utilized.Compared to accuracy, F1 score is a more reliable metric to quantify the performance of classification methods when a dataset has imbalanced classes.Based on the results we have the following observation and analysis.
Interestingly, we also notice that the difficulty to predict the two emotional dimensions are not consistent for the two datasets.Considering the trade-off of accuracy and F1 score, we find that the valence is harder to predict for DEAP while the arousal is harder to predict for MAHNOB-HCI.
Our method outperforms the results reported in the existing literatures [33] [38] [39] as well.According to Table 5, ours achieves the best accuracies for both arousal and valence dimensions.TSception has 3.15% and 1.18% improvements over FBTSC [39] on accuracy for arousal and valence.Compared with UL [38], our method beats it by 1.41% for arousal and 6.02% for valence in terms of accuracy.The accuracies of ours for arousal and valence are 1.75% and 4.67% higher than the ones of SVM reported in [33].For F1 scores, TSception has 5.05% and 2.91% higher than SVM [33] and UL [38] for arousal and 9.07% and 4.12% higher for valence, indicating the effectiveness of the proposed method.
According to the extensive comparison against a variety of methods, the proposed method manifests promising performance on the arousal-valence prediction task, with a decent extent of generality.

Ablation Study
The proposed method TSception has a dynamic temporal layer, asymmetric spatial layer, and high-level fusion layer three functional parts.The combination of those three parts leads to the success of classification tasks.Ablation studies are conducted to further understand which part contributes more to the improvement of classification results.The classification results after removing each of the dynamic temporal layer, asymmetric spatial layer, and high-level fusion layer from the TSception are reported.DEAP dataset is used for the ablation study since the overall performance is higher than MAHNOB-HCI.The results of the ablation study are shown in Table 6.
All of the accuracies and F1 scores drop after removing any of the three types of layers, indicating all components contribute to the improvement of classification results.Overall, the most significant drops of accuracy for three dimensions are observed when the asymmetric spatial layer is removed from TSception with the decrements being 1.5%/1.84%on accuracies for arousal/valence and 1.84%/1.59%on F1 scores for arousal/valence.This demonstrated that the asymmetric spatial layer contributes more than the other two layers, especially for the valence dimension, the drop is the largest in the ablation study.The high-level fusion layer contributes more to arousal because the accuracy drops by 1.54%, and the F1 score drops by 2.40% for arousal while the drops of accuracy and F1 score for valence are smaller (0.93% on accuracy and 1.19% on F1 score) after removing the high-level fusion layer.The dynamic temporal layer contributes less than the others, with the drops of accuracy/F1 score being 1.12%/1.96%for arousal and 0.52%/0.86%for valence.
The kernel-level ablation studies are further conducted to analyze the effects of two types of spatial kernels in the spatial asymmetric layer because it has more contribution than other layers.The weights and biases are set to zeros as [18] did to study the kernel-level effects.The results are shown in Table 7.
Hemisphere kernels learn more discriminative representations than global kernels in TSception, according to the results in Table 7.The drops of classification results after removing the hemisphere kernels are all larger than the ones after removing the global kernels.After removing either type of the hemisphere and global kernels will downgrade the performance of TSception for both the arousal and valence dimensions.This indicates both types of spatial convolutions help to improve the performance of TSception.H: Hemisphere kernels; G: Global kernels.w/o: Without the component.

Interpretability
In this part, the saliency map [40] is utilized to visualize which parts of the data are more informative and contribute to classification performance.The saliency map is one of the most commonly used tools to intuitively show which regions of the input have the classification-related information.To better visualize the saliency map, the original saliency map is averaged along the time dimension to get the topological map of the EEG channels.The normalized saliency maps of different samples of each subject are averaged to get the mean saliency map of the subject for general visualization.The averaged saliency maps in the DEAP dataset are shown in Fig. 5.The mean saliency maps of individuals for arousal are also shown in Fig. 6 to illustrate the differences across subjects.The pictures in Fig. 5 are the saliency maps under different calculation settings.The upper three saliency maps, Fig. 5(a)-(c), are the averaged saliency maps for arousal dimension while the lower three, Fig. 5(d)-(f), are for valence.The first column, Fig. 5(a) and Fig. 5(d), are the mean saliency map of all the subjects.The second column, Fig. 5(b) and Fig. 5(e), are the one of subjects who are top 10% for F1 scores, The last column, Fig. 5(c) and Fig. 5(f), are the average saliency map of the subjects whose F1 scores are in bottom 10% for arousal.The mean saliency map is normalized between -1 and 1 for better visualization.We choose F1 as the selecting criterion for visualization because it can reflect how precise the predictions are when the classes are imbalanced.
For arousal, the frontal, temporal, and right side of the parietal and occipital areas of the brain are more informative according to Fig. 5(a) and Fig. 5(b).The averaged saliency map of all the subjects, Fig. 5(a), shows the value of Fp2, F3, FC2, FC5, T7, T8, C4, P8, and O2 channels are higher than others.Comparing the saliency maps of the top(Fig.5(b)) and bottom 10% (Fig. 5(c)) F1 score subjects, we can see the frontal (Fp1, AF3, F3 and F4), temporal (T8) and parietal (P7) areas provide more information in Fig. 5(b), while the network mainly learns from parietal (P8) in Fig. 5(c).This indicates frontal, temporal, and parietal areas of the brain provide more emotion-related information.This is consistent with previous literatures [45] [46] [47].Emotion arousal is mostly reflected in the frontal lobe, temporal lobe, and parietal lobe [45].Pre-frontal and temporal asymmetry have close relations to arousal recognition [46].
For valence, the frontal, temporal, and right side of the parietal and occipital areas of the brain are more informative according to Fig. 5(d) and Fig. 5(e).The same thing happens that the occipital (O1 and O2) activities provide less classification-related information than frontal (F8), temporal (T8), and parietal (P7 and P8) activities for the high F1 score subjects (Fig. 5(e)).According to previous studies, the asymmetry patterns in pre-frontal, parietal, and temporal regions are observed for valence recognition [46].
In general, the most informative region identified by the neural network is the frontal, temporal, parietal, and occipital regions while the occipital activities are less informative for the subjects with high F1 scores.This is consistent with previous works [45]    To get the generalized evaluation of our method, we adopt the trial-wise cross-validation of cropped trials on two benchmark datasets.As mentioned in Section 4.4, if one randomly shuffles the samples among different trials before dividing the data into training and testing data in cropped experiments, he can get very high classification results that will drop when the highly correlated adjacent segments in one trial are not seen by the model [38] [51].Hence, the trial-wise 10-fold cross-validation is utilized to make sure the highly correlated adjacent segments of each trial don't appear in both training and testing data.To further compare our methods with the ones in the existing literatures that also use generalized evaluation settings, a leave-one-trialout cross-validation is conducted with a voting mechanism on each trial's segment predictions.As for evaluating metrics, we also follow [38], adding F1 score besides accuracy to get a better evaluation on imbalanced datasets.
According to the results on two public datasets shown in Table 3, Table 4 and Table 5, the proposed TSception achieves the highest classification results than those from the compared methods in most of the experiments.Particularly, TSception has 1/4 or 1/10 of the trainable parameters of its counterparts.Such efficiency and effectiveness may benefit the online usage of the neural network in real-world BCI applications.
Extensive ablation studies and interpretability experiments suggested that all modules in TSception have positive contributions to the improvement of classification results and our method learns from the emotion-related information.According to Table 6, we find the asymmetric spatial layer contributes most to the classification results.To make sure the neural network learns the emotion-related information instead of irrelevant features, saliency maps are acquired to visualize the most informative regions identified by the neural network itself.The mean saliency maps of all subjects in Fig. 5(a) and (d) show strong activation in the frontal, temporal, parietal, and occipital areas.However, the saliency maps of the subjects with high F1 scores in Fig. 5(b) and (e) only show strong activation in the frontal, temporal, and parietal areas, which is consistent with [1] [45] [46] [48] [49] [50] [52].A right hemisphere lateralization pattern is also observed in the averaged saliency map of the top 10% subjects with high F1 scores (Fig. 5(e)) for valence, which indicates the right hemisphere is more in- for valence recognition.Neuroscience studies [53] [54] suggested that the right hemisphere has a special role in the emotional process in the brain.However, the right hemisphere lateralization is not present for high F1 subjects for arousal as shown in Fig. 5(b).This could be because the information provided in the frontal area is enough for the neural network to make the decision.Moreover, we find that the occipital activities also contribute to the inference process of the neural network for all the subjects, as shown in Fig. 5(a) and (d).A possible reason for high occipital activities is that music videos are used as stimuli in DEAP.However, the information provided by occipital activities is less useful for high F1 subjects (for both arousal and valence).This suggest occipital is not as informative as other brain regions, such as frontal and temporal regions, for emotion recognition.
To conclude, we propose a multi-scale convolutional neural network, named TSception, to capture temporal dynamics and spatial asymmetry for EEG emotion recognition.Using generalized cross-validation strategies, the proposed method and several baseline methods are evaluated on two publicly available benchmark datasets.The proposed method manifests promising performance on the arousalvalence prediction task, with a decent extent of generality.In the future, the generalization ability of TSception across subjects will be explored.The effect of segment length in cropped experiments on TSception should also be considered and studied.

Fig. 1 .
Fig.1.Structure of TSception.In the figure, fs is the sampling rate of the EEG signals, C is the number of channels, BN stands for batch normalization, AP is the average pooling operation, and GAP represents global average pooling.TSception has four main parts: the dynamic temporal layer, the asymmetric spatial layer, the high-level fusion layer, and the classifier.The dynamic temporal layer will first learn the dynamic temporal/frequency representations from EEG data channel by channel.After getting the learned representations for each channel, the asymmetric spatial layer will be applied to learn the global spatial representations and the emotional asymmetry pattern using different scale convolutional kernels.To fuse the information from hemisphere and global representations, a high-level fusion layer is utilized.Finally, the fused representation will be passed to the fully connected layers with the softmax as the activation function.
) where the Γ(•) is the squeeze operation, W and W are the trainable weight matrix, b and b are the bias terms.The proposed TSception is summarised in Algorithm 1.The structure of the proposed TSception is shown in TA-BLE 2

b 6 s u b 7 s u b 8 s u b 9 s u b 1 0 s u b 1 1 s u b 1 2 s u b 1 3 s u b 1 4 s u b 1 5 s u b 1 6 s u b 1 7 s u b 1 8 s u b 1 9 s u b 2 0 s u b 2 1 s u b 2 2 s u b 2 3 s u b 2 4 s u b 2 5 s u b 2 6 s u b 2 7 s u b 2 8 s u b 2 9 s u b 3 0 s u b 3 1 s u b 3 Fig. 3 .Fig. 4 .
Fig. 3. Mean accuracy of each subject for arousal and valence on DEAP using TSception.
[46] [48] [49][50], which indicates the network learns from the proper region.And for both arousal and valence, the occipital activities provide certain information.This may be because the stimuli used in DEAP are music videos.

Fig. 5 .
Fig. 5. Averaged saliency maps in DEAP dataset.The upper three saliency maps (a)-(c) are the averaged saliency maps for arousal dimension while the lower three (d)-(f) are for valence.The first column (a) and (d) are the mean saliency map of all the subjects.The second column (b)and (e) are the one of subjects who are top 10% for F1 scores, The last column (c) and (f) are the average saliency map of the subjects whose F1 scores are in bottom 10% for arousal.The mean saliency map is normalized between -1 and 1 for better visualization.F1 is chosen as the criterion because it can reflect how precise the predictions are by taking the imbalanced classes issue into consideration.The most informative region identified by the neural network is the frontal, temporal, parietal, and regions for high F1 score subjects.

Fig. 6 .
Fig. 6.Saliency maps of all 32 subjects for arousal in DEAP dataset.The saliency map is averaged along the time dimension to plot the topological map.

TABLE 1
Summary of related information of the datasets used in the experiments

TABLE 2
(1,1)ture of the proposed TSception ReLU is the Leaky-ReLU activation function.AP is the average pooling operation.BN stands for batch normalization.GAP is the global average pooling.'-1' in the tensor size stands for the number of samples within one mini-batch.The strides of CNNs are(1,1)if not specified, and the one for pooling layers is the same as the pooling step.

TABLE 5
Compare with the results reported in the existing literatures using leave-one-trial-out cross-validation on DEAP

TABLE 6
Ablation study results of removing functional layers in TSception using DEAP

TABLE 7
Emotion Regulation Therapy (ERT)/Emotion-Focused Therapy (EFT) for emotion-related mental disorder treatment.Most of the previous works highly rely on the human extracted features, which requires heavy domain knowledge.Deep learning, especially the family of convolutional neural networks, has the auto feature-extracting ability.In this paper, we propose TSception, a multi-scale convolutional neural network, for EEG emotion recognition tasks.The parallel multi-scale temporal kernels whose lengths are related to the sampling rate of EEG are proposed in the temporal convolutional layer of TSception to enrich the learned temporal/frequency representations.To capture the emotional asymmetry patterns, we propose hemisphere kernels besides the global kernels in the asymmetric spatial layer.A high-level fusion layer is designed to further learn from the hemisphere/global representations of EEG and reduce the model size.
5 DISCUSSION AND CONCLUSIONAccurate emotion detection can benefit many healthcare applications including Cognitive Behavioural Therapy (CBT),