A Brain Network Analysis-Based Double Way Deep Neural Network for Emotion Recognition

Constructing reliable and effective models to recognize human emotional states has become an important issue in recent years. In this article, we propose a double way deep residual neural network combined with brain network analysis, which enables the classification of multiple emotional states. To begin with, we transform the emotional EEG signals into five frequency bands by wavelet transform and construct brain networks by inter-channel correlation coefficients. These brain networks are then fed into a subsequent deep neural network block which contains several modules with residual connection and enhanced by channel attention mechanism and spatial attention mechanism. In the second way of the model, we feed the emotional EEG signals directly into another deep neural network block to extract temporal features. At the end of the two ways, the features are concatenated for classification. To verify the effectiveness of our proposed model, we carried out a series of experiments to collect emotional EEG from eight subjects. The average accuracy of the proposed model on our emotional dataset is 94.57%. In addition, the evaluation results on public databases SEED and SEED-IV are 94.55% and 78.91%, respectively, demonstrating the superiority of our model in emotion recognition tasks.


I. INTRODUCTION
P HYSIOLOGICAL states are significant for health management and life quality improvement. Among these, emotions have a great impact on physical health and communication in our daily lives [1]. By properly recognizing emotion states, brain-computer interaction systems can be developed to a higher level, which can be used in many fields such as health care, nursing, entertainment, etc. Judging emotions from external modalities has a high degree of complexity and uncertainty [2]. Recently, methods based on EEG signals have drawn a lot of attention due to the high temporal resolution and objectivity of EEG [3], [4]. Liang et al. [5] proposed a hybrid unsupervised deep convolutional recursive adversarial network-based EEG feature representation and fusion model for training and identifying emotional EEG features in an unsupervised manner. Liu et al. [6] proposed a subject-independent emotion recognition algorithm based on dynamic empirical convolutional neural networks. Jiang et al. [7] proposed a depression classification detection method using differential entropy and genetic algorithm for feature extraction and selection, and support vector machine for classification. These methods obtain temporal and spatial information from the EEG signals for emotion classification. However, due to the non-stationary feature and the high coupling between different channels of EEG signals, it is still a challenging task to detect the emotion-related information in EEG signals while maintaining the robustness of recognition under noise interference.
Considering the nonlinearity and complexity of EEG signals, the application of dynamic systems analysis methods to EEG-based emotion recognition will be potentially adaptive [8], [9]. Recently, to analysis EEG rhythms, Fourier-Bessel series expansion (FBSE) based methods [10], [11], [12], [13] have been widely applied to decompose EEG signals. Sharma et al. [14] proposed the discrete wavelet transform as decomposition method to explore the nonlinear dynamics of each sub-band signal. Nalwaya et al. [15] proposed tunable Q-factor wavelet transform to separate EEG rhythms. Das and Pachori [16] proposed a EEG rhythms segmentation method using multivariate iterative filtering. Alternatively, brain activity can also be described by constructing brain networks using EEG signals, which are built by taking electrodes (channels) as nodes and using correlation measures between channels as edges [17]. By using EEG, MEG, and fMRI to measure brain activity, Straaten and Stam [18] created functional networks for analyzing brain activity. Dang et al. [19] proposed a multilayer brain network to map the time, frequency and channel-related information from EEG signals into the multilayer network topology. The brain network method is an effective way to represent the inter-channel relationship of EEG signals, which is important in emotion recognition task.
The significance of the attention mechanism in deep neural networks has been studied extensively in the previous literature [20], [21]. Attention not only tells where to focus, but it also improves the representation of interesting areas. Woo et al. [22] proved the efficiency of the attention mechanism used in computer vision tasks. Similarly, brain activities can be divided into many channels or regions. Each channel or region has a different intensity of response when the brain is engaged in different activities. To explore the effect of targeting specific regions and channels when analyzing brain activity, we introduce the attention mechanism into the emotion analysis model for further discussion.
Although existing works have been developed to some extent, there still remains a necessity to excavate new method for studying brain activities and to improve classification accuracy. The main contributions of this paper are as follow: 1) It proposes a novel double way feature extracting strategy to improve the utilization of temporal-frequency domain of EEG signals for emotion recognition task. The proposed method takes advantage of dynamic systems analysis by transforming raw EEG signals into brain networks. The deep learning techniques are adopted to attain feature embeddings. The outputs are combined to mitigate the information loss and to improve the classification performance.
2) In the process of dealing with the brain networks, the attention mechanism with residual connection is adopted to perform adaptive feature optimization. By applying it to the extraction of the brain networks, channels and areas that are useful for the emotion recognition are promoted, while the gradient dissipation and gradient explosion are prevented.
3) A series of experiments are conducted to collect emotional EEG data from multiple subjects. The proposed method is validated on our emotional dataset and achieves an average accuracy of 94.57% which is 1.4% higher than the highest baseline. Also, we evaluate our method on SEED and SEED-IV and it achieves 94.55% and 78.91% respectively. 4) We investigate the effect of different EEG rhythms on classification results. Moreover, a series of ablation experiments are carried out to demonstrate the necessity of the double way structure and the residual connection.
The remainder of this paper is organized as follows. Section II explains the detailed experiment protocol and the processing of EEG signals. Section III explains the brain network construction block and the double way structure. Section IV shows the the classification results on the three datasets and compares the results with other baselines. Section V makes a discussion on ablation experiments. Finally, section VI summarizes this study.

II. EXPERIMENTS
The experiments are conducted in the Institute of Artificial Intelligence and Network Science of Tianjin University. The experiment process has been approved by the ethics committee of Tianjin University, Tianjin, China. A multi-channel electrode cap is used for data acquisition. The electrode distribution is in accordance with the 10-20 international standard.  All the 30 electrodes used for EEG acquisition are shown in Fig. 1. With two posterior ear mastoid electrodes as reference, the number of channels is equal to 30 which is corresponding to the electrodes. In addition to the 30 electrodes, there are also 4 Electrooculogram (EOG) electrodes; NuAmps amplifier is used to amplify the collected data; A computer equipped with Scan4.5 is used to store and visualize enlarged EEG data, which can help to determine whether the electrodes are working properly. We recruited eight right-handed healthy subjects (6 males and 2 females, aged 22-24 years). Before the experiment, participants were asked to remain calm and to avoid fatigue to ensure that their emotions were properly aroused. In addition, each subject was provided with detailed instructions during the experiment process to reduce deviation in the EEG signals.
The specific stimuli used to arouse the happy and sad emotions of the subjects were eight carefully selected movie clips, as shown in Table I. The participants were asked to watch these movie clips in an independent and undistracting environment to evoke corresponding emotional responses. According to the requirements of the experimental design, the subjects were asked to fill in the questionnaire during the interval between the movie clips to cooperate with the investigation of truly evoked emotional types and emotional intensity. At the end of each video clip, the computer went black for a period to calm the subjects down from the emotionally aroused state so that the awakening of the next emotional state was not interfered with.
The subjects were guided by the instructions on the screen throughout the experiment to ensure that the subjects were not disturbed by other factors. The experimental procedure were divided into four steps. In the preparation stage of emotional  arousal, the preset preparation time was 5s, and instruction on the screen was used to prompt the subjects to focus on the screen before the movie clips began. In the emotional arousal stage, the program randomly selected a movie clip to play to arouse the corresponding emotional state. In the questionnaire stage, the screen provided a simple survey to obtain the subjects' assessment of emotional arousal. In the emotional recovery stage, the subjects chose a recovery time by themselves, and then entered the next emotional arousal stage after complete recovery. The experiment scenario and the timing diagram are shown in Fig. 2 and Fig. 3 respectively.
After the EEG signals were obtained from the experiment, we down-sampled the data from 1000Hz to 250Hz to reduce the input dimension so that improve the efficiency of the analysis process. Then the processed data were filtered to 1-50Hz using a bandpass filter to reduce noise, artifact interference and some irrelevant components. The independent component analysis (ICA) [23] was used to eliminate blink artifacts. For the upper limit of the time window, directly using an entire trial as an input will cause a huge memory burden. For the lower limit of the time window, Ouyang et al. [24] demonstrated that a time window of 2-s can improve the classification accuracy using power spectral density (PSD) features and differential entropy (DE) features. Here we divide data into 5-s non-overlapping to seek a balance between accuracy and computation burden after preliminary experiments. Finally, our emotional dataset is obtained with 510 samples in total and 170 samples for each category.

III. METHODOLOGY A. Overall Architecture
In this section, we will introduce the implementation of the whole method. The schematic of our proposed method is shown in Fig. 4. The whole model is divided into two ways, and both are fed with raw EEG signals. The first way begins with a brain network construction block which establishes brain network matrices from the raw EEG signals. Then the matrices are fed into a deep convolutional neural network with several successively connected residual modules. In the second way the raw EEG signals are directly fed into a deep neural network block. Finally, the outputs of the two ways are concatenate and merged to obtain emotional categories.

B. Brain Network Construction Block
The brain network is employed to show correlations of EEG signals gathered in different areas of human brains [18]. To reveal the correlations on different frequency bands, we use the continuous wavelet transform (CWT) [25] to decompose the EEG signals into a superposition of a series of wavelets. Compared to the short-time fourier transform (STFT), CWT is able to attain higher frequency resolution in the low-frequency region and a higher time resolution in the high-frequency region. For any energy-limited signal x(t), the CWT of x(t) can be represented as: where a and b denote the scale factor and the offset respectively. * (t) denotes the complex conjugate of mother wavelet a,b (t). W f x(a, b) is the wavelet coefficients of EEG signals. In this work, we choose Morlet wavelet as the mother wavelet because it can preserve the time and frequency resolution [26]. The Morlet wavelet can be represented as [27]: To further analyse the time-frequency representation obtained by CWT, frequency should be determined by the scale factor a. The relationship between the scale factor a and the frequency can be expressed as follows [28]: where f is the frequency corresponding to scale factor a, f c is the center frequency of frequency window of (x), and is the sampling period of the EEG signals. By applying CWT to the raw EEG signals, we get a wavelet time-frequency diagram of each EEG channel. Based on different EEG rhythms [29] (δ (1-3 Hz), θ (4-7 Hz), α (8-13 Hz), β (14-30 Hz), and γ (31-50 Hz)), the wavelet coefficients can be divided into five frequency bands. For each EEG rhythm, the wavelet coefficients can be aggregated according to the boundaries of each frequency band by the following equation [30]: where [r A , r B ] denotes the boundary of different EEG rhythms and W f ave (b) represents a aggregated coefficients sequence at the corresponding EEG rhythm.
To construct the brain networks in each frequency band, the channels of EEG signals are deemed as the nodes and the Spearman correlation coefficients between channels are applied to decide connecting edges of the brain networks. Spearman correlation coefficient [31] is a kind of nonparametric index that can measure the monotonic relationship between two variables. It is calculated according to the sorting position of the original data and is not sensitive to data errors. Therefore, the calculation of the Spearman correlation coefficient between the channels of EEG signals will be robust against the noise in EEG signals.
For two aggregated coefficients sequences of two EEG channels within same frequency band, they are first sorted according to the same sorting rules to obtain two ranking sets X and Y . A difference set D of ranked elements can be obtained by subtracting X and Y in element-wise order, in which the k th element in D is calculated as [30]: Then the Spearman correlation coefficient ρ can be obtained by the following equation [30]: where N denotes the length of the input sequence. The obtained Spearman correlation coefficients range from [−1, 1]. The coefficient increases in absolute value as the correlation between variables increases and the symbol of the coefficient indicates the direction of correlation.
The initially constructed brain network is fully connected, and a threshold value needs to be set to convert it into a sparsely connected brain network. For each brain network, 50% is set as sparsity threshold, that is, half of the edges with lower absolute values are removed according to [32]. Finally, the brain network matrix in a shape of 30×30 is obtained, and the element in the i th row and the j th column denotes the Spearman correlation coefficient between the i th channel and the j th channel.

C. Deep Neural Network Blocks
For the first way of the model, raw EEG signals are mapped to the brain network matrices X ∈ R NxC×C by the brain network construction block, where N represents the frequency band and C represents EEG channels. The matrices are then input into the subsequent deep neural network block. The deep neural network block starts with a convolutional layer. The calculation of the convolutional layer can be expressed by the following equation [33]: where X i is the i th feature map, k is the weight matrix, and b is the deviation. The rectified linear units (ReLU) [34] is adopted as the activation function f of the convolutional layer. The convolutional layer is followed by a batch normalization layer [35]. Then an activation layer is introduced which also uses the ReLU function. Then we adopt a max-pooling layer speed up training and prevent over-fitting [36]. After completing the former operations, the processed features are fed into multiple successively connected residual modules. Each residual module has a residual connection that can effectively alleviate the problems of gradient dissipation and gradient explosion [37]. The schematic diagram of a residual module is shown in Fig. 5. We introduce both channel attention mechanism and spatial attention mechanism into the modules.
The main idea of the channel attention mechanism [22] is to use another neural network to obtain the importance of each channel in the feature maps and use the importance to assign weights to make the neural network focus on some feature channels. Specifically, the channel attention mechanism first needs to use both average-pooling and maxpooling operations to aggregate the spatial information of the feature maps. Here, the pooling operations are adaptive to input feature map and the size of the output is set to 1 × 1. Then we calculate the correlation between channels through a shared multi-layer perceptron in which the dimension of output weights is the same as the number of channels in the input feature maps. Finally, the two output weights are summed to obtain a channel attention map which is normalized to 0-1. In short, channel attention can be expressed as the following formula [38]: where σ is the sigmoid activation function, W 1 and W 0 is the MLP weights, F c avg and F c max denotes the average-pooling operation and max-pooling operation for channel attention respectively, and X is the input feature map.
Compared to the channel attention mechanism, the spatial attention mechanism [22] assigns weights to spatial information, which is complementary to channel attention. Specifically, the spatial attention mechanism first aggregates channel information by applying average-pooling and max-pooling operations along the channel axis to get two weight maps. The pooling operations here are also adaptive which automatically fit and squeeze the input channel dimension into one. Then we use the convolution operation on the two stacked maps. The output map is normalized to 0-1. The calculation of spatial attention is shown as following equation [22]: where σ is the sigmoid activation function, F conv denotes the convolutional operation, F s avg and F s max denotes the averagepooling operation and max-pooling operation for spatial attention respectively, and X is the input feature map.
The second deep neural network block is adopted to gather the temporal features from raw EEG signals at a more comprehensive and adequate level.Inspired by the EEGNet [39], we adopt a deep neural network to enhance the performance. Specifically, three convolutional layers are applied in the second deep neural network block. The first convolutional layer is calculated along channels to aggregate channel-wise information. The second convolutional layer is a depthwise convolution which doesn't connect to all the previous feature maps thus can learn a unique spatial filter for each temporal filter. And the separable convolution is chosen for the third convolutional layer which can synthesize the kernels of each feature map in the time dimension. After every convolutional layer there are batch normalization layer and activation layer. The max-pooling layer and the dropout layer adopted here can reduce the trainable parameters and alleviate overfitting.
Finally, a feature fusion strategy is implemented by concatenating the outputs of the two ways in row style. Then the fused features are fed into a linear layer which adopts softmax activation to turn the classification results between 0 and 1 and are summed to 1. Table II shows the specific parameter settings of the overall method. Floating-point operations per second (FLOPs) is a measure of compute performance used to quantify the number of floating-point operations. We calculate the FLOPs of the whole method and result is 65.789M, which shows a competitive low cost of computational resource compared to other deep learning methods [37], [40].

IV. RESULTS
In this section, the experimental results are presented. Firstly we introduce the public dataset SEED and SEED-IV and the detailed experimental settings. Then the proposed method and other baseline models are scored on three datasets, that is, our emotional dataset collected in Section II and the two public datasets.

A. SEED and SEED-IV Datasets
The SEED [42], [44] dataset includes fifteen Chinese film clips (positive, neutral and negative emotions) and fifteen Chinese subjects (7 males and 8 females). The International 10-20 electrode positioning protocol was followed with 62 channels. The recorded EEG signals were down sampled at 200 Hz and further band-pass filtered between 0.5-70 Hz. We follow the same experimental settings with previous research [45], [46]. Specifically, the first 9 trials of EEG data serve as the training set and remaining 6 trials serve as the testing set. Then, the classification results corresponding to each session are obtained for each subject. Finally, the average classification accuracy and standard deviation over two sessions of all the 15 subjects are calculated.
The SEED-IV [47] dataset contains seventy-two film clips which have the tendency to induce happiness, sadness, fear or neutral emotions. A total of 15 subjects participated in the experiment. For each participant, 3 sessions were performed on different days, and each session contained 24 trials. The same electrode positioning protocol and data preprocessing was followed as the SEED. We follow the same experimental settings with a previous research [48]. Specifically, the first 16 trials are used for training and the remaining 8 trials (containing all labels) for testing. All the three sessions are used for evaluation.

B. Evaluation Results
To evaluate the proposed method, our emotional dataset is first used for validation. Here we adopt the 10-fold crossvalidation manner to evaluate the performance. Specifically, 510 samples are divided into ten sets for each subject, and every time we use nine sets to train the models while the remaining one set is used for validation. This process is done three times to average the results. The confusion matrices across ten folds for each subject is shown in Fig. 6.
To evaluate the effectiveness of the proposed model on our emotional dataset, we set up a group of comparative experiments on three baseline models. The implementation of these three models is as follows: PSD + SVM: The energy spectral density PSD [41] features are extracted from the δ, θ, α, β, and γ bands of each EEG channel to form feature vectors. The classifier is set as a support vector machine.
DE + kNN: Feature vectors composed of differential entropy are classified by K-nearest neighbor method kNN, where K is equal to 5 [42].
DE + HCNN: The differential entropy features of each frequency band are rearranged into a two-dimensional graph in a zero-fill manner to retain the information hidden in the electrode positions. Then, the feature maps are input into the six-layer CNN structure [43].  These baseline methods are tested on our emotional data set in the same way of evaluation. The results are shown in Table III. Taken as a whole, the method we proposed achieves the highest accuracy among all the methods compared. The proposed method surpasses the accuracy of 90% for most subjects. In a comparative perspective it is about 20% higher than the PSD+SVM method and 11% higher than the DE+KNN method. Although the DE+HCNN method using a deep neural network structure achieve a higher result, there is still a 1.4% gap in the averaged accuracy compared with the proposed method. This indicates that our method can effectively perform emotion classification tasks. Table IV gives the validation results on the SEED and SEED-IV dataset. For SEED dataset, Lu et al. [49] applied concatenation fusion, max fusion, and fuzzy integral to fuse multiple modalities and demonstrated that the fuzzy integral fusion method achieved accuracy of 87.6%. Song et al. [45] proposed DGCNN and obtained a classification accuracy of 90.4%. Yang et al. [50] built a single-layer feedforward network (SLFN) with subnetwork nodes and achieved an accuracy of 91.5%. From Table IV it can be seen that our method has achieved the best result among the five benchmarks. This exciting result illustrates the generality of our proposed double way structure. Compared to conventional model structures, the proposed structure can adaptively extract features in different frequency bands and capture fine-grained temporal features. Our model also obtains the best results on the four-class classification task SEED-IV dataset, which indicates that our model structure can be applied to multi-class emotion recognition scenarios.
We also investigate the effectiveness of each EEG rhythm by applying decomposed signals. We use FBSE to extract the five EEG rhythms from our emotional dataset. The proposed TABLE V RESULT OF USING DECOMPOSED SIGNALS method is used for testing with the first way removed because the attention mechanism is self-adaptive on different frequency bands, which is not suitable for comparing the effects of the rhythms. The 10-fold cross-validation is adopted on our dataset. Also, we process the SEED dataset with FBSE and test with the same model. The results are shown in Table V. It can be seen that among the five rhythms, γ rhythm attains the best results while δ and θ rhythms shows relatively poor performance. These results indicate that emotional information is more embedded in the higher frequency bands, especially in γ band, rather than the lower frequency bands of EEG signals. Similar results are also reported recently [45].

V. DISCUSSION
In this section, we make a comprehensive investigation of the proposed method. In Section V-A, we conduct a series of ablation experiments to demonstrate the necessity of the double way structure of our method. In Section V-B, we discuss the effect of residual modules on the classification performance.

A. Ablation Experiments
To demonstrate the necessity of the double way structure in our model, we conduct a series of ablation experiments. The specific settings of the ablation experiments are as follows: Model 1: Use the double way structure as the proposed method.
Model 2: Only use the first ways, namely, the brain network construction block and the deep neural network block. At the end of the structure, the linear layer take the extracted features as input to attain classification result.
Model 3: Only use the second ways, that is, only use the neural network as the feature extractor with raw EEG input and linear is set as model 2. We perform 10-fold cross-validations on our emotional dataset. The results are shown in Table VI. Among them, the averaged accuracy of model 1 is 94.57% which is higher than that of model 2 and model 3. The standard deviation of model 1 is 1.86% which is also better than the other two models. The above results indicate that the proposed double way structure can extract information from multiple perspectives, and retain robustness among subjects, which demonstrated its superiority.

B. Residual Models
In this work, we employ residual modules in the deep neural network blocks to enhance the effectiveness of emotion classification. The residual modules are connected in sequence and the number of the modules should be discussed. Thus, a series of experiments are conducted to find out how the number influences performance. We increase the number of residual modules in the proposed method gradually and a 10-fold cross-validation is performed for each modified model on our emotional dataset. The results are shown in Fig. 7. It is shown that the accuracy gets higher as the number of the residual modules increases and hits the best result when the number is three. However, the accuracy does not improve significantly when the number is further increased and it decreases when the number is six. This indicates that the residual modules of the deep neural network with a certain depth can improve the total performance, but a too deep structure will result in degeneration of performance.

VI. CONCLUSION
In this work, we have proposed a double way deep neural network model based on brain networks. Because of the deep correlation between EEG signals gathered from different channels, it is important for a method to extract and merge effective information from different perspectives. Firstly, we transform EEG signals into five frequency bands by wavelet transform, and construct brain networks for each frequency band based on Spearman correlation coefficients between EEG channels. Brain networks are good at reflecting the correlation of brain activity regions. Then we use deep neural networks with residual connections to extract features from the obtained brain networks. Secondly, considering that the brain network may lose temporal information of the EEG signals when implementing the Spearman correlation coefficient, another deep neural network block is adopted. Finally, the features extracted by the two ways are concatenated and classified. The performance of the proposed model are verified by several comparative experiments on three datasets. A series of ablation experiments are conducted to prove the necessity of the particular structures. Moreover, the influence of residual modules on the performance is discussed and we find out the optimal value for the task of emotion recognition in the proposed method.
The research in this work can be applied to a variety of emotion recognition scenarios. To address the channel reduction and data quality issues associated with portable EEG acquisition devices in practical applications, further work will be conducted on small sample problems and subjectindependent problems. With the help of progressing algorithms and hardware, we expect the BCI to be applied to real scenarios sooner and better.