Deep Learning Model with Adaptive Regularization for EEG-based Emotion Recognition using Temporal and Frequency Features

Since EEG signal acquisition is non-invasive and portable, it is convenient to be used for different applications. Recognizing emotions based on Brain-Computer Interface (BCI) is an important active BCI paradigm for recognizing the inner state of persons. There are extensive studies about emotion recognition, most of which heavily rely on staged complex handcrafted EEG feature extraction and classifier design. In this paper, we propose a hybrid multi-input deep model with convolution neural networks (CNNs) and bidirectional Long Short-term Memory (Bi-LSTM). CNNs extract time-invariant features from raw EEG data, and Bi-LSTM allows long-range lateral interactions between features. First, we propose a novel hybrid multi-input deep learning approach for emotion recognition from raw EEG signals. Second, in the first layers, we use two CNNs with small and large filter sizes to extract temporal and frequency features from each raw EEG epoch of 62-channel 2-s and merge with differential entropy of EEG band. Third, we apply the adaptive regularization method over each parallel CNN’s layer to consider the spatial information of EEG acquisition electrodes. The proposed method is evaluated on two public datasets, SEED and DEAP. Our results show that our technique can significantly improve the accuracy in comparison with the baseline where no adaptive regularization techniques are used.


I. INTRODUCTION
EEG is a complex time-series data used by researchers to recognize brain activation and implement the acquisition and storage of EEG signals in the clinical and psychiatric fields. In particular, in the last ten years, EEG signals are used to reflect the human brain's activity in the field of Braincomputer interface (BCI). Automatic recognition of human emotion is an important research area in human-computer interaction [1]. A perfect ideal BCI can detect arousal or valence of emotion state through spontaneous EEG signals without explicit user feedback [2].
A variety of machine learning techniques and models are used for EEG-based emotion recognition problems. Generally, the methods can be described by three procedures: preprocessing, feature extraction, and classification. Obviously, the pre-processing procedures prepare data prior to any fur-ther analysis. There are two primary technical approaches in feature extraction: traditional machine learning and deep learning methods. The traditional approach relies on handcrafted characteristics [3]. Due to the high non-stationarity of EEG signals, the extraction of such features is a difficult task and needs an expert's knowledge. The more recent studies used multivariate statistical analysis techniques in the frequency, time-frequency, and nonlinear domain, and capture handcrafted feature that can represent EEG characters [4], [5], [6], [7], [8]. For example, Wei et al. [4] evaluated the performance of different common methods for feature extraction. They investigated Discriminative Graph regularized Extreme Learning Machine using differential entropy features to achieve the best average accuracy. Fu et al. [5] proposed multiple features for the formation of high-dimensional features. Generally, traditional recognition methods usually combined handcrafted features and shallow models like support vector machines (SVM) and ensemble learning [9], [10].
The recent development of deep learning-based methods is becoming dominant to boost the efficiency and generalizability of EEG applications. It is claimed that deep learning method addresses the shortcomings of recent approaches, such as the requirement of a large amount of data [11]. Therefore, deep learning networks have gained substantial interest from researchers in the literature [12], [13]. Convolutional Neural Network is one of most important deep networks because of its excellent performance in the signal processing [14]. CNNs allow the extraction of higher level features from the EEG signal. Recent articles demonstrate CNN may have promising performance due to its advantages of automatic feature detection. Also, Recurrent Neural Networks (RNN) as a family of neural networks is used for processing variable-length sequential data [15]. However, using deep RNN architectures could lead to the problem of gradients vanishing or exploding. Recent study showed that long short-term memory (LSTM) outperform traditional RNNs substantially [16].
In this paper, we introduce a multi-input deep model with CNNs and Bidirectional Long Short-term Memory (Bi-LSTM). Bi-LSTM is a two-way LSTM that incorporates the LSTM forwarding direction and the reverse path. In our model, each individual parallel input, at the first layers, consists of two CNNs with small and large filters. Also, for each segment, we extract five frequency bands of EEG signals. Since the experimental results indicate that the gamma and beta band (roughly beta: 14-30 Hz, gamma: 31-50 Hz) is suitable for EEG-based emotion classification, we calculated differential entropy for two bands of frequency. Finally, CNN's features are integrated with differential entropy feature. We applied the adaptive regularization method over each parallel CNN's layer to consider the spatial information of EEG acquisition electrodes.
In summary, the major contributions of this paper are: • A multi-input deep learning CNN+BiLSTM architecture to perform the classification of raw signals from EEG channels with differential entropy for the two band of frequencies. • An approach with the ability to extract temporal EEG patterns as well as frequency components simultaneously, which leads to improvement of the performance of our model. • Adjusting the regularization parameter of the model adaptively to utilize the unnecessary channels and prevent overfitting. The rest of this paper is organized as following: In section II, we describe the research methodology by elucidating our presented deep learning model. In Section III, we explain the model training and prediction algorithm. In section IV, we present the results of our analysis. Finally, Section V and VI draws the discussion of this work and conclusion, respectively.

II. MATERIALS AND METHODS
In this section, we first describe the SEED and DEAP dataset and the pre-processing procedures. Then, we introduce our model in details. Finally, we explain the training parameters used in our model.

1) SEED
SEED is a electroencephalogram signal data measured during the emotional experiment. The dataset was collected from 15 subjects (7 males and 8 females) while showing them 15 chinese movie clip. The focus of the dataset is on three specific emotions: positive, neutral and negative. Therefore, there are 45 trials in the database. Each scalp EEG signal was collected with the standard 10-20 system by 62channel and downsampled to 200Hz sampling rate. Also, a bandpass frequency filter of 0-75 Hz was applied to remove physiological noises [17].

2) DEAP
The DEAP dataset includes 40 channels of peripheral physiological signals from 32 participants (16 males and 16 females) when watching 40 one-minute music videos. Among the 40 trials for each subject, various signals were recorded as 40-channel data. The first 32 channels are EEG signals, and the last 8 channels are autonomous physiological signals. After watching the video, participants marked each video in terms of the levels of arousal, valence, like/dislike, dominance, and familiarity. Valence is the scale used for this work. It ranges from one (low) to nine (high), and scales are divided into three parts to construct positive, neutral, and negative labels in accordance with the SEED dataset. Emotions are categorized negative if the valence rating is smaller than 3; neutral if valence rating is smaller than 7 and greater than 3; and positive if valence rating is greater than 7 [18].

B. PREPROCESSING
To remove noise and the artifacts, the EEG data is preprocessed with a bandpass filter between 0.3 to 50 Hz. After Noise removal, we separate the EEG segments corresponding to Movi-clip and remove excess parts including Hint of start, Self-assessment, and Rest. The length of EEG signals collected after removing extra fragments are 3300 clean epochs for one experiment as explained in [19]. Finally, EEG data in each channel is divided into 2s segments without overlap. Each segment is standardized to unit variance and zero mean.

C. ARCHITECTURE DESIGN
As shown in Figure 1, our new architecture is made up of four sequential main parts, including representation learning, feature reduction, sequence learning and classification. In representation learning part, the model is trained to extract features from each of raw single-channel EEG epochs, both manually and automatically. Second part is reducing the number of features to prevent heavy computation. Third part is sequence learning, which provides temporal information with the Bi-LSTM block. The last part is classification unit which labels the emotions to Positive, Negative and Neutral. The details of each part are given below.

1) Representation Learning
This structure uses all the spatial, temporal and frequency information of EEG signals. For each raw single channel 2-s EEG epochs, we apply two CNNs with small and large filter sizes at the first layers. The idea of having filters with different sizes comes from the context of signal processing that offers trade-off between temporal and frequency precision and resolution. Namely, the frequency resolution is defined by the number of samples in the time series, and the temporal resolution is always defined by the data sampling rate, and does not depend on the length of the time windows. Thus, small and large filter sizes are best to extract temporal and frequency features respectively [20], [21]. In our proposed model, each two 1D-CNN layers followed by a max-pooling layer to down-sample the input representation. 1D-CNNs consists of three main parameter: a filter size, the number of filters, and a stride size. We adjusted the parameters to extract temporal and frequency information from the EEG. Namely, since we want to analyze the signal between the ranges of 2-50 Hz, the window must be at least 500-20ms long. For two cycles, the length of time segments need to have at least 8 points(EEG Sampligrate= 200, Minimum window size for two cycles of capturing=2 × 20ms, therefore 200 × 2 × 20ms=8 points are required). We can't use three or more cycles, because it fails to catch higherfrequency transient behavior. In our dataset, there are 2-s EEG epochs {x ch1 , x ch2 , ..., x ch62 } from 62-channel EEG. For each EEG channel, we use two large-filter and small-filter CNNs to extract f l n and f s n features from the x n .

2) Differential Entropy Feature
Entropy is a quantity that measures the disorder of a system. The entropy features has been successfully applied to expand research of EEG signals. We use an efficient frequency domain feature with the concept of entropy called differential entropy. Differential entropy performs signal decomposition on the original signal to extract useful information. The previous research [5] indicated that the gamma and beta band (beta: 14-30 Hz, gamma: 31-50 Hz) is suitable for EEGbased emotion classification. We calculate the differential entropy of two frequency band of 62 channels, and create features with 124 dimensions for each segment. Differential entropy is defined as follows: where the time series X is the Gaussian distribution N (µ, σ 2 ). Finally, the moving average filter with window length 20s were applied to smooth the feature sequence [17].

3) Feature Reduction
This unit is a neural network that could help to improve the speed and reliability of computing. At the end of feature extraction, the outputs of CNN's parts are concatenated with Differential Entropy feature and followed by feature reduction networks in order to select and reduce the space of feature.

4) Sequence Learning
Since EEG signals contain temporal dynamic information, we apply two layers Bi-LSTM cells to extract temporal information. We denote F = {f n } N n=1 be the set of extracted feature corresponding to the n-th epoch of emotion data with label Y = {y n } N n=1 . As shown in Fig. 2, Bi-LSTM can be regarded as a Forward LSTM (Fw-LSTM) and a Backward LSTM (Bw-LSTM). Assume F f = F represents Forward target feature sequence and y n represents the final prediction value, we calculate the prediction sequence as follows: y n = LST M (F fn ) where LSTM represents a function that processes sequences of features by using the input gates, forget gates and output gates in the memory block. The output of the LSTM layer can be calculated as follows: The LSTM function is computed as follows: Where σ is the logistic sigmoid function, and t, i, f, o, and c are time step, the input gate, the forget gate, the output gate, and cell activation vectors, respectively. The W terms are weight matrices and the b terms are bias vectors. Similarly, we calculate the backward sequence by input n=N . We then calculate the hidden vector h by concatenating h n in forward and backward path. The output of the Sequence Learning unit is based on the hidden vector h as follows:   Where W h and b h are the weight matrix and bias vector, respectively.

5) Classification
Finally, we predict the label of each emotion EEG segment by softmax. The softmax layer is used as the output layer and give the probabilities of the each classes is computed as follows: Where P (c | f (x n , c)) is the probability of the EEG segment x n labeled as class c, given the real value calculated by our model. We train the model by minimizing the cross-entropy error [22]:

6) Adaptive regularization
Adaptive regularization method aims to find variant features and reduce weight distribution in layers. In other words, adaptation regularization strategy is actually a trade-off between information of different electrodes with variant spatial resolution that preserved both unnecessary and necessary electrodes in emotion recognition problem. This algorithm consists of three alternating steps: training model, calculating the variance of extracted features from the individual convolution layers, and update regularization parameters. Therefore, the adaptive model iteratively changes the hyperparameters between the adaptive layer and the former layer.
In this section, we investigate the adaptive regularization approach in the general formulation. The goal is to adjust the regularization parameters of CNN's layer adaptively. Specifically, the outputs of represents learning unit denoted with the length of R = 64(2 × 32=CNNs feature).To minimize the distance, we uses the standard deviation: where is the output of represents learning unit (with the length of R = 64) where Σ i is total distribution of pairs large and small CNNs unit. We apply L1 penalty and L2 penalty terms to the kernel weights for each parallel convolutional layers. A general form of the regularization methods is as follows: Where l(., .) is a loss function, and λ i is the regularization parameter of pairs large and small CNNs unit. When f is in the linear form and the loss function is square loss, f k is the norm of the coefficient of linear model. In this paper λ is the tuning parameter which adapted as follows: Where λ i in the [0, 1), then λ 0 is a constant value smaller than one.

D. TRAINING ALGORITHM
Our training algorithm is developed based on iterative backpropagation to adjust the network weights (see Algorithm), while applying the adaptive regularization method over each parallel CNNs layer to consider the spatial information of EEG acquisition electrodes. The algorithm trains the model iteratively until performance does not improve for a fixed number of training epochs. Then, the variance of CNNs representation of all available channels is estimated on the training data and is used to adjust the regularization penalty, λ, adaptively. In line 5 of the Algorithm, the model weights gets updated and fine-tuned on the whole model weights, and learning rates decreases after each fine-tune.

E. TRAINING PARAMETERS
We use the ADAM optimizer with default parameters of framework (learning rate set on lr=0.0005, learning de-cay=0.5 and decay rate of the first and the second moments ADAM: β1 = 0.9 and β2 = 0.999) to model training. The model is trained using the training set with the mini-batch size of 120. For the 1D-CNNs in representation learning part, we consider small and large windows size according to the guideline provided in relevant section. For instance, the size of the large windows in the first layer was set to kernel_size = 25 (because it can capture several cycles from higher-frequency i.e. 50Hz), and its stride size is set to 5 (because, sampling rate is F s = 200 and imprtance higher-frequency transient behavior band belongs to emotion is less than 50, then we choose minimum step of F s/4 = 50 to capture cycles of higher-frequency). The use of stride may minimize the number of parameters, and it can reduce the cost of computing, and can also obtain a comparable complexity of model. Also, the small window size set to kernel_size = 10 to detect temporal EEG patterns. In each block, we use several sequence layer to allow the extraction of higher level features from the EEG signal. At the end of feature extraction, the outputs of CNN's parts are concatenated with Differential Entropy Feature and obtain 4092 features (62 × 2 × 32 = electrods × small&large_blocks × 1DCN N s_f eatures + 124 =differential entropy of two frequency band extract from 62 channels) and follows by feature reduction in order for selecting by two layers neural networks. In the sequence learning part, the unit of the Bi-LSTM is set near size of the feature reduction output part, which is 256.

III. RESULTS
The performance results of the proposed method using SEED and DEAP datasets are presented in this section. The classification confusion matrices obtained from the test set of SEED dataset are illustrated in Figure 3. Each row and column represent the number of 2-s EEG emotion segments classified by the target emotion and our model predicted label, respectively. The diagonal of the table represents the percentage of epochs that our model classified the emotion correctly.
In Figure 4, we show the effect of the adaptive regularization on the final contribution of each electrode. In this figure, each bar represents Σ of each electrode as described in section 2.3.6. The comparison of the two charts clarifies that a set of electrodes like {T 7, T 8, T P 7, F C5, F P Z, F 7, F 3, F C3, O1, ...} have more influence on the output of the activation layer. Adaptive regularization strategy optimizes the usage of information of different electrodes with a variant spatial resolution that preserved both unnecessary and necessary electrodes in the emotion recognition problem. From these results, we can see that our method has the ability to smooth the weight of each parallel CNN's layer by considering the spatial information of EEG acquisition electrodes. In order to evaluate the static functional connectivity between the channels, and track their changes over time [23], we calculate the spatial correlation between all the features extracted from a multi-channel signal f = {f 1 , f 2 , ..., f c } of parallel network, where c is the number of channels and f c = f sc f lc . For this purpose, we randomly select two samples from SEED dataset, and we calculate the correlation between features with and without regularization techniques. The result of this analysis is presented in Fig 5. We also compare the accuracy of our proposed methods with state-of-the-art methods that are applied on 10-fold cross validation of SEED and DEAP database. The result of this analysis is presented in Table 1.

IV. DISCUSSION
We propose the deep learning model that utilizes a combination of differential entropy features and automatically extracted from CNNs to recognize the emotion from raw EEGs. The proposed model makes the features of each class more conspicuous and efficient. The proposed structure for CNN's have the ability to extract temporal EEG patterns and frequency components, simultaneously. Also, according the recent literature in emotion recognition, beta and gamma bands are more related to emotion recognition. So we selected just the differential entropy features of two relevant bands. Also we added an adaptive regularization technique to use all the unnecessary channels and prevent overfitting of the model. Based on specification of the input, we design a hybrid parallel neural network of 1D-CNN and handcrafted features. The parallel architecture allows the identification of the most statistically significant features and automates the extraction of spatial information related to the channel position in EEG signals. Consequently, we apply an adaptive regularization method to exploited the time-frequency information involved in each channel, and remove the bias related to the correlation of the channel position. Therefore, the method extracts a wide range of features related to EEG signal patterns and overcomes the influence of channel correlation. According to Figure 5, the correlation between some electrodes is significantly higher than other electrodes, which means some brain parts may be more synchronized with other brain parts. Also, it shows that some areas may be responsible for the generation of particular emotion. The correlation values between channels are between 0 and 1, in which 0 means that f i and f j are completely uncorrelated while 1 means completely positively correlated. Comparing two correlation matrix in Figure 5 shows that the pair-wise correlation between the features are reduced after the regularization techniques. It shows that the adaptation regularization strategy establishes a trade-  off between the information from functional connectivity and regions, and it focuses on finding silent features for emotion recognition. Table 1 shows a brief summary and performance comparison of our method and other emotion recognition methods. These methods are deep learning models and used the same evaluation method. Zheng and Lu [24] extracted differential entropy features from multichannel EEG data and applied a Deep Belief Network (DBN) for emotion recognition on the SEED dataset. They identified the weight distributions of the trained DBN models, and they selected more important frequency bands and channels selected according to the tendency of the network. At last, they trained DBN with the profile of just 12 channels and evaluated the results. Recently, Chen Wei et al. [25] utilized simple recurrent units network and ensemble learning and obtained accuracy of 78.8% on the SEED dataset which is 2.74% low than our results. They applied highly complex systems for classification and feature engineering. Zhongmin et al [26] proposed a multichannel EEG emotion recognition method based on phase-locking value (PLV) graph convolutional neural networks (P-GCNN). Their P-GCNN model consists of the PLV-based graph signals construction, graph convolutional operation, graph pooling operation, Relu activation, and the full connection. This method led to an average accuracy of 84.35% on SEED and 73.31% on DEAP dataset. Although their performance is better than our proposed method, they need the large time series of EEG signal before calculating the PLV. Therefore, their model have computational complexity and also unexpected ability for online calculation. Moreover, they are using handicraft features that may be reduce the generalizability of the model.
The methods presented in [27], [28] has certain limitations including the number of emotion states considered, which result in higher accuracy. In general, the accuracy for the binary class prediction is higher than the multi-class prediction due to removing the complexity of the outcome. In [25], [29], [30] features were computed in multi-domain (time, wavelet, and frequency) from EEG signals and set of stable features are identified for emotion recognition. Although some studies have reported higher performance, it is important to note that they extract a large number of features and perform extra pre-processing, such as reducing number of classes or removing subjects from the datasets. Also, selecting the hand engineering features and electrodes could not be efficient for developing an emotion recognition system, and decreases model generalizability.
In our model, all components of feature extraction, feature reduction, and classification are merged and compute in the parallel processes, which enhances computational efficiency. In comparison to other models, our model achieves comparable classification performance. Moreover, the result shows by using proposed adaptive regularization technique, all channels are utilized which may prevent overfitting of the model, and therefore more generalizable with better performance.

V. CONCLUSION
The proposed architecture in this paper is based on combination of deep learning and differential entropy. We used small and large features to capture heterogeneous features representing temporal and frequency resolution of signals. Also, we used adaptive regularization to remove the bias from the model, and select more effective electrodes to classify emotions. The proposed method is parallelizable, hence computationally efficient. For future studies, we suggest using this model on independent datasets or leave-one-out cross validation to assess the generalizability of the model.

PLACE PHOTO HERE
EBRAHIM KHALILI is a graduate student in electrical engineering at Trabiat Modares University, Tehran, Iran.

PLACE PHOTO HERE
BENTOLHODA AYATI is an assistant professor in Biomedical Engineering department at Islamic Azad University of Tehran -Central Branch, Tehran, Iran. She received her Ph.D in Biomedical Engineering from Loughborough University, UK in 2016. Her research area is signal and image processing. She also has experiences in biomedical data collection and image reconstruction.

PLACE PHOTO HERE
MARZIEH AYATI is an assistant professor in Computer Science department at University of Texas Rio Grande Valley in Texas, USA. She got her Ph.D. in Computer science from Case Western Reserve University, Ohio, USA in 2018. She is working on development of algorithms to integrate various types of data to improve understanding of diseases. Her research is also applying machine learning data on biological data. Through her research, she gained significant experience in analyzing noisy, incomplete, high dimensional, and heterogeneous datasets, which enabled me to develop broad range of theoretical and practical skills in algorithm development and implementation, data integration, data management, data mining, machine learning, and statistical considerations. VOLUME 4, 2016