Emotion Recognition From Multi-Channel EEG Signals by Exploiting the Deep Belief-Conditional Random Field Framework

Recently, much attention has been attracted to automatic emotion recognition based on multi-channel electroencephalogram (EEG) signals, with the rapid development of machine learning methods. However, traditional methods ignore the correlation information between different channels, and cannot fully capture the long-term dependencies and contextual information of EEG signals. To address the problems, this paper proposes a deep belief-conditional random field (DBN-CRF) framework which integrates the improved deep belief networks with glia chains (DBN-GC) and conditional random field. In the framework, the raw feature vector sequence is firstly extracted from the multi-channel EEG signals by a sliding window. Then, parallel DBN-GC models are utilized to obtain the high-level feature sequence of the multi-channel EEG signals. And the conditional random field (CRF) model generates the predicted emotion label sequence according to the high-level feature sequence. Finally, the decision merge layer based on K-nearest neighbor algorithm is employed to estimate the emotion state. According to our best knowledge, this is the first attempt that applies the conditional random field methodology to deep belief networks for emotion recognition. Experiments are conducted on three publicly available emotional datasets which include AMIGOS, SEED and DEAP. The results demonstrate that the proposed framework can mine inter correlation information of multiple-channel by the glia chains and catch inter channel correlation information and contextual information of EEG signals for emotion recognition. In addition, the classification accuracy of the proposed method is compared with several classical techniques. The results indicate that the proposed method outperforms most of the other deep classifiers. Thus, potential of the proposed framework is demonstrated.


I. INTRODUCTION
Emotion is a psychological and physiological state accompanied by cognitive and conscious processes, which plays a very important role in human communication. In human-computer interaction (HCI) system, the machine should have the functionality of understanding human affective behaviors. This functionality helps to understand human intentions accurately in order to trigger appropriate feedback. Therefore, affective computing has gradually become a focus field of emotional research in recent years. Specifically, affective computing The associate editor coordinating the review of this manuscript and approving it for publication was Eunil Park .
is performed by computational model, which is established by constructing emotional states. It analyses behavior and physiological signals, and establishes emotional interaction between human and computer based on measured emotional states. As an indispensable part of affective computing, emotion recognition, which predicts and estimates the corresponding emotional state through the user's behavior and physiological response, becomes increasingly attractive [1].
Emotion recognition can be implemented by external information such as voice intonation [2], [3], facial expression [4] and body posture [5], as well as by changes in the nervous system such as EEG signals. Because of various reasons, external information is easy to be artificially concealed, so the objectivity of using external features to identify emotion states is not strong. On the contrary, the recognition results based on the changes in the nervous system are relatively objective. Therefore, much attention has been paid to the EEG based emotion recognition [6]- [10].
In general, emotion recognition is a pattern recognition task, and various machine learning methods have been applied to EEG based emotion recognition. Specifically, Jadhav et al. integrated gray-level co-occurrence matrix (GLCM) into a K-nearest neighbor (KNN) classifier for emotion recognition [11]. Support vector machine (SVM) was employed by Anas et al. to fuse band power, statistical features and high order crossing of the EEG signal [12]. In addition, multilayer perceptron (MLP) and decision tree have also been used for emotion classification task [13]. Deep learning model can extract high-level features from raw physiological data automatically. Thus, various deep learning models, such as deep belief networks [14], convolutional neural network (CNN) [15] and capsules network (Cap-sNet) [16], have been widely used in emotional classification tasks, and promising results have been achieved. Recently, hybrid deep learning framework also shows great potential for EEG based emotion recognition, just like acoustic modeling [17] and health state diagnosis [18]. Yin et al. presented an ensemble classifier which extracted intermediate representations of physiological features in multiple modalities by different stacked autoencoders (SAEs) and merged the intermediate representations by a graph based fusion network. In reference [19], the C-RNN framework, which integrated CNN and recurrent neural network (RNN), was proposed to recognize emotion states by multi-channel EEG signals.
Although the encouraging results have been achieved, traditional emotion recognition methods ignore the temporal dependencies of EEG signals. Similar to speech signal, EEG is essentially a kind of time series signal. Therefore, the front and back units of EEG signal are not independent, and there is some correlation between them. For example, it is very difficult for humans to enter a state of sadness immediately after they are happy. Thus, the transition and evolution of emotional states need to be taken into account when modeling EEG signals. This is especially necessary when the EEG signal has a long time. Unlike speech signal, the long-term dependencies and contextual information of EEG signal are rarely considered in previous studies. In addition, the correlation information between different channels of EEG signal is beneficial to emotion recognition when using multichannel EEG signals. Generally, features extracted from multi-channel EEG signals are utilized by feature-level fusion based approach (Chenet al., 2015) or decision fusion based strategy [20]. Weather feature-level fusion based approach or decision fusion based strategy doesn't model the correlation information pertinently and can't effectively utilize the correlation information.
Conditional random field (CRF) is a discriminative model proposed by Sutton et al. [21] to mark and segment serialized data. It constructs the conditional probability distribution model of the label sequence when giving the observation sequence. CRF provides two attractive advantages over other discriminative models [22]. Firstly, unlike Hidden Markov Models (HMMs), CRF does not require the strict independence assumptions on the observations. Therefore, it can effectively utilize the overlapping and long-range dependent information of observations. Secondly, unlike SVM and neural networks (NN), CRF determines an optimal result according to the whole sequence. The advantages make CRF more suitable for processing sequential data. Inspired by it, the paper investigates the possibility of integrating the CRF into deep learning framework to achieve superior recognition performance. In reference [23], Banda and Engelbrecht proposed the deep continuous conditional recurrent neural field framework to model dimensional emotion from multiple input features.
In this paper, we proposed a novel deep learning framework termed deep belief -conditional random field. In the framework, deep belief network with glia chains instead of DBN is employed to learn high-level abstract features. Compared with DBN, DBN-GC can not only utlize time-domain or frequency-domain information of EEG signals, but also mine inter correlation information of multiple-channel by the glia chains. Utilization of deep models reduces the difficulty of extracted task-related features and helps to bypass the feature selection stage. Considering that emotion state is a response to external events and evolves with the change of stimulus, EEG signals as an expression of emotional state contain rich contextual and semantic information. Therefore, CRF is adopted to handle the evolution and long term dependencies of the EGG signals for emotion recognition. Therefore, the contribution of this paper lies in two aspects. Firstly, in addition to time-domain and frequency-domain features, this paper also attempts to inter channel correlation information and contextual information of EEG signals into emotion recognition. Secondly, the deep belief-conditional random field framework is proposed to catch inter channel correlation information and contextual information of EEG signals.
The remaining of this paper is organized as follows. The basic background knowledge of DBN and CRF is elaborated in Section II. Section III provides the detailed description of the proposed DBN-CRF framework. The experimental settings and evaluation results are described in Section IV. Section V briefly concludes this work.

II. THEORETICAL BACKGROUND
In order to better understand the following DBN-CRF framework, in this section, we briefly introduce the theoretical background of relevant models, namely restricted Boltzmann machine (RBM), DBN, and CRF.

A. RBM AND DBN
As a kind of Boltzmann machine with special topological structure, RBM is a generative stochastic artificial neural network that can learn a probability distribution by input data set [24]. RBM consists of two parts: the visible layer VOLUME 8, 2020  and the hidden layer. Different nodes in the same layer are independent of each other, and symmetrical undirected full connection is adopted between the two layers. The architecture of a typical restricted Boltzmann machine can be seen from Figure 1. Contrast divergence (CD) algorithm, which adopts Gibbs sampling to update the weights in the process of gradient descent, is utilized to train RBM model.
Deep belief network was proposed by Hinton et al. [25] in 2006. As a deep learning model, it has attracted wide attention and has been successfully applied in object recognition, speech recognition and other fields. Deep belief network is a multi-hidden layers network composed of several restricted Boltzmann machines and a supervised output layer, as shown in Figure 2. v represents the visual layer, and h i represents the hidden layer. DBN training includes two steps: pre-training and fine-tuning.
In the pre-training stage, unsupervised layerwise training is employed to train the RBMs in each layer. The output of the hidden layer of the lower RBM is used as the input of the visible layer of the upper RBM. This training method can learn some non-linear complex functions by mapping data directly from input to output, which makes DBN have strong feature extraction ability. In fine-tuning stage, supervised learning is used to train the network in the last layer, and the errors between actual output and expected output are propagated backwards layer by layer, and the weights of the whole DBN network are fine-tuned.

B. CONDITIONAL RAMDOM FIELD
CRF is a discriminative probabilistic model which is often used to annotate or analyze sequence data, such as natural language [26], [27]. CRF constructs conditional probability distribution P (Y | X), which represents Markov random fields of a group of output random variables Y given a set of input random variables X. Similar to Markov random fields, conditional random fields are undirected graph models, where vertices represent random variables and links between vertices represent dependencies among random variables. In CRF, the distribution of random variables Y is conditional probability, and the given observation value is random variables X. In principle, the layout of graph model of conditional random fields can be given arbitrarily. However, the common layout is the chain architecture because there are efficient algorithms in training, inference and decoding for chain architecture. Among the linear chain, conditional random field is the most commonly used one.

III. DBN-CRF FRAMEWORK
To employ the temporal dependencies of EEG signal and utilize the correlation information between different channels, we present the DBN-CRF framework for EEG based emotion recognition task, and offers the details of the whole framework.

A. OVERVIEW OF THE FRAMEWORK
We integrate the DBN-GC methodology and conditional random field together to establish a novel deep framework. The architecture consists of three key parts: parallel DBN-GCs, CRF based sequence learning model and decision merge layer, as shown in Figure 3.
The multi-channel EEG signals of current sample are split into several segments by a specific time window, and a raw feature vector can be extracted from the multi-channel EEG  signals of the current segment. Thus, a raw feature vector sequence can be achieved. Each raw feature vector is fed into the corresponding DBN-GC to extract the high level representations. Thus, a high-level feature sequence can be achieved. Due to the existence of glia chain, DBN-GC can mine inter-channel correlation information of EEG signal. Then, the unit of sequence learning, which adopts a CRF structure, models the contextual information for the highlevel feature sequence and generates the predicted emotion label sequence Y = {y 1 , y 2 , . . . , y l , . . . y n }. Finally, the decision merge layer base on k-nearest neighbor algorithm is utilized to determine the emotion state of the observation sequence.
The proposed framework combines the strong ability of the DBN-GC model in processing intra-channel and crosschannel information and the ability of CRF in processing sequential data. Therefore, it is quite suitable for processing sequential signal such as EEG or speech.

B. INDEPENDENT DBN-GC LEARNING
DBN-GC is proposed based on deep belief network which is composed of many restricted Boltzmann machines (RBM) in the stacking way. In DBN-GC, each RBM has two node layers, as well as a group of glia cells represented by stars and linked into a chain structure, as shown in Figure 4. Each glia cell is also connected to a unit in the hidden layer of RBM, as shown in Figure 5. There are connection weights between glia cell and corresponding node unit.
In the training process, the role of all glia cells can be directly applied to the hidden unit, so that the output of the hidden layer nodes can be adjusted accordingly. By connecting glia cells, each glia cell can also transmit activated signals to other glia cells and regulate the glia effect of other glia cells. In order to simplify the calculation, all signals produced by glia cells propagate along specific directions of the glia chain. It means that the signal is transmitted from the first glia cell to the last one in the chain.
For RBM with glia chain, the output rule of hidden units is modified as follows: where h * j is the input value of the j-th node of the hidden layer. g j is the glia effect value of the corresponding glia cell. α is the weight coefficient of glia effect value. σ is the sigmoid function. The weight coefficient α is preset, which can control the effect of glia effect value on hidden units. The calculation formula of h * j is as follows: where W ij is the connection weight of the visual layer node i and the hidden layer node j. v i is the state value of node i in the visual layer. c j is the offset of node j. The glia effect value g j of node j is defined as: where θ is a predetermined threshold. T is an unresponsive time threshold after activation, and β represents the attenuation factor. Signals generated by activated glia are transmitted to the next glia cell. The activation of glia cells will depend on whether the input of the corresponding hidden unit reaches the predetermined threshold theta or whether the previous glia cells transmit signals to it. At the same time, the difference between the last activation time and the current time d j must be greater than the threshold T . If the glia cell is activated, it will send a signal to the next glia cell, otherwise it will not produce a signal, and its glia effect will gradually attenuate. t represents the number of steps in which signals produced by activated glia are transmitted towards a specific direction, each step leading to the next adjacent glia cell. Based on our previous work [28], this paper improves the way by which glia signals are transferred between glia cells. According to (3), the glia effect values of the glia cells of the nodes in the hidden layer of RBM are updated as follows Table 1. In the process, the newly activated node as the starting node can only transmit a signal once. Thus, the newly activated node that transmits the signal in the current cycle becomes an old activation node which doesn't transmit signals in the next cycle.
Similar to DBN, the training process of DNB-GC includes two steps: pre-training and fine-tuning. The glia cell mechanism only acts on pre-training. The specific training process can be found in reference [29].

C. SEQUENCE LEARNING AND DECISION MARGE LAYER
After extracting the high-level feature sequence by parallel DBN-GCs, CRF is employed to perform sequential learning and generate the predicted emotion label sequence. CRF adopts the form of undirected graphical model, which combines the characteristics of maximum entropy model and hidden Markov model. Given the observation sequence X =< X 1 , X 2 , . . . , X n > which corresponds to the high-level feature sequence derived from EEG signals in this emotion recognition task, a conditional probability distribution over label sequence Y = {y 1 , y 2 , . . . , y l , . . . y n } which corresponds to emotion label sequence is defined as where f k (y l−1 , y l , X , l) is either a state feature function or a transition feature function. u k is the weight coefficient. Z (X ) is the normalization factor, which is defined as where Y * is the set of all possible label sequences.
In the decision merge layer, k-nearest neighbor algorithm (KNN) is adopted to determine the emotion state of the observation sequence X . Given the predicted emotion label sequence Y =< y 1 , y 2 , . . . , y n >, the emotion state Y * of the observation sequence X can be defined as where s v is the number of elements belonging to the v-th emotion state in Y , and m is the total number of emotion states.

A. DATABASE AND DATA PREPROCESSING
Datasets which are often used in EEG emotion experiment include AMIGOS [30], SEED [31], HR-EEG4EMO [32] and DEAP [33], etc. In this paper, DEAP, AMIGOS and SEED were adopted to validate the effectiveness of the proposed DBN-CRF framework. The DEAP dataset recorded various physiological signals of 32 subjects (16 males and 16 females) when they watched 40 Music Video excerpts with different emotional markers. Among them, EEG signal was collected by 32-lead electrode cap of ''10-20'' international lead standard, and EEG signal of subjects was collected at 512 Hz sampling frequency. After watching the video, all subjects were asked to mark the valence, arousal, liking and dominance of the video they watched on a scale of 1-9.
The AMIGOS dataset recorded various physiological signals of 40 volunteers. In the first experiment, 40 volunteers watched a set of 16 short affective video extracts from movies. In the second experiment, 37 of the participants of the previous experiment watched a set of 4 long affective video extracts from movies. Each participant rated each video in valence, arousal, dominance, familiarity and liking, on a scale of 1-9 and selected basic emotions. Among them, EEG data was recorded using the EMOTIV Epoc, with 14 channel and a sampling frequency of 128 Hz. EEG channels according to the 10-20 system are: AF3, F7, F3, FC5, T7,P7, O1, O2, P8, T8, FC6, F4, F8, AF4. For the DEAP dataset and the AMIGOS dataset, only arousal, valence and dominance scales are employed for evaluation. All samples are divided into two classes according to arousal, valence and dominance scales. The thresholds of arousal, valence and dominance are both 5.
The SEED dataset contains EEG singals and eye movement signals from 15 subjects ((7 males and 8 females) during watching emotional movie clips. The dataset contains 15 movie clips and each clip lasts about 4 minutes long. Each subject performed the experiment three times with an interval of about one week. The EEG signals are of 62 channels at a sampling rate of 1000Hz. Then, the data is down sampled to 200Hz. In this paper, only the 12 channels selected by reference [31] are employed. There are three emotional labels (−1 for negative, 0 for neutral and +1 for positive).
Among the physiological signals, only EEG signals are used to perform emotion recognition task. A sliding window is employed to divide the raw EEG signals into several segments, and there is no overlapping region between adjacent windows. As an independent sample, each segment inherits the label of the corresponding original sample. The window length of the three datasets is 6 seconds, 5 seconds and 10 seconds respectively. Thus, the total number of samples of the three datasets is 12800, 50320 and 14895. In the experiments, 10-fold cross validation technique is adopted. For each volunteer, the corresponding samples are divided into 10 subsets, of which 9 subsets are assigned to the training set and the remaining one to the test set. Above process is repeated 10 times.

B. FEATURE EXTRACTION
In order to verify the effectiveness of the proposed model, three different types of features are employed in this paper. These features describe the salient information related to emotion states in time domain, frequency domain and timefrequency domain of EEG signals respectively [33]- [37], and Table 2 shows the detailed description.

C. PERFORMANCE OF THE INDEPENDENT DBN-GC
To analyze the performance of the improved DBN-GC, statistical measures, power features and HHS features are all used to train independent DBNs and independent DBN-GCs. Considering the similarity between DEAP and AMIGOS, these two datasets are used in the experiment. Each type of features is employed to train a DBN and a DBN-GC, respectively. This means that a DBN and its corresponding DBN-GC are trained by the same features and they have the same hyper-parameters. Furthermore, the parameters of a DBN and its corresponding DBN-GC, such as learning rate, are all set to the same value. Specifically, DBN1 and DBN-GC1 are trained by statistical measures of the DEAP dataset. DBN2 and DBN-GC2 are trained by power features of the DEAP dataset, and DBN3 and DBN-GC3 are trained by HHS features of the DEAP dataset. Similarly, DBN4 and   DBN-GC4 are trained by statistical measures of the AMIGOS dataset. DBN5 and DBN-GC5 are trained by power features of the AMIGOS dataset, and DBN6 and DBN-GC6 are trained by HHS features of the AMIGOS dataset. The above models implement the same emotion recognition task, and accuracy and F1-score are adopted to evaluate the recognition performance of these models.
In order to obtain the appropriate combination of hyperparameters, we set different combinations of parameters to perform the experiment, and the parameter combination with the minimal recognition error is adopted. The hyperparameters of DBNs and DBN-GCs are shown in Table 3 and Table 4. The training epoch is set to 200 and the batch size is 40.
The recognition results are shown in Figure. 6 and Figure.   The above results show that the glia chain can effectively improve the deep learning performance of DBN model. Specifically, the same hidden layer nodes can transmit information to each other through the glia chain. Thus, the correlation information between the adjacent hidden nodes, which is beneficial to pattern recognition task, can be caught by the DBN-GC model. When the DBN-GC model is applied to multi-channel EEG based emotion recognition, glia chain can mine and utilize inter-channel information, which is related to emotion states and helps to judge emotional state. In addition, the results also validate that the frequency domain feature of the three types of features contains most salient information regarding emotion states.

D. RESULTS OF THE DBN-CRF FRAMEWORK
To verify the effectiveness of the integrated framework in Figure.3, two kinds of methods are utilized for emotion recognition. One uses the DBN-GC-CRF model just described in Figure.3. The other makes use of the DBN-CRF model which uses several parallel DBNs instead of DBN-GCs of the framework.
In the experiments, only the frequency domain features are employed to construct the DBN-CRF framework and the DBN-GC-CRF framework. The hyper-parameters of DBN and DBN-GC in the two frameworks constructed by the DEAP dataset are the same as those of DBN2 and DBN-GC2 in Table 3, respectively. Similarly, DBN5 and DBN-GC5 in Table 4 are applied to emotion recognition task based on the AMIGOS dataset. When constructing DBN-CRF framework and DBN-GC-CRF framework with the SEED dataset, the node number of the three hidden layers in DBN and DBN-GC is 260, 180 and 45 respectively. User-dependent model is built to perform leave-one-out cross validation, and user-independent model is built to perform leave-one-subject-out cross validation. In leave-oneout cross validation, for each volunteer, the corresponding samples are divided into 10 subsets, of which 9 subsets are assigned to the training set and the remaining one to the test set. Above process is repeated 10 times. In leave-one-subjectout cross validation, one subject set is assigned to the test set, and the remaining subject sets are assigned to the training set. Above process is repeated until each subject set is used as test set.    Table5-7 show the results of leave-one-out cross validation on all datasets, and Table 8-10 show the results of leaveone-subject-out cross validation on all datasets. As can be seen from these tables, no matter which dataset is used, the DBN-GC-CRF framework outperforms the DBN-CRF framework in all dimensions. In addition, the recognition results of the DBN-GC-CRF framework are better than those of DBN-GC. VOLUME 8, 2020

E. PARAMETER ANALYSIS
As mentioned in Section 3.1, each sample is split into several sections. Then, a raw feature vector can be extracted from each section and is fed into the corresponding DBN or DBN-GC. In order to investigate the effect of section number on recognition performance and achieve the appropriate section number, the experiments with different section numbers are carried out. The experiments are performed on the DEAP dataset. Section number ranges from 1 to 20. When the section number is set to 1, the DBN-CRF framework is simplified to the DBN model and the DBN-GC-CRF framework is simplified to the DBN-GC model. As can be seen from Figure.8, when the section number increases from 1 and 11, the MRA on arousal as well as the MRA on valence also increase rapidly whether using DBN or DBN-GC. The results indicate that the contextual information of EEG signal contains salient information related to emotion states, which is beneficial to improvement of emotion recognition. Meanwhile, the proposed framework can effectively combine the feature extraction ability of DBN with the sequence data processing ability of CRF. Once the section number exceeds 11, the two MRAs increase very slowly and remain basically stable. With the increase of the section number, the computational complexity of the proposed framework will increase dramatically. Therefore, it is appropriate to set the section number to 11. In this case, the recognition rates of DBN-CRF and DBN-GC-CRF in arousal dimension are 0.7428 and 0.7613, respectively. And the recognition rates of DBN-CRF and DBN-GC-CRF in valence dimension are 0.7479 and 0.7702, respectively. In addition, regardless of the number of sections, the DBN-GC-CRF framework outperforms the DBN -CRF framework in both arousal and valence dimensions. This indicates that the glia chain of DBN-GC is also effective in the integrated framework.
Considering the similarity between DBN-CRF and DBN-GC-CRF, hold-out validation is utilized to evaluate the two models for further comparison. The section number is set to 11. All samples are randomly divided into training, validation, and test sets, with respective proportions of 8:1:1 for each label. Figure.9 describes the results in the validation set. By using the same data, the validation accuracy of DBN-GC-CRF increases faster both in arousal and valence dimensions, and the accuracy of DBN-GC-CRF is higher than that of DBN-CRF with a similar network structure using the same epoch.

F. RESULTS COMPARION
Finally, the proposed method is compared with other deep learning methods which also use the DEAP dataset, and the detail results are described in Table.11. As shown in Table. 11, the highest recognition accuracies both in arousal and valence dimensions can be achieved by the DBN-GC-CRF model, with the exception of the results of reference [20]. The mean recognition accuracy of 0.7702 on valence provided by the DBN-GC-CRF model is lower than that of 0.8141 reported in reference [20]. The reason may be that the reference [20] trains a subject-related model, which is only utilized to recognize the samples belonging to the same subject. In addition, the F1-scores of the DBN-GC-CRF model are superior to 0.5830 and 0.5630 of reference [33] with the same kind of features.

V. CONCLUSION
In this paper, we have proposed an ensemble deep learning framework, which integrates parallel DBN-GCs and CRF, for emotion classification based on multi-channel EEG signals. Specifically, a raw feature vector sequence is firstly obtained from the multi-channel EEG signals of the current sample by dividing the EEG signals into several segments. Then, each raw feature vector is fed into the corresponding DBN-GC model to generate the high level representations. Thus, a high-level feature sequence is achieved. According to the high-level feature sequence, the CRF model generates the predicted emotion label sequence. Finally, the KNN based decision merge layer is adopted to determine the emotion state. The proposed method is evaluated on datasets AMI-GOS, SEED and DEAP. The results indicate that the proposed method outperforms most of the previous studies with the same EEG data set. The main contributions of this work lie in two aspects. Firstly, the DBN-GC can mine inter-channel or cross-modal correlation information of EEG signals through the glia chain, which includes salient information related to emotion states. Furthermore, the CRF based model structure can catch the long-term dependencies and contextual information of EEG sequences and is beneficial to the improvement of recognition performance. Especially, this is the first attempt to apply the conditional random field methodology to deep belief networks.