Expression-EEG Based Collaborative Multimodal Emotion Recognition Using Deep AutoEncoder

Emotion recognition has shown many valuable roles in people’s lives under the background of artificial intelligence technology. However, most existing emotion recognition methods have poor recognition performance, which prevents their promotion in practical applications. To alleviate this problem, we proposed an expression-EEG interaction multi-modal emotion recognition method using a deep automatic encoder. Firstly, decision tree is applied as objective feature selection method. Then, based on the facial expression features recognized by sparse representation, the solution vector coefficients are analyzed to determine the facial expression category of the test samples. After that, the bimodal deep automatic encoder is adopted to fuse the EEG signals and facial expression signals. The third layer of BDAE extracts features for training of supervised learning. Finally, LIBSVM classifier is used to complete classification task. We carried out experiments on a constructed video library to verify the proposed emotion recognition method. The results show that the proposed method can effectively extract and integrate high-level emotion-related features in EEG and facial expression signals. The recognition rate of discrete emotion state type and the average emotion recognition rate have been improved relatively, in which the average emotion recognition rate is 85.71%. Overall, the emotion recognition ability has been greatly improved.


I. INTRODUCTION
Emotion is very complex mental state or process of human beings, which can reflect human perceptions and attitudes and play an important role in the communication between people [1]. The research of emotion recognition has very important value in the application of human-computer interaction [2]. The environment in human-computer interaction system is complex and dynamic. In many occasions, it needs coordinate operations with people, therefore system with emotional interaction ability can better adapt to such environment. If the human-computer interaction system can quickly and accurately identify human emotions, the interaction process will be more friendly and natural [3]. EEG signals have the strong ability to characterize changes in human brain state, so emotion recognition based on EEG signals has become a popular trend in research.
The associate editor coordinating the review of this manuscript and approving it for publication was Juan M. Gorriz .
With the development of multimedia and human-computer interaction technology, it is of great significance to automatically recognize people's emotional states [4]. For example, in the process of playing a game, the difficulty of the game can be adjusted by identifying the emotional state. In a negative emotional state, the system will provide a simple and cheerful game for relaxing, which is easier to pass through. When being in a positive emotional state, it will increase the game difficulty, which can bring more challenging to people's game experience [5]. In addition, emotions can also be adjusted through the recommendation of music and video. When watching videos or listening to musics, one can alleviate the effects of negative emotions by pushing positive multimedia content. With the development of artificial intelligence, technological research of robots has developed rapidly. Nowadays, it has been applied in many fields, like household appliances, food and other industries. As the robots enter all aspects of daily life, people have put forward higher requirements for them. In terms of humanization, it is hoped that the robot can have human brain thinking and emotional colors, which can recognize the emotional state of the human to achieve better human-computer interaction. The guarantee of emotional recognition rate is one of the key technologies that emotional robots can put into practical application, which is also great significance to its realization [6].
In the medical field, emotion recognition is also very important. For patients, the quality of the emotional state will have a great impact on the disease development process and the corresponding treatment management system [7]. Although current medical research has no definite evidence to prove the relationship between emotional states and diseases, the positive emotional states are actually conducive to disease recovery and physical mental health. When patients are in a negative state and does not cooperate with treatment, the cure of disease is usually very slow, and may even lead to the disease exacerbation. Therefore, the detection of emotional state is relatively important for patients. As much as possible to put the patients in a positive emotional state can promote the good development of the disease. Meanwhile, emotion recognition is also helpful for the prevention and treatment of depression and other diseases [8].
In the field of public safety, emotion recognition also has certain practical value. Polygraph is an important tool for public security personnel to interrogate suspects [9]. In the process of criminal suspect's statement, the corresponding emotional state can be judged based on the physiological signal, which will serve as a basis for the truth of the statement. In addition, emotion recognition can be used in teaching management. Through wearing a harmless portable device with emotion recognition function, teathers can detect students' emotional states in real time. When student's mood is relatively poor or even shows extreme behavior, the teacher can timely communicate with the student and their parents to avoid tragedy [10].
Since the theory of sentiment computing was proposed, its related theories and analytical methods have developed rapidly, and the research on sentiment recognition has also been focused by researchers [11]. In this paper, a multi-modal emotion recognition method based on expression-EGG interaction using deep autoencoder is proposed, and experiments on the constructed video library verify the effectiveness of our method.

II. RELATED WORKS
The features used in traditional emotion recognition methods are mainly external features such as facial expressions, body postures and speech [12]. There is unnecessary to wear sensors for obtaining these signals, which has the advantages of easy acquisition and low cost. [13] utilized the people facial expressions in the videos for emotion recognition. The authors applied Naive Bayes algorithm to recognize seven different emotions including happiness, surprise, anger, disgust, fear, sadness, and neutrality. The emotion recognition accuracy rate between facial expressions of different people is 64.3%, while testing the same person achieves an accuracy rate of 93.2%, indicating that facial expressions can be adapted to effectively recognize emotions. [14] combined acoustic features and speech content for emotion recognition based on speech signals. A support vector machine-belief network architecture is used to subdivide six different emotions of anger, disgust, fear, neutrality, sadness and surprise, and the recognition accuracy is up to 93%. These experiment results confirmed the effectiveness of speech signals for emotion recognition. However, these signals are relatively sensitive, which are easily affected by the subjective factors of the testers. The system cannot make a correct judgment when the subject's inner true emotions and external performance are inconsistent. Meanwhile, pure external performance is only a part of emotional performance, which cannot express the rich emotions of human beings. The physiological changes are dominated by the central nervous system of the person, which can more objectively reflect the emotional state of the person. Therefore, the use of human physiological signals for emotion recognition is currently a novel international research trend in emotion computing.
Currently, researchers usually use Electroencephalogram (EEG), Electromyogram (EMG), Galvanic Skin Response (GSR), Electrooculogram (EOG), Electrocardiogram, (ECG), blood pressure, blood volume pulse (BVP), epidermal temperature, eye movement signals and other physiological signals for emotional recognition research [15]. The emotion recognition method based on physiological signals can achieve high accuracy while collected data can objectively reflect the emotional state of the subjects [16]. According to the different signals adopted, existing emotion recognition methods can be roughly divided into EEG signals based, facial video features-based, and multi-modal emotion recognition.

A. EMOTION RECOGNITION METHOD BASED ON EEG SIGNALS
At present, there are many researches on EEG-based emotion recognition, which have proved the effectiveness of EEG signals for emotion recognition. The attention of scholars mainly focus on aspects of feature extraction, feature selection and classification model selection of EEG. EEG-based emotion recognition methods are generally divided into two categories [17], [18]: one is supervised learning, which is trained and tested based on emotion labels, such as KNN, Fisher algorithm and etc. The other is unsupervised learning method, in which the sample data do not contain labels. Professional researchers will automatically divide all samples into different categories according to a certain strategy, and then assign corresponding labels. Unsupervised learning methods usually include K-means, fuzzy C-means (FCM), and self-organizing maps.
In the case of facial expression recognition, [19] proposed the utilization of a deep learning network (DLN) to discover unknown feature correlation between input signals that are crucial for the learning task. The DLN is implemented with a stacked autoencoder (SAE) using hierarchical feature learning approach. Input features of the network are power spectral densities of 32-channel EEG signals from 32 subjects. [20] shown 40 patients with Parkinson's disease (PD) from China and 40 healthy controls, and designed 24 black/ white portraits and 24 music excerpts to express happiness, sadness, fear and anger. Four tests were used to evaluate participants' executive functions, including trace production test (TMT), clock drawing test (CDT), semantic spoken fluency test (VFT) and digital span test (DST). The experimental results shown that the recognize ability for anger face in PD group was impaired. It may be related to executive dysfunction, while shown better performance in recognizing musical emotions. To further improve the accuracy of the CNN-based modules, [21] devised a multi-column structured model, whose decision is produced by a weighted sum of the decisions from individual recognizing modules. We apply the model to EEG signals from DEAP dataset for comparison and demonstrate the improved accuracy of our model. [22] proposed a real-time emotion recognition hardware system architecture with EEG, which performing binary and quaternary classification in multiphase convolutional neural network (CNN) algorithm based on a 28-nanometer technology chip and a field programmable gate array (FPGA). Sample entropy, differential asymmetry, short-time Fourier transform and channel reconstruction methods are used for emotional feature extraction. EEG signal features can be divided into time domain features, frequency domain features and time-frequency features. Time domain features are mainly the statistical features of the signal. Frequency domain characteristics mainly include frequency band energy and higher order spectrum characteristics (HOS).
According to the asymmetric effects of emotions, differential asymmetric features and ratio asymmetric features can be extracted. Experimental results show that the performance of multiple feature selection strategies is better than the univariate method. In addition, the emotion recognition accuracy rate of EEG features obtained by more advanced extraction algorithms is higher than band spectrum features obtained by traditional methods.

B. EMOTION RECOGNITION METHOD BASED ON FACIAL VIDEO FEATURES
Emotions are caused by specific scenes, the process is usually very complicated. Emotion recognition needs to consider the specific situation, and cannot be simply judged by the external performance, while ignoring the content attributes and semantics of emotions [23]. Eye-tracking signals can provide various indicators of eye activity. They can guide system observe the subtle subconscious behavior of users, which provide important reference for the user's current activity context [24]. The technique of recording individual's eye movement is called eye tracking. In the basis of this technique, one can determine where the subject is looking at a certain moment, and also get the eye's movement trajectory during a certain period of time. [25] proposed a novel multi-pattern correlation network for emotion recognition, which aims to achieve more powerful and accurate detection by the combination of audio and video channels information. This method first preprocessed the audio and visual signal for feature extraction and then obtain the Mel spectrogram, which will be treated as image to obtain representative frames from visual segments. The Mel spectrogram and representative frames are then fed to a CNN to obtain audio features. [26] studied the reduction of facial expressions that might lead to autism spectrum disorder (ASD) and caused impaired emotion recognition and expression. On this basis, their algorithm evaluated the purpose of reducing facial emotion recognition (FER) deficiencies in the acceptability, feasibility, and initial efficacy of attention-correcting interventions.

C. MULTIMODAL EMOTION RECOGNITION METHOD
At present, multi-modal fusion emotion recognition method is introduced to further improve the accuracy of emotion recognition. Multimodal fusion model can obtain emotion recognition results by fusing different physiological signals. [27] is aimed to analyze the performance of a Convolutional Neural Network which uses AutoEncoder Units for emotion recognition in human faces. The combination of two Deep Learning techniques boosts the performance of the classification system. 8000 facial expressions from the Radboud Faces Database were used during this research for both training and testing. The outcome showed that five of the eight analyzed emotions presented higher accuracy rates, higher than 90%. [28] completed the emotion recognition (ER) task and self-reported depression severity based on 644 outpatients (57.6% were female and the average age was 31.31). The study used 10 items specific to unipolar depression identified through factor analysis. 34.6% of the participants had clinical depression, while all other participants had clinical anxiety or other unspecified emotional disorders. In a large number of diagnosed clinical samples of adults with emotional disorders, we found that the accuracy of ER showed a decline with age increases, especially for negative emotions such as sadness and fear. [29] proposed a method of facial expression recognition during the speech process. The method uses a hybrid deep network architecture to perform multi-modal fusion of EEG signals and facial expressions. Experimental results prove the emotion recognition rate in multi-modal fusion model is higher than the model in each individual modal. [30] analyzed the epoch data from the EEG sensor channel, and performed a variety of machine learning technology tests including Support Vector Machine (SVM), K nearest neighbor, linear discriminant analysis, logistic regression, and decision tree. Whether to use Principal Component Analysis (PCA) for dimension reduction. Grid search is also used to adjust the hyperparameters of each tested machine learning model through the Spark cluster to shorten the execution time. [31] proposed a speech emotion recognition algorithm based on the superposed sparse depth model. The improvement of this algorithm is based on the automatic encoder, denoising automatic encoder and sparse automatic encoder. The first layer structure uses a noise reduction autoencoder to learn hidden features, whose dimension is larger than that of the input features. The second layer uses a sparse autoencoder to learn sparse features. Finally, the wavelet kernel sparse SVM classifier is used to classify the features, but how to further improve the emotion recognition rate of multi-modal fusion is also worth considering.
For the problems of single modality and low accuracy in the above researches, a multi-modal emotion recognition method based on expression-EEG interaction using deep autoencoder is proposed. The innovations are summarized as follows: 1) To accurately obtain facial expression features, the proposed method recognizes facial expression features based on sparse representation, and uses orthogonal matching tracking algorithm to analyze the solution vector coefficients and to determine the facial expression category of the test sample.
2) Different from the single modal recognition problem, our proposed method uses dual-modal depth automatic encoder (BDAE) to fuse EEG signals and facial expression signals, and inputs them to supervised learning LIBSVM classifier framework to get an emotion classification model.

III. OVERALL FRAMEWORK AND FEATURE EXTRACTION A. FRAMEWORK OF THE PROPOSED METHOD
EEG signals have a strong ability to characterize changes in human brain state, so emotion recognition based on EEG signals has been a common studying method. Besides the EEG signals, another external physiological characterization signal-facial expression signal is added for emotion recognition. Deeply exploring the ability of EEG signals and facial expression signals is to distinguish and characterize different emotions, and combining EEG and facial expression signals through different model fusion strategies (including deep neural networks) is to establish a multi-modal emotion recognition model combining with internal neural models and external subconsciousness behaviors. The framework is shown in FIGURE 1 [32]. Meanwhile, to research the stability of express emotions ability for EEG and facial expression signals over time, BDAE is used for model fusion. The accuracy of emotion recognition model has been effectively improved after fusion of EEG signals and facial expression signals. After training process, the middle layer (i.e., the third layer of BDAE, a total of five layers) is used as the extracted features, and then send them to the LIBSVM classifier for supervised learning training to get the emotion classification model. In this way, the emotion recognition ability of the model will be significantly improved, which is benefit to the high-order features related to emotions in the two signals extracted by the deep neural network.

B. EEG SIGNAL FEATURE SELECTION
To alleviate the shortcomings of the traditional EEG signal feature selection methods, a decision tree-based EEG signal feature selection method is proposed to achieve both objectivity of feature selection and high classification accuracy.

1) PRINCIPLE OF DECISION TREE
Decision tree is one of the most classic and commonly used algorithms in data mining. Compared with other data mining algorithms, the decision tree has three advantages [33]: (1) The decision tree is a very easy-to-understand algorithm; (2) In process of training the decision tree, there is no need for researchers to understand the relevant background knowledge of the training data; (3) The classification accuracy of the decision tree is relatively high. Considering the three advantages of decision trees, researchers usually use decision tree algorithms to conduct data classification studies.
The decision tree uses a ''divide and conquer'' greedy algorithm, outputting a tree-like structure. The structure of the decision tree is shown in FIGURE 2. Each non-leaf node stores a split attribute, branches are divided according to different attribute values of the split attribute, and each leaf stores a category label. Completing the decision tree algorithm requires two steps: constructing the decision tree; pruning the complete decision tree to form a simplified decision tree.

2) EEG SIGNAL FEATURE SELECTION BASED ON DECISION TREE
EEG signal is a kind of non-stationary, non-linear, high-dimensional weak physiological signal, and the decision tree has three advantages: easy to understand, no relevant VOLUME 8, 2020 background knowledge during training, and high classification accuracy. Therefore, using decision trees can not only objectively select features from high-dimensional and complex EEG signals, but also improve classification accuracy [34]. The proposed framework of EEG signal feature selection method based on decision tree is shown in FIGURE 3. The method is divided into the following 5 steps: Step 1. Input the EEG signal feature vector after performing feature extraction operation; Step 2. Apply decision tree C4.5 algorithm to divide the input EEG signal feature vector into dominant features and non-dominant features; Step 3. Discard non-dominant features from the input EEG signal feature vector; Step 4. Recombine the dominant features from the input EEG signal feature vectors to obtain new feature vectors; Step 5. Output the restructured feature vector after feature selection operation.
Step 2 will be discussed in detail. For EEG signal feature vectors after feature extraction X , X contains N samples, denoted as Regarding d dimensional features as d attributes, the candidate attribute set A can be denotes as Building the feature sample set based on the method flow.
Initially, feature vector sample set X is treated as training set, calculating the information entropy of X : In formula, p j is the probability that training set X belongs to category j, l is the number of all categories.
For the given attribute A k , the information entropy of training set X is: In the formula, For given attribute A k , split information of training set X is: In the formula, X j is a subset of X , all samples in X j are belong to j, and in the set of X −X k , all samples are not belong to j.
For given attribute A k , the information entropy of training set X is: In the process of constructing a complete tree, the attribute that maximizes the information entropy gain rate of the training set is selected as the split attribute, and then the branches are divided according to different values, each branch is operated recursively. When pruning a complete tree, the post-pruning method is selected to form a simplified decision tree.
The split attribute of all nodes included in the simplified decision tree is defined as the dominant attribute. After removing dominant feature from the initial feature, the remaining features are non-dominant features. For the initial feature vector X , reorganize them according to the dominant feature to form a new feature vector. This recombined feature vector is selected after applying the feature selection method based on the decision tree.

C. FACIAL EXPRESSION FEATURE EXTRACTION
Facial expression is a reflector of visual information, which is a key part of conveying emotions. Human face and facial emotions can reflect the different emotional states of peoples. For example, when people are in a happy state, their lips are often open, the corners of their mouths are raised, and their eyes become smaller. When people are in an angry state, they often open their eyes and frown, lock eyebrow, twitch zygomatic muscles. Recognizing these states by computers is facial expression recognition.

1) BASIC PRINCIPLE OF SPARSE REPRESENTATION
The model of facial expression recognition based on the sparse representation first composes the training sample set as a sparse dictionary, and then solves the most sparse solution vector for the test sample [35]. Finally, the analysis of solution vector coefficients determines the facial expression category of the test sample. The specific implementation process is as follows: Step 1. Pretreatment phase. Input four types of facial expression samples: angry, depressed, happy, and normal. Form the u g training samples corresponding to the g category into a matrix B g . Assume the size of each sample b g,h is m×n (Pixel) grayscale image, where b g,h ∈ R m×n .
Step 2. Stacking each column of sample b g,h to form column vector u g,h , u g,h ∈ R m×n , new matrix is denoted as B g = u g,1 , u g,2 , · · · , u g,n i ∈ R m×n . Each column represents the facial expression training sample of the g th object, A total of n training samples of type k form a total training sample set matrix B: Step 3. After obtaining matrix B g , for test samples, first convert it to a column vector form, and any g-th object can be approximated by a linear combination of training samples of this category: y = a g,1 u g,1 a g,2 u g,2 , · · · , a g,h u g,h In the formula, a g,h ∈ R, h = 1, 2, · · · , n g , y ∈ R m . The test sample y is shown as: y = Ax, x = 0, · · · , 0, a g,1 , a g,2 , · · · , a g,n h , 0, · · · , 0 In the formula, x is the expansion coefficient vector of test sample y relative to the total training sample set B: Step 4. Finding the approximate sparse solutionx of y = Ax is to get the emotional state of different facial expressions.
The variables number in the system of equations is greater than the number of equations, therefore, the solution of y is not unique. Since the sparsity is defined by the 0 norm, it can be solved using the L 0 norm minimization method, as shown below:x = arg min x 0 subjec to y = Ax (9) In the formula, 0 is the L 0 norm of vector, representing the number of non-zero elements in vector A x. According to recent studies on compressed sensing, when the coefficient vector x is sufficiently sparse, it can be solved using the L 1 -norm approximation method, as followŝ In the formula, 1 is the L 1 norm of vector, representing the sum of the absolute values of the elements in vector x.

2) ORTHOGONAL MATCHING PURSUIT ALGORITHM
There are multiple solutions to equation (9), including matching pursuit (MP) algorithms, orthogonal matching pursuit algorithms (OMP) and etc. The OMP algorithm is used to solve the sparse coefficients, through repeated iterations to select the column vector in the training matrix that has a high correlation with the residual signal. The sparse solution of the test sample can effectively recognize different facial expressions. The pseudo code of the OMP algorithm is shown in Algorithm 1.

Algorithm 1 Pseudo Code of OMP Algorithm
Begin 1. Approximate sparse solution for y = Ax isx. First selecting the element with the highest correlation with the residual r 0 = y in the matrix A: It can be seen from the specific algorithm that the sparse coefficient after solving not only excludes the interference of most samples of different types, but also can find the most similar sample category among a small number of samples, which greatly reduces the false judgment rate, and can effectively solve the image occlusion, lighting and other issues. Benefiting from the multiple advantages of the sparse representation classifier, it has been used in many modal recognition fields.

IV. MULTIMODAL EMOTION RECOGNITION BASED ON DEEP NEURAL NETWORK
In recent years, deep neural networks have been used in extensive researches, and shown superiority in the applications of artificial intelligence such as speech recognition, natural language processing, and image recognition. However, there is a lack of research on the use of deep neural networks for emotion recognition. Recently, Liu et al. used deep neural networks to combine EEG signals and eye tracking signals for emotion recognition. It has found that using deep neural networks to fuse the two signals can effectively improve the accuracy of emotion recognition, and has better performance than traditional model fusion methods. This observation VOLUME 8, 2020 inspired us to use deep neural networks to fuse EEG signals and facial expression signals, so as to further improve the accuracy of emotion recognition [36]. The following parts will introduce the details of deep neural network model: Bimodal Deep Auto-Encoder.  The observation data corresponds to the visible layer, and the extracted features correspond to the hidden layer, which can be regarded as a feature detector. The model has no edge connections between nodes on the same layer. Assuming that visible layer variable is v ∈ {0, 1} M and hidden layer variable is h = {0, 1} N , the energy possessed by an RBM system can be defined as: a j h j (11) In the formula, parameter θ = {a, b, W }, W ij is the symmetric weight between the visible layer node i and the hidden layer node j, b i and a j are the bias of the visible layer node i and the hidden layer node j.
With the energy equation, the joint probability distribution of the visible layer node and the hidden layer node can be obtained: In practical applications, we are concerned with the observation data, that is, the edge probability distribution P (v |θ ) of the visible layer v. This distribution is also called the Likelihood Function.
In the formula, calculating Z (θ) needs to traverse the values of all visible layer nodes i and hidden layer nodes j, total 2 m+n times, which is intolerable in practical applications. However, because RBM has a special structure, it has edge connections between layers, but there is no connection between nodes in the layer. Therefore, when the state of a certain layer (such as the visible layer) is determined, the state of each node in the other layer (that is, the hidden layer) is conditionally independent. Therefore, the conditional probability of hidden layer node j activation is: (14) In the formula, σ (x) = 1 1+exp(−x) is called sigmoid activation function.
Since the structure of RBM is symmetrical, the conditional probability of activation of node i in the visible layer can also be obtained by giving hidden layer h: The goal of training RBM is to get the value of parameter θ, and θ can be obtained by maximizing the log-likelihood of RBM on the training data: In the formula, T represents the number of training samples. Based on partial derivation of θ with respect to parameter (θ), the gradient can be obtained as: In the formula, · P is the mathematical expectation about distribution P. P h v (t) , θ is the probability distribution of each node of the hidden layer under the given training samples (that is, the value of each node of the visible layer is determined), which is easy to calculate. However, P (v |h , θ) represents the joint probability distribution of the visible layer and the hidden layer, which involves the normalization factor Z (θ) and is difficult to obtain. Therefore, the approximate value can only be obtained by sampling. The commonly used sampling method is Gibbs sampling [37].
Hinton proposed an RBM fast learning method based on Contrastive Divergence (CD) algorithm in 2002. Hinton suggested that Gibbs sampling can be used in k steps (usually k = 1) to get a good enough approximation. First, set the visible layer as the value of a training sample, and use the probability obtained by Equation 14 to calculate the value of the hidden layer node. Then use the probability obtained by Equation 15 to determine the value of the visible layer node, so as to form a reconstruction of the training sample. In this way, the update rule based on the parameters of a sample becomes: In the formula, γ is learning rate, · r represents the distribution of the model after one-step reconstruction. Then, traverse all sample points and constantly update the model parameters to train the RBM.
The nodes used in RBM obey Bernoulli Distribution, that is, the value of the node is 0 or 1. Moreover, the contrast divergence algorithm (CD) is applied to train the RBM. In addition, when updating the weights of RBM, only learning one sample at a time will greatly increase the amount of calculation [38]. If dividing the samples into mini-batches with tens or hundreds of samples for training, the efficiency will be improved, which is mainly benefited from using the graphics processor GPU (Grapic Processing Unit) or the efficient matrix multiplication operation in MATLAB. In this paper, the batch size of small batch data is set to 100.
Let the number of visible layer units is n, the hidden layer units number is m, w represents the connection weight between the visible layer and the hidden layer (m × n dimension), vector α (m-dimensional column vector)represents the offset vector of the hidden layer, vector b (n-dimensional column vector) represents the offset vector of the visible layer. The CD-based RBM fast learning method is shown in FIGURE 5.

B. CONSTRUCTION OF MULTIMODAL EMOTION RECOGNITION MODEL
A BDAE is used to construct a deep neural network, which includes two parts: encoding and decoding. In the encoding part, two RBM models are trained using the characteristics of EEG signals and eye movement signals, as shown in FIGURE 6(a).
The hidden layer h EEG , h Face and weights w 1 w 2 of the two RBMs can be obtained after training. h EEG and h Face are merged together into another new RBM visible layer, and then train the RBM to obtain the corresponding weights, which is shown in FIGURE 6(b).
In the decoding part, two layers of RBM are developed to reconstruct the input features, so as to form a deep automatic encoder. The weights between the network connection layers are w 1 , w 2 , w 3 , w T 3 , w T 1 , w T 2 , shown in FIGURE 6(c). Then, the back-propagation algorithm (BP) method of unsupervised learning is used to adjust the parameters of the network. After training BDAE, let the middle layer (the third layer of BDAE, a total of five layers) be the extracted features. Then send them to the LIBSVM classifier for supervised learning training, and get the final emotion classification model.
The proposed method uses LIBSVM as a classifier to identify emotions. LIBSVM is the improvement and supplement to some parameters of the original SVM. LIBSVM is usually used to solve binary classification problems. Based on the established hyperplane, it can distinguish positive examples and negative examples as much as possible.
In the process of emotion classification, each channel involved in emotion recognition will eventually get a result of emotion recognition [39]. At this time, each channel can be regarded as a set of separate EEG signals, which can form a separate LIBSVM classifier. Perform decision layer fusion for the classification results generated by each LIBSVM classifier. The apply of multi-classifier fusion based on fuzzy integration can not only fuse the results obtained by each LIBSVM classifier, but also reflect the importance of each LIBSVM classifier in the fusion process [40]. Fuzzy integral uses the fuzzy measure as the weight of each LIBSVM classifier, and fully considers the relationship between each LIBSVM.

V. EXPERIMENTAL VERIFICATION AND RESULT ANALYSIS
In order to demonstrate the performance of the proposed multi-modal deep learning emotion recognition method, a video library is first established for video emotion-evoked EEG experiments. The video library contains 90 video clips. These video clips are collected from different movies and TV shows, and are unified into wmv format. The 90 video clips in the video library contain three types of emotions: violent, neutral and pornographic. The violent video clips and pornographic video clips were taken from two different types of famous movies: action films and drama films. In the video library contained 90 video clips, violent, neutral and pornographic videos each contain 30 video clips. Each video clip contains only one emotion type, and the duration of each video clip is about 6s. Each video clip in the video library is objectively evaluated by 6 researchers (3 men and 3 women) before being included in the video library. When selecting video clips, only these six researchers believe that a certain video clip belongs to a certain emotional type clip, then this video clip can be selected into the video library.
13 healthy subjects participated in the video emotionevoked electroencephalogram experiment, including 7 males and 6 females. The subjects were 24-28 years old, and the naked vision or corrected vision reached 1.0. In order to induce EEG signals containing different video emotions, the above video library contained 90 video clips was used as a stimulus. The subjects wore a 64-lead Quik-Cap electrode cap and watched the video clips continuously playing on the computer to induce EEG signals containing different video emotions.
The electrodes of the electrode cap are arranged according to the 10-20 system electrode coordination method, as shown in Fig. 7. Neuroscan system is used to collect and preprocess the EEG signals induced by experiments. The E-Prime software developed by PST company is used in the design of video emotion evoked EEG experiment.
The flow chart of the designed experiment is shown in Figure 8. At the beginning of the experiment, the computer screen in front of the subjects will display the instructions and cautions. After understanding the experimental process and the general content of the experiment, the subjects started the experiment by pressing the space bar.
For each subject, 30 video clips were randomly selected from the video library contained 90 videos, and each emotion type contained 10 video clips. In order to prevent subjects from forming inertial memories, the playback of these 30 video clips is random. Before playing each video clip,  a cross-shaped prompt will be displayed on the computer screen to attract the attention of the subjects. After each video clip is played, there will be a period of rest to allow the subjects to calm down. After the 30 selected video clips are all played, the experiment ends. Throughout the experiment, the subjects' EEG signals were collected, and the sampling rate of the EEG signals was 1000 Hz. The above experiment was repeated for each subject, and could obtain final 13 subjects' EEG signals.
After preprocessing the collected original EEG data, a relatively ''pure'' EEG signal is obtained. Then, we extract the features of ''pure'' EEG signals to obtain the initial EEG features. In this method, wavelet packet decomposition (WPD) is used to extract the features of the preprocessed EEG signals. In the stage of EEG feature extraction, DB6 is used as wavelet base. Through the experiment, the wavelet packet decomposition level within 10 does not have a great impact on the classification results, so this paper sets the decomposition level to 3. The WPD features with dimension 8 are extracted from the EEG signals of each window length. For a video clip, 48 dimensional WPD features are extracted from a subject's EEG signal collected by an electrode. When 64 electrodes are used in the experiment, 3072 dimensional WPD features are extracted from a subject's EEG signal for a video clip. The feature selection method based on decision tree is used to select the initial EEG features. In this paper, each dimension of EEG feature is regarded as an attribute. WPD feature vector after feature extraction is input into decision tree C4.5 for tree building and pruning, and a simplified tree is formed. All the attributes included in the simplified tree are the selected EEG features. Experiments show that the simplified tree contains 14 attributes. Therefore, 14 dimensional EEG features are selected from 3072 dimensional WPD features for later classification.

A. EXTRACTION COMPLEMENTARY ANALYSIS OF EEG SIGNAL AND EXPRESSION SIGNAL
In order to study the difference in the ability of EEG signals and facial expression signals to recognize different emotional states, EEG signals, expression features, feature layer fusion (FLF) method and BDAE method were used to identify the four emotion-like confusion matrices, the results are shown in Table 1. Each row of the confusion matrix represents the true category of the data, and each column represents the predicted category. The i-th row and the j-th column indicate the number of samples that the model distinguishes the data that truly belongs to the i-th category as the j-th category. Furthermore, the confusion matrix can be used to calculate Precision, Recall, and F1-score. Precision represents the percentage of data identified as a certain category that really belongs to this category. Recall indicates the percentage of data that the model judges correctly in the data that really belongs to a certain category. The F1-score combines the two measure characterizes, which is a weighted average with value between 0 ∼ 1. It is also a method of measuring accuracy. FIGURE9 can be obtained by calculating the accuracy, recall, and F1-values of the four confusion matrices, which focuses on comparing the F1-values of EEG signals, eye movement signals, FLF, and BDAE methods on different emotion types.
Observing the F1-values of EEG signals in four types of emotional states from the above figure, it can be seen that the F1-values of sad and happy states are higher, indicating that the model trained with EEG signals is better at distinguishing sad and happy state. From the F1-values of facial expression characteristics under four types of emotional states, it can be seen that the F1-values under neutral emotional states are significantly higher than those under the other three emotional states, indicating that models trained with facial expression characteristics are better at distinguishing neutral emotions. On the basis of above results, models trained with EEG signals and facial expression characteristics have great differences in the ability to distinguish four different emotions, indicating that the two models have a certain complementarity in the ability to characterize different emotions, which can effectively improve the accuracy of emotion recognition.
The F1-value of the FLF method in four types of emotional states shows that the F1-value in the fear and neutral states is relatively high, indicating that the FLF method is better at distinguishing fear and neutral emotions. In addition, except that the F1-value in the sad state (0.73) is slightly lower than the F1-value in the sad state of the EEG signal (0.74), the F1-values of the other three emotions are higher than those corresponding to the EEG signals and expression characteristics F1-value. The F1-values obtained by the proposed method using BDAE in four types of emotional states have been significantly improved, and the F1-values of sadness, fear and neutral are relatively high, indicating that the BDAE method distinguish ability between sadness, fear and neutral has been significantly improved. This also shows that it uses the advantages of EEG signals to distinguish sad emotions and facial expressions, to distinguish between neutral emotions and to improve their recognize emotions ability. Combining with the confusion matrix table 1(d), for the sad emotions, the model originally trained by the EEG signal is most likely to misclassify the data that is truly sad into neutral, and there are fewer cases of misclassification into fear, but the expression characteristics are just the opposite. BDAE, combined the characteristics of the two models, has effectively improved the situation, in which the data that really belongs to sadness is wrongly divided into neutral and fear. Similarly, the other three types of emotions have similar results.
The comparison results of F1-values of EEG signals, facial expression characteristics, FLF and BDAE methods for each emotional state are shown in FIGURE 10. For sad emotions in the above figure, the F1-value of the FLF method has not been significantly improved, and the ability of the proposed method to recognize sad emotions using BDAE has been significantly improved. For fear and happiness, two F1-values of the single modal are relatively close, but the FLF and BDAE methods improve the ability to recognize fear emotions to a greater extent than that to recognize happy. For neutral emotions, the model trained by EEG signals is not very good at distinguishing this emotion, but the model trained by facial expression characteristics has better performance. Finally, by combining the complementary information between two methods, the obtained model can effectively improve the ability to recognize neutral emotions.
The results show that the EEG signals and facial expression characteristics have differences in the ability to characterize different emotions, and combining them can effectively use the complementarity information to improve the ability of recognize four types of emotions.

B. THE INFLUENCE OF LEARNING RATE ON THE RECOGNITION ACCURACY RATE OF FOUR KINDS OF EMOTION
In order to analyze how the learning rate γ affects the accuracy of the proposed method of emotion recognition, classification accuracy results of the four emotions, including happiness, sadness, fear and neutrality, are shown in FIGURE 11. From the FIGURE 11, with the increase of learning rate γ , the accuracy rate of emotion recognition has fluctuated to a certain extent. When the learning rate is too small, there will be a very slow convergence rate for parameters with large gradients; while the learning rate is too large, the parameters that have been optimized similarly may be unstable. Therefore, when γ is 0.4 and 0.6, the accuracy is higher. In addition, the proposed method combines the EEG signals and facial video features, and uses complementary advantages of the EEG signals good at distinguishing sad emotions and expression features good at distinguishing neutral emotions to improve the model ability of recognizing emotions. Therefore, the model gets the highest recognition accuracy of neutral emotion, while the lower recognition accuracy rate of happy emotions.

C. MODEL STABILITY OVER TIME
In practical applications, it is not realistic to train a model each time and then test in order to recognize the user's emotions. It is necessary to train a more stable emotion recognition model that can be used for the task of emotion recognition in a long time. In order to study the stability of the emotion recognition model over time, each participant was asked to do three experiments, each time a few days apart. By using the same subject's data from different experiments as the training set and test set to verify that the change of the proposed model over time is relatively stable. Table 2 shows the stability of the model obtained by the fusion of 6-lead and 62-lead EEG signals and facial expression features after training by the BDAE method. The i-th row and j-th column of each table represent the i-th day use the data as a training set to train the model, and then use the data from the j-th day as the test set to test the recognition accuracy.
From the above table, the model obtained by combining 6-lead EEG signals or 62-lead EEG signals and expression features for training has certain stability over time, and the accuracy rate on the diagonal is higher than other ones. The model trained on the same day can achieve a higher accuracy of predicting results in the emotional state experiment. When used to test the emotional state of other different groups in experiments, although the accuracy is not as high as that of the day, it still achieves satisfactory accuracy compared with four classifications. The results of training and testing in the same group of experiments are better than those between different groups of experiments. The environmental variables of each experiment will slightly change, for example, the impedance of the EEG cap is different, and the interference from the outside will also change. When training or test data comes from different experiments, the noise will affect the distribution of samples, which will reduce the accuracy. In addition, the emotional induction of subjects under different experiments will also be different, which will also lead to relatively low accuracy of training or tests conducted by different groups of experiments.
In addition, with the passage of time, the longer the interval of experimental time, the lower the accuracy of emotion recognition obtained by training and testing, indicating that the stability of the model will decrease as time goes by. And the difference between the recognition accuracy of 6 leads and 62 leads is not very obvious. The model trained by the combination of 6 EEG signals and facial expression features also has stability equivalent to lead 62 over time, which also proves that 6 EEG signal has a strong emotion representation capability, and can only use small number of electrode EEG signals for emotion recognition. In this way, reducing complexity and cost, and enhancing portability. Therefore, it has been demonstrated that the model obtained by combining EEG signals and facial expression features with time has certain stability, which provides the possibility of using EEG signals and facial expression features for emotion recognition in practice.

D. COMPARATIVE RESULTS OF MULTI-MODAL FUSION EMOTION MODEL EXPERIMENTS
The proposed method is compared with the methods in [22], [26], [28], [30], and [31]. Experiment results are shown in Table 3, where the evaluation standard is the average of emotion recognition rate of discrete emotional states. In order to correctly evaluate the comparison methods, the same data sets were used in all the comparison methods in this study.
From the above table, [22] used brain functional networks and achieved 62.5% average emotion recognition rate on positive and negative emotions. In [26], EEG was used to identify four emotional states, including happiness, sadness, anger, and relaxation, then utilizing wavelet analysis to identify them. The final average emotion recognition rate was 74.52%. [28] recognized four emotion states based on facial expression features, and used SVM for classification. The average emotion recognition rate is 77.63%. [30] proposed a new group sparse canonical correlation analysis (GSCCA) method for synchronizing EEG channel selection and emotion recognition, which divided the entire EEG frequency band into five parts, and then extracted the frequency band from each frequency band of GSCCA characteristics. When distinguishing 3 discrete emotion states, the model achieved an 79.32% average emotion recognition rate. [31] performed multi-modal fusion for EEG and electrooculogram signals. On the feature layer fusion, the feature vectors of the two physiological signals are directly stitched together to form a longer feature vector. On the decision layer fusion, two classifiers are trained with different feature values respectively, and performed the decision layer fusion. The proposed method uses BDAE to fuse EEG signals with facial video features and conducts emotion recognition based on deep learning. Therefore, regardless of the type of discrete emotional state recognition or the average emotion recognition rate, there is a relative increase in average emotion, reaching 85.71%. In terms of the running time of each model, the proposed model is in the middle position, and the average running time is 15.57s.

VI. CONCLUSION
As an important field of artificial intelligence, emotion computing is favored by more and more researchers. At present, most of emotion recognition methodsare based on speech signals, facial expressions, and other bioelectric signals such as electrocardiogram and electroencephalogram. When the emotion signal of a single channel is interfered by other signals, the models usually get lower emotion recognition rate. Therefore, taking expression signals and EEG signals as emotion signals, a multi-modal emotion recognition method combining expression signals and EEG signals is proposed through BDAE feature fusion. Using the constructed video library to carry out supervised learning training experiments on the proposed method, experiment results show that the complementation of EEG signals and facial expression features can effectively improve the ability to recognize four types of emotions. And the application of deep neural networks can greatly improve the ability of multi-modal emotion recognition. Compared with other methods, the recognition accuracy of the proposed method reaches 85.71%.
Currently, most research works take two different emotion signals as target objects, but when humans deliberately disguise emotion signals, the recognition rate of obtained emotions tends to decrease. Therefore, combining more emotion information will improve the construction of emotion recognition system, and structuring a more effective emotion database will be the focus of the next research. In addition, different emotions usually have certain correlation, for example, sad emotions often contain a certain amount of anger. Therefore, it is worth researching that how to construct a more effective emotion recognition model by combining the correlation between emotions. However, the method of classifying video clips proposed in this paper is lack of subjectivity and needs to be further improved. In addition, there is no public database including video and corresponding evoked EEG in the world. In the future, we can consider increasing the type of video, the number of video and the length of video, and collecting EEG signals from more subjects in order to establish a public video EEG database.