Emotion Recognition with Audio, Video, EEG, and EMG: A Dataset and Baseline Approaches

This paper describes a new posed multimodal emotional dataset and compares human emotion classification based on four different modalities -audio, video, electromyography (EMG), and electroencephalography (EEG). Results are reported with several baseline approaches using various feature extraction techniques and machine learning algorithms. First, we collected a dataset from 11 human subjects expressing six basic emotions and a neutral emotion. We then extracted features from each modality using principal component analysis, autoencoder, convolution network, and Mel-frequency cepstrum coefficients (MFCC), some unique to individual modalities. A number of baseline models were applied to compare classification performance in emotion recognition, including k-nearest neighbors (KNN), support vector machines (SVM), random forest, multilayer perceptron (MLP), long short-term memory (LSTM) model, and convolutional neural network (CNN). Our results show that bootstrapping the biosensor signals (i.e., EMG and EEG) can greatly increase the emotion classification performance by reducing noise. In contrast, the best classification results are obtained by a traditional KNN, while audio and image sequences of human emotion can be better classified with LSTM.


I. INTRODUCTION
In daily life, emotions are abound and there are countless reasons for determining someone's emotional state, including for better communication and work efficiency. In the product development process, product features and design can be determined to be more suitable for users by analyzing the user's emotional states during their user experience. In medical care, caregivers can provide better care to patients if the patient's emotional states in different situations are known. Emotion recognition has been an important interdisciplinary research topic in various fields, including psychology, neuroscience, and artificial intelligence. Many emotion classification studies use deep learning methods in combination with state-of-the-art statistics to optimize the accuracy of detecting the emotion and attempt to integrate multiple modalities for better accuracy.
With increasing attention in emotion recognition, which will be detailed in the Related Work section, many emotional datasets have been collected, including both nonphysiological signals (e.g., facial expressions and speech) and physiological signals (e.g., electroencephalogram (EEG), electromyogram (EMG), electrooculogram (EOG)).
The two major categories that emotion datasets usually fall into are posed and spontaneous expressions [1]. Posed expressions are more intense and less ambiguous, where the test subjects receive the instruction to act or perform an emotion. Spontaneous expressions contain more valuable information on natural expressions, but they are more difficult to evaluate than posed expressions; the results rely on the subject's self-report, which may introduce potential differences between the actual and reported emotion experienced [2].
Deliberate behavior is often exaggerated and may fail to generalize to real-world behavior [3]. Posed emotions are widely used in 22 dynamic facial expression datasets [4]. Unlike these posed facial expression datasets, spontaneous expressions are more widely used in the physiological emotion datasets, where music or video clips are often used to elicit subjects' emotions. These datasets usually collected subjects' self-reports on valence and arousal levels on a continuous scale, and these values can be used to categorize emotion [5]. However, there could be errors in subjects' selfassessment reports, and the same emotions from different subjects could have different valence and arousal levels. Posed emotions are more systematically controlled and contain direct correspondence between the collected data and its associated emotion. Therefore, databases with deliberately posed emotions are usually more reliable for obtaining multimodal data and provide higher accuracy in emotion recognition [6].
Although there are many emotional datasets, multimodal emotional datasets, especially ones that include physiological data collected from posed emotions, are still deficient. In this study, we collected a new posed multimodal emotion dataset called PME4 to study emotion recognition from both nonphysiological (audio and video) and physiological signals (EEG and EMG). The video consists of image sequences of actors producing facial expressions and audio speech while uttering a generic sentence, "The sky is green.". EEG signals reflect brain activity and EMG signals reflect facial muscle movement during these utterances. Each modality has its unique contribution to emotion recognition and might have various impacts at different emotion processing stages (i.e., pre-speech, during-speech, post-speech). Hence, instead of aiming to improve the existing emotion recognition methods, our goal is to provide a new posed multimodal emotional dataset 1 to the research community with different feature extractors and machine learning models across the four modalities to classify emotional expressions.
The key contributions of this work include: 1) A new posed multimodal emotion dataset with four modalities (PME4): audio, video, EEG, and EMG; 2) A thorough 1 The dataset and code are publicly available to enrich the state-of-theart approach for emotion recognition research.
comparison and analysis of a set of state-of-the-art data preprocessing and feature extraction techniques for each modality (included bootstrapping, principal component analysis, convolutional autoencoder, and/or Mel frequency cepstral coefficients); 3) Comparisons of a few baseline machine learning methods (KNN, SVM, Random Forest, MLP, CNN, LSTM) in classifying emotions with optimal features; and 4) A comprehensive survey of emotion recognition in terms of datasets, features, and recognition methods.
The remainder of the paper is structured as followed. First, existing emotion datasets, feature extraction techniques, and emotion classification methods are described in Section 2. Next, we detail the data collection process for the PME4 dataset in Section 3. Afterward, the feature extractions for each modality and classification method are discussed in Section 4. Results and analysis of emotion recognition with each modality are provided in Section 5. Finally, we provide a conclusion and discuss limitations in Section 6.

A. EMOTION DATASETS
With the increase in human-computer interactions, more emotional databases are being developed to better classify emotions, especially physiological signals. Some popular emotion databases are in Table I. The table lists these datasets with several important aspects: number of subjects, emotion states, elicitation, data types, feature extraction methods, and classifiers. Here we mainly focus on the datasets; the last two aspects will be more thoroughly discussed later. CK [7] and CK+ [8] are the most widely used facial expression datasets collected by Kanade et al. [7,8]. CK+ [8] consisted of both frontal views and 30-degree views of 123 subjects' facial expressions from instructions to perform different expressions, including anger, contempt, disgust, fear, joy, surprise, and sadness. They used the active appearance models to track the subjects' face shape across the image sequences, then extracted the similaritynormalized shape (SPTS) and canonical appearance (CAPP) features and classified the action units (AU) and emotions using the linear support vector machine (SVM). The linear SVM was able to obtain 94.5% accuracy in AU detection and 83.33% in emotion detection when using both SPTS and CAPP, which is better than using individual features. The brain electrical activity measured by electroencephalography (EEG) has recently become an interesting area for detecting internal emotional states [9,10]. Valence and arousal are commonly used measurements to characterize emotion. Both DEAP [11] and DECAF [13] datasets collected the multimodal physiological signals elicited by music videos and/or affective movie clips. Emotion state was determined by subjects' self-evaluated arousal and valence score, where valence is associated with the level of happiness and arousal associated with the level of calmness [11]. SEED dataset [12] also used video clips for eliciting emotion. But unlike the DEAP and DECAF, each video clip is associated with one of three emotional states: positive, negative, and neutral. The participants of the SEED dataset were extroverts with stable moods based on the Eysenck Personality Questionnaire.
Similarly, the MPED [14] dataset also used video clips for eliciting target emotion states. The 28 video clips and their corresponding target emotions were selected based on the participants' self-scoring on three psychological questionnaires on about 1500 video clips and evaluated with the k-means algorithm. All four datasets consisted of 32 to 62 EEG channels or 306 MEG channels along with other physical peripheral physiological signals, as shown in Table  I.
Multiple other studies have attempted to determine emotional state based on EEG signals. Lan et al. [46] proposed to use the autoencoder in combination with the Kmean cluster algorithm to automatically learn the meaningful frequency features from the power spectral density of the EEG signals. Zhang and Lu [12] applied a critical EEG channels selection method based on the weight distribution of a trained Deep Belief Network (DBN) model with differential entropy features from five different frequency bands of the EEG signals. This method achieved similar accuracy (82.88% to 86.65%) with fewer EEG channels (range of 4 to 12) compared with 86.08% accuracy using all 62 EEG channels on the SEED dataset when classifying the three emotional states.
Lan et al. [47] used domain adaptation techniques on the SEED and DEAP datasets to reduce the inter-subject variances between subjects and technical differences between the datasets. The reported accuracies were 72.47% for SEED and 48.93% for DEAP using the maximum independence domain adaptation (MIDA) with the differential entropy features. Soroush et al. [48] proposed an angle space reconstruction to obtain the geometrical features from the EEG phase space. The reported classification accuracy of the four valence-arousal spaces was 91.37% using the statistically significant features with the nonlinear features extracted from the estimated differential angle and vector length from the angle space.
Time-frequency analysis is also being widely used in EEG signal processing. In [49], the multivariate synchrosqueezing transform (MSST) method based on continuous wavelet transform has been used to obtain features that stem from multichannel dependency in addition to the mono-channel features. The joint instantaneous frequency and bandwidth estimate the multivariate bandwidth for all channels to partition the time-frequency domain. This method achieved an accuracy of 86.93% in classifying eight emotional states in DEAP. Abadi et al. [13] used the discrete cosine transform (DCT) feature to obtain the spatio-temporal patterns of DECAF's MEG data. They reported 62% and 59% accuracy in determining arousal and valence level using the linear SVM classifier.
Song et al. [14] proposed a novel attention-long short-term memory (A-LSTM) model to extract more discriminative features by capturing the information of interest from different sequences using residual connections. The model also uses a convolution kernel with a size of 1x1 to avoid the interaction of different channels. It achieved 76.06% accuracy in classifying seven emotions from the MPED dataset with the higher-order crossings features.
Although there are many physiological emotion databases with multiple modalities, most of them used spontaneous expressions and relied on the subjects' self-report arousal and valence levels. Many problems can arise when trying to match the collected data with corresponding emotions, such as inaccurate self-report values, differences between various subjects' report values for the same emotion, and multiple emotions being elicited simultaneously [52].
Unlike other datasets, the PME4 dataset collected posed emotions, where subjects are asked to express emotions. All subjects were either acting students or had acting experience, which helped to minimize variance in the data as actors are trained to express the exact emotions based on the instruction. Most likely, subjects also experienced these emotions through embodied cognition and thus providing more comprehensive matches between the collected data and its associated emotions. PME4 is a comprehensive dataset that consists of synchronized physiological signals and nonphysiological signals and can be used to compare emotions that the subjects expressed, which is in contrast to previous studies that typically measured physiological signals of viewers' EEG activity in response to different stimuli intended to elicit different emotions. As subjects were required to switch between emotions within a short time, their physiological signals might not immediately reflect the instructed emotion as compared to non-physiological signals. This may be the case especially for the EEG signals reflecting brain activity, where an emotion aftereffect was found in our previous study [41]. Moreover, each emotion period of all four modalities in our PME4 dataset can be separated into three stages: pre-speech, during-speech, and post-speech. Each stage could lead to different classification performance in the four modalities, especially the EEG signals during the speech stages that are unique from other emotional datasets. The research community could find new insights by analyzing the EEG activity under different stages with various emotional states. In addition, PME4 also consists of the data from five data collection time blocks for each subject with nearly evenly distributed sample sizes of each emotional state, allowing researchers to analyze how different time slots could impact subjects' emotions. Finally, integrated multimodalities can improve the inference of emotions [15,16].

B. FEATURE EXTRACTION TECHNIQUES
Extracting meaningful features from the raw data is a critical step for emotion recognition, as the classifiers cannot have optimal performance with noisy and/or uninformative data. Each modality contains different information and so we need to use different feature extraction methods.
Feature transformation techniques are used to reduce the data dimensions by transforming data into feature space. Principal component analysis (PCA) uses the orthogonal transformation to remove data redundancy by finding the projection matrix to map the original high dimensional features space onto a low-dimensional component subspace. The first component contains the most significant variance among the original features than the second, and so on [19]. It has been applied to image and EEG pattern classification [9,44,45].
Speech signals contain significant information that can be used to identify and understand the speaker's emotion; however, these signals often contain "uninformative" information, such as background noise and acoustic variability across speakers. There are various feature extraction techniques available for obtaining meaningful audio features through eliminating noise, such as Mel frequency cepstral coefficient (MFCC), perceptual linear prediction coefficient (PLPC), linear predictive cepstral coefficient (LPCC), linear predictive coder analysis (LPC), etc. [21].
MFCC was widely used in speech recognition systems as it uses the linear cepstrum to represent the audio signal that is close to the human auditory system [11,22]. It extracts frequency domain features, which perform better than timedomain features [23]. Extracting MFCCs includes the following key steps: noise removal with a hamming window, time domain to frequency domain conversion with FFT, Mel log power computation with a bank of filters, and MFCC computation with discrete cosine transformation [24,25]. Besides speech signals, MFCC can also be used for extracting EMG signals from several facial muscles [26]. Studies that applied MFCC on EMG data for classification have suggested that a large time frame is needed to extract a better representation of the EMG features [27].
Autoencoder is a well-known sophisticated feature extractor that contains two major parts: an encoder and a decoder. The encoder efficiently extracts meaningful features from the data, and the decoder reconstructs the original data from the features extracted by the encoder. Multiple studies have used an autoencoder to extract high-dimension EEG information [9,46,50]. Convolutional autoencoder is also widely applied in obtaining salient feature vectors from image data [28,29]. It uses the convolution layers to extract the input's significant features while preserving the relationship between the pixels and extracted features. Convolutional networks outperform on capturing valuable spatial correlation features of the image, and with the deeper network, it can capture deep features [30]. The autoencoder is trained directly in an end-to-end manner without applying any regularizations to ensure that no features are lost between layers.
Pre-trained CNN Models are usually better in retrieving meaningful generic features, especially from images. The VGG neural network (VGG16, VGG19) [53] has been widely used in image classification and extracts the image data features for emotion recognition [54,55]. Even though the VGG neural network is pre-trained for object classification of various objects rather than human images, since the ImageNet contains vast data samples, the convolution filters have been trained to extract the key features of the images of faces. It will be used to extract the image features for the emotion classification.

C. EMOTION RECOGNITION METHODS
Many traditional classifiers have been used in emotion recognition, such as the Support Vector Machines (SVM) [11,13,16,31,42,44,49], K-Nearest Neighbors (KNN) [12,14,31,32,42], Random Forest (RF) [49,54,57] and Multi-layer Perceptron (MLP) [10,48,54]. KNN is an intuitive and straightforward supervised method, which uses a voting scheme to decide the sample class based on the majority voting from K nearest training samples. SVM utilizes a radial basis function kernel to improve the performance of the highdimensional data. Random Forest is an ensemble learning method that consists of multiple independent decision trees. MLP uses non-linear classifiers and a backpropagation algorithm to update network weights. For our experiment, these classifiers will be used as the baselines for our emotion classification.
Convolution Neural Network (CNN) is commonly applied in areas related to analyzing visual images, such as object detection, image recognition and classification, and facial recognition. CNN contains convolution layers that extract the input's significant features while preserving the relationship between the 2D spatial domain and extracted features, and has been used for emotion recognition [53,55]. We can also convert temporal data into two-dimensions of time-frequency data, then use CNN to find the relationship between the time domain and the frequency/spatial domain to determine the corresponding emotion. We use CNN as a baseline model with strongly local spatial learning ability through its convolutional layers.
Long Short-Term Memory (LSTM) [33] is a special type of recurrent neural network with feedback connections that can process a sequence of data [34]. It overcomes the vanishing gradients problem of the traditional RNN [35] and has shown state-of-the-art performance on time series data, including emotion recognition [14,41,56]. The RNN unit has neurons representing a data sequence's temporal dependency and has a vanishing and exploding gradient problem with a long or unstable data sequence. The LSTM solves this problem by integrating more memory gates to allow the network to learn long temporal data. The LSTM unit consists of a memory cell and three gates, the input, output, and forget gates. With the additional gates, these units can forget the previous states and update current states as new information is provided. The input gate controls the input signal's impact on the state of the memory cell. The output gate is responsible for the change of the hidden state based on the memory cell. The forget gate controls the impact of the previous hidden state. LSTM has outperformed other traditional classifiers in classifying emotion in Song et al. [14]. We use LSTM as a baseline model to deal with all four modalities that are inherently time sequences.

III. DATA COLLECTION
According to psychologist Paul Ekman [17], the six basic emotions are anger, fear, disgust, sadness, happiness, and surprise. Public emotion databases are usually categorized into five to eight emotions. This study focuses on recognizing the six basic human emotions in [17] plus a neutral emotion, for a total of 7 emotions. The data were collected from 11 human subjects who are students in acting, after informed consent (5 female and 6 male individuals). The Institutional Review Board of the City University of New York approved this study. To enhance the accuracy of the collected posed emotions, all subjects had some acting experience. Data collection took approximately four months, where each test session for each subject lasted approximately two hours. The entire test session was divided into five blocks, with 10 trials of each emotion presented in random order in each block. Subjects were allowed to take an optional break between each block. We included a large number of repetitions of each emotion for each subject in the dataset to minimize the effects of variability and noise. Each trial was five seconds long, with one of the seven emotion labels presented on a monitor placed 57 cm in front of the subject. Subjects were required to utter the generic sentence "The sky is green" while mimicking the facial expression and experience indicated by the presented emotion label. The spoken sentence was chosen because of its neutral content, thereby minimizing interference with any emotion that the subject was trying to experience and express. Each emotion label was displayed for 4 seconds, and a one-second break was given between every emotion. Overall, the longest time for subjects to finish speaking the sentence was approximately 3 seconds.
Multiple issues can arise during data acquisition, such as electrodes becoming loose, interruptions from external sources, large head movements that cause the faces not to be fully captured in the images, etc. After removing these error trials, 3829 trials remained across all four modalities, with the details shown in Table II.

A. AUDIO AND VIDEO
During the test session, the subjects' facial emotional expressions were video recorded with a Logitech V-UCR45 USB webcam camera attached to a MacBook Pro 15" Retina Display Late 2013, and the subject's voices were recorded with the laptop's microphone. The laptop was placed in front of the subject to ensure an adequate quality of the acquired video and audio. The audio signals were recorded with a 44.1kHz sampling rate and the video with the resolution of 960 x 720 pixels at 10 FPS.

B. EEG AND EMG
The EMG and EEG signals were acquired using gold plated surface electrodes connected to Grass amplifiers. The EMG data were bandpass filtered online between 50 and 1000 Hz, whereas the EEG data were bandpass filtered online between .1 and 100 Hz. We used a 5kHz sampling rate and all electrode impedances were below 10 kΩ at the beginning of the experiment.  The six muscles chosen for recording the EMG activity were the depressor anguli oris, zygomaticus major, levator labii superioris alaeque nasi, levator labii superioris, procerus, and occipitofrontalis ( Fig. 1 (a)), which are the major muscles involved in speech and associate facial Actions Units (AUs) during facial emotion recognition [18]. Note that the electrodes covered only half of the face, so we can better use the video data for facial expression recognition. The EEG data were collected through eight surface electrodes placed onto the scalp: F3, Fz, F4, Cz, P3, Pz, P4, and O2 ( Fig. 1 (b)). In total, we used 16 electrodes: 6 for EMG, 8 for EEG, one ground channel that was placed on the nasion, and two references, placed on the left and right mastoids. All data were referenced online to the left mastoid and re-referenced offline to an average of the left and right mastoids.

IV. METHODS
The dataset contains both EEG and EMG signals together with corresponding audio-video data of 11 subjects. Although our sample size is small, it is on the same order as many neuroscience experiments and the results are still statistically significant for emotion recognition with such a dataset.

A. AUDIO
Non-speech interval signals were meaningless and contained noise that impacts the features used for emotion classification. To minimize the noise, we focus only on the speech interval of the audio data. We used a CNN-based audio segmentation method [51] to extract the speech intervals for each trial. After manual checking and fixing the extraction results, the speech intervals for subjects to speak the generic sentence "The sky is green" were different, ranging from 0.75 seconds to 3 seconds. Resampling the speech segments to have uniform length causes multiple problems. The high-frequency signal, for example, could alias the low-frequency signal, which would eventually provide invalid information when conducting the feature extraction. Therefore, instead of resampling, we used a 3second speech duration interval. To extract the 3 seconds during-speech stage, we started from the center of the speech interval that is automatically detected using the CNN-based audio segmentation method [51], then evenly expanded 1.5 seconds before the center location and 1.5 seconds after as the during-speech stage.
To extract audio features, each 3-second during-speech audio sample was first normalized and then a Hamming window was applied to remove the noise. Last, the 20 most significant Mel-frequency cepstral coefficients (MFCC) were extracted separately from each time interval of the Hamming window with 20 filter banks between 300Hz and 3700Hz, which are the parameters used to extract the audio features by MFCC for emotion analysis in Dahake's work [36]. These extracted features formed a sequence vector that embeds both frequency (20 MFCC) and time (number of Hamming window intervals within the 3-second data) information.
We tried two different hamming window sizes, 20ms intervals with 10ms offsets and 100ms intervals with 50ms offsets, to compare the influence of the window size on the feature extraction performance. In total, for each trial of 3second speech duration, we have 299x20 MFCC features for the 299 20ms-window and 59x20 MFCC features for the 59 100ms-window.
As MFCC features contain time information, we used LSTM for analysis, since LSTM architecture is optimal for time series prediction. As the speech duration for each trial varies, some portions of the speech interval likely contained noise. Therefore, instead of obtaining the last output state of the LSTM, we connected the output state of each LSTM cell (over time) to a fully connected layer (with dense cells), then averaged the output of the dense cells to get the final prediction through softmax, as illustrated in Fig. 3. We also applied an ensemble learning approach to the LSTM model by training 30 simple LSTM models. Each LSTM model has the same structure, as shown in Fig. 3. After all 30 models  were trained with the same training dataset, we took an average of the output from the softmax layer of each model. The average result is the final probability of each emotion class, and the emotion with the highest probability was our final prediction of the input data.
To evaluate the effectiveness of the LSTM approach, we also compare that with our baseline models. These include the KNN with K equal to 10 and using the Euclidean distance formula to determine the nearest neighbor, Support Vector Machines (SVMs) with Gaussian kernels, Random Forest (RF) with 100 estimators and a maximum depth of 7 to reduce the overfitting, and a multilayer perceptron (MLP) with 512 hidden nodes. The input trials to these baseline models are concatenated into one dimension with a size equal to the number of timesteps multiplied by the number of features.
To compare the performance of different classifiers, we used K-Fold cross-validation, where the K is equal to 5 in our current implementation. We randomly split each subject's emotion data samples into the five subsets evenly. The split method ensures randomness within the training and testing dataset and maintains enough samples for each emotion and subject during the training process. Most importantly, this helps minimize information leaks and provides more accurate results for the model performance. Each subset contained either 765 or 766 samples, and the classifiers were trained with four subsets and tested on the remaining subset.

B. IMAGE
As some subjects did not provide consent to release the original image data, we applied multiple feature extraction methods to obtain meaningful features from the original images to be released to the public and for training machine learning models rather than using the original images.
Like audio signal processing, we focused on the image sequence during each trial's utterance interval (the duringspeech stage) as it contains the most emotional expression. However, there are large variations of the lengths of duringspeech stages for different trials, and the average duringspeech interval was 1.3 seconds. To equate the speech interval for all trials, we extracted an image sequence of 16 screenshots per trial, evenly sampled from the central 1.5 seconds during-speech window at 10FPS. Before extracting the image features, we cropped the face area on each frame as other regions do not contain any emotional information. We used the open-source MTCNN [37] and Python face recognition library [38] that build with dlib [39] for our face detection and extraction. As the existing face detection networks do not always provide 100% accuracy for detecting the correct face region, a manual correction was also applied to fix any errors in the extracted images. Note that we used electrode paste and transparent tape to affix the surface electrodes to the face for collecting EMG data (see Fig. 4). This minimized occlusion of the facial expression. The cropped face images varied in size and were converted to 224x224 pixels, as it was the size to input to the pre-trained networks.
We applied four different feature extraction techniques to the extracted facial images: PCA, convolutional autoencoder, and two pre-trained networks (VGG16 and VGG19) [53].
Each image contains three color channels; however, the color doesn't have much influence on emotion recognition. Before applying the PCA, we converted all images into grayscale and normalized the grayscale values to [0, 1]. All images in the training set were used to calculate the PCA transform matrix and applied to the result matrix on both the training and testing sets to get their corresponding PCA components.
The convolutional networks should be more powerful than PCA in obtaining significant visual features for larger image sizes. The convolution autoencoder considers both encoder (shown in Fig. 5) and decoder (a mirror of the encoder, sharing the same parameters and replacing the convolutional layers with transposed convolutional layers). A feature vector of 2048 elements was obtained from the output of the encoder's final layer (a dense layer) for each image. Images were also passed into the VGG16 and VGG19 with pre-trained weights on ImageNet to extract features. We used the output from the last max pooling layer with size 7x7x512 as the image feature since the rest of the layers were originally used for the classification. We also tried to use In-ceptionV3 [58] and ResNet50 [59], but the extraction feature size was too large for our classifiers, so the results are not reported.
To ensure a fair evaluation of the extracted features, we also used the same 5-fold cross-validation technique as the audio process. As the LSTM is more powerful in dealing with the temporal features, for each of the four extracted features, LSTM (Fig. 3) performance was compared with the baseline methods described in the audio section.

C. EEG
The EEG data were recorded from scalp electrodes and reflect activity from a large number of neuron potentials. Because the EEG signal also contains noise, extracting meaningful information from the EEG signals can be challenging.
Scalp EEG signals are typically unstable and noisy, so we applied several noise reduction techniques when preprocessing the EEG data. First, the EEG data were converted to voltage values and then re-referenced to the right mastoid. Since neural activity measured with non-invasive EEG electrodes is more robust in the 0.1 to 30 Hz frequency band range, a bandpass Butterworth filter from 0.1Hz to 30Hz was applied to eliminate noise and less meaningful parts of the signals. After the filtering, the data were downsampled from 5kHz to 1kHz for the sake of data dimension reduction in the later steps while still maintaining a high fidelity of the neural signals.
EEG recordings often include various artifacts, including those from blinks and facial movements. Because subjects in this study made facial expressions to convey different emotions, it was essential to minimize the influence of these extraneous, non-neural signals. We therefore applied an automatic artifact detection method that removed the impact of the muscular activity on the EEG signal. To remove the EMG effect on the EEG signals, we used the AAR plug-in for EEGLAB [40] and applied this removal process before bandpass filtering.
To further minimize noise, we applied a bootstrapping method to extract EEG features, averaging over multiple trials of the same emotion for the same subject to obtain a more stable EEG signal. We processed the data of all 11 subjects with the same bootstrapping method shown in Fig.  7. First, we extracted each emotional state from each subject, where the number of trials per emotion ranges from 46 to 51 per subject (see Table II). Second, we randomly split each subject's emotion trials evenly into training and testing subsets, each subset consisted of between 23 to 26 trials. Then after splitting the subsets for each subject's emotion, we randomly selected 20 trials from each subset and  averaged these trials to obtain a new sample. The last step was repeated 400 times for the training set and 100 times for the testing set of each subject emotion, respectively.
The bootstrapping method results in 400x11x7 training samples (400 random bootstrapping sampling, 11 subjects and 7 emotions) and 100x11x7 testing samples and overcomes the issue of limited trials available for training the model. Each of the eight EEG channels contains a sequence vector of 5x1000 elements (five seconds, 1000 sampling points per second at 1kHz) and input into 1D CNN model and the LSTM model. Like image processing, we also applied PCA feature extraction methods for each EEG channel and tested the performance with the baseline models. We obtained 50 PCA components for each EEG channel, accounting for over 97% of the energy spectrum. We also applied the 5-fold cross-validation technique to validate the performance of this method, where we repeated the bootstrapped process five times, with different train and test subsets for each subject emotion.
We also applied the autoencoder method stated in [46] and the K-mean cluster algorithm to extract the ten features (K=10) for each EEG channel. The input of the autoencoder is a raw periodogram from 0.1 to 30Hz with a resolution of 0.2 Hz, resulting in a vector size of 155 for each channel. Two types of features result from this method. First is the directly extracted features from the hidden layer of the autoencoder that contained 100 features for each of the eight channels. The second is the ten features from the 10 cluster groups based on the similarity of the hidden layer weights for each channel, where each feature is the average value of each cluster group's hidden node values.
LSTM models are used to evaluate the extracted features from these methods. Unlike the LSTM in Fig. 3, we used the last output state of the LSTM layer and then connected it to the softmax layer to obtain the emotion class for the EEG and EMG data because we took the entire 5-second interval for the EEG (and EMG) data as one feature vector. The eight channels of the EEG data will also be treated as the eight timesteps for the LSTM model. Besides the LSTM, we also applied the CNN model for both EMG and EEG features. The CNN model for the EEG PCA features is shown in Table III. The architecture is similar for other input sizes. It contains the 1D convolution layer with a kernel size of 4 and stride of 2, an average pooling layer, and a dropout layer with a dropout rate of 0.2 after the second convolutional layer. The four baseline models discussed in the audio sections are also applied to compare with the LSTM and CNN model results.

D. EMG
Like EEG, EMG data are also noisy. Thus, we applied a Butterworth filter from 20Hz to 500Hz to EMG data to minimize noise within the signals. The EMG data were processed using the same bootstrapping methods as the EEG data.
We also used MFCC for EMG feature extraction, as EMG signals are often used with audio data for analysis. Norali's work [27] suggested longer frame sizes represent EMG data better. Based on [27], the frame sizes of 2000ms, 4000ms, 5000ms and 10000ms result in higher accuracies. However, each of our trials is only 5-seconds long, so we used the 2000ms window size. We first applied a window size of 2000ms on a single trial for extracting its MFCC feature with 1000ms overlap. For the second method, a similar bootstrapping method (Fig. 7) as with the EEG was used, with the same window size applied to the average 20 trials of the EMG for the MFCC features. For the third method, instead of averaging the 20 trials, we concatenated 20 trials to form a "super-trial" of a longer time that allows for a larger frame size, and using the same window size to obtain the MFCC features.
We extracted 12 MFCCs for each EMG channel for each time step and concatenated all channels' coefficients into one feature vector. LSTM and CNN were used to classify the emotion with the EMG feature vectors with the similar architecture used for the EEG data. The input size for the LSTM and CNN was two dimensions, 12x6 coefficients and n time steps based on the window size and the method. We concatenated all the coefficients together into a onedimension feature vector for four baseline models. Table V summarizes the emotion recognition results for the four sensory modalities with various feature extraction methods and different classifiers (four baseline models and three deep learning methods). We will discuss some details of each of the modalities. Table V shows the best classification model for audio data with MFCC coefficients is the ensemble method with 30 LSTM models for both window sizes. For the 20ms window, we get an average accuracy of 71.32% and an average accuracy of 69.60% for the 100ms window. This performance is better than all other baseline models that have accuracies between 42.7% to 56.57%. To statistically validate that the LSTM model performs better than other baseline models, we compared the LSTM with the most accurate baseline model (SVM) using the t-test and obtained highly significant p-values of 1.24e-7 and 3.99e-7 and tvalues of 17.34 and 14.92 for the 20ms and 100ms windows, respectively, confirming that LSTM outperformed the baseline models. The difference between the 20ms window and the 100ms window was not significant (t = 1.7299, p = 0.1219), suggesting that the difference in the window size of the audio data does not have much influence on the MFCC characteristic of the audio data. The details of the t-test results in comparing the performance of LSTM model with all the other models for all four modalities are shown in Table VI. The model can get confused between emotions that result in a similar tone. As shown in Fig. 8, emotion "fear" with the lowest classification accuracy is misclassified to disgust, surprise, and sadness approximately 8% to 10% of the time. One of the reasons could be some subjects have overexaggerated voices that makes the models unable to find the clear boundaries to classify these trials into the correct emotion category. This can be further seen in Table IV, where the model has a huge difference in predicting each subject's emotion, ranging from 43.4% to 86.1%. Subject 4 had the worst accuracy, especially for the emotion "fear" that gets misclassified with other emotions except "anger" for over 10% of the time. Each subject expresses the same emotion in various tones and talking speeds, thus increasing the difficulties to train a general model to adopt all the varieties.

B. IMAGE
We have 224x224 pixel values for each image, but PCA greatly reduces the number of values while persevering data variability in reconstructing the image; for example, 50 PCA coefficients accounted for an average of 83.37% of the data variance with 0.08% standard deviations. PCA was mainly used for dimension reduction, but the CNN autoencoder seems to be better at extracting meaningful spatial features as it outperforms the PCA by about 9%, as shown in Table V in both MLP and LSTM. However, MLP and LSTM have similar performance with PCA or autoencoder features (p = 0.58 and p = 0.21). These two feature extraction methods might not preserve the temporal characteristics of the image sequence as trained with a small dataset, so LSTM loses its advantage.
VGG19 features are similar to VGG16 features, with no statistical difference between the LSTM model results of these two features (t = -0.93; p = 0.38). The VGG16 feature with the LSTM model achieves a mean accuracy of 67.20%, which is over 10% better than autoencoder features and 20% better than the PCA features. Features extracted from the pretrained models (VGG16 and VGG19) are significantly different from PCA and autoencoder trained with our dataset (with the p-values between 3.14e-6 to 8.70e-7). This is because the pre-trained models have been trained with over 10k images that can learn richer image features. With more meaningful features, the LSTM outperforms the MLP by increasing nearly 20%, as shown in Table V. Like audio, the model also has different performances in recognizing each subject's emotion, ranging from 46.3% to 89.4%. Subjects with lower accuracies either show less facial expression or are over-exaggerated. This makes the model hard to distinguish between emotions with similar facial expressions, such as fear and surprise, sad and neutral, with around a 15% misclassification rate.

C. EEG
Without using the bootstrap method, all the models fail to   Table V, where the best accuracy is only 20% with SVM.
With the bootstrap method applied, there is an accuracy increase of 15% -20%. Surprisingly, the baseline models are either outperformed or have very similar accuracies as the deep learning models. The KNN model achieves an average accuracy of 39.70% with PCA features and is significantly better than the LSTM model (p = 5.9e-4).
Our data generated two types of features from the autoencoder method [46] (Section 4.3). First is a features vector of size 8x100 for each trial that is the output from the autoencoder's hidden layer with the raw periodogram of the EEG data as the input. Second is the average value of the 10 cluster groups of the first feature type, where the clusters are based on the similarity of the hidden layer weights, resulting in a vector of size 8x10. Unfortunately, neither of the extracted features has a good performance in emotion classification, which can be due to several reasons. First, our dataset contains only 8 EEG channels, less than the 32 channels used in [46]. Second, our dataset has a distinct data collection process, where the subject emotion is posed and required to change within a short period of time (5s), which may introduce more noise in the signals in comparison to the SEED dataset that used movie clips to elicit emotions with long trial times (60s). Third, even when we reduced noise and expanded the number of training samples through bootstrapping, there is still insufficient data to train the autoencoder to capture the generic feature of the EEG data.
The PCA can reduce the redundant information in the EEG signals and preserve meaningful information compared to the autoencoder method. Also, PCA is faster and requires less computation power than autoencoder or raw data, as shown in Table VII. PCA performance is very similar to the filtered data results and outperforms the autoencoder features. With 50 PCA components, it preserves an average of 97.9% data variability with a standard deviation of less than 0.1%. The PCA features don't work well with the CNN model since the CNN model can extract features from the raw data in its convolution layers. Also, since the input EEG features were not temporal data, the LSTM doesn't work well compared to the baseline models.

D. EMG
EMG signal performance in classifying emotions was similar to EEG (Table V). As we used the window size of 2000ms, there are only four timesteps for the single and averaged trial data, which does not provide much temporal information for  the LSTM. With concatenated trials, which have 99 timesteps, the LSTM model has a slight improvement. The data used in Norali's paper [27] was 50 seconds in duration, but our data was only 5 seconds per trial, and even with concatenated trials, signals are not consecutive, which introduces errors in extracting the MFCC coefficients. With this limitation of the small dataset and noise within the features, the deep learning models easily overfit the training data and don't perform better than the baseline models. The bootstrap method also shows an improvement in denoising the EMG data, with around a 10% accuracy increase using the KNN for both the concatenated and average methods. Moreover, the p-value of the KNN accuracy between the concatenate and average methods is 0.54, proving that these two methods have similar performance with the KNN model. This could be due to the similarity within EMG signal characteristics of the same subject emotion. The average method smooths out the noise, and the concatenated method emphasizes the common characteristics.

E. DISCUSSION
We used general feature extraction techniques to obtain features from the four modalities, but if we used more advanced and fine-tuned methods to extract task-specific features, the models could better recognize the emotion, as shown in another of our studies [41]. Our dataset is relatively small, so it is hard to train the CNN autoencoder to get significant image features. Our previous work [41] used pretrained models combined with ROI net to extract the feature from the region of interest on the faces for emotion recognition. The trained LSTM model accuracy was increased by around 20%.
Besides the methods discussed above, we also conducted a wavelet analysis on the EEG data, which should be more powerful than the PCA and FFT in obtaining both time and frequency domain information. We used a Morlet wavelet at varying frequencies to extract the power at frequencies from 1Hz to 40Hz over the 5-second window. The wavelet feature vector results in size 8x40x5000 (8 channels, 40 frequency, and 5000 sampling points per 5 second at 1kHz) per trial. With this large number of features per trial and limited trials, it is challenging to train the model to distinguish between different emotional states without overfitting. As a result, emotion recognition accuracy for the wavelet features was 20.72% with the LSTM model. However, our preliminary studies suspected that there might be a delay for the brain to evoke the emotional state. As our data can be split into three different emotional processing stages, we will analyze the various stages of each trial for our future work. This paper aims to provide different baseline methods along with the dataset for the research community to work on this interesting and challenging task. Table VII shows the computation time to determine the emotion of a single trial using the best method of each modality. All experiments are simulated in Python on a MacBook Pro equipped with Intel Iris Plus Graphics 655, i7 CPU @ 2.7 GHz, 16 GB RAM, 512 GB SSD. If we apply face tracking, then we can reduce the time for face detection on each image. Overall, our method does not require huge computation power and has large potential in many applications. For example, virtual assistant applications could use audio and image emotion detection to monitor user reactions, whereas portable EEG and EMG systems might be used for online classification of emotions based on neural and muscular signals. The application's A/B testing can incorporate these reaction data to achieve a better application design fit for the user's needs. The EEG and EMG emotion detection could also help convey the emotion of people with various disabilities or disorders, such as cerebral palsy or vegetative state. Even though the emotion recognition accuracies for these two data types are not accurate for determining which of the seven emotions people are expressing, we may be able to assess if people have positive or negative reactions. One possible disadvantage of our method is that it required asking the same question multiple times to determine their reaction to the question. However, people might not have an immediate response when asked; thus, doing it numerous times can ensure the correct interpretation of their reaction.

VI. CONCLUSION
This work provides baseline approaches for posed emotion recognition based on our new dataset, PME4, with four different modalities. We examined various feature extraction techniques (MFCC, PCA, autoencoder and pre-trained CNN) and machine learning models (KNN, SVM, Random Forest, MLP, CNN, LSTM) for each modality. We have found the deep learning model of LSTM works better than the traditional KNN in classifying the emotion using the audio and image sequences, as these data contain more abundant features of human emotion. As everyone expresses their emotions differently, we have found that the general model has a large difference in determining each person's emotion, ranging from 43.4% to 86.1% accuracy. This problem can also be due to the fact that we collected posed expressions instead of spontaneous expressions, and some subjects might have overdone or underperformed the required expressions. As there was more noise in the biosensor EEG and EMG data, bootstrapping proved to improve the data stability. With 20 bootstrapped samples, the accuracy had increased by around 20% for EEG and 10% for EMG. Our initial motivation for collecting only eight EEG channels was to determine which EEG channels are more influenced by the subject's emotional states and to minimize the difficulty and complexities of biosensor data collection. We achieved 39.70% accuracy with only eight channels of EEG data using the traditional KNN model, which has much less data than other emotional datasets and needs a shorter computation time.

A. LIMITATIONS AND FUTURE WORK
There are a few limitations that we faced while doing the analysis, especially with the biosensor data. Bootstrapping the EEG data did increase the emotion recognition performance and focusing on the speech interval for biosensor data may also improve the performance. However, our PME4 dataset is relatively small, and because EEG and EMG data are inherently noisy, larger numbers of trials and subjects in future datasets may improve classification based on these types of signals. Also, each person uses different talking speeds and tones for the same emotion. Resampling the speech interval could introduce many problems but using the same speech duration for all subjects includes unneeded data that affect the extracted features. Moreover, each experimental session was relatively long, which might have differentially affected subjects' performances of each emotion at the beginning compared to the end of the experimental session. It will be interesting to analyze each experimental session as a function of time to assess whether the subjects' emotional states were expressed evenly over five testing blocks. We could apply the baseline model (LSTM for audio and images, KNN for the EEG and EMG) to the four modalities in each block and compare the result to see how time can influence the subjects' emotions. Subjects could also experience multiple emotions during the same trial, such as exhaustion toward the last block. More analysis could be done with the initial block used as the baseline. In the future, there are multiple analyses we would like to perform. First, compare the personal identification of each of the 11 subjects within each emotion or a combination of both person and emotion recognition to explore the differences between people and their emotions in the four different modalities. Second, analyze the timecourse of the physiological signals (EEG and EMG) for different emotional states to see if there are differential delays in evoking the emotion in the brain. Integrating multimodalities for emotion recognition is another line of future research. We will make our dataset available for research purposes and welcome other researchers to work on the dataset with new ideas and improved performance.