Autoencoder With Emotion Embedding for Speech Emotion Recognition

An important part of the human-computer interaction process is speech emotion recognition (SER), which has been receiving more attention in recent years. However, although a wide diversity of methods has been proposed in SER, these approaches still cannot improve the performance. A key issue in the low performance of the SER system is how to effectively extract emotion-oriented features. In this paper, we propose a novel algorithm, an autoencoder with emotion embedding, to extract deep emotion features. Unlike many previous works, instance normalization, which is a common technique in the style transfer field, is introduced into our model rather than batch normalization. Furthermore, the emotion embedding path in our method can lead the autoencoder to efficiently learn a priori knowledge from the label. It can enable the model to distinguish which features are most related to human emotion. We concatenate the latent representation learned by the autoencoder and acoustic features obtained by the openSMILE toolkit. Finally, the concatenated feature vector is utilized for emotion classification. To improve the generalization of our method, a simple data augmentation approach is applied. Two publicly available and highly popular databases, IEMOCAP and EMODB, are chosen to evaluate our method. Experimental results demonstrate that the proposed model achieves significant performance improvement compared to other speech emotion recognition systems.


I. INTRODUCTION
In human speech interaction, people convey the underlying intent through paralinguistic characteristics such as emotions, intonations and styles. Therefore, speech emotion recognition (SER), the technique of recognizing emotions from speech, has gradually become a significant research interest. This technology has promising prospects and plays an important role in natural language understanding. For example, emotion recognition has been widely used in the process of humancomputer interaction (HCI) and computer-dedicated human communication [1]. Recognizing these paralinguistic characteristics can help intelligent systems understand user intention and further improve the user experience. In this paper, an algorithm that analyzes the underlying emotions of speech with a deep learning algorithm is proposed.
Human emotions in speech are complex to model. The main reasons are as follows: 1) human emotions may be treated as noise and discarded in many current speech recognition methods due to their abstraction. 2) in general, The associate editor coordinating the review of this manuscript and approving it for publication was Sunil Karamchandani . human emotion in a long utterance can only be detected in some specific moments [2]. Early work on SER mainly focused on selecting speech acoustic features that can distinguish different emotions, such as statistical features and prosodic features. The most common approach is to extract a large number of statistical features from utterances and utilize basic machine learning algorithms (e.g., hidden Markov model (HMM) [3], Gaussian mixture model (GMM) [4], and support vector machine (SVM) [5]). Recently, with the increased interest in deep learning (DL) algorithms, the automatic extraction of useful features from speech signals by deep neural networks (DNNs), such as recurrent neural networks (RNNs) [6] and convolutional neural networks (CNNs) [7], has become a very popular technique. Prior researchers used DNNs have demonstrated that deep learning has the most promising results compared with traditional algorithms. Encouraged by the recent success of autoencoder structures [19], [20] with deep unsupervised learning and the idea of word embedding [8], [9] in natural language processing (NLP), we propose a novel algorithm based on an autoencoder with emotion embedding for SER. The key VOLUME  In the comparative analysis, our system showed outperformed recognition results. The rest of this paper is organized as follows. Section II describes the related algorithm of the autoencoder. Section III presents our proposed novel algorithm in detail. Section IV shows the experimental details and databases. Section V demonstrates the experimental results.

II. RELATED WORK
Speech emotion recognition is considered a challenging task in the HCI domain. A large number of methodologies and corpora have been proposed in previous works [10]- [12]. The early stage of SER research used handcrafted speech features and low-level descriptors to train classic machine learning models. Recently, increasing attention has been drawn to the study of DNNs. However, there are two major issues observed in DL approaches: (1) a sufficient amount of labeled speech data. (2) extracting emotion related features from audio.
To address the scarcity of training data, multiple methodologies have been investigated. Generally, there are three approaches to address this obstacle. (1) Collecting and annotating new data. However, it is expensive and consumes much time to create a large enough dataset. (2) Data augmentation. This is the most common method that has been widely used in the DL field [13], [14]. (3) Transfer learning. This method is a popular research problem in DL that focuses on storing knowledge gained while training one model and applying it to another task. It has been successfully applied in various domains [15]- [18]. However, the mismatch between the datasets is the reason why the accuracy of the SER system has not been further improved. In this paper, a simple data augmentation method is applied to the proposed SER algorithm.
In recent years, to strengthen the feature extraction ability, many improvements to DL algorithms have been proposed. With the successful application of autoencoders in the DL field, they have also been introduced into SER tasks. An autoencoder is an unsupervised learning model used to reconstruct the input with minimum reconstruction error.
The basic autoencoder has one input layer, one hidden layer and one output layer. The autoencoder first maps the input vector to the best latent representation through nonlinear mapping, and then this representation is mapped back to the output layer to reconstruct the input vector. If the number of hidden layers is greater than one, the network is considered to be deep. Many previous works directly utilized the latent representation learned by basic autoencoders for SER tasks. For example, in [21], a deep autoencoder based on a multilayer perceptron was proposed for SER. Pal and Baskar [22] proposed a deep dropout autoencoder based multilayer perceptron. Similar to [21] and [22], an autoencoder was also applied to extract the bottleneck features for dimensionality reduction in [23] and [24]. Finally, some machine learning algorithms, such as SVM and long short-term memory (LSTM), were applied for emotion classification.
Moreover, to extract more robust features, a denoising autoencoder (DAE) and its deep structure, stacked denoising autoencoders, are introduced into the SER field. The major difference between DAE and traditional autoencoders is that DAE is trained to recover from corrupted inputs. Encouraged by the motivation behind this, Ghosh et al. [25] explored stacked DAEs for representation learning. Furthermore, Zhang et al. [26] proposed a memory-enhanced recurrent denoising autoencoder (rDA) that has shown that this method can significantly improve the performance.
In the aforementioned methods, the autoencoder is directly trained to learn a lower-dimensional distributed representation of the input data. However, one may note that the representation learned by basic autoencoder architecture also contains redundant information not related to human emotions. To address this problem, many researchers have explored many modified autoencoder networks. Xia and Liu [27] proposed a modified autoencoder method to project the input to two hidden spaces. One of them is meant to represent emotional information, whereas the other is used to capture redundant information. Deng et al. [28] proposed a shared hidden-layer autoencoder (SHLA) model for learning common feature representations shared across the training and test sets to reduce the discrepancy in them. In addition, Zong et al. [29] proposed a novel framework named multichannel autoencoder (MTC-AE) for emotion recognition. MTC-AE contains multiple local DNNs based on different low-level descriptors with different statistical functions that are partly concatenated together, by which the structure is enabled to consider both local and global features simultaneously. Wei et al. [30] proposed an algorithm based on an autoencoder, denoising autoencoder, and sparse autoencoder. The first layer of the structure uses a denoising autoencoder to learn a hidden feature with a larger dimension than the dimension of the input features, and the second layer employs a sparse autoencoder to learn sparse features.
Obviously, even if such methods can further improve the performance of SER, the high-level features extracted by reconstructing input mainly contain content information rather than emotion-oriented features. Moreover, the above-mentioned works do not consider the significance of a priori knowledge. To address this problem, our method does not rely on basic autoencoder architecture. In this work, we increase the modeling capacity by designing a new autoencoder architecture, emotion-embedded autoencoder. Emotion embedding layers in our method lead the model to efficiently learn a priori emotion information from the label, which allows the autoencoder to focus more on deep emotion features during the reconstruction process. Experimental results demonstrate that the proposed method can present performance improvement.

III. PROPOSED METHOD
In this section, we describe our proposed method. There are three parts: input speech feature, autoencoder with emotion embedding and emotion classification network. Fig. 1 depicts the model framework, which includes an autoencoder, an emotion embedding path, and an emotion classification net. Let us consider a dataset with N labeled samples D = where x i is denoted as the ith acoustic feature sequence of the speech sample and y i is the emotion label corresponding to x i . y i ∈ {1, 2, 3, · · · ,K }, and K is the number of emotion categories.

A. INPUT SPEECH FEATURE 1) LOG MAGNITUDE SPECTROGRAM
A spectrogram is a useful expression for the analysis of speech and audio signals. Many applications are performed in the spectral domain using spectrograms rather than in the original time domain. Furthermore, the magnitude spectrograms of audio signals tend to be highly structured in terms of both spectral and temporal regularities [31]. Therefore, it is easier to deal with many problems by processing magnitude spectrograms than directly processing time-domain signals. In fact, magnitude spectrograms have been introduced into many audio processing fields including audio source separation and speech synthesis systems [32]- [36]. In this paper,  the detailed spectral analysis was the same as in previous work [36].

2) IS10 FEATURE SET
We utilize the openSMILE [37] toolkit to extract statistical features that were used in the INTERSPEECH 2010 Paralinguistic Challenge [38]. The open-source media interpretation by large feature-space extraction (openSMILE) toolkit is a modular tool for signal processing and machine learning applications. It can flexibly extract the features of signals and is mainly used for audio signal feature extraction. Therefore, 1582-demensional features are generated by extracting 38 kinds of LLDs and applying 21 statistical functions. Details about these features can be found in [38].

B. AUTOENCODER WITH EMOTION EMBEDDING
In this part, we interpret the complete scheme of the autoencoder with emotion embedding in detail. Fig. 3-4 and Table 1-2 depict the detailed autoencoder architecture. In this letter, due to the 2D representation of the spectrogram, our proposed autoencoder is mainly based on a CNN, and it contains two parts: encoder and decoder. Each part mainly consists of four building blocks: convolution parts, instance normalization, dropout layers and gated recurrent   unit (GRU). CNNs are one of the most popular deep learning models that have demonstrated great success in various research fields. In SER, CNNs have also been widely used to learn salient features, also directly for classification. Generally, the basic components of a CNN are convolution layers, pooling layers, batch normalization (BN) and activation layers. In our method, pooling layers are discarded since we do not expect to lose any high-level information that may be related to emotions. In addition, we replace BN with instance normalization (IN) [39] in our network. BN [40] is one of the most common components in many CNNs, and it normalizes the features by the mean and variance computed within a batch. It enables a larger learning rate and faster convergence by reducing the internal covariate shift during the training process. Unlike BN, the key difference between BN and IN is that the latter applies normalization to an individual sample instead of a whole batch of samples. Generally, IN is mainly used in the style transfer field, for instance, image style transfer [41]. Some methods [42], [43] employ IN to help remove image contrast. Moreover, many existing works disclose that IN learns features that are invariant to appearance changes, such as colors, styles, and virtuality, while BN is essential for preserving content related information [44]. To this end, we introduce IN into our autoencoder network with the purpose of leading the model to attract more attention to features related to emotion while maintaining discrimination of the learned features. In addition, considering that emotions in speech are context-dependent, the ability to model contextual information makes RNNs suitable for SER. Therefore, GRU [45], which is a special case of RNN, is utilized in our network. Finally, residual connection [46] are utilized in the network to address vanishing/exploding gradients.
In the encoder process, the encoder is trained to map an input sequence x to a latent representation encoder(x). In the reconstruction process, the decoder network is equipped with the emotion embedding path, leading the model to efficiently learn emotion information from the label, as shown in Fig. 4. The fact that the latent representation from conventional autoencoder learned by reconstructing input mainly contains content information is the reason why the SER accuracy in many works has not been further improved. The main attraction of emotion embedding for SER is that it allows the network to distinguish which deep features are related to emotion. In this paper, the decoder is trained to generate x which is a reconstruction of x from encoder(x) given the emotion label y, as shown in (1).
The mean absolute error (MAE) is used as the reconstruction loss since it generates a sharper output than the mean square error [47], as shown in (2).  where θ Enc and θ Dec are the parameters of the encoder and decoder respectively.

C. EMOTION CLASSIFICATION WITH FEATURE FUSION
In the classification process, the classification network takes the encoder's output and learns the links between it and the emotion label, as shown in Fig. 5 and Table 3. The best representation from the encoder is fed into the self-attention [48] layer first, and the details of the attention layer are the same as in previous work [48]. With the attention mechanism, the network can focus more on the emotion-oriented feature in the best representation obtained by the autoencoder. Moreover, while the progressive downsampling of CNNs provides strong capability in local context modeling and emotion-related pattern detection, Li et al. [49] believed that the temporal structure of speech that is highly related to emotions will gradually be lost in the downsampling process [50]. To overcome this problem, we concatenate the deep attention emotion feature extracted from the attention layer and acoustic features obtained by the openSMILE toolkit. These features contain global information of speech. Finally, the concatenated feature vector is fed into the fully connected network for emotion classification.
The emotion classification network takes the concatenated feature vector as input and outputs the predicted emotion class. The classifier is trained to minimize the negative logprobability, as shown in (3).
where θ Att and θ Cla are the parameters of the attention layer and classification network, respectively.
During the training process, the object function of our network is a joint function decided by both reconstruction error and the negative log-probability: where λ is a constant controlling the weighting between the encoder path and the classify path.

IV. EXPERIMENT
A. DATA AUGMENTATION Currently, there are two common problems with datasets in the SER field: the typical inherent mismatch between the dataset and the difficulty in creating corpora. This mismatch means different emotion annotation schemes in different datasets, and high data collection often comes with high annotation costs. Therefore, one of the serious obstacles to the applications of SER systems in real-life settings is the lack of a sufficient amount of labeled speech data. Inspired by data augmentation, in this paper, a simple data augmentation method is presented to make use of data effectively. Data augmentation has been proposed as a method to generate additional training data for computer vision. Artificial data have also been augmented for many previous works in automatic speech recognition (ASR) [51], [52]. In [53], Navdeep Jaitly et al. proposed a method named vocal tract length normalization for data augmentation. The approach of superimposing clean audio with a noisy audio signal was adapted in [54]. In LVSCR tasks [55], they applied speed perturbation in their work. In addition, the use of an acoustic room simulator [56] and generative adversarial networks (GANs) [57] have also been proposed for data augmentation. However, the aforementioned approaches all operated on the raw audio itself rather than the spectrogram. In [58], D. S. Park et al. proposed a simple and computationally cheap method for data augmentation, which directly acted on the log mel spectrogram and did not require any additional data. Three deformations of the spectrogram were chosen in their work: time warping, frequency masking and time masking. More generally, many works have demonstrated that data augmentation techniques have achieved state-of-theart performance in ASR. In this paper, to avoid losing local context information of the speech signal, we randomly sampled 128 frames of log magnitude spectrogram with overlap. This means that 128 consecutive time steps [t, t + 128) are termed a training sample, where t is chosen from a uniform distribution [0, T − 128), and T is the length of the log magnitude spectrogram, as shown in Fig. 2.

B. DATASET
To investigate the performance of the proposed method, two publicly available and highly popular databases, VOLUME 9, 2021 namely the Interactive Emotional Dyadic Motion Capture (IEMOCAP) [59] and Berlin Emotional Speech Database (EMODB) [60] are chosen as source sets. We first briefly introduce the database.

1) IEMOCAP
IEMOCAP was collected by SAIL lab at USC, USA, and it consists of 5 sessions. It has 10 professional actors (5 male and 5 female) acting in two different scenarios: scripted play and spontaneous dialog. This corpus has approximately 12 hours of audiovisual data, including video, speech, motion capture of face, and text transcription. Each interaction is segmented into sentences that are labeled by at least 3 annotators. In this paper, we used four emotion categories: angry, happy, sad and neutral. Note that, like many previous works, happy and excited in the original annotation were merged into one class: happy. Only the audio signals were used in the experiments.

2) EMODB
This dataset was collected by the Institute of Communication Science at the Technical University of Berlin. It was spoken by 10 actors and comprises 535 utterances divided into seven emotion classes, namely, anger, fear, happiness, sadness, disgust, boredom, and neutral. We use all data in this study.

C. EXPERIMENT SETUP
Since there were 10 speakers in IEMOCAP and each session consisted of 2 speakers, leave-one-speaker-out crossvalidation was applied in our experiments so that there was no speaker overlap between the training and test data. Moreover, 10-fold cross-validation strategies were also used to evaluate the proposed method. For EMODB, we performed all evaluations using 5-fold and 10-fold cross-validation, to stay in the same manner as most approaches. For performance comparison, we utilize unweighted accuracy (UA) and weighted accuracy (WA) [61], which have been used in several previous emotion challenges. Weighted accuracy is the accuracy over all testing utterances in the dataset, and unweighted accuracy is the average accuracy over each emotion category. They are quite good measurements in this case since the class distribution is imbalanced. In addition, to further measure the proposed method, the metrics of precision, recall and F1-score are also computed.
We used sampled log magnitude spectrogram as the inputs, and trained the network using Adam optimizer with lr = 0.0001, β 1 = 0.9, β 2 = 0.999. The model was trained for 40 epochs on the dataset. All the experiments were performed using an Nvidia GTX 1080Ti with 11 GB memory.

V. RESULTS AND ANALYSIS A. IMPACTS OF INSTANCE NORMALIZATION AND EMOTION EMBEDDING
In this part, the IS10 feature set is combined with the traditional machine learning algorithm SVM (IS10+SVM) to          Table 4 shows the performance of different classifiers on the IEMOCAP speech database. For 10-fold cross validation, the UA obtained with AE −OS,+EE,IN is improved by 1.84% compared with AE −OS . Furthermore, the UA obtained with the proposed method is improved by 1.22%, 0.2% and 0.65% compared with AE, AE +EE and AE +IN , respectively. For 10-fold leave-one-speaker-out cross validation, the UA obtained with AE −OS,+EE,IN is improved by 2.93% compared with AE −OS . Meanwhile, the UA obtained with the proposed method is improved by 8.1%, 1.24%, 1.8% and 0.36% compared with IS10+SVM, AE, AE +EE and AE +IN , respectively. In summary, the performance of SER is further improved by introducing emotion embedding and instance normalization.

B. VISUALIZING THE EMOTION EMBEDDING USING T-DISTRIBUTED NEIGHBOR EMBEDDING (T-SNE)
T-SNE is an algorithm developed for visualizing multidimensional data based on the idea of dimensionality reduction. We use t-SNE plots to visualize the emotion embedding of our modified autoencoder model. In our method, there are five emotion embedding layers in the decoder network, as shown in Fig. 4. Each test sample is now a multidimensional data point (512 dimensions). T-SNE was then used to reduce the dimensions to only two for a 2D plot, as shown in Fig. 6 and Fig. 7. From Fig. 6 and Fig. 7, we can clearly see the separation between ''ang'', ''hap'' and ''neu''. Such a result is expected since there are obviously different characteristics between them. However, we can also see that it is not clearly separated between ''sad'' and the other three emotions. One possible explanation is that the low-energy state of the emotion ''sad'' does not have salient characteristics compared with the other emotions. In summary, the experimental results demonstrate that our proposed autoencoder naturally learns useful emotion representations from the label; in turn, the learning process discovers the intrinsic attributes necessary to solve emotion recognition.    seen that ''anger'' and ''sadness'' are easier to distinguish than other emotions. In the EMODB dataset, ''boredom'', ''disgust'' and ''fear'' can also be recognized well. However, ''happiness'' and ''neutral'' are the most difficult to identify.
To further observe the performance of the proposed method for each emotion, we also present the classification confusion matrix, as shown in Tables 9-12. These results correspond to the IEMOCAP (10-fold, 10-fold LOSO) and EMODB (5-fold, 10-fold) datasets, respectively. Tables 9-12 clearly indicate that recognizing ''happiness'' is the most difficult task since ''happiness'' is easily confused with other emotions. In addition, the recognition accuracy of ''neutral'' in different datasets varies greatly. This is because the data distribution of each dataset is different.
To evaluate the superiority of the proposed method, Tables 13 and 14 present the performance comparisons of  different methods on the IEMOCAP and EMODB datasets. From Tables 13 and 14, the best performance of our proposed method is 71.2% for IEMOCAP and 95.6% for EMODB. Additionally, we find that the performances of all models on the IEMOCAP dataset are relatively low. This is because the IEMOCAP dataset is collected in two different scenarios, scripted play and spontaneous dialog, and spontaneous dialog is much more difficult to identify than acted dialog. Finally, compared with other deep learning methods, it clearly shows that the proposed method achieves significant performance improvement.

VI. CONCLUSION
In this paper, we proposed a novel algorithm that combines both autoencoder and emotion embedding. The emotion embedding path focuses on learning strong emotional information from labels. This allows the latent representation from the autoencoder to learn which deep features are related to emotion. In the emotion classification process, the IS10 feature set was fused with the deep emotion feature from the autoencoder. Experimental results with two publicly available corpora show that the proposed algorithm further enhances the classification accuracy.
In future work, considering the powerful capabilities of BERT [73] in natural language processing tasks, we will consider introducing it into SER tasks to help the model extract deep attention features. In addition, the use of text information can be a measure to further improve the accuracy of SER.
CHENGHAO ZHANG was born in Wuxi, Jiangsu, China, in 1992. He is currently pursuing the M.E. degree in communication and information systems with Shanghai University. His research interests include signal processing, pattern recognition, and deep learning.
LEI XUE was born in Beijing, China, in 1963. He received the B.E. degree from Henan University, China, in 1985, the M.E. degree in computer science from Fudan University, China, in 1999, and the Ph.D. degree in pattern recognition and intelligent control from the Huazhong University of Science and Technology, China, in 2004. His research interests include signal processing and intelligent control. VOLUME 9, 2021