Convolutional-Recurrent Neural Networks With Multiple Attention Mechanisms for Speech Emotion Recognition | IEEE Journals & Magazine | IEEE Xplore

Convolutional-Recurrent Neural Networks With Multiple Attention Mechanisms for Speech Emotion Recognition


Abstract:

Speech emotion recognition (SER) aims to endow machines with the intelligence in perceiving latent affective components from speech. However, the existing works on deep-l...Show More

Abstract:

Speech emotion recognition (SER) aims to endow machines with the intelligence in perceiving latent affective components from speech. However, the existing works on deep-learning-based SER make it difficult to jointly consider time–frequency and sequential information in speech due to their structures, which may lead to deficiencies in exploring reasonable local emotional representations. In this regard, we propose a convolutional-recurrent neural network with multiple attention mechanisms (CRNN-MAs) for SER in this article, including the paralleled convolutional neural network (CNN) and long short-term memory (LSTM) modules, using extracted Mel-spectrums and frame-level features, respectively, in order to acquire time–frequency and sequential information simultaneously. Furthermore, we set three strategies for the proposed CRNN-MA: 1) a multiple self-attention layer in the CNN module on frame-level weights; 2) a multidimensional attention layer as the input features of the LSTM; and 3) a fusion layer summarizing the features of the two modules. Experimental results on three conventional SER corpora demonstrate the effectiveness of the proposed approach through using the convolutional-recurrent and multiple-attention modules, compared with other related models and existing state-of-the-art approaches.
Published in: IEEE Transactions on Cognitive and Developmental Systems ( Volume: 14, Issue: 4, December 2022)
Page(s): 1564 - 1573
Date of Publication: 29 October 2021

ISSN Information:

Funding Agency:


I. Introduction

Paralinguistics focus on exploring latent information from speech, representing the states of speakers or acoustic intermediates [1]–[3]. As an affective task in paralinguistics, speech emotion recognition (SER) makes it possible for machines to learn emotional categories or descriptors from speech utterances, which can provide help on intelligent human–computer interaction through audio modality [4]–[7]. The recent research of deep learning provides SER with deep models to better describe emotional states in speech. One of the primary deep learning models are the deep neural networks (DNNs) [8], which is usually used to learn discriminative representations from low-level acoustic features [9], [10]. Furthermore, these SER works tend to center on convolutional neural networks (CNNs) [11], [12] and long short-term memory (LSTM)-based recurrent neural networks (RNNs) [13]–[15], in need of mining local information in speech utterances. For most CNN-based SER cases [11], [16], [17], CNNs are frequently used to learn time–frequency information from spectrum features, while the LSTM-RNN cases [18], [19] focus on extracting sequential correlation for time series in speech.

Contact IEEE to Subscribe

References

References is not available for this document.