I. Introduction
Paralinguistics focus on exploring latent information from speech, representing the states of speakers or acoustic intermediates [1]–[3]. As an affective task in paralinguistics, speech emotion recognition (SER) makes it possible for machines to learn emotional categories or descriptors from speech utterances, which can provide help on intelligent human–computer interaction through audio modality [4]–[7]. The recent research of deep learning provides SER with deep models to better describe emotional states in speech. One of the primary deep learning models are the deep neural networks (DNNs) [8], which is usually used to learn discriminative representations from low-level acoustic features [9], [10]. Furthermore, these SER works tend to center on convolutional neural networks (CNNs) [11], [12] and long short-term memory (LSTM)-based recurrent neural networks (RNNs) [13]–[15], in need of mining local information in speech utterances. For most CNN-based SER cases [11], [16], [17], CNNs are frequently used to learn time–frequency information from spectrum features, while the LSTM-RNN cases [18], [19] focus on extracting sequential correlation for time series in speech.