Exploring Deep Spectrum Representations via Attention-Based Recurrent and Convolutional Neural Networks for Speech Emotion Recognition

The automatic detection of an emotional state from human speech, which plays a crucial role in the area of human–machine interaction, has consistently been shown to be a difficult task for machine learning algorithms. Previous work on emotion recognition has mostly focused on the extraction of carefully hand-crafted and highly engineered features. Results from these works have demonstrated the importance of discriminative spatio-temporal features to model the continual evolutions of different emotions. Recently, spectrogram representations of emotional speech have achieved competitive performance for automatic speech emotion recognition (SER). How machine learning algorithms learn the effective compositional spatio-temporal dynamics for SER has been a fundamental problem of deep representations, herein denoted as deep spectrum representations. In this paper, we develop a model to alleviate this limitation by leveraging a parallel combination of attention-based bidirectional long short-term memory recurrent neural networks with attention-based fully convolutional networks (FCN). The extensive experiments were undertaken on the interactive emotional dyadic motion capture (IEMOCAP) and FAU aibo emotion corpus (FAU-AEC) to highlight the effectiveness of our approach. The experimental results indicate that deep spectrum representations extracted from the proposed model are well-suited to the task of SER, achieving a WA of 68.1% and a UA of 67.0% on IEMOCAP, and 45.4% for UA on FAU-AEC dataset. Key results indicate that the extracted deep representations combined with a linear support vector classifier are comparable in performance with eGeMAPS and COMPARE, two standard acoustic feature representations.


I. INTRODUCTION
Automatic emotion recognition from speech signals, aiming at the identification of our basic emotional states using machine learning, remains a difficult task. A major challenge currently being faced by researchers is how best to extract The associate editor coordinating the review of this manuscript and approving it for publication was Haishuai Wang. discriminative, robust, and affect-salient features that represent the acoustic contents of speech signals. Many previous research efforts have investigated several hand-crafted acoustic features for the task of speech emotion recognition (SER), such as prosodic features (e. g. , pitch, energy, zerocrossings), spectral features (e. g. , linear predictor coefficients (LPC), linear predictor cepstral coefficients (LPCC), mel-frequency cepstral coefficients (MFCC), and non-linear VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ features such as the Teager-energy-operator (TEO). More recently, with the increased use of neural networks for SER tasks, mel-scale filterbank spectrograms are now widely used as an input feature. Deep spectrum representations, which are features automatically extracted from speech spectrogram images using deep learning models, have produced promising results in the fields of SER [1] and other speech and audio related applications [1]- [3]. Inspired by their performance in visual recognition tasks [4], recent SER approaches such as deep spectrum have incorporated convolutional neural networks (CNNs) to extract features from spectrograms. CNNs are exceptionally good at capturing high-level representations in a spatial domain. Recently, fully convolutional networks (FCNs) [5] have been proposed as a variant of CNNs. A major advantage of FCNs is that they can handle inputs of variable sizes; based on this property, they have achieved state-of-the-art performance in time-series based classification tasks [6], [7].
However, a drawback of FCNs is that they are not primarily tailored for learning temporal features. In this regard, recurrent neural networks with long short-term memory (LSTM-RNNs) offer the advantage of being suitable to model temporal dependencies between sequences [8], and as a result are widely used in SER [9], [10]. The approach proposed herein aims to leverage the inherent strengths of the two aforementioned models. The framework combines, in a parallel manner, FCNs and LSTM-RNNS, specifically bidirectional LSTM-RNNs (BLSTM-RNNs), to learn effective compositional spatio-temporal dynamics from spectrograms for the SER task.
In addition to learning useful spatio-temporal features, it is also important to select the emotionally salient sections of an input signal to improve SER performance further [11]. The use of attention mechanisms in RNN and CNN-based models has frequently been demonstrated as a useful tool to encourage a model to more heavily weight specific regions of an input sequence or image [12]. Attention mechanisms have also been effectively applied in SER [11], [13]- [15].
Motivated by the above analysis, and following on from our previous preliminary work [10], [16], we propose the Attention-BLSTM-FCN model, a spatio-temporal spectrogram-based approach which leverages attention-based BLSTM-RNNs (Attention-BLSTM-RNNs) and attentionbased FCNs in parallel for SER. An advantage of the Attention-BLSTM-FCN model is that it enables the model to capture both temporal and frequency dependence in the spectrogram of the speech, relying on FCNs to extract representations from the spectrogram and modelling the temporal dynamics using a BLSTM network. In order to focus on feature extraction in the emotionally salient parts of an utterance, we investigate the benefits of including attention-based architectures in the model. A concatenation operation is employed to take advantage of the complementary features extracted from BLSTM and FCN, and the learnt representations are then fed into a deep neural network (DNN) to predict the emotion of the input utterance.
The main contributions of this article are, therefore, as follows: i) we propose a novel framework to fuse both spatial and temporal representations for SER by leveraging attention-based FCNs with attention-based BLSTM-RNNs, an approach capable of automatically learning feature representations and modeling the temporal dependencies; ii) following the recent success of applying deep learning methods directly to spectrograms, enhanced deep spectrum representations are derived from forwarding spectrograms through the Attention-BLSTM-FCN model; and iii) the proposed method can be easily adapted to enhance existing state-of-the-art methods. To the best of the authors' knowledge, this is the first work in the literature that applies the Attention-BLSTM-FCN model to learn enhanced deep-spectrum representations for SER.

II. RELATED WORK
SER is a highly active research field, with many novel approaches being proposed and investigated over the past decade. With the increase of available data and computational power, deep learning methods are rapidly becoming the predominant approach [17], [17]- [19]. In particular, many recent studies have explored leveraging deep neural networks as feature extractors to learn discriminative representation [20]. Due to their success in many visual recognition tasks, CNNs are being widely used in feature representation learning in various speech analysis tasks. For example, Huang et al. used spectrograms of speech together with a CNN to perform SER [21], and similar work is presented in [22], in which a CNN was employed to learn affect-salient features from spectrograms.
Nowadays, extracting spectrograms from audio clips and extracting deep spectrum representations by feeding them through a deep CNN has become a new research trend [1], [2], [23]- [25]. Furthermore, deep spectrum representations benefit from the advantage of transfer learning, as they are formed by passing spectrograms through pre-trained image classification deep CNNs such as AlexNet [26] or VGG [27]. Deep spectrum representations have been shown to produce suitable salient features which achieve state-of-the-art performance in a range of speech-related recognition tasks including SER [1].
Additionally, given that context information is crucial for detecting emotional states, RNN paradigms are widely used in SER to exploit the temporal information inherent in speech signals. LSTM-RNNs, in particular, are frequently employed in SER tasks [9], [11], [28]- [30].
Inspired by the success of CNNs and RNNs, there has been an increasing interest in incorporating both into a single architecture. For example, in [31], the Convolutional Long Short-Term Memory Deep Neural Networks (CLDNN) model was proposed for speech recognition. The developed model consisted of convolutional layers, LSTM gated recurrent layers, and fully connected (FC) layers. More recently, end-to-end network architectures have emerged as a promising network structure. These can automatically extract representations directly from raw (unprocessed) data, rather than manually extracting hand-crafted features.
The SER approach proposed in [9] jointly exploited a CNN to automatically extract suitable representations from raw audio signals and an LSTM-RNN to capture the temporal information. A similar framework was proposed in [32] for the related task of speech-based depression detection. In [33], a specially designed neural network structure that accepts variable-length speech was proposed for SER. This approach combines CNN-based deep spectrogram representations with an RNN to handle the variable-length speech segments.
Similar to the Attention-BLSTM-FCN model developed in this paper, a parallel combination of LSTM and the CNN neural network framework has been explored for acoustic scene classification [34]. The results presented in [34] demonstrate that the LSTM model extracted key sequential information from consecutive audio features and the CNN model learnt salient spectro-temporal locality from spectrogram images.
In summary, while there is a range of work in the literature focusing on feeding spectrograms into CNNs for speech-based recognition tasks, very little research has been undertaken to explore attention-based FCNs and attention-based LSTM-RNNs as mechanisms for extracting emotionally salient information from spectrograms.

III. PROPOSED METHODOLOGY
In our proposed Attention-BLSTM-FCN model (cf. Fig. III), the Mel-spectrograms are fed into two parallel networks, namely an Attention-BLSTM and an Attention-FCN. We then concatenate the network outputs to form a new feature sequence. The Attention-BLSTM layers extract sequential information from the spectrograms, while the Attention-FCN layers extract spatial information. Fusion of the concurrently extracted and complementary features forms a joint spatio-temporal feature vector.

A. SPECTROGRAM GENERATION
The first step in our proposed system is the extraction of the mel-spectrograms. Spectrograms are a time-frequency visual representation of a signal produced by a short-time Fourier transform (STFT) [38]. In the presented work, we used the librosa 1 framework to first resample the audio signals to 16 kHz, and then transform them to spectrograms utilizing the STFT implemented with a Hamming window function with a frame length of 25ms at a rate of 10ms. Following this, we mapped the STFT matrices into their magnitude squared 1 https://github.com/librosa/librosa via: where x i is an utterance signal, f stands for frequency and m for window position. Finally, we generate the mel-spectrograms by scaling the f hertz signal into m mel-scaled bands via: m = 2595 log 10 (1 + f 700 ). ( Mel-frequency spacing approximates that of the human cochlea, and thus the resulting mel-spectrograms reflect the relative importance of different frequency bands as perceived by the human ea [39].

B. ATTENTION-BASED BIDIRECTIONAL LONG SHORT-TERM MEMORY NETWORKS
Our proposed system includes the use of attention mechanisms, together with BLSTM in order to focus feature learning onto the salient regions of a sequence. The so-called Attention-Based BLSTM-RNN unit contains four components: 1) The input layer: the spectrogram is fed into the model.
2) An LSTM layer: utilizes a BLSTM to extract high-level representations from step (1). 3) An attention layer: produces a weight vector, and merges frame-level features from each time step into an utterance-level feature vector by multiplying it with the weight vector. 4) The output layer: outputs the resulting utterance-level feature representation. We describe the LSTM and attention layers below in the following.

1) BIDIRECTIONAL LONG SHORT-TERM MEMORY NETWORKS
As LSTM units solve the issue of vanishing and exploding gradients in RNN training [8], they are, usually, employed as the basic unit in RNN. An LSTM-RNN can, therefore, model long-range dynamic dependencies while avoiding issues relating to vanishing or exploding gradients during training. A standard LSTM can, however, only process sequential data in one direction [40], hence the BLSTM-RNN has been proposed to overcome this limitation. In a BLSTM-RNN, the input is processed both in the standard order and reversed order, allowing the network to combine future and past information at every time step.
A BLSTM component comprises two LSTM layers processing the input separately to produce − → h , − → c , the hidden states and the cell states of an LSTM processing the input in the forward direction, and ← − h , ← − c , the hidden states and cell states of an LSTM processing the input in reversed order. Both − → h , and ← − h , are then combined using: to produce the output sequence of the BLSTM layer. Note that it is also possible to use the cell states, instead of the hidden states, of the two LSTM layers in a BLSTM layer to produce the output sequence of the BLSTM layer:

2) ATTENTION LAYER
In this layer, a 1D attention module is built on top of the BLSTM layer. To determine the attention weights α i , we calculate each vector entry x i in a sequence of inputs x, as follows: in which f (x) denotes the scoring function. We use f (x) = W T x for f (x), in which W is the trainable parameter, as a linear scoring function. The output of the attention layer is then the weighted sum of the input sequence, defined as attentive x : C. ATTENTION POOLING BASED FULLY CONVOLUTIONAL NETWORKS Our proposed system also includes the use of attention mechanisms, together with FCN in order to focus feature learning onto more emotion-relevant time-frequency regions of the mel-spectrograms of speech.

1) FULLY CONVOLUTIONAL NETWORKS
Similar to a conventional CNN, the FCN structure only consists of convolutional layers, and hence the local feature structures are effectively preserved with a relatively small number of weights. Meanwhile, the FCN structure also provides advantages by allowing the networks to model the temporal and harmonic structure of audio signals [41]. Given these benefits, we use spatial convolutional neural networks with an FCN-like structure for our deep spectrum features extraction.
In this work, the output of our FCN is a three-dimensional array of size F × T × C, where the F and T stands for the frequency and time domains of the spectrogram and C for channel size. We consider the output as a variable-length grid of L elements, L = F × T . In set A, each of the elements is a C-dimensional vector corresponding to a region of speech spectrogram, represented as α i .
In this work, we employ a 3-layer FCN which contains three convolutional layers and three max-pooling layers. The network takes a log-amplitude mel-spectrogram sized 40 × 500 as input and predicts a 128-dimensional output vector. As the FCN is performing feature extraction, its final output comes from the attention pooling [42], which reduces the number of parameters of the network.

2) ATTENTION POOLING METHOD
As not all time-frequency units will contribute equally to the emotional state associated with an utterance, we, therefore, adopt an attention mechanism similar to [6]. We place it on top of the FCN to help the network pay more attention to specific time-frequency regions of the input spectrogram. We realize the attention module as follows. First, the annotation a i is fed as input to obtain a new representation of a i through a multilayer-perceptron (MLP) layer employing tanh as the non-linear activation function: Next, we calculate the importance weight, e i , of the a i by the inner product between this new vector and the learnable vector u. After this, the normalized importance weight α i is calculated using the softmax function: In this equation, λ is a scale factor which controls the uniformity of the importance weights of the annotation vectors. λ ranges between 0 and 1. If λ = 1, the scaled-softmax becomes the commonly used softmax function. If λ = 0, the importance weights will be a uniform distribution on the set A, which means all the time-frequency units have the same importance weights for the final utterance emotion vector.
In this work, we set λ = 0.3 according to the performance on the validation set [6]. Finally, the utterance emotion vector c is computed as the weighted sum of set A with importance weights:

IV. EXPERIMENTS AND RESULTS
In this section, we provide key details relating to the experimental setup, our experiments, and the results of our analysis.

A. DATASET DESCRIPTION
IEMOCAP consists of audio-visual data with transcriptions from recordings of dialogues between two professional actors, over 5 sessions, with the corpus divided into two parts: improvise and script [43]. In our experiments, we only focus on the improvised sessions. Adopting the methodology of previous works, we used a leave-one-session-out strategy. In each training process, 8 speakers from 4 sessions were used as training data, and the remaining session was separated into two parts: one being regarded as validation data and the other as test data. It is also worth noting here that the data distribution of each emotion class is heavily imbalanced. As in [44], we, therefore, merge the happy and excited utterances into the happy class since they are close in emotion. Four emotion categories are, therefore, employed in the training and evaluation: angry, happy, sad, and neutral (cf. Table 1). FAU Aibo Emotion Corpus (FAU-AEC), on the other hand, is composed of spontaneous and emotional German speech samples [45]. The corpus contains 9.2 hours of German speech from a total of 51 children interacting with Sony's pet robot Aibo at two different schools. As per [46], we used 9 959 utterances from 26 children (13 males and 13 females from the Ohm School) as the training set and 8 257 utterances from 25 children (8 males and 17 females from the Montessori School) as the test set. 2 In this study, we concentrated on the five-class problem with the emotion categories of anger, emphatic, neutral, positive, and rest (cf. Table 2).

B. EXPERIMENT SETUP AND EVALUATION METRICS
The proposed Attention-BLSTM-FCN model has many hyperparameters, a proportion of these being tuned based on the recommendations from previous works which utilized the same database [10], [16]. In order to identify the optimal model, we optimized 15 hyperparameters: window size, convolutional kernel size, pooling size, stride on convolutional layer, initial number of filters and neurons, learning rate, the number of convolutional/pooling/fully connected layers, type of activation function, optimization algorithm, dropout on convolutional and fully connected layers, and frequency resolutions of the input spectrogram. The details on these hyper-parameters are given below: 1) We set the window size to 25 ms (window sizes between 15 ms to 200 ms were tested) and window shift is set to 10 ms 2) The BLSTM contained 128 × 2 nodes. We also tested BLSTMs of 64 × 2 nodes however we observed an accuracy drop of 1-3 % 3) Our Mel-spectrograms were formed using 40 Mel bands (30,60,80, and 100 bands were also tested) 4) The optimal FCN topology was found to be 3 layers (we tested 2-5 layers), similarly, the best topology for BLSTM is found to be 2 layers (we tested 1-3 layers) 5) The FCN filters are set to 64, 128, 128 (each layer was tested from 8 to 256). Stride for the CNN layers was set as (1, 1).

6) A dropout layer, batch normalization techniques, and
ReLU activation functions are applied to prevent overfitting. 7) The Adam optimizer with a learning rate of 10 −3 , and a decay of 10 −6 is used for training. 8) All models were implemented by the TensorFlow 3 framework. 9) All models were trained with a maximum epoch of 100 and batch size of 100 with dropout regularization utilized to prevent overfitting. To evaluate the performance of the proposed framework, we conducted several experiments. First, in order to investigate the influence of spatial and temporal information, we built our FCNs, attention-FCN, and attention-BLSTM models as described above. A comparison of FCNs, attention-FCN, attention-LSTM, attention-BLSTM, as well as our proposed model was performed. We then evaluated the performance of the standard spectrogram with different spectrogram resolutions based on the Attention-BSLTM-FCN 3 https://www.tensorflow.org model. Note, resolution is an important decision when generating models that rely on spectrograms. The work presented in [47] reveals the performance differences among different frequency resolutions of the input spectrogram. To this end, the Attention-BLSTM-FCN model was re-trained using either a 30-band, a 40-band, 60-band, 80-band, or 100-band Mel-spectrogram. Moreover, spectrograms represent a 2D representation of audio signals. On the one hand, changes in the Mel-scale represent a scale effect on the vertical direction. On the other hand, the horizontal scale of each data point is influenced by (temporal) window length (cf. Figure 2). We therefore also tested the effect of varying window sizes between 15 ms to 200 ms.
Thirdly, we compared the effectiveness of the deep spectrum representations extracted from the Attention-BSLTM-FCN model with two commonly used SER feature representations: extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) [48] and Interspeech Computational Paralinguistics Challenge (COMPARE) features set. In order to do so, we first extracted the eGeMAPS and Com-ParE low level features with the openSMILE toolkit [49]. Due to the high dimension of the ComParE feature set, we performed the PCA technique on the training set to reduce the feature size by selecting top 150 components which explained >95 % variances of the original features. We then applied functionals (max, min, range, mean, and standard-deviation) on the two feature representations independently for each combination of speaker and feature independently. Finally, all the feature representations are fed into a linear support vector machine (SVM) implemented using the scikit-learn 4 toolbox, and trained via stochastic gradient descent.
Finally, in order to show the effectiveness of our approach, we compared the performance with systems based on pre-trained CNNs, namely 'AlexNet' [26], 'VGG16', and 'VGG19' [27]. We obtained the pre-trained 'AlexNet' network from MATLAB R2017a3, and 'VGG16' and  'VGG19' from MatConvNet [50]. Then, we exploited the Mel-spectrograms as the input for these three pre-trained CNNs and extracted the deep representations from the activations on the second fully connected layer (fc7) as feature vectors (cf. Table 3). The feature representations extracted by the three pre-trained CNNs and Attention-BLSTM-FCN were fed into the linear SVM.
As evaluation measures, we employ the standard evaluation criteria used on the IEMOCAP and FAU-AEC dataset. For IEMOCAP, we used both unweighted and weighted accuracies (UA and WA respectively) as the evaluation metric, while for FAU-AEC, we use only unweighted accuracy (UA) as the evaluating measure as this database is extremely unbalanced. Furthermore, in order to tackle the problem of unbalanced data, we apply class weights during training (cf. Table 4) identified using: where N is the total number of training examples, and N k is the number of the training examples of each class [51].

C. RESULTS
A comparison shows that the Attention-BLSTM-FCN model achieves the best performance. It can be seen that the proposed approach outperforms previous works on the IEMO-CAP and FAU-AEC datasets (cf. Table 5). Our highest UA and WA achieved on IEMOCAP were 68.1 %, and 67.0 %, respectively. This represents a significant improvement over the baseline FCN model (p < .05 in a one-tailed z-test).
The same system set-up also achieved the best UAR, 45.4 %, on FAU-AEC. Again, this represents a significant improvement over the baseline FCN (p < .05 in a one-tailed z-test).   Our second experiment explores the difference in the frequency resolution in our system set-up (cf. Table 6) We observed that frequency plays an important role when extracting deep features. In this group of experiments, the best performances were 68.1 % (WA) and 67.0 % (UA) on IEMO-CAP; 45.4% (UA) on FAU-AEC was achieved by the resolution of 40 Mel-bands.
When comparing our features with two standard acoustic features (cf. Table 7), we observed that the best UA (66.5 %) and WA (66.7 %) on IEMOCAP and the best UA (43.9 %) on FAU-AEC were achieved by the deep spectrum features extracted from our proposed model. This set-up yielded a significant improvement over the eGeMAPS (p < 0.01 in a one-tailed z-test) and C OMP ARE (p < 0.01 in a one-tailed z-test) feature sets. These comparisons indicate the promise of the deep spectrum features; further investigations are warranted to establish their suitability over a range of speech-related tasks.  Finally, when comparing the Attention-BLSTM-FCN model with more conventional deep spectrum approaches, the advantages of this framework can be clearly seen (cf. Fig. 3 and Fig. 4). Across the two data sets, and across the different mel-frequency resolutions, the Attention-BLSTM-FCN approach also yielded a significant improvement over deep spectrum representations extracted by AlexNet, 'VGG16' and 'VGG19' (p < .01 in a one-tailed z-test). Given the previous results showing the suitability of AlexNet, in particular, for deep spectrum feature extraction [1], [2], [10], [16], these results highlight the effectiveness of our proposed model for SER.

D. DISCUSSION
From an overall experimental view point, the presented results demonstrate that our proposed model achieves notable performance improvements over the other, existing methods on IEMOCAP as well as the FAU-AEC. Furthermore, the proposed model outperforms both the baseline models and the individual application of attention-FCN and attention-BLSTM. These comparisons imply that it is crucial to use both spatial and temporal spectral information to boost speech emotion recognition and analysis. In terms of improved performance, it is clear that both the attention-FCN and the attention-BLSTM models complement each other. The consistently stronger performances of the Attention-BSLTM-FCN deep features compared to the other three deep pre-trained convolutional neural networks (cf. Fig. 3 and Fig. 4) support this hypothesis.
Our results also demonstrate that, on average, attention mechanisms can improve the prediction accuracy of the FCNs and BLSTM modules. We observed that the attention-FCN module did not result in a consistent improvement in WA over use of the FCN model alone when using the IEMOCAP dataset. In this regard, it is important to note that WA is highly dependent on the distribution of classes in the dataset. Therefore, we lend more importance to the UA; it better reflects the imbalanced distribution of the emotional classes. A comparison of the results for the proposed architecture with those for the eGeMAPS and ComParE feature sets indicate that it performs well as a feature extractor. It is worth noting that we did not perform any preprocessing on data and only used an SVM for classification. Additionally, the recognition results based on the deep spectrum representations derived from the proposed model outperformed the other two commonly used feature sets. These results add to the growing evidence in the literature that forwarding spectrogram representations through deep learning models produces salient features suitable for speech-related classification tasks.
We also observed that the frequency resolution of the input spectrogram is an important factor in determining the overall performance of the model (cf. Table 6). This effect is most likely due to the network learning some form of frequency discriminating function. Consistent with some --other -results in the literature [58], setting the frequency resolution to 40 mel bands yields better results than those with any other value. This result contradicts those presented in [47], in which it was observed that using a higher number of mel frequency bands uniformly improved system accuracy. However, a reasonable explanation for this could be the difference in the models employed. Thus, the number of mel-bands required should potentially be treated as a hyperparameter and evaluated on a case-by-case basis.
Even though our results are more than encouraging, our approach has several limitations and a number of research directions should be considered for future research. A potential limitation of our proposed model is increased computations due to the generation of more trainable weights and hyperparameters. Moreover, further research needs to be conducted to confirm the robustness of our proposed model. Furthermore, we expect that the application of our approach to large datasets would show bigger improvements with respect to deep spectrum representations.

V. CONCLUSION
We have proposed and developed a joint deep neural network architecture comprising a parallel combination of attention enhanced FCN and BLSTM networks to perform efficient SER from spectrograms. We trained an Attention-BLSTM-FCN model based on the spectrograms generated from the IEMOCAP and FAU-AEC datasets. The results of our experiments are highly promising, providing a new direction for consideration when performing emotion recognition.
In future work, we plan to further realize the potential of our proposed model and deep spectrum representations by establishing their suitability in other speech and acoustic recognition tasks. ZHONGTIAN BAO was born in Ningbo, Zhejiang, China, in 1999. He received the bachelor's degree from Nanjing University, in 2017. He is currently pursuing the master's degree with Tianjin Normal University. His research interest includes speech emotion recognition and applications.
YIQIN ZHAO was born in Taiyuan, Shanxi, China, in 1996. He is currently pursuing the bachelor's degree in software engineering with Tianjin Normal University, where he has been a member with the Cognitive and Affective Computing Lab, since 2016. His current research interests include the intersection of affective computing, audio signal processing, and machine learning.