Speech Emotion Recognition by Late Fusion for Bidirectional Reservoir Computing With Random Projection

Many researchers are inspired by studying Speech Emotion Recognition (SER) because it is considered as a key effort in Human-Computer Interaction (HCI). The main focus of this work is to design a model for emotion recognition from speech, which has plenty of challenges within it. Due to the time series and sparse nature of emotion in speech, we have adopted a multivariate time series feature representation of the input data. The work has also adopted the Echo State Network (ESN) which includes reservoir computing as a special case of the Recurrent Neural Network (RNN) to avoid model complexity because of its untrained and sparse nature when mapping the features into a higher dimensional space. Additionally, we applied dimensionality reduction since it offers significant computational advantages by using Sparse Random Projection (SRP). Late fusion of bidirectionality input has been applied to capture additional information independently of the input data. The experiments for speaker-independent and/or speaker-dependent were performed on four common speech emotion datasets which are Emo-DB, SAVEE, RAVDESS, and FAU Aibo Emotion Corpus. The results show that the designed model outperforms the state-of-the-art with a cheaper computation cost.


I. INTRODUCTION
Emotion can play an important role in many parts of a human's life such as communicating, understanding, helping each other, rational thinking, creativity and sometimes it has a vital part in decision making. However, there has been no general agreement on how to categorize, recognize and analyze it because of the differences among cultures and individuals. Emotion can be detected from various channels such as electroencephalography (EEG) signals, acoustic, visual, text, and gestures. Detecting emotion is a challenging task and it has become a hot field of research topics and covered a wide research area due to the high demand for using it in many practical applications such as healthcare, social robot, and Human-Computer Interaction (HCI) [1], [2]. However, emotions do not have a static categorization, and it is not The associate editor coordinating the review of this manuscript and approving it for publication was Donato Impedovo . easy to adapt, which is why some works are done by using unsupervised models for unknown emotions and growing models to deal with the adaptation [3].
Speech is an effective, quick, and important way for individuals to communicate with each other [4] and the speech signal is considered as a fast and useful mechanism for HCI. Emotions have always been a part of normal human conversation which makes the speech more attractive and more effective. Detecting emotions from speech signals is an old yet big challenge in the field of artificial intelligence [5] which makes many researchers inspired to work on it.
For this reason, Speech Emotion Recognition (SER) is playing a significant role in the HCI with great progress in recent years. However, certain aspects of inner feelings remain concealed and are not easily measurable from the speech, particularly when humans want to suppress their emotions. Thus, it cannot be expected for the computer-based VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ system to do beyond what is perceived from the input of the speech sample. One of the challenges in SER is to determine the most relevant acoustic emotion features which are extracted from the raw speech signal. Researchers endeavor to find more effective features for detecting emotion in speech [6]. Recent studies have shown that emotional information in speech is distributed over multiple types of features [2] and finding the right features which have the most information about human emotion is critical. Two main ways have been used to extract features which are handcrafted features in addition to deep learned emotion features. Therefore, many applications such as speech recognition with time series or sequential data have been shown to achieve state-of-the-art results with some deep learning approaches such as Recurrent Neural Networks (RNNs), Gated Recurrent Unit (GRU), and Long Short-Term Memory (LSTM) [7]. However, Zhong et al. [8] reviewed data representation research, including traditional feature extraction and deep learning, with the conclusion that the gap between the theory and practical applications of deep learning is still quite big, and deep learning models are not always the best approach, especially in real-world problems.
Multivariate time series emotion feature representation can be able to adapt due to the sparse nature of emotion in speech. To tackle this characteristic, some studies used Echo State Network (ESN) as a special type of RNN, and as a part of the reservoir computing framework. The main reported advantage of ESN is that it has a simple architecture as it contains the input layer, a reservoir layer with sparsely connected neurons that are randomly assigned without training, and the output layer [9]. The temporal dependency of time series data can be handled effectively by ESN since it is successfully applied for chaotic time series prediction models [10], [11]. The simplicity of ESN is represented by assigning a non-trainable randomly weights and avoiding the time complexity of deep recurrent networks [12] which makes ESN an ultimate nominee for tasks involving the real-time processing [13], [14] such as time series forecasting [15].
Some researchers addressed the instability in ESN because of the randomness in weights assigning, which is allocated in the reservoir part and is assigned only once and fixed [9]. However, authors in [16] adopted the use of bidirectional input. Both of the directions of the data feed as an input sequence to the same reservoir in both forward and backward ways to capture different independent versions of information from the input data. Authors in [17] showed that having two different inputs in a straight and reverse order will improve the memorization.
Dimension reduction techniques are transforming the high dimensional data within the feature space into another subspace of lower-dimensional representation to avoid computation and assist in de-correlating the transformed data. Therefore, dimensionality reduction techniques are applied to solve these problems by using a particular transformation map such as Principal Component Analysis (PCA) or Random Projection (RP) [18]. High dimensional sparse output from the reservoir layer makes feature representation intractable and leads to overfitting and high computational resources [17]. In machine learning, dimension reduction is useful to prepare a more informative representation for the classifier. There are studies that used PCA as a powerful tool for dimensional reduction of the output of the reservoir layer [17], [19].
Tuning the hyperparameters in ESN is a common issue since it is significantly affecting the performance of the reservoir. Optimizing these hyperparameters are typically slow and consequently, researchers either assign them manually based on experience [20] or they adopt different optimization approaches such as grid search, random search, and Bayesian optimization [21].
In this work, we proposed a novel reservoir computing approach for SER using bidirectional late fusion, Sparse Random Projection (SRP), and optimizing hyperparameters with a Bayesian optimization method. Additionally, a multivariate time series handcrafted features of which Mel-Frequency Cepstral Coefficients (MFCCs) and Gamma-Tone Cepstral Coefficients (GTCCs) have been used to feed the reservoir layer. The main contributions of the proposed model are: 1) adopting a very sparse random projection [22] approach for dimension reduction which can be more compatible with the sparse data distribution produced by the reservoir; 2) using the bidirectionality approach with the late representation fusion which may improve the memorization capability of ESN.
The rest of this paper is organized as follows: Section II covers literature about the existing methods of SER, while section III presents the proposed model, and section IV shows experiments and results. The discussion work is presented in section V, and finally, the conclusion and future work come in section VI.

II. LITERATURE REVIEW
Researchers widely use speech signals to detect emotion in the field of HCI to gain a better interaction between them. Therefore, the right design model for classification and relevant emotion features from speech with distinctive information are the two significant aspects in speech emotion recognition models [23].
To extract valuable features, some researchers preferred a handcrafted feature while others used deep learned features. Handcrafted feature representation can be a global feature that represents each sample as one vector or it can be local features extracted from the sequence of the frames. There are a variety of open-source toolkits for extracting features from speech such as openSMILE [24] and COVAREP [25]. A lot of studies [26]- [28], [29] have used openSMILE toolkit as it is one of the most famous tools to extract emotion features from speech. The openSMILE toolkit is extracting non-temporal global features. However, some researchers are using time series features from speech signals to detect the real-time emotion recognition. Scherer et al. [14] used spectral features from frames, however, they were not successful to recognize emotion in real-time. On the other hand, recent works are focusing on learned features directly from the raw speech signal by using deep learning models [8], [30]. Authors in [31], used 1D CNN network for SER systems that can learn features from the speech signal. The time series features representation requires a proper classifier such as RNN which is computationally intensive.
Besides choosing the right features from speech, developing a robust mathematical model is another vital step to the high performance of emotion recognition from speech signals [32]. As mentioned before, frame-based features require a model to support multivariate time series data such as RNN. In [31] and [33] a high-level representation features are used with adopting bidirectional long short-term memory (BLSTM) model. A speech emotion model using both CNN and LSTM proposed in [34] and [35]. The data augmentation techniques are applied in [36] on Acted Emotional Speech Dynamic Database (AESDD) with the use of CNN for continuous speech emotion recognition.
However, few researchers reported the use of ESN for SER, for example, in [14], authors proposed a not fully successful real-time speech emotion recognition model. To participate in Evalita 2014 competition, Gallicchio et al. [37] proposed ESN to detect emotion from speech. Additionally, Saleh and Micheli in [38] used ESN for SER, where only neutral and anger emotion classes are used in their model.
The time complexity of RNN-based models (such as LSTM) versus ESN has been investigated and reported frequently. The untrained nature of ESN shows the capability to significantly reduce the time complexity as shown in Table 1. The ESN performance is always comparable to the LSTM. However, we shall see in the discussion section that the proposed ESN in this work can outperform the LSTM for SER. With a competitive performance in time series prediction, ESN with the simplicity of its architecture deterministically raised to propose in many applications [42]. Bidirectionality is applied in ESN by feeding an input sequence into the same reservoir in both forward and backward to capture additional information independently of the input data. For example, authors in [16] and [17] proposed bidirectional reservoir to improve the memorization capability. For the same purpose, Bianchi et al. in [43] proposed Bidirectional Deep-readout ESN (BDESN) and multilayer perceptron (MLP) as a classifier. The deep bidirectional LSTM [31], [33] has been used in the SER field to learn the temporal information for detecting the final state of emotion.
High dimensional sparse output from a reservoir layer makes feature representation suffer from the curse of dimensionality which is why dimension reduction step is necessary to prepare a non-sparse representation to feed the classifier. The Principal Component Analysis (PCA) was used with ESN in [17] and [43] to improve the model performance. But [19] used ELM-based Auto-encoder (ELM-AE) beside PCA to reduce dimensionality between reservoirs in their deep ESN approach.
Regarding the hyperparameters in ESN, which have a significant effect on the model performance, some researchers adopted fixing these hyperparameters [17], [20]. However, to improve ESN performance, [44] optimized the hyperparameters by Grasshopper Optimization Algorithm (GOA) approach. ESN is also found to exploit Bayesian optimization [45] approach to tune its hyperparameters [21], [48]. In order to achieve more satisfactory performance for SER, the Bayesian optimization approach has been adopted by [46] to optimize the hyperparameters of k-nearest neighbors, support vector machine and decision tree, and also adopted by [47] to optimize the kernel size for the CNN.

III. METHODOLOGY
In this section, the model design is presented, and the proposed model is briefly explained. It represents the main components of the solution and explains how the proposed VOLUME 9, 2021 method helps to improve the performance of ESN to recognize emotions from speech. Most of the works on SER have used global features and very few works were working on time series local features. Several works have been found in the literature that used LSTM as a model to feed time series features. In addition, there are few works that used ESN for the SER systems [14], [37], [38], however, none of them reached an outstanding performance. This unconvincing performance may be due to three factors which are: 1) adopting a unidirectional signal processing which results in losing important information between the speech frames in the opposite direction, 2) ESN for temporal data produces a very high dimensional representation that negatively influences the performance of the classifier, and 3) the manual tuning of the ESN hyperparameters instead of optimizing them may not lead to optimum performance of the ESN model. To overcome these drawbacks, we have been inspired by the work of [17], and have used ESN with bidirectional time series features and dimension reduction representation to recognize emotions from speech. Our contribution in this work is to modify the adapted model to improve the performance of SER. The next subsections show the details of the proposed model, which is shown in Figure 1.

A. FEATURE EXTRACTION
Speech features with discriminative information have a vital role in emotion recognition in speech. Extracting the proper speech emotion features reflects obtainable information about emotion characteristics and the effect of the human's emotional condition on the speech signal.
In this work, frame-based handcrafted features have been adopted to feed the proposed model. The first set of features that have been extracted in this work is 13 MFCC features. MFCC is the most widely used feature for speech emotion recognition because of its simplicity of computation and the good capability of extracting informative features. However, MFCC based models are suffered by decreasing the performance under noisy conditions because MFCCs are biased by noise which triggers mismatched likelihood calculation [49]. Therefore, we extracted 13 GTCC features which have better performance than MFCCs under noisy conditions. Overall, 26 features are used as an input to our model.
The audioFeatureExtractor object method in MATLAB has been used to extract the features with windows of length 30ms overlapped by 20ms. Since the length of the samples vary (See Figure 2), we have equated the length of the samples by padding with zeros or pruning at the start and the end of each row data. Consequently, we have used 500, 600, 400, and 300 frames for Emo-DB, SAVEE, RAVDESS, and FAU Aibo respectively based on the near maximum length for each dataset.

B. BIDIRECTIONAL RESERVOIR COMPUTING-ESN
Echo State Networks (ESNs) were first proposed by [50] as a special case of RNN for learning nonlinear systems which is also a part of the reservoir computing framework.
The Reservoir computing (RC) framework is a kind of RNN model whose recurrent part weights are initiated randomly and then fixed without training, followed by a trainable layer that can be updated with the output [51].
The untrained nature of ESN makes it avoid the complexity available in trained natured networks such as LSTM. It has a sparse nature as it maps the features into a higher dimensional space. In addition, ESN has a simple architecture that contains an input layer, reservoir layer and output layer. Regarding the input layer, a bidirectionality multivariate time series data is applied by feeding an input sequence into the reservoir in both forward and backward. The advantage of the bidirectional approach is to capture additional information independently of the input data and the capability to improve the memorization with straight and reverse inputs. The reservoir layer contains sparsely connected neurons which are randomly assigned and fixed without training.
The temporal dependence of time series data can be handled effectively by ESN which is successfully applied for chaotic time series prediction models. The simplicity of ESN is that most of the weights are randomly assigned and not trainable. The complexity of deep recurrent networks requires an extreme computing time which makes ESN an ultimate nominee for tasks involving real-time processing such as time series forecasting.
The input multivariate time series sample data contains D-dimensional feature vector for each time step t, where t = 1, 2, . . . , T , and T is the number of time steps. In other words x(t) ∈ R D and X = [x(1), x(2), . . . x(T )] T . Note that T represents the number of time steps after padding the samples to avoid length differences. As an RNN based model, reservoir model is suitable for the sequential data and for bidirectional approach which has been adopted in this work. The state in the reservoir layer can be updated using the following equations: are the RNN states at time t for both bidirectional inputs that can be computed as a function of their previous values ( In addition, f is a nonlinear activation hyperbolic tangent function, and θ enc are the adaptable parameters from the reservoir. The equation (1) can be presented as the simplest formulation as follows: where W in is the input weight and W r is the weight from reservoir connections, and the reservoir states ( − → RS and ← − RS) are generated by the reservoir layer over time, The θ enc can be represented as The reservoir has several hyperparameters that have a significant effect on its performance such as (i) the amount of internal (hidden) units R, (ii) spectral radius ρ of reservoir connection weights matrix W r which helps the system to be stable [52] and normally should be less than 1, (iii) the nonzero connections β is used as a percentage of non-zero connection weights, (iv) scaling ω of the values in W in is another hyperparameter, which controls the total of nonlinearity in handling the hidden units together with ρ and can change the internal dynamics from a chaotic regime to a contractive regime [53], (v) leak as an amount of leakage in the reservoir state update, and (vi) it is also possible to include a dropout regularization and we applied a dropout, particularly for recurrent architectures [17].

C. RANDOM PROJECTION BASED DIMENSION REDUCTION
The high dimensional sparse output from the reservoir layer makes feature representation intractable and leads to overfitting and high computational cost. Additionally, Sparse Random Projection (SRP) has been used to transform the sparse output into a more compact representation.
Trainable dimension reduction such as PCA is well known, however, because the sparse data distribution produced by the reservoir uses a binomial distribution, adopting a sparse random projection where its values initialized by 1 and −1 can be a suitable alternative. In addition, PCA is more time consuming because of the training part inside it. SRP reduces the dimensions and preserves the distances in addition to the fact that random projection has a low complexity since it does not need any training and removes redundancies with minimal loss of information.
In the work, we follow [22] by using a SRP matrix. The SRP matrix R is initialized with 1 and -1 as in the following equation: where d is the dimension of the reservoir output state. This step will reduce the dimension to a specific number that can be fixed or optimized. Reducing the dimensions has a significant impact on implementing the reservoir model space which will be applied nextly. The dimensionality reduction step decreases the number of reservoir output features and produces a new sequences − → H and ← − H which will be the input to the model space.

D. RESERVOIR MODEL SPACE AND LATE FUSION
The reservoir model space that has been proposed by [17], distinguishes a generative model of the reservoir sequence and induces a metric relationship between the samples. In this work, we adopted a bidirectional approach with late fusion. Processing of each direction in a separate way can provide richer information about the relation of the time steps in both forward and backward directions. The late fusion will combine more diverse representations of the data and make the characteristics of each individual direction to be more highlighted. Consequently, the formula from a proposed model by [17] has been adapted with two separate outputs from unsupervised dimensionality reduction process from SRP as shown in the following equations: where D is the number of dimension after the reduction process. The late fusion will be applied in this stage by concatenating the generated output from both − → r X and ← − r X where: Equation 7 shows that, how the − → θ h and ← − θ h can be learned by minimizing a ridge regression loss function: where the µ is the regularization parameter to adjust the number of the coefficient shrinkage in the reservoir model space. In the classification level ESN adopts a linear model for decoding which is usually formed as in the following VOLUME 9, 2021 FIGURE 3. Samples of optimization process for one of the speakers in RAVDESS, SAVEE, and Emo-DB to find the optimal value for hyperparameters. equation: This model has a set of parameters θ dec = {V o , v o }. θ dec which can be learned by minimizing the loss function in a ridge regression which admits a closed form solution: where λ is the regularization parameter for ridge regression and helps to minimize overfitting of the training data. The aim of the linear readout is to perform the final classification that maps the r X representation into the class labels y.

E. THE BAYESIAN HYPERPARAMETER OPTIMIZATION
Determining the ESN hyperparameters is one of the reported issues due to its effects on the ESN model performance. However, most of the works have assigned ESN parameters manually or based on experiences. In this work, we optimized major ESN hyperparameters such as the size of reservoir state, spectral radius, size of connectivity, input scaling, amount of leakage in the reservoir state update, and the number of dropouts. Furthermore, optimizing the number of resulting dimensions after the dimensionality reduction procedure, and both regularization parameters µ in modal space and λ in ridge regression readout part. Based on the comparison in [48] between Bayesian optimization and grid search, Bayesian optimization shows to be more efficient than a grid search in their experiments. Bayesian optimization is a gradient-free global optimization approach to optimize random functions [54]. It is initiated to minimize loss functions f (θ) of the models, where their hyperparameters θ are normally difficult to be tuned [48]. Additionally, the Bayesian optimization method has been used in various applications including SER models [46], [47] to optimize the models' hyperparameters.
In this work, Bayesian optimization [54] has been used to tune the parameters of the reservoir layer and the ridge regression, in addition to dimensionality reduction size in SRP as shown in Table 2. The Figure 3 shows a sample of the 100 iterations for the three used datasets. Optimizing these parameters has a significant effect to improve the performance of the model.

F. SPEAKER NORMALIZATION (SN)
Inspired by the work of Valsenko et al. [55] we adopted Speaker Normalization (SN) on each particular speaker sample in speaker-independent experiments. SN is comprehended as subtracting the mean of utterances that belong to one of the speakers in a specific dataset, divided by the standard deviation of those samples. The purpose of using SN is to counteract the samples from specific speaker influences, thus the emotion space is more improved. While SN is a totally unsupervised approach and labels are not necessary. This method has improved the performance of using speaker-independent in our proposed model.

IV. EXPERIMENTAL SETUP AND RESULTS
In this section, we evaluated and validated the performance of the proposed model to detect emotions from the speech on most public and available SER datasets. So far, there is not much research work that has reported about using ESN for speech emotion recognition. The reason may be the wide use of global features instead of time series features. However, as mentioned in the previous sections, ESN can have a good performance in time series data.
In this work, we have used handcrafted time series features which are 13 MFCCs and 13 GTCCs extracted for each window of length 30ms overlapped by 20ms. The features feed the reservoir layer, where the number of internal units has been optimized using Bayesian optimization. The SRP is applied to transform the high dimensional sparse nature output from the reservoir layer into more compact representation. The reservoir model space distinguishes a generative model of the reservoir sequence and induces a metric relationship between the samples that came from the SRP part. Subsequently, the late fusion of bidirectionality input has been applied with the processing of each direction in a separate way. Bayesian optimization has been used to tune hyperparameters of the reservoir layer, ridge regression, and the size of dimensionality reduction in SRP as shown in Table 2.
The proposed model results are presented in terms of precision, recall, F1 score, unweighted and weighted percentage accuracy. Precision and recall are used to evaluate the performance of classification and F1 score is the weighted average of both precision and recall. The weighted accuracy coincides to the correctly classified emotion divided by the total number of emotion classes, while the unweighted accuracy (UA) means the average of per-class accuracies. The detailed results per each emotion classes of all four datasets are given in Tables 3 -12. We applied a speaker-independent approach using Leave One Speaker Out (LOSO), in addition to a speaker-dependent approach using the 5-fold and 10-fold cross-validation techniques on Emo-DB, SAVEE and RAVDESS datasets. In adopting 5-fold and 10-fold cross-validation, the dataset is divided into 5 and 10 folds respectively with mutually exclusive subsets. The model is trained and tested 5 times for 5-fold and 10 times for 10-fold, each time one set is considered as a test set and the remaining sets are considered as a train set. To conduct fair comparison with the stateof-the-art studies of the FAU Aibo dataset, we followed the adopted protocol of the interspeech09 challenge [56].
Since the ESN has no trainable weights in the reservoir layer, but rather it uses fixed weights, it doesn't need any GPU or high resources, therefore, we carried out the experiments using CPU on Google Colab (12 GB RAM) and on PC with 64GB RAM.
Furthermore, our experiments on both speaker-independent on all datasets and speaker-dependent on Emo-DB, SAVEE, and RAVDESS datasets are conducted and have shown better performance as compared to state-of-the-art works.
The performance of the proposed model is validated using four well known public speech emotion datasets, which are Berlin Database of Emotional Speech (Emo-DB) [57], Surrey Audio-Visual Expressed Emotion (SAVEE) [58], Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [59] and FAU Aibo Emotion Corpus [60].

A. EMO-DB
The Emo-DB [57] is a German dataset for emotional speech, produced by the Technical University of Berlin. It covers seven emotion classes: anger, boredom, neutral, disgust, fear, happiness, and sadness. Additionally, 10 actors (5 females and 5 males, between the age of 20 and 35) are involved to take a specific emotion over the memories of their real experience. Emo-DB is the most popular dataset that is used in speech emotion recognition with a total number of 535 utterance files which include anger (127), boredom (81), neutral (79), disgust (46), fear (69), happiness (71) and sadness (62) sentences, see Figure 4. We validated the proposed model based on speaker-independent and speaker-dependent for 5-fold and 10-fold cross-validation.

1) SPEAKER-INDEPENDENT
The LOSO method is applied for speaker-independent, as in Emo-DB we set 9 speakers as a train set and one speaker as a test set and this process will be repeated to guarantee the participation of all speakers in the test set.  Table 3 shows the detailed results of precision, recall, F1 score, unweighted, and weighted percentage accuracy for each emotion class for Emo-DB dataset. The confusion VOLUME 9, 2021 matrix in Figure 5 shows the individual accuracy of each 7 emotion classes of Emo-DB dataset for the speakerindependent approach. The anger class recorded the highest accuracy which is 100% from all speakers while happiness recorded the lowest with only 63%. However, disgust and fear emotions have less accuracy compared with boredom, sadness and neutral.

2) SPEAKER-DEPENDENT
For the speaker-dependent approach, we applied the 5-fold cross-validation method on Emo-DB dataset and its results in terms of precision, recall, F1 score, unweighted and weighted accuracy, are shown in Table 4. The confusion matrix for speaker-dependent (5-fold) in Figure 6 shows that the highest accuracy is obtained by the anger emotion, while the lowest accuracy is recorded by the happiness emotion similar to the speaker-independent approach.
The same procedure is applied for the 10-fold crossvalidation approach, and the detailed results in terms of precision, recall, F1 score, unweighted and weighted accuracy are shown in Table 5.
The confusion matrix for speaker-dependent (10-fold) in Figure 7 shows that the highest accuracy is achieved by the anger emotion (99%) and similar to the speaker-independent  and 5-fold approaches, the lowest accuracy is achieved by happiness emotion.

SAVEE [58] is a multimodal (Audio and Visual expression)
British English voice database that can be used for facial expression and speech emotion recognition. In our study, only the audio speech part has been used. It was recorded from four male native English speakers at the University of Surrey with seven basic emotion categories, which are anger, disgust, fear, happiness, sadness, surprise, and neutral. Each actor recorded 120 utterances which overall speech samples are 480 files, and the total number of utterances of the neutral emotion class is 120 while the other remaining emotion classes comprised of 60 utterances, which is shown in Figure 4. SAVEE dataset also used to validate the proposed model based on speaker-independent and speaker-dependent for 5-fold and 10-fold cross-validation.

1) SPEAKER-INDEPENDENT
The LOSO method has been applied for speaker-independent, as in SAVEE we set one speaker as a test set and the remaining speakers as a train set and this process will be repeated to guarantee the participation of all the speakers in the test set. Table 6 shows the detailed results of precision, recall,   F1 score, unweighted, and weighted percentage accuracy for each emotion class for SAVEE dataset. The confusion matrix in Figure 8 shows the individual accuracy of each 7 emotion classes of SAVEE dataset for the speaker-independent approach. The neutral class recorded the highest accuracy which is 93% from all speakers while sadness recorded the lowest with only 43% while 47% of sadness emotion is considered as a neutral emotion. The same case has happened in disgust emotion which 37% recognized as neutral and only 45% as a current emotion which is disgust.

2) SPEAKER-DEPENDENT
For the speaker-dependent method and for getting the most reliable result of our model, same as Emo-DB dataset, we applied 5-fold and 10-fold cross-validation. Table 7 shows the results and statistics in terms of precision, recall, F1 score, weighted, and unweighted percentage accuracy. The confusion matrix for speaker-dependent (5-fold) in Figure 9 points out the true emotion label and predicted emotion label. Similar to the speaker-independent approach, the highest accuracy is achieved by the neutral emotion with 97% and while sadness emotion is still the lowest with 55% which 37% are considered as a neutral emotion. However, the disgust emotion obtained 57% which is much higher if we compared it with the speaker-independent accuracy (45%). The same procedure is applied for the 10-fold crossvalidation approach, and the results of precision, recall, VOLUME 9, 2021 F1 score, unweighted and weighted accuracy are shown in Table 8. The proposed model for speaker-dependent (10-fold) evaluation is presented in the given Figure 10. The confusion matrix in Figure 10 shows that the highest accuracy is achieved by neutral emotion (97%) and is the same as speaker-independent and speaker-dependent (5-fold), the sadness emotion has the lowest accuracy (58%) compared with other classes. Additionally, 35% of sadness emotion was recognized as a neutral emotion which is the same situation we have in both previous approaches.

C. RAVDESS
RAVDESS [59] is the third speech emotion dataset that has been used to validate our model. It is a multimodal dataset that contains facial expression and voice data for speech and song. RAVDESS was recorded with a North American accent by 24 professional actors (12 females and 12 males) with eight emotions: calm, happy, sad, angry, fearful, surprise, neutral, and disgust expressions. Additionally, RAVDESS contains overall 7356 files and only 1440 speech files as a voice channel for speech emotion have been used. The total utterances of the neutral emotion class are 96 while the other remaining emotion classes have 192 utterances, which is shown in Figure 4.
The proposed model has been validated on RAVDESS dataset, based on speaker-independent and speaker-dependent for 5-fold and 10-fold cross-validation approaches.

1) SPEAKER-INDEPENDENT
LOSO method has been applied for speaker-independent, as in for RAVDESS, we set 23 speakers as a train set and one speaker as a test set and this process will be repeated to guarantee the participation of all speakers in the test set.
There are a few works that applied a speaker-independent approach on RAVDESS dataset, and none of them applied LOSO. However, our work is the same as Emo-DB and SAVEE where we adopted the LOSO approach. Table 9 shows the detailed results of precision, recall, F1 score, unweighted, and weighted percentage accuracy for each emotion class for RAVDESS dataset for speaker-independent.  The confusion matrix in Figure 11 shows the individual accuracy for each of the 8 emotion classes of RAVDESS dataset for the speaker-independent approach. The calm emotion class recorded the highest accuracy which is 89% from all speakers, however, neutral emotion recorded only 50% accuracy with 18% and 20% recognized as calm and sad emotion classes respectively.

2) SPEAKER-DEPENDENT
For the speaker-dependent approach, we applied the 5-fold and 10-fold cross-validation approach. The 5-fold results are shown in Table 10, in terms of precision, recall, F1 score, unweighted and weighted percentage accuracy. The confusion matrix for speaker-dependent (5-fold) in Figure 12 shows that the highest accuracy obtained by the calm emotion which is 94% while sad emotion is recorded as the lowest accuracy of 73%. Therefore, the neutral emotion has a significant improvement in the speaker-dependent (5-fold) approach with 80% accuracy, while it was only 50% in speaker-independent. The same procedure is applied for the 10-fold approach, and the results in terms of precision, recall, F1 score, unweighted, and weighted percentage accuracy are shown in Table 11. The confusion matrix for speaker-dependent (10-fold) in Figure 13 shows that the accuracy for each RAVDESS emotion class, and compared with the 5-fold approach, the accuracy of all emotion classes are higher except happy emotion with 86% accuracy.

D. FAU AIBO EMOTION CORPUS
The fourth dataset that has been used to evaluate the proposed model is the non-acted FAU Aibo Emotion Corpus, which contains 9.2 hours of spontaneous and emotional German speech samples [60]. The dataset was recorded from a total of 51 children (21 male and 30 female) at the age 10-13 years during their interactions with Sony's pet robot Aibo at two different schools, 'Ohm' and 'Mont'. The corpus contains 18216 chunk speech samples where dataset designers labeled each word in the dataset into 10 categories and later, they mapped them into five different emotion classes which are anger, emphatic, neutral, positive, and rest. The final numbers for each emotion class are listed in Figure 15. Following the adopted protocol of the interspeech09 challenge [56], we used 'Ohm' with 9959 chunks from 26 children (13 males, 13 females) as a training set and 'Mont' with 8257 utterances from 25 children (8 males, 17 females) as a testing set.
The number of chunks per class in the FAU Aibo Emotion corpus is extremely unbalanced as shown in Figure 15, where in the training set the 56.1% of the data are labeled as neutral, 21% are emphatic, 8.8% are angry, 6.8 are positive, and 7.2% are rest. To overcome the unbalanced issue, we applied random under sampler [61] where under sampling on the   majority classes is adopted by randomly picking a fixed number of samples. Table 12 lists the detailed results of the precision, the recall, F1 score, unweighted, and weighted percentage accuracy for each emotion class for FAU Aibo dataset. It can be observed that there is a big gap between the weighted and unweighted accuracy due to the high imbalance of data. The low accuracy of this dataset compared to the others reflects the challenge of emotion recognition in a spontaneous dataset. The confusion matrix in Figure 16 shows the accuracy of each 5 involved emotion classes of FAU Aibo. The positive class recorded 66% as the highest accuracy and the rest emotion class with 18% is the lowest accuracy that we have got from the proposed model. The low accuracy of rest may be due to its samples nature where they have different labels but are gathered under the same class.

E. THE IMPACT OF ZERO PADDINGS
As mentioned in the methodology section, since the length of the samples vary, as shown in Figure 2, we have equated the length of the samples by padding with zeros or pruning at the start and end of each row data. During the experiments as shown in Figure 14, we have found that there is no clear relation between the error and sample length scaling. For example, in Emo-DB dataset, the error of samples when 100-200 zeros are added were 16.91%, however, samples with more added zeros (such as 200-300) recorded lower error (9.17%), while samples with 300-400 added zeros recorded an error of 12.59%. In SAVEE dataset, the low misclassification ratio (50%) where 400-500 zeros are added, may not relate to the zero padding ratio, since 78.5% of the samples in this length range come from one of the speakers of whom its result is 53%. Similar observations can be noticed in RAVDESS and FAU Aibo datasets where noticeable zero padding ratio has been applied. These observations do not highlight any pattern regarding the relation between error increasing and zero paddings.  datasets show that late fusion with both PCA and RP outperforms the early fusion (See Figure 17). However, regarding the Emo-DB dataset, the LF-PCA is not able to outperform EF-PCA, but the RP impact on the late fusion model is significant and records 86.80% of accuracy. Overall, the proposed model (LF-RP) outperformed all other three methods on all four datasets. To show the impact of the adopted bidirectional, dimension reduction and optimization method in the proposed model over a basic ESN (unidirectional, total dimensions, and non-optimized hyperparameters are used). Table 13 shows the outperformance of the proposed model using all the involved datasets in a speaker-independent approach.

V. DISCUSSION
In this section, we are comparing the proposed model performance with other baseline methods. In order to obtain high classification accuracy, we proposed a novel ESN model which deals with a small size of handcrafted features as an input to the reservoir layer with bidirectional time series representation where its hyperparameters have been optimized. Additionally, we applied sparse random projection to reduce the output feature representation from the reservoir layer, which helped the model to perform better in dealing with sparse representation data. We adopt speakerindependent for all four popular benchmark datasets and speaker-dependent with 5-fold and 10-fold cross-validations on Emo-DB, SAVEE, and RAVDESS datasets to recognize the emotional state from speech signals. The late bidirectional fusion helped to extract more information from the data before feeding it to the ridge regression classifier. This novel proposed approach for SER helped to improve the classification accuracy and because of the simplicity and the trainless nature of ESN, the processing time is reduced as compared to the other deep learning methods such as LSTM and CNN.
In this discussion, we are going to present the overall unweighted accuracy (UA), since the actual performance is more representing, especially when the data is imbalanced in terms of utterance sizes per class, as shown in Figure 4. UA is the sum of all class accuracies divided by the number of classes, without taking into account the number of samples per class. Consequently, UA is a useful evaluation metric for emotion recognition studies due to the imbalanced nature of emotion datasets.
We have compared the performance of our proposed model in speaker-independent and/or speaker-dependent schema with the previously presented methods for Emo-DB, SAVEE, RAVDESS, and FAU Aibo datasets (See Table 14 -17). The classification UA of speaker-independent experiments are shown in Table 14. For the Emo-DB dataset, our result achieved 86.80% UA, which performed better compared with various new works that have been conducted recently. Our model when applied on the SAVEE dataset obtained 68.45% which is 13.45% higher than the second-best result. However, our proposed model adopted LOSO, and few works are conducted for RAVDESS speaker-independent method, none of which applied the LOSO approach. For example, [31] used 2 speakers for testing the model while [68] used 19 speakers for training and four other speakers for the testing scenario. However, the proposed model achieved 73.05% by adopting the LOSO approach. Table 15 shows the summary of unweighted classification accuracies (UA%) achieved by various researchers using 5-fold cross-validation for Emo-DB, SAVEE, and RAVDESS speech datasets.
Among all state-of-the-art methods, our approach performed the best. Considering that ESN is extremely less complicated than LSTM based models, our proposed model when applied to Emo-DB achieved 91.64%, which is slightly better than the work in [31] where they used the Deep BiLSTM method and CNN for feature extraction. Additionally, our result in Emo-DB has outperformed the work in [62] by 1.27% and the other works significantly, as shown in Table 15.
The proposed model when validated by SAVEE achieved 70.34% of UA. However, to our best knowledge, there is only one recent work that adopted a 5-fold approach applied to SAVEE dataset [70] and gained 68% of UA with 2% lower than our result.
Regarding the RAVDESS dataset, our proposed model achieves high UA with the 5-fold schema (85.68%) and outperforms the highest achieved results in state-of-the-art studies by 2.68%.  To compare the performance of our proposed model with the state-of-the-art studies, a number of recent works that used FAU Aibo dataset and followed the 2009 challenge protocol are presented in Table 17. It is obvious that the proposed model in this work has outperformed these studies by 45.9% of UA accuracy, as an indication of the usefulness of it for non-acted emotional datasets as well besides the acted ones. Among the studies mentioned in Table 14, 15 and 17, some of them have adopted spectrogram-based features with deep learning and have achieved distinguished results. In the Emo-DB dataset and speaker-independent approach, authors in [62] and [65] have achieved a classification accuracy of 84.99% and 82.82% respectively using 3-D Log-Mel spectrums from raw speech signals and feed them to 3-D attention-based convolutional recurrent neural networks (ACRNN). Additionally, Jiang et al. [30] extracted 3-D log Mel-spectrograms from the speech signal and fed it to a parallelized convolutional recurrent neural network (PCRN) model and recorded 84.53% UA. However, our proposed model is able to outperform these spectrogram-based features with deep learning and achieved UA of 86.80%. On the other hand, when adopting a 5-fold cross-validation method, researches [31], [62] applied their spectrogram-based deep learning model on Emo-DB and have achieved better results than other models. Mustaqeem et al. [31] achieved 91.14% UA by using salient features from the speech spectrogram with deep bidirectional LSTM to learn the Spatio-temporal information for detecting the last state of the emotion model. Again, our proposed model achieved 91.64% UA exceeding the model of [31] by 0.5%. Spectrogram-based features with deep learning models are also applied to the RAVDESS dataset [31], [72], however, in spite of its good achievement unlike the EMO-DB, the work of [71] outperformed them.
Regarding the Aibo dataset, one can notice that the highest achieved result in the previous works is using spectrogram-based features with deep learning models [75], [78], [79]. Shih et al. [78] achieved 45.4% UA by extracting deep spectrum representations and developing a deep learning model with the attention enhanced FCN and BLSTM networks. Our proposed model is once again able to outperform the mentioned study by achieving UA of 45.9%.

VI. CONCLUSION AND FUTURE WORK
We proposed a novel recurrent based architecture for time series speech emotion recognition classification by using bidirectional late fusion ESN based on the reservoir model space representation with sparse random projection. Early fusion of the temporal features generated by bidirectional reservoir leads to the loss of independency from both representations. Thus, to avoid the drawback of the linear combination representation of both directional representations produced by dimension reduction, we proposed the late fusion of the representations, which is applied later to the dimension reduction step to overcome this problem.
On the other side, dimensionality reduction of sparse data by using SRP is reported to be useful to prepare a more compact and informative representation for the classifier. SRP reduces the dimensions and preserves the distances in addition to the fact that random projection has a low complexity since it does not need training. Because of the small size of features and a nontrainable ESN method, our model is fast and more robust to achieve better performance. Another factor that has a notable impact on increasing the performance of our model is the use of Bayesian optimization to optimize ESN hyperparameters. The Bayesian optimization in our work has been adopted to fix a large number of parameters in the proposed model and has shown an ability to record a good performance.
This proposed model has come out with the highest classification UA compared to the previous works on SER when using 5-fold and 10-fold speaker-dependent, LOSO speaker-independent on Emo-DB, SAVEE, and RAVDESS datasets, and speaker-independent on FAU Aibo.
A single reservoir suffers from generating a comprehensive representation and from the randomness assigned to it. For this reason, in future work, we intend to use more than one reservoir to create a more typical representation of the input data that captures more information independently of the input data.