Robust Speech Emotion Recognition Using CNN+LSTM Based on Stochastic Fractal Search Optimization Algorithm

One of the main challenges facing the current approaches of speech emotion recognition is the lack of a dataset large enough to train the currently available deep learning models properly. Therefore, this paper proposes a new data augmentation algorithm to enrich the speech emotions dataset with more samples through a careful addition of noise fractions. In addition, the hyperparameters of the currently available deep learning models are either handcrafted or adjusted during the training process. However, this approach does not guarantee finding the best settings for these parameters. Therefore, we propose an optimized deep learning model in which the hyperparameters are optimized to find their best settings and thus achieve more recognition results. This deep learning model consists of a convolutional neural network (CNN) composed of four local feature-learning blocks and a long short-term memory (LSTM) layer for learning local and long-term correlations in the log Mel-spectrogram of the input speech samples. To improve the performance of this deep network, the learning rate and label smoothing regularization factor are optimized using the recently emerged stochastic fractal search (SFS)-guided whale optimization algorithm (WOA). The strength of this algorithm is the ability to balance between the exploration and exploitation of the search agents’ positions to guarantee to reach the optimal global solution. To prove the effectiveness of the proposed approach, four speech emotion datasets, namely, IEMOCAP, Emo-DB, RAVDESS, and SAVEE, are incorporated in the conducted experiments. Experimental results confirmed the superiority of the proposed approach when compared with state-of-the-art approaches. Based on the four datasets, the achieved recognition accuracies are 98.13%, 99.76%, 99.47%, and 99.50%, respectively. Moreover, a statistical analysis of the achieved results is provided to emphasize the stability of the proposed approach.


I. INTRODUCTION
S PEECH emotion recognition (SER) has received much attention in recent years [1,2]. Although human emotions are hard to characterize and categorize, research on machine understanding of human emotions is rapidly advanc-ing. The recognition of speech emotions usually includes extracting paralinguistic features from speech. These features should be independent of the speaker and lexical content of the speech signal. Generally, the information embedded in speech signals can be categorized into paralinguistic infor-mation and linguistic information. Paralinguistic information refers to implicit features, such as the emotions harnessed in the speech signal, which is the domain of SER [3]. On the other hand, linguistic information refers to the context and meaning of the speech signal, which is the domain of interest in speech recognition.
To recognize the embedded emotions in speech, many distinguishing features can be extracted. These features include spectral features, qualitative features, and continuous features [4]. Many researchers have investigated the application of these features in SER. On the other hand, other researchers investigated the advantages and disadvantages of these features; however, the best features that can be used for this task cannot be identified easily. These features are usually referred to as handcrafted features. The accuracy of these features is relatively high; however, professional knowledge is required for extracting these features. Consequently, deep learning is introduced to model the extraction of high-level features from lower-level features to save the efforts needed for extracting the handcrafted features [5].
Currently, deep learning approaches are employed to solve many critical problems. The strength of deep learning comes from its ability to learn high-level features. Therefore, many researchers have introduced these approaches to recognize speech emotions based on many deep learning architectures. These architectures could achieve reasonable accuracy for the task of SER. However, more efforts are still required to improve the recently achieved performance [6,7].
Deep learning greatly improved the performance of speech signal processing frameworks. Excellent results are achieved by researchers in this field based on the application of convolutional neural networks (CNNs), deep belief networks (DBNs), and long short-term memory (LSTM) [8,9]. Special processing is required for speech signals to model their timevarying nature. Therefore, LSTM is more suitable to extract the long-term contextual dependencies in the input speech. One of the most effective features that can be used in SER is time-frequency decomposition, which is represented by a spectrogram. These features are proven to give significant recognition accuracy compared to using the raw speech signal when used to train deep learning frameworks [10].
The hyperparameters of deep learning models affect their performance to a certain extent. The selection of proper values of these parameters usually forms a challenge in utilizing deep learning models for different tasks. Recently, many optimization techniques have emerged to optimize the parameters of various models. These optimization techniques include particle swarm optimization (PSO) [11], whale optimization algorithm (WOA) [12], gray wolf optimization (GWO) [13], dipper throated optimization (DIP) [14], etc. In this research, we adopted the WOA, as an example, for optimizing the hyperparameters of the proposed deep learning model. Other types of optimizers will be considered in the future perspectives of this research.
There are many beneficial usages of SER in various applications that are based on the interaction between humans and computers. These applications include customer service, speech synthesis, medical analysis, forensics, and smart education. These applications highlight the significance of the automatic recognition of speech emotions and the necessity for achieving high recognition accuracy to realize these applications properly.
This paper presents an accurate approach for recognizing speech emotions using an optimized deep learning model based on cascaded layers of CNN+LSTM and stochastic fractal search guided whale optimization algorithm (SFS-Guided WOA). The effectiveness of the proposed approach is validated in terms of four standard speech emotion datasets, namely, IEMOCAP [15], Emo-DB [16], RAVDESS [17], and SAVEE [18]. In addition, the results of the proposed approach are compared with the results achieved by the other competing approaches in the literature to prove its superiority. Moreover, statistical analysis is performed to confirm the stability of the performance of the proposed approach.
The structure of this paper is organized as follows. A literature review is presented in section II. The proposed approach, along with the system architecture, is then explained in section III. Section IV presents and discusses the results of the conducted experiments. Finally, the conclusions and future perspectives are given in section V.

II. LITERATURE REVIEW
Speech emotion recognition (SER) is addressed by many researchers in the literature. In this section, we discuss some of these research efforts focusing on their achievements.
Aharon et al. [19] employed a deep neural network to recognize speech emotions from paralingual information. This deep network consists of convolutional and recurrent layers to learn the inherent representations of speech emotions. This approach utilizes the speech signal spectrogram to achieve this goal. The processing of speech signals is performed based on small segments with non-overlapping parts. This approach was tested on the IEMOCAP dataset and achieved recognition accuracy of 68% when the deep network was combined with a high-complexity convolutional LSTM.
Jonathan et al. [20] proposed an improved approach based on two machine learning approaches. They employed both multitask machine learning and deep convolutional generative adversarial networks to generate a set of unlabeled data. Using these approaches, they could leverage the size of the speech emotions training corpus to 100 hours. This large corpus could improve the performance of speech emotion classifiers, and the achieved performance was better than that of the baseline systems. The percentage of the achieved improvement reached 43.88%, which competes with the methods.
Chen et al. [21] hypothesized that measuring deltas and delta-deltas for customized characteristics not only retains successful emotional information but also reduces the impact of emotionally irrelevant variables, resulting in less misclassification. Furthermore, SER is often plagued by silent frames and emotionally meaningless frames. In the mean-time, the attention mechanism has shown exceptional abilities in studying relevant feature representations for complex tasks. They considered using the Mel spectrogram with deltas and delta-deltas as data to train 3-D attention-dependent convolutional recurrent neural networks (ACRNNs) to learn discriminative features for SER. Experiments on the Emo-DB and IEMOCAP corpora reveal that the suggested method works well and achieves best-in-class unweighted average recall.
Log-Mel spectrograms and high-level features are learned from raw audio clips by Zhao et al. [22]. In this research, the authors created a combined convolutional neural network (CNN) with two branches, namely, 1D CNN and 2D CNN. There are two stages in constructing the combined deep CNN. The two designed architectures' hyperparameters are chosen using Bayesian optimization in training. After designing and evaluating one 1D CNN and one 2D CNN architecture, the two CNN architectures were combined after removing the second dense layer. Transfer learning was added to the training to help speed up the training of the combined CNN. The first two CNNs to be trained were the 1D and 2D CNNs. The 1D and 2D CNN's learned features were then repurposed and converted to the combined CNN. The final step was to fine-tune the merged deep CNN that had been initialized with migrated functionality. Experiments show that combining deep CNNs will significantly boost emotion classification results when tested on two benchmark datasets.
Yenigalla et al. [23] suggested a phoneme-based and spectrogram-based approach for speech emotion detection. The phoneme sequence and spectrogram both preserve the emotional content of expression, lost as it is translated to text. They used various deep neural networks with phonemes and spectrograms as inputs to conduct multiple experiments. Three of these network architectures are discussed there, and compared to state-of-the-art approaches on a comparison dataset, they helped to achieve better precision. The phoneme and spectrogram hybrid CNN model was the most reliable model for understanding feelings on IEMOCAP data. Compared to current state-of-the-art approaches, the average class accuracy and the overall accuracy are improved.
Sarma et al. [24] used the IEMOCAP database to analyze many DNN architectures for emotion recognition. First, they contrast different function extraction front ends: they contrast time-domain and frequency-domain with high-dimensional Mel-frequency cepstral coefficient (MFCC) input (equivalent to filter banks) approaches to learning filters as part of the network. The time-domain filter-learning technique gives them the best outcomes. The researchers then looked at various methods for aggregating data throughout a speech. They experimented with approaches that use time aggregation within the network and single label per utterance and approaches that use a label that is replicated with each frame. The best design they tried interleaves time-restricted self-attention with time-delay neural network (TDNN) + LSTM and achieves a weighted precision of 70.6% percent, compared to 61.8% achieved by the most promising method presented previously that was based on Fourier log-energy input with 257 dimensions.
Latif et al. [25] used a novel transition learning methodology in cross-language and cross-corpus situations to enhance the accuracy of SER systems. Compared to support vector machines (SVMs) and sparse autoencoders, deep belief networks (DBNs) offer greater accuracy on cross-corpus emotion detection than previous approaches on five different corpora in three different languages. The results also show that using many languages for training and only a small portion of the target data in training will greatly improve accuracy compared to the baseline, including for corpora with few examples for training.
Zhao et al. [26] proposed two CNN+LSTM networks, one 1D CNN+LSTM network, and one 2D CNN+LSTM network, to learn local and global emotion-related features from speech and log-Mel spectrograms, respectively. The architecture of the two networks is identical, with four local function learning blocks (LFLBs) and one LSTM layer in each. LFLB is designed to learn local correlations and derive hierarchical correlations, and it consists primarily of one convolutional layer and one max-pooling layer. The LSTM layer is used to learn long-term dependencies from the local learned functions.
Sun et al. [27] presented a new algorithm that incorporates both a sparse autoencoder and a method for focusing attention. The goal is to use an autoencoder to learn from both labeled and unlabeled data and to use the attention function to focus on speech frames with strong emotional content. Such nonemotional speech frames can also be overlooked. Three online databases with a cross-language system are used to test the proposed algorithm. Compared to current speech emotion detection algorithms, experimental findings reveal that the proposed algorithm provides substantially more reliable predictions.
Jiang et al. [28] suggested a feature representation extraction method based on deep learning from heterogeneous acoustic feature groups that could include redundant and irrelevant content, resulting in poor emotion recognition output in their research. A fusion network is learned to jointly learn the discriminative acoustic feature representation and SVM as the final classifier after the informative features are obtained. The proposed architecture increased recognition efficiency by 64% compared to current state-of-the-art methods, according to experimental findings on the IEMOCAP dataset.
Pandey et al. [29] provided an overview of deep learning strategies for extracting and classifying emotional states from speech utterances. They investigate the most commonly used simple deep learning architectures in the literature. On the two common datasets, Emo-DB and IEMOCAP, architectures such as CNN and LSTM were used to measure the emotion capture capability of various standard speech representations such as Mel-spectrograms, magnitude spectrograms, and MFCCs. The experiments' results and the reasoning VOLUME 4, 2016 behind them have been discussed to determine which architecture and function combination is best for speech emotion detection.
Meng et al. in [30] employed the bidirectional LSTM along with CNN to recognize speech emotions. In addition, they adopted the Mel-spectrogram features in the 3D space as the main features used to train the CNN network. That model was evaluated based on IEMOCAP and Emo-DB datasets. Although the results achieved by this model are promising, it lacks generalization, as the model performs well on the training data; however, the performance is worse on the test set.
Zhen et al. in [31] proposed a model composed of CNN, BLSTM, and SVM for recognizing the speech emotions based on log-Mel spectrogram features. The model is evaluated on the IEMOCAP dataset and shows better performance when compared with another approach in the literature. Despite the promising performance of the model, it still needs to be evaluated using other datasets to show its generalization capability. On the other hand, the study presented in [32] showed the performance of various models used in SER using six speech datasets. This study concluded that the CNN+LSTM model performs better than the other models for five out of the six datasets.
Lili Guo et al. [33] employed kernel extreme learning machine (KELM) for classifying classes of speech emotions. In this approach, a fusion of spectral features is used to train the presented model. The evaluation of this model is performed in terms of two datasets, Emo-DB and IEMOCAP. However, the presented results show promising performance on only one dataset, which means that the presented approach lacks proper generalization. In addition, the authors concluded that the fusion of the spectral features allows the models to achieve higher classification accuracy.
Misbah et al. in [34] investigated the application of a deep convolutional neural network (DCNN) to extract features from the leg-Mel spectrogram of the raw speech. The study employed four datasets, IEMOCAP, Emo-DB, SAVEE, and RAVDESS. The classification of speech emotions is performed using four classifiers: SVM, random forest, k nearest neighbors, and neural networks. The performance of these classifiers is promising; however, no single classifier could perform well on the four datasets. This indicates that these classifiers lack generalization capability.
Sonawane et al. [35] demonstrated a deep learning approach for speech emotion understanding. For the classification of emotions such as positive, negative, indifferent, disgust, and surprise, a multilayer convolutional neural network is used with a basic K-nearest neighbor (KNN) classifier. The combination of MFCC-CNN and the KNN classifier performs better than the current MFCC algorithm, according to experimental findings on a real-time database obtained from the open-access social media site YouTube.
Sajjad et al. [36] presented a new SER system focused on Radial basis function network (RBFN) similarity calculation in clusters and the main sequence segment selection method. The STFT algorithm is used to transform the chosen sequence into a spectrogram, which is then fed into the CNN model, which extracts the discriminative and salient features from the speech spectrogram. Additionally, to ensure precise recognition performance, CNN features were normalized and fed to the deep bidirectional long short-term memory (BiL-STM) for emotion recognition based on the learned temporal information.
Kwon et al. [37] made significant contributions to (1) improving SER precision in comparison to other methods and (2) improving the complexity of the proposed SER model. They suggest an artificial intelligence-assisted deep stride convolutional neural network (DSCNN) architecture based on the simple net approach to learn salient and discriminative features from spectrograms of speech signals. The hidden local features are learned in convolutional layers rather than pooling layers, with unique strides to downsample the feature maps, and fully connected layers are used to learn the global features. This approach was based on a softmax classifier for classifying speech emotions. On the RAVDESS and IEMOCAP datasets, the proposed strategy improves the overall accuracy by 4.5% and 7.85%, respectively.
Vryzas et al. [38] developed and tested SER based on CNN. On consecutive time frames of continuous expression, emotion recognition is performed. The acted emotional speech dynamic database (AESDD) is the dataset used for training and analyzing the model and the techniques of data augmentation. The AESDD is subjected to arbitrary evaluations to act as a benchmark for human-level identification performance. In terms of precision, the CNN model outperforms the other models using SVM by 8.4%.
Ngoc-Huynh et al. [39] presented a multimodal approach for recognizing speech emotions. The presented approach is based on a multi-Level multi-head fusion (MLMHF) attention mechanism, and recurrent neural network [44]. MFCC features are utilized in the presented approach. Three datasets are employed to evaluate the presented approach: IEMOCAP, MELD, and CMU-MOSEI. Despite the promising performance achieved by this approach, the performance varies greatly depending on the tested dataset. Therefore, it can be noted that this approach does not generalize well, based on the presented results.
Orhan et al. [40] presented a model based on 3D CNN+LSTM that an attention model guides. This model follows the approach of deep end-to-end learning. The features extracted from the speech signals to train the model are Mel-frequency coefficients. The presented model is evaluated using three datasets: RAVDESS, SAVEE, and RML. The achieved results by this model are 96.18%, 87.50%, and 93.32%, respectively.
Turker et al. [41] developed a nonlinear multi-level feature generation model is based on cryptographic structure. The performance of that model is validated using four speech emotion datasets, namely, RAVDESS, Emo-DB, SAVEE, and EMOVO. The presented model achieved 87.43%, 90.09%, 84.79%, and 79.08% classification accu- racy based on these datasets, respectively, using a 10-fold cross-validation strategy.

VOLUME 4, 2016
A summary of the relevant milestones of SER in the literature is presented in Table 3. This summary is presented in terms of the year of the publication, the proposed methodology, the type of features utilized in the research, the dataset employed, and the achieved accuracy corresponding to each dataset.

III. PROPOSED METHODOLOGY
This section explains the proposed speech emotion recognition (SER) methodology. The proposed approach consists of a proposed data augmentation algorithm, a proposed CNN+LSTM deep neural network, and a proposed optimization approach using a stochastic fractal search-guided whale optimization algorithm (SFS-Guided WOA) for optimizing the parameters of the deep network. Figure 1 depicts the overall architecture of the proposed SER methodology.

A. DATA AUGMENTATION
A large amount of training data is usually required for deep learning to achieve better results. One way to increase the number of training samples is through data augmentation. In this paper, we propose a new data augmentation algorithm as presented in Algorithm (1). This algorithm creates additional training samples by carefully adding fractions of noise to the clean samples. The choice of this fraction is critical, as it may corrupt the signal content if the amount of noise is large or may be irrelevant if the amount of noise is too small. In this paper, we adopted the noise ratio as the 0.005 × max value in the speech signal. In this research, after performing data augmentation, each clean sample in the dataset will have three new samples generated by the augmentation algorithm. Therefore, the ratio of the clean to the newly generated samples in the augmented dataset is 1:3.
Algorithm 1 Data augmentation 1: procedure AUGMENT data 2: Ratio ← 0.005 3: Max ← np.amax(data) 4: rUniform ← np.random.uniform() 5: NoiseFactor ← Ratio × Max × rUniform 6: Noise ← np.random.randn(len(data)) 7: AugmentedData ← data + NoiseFactor × Noise 8: return AugmentedData 9: end procedure The addition of this fraction of noise to the clean signal is significant to improve the generalization of the proposed deep learning model. On the other hand, the existence of the clean samples in the dataset makes the model capable of recognizing the speech emotions of a clean signal as well as the noisy signal.

B. FEATURE EXTRACTION
The features extracted from the speech dataset are represented in the 2D space as log-Mel spectra. These features are employed as a static input to the deep network to achieve a better distribution of emotional features. In addition, this representation of features can extract the features corresponding to the emotions of interest accurately when compared with the raw spectrum and with a reduction in the dimensionality of the feature space [30]. Moreover, the log-Mel spectrum helps to reduce the effect of interference that may occur in the frequency bands and improve the linearization of the frequency perception of human ears [43]. Consequently, the speed of training the classification model along with the recognition process can be significantly improved.
To map the signal frequency to a log-Mel spectrum, equation (1) is employed.
where k represents the frequency of the M el scale and f denotes the frequency that moves on the scale of 0 ≤ f ≤ 22, 050.
The process of extracting the log-Mel spectrum is represented in the following steps.
• Framing and windowing: A window size of 25 ms or equivalently 256 samples is used as an analysis window. To smoothly cover the spectrum variation, a skip rate of 50% is also applied. The analysis window is applied in terms of the Hamming window to effectively reduce the signal distortions. The Hamming window is expressed as presented in equation (2) for the window length is denoted by N , and δ is usually set as 0.46.
• Mel filter: To measure the energy of the speech signal, the modulus of the frequency spectrum is squared. Then, a set of triangles is applied to the Mel scale of the energy spectrum. The application of this set of filter banks helps reduce the harmonics and improve the smoothness of the frequency spectrum. In addition, these filter banks can reduce the time needed to calculate the resulting output while reducing the dimension of the feature space. In most speech processing approaches, the number of filter banks is usually 13. The output from each triangular filter is defined as shown in equation 13], and the image of the Mel-frequency filter bank is characterized by the function denoted by f (.). Due to the relation between the methodology of Mel-spectrogram and its inspiration from the human auditory system, it is usually used used in several operation of speech processing, such as speech recognition, speech synthesis, speech emotion, etc.

C. THE PROPOSED CNN+LSTM
To understand speech emotions, researchers have one key challenge, which is the extraction of most distinctive features that represent emotions accurately. Based on the existing methods of feature extraction, speech features can be categorized as either studied features or handcrafted features. Deep neural networks, such as CNNs, offer a simple way to extract features that can achieve exceptional performance [45,46]. To extract accurate emotional features, four neural network layers were stacked together to form a local feature-learning block (LFLB). These layers include a convolutional layer for performing the convolution of the speech features with a kernel mask, a normalization layer called batch normalization (BN) [47], an exponential linear unit (ELU) [48], and finally, a max-pooling layer. Four blocks of LFLB are then stacked to form the general architecture of the proposed approach, as shown in Figure 2. The LFLB's main layers are the convolution and max-pooling layers [49,50,51]. The function of the learning kernel is performed by the convolution layer. The BN layer increases the efficiency and reliability of deep networks by normalizing the activation of the convolutional layer in each batch. The batch normalization transition keeps the standard deviation of activation near the value of one and the mean activation near the value of zero [52].
The ELU layer controls the BN layer's performance. ELU has negative value, which resets the mean of the activation layer, allowing the learning rate to become much faster and thus boosting the recognition accuracy accordingly. The features can show noise and vibration resistance by using a pooling layer. Nonlinear functions, such as max-pooling, are the most widely used functions that can help in dividing the input into non-overlapping regions along with their maxvalues [53].
In this research, log-Mel spectrogram is used to extract local and global features that are then learned using a combination of LSTM and LFLB. The central layer of the LFLB is the convolution layer, which is designed to process a grid of values. It will learn sequence features based on the neighboring inputs. In particular, each feature element is formed in terms of a small number of these neighboring inputs. On the other hand, the learned features are based on the previous outputs. High-level features can be learned by LSTM and CNN in conjunction and provide both long-term and local contextual information.
The result z(i, j) can be measured by the convolution of x(i, j) with kernel w(i, j), which has a size of a × b. In contrast, the input to the convolution layer is x(i, j). In the conducted experiments, the initialization of the 2D convolution kernel w(i, j) is chosen arbitrarily.
The BN layer is fed with the convolved features from the previous layer, which are then normalized in each batch. The BN layer uses a transformation to keep the convolved features' variance equals to one and the mean equals to zero. This operation can be interpreted as follows: where z l−1 j and z l i refer to the l th layer at which we obtain the i th output and the j th input feature at the (l − 1) th layer; the convolution kernel between the j th and i th features is denoted by w l ij . The normalization of the features learned by the convolution layer is denoted by the function BN (). In addition, the network activation function is denoted by σ() and is defined as: where e is Euler's sum and α > 0. The nonlinear downsampling operation is performed by the pooling layer, which decreases the feature resolution. The following are the characteristics provided by the max-pooling layer.
where Ω k represents the k th pooling region. The l th maxpooling layer output and input feature are denoted by z l k and z l p at index k and p, respectively. LSTM is used to learn long-term information, however the stacks of LFLB are used to learn the local information. Outside a cell with a self-recurring relationship, LSTM can add or remove data on a block state based on three components: an input, output, and gates. Softmax is used to render predictions that include both local and global context information using the features tested. The fully connected layer is used to generalize these functions into the output space.
When designing a deep architecture, selecting a collection of hyperparameters is crucial. To improve the efficiency of the deep network, optimization of the hyperparameter is performed on a separate data collection. Random search and grid search have been successfully used in several deep learning applications to accelerate deep design training. As Bayesian optimization is suggested, it has been shown that it produces improved outcomes with fewer studies [54]. The Bayesian optimization approach is used to select the hyperparameters for the proposed deep network.
Bayesian optimization is a sequential architecture approach that effectively reduces the objective function. The hyperparameters in our experiments are optimized using Hyperopt, a Python library. Hyperopt determines a minimizable objective function and uses it as a random function [55]. Over the objective function, a prior is often applied. The prior is modified based on the gathered function evaluations to form the posterior distribution over the objective function. Using the posterior distribution, an acquisition mechanism is established. The hyperparameters are then iteratively chosen. The options distribution ('rmsprop', 'sgd', 'adam', 'adagrad') is followed to select an appropriate optimization algorithm. The best model is returned after practicing with the optimized hyperparameters [56].

D. HYPERPARAMETERS OPTIMIZATION
As the proposed CNN+LSTM consists of a set of hyperparameters, the significant parameters in this set are the learning rate and label smoothing regularization factor. The learning rate affects network performance and directly determines the convergence speed along with the model accuracy. On the other hand, the smoothing regularization factor affects the intensity of the disturbance applied to the correct labels and thus affects the correctness of the input labels to the model. In this research, both of these parameters are optimized to determine their optimal values to improve the trained model accuracy. The optimization of these parameters is performed in terms of the recently published SFS-guided WOA.
The basic idea of SFS-Guided WOA algorithm is based on the behavior of whales, which trap their prey using bubbles that push them up to the surface in the form of a spiral loop. In this SFS-Guided WOA, there is a whale that looks for the optimal values of the parameters, and this whale is guided by three other random whales [57]. This strategy is useful in improving the exploration and exploitation features of this optimization task. The representation of these whales is described by the following equation.
where − → W rand1 , − → W rand2 , and − → W rand3 represent the three random whales, where each random whale represents a potential solution.  Statistical fractal search employed in this algorithm depends on diffusion-limited aggregation (DLA), which generates the objects' fractal shape. The SFS technique uses diffusion and two kinds of updating processes. Figure 3 depicts a graphical form of the SFS diffusion process. For the best-solution BP, a list of solutions BP1, BP2, BP3, BP4, and BP5 are listed around this best solution. Algorithm 2 presents the full process of SFS-guided WOA. For more details about this optimization algorithm, please refer to [58].

IV. EXPERIMENTAL RESULTS
To recognize the embedded emotions in the input speech, the speech utterance is segmented if it was longer than 8 seconds, and is padded to 8-second length otherwise. The FFT with window length of 2048, and hop length of 512 are used in the process of computing the log-Mel spectrogram. Consequently, the log-Mel spectrogram is estimated with 251 frames and 128 Mel frequency bins [59]. The 128 x 251 matrices are employed in the conducted experiments to provide a feedback to the CNN+LSTM network. The resulting 2D log Mel-spectrogram patches are fed to the CNN+LSTM network to learn the high-level contextual information.

A. EXPERIMENTAL PLATFORM
The platform used in running the conducted experiments has a set of parameters presented in Table 2. The main factor for accelerating the training process is the utilization of the available GPU and memory. These resources allow running the experiments with a batch of size >= 16, which enables completing the model training process in a relatively short time.

B. EXPERIMENTAL DATASETS
In this research, four datasets were included in the conducted experiments. These datasets are introduced in the following.
• RAVDESS: This dataset is composed of audio clips of songs and speech. The clips are recorded by 24 speakers; 12 women and 12 men. The emotional expressions included in the speech clips are surprise, fear, anger, sadness, happiness, calm, and disgust. On the other hand, the expressions included in the song clips are fear, anger, calm, sadness, and happiness. Each sentence is recorded twice by each speaker. The number of song clips is 1,012 and the number of speech clips is 1,440. for (i = 1 : i < n + 1) do 10: if ( − → r 3 < 0.5) then 11: if (| − → A | < 1) then

12:
Update the position of current search agent as in the following equation.
else 14: Select the three random search agents − → W rand1 , − → W rand2 , and − → W rand3 from the current solutions.

15:
Update the ( − → z ) parameter by the following exponential form.

16:
Update position of current search agent as in the following equation.
Update the position of current search agent as in the following equation. for (i = 1 : i < n + 1) do

23:
Calculate the following equation to update solutions based on SFS algorithm.
end for 25:

26:
Convert the updated solution to binary using sigmoid function.

27:
Calculate the fitness function F n for each agent − → W i

28:
Find the best solution −→ W * from the updated solutions.

29:
Set t = t + 1 30: end while 31 To used these datasets in the conducted experiments, each dataset is split into 80% for training/validation and 20% for testing. In addition, to allocate a subset of this dataset for validation, the training/validation part is split further into 80% for training and 20% for validation. The input to the proposed model is the log Mel-spectrograms of the input speech utterance. The log Mel-spectrograms are calculated for 3 seconds of the input speech utterance. Utterances that are less than 3 seconds are extended to 3 seconds by a zeropadding operation, otherwise they are split into 3 seconds chuncks.

C. MODEL TRAINING AND TESTING
The experimental data were arbitrarily divided into two groups, with the training group receiving 80% of the data and the study set receiving 20%. Experiments of comparable findings demonstrate that the CNN+LSTM network is capable of accurately detecting speech emotions. On average precision, In the conducted experiments, only the most accurate and well-fit models are taken into consideration. The validity accuracy of the learned model is an important predictor of its generalization. The best predictive model will be available when the validation accuracy hits its limit during CNN+LSTM network preparation. As a result, the recorded model not only suits the experimental results well but also performs well in terms of predicting SER.
The CNN+LSTM deep network architecture is summarized in Table 3. Four local function learning blocks are depicted in the table. Convolutional layers, batch normalization, activation, and max-pooling layers are all used in each learning block. The table also shows the form of each layer. An LSTM layer is applied after our local function learning blocks to learn the global feature from the input spectrogram.
To verify the generalization ability of the developed CNN+LSTM network, the performance is recorded for the training and verification sets. Five-fold cross-validation was used to evaluate the true generalization error of the network. Figure 4 depicts the progress of the loss values during the training of the network. As shown in the figure, the model could learn the significant features necessary for classifying speech emotions accurately. The loss values become close to zero after starting from epoch number 60. In the literature, many methods have been proposed to reduce the likelihood or the amount of overfitting in studies. Bad predictions for untrained sample data are caused in part by overfitting. When a model is overfitted, it memorizes the training data instead of learning to predict better. The phenomenon of overfitting can be caused by a variety of factors. Overfitting can occur because of the complexity of the deep network or because the network is overtrained. Therefore, model selection, early stopping, batch normalization, regularization, and cross-validation are adopted to overcome overfitting [60,61,62,63].
Early stopping, as shown in Figure 4, will prevent overtraining and increase the prediction ability of the model. Performance monitoring can be used to track training accuracy and validation accuracy. The number of epochs with no change in the display is the patience. The network would have superior predictive efficiency, while the validity accuracy does not increase in testing.
In addition, the accuracy of the trained model is recorded during the training process. Figure 4 shows the progress of the accuracy during the training epochs. As shown in the figure, the progress of the accuracy of the trained model moves smoothly for the selected learning rate. The model accuracy of the training sets achieves the best performance after the 60th iteration for the four datasets. In addition, the progress of the validation accuracy stabilizes after reaching the 60th iteration, which means that the model learns the training data accurately and is ready to generalize for the test set.
The accuracy increased significantly as a result of the use of data augmentation during the training period. As a result, the average recognition accuracy of the correctly identified emotions in the test sample was 99.2%, which is higher than all current competing approaches. The other rival method with close recognition precision was introduced in [26], which is based on a CNN+LSTM deep network but does not use data augmentation, making it less resilient to input speech emotions. This comparison clearly shows that the suggested solution outperforms the competition in regard to understanding speech emotions.
The confusion matrix of the recognition of speech emotions in the test set is shown in Figure 5. The test set is usually the final judge of the effectiveness of the developed approach. As shown in this figure, the proposed approach can successfully recognize almost all the speech emotions in the test set with very high accuracy. This reflects the efficiency of the proposed deep network along with the notion of data augmentation and parameter optimization, which positively affects the overall recognition accuracy.

D. COMPARISON WITH EXISTING SYSTEMS
The proposed approach is compared with a set of competing approaches in the literature to validate the superiority of the proposed approach. Table 4 presents the classification accuracy achieved by each approach, including the proposed approach. As shown in the table, the performance of the proposed approach outperforms the performance of the other approaches applied to the RAVDESS dataset, where the maximum accuracy achieved was 97.36.46%, but the proposed approach could achieve an accuracy of 99.47%. A  accuracy of more than 89.16% when applied to the IEMO-CAP dataset. However, the proposed approach could achieve higher accuracy, which reached 98.13% on the same dataset.

E. STATISTICAL ANALYSIS
Another perspective of evaluation of the proposed approach is presented in this section in terms of in-depth statistical analysis of the achieved results based on the employed datasets and in comparison with the other competing approaches. Figure 6 presents multiple measures of the statistical anal-ysis of the achieved results. One of these measures is heteroscedasticity, which measures the residual between the predicted emotion and the absolute residual of the recognition values, considering that the sum and mean of the residuals are equal to zero, as shown in Figure 6a. To achieve the ideal case, the residual values should be distributed uniformly around the horizontal axis, which is clearly shown in the figure. In addition, the heteroscedasticity plot is shown in Figure 6b. Homoscedasticity describes whether the error term is the same across the values of independent variables. Figure 6c also shows the quantile-quantile (QQ) plot and  [34] Log-Mel spectrogram 83.80 StarGAN+DCNN [43] Log-Mel spectrogram 92.97 TLCNN-RAM [64] Log-Mel spectrogram 89.02 Convolution-LSTM [65] Log-Mel spectrogram 85.46 Proposed Log-Mel spectrogram 99.50 Emo-DB ADRNN [30] Log-Mel spectrogram 85.61 CNN + LSTM [32] Mel Filter bank 69.72 DCNN + CFS + ML [34] Log-Mel spectrogram 82.10 RBFN+BiLSTM [36] Spectrogram 85.57 StarGAN+DCNN [43] Log-Mel spectrogram 91.06 TLCNN-RAM [64] Log-Mel spectrogram 80.71 Convolution-LSTM [65] Log    Figure 6d presents the heatmap plot with ordinary one-way ANOVA. In addition, a histogram of the achieved accuracies using the proposed and other approaches is presented in Figure 6f. In this figure, the presented models were evaluated using multiple test sets, and then the accuracy was binned and counted to make this plot. As shown in the figure, the proposed approach could achieve a stable performance while varying the test sets from the four datasets. The achieved performance resides in the scale of accuracy > 98%. However, the highest accuracy achieved by the other approaches is within the range of 88%  to 95%. Moreover, Figure 7 shows the ranges of accuracy for each of the presented approaches, including the proposed approach. As shown in the figure, the competing approaches expose a variation in the performance, whereas the performance of the proposed approach exposes a stable perfor-    on the proposed approach compared to other approaches is also shown in Table 7. These results confirm the superiority of the proposed approach with parameter optimization using the guided whale optimization algorithm and indicate the statistical significance of the proposed approach for the SER tested problem compared to the other competing approaches.  the effectiveness of the data augmentation and its impact on the achieved results. In this set of experiments, we tested the proposed approach without employing the proposed data augmentation and the results were recorded. Table 8 present the findings of this evaluation. As shown in the table, the proposed algorithm of data augmentation has a significant impact of the achieved results and thus recommended. In addition, the adopted softmax classifier is compared with two other classifiers to show its effectiveness. Table 9 presents the comparison results. As shown in the table, the other classifiers included in the experiments are K-NN with K equals to the number emotion categories in the dataset, and SVM with a kernel function of type (radial basis function). The presented results show the effectiveness of the adopted softmax classifier in the proposed approach.

V. CONCLUSION
A new approach for recognizing emotions embedded in a speech signal is proposed in this paper. The proposed approach is based on utilizing deep learning through developing cascaded layers of feature learning blocks with long short-term memory layer. The feature learning blocks are composed of four layers, namely, convolutional, batch normalization, activation, and max pooling. These layers are used to extract high level features from the log Mel-spectrum of the given speech samples. The log-Mel spectrograms are used to extract the local correlations and contextual information of the spoken utterances. To improve the performance of the proposed deep network, two hyperparameters were optimized using the whale optimization algorithm which is guided by the stochastic fractal search method. These hyperparameters are the learning rate and the label smoothing regularization factor. The evaluation of the proposed approach is performed in terms of four speech emotion datasets, namely, IEMOCAP, Emo-DB, RAVDESS, and SAVEE. To train the proposed model using these datasets, a new data augmentation algorithm is proposed to increase the number of training samples and to boost the generalization capability of the model. Experimental results show the effectiveness of the proposed approach in recognizing speech emotions of the the adopted four datasets accurately. In addition, a comparison with the other competing approaches is performed to show the superiority of the proposed model. Moreover, a statistical analysis is performed to emphasize the stability of the performance of the proposed approach in recognizing speech emotions. EL-SAYED M. EL-KENAWY (Member, IEEE) is an assistant professor at Delta Higher Institute for Engineering & Technology (DHIET), Mansoura, Egypt. He published more than 35 papers with more than 1000 citations and an H-index of 21. He has launched and pioneered independent research programs. He is motivating and inspiring his students by different ways by providing a thorough understanding of a variety of computer concepts and he explains complex concepts in an easy-tounderstand manner. He is a reviewer for Computers, Materials & Continua Journal, IEEE Access, and some other journals. His research interests include artificial intelligence, machine learning, optimization, deep learning, digital marketing, and data science.
BANDAR ALOTAIBI (Member, IEEE) received the Bachelor of Science degree (Hons.) in computer science (information security and assurance) from the University of Findlay, USA, the Master of Science degree in information security and assurance from Robert Morris University, USA, and the Ph.D. degree in computer science and engineering from the University of Bridgeport, USA. He is currently an Associate Professor with the Information Technology Department, University of Tabuk. His research interests include computer vision, network security, mobile communications, computer forensics, wireless sensor networks, and quantum computing.
GHADA M. AMER is Vice Dean for Postgraduate Studies and Research Faculty of Engineering -Benha University, the President of Centre for Strategic Studies of Science and Technology, and VP at Arab Science and Technology Foundation. She holds a few more positions within her profession, like the Director of Innovation and Entrepreneurship Centre at Benha University, the CEO and the Co-founder of ASTF innovation Lab, the ex-Head of Electrical Engineering Department at Benha University, and the CEO of the Global Awqaf Research Centre. Because she believes in the importance of R&D and innovation for the community, she was selected in 2016 to be one of the Rolex Enterprise Awards for Innovation Jury members. She was named in Jan 2014 as one of the "Top 20 Influential Muslim Women Scientists in the World" by Muslim-Science Magazine. She is called the "Personality of the Year" from Muslim Science magazine, the United Kingdom 2015. Also she ranked first place for the 50 most prominent leaders in entrepreneurship of the Arab woman in 2014 issued by the Sayidaty magazine. In 2016, 2017, she was named one of the "the 500 Most Influential Muslims in 2016", in the field of science and technology, by The Royal Islamic Strategic Studies Centre. Prof. Ghada is an active advocate of socio-economic development based on RDI within her country and the region. She worked since 2009 as a volunteer with the Arab Science and Technology Foundation and later joined as a volunteer Manager for Women Programs. She was elected as a member of the Board of Directors (2011) then the VP of the Foundation (since 2012). She developed and led more than 20 projects and programs to support scientific development and entrepreneurship. She published about 42 papers in international journals. She raised more than $2 million to support research, innovation, and entrepreneurship activities to create jobs and support the Arab community. She helped to establish 142 startups on innovative ideas from the region.
MAHMOUD Y. ABDELKADER received the Bachelor of scientific computing degree in the faculty of computer and information sciences, Ain Shams University, Cairo, Egypt. He is working as an AI-Pro collaborator in the role of machine learning engineer at Information Technology Institute (ITI), Cairo, Egypt. In this position, he developed many machine learning based intelligent systems. In addition, he is currently pursuing a diploma in the EPITA School of Engineering and Computer Science, Paris, France. His research interests include computer vision, speech processing, 3D visualization, deep learning, and data science.