Bangla Speech Emotion Recognition and Cross-Lingual Study Using Deep CNN and BLSTM Networks

In this study, we have presented a deep learning-based implementation for speech emotion recognition (SER). The system combines a deep convolutional neural network (DCNN) and a bidirectional long-short term memory (BLSTM) network with a time-distributed flatten (TDF) layer. The proposed model has been applied for the recently built audio-only Bangla emotional speech corpus SUBESCO. A series of experiments were carried out to analyze all the models discussed in this paper for baseline, cross-lingual, and multilingual training-testing setups. The experimental results reveal that the model with a TDF layer achieves better performance compared with other state-of-the-art CNN-based SER models which can work on both temporal and sequential representation of emotions. For the cross-lingual experiments, cross-corpus training, multi-corpus training, and transfer learning were employed for the Bangla and English languages using the SUBESCO and RAVDESS datasets. The proposed model has attained a state-of-the-art perceptual efficiency achieving weighted accuracies (WAs) of 86.9%, and 82.7% for the SUBESCO and RAVDESS datasets, respectively.


I. INTRODUCTION
Identifying human emotions from voice signals, using a machine learning approach, is important to construct a natural-like human-computer interaction (HCI) system. Robotics, mobile services, contact centers, computer games, and psychological examinations are just a few of the examples where speech emotion recognition (SER) is used. Research on the development of a successful SER system is emerging in recent years, though it has been in action since the last two decades [1]. The SER system as a whole, is a collection of methodologies for analyzing and classifying speech data in order to discover the embedded emotions. The first step in developing it is to create a dataset that is appropriate for the target language and modality. The emotional database can be acted, simulated, or elicited for audio-only, audio-visual or facial expressions [2]. However, selecting the appropriate features for classifying emotions accurately is The associate editor coordinating the review of this manuscript and approving it for publication was Wei Jiang . the most crucial design decision. Acoustic features of speech are considered the most essential and extensively used features for speech emotion representation. Different kinds of acoustic features such as prosodic, spectral, voice quality, energy operator, etc. have been employed to construct SER in various studies [3]. Those features can be further classified as temporal (time-domain) and spectral (frequency-domain) features. It is also important to understand how emotions are represented in discrete or dimensional emotional models to explain the functions of similar emotions [4]. In the discrete approach, emotions are represented as different emotional states e.g. sadness, happiness. Emotions are represented in the dimensional model by the levels of positive to negative arousal and low to high valence. The ultimate result of SER is obtained by the use of a classifier, which allows the system to determine the best match for input emotional speech. Selecting an efficient classifier is a crucial part of the SER. As a result, numerous types of classifiers have emerged to date, and the research is still ongoing. Hidden Markov Model (HMM), Support Vector Machines (SVM), Gaussian Mixture Model (GMM), Artificial Neural Network (ANN), decision trees, and ensemble approaches are some well-known classifiers that have been employed in previous studies. The recent tendency is to use deep learning-based classifiers like CNN, DNN, and RNN, as well as deep learning-based augmentation techniques like auto-encoders, multitask learning, attention mechanism, transfer learning, adversarial training, etc [3]. In this study, two datasets from distinct languages were investigated. The first one is SUST Bangla Emotional speech corpus (SUBESCO) which has recently been developed and made publicly available for the Bangla language [48]. The second corpus is the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), which was created for the American English language [49]. Log mel-spectrogram has been used for input feature vector to the CNN layer. Experiments demonstrate that mel-spectrogram shows better performance as an effective audio feature than others like MFCCs, STFTs [5]- [7]. In recent studies, a CNN-LSTM combination has been frequently used to construct an end-to-end SER system as it gives promising outcomes for the spectral-temporal features [8]. CNNs have wide use in the field of computer vision as well as in speech-related researches [9]. Deep CNN has the powerful ability to learn from a large number of samples and represent a higherlevel task-specific knowledge. It is also important to capture the sequential information of speech for emotion recognition. Long short-term memory of recurrent neural network architecture (LSTM-RNN) can exploit this information [10]. A BLSTM layer with a DCNN block has been utilized in this study for effective features extraction to classify emotions. Transfer learning with deep learning models is comparatively a new and efficient tool for cross-lingual research [11], which has also been reported in this paper.
The primary contributions of this study are as follows: i) This work examined a deep learning-based model for the low-resource language Bangla, which achieved a high perception accuracy of 86.86% utilizing the largest emotional speech corpus SUBESCO available for this language. ii) A novel architecture DCTFB (deep CNN with Time-distributed flatten and BLSTM layers) has been proposed consisting of a feature learning block DCNN, a timedistributed flattening layer and a BLSTM layer that can acquire both local and sequential information of the emotional speech. iii) A comparative analysis is presented to show that the proposed architecture effectively enhances emotion detection in comparisons with other similar models. The model obtained state-of-the-art performances for SUBESCO and RAVDESS datasets. iv) A detailed cross-lingual study experimenting cross-corpus training, multi-corpus training along with a transfer learning technique using Bangla and English emotional audio corpora has been presented here for the first time.
The rest of this paper is organized as follows: section II highlights related works done in recent years, while section III focuses the methodology, including preprocessing, spectrogram generation, architectures, and training methods. Experimental details are presented in section IV. Sections V and VI explain the findings and discussions. Finally, a conclusion is drawn in section VII.

II. RELATED WORKS
Though there have been numerous studies in the field of SER for other languages, particularly English, just a few attempts have been made to establish SER for Bangla. In 2018, Rahman et al. proposed a Dynamic time warping assisted SVM emotion classifier for Bangla words [12]. The first and second derivatives of MFCC features were extracted as features for classification. The system achieved 86.08% average accuracy for a small dataset of only 200 words. In 2017, Badshah et al. proposed a CNN architecture consisting of three convolutional layers and three FC layers [13]. The model was trained on Berlin emotional corpus to discriminate between seven emotions based on the spectrograms collected from the stimuli. The average prediction accuracy for this system was 56%. Satt et al. presented an SER model which calculates log-spectrograms as feature vectors [14]. They experimented with two architectures: convolution-only and convolution-LSTM deep neural networks achieving prediction rates of 66% and 68%, respectively, for the IEMOCAP dataset. Etienne et al. employed a CNN-LSTM architecture to classify emotions using spectrogram information in 2018 [15]. They trained the model using the improvised part of the IEMOCAP dataset and got a WA of 64.5%. Three scenarios were considered for their experiment: shallow CNN with deep BLSTM, deep CNN with shallow BLSTM, and deep CNN with deep BLSTM. They achieved the best result for the combination of 4 convolutional and 1 BLSTM layer. With 3-D attention-based convolutional recurrent neural networks (ACRNN), Chen et al. used deltas and delta deltas of log mel-spectrogram for emotion identification [16]. The model was trained on Emo-DB corpus and improvised data of IEMOCAP corpus, yielding recognition accuracies of 82.82% and 64.74% for the two databases, respectively. The researchers trialed combinations of different numbers of convolution layers with LSTM. Among them, the combination of 6 convolutional layers with LSTM performed the best. Zhao et al. [17] have presented another CNN-LSTM based deep learning model for end-to-end SER. The model was trained and tested using spectrograms taken from the audios of the IEMOCAP dataset, yielding a WA perception accuracy of 68%. The model was composed of attention-based BLSTM layers to extract the sequential features and fully convolutional network (FCN) layers to learn the spectro-temporal locality of the spectrograms. Another FCN model incorporating an attention mechanism was evaluated on the IEMOCAP corpus in 2019 and claimed to outperform state-of-the-art models with an WA of 63.9% [18]. 2D CNN based architecture was employed for feature extraction from audios and SVM was used for emotion classification. Ghosal et al. [19] proposed a graph neural network-based technique for emotion recognition in conversation called dialogue graph convolutional network (DialogueGCN). They compared the performance of the architecture with baseline CNN models and others for the three datasets IEMOCAP, AVEC, and MELD. Perceived weighted accuracy for the IEMOCAP dataset was 64.18%. Zhao et al. [20] used a connectionist temporal classification (CTC) with attention-based BLSTM for SER. The system outperformed existing systems with a 69% accuracy for the IEMOCAP dataset in 2019. They extracted log mel-spectrogram for each audio to perform the classification task. Another model, composed of a combination of BLSTM with FCN, reported a weighted accuracy of 68.1% for the IEMOCAP dataset and an unweighted accuracy of 45.4% for the FAU-AEC dataset [21]. Mel-spectrograms were fed into attention-based FCNs to classify emotions using LSTM-RNNs. Mustaqeem and Kwon proposed a stateof-the-art SER model using a deep stride CNN (DSCNN) with special strides in 2020 [22]. Spectrogram features were extracted from clean speech to classify emotions from the datasets IEMOCAP and RAVDESS. The system reported average accuracies of 81.75% and 79.5% for IEMOCAP and RAVDESS datasets, respectively. An edge and cloud-based emotion recognition system using the Internet of Things (IoT) was proposed in [23], where deep-learned features were extracted using CNN in the core cloud. The system achieved unweighted accuracies of 82.3% and 87.6% for the RML database and eNTERFACE'05 database, respectively. The same authors presented another deep learning based emotion recognition system for Big Data containing both speech and video [24]. A recent study based on dilated causal convolution with context stacking for end-to-end SER was proposed by Tang et al. [25]. The proposed architecture consists of dilated causal convolution blocks that are stacked with various dilation numbers. The stacked structure consists of three learnable sub-networks and uses local conditioning related to input frame for end-to-end SER. Experimented datasets were RECOLA and IEMOCAP, and the extracted feature was log-mel spectrogram. For improvised utterances from the IEMOCAP dataset, the system obtained a WA of 64.1%. For the RECOLA dataset, this design increased the WA by 10.7%.
Our research differs from the previous works mentioned above as it proposes a new architecture experimented on the RAVDESS and SUBESCO datasets showing state-of-the-art performances.

A. DATA PREPROCESSING
The librosa [26] framework was used to read and re-sample each wav file at a sampling rate of 44KHz. Silence was removed by trimming the signal under 25dB. Both the datasets were created in studio environments and have a minimum amount of noise in stimuli. To remove additive noise from the audios, a Wiener filter was utilized. This filter is a linear minimum mean square error estimator (LMMSEE) and it performs very well with less speech distortion for single-channel audios [27]. It estimates the desired signal y(n) from the observed signal x(n) using the optimal filter coefficients w * assuming that the desired signal and noise v(n) are uncorrelated.
The signal obtained after filtering is: Before creating the mel-spectrograms, all audios were limited to 3 seconds in length. Fixing the length does not trim any important information because the stimuli in the RAVDESS dataset are already 3 seconds long, and almost all of the stimuli in SUBESCO terminate in 3 seconds, if we exclude the silence at the end of each recording.

B. MEL-SPECTROGRAM GENERATION
Humans have a logarithmic perception of auditory frequencies. Mel-scale represents signal frequencies in the logarithmic scale, which is similar to this notion. Spectrogram visually represents how the frequencies evolve over time. This time-frequency representation of a signal is very important for some experiments where the time alone or the frequency domain descriptions are not enough to provide comprehensive information for classification [28]. Melspectrogram represents the power spectrogram for each mel against time, and it can illustrate the relative importance of different frequency bands similar to the way of human ear perception. The relationship between the mel spectrum frequency f mel and the signal frequency fHz is defined as: In this study, mel-spectrogram for each sentence was calculated for 128 mel filter banks using a function of the librosa framework. The power log mel-spectrogram was extracted by converting the magnitude of the spectrogram in logarithmic scale decibel. Figure 1 illustrates examples of extracted mel-spectrograms for four different emotional audios: anger, happiness, neutral, and sadness, all spoken by the same male speaker for the same sentence. It is evident, the four spectrograms in this figure differ from one another. The time is displayed by the x-axis, while the converted log mel-scale is represented by the y-axis (frequency). The color dimension represents the magnitude of the decomposed frequency components of the signal, corresponding to the mel-scale. Dark colors indicate low amplitudes, while stronger amplitudes are denoted by brighter colors.

C. CONVOLUTIONAL NEURAL NETWORK (CNN)
The convolutional neural network is a special kind of artificial neural network that can learn special hierarchical features adaptively [29]. Convolutional layers, pooling layers, and fully linked layers are the fundamental building components of a CNN. Convolution layers use arrays of numbers, called kernels, to transform input data into feature maps. Before beginning the training procedure, two hyperparameters, the size and the number of kernels are defined. The parameter stride indicates the amount of kernel's movement on the input matrix. Padding is applied to the input matrix to allow kernels to overlap the outermost elements. The pooling layer down-sizes the feature maps by subsampling them and retaining only the dominant information. Common pooling methods are: max pooling, min pooling, global average pooling, etc. An activation function determines whether or not to fire a neuron based on the preference of mapping of input to the desired output. Sigmoid, tanh, ReLU are commonly used activation functions. This function is employed after all non-linear convolutional layers and fully-connected layers. Dropout is a regularization technique that avoids overfitting by ignoring some randomly selected neurons. It can be used on neurons in the input layer as well as hidden layers. The output feature maps of the last convolution layer are the input to the fully connected layer (FC). The FC layer connects all of its neurons to all neurons of the previous layer, and it flattens the input into a one-dimensional array of numbers. Finally, to complete the classification task, an activation function is used. CNN has the advantage of reducing the number of network parameters in training by allowing weight sharing. Concurrent learning of feature extraction in this network makes it highly organized and easier to implement than other networks [30].

D. LONG SHORT-TERM MEMORY (LSTM)
Long Short-Term Memory (LSTM) is a kind of RNN that is composed of recurrently connected memory blocks that contain memory cells with self-connections to store the temporal states of the network. In memory blocks, there are special multiplicative units, called gates, that control the flow of information [31]. There are three gate units in each memory block: input, output, and forget gates. The input gate multiplies the cell input by the activation function to perform read, the cell output is multiplied by the activation of the output gate to perform write. And, to perform reset, the activation of the forget gate is multiplied by the previous cell values [32]. LSTM solves the problem of long-term dependence in the RNN, and it is more capable in implementing a refined internal processing unit to effectively store and update context information [33]. It also overcomes the standard RNN's problem of gradient vanishing or exploding during training [34].
The traditional LSTM can only learn in one way, however, the bidirectional-LSTM can access context information in both forward and backward directions. It can make the system more robust by recognizing the concealed emotions through directional analysis [35]. For forward analysis, the received signal sequence is fed in its original order into one LSTM cell in the forward direction generating the sequence of hidden states as fh kt = {fh k1 , . . . fh kT }. For backward analysis, the signal sequence is fed in reverse order into another LSTM cell in the backward direction generating the sequence of hidden states as bh kt = {bh kT , . . . bh k1 }. As the last states of those sequences contain the information of the entire sequences, those are concatenated together to get the final state h t at

E. PROPOSED DEEP CNN WITH TIME-DISTRIBUTED FLATTEN LAYER AND BLSTM LAYER (DCTFB) ARCHITECTURE
The proposed SER system ( Figure 2) utilizes log melspectrograms extracted from the speech signals. The spectrograms are fed into the DCNN as input of size 128 × 259. The DCNN architecture consists of four local convolutional blocks similar to the local feature learning blocks (LFLB) described in the study [36]. But, the number of kernels and layer parameters of this model are different from those of the reference model. The 2D convolutional layer in each block is followed by a batch normalization layer, an exponential linear unit (ELU) activation, and a 2D max-pooling layer. The result of 2D convolution z(i, j) is obtained by convolving the input signal x(i, j) with the kernel w(i, j) of size k.
The batch normalization (BN) layer normalizes the activation of the convolutional layer by taking its learned features as input by mini batch. It acts as a regularizer and it accelerates the training process by reducing the internal covariate shift [37]. ELU acts as an activation function and defines the output of the BN layer. The advantages of using ELU are: it has a lower computation complexity, it can speed up the learning for a reduced bias shift effect, and it performs better with batch normalization than other activation functions [38]. ELU activation function is defined as: where, α > 0. The max-pooling layer employs a widely used maximum pooling function that extracts the largest value from each patch of the activated convoluted feature map. It is used to prevent over-fitting and to down-sample the output features to reduce computational load. Figure 3 shows details of the suggested DCNN architecture. The first convolutional layer has 128 kernels of size 3 × 3 and stride of 1 × 1. In the first convolutional block, the max-pooling layer has kernels of 2 × 2 size and 2 × 2 stride. Each of the convolution layers in each block has the same kernel size and stride. But the numbers of filters are different for the blocks. There are 128 filters in the first two blocks and 64 filters in the latter two blocks. In the last three blocks, the max-pooling layers have a kernel of size 4 × 4 and a stride of 4 × 4. The suggested deep neural network model's layer parameters are detailed in Table 1. The output dimension indicates the height × width × filter_number in each layer. For the input of x 1 × x 2 with zero padding of 1 and k kernels of stride 1 in the convolutional layer, the output feature map is x 1 × x 2 × k. In the maxpooling layer for stride of 2 × 2, the output feature map after pooling is The output of the DCNN block's final CNN layer is fed into a time-distributed flatten (TDF) layer which is enclosed in a time-distributed wrapper. This wrapper allows the application of the same weights and biases to each temporal time steps of a layer which is important to exploit the temporal correlations of sequential input data. This also helps LSTM to analyze sequential information obtained from the previous layer with a timing arrangement. If the input shape is The output of this layer is passed into a BLSTM layer to capture both the past and future information for analysis. The output dimension of the BLSTM network is twice that of the LSTM network's hidden units. To avoid over-fitting, the BLSTM layer is followed by a 25% neuron dropout. Finally, in the fully connected (dense) layer, the softmax activation function was employed to normalize the prediction of emotion classification. This activation is a simple and effective function to assess and discriminate the features for prediction [39]. A classical softmax function for every component i in a j dimension input vector z is defined as:

1) COMPUTATIONAL COMPLEXITY
There are four convolutional blocks and a BLSTM layer in the proposed DCTFB model. Given, d is the total number of convolutional layers, the computational complexity of multiple convolutional layers is expressed as [40]: where, for each convolutional layer l, m l is the spatial size of the output feature map, k l is the size of the kernel, n l is the number of input channels, o l is the number of output channels. As batch normalization and pooling layers take very little time, we ignore them while calculating the computational complexity of the model. Moreover, both of them speed up the overall learning process. The computational complexity of BLSTM network is O(w); where, w is the number of input parameters for the layer [41]. For i iterations and e epochs the overall computational complexity of the learning process is: It denotes that the suggested architecture has big O complexity in the asymptotic notation.  for a convolutional layer with u kernels of size s, v input channels, and b biases is: For each input vector, batch normalization calculates four parameters: gamma weight, beta weight, optimal mean, and standard deviation. As a result, the total number of parameters P bn for N input features is: The LSTM layer calculates the parameters for 3 gates (input gate, output gate, and forget gate) and a cell state. The total number of parameters P l for this layer is calculated as: where, a = input size, and c = output size. BLSTM has twice the number of parameters for that of the LSTM layer. Each neuron from the previous layer is connected to each neuron in the current layer by the dense layer. If the previous layer neuron number is p n , the current layer neuron number is c n , and the bias is b, then the number of parameters P d of this layer is calculated as: As the activation, max-pooling, dropout, and flatten layers do not involve back-propagation learning, the number of learning parameters is 0 for those layers.

F. OTHER NEURAL NETWORK MODELS
In addition to the proposed DCTFB model described above, we also experimented with eight other models. Among them, there are three reference models which were proposed recently for SER. The other five models are our experimented models to observe the impact of applying TDF with LSTM and BLSTM layers. The networks are briefly described below:

1) 2 CNN BLOCK ARCHITECTURE
The first model we considered as the base model for our study contained two 2D CNN and max-pooling pairs with two fully-connected layers which were presented in [42]. There was a RELU activation after each convolutional layer. The first convolutional block used 128 kernels, while the second block used 64 kernels. Kernel size is 5 × 5 with stride 1 in each layer. The max-pooling layers had kernels of size 2 with stride 2. Batch normalization was added with each convolutional layer, though it was not present in the original model which improved the performance of the system. The last max-pooling layer is followed by an 85% dropout. This system contains two fully connected layers with a Softmax classifier. A Stochastic gradient descent algorithm was applied for model optimization. This model is named RM1 (reference model 1) in this study.

2) 4 CNN BLOCK ARCHITECTURES
To compare the results of four block architectures we experimented with the reference model described in [36], which is referred to as RM2 (reference model 2) in this study.  The first two convolutional layers of RM2 have 64 kernels, whereas the last two have 128 kernels. Each convolutional layer is followed by a BN, an ELU activation, and a maxpooling layer. Kernel size is 3 (stride 1) for the convolutional layer and 2 (stride 1) for the max-pooling layer in each block. We experimented with three other variations of the proposed 4 CNN block DCTFB model. In the 4CNN+LSTM and 4CNN+BLSTM models, reshaping was employed instead of a TDF layer after the last convolutional blocks. For all of these models, the training parameters were the same as the proposed system.

3) 7 CNN BLOCK ARCHITECTURES
The base model for seven layer architecture was a deep stride CNN architecture which was described in [22]. This model was tested on 1440 RAVDESS audio-only speech files and yielded an accuracy of 79.5% for clean speech data. This model is referred to as RM3 (reference model 3) in this study. Seven convolutional layers, each with a batch normalization layer and ELU activation, make up the models. There is no max-pooling layer. The numbers of filters used in convolutional layers are 16, 32, 32, 64, 64, 128, 128. Kernel sizes are 7 × 7 for the first layer, 5 × 5 for the second layer, and 3 × 3 for the remaining convolutional layers. All of these kernels had a stride of 2. The remaining parts had a 25% dropout followed by an FC layer with a softmax activation function. The final FC layer is followed by a Softmax activation. We experimented with two variants of the RM3 system. Those variants had a TDF layer after the last convolutional layer. In one case, it was followed by an LSTM layer, whereas in the other, it was followed by a BLSTM layer.

G. TRAINING THE MODELS
The Keras [43] neural network package was used to implement all of the models for comparative analysis. The 'same' padding approach from Keras was utilized in each convolutional layer, implying that the output spatial dimensions for stride 1 are the same as the input spatial dimensions. To train the models, we used log-mel spectrograms derived from the audios in our target dataset as input features. The entire set of input characteristics was divided into 70%, 20%, and 10% ratios, for training, validation, and testing, respectively. To get the optimal results, it is very important to define a proper fitness for training a DNN. We trialed different learning rates and other parameters for all of the models detailed above and found that a learning rate of 10 −3 with a decay of 10 −6 for a batch size of 32 produced the best results. Batch normalization and dropout were utilized for regularization. The loss function used to measure the prediction error of the DNN model was cross-entropy loss [44]. A gradient-based optimization algorithm Adam [45] was used during training of the models and the softmax activation function was applied for classification. Adam is computationally efficient in dealing with gradients and also suitable for training with large number of parameters with less memory requirement. For our experiments, all of the models were trained and executed using Jupyter notebook [46] both on a local workstation without a GPU and on Google Colaboratory [47] which is a cloud-based service with a GPU. The local machine was a MacBook Pro notebook with a 2.4 Intel core i9 processor, 32 GB RAM, and Intel UHD Graphics 630, running OS version 10.15.7.

IV. EXPERIMENTAL SETUPS A. DATASETS
SUBESCO and RAVDESS are the two datasets used in this research study. SUBESCO is the only verified emotional speech corpus for Bangla that is gender-balanced [48]. It is an acted emotional corpus with 7000 audios from ten male and ten female speakers. In this dataset, there are seven acted emotions: anger, disgust, fear, happiness, neutral, sadness, and surprise. The statistics of all the recognizers that were run on SUBESCO were compared to those from the RAVDESS dataset [49].
RAVDESS is an audio-visual resource for American English emotive speech and songs. The recording of stimuli was done by twelve males and twelve females. Audio video, audio-only, and video-only recordings are the three forms of recordings available for this corpus. For our research, we solely looked at audio-only speech and song recordings. This section includes 1440 speech files and 1012 song files recorded from 24 actors. Song files of a female actor are missing from the dataset. In total, 2452 audio files from RAVDESS were used to analyze the results. Speech includes neutral, calm, happy, sad, angry, fearful, surprise, and disgust expressions. The songs contain neutral, calm, happy, sad, angry, and fearful emotions. RAVDESS was chosen with SUBESCO for the experimental study of the proposed models for a reason. During the construction of SUBESCO the authors were influenced by RAVDESS's development technique. The audio-only stimuli in these two corpora are created and validated in the same way. In this regard, using RAVDESS for model comparisons and cross-lingual analysis was a sensible design decision.

B. SETUP 1: BASELINE EXPERIMENTS WITH SUBESCO
The same corpus was utilized for training, validation, and testing in the baseline experiments. For SUBESCO, all 7000 stimuli for seven emotions were taken into  consideration. 70% was allocated to training, 20% to validation, and 10% to testing. We used 4900 SUBESCO files to train the models. For cross-validation, another subset of 1400 files was employed. Testing was carried out on 700 files that had not been used in the training or validation stages. All the subsets were balanced in terms of speaker gender and emotion classes. Details can be found in Table 3.

C. SETUP 2: BASELINE EXPERIMENTS WITH RAVDESS
RAVDESS speech and song audio-only files were merged together for the six emotions namely neutral, calm, happy, sad, angry, and fearful. Because of the perceptual similarities highlighted in the research [49], calm audios were renamed as neutral audios. Finally, total five emotions were considered for this dataset. The training, validation, and testing ratio of the models for this setup is 70:20:10, which is the same as SUBESCO. Due to the lack of audios from a female speaker, it was not possible to segment this dataset in a balanced way in terms of emotions and speaker gender. Table 4 shows the distribution of RAVDESS for training, cross-validation, and testing.

D. SETUP 3 & 4: CROSS-CORPUS EXPERIMENTS
For the cross-corpus study, all of the explored models trained on one dataset were tested against another dataset. 1440 stimuli from RAVDESS audios were chosen to evaluate the models trained with SUBESCO (Setup 3), comprising of seven emotions (calm substituted by neutral). 850 audio samples from SUBESCO for five matching emotions: angry, fear, happiness, neutral, and sadness were used to assess the models trained with RAVDESS (Setup 4).

E. SETUP 5 & 6: TRANSFER LEARNING
To improve the experimental outcomes for cross-lingual analysis, next we used the concept of transfer learning in this step. It is the process of training a deep learning model using a large dataset of a domain, and then applying that model to solve a similar problem with a different small size dataset. The weights of the original model are maintained unchanged for the initial layers in this technique, allowing those layers to reuse their expertise for modeling a new but related task. In Setup 5, all the models trained with RAVDESS dataset were used as a starting point to train and test 850 files of SUBESCO dataset with 80:20 split for training and testing, respectively. The weights of all layers were frozen to make them untrainable while removing the final dense layer from the trained models. Then, a new dense layer with softmax activation was introduced and trained for the SUBESCO data. For Setup 6, all of the pre-trained models from the SUBESCO dataset were utilized to transfer the knowledge of emotional features extraction to build new models, in order to train and categorize the RAVDESS dataset. There were 1440 RAVDESS audio-only speech files used in this experiment, with an 80:20 split for training and testing the new transferred models.

F. SETUP 7: EXPERIMENTS WITH MULTILINGUAL DATASETS
In the multilingual training experiment, 2000 stimuli from each corpus were considered. There are 4000 stimuli in total for the emotional states of neutral, happiness, sadness, anger, and fear. RAVDESS's neutral, happy, sad, angry, and fearful emotions were renamed to reflect this; where all the files labeled as calm for this dataset were considered as neutral. Testing was carried out on a subset of 850 audios from SUBESCO that had not been included in the training or validation stages. In the test set, each class has the same number of instances, which is 170. The file distributions of both datasets for this experiment are shown in Tables 3 and 4.

G. EVALUATION MATRICES
The classification task's performance was graded at two levels: overall accuracy and class accuracy. Weighted accuracy, VOLUME 10, 2022   unweighted accuracy, and average F1 values were generated to assess overall performance. Sensitivity, specificity, precision, and negative prediction value, were employed to report class-wise accuracy. The following matrices are described in detail: Unweighted accuracy is the ratio of total correct predictions and total instances of all classes in the dataset, UA = correct predictions total instances (14) Weighted accuracy weighs each class according to the number of correct predictions. Total correct predictions of a class correct i is divided by total instances instance i of that class i in the dataset. Then the sum of all weighted classes becomes: Recall or sensitivity is the fraction of the number of correctly classified instances among the total instances of that class in the dataset. It also refers to the true positive rate (TPR) and is defined as: recall = true positive true positive + false negative (16) Specificity or selectivity is the fraction of correctly classified negative instances among all the negative instances of a class. It is also termed as the true negative rate (TNR) and The F1 score is defined as:

A. ANALYSIS OF MODELS USING SUBESCO DATASET (SETUP 1)
We evaluated all the target SER models on the SUBESCO dataset. Table 5 illustrates the test results for all topologies, demonstrating that the DCTFB model has the highest accuracy amongst all of them. The WA accuracy achieved for this model is 86.86%, and the average f1 score is also 86.86%. In this situation, the WA and UA are the same because we utilized a balanced dataset for training, validation, and testing. As a result, only the WA    Table 9 presents the suggested model's confusion matrix for the RAVDESS dataset. Table 10 reports the accuracy matrices for this setup. It reveals that anger has the highest f1 (90.91%) and sad has the lowest (71.43%).

C. ANALYSIS OF CROSS-CORPUS TESTS (SETUP 3 & 4)
For cross-lingual analysis, all trained models of one dataset were tested on a subset of another dataset. A subset of 1440 files of RAVDESS for seven emotions was evaluated on the trained models of SUBESCO (Setup 3). While the trained models for the RAVDESS dataset were used to test the subset of SUBESCO (Setup 4) for five emotions. The prediction performance for this study is reported in Table 11. This demonstrates that for this experiment, all of the models had poor accuracy. SUBESCO (UA = 27.92%, WA = 24.06%) yielded the best testing accuracies when using the DCTFB model. 4CNN+LSTM performed slightly better in the case of RAVDESS. The perceptual performance of RM3 was the lowest. We know that there are linguistic and cultural barriers to classifying emotions of a language using a deep learning model trained on another language. Although there have been a few research on this type of experiment, the results have not been as promising as expected [50].

D. ANALYSIS OF TRANSFER LEARNING (SETUP 5 & 6)
From     . Table 15 shows the confusion matrix for the proposed model of this study. We see that anger achieved the highest recognition rate, while fear had the second-highest accuracy rate. Sadness is the least recognized emotion and it is largely confused with fear. The result reveals some important facts, such as anger has a common way of expressiveness through loudness, in many cultures. This may have an impact on the outcome of this study. However, these observations are not sufficient to draw any conclusions from a multilingual trial. A more in-depth investigation might be beneficial in determining the similarities of emotion expressions across cultures. Emotion-wise accuracy matrices are shown in Table 16.

VI. DISCUSSION
In this work, a deep CNN block was used to learn high-level local emotional features, while the LSTM was utilized to learn long-term contextual global features. We experimented with nine distinct topologies using a variety of layers and parameters. Comparing the best results for all the experimented models it was found that the proposed model (DCTFB) performed consistently better and outperformed all, almost in all cases. To demonstrate the performance of the trained models, both weighted and unweighted accuracies were calculated. WA is important to use when class distribution is not balanced in the dataset so that the reported performance is not biased towards the larger classes.
For both datasets, we attempted to employ a balanced split because studies have shown that a DNN classifier trained on a balanced dataset produces the best results and is more robust [51], [52]. In the case of RAVDESS, all of the emotions have the same number of samples, except for neutral. WA, UA, and F1 scores calculated using different setups for the proposed model, have been compared in Figure 4. It was discovered that the introduction of TDF enhanced the prediction accuracy using the salient discriminating feature regardless of the training language type. In terms of the performance of both databases, SUBESCO performed better than RAVDESS. One possible explanation for this is that SUBESCO has more balanced instances of each class than RAVDESS. This is the first SER implementation of SUBESCO, though the audio-only files of the RAVDESS dataset have recently been used in a variety of research for emotion classification. A study [22] obtained 79.5% UA for 1440 RAVDESS files using a CNN classifier. Issa et al. used CNN to classify RAVDESS audio files based on a combination of spectral parameters and reported a recognition rate of 71.61% [53]. For those files, Zisad et al. obtained an average accuracy of 82.5% employing data augmentation, and it used a CNN classifier to distinguish emotion from the dataset [54]. For the same subset of the RAVDESS dataset, a real-time speech recognition system using transfer learning techniques for the VGG16 pre-trained model showed an emotion perception rate of 62.51% [55]. Patel et al. presented another work in which they utilized an autoencoder to reduce dimensionality and used a CNN classifier to reach an accuracy of 80% for RAVDESS audio-only files [56]. A system consisting of CNN and head fusion multi-head attention achieved 77.8% WA for the audio-only speech files of RAVDESS in recent work [57]. The most recent SER system using this dataset was presented in [58]. The system employed a GA-optimized feature set to classify emotions using SVM trained and tested with the 1440 speech files of RAVDESS. It achieved a UA of 82.5% for a speaker-dependent experiment carried out using this subset of the corpus. In contrast, our suggested model utilized both the speech and song files for audio-only recordings of   RAVDESS. It achieved a UA of 82.7%, which is the greatest attained accuracy for RAVDESS audio-only files recorded to date.
In the scenarios of seven layers CNN, there is a consistent improvement in WA for using TDF with LSTM/BLSTM when experimented with both the datasets. Four-layer CNN models also show the best perception accuracies when accompanied by the TDF layer. Seven-layer models, on the other hand, demonstrate good accuracies with substantially less training time. This is due to the reduced size of feature maps in these architectures. Figure 5 depicts the comparisons of the overall performances of all the tested models for setups 1, 2 and 7.
Cross-lingual study (setups 3-7) shows poor performances when the model is trained with an unknown dataset of another language. We observed that the use of pre-trained models to apply transfer learning for another dataset significantly boosts performance of the deep learning models. It was also found that training the models with aggregated datasets also enhances performance to a large extent. The line graph in Figure 6 compares the UAs of all the experimented models for all the seven setups. Subsets of SUBESCO and RAVDESS were chosen rather than the whole datasets to find the effect of deep learning models trained on a larger dataset and tested on a smaller dataset which is very useful to classify emotions for low-resource languages. The limitation of our study is that we experimented only two datasets of completely different languages of different cultures. Further research needs to be conducted for cross-lingual study involving more datasets.

VII. CONCLUSION
In this study, a novel architecture for SER has been proposed that demonstrated state-of-the-art prediction performance for the Bangla dataset SUBESCO, as well as the English dataset RAVDESS. Weighted accuracy, unweighted accuracy, precision, recall, NPV, specificity, and F1 scores were used as performance parameters to report the statistics. To combat the overfitting of training the models, batch VOLUME 10, 2022     audio-only files (speech and song) with an accuracy of 82.7%. A cross-lingual study implemented using transfer learning shows that models trained on a SUBESCO dataset can be applied for other languages as well with satisfactory performance. As Bangla is still considered a low-resource language, we anticipate that this work will provide the Bangla research paradigm with a new direction. In our future research, we plan to extend this work using a multi-dimensional dataset for Bangla. Also, we wish to conduct a cross-lingual study for prominent Indo-Aryan languages.

APPENDIX TRAINING PLOTS FOR BASELINE EXPERIMENTS
Training history of Setup 1 and 2 for the proposed model are presented in Figures 7 to 10.

ACKNOWLEDGMENT
This work has been done as a part of a Ph.D. research project which was initially supported by the Higher Education Quality Enhancement Project (AIF Window 4, CP 3888) for ''The Development of Multi-Platform Speech and Language Processing Software for Bangla.'' The authors would like to thank Dr. Yasmeen Haque for proofreading the manuscript, and also would like to thank all of the creators of RAVDESS dataset for developing and sharing their invaluable resources.