Long Short-Term Memory Recurrent Neural Network for Automatic Speech Recognition

Automatic speech recognition (ASR) is one of the most demanding tasks in natural language processing owing to its complexity. Recently, deep learning approaches have been deployed for this task and have been proven to outperform traditional machine learning approaches such as Artificial Neural Network (ANN). In particular, deep-learning methods such as long short-term memory (LSTM) have achieved improved ASR performance. However, this method is limited to processing continuous input streams. Traditional LSTM requires four (4) linear layers (multilayer perceptron (MLP) layer) per cell with a large memory bandwidth for each sequence time step. LSTM cannot accommodate the many computational units required for processing continuous input streams because the system does not have sufficient memory bandwidth to feed the computational units. In this study, an enhanced deep learning LSTM recurrent neural network (RNN) model was proposed to resolve this shortcoming. In the proposed model, the RNN is incorporated as a “forget gate” to the memory block to allow the resetting of cell states at the beginning of the sub-sequences. This enables the system to process continuous input streams efficiently without necessarily increasing the required bandwidths. In the proposed model, the standard architecture of the LSTM network is modified to effectively use the model parameters. Some CNN-based and sequential models were used on the same dataset, and the models were compared with the proposed model. LSTM-RNN outperformed the other deep learning models with an accuracy of 99.36% on the well-established public benchmark spoken English digit dataset.


I. INTRODUCTION
Speech comprises a sequence of uttered sounds, which are also known as phonemes. Speech is used to transmit information from one speaker to the other. When the signal from speech is converted into a meaningful message or text, it is called Automatic Speech Recognition (ASR) [1]. The recognition of isolated spoken digits has proven to be a challenging task in ASR owing to its complexity.

A. BACKGROUND
Deep learning is an emerging technology that is regarded as auspicious direction for attaining a height in artificial intelligence [2]. At present, deep learning has been deployed in a wide range of domains, including bioinformatics, computer vision, machine translation, dialogue systems, and natural language processing. One area that has been transplanted by The associate editor coordinating the review of this manuscript and approving it for publication was Wei Liu. this technology is ASR [3]. In recent times, deep learning has been deployed for ASR [4]- [6], speech recognition systems [7], [8], speech enhancement problems [9]- [11] and has outperformed traditional machine learning approaches such as artificial neural networks (ANN).
Although ANNs can categorize small acoustic-phonetic units such as separate phonemes, they cannot model long-term dependencies in acoustic signals [12]. However, deep neural networks (DNNs) provide restricted temporal modeling of the acoustic frames. However, they cannot deal with data that have longer-term dependencies. Feedforward neural networks can be expanded for an effective classification. To achieve this, it will require feeding the signals that were fed back into the network from previous time steps. Such networks with recurrent interconnections are called recurrent neural networks (RNNs) [13], [14]. RNNs are restricted because they look back in time for roughly ten time-steps [15].
The connections in RNNs are cyclic, which makes them a dynamic mechanism for modeling sequence data [16]. Thus, RNNs use a dynamic contextual window against a static fixed-size window over sequences. Unfortunately, RNNs are difficult to train using gradient-based back propagation through time (BPTT) [17] and are not likely to demonstrate the full power of recurrent models. This is because of the well-known vanishing and exploding gradient problems [18].
One way to improve the training of RNNs is to use an optimization algorithm with higher-order approximations [19]. However, it is normally at the cost of remarkably increased computational costs, which makes the approach unattractive for language modeling which requires an enormous amount of training data [20]. Hochreiter and Schmidhuber [21] proposed the long short-term memory (LSTM) architecture as a solution to resolve this challenge. LSTMs are specifically designed to avoid the long-term dependency problem. Remembering information for long periods is their default practice. LSTMs have many advantages over conventional feed-forward neural networks and the RNN. This is because of their ability to remember patterns for long durations.
LSTM is a type of recurrent neural network with a strong ability to learn and predict sequential data. Sequence prediction is a long-standing problem. With recent advancements in the field of data science, it is found that for practically all sequence prediction problems, LSTM has been observed as the most successful approach [22]. The core idea behind LSTMs is the cell state and its gates. The cell state conveys the relevant information to the sequence chain.  [16] work was found to have introduced the first implementation of LSTM networks on a large-vocabulary Google voice search speech recognition task. They presented an LSTM RNN model architecture that makes use of the model parameter more advantageous for training acoustic models for large-vocabulary tasks. They trained and compared LSTM, RNN, and DNN models using different numbers of parameters and configurations. The results of their experiment show that LSTM models converge quickly and perform best when applied to moderately small-sized frameworks.
Geiger et al. [23], proposed an LSTM RNN in a hybrid acoustic modeling structure for robust speech recognition in an environment affected by noise and reverberation. The experiment was conducted using the database of the medium-vocabulary recognition track of the 2nd CHiME speech separation and recognition challenge. The authors compared state prediction networks with networks that predict phonemes using LSTM networks. The result showed that with LSTMs, state prediction is better than networks predicting phonemes.
Recently, there has been a remarkable improvement in RNN-HMM hybrid systems with deep bidirectional (DB) LSTM-based acoustic models for CD phonetic units, states for the LSTM output space and distributed training methods to perform large-scale modeling [24].

2) GATED RECURRENT UNITS (GRU)
Modified gated recurrent units (GRU), known as light-gated recurrent units (Li-GRU), were proposed in [25] for automatic speech recognition across various tasks, features, conditions, and paradigms. The experiment was conducted using TIMIT, DIRHA-English, CHiME, and TED-talk speech recognition corpus in various subsections. The proposed method outperformed the standard GRU in terms of recognition and computational performance and significantly reduced the per-epoch training time by 30% compared to the standard GRU.
Feng et al. [26] proposed a projected minimal gated recurrent unit (PmGRU) an improved version of the mGRUIP with context module (mGRUIP-Ctx) for speech recognition acoustic model on five different ASR tasks. The proposed model showed a significant reduction in the word error rate (WER) compared to the WER of mGRUIP-Ctx.

3) END-TO-END SPEECH RECOGNITION
Graves et al.'s [7] showed that end-to-end training methods such as connectionist temporal classification (CTC) can be used to train RNNs for sequence-labelling tasks on the TIMIT corpus, where the input-output alignment is not known. They suggested that combining these methods with LSTM RNN architecture is likely to yield state-of-the-art results.
Hannun et al. [27] used of a 5-layer RNN with a bidirectional recurrent layer trained with CTC loss and a language model to credibly fix the phonetic transcriptions. The results of this approach exceeded the best results on the switchboard dataset.
Li [28] provided a detailed overview of E2E models and feasible technologies that makes E2E models outperform hybrid models in the industrial world.

4) DEEP BELIEF NETWORK
Mohamed et al. [29] conducted the first successful experiment using a hybrid DNN-hidden makov model (HMM) with an acoustic model based on deep belief network (DBN) on the TIMIT dataset. His results outperformed those of previous studies using the same dataset. Over the years, other researchers have used restricted BoltzMann machines (RBMs) and DBNs techniques to explore and demonstrate the results of using them in speech recognition tasks. [30]- [33]. [34] work using CNN outperformed previously published results used in the hybrid NN-HMM model. Their experimental results showed a remarkable improvement in the recognition performance using local filtering and max-pooling and achieved over a 10% relative error reduction on the core TIMIT test sets compared to constant neural networks (NNs) with the same number of hidden layers and weights. Abdel-Hamid et al.'s work in [35] also, investigated convolution over the time and frequency axes simultaneously.

Abdel-Hamid et al.'s
Sainath et al.'s [36] investigated the most suitable approach for making CNNs a more capable model for large-vocabulary continuous speech recognition (LVCSR tasks) than DNNs. They also investigated the actions of NN features extracted from CNNs on a variety of LVCSR tasks, which were compared with DNNs and GMMs. The results of their experiment shows 13-30% and 4-12% relative improvement over GMMs and DNNs respectively, on the 400-hr broadcast news and 300-hr switchboard task. In addition, an experimental investigation of CNN-based acoustic models for low-resource languages has proven that CNNs are better than DBNs in terms of robustness and improved generality [37].

C. PROPOSED MODEL
A modified LSTM RNN model was proposed in this work to perform sequence prediction that will make use of deep supervised learning on the benchmark spoken English digit dataset. The effectiveness of the model will be estimated with respect to training and validation accuracy, and the results will be compared with other studies that used deep learning models for various speech recognition tasks. In addition, the classification performance of the model was evaluated to obtain the average score for precision, recall, f1-score using a confusion matrix. The choice of LSTM RNN is based on the fact that LSTM consists of a standard RNN built up with ''memory units'', that specializes in transferring long-term information, also with a set of ''gating'' units that allows memory units to carefully interrelate with the normal RNN hidden state [19].
Several studies have been conducted using LSTM RNN. LSTM has achieved virtually all thrilling results based on RNNs. Thus, it has become the centre of deep learning in ASR systems [38]. LSTMs have been used extensively in speech recognition tasks because of their powerful learning ability [7], [16], [23], [39], [40], [25], [41], but this is the first time LSTM RNN will be used on the spoken English digit speech recognition dataset.
The contributions of this paper can be summarized as follows; 1) This study reviews existing deep learning methods for sequential data and highlights the limitations of traditional LSTM in processing continuous input streams. 2) A recurrent neural network (RNN) is incorporated as a forget gate to the memory block to allow resetting of the cell states at the beginning of the subsequences.

II. RELATED WORK
Graves et al. [7] showed that end-to-end training methods like CTC can be used to train RNNs for sequence labelling tasks on the TIMIT corpus. Merging these methods with LSTM RNN architecture will likely yield state-of-the-art results. In this study, the standard LSTM RNN training method was used to obtain a 99.36% accuracy for the sequence prediction speech recognition task. Sak et al. [16] work, was found to have introduced the first implementation of LSTM networks on the Google voice search speech recognition task. Their proposed model architecture improved the use of model parameters while training acoustic models. The model trained and compared LSTM, RNN, and DNN models with various numbers of parameters and configurations. The results show that the LSTM model was the fastest to converge and performed best when applied to a moderately small-sized sized framework.
Geiger et al. [23], proposed an LSTM RNN in a hybrid acoustic modelling structure for robust speech recognition in an environment affected by noise and reverberation. The experiment was conducted using the database of the 2nd CHIME medium-vocabulary recognition track. The authors compared state prediction networks and networks that predict phonemes using LSTM networks. The results of their experiment showed that with the use of LSTMs in a hybrid or double-stream system, the state prediction network is superior to the network prediction phonemes.
He and Droppo [40] proposed a generalized LSTM known as the (G)LSTM-DNN. The strength of the proposed model was first analyzed using a normal 80-hour LVCSR task AMI and then applied to the 2000-hour Switchboard data set. The results of their experiment showed that the proposed (G)LSTM-DNN performs better with more layers and achieved a relative word error rate reduction of 8.2% on the 2000-hour Switchboard data set. One issue discussed in their work is that the model's performance comes at the cost of a large number of parameters, and it is noteworthy to find a system that will save the parameters while maintaining its modeling power.
Tachioka and Ishii [39], proposed LSTM RNN for Bandwidth Extension (BWE) on the TIMIT phoneme recognition task. The proposed LSTM RNN-based BWE was compared to standard gaussian mixture model (GMM)-based BWE. The results of the experiment showed that LSTM RNN-based BWE was more powerful than the GMM-based BWE. In addition, they added that for ASR purposes, it is better to predict MFCC features directly than to predict Mel-cepstrum features. The model used in this study has used the MFCC features for its prediction.
The authors proposed an LSTM-RNN for deep sentence embedding [42]. Here, the RNN is used to accept each word in a sentence sequentially and then map alongside the contextual information into a latent space in a recurrent form. Furthermore, LSTM cells were incorporated into the RNN model (LSTM-RNN) to address the weakness of the RNN in learning long-term memory. As a result of the non-availability of labeled data, user click-through data were used and the model was trained in a weakly supervised form. The proposed LSTM RNN used in this work for the sequence prediction of the spoken English data, however, was, trained using a strong deep supervised network that helped obtain optimal accuracy.
One of the RNN models, gated recurrent units (GRUs), was revised, and a simpler architecture was proposed in [25] for automatic speech recognition across various tasks, features, conditions and paradigms. The experiment was conducted using TIMIT, the DIRHA-English, CHiME, and TED-talk speech recognition corpus, in various subsections. The proposed method has outperformed the standard GRU in terms of recognition and computational performance and significantly reduced the per epoch training by 30% compared to the standard GRU.
WAZIR and CHUAH [41] proposed an Arabic digits speech recognition model using an RNN with LSTM cells. Their model exhibited an overall accuracy of 94.00% for model training and 69.00% for the model testing. When the standard LSTM was implemented in the spoken English digit speech recognition task, the overall accuracy of 99.36% was achieved for model training, as demonstrated in this work.

III. METHODS AND TECHNIQUES A. THE STANDARD LSTM ARCHITECTURE
The main structure of LSTM consists of unique segments known as ''memory blocks'' in the hidden layer. The first type of LSTM block consists of cells and the input and output gates. The standard structure of LSTM has a limitation, which was addressed for the first time in [43] through the establishment of a ''forget gate'' that will empower LSTM to adjust its state. The ''forget gate'' f t resets the cell variable leading to the 'forgetting' of the stored input c t , whereas the input and output gates manage the reading of inputs from the feature vector, x t , and writing of output to h t , respectively [21].
The gates regulate the action of the memory block whereas the ''forget gate'' weighs the information inside the cells, such that anytime previous information becomes unimportant for some cells, it will reset the state of the different cells. ''Forget gates'' also enables continual prediction [44], by making cells forget their previous state, thereby restricting biases in prediction.
The computation operation within an LSTM block is as follows: Input values can only be conserved in the cell state if the input gate allows them. Its input value of i t and the expected value of the memory cells,C t , at time step, t, is calculated as follows: W [h t−1 , x t ] and b represent the weight matrices and bias, respectively. The forget gate controls the weight of the state cell unit, and the value of the forget gate is computed as: By this process, the new state of the memory cell is being updated asC Given a new state memory cell, the output value of the gate is computed as The final output value of the cell can then be explained as With this structure in place, the network can store inputs for a long period, thus utilizing a trained number of extended temporal situations [23]. Additionally, the recent LSTM architecture accommodates ''peephole connections'' from its internal cells, which learns the accurate timing of the output [45]. The standard LSTM Structure is illustrated in Figure 1.

B. THE PROPOSED LSTM ARCHITECTURE
The proposed model avoids the problem of processing continuous input streams that are not segmented into subsequences. This means that streams that are not theoretically subdivided into smaller units are easily processed by the network. The proposed model in turn integrates RNN as a ''forget gate'' to the memory block to permit cell states to be reset at the beginning of sub-sequences. There is a need to reset the network's internal state to prevent the cell state from growing indefinitely, which may eventually cause the network to break. The memory blocks use their memory cells to store the network's temporal state, and distinctive multiplicative units known as gates to control information flow. The proposed model architecture effectively use model parameters by modifying the standard LSTM architecture. This modification in the LSTM architecture causes changes in the computational cost because of the increase in the computational resources as a result of adding an RNN as a forget gate. Figure 2 shows the proposed LSTM RNN memory block.
Supervised learning is a learning technique that use labelled data. For a supervised deep learning technique, the setting comprises a set of inputs with complementary output (x t , y t ) ∼ p. For instance, if for an input x t , the smart agent predictsŷ = (x t ), and then the agent will obtain a loss value l = (y t ,ŷ t ). After successful training, the agent repeatedly adjust the network parameters to obtain an improved approximation of the output, similar to the deep supervised approach used in this study [47].
Algorithm 1 represents the algorithm of the proposed Model

IV. EXPERIMENTS A. DATASET
The dataset is a well-established publicly available dataset under Pannous, a collaboration working on improving speech recognition [48], from the librosa library [49]. Speech data were downloaded using an MFCC batch generator. The file consists of a group of wav files that are in batches alongside its related labels. The audio dataset was pre-processed using the librosa library, Python's library dedicated to analyzing sounds. Perform Optimal Estimation using Adam Optimization; 10: Output Recognized Speech 11: end procedure The dataset used in this study consists of isolated spoken digits. It is a tar file consisting of 15 speakers (male and female). Each speaker utters a digit 16 times, leading to 15*16 = 240 instances for each digit. The phrases were English numbers: 0-9. This gives us a total of, 2400 different audio files with wav format for training the proposed system.
The dataset was split into training and validation datasets. Ten percent (10%) of the dataset was used for validation, and the remaining ninety percent (90%) was used for training. The training step output contained validation accuracy and loss as shown in Table 1 because the validation set was introduced as a part of the model fit function during training.
The proposed LSTM RNN network structure comprises four network layers: an input layer, LSTM (dropout) layer, fully connected layer and regression layer. The model was trained using a deep-learning library known as TFLearn.

B. PROPOSED MODEL TRAINING
The learning rate and number of training iterations can affect the accuracy and training time of the proposed model. Therefore, both parameters were adjusted to different values for optimal performance. Given that the learning rate should be considered the most crucial hyperparameter, it might be necessary to understand how to adjust it properly to achieve a positive outcome [50]. The learning rate regulates the speed of the network weight updates. The initial learning rate of the model was set at 10 −3 .
Next is the training iteration, which was adjusted to the initial value of = 1000 iters. Training iterations were used to multiply the epoch size to obtain the training steps. The training steps, with 10 epochs of batch size 64/64, ranged from 10000 to 20000 training steps. A high accuracy was achieved when the number of training steps was increased.
To reduce LSTM total loss on a set of training sequences, Adam's optimization algorithm was used to improve the parameter of each network weight to the weight parameter using the BPTT method [17], [51], [52]. The BPTT method, used for learning the weight matrices of an RNN unravels the network on time and disseminates error signals backward through time. The major challenge with the BPTT method is the vanishing gradient problem. However, this difficulty is being overcome to a great extent by using LSTM cells [53].
The cross-entropy loss that used the softmax activation function was used to train the networks. With an initial learning rate of 10 −3 , the model trained quickly, but started to overfit at some point. It was observed that the accuracy dropped when the model overfitted. By adjusting the learning rate to 10 −4 , the model was trained slowly, with an increase in network accuracy.
The proposed model was implemented on a multi-core central processing unit (CPU) on a single machine instead of a graphics processing unit (GPU). The choice of using a CPU is made because CPUs are relatively simple to implement and easy to debug. It also allows for easy distributed implementation on a large cluster of machines [54].
The computational graphs of the model's output were visualized using a TensorBoard. It is a visualization extension created by the TensorFlow team to decrease the complexity of neural networks. Time-dependent scalar statistics that vary over time and variations in accuracy and loss performance are visualized in Figures 3 and 4 for 2000 iterations at learning rates of 10 −3 and 10 −4 , respectively.
Other deep learning models such as ResNet-18, ResNet-34, DenseNet-121, DenseNet-169, and VGG-16 were used to train the model. The output of the training showing loss and accuracy curves and the bar chart comparing the performances of the deep learning models with the proposed model are shown in Figures 5, 6, respectively.

C. RESULTS AND DISCUSSIONS
The result of the model's training has shown that good hyperparameters such as the learning rate, help to manage  a large set of experiments for hyperparameter tuning. This shows that increasing the learning rate leads to fast network training, whereas reducing the learning rate leads to an accurate prediction of the network. Hence, it represents the trade-offs between time and accuracy. Optimum accuracy is possible when the learning rate is reduced and the number of training steps increases.
From the performance results of network training in the proposed model, it is necessary to state that RNNs are at the centre of recent ASR systems. Specifically, LSTM RNN have shown exciting results in numerous speech recognition VOLUME 10, 2022 jobs, owing to their capability to represent long-term and short-term dependencies in sequences [55].
The model showed 99.36% accuracy and 100.00% validation accuracy with the least minimal loss of 0.02656 for 2000 training iterations at the learning rate of 10 −4 , as represented in Table 1. This is to prove that a low learning rate leads to a higher accuracy. Table 1 presents the results of the learning rate tuning of the model.   The ResNet-18, ResNet-34, DenseNet-121, DenseNet-169, and VGG-16 deep learning models were run on the same dataset and the results were compared with the performance of the proposed model as presented in Table 2. From Table 2, it can be seen that DenseNet-121 and DenseNet-169 showed high accuracies of 89.67% and 87.17% respectively, but LSTM-RNN showed the highest accuracy of 99.36% and outperformed the other deep learning models on the same dataset. A summary of the performance of the deep learning models in comparison with the proposed model is represented as a bar chart in Figure 6. LSTM exhibited the best performance in terms of both accuracy and loss. Sequential models such as GRU, bidirectional LSTM, simple LSTM and RNN were also tested on the same dataset and the performance is compared in Table 3.
To further investigate the model results, a confusion matrix was used to evaluate classification performance. From the model's classification report, the average score for precision, recall, and f1-score was derived from the classification report and served as the performance metric for the evaluation of the proposed model as shown in Table 4.

V. CONCLUSION
In this study, an LSTM-RNN model has been proposed that incorporates an RNN into the LSTM network to overcome the challenges of the traditional LSTM in processing a continuous input stream. The proposed system utilizes an RNN as a forget gate in the network, which allows the resetting of the cell states at the beginning of sub-sequences and consequently improves the performance of the model to make effective use of network parameters. This addresses the computational efficiency problems of large networks for large-vocabulary speech recognition. The proposed model is evaluated using a well-established dataset. Some CNN-based and sequential models were also used on the same dataset, and the performances of the models were compared with the performance of the proposed model. The proposed LSTM-RNN outperformed other deep learning models with an accuracy of 99.36% on the well-established public benchmark spoken English digit dataset. VOLUME 10, 2022