A Deep Diacritics-Based Recognition Model for Arabic Speech: Quranic Verses as Case Study

Arabic is the language of more than 422 million of the world’s population. Although classic Arabic is the Quran language that 1.9 billion Muslims are required to recite, limited Arabic speech recognition exists. In classic Arabic, diacritics affect the pronunciation of a word, a change in a diacritic can change the meaning of a word. However, most of the Arabic-based speech recognition models discarded the diacritics. This work aims to recognize the classic Arabic speech while considering diacritics by converting audio signals to diacritized text using Deep Neural Network (DNN)-based models. The DNN-based model recognizes speech using DNN which outperformed the traditional speech recognition systems’ phonetics dependency. Three models were developed to recognize Arabic speech: (i) Time Delay Neural Network-Connectionist Temporal Classification (CTC), (ii) Recurrent Neural Network (RNN)-CTC, and (iii) transformer. A 100hours dataset of the Quran recordings has been used. Based on the results, the RNN-CTC model obtained state-of-the-art results with the lowest word error rate of 19.43% and a 3.51% character error rate. RNN-CTC model recognized character-by-character which is more reliable compared to transformers’ whole-sentence recognition behaviour. The model performed well with clear unstressed recordings of short sentences. Moreover, the RNN-CTC model effectively recognized out-of-the-dataset sounds. The findings recommend continuing the efforts in enhancing the diacritics-based Arabic speech recognition models using clear and unstressed recordings to obtain better performance. Moreover, pretraining large speech models could obtain accurate recognition. The outcomes can be used to enhance the existing classic Arabic speech recognition solutions by supporting diacritics recognition.


I. INTRODUCTION
Although more than 422 million people speak Arabic [1], the Arabic speech recognition field still needs improvement [2], [3]. Arabic is a complicated language due to its richness. The Arabic language can be classified into three classes, as illustrated in Fig. 1, (i) classic Arabic used in Quran, Hadith, and old Arabic poetry, (ii) modern Arabic, which is a modified classic Arabic used in news, formal communications, and modern books, and (iii) dialectal Arabic which is altered modern Arabic with regional speaking additions [4]. Moreover, diacritics, i.e., vocal symbols associated with the letters and affect the pronunciation, are highly used The associate editor coordinating the review of this manuscript and approving it for publication was Angel F. García-Fernández . in classic Arabic than in modern or dialectal Arabic. Diacritics are important in Arabic speech, as mispronouncing a character, i.e., a letter or diacritic, can change the word's meaning in Arabic. Even though some Arabic words contain the same letters, the diacritics differential can change their meaning. For instance, the terms '' '', '' '' and '' '' mean ''Heaven'', ''Protector'', and ''Jinns'', respectively.
Researchers have made some efforts to recognize the Arabic speech, especially the modern [5], [6], [7], [8] and dialectical [9], [10], [11], [12] Arabic. Researchers have also developed an Arabic poetry meter recognition model [13]. However, some efforts were made to recognize classic Arabic letters [14], [15], digits [7], [16], [17], and one-word commands or isolated words [16], [18], [19]. Although classic Arabic is mostly used in education and Quran recitation, there is a lack of continuous classic Arabic speech recognition models. Additionally, diacritics are highly associated with classic Arabic. Therefore, classic Arabic speech recognition models and systems should be able to recognize diacritics in speech. Diacritics negatively affected the Arabic speech recognition model performance, as reported in [20]. However, recognizing diacritics is still important and requires extensive efforts to train and develop accurate classic Arabic speech recognition models. The speech recognition model is a model that converts the audio signals to text using either a traditional approach or an end-to-end Deep Neural Network (DNN)-based approach [21]. Traditional speech recognition depends on phonetics and pronunciation dictionaries to convert the speech to text [22], [23]. Traditional speech recognition consists of three parts, (i) an acoustic model, (ii) a pronunciation dictionary, and (iii) a language model. Hidden Markov Model (HMM) is the most used acoustic model in the traditional approach. However, there are limitations in the traditional speech recognition approach, such as its phonetic dependency. Therefore, the end-to-end DNN-based speech recognition approach was proposed [24]. End-to-end speech recognition models can recognize a speech using a DNN without the need for a predefined pronunciation dictionary. End-to-end speech recognition consists of an encoder, decoder, and alignment method, such as Connectionist Temporal Classification (CTC). Few classic Arabic speech recognition models have been developed using traditional [20], [25] and end-to-end [3], [14], [17], [19], [26] speech recognition models.
Additionally, as there is a lack of a standard classic Arabic pronunciation dictionary, using an end-to-end speech recognition approach is preferable. Thus, this effort aims to recognize classic Arabic speech using DNNs following the end-to-end approach to convert audio signals to diacritized text. A dataset of more than 100 hours of classic Arabic speech with its diacritized transcripts was used to train the proposed models. Three DNN-based models were implemented and compared, (i) Time Delay Neural Network (TDNN)-CTC-based, (ii) Recurrent Neural Network (RNN)-CTC-based, and (iii) transformers-based.
This work converts the input of classic Arabic speech to diacritized text using the end-to-end speech recognition approach. The main contribution of this work is recognizing the diacritized continuous classic Arabic speech using three DNN models that, to the best of our knowledge, have not been used with a large classic Arabic dataset. This effort also investigates the best-performed model's performance. Moreover, the best-performed model performance reaches the performance of the state-of-the-art models that were fine-tuned for recognizing classic Arabic speech.
The rest of the paper is structured as the following; Section II discusses the classic Arabic speech recognition related works. Section III illustrates the work's methodology. Section IV presents the experiments' details. Section V and VI discuss the results and the findings, and Section VII concludes the study.

II. RELATED WORK
Humans verbally communicate through sounds. Recognizing sounds and speech is the key to understand others' sayings [21]. Therefore, speech recognition technology assists in using the computational power to recognize human speech by converting it to a machine-readable format that can be converted to text to perform specific actions. Speech recognition is applied in different fields and applications. Traditional speech recognition consists of three independent components, (i) acoustic model, (ii) pronunciation dictionary, and (iii) language model [21]. Statistical models, such as HMM and Gaussian Mixture Model (GMM), are used as acoustic models in traditional speech recognition systems. However, traditional speech recognition had several limitations, e.g., predefined pronunciation dictionary requirements. Therefore, DNNs have been developed to recognize audio signals directly into text without the need for a predefined pronunciation lexicon, forming an end-to-end speech recognition system.
Moreover, speech recognition systems can recognize (i) letters, (ii) isolated words, such as digits, commands, or single words, or (iii) continuous speech. Traditional and end-to-end speech recognition methods have been used with the classic Arabic language. The following subsections discuss the few classic Arabic speech recognition-related efforts, gaps, and emerging speech recognition models.

A. TRADITIONAL CLASSIC ARABIC SPEECH RECOGNITION
Regarding the importance of Arabic diacritics in classic Arabic, the diacritics' effect on recognizing the classic Arabic speech was studied in [20]. Eight traditional-based speech recognition models, namely (i) GMM-SI, (ii) GMM SAT, (iii) GMM MPE, (iv) GMM MMI, (v) SGMM, (vi) SGMM-bMMI, (vii) DNN, and (viii) DNN-MPE, were trained with 23hours continuous speech datasets containing 4754 sentences. The authors used two sets of the same dataset, the diacritized dataset (supporting six diacritics only) and the non-diacritized dataset. However, the DNN-MPE model reported the lowest Word Error Rates (WER)s of 4.68% (without diacritics) and 5.53% (with diacritics). Even though the WER in diacritics increases by about 1%, recognizing diacritics in classic Arabic speech is still important and should be further improved.
On the other hand, other researchers used parts of the traditional speech recognition methods to convert graphemes to phonemes of the diacritized classic Arabic words [25]. The joint multigram model was used to predict phonemes of classic Arabic words and recorded 42.5% WER. Although dealing with Arabic diacritics is challenging, interested researchers continued their efforts using advanced methods (end-to-end) that are explained in the following section.

B. END-TO-END CLASSIC ARABIC SPEECH RECOGNITION
An end-to-end diacritized classic Arabic speech recognition was discussed in [26]. The authors compared the performance of three different speech recognition models on the single Arabic speaker corpus. The corpus consisted of audio recordings of 51 thousand words and diacritized texts. The authors built and trained one traditional speech recognition model and two end-to-end models (CTC-based and Convolutional Neural Network (CNN)-Long Short-Term Memory (LSTM)-attention-based models). As a result, the CNN-LSTM-attention-based model outperformed the traditional and CTC-based models by achieving 28.48% WER compared to 33.72% and 31.10% WERs, respectively. Thus, the Arabic speech recognition applications valuing diacritics should consider end-to-end-based models.
Regarding the limited classic Arabic research attention specially in speech recognition, Arabic alphabet learning models were designed in [14] to correct mispronunciation. The work in [14] was divided into two parts, (i) alphabet recognition and (i) pronunciation quality classification. After training two CNN-based and an RNN-based DNNs, namely (i) DCNN, (ii) AlexNet, and (iii) Bidirectional LSTM (BLSTM), the AlexNet outperformed the rest models in the two parts with 98.41% alphabet recognition accuracy and 99.14% pronunciation classification accuracy.
Furthermore, non-diacritized Arabic digits and single-word commands were collected to train SVM, LSTM, and KNN speech recognition models [19]. LSTM reported the best recognition accuracy of 98.12%, while KNN was the fastest in training duration. Moreover, the LSTM-based speech recognition model was also trained to recognize the Arabic digits and recorded 69% testing accuracy [17].

C. CLASSIC ARABIC SPEECH RECOGNITION RESEARCH
Although the wide spread of Arabic speakers around the world, the classic Arabic speech recognition efforts are limited [14]. Isolated classic Arabic words received the most attention from the research community. Thus, continuous classic Arabic speech recognition is left undeveloped and lacks datasets' availability and efforts' development. Moreover, Arabic speech recognition generally still requires researchers' attention compared with other languages such as English [27]. However, speech recognition models applied to other languages could be suitable for Arabic with slight tuning. Besides, as there is no standard Arabic pronunciation dictionary, an end-to-end speech recognition approach is preferred for Arabic speech. Furthermore, most of the Arabic-based speech recognition models discarded the diacritics. However, diacritics affect the pronunciation of a word, where a change in a diacritic can change the word's meaning.

D. EMERGING SPEECH RECOGNITION DEVELOPMENTS
The rapid development of the transformer-based speech recognition models and their performance in the streaming environments regarding the fast training directed researchers to use transformers in end-to-end speech recognition [28]. Transformer models are sequence to sequence models that use self-attention mechanisms instead of RNN [29]. The authors in [29] aimed to transcript audio recordings of dialectal Arabic using three different methods, (i) transformer-based, (ii) HMM-DNN-based, and (iii) manual transcription methods. The manual transcription was done by native and expert linguist participants given a set of audio recordings. The findings in [29] stated that the end-to-end transformer-based model supplemented with CTC outperformed the HMM-DNN-based and manual transcription methods. In addition, using the transformer model without a language model recorded lower WER, hence, achieving better recognition results.

III. METHODOLOGY
This section discusses the followed methodology and materials used to conduct this study. The dataset, data processing, speech recognition models, and evaluation techniques used to recognize classic Arabic speech are presented in the following sections.

A. DATASET
A total of 72,735 classic Arabic, i.e., Quran, audio recordings of more than 100hours have been collected from the ''EveryAyah'' [30] dataset. Each audio file has been stored in the waveform file format (.wav) and consists of a Quran recitation recited by expert reciters. Furthermore, the collected audio files have been recorded in optimal and noisy environments. Thus, some cleaning steps have been performed, e.g., manually eliminating unclear audio recordings. Any recording that was not audible to human ear, was removed from the dataset. Moreover, each audio file was mapped to its transcript, i.e., textual form, in the CSV file.
Data splitting has been performed to avoid overfitting. Overfitting is an issue that occurs when the model highly recognizes the trained data but poorly recognizes new and unseen data [31]. Thus, overfitting affects the model's reproducibility. Therefore, the dataset was split into three subsets,   (i) training, (ii) validation, and (iii) testing sets, as presented in Tab. 1.

B. DATA PROCESSING
The dataset needed some preprocessing steps before being trained and tested in the developed models, such as transforming the data into a machine-readable format. There are two data types input to our classic Arabic speech recognition models, (i) audio data and (ii) textual data (transcripts). Those raw data needed to be processed in a machine-readable format. The audio signals were converted to Mel-spectrograms. Spectrograms digitally visualize audio signals' time, frequency, and amplitude using Short-Time Fourier Transform (STFT), which combines several Fast Fourier Transform (FFT)s of overlapped audio segments over time [32]. FFT is a Fourier Transform algorithm that converts the time domain of a non-periodic segmented signal into a frequency domain. Non-periodic signals represent real-life non-stationary signals, such as audio signals.
Moreover, the Mel scale is applied to the spectrograms generating Mel-spectrograms to mimic the human hearing system detecting different frequencies. Thus, audio signals were converted to Mel-spectrograms to be input into the speech recognition models. Fig. 2 illustrates a sample of audio signals converted to a Mel-spectrogram.
Furthermore, as we use character-based speech recognition models, each character in Arabic will be treated as a class. The characters in our situations are the Arabic letters, diacritics, other symbols that affect letters' pronunciations, and a space character. Each character was vectorized with a specific index. Thus, character sequences in any verse, i.e., transcript, were converted to a sequence of indices, as exemplified in Fig. 3.

C. SPEECH RECOGNITION MODELS
This section discusses the different speech recognition models used to recognize the classic Arabic speech. The end-to-end speech recognition models consist of an encoder, alignment technique, and decoder. We applied three character-based speech recognition models with different DNN architectures in the encoder to find the best-performed model with our data. The TDNN-based, RNN-based, and transformer-based speech recognition models were used and trained with the diacritized classic Arabic speech. However, a greedy decoder was applied in each model. The greedy decoder decodes the aligned encoder's output, i.e., sequence of characters' indices with high probabilities, to text, i.e., sequence of characters [33]. The greedy decoder was implemented to support outputs with different character combinations and avoid the dependency on language vocabularies with the beam search decoder. Fig. 4 overviews the followed steps to recognize classic Arabic speech. Therefore, the following subsections discuss the encoders' structures of the implemented speech recognition models.

1) TIME DELAY NEURAL NETWORK SPEECH RECOGNITION MODEL WITH CONNECTIONIST TEMPORAL CLASSIFICATION
TDNN have been widely used as an acoustic model in the traditional and hybrid speech recognition systems [34]. However, with the developments of end-to-end speech recognition models, the combinations of CNN and TDNN have been proposed and reported state of the art results [35]. Jasper (Just Another SPEech Recognizer) is a TDNN-CTC endto-end speech recognition model developed by NVIDIA in 2019 [35]. Jasper uses CTC loss and has a block architecture consisting of blocks and convolutional sub-blocks; each sub-block contains four layers. Blocks in Jasper are connected using residual connections. The data in neural networks with residual connections flow in different paths. Thus, in residual connections, some layers might be skipped to reach the last layer. Residual connections differ from sequential connections where data flow in a one-path constructing feedforward neural networks [36]. The residual connection in Jasper launches a 1 × 1 convolution. Then, a batch normalization layer where its output will be added to the output of the batch normalization of the last sub-block. The summation is then passed to the activation function and dropout layers to produce the block's output.
Jasper achieved state-of-the-art results on the English speech datasets. However, Jasper requires high computational power and memory requirements due to its utilization of many parameters, i.e., over 200 million parameters. Thus, a smaller speech recognition model called QuartzNet was proposed based on the Jasper architecture with fewer parameters and lower computational power requirements [37]. QuartzNet implements depthwise separable convolutions by replacing Jasper's 1D convolutions with 1D time-channel separable convolutions.
The depthwise separable convolutions deals with spatial (height and width) and depth (channels) dimensions [37]. The depthwise separable convolutions faster the network and reduce the complexity by splitting the kernel into two smaller kernels, (i) depthwise convolution and (ii) pointwise convolution. The depthwise convolution is individually implemented on each channel across a number of time frames (time steps). While the pointwise convolution independently operated on each time frame across all channels. Thus, the components of the used QuartzNet are illustrated in Fig. 5. This work applied QuartzNet as a TDNN-CTC speech recognition model to recognize classic Arabic speech. QuartzNet has never been used with classic Arabic speech.
2) RECURRENT NEURAL NETWORK SPEECH RECOGNITION MODEL WITH CONNECTIONIST TEMPORAL CLASSIFICATION RNN has been used in end-to-end speech recognition models to transform audio spectrograms into text transcriptions, such as Deep Speech [38]. The Deep Speech model consists of 5 layers of hidden units; the first three layers and the last layers are non-recurrent layers, while the fourth layer is an RNN with forward and backward passes. Moreover, the Deep Speech model uses CTC loss to align encoders' output to character sequences. However, speech recognition models with a single recurrent layer in the encoder cannot deal with large and continuous speech datasets, thus limiting their capabilities [39]. Therefore, an updated version of Deep Speech 2 was proposed with multiple CNN and RNN layers. Deep Speech 2 with one CNN layer, 5 GRU layers, and one fully connected layer achieved the lowest WER compared to other proposed CNN and RNN layers combinations. Therefore, in this work, we constructed our RNN-CTC speech recognition model based on the enhanced Deep speech 2 model architecture proposed in [40] that outperformed the recognition performance of the original Deep speech 2.
The architecture of our RNN-CTC speech recognition model consists of 4 CNN layers (1 traditional CNN and 3 ResidualCNN), 5 Bidirectional Gated Recurrent Unit (BiGRU) layers, a fully connected layer, and a linear classification layer, as presented in Fig. 6. The audio features are extracted using those CNN layers. Whereas the predictions of each frame, considering the previous frames, are performed in the BiGRU layers. GRU is an RNN variant that uses less computational resources compared to LSTM.

3) TRANSFORMER SPEECH RECOGNITION MODEL
Transformers were first proposed to enhance machine translation using attention mechanisms [41]. Attention mechanisms are applied in sequence-to-sequence modelling to allow modelling dependencies regardless of the distances of the input or output sequences. Attention layers have been implemented in RNN models [42]. However, the transformers, recurrent-free models based solely on attention layers to map input and output dependencies, were proposed and reported state-of-the-art results using lower resources and computation power compared to RNN-based models [41]. Recently, transformers have been applied in speech recognition and recorded competitive performance compared with other sequence-to-sequence models [43], [44]. The main advantages of transformers are their (i) fast learning ability with low memory usage compared with RNN models and (ii) long dependencies capturing capability. Besides, transformers require large data to obtain good results. In speech recognition, CNN layers are added to the transformer architecture to extract the audio features [44]. The flattened audio feature vector and its d-dimensional positional encoding will be the input to the transformer's encoder. In contrast, the characters encoding that converts the output characters' sequences to d-dimensional vector with their positional encoding will be the input to the transformer's decoder. Additionally, the encoder's output will be input to the second multi-head attention sub-layer in the decoder. Fig. 7 illustrates the speech transformer architecture consisting of encoder and decoder with attention layers.

D. EVALUATION METHOD
WER was measured to evaluate the speech recognition model's performance. WER is a metric derived from Levenshtein distance to measure the accuracy of a speech recognition model [45]. WER is calculated based on the number of deleted (D), inserted (I), and substituted (S) words that appear in the recognized text, as shown in (1) where N is the number of words in the target (reference) text. WER measures the speech recognition model's performance in terms of word recognition considering words' deletion, insertion, and substitution. The words' accuracy for each speech recognition model was calculated by subtracting the WER value from 1. Moreover, the Character Error Rate (CER) was calculated for the best-performed speech recognition model, i.e., the model with the lowest WER, to measure the character recognition considering characters' deletion, insertion, and substitution. The characters' accuracy was also calculated for the best-performed speech recognition model by subtracting the CER value from 1. Additionally, a similarity score for each recognized verse was calculated by finding the longest contiguous matching sub-sequence between the recognized verse and the target verse. The similarity score ranged between 1 (identical) and 0 (dissimilar).

IV. EXPERIMENTS SETUP
This section discusses experimental details of implementing DNN-based models to recognize classic Arabic speech.

A. DATA PROCESSING
The Arabic speech dataset went through processing steps before feeding it to the speech recognition models. The performed preprocessing steps were:

1) DATA SPLITTING
Data splitting is a technique used to split the data into two or three subsets to eliminate overfitting the model on the data. The model will be trained on a subset of the data and tested on another subset that the model has not seen before. Our dataset was split into three sets, (i) training, (ii) validation, and (iii) testing sets. To ensure that the sounds in the validation and testing sets differ from those in the training set, we randomly selected some reciters to only exist in the validation and testing sets, i.e., the training set did not contain any recording of those selected reciters. Based on reciters separation, the splitting ratios were approximately 79%, 10%, and 11% for the training, validation, and testing sets, respectively.

2) TRANSFORMING AUDIO FILES TO MEL-SPECTROGRAMS
Each audio file was transformed from its waveform (time-series signals) to Mel-spectrograms using the torchaudio.transforms library in Python. The torchaudio.transforms library helps in transforming raw audio files to other representations, such as Mel-spectrograms. Each raw .wav audio file was transformed to Mel-spectrogram tensor with the default 128 Mel filterbanks. Mel filterbanks mimic the filterbanks in human ears. Then, normalization was applied to eliminate nulls in Mel-spectrogram tensors.

3) CHARACTERS VECTORIZATION: MAPPING ARABIC CHARACTERS TO NUMERICAL INDICES
The Arabic characters, including letters, diacritics, symbols, and spaces, have been mapped to numerical indices, i.e., character embedding. After removing symbols that do not affect letters' pronunciation, such as mandatory and optional recitation stop symbols, we remained with a total of 62 characters. Thus, as our speech recognition models are character-based, those 62 characters represent the 62 classes our models deal with when recognizing speech. Fig. 8 illustrates the mapping of those 62 characters. Therefore, every textual sentence  (sequence of characters) was converted to a vector (sequence of numerical indices).

B. SPEECH RECOGNITION MODELS
This section specifically illustrates the experimental details of each speech recognition model trained and tested to recognize the classic Arabic speech. This work implemented three speech recognition models, (i) TDNN-CTC, (ii) RNN-CTC, and (iii) transformers. Each model's details will be explained in the following subsections.

1) TIME DELAY NEURAL NETWORK SPEECH RECOGNITION MODEL WITH CONNECTIONIST TEMPORAL CLASSIFICATION
Our TDNN model consisted of an encoder, decoder, and CTC loss. We used the default QuartzNet [37] encoder architecture provided by NeMo [46] with two fixed blocks (the first and last blocks) and 15 repeated blocks; each block contained five sub-blocks. The encoder in the QuartzNet model is based on the Jasper model [35] and consists of seven Jasper layers (6 residual layers and one traditional layer). In contrast, the decoder is a linear classification layer that converts the encoder's output to 63 classes' probabilities (62 characters and one blank character for the CTC loss). Then the CTC loss has been calculated to map the highest probability class to its character representation using a greedy decoder. Tab. 2 illustrates the TDNN model summary. Our TDNN model used the Novograd optimization method. Novograd is an adaptive layer-wise stochastic optimization method that normalizes gradient and decouples weight decay per layer [47].

2) RECURRENT NEURAL NETWORK SPEECH RECOGNITION MODEL WITH CONNECTIONIST TEMPORAL CLASSIFICATION
Our RNN model consisted of an encoder, decoder, and CTC loss. We used the improved version of Deep Speech 2 implemented in [40]. The encoder in the RNN model consists of 4 CNN layers (1 traditional CNN and 3 ResidualCNN), 5 BiGRU layers, and a fully connected layer. The decoder is a linear classification layer that converts the encoder's output to 63 classes' probabilities (62 characters and one blank character for the CTC loss). Then the CTC loss has been calculated to map the highest probability class to its character representation using a greedy decoder. Tab. 3 illustrates the RNN model summary. Our RNN model used the AdamW optimization method. AdamW is a modified stochastic optimization method from Adam that decouples the weight decay [48].

3) TRANSFORMER SPEECH RECOGNITION MODEL
Our transformer model consisted of an encoder and a decoder. We used the speech transformer architecture proposed in [44] with different layers of encoders and decoders to explore the transformer performance. Audio feature embedding and character embedding were applied to dataset's audios and textual verses before being fed to the encoder and decoder. The audio feature embedding method extracted audio features from Mel-spectrograms. In contrast, the character embedding method mapped each character to a numerical representation. The transformer's encoder consisted of two sub-layers, (i) multi-head attention and (ii) feedforward network layers. While the transformer's decoder consisted of three sub-layers, (i) masked multi-head attention, (ii) multi-head attention, and (iii) feedforward network layers. The decoder's output is then fed to a linear classification layer using a softmax activation function that converts the transformer's decoder output to 64 classes' probabilities (62 characters and two special characters (< and >) to notify the model with the beginning and ending of each verse). A categorical crossentropy loss was calculated instead of the CTC loss to map the highest probability class to its character representation  using a greedy decoder. Tab. 4 illustrates the transformer model summary. Our transformer's models used the Adam optimization method. Adam is a modified stochastic optimization method based on the adaptive estimation of the first and second orders of moments [49]. Tab. 5 presents the experimental details of each speech recognition model trained and validated with the same training and validation sets' sizes specified in Tab. 1. Due to the limited storage and power resources, different parameter settings were applied to each model. For instance, the batch size differs from one model to another based on the limited memory storage. Moreover, the learning rates were chosen after being tuned for each model in several experiments. The TDNN-CTC model took the longest duration with approximately two days of training, as illustrated in Tab. 6.
On the other hand, we explored different numbers of the transformers' encoders and decoders layers. Based on [44], the transformer with six encoders and decoders layers reported the best results. However, we also trained four encoders and decoders transformer and four encoders and one decoder transformer inspired from [50] to explore the effect of the number of layers.
Furthermore, the early stopping technique was performed with the RNN-CTC and transformer models. Early stopping is an optimization technique representing the action of ending the model training earlier when the model performance is not improving on the validation data [51]. The model performance will overfit the training data without performing an early stop. Fig. 9 illustrates the points where early stopping should be applied to the best-performed speech recognition  model as the model seemed to overfit the training data. The early stopping should be applied when the model's learning rate and training loss decrease after a peak and the validation loss increases. The red cycles in Fig. 9 identify the points where early stopping should be performed to avoid overfitting.

V. RESULTS
This section discusses the experiments' outcomes regarding recognizing diacritized classic Arabic speech using three speech recognition models and analyzes the best-performed model's recognition results. Besides, the effectiveness of the best-performed classic Arabic speech recognition model using a sample of out-of-the-dataset audios is explored.
Three DNN-based speech recognition models have been trained and tested to recognize diacritized classic Arabic speech, (i) TDNN-CTC-based, (ii) RNN-CTC-based, and (iii) transformer-based recognizers. Tab. 7 illustrates the testing WER for each model indicating the best-performed model with the lowest WER. Moreover, Fig. 10 presents a sample of the models' recognized verses. Correctly recognized characters are in green, while the misrecognized characters are identified in red font colour.   From Fig. 10, it is noticeable that the TDNN-CTC and RNN-CTC models are character-based recognition models, i.e., they recognize character by character, thus helping in identifying character-based mistakes. However, the transformer-based recognition model appeared to be a sentence-based recognition model, i.e., the model either recognizes the sentence and outputs the correct sentence or results with another irrelated sentence. Moreover, an early stopping technique was applied to the transformer models to avoid overfitting, as we noticed that the transformer-based models overfit a specific sentence after a number of trained epochs. However, the behaviours of the transformer-based models were directed to recognizing the whole sentence instead of recognizing character by character, i.e., the aim of this work. Therefore, based on the WER results, the RNN-CTC speech recognition model is the best-performed model as it reported the lowest WER.
Based on Tab. 7, the RNN-CTC speech recognition model is considered the best-performed model as it reported the highest recognition performance with the lowest WER compared to the other two recognition models. Furthermore, Tab. 8 shows the RNN-CTC model's recognition performance on the validation and testing sets. The RNN-CTC model recorded low CER indicating a 96.49% characters accuracy of the testing set. However, around 44.22% of the   recordings in the testing set were misrecognized with different similarity scores to their target sentence, as shown in Tab. 9. The similarity scores in Tab. 9 represent the characters' similarity between the recognized and target verses.  Similarity scores close to 1 indicate high similarity and few generated mistakes. While similarity scores below 0.5 determine low similarity and major produced mistakes. Fig. 11 presents a sample of generated sentences with different similarity scores.
On the other hand, 72.58% of the dataset's audios were correctly recognized, while the remaining 27.42% of audios were misrecognized. Analyzing the misrecognized speech helps in detecting the factors affecting the recognition performance. Verses that have been misrecognized more than 100 times are considered the model's most misrecognized recordings because the dataset contained 129 different recordings of each verse. Tab. 10 illustrates the top 5 least and most misrecognized verses. Out of 129 recordings of each verse, the most misrecognized verse was correctly recognized two times only, while the most recognized verse was correctly recognized 122 times. Moreover, a verse may be misrecognized an average of 35 times, i.e., 35 is the mean number of times the model misrecognizes a verse.
On the other side, the most recognized speaker was Ibrahim Alakhdar, with only 190 out of 1692 of his recordings were misrecognized, as shown in Fig. 12. In contrast, the least recognized speaker was Abdulbasit Abdulsamad, with 1172 misrecognized recordings out of 1692 total recordings. Furthermore, to test the model's effectiveness with out-ofthe-dataset voices, a sample of Quran recitation recordings of six verses was collected from out-of-the-dataset voices of different genders and age groups. The recordings of the top two most and least misrecognized sentences were collected from online-available Quran recordings. In addition, the recordings of two sentences that have been misrecognized 35 times, i.e., mean value, have also been collected from ordinary people. A man, woman, boy, and girl online-available recordings of these six sentences were collected to test the model's effectiveness along with the author's recordings. Tab. 11 presents the RNN-CTC model performance in recognizing out-of-the-dataset voices. A sample of the recognized sentences by ordinary speakers is illustrated in Fig. 13, with the model-generated mistakes highlighted in yellow. From Fig. 14, we can derive that the model performance has not been affected by the gender or the age group of the reciter.

VI. DISCUSSION
According to the results, the main findings are:

A. SPEECH RECOGNITION MODELS CAN CONVERT AUDIO SPEECH TO DIACRITIZED TEXT AFTER TRAINING THEM WITH DIACRITIZED TARGET TEXT LABELS
Both TDNN-CTC and RNN-CTC models converted audio speech to diacritized text as illustrated earlier in Figure 10. TDNN-CTC and RNN-CTC models are found to be character-based recognition models, i.e., they recognize character by character, thus helping in identifying character-based mistakes. The characters in this case include letters and diacritics. On the other hand, the transformer-based models failed to recognize character by character and recognized the whole sentence instead.

B. THE RNN-BASED SPEECH RECOGNITION MODEL OUTPERFORMED THE TRANSFORMER-BASED AND TDNN-BASED MODELS WHEN TRAINED WITH DIACRITIZED CLASSIC ARABIC SPEECH
Based on the results, our RNN-CTC speech recognition model outperformed the TDNN-CTC and transformers recognition models with the lowest WER of 19.43%. RNN-CTC speech recognition model recognized the classic Arabic speech and converted them to diacritized text with words' and characters' testing accuracies of 80.57% and 96.49%, respectively. Moreover, the RNN-CTC model was the largest in terms of trainable parameters (23.7 M parameters) which might support its performance.
Moreover, the performance of our RNN-CTC model is very close to the state-of-the-art models that were fine-tuned on Arabic dataset. Whisper [52], the large transformer-based speech recognition model, reported 34.28% WER when trained with Arabic dataset [53] and reached 13.4% WER and 2.8% CER when trained with the 10h diacriticsbased single-speaker classic Arabic dataset [3]. On the other hand, wav2vec reported 16% WER and 3% CER on diacritics-based single Arabic speaker dataset.
Therefore, considering the advantage of our multi-speakers trained RNN-CTC model, reaching a 19.43% WER and 3.5% CER is competitive to the state-of-the-art models' performance.

C. THE SOUND AND AUDIO RECORDING QUALITY HIGHLY AFFECT THE RECOGNITION MODEL PERFORMANCE AS STRESSED SOUNDS ARE LIKELY TO BE MISRECOGNIZED
After analyzing the misrecognized verses by the RNN-CTC speech recognition model, the model tends to misrecognize lengthy speech more than short speech. Additionally, the model performance highly depends on the stress during speaking and the quality of the audio recordings, as the model recognized Ibrahim Alakhdar's unstressed recordings better than the stressed recordings of Abdulbasit Abdulsamad. Stressed recordings are recordings with longer and louder sounds. Moreover, our model was not affected by the age or gender of the reciter.

VII. CONCLUSION
Despite the high population rate of Arabic speakers, the Arabic speech recognition efforts are still underdeveloped. Specifically, the continuous classic Arabic speech recognition received the least attention. On the other hand, the majority of the Arabic speech recognition works are based on the traditional speech recognition structure relying on a pronunciation dictionary that links each word with its phonetics representation. Creating and building pronunciation dictionaries for any language or purpose require huge efforts from different experts, e.g., linguistics and phonetics. However, with the technology's growth, DNN-based end-toend speech recognition structures have been developed to overcome the limitations of traditional speech recognition systems. Although the rapid development of end-to-end speech recognition-based solutions, limited classic Arabic speech recognition-related solutions have been developed.
Furthermore, most of the Arabic-based speech recognition models discarded the diacritics. However, diacritics affect the pronunciation of a word, where a change in a diacritic can change the word's meaning. Therefore, this work contributed to recognizing diacritized classic Arabic speech using DNN-based speech recognition models. This work went through two phases, (i) data processing and (ii) classic Arabic speech recognition. Three DNN-based models have been trained and tested with classic Arabic recordings to convert them into diacritized text in the speech recognition phase. After comparing the performance of three DNNs, (i) TDNN-CTC, (ii) RNN-CTC, and (iii) transformer speech recognition models, the RNN-CTC model obtained the best results with the lowest WER of 19.43%. Based on the results, the diacritics-based Arabic speech recognition model performs well with clear unstressed recordings of short sentences. The longer the spoken sentence, the more mistakes the model could generate. Moreover, the best-performed recognition model had effectively recognized out-of-the-datas2et sounds.
The work's outcomes highly contributed to the classic Arabic speech recognition efforts and solutions due to the lack of DNN-based continuous classic Arabic speech recognition developments. This work's outcomes also enhance the existing smart classic Arabic speech recognition solutions by recognizing diacritics. We believe that this effort will open opportunities regarding recognizing classic Arabic speech. Moreover, the trained models could be retrained, i.e., using transfer learning, to build other Arabic speech recognition solutions in different fields, such as education. The main contribution of this work was training and comparing the performance of three DNN models with diacritized classic Arabic speech. We highly encourage interested researchers to contribute to developing smart Arabic solutions. We also plan in the near future to continue our developments by contributing to improving the recognition performance.