Deep Investigation of the Recent Advances in Dialectal Arabic Speech Recognition

Speech recognition systems play an important role in human–machine interactions. Many systems exist for Arabic speech, however, there are limited systems for dialectal Arabic speech. The Arabic language comprises many properties, some of which are ideal for building automatic speech recognition systems such as syntax and phonology, while other properties are unsuitable for developing speech systems. Importantly, most data are in non-diacritized form, vary in dialect, and contain morphological complexity. Moreover, the Arabic dialects lack a standard structure. In this paper, we present an overview and summation of dialectal Arabic speech recognition systems using different approaches and techniques. The main goal of this paper is to compare and discuss different studies in dialectal Arabic speech systems including several criteria such as techniques, datasets, evaluation metrics, and dialect types. The study also includes a description of some techniques used in all steps of the dialectal Arabic speech system and introduces challenges and problems. Overall, more studies are required to obtain a more accurate speech system for dialectal Arabic.


I. INTRODUCTION
Automatic speech recognition (ASR) is one of the earliest tasks in artificial intelligence (AI) research, which is used to convert speech waves or signals to a mapping words (units) sequence using an appropriate algorithm [1]. ASR has a wide area of IT applications: employing a range of IT solutions and applications for civil areas and industry, human-computer interactions (HCI), voice applications, automatic language translation, and many via-voice systems [2]. Unlike other languages, there is limited research on Arabic speech recognition systems. Arabic speech recognition systems are a difficult task, that is due to many reasons: the data sparseness of the language, lexical variety, number of several dialects spoken in the world, and the predominance of non-diacritized text material. Moreover, the Arabic language is a morphologically complex language. However, the Arabic language is very rich in vocabulary [3]. Thus, large-vocabulary ASR for Arabic also represents several challenges for speech research. Over the past decade, researchers have been greatly interested in building robust Arabic automatic speech recognition (AASR) systems [4,5]. The Arabic language has a set of special symbols (marks)-called diacritics (Arabic harakat)that are placed above the main symbols (letters) [6]. These diacritics represent sounds similar to vowels in English and tones in Chinese. These letter sounds are important for understanding the meaning of words and sentences and represent another challenge for Arabic ASR.
In addition, building acoustic models for dialectal Arabic ASR is challenging. In these systems, training the model requires the appropriate Arabic dialect. Therefore, developing Arabic dialectal ASR has several challenges. Lacking enough training data: A large data set should be collected to obtain a good and accurate model. Unfortunately, collecting dialect training data is a difficult task compared to other modern standard Arabic (MSA) versions and languages. The problems mainly correspond to building an accurate transcription for dialect versions. The variety of dialects is a challenge since the Arabic dialect has many different forms (Egyptian, Levantine, Iraqi, Gulf, etc.). Furthermore, each village sometimes has a different dialect form. In addition, collecting a pronunciation dictionary that includes whole dialectal words is also immensely challenging. The Arabic dialect lacks standard orthography for writing and is mostly spoken language. The diacritization for dialectal Arabic is far more challenging than MSA since it requires a dialectal Arabic morphological analyzer to generate various diacritization forms. Using context-based forms, the diacritization also requires a robust language model for dialectal Arabic which is currently unavailable. Moreover, the dialectal Arabic diacritization-using automatic alignment against the audio signal-is also difficult due to the larger set of vowels [7]. Therefore, the Arabic dialect does not include diacritics. Thus, this leads us to build an inaccurate and less predictive language model. Furthermore, the high morphological complexity and high degree of morphological complexity, during decoding, lead to high out-of-vocabulary rates and larger search spaces [7,8].
This paper presents an analysis and discussion of dialectal Arabic ASR systems and also presents some current challenges and difficulties facing the system developers. We also introduce the knowledge of techniques that are used in these systems and investigate the used approaches and open-source data sets of several Arabic dialects.
The rest of the paper is organized as follows. Section 2 introduces the research methodology. In Section 3, we present a literature review for dialectal Arabic ASR. In Section 4, the main steps for ASR are described. Section 5 presents the discussion and challenges. Finally, Section 6 includes the conclusion and suggestions.

II. RESEARCH METHODOLOGY
The research methodology includes a number of steps as shown in Figure 1. We used Google and Google Scholar to search for studies in the dialectal Arabic ASR field. Some of the keywords and strings are used to find studies and manuscripts related to our research topic. These keywords and strings include: "dialectal Arabic automatic speech system", "dialect Arabic automatic speech system", "dialectal Arabic ASR", "automatic speech system for dialect Arabic", "dialect Arabic", "Arabic automatic speech system", "Arabic ASR", and "English language". If studies include any keywords or strings in their content, we simply filter these studies to select the suitable manuscript for dialectal Arabic ASR. We also use the citations of some manuscripts to obtain other manuscripts regarding our topic. The date of the initial collection is between 2005 and 2022. The initial number of collected manuscripts is 130 papers. Then, the range of date is reduced to 2009-2022 and the article type is selected resulting in 76 papers. Finally, we manually selected the studies that are related to dialectal Arabic ASR based on some criteria such as: (1) Studies that presented the speech recognition (speech-to-text) systems; (2) studies that include systems for pure dialects Arabic, i.e., the developed systems only used pure Arabic dialects for training and evaluation; (3) studies that used the Arabic dialects as part of the training and evaluation data; (4) studies that utilized the Arabic dialects for evaluation; (5) studies that utilized the Arabic for dialects adaption and testing the acoustic model; (6) studies that used the Arabic dialects for speech code-switching. Thus, the studies or papers containing information, descriptions, or related works for dialectal Arabic ASR were excluded-i.e., studies that do not include investigation and results of dialectal Arabic ASR were not taken into consideration. After the manual selection process, 35 studies were reviewed and analyzed in this review according to criteria such as techniques, approaches, datasets, evaluation metrics, dialect types, and well-known publishers.

III. LITERATURE REVIEW
In this paper, 35 studies for dialectal Arabic ASR published in different journals and conferences over the last 13 years were introduced. Most studies were developed using machine learning methods. We present several studies for various Arabic dialect types.
Soltau et al. [9] presented a description for Arabic broadcast transcription evaluation using several techniques. They used a large vocabulary and cross-adaptation for two acoustic models (unvowelized and vowelized) to enhance the performance. Hidden Markov models (HMMs) were used for building acoustic models with a mixture of diagonal-covariance Gaussian densities. Feature space maximum likelihood linear regression (FMLLR), maximum likelihood linear regression (MLLR), and feature minimum phone error (fMPE) were utilized as discriminative training techniques. The global autonomous language exploitation (GALE) Phase 2 and Arabic Gigaword corpus were used for training and evaluating the acoustic and language models. For dialect-specific acoustic modeling, experiments are reported as a decision tree depending on the dialect-dependent questions. The obtained results included 25.9% for the regular tree and 24.7% for the dialect tree.
Elmahdy et al. [10] introduced a new multilingual system for dialectal ASR. The HMM-based technique was used for training with MLLR, maximum a posteriori (MAP), and vocal tract length normalization (VTLN) as adaptation techniques. The acoustic models used the news broadcast corpus of MSA for decoding Egyptian colloquial Arabic (ECA). The authors collected the ECA connected digits data for evaluating their model. An accuracy rate of 99.34% was reached.
Al-Haj et al. [11] proposed a model to recognize the dialectal Iraqi-Arabic. Pronunciation modeling was used for investigating the dialectal Iraqi-Arabic. This acoustic model is a combined model using HMM-based, sub-phonetically tied, and semi-continuous. The Mel frequency cepstral coefficient (MFCC) and approximations of the first and second Start: Using google and google scholar for search Search criteria: -Key words: "dialectal Arabic automatic speech system", "dialect Arabic automatic speech system", "dialectal Arabic ASR", "automatic speech system for dialect Arabic", "dialect Arabic", "Arabic automatic speech system", or "Arabic ASR". derivatives were used for features extraction with 42 dimensional coefficients. The models were trained for 450 hours of the Iraqi-Arabic dataset. The results were evaluated by two versions of evaluation data; the best results were 35.84% and 33.30% using multi-pronunciation with estimated weights. Selouani and Boudraa [12] created the dialectal Algerian database (known as the Algerian Arabic speech database (ALGASD)). This database includes 300 Algerian native speakers. For training and evaluating this data, the authors built ASR using the hidden Markov model toolkit (HTK) and MFCCs for features extraction. In experiments, the test data consisted of 157 sentences for evaluating the system. The results achieved an accuracy rate of 91.65%.
Elmahdy et al. [13] proposed an ASR system for ECA depending on the benefit of MSA resources. Cross-lingual acoustic modeling was suggested using the Gaussian mixture model (GMM) and HMM. MLLR and MAP were used for adapting the acoustic model. The authors investigated phoneme-based and graphene-based acoustic modeling to adapt the MSA model using spelling variants. This adaptation was used to select the correct ECA spelling. The results showed a word error rate (WER) of 35.00% with MLLR, MAP, and spelling variants.
Saon et al. [14] introduced a description for the Arabic broadcast transcription using a mixture of GALE, FBIS, and topic detection and tracking (TDT-4) audio. Subspace Gaussian mixture models (SGMM) were utilized to train the acoustic model and neural network acoustics were utilized to train the language model (LM). They used modified Kneser-Ney smoothing for enhancing LM. MLLR and fM-LLR are used to estimate the acoustic models using speakerindependent (SI). The best WER result was 9.10% with the language model. Elmahdy et al. [15] suggested the dialectal Arabic speech transcription system using the Arabic chat alphabet (ACA). GMM-HMM was used to train acoustic models depending on phoneme-based and grapheme-based. Kneser-Ney smoothing was used to train a bi-gram LM. The ECA corpus with collected ACA data was used for training. The best WER of this work was found to be 13.40%.
Huang and Hasegawa-Johnson [16] presented an Arabic ASR system to classify phones based on West point MSA with Babylon Levantine Arabic corpus. They proposed crossdialectal GMM as a training method to train the acoustic model and used transfer learning to transfer MSA data into dialectal Levantine Arabic.
Biadsy et al. [17] built Google's Arabic voice search system for multiple Arabic dialects and made a compression between each. These dialectal languages were Egypt (EG), Jordan (JO), Lebanon (LB), Saudi Arabia (SA), and the United Arab Emirates (AE). They used the standard 3-HMM state for training the acoustic model and boosted maximum mutual information (MMI) as discriminative training techniques. In features extraction, linear discriminant analysis (LDA) was used as an adaptation method. The language model was trained as 5-gram backoff LMs using entropy pruning and Katz smoothing. The results for all used dialectal language were 27.7, 28.7, 24.6, 18.5, and 24.2 for AE, SA, EG, JO, and LB, respectively.
Almeman and Lee [18] proposed the Arabic ASR system for recognizing the MSA, Egyptian, Gulf, and Levantine dialects. This work presented compression between different Arabic dialect languages. The CMU Sphinx framework was used for training the acoustic model. The best WERs achieved were 13.7%, 10.00%, 17.00%, 15.10%, and 16.30% for multi-dialect, MSA, Gulf, Levantine, and Egyptian, respectively.
Masmoudi et al. [19] presented a novel Tunisian Arabic corpus and dictionary for ASR which was coined Tunisian Arabic railway interaction corpus (TARIC). The Tunisian graphemes were converted into the corresponding phonemes using rule-based tools. Moreover, the authors built a tool for this rule-based depending on a set of graphemes, phonemes, the lexicon of exceptions, and phonetic rules. Two types of corpora were used for evaluating the performance of ruled-based tools and pronunciation dictionaries. The results showed a WER of around 9%.
Ali et al. [20] developed an under-resourced Egyptian ASR system and presented the results for this system. GMM, SGMM, and deep neural network (DNN) models were employed for training the acoustic model using the KALDI toolkit. Minimum phone error (MPE) and bMMI were used as discriminative training to adapt the acoustic model. The SRILM toolkit was utilized to build LM with Kneser-Ney smoothing. The standard 13-dimensional cepstral meanvariance normalized (CMVN) MFCC was used to extract features. The training and evaluation were conducted using 10 hours dataset. An optimal WER of 44.71% was obtained for EG grapheme.
Elmahdy et al. [21] proposed the dialectal Arabic ASR system for the Qatari Arabic (QA) dialect based on an underresourced Arabic dialect. The GMM-HMM architecture was used to train the model using the KALDI toolkit. The transfer learning technique was proposed to transfer MSA-based into the Qatari dialect. In the transfer learning method, they used different processes to increase the accuracy such as orthographic normalization, phone mapping, data pooling, acoustic model adaptation, and combined model. The backoff trigram model is built with Kneser-Ney smoothing. An optimal WER of 64.4%-for the evaluation-was obtained with a combination of data pooling methods, adaptation methods, and Lattice minimum Bayes risk (MBR) decoding.
Wray et al. [22] presented a study to assess the quality control in crowdsourcing transcriptions. The Arabic ASR system was built on MSA and dialect Arabic data. Moreover, the dialect data was used to evaluate the quality of transcription with an Edit distance algorithm. In Egyptian and North African dialects, the transcription error was reduced by 1.0% for Egyptian data and 4.0% for North African data.
Ali et al. [23] proposed a method for measuring accuracy of the ASR system. They presented a new approach to report the accuracy of the ASR system in a non-standard orthographic language known as the multi-reference word error rate (MRWER). The grapheme-based approach was used for building an acoustic model using sequential DNN. In the experiment, an MRWER of 53% was obtained, and WERs of 76.4% and 80.9% were reported.
Khurana and Ali [24] presented a description for the dialectal Arabic multi-genre broadcast (MGB-2) challenge, evaluated based on 1,200 hours of speech audio. They proposed an LF-MMI modeling framework for building the system. The system was trained separately using long shortterm memory (LSTM), BLSTM, and TDNN techniques. A recurrent neural network (RNN) was used for building the 4gram language model with MaxEnt connections (RNNME) by the RNNLM toolkit. The features were transformed using adaptation techniques such as LDA, MLLT, and fMLLR. The KALDI toolkit was used for building all trained models. Finally, the three models were combined into one model that achieved a WER of 14.2%.
Amazouz et al. [25] introduced a study for the effectiveness of code-switching (CS) in an ASR system. CS in French/Algerian Arabic was proposed for comparing the quantity of CS that occurred in dialectal Arabic and switching to French. They built the acoustic-phonetic model based on collected Maghrebian broadcasts news data including Algerian, Moroccan, and Tunisian dialects. The results showed that the Algerian dialect had a better CS rate than Tunisian and Moroccan.
Masmoudi et al. [26] proposed a framework for developing an ASR system based on the Tunisian dialect. They sought to summarize the linguistic characteristics, such as phonological, morphological, and syntactic, of the Tunisian dialect. This work introduced grapheme-to-phoneme (G2P) conversion using the ruled-based technique. An accuracy of 22.60% was obtained as WER.
Menacer et al. [27] developed an ASR system for MSA and Algerian dialect known as Arabic Loria ASR (ALASR). The DNN-HMM technique was utilized to build the acoustic model, and the LM was built by a classical n-gram. The sMBR (state-level minimum Bayes risk) criterion was applied for adapting the training. The Kaldi toolkit was utilized for building the acoustic model. The Nemlar, Gigaword Arabic, and NetDC corpora were used for training and testing the model. WERs of 14.02%, 89%, and 65.45% were obtained for MSA, Algerian dialect, and the combined data (MSA and Algerian), respectively.
Ali et al. [28] presented a detailed description of the Arabic MGB-3 challenge. The MGB-3 was used for evaluating the Arabic ASR system. MGB-3 consists of 16 hours of Egyptian dialect that are collected from talk show programs on YouTube. The system was trained using LSTM, BLSTM, and TDNN techniques. The lexical and i-vector bottleneck features were extracted for use in this system. The system was evaluated using MGB-3 testing with an average WER of 37.5%.
Ali et al. [29] introduced a study assessing the effectiveness of the ASR on dialectal Arabic speech. This study focused on the problems associated with the orthography and spelling of dialects. The authors proposed an LF-MMI modeling framework for building the system. The system was trained using three LSTM, BLSTM, and TDNN techniques, separately. In LM training, two types of n-gram were used; firstly, tri-gram was used to generate decoding lattices, then, 4-gram was used for rescoring the output of the first type based on the external LM data. Both language models were trained using RNN with MaxEnt connections (RNNME) with Kneser-Ney as the smoothing technique. A multi-reference word error rate (MR-WER) of 25.3% was reported in this work as an average MR-WER for the Egyptian dialect.
Najafian et al. [30] presented a study for investigating the performance of several spoken dialect identification techniques. Multi-lingual such as Arabic, English, Czech, Hungarian, and Russian languages were trained separately for enhancing the accuracy. This work used the multi-dialectal speech corpus such as Egyptian (EGY), Gulf (GLF), Levantine (LEV), MSA, and North African. A new n-gram phonotactic feature was proposed and integrated with the SVMs classifier for generating the phone sequences. In addition, the i-vectors method was combined with the phonotactic features using DNN. Finally, convolutional neural networks (CNNs) were used to map the acoustic model and proposed features with each dialect language of the five dialects. The system achieved accuracies of 56.82% and 57.91% for phone n-gram with support vector machine (SVM) and phone n-gram with CNN, respectively.
Hassine et al. [31] built an Arabic ASR system for recognizing Arabic numbers (digits), from 0 to 9, in the Tunisian dialect. In the features extraction step, different techniques were used separately to extract features such as the perceptual linear prediction (PLP) technique, ∆PLP, MFCC, and vector quantization of Linde-Buzo-Gray (VQLBG). Then, all features were merged and used in training. The ANN typeknown as feedforward back-propagation neural networks (FFBPNN)-was used for training the acoustic model. An average accuracy of 98.54% was obtained.
Khurana et al. [32] developed the DARTS system to convert speech to text in the Egyptian Arabic dialect. The transfer learning technique was used to transfer from the high-resource broadcast domain to the dialectal text. The acoustic model was trained on the Arabic MGB-2 and MGB-3 challenge using a deep neural network including a CNN and multiple layers of TDNN, and LSTM. Discriminative methods were used in training such as LF-MMI and Multi-LF-MMI. The training process was performed by the KALDI toolkit. Two LMs were developed: the first one was a tri-gram LM built by the SRILM toolkit with Kneser-Ney (KN) as the smoothing method. The other was 4-gram LM based on RNN-LM with MaxEnt connections using the Mikolov RNN LM toolkit. DARTS was evaluated using the MGB-3 testing corpus and achieved a WER of 35.8%.
Ali et al. [33] presented a new edition of the multigenre broadcast challenge known as MGB-5. Its construction depends on the MGB-3 dataset and contains audio data collected from dialectal Moroccan recorded from over 48 VOLUME 4, 2016 hours from YouTube. These data were used for evaluating the Arabic ASR system. They proposed an LF-MMI modeling framework for building the system. The system was trained using three LSTM, BLSTM, and TDNN techniques, separately. RNNME was used to build 4-gram LM using the RNNLM toolkit. The features were transformed using adaptation techniques such as LDA, MLLT, and fMLLR. The KALDI toolkit was used for building all the trained models. Accuracies of 67.1% and 48.4% were obtained for AV-WER and MR-WER, respectively.
Ali et al. [34] evaluated the Arabic ASR system on the dialectal Arabic transcription that included a set of evaluation metrics. This work used these metrics by comparing their correlation with human judgments on a validation set of 1,000 utterances for six systems. They proposed a new degree for morphological abstraction and spelling normalization. The results showed that the new degree of morphological abstractions and spelling normalization demonstrated the best correlation with human judgment.
Bougrine et al. [35] developed a complete recipe for building large-scale speech corpus from web resources. The presented recipe was used to create a corpus for the Algerian Arabic dialect which was named KALAM'DZ. This corpus included eight classes of Algerian Arabic sub-dialects containing about 104.4 hours.
Alsharhan and Ramsay [36] evaluated the Arabic ASR system on the Arabic dialect based on MSA data. They used MFCCs for features extraction. The acoustic model was built based on the DNN network using the HTK toolkit. The pronunciation model was used for integrating the acoustic model with the LM depending on the pronunciation lexicon. The pronunciation lexicon contains a set of units (words) with single or multiple phonetic transcriptions. Then, LM is built using DNN and HMM. Two datasets were utilized for training and testing; the first dataset is the GALE phase 3 dataset for MSA, while the second is the Arabic dialect dataset which includes the Gulf, Iraqi, Egyptian, Levantine, and Maghrebi dialect version. The final integration achieved a WER between 3.24% and 5.35%.
Hamed et al. [37] collected and analyzed a speech corpus based on Egyptian and English conversations. Codeswitching was used for mixing Arabic Egyptian and English conversations. Three-fold was proposed for building the corpus including conversational Egyptian Arabic spontaneous speech, obtaining manual transcriptions, and analyzing the speech from the code-switching perspective. Part-of-speech (POS) tags were used to annotate some of the transcriptions.
Ali [38] developed a multi-dialect ASR system for Arabic using an end-to-end approach. The author proposed CNN, RNN, and joint connectionist temporal classification (CTC)/attention encoder-decoder for building acoustic modeling. In LM training, an RNN with Kneser-Ney is used to build LM. An open-source corpus, collected from several corpora, was used for training and testing processes. An accuracy of 14.07% was obtained as WER.
Mubarak et al. [39] introduced the ASR system for di-alectal Arabic speech using an end-to-end approach. They proposed a joint CTC/attention encoder-decoder for building acoustic modeling. In LM training, an RNN is used to build LM. The QASR corpus was used for training and testing processes. An average accuracy of 52.6% was reported in this work.
Hamed et al. [40] developed a system for switching Egyptian Arabic-English based on ASR. They used DNN-based hybrid and transformer-based depending on the end-to-end approach to build ASR systems. In LM training, an RNN is used to build LM. The MBG-3 corpus was used for training and testing processes. An optimal accuracy of 32.1% was obtained as WER.
Ahmed et al. [41] developed and described an Arabic ASR based on MGB-5 in Arabic. They applied speech augmentation using speed and volume perturbation, data reverberation, and music-noise-speech injection transformation. CNN with TDNN and TDNN-f were used for building the acoustic model. The x-vector and i-vector were combined and used as new features in this system. In addition, language model interpolation, semi-supervised learning, genre adaptation, and lattice-based MBR were proposed and combined. The proposed system achieved an average WER of 62.17%.
Al-Anzi and AbuZeina [42] presented dialectal Arabic speech system. This system includes pronunciation dictionaries, language models, and acoustic models. Acoustic model is trained and built based on a hybrid architecture Deep Neural Network Hidden Markov model (DNN-HMM) using HTK toolkit. N-gram language model is presented using ongdistance word relationships. MFCC is used to extract the features. The models are trained and evaluated by discreteword speech dataset. The system achieved 54.02% as WER.
Hussein et al. [43] proposed an state-of-the-art end-to-end ASR for Arabic speech. They used transformer technqniue to build encoder and decoder. The language model is build using TDNN-LSTM. Mel filter bank is utilized to build acoustic feature. MGB3 and MGB5 corpora are used to train and evaluat system. The system achieved a new state-of-the-art performance at 27.5% and 33.8% for MGB3 and MGB5 respectively.

IV. DIALECTAL ARABIC SPEECH RECOGNITION SYSTEM
As mentioned in the above-mentioned literature review, most ASR systems comprise six steps: (i) 1) Feature extraction. 2) Lexical modeling.
The ASR system architecture is shown in Figure 2 and is further analyzed in this section.

A. FEATURE EXTRACTION.
Feature Extraction is an important step of ASR tasks. The wave is formed in continuous size and time. The purpose of signal processing is to convert the waveform into vectors. Feature extraction is a process that is utilized to map the audio signals to a set of acoustic features that are utilized to build the acoustic model. The acoustic feature must be built without losing substantial signal data, minimizing variability across speakers, and environmental acoustic conditions, simultaneously. Moreover, it is used to distinguish speech from others. In addition, it utilizes extraction of the testing features as the input of the recognizer to generate the sequence of uttered words [44]. Thus, there are several techniques that can be used for feature extraction [45,46].
• Mel frequency cepstral coefficient (MFCC) is a popular technique that is used to extract features of ASR. It depends on cepstral analysis, which is a method for separating speech signals into components in order to represent pitch and vocal tract information. MFCC simulates human behavior by distinguishing the sound frequencies since the frequency bands are calculated logarithmically. The feature processing begins with the windows step, which converts the waveform into vectors or chunks; optimally 25 ms are handled with 10 ms intervals. Then, each window is transformed into the spectral domain and power spectra using the short-time fast Fourier transform technique. In addition, power spectra are smoothed for each window using a 20-40 Mel filter bank. This smoothing method is utilized to calculate the frequency sensitivity of human hearing. The smoothed power spectra are logarithmically calculated in order to represent the Mel-filter bank (FBANK) features-that will be used in this work for training the acoustic models. The FBANK features will be prepared and presented for decorrelating discrete cosine transform (DCT) to produce MFCC [47]. This method has been used for feature extraction for different ASR systems (see [17][18][19][20][21][22][23]). • Perceptually based linear predictive analysis (PLP) uses certain aspects of audition which provides the same spectral estimation of speech as LPC analysis but with a lower order model. In addition, it provides better performance for crossing speaker ASR. Furthermore, it is utilized to calculate the filter-bank filters followed by a linear predictive analysis and produce a cepstral representation. This method has been used for feature extraction for different ASR systems (see [9,16,17,26]).
LDA transformation is used to improve the separability and decrease data dimension in acoustic features. Table 1 Studies that use these methods for feature extraction.

B. LEXICAL MODELING
A lexical model is a method for representing the phonemes sequence in the vocabulary. It is used as a pronunciation dictionary to map sequences of phones into words. Each line in the lexical model is utilized to represent a suitable word for the recognition model in the speech decoder with context-independent phonemes for these words. In addition, this lexical is a simple method for building lexical models.
There are statistical methods to model lexical depending on the probabilities of multiple pronunciations of each word [48].

C. LANGUAGE MODELS
LM is statistical modeling (known as a model used in ASR decoding) for enhancing the word (unit) recognition process. LM depends on a large set of vocabularies that are each connected as sentences. In addition, LM is stored in a file that contains all words and their probability occurrences. The probability is the prior probability of a sequence of words and appears in the language. An ASR with LM is faster and achieves higher accuracy. In general, the quality of LM depends on the morphologically of language, i.e., LM of morphologically simplex language is better than LM of morphologically complex language. Thus, the LM of the Arabic language presents challenges due to the morphological complexity compared to some other languages [7,49]. Sentences were not included in the ASR dataset and have an output model with zero probability-this is taken as a challenge and problem for the language model. To solve the zero-probability problem, the presence of a smoothing method endeavors to distribute probabilities to the sentences that have zero-based probability depending on the sentences in the dataset. Moreover, the method also tries to enhance the accuracy of the network model. There are a set of smoothing techniques that are used to calculate the probability of a word [49] and that are classified into backoff and interpolated techniques. In the first technique, the probability of the missed sentence in the corpus is estimated using its lower n-grams, while the second technique combines the sentence probability with its lower order, i.e., the probability of trigram, bigram, and unigram are combined. The smoothing techniques are Witten-Bell, Good-Turing, and Kneser-Ney smoothing [49].

D. ACOUSTIC MODELING
A set of statistical models are estimated to represent a set of sub-word/word (units), such as phonemes, tri-phones, or complete words. These models are usually used to measure how likely the acoustic features are emitted by the word sequence hypothesis and its constituting sub-word units. The acoustic model is trained and built using generative learning algorithms. This model can recognize dialectal Arabic speech. In the reviewed studies, many techniques are used to represent acoustic modeling as shown in Table 1. A brief description of some of these techniques is presented.

1) Hidden Markov Models
HMMs were introduced at the end of the last century. HMMs are a special case of regular Markov models that have been evaluated as a powerful model for representing the timevarying signals as a parametric random process [50]. HMMs are considered the most popular acoustic models for ASR [6]. In addition, they are encoded by a finite state of the Markov model and decoded by a set of output distributions. The inputs of HMM are the temporal variability, while the outputs are spectral variability. The state of HMM is associated with probability density functions. The GMM with mixture diagonal covariance Gaussians are used to model each state in HMM [45,45,50]. Several diagonal covariance Gaussians are utilized to generate probability densities as follows [50]: where j represents ranges over the count of Gaussian densities in the mixture of state S i . The data likelihood will be maximized for training HMMs as follows [50]: The parameters of the model (state transition probabilities and output distribution parameters, e.g., means and variances of a Gaussian) are automatically estimated from the training data. HMM is used to model the unit (phone, phoneme, word, etc.) [50]. The phone model is represented by a phoneme connected with each HMM. HMMs are used to model the phones as the main unit in speech, the left-to-right HMMs model each phone using three states with the input and stacked states. As shown in Table 1, nine dialectal Arabic ASR studies used HMM.

2) Gaussian Mixture Model
GMM is a statistical learning model. GMM is a probabilistic model that represents the speech signals feature. It is used to manipulate variations in signals and convert them into a dynamic sequence of vectors. GMM is a suitable method that deals with text-independent ASR systems. To implement the likelihood ratio as a recognition model, the actual likelihood function must be determined. This function is selected based on the features that are extracted from signals. In addition, the GMM model is built depending on the underlying distribution of acoustic observations from speech. The temporal aspects of the utterance do not impact GMM modeling. In speech recognition, each utterance is represented as a GMM for producing this model. The parameters in model λ must be estimated in order to match the best training vectors in utterance [51]. The most favorable techniques for estimating these parameters in the model are: 1) Maximum likelihood (ML) estimation. The main goal of ML is to obtain the model parameters that maximize the likelihood of the GMM; 2) Expectation-maximization (EM). EM is an iterative algorithm utilized to estimate ML of GMM parameters. As shown in Table 1, seven dialectal Arabic ASR studies used GMM.

3) Subspace Gaussian Mixture Model
In conventional acoustic models, a GMM with a large set of parameters can represent every HMM distribution. The SGMM also represents the states' distribution with a small set of parameters as low dimensional subspace. There is a high correlation between states' distributions, therefore, the distributions of states can be represented by a low dimensional subspace for all states. Human sounds correspond to a limited variety of distributions, therefore, speech is considered as triphone states with a high correlation between their distributions. The SGMM is suitable for ASR and comprises shared parameters for all states. In addition, SGMMs can be naturally trained in a multilingual fashion. However, in an SGMM, the correlations across the triphone states are stored in a low dimensional model subspace as parameters [52]. All context-dependent HMM states in SGMMs use the universal background model (UBM) for sharing a common representation. UBM represents a GMM model trained over whole speech classes that are pooled together [53]. A GMM-UBM is a large mixture of Gaussians that represents all speech with I components; it is used for pruning the Gaussian components and initializing the model. The acoustic space is split into I regions by UBM, where the acoustics are defined using M j , N i , and w i . In UBM, the selected highest P Gaussian components with maximum likelihood scores are used in both model training and recognition. As in GMM, we used ML to estimate the SGMM parameters, and EM to estimate ML of SGMM. As shown in Table 1, four dialectal Arabic ASR studies used SGMM.

4) Deep Neural Network Model
A DNN model comprises an input layer, an output layer, and two or more layers of hidden units. Each hidden unit is used to associate all inputs from the previous layer to the scalar state using the logistic function and sent into the next layer. DNNs are discriminatively trained using the backpropagation of cost function derivatives. This backpropagation is utilized to determine the conflict between the original outputs and resulted from DNN training [54]. In the softmax function, the cross-entropy between the probabilities d and softmax output p is represented by the natural cost function C as follows: where the probability d is one or zero. DNN can be trained on a large training set by calculating the derivatives on a small part of the training set "minibatch" compared to the whole training set. Then, it updates the weights to the gradient. The trained neural networks in DNN are used to recognize speech. It includes the dimension of the input spectral features as the input layer, N hidden layers, and one output layer. The output layer dimension is equal to the number of utterances the system is designed to identify. The frame-level DNN posteriors from the output layer must be combined by simply averaging over the test utterance [55]. DNN can be used to train and recognize speech signals during a low resource system without a secondary classifier. The secondary classifier is unsuitable for small datasets, which requires computational resources. Increasing the number of hidden layers will enhance the system's performance, however, the complexity will be increased. As shown in Table 1, 12 dialectal Arabic ASR studies used DNN.

5) Convolutional Neural Network
CNN [56] is a kind of DNN. It has a mechanism for simulating the mammal visual neuron systems [57] which activate neurons in specific areas in the visual field. CNN has conditioned-as opposed to fully integrated-connections to manipulate data using a grid-form essential structure. For example, an image can be represented by 2D pixel grids and fixed-length audio can be represented by 1D grids. In addition, CNN has novel properties that render DNN more VOLUME 4, 2016 suitable for image and signal data. CNN has three stages: a convolution, detector, and pooling stage. Because a CNN is a simulated biologically inspired model, it is, therefore, suitable to develop acoustic models in ASR systems in order to enhance the performance. In addition, the structural locality from the acoustic feature is used to reduce the spectral variance in acoustic features and longterm dependencies in the speech frames by taking prior speech signal knowledge [59,60]. Sainath et al. [61] reported that CNN achieves 13-30% enhancement over GMMs, and 4-12% enhancement over DNNs, using 700 hours of speech data. As shown in Table 1, four dialectal Arabic ASR studies used CNN.

6) Time-delay Neural Network
Time-delay neural networks (TDNNs) are types of CNNs that are used for sharing the weights in a single temporal dimension. The first TDNN model was proposed to recognize phonemes [62]. Then, TDNNs were utilized for recognizing the spoken word [63] and handwriting [64], enabling the acoustic model to learn the temporal dynamics of the speech signal using short-term acoustic feature vectors. Moreover, it uses sub-sampling for reducing computation in training. In DNN, the wider temporal context is processed by a wide contextual window of features in the initial layer, while in TDNN, each layer corresponds to a different level of the entire features-local patterns in the entire features are learned by the first layer and higher layers are used to learn a wider temporal context. Each layer in a TDNN is operated at a different temporal resolution, which is increased as one moves deeper into the network. As shown in Table 1, six dialectal Arabic ASR studies used TDNN.

7) Long Short-Term Memory Network
LSTMs are a special type of RNNs used for the evolution of RNN. The LSTM method can save information over a long period using long-term dependencies in order to find and exploit long-range context. The standard RNN has a single neural network, while LSTM uses four interacting layers with a unique communication link [4,65]. In ASR, we can use the coming context as well if the transcription for all utterances is obtained at training time. An LSTM calculates an input sequence X = x 1 , x 3 , ..., x T and the corresponding output sequence Y = y 1 , y 2 , ..., y L using the calculation of the network unit activation. An LSTM is used in the training stage with sub-sampling given the T-length of the speech feature sequence o t−1 . It is utilized to produce a high-level feature h 1:T0 as follows: where h denotes the sub-sampling. The input features X will be handled to create the hidden states h t based on framewise operations. LSTM presents the outputs to reduce the computational cost. Therefore, in ASR, the input length is different from the output length [66]. A bidirectional LSTM (BLSTM) is an LSTM in the hidden layers. As shown in Table 1, 11 dialectal Arabic ASR studies used LSTM and BLSTM.

E. DISCRIMINATIVE CRITERIA
In speech recognition, the acoustic model may be trained using large datasets consisting of hundreds of hours (or greater), from different speakers. However, there are often utterances that are poorly represented in the training data. This leads to a conflict in mapping between training and testing representation. To solve this problem and reduce the mismatch, there are adaptation techniques for discriminative training of acoustic models. The discriminative training approach directly optimizes a mapping function from the input samples to output labels and is used to enhance the acoustic model for recognizing utterances [67,68]. Therefore, the main goal of the discriminative learning approach is to modify only the decision boundary without constructing a data generator from the entire feature space. As in the reviewed studies, several discriminative training criteria are used for the dialectal Arabic speech recognition, such as LDA, MMLR, fMMLR, maximum mutual information estimation (MMIE), boosted MMI, MPE, CTC, and attentionbased models (see Table 1). In addition, adaptation is an optimal approach that alleviates conflicts between the models and the data from any utterance, channel, or another factor. As shown in Table 1, 19 dialectal Arabic ASR studies used adaptation methods.

F. EVALUATION
As in the reviewed studies, accuracy performance metric is used for evaluating the performance of ASR systems. Moreover, the perplexity metric is used for evaluating the performance of LM. This section describes these evaluation metrics in some detail.

1) ASR Evaluation
The performance evaluation of ASR is usually presented in terms of two criteria: (1) accuracy (Acc), which represents the percentage of the accuracy, and (2) WER, which represents the percentage of the word-level errors of the recognized units. These criteria are defined as follows: where N represents all the words in the set of evaluation utterances, substitutions (S) denotes the number of misrecognized words, deletions (D) represents the number of the deleted words in the recognition result, and I is the number of the inserted words in the recognition result.

2) Language Model Evaluation
The performance evaluation of LM uses the perplexity measure, which is dependent on the token in transcriptions. The perplexity of LM is calculated for K tokens as follows [69]: where P (token i |token j<i ) represents the probability of i-th throw of the LM training depending on the first i − 1 tokens. Table 2 Shows the datasets and evaluation for all systems that are presented in the reviewed studies.

V. DISCUSSION AND CHALLENGES
In this section, we present a discussion and analysis of the results and highlight some challenges for the dialectal Arabic ASR system.

A. DISCUSSION AND ANALYSIS
As in the literature review, many studies present the dialectal Arabic ASR using several techniques and methods as shown in Table 1. From Table 1, we can see that most studies used the MFCC technique for feature extraction. These studies used 13 MFCC with 39 and 40-dimensional high-resolution Mel frequency cepstral coefficients. Furthermore, some of these studies used LDA for features transformation, while other studies used techniques such as PLP (12-and 13dimensional PLP), neural network (NN), bottleneck, and VQLBG. In addition, most studies used language modeling using different techniques in which a large number of these studies used RNN for building language models. Moreover, some of these studies used the Kneser-Ney smoothing technique for enhancing LM. The 3-gram and 4-gram LMs were used in these studies. Five studies did not use LM. Table 1 shows the techniques and approaches that were used for building acoustic models in the presented studies. Most studies used traditional techniques and others used deep learning techniques for building acoustic models. Moreover, some studies used hybrid techniques. Furthermore, two studies used the ruled-based technique for building models. Most studies used adaptation (discriminative) techniques, as shown in Table 1. These techniques were used to enhance the acoustic model for recognizing utterances. In general, most studies used a traditional approach for building the dialectal Arabic ASR system, while two studies used an end-to-end approach.
We cannot decide the state-of-the-art of dialectal Arabic ASR systems. However, according to our literature review and current knowledge, we can conclude that studies implementing the end-to-end approach are at the front of dialectal Arabic ASR systems as shown in Table 2.
From Table 2, we can observe that the WER term was reported as result for most studies, while other studies used the accuracy terms. In addition, most studies did not report the perplexity of LM. Furthermore, we can see that five studies are presented in 2017 and one study is presented in 2016. Over the last five years, 17 studies were presented as shown in Table 3.
As shown from the literature review and Table 2, datasets, corpora, and databases were used in the dialectal Arabic ASR systems. Some of these data sources are freely available and others require an access fee. Table 4 summarizes the characteristics and availability of some (free) data sources that were used in dialectal Arabic ASR systems.
The Arabic dialect has several types of dialects such as Algeria ( Figure 3 shows the number of speakers from each dialect type. The presented studies show different dialectal Arabic ASR systems for some of these Arabic dialects. Table 5 summarizes the types of Arabic dialects in the presented studies. From this table, we note that the Egyptian dialect has the highest study count (17), while Algeria and Tunisia have seven and eight studies, respectively. Mauritania-type dialect has only one study representing the lowest. In addition, Sudan and Yemeni dialects have no studies.
Some of the presented studies were published in the Web of Science and Scopus databases as shown in Table 6. Figure 4 shows a comparison between the types of Arabic dialects regarding the number of studies for each type.

B. CHALLENGES
As mentioned above, many studies have developed dialectal Arabic ASR systems. However, adding to the challenges that are mentioned in the introduction, many challenges still exist: 1.        1) There are a few counts of datasets that are related to the Arabic dialect. 2) There are no datasets for some dialect types such as Sudan and Yemeni. 3) All of the available studies used a non-diacritized Arabic version. 4) There are studies that lack variation in data. 5) All of the studies use English letters instead of Arabic letters in transcription using the Buckwalter format. 6) Many of the presented studies used the MSA version to train and adapt the acoustic model, and dialectal version for testing; i.e., these studies did not use pure dialectal data in the training process. 7) The studies that were developed based on the collected or specific datasets have good accuracy. These studies did not use the standard datasets. Moreover, some of these studies used one dialect type. 8) The multi-dialect systems have a low accuracy compared to one-dialect systems.

VI. CONCLUSION
In this work, we reviewed 35 studies of the dialectal Arabic ASR. Many approaches and techniques were described in feature extraction, including lexical modeling, language modeling, acoustic modeling, discriminative criteria, and evaluation steps. Moreover, we presented the current progress of the dialectal Arabic ASR and introduced three comparisons between the presented studies including techniques and methods, datasets, accuracies, dialect types-these studies were analyzed and discussed. In addition, a brief of the dialectal Arabic data sets and corpora is presented. We also discussed and highlighted some challenges and problems. Due to challenges and from the analysis, we suggest some future studies including collecting diacritized data, collecting new various data, collecting Sudanese and Yemeni dialect data, adapting techniques and methods to address Arabic letters, and applying other techniques and methods for building the dialectal Arabic speech system. BANDAR ALOTAIBI (MEMBER, IEEE) received the Bachelor of Science degree (Hons.) in computer science (information security and assurance) from the University of Findlay, USA, the Master of Science degree in information security and assurance from Robert Morris University, USA, and the Ph.D. degree in computer science and engineering from the University of Bridgeport, USA. He is currently an Associate Professor with the Information Technology Department, University of Tabuk. His research interests include computer vision, network security, mobile communications, computer forensics, wireless sensor networks, and quantum computing.
ABDELAZIZ A. ABDELHAMID received the M.Sc. degree in computer science from the Faculty of Computer and Information Sciences, Ain Shams University, and the Ph.D. degree in computer engineering from the Faculty of Engineering, Auckland University, New Zealand. He is an Assistant Professor with the Department of Computer Science, Faculty of Computer and Information Sciences, Ain Shams University. He is currently working as an assistant professor with the computer science department, College of Computing and Information Technology, Shaqra University. His research interests include speech and image processing, and machine learning-based intelligent systems. VOLUME 4, 2016