Arabic Diacritization Using Bidirectional Long Short-Term Memory Neural Networks With Conditional Random Fields

Arabic diacritics play a significant role in distinguishing words with the same orthography but different meanings, pronunciations, and syntactic functions. The presence of Arabic diacritics can be useful in many natural language processing applications, such as text-to-speech tasks, machine translation, and part-of-speech tagging. This article discusses the use of bidirectional long short-term memory neural networks with conditional random fields for Arabic diacritization. This approach requires no morphological analyzers, dictionary, or feature engineering, but rather uses a sequence-to-sequence schema. The input is a sequence of characters that constitute the sentence, and the output consists of the corresponding diacritic(s) for each character in that sentence. The performance of the proposed approach was examined using four datasets with different sizes and genres, namely, the King Abdulaziz City for Science and Technology text-to-speech (KACST TTS) dataset, the Holy Quran, Sahih Al-Bukhary, and the Penn Arabic Treebank (ATB). For training, 60% of the sentences were randomly selected from each dataset, 20% were selected for validation, and 20% were selected for testing. The trained models achieved diacritic error rates of 3.41%, 1.34%, 1.57%, and 2.13% and word error rates of 14.46%, 4.92%, 5.65%, and 8.43% on the KACST TTS, Holy Quran, Sahih Al-Bukhary, and ATB datasets, respectively. Comparison of the proposed method with those used in other studies and existing systems revealed that its results are comparable to or better than those of the state-of-the-art methods.


I. INTRODUCTION
Diacritics are marks written above or below words or letters in several languages such as Arabic [1], Turkish [2], and Romanian [3]. Arabic texts are usually written without diacritics, and readers can infer the meanings and correct pronunciations of the words from their contexts. However, such inference is not always easy. In Arabic, diacritics are used to clarify the correct pronunciations, meanings, and syntactic roles of words, specifically in the last letters of words (see Section III for more information). The word ''Elm,'' for instance, has several meanings: ''Ealamo'' (flag: noun), ''Eil∼omo'' (science: noun), ''Ealima'' (he knows: verb), The associate editor coordinating the review of this manuscript and approving it for publication was Aysegul Ucar .
Furthermore, the diacritic of the last letter in the word reveals its syntactic role. Consider the noun ''Ealamo'' (flag) in the following two sentences: (The country flag flapped.) (The soldier raised the country's flag.) In the first sentence, the word is a subject, so the diacritic of the last letter ''m'' is Damma '' '' because it is a singular noun. In the second sentence, the word is an object; hence, the diacritic of the last letter ''m'' is Fatha '' '' because it is a singular noun. Note that the diacritics of the first and second letters of the word do not change; otherwise, the meaning would also change.
The objective of any diacritization system is to restore the missing diacritics, that is, to assign the correct diacritics to the letters in the words of a given sentence either fully or partially as required. Diacritic restoration is not exclusive to classical Arabic [1], modern standard Arabic [4], or Arabic dialects [5], [6]. Several efforts are also being made in other languages such as Turkish [2], Romanian [3], and Yorùbá [7].
Restoring Arabic diacritics can be useful in many natural language processing and computational linguistics tasks. These tasks include text-to-speech applications [8], automatic speech recognition [9], homograph disambiguation [10], part-of-speech (POS) tagging [11], identifying the syntactic role of the words in sentences [12], and Arabic machine translation [13].
The existing systems for Arabic diacritic restoration typically consider the problem either within morphological disambiguation or as a standalone problem. In the latter case, most proposed systems are based on dictionaries and rules, language resources, or feature engineering approaches that employ linguistic information. In contrast, few efforts have been made based on data alone (see Section II).
In this study, we investigated the performance of a supervised deep learning approach, specifically, bidirectional long short-term memory (BiLSTM) with conditional random fields (CRFs) [14] for Arabic diacritization. The BiLSTM-CRF approach has been shown to be suitable for many natural language processing and text mining tasks that depend on sequence tagging, such as name entity recognition [15], POS tagging [16], and sentiment detection [17].
The contribution of this study is twofold. First, the proposed method of Arabic diacritic restoration does not employ any type of morphological analyzer, a dictionary, rules, or any kind of feature engineering. It is solely based on data and is distinct from other Arabic diacritic restoration efforts that employ long short-term memory (LSTM) networks by using the sequence of characters that constitute the sentence as input and their corresponding diacritics as output. The proposed approach does not use moving windows or a hot vector as input. Second, we investigated the performance of the proposed method using four datasets of different sizes covering different genres, namely, the King Abdulaziz City for Science and Technology text-to-speech (KACST TTS) dataset, the Holy Quran, Sahih Al-Bukhary, and the Penn Arabic Treebank (ATB). To the best of our knowledge, no previous researchers have used such diverse datasets to evaluate their models.
The remainder of this article is organized as follows. Section II presents the related literature. Section III provides background information about Arabic diacritics, BiLSTM and CRF. Sections IV and V describe the datasets used in the experiments and the proposed method. Section VI presents the experimental results. Section VII compares the proposed system with other Arabic diacritization systems. Finally, Section VIII summarizes the conclusions.

II. RELATED WORK
This section discusses some of the recent studies examining full Arabic diacritization. Partial Arabic diacritization, for example, on case ending [12] and semantic disambiguation [10], is not addressed because it is outside the scope of this study.
The existing Arabic diacritization algorithms have used three approaches to represent data. The first approach mainly relies on morphological analysis. Habash et al., for example, used the Backwalter and Standard Arabic morphological analyzers to list all possible analyses for each word in the given sentences, including diacritization. To predict the correct morphological analysis from the generated analysis lists, they used different algorithms such as support vector machine (SVM) methods [18]- [21], the J48 decision tree classifier with manually crafted rules [22], and LSTM [23]. Hussein et al. [24] also used the Backwalter Arabic morphological analyzer with hidden Markov models (HMMs) for Arabic diacritization.
Other researchers have employed different morphological analyzers to solve the problem of Arabic diacritic restoration. Chennoufi and Mazroui [25] used the Alkhalil Arabic morphological analyzer with a set of rules and an HMM. In addition, Said et al. [26] developed a statistical-and rule-based morphological analyzer and rule-based algorithm.
The second approach to represent the data in Arabic diacritization relies on feature engineering. The objective of this approach is to supply the learning algorithm with the most representative features of the data to help the algorithm learn more effectively.
The most commonly used features in Arabic diacritization are linguistic features. For instance, Zitouni et al. [27] utilized lexical, segment-based, and POS features with a statistical approach based on maximum entropy. To train deep neural networks, Rashwan et al. [28] employed linguistic features, namely, POS tags, prefixes, suffixes, roots, and patterns, alongside other features related to word context, such as previous character, last character, previous word, and last word.
Based on language models that were built on the surface form of the words, morphologically segmented words, and character level, Mubarak et al. [6] used a Viterbi search to choose the most probable diacritization for an input word in a sentence.
Similarly, Darwish et al. [29] utilized a Viterbi decoder at the word level with word stems, morphological patterns, and transliteration for the core word diacritization and SVM-based ranking with morphological patterns and linguistic rules to determine the proper case ending of the word.
The third approach relies solely on data, specifically, on characters and words that compose the sentence. This approach is mainly used with deep learning algorithms, especially recurrent neural networks (RNNs) and their variants, LSTM, and BiLSTM because of their abilities to handle sequential data such as text. Our work falls into this category. VOLUME 8, 2020 Abandah et al. [30] employed a deep neural network built by stacking a set of BiLSTM layers to work around the problem of the dependency of the diacritization of the current word on both the previous and following words. This architecture allows the decision to be more context-aware. Subsequent error correction techniques were used to improve the results. The network was trained in either a one-to-one manner (in which the input length equals the output length) or a many-to-many manner (in which the output is longer than the input); empirically, the one-to-one architecture outperformed the one-to-many architecture in several tests.
Belinkov and Glass [31] focused on using different types of neural networks to build a language-independent method of restoring missing diacritics. They undertook numerous experiments for various hidden layer types of neural networks, from a single feed-forward neural network (FFNN) layer to BiLSTM layers. Their results suggest that the use of stacked layers of BiLSTM (three layers) yields better results than FFNNs and LSTM.
Mubarak et al. [32] readjusted a standard sequence-to-sequence neural machine translation setup based on an LSTM architecture to formulate a unified model for Arabic diacritic restoration. Their model operates by training on a fixed-length sliding window of n words, which are expressed in terms of their individual characters, followed by a voting mechanism to select the most suitable diacritized form for a given word.
The performances of two convolutional neural networks, namely, the temporal convolutional neural network (TCN) and acausal temporal convolutional neural network (A-TCN), were compared with two RNNs, namely, LSTM and BiLSTM, for Arabic diacritization [33]. The results showed that BiLSTM outperforms LSTM and A-TCN outperforms TCN, and the best model among the four is BiLSTM.
Using the same data representation approach, Khorsheed [34] investigated the use of an HMM in conjunction with a character-based 4-gram model to facilitate the selection of the most suitable diacritization for a word. Every individual diacritic has a separate HMM model that concatenates the output models alongside the input character sequence to determine the final diacritized sentence.
The state-of-the-art results for ATB part 3 are a diacritic error rate (DER) and word error rate (WER) of 1.6% [6] and 9.07% [30], respectively. For the three parts of ATB, the stateof-the-art results are a DER and WER of 2.8% and 8.2%, respectively [33]. The method of Abandah et al. [30] yielded the best results for the Holy Quran, achieving a DER and WER of 3.04% and 8.7%, respectively. Mubarak et al. [32] reported a WER of 4.49% using a 9.7 million word manually diacritized and revised corpus provided by a commercial vendor.
Using Arabic diacritization systems that rely on the first and second approaches, that is, morphological analyzers and feature engineering, entail additional costs in real life. This situation necessitates the availability of morphological analyzers and/or feature extraction tools that are used to build these systems, which in most cases are available with fees or are not available at all to the public research community. Using the third approach, which relies solely on data, requires the availability of diacritized texts that are freely available along with many powerful machine learning packages that are also freely available.

A. ARABIC DIACRITICS
Arabic diacritics can be classified into four groups. The first consists of the short vowels, namely, Fatha '' ,'' Damma '' ,'' and Kasra '' .'' The second group consists of Sukun '' ,'' which indicates that the consonant is not followed by a vowel. The third group consists of the Tanwin diacritics: Tanwin Fatha '' ,'' Tanwin Damma '' ,'' and Tanwin Kasra '' .'' Tanwin diacritics are used only with the last letter of a word if needed. They indicate that the vowel is followed by the consonant ''n,'' which is pronounced but not written in that particular case. Finally, the fourth group consists of Shadda '' ,'' which indicates germination, i.e., the doubling of a consonant letter. Shadda can be utilized with all short vowels, Sukun, or Tanwin diacritics. These four groups include 15 diacritics. Table 1 summarizes these diacritics along with their Buckwalter transliterations when used with the letter ''m.'' In fully diacritized texts, some diacritics such as the following are often neglected: 1) The diacritics for ''lam shamsiah,'' the second letter in the definitive article when it is written but not pronounced, are ignored. A well-known example is the word ''Al$ms'' (the sun), in which this letter is not pronounced. Hence, no diacritic is assigned.
2) Similar to case 1, the diacritics for ''lam qamariah,'' which is the second letter in the definitive article and is written and pronounced, are not written, such as in the word ''Alqmr'' (the moon). However, its correct diacritic '' '' is occasionally assigned.
3) The diacritics for Alif maqsourah ''Y'' and Alif madd ''='' are ignored. 4) The diacritics for the letter ''A'' are not added when it comes in the middle of the word. 5) Diacritics that are a part of the long vowel ''w'' are not added. 6) Diacritics that are a part of the long vowel ''y'' are not added. 7) Very occasionally, the diacritic for the letter ''<'' is neglected because it must be Kasra. The omission of diacritics is not consistent in all diacritized texts. This inconsistency does not affect the ability of readers to understand such texts because all neglected diacritics can easily be inferred. However, the case is different for machines, where inconsistent diacritization often causes models that have been trained on such data to perform poorly.

B. BiLSTM NETWORKS
The function of any classification algorithm is to transfer the input data X t to a certain output Y t . One of the limitations of FFNNs is that the transformation process from X t to Y t does not have access to the previous output Y t−1 . Recurrent neural networks (RNN), where the output of each cell y t depends not only on its input x t but also on the output of the previous cell y t−1 , partially overcome this limitation. In the literature, the output y t is usually denoted by h t .
The ability to remember the previous output only (short-term memory) has proven to be efficient for several tasks, such as predicting the last word in a sentence. However, in some cases, it is necessary to remember the previous outputs (long-term memory) to produce h t . LSTM networks solve this problem because they are able to learn the long-term dependencies between the inputs and outputs.
An LSTM cell receives three inputs, namely, the output of the previous cell h t−1 , cell state of the previous cell C t−1 , and input into the cell from the previous layer x t . It then produces two outputs: the cell output h t and the cell state C t [37], [38]. Figure 1 illustrates the architecture of an LSTM cell that can control the following information: a. Which information the cell passes to the next cell: This information is controlled using a ''forget gate.'' The output of the forget gate f t is given by b. The information that is stored in the cell state: This information is based on the output of the input gate i t and the previous state of the cellC t , which can be expressed as follows: The cell state C t , which is passed to the next cell, is given by c. The information produced as an output of the cell: The cell output h t is based on the output of the output gate o t and the current cell state C t , where and Like the current cell state C t , the cell output h t is passed to the next cell. In addition, W f , W i , W C , and W o are weights; b f , b i , b C , and b o are biases; and α is the sigmoid function.
In a BiLSTM network, each BiLSTM cell consists of two LSTM cells. One LSTM cell processes the input data from right to left and the other processes the input data from left to right. The outputs of both LSTM cells h t and ← h t are concatenated to produce the output of the BiLSTM cell. This architecture allows the network to benefit from left and right inputs to produce the output of the current instance. Figure 2 shows the architecture of the BiLSTM network.

C. CRF
CRFs constitute a class of probabilistic undirected graphical models. A CRF represents a conditional density of a set of class labels Y t given a set of observations X t , and the density is factorized according to the graph [39]. One of the VOLUME 8, 2020  most widely adopted instances of CRFs is the linear chain conditional random field (L-CRF), which is considered in this article. The density of an L-CRF, shown in Fig. 3, is defined over a sequence of class labels Y t = {y t } n t=0 (empty ovals) given a sequence of observations X t = {x t } n t=0 (filled ovals) as follows: where Z (.) is a normalizing factor. As shown in (7), the density models the dependency of the class labels not only on the observations, but also on the other class labels [40]. This property enables relational data to be processed. Therefore, assigning a class label y t depends on the features x t and context y j =t . L-CRFs are widely adopted in natural language processing, where tokens, whether they are letters, words, morphemes, phonemes, or diacritics, appear as sequences, and each is mostly dependent on its context. It has been used successfully in many language processing tasks such as POS tagging [41] and named entity recognition [42].

IV. DATA
To investigate the performance of the BiLSTM-CRF for automatic Arabic diacritization, we used four datasets covering various genres of different sizes, which were collected from KACST TTS, the Holy Quran, Sahih Al-Bukhary, and the ATB.
KACST TTS [43] is the smallest corpus among our datasets. It consists of randomly selected sentences from different genres such as news, finance, customer service, and literature and was manually diacritized and revised.
The Holy Quran is the most analyzed text in Arabic because it is, as Muslims believe, the revelation from Allah to the Prophet Mohammad. Sahih Al-Bukhary is one of two major collections of Hadith, or traditions relating to the Prophet and his companions in Islam. All official releases of the Holy Quran and Sahih Al-Bukhary are either printed or in an electronic format, fully diacritized, and thoroughly revised to facilitate reading and to avoid ambiguity in meaning. Thus, they are ideal datasets for Arabic diacritization. We downloaded the Holy Quran and Sahih Al-Bukhary from the Tanzil 1 and Shamela 2 websites, respectively.
ATB [44] is the largest corpus among our datasets. Three parts of the ATB were used in this study: part 1 (v 3.0), part 2 (v 2.0), and part 3 (v 2.0). The ATB consists of modern standard Arabic texts from the news domain and has been used in numerous studies on Arabic computational linguistics. Several cases of inconsistent diacritization have been identified in the ATB, and numerous reports have noted such observations [6], [29].
To unify the structures of our datasets because they came from different sources, we performed a few preprocessing steps: a. We divided each dataset into a set of sentences, where ''.,'' ''?,'' and ''!'' were used as delimiters. However, for the Holy Quran, each verse was treated as a sentence. b. Extra spaces were removed from each sentence. c. The positions of the diacritics were unified, where each diacritic was inserted directly after the corresponding letter. d. When Shaddah and other diacritics were used, the sequence was unified, where the Shaddah came first, followed by the other diacritics. Table 2 summarizes the basic statistics for the datasets used in this study. We counted the number of sentences based on our sentence segmenter. The following sentences are examples from KACST TTS, Holy Quran, Sahih Al-Bukhary, and ATB: To prepare our data for training, we assigned a tag to each character in each sentence. The tag was either one diacritic (a short vowel or Sukun, Shadda, or Tanwin diacritic), two diacritics (Shadda with short vowels, Sukun, or Tanwin diacritic), or one of two special tags ''_'' and ''!.'' The special tag ''_'' was used to indicate that the diacritic was missing, while the special tag ''!'' was utilized to indicate that the character was not an Arabic letter and hence that no diacritic could be assigned to it. We split each dataset randomly into three parts: training (60%), validation (20%), and testing (20%). Table 3 lists the percentages of Arabic letters with and without diacritics for each corpus in our dataset. The data show that about 74% of all the letters in the overall dataset have diacritics. About 80% of the letters are diacritized in the Al-Bukhary text. In contrast, only 61% of the letters are diacritized in the ATB. After reviewing the datasets, we found that about 40% of the non-diacritized letters are associated with the definite article ''Al.'' If the second letter ''l'' is lam shamsiah, no diacritic is assigned because this is not pronounced. If the second letter is lam qamariah, the diacritics must be Sukun; however, this diacritic is neglected, specifically in the ATB corpus. One of the clear cases of neglected diacritics in the ATB is the diacritic of a letter in the middle of a word, which must be Fatha and followed by '' .'' In this case, the diacritics for both letters were neglected. The neglect of diacritics negatively affects the accuracy of the diacritization model, especially when the diacritics are not neglected consistently. Table 4 summarizes the distributions of all the diacritics in the datasets. The most frequent diacritic in all datasets is Fatha, followed by Kasra, which together represent approximately 63% of the total. Sukun, Damma, and Shadda with Fatha come next, representing approximately 32% of all diacritics. The remaining 10 diacritics all occur infrequently in the dataset, with less than 2% frequency for each tag. The unbalanced distribution of diacritics will probably affect the diacritization accuracy; that is, the system will be able to restore frequent diacritics more accurately than infrequent diacritics.
The diacritic distributions reveal that the KACST TTS corpus is the only corpus that contains the diacritization Shadda with Sukun '' .'' All of these cases (47 cases) occur at the ends of sentences and on the letter . The reason for the occurrence of '' '' in the KACST TTS corpus is that the corpus was diacritized specifically for the TTS application. In Arabic, there is a rule stating that ''Arabic speech does not start with a consonant and does not pause with a vowel.'' This rule may explain why the Sukun diacritic percentage (20%)  is higher in the KACST TTS corpus than in the other corpora. Additionally, Table 4 shows that the KACST TTS and Sahih Al-Bukhary corpora are the only corpora that contain the diacritic Shadda with Tanwin Fatha.
In Arabic, the diacritic of the last letter of a word usually has a special property: it is not always fixed. This property originates from the fact that, in most cases, the same word can have different diacritics for the last letter based on its syntactic role in the sentence. Exceptions occur in a set of words known as ''Mabniah,'' whose phonological structures are fixed regardless of their syntactic roles. In the former case, if the word is a singular noun, for instance, and it is the subject in a verbal sentence, the last diacritic should be Damma. If the same word is the object, the last diacritic should be Fatha. In the data, the most frequent diacritics for the last letter are Kasra (20%), Fatha (18%), and Damma (13%).

V. METHOD
Our model is based on the BiLSTM-CRF. It consists of six consecutive layers: an input layer, a character embedding layer, a BiLSTM layer, a time-distributed layer, a CRF layer, and an output layer. Figure 4 illustrates the architecture of the proposed model. 1) Input layer: The model accepts its input (i.e., a sentence) as a sequence of characters through the input layer. Because the length of the input varies from sentence to sentence, the length of the input given to the model must be fixed to a predefined value (max_len). When the length of the input sentence is greater than the value of max_len, any character beyond this limit is truncated. In the opposite scenario, that is, when the length of the input sentence is less than the value of max_len, the input sequence is padded with ENDPAD for each input character exceeding the actual length of the sentence. Each character in the training data has a unique numerical representation. The output of this layer is fed to the next layer: the character embedding layer. 2) Character embedding layer: Arabic is a morphologically rich language, and word prefixes such as the letters ''b'' and ''l,'' for example, affect the diacritic of the last letter in the word. Several studies have shown that character embedding is useful for many natural language processing tasks, and it can deal well with the out-of-vocabulary problem and morphologically rich languages [45]- [47]. The character embedding layer accepts the input via the input layer and produces a vector representation with a predefined length for each character in the training data. We used a vector of length 128 to represent each character in the training dataset.
3) BiLSTM layer: The input into this layer is a set of embedded characters. We set the number of units to 128, where the layer is set to return all sequences; that is, the BiLSTM representation of each character in the input sentence (h 0 to h n ). This representation has been proven effective for making independent tagging decisions [44]. The encoded sequence is passed to a CRF through the next layer (the time-distributed layer). The recurrent dropout is set to 0.4 to avoid over-fitting. The hyperbolic tangent (tanh) serves as an activation function for this layer. Instead of a single BiLSTM layer, several BiLSTM layers can be stacked. However, this configuration complicates the model and slows the training process. 4) Time distributed layer: The time distributed layer enables the application of one layer to every element of the BiLSTM sequence output independently and control of the dimension of the data input into the CRF. The output dimension of this layer was set to 128. As for the BiLSTM layer, tanh serves as an activation function for this layer. 5) CRF layer: The CRF algorithm is well known for its performance in sequence labeling. We chose to use the L-CRF architecture in this layer to complement the BiLSTM for the classification layer. The CRF layer increases the overall complexity of the model, specifically, its time complexity, which is largely affected by the input sequence length and output sequence length. The time complexity for the linear chain CRF is O(l. |y| 2 ), where |y| is the length of the output sequence and l is the length of the input sequence.

6) Output layer:
The target of our model is a sequence of diacritics corresponding to the input sequence, where each diacritic in the training data is represented as a one-hot vector. The length of this vector is equal to the number of diacritics in the training dataset plus two special tags used to indicate that the character is not an Arabic letter (and hence cannot be diacritized) or that the diacritic for the corresponding Arabic letter is missing from the dataset. We implemented the proposed model with Python 3.7 using the Keras library (with a TensorFlow backend) version 2.2.4 and Keras_contrib library version 0.0.2. To train our model, we used Adam as an optimization function with the default parameters provided by the Keras library (learning rate = 0.001, β 1 = 0.9, β 2 = 0.999, ε = 1e-07, amsgrad = False) and the crf.loss_function from Keras_contrib as a loss function. The batch size was 32, and the number of epochs was set to 200 with early stopping if the loss function did not improve within 40 epochs. For the other training parameters, we used the default settings provided by the aforementioned libraries. For example, the kernel initializer was ''glorot_uniform,'' and the bias initializer was ''zeros.'' We ran our experiments on a PC with an Intel Core i7-8750H CPU (2.2 GHz) and 16 GB RAM. We did not use a GPU.
To measure the performances of the resulting models, we used three measures, namely, the DER, WER, and SER. The DER is the percentage of all the characters that are incorrectly discretized. The WER is the percentage of all the words that are incorrectly discretized. An entire word was considered incorrectly diacritized if one character of the word was incorrectly diacritized. The SER focuses on the last letter of the word and is the percentage of all the words for which the last letter is incorrectly diacritized.

VI. RESULTS
We conducted 20 experiments on each dataset to test the performance of the proposed method. Firstly, we examined the effect of input length, where we increased the length of the input sequence in intervals of 50, starting from 250 characters and ending with 700 characters. For each input, we set the dropout value to 0.3 and 0.4. The results suggest that using 0.4 as the dropout yields marginally better results. Table 5 illustrates the DER, WER, and SER for all datasets and the combined Holy Quran, Sahih Al-Bukhary, and ATB dataset (Q + BUKH + ATB) along with the input length that yielded the best results. The results show that the best performance is obtained when the second smallest dataset, the Holy Quran dataset, is used, followed by the third smallest dataset, the Sahih Al-Bukhary dataset. These results suggest that the proposed method is considerably affected by the data quality. The Holy Quran is the most revered, revised, and studied text in Arabic, so the diacritization quality of the Holy Quran is the best, followed by that of Sahih Al-Bukhary.
Although the ATB dataset is larger than the Holy Quran and Sahih Al-Bukhary datasets, the diacritization performance is worse for this dataset than for the Holy Quran and Sahih Al-Bukhary datasets. This phenomenon may be due to the quality of ATB diacritization (see Section IV).
The method produced worse DER, WER, and SER results when the combined dataset (Q + BUKH + ATB) was employed than when the individual datasets were used. This low performance may be due to the inconsistency between the diacritization approaches in the three datasets and the differences between the genres of the datasets. The proposed method performed the worst on the KACST TTS dataset because of the small size of this dataset and the diversity of its topics.
Noting the use of different splits for training and testing, the DER and WER results for the Holy Quran are superior to those achieved by Abandah et al. [30] (3.04% for DER and 8.7% for WER) and the DER results for KACST TTS are better than those of Khorsheed [34] (26.87%).
We analyzed the errors produced by the models at the sentence level, where we counted the numbers of words that the models failed to predict correctly. We found that 50.9%, 74.3%, 29.7%, 28.0%, and 21.3% of the sentences from the testing data were diacritized with no error by the KACST TTS, Holy Quran, Sahih Al-Bukhary, ATB, and Q + BUKH + ATB models, respectively. Although the KACST TTS model performed worse than the other models in terms of DER and WER, it outperformed the other models in terms of the percentage of sentences that had no errors in diacritization prediction. The fact that the KACST TTS model performed the worst in terms of DER and WER may be due to the small sizes of the testing and training datasets. In particular, DER and WER are largely affected by the size of the testing data. Table 6 lists the percentages of sentences with one or more word errors among all erroneous sentences for eachtest dataset. The data show that most of the erroneous sentences  have only one word error, followed by sentences with two, three, four, and more than four word errors. Table 7 presents the percentages of errors among the diacritics for each test dataset. In general, as expected for any classification problem, the data suggest that the diacritics that appear frequently in the corpus have lower error percentages compared with diacritics that appear less frequently (see Table 4). The data show that the models were able to assign no diacritics for non-Arabic letters. Note that we did not use any post-processing rules in this case. Moreover, the lowest error percentage for all testing datasets occurs for the special tag ''_,'' which denotes that the diacritic is missing.
We randomly selected 100 words with diacritization errors from each testing dataset to find the common error types. Table 8 shows the numbers of words with one, two, three, and more than three diacritization errors. The data suggest that all of the trained models behave similarly regarding the number of diacritic errors per word, where most words (84 on average) have one diacritic error, followed by the words that have two diacritic errors (12.6 on average). Words with three or more errors are very rare (3.4 on average), and there were four diacritic errors for one word each in the Holy Quran and Q + BUKH + ATB samples and seven diacritic errors for one word in the ATB sample. These observations indicate the consistency of the model performance regardless of the type of data. VOLUME 8, 2020  In the error samples, diacritization errors occurred at the beginning, middle, or end of the words; concurrently at all of these places; or at two of them. As Table 9 illustrates, the trained models are sensitive to the syntactic roles of the words because most of the errors occurred at the ends of the words (50.2 on average), except in the case of the Holy Quran, where most of the errors occurred in the middle of the words. The high accuracy for the Holy Quran is probably due to the diacritization quality and short length of the Holy Quran sentences compared to the length of the sentences in the other datasets. Table 10 shows the distributions of diacritization errors across the three main Arabic POS categories: nouns, verbs, and particles. The data indicate that the most errors occurred on nouns followed by verbs for each sample. This tendency is as expected because nouns occur more frequently than verbs in Arabic. However, the difference between nouns and verbs varies largely among the error samples. The maximum difference occurs for the KACST TTS sample and the minimum for the Holy Quran sample. Diacritization errors on particles are rare or even absent because, regardless of their frequent usage in the language, they have fixed diacritization and the errors that may occur are due to orthographic similarity with other words in Arabic, such as ''mino'' (from) and ''mano''(who).
The main error pattern for verbs is confusion between passive and active verbs. This error type constitutes 31%, 44%, 43%, and 38% of the verb errors in the Holy Quran, Sahih Al-Bukhary, ATB, and Q + BUKH + ATB error samples. For KACST TTS, the main error pattern is confusion between present and past tense verbs. Only one instance of confusion between passive and active verbs was found in the KACST TTS error samples, possibly because passive verbs rarely occur in the KACST TTS dataset.
We investigated the error patterns for each error sample and found that the error samples generally share the same error patterns. The main error patterns for nouns in the error samples include confusion between different morphological patterns, syntactic confusion regarding the second word of the genitive construction, confusion between verb subject and  verb object, named entities, and not assigning diacritics to the letter ''A'' at the beginning of a definitive noun. We found the last error pattern in the KACST TTS error sample only. Additionally, we did not find any errors regarding named entities in the Holy Quran error sample. Adding syntactic features may help eliminate such errors. Table 11 summarizes the distributions of these noun error patterns in the error samples.
Additionally, for the KACST TTS, ATB, and Q + BUKH + ATB error samples, we noticed instances in which the models predicted the correct diacritization of letters when the diacritization was missing from the original sentence and the predicted diacritization was possible in Arabic. Table 12 lists the numbers of such instances in each error sample along with corresponding examples.

VII. COMPARISON
We employed two approaches to compare the proposed method with other available methods. The first approach involved comparing the performance results of the proposed method with those of other methods using the same data and data splits that have been utilized in previous studies. We used the three parts of the ATB (parts 1, 2, and 3) and the same split as Diab et al. [35] to compare the performance of the proposed method with those of two methods that, to the best of our knowledge, achieved state-of-the-art published results for Arabic diacritization, namely, the methods of Zalmout and Habash [23] and Alqahtani et al. [33].  Table 13 lists the results. Our proposed method achieved the best DER (2.34%) but yielded the worst WER among the considered methods. However, when we used random splits for training, validation, and testing (see Section VI), the proposed method produced a better WER (8.43%) and an even better DER (2.13%).
In the second comparison approach, we compared the resulting models (see Section VI) with three well-known and publicly available Arabic diacritization systems, namely, MADAMIRA 3 (morphological analysis-based), Farasa 4 (feature engineering-based), and the Belinkov and Glass model 5 (data-based). We based our comparison on three manually diacritized texts from three different genres with various writing styles and sentence structures. These texts were a Friday sermon 6 (1,854 words), a children's story 7 (927 words), and a classical Arabic poem 8 by the 10th-century poet Al-Mutanabbi (399 words).
Each text was divided into sentences, as described in Section IV. However, for the sentences that exceeded the maximum length of the model input, extra splits were performed to ensure that every sentence was within the length accepted by the model. To compare the performances of the models, a non-diacritic version of the testing dataset was also used.
We faced the following issues with the output of the three diacritization systems: The insertion of additional letters into the middle of words such as and , the removal of some punctuation, and letter substitutions; for instance, the letter could be changed to and could be changed to .

B. MADAMIRA
Letter substitutions; the conversion of punctuation into letters, such as {} being changed to ? ; the insertion of question marks into the middle of words such as and ; and the insertion of spaces between punctuation and letters.

C. BELINKOV AND GLASS MODEL
The insertion of question marks into the middle of words such as and , diacritizing punctuation, the insertion of extra characters and new lines, and the replacement of some  To compare the performances of our models with those of the other systems fairly, the resulting outputs should be standardized in terms of character order, and the sentence division must be identical to that of the original testing text. Therefore, we manually edited the diacritized text of each model by removing unnecessary insertions of letters and diacritics. Additionally, we only considered the Arabic characters in our DER, WER, and SER calculations, discarding non-Arabic letters and non-alphabetic characters such as numbers and punctuation. Tables 14,15, and 16 present the performances of all models when applied to a Friday sermon text, children's story, and classical Arabic poem, respectively. As expected, the error rates are high for our models and the other systems because the data used for comparison were not from the same discourses on which the models were trained.
The results indicate that our models outperformed the other systems when applied to the Friday sermon in terms of the DER and WER, where the Q + BUKH + ATB-based model achieved the best DER and WER of 9.02% and 27.65%, respectively, followed by the Sahih Al-Bukhary-based model. In addition to the prediction capabilities of the deep learning approaches, this performance can be ascribed to the genre similarity between the Friday sermon and the data from the Q + BUKH + ATB-and Sahih Al-Bukhary-based models, which were trained on religious texts. Farasa achieved the best SER (14.9%), perhaps because it was trained to determine the proper word case endings. However, Farasa was followed by the Q + BUKH + ATB-based model with a marginal difference (15.11%).
For the children's story text, the Holy Quran-based model achieved the best DER (24.84%) and SER (31.89%) but came in second in terms of the WER (62.00%), with a marginal difference from the KACST TTS-based model (61.88%). Knowing that the lengths of the sentences in children's stories are generally short, the performances of these two models can be ascribed to the short length of the sentences in the training data.
For the classical Arabic poem, the performances of all models are not encouraging. The poor results may be due to the tendency of Arabic poetry not to repeat words more than once in the same poem and the extensive use of metaphors. These two characteristics would have caused the models to face two problems: out-of-vocabulary situations and new contexts for words not seen before. Any model may face such problems; however, these issues are largely present in poetry texts. Farasa achieved the best results for the classical Arabic poem in terms of the DER (34.47%) and WER (74.87%), followed by the Q + BUKH + ATB-based model (39.89% and 85.68%, respectively), whereas the Sahih Al-Bukhary-based model achieved the best SER (32.66%) followed by the Q + BUKH + ATB-based model (33.42%). The good performance of Farasa may have resulted from the large amount of training data (9.7 million tokens) and their diversity [32].
We used the McNemar test [48], [49], also known as the within-subjects chi-squared test, to check whether the differences in DER between the best performing model and other models were statistically significant. We selected the DER because our approach is based on character diacritization.
Suppose we have the following 2 × 2 confusion matrix: The McNemar test statistic can be computed using the following formula: The minimum accepted p value for the test is 0.05, which corresponds to a 95% confidence interval; that is, the McNemar test statistic should be equal to or greater than 3.841. Table 17 illustrates the McNemar test statistic values for the best performing model and other models for each testing set.
Considering the smallest McNemar test statistic (12.7), which occurs for the Friday sermon text, the performance difference between the Q + BUKH + ATB-based model and the other models is significant with a minimum p value of 0.0003 (99.97% confidence interval). The difference in performance between the Holy Quran-based model and the other models on the children's story is significant, with a minimum p value of 0.003 (99.9% confidence interval). For Farasa and the other models, the difference is also significant for the Arabic poems, with a minimum p value of 0.00004 (99.996% confidence interval).

VIII. CONCLUSION
In this study, we examined sequence-to-sequence tagging using a BiLSTM neural network with CRF for Arabic diacritization. Our approach does not involve a morphological analyzer, dictionary, or feature engineering. The results suggest that the approach is affected by the quality of diacritization and then by the size of the data.
Comparing the performance of the proposed method with those of existing models using benchmarking data and data splits for training, validation, and testing, the proposed method achieved, to the best of our knowledge, state-of-the-art DER results (2.37%) for ATB parts 1, 2, and 3.
Additionally, we compared the proposed approach with three well-known and publicly available Arabic diacritization systems using three texts from different genres. The results revealed that our models outperform all models except for Farasa, which yielded the best results for the classical Arabic poem.
Our proposed method generally achieved the best DERs because it is based on characters, whereas other Arabic diacritization systems are based on words, in addition to its prediction capabilities resulting from combining deep learning networks with CRF. This performance is also reflected in the WER in general. Like any other deep learning approach, our Arabic diacritization approach requires larger amounts of memory, computational resources, and time than other machine learning approaches. However, deep learning approaches can extract useful features solely from data and produce better results than other machine learning algorithms, specifically when combined with CRF.
Our experiments and comparisons with other systems demonstrate that a robust Arabic diacritization system can be learned from data if the training set is large and fully and consistently diacritized and covers a large range of topics.