Arabic Text Processing Model: Verbs Roots and Conjugation Automation

The Natural Language Processing (NLP) is a process to automate the text or speech of Natural Languages. This automation is mainly conducted for Western languages. The Arabic Language got less focus in this area. This paper presents a Model to recognize an Arabic sentence. A new morphological model based on regular expressions is developed to recognize the Arabic verbs. A hash table containing all Arabic three-letters’ root of verbs is implemented. The total number of Arabic verbs that are derived from three-letters’ root size is 23090. The number of roots is 6104. A set of rules forming the Arabic grammar is used to derive and analyze the syntax of Arabic sentences. About 87% of the verbs represented in our regular expressions’ engine are detected. Moreover, the sentences are also recognized. In several Surat of the Quran, only 9% of the detected verbs are false-positive (a non-verb declared as a verb), and 4% are considered false-negative (a verb is considered as a noun). This rate is mainly because we are not using vowels even that the Quran (our case study) is using them. The reason behind our decision is to be able to handle all Arabic texts, which mostly are not using vowels.


I. INTRODUCTION
NLP is a well-known domain that since several years keeping an area of research orientation. The complexity of this domain made progress going slowly as compared to the research in most of other domains. Moreover, between the different natural languages, the Arabic Language got less attention, and it is known that it has the most complex syntax, structure, and verb conjugation. The research in this direction becomes most needed to ease the discovery of extensive knowledge of words.
The Arabic text syntax and semantic are challenging because of the morphological features of the Arabic sentences and words [1]. The Arabic Language is rich and counted as the most complicated Language amongst the other languages. The sentence complexity is embedded in both syntax and its semantics. It also contains a massive number of vocabularies, The associate editor coordinating the review of this manuscript and approving it for publication was Senthil Kumar.
including words synonyms, antonyms, and word roots as nouns or verbs.
This type of complexity is affecting on the understandability, analysis, automation of this Language. The word roots join with different vowel forms to constitute simple verbs and nouns. It can be generated using complicated methods in the grammatical derivation [2]. The Arabic Language comprises a diversity of forms of verbs. The conjugation of each verb depends on several factors.
There is a limited number of research works that concentrate on Arabic language automation and correction. It deals with word roots generation and syntax analysis, for instance, Abu-Errub et al. presented a technique to generate the trilateral Arabic word roots, based on a list of morphological weights and the roots of three consonants. They deal with the removal of the prefixes and the suffixes. The results show the usefulness of the proposed technique with a performance rate equal to 94% [3]. The main drawback of this research is that it does not mention the type and size of the input test.
An Arabic spelling correction system had been proposed to discover and remove or correct the spelling errors. The system is intended for the Arabic search in electronic dictionaries. It was evaluated on single-error queries, and the system performance was 28% better than the baseline. It is the first system for a spoken colloquial of the Arabic dialect [4]. As well, Azmi et al. designed an algorithm for Arabic spelling error detection and correction, where this method relies on dictionary search. The proposed algorithm introduced as a spell checker can detect the spelling errors and then suggests the real corrections for incorrect Arabic spelling entered by the user [5].
Also, a research work carried out by [6] developed a hybrid system based on two techniques: the correction model and the confusion matrix. The system automatically detects and corrects spelling errors written in Arabic. It contains a robust error confusion matrix. After the system evaluation, it found that the result outperforms other systems results.
Furthermore, Hassan [7] studied the writing problems of a sample of learners in regards to writing Arabic and English statements. The study identified three common errors in the statement structure, along with suggesting some practical solutions to minimize these errors.
On the other hand, a recent morphological analysis system has been developed [8], which analyzes the Arabic word surface patterns. The mechanism used to classify verbs' roots was based on a set of morphological rules. Later on, they construct a conjugated surface pattern database. The proposed system had been tested and evaluated by 4,000 verbs, and it only presented an error rate equal to 4%. This system performs roots' derivatives based on assumptions, and not the real Arabic used verb roots.
This research is focusing mainly on the automation process of Arabic roots generation. It aims at the development of a tool to process the Arabic text in a way to discover the syntax used in the input text and correct the errors. The tool can be used to automate the verb conjugation going from root to its conjugation and going back from conjugation to the root. It will also be used to detect syntax, provide the errors then help to use the correct syntax better. Later, this tool will be incorporated into a smart editor to help writing Arabic text.
The method includes a collection of Arabic verbs that are related to knowledge. It builds a tool using a database of knowledge. Moreover, it provides and implements a set of algorithms to cover verbs conjugation, root detection, algorithms evaluation, collections of the Arabic grammar rules, syntax, build the grammars, add the semantic actions of the syntax generation, and generate the syntax of an input sentence.
The rest of the paper is structured as follows; Section II introduces the literature review. Section III is describing the proposed model by the Arabic Language Grammar, the Verbs Database Construction (Verbs' Generation), and the Verb Recognition System. Furthermore, section IV highlights the experimentation and results, while section V demonstrates the conclusions and future works.

II. LITERATURE REVIEW
Since the last decade, many researchers were focusing on Arabic Language processing. Most researches in NLP were conducted toward the English language. The transition of the works done on the English to Arabic is not as trivial as it can be seen. The Arabic Language is the richest natural Language in syntax and semantics. Among these researches, we present the following related contributions.
Althobaiti et al. [9] built a Java-based library that consists of various tools for Arabic text processing. They presented a complete library to handle Arabic texts.
An Arabic morphology dataset was presented by [10]. They focused on the uniqueness of the root pattern phenomenon and studied the associative relationships between words meaning at a higher level and their possible occurrences. This approach can be viewed as an instantiated global root-pattern related to the morpho-phonetic items.
Some other works are given in [11], which studied Arabic text automation in a semantic approach. The work achieved by [12], built a classification method for the Arabic Language.
Moreover, there are some research works conducted on the Arabic root's extraction, for instance, a multi-objective method with a statistical method to separate the suggested Arabic roots. The results presented that the developed method improved the performance of extracting the Arabic roots [13]. Besides, Yousef et al. proposed an approach to improve the Arabic root extraction method for all words, according to the bi-gram technique. The performance of the proposed approach reached to 80% [14]. This approach succeeded with the vocalic roots.
Likewise, a proposed model was developed to identify the root of verbs by a software tool called RootIT, to overcome the problem of verb root generation without disambiguation [15]. Besides, an approach for Arabic root generation presented by [16], is a novel technique for Arabic NLP to generate the roots of the Arabic word. Also, Farwaneh, in [17], focused on the Levantine Arabic variety and created an account of a set of complex facts related to the inflection of sound verbs and non-sound verbs. The account distinguishes four levels of correspondence, (input-output), (output-output). Also, concentrated on the paradigmatic differences found in the inflection of sound verbs, his method concentrated on the stems of more than two consonants and on the non-sound verbs, whose stems comprise two consonantal realizations.
The Arabic sentence is syntactically ambiguous and complicated because of the deferent meaning of the word in the context and frequent usage of conjunctions, grammatical relationships, and other forms [18]. The Arabic Language is characterized by its complex morphology based on root-pattern schemes. The process of Arabic words' roots extraction is challenging, and it is an essential topic in NLP applications, for example, Information Retrieval, text analysis, machine translation, and speech tagging [3].
Besides, another study [19] discussed machine translation models and their current problems, and its lower accuracy, especially in the translation from the Arabic to Chinese as two complex languages. Besides, the researchers propose the best combination of factors that can help in the translation task within a proposed approach. On the other hand, Thalji et al. [20] worked on the Arabic Language, created a rule-based algorithm for roots extraction to eliminate the weaknesses of the previous methods. They used the corpus of ''Thalji'' for the testing process and comparison with other works.
Elazhary et. al. concentrated on automation tutoring to analyze the Arabic word root extraction [21]. They suggested an automated tutor that can be used in learning and teaching to help students to extract the appropriate roots of any Arabic word.
Yaseen and Hmeidi introduced an algorithm called Stemming Algorithm for roots extraction based on a set of rules and an Arabic roots file that contains the word roots. The algorithm keeps the affixes during the extraction steps, and the proposed algorithm is competitive with an accuracy equal to 84% [22]. As well, Mohammed [23] suggested a combined stemmer for Arabic words' root, which achieved an exploration ratio of roots equal to 99.08. Moreover, Zeroual et al. developed an efficient stemmer algorithm for Arabic text, which deals with the morphological characteristics of Arabic. They executed some experiments and evaluated the performance of the developed algorithm based on two styles of Arabic; Classical Arabic and Modern Standard Arabic. The outputs of the stemmer organized in three classes include the stem, a unique root, and a combined class from the root and stem [24].
In the Arabic Language, the word root has various patterns that can be matched by different algorithms. The Word-pattern matching algorithms are employed in extracting Arabic words' roots [25], [26]. According to this idea, a root is separated after canceling the affixes attached to the word. The root generation is achieved by comparing the corresponding pattern with the positions of the letters in the word.
Generally, two types of Arabic roots that can be classified according to the vowels [27]. The first is the vowel root that contains at least one vowel. The second is the base root, which does not contain any vowel. Besides, Blanchete et al. formalized a model for Arabic verbs based on pattern approach and the verb root. This model uses a linguistic classification method that identifies a group of morphological features. Their research work depends on two main parts: first, a dictionary, which contains patterns, lemmas, and roots and second, the generation of all potential verbs. The primary process concentrated on the roots classification and matching with patterns to give lemma. The output of this work introduced a dictionary that includes a big set of inflectional and derivational verbs styles [28].
Many techniques have been employed to find out the roots of Arabic words. Nevertheless, none of these methods have been approved as a slandered method because of the morphological richness of the Language [26]. Several methods of extracting the Arabic word roots have been presented [29], [30]. Some researchers rely on morphological rules to find out the trilateral word roots.
Many researchers use morphological rules to generate the roots of the Arabic word [3], the processes of roots extraction are not easy and very complex as a result of the multiple forms of the morphological formulas in Arabic words. Therefore, a technique was developed, which depends on some non-morphological rules combined with the statistical approach. This method proposed to reduce the word roots complexity process.
There is little research done on Arabic text processing, compared to other languages, especially colloquial text [12]. However, recently there is an increasing interest in this area of research. One of these researches modifies the classification of Arabic dialects by using metadata [31]. Although the difficulty of the Arabic text structures and with the absence of research in this field, there are limited contributions that have tried to analyze Arabic expressions by using different algorithms. However, these approaches have some restrictions.
For instance, El Kourdi et al. [32] classified a dataset of 1500 web documents written in Arabic, which were collected from the news channels and categorized into five classes by the Naïve Bayes algorithm. They got a low average accuracy equal to 69%. Most of the research contributions for Arabic text were achieved by lexical based categorization and classification using the machine learning algorithms [33]- [36], where, they have resources about standard Arabic. Also, Kamps et al. [37] developed a simple distance measure on WordNet to determine the semantic orientation of adjectives. It classifies the text by using a simple technique based on lexical relations. Also, [38] used WordNet to classify the English text based on the quantitative analysis of the glosses of terms that carries opinionated content that has a positive or a negative meaning. Moreover, in [39], the main features were extracted from Twitter data by using a simple lexicon-based approach.
The sentiment classification carried out by [40] used a supervised learning algorithm to identify the semantic orientation of the conjoined adjectives from a large corpus of conjunction constraints. While Turney [41] used an algorithm of machine learning to classify contents of reviews as recommended or not, which is predicted using the semantic orientation average of the expressions in a data set of reviews that encompass adjectives. The orientation of the semantic of a phrase is calculated based on the mutual information of the words and the given phrase ''excellent'' and ''poor'' using statistical techniques. Our proposed technique performs better as compared to most of the systems found in the literature. We reached more than 87% of the accuracy of recognized verbs.

III. PROPOSED MODEL
Our proposed Arabic Text Processing model uses the grammar rules of the Arabic Language. The Model is given in Fig. 1. There are different levels in this model: Level 1: The Arabic text segment unit, which is the sentence. Parsing the sentence will narrow the recognition of the type of lexemes. Most of the words of type NOUN or VERBS cannot be recognized as such, mainly when they are vowel-less, except in their context in a sentence. Most of the researchers are looking only at the morphology level, and not considering the context (syntax) level. Level 2: The lexical level, where the different parsed lexemes are matched to the forms of Arabic words using regular expression engine.
Level 3: Lexeme type recognition: one of our contributions at this level is the construction of a database containing all three-letters-root Arabic Verbs, all well-known special Nouns, and Hurufs.
In the rest of this section, we will present the main grammar rules representing the construction of a sentence in Arabic Language describing the main complexity and the different sentence forms of such a language.
A. ARABIC GRAMMAR Fig. 2 gives a global idea about the Arabic Language Formal Grammar, which has been inspired from [42] and slightly enhanced, which we call here Arabic Context-Free Grammar (ACFG) that will be more tuned in the future work to represent as much as possible the sentences in an Arabic text. The lexemes that are represented by the tokens include; Verb, Harf, and Noun. These are the only lexemes kind in the Arabic Language that regroup all Verbs, Hurufs, and Nouns. The Verbs and their different forms are all known as well as Hurufs. Some Nouns are regrouped under some abstractions because they have some special effect in a sentence, and their forms may not change or may change in a particular way. The rest of the Nouns cannot be limited. For this reason, any lexeme that is not a Verb or Harf is a Noun.
This grammar is presented in its first form before it passes through different processing steps: 1) add semantic actions that are used to describe the syntax analysis 2) left factoring for the rules under the same name and starting with the same prefix We denote the Non-Terminal Symbols the set of all rules names and the Terminal Symbols the set of all the tokens provided by the lexical analyzer for all lexemes (strings) in the user input including Verb, Harf, Noun, VerbalTrans-formedParticle, AdjectiveParticle, Pronoun, Adverb (Time, Place), and Preposition. As ambiguity -the same sentence can have different meanings -in natural languages and particularly in the Arabic Language is frequently used, our model presents as much syntax analysis as it can find for the same sentence.

B. VERBS WITH THREE LETTERS' ROOT
The Verbs in Arabic Languages are classified in different ways based on the number of characters in their roots. The most used ones are three and four letters' roots. This study is focusing on three letters' roots. Table 1 is showing some verbs with their past derivatives. There are 15 verbs' past derivatives forms for three letters' root verbs in the Arabic Language. These derivatives forms are represented in Fig. 3. All Arabic verbs are described in detail in [43].
There are some special forms of the derivatives, which result from combining two letters or changing a letter by  another. The first form is due to getting the same letter in the second, and the third positions of the root. The second by having in the root one or two letters equal to , , or which are called weak letters ( ), or a vowel with a long sound. An example of these two forms is shown in Table 2. The third special form is when the letter' ' is replaced by , , or in the derivative since it is easier to pronounce. Table 3 presents an example of this latter case.
In this paper, we are covering the first special form but not totally the second and the third as they have several forms depending on the context, which will be addressed in the  future. These derivatives will change in the present tense to those given in Fig. 4.
In all of these derivatives (in the past and present), the , , represent the first, second, and third letters of the root in the verb respectively. In an Arabic text, and depending on the pronoun and the tense, the verb is concatenated with prefixes given in Fig. 5, and suffixes are presented in Fig. 6.
As it is shown in Table 1, we can find no rule to extract the derivatives from a root. To recognize a verb, the root has to be extracted from the user lexeme, which has to be parsed, and the form of the used verb should also be a good derivative of the root. For this reason, in our model, a database is built for all three letters' root and their derivatives. A hash-table was used to find a root with its derivatives. As the 28 Arabic letters are used to construct roots, the chosen hash function is 28 * rank(a)+28 2 * rank(b)+28 3 * rank(c), which gives a complexity of O(1). Where a, b, and c are the first, second, and third letters in the root, respectively. The rank is the rank of the letter in the list of the Alphabet (from 1 to 28). The algorithm building this hash-table is presented in Fig. 7. Table 4 presents the number of generated Arabic verbs per derivative per root's first letter.
The total number of 3-letters-root verbs in the Arabic Language is 23090. The number of roots is 6104. Fig. 8 presents the number of verbs per root's first letter. The maximum number of verbs are those starting with the letter whereas, the roots starting with and have the least number of verbs.    Fig. 9 shows the number of Verbs per root's derivatives. As all three-letters' verbs have a derivative , it is the most used derivative, whereas, the derivatives containing the least number of verbs are , and

C. VERBS RECOGNITION
The verbs recognition in our model passes through different steps, which are presented in Fig. 10. The lexeme read from the user text is parsed by a regular expression engine, which tries to match the lexeme with all [[prefix] prefix] derivative [suffix [suffix]] forms. The number of prefixes and suffixes in the Arabic Language may reach three for each. From the derivative, the root is extracted. Then the hash-table is indexed by the result of the hashfunction of the root. If the derivative is represented in the code in the hash-table, then the lexeme is considered a verb. If all matching regular expressions do not lead to a verb, then the lexeme is not a verb. The regular expression engine is used by the lexical analyzer to check if a lexeme from the user input is a Verb after checking if it is a Harf. If it is neither Harf nor Verb, then it is considered as a Noun. Fig. 11 presents some regular expressions used in this engine.

IV. EXPERIMENTATION AND RESULTS
The Experiments and results demonstrate verb recognition and sentence recognition, each of which will be discussed in the following sub-sections.

A. VERB RECOGNITION
The input to our framework was composed of some Surat from the Quran. Fig. 12 shows the number of 3-letters-root's verbs that we recognized from some Surat of the Quran.
Globally, there are only four used verbs' derivatives, at a certain level, in these Surat: , which is the most used in all Surat, , and . This goes with the fact that these derivatives have the maximum number of roots in the Arabic language.
We denote by false-positive the fact that a lexeme, which is not a verb, is recognized as a verb. Similarly, by falsenegative, we describe the fact that a verb is not recognized as such. Table 5 gives the false-positive and false-negative of verbs recognition. The average false-positive rate is around 9%. The average false-negative rate is around 4%, which is due to some added rules to eliminate the nouns that have verb roots. These rates can be considerably reduced while taking into account the vowels. Moreover, and for the Arabic texts   without vowels, the context (the sentence at the syntax level) will help to reduce the rate.

A verb in the form [[prefix] prefix] derivative [suffix [suffix]]
is, in fact, a sentence or a part of it. By parsing this lexeme, the sentence syntax is recognized. In the same way, the Hurufs and Nouns may have prefixes and suffixes and will be treated using regular expressions. While this latter is easy for the Hurufs, it is not trivial for Nouns. This will be focused on future work.
Using the grammar presented in Fig. 2, the parser is built to recognize and analyze the syntax of an Arabic sentence. If there is an ambiguity, all different recognized forms of syntax analysis are given. Fig. 13 gives an example of a sentence syntax analysis. Table 6 presents the comparison of the results between different techniques for Arabic root extraction. Even if the input is not the same, our proposed model outperforms other methods.
Most of the recent researches concentrated on the morphological level extracting roots from Arabic words with a recognition performance between 94% and 96%. Compared to these researchers our proposed model is as good as the declared approaches since it achieved 87% verbs recognition and 9% of false-positive (non-verbs) rate, which brings the total to 96%. Our model is close to what is proposed in [22] as they use a file containing some Arabic roots while we are building a hash-table that contains all letters' root verbs. Similarly, they are using what they call Arabic rules while we are describing the verbs' derivatives and their potential prefixes and suffixes using a regular expression. Moreover, our proposed model uses Arabic grammar to recognize a sentence and give its syntax analysis. On the other hand, [3] and [8] claim to have a root extraction performance rate of 94% and 96% respectively. However, the first did not mention the input type and size, while in the case of the second one, the test is done over 4000 verbs not included in the normal Arabic text, which adds complexity to distinguish between verbs and nouns. VOLUME 8, 2020

V. CONCLUSION
To be able to derive and know the syntax analysis of an Arabic sentence, we proposed a new model based on regular expressions and Arabic grammar rules. We generate all verbs with three-letters' root and their derivatives and build a hash-table with access complexity of O(1). A lexical analyzer is reading, slicing, and returning the token for each input word to the parser that is checking the grammar rules and producing the analysis. All verbs represented in the regular expression engine are detected, among which only 9% are false-positive. A false-negative rate of 4% represents the verbs that are rejected because they are considered as Nouns. Our model performance is as good as claimed by most of the researches using a stemming process for root extraction when counting the total root recognition (87% verbs and 9% of false-positive). As Arabic text processing needs more deep analysis than root recognition, our model is covering such need, by attempting to recognize verbs over nouns, which is not an easy task in the Arabic Language and needs further future work. The grammar is used to recognize Arabic sentences and get the syntax analysis. Although our results are promising, some verbs forms were not taken into consideration like imperative verbs, verbs with weak letters (vowel with long sound , , and ). These will be considered in future work. The same will be for more refining the grammar to include more granularity in the syntax.