Hybrid Feature Model for Emotion Recognition in Arabic Text

In recent years, research into developing state-of-the-art models for Arabic natural language processing tasks has gained momentum. These models must address the added difficulties related to the nature and structure of the Arabic language. In this paper, we propose three models, a human-engineered feature-based (HEF) model, a deep feature-based (DF) model, and a hybrid of both models (HEF+DF) for emotion recognition in Arabic text. We evaluated the performance of the proposed models on the SemEval-2018, IAEDS, and AETD datasets by comparing the performances of those models on each emotion label. We also compared the model performances with those of other state-of-the-art models. The results show that the HEF+DF model outperformed the DF and HEF models on all datasets. The DF model performed better than the HEF model on the SemEval-2018 and AETD datasets, while the HEF model performed better than the DF model on the IAEDS dataset. The HEF+DF model outperformed the state-of-the-art models in terms of accuracy, weighted-average precision, weighted-average recall, and weighted-average F-score on the AETD dataset and in terms of accuracy, macro-averaged precision, macro-averaged recall, and macro-averaged F-score on the IAEDS dataset. It also achieved the best macro-averaged F-score and the second-best Jaccard accuracy and micro-averaged F-score on the SemEval-2018 dataset.


I. INTRODUCTION
Because we rely on computers to perform our daily tasks, the need for improved human-computer interactions has increased. Text is the main medium of human-computer interactions in various forms: text messages, emails, product reviews, web blogs, and other social media platforms, including Facebook, Twitter, and YouTube. Automating emotion recognition can benefit the field of human-computer interaction as well as other fields, including virtual reality, e-learning, psychology, business, data mining, information filtering systems, and robotics. The computer's lack of common-sense knowledge makes it difficult for computers to understand emotion; thus, emotion recognition from text is both difficult and also an important natural language processing task (NLPs).
Emotion recognition from text refers to the task of automatically assigning emotion to text selected from a set of predefined emotion labels. There are few published studies The associate editor coordinating the review of this manuscript and approving it for publication was Imran Sarwar Bajwa . on emotion recognition in Arabic text. In general, NLP in Arabic is not as advanced as NLP in English. Arabic is a Semitic language spoken by more than 400 million people. There are three main types of Arabic: Classical Arabic (CA), which is used in the Quran, modern standard Arabic (MSA), which is used in formal conversations and writing, and the Arabic dialect (AD), which is used in daily life communication and social media. Arabic is written from right to left. The number of Arabic alphabets, not counting the hamza, is 28. No capitalization exists in Arabic, but the letters change shapes according to their positions in words. To develop a model for Arabic, one must have insight into the structure and syntax of the Arabic language.
Motivated by the objective of boosting the research on Arabic NLP, this paper proposes three models, a humanengineered feature-based (HEF) model, a deep feature-based (DF) model, and a hybrid model (HEF+DF) for emotion recognition in Arabic text. For the HEF model, we selected features that represent different aspects of the text. The feature set includes stylistic, lexical, syntactic, and semantic features. For the DF model, we built the embedding VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ layer using four different pre-trained word embedding models. We overcame the out-of-vocabulary (OOV) word problem by calculating the characters' embeddings from these pre-trained word embedding models. The DF model consists of stacked deep neural networks in which the embedding layer is reinserted multiple times to slow down the learning process. The performance of the proposed models was tested on three datasets, the SemEval-2018 dataset, the Iraqi Arabic emotion dataset (IAEDS), and the Arabic emotions Twitter dataset (AETD). The results show that the HEF+DF model outperformed the HEF and DF models on all datasets. Moreover, the HEF+DF model outperformed other state-ofthe-art models on the IAEDS dataset in terms of accuracy, macro-averaged precision (P macro ), macro-averaged recall (R macro ), and macro-averaged F-score (F macro ). It also outperformed the state-of-the-art models on the AETD dataset in terms of accuracy, weighted-average precision (P weighted ), weighted-average recall (R weighted ), and weighted-average F-score (F weighted ). Finally, it achieved the best F macro and the second-best Jaccard accuracy and micro-averaged F-score (F micro ) on the SemEval-2018 dataset. The remainder of this paper is organized as follows. Section II presents related works. Section III describes the proposed models for emotion recognition in Arabic text. Section IV presents the experiments, reports the performance results, and provides a discussion. Finally, we conclude this work in Section V and outline some future research directions.

II. RELATED WORK
The research work for emotion recognition in Arabic is not as advanced as is emotion recognition research work for English or Chinese. The limited resources in Arabic are the main contributors to this issue. Mohammad et al. [1] organized the SemEval-2018 Task 1: affect in Tweets, which included five subtasks. The fifth subtask was multi-label emotion recognition in tweets. They created labeled training, development, and testing datasets in three languages: Arabic, English, and Spanish. The annotations were performed by presenting one tweet at a time to the annotators and asking them which of eleven emotions best described the emotional state of the tweeter. More information on the dataset and the distribution of instances between the emotion labels is provided in Section IV-A Datasets. The number of participants in the SemEval-2018 competition for emotion recognition in Arabic compared to the number of English participants was low. Of the eleven participants, only five achieved results higher than the baseline, and of those five, only Badaro et al. [2], Mulki et al. [3], and Abdullah and Shaikh [4] submitted a paper describing their systems.
Badaro et al. [2] proposed a learning-based model for multi-label emotion recognition and tested several features, including n-grams, affect lexicons, sentiment lexicon, and word embeddings from AraVec [5] and FastText [6]. AraVec embeddings outperformed the other features. The authors also tested several learning models, including a support vector classifier (SVC) with both L1 and L2 penalties, ridge classification (RC), random forests (RF), and an ensemble of the three. Linear SVC with L1 outperformed the other learning models. Mulki et al. [3] formulated multi-label emotion recognition as a binary classification problem and tested different preprocessing steps. The preprocessing pipeline used in their best results replaced emoji with emotion tags and performed stemming and stop-word removal. They used term frequency-inverse document frequency (TF-IDF) to generate the features and performed classification using a one-vs-all support vector machine (SVM) classifier with a linear kernel. Abdullah and Shaikh [4] also formulated multi-label emotion recognition as a binary classification problem and used pretrained AraVec word embeddings for word representation. The embeddings were fed into four dense neural networks (DNNs); the output of the fourth DNN was normalized to either one or zero based on a threshold of 0.5.
Samy et al. [7] proposed a context-aware gated recurrent unit (C-GRU). The preprocessing steps included removing links, hashtag symbols, user mentions, diacritics, and elongations. Then, they normalized characters such as the ''hamza'', ''alf'', ''haa'', and ''yaa''. The input to the C-GRU model was a set of sentences and their corresponding topic representations. For word representation, they used 300-dimensional pre-trained word embeddings from AraVec. A gated recurrent unit (GRU) model was pre-trained to detect topics on the SemEval-2017 [8] dataset. Utilizing a transfer learning approach for topic detection overcomes the challenges of learning from a small training dataset. The learned topics were fed into four stacked convolutional neural networks (CNNs); then, the output of the last CNN layer was input into a global max-pooling layer. The word embeddings were fed into a GRU layer. The outputs of the global max-pooling layer and the GRU layer were merged and fed into a DNN with a rectified linear unit (ReLU) activation function. The classification was performed by logistic regression. The performance of the C-GRU model was evaluated on the SemEval-2018 dataset. The results achieved by this model exceeded the results obtained by Badaro et al. [2], who ranked first on the leaderboard of the SemEval-2018 competition.
Abdul-Mageed et al. [9] created DINA, a multi-dialect dataset for Arabic emotion analysis, by crawling Twitter between July and October of 2015. The annotation process was conducted using two annotators who were native speakers of Arabic with postgraduate education. The annotators were provided with several examples and were advised to consult with each other, talk to their friends, and ask online on cases where a given dialect was not understandable. Their analysis shows the effectiveness of the phrase-based seed approach for automatically acquiring emotion data. Al-Khatib and El-Beltagy [10] also created a dataset for emotion recognition from tweets. More information on the AETD dataset is presented in section IV-A Datasets. The preprocessing steps included removing diacritics, links, mentions, and retweet indicators and normalization, where , , and were replaced by was replaced by; was replaced by ; and Arabic numerals replaced Hindi numerals. They used n-gram features and tested different classification algorithms, including naïve Bayes (NB), Complement NB [11], and sequential minimal optimization (SMO). The experiments showed that Complement NB outperformed the other models and achieved the highest results in terms of accuracy, P weighted , R weighted , and F weighted . Almahdawi and Teahan [12] created a dataset (IAEDS) for emotion recognition from Facebook posts. More information on the IAEDS dataset is presented in section IV-A Datasets. They performed two experiments. In the first experiment, WEKA 1 (Waikato Environment for Knowledge Analysis) was used to extract n-grams as features and tested with five classifiers, ZeroR, J48, NB, multinomial naïve Bayes (MNB) for text, and SVM with SMO. ZeroR and MNB resulted in the worst performances. In the second experiment, a compression-based classifier called prediction by partial matching (PPM) [13] was tested. The results showed that the PPM classifier significantly outperformed the other classifiers and achieved the highest results in terms of accuracy, precision, recall, and F-score.

III. PROPOSED MODELS
This section presents the proposed models for emotion recognition in Arabic text.

A. PREPROCESSING
The performances of the proposed models were tested on three datasets, All of which were created from social media platforms; for more details, see Section IV-A Datasets. The writing style used in social media is informal, contains grammatical and spelling mistakes, and includes hashtags, emoticons, and emojis. Table 1 shows some examples of sentences 1 https://www.cs.waikato.ac.nz/ ml/weka/ before and after preprocessing. The preprocessing pipeline includes the following: • Use Tashaphyne [14], which is an Arabic light stemmer, to remove diacritics (tashkeel) and tatweel.
• Remove stop words (except for negation words).

B. HUMAN-ENGINEERED FEATURE-BASED MODEL
This section presents the HEF model. Figure 1 shows a diagram of this model.

1) FEATURE SET
We selected features that represented different aspects of the text including stylistic, lexical, syntactic, and semantic features. After text preprocessing, we extracted the following features: • Domain-specific features: SenticNet [15] was used to retrieve the mood tag of each word in the dataset. Then, each word was replaced by its mood tag. Words without mood tags were deleted. Finally, the TF-IDF was calculated. Table 2 shows some examples of sentences and the mode tags assigned to their words.
• Linguistic features: -The TF-IDF of the character-grams: The number of characters ranges between one and ten. -The TF-IDF of the uni-grams.
• Lexical features: -Lexical sentiment features (LSF): The sentiment of sentences was calculated by summing the word sentiment score provided by of the following lexicons: Arabic Twitter sentiment lexicon [16], Arabic emoticon lexicon [17], [18], Arabic hashtag lexicon [17], [18], and Arabic hashtag lexicon dialectal [17], [18]. -Lexical emotion features (LEF): The Arabic translation of NRC emotion lexicon 4 lists words. For each word, it provides a value of either zero or one for the emotions, anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise, and trust. We excluded the negative and positive emotion indicators from the SemEval-2018 dataset and the negative, positive, anticipation, and trust emotion indicators from the AETD, AIEDS datasets. The LEFs were calculated for each sentence by counting the number of words matching each emotion from this lexicon.
• Syntactic features: The TF-IDF of the POS tags.
• Semantic features: -The TF-IDF of the semantic meaning: SenticNet was used to retrieve the semantic meaning of each word in the dataset. Then, the word was replaced by its semantic meaning. Finally, the TF-IDF was calculated. -Hourglass of emotions (HGE) [19]: SenticNet was used to retrieve the sensitivity, attention, pleasantness, and aptitude scores of each word in a sentence. Then, the scores for each emotion dimension were added.

2) HEF MODEL
Three DNNs containing 100, 64 and 32 units, respectively, and a ReLU activation function were trained on each of the TF-IDF features. The HGE and LSF were trained with a DNN with ten units and a ReLU activation function. The outputs of both DNNs were concatenated and fed into two DNNs with ten units and a ReLU activation function. The LEFs were trained with two DNNs with ten and eight units and a ReLU activation function. To perform the classification, the outputs from all the previous DNNs were concatenated and passed into two DNNs with 50 units and a ReLU activation function. Finally, a dropout of value 0.1 was added to avoid overfitting, and a DNN whose units were equal to the number of emotion labels and a sigmoid activation function was added as an output layer.

C. DEEP FEATURE-BASED MODEL
This section presents the DF model. Figure 2 shows a diagram of this model.

1) PRE-TRAINED EMBEDDINGS
The available datasets for the Arabic language are small; however, deep learning requires large amounts of data for training. Thus, we used pre-trained word embeddings to serve as a means of transfer learning to train the deep learning models. However, not all words are represented in the pretrained embedding models. This OOV word problem was solved by using character embeddings, 5 which were obtained by taking the average of the embeddings of all the words containing each character. The pre-trained embedding models used are as follows: • Emoji2vec [20]: 300-dimensional emoji vectors learned from their description in the Unicode emoji standard. 6 • GloVe [21]: 300-dimensional word vectors trained on tweets. To obtain the 300-dimensional word vectors, we concatenated the 200-dimensional and 100dimensional word vectors.
• Use the emoji embeddings from the emoji2vec embeddings.
• Use the word embeddings if they are represented in GloVe.
• Use the word stem embeddings if the word is not represented in GloVe.
• If the word stem is not represented in GloVe, substitute the sum of the embeddings of the characters that comprise the word. These steps were repeated three more times while varying only the source of the pre-trained embeddings. In the above example, we used GloVe; for the other matrices, we used AraVec-CBOW, AraVec-SkipGram, and FastText.

2) DF MODEL
We utilized different deep neural networks from the Keras 8 deep learning library. After text preprocessing, we built four embedding matrices and used them to create four embedding layers. Then, the average of the four embedding layers was fed into a CuDNNLSTM (long short-term memory built with the NVIDIA CUDA R deep neural network library) with 300 units and a tanh activation function. Then, the average of the CuDNNLSTM output and the averaged embedding layer was fed into a CuDNNGRU (gated recurrent unit built with the NVIDIA CUDA R deep neural network library) with 300 units and a tanh activation function. Next, the average of the CuDNNGRU output and the averaged embedding layer was fed into a CuDNNGRU with 300 units and a tanh activation function. Global max-pooling was conducted on the output of the last CuDNNGRU. A dropout value of 0.1 was added to help avoid overfitting. The same classification method used in the HEF model was used here.

D. HYBRID MODEL HEF+DF
This section presents the hybrid model HEF+DF. A diagram of this model is shown in Figure 3. The features from the HEF and DF models were concatenated, Algorithm 1 shows the pseudocode for the concatenation. As input, it takes the features from those two models (all of which should have the same shape except for the concatenation axis) and returns a single output, the concatenation of all inputs. The output of the concatenation was fed into two DNNs with 50 units and a ReLU activation function. A dropout of value 0.1 was added to avoid overfitting. Finally, a DNN with units equal to the number of emotion labels and a sigmoid activation function was added as an output layer for the classification of the emotions.

IV. EXPERIMENTS
The proposed emotion recognition models were implemented in Python. We used the following libraries: NLTK 2 , Tashaphyne, scikit-learn [23], and Keras 8 deep learning with a Ten-sorFlow 9 backend and the Google Colaboratory 10 platform running on a 25-GB GPU.

A. DATASETS
In this section, we present the datasets used to evaluate the performance of the proposed emotion recognition models. Tables 3, 4, and 5 show the emotion labels, the number of instances in each label, and the distribution percentages of those instances in the AETD dataset, IAEDS dataset and the SemEval-2018 dataset, respectively.
• AETD [10]: This dataset consists of tweets mostly in the Egyptian dialect. The total number of instances is 10,065, and each instance is labeled as anger, fear, happiness, love, sadness, surprise, sympathy, or none. The distributions, as shown in Table 3, range from 10.38% to 15.40% for surprise and none, respectively.     emotions: anger, anticipation, disgust, fear, happiness, love, optimism, pessimism, sadness, surprise, and trust. Therefore, the total number of instances may be less than the total associated with each emotion label in Table 5.
-Training dataset: The total number of instances is 2,278. The distributions range from 0.91% to 17.41% for surprise and anger, respectively. -Development dataset: The total number of instances is 585. The distributions range from 0.94% to 15.66% for surprise and sadness, respectively. -Test dataset: The total number of instances is 1,518.
The distributions range from 1.07% to 17.14% for surprise and anger, respectively.
where G s is the set of gold labels for sentence s, P s is the set of predicted labels for sentence s, and S is the set of sentences.

Accuracy
where E is the set of emotion labels, TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. The TP, FP, FN , and TN values were calculated as follows: • TP: For a given label, if that label occurs in both the set of gold labels and the set of predicted labels, then increment by one.
• FP: For a given label, if that label occurs in the set of predicted labels but not in the set of gold labels, then increment by one.
• FN : For a given label, if that label occurs in the set of gold labels but not in the set of predicted labels, then increment by one.
• TN : The total number of occurrences that are not a given label minus the FP of that label. For the micro-averaged results, the TP, FP and FN for each emotion label e are summed and the average is taken. The micro-averaged precision (P micro ) and micro-averaged recall (R micro ) are calculated as follows: where F micro is the harmonic mean of the above two equations: For the macro-averaged results, the precision and recall are calculated independently for each emotion label e, and then the average is taken. Hence, all the emotion label are treated equally.
precision e = TP e TP e + FP e (6) reacll e = TP e TP e + FN e .
The F-score e is the harmonic mean of the above two equations.
f -score e = 2 · precision e × reacll e (precision e + reacll e ) The P macro and R macro are calculated as follows: and the F macro is the harmonic mean of the above two equations: .
The weighted average considers label imbalance and can result in an F weighted that is not between P weighted and VOLUME 8, 2020   all the human-engineered features were used. Table 6 shows the hyperparameter values, and Table 7 shows comparison results of the proposed models. Tables 8, 9, and 10 show the comparison results of the proposed models with stateof-the-art models on the SemEval-2018, IAEDS, and AETD datasets, respectively.

1) COMPARISON OF THE PROPOSED MODELS
On the SemEval-2018 dataset, the HEF+DF model performed better than the DF and HEF models, and the DF model performed better than the HEF model. The HEF+DF model outperformed the DF model, achieving improvements On the AETD dataset, the HEF+DF model performed better than the DF and HEF models, and the DF model performed better than the HEF model. The HEF+DF model outperformed the DF model, achieving improvements of 1%, 0.8%, 1%, and 0.8% in accuracy, P weighted , R weighted , and F weighted , respectively. The DF model outperformed the HEF model, achieving improvements of 5%, 4%, 5%, and 4.6% in accuracy, P weighted , R weighted , and F weighted , respectively.

2) PERFORMANCE COMPARISON BASED ON EMOTION LABELS
Precision The performance results on the SemEval-2018 dataset showed that the HEF model achieved the highest performance results for anticipation, happiness, optimism, and pessimism, while the DF model achieved the highest performance results for anger, disgust, fear, love, and surprise. The HEF+DF model achieved the highest performance results for sadness and trust. For the emotion labels on which either the HEF model or the DF model achieved the highest result, the HEF+DF model achieved the second-best results with one exception: the only time that the HEF+DF model performance result came in last was for the emotion label love; however, the difference between it and the HEF model was insignificant. The best performance results on the IAEDS dataset were achieved by either the DF model or the HEF+DF model. The HEF model came in a close second to the DF model for anger (a 1.37% difference). Moreover, the HEF model was second to the HEF+DF model for disgust, fear, and sadness (differences of 2.57%, 5.06%, and 5%, respectively). The performance results on the AETD dataset show that the HEF model achieved the highest performance results for fear, surprise, and sympathy, while the DF model achieved the highest performance results for anger and love. The HEF+DF model achieved the highest performance results for happiness, sadness, and none. For the emotion labels in which the DF model achieved the highest result, the HEF+DF model achieved the second-best results. However, for the emotion labels in which the HEF model achieved the highest result, the DF model achieved the second-best results.
Recall The best performance results on the SemEval-2018 dataset were consistently achieved by either the DF model or the HEF+DF model. The HEF model was second to the HEF+DF model for anger and anticipation with 4.11% and 5.69% differences, respectively. The performance results on the IAEDS dataset show that the HEF model achieved the highest performance results for happiness and surprise; the DF model achieved the highest performance results for disgust and fear; and the HEF+DF model achieved the highest performance results for anger and sadness. For the emotion labels in which either the HEF model or the DF model achieved the highest result, the HEF+DF model achieved the second-best results except for the emotion label disgust, where it came in last, but with only an insignificant difference between it and the HEF model. The best performance results on the AETD dataset were achieved by either the DF model or the HEF+DF model. The HEF+DF model came in a close second to the DF model for happiness and sympathy (1.9% and 0.8% differences, respectively).
F-score The performance results on the SemEval-2018 dataset showed that the HEF+DF model achieved the highest performance results for anger, anticipation, disgust, fear, happiness, and trust, while the DF model achieved the highest performance results for love, pessimism, sadness, and surprise, and the HEF model achieved the highest performance results for optimism. For the emotion labels in which either the HEF model or the DF model achieved the highest result, the HEF+DF model achieved the second-best results. Moreover, the difference was insignificant between the HEF+DF model and the first-place model for optimism and sadness. The performance results on the IAEDS dataset show that the HEF+DF model achieved the highest performance results for anger, fear, and sadness, and the HEF model achieved the highest performance results for disgust and happiness. Although the HEF model and the DF model achieved the highest performance results for disgust and surprise, respectively, the differences between their results and the results achieved by the HEF+DF model were insignificant. The best performance results on the AETD dataset were mostly achieved by the HEF+DF model, and came in a close second to the DF model for sadness and sympathy with 0.2% and 0.5% differences, respectively.

D. DISCUSSION
In this section, we discuss the performances of the HEF, DF, and HEF+DF models in light of the result presented in Section IV-C Performance Results.
On the IAEDS dataset, the HEF+DF model outperformed the Almahdawi and Teahan [12] model by 0.1%, 6%, 1%, and 3% on accuracy, P macro , R macro , and F macro , respectively. Moreover, the HEF model and the DF model outperformed the Almahdawi and Teahan [12] model on P macro by 4% and 1%, respectively. The DF model achieved the same F macro as the Almahdawi and Teahan [12] model, but the HEF model outperformed them by 1% improvement.

2) THE IMPACT OF HYBRIDIZING THE HEF AND DF MODELS
The AETD dataset size is almost ten times the size of the IAEDS dataset; however, the hybrid model HEF+DF outperformed the other two models on both these datasets. On the AETD dataset, which has 10,065 instances, the DF model performed better than did HEF model in terms of accuracy, P weighted , R weighted , and F weighted , while on the IAEDS dataset, which only has 1,365 instances, the HEF model performed better than the DF model in terms of accuracy, P macro , R macro , and F macro . Moreover, in terms of precision, the HEF model outperformed the DF model on both datasets for the emotion labels with the smallest number of instances (surprise, sympathy, and fear in the AETD dataset, and fear and disgust in the IAEDS dataset). These results show that the performance of the DF model is affected by the dataset size and the instance distribution of the emotion labels. They also show that hybridizing the two models improved the results by combining the strength of both models.

3) THE IMPACT OF IMBALANCED DATASETS
All three datasets are imbalanced, but the imbalance is greater in the SemEval-2018 dataset than in the IAEDS and AETD datasets. The emotion label with the largest number of instances in the SemEval-2018 dataset was anger, comprising 17.41% and 17.14% of the instances in the training and testing datasets, respectively. On the other hand, the surprise, trust, and anticipation emotion labels had the smallest number of instances-only 0.91%, 2.32%, 3.99% in the training dataset and 1.07%, 2.18%, and 4.45% in the testing dataset, respectively. All three models had difficulty recognizing the trust and surprise emotions; in fact, the HEF model and the HEF+DF model failed to recognize the surprise emotion.

4) EASY-TO-GRASP CHARACTERISTICS
Some emotions are easier to recognize than others. In the Semeval-2018 dataset, although the number of instances for the emotion label happiness was less than the number of instances for the emotion label anger, the F-score for recognizing the emotion happiness was higher than that for recognizing anger. Moreover, all the models recognized fear better than they did disgust or pessimism. In the IAEDS dataset, although the number of instances of the emotion label anger was the largest, the models were able to recognize sadness, happiness, and fear better than anger. Furthermore, while the emotion label fear had the smallest number of instances, the F-scores for recognizing the emotions disgust and surprise were lower than that for fear. In the AETD dataset, the models were able to recognize fear, sympathy, and love better than anger even though the emotion label anger has more instances. Hence, emotions, happiness, love, sadness, fear, and anger have characteristics and indicators that are easier to grasp.

V. CONCLUSION
In this paper, we proposed three models, the HEF model, the DF model, and the hybrid model HEF+DF, for emotion recognition in Arabic text. The DF model performed better than the HEF model on the SemEval-2018 dataset; however, the SemEval-2018 dataset was more imbalanced than the IAEDS dataset. Utilizing different pre-trained embedding models provided the DF model with a good starting point. Reinserting the embedding layer allowed the DF model time to learn by delaying the convergence caused by stacking deep neural networks and training on a small dataset. Moreover, it improved the prediction of emotion labels with only small numbers of instances, such as surprise. Combining the HEF model with the DF model achieved the highest performance in terms of F macro . Although the HEF+DF model improved the predictions on the majority of the emotion labels, the limitations of the HEF model affected its prediction of some emotion labels. Nevertheless, the HEF model performed better than the DF model when tested on the IAEDS dataset, which was smaller than the AETD and SemEval-2018 datasets. The performance of the DF model was affected dataset size. Combining the HEF model with the DF model achieved the highest performance results on the IAEDS dataset in terms of accuracy, P macro , R macro , and F macro ; however, the DF model performed better than the HEF model when tested on the AETD dataset, which is larger than the SemEval-2018 and IAEDS datasets. Combining the HEF model with the DF model achieved the highest performance result on the AETD dataset in terms of accuracy, P weighted , R weighted , and F weighted .
People tend to use strong words and more emojis when expressing happiness, love, sadness, fear, and anger, which makes it easier to recognize those emotions. We used the NRC emotion lexicon to help improve the recognition of anticipation, disgust, surprise, and trust. Nevertheless, the NRC emotion lexicon is a translated lexicon. Creating emotion lexicons specifically for Arabic would help improve the recognition of emotions that lack distinct characteristics and indicators.
In the future, we plan to investigate how to represent words that share the same spelling but have different meanings. We noticed this problem when we dealt with ADs. For example, consider the word (English translation: liar). In regions nearest to the Arabian Gulf, the letter is pronounced (cha) instead of (ka), and when writing , some people replace it with . A word such as could be written as (English translation: attractive). Hence, a sentence such as (this girl is lying) written in an Iraqi dialect can become (English translation: this girl is attractive). FastText provided a pre-trained word embedding model in Egyptian Arabic. Providing pre-trained word embedding models for other ADs would help solve such problems.
Deep learning requires large datasets for training. Using pre-trained word embeddings helps minimize the effect of the absence of a large training dataset; however, we still needed to address the OOV word problem. We overcame that by calculating characters' embeddings 5 from the available pretrained word embedding models, but a robust solution to solve this problem is still needed. Finally, more research should be conducted to improve emotion recognition in Arabic and boost Arabic NLP. NOURAH ALSWAIDAN received the master's degree in computer science from King Saud University, Saudi Arabia, in 2014, where she is currently pursuing the Ph.D. degree with the Department of Computer Science. Her main research interests include meta-heuristics, natural language processing, and machine learning.
MOHAMED EL BACHIR MENAI received the Ph.D. degree in computer science from the Mentouri University of Constantine, Algeria, and the University of Paris VIII, France, in 2005. He also received the postdoctoral degree ''Habilitation Universitaire'' in computer science from the Mentouri University of Constantine, in 2007 (it is the highest academic qualification in Algeria, France and Germany). He is currently a Professor with the Department of Computer Science, King Saud University. His main interests include satisfiability problems, evolutionary computing, natural language processing, machine learning, and AI in medicine.