Arabic Aspect Extraction Based on Stacked Contextualized Embedding With Deep Learning

The exponential growth of the internet and a multi-fold increase in social media users in the last decade have resulted in a massive growth of unstructured data. Aspect-Based Sentiment Analysis (ABSA) is challenging because it performs a fine-grain analysis; it is a text analysis technique where the opinions group is based on the aspect. The Aspect Extraction (AE) task is one of the core subtasks of ABSA; it helps to identify aspect terms in the text, comments, or reviews. The challenge of the Arabic AE task increases due to the complexity of the Arabic language. This work aims to develop the Arabic AE task by proposing transfer learning using state-of-art pre-trained contextual language models. We concatenate the Bidirectional Encoder Representation from Transformers (BERT) language model and contextualize string embeddings (Flair embedding) as a stacked embeddings layer for better word representation for Arabic language. Then, we extend it with different deep learning network architectures. For Arabic AE, the model is developed by concatenating the Arabic contextual language model, AraBERT, and Flair embedding as a contextual stacked embeddings layer with an extended layer, BiLSTM-CRF or BiGRU-CRF, for sequence labeling. Our proposed models are called BF-BiLSTM-CRF and BF-BiGRU-CRF. The proposed model is evaluated using the Arabic Hotel’s reviews dataset. For performance evaluation, we used the F1 score. The experimental results show that the proposed BF-BiLSTM-CRF configuration outperformed the baseline and other models by achieving an F1score of 79.7%.


I. INTRODUCTION
Sentiment Analysis (SA), often known as opinion mining, is a popular study topic in Natural Language Processing (NLP). In recent years, it has attracted the research community's attention due to the explosion of social media data and the abundance of online reviews on services and products [1]- [3]. However, it is problematic to identify the opinion or sentiment expressed about services and products in text.
SA is classified into document, sentence, and aspect levels. The document and sentence levels are considered coarsegrained analysis, where the opinion is about the whole given text (document or sentence), and is not sufficient, in many cases, to indicate an opinion about the specific aspects given in the text [4]. On the other hand, aspect The associate editor coordinating the review of this manuscript and approving it for publication was Shuihua Wang . level analysis, or ABSA, is a fine-grained analysis where the sentiment is predicted based on the entities (products, services, organizations, or events) in a particular domain that SA cannot cover [5]. ABSA helps, for example, organizations, governments, and decision-makers to know which features are considered attractive for customers and which are not favored by users to avoid or enhance them in the future. For example, a sentence about a hotel, ''The buffet was very delicious, but it is a little price,'' gives positive and negative remarks about a hotel on two different aspects, ''food'' and ''price,'' respectively. Another example is a review in Arabic about a new mobile phone ' ' ,'' which means ''The price of the device is reasonable, but its disadvantage is the front camera and the battery life is short.'' The main aspects in this review are price, battery, and camera; the cost of the mobile phone has a positive polarity while negative polarities are indicated about the battery and camera, which means a negative polarity about its properties.
ABSA comprises four main subtasks: aspect term extraction, aspect term sentiment analysis, aspect category classification, and aspect-category sentiment analysis [6]. Every task can be performed independently or combined with others. Aspect term extraction constitutes the basis of ABSA and governs the accuracy of ABSA results [5], [7], [8]. We concentrate on the Aspect Extraction (AE) task for the Arabic language [6].
An AE task is treated as a sequence labeling problem, similar to Named Entity Recognition (NER). Wellknown methods to extract aspects are the Conditional Random Field (CRF) and the Hidden Markova Model (HMM)) [9], [10], which rely heavily on handcrafted features, for instance, bi-gram and part of speech. Aspect category classification is solved by a support vector machine, logistic regression with several feature representations, and extraction methods such as frequency features, n-gram, a bag of words, term frequency-inverse document frequency (tf-idf), and word embedding [11]. The accuracy of applying these techniques to AE depends on quality handcrafted features, and it needs feature engineering at a high level, which is time-consuming. With the increase in the number of datasets and developed computation resources, the researchers of this paper utilized a deep neural network for ABSA tasks to give rich feature representations. Deep learning consists of cascaded layers with nonlinear processing units for feature extractions. Applying deep learning in NLP problems shows excellent potential, and it can train complex models on big datasets [12]. Deep learning reports more accuracy through several algorithms and techniques, or a hybrid of them, such as [13], Convolutional Neural Networks (CNNs) [14], Recurrent Neural Networks (RNNs), and RNN extensions, such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) [15]. In addition, using Attention mechanisms for AE improves the performance of ABSA [16]. Arabic is the fifth most widely spoken language in the world. More than 422 million people speak it, and it is the official language of 22 countries. Arabic has some unique characteristics, such as containing 28 letters, its orientation from right to left, and lack of capitalization. Unlike other languages such as English, Arabic is a Semitic language; it has rich morphology and words that could have different meanings within a given context. Moreover, diacritics in Arabic serve the same purpose as short vowels in English; they determine how letters are pronounced. Arabic is a derivative language where All Arabic words are derived from a root composed of constants. These are usually three or four letters [17].
The Arabic language has a complex inflectional and derivative morphology, leading to the rise of sentiment analysis and Arabic ABSA (AABSA) challenges. Some of these challenges are [17], [18]: With the lack of reliable NLP resources that deal with DA, there is a lack of opinion written in classic and MSA Arabic forms. The users typically use dialects forms to express opinions on social media, for example, Twitter, forums, and online websites.
Additionally, large AABSA-annotated corpora are not available for learning accurate models and no annotated datasets for social media and different dialects.
Moreover, many challenges occur during the preprocessing phase [18], such as: • A given term can take on several meanings depending on the context, such as ( ), which may mean ''gold'' or ''went.'' • Capital letters are a problem in AE.
• Transliteration and misspelling in microblogs such as Twitter and hashtags result in noisy and dirty datasets.
• With many word forms and removing diacritics, the sentence holds different meanings; e.g., ( ) can be ''Flag,'' ''taught,'' ''understood,'' ''science,'' and ''knew.'' In comparison to a large number of studies on ABSA in English, there are few works on AABSA for the following reasons: (1) the difficulty and complexity of the Arabic language; (2) the lack of reliable NLP tools for the Arabic language; and (3) less availability of large and annotated datasets.
Training deep learning models is data-intensive. Thus, deep-learning-based NLP tasks need more data for training available in limited resource languages such as Arabic; this problem could be solved by transfer learning [19]. Transfer learning enables transferring the knowledge from pre-trained models such as XLNeT [20], ELMO [21], Flair [22], and Bidirectional Encoder Representations from Transformers (BERT) to other limited domains [23]-for example, product reviews and healthcare. In other words, it allows using the pre-trained models on a vast text corpus and using them in other downstream tasks with small datasets [24].
Moreover, deep learning methods such as LSTM cannot parallelize and capture semantics over longer sequences than transformers. In addition, deep learning methods based on word embedding vectors such as Wor2vec [11] and Glove [25] cannot consider the polysemies of the word in different contexts where one vector can be generated for each word in the vocabulary regardless its surrounding context.
Recently, transfer learning has been shown a significant advantage in the semantic representation of text through a pre-trained contextual model. BERT and Flair have a strong semantic text representation; they achieve stateof-the-art results in several NLP tasks. BERT and Flair can deal with polysemy, Out-Of-Vocabulary OOV and misspelling [23], [26].
In our work for better word representation and to develop Arabic AE, we find that the concatenation BERT with contextual string embedding Flair (as stacked embedding layer) strengthens word representation, enriches word representations with additional semantic and syntactic information, and their use is the state-of-art in sequence labeling. Thus, it can improve the performant Arabic AE task. VOLUME 10, 2022 The contribution of our work can be summarized as follows: • We propose using transfer learning based on stacked contextualized embedding for Arabic AE. To our knowledge, this is the first work using a transfer learning based on a combination of fine-tuned AraBERT and contextual Character-level embedding (Flair embedding) for better word representation to solve the AE task on an Arabic dataset.
• We propose combining LSTM (BiLSTM) or Bidirectional GRU (BiGRU) and CRF on top of the contextual embedding layer for sequence labeling.
• We use the Arabic Hotel's reviews dataset to train and evaluate our proposed model. Extensive experimental results demonstrate that our proposed fine-tuning BERT BF-BiLSTM-CRF model outperforms baseline and state-of-the-art works on the Arabic AE task. The rest of this paper is organized as follows. Section II discusses the related work. The proposed models are provided in Section III. The details of the experiments and the evaluation results are presented in Section IV. Finally, the conclusion of this work is presented in Section V.

II. RELATED WORK A. ENGLISH AND OTHER LANGUAGES
There are three categories of AE Approaches: early methodologies, rule-based or based on a dictionary; traditional approaches based on machine learning; and modern approaches rely on deep learning and transformers [5]. Additionally, the approaches can be classified into supervised methods and unsupervised methods. Unsupervised techniques to extract aspects are rule-based, frequency-based, and statistical methods [27], [28]. Rule-based approaches are based on predefined rules to extract aspect terms manually [29], [30] or automatically [31], [32]. With limited grammatical information, frequency-based algorithms extract aspects based on the more frequent features, such as nouns or noun phrases [33].
On the other hand, the supervised approach considers the AE task a sequence-labeling problem. Several machine learning-based approaches have been performed on AE such as Support Vector Machine (SVM) [8]. Also, topic modeling is widely used for AE and aspect grouping [34]. Many hybrid approaches combine more than one of these methods. The limitations of machine learning methods lie in handcrafted features that usually need many experts in the domains and human labor.
In recent years, researchers have adopted deep learning rather than traditional techniques for AE; deep learning models enhance performance by automatically learning the semantic and syntactic features. One work used a CNN for aspect extraction [14], which had seven layers, one input layer, two conventional layers, two max-pooling layers, and fully connected layers with Softmax output. For features, the researchers combined word embedding and parts of speech. The results showed a significant performance improvement. In [7], the authors proposed two embedding layers combined with a CNN network, where the two embedding layers were a general embedding layer and an in-domain layer. For sequence labeling, they used four CNN layers. Their results showed that the two embedding mechanisms with CNN enhanced the performance of the model.
Tang et al. [35] extended LSTM into Target-Dependent LSTM (TD-LSTM) and Target-Connection LSTM (TC-LSTM) to consider aspect target. In TD-LSM, they used two LSTM layers-LSTML to represent the left context from the aspect and LSTMR to represent the right context of the aspect plus the aspect for learning. After that, they concatenated the last hidden vector of LSTML and LSTMR and pass them to the last layer to predict sentiment polarity related to the aspect. To capture the interaction between aspect word and its context. Some works combined RNN with CRF methods to improve the AE task to identify the aspect boundary [5], [36]. In addition, several studies used an attention mechanism to assist the model learn representation more effectively, highlighting aspect-related words while de-emphasizing irrelative aspect words [16], [37]- [41].
Most deep learning models are based on word embedding. Word embedding is a neural method for text representation that translates text into vectors (vector representation for a particular word). It is used as input for deep learning. The vector space model represents words in a continuous vector, where the semantically and syntactically similar words are mapped to nearby points and embedded near each other. Wor2vec [11] and Glove [25] are the most common examples of word embedding. The limitation of word embedding is context-free, and it cannot consider polysemy words, limiting the performance of models that rely on word embedding.
Recently, pre-trained models, such as ELMO [21], XLNeT [20], and BERT [23] accomplished state-of-theart results in NLP tasks because they adjusted the word vector based on the context. Utilizing BERT in ABSA, Gao et al. [42] converted the ABSA problem into a sentence-pair classification task. At the same time, TD-BERT concentrated on using the positioned output at the target word as the classification input rather than the first (CLS) tag; the features from these two base models were concatenated. Li et al. [43] also proposed the BERT-CRF model for Endto-End ABSA. In [44], they presented the FAGOM model for aspect level opinion mining; they used the BERT model for context embedding and a multi-head attention mechanism. Finally, a pooling layer is added to extract local and global features.
MA et al. [45] proposed combining LDA with lexicon for aspect extraction from Chinese reviews for the Chinese language. Yu et al. [46] proposed Fine-tuned BERT to extract implicit aspect terms from online clothing reviews. ABSA was studied on some low-resource Language such as Urdu [47]; the authors annotated data set for ABSA task for Roman Urdu and validated it by using several machine learning models.
To investigate the effect of transfer learning in a low resource language, Winatmoko et al. [48] used the multilingual version of BERT with the auxiliary label and CRF as an output layer to extract the aspect term and opining tasks on hotel reviews in Bahasa Indonesian. The results showed improvement in the F1 score. Lopes et al. [49] also used BERT for an AE task on a Portuguese dataset.

B. ARABIC LANGUAGE
In this paper, our primary focus is on AABSA research. Several works on Arabic SA and some earlier works used traditional methods for SA [50]- [52]. Lately, most works are based on modern techniques such as deep learning and transformers [53]- [55]. Compared to English, there was a limited number of research works targeting AABSA. There was no work on AABSA data before 2015. The first presented research for AABSA was in 2015.
The first Arabic dataset supporting ABSA is HAAD (Human Annotated Arabic Dataset of Book Review); it contains Arabic books reviews [56]. The following work used the HAAD dataset to enhance ABSA tasks aspect category extraction and aspect polarity classification [57]. Recently, In [58], the author extracted the explicit aspects from HAAD by using description logic to describe terminological knowledge. Then, they combine linguistic Rule and Description Logic to extract opinion targets. Areed et al. [59] present dataset for government reviews, the combined rule-based and lexicon models for AE.
In [60], ABSA was used to study and analyze the effect of Arabic news on readers. Another dataset for Arabic laptop reviews was prepared to support ABSA. Concerning using deep learning in AABSA, few studies are found in the literature. The first one is INSIGHT-1; the authors used CNN for aspect category and sentiment polarity detection for multilingual ABSA (11 languages) [61]. For Arabic, they used the review Hotel' dataset, and the results showed an enhancement in performance over the baseline result by 11% for aspect category detection and 6% for sentiment polarity classification. Another research used the LSTM approach for eight languages, including Arabic [62]. For Arabic, their results did not perform well compared to the baseline result; they achieved (F1 = 47.3%), with an enhancement of around 7%. Al-Smadi et al. [63] used the Arabic Hotel's reviews dataset for SemEval-2016 Task 5. The authors compared RNN with SVM for ABSA tasks related to the dataset: aspect category identification, aspect opening target expression extracting, and aspect polarity classification. They extracted lexical, syntactic, semantic, and morphological features for training SVM classifiers. Then, they compared a trained RNN with SVM, but it had a long execution time in the training and testing phases.
Two approaches were developed to handle the AABSA in [64]. The first one used Bi-LSTM and CRF on the word and character level to extract aspects from the review. For the sentiment polarity classification, LSTM was used. The result showed enhancement over baseline research on both tasks (39% for task 1 and 6% for task 2). In [65], the authors used an attention mechanism for AE, and the performance improved with a 72.8 F-score. In [66], a BiGRU was used. For AE, the result was close to [65].
A few works used fine-tuning BERT with linear classification for Arabic aspect polarity classification [67]. Bensoltan et al. [68], proposed Bert-BiLSTM-CRF model for AE from the News dataset that outperformed the previous works on this dataset.
To the best of our knowledge, no prior work used a combination of BERT and string embedding for Arabic Aspect Extraction.

III. PROPOSED MODEL ARCHITECTURE
This section introduces our proposed models; we propose two models for the Arabic AE task: the BF-BiLSTM-CRF model and the BF-BiGRU-CRF model. The proposed models integrate two contextual pre-trained models, namely BERT language model and contextual string embedding (Flair) as stacked embeddings. The architectures of the models consist of three main layers: (1) the combination of the Arabic BERT (AraBERT) and Flair embeddings is used as the input layer, (2) BiLSTM and BiGRU represent the encoder layer, and (3) CRF serves as the decoding layer.
As shown in Figure 1, the proposed model consists of three main layers: embedding layer, BiLSTM or BiGRU layer, and CRF layer. For the embedding layer or input representation, we use Arabic BERT (AraBERTv02) and Flair embeddings, in which every word of the sentence is mapped to a contextual vector of concatenated embedding. BiLSTM/BiGRU-CRF is used for sequence labeling. BiLSTM/BiGRU encodes the contextual information for each word in the input sequence; it is used for semantic encoding and obtaining global sequence features. The embedding vectors from the embedding layer are used as input to BiLSTM or BiGRU; they generate scores representing the probability for the tags; for example, the highest score is selected as the final prediction for a particular token. However, some predicted labels are invalid, so the output from BiLSTM is passed to the CRF Layer for correction.
The CRF is a decoding layer that predicts the final sequence labels, considers the dependency relationship between adjacent tags, and selects the best sequence tagging.

A. EMBEDDING LAYER
The embedding layer receives a sequence of N words (w 1 , w 2 , . . . , w N ) and generates an embedding vector for each word (e 1 , e 2 , . . . , e N ). The final embedding vector can have stacked embeddings, where a Flair embedding concatenated with BERT forms the final embedding vector for a word w in position i, and it is given by:  Here, e Flair i and e BERT i represent the contextual Flair and BERT embedding vectors, respectively. The concatenation combines the best of these pre-trained models to enhance the performance; it is beneficial to enrich word representations with additional syntactic and semantic information.
Moreover, the performance of sequence labeling tasks, such as NER, is enhanced by using BiLSTM-CRF architecture on the top embedding layer. As we use a combination of pre-trained embedding, we can assume that the performance of the Arabic AE task will be improved.

1) BERT
BERT uses a transformer encoder in a bidirectional way to encode a word, representing the semantic of the word in the context based on its semantic relationship with the other words [23]. Thus, the output is a contextual embedding vector for each word. It is implemented using a multi-layer transformer encoder, proposed and developed by [69]. The transformer's core is a self-attention mechanism that obtains the relationship between words in the context by calculating the attention between each word in the input sequence. As shown in (2), the attention is calculated using three input word vector matrixes: Query Vector Q, Value Vector V , and Key Vector (K ). d k is the input vector dimension, while QK t is used to calculate the relationship between input words. The weights representations are obtained by softmax normalization, and the final output is the sum of the weights of all input vectors. In addition, the transformer uses multihead attention, which computes the attention h at different times from different points of view with different weight matrices, and concatenates them together.
The BERT input is composed of three types of embeddings. The first is token embedding to encode the word by adding special tokens to the input [CLS] and [SEP] at the beginning and end of each sentence, respectively. Second, segment embedding is used to encode the sentence position. Third, position embedding encodes the word's position in the sequence. BERT is pre-trained on a massive corpus, enabling it to obtain a much better feature representation.
BERT is pre-trained on two tasks to understand the language: A Masked Language Model (MLM) and Next Sentence Prediction (NSP). In MLM, 15% of the input words are masked randomly to predicted by BERT based on the context information. In NSP, BERT predicts whether two sentences occurred consecutively in a given text or not by predicting if the second sentence continues the previous sentence.
In this paper, we apply the AraBERTv02 [70] pre-trained on the Arabic Wikipedia and a large Arabic news corpus containing 8.5 articles, 70 million sentences, and 2.5B tokens; the size of the dataset used is 24 GB. It covers different topics from different regions and covers a wide range of Arabic language in the Arabic world.

2) FLAIR EMBEDDING
Flair embedding is pre-trained contextual string embedding [22]. It is based on a character-level language model (CharLM); each letter of the words is sent into the character language model. The words and characters are a probability distribution of words, and characters are a probability distribution. Every new character or word depends on the character or words that come before it. The input representation is taken from the Backward and Forward language models. The final embedding from both hidden states is concatenated after the last character in the word.

B. BiLSTM LAYER
The Recurrent Neural Network (RNN) is an artificial neural network which processes sequential data. RNN has internal memory that enables it to process a data sequence by applying the same function and set of parameters to every sequence element (in each layer). Each output depends on all previous information; this enables inferring the following word or character in the sequence. Theoretically, RNNs can use the information for a long sequence, but practically, backpropagated gradient growth (becoming extremely high) or shrinkage (closing on zero) explodes gradients or causes gradient problems to vanish after many steps.
To solve RNN problems, extension networks were developed from RNNs, such as LSTM [15]. LSTM does not differ from RNN; however, it has different computations in a hidden state. It has memory cells and three nonlinear gates that control the information that should be kept (positive values), and the information should be forgotten. The forget gate controls the gradient passing through it. It allows for explicit memory deletions and updates; the input gate determines the vital information in the current state, and the output gate is used to determine the next hidden layer state. The LSTM network structure is shown in the following equations.
where f t i t , and O t represent the forget gate, input gate, an output gate, respectively. C t and h t represent the cell state and hidden state at time t. b is the bias vector, and W is the weight matrix. σ is the sigmoid function, which is the hyperbolic tangent activation function.C t represents the candidate value created by the layer, it added to the output of the input gate to obtain a new state cell C t at time t. This work uses the Bi-LSTM as a layer in this architecture to capture both forward and backward long dependencies between sequence words. Two hidden units are represented starting from the first word to the last and separately in reverse order. The two hidden units are concatenated simultaneously as the final output.

C. BiGRU LAYER
GRU is a simple version of LSTM, where it controls the flow of information through two gates, reset and update gates, like LSTM, but without a memory unit. The update gate determines the information that should be passed from the previous state to the current state; the reset gate determines the information that should be forgotten of the prior time step. The hidden unit's calculations are shown in the following equations: where r t , u t t , andh t represent the reset gate, update gate, and hidden state, respectively. x t is the input vector at time t, h t is the hidden state and the output vector. W u , W r , and W represent the weights for the reset, cell, and update states, respectively. b r , b, and b u represent bias parameters for the rest, cell, and update states, respective BiGRU is a bidirectional variant of GRU. Like BiLSTM, the input sequence is read in a bidirectional way from left to right (from the first word to the last) by a forwarding layer and in reverse from right to left (starting from the last word to the first word) by a backward layer. The final hidden layers from the forward direction and the final hidden state from the backward direction are concatenated to produce the last hidden state.

D. CRF LAYER
In sequence-labeling tasks, such as aspect extraction, the strong dependency relationships between labels should be considered. BiLSTM or BiGRU can consider the long-term context information, but they cannot consider the tag dependency for output results. CRF can solve these problems [71]. The CRF layer is used with highly interdependent output labels. Instead of modeling labeling decisions independently, they are jointly modeled with a CRF layer, which aims to produce the optimal global sequence of labels given a sequence of input [72]. The main advantage of CRF is learning some restrictions of the output labels that follow the BIO labeling scheme to ensure the validity of the predicted sequence labels. These restrictions are learned automatically during the learning process. Some examples of these restrictions in the case of our AE task are: The For sequence input X = (x 1 , x 2 , . . . , x N ), X is the input to the model for training. is Y = (y 1 , y 2 , . . . , y N ) is tag sequence. CRF determines the final score of the prediction sequence label from two types of scores; emission scores are the probability of the output from the BiLSTM layer, where I and y i are the indices of word and label, respectively. N, X , and K represent the size of the output matrix P, where N is the number of words and K is the number of tags. Additionally, the transition score, A, represents the transition matrix, and the transition probability from one tag to another is represented by A yi,yi+1 . The final total score is: The Softmax function, the overall possible tag sequences, is used to obtain the score of the probability of sequence y.
Then using the logarithm to maximize the correct tag sequence: Finally, the output sequence of the maximum score is given by:

IV. EXPERIMENT AND RESULT A. DATASET
We test the proposed models on the Arabic Hotel's reviews dataset. The dataset was part of the SemEval2016 competition task 5 for ABSA analysis [6]. Semeval2016 is a multilingual task that covers customer reviews for seven domains and eight different languages. It is considered the benchmark dataset for AABSA tasks. The dataset consists of several sentences, and each sentence is divided into a list of tuples. The dataset was annotated on text level with 2029 reviews, 1839 pieces for VOLUME 10, 2022 training, and 425 testing. Moreover, it was annotated on the sentence level with 6029 sentences: 4082 for training and 1227 for testing. It consisted of 24,028 annotated tuples split into 19,226 for training and 4802 for testing. We adopted the BIO annotation strategy for labeling the dataset. The aspect term contains one word or phrase; there are three types of labels, in which B-ASP indicates the first word of the aspect term, I-ASP indicates the inside the aspect term (but not the first word) O indicates not aspect words. For example, the input sentence '' '' (which means ''the hotel location is good but the service is bad'') is annotated as follows: For the Arabic AE task, the F1-score was used for performance evaluation. F1 is a standard measure for sequence labeling problems; it combines precision and recall rates. Precision indicates the number of correct predicted aspect entities to all detected aspect entities, and recall indicates the number of correctly predicted aspect entities to the number of entities in the standard result. F1 is the harmonic average of Precision and recall. The calculation method for Precision, recall, and F1 are as follows:

C. EXPERIMENT SETTING
We used a Flair Framework to implement the proposed models in this experiment [26]. The whole experiment was run on Google Colaboratory with Tesla T4 GPU. During the training, the Hyperparameters are updated using Stochastic Gradient Descent algorithms (SGD). The model's hyperparameters are shown in Table 1. Both models were trained with one and two layers to assess whether they increase or decrease the quality of overall performance. The pre-trained AraBERTv0.2 was utilized as a feature extractor in two ways: a feature-based method where the weights are a frozen and fine-tuned-based method where all model parameters are fine-tuned, including BERT. Due to limited computational resources, the generic pre-trained AraBERT model (AraBERTv02-base) was utilized rather than the large version (AraBERT-large).

V. RESULT AND DISCUSSION
We compare the proposed models with baseline [6] and previous models that used traditional deep learning, RNN [63], BiLSTM-CRF with Word2vec/fastText as word embedding [64], and an attention-based neural model [65].  In addition, we conducted a series of experiments to verify the proposed BF-BiLSTM-CRF and BF-BiGRU-CRF models on the Arabic AE task. First, the fine-tuned BERT with a linear layer and the BERT-CRF model was run. Then, the BERT-BiLSTM-CRF model was run with one BiLSTM layer and then with stacked two layers. Then, the BERT-BiGRU-CRF model was run with the same configuration. In the end, in our proposed models BF-BiLSTM-CRF and BF-BiGRU-CRF, we used BF as an abbreviation for ''BERT + Flair.'' All the models were tested on the same training and testing datasets.
We used two training methods for BERT: a feature-based method with fixed parameters and a fine-tuning method training the whole model.
As shown from Table 2, BERT-BiLSTM-CRF and BERT-BiGRU-CRF outperformed BERT with a linear layer and BERT-CRF, reflecting the effectiveness of BiLSTM-CRF and Bi-GRU-CRF on top of the BERT layer for sequence labeling. That proves the positive effect of using CRF to consider the dependency between adjacent labels.
In addition, the results demonstrate that using two layers of BiLSTM/BiGRU enhances the model's performance in all cases. For that, the proposed models BF-BiLSTM-CRF and BF-BiGRU-CRF were tested with stacked BiLSTM and BiGRU layers (two layers).
As for the overall models' performance, our proposed models, the fine-tuned BF-BiLSTM-CRF, outperformed the baseline, previous related works, and other models based on BERT by achieving a 79.7% F1-score. That demonstrates the effectiveness of using a combination of stacked contextual word embedding to improve the performance of the Arabic AE task. That proves that using stacked contextual embeddings (BERT and Flair) enhances the semantic representation of words and semantic relationships of the words in the text.
Moreover, we were used BERT as a feature-based model with different model configurations BERT-CRF and BERT-BiLSTM-CRF, BERT-BiGRU-CRF, and BF-BiLSTM-CRF. As shown in Table 2 and Table 3, the proposed model, BF-BiLSTM-CRF (fine-tuning BERT) outperformed the BF-BiLSTM-CRF model based on a feature-based method by 2.5% F1 score points, which shows the effectiveness of the fine-tuning method rather than feature-based methods.

VI. CONCLUSION AND FUTURE WORK
Compared to the English language, limited works target Arabic AE because the Arabic language has richer inflectional and derivative morphology and Lack of Available NLP tools and resources. In this paper, we integrated the BERT language model and contextualized string embedding (Flair) for better word representation to enhance Arabic AE. It extended with a variant of neural network architecture and CRF.
The model's performance was investigated using Arabic pre-trained AraBERTv02 and Flair embedding as stacked embedding layers for contextual representation, and then it was combined with BiLSTM/BiGRU and CRF. The proposed models are called BF-BiLSTM-CRF and BF-BiGRU-CRF.
We experimented with two kinds of BERT training methods: a fine-tuning-based method and a feature-based method. Our results showed that the integration of BERT and Flair and BiLSTM-CRF, in addition, improves the outcomes because it combines the advantage of pre-trained contextual embedding, the BiLSTM network, and the CRF model. In all evaluation experiments, fine-tuning BF-BiLSTM-CRF with two BiLSTM stacked layers can outperform all the other models and previous works on the AE task on the same dataset with a 79.7% F1 score.
For future work, we can use contextual embeddings to improve the results of other Arabic ABSA tasks, such as aspect polarity classification and aspect category detection. This area has many challenges that the research community must exploit in the future.