BERT-Based Model for Aspect-Based Sentiment Analysis for Analyzing Arabic Open-Ended Survey Responses: A Case Study

Educational institutions typically gather feedback from beneficiaries through formal surveys. Offering open-ended questions allows students to express their opinions about matters that may not have been measured directly in closed-ended questions. However, responses to open-ended questions are typically overlooked due to the time and effort required. Aspect-based sentiment analysis is used to automate the process of extracting fine-grained information from texts. This study aims to 1) examine the performance of different BERT-based models for aspect term extraction for Arabic text sourced from educational institution surveys; 2) develop a system that automates the ABSA process in a way that will automatically label survey responses. An end-to-end system was developed as a case study to extract aspect terms, identify their polarity, map extracted aspects to their respective categories, and aggregate category polarity. To accomplish this, the models were evaluated using an in-house dataset. The result showed that FAST-LCF-ATEPC, a multilingual checkpoint, outperformed other models including AraBERT, MARBERT, and QARiB, in the aspect-term extraction task, with an F1 score of 0.58. Hence, it was used for aspect-term polarity classification, showing an F1 score of 0.86. Mapping aspects to their respective categories using a predefined list yielded an average F1 score of 0.98. Furthermore, the polarities of the categories were aggregated to summarize the overall polarity for each category. The developed system can support Arabic educational institutions in harnessing valuable information in responses to open-ended survey questions, allowing decision-makers to better allocate resources, and improve facilities, services, and students’ learning experiences.


I. INTRODUCTION
Universities and higher education institutions worldwide allocate significant financial resources to enhance their services to maintain existing students and attract new ones [1].Student satisfaction and opinions about the university's service quality are very important because they have a direct impact on student impressions and the institution's reputation [2].Students can express their thoughts through official surveys published at the institutional level.Usually, these surveys include closed-and open-ended questions.Close-ended The associate editor coordinating the review of this manuscript and approving it for publication was Camelia Delcea .
questions are specific and easy to analyze.In contrast, openended questions give students the opportunity to express their opinions and sentiments [1].This type of question is valuable because it encourages them to express their minds and feelings and provide useful information on personal experiences [3], [4].However, these textual responses require more effort in the analysis process to extract helpful information and obtain sentiments from it.Such analyses also consume considerable human time, especially when the number of responses is large and the questions cover more than one aspect [3], [4].Students' responses are typically related to university aspects, such as services, professors, and buildings, and their feelings (positive, negative, and neutral) toward these aspects.Extracting useful information from textual responses calls for an automated system that can analyze the text and detect the sentiments of the elements (aspects) presented in the response.
Sentiment analysis or opinion mining is an active area of the natural language processing [5].The main task of sentiment analysis is to classify expressed opinions in the text [6].The extracted opinion is typically classified according to its polarity as positive, negative, or neutral [7].There are three levels of classification in sentiment analysis: document, sentence, and aspect [8].Although document-and sentence-level analyses are useful for some applications, they are not sufficient for others to search for fine-grained information about a particular aspect.In such cases, aspect-based sentiment analysis (ABSA) is used.Principally, ABSA systems receive a set of texts (product reviews, comments, forum discussions, etc.) that discuss a specific entity.The system attempts to obtain the main aspects of the entity and detect the sentiments expressed toward each aspect [7].The results of ABSA provide detailed sentiment information that can be highly valuable in various domains.Despite this benefit, ABSA has not been extensively applied in the educational domain.In addition, the majority of prior work on ABSA has focused on English, with a limited number of studies targeting ABSA in Arabic and other languages [9].Arabic is the primary spoken language for approximately 422 million speakers worldwide [10].It is a rich language with a large number of vocabulary words with different sentence structures and multiple meanings.It has approximately 10,000 roots and more than 900 forms of nouns and verbs based on their morphology [11].This results in a variety of derivational morphologies and structural forms, which increase the sparsity of morphemes and words [9] as well as the complexity of the analysis.
An advanced system is needed to analyze students' survey responses offered in Arabic, categorize them based on various aspects of the university, and identify students' sentiments toward these aspects.Such a system will support the integration of student feedback into decision-making processes and aid university leaders in allocating resources and improving the quality of the services provided.Thus, there is a need to examine the literature and identify potential approaches that can improve ABSA for the Arabic language as well as its effectiveness in analyzing educational data.
This study aims to fill this gap by examining a transfer learning approach to assess ABSA in the Arabic educational context and to evaluate its performance.Different bidirectional encoder representations from transformer BERT-based models were evaluated for aspect extraction using an in-house dataset of open-ended survey responses at King Abdulaziz University (KAU).The best-performing model was used to classify the polarity of each aspect.The aspects are then mapped to their category, and the results are summarized by category by simply counting the polarities of the aspects for each category.To the best of our knowledge, this is the first study to assess ABSA using Arabic surveys in the educational domain.

II. BACKGROUND A. ASPECT-BASED SENTIMENT ANALYSIS
ABSA produces finely detailed sentiment information.This information is useful for many applications in various domains.The ABSA consists of four tasks: aspect term extraction (T1), aspect polarity classification (T2), aspect category mapping (T3), and category polarity (T4).T1 extracts all the words or aspects that need to specify their polarity sentiment.This aspect can be implicit or explicit.The task was executed using supervised and unsupervised methods [12].T2 assigns the polarity of the sentiment analysis to the extracted aspect [13].T3 identifies the category using a multilabel classifier that classifies each entity into multiple labels, where the label consists of entities and aspects.T4 assigns the polarity of sentiment analysis to the identified categories.Figure 1 shows an example of these tasks.
Assuming that there are only two reviews for a restaurant, tasks T1, T2, and T3 are assigned, as shown in Figure 1.Task T4 for the overall polarity of the category in this example is positive for food and negative for service since pasta and steak are rated as positive to yield an overall positive category polarity for food, whereas the waiter is rated negative, yielding an overall negative category polarity for service.

B. DEEP LEARNING
Deep learning is a rising technique in machine learning that uses a hierarchy of layers to progressively extract higherlevel features.During training, the high layers exploit the complex compositional nonlinear functions of the lower layers.This means that the layers in a higher hierarchy have more abstract or divided representations than the lower ones.Consequently, each layer receives input to analyze and classify it to provide the output that feeds the input of the next layer [14], [15].A variety of algorithms, such as deep neural networks, convolutional neural networks, recurrent neural networks (RNN), and recursive neural networks, help in the analysis of many fields, especially in fine-grained processes for processors with a large number of layers [14].Additionally, word embedding, long short-term memory (LSTM), and bi-directional LSTM are concepts related to deep learning that allow dealing with various types of data such as text, images, and videos [16].

C. TRANSFORMER-BASED TRANSFER LEARNING
Transfer learning is an emerging machine-learning technique that uses existing knowledge to solve different domain problems and produces state-of-the-art prediction results [17].Transfer learning methods perform extensively in computer vision tasks such as anomalous activity detection, object classification, and image captioning.Moreover, transferlearning-based methods, such as BERT, have been successful in several natural language processing (NLP) tasks [18] and in the field of sentiment analysis [17].BERT is a pre-trained language model developed by Google in 2018.It uses deep neural network architecture with an attention component.It is designed to process sequential data such as text and learn the contextual relationships between words [19].

III. LITERATURE REVIEW
To decide which is the most appropriate approach, a comprehensive literature review on ABSA in the education domain and Arabic ABSA approaches was done.The''ABSA in educational domain'' section presents all existing empirical studies in the educational domain to review the source of used data, approaches, and ABSA tasks through a methodical and exhaustive literature review using search queries consisting of the keywords (''aspect-based sentiment analysis'' OR''ABSA'') AND''education''.After that, there is still a need for more research on approaches used for ABSA tasks in Arabic datasets in other domains that are covered in the''Arabic ABSA approaches'' by exploring the literature review using the keywords (''aspect-based sentiment analysis'' OR''ABSA'') AND''Arabic'' presented in the''Arabic ABSA Approaches'' section.We came up with three subsections: unsupervised learning, supervised learning, and deep learning.The associated studies of these approaches were presented in detail.

A. ABSA IN EDUCATIONAL DOMAIN
Most ABSA studies in the educational field have been conducted on English-language datasets.The aim of these studies was to assist academic institutions in identifying and addressing student issues through feedback analysis.The data for these studies was primarily gathered from social media platforms like Twitter and Facebook, as mentioned in [20], [21], [22], and [23].Other studies utilized data collected from the institution, such as MOOC platforms or traditional institution surveys, as seen in [24], [25], and [26].The methods used in these studies included semantic relatedness and sentiment polarity categorization.The researchers employed various classical machine learning algorithms such as k-means clustering, naive Bayes, linear regression, and support vector machine (SVM) with only two studies employed deep neural networks, LSTM [24], [26].We also found that all studies focused on aspect extraction and polarity classification tasks, with the exception of a study that used a combination of machine learning-and lexicon-based approaches to accomplish all four tasks [23].
ABSA has also found application beyond the English language in education, albeit in a limited capacity.In Serbia, [1] employed ML algorithms to achieve T1 and T2.They examined student reviews on the''Oceni profesora'' (''Rate my professors'') website to gain insights into the teaching faculty, courses, and programs offered by the Faculty of Technical Sciences.Another study in Indonesia used an unsupervised lexicon-based method for both tasks [27].They used recent online learning graduates feedback from BINUS (Bina Nusantara University).Moreover, [28] proposed a hybrid features selection method to address T1 and T2 in Arabic tweets related to Qassim University.It extracts aspects related to the education domain such as teaching quality, services, activities, etc.The purpose of this study is to enhance the SVM classifier in ABSA by decreasing used features.The results showed that the hybrid method successfully improves SVM classifier performance with (F1: 0.70) for T1 and (F1: 0.71) for T2.Table 1 provides a summary of prior work on ABSA in the educational domain showing the year of publication, the targeted language, the data source, the approach used, and the tasks covered by each paper.As shown, there is a lack of ABSA research on the Arabic language in the educational domain.This research aims to contribute to this direction benefitting researchers and practitioners.

B. ARABIC ABSA APPROACHES 1) UNSUPERVISED APPROACHES
A comparative study was conducted to test and assess various lexicon-based approaches for ABSA tasks T3 and T4 based on 63,000 book reviews annotated by humans [29].This was later extended using enhanced lexicon-based approaches on the same book review dataset to achieve results that exceeded those of the previous study, particularly for T4 (accuracy:0.88)and T3 (F1 score:0.24)[30].Several studies have combined two approaches or models to produce superior models.Reference [31] combined corpus-and lexicon-based approaches to address tasks T2 and T4 using a large-scale Arabic book review dataset.Furthermore, [32] proposed a hybrid approach to address T1 and T2 from reviews in Arabic government applications.This approach combined lexicons with rule-based models.The authors aimed to develop rules, techniques, and lexicons to address the challenges of sentiment analysis.The results showed an increase in accuracy when compared to the baseline models.
2) SUPERVISED APPROACHES Supervised approaches depend on the training process using labeled data to train the machine in predicting the output for the new input.Various studies have used the Arabic-language hotel review dataset as a benchmark to evaluate their proposed approaches or models.The authors in [9] proposed a framework for applying ABSA to Arabic.They suggested the use of a SVM approach for tasks T1, T2, and T3.Reference [33] considered morphological, syntactic, and semantic features to address task T2, in addition to T1 and T3.The authors examined multiple classification methods such as naïve Bayes, Bayes networks, decision trees, k-nearest neighbor (K-NN), and SVM.The results showed that models developed by the supervised learning approach performed better than combined lexicons with rule-based models, whereas SVM performed the best compared with the other classifiers for all tasks in the study.Moreover, [12] evaluated various classifier techniques for T1, and the results showed that the adaptive boosting (AdaBoost) classifier achieved the best results compared with previous methods in terms of precision (97%) and recall (96.9%).

3) DEEP LEARNING APPROACHES
A study by [34] compared two pretrained word-embedding models for ABSA.These models are fastText Arabic Wikipedia and AraVec Web.An SVM classifier was used to train the model for tasks T1 and T2 in a dataset of 5000 Arabic tweets related to airline services that were manually labeled for ABSA.The study showed an enhancement in the SVM classifier performance when extracting features using word embedding.The result was slightly better when fastText Arabic Wikipedia word embedding was used compared with AraVec-Web, indicating the usefulness of word embedding for sentiment analysis.
Other studies used the Arabic-language hotel review dataset to evaluate the proposed approach or model.Reference [35] applied a deep RNN and SVM to hotel reviews to address tasks T1, T2, and T3.The results showed that the SVM exceeded the deep RNN.However, the authors suggested enhancing the proposed deep learning approach by assessing different LSTM networks and using word embedding, such as fastText.Reference [36] applied the suggestions of a previous study by utilizing LSTM neural networks for T1 and T2.The results showed that the method used exceeded the baseline (SVM trained with N-gram features) for both the T1 and T2 tasks.Furthermore, [37] applied two deep learning models: the convolutional independent LSTM model (C-IndyLSTM) for T1, and the memory-based recurrent attention model (MBRA) for T3.The C-IndyLSTM model is based on a convolutional neural network and stacked independent long-short-term memory, whereas the MBRA model is based on stacked bidirectional independent LSTM, a position-weighting mechanism, and multiple attention mechanism layers.Moreover, [38] applied two deep-learning models based on GRU neural networks.The first model, BGRU-CNN-CRF, combines a bidirectional GRU, CNN, and CRF for T1.The second model, IAN-BGRU, is an interactive attention network used for T2.
Recently, increased attention has been paid to the use of large pre-trained language models, such as BERT and its variations, as it achieves superior results for a variety of NLP tasks.Reference [39] proposed a BERT with a simple linear classification layer to accomplish T2 only.Experiments on three Arabic datasets, hotel reviews, book reviews, and Arabic news, showed that the proposed model accuracies were 89.51%, 73.23%, and 85.73%, respectively.The researchers aim to accomplish T1 and T3 in future work.Reference [40] proposed a transfer learning method using the AraBERT pre-trained language model to accomplish tasks T1 and T3.
Most previous studies individually or sequentially handled the T1 and T2 tasks, where independent models were designed for each task.However, T1 and T2 are performed jointly in multi-task learning by other studies.Reference [41] developed a lightweight ABSA framework called Python aspect-based sentiment analysis (PyABSA), which can be used for T1 and T2.The models were trained on various datasets, including restaurants, laptops, MOOCs, Twitter, and other domains in eight languages (one of them was the Arabic language dataset SemEval-2016 Task 5).The Arabic dataset was used to evaluate the BERT-ATESC, Fast LCF-ASESC, and LCF-ATESC models.Performance evaluation showed that the BERT-ATESC model achieved the best results, with an F1 score of 71.18% for T1 and T2.
Furthermore, [42] tested a transfer-learning approach using Arabic-BERT-CRF for tasks T1 and T2 on a human-annotated Arabic dataset for ABSA.The experimental results demonstrated that the model exceeded the baseline model, which relied on conditional random fields (CRF) with features extracted using named entity recognition (NER), POS tagging, parsing, semantic analysis, and other recently proposed models such as AraBERT, MarBERT, and CamelBERT-MSA.Reference [43] proposed a multi-task learning approach called local context focus-aspect term extraction and polarity classification (LCF-ATEPC) and AraBERT as a shared layer for Arabic contextual text representation to accomplish T1 and T2 simultaneously.The reference hotel and product review datasets were used.In addition, the authors proposed a data augmentation technique for T2 that involves generating synthetic data using back-translation and synonym replacement.The results showed that the proposed model outperformed the baseline models on both datasets for both single-and multitask approaches, achieving state-of-the-art performance.Table 2 provides a comparison of the different Arabic language ABSA literature reviews that were summarized above.The comparison is across the year of publication, the data source, the used approach, and the result of the covered tasks by each research.
Overall, Arabic ABSA has evolved significantly over the years, transitioning from lexicon-based approaches to deep learning techniques.Lexicon-based approaches were simple but suffered from scalability constraints and the inability to adapt to context-dependent nuances in sentiment analysis.Supervised learning methods improved scalability but required substantial amounts of labeled data and involved feature engineering.However, in recent years, deep learning has emerged as a dominant approach in ABSA, largely due to the Transformer-based BERT model.BERT has demonstrated remarkable effectiveness in understanding contextual information, capturing complex language patterns, and addressing prior limitations.Moreover, BERT has introduced the concept of transfer learning in NLP, enabling it to learn general language representations through pre-training on vast text corpora.Subsequently, fine-tuning BERT on task-specific data significantly reduces the need for extensive labeled data, making it an ideal choice for this research.
As a part of our research, we have carefully selected the latest and top-performing BERT-based models from the literature -LCF and AraBERT.FAST-LCF-ATEPC model stood out as it efficiently performs aspect term extraction and aspect polarity classification simultaneously.AraBERT, on the other hand, was specifically designed and trained on Arabic data, making it a promising model.While AraBERT has shown significant improvement over baseline approaches in various Arabic NLP tasks, it has been outperformed by MARBERT [44].QARiB also performed well in Arabic NLP tasks like SA and NER, but its performance for Arabic ABSA has not yet been evaluated.Therefore, it is essential to experiment with the most promising BERT-based models for Arabic ABSA and evaluate their performance on related data to be able to develop an effective ABSA system for educational institutions.Table 3 provides a comprehensive overview of the BERT models used, highlighting their respective areas of focus, advantages, and limitations.Moreover, each model is explained separately in the methodology section.

IV. RESEARCH CONTRIBUTION
This study is unlike prior works, as it delves into the examination and application of ABSA methods in a domain that has received limited attention -Arabic language text obtained from the educational sector.This sector has not been extensively studied, and the effectiveness of pre-trained models, such as BERT, which performed well in various NLP tasks remains unexplored in the intersection of Arabic ABSA and the educational sector.It is essential to acknowledge that models trained for one domain may not perform as well in another, emphasizing the need for rigorous evaluation of different ABSA models on Arabic text derived from educational data.The main contribution of this research is 1) to examine the performance of different BERT-based models for aspect term extraction for Arabic text sourced from educational institution surveys; 2) to develop a system that automates the ABSA process in a way that will automatically label survey responses.This research has significant implications including improving the quality of education and enhancing user satisfaction.Moreover, the automatic identification of areas of concern or success, which, in turn, can inform policymakers and aid in the allocation of resources to meet the evolving needs of students and educators.The benefits of this research are not limited to educational institutions, which can expedite the analysis through the automation of the four steps of ABSA but also extend to the advancement of natural language processing research, particularly for the Arabic language.

V. METHODOLOGY
This section introduces the used datasets and models.After that, we described the approach used to develop the aspect-based sentiment analysis system.Lastly, we defined the performance measures used to evaluate the different tasks in this research.

A. DATA
This section describes the steps involved in building a reliable annotated dataset from an educational context for testing and evaluation.Data were collected from the KAU service evaluation survey.The responses were usually written in formal Arabic, with a maximum of 200 characters.Students dis-2294 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.closed their feelings transparently regarding the university's main aspects.The total number of collected responses was 1815 responses, 91 of which were written in English, and 218 were garbled data, such as spaces, numbers, and symbols.The remaining 1506 responses were analyzed.Some sample responses to the open-ended questions ''Any other additions that were not mentioned in the questionnaire that you would like to mention?'' are shown in Table 4.
In the second step, the collected dataset underwent annotation to label the responses manually.The annotation was performed by three KAU employees: A, B, and C. Detailed guidelines were provided to the annotators to help them extract aspects and identify their polarity and categories.After receiving the annotated data from the annotators, the data were explored and cleaned.Table 5 provides three examples of responses to this question, along with human annotations, showing a sample of the in-house built dataset.
In the third step, the labeled responses were assessed and evaluated by calculating Cohen's kappa, which measures the agreement between two annotators to ensure the quality of the annotation process [57].In our case, Cohen's kappa was used to check the agreement for pairs of annotators (A and B, A and C, and B and C) separately for aspect, polarity, and category.As shown in Table 6, the best agreement for the aspect, polarity, and category was between annotators B and C. Cohen's Kappa showed a substantial agreement, with a kappa value of 0.70 for the aspect.Further, B and C gave the largest number of aspects compared to A and B or A and C.Moreover, the polarity and category showed an almost perfect agreement, with kappa values of 0.92 and 0.87, respectively.
According to the Cohen kappa results, annotators B and C were selected to construct the golden dataset.When there was a discrepancy between B and C, the annotation of A was consulted.Only matched aspect, polarity, and category were sustained, which resulted in the retention of 448 responses that included 639 aspects related to 13 categories.The polarity of these aspects is skewed toward negative (512 aspects, ∼80%), which is expected because people tend to recall and report negative experiences or thoughts more than positive ones, which is also known as a negativity bias [58], [59].

B. MODELS 1) MODEL
FAST-LCF-ATEPC, proposed in 2021, is a multitask learning model based on self-attention and local context focus (LCF) that integrate the pretrained BERT model.Unlike other models, it extracts aspect terms and synchronously infers polarity [46].It employs two separate BERT layers to capture the global and context, respectively.To enable simultaneous multi-task the input sequences are divided into separate tokens, and each token is assigned two labels.The first label determines whether the token is part of an aspect, while the second label denotes the polarity of the token associated with the aspect.
PyABSA, which is an open framework, has different versions of FAST-LCF-ATEPC trained on the SemEval 2016 Arabic dataset [41].The checkpoints used in this study were multilingual, multilingual-256, and multilingual-256-2.The main difference between these three models is the number of languages and the size of the embedding layers used in the model as shown in Table 7.For the multilingual checkpoint, the model is trained on a multilingual dataset with 5 languages (English, French, German, Italian, and Spanish) and the embedding layer size used is 768.For the multilingual-256 checkpoint, the model is also trained on a multilingual dataset with 5 languages but uses a smaller embedding layer size of 256.This reduces the memory footprint of the model and can improve training speed on smaller datasets.For the multilingual-256-2, the model is trained on a larger multilingual dataset with 15 languages and uses a smaller embedding layer size of 256.This allows the model to generalize better across languages and reduces the likelihood of overfitting on any particular language [41], [46].Therefore, it is essential to conduct an empirical evaluation on all three models to determine which one would yield better results considering the differences in the size of the embedding layer and the diversity of languages used in training.

2) ARABERT MODEL
AraBERT was developed in 2021 as a pretrained BERT model specifically for the Arabic language to achieve the same success as BERT for the English language.In addition to BERT base configuration, AraBERT employs two tasks: Masked Language Modeling (MLM) task to improve pre-training tasks by forcing the model to predict the whole word instead of getting hints from parts of the word, and Next Sentence Prediction (NSP) task to helps the model understand the relationship between two sentences, which can be useful for many language understanding tasks such as Question Answering.It was trained on a large-scale Arabic corpus extracted from news articles on the Arabic media.This corpus contained modern standard Arabic (MSA) data.It includes 70 million sentences and 3 billion words.The authors evaluated the model on three NLP downstream tasks: SA, question answering, and NER.The performance of AraBERT was compared with that of multilingual BERT from Google and other state-of-the-art approaches.The results showed that the newly developed AraBERT achieved state-of-the-art performance on most tested Arabic NLP tasks [48].

3) MARBERT MODEL
MARBERT is an Arabic-focused transformer language model developed in 2021.Unlike AraBERT, MARBERT is trained using data from the Twitter platform (one billion Arabic tweets), which includes both MSA and diverse Arabic dialects.MARBERT uses the same network architecture as the BERT model, but excludes the next sentence prediction objective because of the word count limit in tweets.MAR-BERT was evaluated using six NLP tasks: sentiment analysis, topic classification, dialect identification, question answering, NER, and social meaning.According to [44], the results of these six tasks showed that MARBERT was significantly better than AraBERT.

4) QARIB MODEL
QARiB is a pretrained model developed in 2021 [56].The authors trained five BERT models on different sizes of training sets, different linguistic preprocessing, and different text dialects: MSA formal and informal Arabic dialects.The MSA texts include data extracted from newswire sources, online Arabic newspaper websites, and movie and TV subtitles, whereas the dialect text includes Twitter data.The corpus contained 180 M sentences and 440 M tweets composed of 2.7 B words.According to [56], QARiB achieved state-ofthe-art results on several tasks such as emotion, NER, and offensive aspects.

C. ABSA TASKS
Our research objective was to develop an Aspect-Based Sentiment Analysis system for KAU that facilitates the analysis of Arabic survey responses.To achieve this, we conducted experiments to determine the most suitable model for our task.Using the PyABSA framework [41], we evaluated the performance of three models: FAST-LCF-ATEPC (multilingual), FAST-LCF-ATEPC (multilingual-256), and FAST-LCF-ATEPC (multilingual-256-2).These models are designed to perform both T1, which involves identifying the aspect term, and T2, which involves assigning its polarity, simultaneously.In addition, we fine-tuned three pre-trained language models designed for Arabic language NLP tasks, namely AraBERT [48], MARBERT [44], and QARiB [56].The base architecture of these three models remains the same, including the tokenization of input text into subwords or tokens.To fine-tune the models for the aspect extraction task (T1), a new task-specific token classification head called named entity recognition (NER) is added.NER is a technique used in NLP to automatically find and categorize names, words, or phrases in text that refer to real objects such as people, groups, places, dates, amounts, etc.This additional layer is responsible for predicting the NER label for each token in the input.Labeled annotated datasets are required for fine-tuning, where each word in the text is labeled with its corresponding label, such as 'ASP' for aspect and 'O' for others.We utilized the reference multilingual ABSA dataset (SemEval2016-ABSA for Task 5) with 9620 examples [60] for both training and validation purposes.During the finetuning process, the cross-entropy loss function was used to measure the dissimilarity between the predicted probabilities of each token's label and the actual labels.This loss is then backpropagated through the network to update the token classification head weights.To avoid overfitting, we monitor the model's performance on the validation set and adjust parameters accordingly.Default hyperparameters were used for all models with an embedding size (100), batch size (32), epochs (8) for optimal performance, and learning rate (5e-5).After completing the fine-tuning process, the model becomes equipped to perform the aspect extraction tasks.Figure 2 illustrates the four tasks we performed to develop an end-to-end ABSA system in this study.The input was Arabic survey responses.The first task (T1) involved extracting aspects from each response.We conducted six experiments to examine the performance of these models to accomplish this task.Then, the best-performing model, the one that has the highest F1-score for extracting aspects from the responses, was used in (T2), which involved identifying the polarity that is associated with each aspect.Following that, we executed the third task (T3), which involved mapping each extracted aspect to a category.We used a predefined list of categories and their associated aspects curated from the golden dataset to accomplish this task.Once we completed task 3 for all responses, we presented the extracted aspects, polarity, and category for each response as shown in Table 8.In the final task (T4), we aggregated the results for each category by counting the polarities of their related aspects.This allowed us to assign an overall polarity for each category.

D. PERFORMANCE MEASURES
In this study, various evaluation metrics were used.For T1, because aspects were extracted directly from the responses and not from a predefined list, message understanding conference (MUC) metrics were used [61] to obtain detailed results.MUC represents one of the earliest and longest-running efforts to evaluate language-understanding technologies.It is particularly useful for text processing problems such as sentiment analysis and information extraction [62].MUC considers different categories of errors: correct (COR), incorrect (INC), partial (PAR), missing (MIS), and spurious (SPU).These metrics were defined by comparing the responses of a model against golden annotation.We used COR, INC, MIS, and SPU metrics and eliminated the PAR metric because we considered PAR to be COR in our case.For example, the aspect ''university'' was considered COR as long as it is part of the actual aspect ''KAU university'' in the golden dataset and does not need to be identical.Recall (R), precision (P), and F score (F1) were calculated as secondary metrics from MUC-5, as these metrics are commonly used for comparison of models, as shown in ( 1)-(3): where: For the polarity classification (T2) and category mapping (T3), a confusion matrix was used to report the detailed performance of the classification tasks.From the confusion matrix, four commonly used classification metrics were computed: P, R, F1, and accuracy (Acc) [63].The overall category polarity (T4) is a summation of the polarities that belong to the same category, which allows for an overall result representation.

VI. EXPERIMENTAL RESULTS
The experiments were performed on an educational dataset with 448 responses to evaluate T1.The first three experiments used the FAST-LCF-ATEPC model.Each of these experiments used different checkpoints: multilingual, multilingual-256, and multilingual-256-2.The remaining three experiments used AraBERT, MARBERT, and QARiB, respectively using default hyperparameters.The experiments were implemented in Python.PyTorch was used as the deep learning framework.A snapshot of the output results is shown in Table 8.

A. ASPECT EXTRACTION RESULTS
The results of the six experiments are presented in Table 9.Based on the experiments, we've found that the FAST-LCF-ATEPC (multilingual) model has shown promising results for T1 with an F1 score of 0.58, precision of 0.65, and recall of 0.52.The reason behind this could be the large embedding size layer that allows for more expressive representations because it provides a higher-dimensional space in which tokens can be represented.This higher dimensionality enables the model to capture more nuanced relationships and semantic information between words.While the model was successful in extracting aspects from 314 out of 448 responses, there were 134 responses from which no aspects could be extracted.
Upon evaluating the MUC-5 metrics, we found that the model accurately extracted aspects from 205 responses, while 55 contained incorrect aspects and 54 contained spurious aspects that were not in the golden dataset.We believe that with fine-tuning and data augmentation, the model can be further improved to extract aspects from the remaining responses to achieve better results.
AraBERT, on the other hand, was able to extract aspects for the largest number of responses, which could be due to its tailored training for the Arabic language and its ability to capture the unique nuances of the language.However, it also extracted a high number of spurious aspects, leading to a lower precision and F1 score.MARBERT and QARiB had lower performance, which could be due to the original dataset used in the pre-trained models, which included various dialects in addition to formal Arabic.Overall, these observations highlight opportunities for further improvement in aspect extraction for the Arabic language in the educational domain.
In our case, the domain we are working on is relatively unexplored and as mentioned in [64], no technique can guarantee good performance in all domains.For that, we have opted not to compare with existing work in different domains to avoid any potential inaccuracies.Regarding comparison with the educational domain existing work, there was only one study that used Arabic language data collected from Twitter and applied the SVM method, which is completely different from the method used in this study.Nonetheless, we compared various BERT-based methods on our dataset, which consists of Arabic text derived from the educational domain to help us determine the best method for Arabic ABSA related to education.

B. POLARITY CLASSIFICATION RESULTS
The polarity classification task determined the polarity of each extracted aspect.For this task, the FAST-LCF-ATEPC (multilingual) model was used because it achieved the best results for T1.The model results are as follows.Table 10 shows that 29% of the aspects extracted by the model were positive, 70% had a negative polarity, and 1% had a neutral polarity.Compared with the 300 matched aspects in the golden dataset, 14% of the aspects had a positive polarity, and 86% had a negative polarity.Since there was no 2298 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.neutral polarity in the golden dataset, the neutral aspects were removed from the results.
The model was then reevaluated using a confusion matrix, P, R, Acc, and F1, as shown in Figure 2 and Table 11.In the confusion matrix, rows represent the actual number of aspects with negative polarity and those with positive polarity in the golden dataset, whereas columns represent the polarity of the aspects predicted by the FAST-LCF-ATEPC (multilingual) model.As shown, the data were unbalanced, with 255 negative aspects and 42 positive aspects.As per the model prediction, only one aspect was incorrectly classified as negative, and 47 aspects were classified incorrectly as positive.
As shown in Table 11, the accuracy of the model is 84%.For negative polarity, the precision was 100%, recall was 82%, and the F1 score was 90%.For positive polarity, the precision was 47%, recall was 98%, and the F1 score was 63%.

C. CATEGORY MAPPING RESULTS
In this task, each aspect extracted by the model was mapped to a category.There were 13 categories, including university infrastructure and public services, medical administration and its services, and libraries and their services.To achieve this, a predefined list of categories was constructed from the golden dataset.The assigned category was evaluated against the human-assigned category for each sample.
In the confusion matrix, Figure 3, the rows represent the actual categories in the golden dataset, and the columns represent the same categories assigned using the predefined list.As shown, the data were unbalanced.The confusion matrix result shows that for category (0), 148 aspects were labeled correctly, and three aspects were incorrectly classified.The overall accuracy for assigning a category for the extracted aspects showed an overall accuracy of 0.98 and a weighted average F1 score of 0.98.

D. CATEGORY POLARITY RESULTS
In this section, the results are summarized to provide the overall polarity for each category.Table 12 summarizes the number of positive and negative aspects extracted by the model and the overall polarity of each category.Table 12 can inform decision makers about the services that need to be improved, as people tend to leave written feedback when they want to complain.In this summary, all short responses with fewer than three words were removed from the analysis for two reasons.First, they did not have an explicit aspect, and second, they were typically positive sentiments, such as ''Thank you'' or ''Nothing,'' which are not valuable to decision makers.

VII. CONCLUSION AND FUTURE WORK
In this study, we evaluate different BERT-based models for Arabic ABSA in the educational domain: FAST-LCF-ATEPC (multilingual), FAST-LCF-ATEPC (multilingual-256), FAST-LCF-ATEPC (multilingual-256-2), AraBERT, MARBERT, and QARiB.These models were fine-tuned using a reference multilingual ABSA dataset (SemEval2016-ABSA for Task 5).Six experiments were performed to determine the best method for extracting the aspect terms.The best result was achieved using the FAST-LCF-ATEPC (multilingual) model.This model performs T1 and T2 simultaneously by extracting aspect terms and classifying their polarities, which is better than pipeline solutions that design different models for each task, in which the output from the T1 model is used as the input for the T2 model, thus potentially propagating errors from one step to another.The end-to-end ABSA system achieved good results for all the four tasks.Future research should explore new methods to improve the aspect extraction task as there is still room for improvement.Other methods for optimizing T2 should be investigated.This study contributes to the body of knowledge by enriching research in the Arabic language as well as the educational field.The system can be used by educational institutions to analyze open-ended Arabic responses more efficiently and improve their services and institutions.

TABLE 4 .
Examples of responses to the open-ended question.

FIGURE 2 .
FIGURE 2. An end-to-end ABSA framework used in the study.

FIGURE 3 .
FIGURE 3. Confusion matrix of model polarity classification results.

FIGURE 4 .
FIGURE 4. Confusion matrix of model category mapping result.

TABLE 1 .
A summary of prior work on ABSA in the educational field.

TABLE 3 .
Overview of the used pre-trained BERT models.

TABLE 5 .
Examples of labeled responses.

TABLE 8 .
Snapshot of the output results.

TABLE 9 .
Summary of T1 experiment results.

TABLE 12 .
Summarization of the overall polarity for each category.