Exploratory Data Analysis and Classification of a New Arabic Online Extremism Dataset

The dissemination of extremist ideas and causes online has intensified over the last decade. Extremist organizations use social media to gain publicity and new recruits, often with little interference from network providers. New techniques are being developed to identify extremist content, ensuring it can be promptly removed and its authors blocked from network access. However, most techniques are only compatible with the English language, despite the fact that extremist propaganda is frequently shared in other languages, including Arabic. Since the most effective methods for automated linguistic analysis use deep learning and require large, high-quality datasets, creating specialised data samples containing examples of extremist communication is an essential step toward a practical solution. In this paper, we present a dataset compiled for this purpose and discuss the classification methods that can be used for extremism detection. The manually annotated Arabic Twitter dataset consists of 89,816 tweets published between 2011 and 2021. Using guidelines, three expert annotators labelled the tweets as extremist or non-extremist. Exploratory data analysis was performed to understand the dataset’s features. Classification algorithms were used with the dataset, including logistic regression, support vector machine, multinominal naïve Bayes, random forest, and BERT. Among the traditional machine learning models, support vector machine with term frequency-inverse document frequency features achieved the highest accuracy (0.9729). However, BERT outperformed the traditional models with an accuracy of 0.9749. This dataset is expected to enhance the accuracy of Arabic online extremism classification in future research, and so we have made it publicly available.


I. INTRODUCTION
Extremism on social media is a growing problem [1]. Extremists use these channels to promote their ideologies and gain recruits, exerting their influence and extending their operations beyond physical space. The specific features of online networks enable extremists to use them to contact other groups or individuals anonymously [2]. Therefore, social media platforms such as Twitter often serve as an ideal place for extremist individuals, groups, or organisations to gather substantial audiences, recruit cost-effectively, and engage in extremist discourse with limited restrictions [3]. In 2021, the number of social media users reached approximately 3 billion [4]. With this large and easily accessible audience, online social networks have become a useful platform for extremist propaganda. For example, recent events surrounding the 2020 US Presidential Election and the Black Lives Matter movement reflect how the distribution of violent or inflammatory content on social networks can instigate violence in the streets [5]. This reflects the power of social media networks to influence public opinion. Therefore, guarding against the use of this power by online extremists has become a key topic of interest for many governments, organisations, and social media platforms. The Islamic State of Iraq and Syria (ISIS) emerged in 2014 [6]. At the same time, it became clear that extremists can effectively use social media networks to pursue their agendas. In response to this, companies such as Twitter and Facebook responded with various initiatives to prevent extremism on their platforms. The methods used for this task were both manual and automatic, with the latter often involving the detection of violent or extreme keywords. Recently, many researchers have sought to devise systems to automatically detect and predict extremist online content. Although progress has been made, most researchers have studied English content, and only a few have explored Arabic content [7]. Significantly, this lack of studies can be attributed to the limited availability of public Arabic datasets. Since extremists are extensive social media users, it is worthwhile to identify and detect extremist content automatically to assist in restricting its spread. Researchers from different areas, including computer science, social science, and psychology, have collaborated to develop initiatives to counter online extremism. Most of these initiatives have aimed to detect and classify extremist content on social media [8], [9], [10], and substantial contributions have been made in this area. Such efforts hold considerable value for governments, counter-terrorism agencies, and social network operators because they may contribute to controlling crime, limiting the spread of extremist ideologies, and preventing terrorist recruitment. Due to the value of automated systems for online extremism detection, the need for research in the domain of computer science has increased. Various domains have used natural language processing (NLP) and machine learning techniques to solve problems such as movie and product reviews, authorship, and sentiment analysis. These techniques have also been applied to detect and predict extremism or radicalisation in social media content (e.g., Twitter posts), usually in the context of ISIS groups. The number of active Twitter users reached 397 million users in July 2021 [11], which makes it an important data source for researchers. Twitter is a real-time public platform used by members of various social strata, ranging from ordinary people to celebrities to international organisations. Since November 2017, Twitter users have been able to post short, 280-character text messages, or tweets, which other users can interact with. The short form of the text messages makes extremism detection challenging because it is difficult to build contextual meaning from short sentences. The Arabic language is widely used on Twitter, with estimates from 2014 indicating that there are approximately 5 million active Arabic-speaking Twitter users [12]. At present, the Arab world includes 22 countries and millions of individuals who speak and write Arabic. However, despite the availability of online extremism detection techniques and datasets in the English language, only a limited body of literature exists on classifying Arabic content. A basic overview of existing scientific resources revealed that relatively few datasets that have been built specifically to explore the spread of extremism are publicly available at this time. Two datasets were released in 2016 and contain 17,000 and 122,000 messages related to ISIS activities, respectively [13] [14]. Another was published by Gupta et al. [15] in 2017 with 48,000 messages related to several terrorist groups, including Al Qaeda and the Taliban. All of these datasets are in the English language and, as such, cannot be used as training tools for other languages, which limits their practical utility. The only existing dataset of this kind in the Arabic language was created by Fraiwan, with 24,000 messages annotated by knowledgeable experts [16]. Regarding the classification methods that can be used to differentiate between extremist content and normal messages, machine learning methodologies have proven to be very effective. Some of the most commonly used methods include random forest (RF), support vector machines (SVM), and long short-term memory (LSTM) networks. These are trained using features related to message text, profile of the sender, and the time sent. These methods have proven to be relatively successful, recognising extremist messages with up to 85% accuracy, but they still leave too large a margin of error for practical use. The primary contributions of this study are as follows: • A novel open-access Arabic language benchmark dataset for online extremism detection consisting of 89,000 labelled tweets has been created. Expert annotation and data validation were performed using different techniques to ensure the quality of the proposed dataset. • Exploratory data analysis using in-depth statistical analyses was performed to understand and visualise the proposed online extremism dataset. • Different classification models for online extremism detection are presented. To boost the accuracy of the classification models, N-gram features along with different feature sets were evaluated. The remainder of this paper is organised as follows. Section II provides an overview of related work, including existing datasets. Section III presents the proposed methodology for the detection of extremist content. Section IV explains the implementation setup for evaluating the proposed method, and Section V offers concluding remarks and discusses pathways for future research.

II. RELATED WORK
This section provides an analysis of the existing literature on extremism detection with respect to the datasets and classifier techniques used.

A. DATASETS
Researchers have sought to detect online extremism by applying artificial intelligence techniques to social media content datasets (e.g. tweets from Twitter). However, research in this area faces a challenge in terms of the availability and quality of datasets containing extremist content. Aldera et al. [7] reported that very few datasets are publicly available because of data regulations. We conducted a review of publicly available datasets and identified only four. An overview of these datasets is provided in this section.
The Kaggle data science community [13] published two English language datasets in 2016. The first dataset, 'How ISIS Uses Twitter', scraped over 17,000 tweets from more than 100 pro-ISIS Twitter users worldwide after the November 2015 Paris Attacks. The second dataset, 'Tweets Targeting ISIS', served as a counterpart to the first, containing 122,000 tweets from 95,725 distinct users; these were general tweets about ISIS and related words [14]. These two datasets have been used in many studies [1], [17], [18], [19], [20], [21], and their availability has enabled the development of extremism detection techniques. Gupta et al. [15] released a data repository on GitHub [22] to assist in the identification of radical social media posts using machine learning. The authors used the Twitter Search REST API to extract public tweets that were posted between mid-February 2017 and mid-March 2017. The targeted tweets contained hashtags associated with radical groups, including #ISIS, #Taliban, #AlQaeda, #Wahhabism, and #Daesh (as seed hashtags); in turn, the frequencies of all the hashtags were calculated from the extracted tweets for use as a new search query. This process yielded approximately 48,000 unique English tweets. The content of the tweets was cleaned and preprocessed using tokenising and lemmatisation. Finally, approximately 25,000 tweets from the initial 48,000 were manually labelled as radical or not radical. Fraiwan [23] published the first dataset of annotated ISIS radical tweets in Arabic, which consisted of 24,000 tweets from 174 accounts related to ISIS. The author developed crawler software to collect tweets from suspected ISIS accounts. The annotation process evaluated whether a given tweet was radical, religious but not radical, or unrelated to the subject matter (e.g. sports). Two experts in religion from the armed forces performed the annotation and found that the dataset contained 45% radical, 43% religious but not radical, and 11% unrelated tweets. In their survey paper, Gaikwad et al. [24] identified three main challenges in online extremism datasets. The first is data imbalance, where the extremist class is smaller than the nonextremist class. The second challenge is that of data validation/verification: data availability becomes a challenge due to the suspension of extremist accounts, which makes replication of results difficult. Additionally, manual data validation suffers from bias as few experts label the data. The third challenge is that the data are collected from specific events or groups, which introduces a bias into the classification algorithms. As our analysis indicates, there is a limited number of datasets available for online extremism detection study. Moreover, certain datasets are incomplete and suffer from bias due to unclear or low-quality annotation processes; most of the datasets are in English, and only one is in Arabic. Thus, obtaining and annotating more data, specifically in the Arabic language, is essential for continued research on online extremism.

B. CLASSIFICATION
Recently, interest in text classification models for extremism detection, particularly in the context of social media networks, has been growing. Over the last few years, experts from different disciplines (e.g. computer science, social science, and psychology) have collaborated to develop solutions by applying artificial intelligence technology to the issue of online extremism [25]. This section presents the results of a literature review on automatic detection and classification of extremism on social media networks. Multidisciplinary research in online extremism detection has focused on the analysis of online extremism to understand the processes underlying it [18] [26] and the examination of how propaganda spreads online [27]. Researchers have also sought to devise systems to automatically detect extremist users and radical content online. In recent years, using machine learning techniques with textual features has become a popular practice. Traditional machine learning techniques and, more recently, deep learning techniques have been used by researchers to detect extremism on social media networks. Table 1 summarises prior studies that have applied machine learning techniques to online extremism detection. As shown in Table 1, the most commonly implemented algorithms were support vector machine (SVM), random forest (RF), and long short-term memory (LSTM). In several studies, including [28], [10], and [29], the accuracy of SVM exceeded 90%, and in [25], [15], and [30], the accuracy of the RF algorithm was higher than that of the SVM. Recently, deep learning techniques, particularly convolutional neural networks (CNN), originally developed in [31], and recurrent neural networks (RNN), proposed in the late 80s [32], [33], have yielded notable results. Researchers have also used LSTM networks to devise systems for extremist content detection on social media, as well as techniques such as SVM, RF, and maximum entropy [25]. The results in Table 1 indicate that LSTM outperformed most traditional machine learning techniques in terms of precision (85.9%). Most prior studies addressed extremism detection using three main categories of features: textual (NLP features), time, and profile. Textual features were primarily used in the classification task, which may involve the use of techniques such as term frequency-inverse document frequency (TF-IDF), N-gram, part-of-speech, and bag-of-words. Some studies combined time, profile, and network features to classify extremist content. In addition, some authors used more advanced features, such as psychological and behavioural features. However, to the best of our knowledge, no comparative study of the literature has identified which features and classification models perform with greater accuracy than others.

III. METHODS
This section outlines the architecture for the proposed extremism detection module, which has a four-part architecture, as shown in Figure 1.
In the first part, data are obtained from Twitter using the Twitter API. Following this, standard NLP pre-processing techniques (e.g., tokenisation and lemmatisation) are applied to generate a dataset. Thereafter, the tweets are manually labelled as either extremist or non-extremist. In this study, we performed exploratory data analysis (EDA) to understand the dataset. Afterwards, we evaluated the dataset using various traditional machine learning models with different NLP features. We also evaluated the dataset using the bidirectional encoder representations from transformers (BERT) deep learning model. Finally, we evaluated the model's performance using metrics such as accuracy, F1-score, and area under the receiver operating characteristic curve (AUC). The following subsections outline the key stages of our module, including data collection and preparation. In turn, feature extraction, EDA, and predictive model development are discussed. Figure 2 demonstrates the methodology adopted to collect and construct the corpus. The methodology comprises three primary phases: corpus collection, corpus cleaning, and data annotation, which are described in detail in the following subsections.

1) CORPUS COLLECTION
By the third quarter of 2020, the average number of active daily Twitter users reached 187 million, and based on an average posting frequency of 500,000 tweets every day, approximately 200 billion tweets are published on the platform per year, each with a 280-character length [41]. Moreover, Twitter allows researchers to access its public data for research purposes via the Twitter API, with some limitations.
As mentioned in Section II, there is a limited number of datasets, particularly Arabic language datasets, available on the Internet for online extremism detection. Therefore, we collected new data from Twitter. The Twitter API allows developers to collect real-time tweets with different parameters based on given query terms. Our final query was as follows: Data [] = Search {Search_Term, longitude, latitude, lang} The API returned the text of tweets, along with user information (e.g. username, user location, number of friends, number of followers, number of likes, and user description). Moreover, a list of search terms was prepared for data collection based on trending Twitter topics in the Arab world.

2) CORPUS CLEANING
After corpus collection and before its annotation, a cleaning step is required to prepare the tweets for processing. We removed duplicates and empty tweets from the corpus and excluded non-Arabic tweets. At this stage, the number of tweets was reduced to approximately 145,000.

3) MANUAL ANNOTATION
The annotation process is critical because it directly influences model accuracy. Furthermore, annotating large datasets can be costly and time-consuming. Given the costly nature of the annotation process, most researchers use automatic annotation techniques that are based on dictionary resources, including WordNet and SentiWordNet. However, a limitation of these techniques is their potentially low accuracy. Owing to this and other limitations of automated annotation, we opted for manual annotation; specifically, we annotated our dataset by checking each tweet and considering the occurrence of each word and the meaning and context of the tweet. Although it is lengthy, manual annotation is accurate and reliable compared to automated annotation. In this research, three different raters manually labelled the collected tweets. Precautions were taken to minimise bias; these included establishing clear guidelines and validating the results using various techniques (as discussed in Section IV).

4) FEATURE EXTRACTION
Before classification, our proposed system processes tweets in the form of vectors, enabling the classification models to perform statistical operations. Initially, an NLP pre-processing technique is followed (e.g. stop word removal, lowercasing, tokenisation, and lemmatisation to obtain unigrams). To create feature vectors, different feature extraction techniques were used in this study: • N-grams: These are the basic features in detection problems. A sequence of n words can be referred to as a unigram (one word), bigram (two words), trigram (three words), and so on, depending on the value of n.

B. EXPLORATORY DATA ANALYSIS
EDA is a technique used to explore datasets so as to extract useful and actionable information, identify relationships among the explanatory variables, detect mistakes, and preliminarily select appropriate models. It uses descriptive statistics and graphical tools to develop an understanding of the data [42]. EDA is used primarily to maximise insight into a dataset, detect outliers and anomalies, and test underlying assumptions [43]. In this study's EDA, we used graphical methods to summarise the data visually and diagrammatically. We applied a univariate graphical method that examines one variable at a time (e.g., using histograms, boxplots, or pie charts) for categorical data, whereas a multivariate graphical method was applied to consider two or more variables at a time and explore relationships. In the latter case, correlation analysis was used, which is a technique that can calculate the overall correlation for two or more numerical variables. In the section IV, we discuss how categorical and numerical features were explored and visualised using several Python libraries, including the Plotly Python graphing library, NumPy, and pandas. Using EDA, we acquired detailed insights into the dataset and meaningful information about the dataset's characteristics.

C. CLASSIFICATION MODELS
There are many available classification models, and their effectiveness depends on the problem domain. Choosing the right model is critical for building a robust detection system. For our problem, the following algorithms were evaluated. •

D. PERFORMANCE EVALUATION
Different performance metrics were used to evaluate the classifier performance. Accuracy is the simplest and most widely used metric to evaluate a classifier.

TP TN Accuracy TP FN TN FP
where TP is the number of correctly classified extremism tweets, TN is the number of correctly classified nonextremism tweets, FN is the number of incorrectly classified extremism tweets, and FP is the number of incorrectly classified non-extremism tweets.
Precision is defined as the ratio of true positives to all tweets identified as positive.

TP Precision
TP FP = + Recall is defined as the ratio of correctly classified positives to total positives. In our study, recall is a measure of the proportion of the detected extremism tweets.
The trade-off between recall (false negatives) and precision (false positives) is captured by the F1-measure. The F1score is defined as the weighted average of recall and precision; it balances precision and recall in a single value and therefore is commonly used as a classification evaluation metric.
Precision Recall F1-score 2 Precision Recall The receiver operating characteristic curve is obtained by plotting the true positive rate relative to the false positive rate, and the AUC measure, which is bounded by zero and one, typically exceeds 0.5.

IV. EXPERIMENTS
The framework was implemented in three parts: first, collection and annotation of data; second, EDA; and third, development of classification models and evaluation on the newly proposed dataset.

A. DATA COLLECTION
After registering as a developer on the Twitter developer platform, we obtained user status data with more than 40 attributes. We used the Twitter Streaming API and Search API to collect data relating to tweets in real time, filtering with specific settings (e.g., based on keywords, location, and Arabic language). The crawled data were published between May 2011 and March 2021, and a final set of 2 million tweets with their associated metadata was extracted. The dataset size was reduced to 500,000 after excluding retweets (i.e., considering original tweets only). Arabic search terms were used in the query, focusing on religious and political terms. A list of specific query search terms, including ‫داعش,‬ ‫تنظيم‬ ‫هللا,‬ ‫عدو‬ ‫العر‬ ‫داعش‬ ‫هيئة‬ ‫اإلرهاب,‬ ‫رعاة‬ ‫المنتجات,‬ ‫مقاطعة‬ ‫الغضب,‬ ‫جمعة‬ ‫كفار,‬ ‫اق,‬ ‫داعش‬ ‫العمالء,‬ ‫,كبار‬ was used to identify and collect tweets from public Twitter profiles.

1) DATA CLEANING
Before labelling the dataset, the data were cleaned by removing empty tweets and tweets containing fewer than seven words (excluding hashtags and user mentions). Then, duplicate tweets were removed, using the dedupe Python library [46]. The library applies machine learning for rapid deduplication and entity resolution on structured data, using human-annotated training data. After data cleaning, the dataset consisted of 145,000 tweets.

2) DATA ANNOTATION
As discussed in Section III, manual annotation is more reliable and accurate than automatic annotation. Therefore, the data annotation technique followed in this study was that of Wosom [47], which is specialised to handle data annotation and supports annotation of text, audio, image, and video among others. It is difficult to identify extremism based on an individual's judgment about whether a given text (e.g. a tweet) is extremist or not. For this reason, a predefined process is required to identify extremism and to avoid personal judgment and bias of the annotators. In this research, three expert annotators reviewed the tweets individually and labelled them as extremist or non-extremist. The guidelines listed in Table 2 were used for the annotation process. The final label for each tweet was decided by a majority vote. An odd number of annotators were employed to prevent a tie. At the end of this stage, 89,816 tweets were labelled using the majority vote method. Moreover, during the annotation process, a sample of 1000 tweets was used to validate the annotation process. A fourth annotator was invited to check the annotators' work, resulting in more than 80% agreement between the average of the three raters and the validating fourth rater. This procedure was undertaken ten times during the annotation process. Our dataset has been made publicly available on IEEE DataPort [48].

3) ANNOTATION EVALUATION
The inter-annotator agreement was calculated using Gwet's AC1 measure, which is one of the main statistical measures of agreement and can be used when the outcome is ordinal or nominal in nature [49]. In this study, nominal weights were used to take into account the nominal nature of the data [49].
The initial study sample included 89,816 observations. The overall value of Gwet's AC1 was 0.6, which indicates substantial agreement between raters.

4) POTENTIAL BIAS
Most datasets have a risk level in terms of demographic bias [50], and the risk increases when using manual search terms. Researchers must be aware of potential biases in their datasets and address them. Gender and ethnicity are common sources of bias that are often identified in datasets. Therefore, to explore the potential biases in our dataset, we initially checked whether gender bias is a feature of the dataset. Since Twitter does not record user gender information, we used the Gender API [51] to detect user gender from the first name. We were able to process the gender of 52,929 unique users and inferred that 66.5% of the users were identified as male, 25.6% as female, and 7.9% as unknown (cannot be identified). This suggests a male bias in terms of the users included in the dataset. However, this approach is limited because names are not a reliable and truthful way to determine gender identities. Regarding ethnicity bias, we generalised the search keywords during the dataset collection phase in order to mitigate potential ethnicity bias. However, it should be noticed that the popularity of certain keywords may still give rise to an unintended bias. For example, most of the tweets were posted from Saudi Arabia, Egypt, and Yemen due to the use of specific keywords such as ' ‫الغضب‬ ‫','جمعة‬ ‫العمالء‬ ‫كبار‬ ‫.'هيئة‬

6) FEATURE EXTRACTION
An NLP pre-processing was performed, which involved lemmatisation, stop word removal, and tokenisation. To create word vectors, various feature extraction techniques were used, including unigrams, bigrams, and trigrams with TF-IDF and Word2Vec. After reading the tweet, please classify it as 'extreme' or 'not-extreme' using the following table focusing on the context of its text:

Extremist (T)
Classify a tweet as extremist if it has one of the following traits: 1. Concept of burglary 2. Intellectual unilateralism 3. The 'grievance' fallacy 4. The Islamophobia fallacy 5. Dismantling and building chaos The most important 'ideas' and 'sayings' of extremist discourse are: 1. Atonement for others who are different 2. Advocating and inciting violence 3. Reliance on references to extremist groups 4. The logic of fatal divisions 5. The idea of a single model

Non-extremist (F)
Classify a tweet as non-extremist if it does not contain the features mentioned above even if it contains religious, political, ethnic, or social themes or if its topic is unrelated to extremism (e.g. sports or fashion).

B. EXPLORATORY DATA ANALYSIS
After a preliminary examination of the dataset, we identified a series of steps to perform at the outset: • Capitalise all column names.
• Split user dictionary column into spread columns using Python pandas and Contractions libraries. • Cast columns to their appropriate data types.
• Map target classes to extremist and non-extremist.
• Create a new feature for the tweet text length and word count. • Split the date column into multi-features (e.g. hour, day, month) by processing the associated timestamp for each tweet record. • Drop irrelevant columns: Conversationid, Outlinks, Tcooutlinks, etc. • Check for duplicates and remove them.
• Check for missing values.

1) METADATA ANALYSIS
We performed further analysis of the tweet metadata to identify important features. In the extremism dataset, there were 89,816 tweets in total published by 52,929 unique users. Figure 3 shows the percentage of extremist tweets and nonextremist tweets identified in the dataset. A total of 50,279 tweets (56%) from 22,858 unique users were labelled as extremist, whereas 39,537 tweets (44%) from 30,911 unique users were labelled as non-extremist. We applied Shannon's entropy measure to check the dataset's balance, deriving a result of 0.98, which indicates that the dataset is well balanced. Figure 4 illustrates the number of tweets published annually. We observed increased content publication in certain months, for example in October 2018. In this month, the publication of extremist content increased by 100%, which coincided with the death of Saudi journalist Jamal Khashoggi; this evidently sparked significant social media activity. As another example, the April 2021 publication of the American CIA report on the Khashoggi case also appears to have caused an increase in Twitter social media content engagement.   Figures 5 and 6 show the day and time of extremist and nonextremist tweet publication, respectively. The time for extremist tweets was typically later in the day, which is likely because most people are busy or working during the daytime. By contrast, the days of publication are almost uniformly distributed over the week with a slight increase in nonextremist tweets during the weekend. Based on these observations, it was concluded that these features are not associated with the variable of extremism, and thus they were excluded from the classification model.

FIGURE 7. Correlations between numerical variables
We also examined the correlations between numerical variables to check for any relationships among them. Figure 7 shows a strong correlation between reply count, retweet count, and like count. Moreover, a relationship was identified between a user's follower count and listed count. Finally, a correlation existed between user media count and user status count. Therefore, we can use these correlations to build new features, such as favourite/follower distribution and retweet/like distribution. Figures 8 and 9 show that the favourite count was higher in extremist tweets compared with non-extremist tweets (nonextremist median=2084, extremist median=540). However, the follower count of extremist tweets slightly exceeded that of non-extremist tweets (non-extremist median=419, extremist median=157). This may be attributable to the fact that users who have refrained from following an account may agree with the idea and mark the tweet as a favourite. Figures 10 and 11 show that the lengths and word count of the tweets were similar across extremist and non-extremist tweets, respectively. However, as mentioned earlier, tweets with fewer than seven words were excluded from the dataset.

2) NLP ANALYSIS
To analyse the tweets in our Arabic language online extremism dataset in depth, we investigated the top 10 unigrams and bigrams based on TF-IDF for extremist and nonextremist tweets after removing Arabic stop words. In TF-IDF, words are assigned numerical weights that represent their relative importance in a particular document within a set of documents (i.e., a corpus). Figures 12 and 13 show the topranked unigrams based on TF-IDF for extremist and nonextremist tweets. Tables 3 and 4 provide the English translations for the words shown in Figures 12 and 13, respectively. As the figures indicate, the word 'Allah' was used in both types of tweets but more frequently in extremist tweets. Furthermore, the most frequent terms in extremist tweets tended to be more violent and dominant compared to those of non-extremist tweets.         Supervised classification was performed by splitting the dataset into training, validation, and test sets because the dataset is sufficiently large to have an independent test set to verify the performance of the models; sufficient data also remained available for training and validation. We tuned the hyperparameters on the validation set, and after achieving optimal results, evaluated the real performance on the test set. We used four machine learning models, namely LR, SVM, MNB, and RF, and all experiments were performed using a combination of feature sets extracted from our Twitter dataset. This consisted of a TF-IDF feature set with N-gram and Word2Vec embeddings to determine the most accurate and effective model. Moreover, we used the BERT model, which is a popular language model, that can be fine-tuned for a specific NLP classification task; we included different Arabic BERT models in our research. Finally, we trained and tested 17 models with our Twitter extremism dataset.

D. RESULTS
This section presents the performance evaluation results of our models, in terms of correct classification of extremist and nonextremist tweets. Table 7 lists the obtained values for accuracy, F1-score, and AUC. As our dataset was balanced, accuracy, F1-score, and AUC measures are considered optimal for evaluating our models. For the detection of extremist tweets using traditional machine learning algorithms, Table 7 Table 3 confirms that the inclusion of bigrams and trigrams does not improve the performance of the models. Furthermore, the performance of certain models degraded when trigram features were included. The only model that benefitted from the inclusion of bigrams and trigrams was the MNB. Additionally, the inclusion of Word2Vec resulted in improvements for RF only.

V. CONCLUSIONS AND FUTURE WORK
The easy accessibility and widespread nature of social media networks provide extremist individuals, groups, and organisations with an easy means to attract large audiences, disseminate propaganda, and recruit members. In this study, our objective was to compile a dataset of Arabic language tweets obtained from Twitter and automatically detect extremist content using machine learning algorithms. Many online extremism detection systems have been proposed in the literature, often achieving accuracies of around 90%. However, there is an insufficient number of publicly available extremism datasets, particularly in the Arabic language; most of the available Arabic language datasets are limited to ISIS and do not extend to political and other types of extremism.
In this research, we present an Arabic Twitter dataset for online extremism detection consisting of 89,816 tweets and associated metadata. The dataset was manually annotated by three experts and achieved a Gwet's AC1 score of 0.6, indicating substantial inter-annotator agreement. A two-step analysis was performed: first, EDA to understand the dataset and provide insights into the features; and second, the classification modelling process, wherein 17 different classification models were used. Among the traditional machine learning models, SVM achieved the best accuracy (0.9729) using TF-IDF features extracted from the tweet content. Notably, the BERT deep learning model outperformed SVM, achieving an accuracy of 0.9749. This research applies some of the achievements in NLP to a pivotal social issue and seeks to create tools that other researchers and stakeholders can use to prevent the proliferation of extremist ideas. The compilation of our Arabic language dataset, consisting of annotated Tweets that may or may not include terrorist rhetoric, is hugely important and may have a wide-ranging impact on this field. Data collection and pre-processing were conducted systematically, while the annotation process was performed manually and with a full understanding of the context. This approach resulted in a very reliable dataset that is expected to provide a solid foundation for future works, specifically those aiming to detect extremist content quickly and accurately on social media. In this research, we also compared several simple machine learning classification algorithms, along with a more complex deep learning approach based on the BERT model, obtaining results that illustrate the advantages of the latter. Classification methods were evaluated experimentally using metrics such as accuracy and F1-score to assess their effectiveness. This study is notable for being one of the first to address the issue of extremist communication in the Arabic language, which is widely used on the Internet and constitutes one of the major global languages. However, this study has a limited scope, and it fails to account for regional variations of Arabic or consider that various groups could be using different vocabularies or code words. This is why it is necessary to continue building similar datasets and make them publicly available to other researchers, thereby strengthening the digital defences against hateful ideologies and dangerous individuals and organisations. It is worth highlighting that monitoring extremist tweets and users can help establish early warning systems and create opportunities for predictive and preventative actions against extremism. In future works, different features and different