Detecting Extremism on Twitter During U.S. Capitol Riot Using Deep Learning Techniques

In the 21st century, social media platforms have become famous for communicating ideas, opinions, and emotions. These platforms are influential in reaching out to youth, recruiting, and spreading propaganda. Extremist groups are now active users of social media platforms; therefore, it is necessary to monitor their activities. Therefore, there is an urgent need to detect extremism on social media platforms. Existing research on extremism lacks a dedicated extremism dataset and provides minimal insights into extremism texts. This study introduces the development of an extremism dataset containing tweets collected from Twitter and classifying extremism texts as propaganda, recruitment, radicalization, and non-extremism. The proposed extremism dataset is evaluated using different Artificial Intelligence approaches such as Bi-LSTM, BERT, RoBERTa, and DistilBERT. Among the four models, RoBERTa proved to be the most suitable for detecting extremism on social media, with an accuracy of 95%.


I. INTRODUCTION
Social media platforms are used to make friends online, chat or share informal information. Currently, it is used as an application in business marketing, discussion on the latest events, debates on political events, news, entertainment, and so on. Today there are over 3.78 billion users of social media worldwide [1] and 330 million monthly active users on Twitter [2], with an average of 500 million tweets posted every day [3]. A study by Maryville University suggests that almost 72% of adults in the United States are users of social media platforms [4], which will increase in the coming years. Therefore, social networking services such as Facebook, YouTube, Instagram, and Twitter are used by the public to influence election results and mislead the public The associate editor coordinating the review of this manuscript and approving it for publication was Li He . through false information. False information through social sites sometimes causes uproar among the masses giving rise to protests and riots. Riots result from dissatisfaction with a particular event, including violence and vandalism. Many political or social movements are intended to gain either political or social advantage, yet they do not aim to disturb the nation's harmony. Usually, extremists target the emotional and moral sentiments of the masses to disrupt the peace within the communities, and one of their ways of doing that is through social media. Therefore, social media texts can be examined to identify the perpetrators influencing and engaging people to support the riot. The agenda of such riots are planned through social media platforms by creating groups to fight for a common cause or sometimes through the spread of misinformation. Individuals can spread extremism by discussing their views and opinions on the outcome of an event or through conspiratorial theories posted by subversive groups. The language of such individuals and groups can be devoted to a single person, misleading the public and being supportive of one side. Hence, extremism detection on social media is vital for the detection of extremist users as well as for preventing extremist content from being posted.
The latest event certifying the negative impact of social media platforms was the 2020 U.S. presidential election which gave rise to the infamous Capitol Riot. Around 300k (i.e., 3 million) tweets associated with the presidential elections in the United States in 2020 contained deceptive content [3]. With the impending outcome of the presidential election in 2020, users discussed the 2020 U.S. election, Donald Trump, voter fraud, and election fraud across social media platforms such as Parler [5], Facebook, Instagram, and Telegram [6]. The paper [7] claimed that on 6th January 2020, the pro-Trump crowd stormed the U.S. Capitol. The attack on the Capitol was planned using social media platforms where 800 people stormed the Capitol [8] with the intent of causing harm and disharmony among the citizens. Such posts on social media enraged other users to cause violence; therefore, thorough attention and identification of such content are possible using deep learning techniques.
Some papers have explored social media texts about the U.S. Capitol riot, especially from Parler and Twitter. The studies like [9] and [10] have discussed the U.S. Capitol riot and its theoretical aspects, focusing on Twitter posts and the users' reactions. However, there is a lack of research on practical insights using state-of-the-art techniques to help improve and control extremist activities through online platforms for political riots.
Researchers have explored the political causes of the U.S. Capitol riot attack; however, few studies experimented to find extremism in tweets related to this riot. Besides, the publicly available Twitter datasets on the U.S. Capitol riot and U.S. presidential elections in 2020 are also limited. The existing research talks about detecting online extremism and extremism text sentiment analysis. Furthermore, most extremism research focuses on detecting radical ISIS accounts, sentiment analysis of an event, or the prediction of the possibility of the occurrence of a protest, utilizing machine learning and deep learning approaches. However, advanced models such as BERT and RoBERTa are efficient for text categorization, but they have not experimented with much. Besides, the identification of multiclass extremism has not been a popular choice among many authors. Thus, social media text analysis for identification of the type of extremism was absent in existing research. The datasets required for the experiment did not have the necessary features to train and test the models. Besides, there was a lack of post-classification and dataset training for performing model testing. The use of traditional machine learning models is prevalent in many extremism studies.
Nevertheless, deep learning models are also implemented to compare and discuss their performance. Therefore, this research aims to overcome all these limitations by gathering data with the required attributes. This curated dataset can help observe the trends prevalent on Twitter during the U.S. Capitol riot, classify tweets into propaganda, recruitment, radicalization and non-extremism, and compare performance of trained deep learning models on the collected dataset. The significant research gaps found in the literature survey are as listed below.
• Lack of publicly available dataset on U.S. Capitol riot. • Limited discussion on techniques helpful in combating online extremism.
• Limited research on identifying extreme political discourse-related social media posts.

A. CONTRIBUTIONS
This research has contributed significant work keeping the research gaps and critical objectives in mind. The significant contributions of this work are shown below.
• Development of seed dataset and online extremism dataset consists of Tweets collected from Twitter.
• Evaluation of online extremism dataset using LSTM, BERT and its variants are known as Roberta and DistilBERT.
• Evaluation of extremism dataset and classification of tweets into multiple labels such as propaganda, recruitment, radicalism, and non-extremism.
• Analysis of trending hashtags of before and after the U.S. Capitol riot incident. This research paper is organized into six sections. The second section presents the literature review of the previous work performed, followed by the third section, which discusses the proposed architecture of this work. Then sections four, five, and six mention the data collection, preprocessing, and visualization. The seventh section covers the technical aspect of the work, explaining the experimental setup, and the observed results during the experiment. Limitations in this research are mentioned in section eight. The ninth section discusses the future scope and conclusion of this study.

II. RELATED WORK
This section presents an overview of relevant research papers addressing the classification and detection of social media-based extremist associations and predicting public demonstrations.
Today many researchers investigate the automatic detection of extremists on social media texts using various methods [11], [12]. Not only that, research has also been done on public protests and elections for sentiment analysis [13], [14], [15]. Another study [16] extracts the sentiment or emotion behind social media texts in specific contexts. Machine learning techniques have been utilized extensively in extremism research since 2013 [17]. The prominent machine learning algorithms used for extremism detection research are Logistic Regression, Naive Bayes, Decision Tree Classifier, K-Neighbors Classifier, Random Forest Classifier, and Support Vector Machine [18]. On the other hand, the implemented deep learning techniques for extremism detection research are Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) [18]. Automated detection of extremism is more focused on areas related to social movements, presidential elections, political issues, and terrorism by researchers.

A. MACHINE LEARNING CLASSIFIERS
Specific research papers have tested different machine learning algorithms to detect extremism content related to terrorism. Hamidreza [11] designed an automatic detection scheme to detect extremists based on three features of a social media user. These features are the textual content of the user, profile information and usernames. The effectiveness of the automatic detection scheme is demonstrated by testing the trained model on a realistic ISIS dataset collected from Twitter. The dataset contains messages by ISIS terrorist groups for recruitment and propaganda [19]. The authors used a set of 3000 Twitter handles in the experiment, divided into 150 suspended ISIS-related Twitter accounts and 150 regular Twitter user accounts [11]. Various semi-supervised and supervised machine learning algorithms were implemented to predict extremist users. These algorithms are SVM, Char-LSTM, LabelSpreading (RBF), Laplacian SVM, Label-Spreading (KNN), Co-Training (SVM), KNN, Gaussian NB, L.R., AdaBoost, and Random forest. The semi-supervised, LabelSpreading and Char-LSTM achieved the highest F1 score. Compared to others in the positive class, Char-LSTM has a high precision of 77% and a high recall of 76%.
Abd-Elaal et al. [20] article presented an intelligent system for detecting ISIS online communities on the social media platform Twitter. The dataset for this study was obtained by looking at extremist accounts on Twitter that used the most common hashtags in ISIS propaganda. Around 21,000 tweets were collected for each Pro-ISIS, Anti-ISIS, and non-ISIS user. The dataset underwent various transformations, dividing them into text features vectoring, text feature analytics, and behavioural features organization. The suggested system examines linguistic and behavioural characteristics such as hashtags, mentions, and followers. This system features a crawling subsystem that establishes an ISIS account detector using previously identified ISIS-related accounts. Using the crawling subsystem, anyone can invade ISIS's online Twitter community. It also features an inquiry subsystem for detecting Pro-ISIS accounts. The user can use inquiring subsystems to look up a specific Twitter account using the Twitter ID as input. The studies utilized six distinct machine learning algorithms: Bernoulli Naive Bayes, Decision Tree Classifier, K Neighbors Classifier, Linear Support Vector Classifier, Logistic Regression, and Random Forest Classifier.
Mussiraliyeva et al. [21] the study discusses the identification of religious extremism on social networks using Machine learning models on the dataset curated from the V.K. social network in the Kazakh language. The authors tested the dataset with six machine learning models: Support Vector Machine, Naive Bayes, Logistic Regression, Random Forest, Decision Trees, and K Nearest Neighbors. The author has used multiple feature techniques such as Statistical Features, LIWC, POS, and TF-IDF to improve the model's accuracy. Also, they applied oversampling and under-sampling methods to compare the respective performance of the models. The best result was achieved by Naive Bayes, having 94% accuracy, which showed its efficiency in detecting extremist content on the web. Likewise, Aldera et al. [22] paper focused on classifying extremist posts in the Arabic language using 89816 tweets annotated manually. The models used to test the dataset included Support Vector Machine, Logistic Regression, Multinomial Naive Bayes, Random Forest, and BERT with TF-IDF as one of the feature extraction methods. Among the Machine learning models, SVM achieved the highest accuracy of 97.29%, and BERT outperformed it by 0.20% accuracy, proving its efficiency in text classification over machine learning models.
Meanwhile, some research papers investigated the probability of a public protest happening. Bahrami et al. [23] aimed to predict protests through machine learning algorithms. It first searches Twitter's Trending Topics for hashtags that call for protests and downloads the associated tweets. Four machine learning algorithms are used to predict the tweets. Their findings show that Twitter can effectively forecast future protests, with an average prediction accuracy of over 75%. Different classifiers are examined in this study, including C4.5, Naive Bayes, Logistic Regression, and SVM, with Logistic Regression yielding the best overall results. This research focuses on predicting violent public protests. Few papers have experimented with predicting and analyzing mild forms of political protest, for example, as seen in [24]. In addition, some studies explore violent protests; for instance, another study [25] forecasts when a protest in China will take place by identifying protest-related articles and negative propaganda. There are some limitations to using machine learning algorithms in this research. Machine learning algorithms cannot take the overall dependencies associated with a sentence; thus, machine learning models do not efficiently categorize the text as extremist or non-extremist. Another limitation is that machine learning algorithms require feature extraction to achieve a better performance score [17]. Also, it cannot consider extensive data because of predefined features, and context analysis is challenging using machine learning [17].

B. DEEP LEARNING CLASSIFIERS
Natural Language Processing is being used to perform analysis on extracted text data. Recently, pre-trained deep learning models that produce word embeddings were used to analyze text data. Pre-trained models such as BERT and LSTM are becoming increasingly popular as a result of their excellent accuracy when compared to other standard machine learning models. Sentiment detection was also done with the help of machine learning and deep learning models. Previous work has shown using LSTM, GRU and BERT models for sentiment analysis. Most research papers have shown how they have used machine learning models for sentiment detection, such as SVM, Naive Bayes, and Logistic Regression.  To understand the public's reaction, some researchers look into the sentiment behind the social movement, political, and terrorism-related social media texts.
Deep learning techniques are employed for classification and prediction in extremism research but are mainly used for sentiment analysis. The deep learning techniques for extremism detection are mostly LSTM, GRU, Random Embedding, FastText, and CNN. Some research papers have used LSTM as a deep learning technique for tweet sentiment analysis. Ahmad et al. [12] proposed a terrorism-related content analysis methodology that categorizes tweets into extremist and non-extremist classes. The study uses Twitter posts to create a tweet classification system that uses a deep learningbased sentiment analysis technique called LSTM + CNN to classify tweets as extremist or non-extremist classes. The data was gathered utilizing a Twitter streaming API and other Dark Web forums such as Al-Firdaws, Montada, alokab, and Islamic Network. There are 12,754 tweets labelled ''extremist'' and 8,432 tweets marked ''non-extremist'' in the training dataset. It compares word embedding learned with CNN, LSTM, FastText, and GRU to conventional feature sets like ngrams, bag-of-words, TF-IDF, and bag-of-words (BoW) for extremism classification. After experimenting with various parameter values for eight LSTM + CNN models, it was discovered that the performance of the LSTM + CNN models was superior to the other models. It had a 92.66% accuracy rate. The precision score in LSTM + CNN is 90%, the recall score is 88%, and the F1 score is 88%.
Even the BERT deep learning algorithm is used in many extremism research for sentiment analysis. Chiorrini et al. [26] investigate using BERT models for sentiment analysis and recognizing emotions in tweets. They have evaluated the performance of these models on real-world tweet data. This model is created by fine-tuning the BERT model on specific tweet datasets. The sentiment analysis was done by training the model with 1,600,000 tweets, and testing with 430 manually annotated tweets as positive, negative, or neutral. For the emotion analysis, they considered the tweet emotion intensity dataset consisting of 6755 tweets labelled as anger, fear, happiness, or sadness. The models had an accuracy of 92% on sentiment analysis and 90% on emotion recognition, according to the findings of the studies.
Meanwhile, Alatawi et al. [27] experimented with detecting white supremacists using the BERT model. This paper identifies white supremacists by hate speech on Twitter, as it is imperative to interpret the spread of such hateful content to prevent it [28]. They used both Bi-LSTM and BERT models, where the BERT model showed the highest F1 score. They paired a Twitter dataset with a Stormfront dataset compiled from a white nationalist site.
A few recently published studies on extremism research utilizing deep learning models are detecting Islamic radicalism in Arabic tweets using NLP and detecting extreme sentiments on social networks with BERT. Mursi et al. [29] presented research on detecting Islamic extremism in Arabic tweets using machine learning algorithms and conducted sentiment analysis on the dataset. For this experiment, they curated their dataset, which was manually labelled as extremism and non-extremism by cybersecurity specialists. Two machine learning algorithms, Super Vector Machine and Multi-Layer Perceptron, are trained using the curated dataset that was converted into a matrix of token count through CountVectorizer and TFIDF. Both models have achieved high accuracy; however, the super vector machine achieved a greater accuracy of 92% than the multi-layer perceptron.
Jamil et al. [30] proposed the detection of extreme sentiments on social media posts with the help of a semi-supervised algorithm known as BERT to reevaluate the accuracy of their prior research. The extreme sentiment is a kind of sentiment analysis, which identifies any negative or positive opinion, evaluation or judgment relevant to a particular thing or person. The former research used an unsupervised approach known as ExtremeSentiLex that automatically detects extreme sentiments on social media posts. Based on their previous work, this research is extended by taking the classified social media posts and validating it using the BERT model. In their experiment, they implemented this methodology on five sets of the dataset; one of them is TurntoIslam dataset relevant to extremism. In the TurntoIslam dataset, the texts were classified as positive extreme, negative extreme, positive non-extreme, negative non-extreme and inconclusive, with negative extreme and inconclusive labels having the highest precision, recall, and F1-score being above 85%.
In deep learning, the advantage is that it is not required to perform feature extraction because most deep learning libraries have an in-built embedding layer that performs feature extraction. Therefore, there is always some advantage with deep learning models when working with large datasets.

C. RELATED DATASETS
The research on detecting and classifying extremist affiliations on social media was conducted using a custom dataset compiled from Twitter and Dark Web forums such as Al-Firdaws, Montada, alokab, and Islamic Network [12]. One of the limitations mentioned in the research paper is that the dataset lacked visual and social context features. In addition, the dataset was imbalanced, as the number of extremist labels was higher than non-extremist labels. Five distinct datasets were used in another research paper about detecting radical content on Twitter [31]. A combination of standard and custom datasets was used to make them. One of the limitations of these different datasets was that they were collected in different periods. Also, the number of tweets collected for each type of dataset varies from each other. This research paper mentioned a limitation that they should take more extensive samples of data for better prediction and also extend the collection of data in other languages, especially the Pashto language. The research on disruptive event detection collected their data from Twitter and gnip using hashtags such as #Ramadi, #Aleppo, #Cairo, and #Dubai for one of the datasets, and the England riot dataset was purchased online [32]. The limitation of this dataset was that the data was imbalanced, as there were more event tweets than non-event tweets. The dataset was only in English.
Meanwhile, in the research on predicting public protests, they collected tweets about the protests against the Trump presidency after the announcement of the presidential election results in November 2016 [23]. Some of the limitations seen in the dataset are that more event-specific features could have been used to bring better performance scores. They could have collected the dataset before the presidential election was almost over to check the likelihood of public protest.
The research paper on violent extremist detection in social media used a custom dataset collected from Twitter [11]. The limitation of their dataset is that they only managed it in a particular year, i.e. 2019, so they did not use a large sample of data. Also, the dataset only focuses on tweets in English. The second limitation in their dataset is that they have not used more user-specific features. If we see the research paper on radicalization detection based on emotion signals and semantic similarity, they have only collected their entire dataset from magazines. Their dataset lacks radical accounts as they are banned. Also, the dataset does not consist of other languages, so that is another limitation. The research paper on detecting violent radical accounts on Twitter could not find an Arabic ISIS-related dataset because of the lack of proper Arabic resources [20]. Therefore, they had to use two different datasets. The first dataset is a collected dataset that was extracted from Pro-ISIS, Anti-ISIS, and non-ISIS Arabic-speaking Twitter accounts. The first dataset's limitations are that the labels Pro-ISIS, Anti-ISIS, and non-ISIS were not balanced. The second dataset is a translated dataset collected from published non-Arabic ISIS-related datasets found in online data science communities. The limitations in the second dataset were that the sample size was not equivalent and the time in which these datasets were extracted is not uniform. Lastly, the paper that researched linguistic cues for analyzing social movements collected their data in two places: one using hashtags like #blacklivesmatter from June 2014 to June 2015 on Twitter and the other dataset from news articles using the same hashtag [33]. The dataset could have been made using even more different hashtags, which is one of the limitations in the dataset. Plus, the dataset could have used more text-specific or news-related features.
Researchers were able to collect data from various sources, create their datasets, use standard datasets found online, or use both types, as seen in Table 3 under dataset type. One of the prominent sources where researchers could gather large amounts of data is social media platforms such as Twitter and Facebook. Also, through available online repositories, for example, Kaggle and UCI. However, there are limitations in these datasets used in their research. Some research uses different sets of datasets, while others only use a particular data collection. These different sets of datasets were combined from standard or custom datasets, which is better for predicting extremism. But the studies in which only a standard dataset was used did not give more accurate results. VOLUME 10, 2022 Some studies have shown that they have collected data from an extensive range of periods, which in some cases is reasonable. For this research, it was difficult to retrieve a dataset during the U.S. presidential election in 2020 and the U.S. Capitol riot in 2020. The available dataset had too many or fewer attributes necessary to implement the    models. Furthermore, the classification of the dataset into categories such as propaganda, recruitment, and radicalization was required, which was not present in the available datasets.
The existing datasets found online were tweets based on the U.S. presidential elections, and there were not many datasets containing tweets about the U.S. Capitol riot. This research required a dataset containing tweets about the U.S. Capitol riot. The customized dataset contains tweets from Twitter after a survey of popular keywords used during the U.S. Capitol riot. The dataset was manually labelled to classify different types of extremism in the text. Each label of the tweets is divided into 28.6%, 32%, 37.2%, and 2.3% recruitment, radicalism, propaganda, and non-extremism tweets. The customized dataset is larger in sample size than existing datasets, and the tweets are collected between the U.S. presidential election in 2020 and the U.S. Capitol riot in 2021. Existing datasets could not provide labels of extremism. Hence, a customized dataset is curated for this research work.

III. PROPOSED ARCHITECTURE
This section discusses the architecture followed during the study, which involves data collection, labelling, and implementation of deep learning models such as Bi-LSTM, BERT, RoBERTa, and Distill-BERT.

A. DATA COLLECTION
The collected data is from the Twitter platform, where many trending hashtags were related to the U.S. Capitol riot. The data collection is performed using the Twitterscraper library. The dataset for this study collected tweets from 25th December 2020 to 10th January 2021, including only the tweets posted in English. The dataframe prepared was used for further cleaning and preprocessing. Figure 2 shows the steps followed during the extraction of tweets.

B. U.S. CAPITOL RIOT KEYWORDS AND COMBINATIONS FOR TWEET COLLECTION
The hashtags are collected from various online sources (news articles, research papers). The listed keywords were the most used on social platforms and were part of many discussions significant to the U.S. Capitol riot. Hence, we have gathered these keywords to extract tweets that can recognize the extremism in the posts shared on Twitter. The dataset contains posts including these keywords. These keywords will help us determine the sentiment in the tweets. Table 4 contains the list of all collected keywords used for extracting tweets to create the dataset and the count of tweets collected for each keyword used during data collection. Thus a total of 93,501 tweets were collected from 25th December 2020 to 10th January 2021.
The metadata obtained from Tweets using the Twitter API is listed below.  Table 5 shows the final dataset prepared for training after the tweet collection. The data contains four columns, which include the date of the tweet posted, the unique id of the tweet posted, the tweet posted by the user, and the username. The text column is of utmost importance for this study as the extremism analysis, labelling, modelling, and testing are performed only on the text data. The text data is further cleaned and preprocessed to make it suitable for modelling and testing.

D. SEED DATA COLLECTION
Seed data is collected based on political ideologies. The seed dataset's primary purpose is to contain text on propaganda, radicalization, and recruitment. For data collection, we collected various research articles, newspapers, websites identifying extremists, and blogs identifying influential propagandists, radicals, and recruiters.

1) RESEARCH ARTICLES AND NEWSPAPERS
The seed data was made from the research articles, and newspapers expressly identified the extremist text as propaganda, radicalization, or recruitment. The research articles and newspapers contained relevant tweets for this experiment thus, we collected the tweets from the research papers and news websites or downloaded the excel files from the websites. The collection of text was confined from January 2016 to December 2021. A total of 30 articles were gathered for this study.

2) WEBSITES AND BLOGS
The majority of seed samples chosen are from blogs and websites. Users were labeled as propagandists and recruiters on some websites. Such users' tweets or posts are regarded as propaganda or recruitment. For this experiment, 90 web blogs and websites were reviewed, of which 45 were deemed suitable for the study. In some sources, only a few of the tweets were used, while in other sources, all the available tweets were utilized. Table 6 contains examples of tweets and its corresponding source.

E. SEED DATA FEATURES
The characteristics of seed data include Source Type, Text, and Label.
1. Source Type -Indicates whether the sample collected is from a research article, website, or newspaper article. VOLUME 10, 2022

2.
Text -Contains extremist text or tweet provided by the source. 3. Label -As per the selected article, the label indicates the class to which the text belongs, such as propaganda, radicalization, or recruitment.

F. SEED DATA ANNOTATION
Each tweet in the seed dataset is classified as radicalism, propaganda, and recruitment. No manual annotator was used. Instead, those tweets were put in categories based on the content on the websites, research papers, news articles, and blogs mentioning radicalism, propaganda or recruitment.

IV. DATA PREPROCESSING
The collected Twitter data is studied through exploratory data analysis, and the text data is processed to continue with the models' development. Data preprocessing helps in the transformation of collected data into a proper format. The Preprocessing stage involves cleaning and removal of unwanted data as well as formatting the raw data into an understandable structure for machines to interpret text [42]. For preprocessing the dataset, the tweets were first converted into lowercase, and then noisy data was removed from the tweets to make the data suitable for further analysis. Eliminating noisy data involved removing URL links, placeholders, HTML references, non-letter characters (punctuation and special characters), Twitter handles and hashtags from text data.
The text data was further processed by removing stopwords, and tokenization was performed to divide the text into meaningful tokens [43]. The tokenized data was further lemmatized to get the base form of the word [43], which helps in performing sentiment analysis on the text data.

A. STOPWORDS
These are unwanted and meaningless words that are removed from the text to reduce noise, as their removal does not impact the performance of the models.

B. TOKENIZATION
This reduces the sentence into tokens by splitting text into words, which helps analyze the word's meaning.

C. LEMMATIZATION
This is used to normalize the words into their root form (dictionary-based approach) having the same meaning.

V. DATA LABELING
The data extracted from Twitter is unlabeled. To annotate the curated Twitter dataset, Pseudo-labeling is implemented. Pseudo-labeling is the technique of predicting labels for unlabeled data using a labelled data model. A seed dataset was created, consisting of 1000 samples gathered from multiple sources with careful annotations of labels such as propaganda, recruitment, and radicalization.
The labelled data is used to train an SVM model, which is used to forecast unlabeled tweets and to check the confidence level. Instead of using labels to help identify the confident guesses, we used the predicted probability, which signifies the class probability. The confidence level starts when the predicted probability is higher than 0.35 to 0.86. Then we add these predictions to the labelled data and retrain the model using both labelled and unlabeled data. Basically, the labels are predicted in the actual dataset and then retrained, the model with seed and pseudo label dataset.
Later it is observed that if there was any improvement with a different threshold value with an accuracy of 94%. The final observation helped predict the final labels, i.e. propaganda, recruitment, and radicalization.
For the SVM training, the dataset was split into two subsets: the train set, which counts for 90% of the existing dataset, and the test set, which is 10% of the existing dataset. Next, the test set was labelled in order to check the confidence level. In the 90% train set, the subset was split into two sets, which are 20% labelled data and 80% unlabeled data. The first model was built with the help of the labelled train set and then classified into the unlabelled training datasets.
After this, measured confidence levels were according to the test sets that were already labelled. Measuring the confidence levels according to the test sets that were already labelled before was essential. Once this was done, we concatenated the labelled train data with predicted unlabeled data. As a result, when the predicted probability is higher than 0.35 till 0.86, this new data is called pseudo-labelled, which is similar to the actual label. This data is again retrained again from new train datasets.
The selected pseudo-labelled ones from the unlabeled datasets, and then we retrained the model to predict the remaining unlabelled data. At this stage, we had to iterate the same step of combining labelled train data with the prediction of unlabeled data until there is no probability of predicted pseudo-labelled higher than 0.35 to 0.86. The training model was evaluated using automatic metrics such as accuracy,  precision, recall, and F1-score to assess the performance of the pseudo-labelling model.
Following are the features of the seed dataset 1. Text: Contains an extremist tweet posted by the user. 2. Label: It identifies different categories of extremism in the tweet as propaganda, radicalization or recruitment.

VI. ANALYSIS OF TRENDING TWEETS THROUGH DATA VISUALIZATION
The entire dataset was analyzed using visualization techniques to understand the trends followed during the U.S. Capitol riot in 2021 and the U.S. presidential election in 2020. This analysis helped identify the most used hashtags, understand the public reaction, and check the distribution of labelled tweets after text classification.

A. WORD CLOUD
One of the most popular techniques to find the top keywords in the dataset is word cloud which indicates the frequency of the word according to their size. Figure 5 highlights the hashtags most used on Twitter during the U.S. Capitol riot.

B. TOP HASHTAGS USED DURING U.S. CAPITOL RIOT (BEFORE 6TH JANUARY)
The bar chart in Figure 6 presents the count of trending keywords according to the word cloud. The most used hashtags  The bar chart in Figure 7 presents the count of trending keywords according to the word cloud. The most used hashtags after the U.S. Capitol riot in 2021 are #Antifa, #Kag, #Stopthesteal, #Trump2020, and #Voterfraud. From Figure 6 and Figure 7, it is clear that there was a change in the trends before and after the Capitol Riot took place. Before 6th January 2021, tweets in support of Trump were trending. However, the trend changed after the riot resulting in trending hashtags gaining momentum.

D. EXTREMISM CLASSIFICATION OF LABELED DATA
The extracted data was labelled propaganda, recruitment, radicalization, and non-extremism. The pie chart shows the percentage of each type of labelled tweet. Almost 98% of tweets belong to at least one of the categories of Extremism. This information is significant because the labels will play a dominant role in the training of the models.

VII. DATASET EVALUATION A. EXPERIMENTAL SETUP
For this research study, the following hardware and software requirements are mentioned in Table 7. The programming was done on a computer using the Anaconda Jupyter Notebook. The entire program was written in python language using python libraries that are utilized for deep learning.

B. DEEP LEARNING MODELS
The four deep learning models implemented in this experiment are Bi-Directional Long Short-Term Memory (Bi-LSTM), Bidirectional Encoder Representations from Transformers (BERT), RoBERTa, and DistillBERT.

C. Bi-LSTM
Bi-Directional Long Short-Term Memory is a Recurrent Neural Network and a Sequence Processing Model, which comprises two LSTM layers. The first layer takes input in one direction, and the second layer processes input in the opposite direction. The first recurrent layer is repeated in the network, resulting in two parallel layers. The input sequence is passed to the first layer as an input, and a reversed version of the same input is given to the second layer for processing.
The working of the Bi-LSTM can be understood in six phases as explained below:

D. WORD EMBEDDING
Words with the same meaning have an equal representation in a learned text representation. To extract semantic information from a tweet, it is first represented as a sequence of word embeddings.

E. BI-LSTM LAYER
This layer records the sequences that appear in the data.

F. DENSE LAYER
The output is passed to the dense layer, which has a sigmoid activation function and uses dropout between the two dense layers to avoid overfitting.

G. INPUT LAYER
It brings the initial data into the system to be processed further.

H. HIDDEN LAYER
The inputs entering the network are transformed nonlinearly by the hidden layers with the help of an activation function.

I. OUTPUT LAYER
This layer extracts the desired named entities.The Bidirectional feature of the Bi-LSTM model serves as an enhancement to RNN, making it possible for the neural networks to memorize the current and previous information. Therefore, Bi-LSTM has been implemented in this study to understand the neural network's performance in the classification of text data and to note the difference in the performance of Bi-LSTM with other advanced models.

J. BERT
Bidirectional Encoder Representations from Transformers is a transformer model that uses a masked attention mechanism to assign weight to each input and output element. It first chooses a pre-trained BERT model according to the language need and later modifies the architecture as per the need of the task, as shown in Figure 10. Lastly, the training data was prepared after fine-tuning the modified model on the dataset.
The BERT encoder anticipates a token sequence [CLS], a unique token that appears at the beginning of the first sentence. Each sentence has [SEP] at the end of it. To distinguish between the sentences, a segment 'A' or 'B' is added to the embeddings. BERT takes a sequence of inputs and moves them up the stack. Before being sent into a feed-forward neural network, each block passes through a Self-Attention layer. The data is subsequently passed on to the next encoder. Each point generates a concealed size vector (768 in BERT Base). VOLUME 10, 2022  The dataset is trained further on the pre-trained model, and the result is fed to a sigmoid layer. Any error gets backpropagated through the entire architecture, and the model's pre-trained weights are adjusted depending on the updated dataset.
This model performs better with massive data with the advantage of a masking feature that helps in better identification of keywords. Therefore, BERT is exclusively used for text classification, whose performance has been tested in this study to achieve better classification results.

K. RoBERTa
The Robustly Optimized BERT Pre-training Approach optimizes BERT architecture, which reduces the time taken during the pretraining of the model. The architecture is very much similar to BERT. Roberta extends BERT's language masking approach by changing key hyperparameters, such as removing BERT's next-sentence pretraining goal and training with much larger mini-batches and learning rates. RoBERTa was trained on data with an order of magnitude more than BERT for an extended period. This is an advanced version of the BERT model that gives better performance than BERT itself. RoBERTa has been used in this study to achieve good results in extremist text classification.

L. DistilBERT
DistilBERT is a Transformer model trained on a BERT base and is small, quick, cheap, and light. To mimic Google's BERT, it uses a process called distillation, which involves substituting a more extensive neural network with a smaller one. After a vast neural network has been trained, the whole output distributions of the network can be approximated using a smaller network. The data is tokenized using the DistilBERT tokenizer and then turned into a series of tokens, converted to tensors and supplied to the model, as shown in Figure 12. The DistillBERTClass is used to create a neural network. This network will use the DistilBERT Language model to obtain the final outputs, a dropout, and a Linear layer.
The data is incorporated into the DistilBERT Language model. It builds a single, dense output layer with a sigmoid      training the additional classification layers, unfreeze Distill-BERT's embedding layer and fine-tune all weights with a reduced learning rate to extract even more performance out of the model.
DistilBert is another version of BERT that offers good performance; therefore, it was selected to analyze its performance with other models.

M. RESULTS AND DISCUSSIONS
The dataset is evaluated with different deep learning classifiers, as shown in Tables 9, 10, and 11. The dataset is divided into training and testing in the ratio of 80:20. For all of the deep learning models examined, Table 11 shows the reported accuracy, precision, recall, and F1-Score.
As evident from Table 11, RoBERTa outperformed the other three models by almost 6% in accuracy, proving its supremacy over BERT and its improved models known as DistilBERT. RoBERTa's precision, recall, and F1-score show an increase of almost 0.06, 0.11, and 0.07 compared to Bi-LSTM. The experiment with neural network Bi-LSTM helped to understand the difference between the result of transformer models and neural networks. BERT models are trained for text data, so their performance is better than the traditional neural networks.
The dynamic masking of RoBERTa is an advantage over BERT, which contributes to the model performance and makes it robust. It outperforms BERT by 3% in terms of accuracy by achieving an accuracy of 95%. Although RoBERTa performs better than DistilBERT, the performance score of DistilBERT was also remarkable. It performed better than BERT and Bi-LSTM due to its architectural advantage, which makes it a good choice for text classification. Lastly, the   training time of RoBERTa was low compared to BERT, Dis-tilBERT, and Bi-LSTM, which makes it optimal for classifying text, as shown in this study.

N. PERFORMANCE EVALUATION
The performance of deep learning models was evaluated using an automatic verification dataset in which 33% of the VOLUME 10, 2022  data was held back for validation. On running the verbose output on each epoch it showed the loss and accuracy on both the training dataset and the validation dataset, which was used to evaluate their performances.

O. ERROR ANALYSIS
Even though the trained deep learning models have good performance scores, there was still some misclassification of tweets, especially those incorrectly identified as radicalism but were propaganda and vice versa. Some of the reasons behind the incorrect identification of these tweets are probably due to the improper labelling of the seed dataset and epoch sizes during training. It is possible that the tweets labelled as radicalism or propaganda in the seed dataset could have been mixed up as both seem to have similar contexts. Therefore, when the model is trained, it predicts more or fewer tweets as propaganda or radicalism because their context is somewhat similar. Moreover, epoch size is used in training the model, which could have affected how unlabelled data was misclassified. When the epoch size is larger, then more data is adequately trained in the model and a result, it returns with a low validation loss and high accuracy. Thus, in this experiment, a moderate epoch size might have trained some of the propaganda and radicalism tweets properly, which could have caused misclassification for some of those tweets.

VIII. LIMITATIONS
This work is limited to Twitter and can be expanded to other networking platforms, such as Facebook and Parler. Although this study explores the social media responses to the U.S. presidential election and the U.S. Capitol riot in English, the response in different languages is to be investigated. The collected dataset lacked emoticons, which can help understand the user's sentiment. Therefore, emoji detection can be used to obtain better results. The bot tweets present in them can significantly affect the data, resulting in incorrect classification. Hence, it is best to detect bots through advanced approaches and increase the model's efficiency in identifying bots from posts. Finally, a user interface can be designed to detect extremism in posts.

IX. CONCLUSION
This research contributes to the field by creating a highquality diversified dataset on the U.S. Capitol riot gathered via Twitter. This study explains how semi-supervised learning is used to predict labels. This research also compares and evaluates several deep learning classifiers such as BERT, BI-LSTM, RoBERTa, and DistilBert. According to the experimental data, Roberta achieved the most competitive outcomes with 95% accuracy. The results of the models show that they can help identify extremist messages on social media, thereby preventing the tragic consequences of the spread of radical posts. Other platforms, such as Facebook and Parler, can be analyzed to gain a broader perspective of the riot and investigate social media's influence on the masses. A few advanced features can be developed or explored to suggest a more accurate model for detecting extremism.
ARUNDARASI RAJENDRAN received the B.Tech. degree in computer science and engineering from the Symbiosis Institute of Technology, Pune, Maharashtra, India. She has completed her five month research internship at the Symbiosis Center for Applied Artificial Intelligence. Her research interests include data science, machine learning, deep learning, and natural language processing.
VATTIKUTI SREE SAHITHI received the B.Tech. degree in computer science and engineering from the Symbiosis Institute of Technology, Pune, Maharashtra, India. Her research interests include machine learning, deep learning, and natural language processing.
CHHAVI GUPTA received the B.Tech. degree in computer science and engineering from the Symbiosis Institute of Technology, Pune, Maharashtra, India. Her research interests include mathematics, machine learning, deep learning, and natural language processing.
MADHURI YADAV received the B.Tech. degree in computer science and engineering from the Symbiosis Institute of Technology, Pune, Maharashtra, India. She has completed her five month research internship at the Symbiosis Center for Applied Artificial Intelligence. She is interested in data science, machine learning, and web development.
SWATI AHIRRAO received the Ph.D. degree from Symbiosis International (Deemed University), Lavale, Pune, Maharashtra, India. She is currently working as an Associate Professor with SIT, Pune. Her research interests include big data analytics, machine learning, deep learning, natural language processing, and reinforcement learning. She has published over 31 research papers in international journals and conferences. According to Google Scholar, her articles have 71 citations, with an H-index of 3 and an i10-index of 2.