Context-Aware Deep Learning Model for Detection of Roman Urdu Hate Speech on Social Media Platform

Over the last two decades, social media platforms have grown dramatically. Twitter and Facebook are the two most popular social media platforms, with millions of active users posting billions of messages daily. These platforms allow users to have freedom of expression. However, some users exploit this facility by disseminating hate speeches. Manual detection and censorship of such hate speeches are impractical; thus, an automatic detection mechanism is required to detect and counter hate speeches in a real-time environment. Most research in hate speech detection has been carried out in the English language. Still, minimal work has been explored in other languages, mainly Urdu written in Roman Urdu script. A few research have attempted machine learning, and deep learning models for Roman Urdu hate speech detection; however, due to a scarcity of Roman Urdu resources, and a large corpus with defined annotation rules, a robust hate speech detection model is still required. With this motivation, this study contributes in the following manner: we developed annotation guidelines for Roman Urdu Hate Speech. Second, we constructed a new Roman Urdu Hate Speech Dataset (RU-HSD-30K) that was annotated by a team of experts using the annotation rules. To the best of our knowledge, the Bi-LSTM model with an attention layer for Roman-Urdu Hate Speech Detection has not been explored. Therefore, we developed a context-aware Roman Urdu Hate Speech detection model based on Bi-LSTM with an attention layer and used custom word2vec for word embeddings. Finally, we examined the effect of lexical normalization of Roman Urdu words on the performance of the proposed model. Different traditional as well as deep learning models, including LSTM and CNN models, were used as baseline models. The performance of the models was assessed in terms of evaluation metrics like accuracy, precision, recall, and F1-score. The generalization of each model is also evaluated on a cross-domain dataset. Experimental results revealed that Bi-LSTM with attention outperformed the traditional machine learning models and other deep learning models with an accuracy score of 0.875 and an F-Score of 0.885. In addition, the results demonstrated that our suggested model (Bi-LSTM with Attention Layer) is more general than previous models when applied to unseen data. The results confirmed that lexical normalization of Roman Urdu words enhanced the performance of the suggested model.


I. INTRODUCTION
Over the past two decades, there has been a substantial increase in social media users. Twitter is a well-known social The associate editor coordinating the review of this manuscript and approving it for publication was Wei-Yen Hsu .
networking site. According to the report, Twitter has 330 million active users (J. Clement, 2020), and 200 billion tweets are posted annually. Likewise, Facebook has over 2.89 billion active users. Facebook users share an average of 4.75 billion items and 10 billion messages daily. These platforms enable users to openly share their ideas and perspectives on a VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ particular subject relating to current events, religion, politics, sports, the entertainment industry, etc. On the other hand, some users abuse this service by promoting hatred, instigating violence, making words that incite conflict, and using inappropriate and abusive language. Some conduct online harassment based on gender, religion, race, or sexual orientation. In most terror incidents, the suspects have a long history of hate speech on social media, indicating that social media is a factor in their extremist activity. Facebook was used to live-stream a 2019 terrorist assault in Christchurch, New Zealand. Similarly, on April 14, 2017, a student at a Pakistani university was murdered by a mob of other students over anti-religious Facebook statements. A suicide bomb attack on a Shia Mosque during Friday prayers in Kandahar (Afghanistan) in October 2021 became a Twitter trend, resulting in nasty messages from Shia and Sunni faiths, Afghans, and Pakistanis, who blamed each other for the catastrophe. Such hate speech and abusive language incite violence and aggressiveness, making protecting human rights more challenging.
Furthermore, political campaigns against political leaders on social media employ highly derogatory language. Social media posts that hurt underrepresented groups based on their religious beliefs might provoke violence. Similarly, Muslims worldwide have an ardent affection for the Prophet Muhammad (SAW), which is an integral part of their religion. Therefore, anything shared on social media that promotes blasphemy against the Prophet Muhammad is intolerable to Muslims, and the scenario may result in violence. Moreover, the comments on social media about celebrities are vulgar. Likewise, online harassment of women is on the rise. According to one survey, 66% of adolescent girls claim to have been bullied on Facebook. Similarly, 73% of adults reported having experienced online harassment, and 40% had been personally targeted (Pew Research Center 2017).
The government enacted regulations such as the ''National Action Plan'' and the ''Prevention of Electronic Crimes Act, 2016'' to eradicate online hate speeches from electronic media. Advanced artificial intelligence (AI) tools to identify online hate speech must be developed to implement these regulations. Facebook's proactive detection rate for hate speech has increased 14 percent over the past year, from 80 percent to 94 percent (Facebook Community Standards Enforcement Report, 2020). However, this feature only works in English. They developed their technology to incorporate Spanish, Arabic, Indonesian, and Burmese, among other languages. However, little effort has been taken to combat hate speech written in Roman Urdu script, which is extensively used by social media users in Pakistan, Bangladesh, India, and Afghanistan. Thus, automatically detecting Roman Urdu hate speech on social media platforms is challenging. Numerous researchers have attempted to identify online hate speech in English, as stated in Section 2, and have identified the following significant obstacles:

II. RELATED WORK
This section describes previous studies undertaken on the detection of hate speech, which is summarized in Table 1. The literature is categorized as follows:

A. NORMALIZATION OF LEXICAL VARIATIONS
Roman Urdu is the name given to Urdu written with the English alphabet, depending on word pronunciation. Roman Urdu is not a standard language; hence its lexicon lacks standard spellings (words). Different individuals spell similar words differently. Therefore, it is difficult to adapt the numerous lexical variants of Roman Urdu terms to standard spelling, and there has been little research undertaken in this area. The authors in [1] devised a feature-based clustering technique that encodes roman Urdu words into phonetic representations and then clusters lexical variants of roman Urdu words with identical phonetics. The technique was evaluated using a manually annotated gold standard dataset and achieved a higher F-score than the baseline. Likewise, the authors of [2] proposed a method based on a phonetic algorithm to normalize lexical variations in the Roman Urdu text. The authors of [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13] conducted a comparative study to assess how normalization techniques handle lexical variation in the text of social media posts in several languages, including English, Japanese, Chinese, Bangla, Dutch, Finnish, Spanish, Arabic, Polish, and Roman Urdu. The different normalization techniques include the Rule-based approach, machine learning algorithms, phonetic algorithms, stemming and lemmatization, etc.

B. ANNOTATION OF DATASETS
Roman Urdu lacks annotated datasets but does have pretrained embeddings. The viability of transfer learning was investigated by [14], which also examined the performance of five distinct multilingual embeddings, namely LASER, ELMo, BERT, XLM-RoBERTa, and FastText, for Roman Urdu text. This work asserted the development of RomUrEm, a pre-trained BERT model for Roman Urdu, but it is not accessible to scholars for further study and reuse.
The authors of [15] utilized an iterative strategy to develop annotation guidelines for their own constructed Hate Speech Roman Urdu 2020 corpus (HS-RU-20). The corpus was intended to be utilized for Hate Speech detection.

C. LEXICAL RULE-BASED APPROACHES
Researchers have proposed lexical rule-based approaches for detecting offensive and hate speech terms effectively. The work of [16] offered a lexicon-based methodology for identifying hate speech that detects subjectivity in sentences and constructs a vocabulary of hate-related words using a rulebased mechanism. A classifier is then trained on features collected from the lexicon and tested on the document to detect hate speech. Similarly, the authors of [17] presented VADER, a rule-based approach for sentiment analysis of content from social media. They designed and validated Gold Standard Sentiment Lexicon. They identified and analyzed rules concerning the conventional use of grammatical and syntactic aspects of text and compared the performance of lexical rule-based approaches to the baseline models.
The most significant shortcoming of lexical rule-based techniques is their inability to capture the context and domain of words in a document. In addition, rule-based approaches are lexicon-based; therefore, these methods are not robust against adversaries and subtleties in the text.

D. TRADITIONAL MACHINE LEARNING APPROACHES
Historically, lexical approaches have been deemed successful for detecting potentially offending phrases. Nonetheless, additional research revealed that, except for a few terms indicated by the Hatebase lexicon and identified by Human Coders, the lexical approaches gave erroneous findings in detecting hate speech. In contrast, machine learning algorithms performed significantly better than lexical techniques for hate speech detection [18].
The authors of [18] discovered that Logistic Regression and Linear SVM beat the other models for detecting hate speech in tweets by a significant margin. In addition, the study showed that biases in the dataset and the existence of overlapping phrases in both hate speech and offensive tweets led to misclassifications.
The research of [19] demonstrated that all proposed cutting-edge models are vulnerable to adversaries. They proved that adversarial training could not fix the issue fully and argued that character-level features are more resistant to attack than word-level features. They suggested Logistic Regression with character-level features for the development of a robust model for detecting hate speech.
The authors of [20] proposed a novel technique known as Multi-view SVM, which has the extra benefit of improved interpretability and outperformed the previous algorithms for detecting hate speech. Similarly, the work of [21] demonstrated that the bigram features, when combined with the support vector machine method, had the highest overall accuracy (79%) for automatic Hate Speech Detection.
The authors of [22] proposed the SVM-Radial Basis Function (RBF) method with character level FastText for Hate speech detection in Hindi-English code-mixed social media text. They demonstrated that character-level features of Fast-Text provide more information than word and document-level features for classification.
The majority of hate speech detection efforts have been carried out in English. Detecting hate speech in Roman Urdu has received relatively little attention. The authors of [15] applied various machine learning algorithms for Roman Urdu Hate Speech detection and demonstrated that Logistic Regression outperformed other machine learning approaches and the deep learning approach (CNN) in distinguishing between neutral and hostile tweets. In addition, they asserted that a bag of words is an appropriate feature extraction method for detecting Roman-Urdu Hate Speech. Traditional Machine Learning models function effectively as long as the size of the dataset is restricted. However, the performance of conventional machine learning models declines for huge datasets.

E. DEEP LEARNING APPROACHES
Deep Learning outperforms other techniques if the data size is large. However, with small data sizes, traditional Machine Learning algorithms are preferred. The work by [23] proposed CNN with word2vec embedding for abusive content detection on social media sites and showed that deep learning models could outperform the classification results of the traditional SVM when the training dataset is imbalanced. The performance of the SVM can be dramatically improved through oversampling. This work assists researchers in selecting acceptable text classification algorithms for the identification of abusive content, including cases in which the training datasets exhibit class imbalance.
Similarly, the authors of [24] conducted numerous studies to detect cyberbullying on various social media platforms (SMPs). They utilized three datasets (Formspring with 12,000 posts, Twitter with 16,000 posts, and Wikipedia with 100,000 posts) and four Deep Neural Network (DNN) models, namely CNN, LSTM, Bidirectional LSTM (BiLSTM), and Bidirectional LSTM (BiLSTM) with Attention. All DNN models utilized the same fundamental structure as [25]. It was observed that the models favored non-bullying since the datasets were VOLUME 10, 2022 completely unbalanced, with bullying constituting the minority class. Moreover, it was discovered that oversampling considerably enhanced performance. In conclusion, they determined that DNN models combined with Transfer Learning beat state-of-the-art outcomes on all three datasets. This work, however, encountered the issue of overfitting.
The authors of [26] utilized the BERT model for Abusive Language classification and demonstrated that BERT performs better than other state-of-the-art models when fine-tuned for the underlying problem. However, BERT can be fine-tuned for languages for which the BERT model has been pre-trained.
The authors of [27] proposed a multi-channel model with three variants of BERT (MC-BERT) for hate speech detection: English, Chinese, and multilingual BERTs. In addition, they investigated the use of translations as extra input by translating training and test sets into the languages required by various BERT models. They discovered that fine-tuning of the Pre-trained BERT model achieved state-of-the-art or comparable performance on three distinct datasets. Roman Urdu, however, lacks a pre-trained BERT model and is exceedingly difficult to translate due to the lexical variations of Roman Urdu words. In addition, dictionaries in both languages are required for translation and no Roman Urdu dictionary exists that includes the Standard Lexical form of Roman Urdu words.
The work of [14] introduced CNN-gram, a new deep learning architecture for detecting hate speech and abusive language in Roman Urdu, and compared its performance with seven existing baseline techniques on the RUHSOLD dataset. The proposed model showed more robustness as compared to the baselines. The work by [28] demonstrated that the performance of Neural-based approaches for the detection of hate speech is better than Classical ML techniques. Amongst Neural network architecture, BI-LSTM, with multiple embeddings, exhibited the best performance in detecting hate speech.
The authors [29] presented a transfer learning strategy for detecting hate speech based on an existing pre-trained language model known as BERT and assessed the proposed model using two publicly available datasets. Next, they developed a method for reducing the influence of bias in the training set during the fine-tuning of a pre-trained BERTbased model for the hate speech detection task.
The study by [30] demonstrated a pipeline for adapting the general-purpose RoBERT model to detect Vietnamese hate speech. They proposed a pipeline that greatly improved performance, attaining a new F1 score of 0.7221 for the Vietnamese Hate Speech Detection (HSD) campaign.
The work of [31] introduced HateXplain, the first benchmark dataset of hate speech spanning several facets of the topic. Using current state-of-the-art models, they noticed that even models that perform exceptionally well in classification do not score well on explainability criteria such as model plausibility and model faithfulness. Additionally, they discovered that models that employed human rationales for training are more effective at minimizing inadvertent bias towards target communities.
The research of [24], [25] was thoroughly analyzed by [32], which concluded that the findings provided by state-of-theart systems indicate that supervised techniques attain nearly perfect performance, but only on particular datasets, the majority of which are in English. They examined the apparent disparity between available literature and real applications. They investigated the experimental methodology utilized in prior studies [24], [25] and their generalizability to additional datasets. Their findings revealed methodological flaws and a significant dataset bias. As a result, current state-of-the-art performance claims have become greatly exaggerated. They identified that the majority of the difficulties are due to data overfitting and sampling concerns.
According to a study [33], RNN outperformed other shallow and deep learning models in terms of accuracy on Dataset-1 and Dataset-2, with scores of 0.7871 and 0.9030, respectively. However, the datasets were distributed in an imbalanced manner in this study. In addition, the study contributes to the detection of English hate speech posted on social media during the COVID-19 era but was unable to detect Roman Urdu hate speech made on social media.

F. UNSUPERVISED LEARNING
Numerous researchers have proposed unsupervised learning approaches to the problem of hate speech detection. The authors of [34] proposed the Growing Hierarchical Self-Organizing Map (GHSOM), a method of unsupervised learning for cyberbullying detection on social media. This research retrieved hand-crafted features that were utilized to capture the semantic and syntactic traits of cyber bullies. The proposed strategy was effective in detecting cyberbullying on social media platforms. However, the approach was incapable of identifying sarcastic text. In addition, the work is restricted to the English language. The authors of [35] investigated a unique framework for detecting frequently discussed subjects/topics on Facebook that cause hate speech. They utilized graphs, sentiment and emotion analysis approaches to cluster and analyze posts on popular Facebook pages.
Consequently, the proposed framework can automatically detect pages that promote hate speech in comment sections on sensitive themes. According to the results, the proposed method achieved an accuracy of 0.74. However, this work is restricted to English hate speeches and uses English language and slurs within the context of American society. Language and other demographic factors influence hate speech. In Roman Urdu, insults and derogatory terms differ from their English counterparts. The authors of [39] presented an unsupervised approach for detecting German Hate Speech on Twitter. They used the skip-grams method to determine the context of the words and the k-mean clustering methodology to group them into meaningful categories, such as immigration, crime, and politics. The machine learning model was used to detect hate speech automatically. A complete qualitative and quantitative investigation of what constitutes  hate speech from a political communication perspective was also conducted. The outcome demonstrated that the proposed method yields an accuracy score of 84.21 and an F-score of 84.21. However, this study is confined to detecting hate speech in German tweets. In addition, the machine learning model utilized in this study may be prone to overfitting and may perform poorly on datasets from other domains.

G. METAHEURISTIC APPROACHES
A study of [42] suggested a meta-heuristic approach for automatic hate speech detection based on the AntLion and MothFlame Optimization algorithms. To extract features, they utilized Bag of Words (BoW), Term Frequency (TF), and document vector (Doc2Vec). The suggested model was evaluated on three datasets, with ALO and MFO exhibiting the highest accuracy (0.921 and 0.90, respectively. This work is applied to datasets in English and Spanish; however, the technique might be extended to other languages, such as Roman Urdu.
The literature revealed that researchers have worked on Hate Speech and Offensive Language Detection, mainly in English and other regional languages worldwide. Many researchers chose their native languages to detect hate speech. However, relatively little research has been conducted on detecting Roman Urdu Hate Speech. As native Urdu speakers, we undertook this study to detect Roman Urdu hate speech on social media.
According to the literature, comprehensive datasets on Roman Urdu Hate Speech are limited, and there is no standard Roman Urdu dictionary. To the best of our knowledge, a deep learning model (Bi-LSTM with attention mechanism) has not been explored for detecting Roman-Urdu hate speech. Consequently, this study investigates the application of (Bi-LSTM with attention mechanism) in detecting Roman Urdu hate speech.

III. METHODOLOGY
This section describes our proposed method in depth as depicted in Figure 2. The proposed architecture encompasses the following five phases. Twitter and Facebook, two popular social media networks, were chosen as data sources for the aim of data extraction. Twitter was scraped for data (tweets) using Tweepy, a Python package for accessing the Twitter API. However, scraping messages from Facebook was difficult since it does not permit any internet scraping application to extract its messages mechanically. As illustrated in Figure 1, a computer operator was employed for this reason to extract Facebook messages into an MS. Excel file manually.

3) SELECTION OF ROMAN URDU COMMENTS
The gathered data were manually filtered to include only Roman Urdu comments/tweets.

B. DATASET DEVELOPMENT 1) DEVELOPMENT OF ANNOTATION GUIDELINES
Annotation Guidelines/rules are developed to get consistent annotations of the dataset. If comprehensive guidelines are in place, the expert can readily annotate the data with relevant labels, resulting in high-quality data annotation. The maximum inter-annotator agreement is used to measure annotation quality.
The initial stage in establishing annotation rules is explicitly defining the classes/labels, i.e., distinguishing between hate speech and neutral or normal speech. As part of this study, we established annotation guidelines for Neutral and Hate speeches, which are provided in Table 2 and Table 3. These rules allowed the annotators/experts to distinguish between neutral and hateful speech and to label the data appropriately.

2) MANUAL ANNOTATION OF TEXT USING ANNOTATION GUIDELINES
Initially, we manually labeled the text messages as ''Hate'' and ''Neutral'' in the excel file as created in Section-III(1)(ii). Annotation Guidelines as discussed in Section-II(2)(i) were consulted while performing annotation. In this step, we developed a base annotated dataset which was yet to be validated by a team of experts.

3) VALIDATION BY EXPERTS
The corpus was shared with a team of ten experts, comprising representatives from various religions, genders, nations, and fields, to annotate data as Hate Speech and Neutral Speech. The team was given the Annotation Guidelines designed for this study.

4) FINALIZATION OF ANNOTATION BASED ON INTER ANNOTATOR AGREEMENT
After receiving dataset annotations from 10 annotators/experts, the final annotation choice was made based VOLUME 10, 2022 on the value of the highest inter-annotator agreement, as depicted in Figure 1. There are two measurements of inter-annotator agreement: Cen's kappa and Fleiss Kappa. Cohen Kappa is used to assess the degree of agreement between a pair of annotators, whereas Fliess' Kappa calculates the degree of agreement across a group of several annotators. In this study, there are ten annotators; consequently, Fliess' Kappa was utilized to calculate the Inter-Annotator agreement.

5) CLASS BALANCING
After data annotation was completed, the dataset was balanced by removing superfluous text messages to maintain an equal amount of both classes, i.e., Hate:15,000 and Neutral:15,000.

C. PREPROCESSING
Before feeding the dataset to the model, it was pre-processed and translated into an appropriate format for the model's optimal performance. The following steps were utilized for the dataset's preliminary processing.

2) LOWERCASE
To standardize the spelling, the entire text was transformed to lowercase letters to minimize case-sensitivity concerns with prediction.

3) STOP WORDS REMOVAL
Stop words are meaningless words in the corpus that add nothing to the efficiency of the model. To reduce the dataset's dimension, these terms are deleted. Typical examples of such terms include prepositions, articles, conjunctions, and so forth. In this research, we used Roman Urdu stop words available on the GitHub link: https://github.com/haseebelahi/roman-urdu-stopwords to filter them out from the text.

4) NORMALIZATION OF LEXICAL VARIATIONS
Roman Urdu is not an official language, but it is the name given to Urdu written in the Latin script, commonly known as the Roman script. There is no standard way to spell the words. People used several spellings of a word based on how it is pronounced. For example the Urdu Word [ ] [Terrorism] is written as''deshatgardi'',''dehshatgardi'',''dahshatgardee'', ''dehshat gardi'', ''deshtgardee''. Similarly ( ) [Shamless] is written as ''baghairat'',''bayghiarat'', ''by ghairat'', ''baygherat'', ''beghairat'' etc. This is called lexical variation. Lexical variants of Roman Urdu words enhance the number of distinct words in the corpus and influence the performance of Natural Language Processing models. By normalizing these lexical variants to a standard form, it is possible to reduce sparsity and improve speed.
It is difficult to handle multiple lexical variants of Roman Urdu terms. Several normalizing strategies, such as Soundex and UrduPhones, manage such lexical variances. The Soundex algorithm encodes words based on their phonetic sounds. Then, combining words with similar sounds is feasible based on their phonetic codes. UrduPhone is a phonetic encoding system devised specifically for the Roman Urdu script [29]. It is derived from Soundex but differs from it in two respects. First, Soundex is based on four (4) character codes, while UrduPhone is based on six characters code with more information to retain. Second, UrduPhones employs groups based on homophones, which are mapped differently in Soundex. UrduPhone encodes Roman Urdu text in accordance with its pronunciation. It tackles characterlevel differences that are expected to occur when Roman Script is used to write Urdu words.
In addition, lexical normalization is required to accommodate lexical variants in Roman Urdu terms and translate spelling variations into a single lexical form. There is no standard form of Roman Urdu words to which variants can be mapped. There is currently no standard lexicon or spelling for Roman Urdu terms. In light of this, we compiled a 4,000word lexicon of Roman Urdu words. It facilitates the normalization of lexical variants using a supervised technique. The   primary objective of creating a dictionary is to standardize Roman Urdu words, not to create a Roman Urdu-to-English dictionary. Roman Urdu is not a legitimate language, but rather a script in which the majority writes the Urdu language of Urdu speakers on social media utilizing electronic devices, such as computers, mobile phones, and tablets.
To standardize the Roman Urdu words, the following rules listed in Tables 4-7 were established.

RULE 2: → IN ABOVE MAPPING, SOME CHARACTERS OF ENGLISH ALPHABETS SHALL BE IGNORED
If there are numerous alphabets in the English language that make the sound of a single alphabet in Roman Urdu, such alphabets will be disregarded, and only the most frequently used alphabet by the users shall be employed. The intention is to eliminate spelling variances. For example: '' '' can be written by 'K' and 'C'. Similarly: '' '' can be written by ''W'' and ''V''.  The algorithm is described as follows: After establishing the Roman-Urdu Dictionary, numerous lexical variations of words are normalized. The procedure compares the phonetic codes of clusters with the phonetic codes of the corresponding standard terms in the dictionary and replaces lexical variations with their standard forms. The same module was applied to the entire corpus to replace variants of a term with its standard form, as depicted in Figures 3-4.

5) ROMAN URDU WORD EMBEDDING USING WORD2VEC TECHNIQUE
After pre-processing and normalization, extracting features is the next crucial step in analyzing raw data. The computer does not directly manipulate the raw data; instead, it converts the data into derived numerical values while maintaining the information included in the original data. There are various ways for feature extraction, including Bag of Words, TF/IDF (Term Frequency/Invert Document Frequency), with n-grams, and character gram.
The features were extracted by word2vec embedding. In this study, we employed custom Word2Vec embedding by passing our corpus to the Word2Vec function with a minimum count of 2 and a dimension size of 200. The Word2Vec embedding technique turns each corpus word based on its meaning and context into a 200-dimensional real-valued vector. In this manner, semantically related words are grouped in the vector space.

6) TRAIN / TEST SPLIT
For the experiment, the dataset is divided into two sets, i.e., Training Set and Testing Set.
In this research, we split our dataset into Training Set and Testing Set with a ratio of 80:20, respectively, by using the ''train_test_split'' function of the sklearn library in python.

D. TRAINING 1) TRAINING SET
It is the data set utilized for learning (by the model), i.e., to fit the parameters to the machine learning model. It is the data set used to train the model, and make it uncovers the hidden features/patterns in the data.
In each epoch, the neural network architecture is given the same training data multiple times, and the model continues to learn features from the data. The training set should contain a diverse group of inputs so that the model is trained in all scenarios and can predict any future unobserved data sample. We used 80 percent of the corpus for training data.

2) SELECTION OF DEEP LEARNING MODELS
In this research, the following Deep Learning models were applied to detect Romah Urdu hate speech.

3) TRAINING OF DEEP LEARNING MODELS
At first, the training dataset was preprocessed, normalized, and embeddings were obtained with the custom word2vec embedding technique. Then, the aforementioned deep learning models were trained with embeddings (feature vectors) obtained from the training data as per the setting described in Section IV.

4) CROSS-VALIDATION OF TRAINED MODELS
Each model was trained and constructed at this stage using a validation split of 0.2. Here, we employed the simplest form of cross-validation, held-out cross-validation, which significantly decreases bias since the majority of the data is used for fitting. As most of the data are also utilized in the validation set, variance is also drastically reduced. The results of training during validation were recorded in ''history'' for purposes of visualization. The history object is used to maintain a record of metrics and loss values during the training procedure.
E. TESTING 1) TESTING SET Section III (3)(vi) discussed that the dataset was divided into two parts. One part was used for training the models, and the other was used for testing the models. The testing set was also passed through preprocessing, normalization, and embedding steps.

2) TESTING OF MODELS
In this step, a testing dataset is supplied to each trained model for evaluation of models, and results in each case are recorded.

3) EVALUATION OF MODELS
In this step, the results of all deep learning models were evaluated by performing statistical analysis of test results. The following metrics were used for the evaluation of the models: The performance of each model is interpreted using graphical representations. Python provides Matplotlib and Seaborn libraries for visualization purposes. We utilized the Matplotlib package of Python to visualize the results of different models considered in this study.

IV. EXPERIMENTAL SETTINGS
We conducted our experiments using the Colab environment hosted by Google, which provides resources of Python 3 Google Compute Engine Backend (GPU) with 16 GB of RAM and 120 GB of storage. The DL algorithms were implemented in Keras-backed Tensor Flow in Python. We used various libraries of Python, including Keras, Pandas, NumPy, NLTK, JSON, Gensim, and Sklearn. We also used the ''UrduPhone'' library introduced by [43].
For Roman Urdu Hate speech detection, we compared our proposed context-aware deep learning model based on Bi-LSTM+attention to baseline traditional machine learning and DL models.
Additionally, we performed two experiments with each model. One investigation was conducted utilizing our provided dataset without normalization, while the other was conducted on a normalized dataset (the dataset is normalized by removing lexical variations in Roman Urdu words).
Using the procedure illustrated in Figure 1, a dataset including 30,000 text messages was produced. Out of these 30,000 text messages, 15,000 were marked as ''Hate,'' while the remaining 15,000 were marked as ''Neutral.'' The dataset was processed, tokenized, and normalized beforehand. Word2vec embedding was used to extract the features. The word2vec embedding method generates vectors for each word based on its n-dimensional context. The semantically related terms were distributed close to one another in the vector space. The tokenized data was passed to the Word2Vec function with a minimum count of 2 and dimension size of 200, which results in a vector of 200 dimensions being formed for each word.
The summary of experimental settings and Hyperparameters used in each experiment are provided in Table-8.

A. TRADITIONAL MODELS
Initially, seven (07) traditional machine learning techniques were used as baselines to build hate speech detection models. For conventional machine learning, we defined a function named MeanEmbeddingVectorizer, which calculates the mean embedding of a word so that it could be used in traditional models. The dataset is split into the training set and testing set with a ratio of 80:20, respectively. The same experiment was performed on a normalized dataset, and the results in each case were recorded in Table 9.

B. DEEP LEARNING MODELS 1) LSTM MODEL
In LSTM, an initially embedding layer was added, followed by an LSTM layer of size 64 with an activation function as 'RELU' followed by a Dropout layer with a 0.2 value.
Thereafter a Dense layer of size 32 with activation function as 'relu' is placed. This was connected with a Drop of size 0.2,  followed by another Dense Layer whose activation function was a sigmoid. Adam was used as the optimizer and Binary Cross Entropy as the loss function to compile the model. The model was compiled and fit our data. VOLUME 10, 2022   The same experiment was performed on a preprocessed non-normalized dataset (without lexical normalization) and a normalized dataset (with lexical normalization), and the results in each case are visualized by using the python library namely ''matplotlib.pyplot'' which is depicted in Figures 5-8:- The trend of accuracy shown in Figure 6 demonstrates that when the dataset is not normalized, the training accuracy of the LSTM model grows with each training epoch, whereas the validation accuracy fluctuates. In subsequent epochs, the model appears to be overfitted. The model loss graph demonstrates that the training loss lowers with each epoch; however, the validation loss initially climbs, then declines, and then grows continually.

2) BI-LSTM MODEL
In Bi-LSTM, an embedding layer was constructed. Afterward, a stack of two Bidirectional LSTM layers of 32 and 16 units was added, followed by a dropout of 0.3 after each BiLSTM layer. The final output was generated using a dense layer with a sigmoid function. Adam as the optimizer and Binary Cross Entropy as the loss function was used to train the model. The identical experiment was conducted on preprocessed non-normalized and normalized datasets, and the results are depicted in Figures 9 through 12.
The trend displayed in Figure 10 illustrates that the Bi-LSTM model's training accuracy increases with each training session even when the dataset is not normalized. However, training accuracy and validation accuracy are not consistent. Similarly, training loss diminishes after each epoch, whereas validation loss initially decreases and then increases continuously after a few epochs. These results indicate that Bi-LSTM overfits when applied to non-normalized data sets.        rates. This resulted in a slight improvement; nevertheless, after a few epochs, validation loss steadily increased. Consequently, the model does not reflect a good fit.

C. BI-LSTM + ATTENTION MODEL
In this configuration, an attention layer is added with Bi-LSTM. The rest of the procedure is the same as in Bi-LSTM. This experiment was also performed on a non-normalized and normalized dataset, and the results in each case are depicted in Figures 13 through 16. Figure 14 demonstrates that the training and validation accuracy of the Bi-LSTM Model with the attention layer increases with each epoch until it achieves a maximum of 86 percent in the final epoch. Nevertheless, this accuracy is inferior to that of the same model trained on a normalized dataset. Figures 15 and 16 show that the testing accuracy of Bi-LSTM + Attention Layer on a normalized dataset is 87.50 percent. Figure 16 demonstrates that, when applied to a normalized dataset, the Bi-LSTM model with the attention layer is more general than the previous deep learning models.

D. CONVOLUTION NEURAL NETWORK (CNN) MODEL
The CNN architecture was developed by first utilizing an embedding layer with a size of 200, followed by a spatial dropout layer with a size of 0.3. Following this, a one-dimensional Convolutional Layer with size=64, filter size=3, and activation function=ReLU was added. After that, a Max Pooling layer was added. In the configuration, another Convolutional layer of size=32 with filter size=5, activation function as 'ReLU' and padding as 'same' was added, followed by a Max pooling layer connected to a third Convolutional layer of size=16 with filter size=3 and activation function as 'ReLU' and padding as 'same.' Then, a second Max pooling layer and a Flatten layer were added. Afterward, a Dense layer with the parameters size=32 and activation function='ReLU' was added. Another Dense layer with output=1 and sigmoid activation was added at the end.
This experiment was also performed on preprocessed normalized and non-normalized datasets and the results in each case are visualized as shown in Figures 17-20. Figure 18 depicts the accuracy of a CNN model trained on a non-normalized dataset. The graph demonstrates that the training accuracy of the model steadily improves after each epoch, but the validation accuracy of the CNN model decreases from 0.87 to 0.862 when it was trained on a nonnormalized dataset.
Similarly, when the CNN model is trained using a normalized dataset, its validation accuracy barely hits 0.866,   as shown in Figure 19. The CNN model's training loss lowers with each training epoch, beginning at 0.41 and reaching 0.14. The validation loss of CNN's model indicates that, at first, the loss decreased from 0.36 to 0.33 before jumping to 0.39. The training and validation patterns depicted in Figure 20 reveal that the CNN model on both normalized and unnormalized datasets suffers from severe overfitting.

V. RESULTS AND DISCUSSIONS
The testing dataset was used to evaluate both classic machine learning models and deep learning models. The performance of these trained models was tested using well-known metrics such as accuracy, precision, recall, and F-measure, and the results are presented in Table-X. It is evident from the empirical results reported in Table 10 and illustrated in Figures 21 and 22 that Deep Learning models outperformed machine learning models. LSTM, Bi-LSTM, and CNN scored worse in accuracy and F-measure than Bi-LSTM with the attention layer. In addition, it was observed that Random Forest and XG Boost performed relatively better than other conventional machine  learning methods. Figure 16 demonstrates that Bi-LSTM with an attention layer not only performed well in terms of accuracy and F-score but is also more general than other models. Figure 8 demonstrates that after normalization, LSTM is also effective in generalization. As depicted in Figures 18 and 20, the CNN generalization was inadequate, resulting in substantial overfitting. LSTM and Bi-LSTM models were intended specifically for sequence and time series data, such as textual data used in Natural Language Processing (NLP), whereas CNN was developed for image processing.
Next, we reviewed the findings of the earlier work completed by Khan et al. [15]. We chose their outcomes according to two class categories, neutral and hostile. They implemented Naive Bayesian, Logical Regression, Random Forest, SVM, and CNN. These models were trained on 5000 Twitter messages. Their work utilized different feature extraction techniques, including Character-level features (CLF),

VI. CONCLUSION AND FUTURE WORK
Detecting Roman-Urdu hate speech is a difficult challenge due to limited resources. Our proposed method demonstrates the viability of this novel HSD route. The following conclusions are derived from experimental findings.
Conclusively, the context-aware HSD model based on Bi-LSTM with attention layer outperformed machine learning and Deep learning models such as LSTM, Bi-LSTM, and CNN in terms of accuracy and F-measure. In addition, it was discovered that Random Forest and XG Boost performed relatively better than other conventional machine learning methods.
As demonstrated in Figure 14, Bi-LSTM with an attention layer not only performed well in terms of accuracy and F score, but the model is also more generalized than others.
According to Table-XI, previous research demonstrated that the Logistic Regression with Count Vectors model beats both traditional and deep learning models (i.e., CNN) in terms of accuracy, precision, recall, and f-measure. However, our experimental findings revealed that Deep Learning Models performed better than machine learning models, including Logistic Regression. The dataset's size might cause this boost in the performance of Deep Learning models. As a result, deep learning models perform favorably on massive datasets instead of small ones.
To analyze the effect of lexical normalization, Table 9 and Figures 5 to 20 reveal that the performance of all models, notably traditional models, improved dramatically after the dataset was normalized. However, normalization has little effect on deep learning models. Despite this, the objective of handling lexical variations through normalization and development of the Roman Urdu Dictionary was to improve performance and standardize the Roman Urdu language by providing standard spelling for each word of Roman Urdu, thereby enabling the models to operate quickly and effectively.

FUTURE DIRECTIONS
Future efforts must be conducted to improve the annotation quality of the Roman Urdu Hate Speech dataset for algorithms to provide the most accurate predictions regarding hate speech. Sentences containing sarcasm and implicit hate speech demand increased scrutiny. In addition, the magnitude of hateful content is dependent on the target. For instance, a slur spoken at a friend will not be viewed as offensive, but rather as amusing, yet the same phrase directed against an adversary may result in severe consequences. Similarly, if the target is religious holy figures, the sentence will have the highest level of hatred. In addition, if a person makes derogatory remarks about the race, nation, or gender to which he or she belongs, these remarks may be viewed as humorous and made in jest; otherwise, they would be regarded as poisonous. Future hate speech detection methods may incorporate these elements.
On the other hand, the impact of normalization could be improved by refining the phonetic encoding method, as we discovered during our experiments that UrduPhone has some limitations that cause unrelated words to cluster together, which degrades the performance, and the results of Deep Learning models were unimpressive. To address the issue of lexical variances in Roman Urdu words in the future, we will work on a Phonetic coding scheme.