HHSD: Hindi Hate Speech Detection Leveraging Multi-Task Learning

Hate speech is now a frequent occurrence on social media. Recently, the majority of study was devoted to identifying hate speech in languages with abundant resources (e.g., English). However, relatively few works are developed for languages with limited resources (e.g., Hindi, the third most widely used language on earth). In this study, Hindi Hate Speech Dataset (HHSD) is created following a novel hierarchical fine-grained four-layer annotation approach. The top layer separates the posts into hateful and non-hateful categories. The second layer further categorises hateful posts into explicit hateful and implicit hateful. The third layer is the multilabel tagging of the post into topics, such as political, religion, racism, or sexism. The fourth layer involves the identification of the targeted named entity, either explicitly or implicitly. Additionally, a thorough evaluation of the data annotation schema for trustworthy annotation is provided. The HHSD data is the largest multi-layer annotated corpora in Hindi compared with the existing multi-layer annotated data. Experiments on the dataset using the transformer-based approaches in single-task learning (STL) attain encouraging performances in accuracy and weighted-f1 score. The experiment leveraged multi-task learning (MTL) by including multiple related hate speech detection tasks from high-resource English and languages from the same linguistic family such as Urdu and Bangla with a transformer encoder as the shared layers to obtain a significant increment of 5.31% and 5.35% over STL in accuracy and weighted-f1 for layer A, 8.20%, and 22.83% for layer B. The MTL surpasses STL by 8.98% and 4.07% in exact match and hamming loss for layer C.


I. INTRODUCTION
With the advancement of the Internet and the widespread acceptance of opinion-rich online resources, users have many options to express their thoughts in real time.However, these platforms are frequently abused to disseminate harmful and hateful messages that target specific people or groups.The prevalence of unpleasant and abusive content on social media sites is posing a significant problem for the government and technology firms.Thus, it is crucial to create automatic methods to stop the spread of hateful content and filter it out.Hate speech is commonly defined as any communication that disparages a person or a group based on some characteristics such as race, color, ethnicity, gender, sexual orientation, nationality, religion, or other characteristics [1].
The associate editor coordinating the review of this manuscript and approving it for publication was Aysegul Ucar .
Hate language can vary from offensive, aggressive, abusive, harassing, toxic or violent.The Google project named Perspective1 defines toxicity as a rude, disrespectful, or unreasonable comment that makes the user leave the conversation.Therefore, it is crucial to identify detrimental posts and stop their spread over social networks to preserve social peace.The identification of hate speech on social media sites like Twitter, Facebook, etc., has received a lot of attention in recent years.Due to lesser regulation of hate speech in non-English speaking countries, the platform is more vulnerable to abuse.A new law in India requires social media companies to remove any illegal content within 36 hours of receiving it. 2urrent research on hate speech analysis is oriented toward monolingual corpora.Even Hindi, the first language of 528 million people (43.63%) in India, 3 does not have sufficient labelled corpora.There will likely be 650 million internet users in India by 2023, which would cause the number of Hindi posts to rise dramatically.Recently, the Hindi-English code mixed data annotated for three subtasks were made publicly available via [3] and [4].A Devanagaribased data (D-HOT) is created by [5] to establish a hate speech classifier in Hindi.Following the work of [5], the primary motivation of this paper is to create a novel data set covering multiple aspects of hateful posts in Hindi.The script for Hindi is Devanagari which is written as i . References [6] and [7] argues that due to the tremendous variability in annotating hate speech, including definition, categories, annotation standards, types of annotators, and agreement of annotations, the nature and content of the datasets are more significant than the models developed.The majority of social media platforms use reporting and manual review methods, which are constrained by the reviewer's speed, ability to understand the evolution of slang, jargon, and familiarity with multilingual content [8].In this study, the models are trained to leverage single-task and multi-task learning paradigms.To increase the performance metric of the classifier, the training data is further augmented with the existing English, Hindi, Urdu, and Bangla hate speech data in the multitask framework.
The key contributions of this work are as follows: 1) Dataset: A novel Hindi Hate Speech Dataset (HHSD) is created following a hierarchical fine-grained fourlayer annotation approach.The first two are binary classification tasks, the third belongs to multi-label classification tasks and the fourth layer is named entity tagging of the targets.This dataset will be made available to the community for research purposes.2) Model: The experiments are conducted using numerous cutting-edge models, such as convolution neural network (CNN), bidirectional long short-term memory (Bi-LSTM), multilingual-bert (M-BERT), languageagnostic bert sentence embeddings (LaBSE), multilingual representations for Indian languages (MuRIL), XLM-RoBERTa, and IndicBERT on the newly created HHSD in a single task learning fashion.The multi-task learning framework results are reported by taking two best-performing transformer encoders, viz., MuRIL and M-BERT as the shared layers.3) Analysis: The model efficacy in a 5-fold crossvalidation approach is examined by presenting qualitative and quantitative analysis.The statistical significance test is also performed to check whether the best model is indeed significant.4) Auxiliary data: In the multi-task learning setup, lowresource languages with a high degree of resemblance to Hindi, such as Urdu and Bangla, are also used to expand the training set.Bangla, Urdu, and Hindi translations and transliterations are likewise derived from the English data that is readily available.A human evaluation score depicting the quality of translation and transliteration are also shown in Table 9.
The remainder of the article is structured as follows.The related background literature is presented in Section II.Section III discusses the resource creation and the annotation schema.Section IV describes the state-of-the-art techniques used for the experiment.In Section V evaluation metrics and the experimental setting are described.The results and error analysis are reported in Section VI, and the conclusion and suggested future work are presented in Section VII.

II. RELATED WORK
The advancement in deep learning techniques has widened the application of natural language processing tasks such as classification.The task of solving hate speech detection is overgrowing, but most of the data sets are available in English [2], [9], [10].In general, the resource available for hate speech detection can be categorized into three settings [11]: (i) high resource setting, (ii) low resource setting, and (iii) zero resource setting.The majority of current research focuses on English and other high-resource languages.However, recently, a few attempts have been made to make the Hindi resources publicly available through shared tasks [3], [4], [12], [13] but due to the less available labelled data, detecting hateful content is a challenging task.In this section, the approaches leveraged to solve Hindi, Urdu, and Bangla hate speech detection is discussed.

A. HINDI
The existing work on Hindi mainly deals with the data mixed with Hindi and English.An annotated corpus of 4575 Hindi-English code-mixed text is presented by [14].The experiment is done on a support vector machine (SVM) and random forest (RF) by utilizing features such as character n-grams, word n-grams, punctuations, negations, and hate lexicon.For the purpose of identifying hate speech in social media codemixed text, [15] studied a number of techniques, including the sub-word level long short-term memory (LSTM) model and the Hierarchical LSTM model with attention based on phonemic sub-words.Reference [16] explored deep learning architectures like CNN, LSTM, and variants of BERT like M-BERT, IndicBERT, and monolingual RoBERTa to solve the hate and offensive detection on the data by [17].The detection of code-mixed Hindi-English data is improved by including social media-based features in the model and additionally capturing the features of profane words [18].A bias elimination algorithm is also developed to mitigate any bias from the proposed model.
A Hinglish offensive tweet (HOT) dataset was introduced by [19] for the multiclass categorization of offensive textual tweets in the Hindi-English code-switched language.
The proposed multi-input multi-channel transfer learning (MIMCT) based model uses multiple embeddings and secondary semantic features in a CNN-LSTM parallel channel architecture to outperform various transfer learning models.Recently, multi-layer annotated data such as [3], [4], [13], [17], and [38] has been released.A suite of functional tests i.e.HateCheckHIn is presented by [22] for Hindi hate speech detection models by combining the existing monolingual and multilingual functionalities.Reference [23] concluded that character level embedding, GRU, and attention layer are novel to hate speech detection in Hinglish code-mixed language.A dataset of 10,000 samples from different sources is created by [24] to train the model with Facebook pre-trained word embedding library to classify between hate and non-hate.Reference [25] experimented by aggregating six datasets in English, Hindi, and code-mixed Hindi to conclude that logistic regression added with TFIDF and POS features outperformed other monolingual models such as CNN-LSTM, BERT, and RoBERTa.A thorough examination of multilingual abusive speech in eight Indic languages from fourteen publicly accessible sources is shown by [26].The experiments were carried out for numerous languages in a variety of circumstances, including ELFI (each language for itself), zero-shot learning, few-shot learning, model transfer, instance transfer, cross-lingual learning, etc.The effectiveness of transformer models like IndicBERT, M-BERT and transfer learning from already-trained language models like ULMFiT and BERT in order to identify hateful text in Hinglish is examined by [27].For the purpose of identifying hate speech in Hinglish, the transformer-based interpreter and feature extraction model (TIF-DNN) beat current cutting-edge techniques.A 150K human labelled data (MACD) for five languages with 49% abusive class is created by [28].The user comment is crawled from 70K users from the social media platform-Sharechat.An abusive content detection model i.e.AbuseXLMR, pre-trained on a large number of social media comments in 15 Indic languages which outperform XLM-R and MuRIL on multiple Indian datasets is released.

B. BANGLA AND URDU
This section discusses the literature for the two low-resource languages: Bangla and Urdu.
A two-layer manually annotated Bangla aggression dataset (BAD) is presented by [29].The experiment applies various machine learning algorithms (LR, SVM, RF, NB), deep learning algorithms (CNN, LSTM, CNN+BiLSTM), and deep transformer-based models like M-BERT, Distil-BERT, Bangla-BERT, and XLM-R.A dataset of 30000 Bengali user comments from Facebook and Youtube comments and has 10,000 hate posts is released by [30].The comments were collected from 7 categories: sports, entertainment, crime, religion, politics, celebrity, tik tok and memes.The experiment leveraged three-word embeddings: word2vec, fasttext, and BengFast, and machine learning models such as SVM, LSTM, and BiLSTM.Reference [8] presented a lexicon of 621 hateful words in Roman Urdu.To identify hate speech, five fine-grained labels were added to the dataset in Roman Urdu.The transfer learning abilities of five pre-existing multilingual embedding models to Roman Urdu through extensive experiments are examined.Reference [31] explored different data augmentation techniques such as synonym replacement, random swap, random insertion, random deletion, MT5 text generation, and M-BERT for the improvement of hate speech classification in Roman Urdu.

C. MULTI-TASK LEARNING
Multi-Task learning (MTL) [32]: It seeks to enhance the learning of a model for the classification task Ti by utilising the knowledge in some or all of the ''m'' learning tasks, given that all of them or a subset of them are connected.
A deep shared-private multi-task learning framework to leverage valuable information from multiple related tasks such as hate detection, racism detection, aggression detection, harassment detection, etc is presented by [33].Reference [34] focuses on hate speech detection in Spanish corpora and proposes an MTL model to benefit from associated tasks like polarity and emotion categorization.Reference [35] presented MT-GAN-BERT, a new architecture that extends BERT-based models with semi-supervised learning while using a single encoder in multi-task learning.

III. CORPUS CREATION
The procedure for gathering data is described in the next part, along with the annotation schema that was given to the annotators.Research and development in this area have been hampered by the lack of a significant Hindi annotated corpus for hate messages.We, therefore, set out to create new data.

A. DATA CRAWLING AND PROCESSING
The proposed data set is constructed from Hindi tweets crawled using Twitter search API. 4 As it is a common practice, the stream was filtered based on a list of frequent words in Hindi and by Twitter's language identification mechanism.The data collection covers a wide period from May 2021 to September 2021.The keywords and topics, written in Hindi script related to political, religion, racism, and sexism were identified, which were in the news in recent times and for which hate speech can be expected.The abusive lexicons in Hindi script were also collected to crawl explicit hateful posts.The key objective of this method is to make sure that a balanced mix of hateful and non-hateful tweets makes up the final dataset.Table 1 and Table 2 depicts the important keywords and topics that were used to crawl the posts.
Selecting relevant tweets for Annotation: To select the relevant tweets from the large pool of unlabeled data for the final annotation, a set of tweets was filtered out based on the weakly generated probability value.A classifier based on convolution neural network (cnn) C i is trained using eight publicly available Hindi datasets (see Table 6).The unlabelled tweet i obtained in the crawling is passed through the trained models C i to generate a weak label based on the probability value p.A set S h of tweets with p (hateful)≥ 0.65, and set S nh of tweets with p (non-hateful) ≥ 0.85 is filtered out to give to the annotators.
Figure 1 explains the data creation process.To prepare the collected tweets for the annotation, we applied some preprocessing steps, which are as follows: 1) The encoding was converted to UTF-8.
3) The tweet should not contain any links, pictures, or videos as they might contain information not available to the annotators.

B. HIERARCHICAL ANNOTATION SCHEMA
The annotation process has been done by five annotators possessing good knowledge of Hindi and linguistics.The annotators were at a higher education level (Masters, PhD.).The annotators were made aware of the posts' hatefulness before they began their annotations.In the HHSD dataset, we use a hierarchical annotation schema for four layers to distinguish whether (A).post is hateful or not, (B).Implicit hateful or Explicit hateful, (C).its associated domain, and (D).it's named entity target.The following subsection goes into further depth about each layer.Figure 2 explains the flow of the annotation covering all four layers.

1) LAYER A: HATEFUL LANGUAGE IDENTIFICATION
Objective: In this layer, classes are divided into two distinct categories i.e.Hateful and Non-hateful.
Hateful: The Language that is intended to be disparaging, humiliating, or insulting to the members of the group or an individual based on race, gender, ethnic origin, sexual orientation, disability, religion, or colour [2], [36].
Non-hateful: Posts that do not contain any hateful content.

2) LAYER B: CATEGORIZATION OF HATEFUL POSTS
Objective: This layer further categorizes hateful tweets into two types of hate i.e.Explicit hateful and Implicit hateful.
Explicit hateful: Any speech or text that displays hate-either through the usage of a particular type of lexical item or lexical feature that is deemed hateful and certain syntactic structures is regarded to be explicit hate.
Implicit hateful: Any post where hate is subtly communicated.It is a hidden attack on the victim and is frequently disguised as (false) courteous interactions (through the use of conventionalized polite structures).

3) LAYER C: MULTI-LABEL TAGGING OF HATEFUL LANGUAGE
Objective: This layer consists of multi-label tagging of the hateful tweets into four domains viz.Political, Religion, Racism, and Sexism.The definition of each domain according to the Cambridge dictionary5 is given as follows: Political: The activities of the government, members of law-making organizations, or people who try to influence the way a country is governed.
Religion: The belief in and worship of a god or gods, or any such system of belief and worship.
Racism: Policies, behaviours, rules, etc. that result in a continued unfair advantage to some people and unfair or harmful treatment of others based on race.
Sexism: The belief that the members of one sex are less intelligent, able, skilful, etc. than the members of the other sex, especially that women are less able than men.

4) LAYER D: TARGET ENTITY IDENTIFICATION
The hateful tweets consist of individuals and groups targeted implicitly or explicitly.We tag all relevant, targeted named entities into four types depending on their presence in the post: Explicit Individual (EI), Explicit Group (EG), Implicit Individual (II), and Implicit Group (IG).

C. ANNOTATION INSTRUCTIONS
Pilot Annotation: Given the subjective nature of the task, the annotators were provided with multiple examples from different classes to get an idea.In the pilot annotation, all five annotators were given the same set of 200 tweets to annotate by following the annotation schema in Figure 2. The purpose of this round was to check the agreement between the annotators.After the first round of the pilot work, we continued with the final annotation and evaluated the quality of the annotation.
Main Annotation: We chose to move forward with a batch of 500 tweets that included distinct samples for each annotator.To create a trustworthy dataset, the annotation quality was examined after each batch.This tweet requires the annotator to know that the second underlined phrase is a name of a political figure who is being targeted by using a derogatory phase.Solution: The two-round annotation discussion is done to resolve these issues.

E. Inter-Annotator Agreement (IAA)
The Fleiss Kappa score is used [37] to assess the annotator scores for the first three levels.It is a metric for evaluating the degree of agreement between two or more raters known as inter-rater agreement.The high value indicates the correctness of the data.It shows how clear the annotation guidelines are, how uniformly the annotators understood it, and how reproducible the annotation task is.It is a vital part of both the validation and reproducibility of classification results.For the first, second, and third layers, the IAA is 86%, 76%, and 82%, respectively.The Fleiss Kappa's interpretation is shown in Table 3.

F. DATA STATISTICS
In this paper, the newly created HHSD is used to evaluate the performance metric on training with deep neural networkbased approach.
The detailed statistics for data set are shown in Table 4. Table 5 enlists the different types of hate attack that is present in the data.

G. AUXILIARY DATA
The experiment also leverages related task data from highresource languages such as English and other data from semantically similar languages such as Urdu and Bangla.In total, eleven English (E), eight Hindi (H), three Urdu (U), and one Bangla (B) dataset were used to augment the training data.Table 6 shows the information about the existing datasets in Hindi, and  English, Bangla, and Urdu.As the main aim is to increase the performance metric of Hindi data, we increase the training data in Hindi (Devanagari), Bangla, and Urdu by obtaining the translation from English (E) → Devanagari (D), English (E) → Bangla(B) and English (E) → Urdu (U) using Google translate.The transliterated version of Devanagari (D) → Roman Devanagari (RD), and Bengla (B)→ Roman Bangla (RB) is obtained using Indic Trans [45] to increase the training sample.The class-wise statistics for all the English, Hindi, Urdu, and Bangla are shown in Table 8.The eight Hindi data sets were denoted from H 1 . . ..H 8 , three Urdu data from U 1 ..U 3 .The size of English (E) data is the summation of all eleven, and B 1 is the Bangla data.

H. HUMAN EVALUATION
The quality of the translation and transliteration obtained from google translate and IndicTrans is manually measured on a sample of 500 tweets based on fluency, and content preservation.Each tweet was given a Likert scale rating from 1 (worst) to 5 (best) for each of the two evaluation criteria.The total score is averaged to produce the final result.
Fluency: It is used to measure the fluency of the grammar correctness in the output text [46].
Content preservation: It is a measure of degree of the preservation in the translation and transliteration.This also calculates the degree of hatefulness preserved.
Table 9 presents the human score to measure the quality of translation and transliteration.
Challenges in language adaptation: While using data from other languages some challenges are bound to happen.It can be seen from Table 9 that the quality of the transliteration is surpassing the quality of the translation.There is an error rate of 1.9, 1.8, and 2.2 in fluency and 1.9, 1.6, and 1.9 in the content preservation while translating the English posts to Devanagari, Urdu, and Bangla.However there is a significant drop in the error rate in fluency and content preservation while transliteration.The error rate of 1.2, and 0.7, and 1.1, and 0.8 in fluency and content preservation is observed while transliterating the Bangla script to Roman Bangla, and Devanagari to Roman Hindi.As the fluency and content preservation obtained is >2.5 for all the cases, the auxiliary data is augmented.101466 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

IV. METHODOLOGY
The experiment is carried out using seven single-task learning (STL) and eight multi-task learning (MTL) frameworks based on deep neural network-based architectures as shown in Figure 3.A detailed explanation of all the models is as follows: CNN: This model adopts the architecture proposed by [47], which has five primary layers: the input layer, the embedding layer, the convolution layer, the pooling layer, and the fully connected layer.
BiLSTM [63]: It is a type of long short-term memory (LSTM) that uses two LSTMs to calculate information from the past and the future.The hidden state at each time step is the concatenation of the forward and backward states for the given time sequence.
Multilingual-BERT: [48] introduced M-BERT i.e., Multilingual Bidirectional Encoder Representation from Transformers to pre-train deep bidirectional representations from unlabeled texts by joint conditioning on both left and right context in all layers.The classifier can be created by adding just one more output layer to the pre-trained BERT model.It generally learns from two training objectives described as follows: 1. Masked Language Modeling (MLM): The model randomly masks some of the tokens from the input, and the goal of the model is to fill that mask with an appropriate token.This allows the model to focus on both left and right contexts.It outperforms M-BERT across cross-lingual classification, especially for low-resource languages.To apply sub-word tokenization on the raw input, sentence piece [50] with unigram language model [51] is employed.The sample of batches from different languages is selected using the same sampling distribution as in [52].
LaBSE [53]: It adopts multi-lingual BERT to produce language-agnostic sentence embedding for 109 languages.This model combines the masked language model (MLM) and translation language model (TLM) [52].
The training takes place using two types of data.
1. Monolingual Data: The data from Wikipedia and Common Crawl is collected, followed by heuristics from [54] to remove the noisy text.After the pre-processing stage, 17B monolingual data were obtained for the training.

Bilingual Translation Pairs:
The web pages were translated using the bitext mining system similar to the approach by [55] to obtain the translated corpus.
MuRIL [56]: A multilingual language model that has been specifically created for Indian languages was trained using text corpora from 16 Indian languages known as ''IN.''The training objectives include MLM and TLM, among others.The TLM uses pairs of translated and transliterated documents to train the model, whereas the MLM only uses monolingual text.4096 is the maximum global batch size, 512 is the maximum sequence length, and 1M steps are learned.With a learning rate of 5e-4, the Adam optimizer has a total of 236M learned parameters.
IndicBERT [57]: It is a multilingual ALBERT model that was developed using extensive corpora that included 12 important Indian languages.It has much fewer parameters than M-BERT and XLM-R, but it manages to give a stateof-the-art performance on the classification task.The joint training of all the languages is done using the single shared model to utilize the relatedness of the Indian languages.Table 10 consists of the source of the training data and the number of trained parameters for the transformer encoder leveraged for the experiments.

A. TRAINING OF MTL
The architecture of the Multi-task deep neural network (MT-DNN) model is shown in Figure 4.It adheres to the approach proposed in [58] to solve only classification tasks.An encoder based on BERT represents the shared layers for all T tasks.The shared layers aim to capture common features.The specific categorization tasks are implemented by the output layers D 1 , . . ., D T .The encoder captures the contextual information for each word in each input example (either a phrase or a group of sentences) made up of n word-pieces by using self-attention to generate a sequence of contextual embeddings.These are (n + 2) vector representations in R d , i.e., (h CLS , h w 1 , . . ., h w n , h SEP ).The h CLS corresponds to the d-dimensional representation of the input sequence, while h w 1 , . . ., h w n represent the d-dimensional embeddings for the individual word pieces.The h CLS is retained for the sentence-based classification and passed as input to the D t layer to classify the input sentence w.r.t. the task t = 1, . . .,T.The training procedure of MT-DNN is reported in Algorithm 1. Input examples generally belong to datasets ϵ 1 , . .., ϵ T that are specific for each task and have a different set of labels.The MT-DNN requires that each dataset is shuttered in mini-batches B t j , each containing valid examples from the same task t.In each epoch, a random mini-batch B t j is selected, all the inputs are encoded leveraging the same BERT encoder and the generated representation h B t CLS is classified by the D t .The task-specific loss L t is computed that is used to update the weights for the entire model via back-propagation.Following this way, the output layer D t is fine-tuned for the t th task but, most importantly, BERT encodings are at the same time optimized for all the tasks.M-BERT and MuRIL were the best two single-task learning models that we chose to employ as the shared-BERT encoder in the multi-task learning framework.We developed four variants of multi-task learning models based on training data in Hindi (H), Bangla (B), Urdu (U), and the combined data from Hindi-Bangla-Urdu (HBU).In this paper, we are reporting only the results obtained on HHSD data from the MTL paradigm.The entire experiment is completed by assigning separate MTL settings to layer A, layer B, and layer C.

V. EXPERIMENTS
This section presents the experimental setups and the evaluation metrics.

A. EXPERIMENTAL SETUP
All deep learning models were created using Keras [59], a neural network tool, with Tensorflow [60] as the backend.We performed 5-fold cross-validation to use 80% for tuning the batch size and learning epochs and test the optimized model on 20% held-out data.The network is optimised using the Adam [61] optimizer, with categorical cross-entropy as the loss function.CNN employs 100 filters, with a kernel width that spans from 1 to 4. There are 100 hidden nodes in the BiLSTM.The value for bias is randomly initialized to all zeros, Relu activation function is employed at the intermediate layer, and Softmax is utilized at the last dense layer.The pre-trained FastText word embeddings [20] is used to initialize the non-BERT model.We use a learning rate of 0.001 for the non-transformer model and 2e-5 for the transformer models.The transformers library is loaded from Hugging Face. 6It is a Python library providing a pre-trained and configurable transformer model useful for various NLP tasks.

B. EVALUATION METRICS
The Accuracy and Weighted-F1 scores have been used to report the evaluation results for layer A and layer B. The Exact match and Hamming loss [62] were employed as metrics to assess the effectiveness of multi-label classification for layer C in HHSD.
Hamming loss: The fraction of labels that are incorrectly predicted.
Exact match: The percentage of samples that have all their labels classified correctly.

VI. RESULTS, COMPARISON AND ANALYSIS
We present 5-fold cross-validation results for HHSD in Table 11 evaluated on state-of-the-art approaches.The results obtained leveraging STL and MTL are discussed in a separate section.
Single-task learning: layer A: The M-BERT obtained highest accuracy and weighted-f1 of 85.82%, and 85.67%.This is followed by MuRIL which obtained 84.50% and 84.46% accuracy and weighted-f1.
layer B: The M-BERT obtained highest accuracy and weighted-f1 of 72.52%, and 56.97%.This is followed by MuRIL with 70.81% and 56.13% accuracy and weighted-f1 score.
layer C: The IndicBERT surpasses the other models by obtaining the exact match and hamming loss of 53.52% and 14.75%.This is followed by MuRIL with a score of 53.12% and 15.16%.
Multi-task learning: The multi-task learning set-up leverages multiple data from Hindi, Bangla, and Urdu scripts.The four combinations of MTL are set up for M-BERT and MuRIL.The first three MTL is taking data from three languages one at a time, and the fourth one is taking all the languages.
layer A: The model is performing best when all three languages are simultaneously trained in the MTL fashion.The M-BERT and MuRIL obtained highest accuracy and weighted-f1 of 91.13% and 91.02%.It is interesting to note that M-BERT and MuRIL, when trained only with Hindi data in MTL, obtain a significant score.The reason for this performance is due to a large number of Hindi data available for the training.The inclusion of Bangla is outperforming the results obtained from Urdu.
layer B: The M-BERT trained with only Hindi tasks outperformed the M-BERT trained with all the language tasks by 1.27% and 0.45% in accuracy and weighted-f1 score.the inclusion of Bangla and Urdu hampered the performance of the model.In the MuRIL setup, the model is performing best with joint training of all the languages to surpass the Hindi-only model by 0.30% and 0.96% in accuracy and weighted-f1 score.
layer C: In this layer, the MuRIL with all the languages obtained maximum exact match and hamming loss of 62.10%, and 10.68%.This is followed by Hindi, Bangla, and Urdu.The M-BERT with all the languages obtained slender improvement over the model using only Hindi, Bangla, and Urdu data alone.
A. QUANTITATIVE ANALYSIS  44.86% for implicit hateful and 7.54% for explicit hateful in layer B. However, the misclassification for non-hateful is 8.37%.MuRIL-MTL trained with Hindi, Urdu, and Bangla is performing best for layer C by giving exact match and hamming loss of 62.10% and 10.68%.

B. QUALITATIVE ANALYSIS
This section highlights some of the Type 1 errors (False-Positives) and Type 2 errors (False-Negatives) from HHSD.We showed three cases for each where the model erroneously misclassified the post.The possible human explanation for the wrong prediction is also given.
Type 1 Error: False Positives (Non−hateful → Hateful) я tt Transliteration: yeh wah wahan hai jisse kutte ko mara gaya hai.Translation: This is the same vehicle used to kill the dog.Human Explainability: The underlined phrase in both the sentences are referring to an animal.Because the language is structured in a way that suggests an implied attack on people, the algorithm incorrectly predicts it to be hate speech.Human Explainability: In both of the highlighted bigrams, an abusive token is combined with a non-abusive token.The coronavirus is the target of the attack in the first post, whereas a motivational quotation is the subject of the harsh word in the second tweet.However, the model was unable to convey the post's sentiment.

3) INDIRECT REFERENCE
1) e º a uÚ u Trnasliteration: Ek medhak nikal kar bola paani me aa teri udasi utaru saale.Translation:A frog came out and said, come in the water, I will remove your sadness.U B****rd.

2)
tt я ºll Transliteration: ye kaisi kutte jaisi billi hai.Translation:What kind of dog-like cat is this?Human Explainability: In the first tweet, a frog verbally assaults a human by speaking in a negative manner and using vulgar language.Additionally, it appears that a person is being referenced twice in the second post by referring to an animal.
Human Explainability: Users obfuscate the slang term to trick the model and succeed in posting their sentiment.The underlined words in the two posts were used in an insulting manner towards both a person and a group, yet the model did not pick up on the seriousness of the offence.Human Explainability Both posts make an implicit reference to attack with the highlighted term.To fully comprehend the true meaning underlying these posts, contextual information is necessary.The covertness present in the posts is not captured by the model.

1)
a a Transliteration: Waah Palturaam waah aakhir sangati kaa asar hai.Translation: Wah Palturam, wow, after all it is the effect of company.

Human Explainability
The phrase in italics is a name-calling reference to a specific individual.Both pieces metaphorically criticise well-known political figures.This is not captured by the model, leading to incorrect classification.

C. STATISTICAL SIGNIFICANCE TEST
A bootstrap sample test is used to assess whether the difference between the two models is statistically significant (at p ≤ 0.05).By selecting three confusion matrices out of a possible five at a time, the test determines if the better system is the same as the better system across the entire data set.The outcome (p-) value of the bootstrap test is the proportion of samples where the winner differs from the entire data set.Table 14 displays the results of the statistical significance test conducted on each of the best three pairs of models for both datasets.We measured the score between M − BERT H (M1), M − BERT HBU (M2), MuRIL H (M3), and MuRIL HBU (M4).

VII. CONCLUSION AND FUTURE WORK
In this study, we developed a benchmark corpus for hate speech identification by crowdsourcing the manual annotation of roughly 15K tweets using a novel four-layer annotation schema.Using keywords and issues related to politics, religion, racism, and sexism, the tweets were crawled.To achieve promising results in terms of accuracy and weighted-f1 score, we undertook an in-depth examination of various experiments carried out on novel-created Hindi data employing deep learning and transformers-based architectures in single-task learning and multi-task learning frameworks.By utilising numerous data from the same domain in different languages, the suggested technique is a long-term approach that typically enhances the adoption of BERT-based models with fewer stringent requirements in terms of annotated training data.Explainability in AI is very important when dealing with sensitive issues which can negatively impact society.There are considerable efforts being made to make sure that AI-based technology does not suffer from any kind of bias introduced by the training data or the training procedure.As a future work, we plan on enriching the dataset with more boosted data, since, as we showed, they carry most of the valuable information about inappropriate speech.Since a lot of tweets require contextual information, localised knowledge graphs can be created for this by collecting intra-user and inter-user tweets to obtain valuable features.The contextual knowledge can easily be verified against this knowledge base.
D. ANNOTATION CHALLENGES AND SOLUTIONS1) Lots of Unlabelled data and small teams: Annotation is a time-consuming and laborious process.It is very challenging to have the resources capable of handling high-volume labelling.Solution: A classifier is trained on existing data to obtain a silver label for the unlabelled set, which will be given to the annotators to get the gold label.2) Keeping human bias out of AI solution: Human bias is one of the issues in reliable annotation that can hamper the efficacy of the classifier.Solution: To mitigate bias, large amounts of training data are collected, and a diverse group of annotators is recruited to ensure the data is as universally applicable as possible.3) Ambiguity: It is very challenging for the annotators to tag ambiguous tweets.For example: я , tt Ú Transliteration: Sanjay Bhaiya, Kutte ke bhaukne par dhyan nahi dete.Translation: Sanjay Bhaiya No need to pay attention to the barking dog.The aforementioned tweet carries two meanings.The former is attacking some human by comparing with the dog, whereas the latter refers to a dog only.Solution: The two-round annotation discussion is done to resolve this type of issue.4) Contextual information: The tagging of a tweet requires contextual knowledge for some tweets to correctly tag it.For example: Luteri dulhan sun kar antonio maino kii yaad aa jati hai.Translation: Hearing the robber bride, one remembers Antonio Maino.

2 .
Next Sentence Prediction (NSP): It pre-trains text-pair representations to determine whether or not two phrases will follow one another.The BERT's multilingual version can operate with 104 different languages.Every sequence begins with a distinct classification token as the first token ([CLS]).The aggregate sequence representation for the classification task is the last hidden state corresponding to this token.XLM-RoBERTa [49]: It is a transformer model trained by sampling streams of text in 100 languages and predicting the masked tokens in the input by MLM objective.2.5 TB

FIGURE 4 .Algorithm 1
FIGURE 4. Multi task learning (MTL).Algorithm 1 Training of a MT-DNN Model 1. Load the Encoder parameter acquired during the pretraining 2. Initialize D 1 , D 2 ,. . ..., D T randomly 3. for T in 1, . . .., T do //Prepare the data for T tasks 4. Divide the data of t th task into mini batches so ϵ t =U j B t Bakchodi corona ke suruaati lakshan dikhne par maine test karaya.Translation I went for a test after seeing the initial symptoms of the f***ing corona.2) a Transliteration: Aaram haram hai shikha.Translation: Relaxing is Ba****d shikha!

TABLE 1 .
Topics to collect the data.

Table 7
depicts the statistics for

TABLE 5 .
Variants of hate attack in the HHSD.

TABLE 6 .
Publicly available Hindi datasets used in the experiment.

TABLE 7 .
Publicly available English, Bangla, and Urdu datasets used in the experiment.

TABLE 8 .
Statistics of auxliary datasets used in the experiment.

Table 12 and
Table 13present the confusion matrix obtained by the best-performing model for layers A and B. The best performers for layer A and layer B for HHSD are M-BERT fine-tuned with all three languages and M-BERT trained with only Hindi data.It can be seen that the misclassification rate in the best-proposed model for hateful is 9.35% in layer A,