HCovBi-Caps: Hate Speech Detection Using Convolutional and Bi-Directional Gated Recurrent Unit With Capsule Network

Adversaries and anti-social elements have exploited the rapid proliferation of computing technology and online social media in the form of novel security threats, such as fake profiles, hate speech, social bots, and rumors. The hate speech problem on online social networks (OSNs) is also widespread. The existing literature has machine learning approaches for hate speech detection on OSNs. However, the effectiveness of contextual information at different orientations is understudied. This study presents a novel Convolutional, BiGRU, and Capsule network-based deep learning model, HCovBi-Caps, to classify the hate speech. The proposed model is evaluated over two Twitter-based benchmark datasets – DS1(balanced) and DS2(unbalanced) with the best performance of 0.90, 0.80, and 0.84 respectively considering precision, recall, and f-score over unbalanced dataset. In terms of training and validation accuracy, the proposed model shows the best performance of 0.93 and 0.90, respectively, over the unbalanced dataset. In comparative evaluation, HCovBi-Caps demonstrates a significantly better performance than state-of-the-art approaches. In addition, HCovBi-Caps shows comparatively better performance over the unbalanced dataset. We also investigate the impact of different hyperparameters on the efficacy of HCovBi-Caps to ascertain the selection of their values. We observed that a higher value of routing iterations adversely affects the model performance, whereas a higher value of capsule dimension improves the performance.


I. INTRODUCTION
In the last few decades, advancement in computing technology, especially in OSNs, has changed the users' communication behavior. Online social networking platforms, such as Facebook, Twitter, Weibo, and WhatsApp, are popular and part of peoples' routine life. On these platforms, users discuss the current trends and express views, sharing them over the virtual network of family and friends on OSNs. Users currently use either one or another OSN platform and The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Tang . generally more than one platform. Interactions among the large user base generate massive data, which we can mine to extract valuable insights. In the existing literature, researchers have used the OSN datasets in many domains, such as sentiment analysis [1], [2], social bot detection [3], [4], sarcasm detection [5] and its different specializations [6]- [8], humor detection [9], recommendation system [10], [11], and rumor detection [12].
The users' discourse on OSNs ranges from diverse themes, such as politics, democracy, economics, fashion, and science & technology, to sharing travel and nature experiences. Twitter, a microblogging platform, is one of the widely used VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ OSN platforms that allows the users to express views within a limit of 280 characters. Politicians, celebrities, and public figures generally use Twitter to connect with their supporters and followers. These high-profile people use it to announce events and professional updates as tweets, triggering the platform users whom reactions contain various emotions with different stances concerning the post. These OSNs have also attracted anti-social and ill-minded people, who generally respond to a tweet with hateful and abusive comments. The adversaries and anti-social elements generally target a specific community, race, gender, or socio-political group. Hateful and abusive content has adverse impacts and occasionally creates depression and anxiety issues in the targeted individual/community. The existing literature has no universally accepted definition of hate speech, and even OSNs do not have a consensus. In the literature, researchers use hate speech and abusive speech interchangeably [13]. Hate speech is defined as offensive and aggressive content targeted toward specific groups based on ethnicity, religion, gender, sexual orientation, or other characteristics. Twitter defines a tweet as hate speech (HS) ''that promotes violence against or directly attack or threatens other people based on race, ethnicity, nationality, sexual orientation, gender, religious affiliation, age, disability, or serious disease.'' Hate speech on OSN platforms can cause riots, resulting in communal disharmony in real life. For example, OSNs were recently overwhelmed with hate content and anti-social propaganda related to the Shaheen Bagh protest in Delhi (India) against the National Register of Citizens, Citizenship Amendment Act, and National Population Register. 1 Anti-Asia and sinophobic content became prevalent on OSN platforms during the COVID-19 pandemic, blaming the Asian people for the pandemic [14]. The OSN service providers have policies and methods to tackle such content. However, no usable and well-accepted solution exists to date. For example, Twitter occasionally deletes the tweets and comments or suspends the violating users.
Researchers from academia and industry are continuously introducing novel methods using techniques from statistical analysis to pattern mining and deep learning to tackle this growing problem of hate content on social media platforms [15]- [17]. Among the existing categories of approaches, machine learning methods are more effective than other categories of methods [18]. However, the current models still lag a satisfactory level of accuracy over a different set of datasets. To this end, this study presents HCovBi-Caps, a novel deep learning model incorporating contextual information at different orientations.

A. OUR CONTRIBUTIONS
The role of context is vital for detecting HS in online content. To extract the relevant hate speech-related information having context in different orientations is a challenging and notable research problem, especially in short texts, such as tweets. In this direction, this paper presents a capsule-based deep neural network model to detect HS on Twitter. The proposed model considers hate speech as a two-class (binary classification) problem, wherein the trained model classifies a tweet as either HS or non-hate speech (NHS). The proposed model, HCovBi-Caps, is a novel end-to-end deep learning model for HS detection. It integrates an input, embedding, convolutional, bi-directional gated recurrent unit (BiGRU), a capsule network, dense and outputs layers. The HCovBi-Caps extracts HS-related contextual information from the input text, considering different orientations and ordering of words. The HCovBi-Caps first converts the input text into a numeric vector and further convert it into an embedding vector at the embedding layer. The model transfers the embedding layer output to the convolutional layer to extract the low-level syntactic and semantic features. The BiGRU further retrieves the latent semantic features from sequences having contextual information. BiGRU is effective than a simple GRU because it considers the contextual information in the forward and backward directions. As a result, BiGRU extracts the preceding and succeeding contextual information-related sequences from the features generated by the convolutional layer. The capsule network further retrieves the contextual information of different orientations by maintaining the ordering of words in the input text. The capsule network considers the partwhole spatial/local relationship of HS-related words. This layer covers the hate-related context by considering various orientations of the input text. The dynamic routing algorithm used in the capsule network increases the weight values of the HS-related latent contextual information. HCovBi-Caps passes the output vector from the capsule network to the dense layer that is further given to a sigmoid function to classify the input text as either HS or NHS.
The problem of hate speech is non-trivial and prevalent in OSNs. Hateful content on social media may lead to large-scale violence and riots in real-life. To address the hate speech problem, researchers have presented various approaches to detect it. OSN service providers are also introducing in-build solutions and formulating policies to tackle this menace. HCovBi-Caps is a small contribution to the collaborative effort going around the world to eradicate the hateful and anti-social content from OSNs. In this direction, HCovBi-Caps detects the hate content written with different contextual orientations.
Overall, the main contributions of this paper can be summarized as follows.
• Introduce a novel deep neural network model, HCovBi-Caps, by integrating the BiGRU, Convolutional layer and Capsule network to incorporate the contextual information at different orientations for hate speech detection.
• Perform the comparative evaluation of HCovBi-Caps over two benchmark datasets to establish its efficacy.
7882 VOLUME 10, 2022 • Investigate the impact of different hyper-parameters values on the efficacy of HCovBi-Caps performance to observe the best hyper-parameters values.
For reproducibility, we release our implementation code at GitHub 2 repository. The remainder of this paper is organized as follows. Section II presents a literature overview of a brief description of existing approaches on HS detection. Section III provides a detailed description of the introduced model, including functional details of the proposed HCovBi-Caps model. Section IV presents the experimental setup and evaluation results. This section further establishes the efficacy of the proposed model by performing its comparative evaluation with two state-of-the-art and six baseline methods. Section V investigates and discusses the effect of various hyperparameters on the efficacy of the proposed HCovBi-Caps model. Finally, Section VII concludes the paper with future research directions.

II. RELATED WORKS
This section presents a synopsis of existing literature on computational detection of hate content over OSN platforms. Researchers have studied hate speech and related research directions like rumor detection [12], credibility analysis [19] and presented approaches to track and curb this nuisance. Thus, researchers presented statistical analysis, pattern mining, and machine learning-based methods, wherein machine learning methods are prevalent and effective. This study classifies the existing approaches into two categories -(i) machine learning-based methods and (ii) deep learningbased methods. Finally, the section ends with a highlight on current status and limitations.

A. MACHINE LEARNING-BASED METHODS
Warner and Hirschberg [20] used unigram, part of speech, and other template-based features in one of the early approaches to tackle the hate speech problem. The authors further trained the SVM light model using linear kernel and evaluated it over two datasets from Yahoo and the American Jews Congress websites to classify the hate from non-hate content. In another approach, Kwok and Wand [21] used unigram features and further trained Naive Bayes classifier to segregate the racist tweets from ordinary ones with an accuracy of 76%. They experimentally concluded that bigram, trigram, and sentiment improve model performance. In another approach based on n-gram, Burnap and Williams [22] employed various n-gram features and trained three machine learning models: Bayesian logistic regression, SVM, and voted ensemble classifiers. They further evaluated the trained models over the crawled Twitter dataset and reported that voted ensemble classifiers show the best performance. Djuric et al. [23] utilized paragraph2vec [24] language model for the joint modeling of comments and words collected from the Yahoo Financial website. They further used the trained dense vector representation to learn a logistic regression model to classify the hate comment.
In a popular approach, Waseem and Hovy [25] opensourced a benchmarked dataset of 16k tweets containing hate speech. The authors further used 1-4-gram features to train the logistic regression classifier to segregate the hate and ordinary tweets. The best model shows performance with an F1-score of 73.89. They also used location and gender features and gender with n-gram reports the best performance. The various categories of hateful content, such as hate, offensive, abusive, have subtle differences; however, it is understudied. Davidson et al. [18] investigated the difference between hate, abusive, spam, and genuine content and used unigram, bigram, POS tag-based n-grams, Flesch-Kincaid Grade Level, Flesch Reading Ease scores, sentiment score, and various linguistic features to train logistic regression classifier to segregate them. Malmasi and Zampieri [26] presented a similar approach using character and word n-gram features to train linear SVM classifier to classify the hate, offensive, and ordinary contents. They experimented using various feature combinations and reported that character 4-gram shows the best performance.

B. DEEP LEARNING-BASED METHODS
Classical machine learning shows good performance but requires feature engineering, a manual, time-consuming, and tedious task. As a result, these approaches depend on human intelligence, therefore, include human bias. Recently, researchers started exploiting the advancements in deep neural networks and presented various deep learning models for HS detection to avoid these limitations [15], [27], [28]. Badjatiya et al. [27] introduced one of the first deep learning-based approaches for hate speech detection. They evaluated various neural network architectures -CNN, LSTM, and DNN, for hate content detection by employing different word representation techniques, such as Glove, word2vec, and FastText. Park and Fung [28] presented a hybrid model integrating the logistic regression and CNN architecture to segregate abusive tweets from genuine tweets. On investigation, the authors found that the hybrid model performs better than the isolated machine and deep learning models. Zhang et al. [15] integrated the convolutional neural network and gated recurrent network to present a deep learning model to classify the hate content and reported the best performance with an F1-score of 0.94. The existing deep learning models used word embedding representations [15], [16], [27], [29] which is a static representation and does not include contextual information. Cao et al. [30] incorporated sentiment and topic-based contextual information using the word, sentiment, and topic-based representations and presented a hybrid deep learning model constructed using CNN, LSTM, and attention mechanism to classify the hate content. Roy et al. [31] introduced a deep convolutional neural network-based framework for hate content detection and reported the best performance with an accuracy of 92%. Researchers recently targeted hate-ful and abusive content written in code-mixed languages, such as Hinglish. In this research direction, Kamble and Joshi [32] compared three classical deep learning models (CNN, LSTM, and BiLSTM) using a domain-specific word embedding to classify the code-mixed hate tweets from regular content. The authors found that the trained domainspecific code-mixed embedding provided better performance than pre-trained word embedding. Finding the labeled hateful content in various languages is tedious. Researchers recently started presenting approaches to detect hate speech in low-resource languages. Pamungkas et al. [33] presented a zero-shot learning approach for cross-lingual hate speech detection. Hate-reflecting words in textual content are used in different contexts depending on the situation. For example, certain words may be used in a derogatory manner in one context to target a race of people but could be slang in another context. The researchers exploited the contextual embedding trained using transformer-based language models, such as BERT for HS detection, to incorporate the contextual information of hate-inciting words [34], [35]. Researchers have also studied various aspects of hate speech, such as identification of hate-targeted vulnerable communities [36] to analyze the generalization of different categories of HS detection models across datasets [37].

C. CURRENT STATUS AND LIMITATIONS
The review of existing literature concludes the persistence of the hate speech problem and the need for globally accepted automatic hate content detection approaches. This problem is also evolving with different complexities. In the existing categories of methods, deep learning models are most effective for HS detection. Researchers also acknowledged that contextual information is crucial for HS detection, but extracting contextual information through different orientations of an instance is difficult. The concept of the capsule network presented by Hinton et al. [38] is prolific because it captures contextual information at different orientations. The existing literature has many approaches using capsule networks for NLP applications, such as text classification [39], clickbait detection [40], and sentiment classification [41]. Authors in [17], [42] have used capsule network for HS detection, but its utilization in this research direction is understudied. Integrating the capsule network with BiGRU and convolutional layer to retrieve the contextual information in different orientations for hate content detection is a fascinating, nontrivial, and significant research problem.

III. PROPOSED APPROACH
This section presents the proposed HCovBi-Caps model for hate content detection. It includes a detailed description of data crawling, pre-processing, and the proposed HCovBi-Caps model in the following subsections.

A. DATA CRAWLING
We perform the empirical evaluation of the HCovBi-Caps model over two Twitter-based benchmark datasets. To this end, a data crawler is developed in Python to retrieve the tweets by accessing them via REST API. We use the Tweepy, 3 a Python library, to crawl the tweets and save them on a local machine to use in the later stages.

B. DATA PRE-PROCESSING
The HCovBi-Caps model first pre-processes the crawled tweets to filter the useless information. Algorithm 1 represents an algorithmic representation of pre-processing steps. In the pre-processing, HCovBi-Caps performs the following steps: • Filtering of Twitter-related markers and symbols, such as hashtags, URLs, mentions, and retweets.
• Filtering of real numbers, stop words, ampersands, redundant white spaces, dots, single and double quotes, non-ASCII characters, commas, emoticons, exclamation marks, interjection, and punctuation marks.
• Removal of all redundant tweets.
C. PROPOSED HCovBi-Caps MODEL Figure 1 shows the architecture of the proposed HCovBi-Caps model. This section presents a detailed description of various HCovBi-Caps components. The model comprises input, embedding, and convolutional layers integrated with BiGRU and capsule network followed by dense and output layers. Algorithm 2 presents an algorithmic representation of the proposed model. The following subsections discuss the functionality of these layers.

1) INPUT LAYER
The HCovBi-Caps model passes the pre-processed tweet to the input layer, which tokenizes and maps each token into a unique number using the dictionary-based index value. Thus, the input layer maps the input text into a numeric vector. Mathematically, the input layer represents the input text as follows: if an input text (tweet) T comprises n tokens, then each token is replaced with its dictionary index such that T ∈ 1×n . Further, padding p is applied to input text to maintain the fixed-length size of the input vector. Finally, all the input texts are transformed into an input matrix T ∈ 1×p .

2) EMBEDDING LAYER
The embedding layer learns word representation as a low-dimensional dense vector by training over massive realworld corpora. This layer identifies semantically-related words having similar vector representations. The embedding layer has diverse applications from text classification, machine translation to recommender systems. In this paper, we use pre-trained (GloVe) 4 embedding, a well-known word representation technique based on word distribution statistics, for word representation. GloVe uses an unsupervised approach to train vector representation of words using a co-occurrence matrix of word pairs. In this study, we use Twitter-based GloVe embedding because the used datasets are Twitter-based. Each token of the padded fixed-length input text is converted into a corresponding GloVe embedding of dimension d, transforming the input vector-matrix to T ∈ p×d .

3) CONVOLUTIONAL LAYER
The HCovBi-Caps applies the convolutional layer over the embedding vector to extract the spatial features. The proposed model uses the one-dimensional convolutional operation because the input embedding vector is a row vector. The convolutional layer uses 128 filters of three different filter sizes to extract the hate-related spatial and temporal features comprising 128 sequences. Equation 1 represents the n th feature sequence, f n , generated from word window x t , where w t , b, and f(·) represent the filter weight, bias, and ReLU (a popular nonlinear activation), respectively. The 128 filters execute the convolutional operation from top to bottom, extracting the feature sequence as f s = [f 1 , f 2 , . . . , f 128 ] from the input text. The max-pooling operation is applied to obtain the underlying feature map. Finally, HCovBi-Caps concatenate the output from each filter to extract the final feature vector, which is an input to the next layer.

4) BiGRU LAYER
BiGRU, a type of RNN, is generally used in sequential modeling problems extracting the sequences from the forward and backward directions. In BiGRU, a forward ( −−→ GRU ) and a backward GRUs ( ←−− GRU ) are integrated to retrieve the succeeding (i.e., f 1 to f 128 ) and preceding feature sequences (i.e., f 128 to f 1 ), respectively. The proposed HCovBi-Caps model retrieves the forward and backward sequences having contextual information by applying the BiGRU layer on the convolutional layer output. Equations 2 and 3 present the outcomes from BiGRU in forward and backward directions, respectively. The BiGRU output is the hate-incorporating representation of the input text by integrating the contexts from the forward & backward directions. The BiGRU-based representation for a given feature sequence f s of the input text is the concatenation of the forward, − → h f , and backward, ← − h b , hidden states. The two hidden states integrate the information collected around L f s to retrieve the hate incorporating contextual information-based sequences. Finally, Equation 4 represents the concatenated contextual information-incorporating sequence as a final hidden state h t , which is passed to the capsule network layer.

5) CAPSULE NETWORK LAYER
Traditional CNN cannot extract salient features. It also losses crucial information because of the application of the pooling technique. Thus, extracted features lose important information generated from activation functions using Max, Min, or Average pooling techniques. Hinton et al. [43] introduced capsule network, a novel neural network architecture, to extract the syntactically enriched features considering different orientations and local ordering of words from the input data [44]. Recently, it has shown remarkable performance in text classification and information retrieval problems. Unlike CNN, it can identify the part-whole spatial relationship within features in textual data, therefore, effectively identifying semantic representation and hidden contextual information from the input text [39]- [41]. In a capsule network, a capsule contains multiple capsules. Furthermore, a capsule is made of neurons to extract semantic and syntactic information. This network uses the modulus of the capsule in the form of a vector to represent the classification probability and the capsule direction to describe different orientations of the text. Hence, the capsule network representation is more efficient and enriched than traditional neural network models, such as CNN. The capsule network generates a vector rather than a scalar value as obtained in VOLUME 10, 2022 the pooling layer of CNN. Furthermore, the dynamic routing algorithm [45], a principle component of the capsule network, adjusts the weight of latent features facilitating the extraction of additional features. As a result, the capsule network improves the classification performance of the underlying model.
The HCovBi-Caps model uses the capsule network because of its advantages discussed in the above paragraphs. To this end, the final hidden state h t , representing the output of the BiGRU layer, is passed to the capsule network layer. The resultant output of the capsule network is obtained using equations 5, 6, 7, 8, and 9. In equation 5, the final hidden state h t of BiGRU is first converted into a feature capsule u i using a non-linear activation function. Furthermore, u i determines the correlation between the input and output layers and generates the prediction vectorû j|i , where W ij represents the weight matrix.û The dynamic routing process is applied to calculate the coupling coefficients c ij . This process ignores the trivial and irrelevant hate-related words from the input text. The weight of HS-related features is proportional to the coupling coefficient c ij . For example, a feature with a high c ij value has a higher weight and vice versa. This correlation helps in encoding vital HS-related contextual representations of the input text considering different orientations. The output of a capsule, s j , is calculated as the summation of all the prediction vectors using equation 6.
The coupling coefficient c ij is calculated by the softmax function using equation 7.
Equation 8 updates b ij through until the iteration requirements are met. It represents the higher layers of the capsule network.
Equation 9 normalizes the final output vector v j via squash function (a nonlinear activation function), which includes different orientations and local ordering of words/tokens of the input text. v j = ||s j || 2 1 + ||s j || 2 s j ||s j || (9)

6) DENSE AND OUTPUT LAYERS
The output vector v j generated from the capsule network passes to the fully connected layer. Finally, the output layer classifies the hate speech by applying the sigmoid function on the dense layer output. The proposed HCovBi-Caps model uses the binary cross-entropy loss function.

Algorithm 1 DataPreparation_Algo
Input: T ← Set of Tweets, L← labels, l m ← maximum number of word in a tweet Output: E ←Embedding matrix 1 for t in T do

IV. EXPERIMENTAL SETUP AND RESULTS
This section presents the experimental details of the proposed HCovBi-Caps model. It includes the description of the datasets, experimental settings, hyperparameter settings, evaluation metrics, evaluation results, and comparative analysis.

A. DATASETS
This study uses two Twitter-based datasets for the empirical evaluation of the proposed model. Among the two datasets, one is relatively balanced and one unbalanced to evaluate the HcovBi-Caps on two types of datasets. Table 1 presents a brief statistics of the datasets. A brief description of the datasets is provided in the following paragraphs.
• Founta et al. [46] (DS1): This dataset contains 80, 000 tweets, which are labeled as either one of the four classes abusive, hateful, normal, or spam. We consider only two labelshateful and normal because the proposed approach is a two-class problem. This dataset is a relatively balanced containing 2615 hate and 5385 non-hate tweets as shown in the first row of Table 1

B. EXPERIMENTAL SETTINGS
The HCovBi-Caps model is implemented using Python language. We performed all the experimental evaluations using a Windows-10 (64-bit) machine having Intel i-3 6006 processor and 8 GB RAM. We use Tweepy, 5 an in-built library, to crawl the tweets from Twitter. Finally, Keras, 6 a popular neural network library, is used to implement the proposed model.

E. EVALUATION RESULTS AND COMPARATIVE ANALYSIS
This section presents the evaluation results of the HCovBi-Caps model for HS detection over the two datasets. We also perform a comparative evaluation of the proposed model with two deep learning-based state-of-the-art and six baseline methods. Table 3 shows that proposed model demonstrates better performance results over both the datasets -DS1 (balanced) and DS2(unbalanced) considering precision, recall, and f-score. The analysis of Table 4 also reveals the improved performance of the proposed model considering training and validation accuracy. We also evaluate the performance of the proposed model using the receiver operating characteristics (ROC) curve. It is a metric to evaluate the performance measurement of a classification model at various thresholds. Figure 2 presents the ROC curve of the proposed model for both the datasets -DS1 and DS2. We can observe from the figure that the proposed model shows significantly good results over both datasets. Interestingly, the proposed model shows better results on the DS2 (unbalanced) dataset than DS1 (balanced) dataset.

1) COMPARISON WITH STATE-OF-THE-ART METHODS
We perform the comparative evaluation of the HCovBi-Caps model with two state-of-the-art deep learning methods for hate speech detection. Consequently, we implement the comparison methods from scratch using the instructions defined in the respective papers.
• Ding et al. [42]: In this paper, authors uses a stack of BiGRU and capsule network layers to detect HS on tweets.   • Roy et al. [31]: Authors use deep CNN for detecting HS or NHS on tweets. Tables 3 and 4 also show that Ding et al. [42], employing the capsule network layer, performs better in comparison to Roy et al. [31], which uses only deep CNN. Therefore, the proposed model integrates capsule network and CNN with BiGRU to gain the advantage of various deep learning components. We can infer from both the tables that the proposed model remarkably outperforms the two state-ofthe-art methods. On analysis, we observed that the inclusion of convolutional layers in the proposed model significantly enhances the performance because it helps in encoding spatial and temporal features. The proposed model outperforms the Ding et al. [42] by 16,13, and 15 points considering precision, recall, and f-score over DS1 (balanced) dataset. HCovBi-Caps also outperforms the Ding et al. [42] by 25,19, and 22 points considering precision, recall, and f-score over DS2 (unbalanced) dataset. Additionally, we can observe from the fifth row of Table 3 that the HCovBi-Caps model outperforms the Roy et al. [31]. Similarly, 3-5 rows of Table 4 presents the performance evaluation results of the HCovBi-Caps model in comparison to the state-of-the-art approaches considering training and validation accuracy over the two datasets -DS1 and DS2. We can observe from the table that the proposed model performs significantly better than state-of-the-art approaches.

2) COMPARISON WITH BASELINE METHODS
HCovBi-Caps is compared with six baseline methods in this paper. A short detail of hyperparameter values of these baselines is given below: • CNN: It is used in this paper as one of the baseline methods, in which filter width size and number of filter are 3 and 128, respectively.
• LSTM: It is used in this paper as one of the baseline methods, in which 128 neurons are utilized.
• GRU: It is used as one of the baseline methods, in which 128 neurons are considered.
• BiLSTM: It is a special type of RNN whose functionality lies in both forward and backward directions. In this paper, it is used as one of the baseline methods, in which 128 neurons are used.
• BiGRU: Like BiLSTM, it is also a special type of RNN which is functional in both directions. In this paper, it is used as one of the baseline methods, in which 128 neurons are used.
• DNN: It contains many hidden nodes, and it uses the input data and weights of these nodes. In this paper, it is used as one of the baseline methods, in which two dense layers having 128 neurons in each layer are used. Tables 3 and 4 show that the bi-directional RNN models, such BiLSTM and BiGRU, perform effectively across all the baseline methods. Such a performance is due to its efficient retrieval of latent sequential features from forward and backward directions. The proposed model outperforms significantly better than all the six baseline methods. For example, the proposed HCovBi-Caps model outperforms CNN baseline by 12, 29, and 23 points considering precision, recall, and f-score, respectively over the DS1(balanced) dataset. Similarly, additional comparative results with other baselines can be observed from Tables 3 and 4. Among the baselines, BiGRU shows the best performance over both the datasets DS1 and DS2, considering precision, recall, and f-score, whereas, in terms of training and validation accuracy, BiLSTM shows the best performance over the datasets DS1 and DS2.

V. DISCUSSION
We found in the experimental evaluation that the HCovBi-Caps model significantly outperforms the state-ofthe-art and baseline methods over both DS1 and DS2 datasets.
In the proposed model, we use BiGRU because it shows the best performance among the baselines. Interestingly, the proposed and comparison approaches show relatively better performance on the unbalanced dataset. Furthermore, the performance difference between the proposed and comparison approaches is also down over the unbalanced dataset. In this section, we present the effects of different neural network hyperparameters on the performance of the HCovBi-Caps model. We perform this empirical investigation to find the optimal value of each hyperparameter so that the final empirical evaluation results are optimal.

A. EFFECT OF NEURAL NETWORK HYPERPARAMETERS
The selection of hyperparameter values is crucial for any deep learning-based model because it affects the model performance. This section experimentally analyzes the impact of four hyperparametersactivation functions, CNN filter size, number of BiGRU hidden units, and optimization algorithms, on the performance of the proposed HCovBi-Caps model over DS1 and DS2 datasets considering f-score and accuracy.

1) ACTIVATION FUNCTIONS
An activation function is crucial considering the activation or non-activation of neurons in any deep learning-based model. We perform the experiments using sigmoid and softmax activation functions to analyze their impact on the classification performance of the HCovBi-Caps model over DS1 and DS2 datasets in terms of f-score and accuracy. The underlying evaluation results are shown in Figure 3. We can observe from the figure that the sigmoid performs significantly better than the softmax function over both datasets. One of the key reasons behind such a performance is that sigmoid is effective for binary classification problems. Thus, the proposed model uses the sigmoid activation function.

2) CNN FILTER SIZE
The CNN filter size is vital in retrieving important features from CNN layers. Thus, the filter size is important to analyze the classification performance of a deep learningbased model. In this paper, we performed the experimental evaluation using three CNN filter sizes to observe its impact on the performance proposed HCovBi-Caps model. Figure 4 presents the underlying evaluation results using different CNN filter sizes -2, 3, and 4 over DS1 and DS2 datasets considering f-score and accuracy. We can observe from the figure that as the CNN filter size increases, the model performance goes down. Furthermore, the model shows the best performance on filter size 3 over both the DS1 and DS2 datasets.

3) BiGRU HIDDEN UNITS
The selection of the number of hidden units is another important hyperparameter impacting the classification performance of a deep neural network model. To this end, we performed the experimental evaluation with a different number of hidden units in BiGRU to find its optimal value. Figure 5 presents      the impact of different BiGRU hidden units -64, 128, and 256 on the performance of proposed model over both the DS1 and DS2 datasets considering f-score and accuracy. The figure demonstrates that the BiGRU with 128 hidden units shows significantly better performance across all datasets. Therefore, the proposed model uses 128 BiGRU hidden units.

4) OPTIMIZATION ALGORITHMS
Like the already discussed hyperparameters, an optimization algorithm also has a significant effect on the performance of a deep learning-based classification model. We performed the experiments using different optimization algorithms to analyze their impact on the classification performance of the HCovBi-Caps model over both DS1 and DS2 datasets. Figure 6 presents the underlying results using Adam and Adadelta algorithms over both the datasets considering f-score and accuracy, respectively. We can observe from the figure that Adam performs better than Adadelta across all datasets. Thus, the proposed model uses the Adam as an optimization algorithm.

B. EFFECT OF CAPSULE NETWORK HYPERPARAMETERS
Accurate functioning of capsule network relies on various hyperparameters. This section presents the effect of three hyperparametersnumber of capsules, dimension of capsule, and number of routing iterations on the performance of HCovBi-Caps model over all datasets considering f-score and accuracy.

1) NUMBER OF CAPSULES
The total number of capsules in a capsule network highlights the role of neurons at each layer and affects the performance of the underlying model. To this end, the performance of the HCovBi-Caps model is investigated using the different number of capsules over both datasets. Figure 7 presents experimental evaluation results representing the effect of number of capsules -4, 5, and 6 on the performance of HCovBi-Caps model over both the datasets considering f-score and accuracy. We found on analysis that the model performance is inversely proportional to the number of capsules. Furthermore, we can observe from the figure 7 that HCovBi-Caps performs significantly better using 4 capsules over both datasets. This result justifies the selection of 4 capsules in the proposed model.

2) DIMENSION OF CAPSULE
The capsule dimension is another critical hyperparameter in a capsule network. It controls the length of the output vector of a capsule. We perform experiments using different capsule dimensions to analyze the classification performance of the HCovBi-Caps model. Figure 8 presents the effect of various capsule dimensions -2, 4, and 8) on HCovBi-Caps model over both datasets considering f-score and accuracy. The figure demonstrates that the proposed model performs best on capsule dimension 4. Further observation is that neither small nor larger capsule dimension is effective.

3) NUMBER OF ROUTING ITERATIONS
Routing iterations connect the capsules of consecutive layers in a capsule network. We performed the experimental evaluation with the different number of routing iterations to analyze the classification performance of the HCovBi-Caps model over both datasets. Figures 9 present the results with different number of routing iterations -2, 3, and 4 representing its impact on the classification performance of HCovBi-Caps over both datasets considering f-score and accuracy. We can observe from the figure that HCovBi-Caps performs significantly better with 3 routing iterations.

VI. CONCLUSION
This study has presented a novel deep neural network model, HCovBi-Caps, integrating the convolutional, BiGRU, and capsule network layers for hate speech detection. Unlike existing models, the HCovBi-Caps incorporates the contextual information at different orientations using the capsule network. We evaluated the proposed model over two Twitter-based benchmark datasets -DS1(balanced) and DS2(unbalanced) to classify hate speech from general text. The proposed model shows the best performance over DS2(unbalanced) with values of 0.90, 0.80, and 0.84 considering precision, recall, and f-score, respectively. The proposed model shows the best performance on the unbalanced dataset considering accuracy. The proposed model has shown significantly improved performance over state-of-the-art and baseline methods. We have further investigated the impact of various hyperparameters of neural and capsule networks to analyze the efficacy of our proposed HCovBi-Caps model.

VII. LIMITATIONS AND FUTURE WORKS
The proposed HCovBi-Caps model detects hateful content with different contextual orientations. However, we can further improve it in detecting hate content considering different contextual semantic. Further, HCovBi-Caps does not exploit the sentiment and users' profile-related features, which may be effective. We can also evaluate HCovBi-Caps over more diverse datasets. The HCovBi-Caps model detects the hate propagated in text only. Therefore, it can be extended to a multi-model approach for hate speech detection. The extension of HCovBi-Caps to classify the hateful multi-lingual and code-mixed content is also another direction of research. The contextual information, which triggers controversy and hates on OSNs, will also be investigated.
MOHD FAZIL received the master's degree in computer science from Aligarh Muslim University, Aligarh, India, and the Ph.D. degree in computer science from Jamia Millia Islamia, New Delhi. He is currently working as a Postdoctoral Research Associate with the Department of Computer Engineering, Qatar University, Qatar. He has published over 14 research articles, including two articles in IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY. His research interests include data science, social computing, and data-driven cyber security. VINEET KUMAR SEJWAL received the Ph.D. degree in computer science from Jamia Millia Islamia (A Central University), New Delhi, India. He has qualified one of most prestigious Indian exams in computer science and engineering and GATE. He has published research articles in reputed SCI-indexed journals. His research interests include recommender systems, text mining, machine learning, and information retrieval.