Racism Detection by Analyzing Differential Opinions Through Sentiment Analysis of Tweets using Stacked Ensemble GCR-NN Model

With social media’s dominating role in the socio-political landscape, several existing and new forms of racism took place on social media. Racism has emerged on social media in different forms, both hidden and open, hidden with the use of memes and open as the racist remarks using fake identities to incite hatred, violence, and social instability. Although often associated with ethnicity, racism is now thriving based on color, origin, language, cultures, and most importantly religion. Social media opinions and remarks provocating racial differences have been regarded as a serious threat to social, political, and cultural stability and have threatened the peace of different countries. Consequently, social media being the leading source of racist opinions dissemination should be monitored and racism remarks should be detected and blocked timely. This study aims at detecting Tweets that contain racist text by performing the sentiment analysis of Tweets. Owing to the superior performance of deep learning, a stacked ensemble deep learning model is assembled by combining gated recurrent unit (GRU), convolutional neural networks (CNN), and recurrent neural networks RNN, called, Gated Convolutional Recurrent-Neural Networks (GCR-NN). GRU is on the top in the GCR-NN model to extract the suitable and prominent features from raw text, CNN extracts important features for RNN to make accurate predictions. Obviously, several experiments are conducted to investigate and analyze the performance of the proposed GCR-NN within the scope of machine learning and deep learning models indicating the superior performance of GCR-NN with increased 0.98 accuracy. The proposed GCR-NN model can detect 97% of the tweets that contain racist comments.


I. INTRODUCTION
S OCIAL media has become a dominating element in socio-political prospects and controls our minds and actions in different ways. With the wide use of social media platforms over the world and freedom of speech, several vices have emerged over the past few years, racism being one of the leading ones. Social media sites, such as Twitter, represent a new setting in which racism and related stress are apparently prospering [1]. Currently, 22% of United States (US) adults use Twitter [2], while Twitter has 1.3 billion accounts and 336 million active users across the globe, 90% of which has a public profile leading to 500 million tweets per day [3]. Unless tweets are made private, they are publicly available and Twitter users can react to such tweets and engage by sharing them on their profile (retweet), tagging someone's user name, clicking the like button, or responding to the author of the tweet [4]. In Twitter, the expression of feelings, emotions, attitudes, and opinions build the raw data of sentimental analysis [5].
The growing popularity of social media platforms has led to their wide use for several old and new forms of racist practices [6]. Racism is expressed on such platforms in different surreptitious forms such as memes and openly such as posting Tweets containing racist remarks using fake identities. Although often associated with ethnicity, racism is now thriving based on color, origin, language, cultures, and most importantly religion. Social media opinions and remarks provocating racial differences have been regarded as a serious threat to social, political, and cultural stability and have threatened the peace of different countries. Social media being the leading source of racism opinions dissemination should be monitored and racism remarks should be detected and blocked timely.
Racist comments and tweets on social media have been regarded as the source of several kinds of mental and body illness leading to adverse health outcomes [7]- [12]. With respect to its use on social media, racism can be categorized into three groups: institutionalized, personally mediated, and internalized [13]. Personally mediated racism can be experienced through racial discrimination or differential racial treatment, or through awareness of discrimination against family and friends. Consequently, the racist behavior of the society adversely affects individuals and ignites several kinds of psycho-social stress often leading to the risk of chronic diseases [14]- [16]. Additionally, racist groups and individuals perpetuate cyber-racism by employing higher skill levels and intricacy through various channels and strategies [5].
Special considerations have been given to the field of sentiment analysis to analyze the text from social media platforms for a large variety of tasks including hatred speech detection, market prediction based on sentiments, and racism detection, etc.
The wide use of social media is a potential source of data generation containing important information regarding people's attitudes, responses, emotions, and opinions regarding specific events, objects, personalities, and entities. Sentiment analysis provides powerful tools to mine such data to analyze emotions. The huge part of Twitter feeds become less characterized by coherent rational discussion, but more by floods of emotion and affect, and can be used to divide the narratives into polarities of good and evil [17], [18]. Research shows that issues may become less obvious than a shared sense of outrage and a compelling sense of shared agreement and Twitter feeds can be quite insular and nodal [19].
Keeping in view its wide use, social media has become an attractive source to apprehend attitudes and analyze interactions over sensitive topics such as racism. In the USA, the discussions about race and ethnicity on Twitter have been considered as indicators of the current state of relations based on race. Additionally, the variation in the types of discussions about racism indicates the geographic variability in racial attitudes and sentiment [20]. So, analyzing the details of how people, events, and circumstances are represented reveals the dynamics of how users communicate, and many problems related to racism can be exposed on this platform. Owing to the extreme and atypical racist attitude an individual faces related to personal traits and attitudes, one can easily become relativized, contextualized, and therefore depoliticized. It leads to distracting attention from the actual and specific structural inequalities in society experienced by certain ethnic groups [21].
Machine and deep learning approaches has proven their strength and superiority over traditional methods in several domains such as image processing [22], [23], text classification [24], [25] and sentiment analysis is no exception. Several recent studies show that machine learning techniques perform better for sentiment analysis tasks [26], [27]. Therefore, this study leverage machine learning and deep learning models to perform sentiment analysis on tweets related to racism and makes the following contributions • An ensemble model is proposed that makes use of recurrent neural networks. For this purpose, gated recurrent unit (GRU), convolution neural network, and recurrent neural network are stacked to make the GCR-NN model to perform sentiment analysis. • A large dataset of tweets containing racist comments/text is crawled from Twitter which can be used by the research community. The dataset is annotated using the TextBlob based on the polarity score into positive, negative ,and neutral sentiments. • For performance comparison, several well-known machine learning models are implemented using the optimized parameters such as decision tree (DT), random forest (RF), logistic regression (LR), k nearest neighbor (KNN), and support vector machines (SVM). Term frequency-inverse document frequency (TF-IDF) and bag of words (BoW) are studied as feature extraction techniques. • For a fair comparison with the proposed approach, GRU, long short term memory (LSTM), CNN, and RNN are implemented as standalone models. Similarly, the performance of several state-of-the-art models is compared with the proposed GCR-NN in terms of accuracy, precision, recall, and F1 score. The rest of the paper is organized as follows. Section II describes several important research papers related to the current study. The proposed approach, dataset, and description of machine learning algorithms are given in Section III. Section IV provides the analysis and discussion of results. In the end, the conclusion is drawn in Section V.

II. RELATED WORK
The overwhelming effects of hate crimes are increasing to a great extent because of the extensive use of social media [37] and the anonymity enjoyed by online users [38]. Abusive content and intricate stuffing on social media is a problematic phenomenon with more than a few overlapping and coinciding modes and aims [31]. The contents related Decontamination of these contents is very necessary. For this purpose, several studies have been conducted to automatically detect the annoying hate speeches and massages among other contents on social media. Automatic hate speech detection using machine learning algorithms is still new and requires extensive research efforts from both industry and academia [39]. Few recent and related papers have been discussed here [40], [41]. Machine learning algorithms have contributed enormously to hate speech detection and content analysis [37].
The authors present a multimodal hate speech detection model specifically for Greek social media in [28]. The study focuses on Twitter messages, especially racist speech and xenophobia, in Greek aimed at migrants and refugees. The ensemble model, the transfer learning, and fine-tuning of the bidirectional encoder representations from transformers (BERT) and Resnet are used on the collected dataset. Different variants of the BERT and Resnet are used and the highest accuracy of 0.944 is reported using nlpaueb/greek-bert for the text modality and 0.97 with resnet18+ nlpaueb/greek-bert using text+image modality. Similarly, [29] proposes a stateof-the-art machine learning-based system for the automatic detection of hate speech in Arabic social media networks. Several types of emotions are captured and a different set of features are used for analysis. The study uses four different machine learning algorithms such as Naïve Bayes (NB), DT, SVM, and RF with TF-IDF, profile-related, and emotionrelated features. RF with TF-IDF and profile-related features achieved the highest accuracy 0.913.
Along the same lines, [30] classifies the fake news and hate speech propaganda using the extracted features from the content containing fake and real news. The study uses NB, LR, and XGBoost with TF-IDF features. XGBoost demonstrates a recall value of 0.83 which indicates that 17% of data contains hatred content and is misclassified by the model. Also, XGBoost achieves the precision value of 0.82 which shows that 18% of data is hateful and the model misclassified it. Authors investigate the hate speech problem in the Saudi Twitter sphere in [31] using different deep learning approaches. A series of experiments are conducted on two datasets using BERT, CNN, GRU, and the ensemble of CNN and GRU (CNN+GRU). Results indicate that the model achieves an F1 score of 0.79 and the area under receiver operating curve (AUROC) of 0.89 using the CNN model.
Study [32] investigates the automatic detection of cyberbullying. To review the deep learning and machine learning approaches, the authors use two different datasets. Different word embedding techniques such as distributed BoW (DBoW), distributed memory mean (DMM) and Word2Vec CNN are used to classify online racism. An accuracy of 96.67% for one dataset while 97.5% for the second dataset is achieved using a neural network with 3 hidden layers using Doc2Vec features. In the same way, study [33] explores the automatic detection of Indonesian tweets that contain hate speech or racism. The authors use machine learning models such as multinomial NB (MNB), Multilayer Perceptron (MLP), AdaBoost (AB) classifier, and SVM. Synthetic minority oversampling technique (SMOTE) is used as an upsampling technique and experiments are performed on both SMOTE and non SMOTE features. Results show that MLP with SMOTE features has an accuracy of 83.4% and AB, and MNB has 71.2% accuracy for non-SMOTE features. VOLUME 4, 2016 Ching She et al. work on hate speech detection from social media in [34]. For experiments, the audio data is extracted from videos and converted to text using a speech-to-text converter. MNB, Linear SVM, RF, and RNN are used for experiments. Two different sets of experiments are carried out where the first experiment involves classifying the video into normal and hateful videos while the second experiment aims at classifying the video into normal, racist, and sexiest classes. Results show that RF shows superior performance in terms of accuracy and achieves an accuracy of 0.9464 for the first set of experiments and 0.857 for the second set of experiments.
Another similar work is [35] which investigates hate speech related to Islam on social media. The study constructs an automated tool that can distinguish between nonislamophobic, weak islamophobic, and strong islamophobic content. Different machine learning algorithms such as NB, RF, LR, DT, SVM, and deep learning models are used. Results suggest that SVM obtains the testing accuracy of 72.17%. The performance of SVM is also evaluated using 10 fold cross-validation which shows a 74.6% accuracy and balanced accuracy of 80.7%. Study [36] proposes a novel system to detect hate speech across multiple social media plate forms like Reddit, YouTube, Twitter, and Wikipedia. A large dataset is built from these social media platforms with 80% labeled as non-hateful and 20% labeled as hateful. Several machine learning algorithms such as XGBoost, SVM, LR, NB, and feed-forward neural networks and tested with BoW, TF-IDF, Word2Vec, BERT, and their combinations. XGBoost outperforms all models with a 0.92 F1 score with all features. Feature importance analysis shows that BERT features have a great effect on predictions.
Taking into account the reported results from deep learning models, this study leverages the deep learning ensemble model to detect racism comments from Twitter. The study aims at obtaining high classification accuracy by stacking recurrent neural networks. Racism detection is performed using sentiment analysis where the ratio of tweets containing negative sentiments indicates the racist tweets.

A. PROPOSED METHODOLOGY
This study proposes an approach for racism detection on social media platforms using machine learning and deep learning technique. Figure 1 shows the flow of the steps carried out in the proposed approach. As the first step is crawled from Twitter, followed by data cleaning and preprocessing, and finally the data annotation. In the end, the proposed stacked ensemble model is trained and tested on the datasets and its performance is compared with several other deep learning and machine learning models.

B. DATASET DESCRIPTION
The racism tweets dataset is collected from Twitter. Twitter has been the first choice of the majority of researchers for text and sentiment analysis due to its being the most common platform widely used by a large number of people to express their feelings, views, comments, and opinions. In particular, this study intends to study the racism trends based on Twitter posts. For data collection, tweets related to racist comments have been collected. For this purpose, several keywords are used such as, '#racism', '#racial', and '#racist', etc. for data collection for the period of 29 July 2021 to 6 August 2021. A total of 169,999 tweets have been collected that match the criteria. The data are collected using the 'Twint library' and important attributes such as 'username', 'date', 'location', and 'content' are extracted. A specimen of collected tweets is provided in Table 2. Text @_LeBale racism @_LeBale racism is good tonyhasanidea @manoutdoors4 @AJ_Lady_Liberty @FBIWFO @TheJusticeDept @FBI it is clear to hundreds of millions of people of all walks that this country has a severe problem with systemic racism. your denial is discussing.

C. DATA PREPROCESSING
Several steps are carried out at the preprocessing level to clean the data. It is vital to preprocess and clean the document adequately so a model can be trained appropriately. This study combines natural language processing (NLP) methods using the natural language toolkit (NLTK) of Python to preprocess the reviews.
• Tokenization is the process of splitting natural texts into tokens without any white spaces. It involves breaking sentences down into constituent words set. Although looks simpler and straightforward, deciding which tokens are appropriate is not a trivial task. • Stemming: The text contains different forms of the same word which can create complexity in machine learning models. Words such as 'go', 'gone', and 'going' are the modified forms of 'go'. Stemming converts each word into its root form such as 'gone', and 'going' will be transformed into 'go'. Stemming is performed using the Stemmer porter algorithm. • Lemmatization: It is a similar procedure to that of tokenization, however, produces a different output. Tokenization simply removes 's' or 'es' at the end of a word to change it to its root form which often results in wrong words/spelling. Lemmatization retains the root form of a word by considering the context in which a word is used. It also lowers the unique occurrence's count of similar words. This approach is used in the suggested strategy for word preprocessing in their canonical format to limit the unique occurrences count of identical text tokens. • Stop Words Exclusion: Stop words are words that do not contribute to the training of the machine learning algorithms. Instead, they create complexity by increasing the feature space. So, stop words such as a, am, and an, etc., are removed to increase the learning efficiency of models in this study. • Case Normalization: Because precise words having various cases must be treated in a similar way, such as "Racism" & " racism," the entire text must be converted to lowercase letters. It is commonly referred to as data cleansing because it aids in minimizing the repetition of similar features that vary only with regard to case sensitivity. • Noise Removal: This stage removes any noise that could degrade the performance of the classification. Special characters, numeric data, id, and '#' signs, etc. are examples of noise types deleted in this phase. The sample tweets are preprocessed using the abovediscussed steps and the resulting text is given in Table 3.   TABLE 3: Sample text before and after the preprocessingt.

Before preprocessing
After preprocessing @_LeBale racism is good racism good @manoutdoors4 @AJ_Lady_Liberty @FBIWFO @TheJusticeDept @FBI it is clear to hundreds of millions of people of all walks that this country has a severe problem with systemic racism. your denial is discussing. the world is changing , get on board or get left clear hundr million people walk country sever problem system racism denial

D. DATA ANNOTATION
To annotate the dataset with positive, negative, and neutral sentiments, this study uses the TextBlob library. Textblob finds the polarity score for a given text which is used to assign a sentiment label to the text. Textblob polarity score range varies between -1 to 1. The polarity score range for positive, negative, and neutral sentiments is shown in Table 4. After the annotation, the distribution of tweets are shown in Figure 2. It shows the ratio of positive, negative, and neutral sentiments in the dataset. The number of records for the three classes is almost similar, with neutral sentiments making the major part of the dataset.

E. FEATURES EXTRACTION
BoW and TF-IDF are used for features extraction to train the machine learning models. Each feature extraction technique gives 125,461 features for models' training.

1) Term Frequency -Inverse Document Frequency
TF-IDF is among the most commonly employed scoring metrics for summarization and information retrieval. It is utilized to measure the significance of the term within a given text [42]. The TF-IDF extraction function takes two inputs: IDF and TF. TF-IDF provides tokens that seem to be uncommon within a dataset. When uncommon words appear in multiple documents, their relevance grows.
where t denotes terms, d denotes each document, and D is the documents set. The parameter n-gram range is used in conjunction with TF-IDF. TF-IDF is used to compute word weights, which offer corpus weights for a given word. The weighted word matrix is the output. The TF approach is frequently used for extracting features and therefore is widely utilized for text categorization. During classifier training, the incidence frequency of terms' is used as a parameter. TF function does not consider the importance of rare words, in contrast to the TF-IDF, which gives less weight to more frequent terms. TF-IDF results on the sample preprocessed data are shown in Table 5.

2) Bag of Words
The BoW is another commonly used feature extraction used in NLP tasks.   It is the most convenient and adaptable approach to get a document's features [43].
The Word's histogram within the text is examined in BoW. The frequency of the words is employed as a function for the training of the set. The BoW approach is implemented in this study by utilizing the Count Vectorizer from the Scikitlearn library of Python. The technique of obtaining numerical vectors by transforming a textual data set is termed vectorization. The frequency of words is counted indicating that tokens have been counted and making the token vectors. The BoW assigns a value to every attribute based on the frequency of those features. BoW results on sample preprocessed data are shown in Table 6.

F. MACHINE LEARNING MODELS
For racism detection from tweets, machine learning models have been adopted due to their superior performance over traditional models. Some of the renowned models such as RF, LR, DT, SVM, and KNN are discussed briefly in this paper for completeness. The performance of these models is optimized by fine-tuning several hyperparameters. A complete list of parameters used in this study is provided in table 7 along with the range used for optimization, as well as, the used values for experiments.

1) Random Forest
RF is a tree-based classifier that builds trees based on a random vector taken from the input vector [44]. Initially, RF builds a forest by producing multiple decision trees using random features. Later, voting is performed by aggregating the decision from all decision trees to make the final prediction. Votes from a decision tree with a low error rate are given a higher weight and vice versa. By using decision trees with low error rates, reduces the chances of wrong prediction [1]. RF can be defined by the equations:

2) Logistic Regression
LR is a statistical-based classifier that is mostly used for the analysis of binary data in which one or more variables are used to find the results. It is also used for probability evaluation of class association [45]. LR is especially recommended for categorical data due to its superior performance. It finds the affiliation between the dependent and one or more independent variables of the categorical data using approximation. For probability approximation, LR makes use of a logistic function. A logistic function or logistic curve is a common "S" sloped or sigmoid curve defined as

3) Support Vector Machine
SVM is a well-known machine learning algorithm that is widely used for the classification of linear, as well as, nonlinear data. For binary classification problems, it is the first choice of many researchers and it is available in various kernel functions [25]. The main purpose of the SVM classifier is to estimate the hyperplane based on feature set to classify data points [44]. The dimensions of the hyperplane vary with respect to the number of features. As multiple possibilities exist for hyperplanes in n-dimensional space, the task is to derive hyperplanes that maximize the margins between samples of classes. The cost function used to determine the hyperplanes is given by such that

4) K Nearest Neighbor
KNN is a simple and widely used machine learning algorithm for both classification and regression problems. KNN assumes that similar data can be found in close proximity, so it uses the concepts of 'neighbors'. It estimates the distance of the new data points to its neighbors by using distance calculation metrics such as Euclidean distance, Manhattan distance, and Minkowski distance, etc. In KNN, the value of K determines the number of neighbors to be considered for prediction. Well-known distance calculation metrics are given here [46]:

5) Decision Tree
DT is a ruled-based supervised machine learning algorithm. DT is a renowned and powerful predictive model which can handle regression and classification problems efficiently. Attribute selection is the major problem in DT [47] and information gain and Gini index are the most used methods for attribute selection. Information gain is the rate of increase or decrease in the entropy of attributes where entropy shows how homogeneous a dataset is [43]. The above equation computes the entropy E of a given dataset D which contains the positive and negative decision attributes. Gain of the attribute X is calculated by the formula:

6) Proposed Gated Convolutional Recurrent Neural Networks
The proposed model GCR-NN is a combination of GRU, CNN, and RNN. This study combines these models in a stack as GRU is working on the top, CNN is working in middle followed by the RNN. The selection of these models to make an ensemble is based on their individual performance. GRU takes the input from the embedding layer with a 5000 vocabulary size. This input is processed by the GRU model to extract features for the following layers. GRU architecture is used with 64 units, followed by a CNN layer that uses the output from the GRU model. CNN layer is used with 64 filters and a kernel with 4×4 kernel size. CNN layer is followed by the max-pooling layer with a pooling size of 4. A dropouts layer with a 0.2 dropouts rate is also used to reduce the complexity in GCR-NN because the dropout layer will randomly delete the neurons and reduce the chances of model overfitting. RNN is working at the end of the GCR-NN model with 16 units. The outputs of the GRU and CNN are directed to the RNN model. At the end of RNN, a dense layer is used with 3 neurons and a softmax activation function because of three target classes. The model is compiled with categorical_crossentropy loss function because of multi-class problem and 'adam' optimizer is used for training [48]. The model is fit using 100 epochs and a batch size of 16.

IV. RESULTS AND DISCUSSIONS
Experiments for sentiment analysis on racism tweets have been carried out using an Intel Corei7 11th generation machine operating on Windows 10. Machine learning and deep learning models are implemented on Jupyter in python language using Tensor-flow, Kara's, and Sci-kit learn frameworks. The performance of all models is evaluated in terms of accuracy, precision, recall, F1 score, number of correct predictions, and number of wrong predictions.

A. VISUAL REPRESENTATION OF SENTIMENT DISTRIBUTION
For providing the distribution of the dataset, with respect to country, data is divided into the top four countries with respect to the highest number of tweets. Figure 4a shows that the highest number of tweets are posted from the US, followed by the United Kingdom (UK), Nigeria, and Republic of South Africa (RSA) when racist content is considered.
Tweets sentiments distribution for each of the top four countries is given in Figure 4. It shows that the majority of the tweets belong to the neutral class for the US, UK, and RSA with 54%, 55%, and 43% neutral tweets, respectively. The highest ratio of negative tweets comes from RSA which is 40% of the tweets originated from RSA. On the other hand, the highest number of positive tweets regarding racism originates from Nigeria with 80% of the total tweets from Nigeria. The ratio of positive and negative tweets is approximately similar in the US and the UK. Figure 5, show the word

B. MACHINE LEARNING MODELS RESULTS USING BOW AND TF-IDF
This section contains the results of machine learning models using BoW and TF-IDF features. Table 8 shows the performance of all machine learning models using TF-IDF features and results show that the performance of linear machine models is significantly better as compared to other models. Results indicate that SVM achieves the highest accuracy of 0.97 and LR achieves a 0.96 accuracy score. These models are best performers when the feature set is large as is this study where the TF-IDF feature size is 125,461. These can be appropriate conditions for both SVM and LR models. RF is also good in terms of accuracy with a 0.91 accuracy score. In this study, the RF ensemble model combines 300 DT under majority voting criteria and this ensemble architecture makes RF a significant model in terms of accuracy. KNN is very poor in performance because it is a lazy learner which can perform better when the dataset is small. Experimental results of machine learning models using BoW features are given in Table 9. Results suggest that SVM and LR show better performance even when used with BoW features. Both SVM and LR obtain a 0.97 accuracy score which is substantially better than all other models. LR and RF both improve the accuracy by 1% with BoW features as compared to when trained on TF-IDF features. The improvement in the performance is due to simple BoW features which aid in better training of machine learning models. TF-IDF gives a weighted feature set which can be complex when there is a large feature set while BoW gives a simple set that can be more appropriate for training machine learning models. The performance of KNN models is also elevated from 42% accuracy to 52% accuracy which is a significant improvement. On average, the performance of the machine learning models is better using BoW features as compared to their performance when TF-IDF features are used.  Models' performance is also evaluated in terms of the number of correct predictions (CP) and wrong predictions (WP). SVM gives the highest number of correct predictions using BoW with respect to machine learning models as SVM gives 41,397 correct predictions and 1,103 wrong predictions. SVM also outperforms using the TF-IDF features in terms of correct predictions as it gives 41,361 correct predictions and 1,139 wrong predictions. With respect to both machine learning and deep learning models, the proposed model GCR-NN gives 41,520 correct predictions and 980 wrong predictions which is the highest correct prediction ratio for all the models used in this study. To show the significance of the proposed approach, the results of the proposed GCR-NN are compared with other studies. The study [49] uses the dataset related to racism and hate speech. The dataset has only two target classes of 'racism' and 'no racism' as compared to the current study which uses three classes for experiments. The study leverages XGBoost for racism detection and obtains an accuracy and F1 scores of 0.69 each. The proposed model in this study, on the other hand, achieves a 0.95 accuracy score and whos far better results than previous studies even with the multiclass task. Another dataset related to US airline sentiments is also considered for performance evaluation which is taken from [50]. The proposed model is implemented using the dataset [50] for performance evaluation on a small dataset. Results indicate that GCR-NN performs well on the US airline dataset with 0.81 accuracy.

E. DISCUSSIONS
This study aims at identifying racist content posted in the tweets by performing sentiment analysis. For this purpose, the dataset is annotated into positive, negative, and neutral classes. Positive and neutral classes indicate that racist content is not present in such tweets while negative class indicates that these tweets are racist as they contain negative views related to racism. So a distribution of correct and wrong predictions and accuracy is provided here with respect to the negative class. The collected dataset contains a total of 169,999 tweets including 66579, 49887, and 53533 tweets for neutral, positive, and negative tweets, respectively. Tweets containing negative sentiments make 31.49% of the total tweets which is definitely not a small number. Results in Table 13 are provided with respect to 53533 negative tweets. Results indicate that SVM shows the capability of detecting negative tweets with the highest accuracy of 0.96, both for TF-IDF and BoW features which means that 4% of racist tweets are misclassified by SVM. Similarly, LR correctly identifies 95% of the racist techniques but attributes 5% of the racist tweets to non-racist tweets. For racism detection, the performance of the proposed GCR-NN is superior to all models where only 352 of the 13425 racist tweets are misclassified which makes the racism detection accuracy of 0.97. This performance is superior to both machine learning, as well as, deep learning models.

V. CONCLUSION
Racist comments are becoming more frequent on social media platforms like Twitter and should be automatically detected and stopped to avoid further spread. This study considers racism detection from a sentiment analysis perspective and detects racist containing tweets by identifying negative sentiments. For obtaining high-performance sentiment analysis, deep learning is complemented by the ensemble approach where GRU, CNN, and RNN are stacked to form the GCR-NN model. A large dataset collected from Twitter and annotated using the TextBlob is used for experiments with several machine learning, deep learning, and proposed GCR-NN model. Overall, 31.49% of the collected 169,999 tweets contain racist comments. Results show that deep learning models show substantially better performance than those of machine learning models with the proposed GCR-NN obtaining averaged 0.98 accuracy score regarding the sentiment analysis for positive, negative, and neutral classes. Since the negative class is important to detect racism, a separate analysis indicates that SVM and LR are able to detect 96% and 95%, respectively of racist tweets correctly while 4% and 5% of the racist tweets are misclassified, respectively. The proposed GCR-NN, on the other hand, can correctly detect 97% of the racist tweets with only a 3% misclassification rate.
ERNESTO LEE is working as a Professor at the Department of Computer Science, Broward College, Broward County, Florida, USA. His recent research interests are related to block-chain, IoT, data mining, mainly working machine learning & deep learning-based IoT, and text mining tasks.
PATRICK BERNARD WASHINGTON is working as a professor at the Division of Business Administration and Economics, Morehouse College, Atlanta, GA, USA. His recent research interests are related to block-chain, IoT, data mining, mainly working machine learning & deep learning-based IoT, and text mining tasks.
FATIMA EL BARAKAZ received the engineer diploma in Computer science Engineering from The National Institute of Statistics and Applied Economics, Morocco, in Juin 2015. He is currently pursuing a Ph.D. degree in machine learning/ data mining in LAROSERI Laboratory, with the Department of Computer Science, Chouaib Doukkali University El Jadida, Morocco. She is also a temporary teacher in C/C++ classrooms. Her recent research interests are related to data mining, mainly working machine learning and deep learning, and text mining tasks. WAJDI ALJEDAANI received a bachelor's degree in Software Engineering from the Athlone institute of technology, Ireland, in 2014, and received his master's degree in Software Engineering from Rochester Institute of Technology, New York, in 2016. He is currently a computer science and engineering PhD student at the University of North Texas. He worked as a lecturer for three years (2017-2020) at Al-Khari College of Technology, Saudi Arabia. His research interests are software engineering, mining software repository, accessibility, machine learning, and text mining.