L-Boost: Identifying Offensive Texts from Social Media Post in Bengali

Due to the significant increase in Internet activity since the COVID-19 epidemic, many informal, unstructured, offensive, and even misspelled textual content has been used for online communication through various social media. The Bengali and Banglish(Bengali words written in English format) offensive texts have recently been widely used to harass and criticize people on various social media. Our deep excavation reveals that limited work has been done to identify offensive Bengali texts. In this study, we have engineered a detection mechanism using natural language processing to identify Bengali and Banglish offensive messages in social media that could abuse other people. First, different classifiers have been employed to classify the offensive text as baseline classifiers from real-life datasets. Then, we applied boosting algorithms based on baseline classifiers. AdaBoost is the most effective ensemble method called adaptive boosting, which enhances the outcomes of the classifiers. The long short-term memory (LSTM) model is used to eliminate long-term dependency problems when classifying text, but overfitting problems occur. AdaBoost has strong forecasting ability and overfitting problem does not occur easily. By considering these two powerful and diverse models, we propose L-Boost, the modified AdaBoost algorithm using bidirectional encoder representations from transformers (BERT) with LSTM models. We tested the L-Boost model on three separate datasets, including the BERT pre-trained word-embedding vector model. We find our proposed L-Boost’s efficacy better than all the baseline classification algorithms reaching an accuracy of 95.11%.


I. INTRODUCTION
D URING the COVID-19 global pandemic, people have been protecting themselves by maintaining as much social distance as possible and communicating virtually with each other. For this purpose, they have been engaging in various social media such as Facebook, TikTok, FaceTime, WhatsApp, Zoom, etc., and have been in constant touch with others. As a result, online content has grown rapidly and people are constantly harassed through online media. The L1ght [1] organization has been monitoring cyberbullying content, such as online harassment or hate speech, since 2018. They found that hate speech on Twitter increased by 900% and traffic on several hate sites increased by 200% during COVID-19.
About 245 million people in the world use Bengali as a language and it is the 7th most spoken language among all languages. Bengali language processing research is not as rich as other languages, such as English, Arabic, or European languages. Many researchers have attempted to automate the Bengali language, but this is still in the development stage. Most Bengali people are  connected to virtual social networks [2], where people are being personally attacked using offensive and inaccurate texts [3] that may threaten their lives, cause social media harassment, religious riots, eve-teasing and above all incites the people against the government.
Texts that are intentionally used to harass people are known as offensive [4] texts. Offensive content encourages people to engage in immoral activities against the government and law enforcement agencies, hurt religious feelings and encourage people to commit crimes without legitimate reasons [5]. Offensive texts can be spread on social media through different forms including text, images, audio, video, or graphical formats. However, the text format is the most commonly used form of social media. In addition, a text using hashtags can be easily converted into clickable links and the same hashtag can easily link text to another group. Thus, the text spreads faster than any other format and helps in the convenient retrieval of any information. In this study, we focus on the textual format of the offensive content and classify whether any text is offensive or not.
Undoubtedly, the manual detection of offensive text from a large number of Internet sources is difficult [6]. Thus, the enormous growth of web data obviates the need to develop an automated offensive text detection system. Recently, Bengali offensive texts have been extensively used to harass people on various platforms. Automated Bengali offensive text detection is thus required to prevent harassment on social media [4]. Such a system may also display a warning message to the user before reading the content. Thus, a user can verify the content of a website or other social media before reading or sharing the content written by others and eliminate online irritation. However, developing an automated offensive text classification model for the Bengali language is a challenging task. This is due to some unique characteristics of the Bengali language in word formation as well as different sentence structures compared to other languages. In Bengali, the same verb can give different meanings due to changes in subject, person, gender, tense, and a large number of synonyms and word senses. For example, in Table 1, the first Bengali sentence has the same meaning as the English sentence. However, in the second example, the literal translation of the Bengali sentence to English does not convey the same meaning. This is because the second example is an offensive text. The verb "eat" here has offensive meaning in Bengali while in the first example it is used as its normal meaning. If you want to be an actress, you have to go to sexual relation with naughty producers Bengali people write Bengali sentences using English letters known as the Banglish form. They can write a full paragraph in the Banglish form rather than writing a single sentence using Bengali characters. This is because writing a single Bengali word in pure Bengali format is more difficult than writing Bengali words in Banglish format, and they use different types of emoji while writing their comments. Table 2 shows an example of a Banglish offensive sentence in which a user uses various emojis and emotions to directly insult someone.
Processing Bengali or Banglish text is not similar to any other language. Therefore, existing offensive text detection methods in other languages cannot accurately detect the offensive Bengali text. However, recently [7] some text classification algorithms have been used to identify offensive Bengali texts. But, these algorithms do not eliminate long-term dependency problems when classifying text. Salim Sazzad [8] suggested abusive detection methods on a transliterated Bengali-English social media dataset. His dataset contains several English and Banglish words but not a single Bengali word. In their feature extraction step, they did not convert the text to the corresponding Bengali or English format but instead used the TF-IDF method to find the vectors of all the words and applied different classification algorithms to find the accuracy.
Existing offensive detection methods in Bengali are mainly limited to baseline experiments, and the other offensive text detection methods in different languages when applied in Bengali show poor accuracy results due to the unique syntactic and semantic structure of the Bengali language [7], [9]. The baseline classification methods have no weight gain scores that are incorrectly classified by the previous iterations. In addition, the baseline methods cannot fit all datasets into an optimal class [10]. However, ensemble methods can exhibit better accuracy with a smaller dataset [11]. Recently, an article [12] was published in Bengali based on ensemble prediction to identify hate speech. In this study, the authors used a variety of BERT approaches such as BanglaBERT based [13], Bangla-BERT-uncased, XLM-RoBERTa etc. in the ensemble method to increase their accuracy. They show that using the ensemble method, their accuracy boosts up to 7%. The most popular ensemble method for binary classification is the AdaBoost ensemble method [14]. The AdaBoost [15] ensemble method is used to improve the accuracy by training weak learners sequentially and increasing the results of text classification. This algorithm selects a weak classifier whose weighted error value is the lowest compared to other weak classifiers. Then, each sample is trained by selecting a new weight and a weak classifier by repeating T times. Finally, all the weak classifiers are combined to form strong classifiers. Sometimes, the AdaBoost algorithm selects a very poor classifier that raises low margin and overfitting issues [16]. Viola and Jones [16] used Haar-like features as weak classifiers in AdaBoost algorithm for object detection.
To address the challenges mentioned above, we propose the L-Boost ensemble method for Bengali offensive text identification. Our proposed model consists of three parts: the first part is proposed to predict offensive text using BERT [17] embedding with a long short-term memory (LSTM) [18] model. BERT is an unsupervised learning model based on bidirectional deep learning that is used for a variety of purposes for text processing [17]. It can be used as a feature-based approach or pretrained word embedding model, or fine tuning in text classification. The LSTM model is used to retain data over a long period of time due to its repetitive formation and gating procedure. In addition, the LSTM model is considered a sophisticated method for time series-related problems. Conventional classification models such as SVM and general neural networks make it difficult to extract long-term dependence on the offensive text where LSTM neural networks are better at it. The second part of our proposed model is AdaBoost with a custom BERT fine-tuning model. However, the LSTM model performs significant results in predicting offensive texts. But for its complex architecture, it may occur some problems such as overfitting, which is the loss of squared prediction error, the relationship between the model complexity and the decomposing material, and the relationship between the decomposing material and model fitting. BERT performs significantly better than other existing NLP models, but also has some limitations. In the BERT model, the first token must be [CLS], which is used for finding specific embedding, and the model uses another token [SEP] for segment separation. These tokens are used during the fine-tuning process, but tokens are unavailable during the prediction phase. On the other hand, AdaBoost has strong forecasting ability with low bias error, and overfitting problems do not occur easily. Considering the above advantages and disadvantages of these two models, the last part of our model is used to combine the first and second parts of the proposed model and the average of the result for the final output. Because both the LSTM and AdaBoost models have low bias errors, the proposed L-Boost model also has low bias errors. In this study, a number of supervised text classification algorithms were first applied to predict offensive Bengali text. Then, we employ the L-Boost ensemble architecture for detecting Bengali offensive text based on text classification algorithms. For feature extraction, we used fastText [19], BERT [20], TF-IDF [21] and Word2Vec [22]. Machine classification algorithms such as support vector machine(SVM) [23], decision tree(DT) [24], random forest(RF) [25], and LSTM [18] are implemented; and compared with the results. These text classification methods are known as baseline methods. The highest outcome is chosen as the benchmark. The BERT baseline methods were used in the proposed L-Boost model to boost the classification [26]. Finally, the L-Boost model was evaluated on a comprehensive Bengali Hate Speech dataset [12], and on the transliterated Bengali-English dataset [8]. Then shows the comparison between L-Boost and existing models. This research focuses on creating appropriate offensive text identification methods for offensive Bengali text on social media. The main contributions of this study are summarized as follows: The remainder of this paper is organized as follows. Section II presents the related works. The methodologies are described in section III. The results of the analyses are presented in section IV. Finally, section V concludes the work.

II. RELATED WORKS
Offensive text identification is a relatively new field in the field of natural language processing (NLP). However, most of the works in offensive text detection are in English, and very few works exist in Bengali language processing. In this section, we first discuss some of the existing methods for identifying offensive texts in languages other than Bengali and, then, discuss existing works on the Bengali language.
Cavnar et al. [27] proposed an N-gram based method for text categorization in which textual errors were tolerated. In this method, a large range of text categorization is possible, as the N-gram has a trustworthy background for frequency. However, this method is suitable only for ASCII code. Yarowsky [28] implemented an unsupervised method to sense the proper meaning of a sentence. Manual preparation is required to train unprepared text VOLUME 4, 2016 using a supervised technique. This makes the supervised technique very slow. McCallum and Nigam [29] aimed to show the differences between the multivariate Bernoulli model and the multinomial model. The authors implemented these methods on a corpus of five text files and compared their performances. Hu et al. [30] assigned distinctive weights to the same word and considered the impact of these weights in the Bayes network. The authors considered these weights for the texts at the time of appearing in distinctive web page components such as the body, title, etc. Yin et al. [31] proposed an n-gram-based supervised classification technique to identify abusive text.
Chen et al. [32] identified the abusive words. They have tried to improve machine learning methods using context-specific features, structural features, and style features, as well as lexical features, to detect offensive language. Their process can tolerate the informal intent of any English writing style. Sood et al. [33] analyzed social media comments. They scrutinized and noticed that the effect of the current method was negligible.
After identifying social differences based on the tolerance of pornography, the authors proposed a more advanced method of detecting contempt. Yang et al. [34] implemented a new method. Embedding can be properly used with the aim of mining new rules. Jurka et al. [35] offered comprehensive text categorization by interfacing with preprocessed words and machine learning algorithms with a new analytical function. Marmol et al. [36] proposed a method using SNSs to limit users from accusing reliable content. Kim [37] investigated sentence classification using convolutional neural networks (CNNs). Yadav and Manwatkar [38] proposed an approach of text filtration and classification using the Aho-Corasick string pattern matching algorithm using their dictionary to match keywords. Profane words were prevented from being published, and semantically related words were ignored. Shashank et al. [39] proposed a text filtration module and created a database of abusive keywords. In their research, they wanted to explain various text categorization methods. Nobata et al. [40]proposed a machine learning method to determine abusive speech from the comments of Internet users. Recent deep-learning-based offensive text detection methods have benefited from their machinelearning approach. Chu et al. [41] identify and categorize abusive comments using CNN and LSTM deep learning methods and a neural language processing technique. They tested three models: first, a long-term short-term memory cell (LSTM) and a recurrent neural network (RNN) with word embedding; a convolutional neural network (CNN) with word embedding; and, a CNN with character embedding. Wulczyn et al. [42] discussed the effectiveness of logistic regression and multilayer perceptron for abuse classification. They also compared their methods with a human baseline.
Bengali language processing is under development, and no proper dataset is available to experiment with offensive identification methods. Because of these, very few studies have been conducted on offensive text identification. Ishmam et al. [43] proposed the gated recurrent neural network (GRNN) method to detect hateful Bengali language and achieved 70.10% accuracy on 5K sized datasets including six classes. They used approximately 900 documents per class to train the model and to create accurate accuracy. The main reason for the poor accuracy is that they did not define classes correctly before training in the hate text classification. Eshan and Hasan [44] analyzed different machine classifiers, such as support vector machine (SVM) with linear kernel, multinomial naive Bayes (MNB), random forest, polynomial and sigmoid kernel of the radial basis function (RBF  [51]. A few researchers [48], [49], [51] used fine-tuning of BERT transformers, whereas reference [50] exploits pre-trained and fine-tuning during sequential text classification and showed better accuracy than other models. In [52], the authors used MNB, SVN, and CNN-LSTM with different kernels to identify threat and abusive text in the Bengali language. Their proposed system shows low accuracy because of the limited dataset, and pretrained word embeddings are unavailable. They show From the above discussion, we can conclude that some studies have attempted to identify Bengali offensive text, including small datasets. However, limited research work has used the Banglish datasets to identify offensive texts. Furthermore, the baseline experiments with limited datasets were used by most researchers. The main limitation of baseline experiments is that they show low accuracy results for limited datasets. Moreover, none of the studies exhibited designed learning and did not focus on ensemble learning methods for Bengali offensive text identification. In our study, our main target is to develop the L-Boost model and; train and test the proposed model using available Bengali and Banglish datasets. We have implemented AdaBoost in a combination of baseline classification and reconstructed it by applying a bidirectional encoder representation of transformers with long short-term memory models.

III. METHODOLOGY
In this section, we propose an offensive text detection process using social media datasets and different classification algorithms. First, we applied some baseline methods such as support vector machine (SVM) with a linear kernel, decision tree (DT), and random forest (RF) to train and test the model for offensive text classification. These baseline methods are used when creating offensive text detection ensemble methods. Figure  1 shows the step-by-step process of the offensive text detection process using a baseline architecture. The TF-IDF vectorizer, word2vec model, fastText and BERT pre-trained embeddings were used for feature extraction and word matrix generation for text processing. Again, we use the popular ensemble method: Boosting (Ad-aBoost), using the baseline classification. We then discuss the proposed methodology. We modify the boosting (AdaBoost) method using a bidirectional encoder representation of transformers with long short-term memory (LSTM) models. We use the BERT transformer to focus on positional word embedding and custom fine-tuning during model training. In this study, we used different symbols, which are shown in Table 3.

A. DATA COLLECTION
Data collection from various social networking platforms can be done automatically, rather than developing a list of sentences manually to present an offensive text in Bengali.We used several Bengali websites, such as prothomalo.com, banglanews24.com, and social media platforms to collect offensive data. Figure 2 shows the data crawling process in which the BeautifulSoup [5] Python library is used to crawl all hashtags and emojirelated posts and comments from different social media. We collected 16,800 posts and comments and memes from different Bengali websites, blogs, and various social media platforms. We collected all comments and posts under a fixed-length size. We removed all permalinks, dates, times, and user details to ensuere greater accuracy and privacy. We manually labeled crowded datasets to identify the offensive texts used to harass people. After crawling 16800 posts, Table 4 shows the data collection statistics. The dataset contains 4463 Bengali comments and 2480 Banglish comments, respectively. There were a total of 2932 offensive comments and 4011 non-offensive comments among all the Bengali and Banglish datasets. Among all datasets, 30% were set aside for testing purposes, and the rest were kept for training purposes.

B. DATA SET PREPROCESSING
Text preprocessing is the most distinctive part of automated text classification. Typical text preprocessing includes punctuation, commas, and case conversions, but in addition to these tasks, specific languages have their own syntactic and grammatical structures that need to be processed using unique methods and rules. The following text discusses the Bengali text preprocessing steps in detail.

1) Emoji and Emoticon Conversion
Text processing is very important for identifying offensive content because there are multiple variations in the dataset, including emoji, emoticons, and words. Most of the comments in different social networks frequentyl use different types of emojis and emotions. Table 5 shows the top emojis and emoticons used for both the offensive and normal text. Most people use different types of emojis and emoticons during comments on social platforms to express their feelings and attitudes. Emoji and emoticons are very important criteria for identifying offensive comments. For example, #sexy_baby is an offensive text that can be identified primarily based on the female genitalia emoji symbol. The meaning of emoji is dependent on language, culture and location [55]. Some emojis have different meanings in Bengali whereas they have other meanings in English. For example, the "rooster" emoji in English usually refers to a specific bird, but in Bengali, the contextual meaning is male genitalia who is ready for sex.

2) Hashtag Segmentation
Hashtags are frequently used when commenting on discovering a post on social platforms such as Facebook and Twitter. If we consider all hashtag texts, it will be very effective in identifying offensive text. Therefore, during preprocessing we replace the space by removing all hashtag symbols (#) and saving all text for future use. Hashtag can be in English or Bengali format, but we translate all hashtag text into the Bengali language. For example: one commenter uses #Boycott hashtag in English, at first, we remove the (#) symbol then translate it to Bengali #Boycott => Boycott =>বয়কট. Then, we store "বয়কট" for understanding and processing in the next step.

3) Miscellaneous
We use other common text preprocessing methods, such as converting numbers into words, removing punctuation, white space, accent marks, stop words, etc. In the Bengali language, there is no specific stop word like the English language. This depends on a specific task. For example, in sentiment analysis we can use "যান, সিহত, িদেছ, িদেয়েছ, িদেলন, স্পষ্ট, উপের, উপর, দু িট, দু েটা, েদওয়ার, হেচ্ছ , েদখা, করার, কের, কির, েবশ," etc. words as stop words but for offensive text detection we can't use these words as stop word because these words are used to make offensive sentence. For example, "েস বিবেক উপের েরেখ কের িদেছ। " He puts Bobby on top and fucks him. This is an offensive sentence in which these words are considered the main words of this sentence. We manually identified all stop words that were not offensive content and used the Python library and regex to remove unwanted words and punctuation from the text. In our research, we converted all Banglish datasets into the corresponding Bengali format during preprocessing. GAMITISA 1 has been used in Banglish to Bengali transliteration. GAMITISA is an online free tool with rich Bengali and Banglish vocabulary that converts Banglish words into corresponding Bengali words.

4) Data Labelling
The data labeling process is shown in Figure 3 where eight graduate students from the Advanced Machine Learning (AML) lab in Bangladesh Business and Technology University (BUBT) manually labeled all the data. The data labels are checked by the expert team of the Advanced Machine Learning (AML) lab. Professor or Ph.D. students with more than five years of experience or any researchers with extensive knowledge in the Bengali language processing area are considered as experts in the AML lab. Table 7 presents some examples of sample datasets, processed datasets after preprocessing and translation, and labeling of each process sentence.

C. FEATURE ENGINEERING
The input values of each text-classification model must be a number vector representing a text document. When the words in a text document are encoded as floating-points 1 https://gamitisa.com/tools/banglish-bangla.php or binary values of fixed-length vectors, this is known as feature extraction or vectorization. In this study, we used TF-IDF [21], Word2vec [22], fastText [19] and BERT [20] embedding to extract features.

1) TF-IDF vectorizer
When processing a huge text datasets, there are some words that occur repeatedly but contain inadequate necessary information. These data usually overshadow the frequency of more important data during text processing, including calculations. The TF-IDF feature extraction technique can resolve this problem. This can be defined as follows: where, tf (t, d) is the term frequency, and idf (t) is the inverse document frequency.

2) Word2vec
Word2Vec is used to determine the vector of all terms in a manner in which the alike text has related vectors. It can be used in semantic analysis of texts. It is a twolayer neural network. Its input is a text corpus and its output is a set of vectors: feature vectors for words in the corpus. We used the Python scikit-learn library to implement these feature vectors.

3) FastText
In 2016 Facebook proposed the FastText word embedding model which is an extension of the Word2Vec model.

4) BERT
The Google AI research team developed a bidirectional encoder representation of transformers (BERT) deep learning-based bidirectional unsupervised model for text processing. The formation of the BERT transformer is a multilevel bidirectional transformer encoder. Deep learning model-based transformers are used as encoders and decoders for translation purposes. The sample BERT diagram is shown in Figure 4, where E 1 , E 2 , ..E n are the inputs of the BERT model. The input can be a set of tokens, special symbols, words, etc. Several multilayer transformers exist after the input level. These transformers are bidirectional transformers used for encoding input text and producing the corresponding output vectors. BERT is another popular pre-trained word-embedding model that is applied to create token embedding. BERT supports token embedding, segment embedding and different types of position embedding as shown in Figure 5. In this study, we used the BERT positional embedding method to extract the Bengali and Banglish text features. The main objective of position embedding is to bind the pursuant The BERT transformer calculates attention by considering the weighetd value(W v ) weighted query(W q ) and weighted key(W k ) in the attention head. For example, let x ∈ N and y ∈ N are two positions, and W V x is the word vector for the x positional word, E x is the embeddings for the x position, and E x−y is the embedding of the relative position. Then, the q, v, k vector for the x positional word can be calculated as follows : The sum of all attention(a) head value is the final result where attention weight depends on a = qk T , therefore,   TF-IDF measures the importance of a word in a document and calculates the score of a particular word, where word2vec combines all the senses of a word to create only one vector. Word2vec cannot work outside of the vocabulary where fast text can fix the word2vec constraint, and BERT uses attention-based positional encoding to find word vectors. The Bengali text in this article was processed using the fastText pre-trained Bengali word embedding vector [19], and BanglaBERT based model [13], a pretrained language model constructed with BERT-based mask language modeling. All feature extraction methods use various n-gram features to calculate the word matrix as shown in Table 9. To identify offensive text from a document and to find the most suitable model for detecting offensive text, we have used a variety of combinations of n-gram properties with feature extraction techniques.

D. PROPOSED L-BOOST MODEL
Our proposed model has 3 parts. First, we applied the BERT embedding with an LSTM-based deep neural network. Figure 6 shows the LSTM model for offensive text prediction where there is an input layer, a BERT embedding layer, two LSTM and dense layer, and an output layer. The first LSTM layer produces the output of all hidden states and then the next LSTM layer produces the last hidden state as the output of the LSTM network. We would like to evaluate the focus of attention on wordlevel tasks, but BERT applies byte-pair tokenization, which  means that certain words are divided into multiple subwords as tokens. Therefore, we exchange (token, token) attention maps into (word, word) attention maps. To focus on the sub-word, we add the weight of the attention to its tokens. We take the average of the weights of attention on its tokens to find attention from a split word. These conversions maintain the quality in which the amount of attention from each word is 1.  LSTM defends the error of propagating backward using hidden layers where learning continues through different steps in the training session. LSTM was constructed to train the long-distance dependence between hierarchical data. Each LSTM cell, which integrates input, forgotten, and output gates, is assigned to capture, cancel, adjust, and decide the next step in retrieving a fraction of the data. The LSTM cell decides when access permissions are read, written, and deleted when data are transferred or blocked across the LSTM unit through the gates as illustrated in Figure 7. The functionality of the gates and the flow of information can be expressed using Equations 5 -10.
hv t = og t × tanh(cu t ) (10) where: x t = input vector in time t, hv t = output vector, cu t = store state information, ig t = vector of input gate, og t = vector of output gate, fg t = vector of forgotten gate, W * = weight of different gate, b = shift vector and σ = activation function All specific baseline classifiers show reasonable output with low accuracy scores, but these results must be filtered to increase the accuracy. In Section IV, we discuss this issue further. This means that baseline methods cannot completely learn an entire set of features from a dataset. This is because the information is stored from different websites or sources, where a single model cannot capture decision boundaries for all aspects. It may also indicate that the extraction of the uni-gram feature was not sufficient to present the latest signification of the abusive text or general text. Ensemble methods allow a combination of baseline classification and various other methods for sampling features during the training phase. The most well-known ensemble methods are bagging and boosting [57].
The boosting ensemble process was used to improve the accuracy by training weak learners sequentially. Boosting has both advantages and disadvantages. This can enhance the classification performance; however, it takes time for large datasets to train the model. The goal of the AdaBoost [14] boosting ensemble algorithm is to increase the results of the weaker classifiers by calculating the errors and updating the weights of the input data. In the ensemble process, a new classifier is added iteratively, and weak classifiers are identified from the dataset. Each of the newly added classifiers then qualifies for the previous iteration with a class of data for which the performance of the classifiers was very low. Another popular ensemble method is bagging, which generates a sample of a certain size from the original dataset. The bagging ensemble process trains classifiers for all dataset samples. It detects the possibilities and averages them. This procedure is run for each individual class on all baseline classifications in the ensemble techniques.
The second part of the proposed model is shown in Figure  8. AdaBoost with a custom BERT fine-tuning model with different strategies was applied to boost the model. In the weight allocation phase, this model incorporates BERT into the boosting architecture. We utilize the fine-tuning weight initialization approach to dynamically assign the parameters of transformer when adding a new base classifier to the ensemble in order to include the pretrained language encoder BERT into the boosting. The BERT architecture has an encoder in which each encoder has 12 blocks of transformers, 12  The MLP architecture is made up of multiple hidden layers that use each transformer encoder's output as an input layer. We add the weight of each base classifier to its output to exploit this knowledge since the base classifier has varying confidence scores on certain tasks. A dense layer is then added to the top of the last layer where the softmax classifier is used to predict the probability of a specific label S.
Prob(S|h) = softmax(Th) (11) where T is the parameter matrix for a specific task. We use AdaBoost with multiple fine-tuning approaches to fine-tune the properties to maximize the probability of a specific label S. We consider some factors during fine-tuning strategies to capture different syntactic and semantic levels for offensive text classification.

1) Long text pre-processing
BERT supported the highest sequence length of 512. To process the highest-length text, we applied the following ways in different iterations of the AdaBoost model.
Cutting approach: Typically, an article contains some basic information from the beginning and end of the text. In this proposed model, we applied three ways to cut text for BERT fine-tuning.  Hierarchical Approach: In this method, text longer than 512 tokens is subdivided into sub-texts. In our model, we applied t(subtext) = f ulltext/510 for text splitting. Then each sub-text is fed into the BERT model for proper fine-tuning. Finally, several pooling methods, such as max pooling, mean pooling and self-attention have been applied to gather all sub-texts for finding the representation of the article.

2) The most effective layer selection
In the BERT transformer model different layers are used to capture different properties during offensive text classification. We examined the usefulness of features from distinct layers. We fine-tuned the model based on the performance of the test error rates of a specific layer.

3) Handle overfitting problem
We used different learning relates during fine-tuning to minimize the overfitting problem. We observe that in the bottom layer, a lower learning rate(lr < 2.5e − 5) performs best during BERT fine-tuning.
Generally, hybrid approaches perform better than normal single methods. In view of the hybridity and brilliant prediction abilities of the two models, we merge these two models to obtain the highest prediction accuracy.
We refer to this model as the L-Boost model where Figure 9 illustrates the combined model. In our proposed architecture, the offensive texts are first predicted separately using the LSTM-BERT and AdaBoost-BERT models. The output results of the two models were averaged to generate the final prediction results. Then the dataset sequence is updated according to the results of the final forecast, which is used as the input of the model in the next iteration.

IV. RESULT ANALYSIS
The main objective of this research is to test the effectiveness of various machine-learning algorithms for combining different features. We present several graphical and statistical performance evaluation methods to discover a suitable model that can successfully classify offensive text. We used Ubuntu 18.04 OS for Intel (R) Core (TM) i5-6500CPU machine with 16GB of RAM to set up our experiments. We use Python 3.6.9 version with TensorFlow 2.2.1 to implement all offensive text classification models and we used Panda 1.0.3 data frame VOLUME 4, 2016 and scikit-learn 0.22.2 to create datasets for training and testing purposes. During the process of creating datasets for training and testing, we randomly shuffled all offensive and general text so that the training and test datasets contained a mixture of offensive and nonoffensive text. K-fold cross-validation was used to perform benchmarks on datasets, where K = 5. Hence, the dataset was divided into 80%-20% training-testing subsets. We implemented the model described in Section III.D and the list of parameters and hyperparameters used in this model is shown in Table 10.

A. PERFORMANCE ANALYSIS PARAMETERS
We evaluated our system performance based on accuracy, precision, recall, and F1-score.

1) Accuracy(A)
This is a mathematical measurement. This indicates that any classification was accurately measured. This is known as the consistency of the actual results for the number of samples tested.
where, T P = Ture Positive; F P = False Positive; T N = True Negative; F N = False Negative.

2) Precision(P)
Precision is the ratio of correctly predicted positive observations to the total number of predicted positive observations.

3) Recall(R)
Recall is the ratio of correctly predicted positive observations to all observations in the actual class The F1-score represents the weighted mean of precision and recall. Therefore, this score considers both false positives and false negatives. Intuitively, accuracy is not as easy to understand, but F1-score is usually more effective than accuracy, specifically if the class distribution is unequal. Accuracy works best when there is the same cost for false positives and false negatives. If the cost of false negatives and false positives are very different, it is preferable to pay attention to both precision and recall.

B. BASELINE EVALUATION
A baseline test was performed using the Python machine learning library. We used binary classification processes using three baseline classifiers and a deep neural network model for text classification. These are support vector machine (SVM), decision tree (DT), random forest (RF) and long short term memory (LSTM). The SVM classification technique demonstrates excellent performance with a small amount of data. All feature vectors were organized in a high-dimensional space. Hyperplanes search for the minimum steps to detach space so that the point belongs to different classes. Sometimes, multiple hyperplanes are used to find the target hyperplane, which can maximize the distance between classes. The decision tree classification algorithm is used in a wide variety of cases for taxonomy. The structure of this strategy is a hierarchical fragmentation of the data space.
The scikit-learn machine learning library was used for the baseline assessment. Different classifiers have different parameters and the parameters depend on the size and model of the input data. max_f eatures = n_f eatures used for DT and RF classifiers, but the entropy criterion is used in DT, where RF uses the Gini criterion. We used learning_rate=optimal and random_state=0 for all classifiers. In this section, we discuss the performance analysis of all baseline classifiers in both statistical and graphical formats. Uni-gram + Bi-gram F2 Bi-gram + Tri-gram F3 Uni-gram + Bi-gram +Tri-gram F4 Quadri-gram F5 Tri-gram + Quadri-gram

1) Statistical Evaluation
We implemented all baseline classification algorithms for the three feature extraction models with five different types of features. Table 11 lists the different features with their corresponding values. We used several combinations of features from Table 9 and selected the top five features with the highest accuracy score. Table 12 shows the performance results of the baseline classification where each classifier is applied separately to the TF-IDF, Word2Vec, and fastText feature extraction methods.
The performance results of the RF classification algorithm is shown in  Table 12.
The F1-score considers all the limitations of the accuracy metric which is more powerful than the accuracy assessment metric of machine learning. Table 12 shows the F1scores of all baseline classification algorithms of the three feature extraction techniques. In the DT classification algorithm, the highest F1-score for the offensive class is derived from the F1 attribute using the fastText FE method, and the lowest F1-score comes from the F5 attribute with the Word2Vec FE method. The F1-score in SVM classification for different properties with TF-IDF attribute extraction methods shows better results than other attribute extraction methods.
The F1-score of offensive and non-offensive classes in the LSTM classification algorithm has better performance for all n-gram features with different feature extraction techniques. The RF classification algorithms show that the F1-score is very unstructured and the F1-score is too low for the class that is offensive in RF classifiers.  VOLUME 4, 2016 and the lowest F1-score limit was 78.01% for the offensive class. In contrast, the IF-IDF strategy has a maximum F1score of 90.05% for the F4 feature and a minimum score of 33.35% for the F2 feature. The F1-score of different classifiers with the Word2Vec FE technique shown in Table 12, where RF and SGD achieved the highest score of 94.85% and MNB classifer with feature F2 achieved the lowest (55.0%) F1-score.

2) Graphical Evaluation
Receiver operating characteristic (ROC) curves are often employed to exhibit the graphical method of the relationship between sensitivity and specificity for a possible cut-off for an experimental combination. Figures 10(a) To find the best feature extraction model, we applied conventional machine learning algorithms using the whole sentence instead of various features to calculate performance. Among the four feature extraction technologies, BERT showed satisfactory scores for all types of classification algorithms as shown in Figure 11. The fastText feature extraction technique performs optimally for some classifiers and also gives poor results for some machine learning algorithms and shows that it is not stable for the offensive text classification. In addition, the Word2Vec feature extraction technique requires a large amount of time for big datasets, and the accuracy score is poor compared to the TF-IDF feature extraction method. For the above reason, our proposed method uses BERT feature extraction techniques to process large datasets and extract useful information from chaotic datasets. To further clarify, in this article, we used different N-Grams for the traditional machine learning models and full text for the BERT-based model.

C. ENSEMBLE EVALUATION
We chose the LSTM deep neural classification algorithm to create the proposed offensive text detection model. We used the BERT FE technique with the LSTM classification algorithm to reduce the problem of overfitting and underfitting in machine learning. We used a method of 10 cross-validations to evaluate our ensemble classification scheme. All classifiers were trained recurrently in this manner. We used the same number of datasets to test and train the simulation method used for the previous baseline classification. The results were calculated after 10 repetitions. Table 13 presents the performance results of the proposed model. The LSTM-BERT achieves an accuracy score of 89.45%, and AdaBoost-BERT achieves a 92.16% accuracy score on our dataset. After updating the dataset   highlights the performance of several ML classifiers for identifying offensive language. The recall and precision scores of LR and SVM are comparable to the three classic ML classifiers. The recall score of the RF classifier is similar to that of LR and SVM, but its precision score is lower. It is worth notifying that both LR and SVM provide lower recall scores than precision scores. The deep learning-based architecture BiLSTM achieves a lower F1-score than LR and SVM, which might be ascribed to the corpus's limited size (i.e., 3000 comments). In addition, Table 14 shows that L-BOOST outperforms the translated Bengali corpus by a good margin. In our proposed architecture, it first converts the entire dataset into the corresponding Bengali format and then applies various pre-trained feature extraction methods and classification methods to detect offensive text. Since we have Bengali and Banglish words in our dataset, we convert all Banglish words into corresponding Bengali format before applying the feature extraction method. However, when we try to convert all the words of [8] to the corresponding Bengali format, the online tool cannot convert some English words to the corresponding Bengali format. For this reason, the L-Boost model shows better accuracy on our dataset than transliterated [8] dataset.   Also, transliterated Bengali-English [8] dataset size is limited (i.e., 3000 comments) compare to our dataset (i.e., above 6000 comments).
The Friedman test [59], [60] and the Nemenyi posthoc test [59], [60] were used to make a more detailed comparison of the performance of the models mentioned above. Table 15 shows the ranking values of the different baseline models, and Table 16 shows the comparison ranking of methods across different datasets. To compare the average rank of various approaches, the Friedman experiment was utilized. The following Equations 16 and 17 are used to calculate the Friedman statistics τ F : Where N represents the total number of datasets and k represents the total number of methods. Using the above equation, for the baseline classification we get τ F = 6.6 and the p-value is 0.086, while α = 0.05. In Friedman test, if p is less than α then performance of all methods will be the same. Since p > α, which means all models have markedly varied performance characteristics. Also, Friedman's statistic (τ F ) is 4.4 and the value of p is 0.34 for all existing and proposed methods which are greater than α. The Nemenyi test was then used to further differentiate between the models. According to the Nemenyi test, the critical difference (CD) is defined as: Figure 12 and 13 shows the visual representation of the Friedman test and the Nemenyi post-hoc analysis results, wherein the average rankings of each technique are indicated through the horizontal line. It has been flipped such that the lowest (best performance) ratings are on the left and the highest (worst performance) on the right. Our L-Boost was statistically the finest of the examined approaches, as shown in Figure 13. In addition, according to the Nemenyi test, the performance of L-Boost and RF differed slightly.
Finally, Table 17 shows a few examples from our testing dataset where the proposed L-Boost system incorrectly predicted a sentence. In Table 17, the third example is labeled as offensive because the contextual meaning indicates that there are some offensive conversations but our L-Boost system cannot detect the text as offensive text because there are no offensive words in this text so, L-Boost incorrectly predicted the class.

V. CONCLUSION
We presented two different types of machine learning tests, baseline and ensemble analysis, for offensive text classification. In the baseline testing, we used different types of machine learning and deep learning text classification algorithms. Of all the classification algorithms, LSTM showed overall good performance for all features, and SVM showed average performance for all features. We used four feature extraction algorithms to extract useful features from the random databases. Among all feature extraction algorithms, BERT showed better performance than the others. We have used five types of N-gram properties and applied all classification algorithms to different properties, including different feature extraction methods. We then selected LSTM with BERT text classifiers to reduce the underfitting and overfitting problems in the classification of machine learning. For the ensemble testing, we used the AdaBoost algorithm with different types of BERT fine-tuning classification models.
We have modified the AdaBoost with BERT transformer to learn to ensemble classification of machines that show excellent performance compared to any other efforts developed in offensive text detection. We anticipate that this research work will be further improved if some issues are addressed. First, setting up data for case studies might be sufficiently diversified to achieve excellent results. Second, since people from different places speak different versions of the Bengali language, there is a need to integrate those phenomena. Finally, endeavors may be exercised for the detection of offensiveness from any format of Bengali content, such as images, pdf, videos, or any speech.