Enhancing Big Social Media Data Quality for Use in Short-Text Topic Modeling

With the emergence of microblogging platforms and social media applications, large amounts of user-generated data in the form of comments, reviews, and brief text messages are produced every day. Microblog data is typically of poor quality; hence improving the quality of the data is a significant scientific and practical challenge. In spite of the relevance of the problem, there has been not much work so far, especially in regard to microblog data quality for Short-Text Topic Modelling (STTM) purposes. This paper addresses this problem and proposes an approach called the Social Media Data Cleansing Model (SMDCM) to improve data quality for STTM. We evaluate SMDCM using six topic modelling methods, namely the Latent Dirichlet Allocation (LDA), Word-Network Topic Model (WNTM), Pseudo-document-based Topic Modelling (PTM), Biterm Topic Model (BTM), Global and Local word embedding-based Topic Modeling (GLTM), and Fuzzy Topic modelling (FTM). We used the Real-world Cyberbullying Twitter (RW-CB-Twitter) and the Cyberbullying Mendeley (CB-MNDLY) datasets in the evaluation. The results proved the efficiency of the GLTM and WNTM over the other STTM models when applying the SMDCM techniques, which achieved optimum topic coherence and high accuracy values on RW-CB-Twitter and CB-MNDLY datasets.


I. INTRODUCTION
Microblogging platforms such as Twitter have emerged as the primary sources of big data [1], [2], and [3], giving organizations access to previously unattainable opportunities to obtain vital intelligence that will guide their decisions and drive insights. Data quality in the big data context is a specific and critical problem, particularly with Twitter data [4]. In this regard, data cleansing is the most critical process in maintaining the quality of data. It is the most time-consuming step in The associate editor coordinating the review of this manuscript and approving it for publication was Arianna Dulizia . any text mining process and negatively affects the accuracy of the data if done improperly. Despite the growing literature on the use of Twitter data for various applications [1], [5], [6], [7], [8], [9], [10], [11], [12], [13], and [14], most of the extant work have mainly concentrated on mining and classification of Twitter data (i.e., tweets) while the quality of the data is mostly overlooked [1].
High-quality data is a prerequisite for data-driven applications such as predicting flu trends from twitter data [6], sentiment analysis to assess Airport Service Quality (ASQ) [7], multimodal sentiment analysis [9], and analyzing and capturing tourist activities [15] to guarantee the quality of the analysis outcome. However, issues with Twitter data quality continue to be a serious and challenging research concern [4], [16]. The conventional data cleansing techniques focus on stop words removal, plural words, and frequent words. The traditional cleansing methods are not fit for microblogging datasets and it may increase the odds of negative data quality. This is because tweets have plenty of anomalies, data sparsity problems, and noises like slang, typos, repeated characters in a word (elongated), complex spelling errors, poorly structured, concatenated words, unconventional usage of acronyms, diversified forms of abbreviations of the same word, short document lengths, varying grammatical structures, and clothed in informal language compared to the long text and normal documents. Therefore, ensuring the data quality collected from microblogging sources is an important research and practical issue that has not yet been adequately addressed [1], [4], [17].
Research in data cleaning has been undertaken in applications such as RFID (Radio Frequency Identification) [18], [19] and online movie reviews [4], [20]. In the context of RFID, the aim is to make the tag read rate as close to the real one as possible by applying data deduplication techniques. In the Topic Modeling (TM) aspect, which is our interest in this paper, converting the slang and acronyms to the normal text, removing the repeating character in a word, splitting the concatenated words of single words, and stemming are the most significant techniques of data cleansing of social media data that are utilized to reduce the feature space, make the task less dependent, and improve the short text topic discovery performance. Also, balanced small-sized datasets are the main focus of the majority of the existing research that examines the impact of preprocessing approaches. Twitter data cleansing in most of the sentiment analysis is largely ignored as the extraction of new sentiment features is mainly the focus [21].
The work presented in this article addresses the problem of data quality issues in the Twitter dataset for use in shorttext topic modeling methods. High-quality data is necessary for topic modeling. Topic modeling methods perform poorly when no or little data cleansing is performed [22]. The noisy short text, data sparsity, and scarcity of word co-occurrences nature of Twitter data pose considerable challenges to topic modelling methods [23]. To address this problem, we propose a Social Media Data Cleansing Model (SMDCM) to increase the quality and accuracy of social media data for use in conjunction with modeling methods. The proposed SMDCM can rectify a wide range of abnormalities like typos, slang, complex spelling errors, contraction, emoji, emoticons, repeated characters in a word (elongated), concatenated words, unconventional usage of acronyms, and diversified forms of abbreviations of the same word. The overall contributions of this article can be stated as follows: • Specific and detailed literature review and comparisons of other short text topic modelling (STTM) using six models: LDA, WNTM, BTM, PTM, GLTM, and FTM.
• A new framework for a data cleansing model specifically tailored to social media dataset for use in topic modelling is proposed. • We investigate the effects of the Twitter data cleansing method on short-text topic modelling methods' performance.
• We validated the framework through extensive experiments using two real-world social media datasets on six short text topic modelling algorithms in terms of purity, NMI, accuracy, and topic coherence.
• Comparison and choices of different and extensive experiments conducted with various scenarios for evaluating the quality of data, as well as the quality of the topic, utilizing different short text topic modelling algorithms in terms of purity, NMI, accuracy, and topic coherence.
• We present the overall performance improvement rate of the short text topic modelling models with the proposed data cleansing approach as compared to baseline techniques. The rest of this paper is structured as follows: Section II introduces the problem formulation and related work regarding preprocessing. Section III presents a review of short text topic modeling models. The proposed methodology and mathematical models are described in detail in Section IV. The details of the experimental analysis and results are discussed in Section V. A conclusion with key findings is given in Section VI.

II. PROBLEM FORMULATION AND RELATED WORKS
This section formulates the problem and presents the related works. The problem formulation is presented in the upcoming sub-section.
return a clean social media post by removing all these noises from every token. When the operations are applied to each social media post in D, then the result is a set of clean and modified social media posts known as D . The short text topic modelling T model is to represent a dataset D as a set of k topics T = {t j |1 ≤ j ≤ k} present in D . Each social media post, D i , will be utilized by T model model to generate T topics. Subject to the following constraints: In our study, we will utilize data cleansing techniques to depict their significance and effects for short text topic modelling over social media data.
Where the SMDCM indicates the social media data cleansing techniques, T model denotes a short text topic modelling algorithm. Let Q 1 denotes the optimal data quality of the social media posts generated by the SMDCM model, and Q 2 indicates the optimal quality of topics discoverable by T model . Also, assume that A denotes the optimal accuracy of topics. Let R indicates the optimal recall of topics discoverable T model . P denotes the optimal precision of topics discovered by T model . Constraint (2) stipulates that the size of every social media post must not be less than (S min ) and should not be more than (S max ) in social media tweets. This constraint is formulated specifically for social media tweets. In our case, the minimum size of a short text (S min ) can be set depending on the quality of the social media post received and S max consists of 280 characters, including blank space. Constraint (3) deals with the optimality of the quality of social media data generated by SMDCM, which enhances the quality of the extracted topics. Constraint (4) deals with the optimality of the four measures: the quality of the extracted topics utilizing topic coherence, recall, precision, and accuracy.

B. RELATED WORK
STTM process mainly includes two phases: preprocessing (social media data cleansing) and topic modelling. This section briefly reviews the existing STTM models based on these phases.

1) PREPROCESSING
Many researchers have investigated the effects of data cleansing and preprocessing on text classification [24]. AL-Ghuribi et al. [20] investigated the impacts of different preprocessing methods, such as negation words, stopwords, and the number of occurrence words, in constructing a domain-based lexicon for unbalanced reviews and computing the total review sentiment score. Zin et al. [25] investigated the effectiveness of three preprocessing techniques in Sentiment Analysis (SA): stopwords removal, eliminating (stopwords with meaningless words), and finally eliminating (words less than three characters, numbers, meaningless words, and stopwords). Sentiment analysis faces major challenges related to data quality [26], [27]. Twitter data cleansing in most of the sentiment analysis is largely ignored as the extraction of new sentiment features is mainly the focus [21]. Murshed et al. [28] investigated the effects of data cleansing on the sentiment analysis performance. Krouska et al. [29] suggested five preprocessing methods to investigate their impacts on the sentiment analysis performance. Sun et al. [30] proposed preprocessing techniques which can conduct (URL, punctuation, numbers, stop word) removal, tokenization, contractions extensions, and lemmatization. Duwairi and El-Orfali [31] studied the impact of various pre-processing techniques like n-gram models, feature correlation on Arabic text sentiment analysis. Some other works studied the impacts of stemming on the Arabic text classification performance, such as [32], [33], [34], and [35].
Topic identification and topic discovery also face major challenges related to data quality [26]. Three prominent and significant heuristic mechanisms have been used to mitigate the data sparsity issue. The first mechanism is to aggregate short texts into pseudo-documents. This mechanism is vastly utilized in social media text data, but it is extremely datareliant as well. To this extent, Mehrotra et al. [36] aggregated tweets into macro-documents in preprocessing phase based on pooling schemes (Author, hashtags, and burst-score). Hong et al. [37] aggregated all the posts or short texts together which contain the exact term or word. Weng et al. [38] aggregated all the tweets generated by the same twitterer (user). Then, once the pseudo-documents have been generated, the traditional TM approaches, such as LDA, etc., are applied to learn and discover more eminent prevalent significant topics from the richer contexts of the aggregated tweets or short texts. Nonetheless, additional information like hashtags or authorship is not available constantly in real-world applications. The second mechanism is to extend TM by adding robust assumptions of STs documents. Some works, like Lakkaraju et al. [39] and Zhao et al. [40], suppose that each post or ST is a blend of unigrams drawn from a single topic. While other models attempt to leverage the wealthy global word co-occurrence patterns to infer hidden topics such as BTM [41] and PTM [42]. The BTM [41] model was utilized to discover the latent topics from the Short-Texts (STs) by generating word co-occurrence patterns (biterm). Pseudo-document-based Topic Modelling (PTM) and Self-Aggregation-based Topic Modelling (SATM) are the most common models in Self-Aggregation models. SATM [43] assumes that each ST as a sample from a hidden long pseudodocument and merges them automatically to use as a Gibbs sampling [44] for topic extraction; however, it suffers from the over-fitting problem and is computationally expensive. PTM is another model suggested by Zuo et al. [42] for short texts. Here, the pseudo document's concept is to implicitly combine short texts to address data sparsity and the overfitting issue. Most of these models were developed to mitigate the sparsity issue.
This study takes into account several social media data cleansing/preprocessing techniques and concentrates on an unsupervised STTM instead of the supervised SA and text classification task. Denny and Spirling [45] investigated the effectiveness of the preprocessing techniques on various text classification and topic modelling over political text datasets. However, the investigation with respect to topic modelling was only on Latent Dirichlet Allocation (LDA). Besides, the utilized dataset is smaller than the ones we used for our study, and it is just about 2000 documents. The key aim of the author's research is to study and analyze the differences between supervised and unsupervised learning on text political datasets. Other papers studied the impacts of preprocessing on the performance of topic modelling over speech and newspaper long text. Schofield et al. [46] analyzed and investigated the effectiveness of one preprocessing technique, such as stopwords removal from the corpus, before conducting topic modelling. This method is informative; however, the authors evaluated only one preprocessing method just over newspaper text, and the social media data was not investigated in their study. Churchill and Singh [47] suggested a standardized pre-processing approach for utilizing on-topic modelling over social media data. They showed the influence and usefulness of the proposed approach on topic modelling with various social media data.
Compared to the existing works, this research provides an in-depth analysis of various social media data cleansing models. It investigates their effectiveness and usefulness over short text topic modeling algorithms. We conduct extensive experiments over two real-world social media cyberbullying datasets: the RW-CB-Twitter dataset and CB-MNDLY dataset, and evaluate the topic quality and data quality on short, noisy, and sparse cyberbullying datasets for each scenario utilizing six short text topic modelling algorithms: LDA, BTM, WNTM, PTM, GLTM, and FTM in terms of topic coherence evaluation and two other external evaluations such as short text clustering evaluation, including purity and NMI, and finally, short text classification evaluation such as accuracy.

III. SHORT TEXT TOPIC MODELING
This section presents a review of short text topic modeling techniques. There are numerous models in the literature that address the topic modelling problems based on our previous taxonomy and survey [48]. Some of them concentrated on long text datasets called traditional long text topic modelling models. These models, such as LDA [49], Nonnegative Matrix Factorization (NMF), Latent Semantic Analysis (LSA), and Probabilistic LSA (PLSA) are well-known in the unsupervised generative for extracting the hidden topics from the long texts datasets. LDA has been the inspiration for the enormous bulk of other generative TM approaches, such as an extension of the LDA called Twitter-LDA [40], Authorless Topic Models (ATM) [50], and Dynamic topic models (DTM) [51]. As the restrictions of the LDA seem to impact its' performance over short texts, then the research community was focused on the shifting toward conventional LDA modification. To this extent, Chen and Kao [52] suggested an approach to enhance the performance of topic modelling utilizing Re-Organized LDA (RO-LDA), which solves the scarcity of local word co-occurrence of LDA. The limitation of this model is in treating the redundant data. A new model named Time-Sensitive Variational Bayesian inference LDA (TSVB-LDA) was suggested by Fang et al. [53] to discover the latent trending topics with high accuracy. However, TSVB-LDA has demerits in terms of inference of news tweets. The Corpus-based Topic Derivation (CTD) was developed by Sharath et al. [54], which integrates Timestamp-based Popular Hashtag Prediction (TPHP) and Latent Feature-LDA (LF-LDA) utilizing an asymmetric topic model to extract Twitter hidden topics based on corpus semantics. Ni et al. [55] introduced the hot event detection approach utilizing Background Removal LDA (BR-LDA), which eliminates the background words from short text tweets. All these models suffer from data sparsity problems with short text datasets due to the scarcity of word co-occurrences in STs.
As a result of the scarcity of word co-occurrences in short texts, many other STTM models have been suggested to extract and reveal the latent topics from the short text datasets. In this research, we focus on the most widely used models for short text datasets, known as STTM. Most of these models were suggested to alleviate the data sparsity problem. Three prominent and significant heuristic mechanisms have been used to mitigate the sparsity of data problems. The first mechanism is to aggregate STs into pseudo-documents. This mechanism is vastly utilized in SM text data, but it is extremely data-reliant as well. To this extent, Mehrotra et al. [36] aggregated tweets into macrodocuments in preprocessing phase based on pooling schemes (Author, hashtags, and burst-score), Hong et al. [37] aggregated all the short text posts which contain the same term or word, and Weng et al. [37], [38] aggregated all the tweets which generate by the same twitterer (user). Then, once the pseudo-documents have been generated, the traditional TM approaches, such as LDA, etc., are applied in order to learn and discover more eminent prevalent significant topics from the richer contexts of the aggregated tweets or STs. Nonetheless, in real-world applications, additional information like authorship or hashtags are not always available. The second mechanism is to expand TM by adding robust assumptions of STs documents. Some works, like Lakkaraju et al. [39] and Zhao et al. [40], suppose that each post or ST is a blend of unigrams drawn from a single topic.
In contrast, other models attempt to leverage the wealthy global word co-occurrence patterns to infer hidden topics such as BTM [41], PTM [42]. The BTM [41] is one of the well-known Global Word Co-occurrences based models. BTM learns the hidden themes on STs by modelling the generation of biterms directly in the dataset. A biterm is a pair of unordered terms/words that appear together (co-occurrence) in a post/short text. The main concept is that if two terms VOLUME 10, 2022 or words co-occur more repeatedly, they are more probably to pertain to the same topic. Zuo et al. [56] suggested a new method named WNTM utilized for clustering the topic from imbalanced and short texts. WNTM is a novel model that simultaneously addresses the data sparsity problem and imbalance. However, WNTM is unable to express the underlying meaning between words, due to a lack of semantic distance metrics. In addition, WNTM contains a huge data that is not relevant in word-word space. An extension to WNTM, Wang et al. [57] introduced a novel model called Robust WNTM (R-WNTM), which filters the unrelated data during the sampling process is presented as the irrelevant data in the word-word space building procedure of WNTM is high, Jiang et al. [58] suggested WNTM with Word2Vector (WNTM-W2V) to discover deep meaning among words to increase the accuracy of relationship among words as well as to improve topic coherence. Wu et al. [59] introduced a clustering method for short texts based on the (BG & SLF-Kmeans) method. In addition, a novel approach named Noise BTM Word Embedding (NBTMWE) was suggested by [60] to resolve the data sparsity problems. This approach integrates the noise BTM and WE from external datasets to ameliorate the coherence of the topic.
Another short text topic modelling is called the Self-Aggregation models. The PTM and SATM are the most common models in Self-Aggregation models. SATM [43] assumes that each ST as a sample from a hidden long pseudodocument and merges them automatically to use as a Gibbs sampling [44] for topic extraction; however, it suffers from the over-fitting problem and is computationally expensive. The PTM is another model proposed by Zuo et al. [42] for short texts. The pseudo document concept implicitly combines short texts to address data sparsity and over-fitting problems. Besides, the authors proposed another model named Sparsity-enhanced PTM (SPTM) by employing Spike and Slab prior method for eliminating the unwanted correlations among the pseudo documents. An extension of the PTM model called Word Embedding-enhanced PTM (WE-PTM) was developed by Zuo et al. [61] to leverage pre-trained WEs, which alleviates the data sparsity problem. Feng et al. [62] proposed a User group-based Topic-Emotion model (UGTE) for topic extraction and Emotion detection, which mitigates the data sparsity problems by aggregating the ST of the group into long pseudo-documents. Most of the previous work considered the data sparsity problem; however, they did not consider the sensitivity of word order in short texts.
Moreover, Dirichlet Multinomial Mixture (DMM) based models were proposed to extract and detect the hidden topics from STs. Hence, many studies incorporating the DMM models for STTM followed. Yin and Wang [63] suggested a Gibbs Sampling algorithm for DMM (GSDMM), which used DMM for short text topic clustering and achieved higher efficiency. Besides, they developed a Fast GSDMM (FGSDMM) [64], which acclimatized an online clustering method for initialization. An improved DMM model called Poisson DMM (PDMM) was proposed by Li et al. [65], which is based on modelling the topic number as the Poisson distribution with auxiliary word embedding. An efficient topic modelling named GPU-DMM model was proposed by Li et al. [66] for short text. GPU-DMM enhances the semantic relatedness of words through the sampling process of DMM under the same topic by utilizing the Generalized Polya Urn (GPU) method. These models seem to outperform both the DMM and individual PDMM methods but also involve high computation costs. A new model called a Collaboratively Modeling and Embedding DMM (CME-DMM) was proposed by Liu [67] for capturing coherent hidden topics from STs. All these models were suggested for topic modeling over short text.
In this research, we evaluate the influence of the Social Media Data Cleansing Model (SMDCM) utilizing the most prominent learning short text topic models such as LDA [49], WNTM [56], BTM [41], PTM [42], GLTM [68], and FTM [69] in terms of topic coherence, purity, NMI, accuracy where LDA is the conventional and ubiquitous topic model. The PTM is the prominent model of Self-Aggregation models. BTM, WNTM and GLTM are chosen to represent the Global Word Co-occurrences based Methods. Whereas the FTM is a clustering-based topic modeling, this is based on the fuzzy concept perspective of extracting and discovering the latent topics from the short texts dataset.

IV. PROPOSED METHODOLOGY
The proposed methodology of this research includes the following stages: (I) Social Media data Cleansing Models (SMDCM), (II) Feature extraction using different techniques such as Term Frequency-Inverse Document Frequency (TF-IDF), Global Vectors (GloVe), and Bag of Word (BoW). Then, (III) Applying various short text topic modeling algorithms: LDA [49], BTM [41], WNTM [56], PTM [42], GLTM [68], and FTM [69]. These algorithms are performed and adopted to extract and discover latent topics from social media short text datasets. The workflow of the suggested framework is depicted in Figure 1. As shown in the figure, the input of the framework is the Social Media Short Text which goes through the four stages mentioned above to extract and discover the topics using the short text topic modelling (STTM) algorithms. Then, the results are evaluated using different performance metrics. The findings of the current research are used to better understand how various data cleaning techniques impact the performance improvement rate (PIR) of STTM algorithms. The SMDCM stage comprises four sub-stages (as depicted in Figure (1-B), starting with (1) Filtering short texts, (2) Noise elimination, (3) Out of Vocabulary (OOV) cleaning, and (4) Posts Transformation. The aim of these data cleansing and pre-processing stages is to reduce the dimension of social media posts. Thus, the informal posts are converted as possible into formal posts using the proposed SMDCM. Subsequently, the reduced posts are fed into the representation techniques. The following sub-sections present each of these sub-stages in detail.

1) FILTERING AND EXTRACTION POSTS
Filtering is the first stage of the proposed SMDCM model, which focuses only on English posts and ignores the other languages of social media posts from the dataset for further analysis. We utilized the ''Tweepy'' package for extracting the tweets from Twitter social media platforms using Twitter API streaming. In this stage, the re-tweets are also filtered, and the duplication of tweets is removed utilizing Regular Expression (RegEx) methods to seek hyperlinks in the post and remove the duplication.

2) NOISE ELIMINATION
A noise is a factor that negatively impacts the analysis and the quality of classification results. This subsection presents the pre-processing methods for eliminating the noise of social media data: URL elimination, hashtags/mentions elimination, emoji and emoticon transformation, and punctuation & symbol Elimination.
• URL ELIMINATION: it is the process of removing the Uniform Resource Locator (URLs) contained in the posts or tweets. Although these URLs provide a detailed description of the posts, they are deemed unneeded in our study and should be removed from the posts since we just concentrate on meaningful words in the posts. The regular expression based on NLP is used to remove these URLs in the SMDCM model.

• HASHTAGS/MENTIONS ELIMINATION:
It is the process of removing unnecessary words starting with symbols such as '@' or '#'. The @ symbol is typically used in tweets and posts to mention people's names. While the '#' symbol is used to describe the topic being discussed. For example, ''I love when white people talk to black people @Belal #black_people''. In our model, the symbols are eliminated using regular expression and the terms, phrases, or the tag which contains the meaningful words or phrases are kept.
• EMOJI AND EMOTICON TRANSFORMATION: It is the process of converting the emojis and emoticons to their appropriate word representation to enhance the feature extraction process. Two dictionaries created by NeelShah 1 are used to transform emojis and emoticons into word format. The emojis dictionary has 4,853 emojis, whereas the emoticon dictionary consists of 222 emoticons.
• CONCATENATED WORDS SPLITTING: Since the maximum of eliminating special characters such as (%, &, $, etc.), punctuation marks, extra white-space characters, and numbers is to obtain only informative data. We use NLTK and regular expressions for this process. It helps reduce the storage of the dataset and This stage is the most crucial one, which identifies and eliminates words/terms that are not in the English dictionary. This stage includes several issues such as concatenated words ('BlackPeopleRacism'), slang (e.g., 'Luv', 'ppl'), elongated words (e.g., 'happppppy'), and contraction. The techniques used to address these issues and enhance data quality are detailed in the following subsections.
• CONCATENATED WORDS SPLITTING: Since the maximum capacity for tweets or postings is limited, some Twitter users concatenate their words to create longer tweets. Concatenated words should be broken down into their individual parts, such as the concatenated word ''BlackPeopleRacism'' should be split into three words ('Black', 'People', 'Racism') using the regular expression technique.
• ELONGATED WORDS TRANSFORMATION: This process is responsible for transforming an elongated word into its original word by eliminating repeated letters. On social media, users use elongated words to express their feelings or emotions, such as ''I am so happpppppppy to meet you'' and ''loooooooove you''. Using this process, 'happpppppppy' transformed to 'happy' and 'loooooooove' transformed to 'love'. The regular expression technique, the backreferences module, is used to conduct this process. It is a popular technique that permits the text captured by one group in a pattern to be matched to exactly the same text again. It matches and excludes repeated characters from the words of posts.
• CONTRACTION REPLACEMENT: This is the most important process in the SMDCM model; it plays a significant role in identifying the tweet or post's sentiments. This process transforms the contractions in social media data such as tweets into a regular lexicon which consists of all the contractions utilized for the transformation. The initial task in this process is to find the contraction pattern and then replace it with the respective pattern from the lexicon. For example, the following contractions: ''can't'', ''didn't'', ''hasn't'', and ''won't'' should be transformed into ''can not'', ''did not'', ''has not'', and ''will not'', respectively.
• SLANGS MODIFICATIONS: Slang is the vocabulary of an informal language that is frequently used in userto-user communication, particularly on social media platforms like YouTube, Reddit, Facebook, Instagram, Snapchat, Twitter, and others. The process of converting informal words into their formal (original) words is known as slang modification. Users frequently employ slang words in chat to lower the number of characters in each post or tweet due to the character limit on tweets. For instance, slang terms like ''luv,'' ''ppl'', and ''plz'' are not listed in the English dictionary. These terms should be changed into official English terms like ''love,'' ''people'', and ''please'', respectively. We constructed a dictionary containing 2864 slang with their formal perspective words to handle this issue. The binary search algorithm is utilized to find the right words in the constructed dictionary for the slang words.
• SPELLING CORRECTION: This process involves correcting typos and mistakes that have arisen in the data.
The Pyspellchecker package, which can fix a variety of mistakes, is used for this correction.

4) POST TRANSFORMATIONS
This section presents the common pre-processing methods to clean up social media data. These methods are all described in the following subsections.
• LOWERCASE CONVERSION: It is the process of lowercasing all letters in all words in a post or tweet in order to give a uniform and consistent format.
• TOKENIZATION (WORD SEGMENTATION): It is a fundamental task in most text processing applications. It divides the post or tweet into lexical units (features or words) named tokens. The words in the sentence are often separated by breaks like commas, semicolons, periods, and white space. The NLTK library [70] is utilized for this process.
• STOP WORDS REMOVAL: This process eliminates the stop words in the post or tweet. Stop words refer to the words that provide no meaning regarding the content, and most of these words, whether prepositions, pronouns and conjunctions such as 'the', 'she', 'he', 'is', 'a', 'an', etc. The common way of eliminating stopwords is based on pre-compiled lists. Since there are numerous potential stop-word lists, we restrict our attention to choosing whether to eliminate words or terms from posts or tweets using the default list provided using the NLTK library.
• STEMMING: This process transforms the word into its base form by removing suffixes and prefixes from the words to get the word roots. It is an important process in NLP because it helps concentrate on the base form of the words in analysis rather than discriminating among different variations of words, which might bring ambiguity during data mining and analysis. As an illustration, the words ''eliminate'', ''eliminated'', ''eliminating'', and ''elimination'' all have the same root form or stem: ''eliminate''. The literature contains a wide variety of stemming algorithms. In our model, we used the Porter Stemmer algorithm [71], which is regarded as the most popular technique with English datasets [72].

B. FEATURE EXTRACTION
The preprocessed social media data, such as posts, are represented as a vector of features. Feature extraction is the process of extracting the words from the text or post and converting them into a set of numerical features usable for ML. In this section, three well-known feature extraction methods are used: BoW, TF-IDF, and Glove [73], [74]. The Feature extraction methods are selected for experimental based on original papers of the STTM models. The following subsections describe the mathematical modelling of each feature extraction method.

1) BAG OF WORD (BoW)
The BoW is the most flexible, popular, and simpler technique for extracting features from documents (posts). It is completely based on the occurrence of a term/word in the post. The procedure of tokenization and counting the token occurrences are accomplished in this method. There are numerous parameters in the BoW [75] method that can be used to refine the feature type. The features can be constructed by utilizing these three parameters: the unigram, bigram, and trigram. In our experiment, we utilized unigram. In this case, each term in the post indicates a specific feature name, and the occurrence of each feature is represented using a matrix to make it simpler to comprehend. Hence consider the set of two tweets,

2) TF-IDF
The TF-IDF technique [75] is a weighting matrix mainly utilized as a weighting factor in Information Retrieval (IR). It is utilized to evaluate the significance of a term/word (weight + count) in each post (document) in a given social media dataset. It is made up of two measures: the first measure is Term Frequency (TF), and the second measure is Inverse Document Frequency (IDF). Mathematicaly, It can be expressed as in a given Eq. (5).
TF (w, d) = Total times a word w appear in document d Total words in document d where the term frequency is denoted by TF (w, d). It is computed by the total times a word w occurs in the post d by the total words in post d, T is denoted to the total posts presented in the dataset, the number of posts counts (where the termt appears) is denoted by DF(t). Table 2 shows the matrix of TF-IDF for the previous example.

3) GloVe
The GloVe stands for Global Vectors, a word embedding framework that signifies the numerical text representations and offers a semantic similarity measure between the words [73], [74]. Word Embedding (WE) is the words or terms representation in their context and the words or terms around them [76]. WE is generally utilized in various deep learning tasks like semantic analysis, entity recognition, syntactic parsing, etc. The word representations are learned using  the GloVe method by factorizing the w2w (word-word) cooccurrence matrix. The key aim of GloVe is to reduce the reconstruction error (only for positive entries of z), and it is computed as given in Eq. (8).
where the vocabulary size is indicated as m, the scalar bias terms associated with words j and i are denoted as b A i and b B j , respectively. The f z ij indicates the weighting function, which filters the zero-entries and minimizes the unusual cooccurrences and can be defined as given in Eq. (9).
Hence, GloVe is considered a distributed word depiction framework used to gain vector representation from the words. The GloVe can be applied to discover relations between words like synonyms. The most common drawback of the Glove model is that it requires a lot of memory for storage when trained on the co-occurrence matrix of words. Moreover, if the parameters are changed related to the cooccurrence matrix, then matrix reconstruction is required again, which is very time-consuming.

C. SHORT TEXT TOPIC MODELING ALGORITHMS
The final phase of the STTM framework is discovering and extracting the latent topics from the short text media posts based on the content of the discussion using STTM Models. Topic modelling is automatically discovering the latent topics from the short text dataset. We evaluates the influence of the social media data cleansing model (STDCM) utilizing the most prominent learning short text topic models: LDA [49], BTM [41], WNTM [56], PTM [42], GLTM [68], and FTM [69]. The descriptions of the considered STTM models are discussed in the upcoming subsections.

1) LATENT DIRICHLET ALLOCATION (LDA)
LDA is a prevalent form of an unsupervised and probabilistic topic modelling method used for discovering and extracting the hidden structure topics in social media short text data [49]. The main concept behind LDA is that each document is essentially represented as a probability distribution or a mixture of topics, whereas each topic is represented as a probability distribution over a bunch of words. Those topics are stayed within hidden in the latent layer. The LDA model is based on the assumption of BoW, which neglects the order of words. The generative process of the LDA method for each short text/post D i ∈ D in a corpus D can be formulated as in the following steps of algorithm 1.
where both parameters β and α represent the dataset level parameters, which are sampled just once in the procedure of generating the dataset. The word-level variables that are taken just once for every word in every document are represented by z d,n and w d,n . The number of topics is indicated by K . The parameter ϕ k denotes the word probability distribution for the topic k. Finally, the parameter θ D indicates the documentlevel variable that is sampled just once per document short text. The posterior called conditional probability is formulated as given in the following Eq. (10).
where P (w D ) denotes the marginal probability, which computes the sum of the joint distribution on all instantiations of the latent structure. The variables β k , θ D , and z D are the hidden variables and not observed, which denote the topics, document topic distribution, and word topic assignment, respectively. There are three main kinds of inference methods: expectation propagation [77], a variational method [49], and Gibbs sampling [44], and this article has utilized Gibbs sampling [44].

2) BITERM TOPIC MODEL (BTM)
BTM [41] is one of the well-known Global Word Co-occurrences based models. BTM learns the hidden topics on STs by modelling the generation of biterms directly in the dataset D. A biterm is a pair of unordered words that appear together (co-occurring) in a ST or post. The main concept is that if two words co-occur more repeatedly, they are more probably to pertain to the same topic. Formally, let us suppose that D is a dataset consisting of N D short texts assume it includes n B biterms B = {b i } n B i=1 , where b i = w i,1 , w i,2 . Suppose z ∈ [1, K ] is a topic indicator variable, where the K topics represent over W words in a vocabulary V . The word distribution over topics (for • Select a word-topic assignment z d,n ∼ Multinomial(θ d ), where z d,n ∈ {1, 2, . . . , k} 6.
• Select a word w d,n ∼ Multinomial β z d,n where w d,n ∈ {1, 2, . . . , V } example, P (w | z)) can be defined by K × W matrix ϕ. Where the k th row ϕ k is a w-dimensional multinomial distribution with ϕ k,w = P (w | z = k) where W w=1 ϕ k,w = 1. The propagation of topics in D dataset (for example, P (z)) can be represented by using the K-dimensional multinomial distribu- The number of biterms appropriated to topic k, except b i is represented by n k,¬i , the z ¬i represents the topic's assignments for the entire biterms, excluding the current b i . The number of times words w i,1 and w i,2 appropriated to the topic k except b i are denoted n w i,1 k,¬i and n w i,2 k,¬i , respectively. We remove the biterm from its current Topic Feature (TF) vector for every biterm. Thus, by using Eq. (11), we reallocate biterm to the topic. The new topic feature vector is updated using Eq. (12). After completing the iterations, the BTM model estimates ϕ and θ using Eq. (13) and Eq. (14). 3

) WORD-NETWORK TOPIC MODEL (WNTM)
WNTM utilizes Global Word Co-occurrences (WC) to build a WC Network (WCN). It is a novel framework that simultaneously addresses the data sparsity problem and imbalance texts. WNTM learns the distribution over topics for every word rather than topics for short texts, rendering the WNTM less sensitive to the social media short text length and the topic-distribution heterogeneity. On the other hand, WNTM learns to construct every word's adjacent word-list in the network utilizing hidden word groups and words corresponding to those groups. The following steps of algorithm 3 show the entire pseudo-document generative process.

Algorithm 3
The Entire Pseudo-Document Generative Process in WNTM 1. For every hidden word group z 2.
Drawn ϕ z ∼ Dirichlet(β), multinomial-distribution over words for z 3. Draw θ i ∼ Dirichlet(α), a latent word group distribution for the adjacent word-list L i of the word w i 4. For each word w j ∈ L i :

5.
Choose a hidden word group z j ∼ θ i 6.
Choose the adjacent word w j ∼ ϕ z,j As the WNTM model scans window word by word, two different words in the same window are considered as a cooccurrence. Then, the undirected WCN is constructed by WNTM, where each node of WCN represents one word, and every edge weight denotes the number of the two cooccurrence words. The number of nodes in the networks denotes the number of features in vocabulary V . After that, the Word-Network topic model produces the pseudodocument (PD) l for every vertex v in a word network, which composes of its neighbouring vertices. Following the acquisition of pseudo-documents P, the WNTM model uses Gibbs sampling for LDA to discover hidden topics or themes from generating the PD. The WNTM model infers its hidden topic utilizing the subsequent conditional distribution as given in Eq. (15).
Assuming a pseudo-document in l is produced from word l, we can compute the topic-word distribution using the given Eq. (16). VOLUME 10, 2022 where the number of words or terms in l indicated by n l . The document-word distribution θ D is based on the topic-word distribution ϕ w D,i k , it can be computed as given in Eq. (17).
The number of word w D,i in the post is denoted n w D,i D .

4) PSEUDO-DOCUMENT-BASED TOPIC MODEL (PTM)
PTM is one of the popular self-aggregation based methods proposed specifically for the ST dataset. PTM provided the Pseudo Document's idea to aggregate STs implicitly against the data sparsity problem. PTM supposes an extreme volume of social media short texts are produced from one long pseudo-document pl. Subsequently, learns the hidden topics from long pseudo-documents P instead of STs. Formally, let us assume that K is a set of topics {ϕ z } K z=1 , which each represents a multinomial distribution on a vocabulary of size V . Suppose P is the Pseudo-document d l P l=1 and D is a dataset consisting of d s short texts {d s } D s=1 . Here, the Pseudo-document are the hidden ones, whereas the short texts are the observed ones. A multinomial distribution ψ is utilized for modelling the distribution of STs on pseudo documents. Let us suppose every ST only belongs to one Pseudo-document. Every word or term in ST is produced by first drawing a hidden topic z from the topic distribution θ of the Pseudo-document and subsequently drawing a word w ∼ ϕ z . The generative process of the PTM is presented in algorithm 4. In according to the inference, the Sampling pseudo document assignments l for ST ds using collapsed Gibbs sampling. It can be defined as given in Eq. (19).
where the length of s th short text d s is denoted by n d s , the number of tokens assigned to topic k in d s is indicated to n k d s , the short texts number associated with the pseudo-document is denoted by m l . The total number of tokens in d l is represented by n l , The topic selector of pseudo-document d of topic k is indicated by b l,k . The size of |A l | is denoted by |A l |, The method to sample the topic assignments k is the same LDA. After getting the pseudo-document, the PTM sample the topic assignments k for each word w in d s . The θ is drawn from Spike and Slab prior.
where n k = V w=0 n w k , the times number w being assigned to topic k is represented by n w k . The document-word distribution is computed using the given Eq. (21).
For each Pseudo-document d l :

5) GLOBAL AND LOCAL WORD EMBEDDING-BASED TOPIC MODELING (GLTM)
The GLTM [68] trains global embedding from a huge external dataset with a suitable encoding of continuous Skip-Gram method with Negative Sampling (SGNS) for getting local word embedding. This model can extract semantic relatedness among words using both local and global word embeddings in short texts, which the Gibbs sampler can exploit to increase the semantic topic coherence throughout the inference process. Then, the spike-and-slab prior is employed in this model to extract the sparse topic structure for every ST. The Dual-Sparsity topic method is adopted, which specifies a weak smoothing prior for the spike-and-slab structure and a smoothing prior for topic distribution. The spike-and-slab prior can efficiently separate the smoothness of probability distribution and sparsity [78]. Algorithm 5 describes the GLTM generation process in detail.
where the set of STs is denoted by D, the short text (d) length is denoted by N d , the set of topics is indicated by K . The multinomial distribution on topics of short text d is represented by θ d , γ is the beta prior for ψ d , Bernoulli distribution on topic selectors of short text D is denoted by ψ d . Multinomial distribution on words of topic k is denoted by ϕ k . The weak smoothing prior for topic distribution is denoted by b, α is the smoothing prior for topic distribution. The indicator of topic k in short text d is referred to d,k . w d,i is the i-th word in short text d. The topic index of word w d,i is referred to z d,i . In the GLTM model, the collapsed Gibbs Sampling is employed to conduct the estimated inference. The latent variables required to be sampled the topic assignment of words z as well as the topic indicators in documents . The maximum posterior estimation (MAP) is used to estimate the three variables parameters like ψ, θ, and ϕ. The details of model inference can be referred to [68]. Draw word distribution for topic k : ϕ k ∼ Dirichlet(β) 3. For each document d ∈ {1, . . . , |D|}
• Draw a topic z d,i ∼ Multinomial (θ d ) 10 • Draw a textual word w d,i ∼ Multinomial ϕ z d,i

6) FUZZY TOPIC MODELING (FTM)
FTM [69] is a clustering-based topic modelling approach which is based on the fuzzy concept perspective of extracting and discovering the latent topics from the STs corpus. FTM is developed to alleviate the data sparsity problems over STs. In this model, the BOW approach is used to compute the global and local frequencies of terms. Then, Principal Component Analysis (PCA) is utilized to minimize the high dimensionality features. After that, the Fuzzy C-mean technique (FCM) is used to cluster short text and extract the themes from STs data, supposing each cluster as an extracted topic. The overall process for FTM is described in algorithm 6. More detail about this model is presented in [69].

V. EXPERIMENTAL ANALYSIS
This section discusses the experimental analysis and evaluates the performance of the proposed SMDCM and studies its effects on different STTM models: LDA, PTM, BTM, WNTM, GLTM, and FTM in terms of accuracy, Topic Coherence (TC), NMI, and Purity. For evaluation, we utilized two real-world short text social media datasets: real-world Cyberbullying Twitter (RW-CB-Twitter) and Cyberbullying Mendeley (CB-MNDLY). The proposed preprocessing and data cleansing phases have been explained in detail in the SMDCM model in sub-section IV-A. After that, concerning the feature extraction process, the features are extracted using TF-IDF, BOW, and GloVe. The Feature extraction techniques are used based on the technique utilized in the original papers of the considered topic modelling models. Lastly, the STTM models are applied to discover the topics. The effects of the SMDCM over STTM were noted with k different numbers of topics such as k = {5, 20, and 40}.

A. EXPERIMENTAL SETUP
This subsection provides the experimental setup of this research, like SMDCM configurations and the parameters setting of the considered short text topic models.

1) SMDCM CONFIGURATIONS
The selected models are carried out utilizing Python 3.7.4 programming with an Anaconda IDE-Sypder environment and Java. The suggested SMDCM has been incorporated with many dictionaries, tools, and libraries like an English dictionary and an acronym and slang dictionaries, which provides a collection of all abbreviations and their versions and slang as lookup dictionaries for the purpose of transformation. Besides some of the needed libraries like Gensim [79], Scikit-learn [80], NumPy, pandas, NLTK [81], and Tweepy. In addition to that, we have utilized packages like ''SpellChecker'', which provides some methods such as ''Correction'' and ''Spell'' to correct and check the spelling mistakes in the respective datasets. For the experiment, we construct a set of SMDCM settings and compare the differences and similarities to depict the functionality of the SMDCM. Here, we select two scenarios of settings for both RW-CB-Twitter and CB-MNDLY datasets. The first scenario is called ''with baseline'' techniques, consisting of only some techniques like tokenizatiom, lowercase conversion, punctuation removal, and stopwords elimination. The second scenario is the suggested preprocessing techniques, known as ''with SMDCM'' techniques, which consists of four stages with all the proposed tasks (as depicted in Figure 1-B), starting with (1) Filtering short texts, (2) Noise removal, including URL Elimination, Mentions and Hashtags Elimination, Emoji and Emoticon Transformation, and Punctuation & Symbol (3) Out of Vocabulary (OOV) cleaning, such as concatenated words splitting, contraction replacement, elongated words transformation, slangs modifications, and spelling correction (4) Posts Transformation including lowercase conversion, tokenization, stop words removal, and stemming.

2) PARAMETERS SETTING
The parameters setting of all the considered short text topic modelling approaches are set as given in the original articles. The number of iterations is fixed to 1000 for all the approaches. The value of α is set 0.05 for LDA and α = 50/K for both BTM and GLTM, whereas we fixed α = 0.1 with WNTM and PTM models. The value of the λ hyperparameter is fixed at λ = 0.1 and λ = 0.5 with PTM, and GLTM approaches, respectively. We fixed β = 0.01 for all the following models LDA, BTM, PTM, WNTM, and GLTM. The value of the sliding window for WNTM is set to 10. We fixed the number of pseudo-documents for PTM to 1000. We evaluate the effects of the SMDCM over the considered topic discovery models with k different numbers of topics such as k = {5, 20, and 30}.

B. DATASETS
In this subsection, we explain in brief the utilized datasets in the analysis of the experiments. The evaluation is performed over two social media datasets: the publicly available Cyberbullying Mendeley dataset collected by Elsafoury [82] we refer to as (CB-MNDLY), and the 10: Compute the words in short texts (documents) probability P X i | Z j .
11. Compute the words in topics probability P (X i | Y k ) other Real-world Cyberbullying Twitter (RW-CB-Twitter) dataset. Table 3 shows the statistics of these datasets. The descriptions of these datasets are provided in the upcoming subsections.

1) RW-CB-TWITTER DATASET
This dataset is collected from the Twitter social media platform by selecting some cyberbullying key terms such as whale, bitch, LGBTQ, fucking, idiot, sucker, fuck, pussy, nigger, poser, moron, etc., using API Twitter streaming as recommended by the authors in psychology literature [83], and [84]. Besides, some other key terms related to racism as recommended by [85], such as black, hate, Islamic, threat, Islam, terrorist, attack, racism, ban, and kill. The number of gathered tweets included in the RW-CB-Twitter dataset is 435764 tweets. We selected 20000 tweets randomly after deleting irrelevant tweets and re-tweet and utilized them in this research for the evaluations. This dataset is expanded to the collected dataset utilized in [14] and classified it into five classes: Not-bullying, sexism, racism, aggressive, and insult.

2) CB-MNDLY DATASET
This dataset is freely available in the data repository of Mendeley 2 for research purposes. It is collected by Elsafoury [82] from various SM sources such as Kaggle, Twitter, YouTube, Talk pages, and Wikipedia. The different types of cyberbullying like aggression, racism, hate speech, insults, sexism, and toxicity, are included in this corpus (dataset); each of them is kept in a separate file and categorized as bullying and not-bullying. We combined these files to generate a new dataset we refer to as the CB-MNDLY dataset, composed of 6 data classes, including Insult, racism, sexism, aggression, toxicity, and not-bullying. The CB-MNDLY dataset contains 448880 short texts. We selected 50,000 short texts out of the combined dataset and used them in this work to evaluate the effect of data cleansing on short text topic discovery models.

C. EVALUATION METRICS
In this subsection, we introduce the evaluation metrics for evaluating the effects of the proposed SMDCM techniques on the STTMs. To provide a good assessment, we evaluate all the considered models from many perspectives utilizing various metrics such as topic coherence evaluation, two other evaluations like short text clustering evaluation, including purity and NMI, and short text classification evaluation, such as accuracy. The descriptions of these metrics are explained as follows:

1) TOPIC COHERENCE (TC)
TC is a metric utilized to assess the quality of extracted topics. For every topic k of post generated, the TC is employed to the top N words (W 1 , . . . . . . , W N ). We chose 10 top-most words as a sliding window in the experiment. It computed the semantic score of a particular topic by assessing the semantic similarity degree of the topic's high-scoring words. To compute TC, we require an external dataset (e.g. Wikipedia) to score pairs of words using the term co-occurrence. Here, the TC is computed using Normalized PMI (NPMI) [86] instead of PMI [87] as provided below in Eq. (26), where the score (w j , w l ) denotes the NPMI.

2) SHORT TEXT CLUSTERING EVALUATION METRICS (NMI AND PURITY)
Short text clustering is a significant application of STTM. We select the maximum value from its topic probability distribution for each social media post as the cluster label. Then, the golden and cluster labels are compared using the clustering evaluation metrics NMI and Purity.

a: PURITY
The purity metric is utilized to evaluate the ratio of an appropriate number of correctly clustered posts (short texts) to all the labelled posts (golden label) in the corpus. The value of purity lies between 0 and 1. It is defined as in Eq. (28).
where the total posts in the corpus (dataset) is denoted to N . The group of clusters is denoted as A = a 1 , . . . , a |A| , and the group of ground-truth (labelled) clusters in datasets can be represented as B = b 1 , . . . , a |B| . b: NMI [88] It is a metric used to calculate the Mutual Information I (A, B) shared between A and B, whose range is normalized to [0,1].
Where H (A) and H (B) are the entropy metrics of clusters and classes, respectively. The NMI is formulated as given in Eq. (29).
The final formula of the NMI can be defined as given in Eq. (33)

3) SHORT TEXT CLASSIFICATION EVALUATION
Each social media post can be represented by document-topic distribution P (z | D). Text classification can be used to evaluate the topic modelling performance. Therefore, we select accuracy as a measure for short text classification. Accuracy can be expressed as the ratio of an appropriate correctly all predicted observations to the total predictions [89], where the higher accuracy indicates that the learned themes are more representative and discriminative. We utilize the SVM classifier for this task. The classification accuracy is calculated with five fold cross-validation over both CB-MNDLY and RW-CB-Twitter datasets. It is computed as defined in Eq. (34).

4) PERFORMANCE IMPROVEMENT RATE (PIR)
In this study, the PIR is defined as the improvement rate of STTM models' performance with SMDCM compared to STTM models without SMDCM. It can be expressed as given in Eq. (35).

D. RESULTS AND DISCUSSION
In this subsection, we discuss the results of the effects of preprocessing and data cleansing techniques on the short text topic modelling from three perspectives in terms of four metrics such as topic coherence, classification accuracy, clustering (purity, NMI). In addition, we show the Performance Improvement Rate (PIR) of the STTM with SMDCM scenario over STTM with baseline scenario on two cyberbullying datasets: RW-CB-Twitter dataset and CB-MNDLY dataset, in terms of all performance metrics.

1) TOPIC COHERENCE EVALUATION RESULT WITH SMDCM AND BASELINE TECHNIQUES
This subsection investigates the effects of preprocessing (SMDCM) over short text topic modelling in terms of topic coherence (TC) metric on both RW-CB-Twitter and CB-MNDLY datasets. In the case of the RW-CB-Twitter dataset, all the considered STTM models operated on this dataset, and the evaluation has been performed with various topics such as k = {5, 20, and 30}. When k = 5, the topic coherence values of GLTM, FTM, and WNTM are 0.565, 0.553, and 0.540 with the SMDCM scenario, respectively. Where the GLTM yields good topic coherence compared to other STTM models. The GLTM, FTM, and WNTM have 0.556, 0.548, and 0.531 of topic coherence without SMDCM (with only baseline scenario). Similarly, when the number of topics is k = 30, the FTM has yielded a high topic coherence of 0.528 with the SMDCM model and 0.507 without SMDCM. We observed that the preprocessing (SMDCM) effects on short text topic modelling results in discovering topics in terms of topic coherence metric, as depicted in Figure 2. In addition, we have investigated the impacts of the SMDCM over STTM models on the CB-MNDLY dataset. Figure 3 depicts the results of topic coherence of the considered topic models with k = {5, 20, 30} topics on the CB-MNDLY dataset with and without data cleansing model and show the effects of the SMDCM over the short text topic discovery. In case k = 5, 20, and 30, the WNTM yielded the best result compared to other models of 0.634, 0.600, and 0.593 of topic coherence with SMDCM and followed by the GLTM, which yields 0.629, 0.592, and 0.579 of topic coherence with preprocessing in case of k = 5, 20, and 30, respectively. Whereas the topic coherence decreases without preprocessing SMDCM, as depicted in Figure 3. The interesting observation is that the GLTM yields a high topic coherence value of 0.565 with SMDCM when k = 5 over the RW-CB-Twitter dataset, whereas the WNTM has got 0.634 of topic coherence value which is the best topic coherence value when k = 5 with SMDCM over all the models with and without SMDCM in case of CB-MNDLY dataset. We can conclude that social media data cleansing (SMDCM) affects the performance of the short text topic discovery models, as presented in Figures 2 and 3. Table 4 depicts the PIR of the short text topic modelling models with SMDCM over short text topic modelling models with baseline over both RW-CB-Twitter and CB-MNDLY datasets in terms of topic coherence. The PIR is computed as formulated in Eq. (35). In case of RW-CB-Twitter, the PIRs of the STTM models LDA, BTM, PTM, GLTM, FTM, and WNTM are 1.82%, 1.96%, 2.36%, 2.51%, 3.05%, and 2.73%, respectively. Similarly, the PIRs of the STTM models with SMDCM scenario over the same Models without SMDCM (with baseline scenario) are 4.70%, 5.65%, 4.97%, 5.04%, 4.28%, and 3.63%, respectively. This improvement proves the effectiveness of the SMDCM on short text topic modelling.

2) ACCURACY EVALUATION RESULTS WITH SMDCM AND BASELINE TECHNIQUES
Here, in this subsection, six topic modelling approaches, such as LDA, PTM, BTM, WNTM, GLTM, and FTM, have been  run over two cyberbullying datasets: RW-CB-Twitter and CB-MNDLY, along with and without the proposed SMDCM model. The evaluations have been performed with k different numbers of topics such as k = {5, 20, and 30} topics. Figure 4 shows the results of accuracy with k = {5, 20, 30} topics on the RW-CB-Twitter dataset with and without SMDCM and show the effects of the SMDCM over the short text topic discovery. When k = 5, the WNTM and FTM yield good results of 77.42% and 77.43% of accuracy with preprocessing (SMDCM), whereas they have got 75.65% and 75.33% of accuracy without SMDCM, respectively, as depicted in Figure 4. The performance improvement rate over WNTM and FTM without the SMDCM model or with the baseline preprocessing is 2.34% and 2.78% when k = 5. In contrast, the classification accuracy of LDA with SMDCM is 72.41% which is the lowest accuracy compared to all other models with SMDCM when k = 5 topics. Similarly, when the number of topics is k = 30, the WNTM has got high accuracy of 78.53% with the SMDCM model (preprocessing) and 77.85% with the baseline (without SMDCM). We conclude that the WNTM is the best model choice with Social media data cleansing SMDCM; in contrast, the accuracy result decreases with baseline (without SMDCM) over the RW-CB-Twitter dataset. Besides, we have studied another dataset named Cyberbullying Mendeley (CB-MNDLY) to investigate the effectiveness of SMDCM over short text topic modelling approaches. Figure 5 provides the accuracy results with k = {5, 20, 30} topics on the CB-MNDLY dataset with and without SMDCM (Preprocessing). In case k = 30, the GLTM achieved the best result of 81.87% accuracy and followed the WNTM model, which yields 81.31% with (SMDCM) preprocessing. Whereas the GLTM and WNTM have 79.75% and 78.89% of accuracy without SMDCM, as depicted in Figure 5. Similarly, when k = 20, the GLTM has the highest accuracy with SMDCM compared to other short text topic modelling methods. In contrast, in case k = 5, the WNTM has the best accuracy result, followed by the GLTM, which achieves 79.54% and 78.96% of accuracies,  respectively. In general, we conclude that the GLTM and WNTM have the best results and the SMDCM preprocessing effects on short text topic modelling performance, as shown in Figure 5.

3) PURITY EVALUATION RESULTS WITH SMDCM AND BASELINE TECHNIQUES
This subsection studies the effects of suggested SMDCM over short text topic modelling on both RW-CB-Twitter and CB-MNDLY datasets in terms of short text clustering  (Purity). In the case of the RW-CB-Twitter dataset, all the considered STTM models operated on this dataset, and we evaluated the STTM models with different topics such as k = {5, 20, and 30}. When k = 5, the purity values of WNTM, GLTM, and FTM are 0.775, 0.768, and 0.766 with SMDCM, respectively. where the WNTM offers good purity compared to other topic modelling approaches. Whereas the WNTM, GLTM, and FTM have got 0.754, 0.742, and 0.735 with baseline techniques. Similarly, when the number of topics is k = 30, the GLTM yields high purity of 0.754 with the SMDCM model (preprocessing) and 0.722 with baseline (without SMDCM). We noted that the SMDCM effects on STTM result in discovering topics in terms of clustering purity metric, as depicted in Figure 6.
In addition, we have investigated the effectiveness of the SMDCM over short text topic discovery on the CB-MNDLY dataset. Figure 7 shows the results of purity of the considered topic modelling models with k = {5, 20, 30} topics on the CB-MNDLY dataset with SMDCMand baseline techniques and show the effects of the SMDCM over the short text topic discovery. In case k = 5, 20, and 30, the GLTM yields the best result compared to other models of 0.853, 0.842, and 0.817 purity with SMDCM, respectively, and followed by the FTM, which yields 0.849, 0.838, and 0.811of purity with preprocessing SMDCM in case of k = 5, 20, and 30, respectively. Whereas the purity decreases when investigated without SMDCM (baseline), as depicted in Figure 6. We can conclude that social media data cleansing (SMDCM) can impact the performance of the STTM models, as presented in Figures 6 and 7.
This paragraph analyzes and discusses the PIR (%) of the considered STTM models with the social media data cleansing model over the same models with the baseline scenario in terms of NMI. Table 6 presents the PIR of STTM models with SMDCM over STTM with baseline scenario on both CB-MNDLY and RW-CB-Twitter datasets. It can be concluded from PIR values of purity that the proposed SMDCM positively affects the Short Text topic Modeling approaches over both social media Cyberbullying datasets, as shown in Table 6.

4) NMI EVALUATION RESULTS WITH SMDCM AND BASELINE TECHNIQUES
In this sub-section, we study the effectiveness of the SMDCM in terms of NMI results over STTM models LDA, PTM, BTM, WNTM, GLTM, and FTM on the RW-CB-Twitter and CB-MNDLY datasets. In the RW-CB-Twitter dataset, the evaluations have been performed with k different numbers VOLUME 10, 2022  of topics such as k = {5, 20, and 30}. Figure 8 shows the results of NMI on the RW-CB-Twitter dataset and shows the effectiveness of the SMDCM over the STTM models. When k = 5, 20, and 30, the WNTM yields good results of 0.695, 0.688, and 0.678 of NMI with SMDCM techniques, respectively, whereas without SMDCM, the NMI values of WNTM when k = 5, 20, and 30 are 0.672, 0.663, and 0.649, respectively. Followed the GLTM and FTM, which achieved the second and third-best results of NMI with all the different topics. Here, the performance improvement rates (%) of WNTM with SMDCM over WNTM without SMDCM when k = 5, 20, and 30 are 3.42%, 3.77%, and 4.47%, respectively. In contrast, the NMI values of LDA with SMDCM when k = 5, 20, and 30 are 0.626, 0.607, and 0.615, respectively, which are the lowest NMI values compared to all other models with SMDCM, whereas the NMI values of LDA without SMDCM when k = 5, 20, and 30 are 0.601, 0.585, and 0.600, respectively. We conclude that the WNTM is found to be the best model choice with social media data cleansing SMDCM; in contrast, the NMI decreases without SMDCM.
In addition, we have studied the CB-MNDLY dataset to investigate the effectiveness of SMDCM over short text topic   modelling approaches. Figure 9 provides the NMI results with k = {5, 20, 30} topics on the CB-MNDLY dataset with baseline and SMDCM (Preprocessing). In the case of k = 5, 20, and 30, the GLTM has achieved the best results of 0.849, 0.851, and 0.838 of NMI with SMDCM, while the values of NMI of GLTM with baseline when k = 5, 20, and 30 are 0.827, 0.835, and 0.818, respectively, as depicted in Figure 9. Followed that the WNTM achieved 0.841 and 0.841 of NMI with (SMDCM) preprocessing when k = 5 and 20 over the CB-MNDLY dataset. The FTM has achieved the second-best value of NMI 0.827 with SMDCM when k = 30 as depicted in Figure 9. The interesting observation is that the GLTM has got 0.851 of NMI value which is the best NMI value when k = 20 with SMDCM over all the models with baseline and SMDCM in the case of the CB-MNDLY dataset, while the WNTM yields high NMI value of 0.695 with SMDCM when k = 5 over RW-CB-MNDLY dataset. In general, we conclude that the NMI values increase with SMDCM (preprocessing) and decrease somewhat without SMDCM, and we investigate the SMDCM effects on short text topic modelling performance. Here, we discuss the PIR (%) of the STTM model with proposed SMDCM over the same considered STTM with baseline scenario in terms of NMI metric on both datasets. For RW-CB-Twitter, the short text topic modelling approaches LDA, BTM, PTM, GLTM, FTM, and WNTM with SMDCM scenario generate 3.47%, 4.51%, 3.32%, 3.75%, 3.56%, and 3.88% of NMI improvements over the same models with the baseline scenario, respectively. In the case of using the CB-MNDLY dataset, the LDA, BTM, PTM, GLTM, FTM, and WNTM models with Social Media Data Cleansing Model (SMDCM) produce 2.90%, 1.61%, 2.81%, 2.34%, 1.46%, and 0.81% NMI improvements over these models with baseline. The overall performance improvement rate (%) of STTM with SMDCM over STTM with baseline in terms of NMI is presented in Table 7. From the results and PIRs values, we conclude that the suggested SMDCM positively affects the performance of STTM.

VI. CONCLUSION
As the use of Twitter data in topic modeling is increasing, improving the quality of social media data before processing it to derive value and insight from social media datasets represents an important and challenging requirement. This paper introduced a model called SMDCM for addressing the data quality problem in social media. Moreover, it investigated the impact of SMDCM on the performance of short text topic modelling (STTM) using six models: LDA, WNTM, BTM, PTM, GLTM, and FTM. Extensive experiments were conducted with various scenarios over two social media datasets: RW-CB-Twitter and CB-MNDLY for evaluating the quality of data, as well as the quality of topic for each scenario, utilizing different short text topic modelling algorithms in terms of purity, NMI, accuracy, and topic coherence. The experimental results showed that the STTM performance highly depends on data cleansing (SMDCM) techniques and the used dataset's nature. It can be concluded that SMDCM has an impact on the performance of TM and the quality of data. The results proved the efficiency of the GLTM and WNTM over the other STTM models when applying the SMDCM techniques, which achieved optimum topic coherence and high accuracy values on the RW-CB-Twitter and CB-MNDLY datasets.  Francis), and an Associate Editor of many journals, such as Sensors (MDPI). He is the author or coauthor of ten books, more than ten conference volumes, more than 453 refereed papers in conferences, book chapters, and journals, such as the IEEE TRANSACTIONS ON COMPUTERS and IEEE TRANSACTIONS ON FUZZY SYSTEMS. He was a member of the organizing committees for over 300 international conferences serving in various capacities, including the chair and the general co-chair. He has delivered over 50 keynote and international seminars. He has also supervised numerous Ph.D. students to completion, and actively involved in various funded research, supervising postdoctoral, research assistants, and visiting scholar in the area of cloud computing, big data analytics, the IoT, cybersecurity, and decision support systems. He is on the editorial boards of many journals. He is a Senior Member of IEEE Computer Society, IEEE Technical Committee on Scalable Computing, IEEE Technical Committee on Dependable Computing and Fault Tolerance, and IEEE Communication Society.
SURESHA MALLAPPA received the M.Sc. degree from the University of Mysore, Mysore, the M.Phil. degree from DAVV, the M.Tech. degree from IIT Kharagpur, and the Ph.D. degree from IISc, Bengaluru. He is currently working as a Professor with the Department of Studies in Computer Science, University of Mysore. He has 31 years of teaching experience in computer science at the postgraduate level in various universities. He has published more than 80 research papers in reputed international and national journals and conferences. He has supervised numerous Ph.D. students. His research interests include dynamic web caching, database systems, image search engines, e-governance, data mining, big data, opinion mining, and cloud computing. He has also taught many courses in foreign universities as part of teaching assignments.