Big Data ML-Based Fake News Detection Using Distributed Learning

Users rely heavily on social media to consume and share news, facilitating the mass dis-semination of genuine and fake stories. The proliferation of misinformation on various social media platforms has serious consequences for society. The inability to differentiate between the several forms of false news on Twitter is a major obstacle to effective detection of fake news. Researchers have made progress toward a solution by emphasizing methods for identifying fake news. The dataset FNC-1, which includes four categories for identifying false news, will be used in this study. The state-of-the-art methods for spotting fake news are evaluated and compared using big data technology (Spark) and machine learning. The methodology of this study employed a decentralized Spark cluster to create a stacked ensemble model. Following feature extraction using N-grams, Hashing TF-IDF, and count vectorizer, we used the proposed stacked ensemble classification model. The results show that the suggested model has a superior classification performance of 92.45% in the F1 score compared to the 83.10 % F1 score of the baseline approach. The proposed model achieved an additional 9.35% F1 score compared to the state-of-the-art techniques.


I. INTRODUCTION
The use of social media platforms to disseminate and digest media has increased in recent years. Social networking sites like Facebook and Twitter generate daily data [1]. It is no secret that the internet is a goldmine of information, especially recent news [2]. The proliferation of fake news is directly attributable to the internet's user-friendly nature. Since fake news is often presented as factual, it is often shared on social media. Often, this data is spread for profit or influencing politics. The effects of fake news on society as a whole are profound. In the light of its profound impacts, fixing this issue is crucial [3]. Multiple instances of false news were reported to have spread on social media during the 2016 US elections, including the presidential election and the nomination of a new Air Marshal in India [4]. The dissemination of false information has negatively affected people's mental health and society as a whole [5].
The associate editor coordinating the review of this manuscript and approving it for publication was Chong Leong Gan .
Many automatically assume that the news is either bogus or legitimate based on the article's content. Techniques based on news content use methods for collecting data and tone from fake news stories. The goal of style-based methods for de-detecting false news is to utilize the manipulators' writing styles for detection. By examining certain language features, we can distinguish fake news from the real thing [3]. However, false news is created with the intent of fooling readers. Thus, improving the detection of false news using news content style is a difficult problem. To assist in avoiding the difficult and time-consuming human work of factchecking, the Natural Language Processing (NLP) industry has shown considerable interest in automatic recognition of fake news [6], [7]. Determining the integrity of news is a difficult task, even for automated approaches [8]. Familiarizing with what other news outlets say on the same issue might be a useful starting point for recognizing false news. Identifying a person's position is the purpose of this phase. Multiple tasks, such as evaluating online arguments [9], [10], verifying the integrity of Twitter rumors [11], [12], VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ or understanding the argumentation structure of seminal works [13], [14], have traditionally relied on position identification.
In the first example of evaluating the first False News Challenge (FNC-1), a media news source discusses a topic to create automated fake news detection systems using AI technology and machine learning. Almost fifty groups from industry and academics worked on this problem. One of the objectives of the FNC-1 challenge is to track out a media production dealing with a certain title. It might support, challenge, or have nothing to do with the title. Four potential vantage points from which an essay is to be written. The guidelines, dataset, and grading criteria for the FNC-1 challenge are all available on their site. These topics are further shown in Figure 1, which depicts the results of four distinct research.

A. OVERVIEW OF FAKE NEWS DETECTION
In 2017, Facebook released a white paper that explored the risks of online communication and the management of being one of the most prominent social media platforms today. Weedon, Nuland, and Stamos also noticed the growing challenge of using the enigmatic phrase ''fake news,'' and proclaimed that ''the overuse and misapplication of the term ''fake news'' might be challenging since we cannot understand or adequately address these concerns without shared definitions'' [19]. The word can apply to anything from virtually incorrect news articles to deceptions, April Fools' jokes, rumors, clickbait, or stated opinions posted online with incorrect facts.
In this research work, ''fake news'' is defined as a written article that is manifestly untrue and falsely disseminated without being authentic mostly accompanird by malicious intents. This definition includes three important textual, visual, and audio bases. Other elements such as video-based fake news and audio, are typically ignored when referring to textual fake news; additionally, each element has its linguistic complexities that necessitate different machine learning and deep learning algorithms to detect and solve problems such as 'Deep Fake,' etc. The notion also implies that fake news might be fact-checked, an important characteristic. Therefore, the claims may be checked to see if they are true or false. Because rumors are usually hard to verify, they are deleted from the definition because of this inclusion. Conspiracy theories are classed as rumors because they are persistent rumors that are difficult to refute. False information concerning the entertainment sector, including hoaxes and April Fools' gags, is not permitted because the objective must be harmful. Furthermore, the goal is infamous as it seeks to affect public opinion in favor of a specific message. It also removes text bits that were mistakenly published improperly, such as transposed numbers.
A model of the connection between headlines and news content is necessary for identifying clickbait. It is also crucial to tell the difference between false news and clickbait. The term ''clickbait'' refers to articles with enticing headlines written to attract online audience or traffic; when people click on such a headline, they end up at a different website with poorly written articles that have nothing to do with the subject line. So, clickbait is written with one goal: getting more people to visit a website that relies on advertising to make money. The motive is monetary gain rather than furthering a political agenda via disseminating false information.
A great example is the deliberately spread of false news about Hillary Clinton by Russian trolls in the 2016 presidential election campaign, which was designed to affect people's voting choices away from Hillary and toward Donald Trump. This instance demonstrates how dangerous it can be when false information spreads on critical issues. Of course, there's another problem with false news, toxic information is spread for no reason to sow doubt, stir up chaos, and make it difficult for readers to tell fact from fiction.

1) SOCIAL MEDIA AND FAKE NEWS
Global knowledge dissemination has been democratized because of technological advancements and the emergence of social media. Important news organizations have invested heavily in digital journalism, generating content for media platforms, and growing their reach via social media and online tools. Furthermore, online social media platforms are becoming most important sites for information spreading. Dissemination of information allows for the exchange of ideas and the connectivity of previously inaccessible locations. It enables users to generate opinions about the information platforms offer from many perspectives.
In the past, media companies have invested heavily in creating their presence online, with online media networking sites playing a significant role. They use social media platforms such as Facebook and Twitter to promote their material, spread information/news, and develop a network of individuals they may engage with. On the other hand, users benefit from social media's technical developments since people now have access to a wide range of information sources.
The current digital landscape for information dissemination and the challenges that media organizations face in an ever-present media environment have resulted in substantial changes in how news organizations are founded. Economic, technical, and social pressures have combined with the desire to be always noticeable, race of reporting with similar speed and excitement, getting followers, creating an atmosphere where fake news is prevalent.
The latest technological advancements in social media have undoubtedly provided a hostile environment for spreading online lies in a primarily deregulated media financed and driven by advertising. The motivation for good is usually overshadowed by the desire for profit, which significantly influences how the medium changes over time. According to the above, fake news exists on social media alongside real news, and the difficulty appears to be distinguishing them. While fake news is not new, the speed it travels and the worldwide reach of the instruments that can distribute it are unprecedented. Consequently, fake news emerges on social media in the same context as actual news, and the problem appears to be discerning between the two. While fake news is not a new phenomenon, the pace and quantity with which it is distributed have changed: social media platforms such as Twitter, Facebook, and Instagram provide an ideal ground for quickly transmitting fake news. Furthermore, bots are increasingly being utilized to distort information, disrupt social media conversations, and draw users' attention, according to the same author.

2) USERS' RE-SHARING BEHAVIOR AND FAKE NEWS
From the perspectives discussed so far, it can be deduced that social media sites play an essential role in disseminating false information. Furthermore, internet users are to blame for spreading false information. There are two main types of data sharing on online sharing sites: self-disclosure, in which a user voluntarily discloses private information, and resharing, in which a user distributes material already created by another user of the site or a third party. Distributing low-quality, erroneous, or purposefully misleading material may have negative implications, such as spreading false news, but spreading high-quality information can assist in development of a more informed community. One of the most common ways information is disseminated online is by re-sharing, which includes retweeting, re-posting, revining, and re-blogging. In social media, for instance, it is common practice for users to write articles, distribute them among their networks, and engage in related online discourse. Social media users may engage in this practice with various apps. Sharing information rapidly is essential in many situations, including political campaigns and times of crisis, and therefore sites like Twitter, YouTube, and Facebook have become more important. Individuals are also using social media accounts for news production and dissemination.
In the case of social media, for instance, someone may spread false information (or even create a fake tale and post it). Resharing is a feature of many social media sites, so if one person shares a story, it increases the likelihood that others will do the same. Several proposed remedies are present, but there is still much disagreement over what constitutes ''fake news'', how it spreads, and how it affects social and political outcomes. Multiple major actors-including social media platforms, users, and groups against the spread of fake news-may be able to control the spread of false information on the internet. This brief theoretical overview of the Uses and gratifications theory (UGT), the filter bubble phenomenon, and social media re-sharing behavior provides important context for the current investigation. According to UGT's research, the Ellinika-Hoaxes-Facebook demographic represents an engaged audience searching for high-quality news and information from sources outside their echo chamber via media consumption. Know that this demographic is engaged, actively looking for information and trying to confirm the integrity of rumours they may have seen on social media. Users' familiarity with the Internet, social media, and other media is crucial for identifying the prevalence of false news on these platforms and stopping its spread. To properly answer the Formulation of research question (RQ) and draw conclusions on how members of the Ellinika-Hoaxes-Facebook group use particular media to detect and prevent the spread of false news, it is necessary to conduct research into their online behavior.

B. FAKE NEWS CHARACTERIZATION
The principle of fake news has two components: authenticity and purpose. The word ''authenticity'' refers to the fact that misleading news often contains false information that may be demonstrated to be untrue. Conspiracy theories, for example, are not included in the definition of fake news since it is nearly hard to tell whether they are real or false in most situations. According to the second component, the erroneous material's objective was to deceive the reader. Figure 2 represents the category of fake news on social media. The characterization module represents fake news belonging to traditional media and social media. The second module shows the fake news detection techniques used for both traditional and social media.
First, to identify fake news, understand the text context and the procedure to categorize it. It is vital to begin by characterization when developing detection models, and it is also necessary to grasp what fake news is before attempting to identify it. It is also not easy to develop a universally agreed definition for ''fake news: Stories that are purposely and verifiably misleading and mislead readers''. As per Wikipedia, deliberate misinformation or hoaxes spread via multiple online platforms and news channels or digital social media constitute a sort of fake journalism or propaganda [20]. Today's fake news is manipulative and diversified in topics, techniques, and platforms. It consists of two components: authenticity and intent. Fake news material that contains inaccuracies that may be verified falls under authenticity. However, it excludes conspiracy theories because they are difficult to prove actual or wrong in most circumstances. The second part refers to the misleading material written to deceive the reader.

C. TRADITIONAL MEDIA FAKE NEWS
The media ecosystem supporting the spread of false information has grown and evolved throughout time, including print, broadcast, social media, and digital platforms. Before the rise of social media, this was seen as a concern because of its role in disseminating false information. Multiple psychological and social scientific foundations are used to characterize the effects of false news on individuals and the social knowledge environment. Humans aren't great at spotting believable stories from those who aren't. Several psychological and perceptual theories explain this phenomenon and the impact of misleading information. Traditional false news exploits readers' emotional vulnerabilities. Incorrect information is more likely to irritate clients due to the following two major factors: • Customers with naive realism believe that their view of the world is valid and that others who disagree with them are irrational or dishonest [21].
• People are more likely to be presented with data that backs with their existing worldview. The cognitive biases that are part of the human condition led consumers to regularly confuse fake news with the genuine thing [22]. By analysing the news ecosystem as a whole, we may be able to pinpoint some of the societal factors that fuel the spread of disinformation. Theories of Social Identity [23] and Normative Influence [5] argue that the need for others' approval is central to a person's sense of self and identity, which increases the likelihood that users will prefer the anonymity and security of online platforms when obtaining and sharing news content, even if it is false.

D. THE EXTRACTION OF FEATURES
Unlike social media, where additional social data may help identify false news, conventional news organizations rely on content like text and photographs to spot and identify fake news. Some representative features of false news were shown in Figure 3. We will next examine how to extract and disseminate relevant data from the media.

1) TEXTUAL CONTEXT BASED
Three important methods to make up news content: • Source-Where it takes the news or a piece of news getting source, who published it, and source is authentic or not.
• Headline-A detailed summary of the news's quality to entice readers.
• Body Text-It shows the actual story/content of the news. The most common method for detecting false information is to look at the content of the news piece. The substance of a news report is generally separated into two types: textual and visual. Much of the news material is presented in the textual mode, one of these modalities. As previously said, fake news consists of manipulating the audience, and it does so via the use of specific terminology. Non-fake news, however, is usually transferred to a separate language list since it is more legitimate. Attribute-based language characteristics and structure-related language features are two common categories.

2) ATTRIBUTE-BASED LANGUAGE FEATURES
They involve the ten parallel aspects of content style's linguistic elements. These aspects involve volume, uncertainty, objectivity, emotions, diversity, and readability [24]. Although attribute-based language characteristics are generally extremely important, explainable, and predictable, they are often useless in assessing deception style compared to structure-based features. Furthermore, attributed features require extra resources for deception detection, which may take longer and significantly focus on correct feature evaluation and filtering.

3) STRUCTURE-BASED LANGUAGE FEATURES
Content style is defined by structure-based linguistic properties and must have four levels of language: the first one is lexicon, the second is semantics, then discourse and syntax. Structure-related features are also techniqueoriented features because most quantification depends on NLP-based methods. The critical challenge at the lexical level is identifying the frequency statistics of a word(s), letter(s), or other entity, which may be done correctly by applying n-gram models. Part-of-Speech (POS)-taggers execute shallow syntax tasks at the syntax level, making tagging and assessment of POS easier. Probabilistic Context-Free Grammars (PCFG) analyses Context-Free Grammars (CFG) by performing deep syntax level operations with parse trees. On the semantic level, word count (WC) and linguistic inquiry are also utilized to create semantic classes for semantic features.

E. PROBLEM FORMULATION
Developing a Spark distributed cluster-based environment for efficiently detecting fake news articles via a supervised learning paradigm necessitated solving two sub-problems. First, our model needed to learn how to recognize and seize necessary information in lengthy and textual news articles for categorizing the association between news item titles and related meta descriptions.

F. RESEARCH OBJECTIVES
In the first section of this research, we examine the effectiveness of Recurrent Neural Networks (RNN) in modeling news articles to identify the link between an article's body content and its title. As part of our research, we use the dataset made available for the FNC-1 competition to train and assess a classifier. We want the classifier to be able to do the following.  •Use the Spark framework to research, assess, and compare several machine learning classification techniques on four classes from the FNC-1 dataset.
•Given a title and an article, determine if the article agrees with, disagrees with, discusses, or is irrelevant to the assertion made in the headline.
• To propose an efficient, systematic, and functional approach based on machine learning algorithms for detecting fake news using Spark and to design an efficient stacked ensemble classifier for fake news detection.
In an experiment, we demonstrate that the recommended method can accurately identify fake news and beats current state of the art algorithms.

G. PAPER LAYOUT
The remaining paper contains the following sections. Related work is reviewed in section II. The dataset used for experimentation and preliminaries is discussed in section III. The experimental results and discussion are articulated in section IV. Finally, section V presents a conclusion, and future work.

II. LITERATURE REVIEW
This section provides an overview of the previous research's difficulties in identifying fake news. To identify fabricated news stories, it is necessary to do rumor detection and identification. It is important to distinguish between Real and fake news since both are based on deliberate fabrication. Fake news identification is particularly difficult when detecting news based on characteristics. Tweets and social context can be used to generate features. As a result, we assess prior work based on single-modality and stance identification.

A. TEXTUAL CONTENT BASED
Most earlier news identification studies relied mainly on textual elements and user metadata. Text based features are statistically extracted from message text content and have been extensively discussed in the literature on fake news identification. The textual component extracts unique writing styles [15], [19], [20] and emotional sensations [18] that are prominent in fake news.
Network connections, style analysis, and individual emotions have all been proven to contribute to detecting fake news [19]. After reading these posts, [20] explored the writing style and its effects on readers' viewpoints and attitudes. Emotion is a significant predictor in many fake news detection studies, and most rely on user positions or simple statistical emotional features to convey emotion. In [15] authors introduced a novel dual emotion-based method for identifying fake news that can learn from publishers' and users' content, user comments, and emotional representation. Reference [25] employed an ML model for identifying fake news that employs convolution filters to distinguish between different granularities of text information. They investigated the issue of posture categorization in an innovative approach to consumer health information inquiries and achieved 84% accuracy using the SVM model.

B. SOCIAL CONTEXT BASED
User generated social media interactions with news stories may give additional information, in addition to aspects directly relevant to the substance of the stories. In [26] authors proposed a novel approach employing a knowledge graph to identify fake news based on actual content. A graph-kernelbased approach used be [27] to discover propagation patterns and attitudes. On the other hand, social context features are difficult to gather because they are loud, unstructured, and time-consuming [28].

C. STANCE DETECTION OVERVIEW
From a broad viewpoint, stance detection can be elaborated as the problem of determining an author's or text's point of view concerning a specified target, such as a single topic, headline, or even a person [15], [29]. Consequently, there are three factors and a machine learning based categorization technique to determine how the comparison occurs. The group's titles (for example: help, against, for, or neutral) are determined by the issue. Political arguments [30], [31], articles [32], [33], and even internal company dialogues [25], [34], which stretches a wide range of fields may be referred to as categories. Detecting the stretch of Tweets or short texts such as hearsays [35] or microblogging accounts has gotten much attention in opinion mining. ''Hillary Clinton'' as a celebrity, ''Atheism'' as a specific issue, or the profess that ''E-cigarettes are safer than regular cigarettes'' are examples of objectives presented in the available datasets. Shared tasks for providing such datasets and promoting research have emerged in several languages.
The sub-task for exposing stance in Tweets [26] was presented at SemEval-2016, with roughly 5,000 tweets in English, including five familiar subjects. The task has initiated a variety of approaches, including conventional techniques (for example, KNN [36], SVM [22], or essential attributes given by methods [34]) and deep learning approaches (e.g., BiLSTM [37], Bidirectional Conditional Encoding [27], [34]). Furthermore, public datasets, for instance, the Multi-Perspective Consumer Health Query dataset [38] dedicated to exposing the stance of sentences taken from high-quality articles on five separate assertions. Like ''Sun exposure causes skin cancer,'' the dataset is available to work on the development of new and exciting work. It contains an in-depth examination of various approaches to the two goals listed above. The need for well-interpreted data in languages other than English has rapidly increased notation efforts and collaborative tasks aimed at furthering research. There are efforts like Stance-Cat, an aim for identifying attitudes in Spanish and Catalan tweets [39], a proposal and database of brief statements in Russian online forums [40], and even projects that integrate several languages [41].
A group of volunteers from industry and academia launched the Fake News Challenge in December 2016 [10]. Using Machine Learning, Natural Language Processing (NLP), and Artificial Intelligence (AI), this competition aimed to encourage the development of technologies that could assist human fact-checkers in detecting deliberate deception in news reporting as a first step, the organizers decided to research what other media outlets have to say about the topic. Consequently, they decided to introduce the event with a stance detection challenge in the first round of competition. The organizers collected data on headlines and body text before the event. In the competition, they asked participants to create classifiers that could reliably classify a body text's viewpoint on a given headline into one of four categories: ''disagree'', ''agree'', ''discuss'' or ''unrelated''. On this task's test set, the top three teams achieved accuracy rates greater than or equal to 80%. The top team's model combined Gradient Boosted Decision Trees and Deep Convolutional Neural Networks. VOLUME 11, 2023 D. MISLEADING HEADLINES Identifying misleading headlines in this research required classifying each article's treatment of the assertion made in the title into one of four categories: (a) agrees, (b) discusses, (c) disagrees, and (d) irrelevant (headline and different topic discussed in body text). As a result of the proliferation of annotated corpora and the increased use of new technologies to combat the fake news pandemic, a new obstacle has recently presented itself to the field of fake news analysis [8]. In this setting, several research challenges and competitions are presented. The most recent and important ones are then dissected in great detail. The evolving dataset [18] was used to create the fake news Challenge6 (FNC-1) [42]. The goal of FNC-1 is to serve as a benchmark for research into AIbased technologies, machine learning, and natural language processing as they apply to the detection of false news. The planners decided to begin with stance disclosure to finish this macro-challenge. The FNC-1 dataset, which included over 75,000 instances labelled as either ''agreeing,'' ''discussing,'' ''disagreeing,'' or ''unrelated,'' was made publicly available. Given the headline ''Robert Plant Ripped up $800M Led Zeppelin Reunion Contract,'' the following excerpts illustrate the categories mentioned, as annotated by the barometer in the FNC-1 dataset.
Body content that conforms to the headline is an instance of agree class. These topics might be discussed in a discussion class: The article's main body addresses the same issue as the title, but does not take a position on the matter. For instance, when comparing the headline and body content, one might say they belong to different classes. The FNC-1 competition had 200 entries, the top 10% of which averaged 82% relative points. The group developed a basic criterion using just hand-coded features and a Gradient Boosting Classifier, both freely accessible on GitHub. Top systems were UCLMR [43], Talos [44], and the Athene system [23]. The CNNs utilised by Talos [44] were one-dimensional, active at the word level, and trained using Google News topic vectors for the article's main body and title. The data from the CNN is then fed into a multi-layer perceptron (MLP) model that generates one of four possible classes of results. Next, it undergoes a comprehensive, start-to-finish training process. The system won the FNC-1 competition with its superior performance using the CNN-MLP combo. In recent trials, several research have employed FNC-1 with encouraging outcomes. For instance, [45] suggested a treelike structure for the linked classes by combining the existing disagree, agree, and discuss ones. This approach uses a twolayer neural network to learn a hierarchical representation of classes, achieving a weighted accuracy of 88.0%.
Additionally, scholars built a stance detection model using accomplishment transfer learning on a Roberta Deep Bidirectional Transformer Language Model. They achieved a weighted accuracy of 90.01% by employing Bidirectional Cross Attention between claim article pairings via pair encoding with self-attention [46]. Further work should be done on posture identification problems, such as linking a news title and article content, outside the FNC-1 Challenge and dataset. Several writers have compiled claims and criticisms [21], [47] to help with identification. Some analytic effort is devoted to ''argument mining,'' in which the headline presents an argument not supported by the content. While argument mining is effective in solving the problem of posture identification, other tasks that discover semantic relationships within the text, such as inconsistency detection [48], contrast detection [49], and synthesis detection [50], may also be useful. Mishra et al. provided a comprehensive taxonomy for spotting false news, outlining the many forms of disinformation and what sets them apart. Multiple mechanisms exist to track down those who propagate false information. Multiple liar, false news, and corpus datasets have been used to compare traditional machine and deep learning techniques. This study demonstrated that deep learning methods outperformed more conventional machine learning strategies. Bi-LSTM outperforms the competition in detecting bogus news with an F1 score of 96.
In [43] authors introduced the Multi-integrated Domain Adaptive Supervision (MIDAS) system to automatically choose the model that best fits a particular collection of data drawn from random distributions. By using local smoothness as a proxy for accuracy and the relevance of training data, MIDAS can increase generalization accuracy across nine distinct fake news datasets. MIDAS has a larger than 10% success rate in recognizing bogus news linked to COVID-19, compared to other labelling methods [43]. The results of the literature review were summarized in Table 1.

III. PROPOSED METHODOLOGY
This section describes a comprehensive detail about the proposed approach. The proposed approach comprises multiple steps of data analysis, feature extraction, single classifier, and the ensemble classifier classification, as shown in Figures 4. The challenge of fake news in stage 1, a particular purpose and dataset is presented to handle the difficulty of identifying fake news. The challenge's primary motivation is to build a semi-automated pipeline that examines the attitude of several news items on a specific topic. Thus, the dataset comprises occurrences with a title, article body, and one of the four labels ''Disagree'', ''Agree'', ''Unrelated'', and ''Discuss''. Figure 4 summarizes our proposed approach, which consists of the steps to achieve fake news classification by solving multi-class labels. The first part explains the corpus creation technique by combining stances and bodies based on news article ids. The second phase describes the preprocessing processes done on news article text. The third phase demonstrates techniques to feature selection or dimensionality reduction. The fourth stage describes each ML and ensemble model used in this study. Finally, the last phase outlines this study's various ensemble learning models. We divide the dataset into two parts for experiments: training and testing. The training dataset comprises 75% of the data, whereas the testing dataset contains 25%.

A. DATASET
Carnegie Mellon University adjunct professor dean Pomerleau, Joostware, and the AI Research Corporation founder Delip Rao hosted a competition called the Fake News Challenge Stage 1 (FNC-1) to investigate the potential of machine learning and natural language processing in the fight against fake news [27]. This issue was the driving force for the competition, which focused on stance detection. This section VOLUME 11, 2023  provides an overview of the competition dataset, the baseline used by the FNC-1 organisers, and the winning strategies used throughout the competition.
It ensued by turning a news story into a headline, then annotated the title and using the story to show where they stood on the assertion they introduced. For this attitude categorization exercise, we have three possible sets of labels: ''for,'' ''against,'' and ''observing.'' The developing dataset [27] is the basis for the FNC-1 competition dataset. To create the FNC-1 dataset, we randomly match headlines and articles from the emerging dataset depending on their attitude toward the linked allegation. In addition, the headlines and articles are separated into related and unrelated groups. Second, and more difficult, the collection of connected headline-article pairings is further split into the three classes disagree, agree, and discuss, allowing for supervision of the job of evaluating the attitude of an article relative to the assertion presented in the associated headline. There are 49,972 headline-article pairs in the training set of the FNC-1 dataset, and another set of pairs in the test set. There are 1,689 distinct headlines and 1,648 unique articles used to build the headline-article pairings that make up the training set. The test set includes 904 distinct articles and 894 unique headlines. Seventy-three percent are classified as unrelated, 7.4 percent as agreeing, 1.7 percent as disagreeing, and 17.8 percent as debating. About 72.2 percent of the test data is irrelevant; 7.4 percent is in agreement; 2.7 percent is in disagreement; and 17.6 percent is up for discussion. The training set has 40,350 headline-article sets, the hold-out set has 9,622, and the claim set has 25,413 sets.

B. CORPUS DESIGN
The dataset FCN-1 has four distinct classes (agree, disagree, discuss, unrelated). In pre-processing, labels are encoded into numeric target values and perform some pre-processing steps. Preprocessed data is split into 75% data for training and 25% for testing.
This study used the FNC-1 dataset, consisting of two CSV files, including stances and body corpora of text news stories written in English. Collecting news stories from multiple sources is difficult due to a lack of linguistic resources. Furthermore, annotating these news pieces based on their contents necessitates specialist expertise, a significant amount of time, and substantial money. As a result, augmented corpus design is the only way to conduct fake news detection research. Our augmented corpus is created by combining 49972 stances with 1683 bodies based on ids. The corpus has four distinct classes (agree, disagree, discuss, unrelated). It contains 8909 discuss stances, 36545 unrelated stances, 3678 stances, and 840 disagree stances. After gathering headlines and articles in one column, the final corpus contains text and stances.

C. PRE-PROCESSING
Data mining relies heavily on pre-processing. It converts inconsistent and incomplete raw data into a machine-readable representation. Various text preprocessing activities were conducted on the FNC-1 dataset. To complete these tasks, NLP approaches such as character conversion to lowercase letters, stop word elimination, stemming, and tokenization, as well as algorithms from keras library were used. Stop words, which comprise words like ''the, of, there,'' etc., are the most commonly used words in our daily language and typically have relatively limited significance in terms of the entire context of the phrase. By removing the stop words, we save time and space that would otherwise be consumed by the useless phrases mentioned before. Words with comparable meanings may appear in the text many times. For example, ''eating'' in any sentence will become ''eats''. Reducing the language to its most basic form can help if that's the case. This operation, known as stemming [51], uses an open-source version of the NLTK's Porter stemmer method. Few preprocessing steps are as follows: in this instance. To begin, the text is cleaned up by removing all stop words. It is possible to remove stop words from the text because they are more common and carry less useful information. Some common stop words include the conjunctions 'and' 'or' and 'but'. Pre-processing data is essential in natural language processing because processing these less frequently used full words consumes a significant amount of time. 2) Punctuation Removal: The grammatical context of a sentence is provided by natural language punctuation. A comma, for example, may not add anything to the understanding of the statement. 3) Link Removal: This step removes hypertext links from social media posts. Regular expressions are used to do this. 4) Lemmatization or stemming: Either lemmatization or stemming is done during this step. The NLTK's WordNet Lemmatizer is used for lemmatization, while the NLTK's Snowball Stemmer implementation is used for stemming, based on the Porter2 stemming algorithm [52]. 5) Apart from the above-mentioned pre-processing stages, every social media post must go through. Reply removal: Words beginning with @ (primarily used for Twitter replies) are eliminated in this phase. Regular expressions are also used to do this. 6) Lowercase transformation: Every word is converted to lowercase in this phase to account for variances in capitalization.

D. FEATURE EXTRACTION
Feature extraction transforms raw data into numerical features that can be further processed while preserving the original data set's information. It is more effective than just using raw data to train a machine.

1) HASHINGTF
The mapped indices are then used to calculate the phrase frequency. Bypassing the need for a term-to-index map, which can be time consuming and expensive for large corpora, this method is less susceptible to hash collisions [45], where multiple raw features are hashed into the same term.
HashingTF maps a series of phrases to their word frequencies using the hashing method. Using Austin Appleby's Murmur Hash 3 algorithm, we can now compute the term object's hash code value (MurmurHash3 × 86 32). Since the hash function is translated to a column index using a simple modulo, the features would not be evenly mapped to the columns if the numb-Features input was less than a power of two. The HashingTF transforms a set of terms into feature vectors of fixed length. Regarding text processing, a ''term set'' could be a collection of words. HashingTF employs the hashing technique. A hash function transforms a raw attribute into an index (term). Murmur-Hash-3 is the hash function in use here. The mapped indices are then used to calculate the phrase frequency. When working with large datasets, avoiding creating a global term-to-index map is preferable because doing so can be time-consuming and expensive. However, this method is vulnerable to hash collisions, which occur when different raw features are hashed into the same term. Increasing the number of buckets in the hash table to reduce the likelihood of collisions is recommended. A simple modulo determines the vector index on the hashed value, so the feature size should be a power of two. If the feature size is smaller than this, the vector indices will not be evenly distributed. There is a binary toggle parameter that controls the frequency of terms. When this value is true, all nonzero frequency counts are reset to 1. As a result, discrete probability models are built that do not use integer counts but rather binary ones.

2) IDF
Inverse Document Frequency (IDF) is a calculation frequently employed in association with term frequency. The issue with term frequency is that frequent terms are not necessarily the most significant. For example, ''content'' will appear on every web page. IDF is a method for lowering the weight of frequently occurring words in a corpus (collection of documents). IDF is determined by dividing the total number of documents by the number of documents containing the phrase in the collection. IDF is an Estimator that generates an IDF Model after being fitted to a dataset. Feature vectors (typically created by Hashing-TF or countvectorizer) are used to scale each IDF model feature [46]. It appears to downplay qualities that are common in a corpus.

E. CLASSIFICATION MODELS AND PARAMETERS SETTINGS
We use the following machine learning techniques to detect irregularities and breakdown of unusual events and investigate the effectiveness of our advanced method: Random Forest (RF): a supervised learning technique that may be used for classification, retrieval, and other tasks. It generates a few trees to aid in decision-making. It takes a random sample of data, constructs many decision trees to forecast each tree, and then votes on the best option. n-estimators = 200, bootstrap = True, criterion = Gini, min-samples-split = 2, random-state = 0, and min-samples-leaf = 1 are the parameters for our RF method.
Logistic Regression (LR): It is a segregated targeted learning model. A very straightforward ML algorithm differentiates problems such as noise detection, diabetes prediction, cancer detection, etc. LR is used to predict probability of target variability [47]. In our application the parameters of the LR algorithm are Penalty = l2, C = 1.0, reduce rating = 1, solver = lbfgs, max iter = 100 and verbose = 0.
Decision Tree (DT): are extensively used in decision analysis and machine learning [21]. It's a decision-making tool that uses a tree-like graph of decisions and consequences, such as random event outcomes, resource costs, and utility, to make judgments. Internal nodes in a DT express a condition about an attribute. Each internal node divides into branches depending on the condition's outcome until it reaches a point where it no longer splits and leads to leaf nodes, which indicate the class label that will be applied [48].
Ensemble Classifier: In addition to the custom classifiers, an ensemble technique was developed, which combined the three custom classifiers. The objective is to develop a voting classifier that calculates the weights to apply to each classifier's prediction [53]. The probabilities computed by the classifiers are first stored in a matrix for each training instance, resulting in each training case being linked with a probability vector. The weights are calculated, and the final label is created using this matrix of vectors, which is then fed into a Meta classifier model (0, 1, 2, or 3).
In contrast to the ensemble model, a voting classifier was also constructed to perform simple majority voting among the models' predictions. Ensemble categorization is generally divided into two stages: base-level and ensemble-level. This base predictors employ the HashingTF with IDF received from news articles as input. The output predictions from these base-predictors are fed into ensemble-level models. The ensemble model's main purpose is to improve the overall prediction F1 score by overcoming the shortcomings of the primary predictors. We have used stacking ensemble models for ensemble classification [54].

F. EVALUATION METRICS
The main concern is determining the model's ability to discern true and false news. We used metrics to properly examine the model's efficiency for this difficult challenge. Model selection and implementation are essential but should not take precedence over the rest of the project. Various assessment measures are used to test data to assess the model's capacity to detect false news. Multiple evaluation metrics, such as classification reports (accuracy, precision, recall, F1-score) and confusion measures, may be used to assess machine learning models. The sections that follow go through each of the assessment measures in detail. Preprocessing and other ways of gathering fake news data are loaded into a strong algorithm, producing incredible results [49].
Observations that match the predictions made by the model are true positives and negatives, respectively, and are marked in green. Because we would want to cut down on both types of errors, the ones we are trying to minimize are marked below. These phrases don't make a lot of sense. So, we can check our understanding by dissecting each statement.
A True Positive (TP) is a correctly anticipated positive result when the actual and projected class values are yes. For instance, if the expected and actual class values suggest that the passenger made it, we know they did. When both the actual and anticipated class values are negative, we say the value is a True Negative (TN). For instance, this passenger did not survive if both the actual and predicted classes suggest that they did not. When the actual class size differs from the projected class size, false positives and negatives occur.
When the expected class is present, but the real class is not, this is called a false positive (FP). The actual class will be utilized if, for example, it shows that the passenger did not survive but the fore-cast class predicts that the passenger would. In cases when the true class is yes but the predicted class is no, a false negative has occurred. If, for example, the actual class value reveals that the passenger lived whereas the expected class value predicted that they would die, the actual class value would be utilized.
To verify the usefulness of the model, the following assessment criteria are used: Precision is the proportion of actual test results that were predicted correctly. This is calculated by dividing the number of correct predictions by the number of incorrect ones.   (FP&N). Several metrics may be used to evaluate a model's efficacy, but accuracy is often prioritized. For example, it incorporates a wide range of assessment tools including as (accuracy, precision, recall, F1 score, and support.) The backing indicates the number of occurrences for each class [50]. It represents how much information out of the total possible may be calculated with high precision. Number of courses where just the best features were recalled. An equation may be used to depict this. To get the F1-score, we add the percentage of correct predictions and the number of correct recalls. The table summarizes the mean weighted recall and accuracy for a certain sample. The F1-score for this model is 1, which means it is ideal. ''Support'' refers to the number of class occurrences in a given dataset. The word ''accuracy'' refers to the proportion of correct predictions relative to the number of potential ones.

A. CLASSIFICATION RESULTS
The experimental results of Term Frequency-Inverse Document Frequency (TF-IDF) and HashingTF feature extraction techniques with ensemble models are presented in Table 2. The results using HashTF and IDF features regarding accuracy, precision, recall, and F1-score are 93.45%, 92.03%, 92.45%, and 92.25%. The results from LR_HashingTF-IDF is 93.45%, and it's a highest as compared to all other experimental. Furthermore, Bigram Logistic Regression exhibits 88.45% accuracies, 87.02% precision, 88.01% recall, and 87.06% F1-score. We also performed experiments using glove word embedding. We used the glove embedding technique with logistic regression. However, the glove with logistic regression model results is not so high but quite well with accuracy scores of 73.25% and 63.12%, 73.25%, 62.45% as the precision, recall, and F1-score. To make a broader comparison, we include features of the count vectorizer technique. The features of the count vectorizer were passed to logistic regression to detect fake news. Using the count vectorizer technique, the logistic model achieved 88.45% accuracy, 82.12% precision, 88.45% recall, and 87.35% F1 score. Moreover, we merged the count vectorizer and TF-IDF features to obtain better results, but we failed to avail improved results due to the high computational cost. The correctness, precision, recall, and F1-score using count vectorizer and TF-IDF features with logistic regression are 84.54%, 83.12%, 84.25%, and 83.26%. We also employed the Support Vector Machine (SVM) model to testify its abilities using count vectorizer features, and the SVM model gets improved results with 91.75% accuracy, 91.25% precision, 91.24% recall, and 90.45% F1-score. As compared to LR with count vectorizer, the SVM obtained high results. We also employed LR and SVM models with HashingTF-IDF features. The results of LR with HashingTF-IDF are better than the SVM model. Compared to LR, the SVM model with HashingTF-IDF achieved 90.75% accuracy.
The LR model with HashingTF-IDF obtained 93.78% accuracy, which is higher than the SVM model's accuracy. At the end we utilized Trigram, Unigram + Bigram + Trigram, Unigram + Bigram + Trigram + 16000 limited top features and Unigram + Bigram + Trigram + Cv + IDF + Chiseq feature with Logistic Regression to efficiently detect fake news. The LR with Trigram obtains significant results: accuracy is 83.47%, precision is 82.01%, recall is 83.45%, and F1-score is 82.64%. While compared to individual Trigram features, the LR model with Uni, Bi, and Trigram obtained better results with 88.64% accuracy. However, when running tests with Uni, Bi, Trigram, and 16000 limited top features, the LR model obtained less accuracy, which is 83.78%. Ultimately, we tried to merge all the features Unigram + Bigram + Trigram + Cv + IDF + Chiseq, applied LR on these features, and obtained promising results with 83.45% accuracy and accuracy and 82.45% F1-score.
The Figure 5 (a) shows the classification report of ensemble model. The support presents the number of instances of each class in testing set. 12,403 instances are used for testing data. We used weighted accuracy to calculate the precision, recall and F1-score because it deals with the class imbalance problem. The mean average precision, recall, F1-score of all classes is calculated using macro average, while weighted average is the total number of TP divided by the entire number of objects in all classes. Macro average stands for mean average. The weighted average score is higher due to the class unbalancing in the dataset. We also construct the ensemble model's confusion matrix, as shown in Figure 5(b). A confusion matrix, also known as an error matrix, is a table that visually depicts the performance of a supervised classification machine learning system. Figure 5 (b) shows that the model made multiple incorrect classifications. The ensemble model's ultimate accuracy on testing data is 93%.

B. PERFORMANCE COMPARISON OF DIFFERENT APPROACHES
The comparative analysis of proposed approaches with various baseline approaches is presented in Table 3. The bold values manifest the highest achieved score of proposed and baseline approaches. The experimental setting of proposed approaches resembled the baseline. It is shown in the table 3 that the proposed approach with TF-IDF features and LR model outperforms the baseline highest F1 score, which is 83.10%, while the proposed approach obtained the highest F1 score of 93.84%. In addition, dealing with the classwise score, the baseline approach of [46] exhibits the best score for Agree class with 73.76%. The proposed approach with TF-IDF features and LR model achieved the highest agree with class score of 80.23%. The proposed approach outperforms the baseline regarding the F1 score, with the highest F1 score of 92.45% and improved 9.35 %.

C. DISCUSSION
The FNC-1 dataset, which contains 49,972 headline articles and four distinct categories, was used to achieve the investigation's objectives, and obtain the desired results (discuss, agree, unrelated, and disagree). The proposed system comprises numerous components, such as data pre-processing, visualization, exploratory analysis, feature extraction, and classification using machine learning strategies. We proposed classifying data with an ensemble model influenced by 29460 VOLUME 11, 2023 machine learning in real time during the experiment. As a direct result, a more rapid interpretation of the findings is now possible. Instead of just one, two, or three different classification methods, the proposed ensemble model employs three distinct machine learning approaches (Random Forest, Logistic Regression, and Decision Tree). This ensemble model was created as part of our efforts to improve our previous investigations into identifying and categorizing fake news.
Several different factors are influencing the current situation. Several experiments are being carried out using the Apache Spark framework to handle big data and perform classification task. These experiments were carried out to improve our ability to detect fake news. As a result of these experiments, our ability to recognize hoaxes and other forms of disinformation should be enhanced. The model's was one of the aspects considered during the evaluation process for this particular piece of research. The model's accuracy was also considered part of the evaluation process, in addition to its performance compared to five other distinct criteria. Different evaluation metrics include accuracy, precision, recall, the F1-score, and the confusion matrix to test the model's performance.
PySpark was chosen because it uses RDD, significantly accelerating computation processing. As a result, the computations were finished significantly faster than they otherwise would have been. This was the essential consideration in deciding whether or not to employ PySpark. Compared to the other approaches utilized during this inquiry and the previous baseline studies, the suggested ensemble model had the greatest F1 score. The proposed ensemble model exhibits the highest F1 score compared to the existing baseline studies and other approaches used in this study. This model achieved the highest F1 score of 92.45% due to the features of HashingTF-IDF that were added during development. We boosted our F1 score by 9.35%, which is a sufficient gain to prove the novelty of this research.
In the future, one of our long-term goals is to use Spark to implement deep learning models in a multi-agent distributed learning environment. These algorithms will be used to detect instances of fake news. As a result, we can assess the effectiveness of a wide range of machine learning and deep learning algorithms on a diverse set of fabricated news stories. Furthermore, we intend to create a featured ensemble of different embedding techniques alongside different machine learning and deep learning models capable of accurately recognizing and categorizing various hoaxes and fake news. This will be done so that we may better understand how to spot false news, which will not only aid in understanding the patterns of detecting hoax or fake news but also in developing a cutting-edge real-time fake news detection system.

V. CONCLUSION
Headline stance checker has been indicated to be a helpful method for exposing falsehood in the news, particularly when a headline is contrasted to its content body. To demonstrate the applicability of the headline stance checker, various tests were conducted in the context of an existing assignment (Fake News Challenge FNC-1). The stance of a headline had to be categorized into one of the following classes: disagree, agree, unrelated, and discuss. The studies included verifying each of the suggested classification steps separately and the overall method is evaluated by comparing the state-of-theart in this job. In this study, researchers used the dataset FNC-1, which has categorized fake news into four categories, while using big data technology (Spark) to perform machine learning analysis for assessment and comparison with other state-of-the-art approaches in fake news identification. The suggested approach created a stacked ensemble model and experimented with it on a distributed Spark cluster. We used N-grams, HashingTF-IDF, and count vectorizer for feature extraction, followed by the suggested stacked ensemble classification model. Compared to the baseline techniques' results, the suggested model has a high classification performance of 92.45% in F1-score. The suggested model outperforms the previous baseline techniques and improves the F1 score significantly. The suggested ensemble model improves the F1 score by 9.35%.

A. RECOMMENDATIONS FOR FURTHER WORK
We currently work with a supervised approach, but researchers can work with unsupervised fake news detection in the future. This proposed work can also be extended using various neural network-based models, which are more sufficient for unsupervised fake news detection. Spark takes too much training time due to the standalone cluster. Due to the solitary cluster, Spark takes twice as long to train. In future, researchers can perform experiments on creating a cluster on a different machine. This research may be further stretched by employing various neural network-based models better suitable for unsupervised fake news identification. We will try to build a cluster on a separate computer.