Detecting Fake News in Social Media Using Voting Classifier

The availability of social media, blogs, and websites to everyone creates a lot of problems. False news is a critical issue that can affect individuals or entire countries. Fake news can be created and shared all over the world. The 2016 presidential election in the United States illustrates that problem. As a result, controlling social media is essential. Machine learning (ML) algorithms help to detect fake news automatically. This article proposes a framework for detecting fake news based on feature extraction and feature selection algorithms and a set of voting classifiers. The proposed system distinguishes fake news from real news. First, we preprocessed the data taking unnecessary characters and numbers and reducing the words in the dictionary (lemmatization). Second, we extracted some important features using two feature extraction types: the term frequency-inverse document frequency (TF-IDF) technique and the DOC2VEC algorithm, a word embedding technique. Third, the extracted characteristics were reduced with the help of the chi-square algorithm and the analysis of variance (ANOVA) algorithm. We used three data sets that are published online: Media-Eval, Fake-or-Real-News, and ISOT. To evaluate the proposed framework, we used five different performance metrics: accuracy (ACC), the area under the curve (AUC), precision, recall, and f1-score. Our system achieved 94.6% of ACC for the Fake-or-Real dataset. For the Media-Eval dataset, the system achieved 92.3% of ACC. For the ISOT dataset, the system achieved 100% of ACC. We contrast the proposed framework with several other classification algorithms. The experimental results show that the proposed framework outperforms the existing works in terms of ACC by 0.2% for the ISOT dataset.


I. INTRODUCTION
One of the consequences of technology is fake news. It is misinformation or misleading information offered as facts that can affect a person's opinion. This false information has several goals; organizations can use it for financial purposes (e.g., Facebook pages used it to spread fake news leading to specific ads) or for political purposes. Compared with Google, Twitter, and webmail such as Yahoo and Gmail, Facebook is the worst media platform for pervasive fake news. A group of researchers tracked the internet usage of more than 3000 Americans in the run-up to the 2016 presidential election. They discovered that Facebook is the highest reference for untrusted news sources in more than 15% of cases. However, for only 6% of the time, Facebook guided users to authoritative news sites. The authors observed that the untrusted website visits are not observed by 3.3% untrusted news versus 6.2% trust news for Google, or 1% untrusted versus 1.5% trust news for Twitter [1].
Spreading false news is roughly as dangerous as spreading the virus. People are currently encountering fake coronavirus news daily, and this fake information triggers fear and panic among people. Therefore, there is a need for ways to fact-check news.
Finland, for example, is leading the fight to tackle fake news through education. They teach primary school students how to combat false news and develop media literacy skills. This has been a priority in Finland's education agenda due to Russia's false stories targeting the country. Thus, Finland ranks first in media literacy compared to the UK, France, and Italy [2].
Fake information can be spread in the form of text, video, pictures, and audio via social media networks such as Facebook and Twitter [3]. The fake news problem has existed for a long time, and people used to believe such news even if it was false [4]. Therefore, detecting fake news can be difficult, especially with no supervising body on the internet [5]. The growth of concern regarding the detection of unreliable news is recent. It is difficult for a human to manually detect news, even with the existence of all topics shown on social media. Therefore, there is a need for an efficient way to help us distinguish false information from true ones posted on social media. One of the efficient ways is to classify the news using machine learning (ML) algorithms. There are two delineations for automatically identifying fake news: news and social context [6] [7]. The news content-based approach focuses on extracting unique features from fake news content. Because misinformation tries to spread false claims, knowledge-based approaches use international sources to fact-check the truthfulness in any news content. Style-based approaches uncover imitation information by spotting manipulators in the writing style.
Social context-supported approaches use someone else's party engagements as a secondary assemblage to exploit and find imitation information. Stance-based approaches use users' viewpoints from relevant blog content to conclude the veracity of original news articles. The credibility values of relevant interpersonal media posts verify any news. Imitative news discovery on social media is a new research area. The research directions are outlined in four scenes: data-oriented, feature-oriented, model-oriented, and application-oriented.
Data-oriented is about different aspects of data. The feature-oriented goal is to explore effective features for detecting false information. Model-oriented is used for building more particular models. Application-oriented goes beyond phony tidings discovery, much as diffusion and involvement.
Text classification is a popular task in natural language processing (NLP). It makes the program classify the text document based on predefined classes. These tasks can be done by using ML and deep learning algorithms. Deep learning is a sub-field of ML that requires massive data to make a sensible decision.
This paper enhances the performance of the conventional ML algorithm in detecting fake news because the dataset utilized didn't have a large enough amount of data to feed the deep learning algorithm [8]. This work proposed a system for differentiating fake news from real news using ordinary ML algorithms and voting classifier techniques. The algorithms used are naïve Bayes (NB), linear support vector machine (LSVM), logistic regression (LR), random forest (RF), passive-aggressive (PA), and stochastic gradient descent classifier (SGD). Online publication of the experimental results obtained from three datasets.
The contributions of the paper can be summarized into the following points: • Six different ML models and the voting classifier technique were trained to compare the differences in performance between them. • To enhance the results, we preprocessed the data by removing all unnecessary characters. • For feature extraction, the term frequency-inverse document frequency (TF-IDF) algorithm and the word embedding DOC2VEC algorithm were used. We used the chi-square algorithm and analysis of variance (ANOVA) algorithm to reduce the features. • We applied three experiments which are TF-IDF with chi-square, TF-IDF with ANOVA, and DOC2VEC for feature extraction and selection. Then, we compared the results for these three experiments.
The remainder of the paper is organized as follows. Section II presents a review of the related studies. Section III illustrates a proposed framework in detail. Section IV presents the experimental results of the proposed system. Finally, Section V introduces the conclusion and future work.

II. RELATED WORK
The spread of misleading information on social media leads researchers to do their best to solve this problem. We present some of the previous work in this direction. Ahmed et al. [3] proposed using of n-gram model to differentiate between fake and real news. They proposed to generate several sets of n-gram from training data to differentiate between false and true news. They used various features of the n-gram baseline established on words. They preprocessed the dataset by stemming and removing the stop words. They use TF-IDF for text feature extraction. They use six ML algorithms: SGD, k-nearest neighbor (KNN), support vector machines (SVM), LSVM, and decision trees (DT) on three online datasets. They achieved an accuracy (ACC) of 87.0% in differentiating fake from real news using n-gram features and LSVM algorithm. However, trying other algorithms can achieve better performance, such as PA.
Keskar et al. [4] proposed an N-gram analysis using the DT ML technique to detect fake news. The proposed system was tested using live stream data collected from Twitter. The ACC achieved was 70.0%. For better performance, several ML algorithms such as SVM and PA can be used with N-Gram. Bali et al. [5] compared the performances of seven ML algorithms. They used three standard datasets. To evaluate their work, they used a performance measure. Their system achieved a mean ACC of 88.0% and 91.0% of f1score. Nevertheless, using linguistic features can enhance the ACC.
Wang [9] came with a new dataset, liar, which can be used for automated fake news detection systems, this dataset is bigger, so it does not have the whole article. It only had 12800 short-labeled statements taken from the political fact website that was manually labeled. They used ML algorithms in their dataset, and the evaluation results show that they achieved 27.0% of ACC. However, trying algorithms like SGD and PA could achieve better ACC.
Yang et al. [10] proposed a framework for detecting real news and users' credibility using unsupervised learning and probabilistic graphical. Their system achieved an ACC of 75.9%. However, incorporating features of news content and user profiles can improve the performance of their unsupervised' model. Shu et al. [11] made the fake news tracker system for fake news problems. First, they collect news and social context automatically to build their dataset. Then, they extract features from the dataset and use ML algorithms to differentiate between false and real news. The experimental results showed that their system achieved an ACC of 74.2%. However, using other available features in the dataset like favorites and re-tweets could enhance the performance.
Wu and Liu [12] proposed a system called trace miner for classifying social media posts. To detect the pathways of a message, they used the LSTM-RNN algorithm. Their proposed system achieved about 93.8% of the f1-score. However, using other measures such as ACC and the area under the curve (AUC) to view their point could be better. Zhang et al. [13] introduced a fake-detector inference model for automated fake news detection; they extract a distinct set of latent features from the text. This system builds a deep diffusive network for learning the impersonation of news articles and creators of these articles and subjects. They experimented on a real-world fake news dataset to compare the system they built with various models. The ACC score obtained by their system is about 63.0%. But they could use other datasets to evaluate the proposed method then compare the results of the two datasets.
Gilda [14] created an application for detecting fake news using an open-source dataset. They use TF-IDF for extracting features and probabilistic context-free grammar detection on a corpus of about 11,000 articles. Their experiments were performed on several classification algorithms: SVM, SGD, gradient boosting (GD), bounded decision trees (BDT), and RF. They found that when using the TF-IDF method with the SGD algorithm, the ACC achieved was 77.2%. However, many feature extraction algorithms such as Word2Vec may enhance the results.
Ko et al. [15] proposed a cognitive system using backtracking to detect fake news. Their results were an 85.0% detection rate. However, they did not clear how to detect fake news and subjective posts. Atodiresei et al. [16] proposed a system for identifying fake users and news on Twitter. Their system will receive a link to a tweet from a user and then compute the tweet credibility, also some statistics such as emotions. However, they didn't clear what are the measure metrics they used to evaluate their work.
Pan et al. [17] proposed a B-TransE model system for detecting fake news according to news content using knowledge graphs. The experimental results of their works show that some of their approaches achieve above 80.0% of f1-score. The results could enhance if they combined two approaches together, such as content-based approaches with style-based approaches. Potthast et al. [18] proposed a new method for assessing style similarity across text categories using meta-learning. They show that fake news can be differentiated by its style. The experimental results show that their system achieved about 74.0% of ACC. Feature extraction algorithms like TF-IDF could enhance the results.
Ghafari et al. [19] presented TDTrust, a context-aware trust prediction system based on Tensor Decomposition that considers knowledge to determine an entity's condition. They used a real-world dataset to investigate novel algorithms that use social context factors. Nevertheless, they did not evaluate the proposed system using measure performance metrics such as precision, recall, and f1-score. Fernandez-Reyes and Shinde [20] proposed using different deep neural algorithms to detect fake news in the political area. The experimental results showed that the proposed system achieved 48.5% of ACC. Evaluating the proposed framework with several measures' performance metrics such as precision, recall, and f1-score to view their points could help the readers.
Grandis et al. [21] proposed using the Multi-Criteria Decision Making paradigm for fake news detection. To evaluate their results, several ML algorithms were used.
Their proposed system achieved about 84.0% of ACC. However, aggregation functions, such as feature interaction, can enhance performance. Ksieniewicz et al. [22] focused on fake news detection in articles published online. They used ML for detecting fake news. They confirmed that the chosen ML algorithms could distinguish between fake and real information. The system proposed in this study achieved an average ACC of (74.3%). Nevertheless, they did not evaluate the results using different performance measures such as precision, recall, and f1-score.
Crammer et al. [23] presented a set of margin-based online learning algorithms for several prediction tasks. This online algorithmic framework solved numerous prediction problems, including classification and sequence prediction. Hakak et al. [24] proposed an ensemble classification model to detect fake news. They extracted the features from the data set. Then they classified the data using three models Decision Tree, Random Forest, and Extra Tree Classifier. They achieved an ACC of 99.8% for the ISOT dataset. However, using algorithms like SVM and PA may enhance the ACC of the lair dataset.
Madani et al. [25] focused on fake news that tweeted during the CORONA virus. They proposed a classification approach based on natural language processing, ML, and deep learning. They used an RF algorithm and achieved 79.0% of ACC. However, as compared to other systems, the ACC they achieved is low. Nasir et al. [26] proposed a novel hybrid deep learning system that gathers convolutional and recurrent neural networks. They used two data sets ISOT and FA-KES. They achieved 99.0% of ACC for the ISOT dataset. In comparison to other systems, their system has a good ACC. The previous studies have several limitations, which are discussed further below. Some of the current works did not use algorithms like SGD and PA to achieve better ACC. Other works discard using some important features. This feature could enhance the performance. Other works did not use any measures to view their point. There are not enough datasets or standard data sets used in other studies to evaluate the proposed method.
In summary, this framework proposes an ensemblevoting classifier to beat the limitations of the previous works. The novelty of this framework can be summarized in a set of points. First, the study used three published datasets, making it easy to compare the results with other works. Second, the framework is based on a set of six different classifier algorithms. Third, we preprocessed the data by removing unnecessary characters, which helps to enhance the performance. Fourth, to avoid overfitting, we split the data using the k-fold cross-validation method. Fifth, feature extraction algorithms reduce the computational cost, redundant and irrelevant words that the machine cannot understand. Sixth, the chi-square algorithm selects the best features for better performance. Seventh, these algorithms were combined to form the ensemble method, and we used a voting function to get the final decision based on the base classifiers. Finally, we used five metrics to evaluate the proposed framework: ACC, AUC, precision, recall, and f1score. The proposed approach can detect any news in text format with high ACC.
Using of n-gram model to differentiate between fake and real news. They proposed to generate several sets of ngram from training data to clarify false and true news. They proposed Trace-Miner, a new method for classifying social media messages based on diffusion traces in social networks. However, they did not utilize any other metrics beyond the f1-score as a statistic.

III. PROPOSED FRAMEWORK
The suggested framework is shown in this section of the article. Three publicly available datasets are used to train the system. To begin, clean up the data by deleting any superfluous characters or digits. After that, two strategies are applied to partition the dataset. The first is holding out, implying that the data is divided into training and testing. Kfold cross-validation is the other technique. We used 4-fold and 10-fold in this technique, then compared the outcomes. TF-IDF and DOC2VEC are utilized to extract features from the dataset [22] [27]. The chi-square discretization algorithm and the ANOVA algorithm are used to choose the best features. Finally, the classification algorithms are fed the features. An ensemble voting classifier is utilized for the best results. The proposed framework is illustrated in Figure 1.

A. Data preprocessing
Preprocessing the data is required before extracting the features. This data may contain special characters, numbers, and unnecessary space. First, remove all special characters, also known as non-word characters and numbers. Then we remove all single characters. For example, when a punctuation mark is removed from Alice's and replaced with space, Alice and a single character "s" have no meaning in the text. Also, replace every single character from the beginning of the text with a single space; however, this will result in numerous spaces being replaced with a single space. The final step in the preprocessing process is lemmatization, which reduces words to their dictionary root form. Computers, for example, will be reduced to a computer. The purpose of the lemmatization stage is to avoid repeating features [28].

B. Dataset split
Splitting the dataset into training and testing sets is important for evaluating the classification models. In this step, the datasets are split into training and testing parts. The training part is used to train the model, and it is called a training set. The other part is used to test the classification model, and it is called a test set. For better performance, the test set is usually smaller than the training set. The method we used to split the data is called k-fold cross-validation [29]. Crossvalidation is robust against overfitting. This method resamples the dataset depending on a parameter called − . The ( parameter) refers to the number of groups that the dataset will split into. In this experiment, we used 10-fold cross-validation [30], [31].

C. Feature Extraction
This study works on textual data such as articles and posts with a large number of words and characters, resulting in high computational costs. This data is also redundant, has irrelevant words and meaning that the machine can't understand. Most of the text is unstructured and has high dimensionality. For that, it can be difficult to apply many classifiers to the text data. Thus, we need first to extract the most distinguishing features from the text so the dimensionality will reduce. Besides, it is important to take out a list of words from the text data and then transfer it into a feature set that the machine can use; this process is called feature extraction from text [32]. Using feature extraction algorithms also helps to enhance the performance of the ML algorithms. There are many ways to extract features from the text. In this paper, we used the TF-IDF algorithm and word embedding DOC2VEC algorithm.
TF-IDF algorithm is usually used to extract features in ML tasks because of its simplicity and robustness. TF-IDF algorithm split into two terms TF, which means how many words are in the current post where refers to how necessary any terms are in all posts. gives a score to words. This score can highlight a useful or necessary word [33].
The stop-words parameter ignores the English stopwords such as a, about, above, after, again, at, as, and are. The (Min_df parameter), which we set to 5, means that the minimum number of posts containing the features means that we only require the features that appear in at least five posts. The (max_df) is the same, but it will be set to 0.7 because the fraction corresponds to a percentage. This means we need the features that only appear in 70.0% of all the posts. Words that appear in almost every post are unsuitable for classification because they do not supply any unique features from the posts to be removed. Table 2 shows some examples of extracted features and their frequencies using the TF-IDF algorithm.
Word embedding algorithms are a word representation type that gives the words of the same meaning the same representation. There are many types of word embedding, such as word2vector, Glove, and DOC2VEC. Word2vector algorithm takes every word in the article to classify and convert it to a unique vector [34]. Glove algorithm is the expansion of word2vec technique for better learning word vector. DOC2VEC algorithm uses three layers of the deep neural network to measure the document context and bind similar context together [35]. Figure 2 shows the histogram of extracted features using the DOC2VEC algorithm.

D. Feature Reduction
Feature reduction or selection is the process of extracting the most pertinent features from a dataset. Following that, ML algorithms will be used, and this improves the performance of the classification model. FEATURE selection methods are used to reduce the risk of overfitting and training time. Delete all single characters 4) Delete single characters from the begin 5) Substituting multiple spaces with a single space 6) Delete prefixed 'b' 7) transforming to Lowercase 8) Lemmatization 9) End for of uncleaned document TF-IDF weight for each term Procedure: 1) for each term ti ∈T do 2) for each document di ∈ D 3) If TFij not equal zero, then DFi++ 4) End for of document 5) Idf= log(D/DFi) 6) End for of term 7) For each term ti do 8) For each document dj ∈ D do 9) Tf-IDF = TFij*IDFi 10) End for of Document 11) End for of term Feature selection methods are classified into three groups which are: filter, wrapper, and embedded. The filter methods keep the same meaning of the selected features as the original features. These methods do not depend on the performance of any classifier. Therefore, in this paper, we used two algorithms of the filter method for feature selection. The first one is the chi-square algorithm which is used for categorical features. We calculated the chi-square between each feature in the dataset and target (label). Then we select the required number of features with the best score of chisquares. The chi-square score is given by: where the is the number of class labels' observations. The would be the class label's number of observations if there were no relevance amidst the feature and the target [36]. The second one is the ANOVA algorithm, a statistical technique used to check the meaning of two or more distinct sets.

A. Classification Algorithms
The system will be trained using six distinct ML algorithms in this step, which will be discussed in the subsections below.

1) NAÏVE BAYES
Using Bayes Theorem, the NB classifier divides the data into classes based on their probability. When using the NB classifier, all predictors are believed to have the same influence on the class outcome. We can find the probability of an event (which refers to fake news) when event is true using Bayes Theorem. The formula of the Bayes Theorem is: where the variable is the class label (fake/real), the variable represents the word/ features, ( | ) is the conditional probability that news articles are fake when that word appears, ( | ) is the conditional probability of finding the word in fake news articles, ( ) is the overall probability that a given news article is fake and ( ) is predictor prior probability.
Following that, the probability of the fact that a given news article is fake is calculated by combining the probabilities ( | ), which is known for each word in the news article, given by Eqs (5-7).
where is the class label (fake/real), the variable represents the word/ features, is the total number of words in the news article, 1 is a product of the probability that news articles are fake given that they contain a specific word for all the words in the news article, 2 is the same as 1 , but complement probabilities are used instead, is the overall probability of the fact that a given news article is fake and finally ( | 1 ). ( | 2 ). ( | 3 ) ⋯ ( | ) is the conditional probability that a news article is fake given that the words 1 , 2 , 3 , appear in it, respectively [37].
2) SUPPORT VECTOR MACHINES Support vector machine is a supervised ML model used for classification and regression. Its goal is to find the best hyperplane that divides a dataset into two classes. The support vector machine has three important concepts: data points, hyperplane, and the margin. Data points are the support vectors closest to the hyperplane, and they can find the correct separate line. A hyperplane is a decision plane that divides a set of objects. A margin is defined as a gap between two lines on the closest data points of distinct classes; a large margin is desirable, while a small one is undesirable.
In applications such as document classification and fake news detection, the dimension of feature vectors is sufficiently large to find separable linear classifiers that can separate training feature vectors into non-overlapping regions (classes). Like perceptron and logistic regression, SVMs are linear classifiers. However, SVMs are more effective because they maximize the margin or the distance between the decision boundary (the hyperplane) and the closest training point (support vector). In order to find the decision boundary w that maximizes the margin, the following two conditions must be satisfied: where + and − are n-dimensional feature vectors that belong to positive and negative datasets, + (real news) and − (fake news), respectively. The above conditions must hold for all feature vectors in + and − .
Let ⃗⃗ defines the margin around the decision boundary w, the form of ⃗⃗ can be defined as follows: where ∝ and ∝ are weights that are assigned to feature vectors in positive and negative datasets + and − , respectively. Since some news is more important than others, these weights are assigned to each feature vector in order to keep only vectors that matter and neglect vectors (news) that do not matter by setting their weights to 0. Note that ∑ ∝ + ∈ + and ∑ ∝ − ∈ − represent the centroids of real news and fake news, respectively. By forming a polytope around all positive (real) / negative (fake) news, any point ℎ ⃗ can be expressed as a linear combination of the corner points ∆ of that polytope: where is a corner point. Assuming that the nearest points to w (support vectors) are h + and h -, the margin width must be equal to ‖ + − − ‖, and the boundary must be located halfway between h + and hon the line linking them. Since there might be several support vectors, we should search for the minimum distance between support vectors on both sides of the boundary, that is: where ‖•‖ is the Euclidean norm.

3) LOGISTIC REGRESSION
LR is a classification algorithm used to predict discrete and binary values like "real and fake" using the logistic function, also known as the sigmoid function. The logistic sigmoid function converts the output of LR into a probability value. The formula of the sigmoid function is: where is a mathematical constant known as Euler's number, and is a feature vector.
In LR, the output value is between 0 and 1 because it predicts an event's appearance probability. We still need the result in binary form; therefore, we use a threshold to convert the probability to binary form [39]. If y = 0 and 1 are the class label, then (0) is the negative or fake news, and (1) is a positive class or actual news. The LR equation is: where is weight factor which is the transpose of and ℎ ( ) . When the class label = 0 1, 0 for fake news and 1 for real news, the estimated probability can be calculated by: To calculate the estimated probability for the fake news = 0: where is the estimated probability, is the class label, is the feature vector, and is the weight factor.

4) RANDOM FORESTS CLASSIFIER
It is a supervised ML algorithm, which is used for classification and regression. This algorithm generates the output by creating from data samples, takes a prediction from each tree, and then selects the superior solution through voting.
All labeled data is first assigned to a root node ( ), after which the feature ( ) is found within a random subset of features and a threshold ( ) is determined. After that, divide the labeled data assigned to ( ) root node into two subsets, left and right, and assign ( , ) to each. If the left and right splitting data are too small to be split again, attach leaf nodes (child) ( ) left and ( ) right to ( ). And, to split data, label the leaves with the most existing labels in the left and right directions ( ℎ ). Alternatively, attach child leaf nodes ( ) and ( ℎ ) to ( ) and assign ( )and ( ℎ ) to then repeat for ( = ℎ ). The algorithm is randomly drawing repeated at each node for − samples, typical subset size, for a random subset of features: where is the features and means simplifying the number but remaining its value near to what it was.
RF uses a modified tree learning algorithm that selects a random subset of features at each candidate split in the learning process. This random selection of features is sometimes referred to as "feature bagging" [40].

5) STOCHASTIC GRADIENT DESCENT (SGD CLASSIFIER)
SGD classifier is used to fit the linear model. We calculate gradient costly based on the whole trained data, referred to as "batch gradient descent." Using this algorithm will be very expensive if there is a large data set because it can be done for a single point in the trained data. So, updating the weights will be slow, and it will take time to converge the global minimum cost. To update the weight, we do the next step after training samples, and the cost minimum is reached one or more times [41].
Calculate the gradient of the feature first by determining the slope of the objective function with respect to each feature parameter. Second, choose a random initial value for the parameters and update the gradient function by putting the parameter value into the gradient function. Then, using the following formula, determine the step size: and finally, determine the new weight: And to compute the loss function: where ( ) is the features, and ( ) is the class labels (real and fake labels), ( ) is the weight, and ( ) is the bias.

6) PASSIVE AGGRESSIVE CLASSIFIER
It is an online learning algorithms family used for either classification or regression tasks [26]. It attempts to settle the issues of the perceptron rule. Let be a set of posts = { 1, 2, 3 ⋯ , }, where is the instance presented to the algorithm on round ( ) and let ( ) is a set of labels = { 1, 2}, where ( ) belongs to {−1, 1}. Then, take each instance label pair ( , ) as an example.
This algorithm predicts using a classification function. So, it takes an array of sizes [ _ , _ ] as input, and the output is an array of sizes [ _ ] because of its binary classification [42]. So, when applying PA in our data first, initialize = (0 ⋯ ,0). Then receive a post, a vector of words = ( 1, ⋯ , ) is the transpose of . Predict positive (real) if, Note that taking both equations and multiplying it by ( ), the answer will be the same, ( ) ≥ 1. To compute the loss function: And to calculate updated weight: where is the loss function, ( ) is the weight vector, is a word vector, is the transpose of and ( ) is the number of words. Many of these algorithms were combined using the voting classifier method to get the best ACC depending on trial and error.

7) VOTING CLASSIFIER
We generated a set of weak classifiers and then merged their output into a single final decision via the ensemble technique [43]. One of the ensemble approaches is the voting classifier utilized in this study. Multiple weak ML techniques are used in the same dataset to accomplish this. The Voting Classifier ( = ′ℎ ′) used in this study means that each model (from the previous classification models) votes for every instance in the data set. The last output prediction is the one that receives more than half of the votes. The instance would be classified based on the majority class label; check the outcome tables to compare the models (Tables 5 -13).

IV. EXPERIMENTAL RESULTS
This section is divided into four subsections: dataset description, hardware and software specifications, evaluation metrics, and discussion results. The first represents the three benchmark datasets: Fake-or-Real-News dataset, Media-Eval dataset, and ISOT dataset. The second represents the hardware and software specifications that we used in our experiments. The third is the evaluation metrics used to evaluate our work. The fourth subsection represents the results and the discussion. This subsection presents the classification results and ensemble-voting classifier on three different benchmark datasets.
We applied three different experiments to the three datasets. We then compared the results of six different ML algorithms and the proposed voting classifier. There are some tables and figures included to clear our point. We supported our idea with five different performance measures. The used performance measures are explained in detail in the evaluation metrics subsection. In the discussion subsection, we discuss the experimental results and provide them with a ROC curve.

A. Dataset Description
We tested our work on three datasets published online. The first is called the Fake-or-Real-News dataset. The second dataset is the Media-Eval dataset, and the third is the ISOT fake news dataset.
Fake-or-Real-News dataset: contains three columns, which are title, text, and label. In our experiment, we only need the text and label columns. The total number of posts in this dataset is 6335 posts collected from social media. This dataset is divided into two types: fake with 3164 posts and real with 3171 posts. The dataset is available for free online [46].
Media-Eval dataset: this dataset contains fake and real posts from Twitter. It also contains images shared online, but we only need the posts in our work so, the images were ignored. This dataset is designed especially for information retrieval tasks and new technologies depending on social media. In this dataset, we worked on 15629 posts, 9404 for fake posts, and 6225 for actual posts [45].
ISOT dataset: this dataset contains two types of articles, fake and real news. The data was collected from accurate world sources in the period from 2016 to 2017. The real articles were collected from Reuters.com, and the fake articles were collected from unreliable websites. This dataset contains different topics. There are 23481 fake articles and 21417 for real articles, with 21417 total articles [47] [48] [24]. The main difference between the three datasets is the source of data collected. The real-or-fake news was collected from Facebook, and the Media-Eval dataset was collected from Twitter. ISOT Dataset was collected from accurate world sources in the period from 2016 to 2017.

B. Hardware and software specifications
This work was implemented by using an anaconda 3.8/ Jupyter notebook. We ran our experiments in a Core (TM) i5/ 2.6 GHz machine with 8 GB RAM.

C. Evaluation metrics
We measured performance ACC, precision, recall, f1-score, and the ROC curve [44]. ACC is how we are relative to the right value. ACC = number of correct predictions/total number of predictions, ACC's equation is: Precision is how close the measurements are: Recall (Sensitivity) is how many correctly actual positive is defined: The f1-score definition is that if the cost of false positives and false negatives varies, we require precision and recall. F1-score's equation is: ROC curve is known as (Receiver Operating Characteristics). İt is an important performance measure that shows how much the model can distinguish between various classes, where the AUC is the area under the curve. Note that the high of this area means the better of the model to make a detection. The ROC curve computes using true positive rate ( ) and false positive rate ( ) . Figures 2-9 show the applied system's ROC curve on the three data sets used in this research. ROC curve's equation is: where = , = , = and = .

D. Results and Discussion
This part will represent the results of the classification process using six different ML algorithms and an ensemble voting classifier. The data is split into two parts, training and testing, using the 10-fold cross-validation method. The evaluation metrics ACC, precision, recall, f1-score, and ROC-AUC are used to evaluate the results. We presented the results of the three datasets, which are: Fake-or-Real-News dataset ( the first dataset), Media-Eval dataset ( the second dataset), and ISOT dataset ( the third dataset) in this section. Tables 5-13 represent the results. ML algorithms play a decisive role in fake news detection. With the help of an ensemble voting classifier and feature extraction and selection methods, our system should detect fake news from real news with high ACC. We used three datasets from various sources, including Facebook, Twitter, and other blogs/websites. On each dataset, we did three experiments and compared the results.
In the first experiment, we used the TF-IDF technique for feature extraction and the chi-square approach for feature selection. The number of retrieved features using the TF-IDF algorithm for the Fake-or-Real-News dataset was 18138. After removing the stop words, this method selects only the unique words from the data. Table 2 lists some of the collected features and their frequencies using this approach. We implemented a feature selection method to make the classification algorithms in our task more understandable and the training phase move faster. Using the chi-square technique, a total of 10000 characteristics were selected. When the features are categorical, the chi-square algorithm is utilized. It is a statistic that analyses the relationship between two categorical variables. When compared to before using the methods, ACC improved by 0.8% after using feature reduction approaches.
Tables 5 represent the results of the first dataset. The voting classifier system achieved ACC equals 94.5%, AUC equals 94.4%, precision equals 97.0%, recall equals 96.0%, and f1-score equals 95.0%. For LSVM, the ACC is 94.2%, the AUC is 94.2%, the precision is 95.0%, the recall is 94.0%, and the f1-score is 94.0%. Figure 3 represents the ROC curves for the current state of the arts and the proposed voting classifier. The ROC curve indicates the compromise between sensitivity (or TPR) and specificity (1 -FPR). Classifiers with curves closer to the upper left indicate better performance. Therefore, by noticing the curve, the voting classifier performs best.
For the Media-Eval dataset, the extracted features using the TF-IDF algorithm were 2282, and the number of selected features was 1500 using the chi-square algorithm. The system achieved ACC equals 91.2%, AUC equals 90.0%, AUC equals 99.0%, recall equals 99.0%, and and f1-score equals 93.0% which appeared in Table 8. Figure 6 represents the ROC curves for the current state of the arts and the proposed voting classifier. The ROC curve of the voting classifier has a better performance.
For the ISOT dataset, The number of extracted features was 31095, and the number of selected features was 15000 features. The performance was 100% of ACC, AUC, precision, recall, and f1-score, as shown in Table 11.
İn this experiment, we observed that the voting classifier technique has the highest ACC among all ML algorithms. Because the voting classifier technique combines two or more weak standard ML algorithms, each classifier votes for a class, and the class with the most votes wins.
The second experiment was applied the TF-IDF algorithm for feature extraction and the ANOVA algorithm for feature selection on the three datasets. For the Fake-or Real-News dataset, the number of extracted features using the TF-IDF algorithm was 17378. The number of the selected features was 10000 features using the ANOVA algorithm. The results are shown in Table 6. The voting classifier achieved average ACC equals 94.6%, AUC equals 94.6%, precision equals 96.0%, recall equals 96.0%, and f1-score equals 95.0%. When compared to, for example, LSVM, which achieved ACC of 94.0 %, AUC of 94.0 %, the precision of 94.0 %, recall of 94.0 %, and f1-score of 94.0 %, the proposed voting classifier performed better. Figure 4 represents the ROC curve for the traditional ML algorithms and the proposed voting classifier. As shown from the curve, the proposed method has a better performance when compared with the current state of the arts.
For the Media-Eval dataset, The extracted features were 1852 by the TF-IDF algorithm, and the selected features were 1000 features using ANOVA algorithms. The proposed voting classifier achieved average ACC equals 92.3%, AUC equals 91.6%, precision equals 93.0%, recall equals 98.0%, and f1-score equals 94.0%. Comparing with, for example, LSVM that achieved ACC equals 91.9%, AUC equals 89.9%, precision equals 95.0%, recall equals 82.0%, and f1score equals 88.0%, the proposed voting classifier performed better. Table 9 represents the results of this dataset. From the results and Figure 7, the performance of the proposed voting classifier is better than the traditional ML algorithms.
For the ISOT dataset, the voting classifier system achieved 100% of ACC, AUC, precision, recall, and f1score. Table 12 represents the results.     The third experiment was applying the word embedding techniques. We used the DOC2VEC algorithm for feature extraction and reduction. For the Fake-or Real-News dataset, the voting classifier achieved average ACC equals 88.8%, AUC equals 88.7%, precision equals 86%, recall equals 96%, and f1-score equals 89.0%. When compared to, for example, LSVM, which achieved ACC equals 84.0%, AUC equals 88.1%, precision equals 85.0%, recall equals 93.0%, and f1-score equals 89.0%, which appeared in Table 7. The proposed voting classifier performed better.

Study
ACC (%) Ahmed et al. [3] 92.0 Hakak et al. [24] 99.8 Nasir et al. [26] 99.0 Proposed method 100 For the ISOT dataset, the voting classifier achieved average ACC equals 68.9%, AUC equals 68.5%, precision equals 71.0%, recall equals 77.0%, and f1-score equals 72.0%. When compared to, for example LSVM that achieved ACC equals 55.7%, AUC equals 63.8%, precision equals 65.0%, recall equals 70.0%, and f1-score equals 67.0%, as shown in Table 13, the proposed voting classifier performed better. Also, we noticed that the RF algorithm achieved better ACC than any algorithms for the ISOT dataset, as shown in Table 13. The reason could be that RF algorithm won't overfit the model if there are enough trees in the forest which cause to better performance. Figure 9 represents the ROC curves for classical ML techniques and the voting classifier for this dataset.
After displaying the results and explaining each experiment separately, we notice that the first and the second experiments have higher ACC than the third experiment. With an ACC of 94.6 %, the suggested voting classifier has the best classification performance of all ML algorithms, as shown in Table 6. The third experiment depends on Word embedding techniques which require massive amounts of data. Our data is small for this strategy to handle. After applying the TF-IDF technique for feature extraction, we enhanced the results by employing the feature reduction approach from the filter type.
We utilized a feature extraction technique in the first and second experiments, then a feature selection algorithm to select the important features from the text. The feature extraction algorithm was the same in both trials; the selection algorithm was different.
For the ISOT dataset, Table 14 shows a comparison between the suggested technique and previous work. As indicated in the table, the proposed strategy outperforms the other works in terms of ACC by 0.2%. This may be due to the usage of feature extraction and feature selection techniques. The other two datasets, Fake-or-Real-News and Media-Eval, were used to demonstrate that the suggested voting classifier outperformed existing ML methods.

V. CONCLUSION
The fake news problem is not new, as disinformation has long been circulated in newspapers and radio. Because of the internet, false news spreads quickly through social media and blogs. This type of information might be harmful. Thus we must be able to distinguish between fake and actual news.
For better performance, we preprocessed the data. We used k-fold cross-validation and a train-test split to split the data. The TF-IDF and DOC2VEC algorithms were used to extract features, and the best features were chosen using the chi-square and ANOVA procedures. This study used six ML algorithms: NB, LSVC, LR, RF, PA, and SGD. Three separate datasets are used in this experiment, all of which are freely available online. In three datasets, the suggested technique outperforms existing traditional methods in terms of ACC. The ACC was 94.5% for the Fake-or-Real-News dataset and 91.2% for the Media-Eval dataset, and 100 % for the ISOT dataset. On the other hand, the presented study can only deal with textual data. So, in the future, datasets in images and videos with textual information will be collected and preserved from Facebook and other social media platforms. The annotated dataset can be used to detect fake photos and videos. In addition, for the proposed false news detection system, a real-time system for detecting fake news will be used on Facebook, Twitter, Instagram, and other platforms. The suggested method has the ability to bring benefits to a variety of new activities, including preventing the spread of fake news during elections, terrorism, natural disasters, and criminal activity for the good of humanity.
University, Egypt, in 2014. Currently, she is an M.Sc. student at the Faculty of Computers and Information, Mansoura University, Egypt. Her research interests include Artificial intelligence, machine learning, and data mining.