Selective Feature Sets Based Fake News Detection for COVID-19 to Manage Infodemic

During the COVID-19 pandemic, the spread of fake news became easy due to the wide use of social media platforms. Considering the problematic consequences of fake news, efforts have been made for the timely detection of fake news using machine learning and deep learning models. Such works focus on model optimization and feature engineering and the extraction part is under-explored area. Therefore, the primary objective of this study is to investigate the impact of features to obtain high performance. For this purpose, this study analyzes the impact of different subset feature selection techniques on the performance of models for fake news detection. Principal component analysis and Chi-square are investigated for feature selection using machine learning and pre-trained deep learning models. Additionally, the influence of different preprocessing steps is also analyzed regarding fake news detection. Results obtained from comprehensive experiments reveal that the extra tree classifier outperforms with a 0.9474 accuracy when trained on the combination of term frequency-inverse document frequency and bag of words features. Models tend to yield poor results if no preprocessing or partial processing is carried out. Convolutional neural network, long short term memory network, residual neural network (ResNet), and InceptionV3 show marginally lower performance than the extra tree classifier. Results reveal that using subset features also helps to achieve robustness for machine learning models.

work, we analyze multiple data sets to identify the most useful To combat the fake news related to COVID-19, machine 104 learning approaches can be leveraged to help analyze news 105 related to . Existing studies focus on enhancing 106 the performance by optimizing the models or using appro-107 priate feature engineering approaches. However, the use of 108 a subset of features or selective features from the dataset 109 remains an under-explored or ignored area. The data con-110 taining fake news of COVID-19 is large and has extensive 111 symmetrical features [19]. This motivates the current research 112 to develop a machine learning-based approach with the use 113 of selective feature sets to detect the fake news related to 114 COVID-19. In this regard, this study makes the following key 115 contributions 116 • An extensive investigation of various feature selection 117 approaches is carried out for COVID-19 fake news 118 detection. From this perspective, two approaches are 119 analyzed including PCA and Chi-square. 120 • The efficiency of PCA and Chi-square is analyzed 121 with different feature extraction approaches like term 122 frequency-inverse document frequency (TF-IDF) and 123 bag of words (BoW). The performance of TF-IDF and 124 BoW is evaluated comprehensively with PCA and Chi-125 square separately.

132
• The suitability of deep learning models is investi-133 gated for COVID-19 fake news detection using both 134 custom-built models and pre-trained models. Convolu-135 tional neural networks (CNN) and long short-term mem-136 ory (LSTM) models are custom designed for this pur-137 pose while two well-known pre-trained models residual 138 neural network (ResNet) and InceptionV3 are adopted 139 as well. 140 The rest of the paper is organized as follows: Section II 141 discusses recent studies related to fake news. Section III gives 142 the summary of the dataset and a brief description of the 143 adopted methodology and models used for fake news detec-144 tion. Section IV presents the discussions and analysis of the 145 results. The conclusion and the future directions are given in 146 Section V. 148 Fake news detection is the foundation of many tasks 149 such as, claim validation [20] and argument search [21]. 150 In many researches, fake news detection regarding a specific 151 VOLUME 10, 2022 target is performed from tweets [22], [23], [24] and 152 online debates [23], [25], [26]. Such kind of target-oriented 153 approaches are based on the lexical features [26], linguistics 154 and structural features [25]. Traditionally, a two-step process 155 is followed for fake news detection where preprocessing is 156 carried out in the first step, followed by the feature extraction. the first layer, and the fake event was discovered using the 182 second layer. The research paper [31]  FNDNet has been proposed to detect fake news [33]. A  based capsule network is proposed for fake news detec-197 tion [34]. They used four capsule networks for long text and 198 two capsules for short text. Authors proposed FakeBERT 199 which is designed using deep-stacked layers of CNN for fake 200 news detection [35].

201
A deep neural network was applied to categorize news 202 content and context-based information separately as well as 203 together, with the help of the best hyper-parameters [36]. 204 Their proposed method's effectiveness has been verified 205 using a real-world dataset. The authors take into consid-206 eration the behavior of several Facebook account-related 207 variables and use a deep learning-based analyzer to exam-208 ine the activity of the account [37]. Authors applied CNN, 209 LSTM, and Bi-LSTM on news article datasets for fake news 210 detection [38].

C. FAKE NEWS DETECTION DURING COVID-19 PANDEMIC 212
Fake news became an attractive and important research area 213 during the COVID-19 pandemic to fight the infodemic. Fake 214 news prediction related to COVID-19 is quite important, sim-215 ilar to other domains, with a more real-time effect to cause 216 rapid panic. Elhadad et al.
[5] worked on the detection of 217 misleading information related to COVID-19. Besides using 218 twelve different performance metrics the study uses 5-fold 219 cross-validation to validate the results. The best results are 220 achieved by the neural network (NN), decision tree (DT), and 221 LR classifiers.

222
The authors investigate the use of machine learning and 223 deep learning techniques to detect COVID-19 fake news 224 in [39]. The TF-IDF and word2vec word embedding tech-225 niques are also included in the work. Results indicate that 226 the support vector machine (SVM) gives the highest F1 227 score of 93.39% with TF-IDF features. Raha et al. worked 228 on automatic detection of fake news related to COVID-19 229 in [40]. RF gives a remarkable accuracy of 96.6% while NB 230 gives an accuracy of 95.05%. Koirala [41] proposed a deep 231 learning-based system for the classification of the COVID-19 232 fake news. The dataset used in this study is inconsistent which 233 leads to deviations in the applied model where the accuracy 234 of the models also deviates.

235
Besides using well-known machine learning models for 236 fake news detection, the use of transfer learning and optimiza-237 tion models is reported to have better results. For example, the 238 authors discussed machine learning and deep learning-based 239 approaches for COVID-19 fake news detection in [42]. 240 Ensemble of three transfer learning approaches including 241 Bidirectional Encoder Representations from Transformers 242 (BERT), ALBERT, and XLNET has been analyzed for fake 243 news detection on social media comments [43].

244
Researchers investigated COVID-19-related misinforma-245 tion using three feature selection approaches; particle swarm 246 optimization, the genetic algorithm, and the salp swarm 247 algorithm [19]. The genetic algorithm outperformed other 248 approaches. Optimization approaches are leveraged in [44] 249 where an optimized Salp swarm optimization approach is 250 adopted for fake news detection. Experimental results show 251 that the optimized model shows superior performance to stan-252 dard models. Similarly, a metaheuristic optimization algo-253 rithm Grey Wolf optimization is adopted in [13] for fake news 254 detection. Results show its better performance over SVM, 255 NB, DT, and J48.

256
Other than using a fake news dataset for simple feature 257 extraction, a few studies explore metadata where different 258 important aspects related to fake and genuine news are ana-259 lyzed. For example, Ibrishimova and Li [45] proposed a 260 system based on factual accuracy and relative reliability of 261 a source. The authors also propose a fake news detection   The COVID-19 fake news dataset is obtained from the IEEE 307 data port [47]. On the Twitter platform, different keywords 308 and hashtags are used for text extraction to ensure that textual 309 data is related to COVID-19. Table 2 shows a list of key- This study investigates various feature engineering approaches 318 in combination with machine learning classifiers. The pro-319 posed architecture for COVID-19 fake news detection is 320 presented in Figure 1 which shows the sequence of steps per-321 formed in experiments. The dataset is obtained from the IEEE 322 data port and preprocessed before the feature extraction. Data 323 is split in the ratio of 70:30 for training and test set. For 324 feature extraction, TF-IDF and BoW are used. To reduce the 325 training time, subset feature selection, PCA and Chi-square 326 are used before models' training. To obtain more optimized 327 results different Feature extraction and feature selection tech-328 niques are tested in various combinations to train the mod-329 els. ML models include RF, ET, GBM, LR, NB, SG, and 330 VC(LR+SGD) and are compared with deep learning models 331 including CNN, LSTM, ResNet, and Inception V3. In the 332 end, evaluation is carried out using accuracy, precision, recall, 333 f1-sore, specificity, and AUC.

335
Tweets contain unstructured, short, and noisy data which 336 needs to be cleaned before it can be used for classification. 337 For removing noise and improving the performance of mod-338 els, several preprocessing steps are carried out. 339 1) Social media posts contain hashtags to relate it to a spe-340 cific topic such as #covid-19, # lockdown. These hash-341 tags are unnecessary in terms of sentiments, so they 342 must be removed.

343
2) To avoid confusion in recognizing the same word dif-344 ferently by a model because of capitalization. All cap-345 ital letters are converted to lower case. Feature engineering aims at finding appropriate features from 358 the data to obtain good results from the models. Feature engi-359 neering helps to enhance the consistency and accuracy of the 360 learning algorithm because feature engineering extracts the 361 meaningful feature from the raw data. In this research, Vec-362 torization (TF-IDF), prediction-based (Bag of Word (BoW), 363 dimensionality reduction (PCA), and variance analysis (Chi-364 square) techniques are used.

365
TF-IDF can be used to find the similarity between doc-366 uments easily. It counts the occurrence of the word in a 367 VOLUME 10, 2022 TABLE 1. Comparative analysis of the approaches from the literature. These approaches are used for fake news detection from recently published works and the approach, dataset, findings, and future works are discussed.    sample is the same as the size of the training dataset [49]. 415 While constructing the decision tree in an RF the major 416 issue is the identification of attributes for the root node at 417 every level. This process is known as attribute selection [50]. 418 By subsampling the training dataset with a replacement boot-419 strap, the sample is derived in which the size of the sample is 420 the same as that of the training dataset.

421
ET is an ensemble learning classifier that aggregates the 422 outcomes of multiple de-correlated decision trees [51]. The 423 ET works quite similar to the RF but varies for the con-424 struction of decision trees within a forest. Every tree is 425 given a random sample with K-features from the feature 426 set in which every decision tree selects the best feature for 427 splitting data based on Gini Index. Multiple de-correlated 428 decision trees are created by these random samples. ET clas-429 sifier generates multiple decision trees to learn the patterns 430 in the training data. These trees help in the prediction of 431 the test data and then voting is performed for the final 432 prediction.

433
GBM is a group of machine learning classifiers that com-434 bine many weak learning classifiers to make a powerful learn-435 ing model [52]. When doing gradient boosting usually deci-436 sion trees are used. GBM develops every tree independently 437 so, it is a time-consuming and costly choice. GBM enhances 438 the learning algorithm strength which is termed as probabil-439 ity approximating correct learning (PAC). PAC gives notable 440 results on the unprocessed data. GBM deals with missing 441 values efficiently.

442
LR is a statistical method that is used to analyze the data 443 where one or more than one variable is used to find the final 444 result. LR is used to estimate the probability of the class 445 members. So, LR is the best choice when the target class 446 is categorical [53]. It processes the connection between the 447 categorical dependent variables and one or more independent 448 variables by estimating probabilities using a logistic func-449 tion. LR gives promising results for binary classification. The 450 sigmoid function is used to predict the probability values. 451 It maps the values between 0 and 1.

452
The concept of SG is based on the working principle 453 of SVM and logistic regression convex loss functions [54]. 454 Due to its quality of combining multiple binary classifiers in 455 One-vs-All (OvA), SG is a powerful algorithm to deal with 456 multi-class classification problems. SG is the best choice for 457 large datasets as it takes only a single example per iteration. 458 SG is based on the simple regression technique so, it is easy to 459 implement and easy to understand. Contrarily, SG is a noisy 460 choice because the examples selected from the batch are ran-461 dom as well as the hyperparameters of SG need to be correctly 462 valued to get the best results. SG has a high sensitivity value 463 in terms of feature scaling.   Hyperparameter setting of ML models. These hyperparameters are obtained using the GridserachCV methods and models obtain optimized results using these hyperparameters. 517 Inception-v3 proposed by Szegedy et al. [58] is a 48-layer 518 convolutional neural network. A pre-trained version of the 519 network that has been trained on more than a million photos is 520 available for loading from the ImageNet database. A CNN's 521 Inception Module is a block of the image model that attempts 522 to simulate an ideal local sparse structure. To put it simply, 523 it enables us to employ numerous filter sizes in a single pic-524 ture block rather than being limited to single filter size.

526
The effectiveness of a machine learning model is measured 527 using evaluation metrics. To evaluate the performance of 528 machine learning models, this study utilized the following 529 evaluation measures: accuracy, precision, recall, F1 score, 530 specificity, and area under the curve (AUC). Following equa-531 tions are used to calculate these measures. (2) 535 where TP, TN, FP, and FN represent true positive, true nega-539 tive, false positive, and false negative, respectively.

541
Extensive experiments are performed using various machine 542 learning models for fake news detection related to 543 COVID-19. Experiments are performed covering several 544 aspects in this regard. For example, the performance of mod-545 els is tested with TF-IDF and BOW using each PCA and 546 Chi-square for analysis. For deep learning models, Global 547 Vectors (GloVe) and FastText word embedding approaches 548 are used. Additionally, the performance of various prepro-549 cessing types is evaluated including full preprocessing, partial 550 preprocessing, and no preprocessing. For partial preprocess-551 ing, the step of case conversion is not carried out as some 552 research works to point out that capital letter words may be 553 an indicator of fake news. For experiments, the data split 554 ratio is 0.7 to 0.3 for training and testing. For performance, 555 specificity and area under the curve (AUC) are used beside 556 traditional parameters of accuracy, precision, recall, and F1 557 score. The results of ML-based models using TF-IDF and BOW 561 are presented in Table 4.  The performance of ML-based models has been evaluated 590 and compared using BOW and PCA for fake news detection.

591
Results presented in Table 6

604
This section presents another combination of features includ-605 ing BOW and Chi-square for fake news detection. Chi-square 606 is a statistical method and focuses on highly dependent fea-607 tures for the target variable. It can be seen from the results 608 given in Table 7 that results obtained from the combination of 609 BOW and Chi-square are very similar to the results obtained 610 from the combination of TF-IDF and PCA. In this scenario, 611 ET outperforms with a 0.9274 accuracy. Its values for preci-612 sion, recall, and F1 score are also the best among all models, 613 however, its specificity and AUC values are comparatively 614 lower than that of LR.  Finally, another combination of feature subsets has been eval-633 uated with models for fake news detection and it includes 634 TF-IDF and Chi-square. Results presented in Table 9 indicate    of deep learning models are considered. Two custom-built 673 models CNN and LSTM are used while two pre-trained deep 674 learning models ResNet and InceptionV3 are also included 675 in the experiments. Table 10 shows the results obtained for 676 fake news detection using the GloVe features. Results indicate 677 that the custom-built CNN models can detect fake news with 678 a 0.92 accuracy score followed by the Inception V3 which 679 has a 0.91 accuracy score. Other parameters like precision, 680 recall, and F1 scores are also better than LSTM and ResNet. 681 Comparatively, the performance of deep learning models is 682 low as compared to the best performing ET and SG which 683 show the best performance using selective features from 684 Chi-square.

685
Similar to using the GloVe features, deep learning models 686 are used with FastText as well and results are provided in 687 Table 11. Results indicate that the performance of models 688 has been improved using the FastText features. For exam-689 ple, the accuracy score of CNN has increased from 0.92 to 690 0.93 and InceptionV3 shows an accuracy score of 0.92 which 691 is better than its performance with GloVe. Similarly, the 692 performance of LSTM and ResNet has also been improved 693 with FastText. Despite that, the performance of CNN is 694 relatively less than machine learning models with selective 695 features. Existing studies suggest that the performance of machine 699 learning models is greatly influenced by the use of differ-700

759
The statistical t-Test has also been performed to show the  improved the performance. Secondly, the test is performed on 771 the proposed approach with the previous best study LSTM 772 approach [63]. The results show a 5.5629 value for test statis-773 tics and 0.007410 p-value. It also proves that the proposed 774 model has improved the performance. Results prove that the 775 difference is statistically significant with p < 0.05. The pro-776 posed model obtained the highest mean rank for accuracy.

778
Fake news presents a challenging problem and its importance 779 has been elevated during the COVID-19 outbreak. Despite 780 several existing approaches, studies investigating the impor-781 tance of subset feature selection for fake news detection are 782 very few. Hence, a feature-based approach is presented in 783 this study where different combinations of feature engineer-784 ing and feature selection approaches are investigated. The 785 impact of selecting TF-IDF, BoW, PCA, and Chi-square is 786 analyzed regarding fake news detection. The authors find 787 three observations from experimental results. First, selective 788 features tend to yield better results than using all features. 789 The use of TF-IDF and BoW features combined produce 790 better results than PCA and Chi-square selected features. Sec-791 ondly, the performance of pre-trained deep learning models 792 ResNet and InceptionV3 is marginally lower than machine 793 learning models. For parameter optimization, such models 794 require larger datasets to show better performance. Thirdly, 795 preprocessing is very important to obtain high accuracy. Full 796 preprocessing produces better results than no preprocessing 797 or partial preprocessing for text analysis. So, the best results 798 are obtained using the full preprocessing where ET obtains 799 a 0.9474 accuracy score for fake news detection. This study 800 does not consider the distribution of classes in the dataset. 801 It infers that an imbalanced class distribution can influence 802 the performance of models. Authors intend to perform fake 803 news detection by combining textual and stylometric features 804 in the future.