Using Adversarial Learning and Biterm Topic Model for an Effective Fake News Video Detection System on Heterogeneous Topics and Short Texts

Fake news videos are being actively produced and uploaded on YouTube to attract public attention. In this paper, we propose a topic-agnostic fake news video detection model based on adversarial learning and topic modeling. The proposed model estimates the topic distribution of a video using its title/description and comments by topic modeling and tries to identify the differences in stance by the topic distribution difference between title/description and comments. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work well. Therefore, we use BTM (Biterm Topic Model) that is robust even in short texts to estimate topic distribution. Then, it constructs an adversarial neural network to extract topic-agnostic features effectively. The proposed model can effectively detect topic changes for stance analysis and easily shifts among various topics. In this study, it achieves a 3.41%p greater F1-score than previous models used for fake news video detection.


I. INTRODUCTION
Recently, several studies have been conducted on detecting fake news videos uploaded on various video-sharing platforms such as YouTube. The topics of fake news videos characteristically change very quickly because their contents involve current societal issues. Conventional fake news detection studies that employ machine learning focus on the training data. However, fake news training data are highly volatile compared to those of other research areas. For example, if a fake news detection model is trained using the training data on fake news disseminated during the U.S. presidential election, it will not effectively detect fake news about COVID-19. Therefore, fake news video detection methods should extract the features of fake news without relying on particular topics to easily find any such videos with a new topic. Furthermore, fake news video detection methods should determine whether the topics in the same video are consistent.
In this study, we use two methods to detect fake news videos: (1) a topic modeling method for detecting topics in The associate editor coordinating the review of this manuscript and approving it for publication was Manuel Rosa-Zurera. videos, and (2) an adversarial learning method for extracting features independent of the subject. Existing topic models such as LDA are often used to find hidden topics in a text corpus. However, it may not be proper to apply existing topic models like LDA to these short texts because the existing topic models capture the co-occurrence patterns of words in a document to find topics and it creates serious data sparsity problems when the document is short. To solve this problem, BTM (Biterm Topic Model) is used to estimate the topic distribution. The key idea of BTM is to learn topics over short texts based on the aggregated biterms in the whole corpus to tackle the sparsity problem in single document. Specifically, we consider the whole corpus as a mixture of topics, where each biterm is drawn from a specific topic independently. Therefore, We extract the difference between the topic of comments and the title/description and topic of the YouTube video by using BTM 1 that is robust even in short sentences. Since the title/description and comments of YouTube videos usually consist of short sentences, it is very effective in inferring topic distribution than the LDA (Latent Dirichlet Allocation) method [1], and the difference of the topic distributions between comments and the title/description is utilized to determine their stance distinction. Then, the stance difference of the videos will be reflected in encoding to improve the performance of fake news video detection. Three modules are constructed to proceed with adversarial learning [2], [3]: (1) a multimodal feature extractor, (2) a fake news video detector, and (3) a topic discriminator. The multimodal feature extractor extracts effective features from three types of data, namely video, title/description, and comments, and builds the encoding vector of a video. The fake news video detector is a classifier constructed with the features extracted by the multimodal feature extractor to determine whether the video is real or fake. The topic discriminator is a classifier constructed using the encoding vector of the input video received from the multimodal feature extractor to predict the topic distribution. Therefore, a loss function is designed to proceed with adversarial learning to ensure that the estimated topic distribution of the topic discriminator will not be the same as the distribution of the topics extracted by the BTM. Through this learning mechanism, the multimodal feature extractor can be trained to extract topic-agnostic features effectively to increase the performance of the fake news video detector while decreasing the performance of the topic discriminator. With these components, we can construct an effective fake news video detection model that adapts well to rapid topic changes by extracting topic-agnostic features when detecting fake news videos.
The major contributions of the fake news video detection system proposed in this paper are as follows: (1) The topic distribution of a news video is automatically estimated by the BTM, and the stance is distinguished using the difference of the topic distributions between the title/description and comments. Then, the stance distinction is reflected to improve performance. (2) topic-agnostic features are extracted using an adversarial learning method to adapt to rapid topic changes.

II. RELATED WORK A. FAKE NEWS DETECTION 1) TEXT-BASED FAKE NEWS DETECTION
Many fake news detection methods that use only text to detect fake news have been proposed [4]- [6]. Xu et al. [4] used the term frequency-inverse document frequency (tf-idf) and topic modeling to detect fake news by focusing on the key terms of news articles. Baird et al. [5] achieved the highest performance in the first stage of the fake news challenge, 2 which was a stance classification task, using a combination of deep convolutional neural networks and gradient-boosted decision trees containing lexical features. Hanselowski et al. [6] detected the stance of fake news using six hidden layers and five multi-layer perceptron (MLP)based classifiers with manually produced features.

2) IMAGE-AND TEXT-BASED FAKE NEWS DETECTION
Studies have also been actively conducted on multi-modal fake news detection using news images and articles. Singhal et al. [7] proposed the SpotFake, which exhibited higher fake news detection performance than conventional detection models despite its simpler architecture. SpotFake comprises three modules. The first is a text feature extractor that uses a language model, namely the Bidirectional Encoder Representations from Transformers (BERT), to extract text features for each situation. The second is a visual feature extractor, and the third is a fusion module that combines the features obtained from the other two to form a single news feature vector.
Previous studies also used adversarial neural networks for fake news detection. Wang et al. [8] proposed a fake news detection model using an adversarial neural network to learn how to avoid extracting features related to one topic. However, because a fake news video may contain multiple topics, the discriminator proposed by Wang et al. [8] that predicts one topic only, cannot extract topic-agnostic features effectively. In this study, topic modeling is used to create the labels of the topic discriminator automatically and predict the distribution of more than one topic for more effective detection of fake news videos.

3) VIDEO-AND TEXT-BASED FAKE NEWS DETECTION
Previous studies also focused on the detection of fake news videos using video and text data. Recently, fake news video benchmark data for efficient detection of fake news videos were established and a model that can effectively detect fake news videos using the constructed benchmark data was proposed [9], [10]. Palod et al. [11] used simple functions based on machine learning models, such as naive Bayes, support vector machines, and decision tree, to detect fake news videos. For example, the percentages of ''likes'' and ''dislikes'' and the percentage of violent words expressed in a video title were considered. Simple features and long shortterm memory (LSTM) networks were used to encode the comments of a video, and the encoding vectors of comments were linked to detect fake news videos.

III. PROPOSED METHOD
The proposed model uses BTM to estimate the topic distribution of a video and calculate the stance difference in a video automatically. Then, an adversarial learning method is applied to extract the topic-agnostic features for effectively detecting fake news videos. Figure 1 shows the process of the proposed model. The modules represented within the dotted lines are newly proposed and are implemented in the proposed method while the other modules are regarded as the baseline model.

A. BASELINE MODEL 1) COMMENTS ENCODING
All words in the comments are first converted into word segments using BERT and then encoded into vector representations [12]. The k-th word segment vector of dimension d is denoted as x k ∈ R d . To encode the i-th comment, the comment is divided into word segments and a bidirectional LSTM is used to generate the comment embed- where H i is the vector representation of the i-th comment with n word segments, − → h i n is n-th hidden vector from the forward direction of the LSTM, and ← − h i 1 is 1-st hidden vector from the backward direction of the LSTM. In conventional fake news video detection studies, all comments existing in a video are simply averaged to create an overall comment embedding. However, the number of ''likes'' in each comment is exploited to create more effective embeddings than those created by conventional embedding method. The number of ''likes'' in each comment is defined as C = {c 1 , c 2 , · · · , c N }, where N is the number of comments contained in the video and c i is the number of ''likes'' of the i-th comment. The overall comments embedding H of a video is generated as a weighted sum of the number of ''likes'' in a comment as follows: where c total is the sum of the number of ''likes'' of all comments in a video; in equation (1), the Laplace estimation is applied by adding 1 to c i .

2) TITLE/DESCRIPTION ENCODING
As the comments, the title/description of a video is converted into word segments using BERT. The Text-CNN proposed by Kim [13] uses multiple filters with various window sizes to extract a variety of features and generates the title/description encoding vector R(R ∈ R 2d ).

3) VIDEO ENCODING
The pre-trained VGG-19 is utilized to extract visual features from video thumbnails and video frames [14]. The visual encoding vector of a video thumbnail is denoted as V thumbnail ∈ R 2d , and the encoding vector of the i-th video frame is denoted as V frame(i) ∈ R 2d . The cosine similarity score between V thumbnail and V frame(i) is calculated to extract the J frames that are most similar to V thumbnail . Then, V thumbnail and the encoding vectors of the extracted the J frames are concatenated to construct a two-dimensional matrix L. V thumbnail (= V (frame(J +1)) ) is added because the thumbnails of some videos are not extracted from the original video frames. For V thumbnail and L, the attention scores are calculated using equations (2) and (3), and the final video encoding vector V is calculated using equation (4): where

4) ENCODING VECTOR INTEGRATION
In previous studies, a linear combination is used to combine two feature vectors of text and video effectively [15], [16]. In this study, however, there are three encoding vectors (video vector V , title/description vector R, and comments vector H ), and a linear combination is not suitable. Therefore, the following procedure is executed. First, the hyperparameter m 1 is learned to reflect the relationship between the video encoding vector (V ) and the title/description encoding vector (R) to learn the encoding ratio as follows: where m 1 is a scalar value, and P ∈ R 2d is an encoding vector containing the visual information of the title/description and video.
The same process is applied to combine H , which includes the comment information of the video, with R: where m 2 is a scalar value, and E ∈ R 2d . Then, P (equation (6)) and E (equation (8)) are concatenated and delivered to the fake news video detector to check whether the video is fake.

B. STANCE DIFFERENCE ESTIMATION WITH TOPICS DISTRIBUTION
The topics distribution estimation process in Figure 1 illustrates the method of estimating the comments' topic distributions and the title/description by using the BTM based on Gibbs sampling. It is significant to infer topic distribution from the video because the proposed method infers the topic of the video by the topic modeling method. Therefore, we use the robust BTM even in short sentences to infer topic distribution. All M videos included in the training data are used to create title/description documents written by a video provider and the comment documents by readers. Then, after preprocessing a total of 2M documents, BTM estimates the distribution of K topics: the topic distribution θ title/description of the title/description documents, and the topic distribution θ comments of the comment documents. Because there are three types of data to be encoded in the proposed model, the data should be effectively encoded according to the characteristics of each video. Among the data types, comments have been used as a key feature for detecting fake news in many studies [17]. The topic distribution difference, or stance difference, is commonly large between the title/description and comments of fake news videos. Therefore, we increase the encoding ratio of the comments when the difference between two estimated distributions of the comments and title/description is large.
We measure the stance difference s between the topic distribution of the comments by the readers and the topic distribution of the title/description by the video provider using the Jensen-Shannon divergence as follows [18]: Min-max normalization is applied to normalize the score of each video using minimum and maximum scores for all training videos: where s normalized is the normalized stance difference, S is the set of s of all videos contained in the training data. The calculated s normalized is applied to equation (7) instead of m 2 to determine the encoding ratio of H . If the stance difference is large, the encoding ratio of the comments increases; otherwise, the encoding ratio of the title/description increases as follows: Finally, E is concatenated with P of equation (6), and the results is used as an input in the fake news video detector.

C. ADVERSARIAL NEURAL NETWORK 1) FAKE NEWS VIDEO DETECTOR
The fake news video detector is constructed with MLP. The detection loss L y is calculated using the cross-entropy as follows: (12) where d fake is the output of the fake news video detector, Y is the label set that indicates whether the video is fake, and VC is the set of all news videos.

2) TOPIC DISCRIMINATOR
The topic discriminator consists of the gradient reversal layer (GRL) and the MLP. We use the GRL proposed by Ganin et al. [19], which works as an identity function in the forward step and delivers the result of multiplying the gradient by −λ to the previous layer. We create the multimodal features O by concatenating all three features to be provided to the topic discriminator: The topic discriminator predicts the distribution of K topics based on the input O. The loss function L td of the topic discriminator can be calculated as follows: where, Y td is the topic distribution of the title/description estimated by the BTM,ŷ i is the output of the topic discriminator, and the probability of the i-th topic in the topic distribution. The final loss function L final of our proposed model can then be expressed as follows: Then, the proposed model can be updated as follows: To stabilize the training process, we reduce the learning rate η as follows [19]: where η is the reduced learning rate, α = 10, β = 0.65, and X changes linearly between 0 and 1 according to the training progress.

IV. EXPERIMENTAL SETTINGS
A. YouTube DATASETS Table 1 shows the data distribution of the three datasets used in the experiments. The Fake Video Corpus (FVC) dataset, which was developed in the In VIDeo Veritas (InVID) project [9], is composed of fake and real news videos of various topics, including politics, sports, and accidents. The Volunteer Annotated Video Dataset (VAVD) is a publicly available fake news video dataset developed by Palod et al. [11], and the Misleading YouTube Video Corpus (MYVC) was collected and annotated for this study. First, we collected fake news and real news from the popular factchecking websites, Snopes, Fact Check-CNN, and PolitiFact. Next, we collected fake and real news from YouTube to augment the fake news video data. Two researchers evaluated the collected video data. The obtained Cohen's kappa coefficient was 0.89, indicating that the constructed video data could be considered reliable. Finally, we combined all three datasets to create a full dataset after removing the duplicate videos.

B. IMPLEMENTATION DETAILS
Excluding the modules marked by dotted lines in Figure 1, to evaluate the performance of our proposed model we denoted the model as FANVM Base (FAke News Video Model) and set it as the baseline system in our experiments; the proposed model was denoted as FANVM Our . The experiment was implemented using Pytorch in Ubuntu 16.04 LTS with a TITAN X GPU. The hidden state size of the LSTM was set to 300, and the size of the BERT was set to 768. Three filters with sizes of 3, 4, and 5 were considered for the CNN, each of which consisted of 14 kernels. Furthermore, a drop-out method was applied to the fully connected layer to prevent over-fitting. The drop-out ratio was set to 0.5. The proposed model was trained using Adam with a learning rate of 0.0001 [20]. The optimal parameters of BTM were determined using perplexity and coherence. The number of topics was set to 15, and the number of iterations was set to 50. The model that achieved the highest F1-score in the validation set was selected for the final test. Identical hyper-parameter settings were used in all experiments unless otherwise specified. Table 2 shows the F1-score achieved by each module in the proposed model. The F1-score increased by 2.00%p when the topic distribution stance distinction was added to the FANVM Base . We analyzed why the performance was improved when comments were encoded with a high ratio when detecting fake news videos. In the comments encoding in Figure 1, we used the number of ''likes'' to assign weight to comment embedding. Put differently, the comment embedding's influence is enhanced by a large number of ''likes'' in the comments. From the comments of the fake and real news videos, we found that those with many ''likes'' in the fake news videos were mostly ones that fact-checked the corresponding fake news videos or ones that rejected incitement.

A. EFFECT OF STANCE DIFFERENCE ESTIMATION
In fake news video detection, when the stance difference is large, i.e., when an input video has a high probability of being fake news, highly weighted comment embeddings can increase the importance of the abovementioned comments with many ''likes''.

B. EFFECT OF ADVERSARIAL NEURAL NETWORK
We applied two methods in the adversarial neural network experiments. One method, called the Label method, uses a one-hot representation of the topics of videos, and the other method, called the Distribution method, uses a probability distribution of the topics of videos. The Label method estimates the topic distribution in the title/description in Figure 1 and sets the topic with the highest probability among the distribution of K topics as the ground truth label of the topic discriminator. Conversely, the Distribution method sets the distribution of K topics of the title/description as the ground truth labels of the topic discriminator to perform the training. When the adversarial neural network was added to FANVM Base the Label method improved the F1-score by 2.45%p while the Distribution method improved the F1-score by 3.39%p indicating that the extraction of topic-agnostic features based on the topic distribution and the adversarial neural network is more effective in detecting fake news videos. This is attributed to a single fake news video commonly containing various topics. Finally, when both the stance difference and adversarial neural network were applied to FANVM Base , a 3.92%p increase in the F1-score was obtained. Table 3 shows the F1-scores obtained with the FANVM Our , MVAE [21], SpotFake [7], and UcNet [11]. For accurate performance comparison, we conducted the experiments in the same experimental environment as a conventional fake

FIGURE 2.
Comments and title/description encoding ratio based on the full dataset.
news video system [11]. As in the study by Palod et al. [11], to evaluate the models, the VAVD data were divided into a training set and a testing set following a 7:3 ratio and for all other datasets, 5-fold cross-validation was used to perform the evaluation. Because there are very few studies on fake news video detection, we used the multi-modal fake news detection systems to compare the models, except the UcNet. These Multi-modal fake news systems use news images and news texts to check whether the news is real or fake. In our experiments, we reimplemented MVAE and SpotFake with comments embedding H to add leveraging to all three embeddings V , R, and H . Vector O is simply created by concatenating the different embeddings and it uses to identify the fake news videos. From Table 3, it can be seen that FANVM Our outperforms the existing models on all datasets. Figure 2 shows the encoding ratio of comments (H ) and title/description (R) obtained on the full dataset using equation (11). For fake news videos, the encoding ratio of the comments is 74% on average. In contrast, only 29% of real news videos are encoded. These results indicate that when our proposed stance difference estimation of topic distribution is used to detect a fake news video, the encoding ratio should be adjusted dynamically depending on whether the news video is real or fake.

E. TOPIC DISCRIMINATOR PERFORMANCE ANALYSIS
In section III.B, the target topic distribution of the topic discriminator is the distribution of title/description. Through comparisons between the performance of FANVM Our on each dataset, we verified that our selection, title/description, is more effective than comments. Figure 3 shows the F1-scores of FANVM Our when a different target topic distribution is used. As a result, FANVM Our including the target topic distribution of title/description achieved better performance on all the datasets than FANVM Our with that of comments. These results indicate that the topic distribution of title/description is more effective in detecting fake videos because the title/description was written by the video provider but the comments were made by many readers, that is, the comments generally contain diverse topics regardless of video contents. Figure 4 shows the F1-scores of FANVM Our according to the topic modeling method. As a result of the experiment, the model inferring topic distribution using BTM had higher performance in all datasets than using the LDA. Since the proposed model computes the stance between the title/description and comments by the topic distribution and utilizes the topic distribution of the title/description as a topic label for an adversarial neural network, it is very important to extract the topic distribution from the video. The title/description and comments of videos mostly consist of short sentences. In particular, the FVC data showed a greater  performance improvement than the rest of the datasets. This experimental result is because the title/description and comments of fake news videos in the FVC data are relatively shorter than in the other datasets. Thus, this experiment demonstrated that inferring topic distribution using BTM is more effective in detecting fake news videos than the LDA method. Table 4 shows the F1-scores of FANVM Our and SpotFake, which achieved the highest performance among the comparative models. After we set the hyperparameters to learn with the FVC data having the 8:2 ratio of training data to validation data, we tested with the MYVC dataset. The videos of the FVC dataset were collected in 2018 while those in the MYVC dataset were collected in 2021. In particular, the MYVC dataset contains many news videos with current issues related to COVID-19. Through this experiment, we could determine whether our proposed model was highly adaptable to topic changes.

G. FANVM: ADAPTABILITY TO NEW TOPICS
In Table 4, the performance difference between FANVM Our and SpotFake is 7.65%p. This result demonstrates that the proposed method, FANVM Our detects fake videos including new topics more effectively than SpotFake. Furthermore, FANVM Our achieved a significant improvement of 4.66%p, when we applied the adversarial learning method. This could be evidence that the proposed method with adversarial learning extracts topic-agnostic features effectively.

VI. CONCLUSION AND FUTURE WORK
We proposed two effective techniques for detecting fake news videos. First, the estimation of stance difference between title/description and comments on topic modeling was used to dynamically adjust the encoding ratio of the comments to determine whether a video is real or fake. Second, to detect video topic changes effectively, we proposed an adversarial neural network framework that is adaptable to new topics along with a multi-modal feature extractor that can extract topic-agnostic features effectively. Furthermore, we presented the FANVM and the MYVC datasets collected by our group. The experiments on three datasets collected from YouTube demonstrated that the proposed model outperforms state-of-the-art models.
In future studies, we will focus on utilizing the subtitles and speech in videos to detect fake news videos more effectively.