What and Why? Towards Duo Explainable Fauxtography Detection Under Constrained Supervision

Fauxtography is a category of multi-modal posts that spread misleading information on various big data online social platforms that generate billions of posts on a daily basis (e.g., Facebook, Twitter, Reddit). A fauxtography post usually consists of an image, a text description and comments from its readers. In this paper, we focus on explaining fauxtography posts by identifying what specific component and why that component of a post leads to the fauxtography (i.e., duo explanations). This problem is motivated by the limitations of current fauxtography detection solutions that only focus on the detection but ignore the important explanation aspect of their results. Two critical challenges exist in solving our problem: i) it is difficult to accurately identify the “guilty” component of a fauxtography post given the fact that different components of the post and their associations could all lead to the fauxtography; ii) it is expensive and time-consuming to obtain a good training set with fine-grained labels of fauxtography posts in terms of explainability, making it challenging to develop fully supervised explainable solutions. To address the above challenges, we develop a Duo Explainable Fauxtography Detection Framework under a Constrained Supervision (DExFC) to generate duo explanations from both content and comment parts of fauxtography posts. We evaluate the DExFC by creating real-world datasets from different online social media platforms (Twitter and Reddit). The results show that DExFC not only detects the fauxtography posts more accurately than the state-of-the-art solutions but also provides well-justified explanations to its results without the full supervision.


INTRODUCTION
I N this paper, we develop a deep graph-based co-attention framework to address the explainable fauxtography detection problem under constrained supervisions. The big data social media applications generate billions of online posts on a daily basis [1], [2]. Some of the applications lead to the proliferation of misleading content, such as fake news [3], fallacious images [4], deepfake videos [5], and fauxtography [6]. Among them, fauxtography detection is considered as one of the most challenging misinformation detection problems due to its complex nature [7], [8]. A fauxtography post is an image-centric content that consists of an image and a text description, and a list of comments from its readers. In this paper, we focus on an explainable fauxtography detection problem where our goal is to address the following question with a limited amount of annotated data: what component of a fauxtography post is false and why the component is false (i.e., duo explanation)?
Recently, several initial efforts have been made to study the problem of fauxtography detection in big data analytics and machine learning [6], [7], [9], [10], [11]. However, those solutions only focus on using the social media posts with binary fauxtography labels (i.e., fauxtography or non-fauxtography) as the training data and could only estimate if a post is fauxtography or not without explaining their decisions (i.e., what specific component of the post is false and why that component is false). Fig. 1 shows the examples of four types of fauxtography posts. We observe that each type of fauxtography post contains some false component(s) that delivers the misinformation (e.g., fake image, false claim, or incorrect association between the image and claim). The research findings in social and cognitive science have demonstrated that people are more convinced by a decision if they are provided with the rationale or reasoning of the decision making process [12]. Therefore, we believe it is critical for social media users to not only know which post is identified as a fauxtography but also understand the reasons behind such a decision. However, the explainable fauxtography detection problem remains largely unresolved in the current literature. This paper develops an end-to-end co-attention graph based neural network approach to address the explainable fauxtography detection problem by providing the explanations from both the content (image and text) and user comments of the fauxtography posts. Such an explainable fauxtography detection problem is non-trivial to solve due to two technical challenges we elaborated below. Diversified Fauxtography Types: A possible solution to solve the explainable fauxtography detection problem is to apply misinformation detection [13], [14] and image forgery detection methods [4], [15] on the text description and image part of a post, respectively. However, a significant limitation of this solution is that it ignores an important category of fauxtography posts where the association between the text and image part is incorrect (e.g., the example shown in Fig. 1d). It is also difficult for such a method to retrieve valuable information from user's comments of the post for the purpose of explaining the fauxtography. This is mainly because many of the user's comments are made on the basis of the entire post rather than a specific component. For example, readers of Fig. 1b often need to read both the text description and the image before they contribute their comments. Therefore, the problem of explaining fauxtography posts of diversified types and identifying the exact component that leads to the fauxtography decision remains as a challenging problem to solve.
Lack of Modality-Level Annotations: An alternative way to solve the explainable fauxtography detection problem is to annotate each component of the post (as true or false) and then train a fully supervised learning model to identify the false component. However, it is extremely time-consuming and expensive to obtain such a large fine-grained training set with modality-level labels [16]. For example, the annotators need to annotate all components of a post by considering the text, image, the association between the text and image in addition to the binary fauxtography label of the entire post. Moreover, a fully supervised training pipeline with modality-level annotations will add a non-trivial amount of overheads to the training process (e.g., a longer training time and a more complex parameter adjustment procedure). Therefore, it is more desirable to develop an explainable fauxtography detection scheme under a constrained supervision (e.g., by utilizing a very limited amount of modality-level annotations in the training), which is not a trivial task.
To address the above challenges, we develop DExFC, a Duo Explainable Fauxtography Detection Framework under a Constrained Supervision (see Fig. 2). In particular, to address the first challenge, we develop a duo graph neural network architecture to explicitly explore and integrate different components of a post and decide what specific component(s) could lead to the fauxtography detection. To address the second challenge, we construct a multi-modal co-attention module that generates attention scores for the comments in our design to identify the reason on why the specific component(s) is false without the need of the modality-level annotations. Moreover, we design a modality-level refinement module that utilizes a small set of fauxtography posts with modality-level annotations (if available) to optimize both detection and explanation performance. To the best of our knowledge, DExFC is the first end-to-end deep learning approach to solve the duo explainable fauxtography detection problem under a constrained supervision. We evaluate the DExFC framework on a real-world dataset collected from Twitter and Reddit. The evaluation includes both an objective study using quantitative metrics to evaluate the fauxtography detection accuracy and a real-world user study on Amazon Mechanical Turk (AMT) [17] to evaluate the Fig. 1. Example cases for fauxtography. Example (a) includes both a fake picture and a misleading text description. In example (b), the description tries to fool people by a real photo with a camera close to a normal rat. In example (c), though there is such a kind of creature in the world, the color of the picture is manipulated by computer software. For example (d), both the text description and the image are true. However, the woman mentioned in the text is not the one in the image. effectiveness and efficiency of the explainability of DExFC. The results show that DExFC not only outperforms the stateof-the-art baselines by finding the fauxtography posts on social media more accurately but also provides reasonable and well-justified explanations to the detected posts.
A preliminary version of this work (i.e., ExFaux) was published in [18]. This paper is a significant extension of ExFaux in the following aspects. First, we focus on a more general explainable fauxtography detection problem by removing an important assumption in ExFaux that prevents the solution from using modality-level annotations even if they are available. In particular, the DExFC is able to leverage a small number of posts with modality-level annotations to significantly improve the model performance (Section 4). Second, we study a new challenge in accurately explaining a fauxtography post from both content and comment parts, which is essential to answer the question on what and why a specific component of the post is misleading (Sections 3 and 4). Third, we design two new deep network modules (i.e., twolevel graph structure, multi-modal co-attention mechanism) in DExFC to fully explore the relations between content and comment components of input posts, which significantly improves the detection and explanation performance (Section 4). Fourth, we extend our evaluation datasets to include more recent fauxtography posts (until 2020) for a more comprehensive and robust study of DExFC and the compared baselines (Section 5). Fifth, we add four new baselines (i.e., HPA [19], AIFN [20], MVAE [21], DEAN [22]) related to multi-modal fake news detection as well as the original ExFaux scheme to further investigate the effectiveness and efficiency of the proposed DExFC scheme (Section 5). Last, we extend the related work discussion by adding a discussion on the topic of semi supervised fake news detection and including more recent work on fauxtography detection and graph neural networks (Section 2).

Fauxtography Detection
The concept of "fauxtography" was first defined by Cooper et al. to describe misleading multi-modal content that spreads during the 2006 Lebanon War [6]. To detect fauxtography posts on online social media, Carvalho et al. [23] developed an illumination-based framework that focuses on the fake image detection. Zhang et al. [10] proposed FxBuster, a content-free fauxtography detector that explores the user comment network and their emotions to identify fauxtography posts. Shang [11] extended the FxBuser by leveraging the heterogeneous information (e.g., the linguistic and semantic comment information) in comments of posts to further improve the fauxtography detection performance. Zlatkova et al. [7] developed a content based approach that identifies the fauxtography posts by exploring the URLs of images in the posts. However, none of the above approaches addresses the explainable fauxtography detection problem. Recently, Kou et al. [18] designed ExFaux to address the explainable fauxtography detection problem. However, ExFaux can only identify what component of fauxtography posts is misleading but fails to explain why that component is false. Moreover, ExFaux only works in a weakly supervised manner that cannot take advantage of the modality-level annotations of the post even if they are available. In this paper, we develop a novel DExFC framework to address the explainable fauxtography detection problem by designing an end-to-end co-attention graph neural network solution. The DExFC not only achieves a better fauxtography detection accuracy but also addresses the what and why questions of the fauxtography explanation problem by leveraging a small set of posts with modalitylevel annotations under constrained supervision.

Explainable Rumor Detection
Efforts have been made to address the explainable rumor detection problem in many communities such as data mining, machine learning and computer vision [3], [19], [21], [24], [25], [26], [27]. For example, Yang et al. [24] proposed a multi-level attention Recurrent Neural Network (RNN) to detect false documents and retrieve related sentences or words from the documents with attention scores as explanations. Jin et al. [27] developed an attention LSTM network to integrate the image features into text features for rumor detection and explanation. Guo et al. [19] developed a fake news detection framework by segmenting reader engagements (e.g., comments or re-posts) to multiple levels and applying a hierarchical attention neural network to detect fake news and retrieve related sentences as explanations. Shu et al. [3] presented a deep attention RNN with a coattention module for fake news article detection and generated attention scores that are used to select news content and comments as explanations. However, those approaches primarily focus on a single data modality (i.e., text in the news/articles) and cannot be directly applied to solve the explainable fauxtography detection problem where the false information could hide in different data modalities and/or the association between them (e.g., Fig. 1-d). In this paper, we develop the DExFC to address the explainable fauxtography problem by explicitly considering the diversified types of fauxtography posts and different data modalities in their detection and explanations.

Fake News Detection With Semi Supervision
A few recent efforts strive to exploit the fake news detection problem in semi-supervised manners [28], [29], [30], [31]. For example, Shu et al. [28] proposed a weak social supervision concept that leverages heuristic rules and extrinsic knowledge sources (e.g., user comments, image links) to mitigate the scarcity of labeled data. Benamira et al. [29] developed a graph-based semi-supervised network to detect fake news by capturing contextual dependencies among different news articles. Guacho et al. [30] proposed a content-based fake news detection framework that utilizes a graph-based network to leverage a small set of labeled news articles to infer the truthfulness of unlabeled articles. Abdali et al. [31] developed HiJoD, a semi-supervised tensor-based embedding framework to detect misinformation by considering a diverse set of factors (e.g., news content, host website) that characterize a news article. However, the above semi-supervised approaches need information beyond the news contents (e.g., image links, online post links) to make their judgements, which are not always available in fauxtography posts. Some recent semi-supervised fake news detection methods [30], [32] optimize their solutions by only focusing on the content of fake news/articles. However, these solutions often require a non-trivial amount of labeled data in the classification training process of their models, which are too expensive to obtain in our fauxtography explanation problem setting that needs both post-level and modalitylevel labels. In this paper, we develop a novel modalitylevel refinement module for DExFC that encodes a limited number of annotated posts to significantly improve the performance of fauxtography detection and explanation.

Graph Neural Network
Our work is also related to the graph neural network technique, which has been applied in many areas such as action recognition, intelligent transportation, recommender systems, and computer vision [33], [34], [35], [36], [37]. For example, Shi et al. [33] proposed a multi-stream attention graph convolutional network to recognize skeleton-based actions in an end-to-end manner. Zhao et al. [34] developed a temporal graph convolutional network to forecast urban road traffic tasks using topological structures to capture spatial dependence. Ying et al. [35] developed a data-efficient graph convolutional network with a random walk strategy to build a large-scale deep recommendation engine for web content. Wang et al. [36] proposed a spatial temporal graph neural network to understand the long instructional videos by exploring the interactions between objects (e.g., knife and apples) in the videos. However, the above GNN networks can only address the problems in their specific domains and cannot be easily extended to solve explainable fauxtography detection problem due to the multi-modal characteristic and complex nature of the fauxtography posts. To the best of our knowledge, the DExFC is the first end-to-end graph based learning approach to solve the explainable fauxtography detection problem by addressing challenges brought by diversified types of fauxtography posts and the lack of modality-level annotations for full supervision.

PROBLEM DESCRIPTION
In this section, we formally define the explainable fauxtography detection problem. We first define a few key terms that will be used in the problem statement.
Definition 1 Post (P). In the explainable fauxtography detection problem, we define an online social media post as a post composed of a text description T , an image I, and a comment list C, such as the post examples in Fig. 1. A set of posts can be denoted as P ¼ fP 1 ; P 2 ; . . . ; P N g where N is the total number and P n ¼ fT n ; I n ; C n g is the nth post in P.
Definition 2 Fauxtography. We consider a post as a fauxtography post if it contains false information in the text description, image or the association between the text and image (e.g., Fig. 1d). Otherwise, the post is non-fauxtography.
Definition 3 Post-level Annotation. Given a post P n , we define the post-level annotation as Z P n 2 f1; 0g to indicate if the post is fauxtography or not (i.e., 1 for fauxtography and 0 for non-fauxtography). For example, the post-level annotations for all the post in Fig. 1 are Z P n ¼ 1.
Definition 4 Modality-level Annotation. Given a post P n , we define the modality-level annotations as Z T n 2 f1; 0g and Z I n 2 f1; 0g (1 for false, 0 for true) for the text and image components, respectively. For example, the modality annotations for the post in Fig. 1c are Z T n ¼ 0 and Z I n ¼ 1. Finally, we define a post P n with both post-level and modality-level annotations Z n ¼ fZ P n ; Z T n ; Z I n g as a fully annotated post. Definition 5 Constraint Set. We define the constraint set . . . ; P G K g as a set of fully annotated fauxtography posts.
, which means the detected fauxtography post contains a false text description but a genuine image.
Definition 8 Comment Explainability. it refers to the capability of a framework to identify a small number (i.e., k) of relevant comments C k n from the user comment list C n of a fauxtography post P n to explain the reason on why a specific component(s) of the post is false. For example, if the post of Fig. 1d is correctly identified by a framework as fauxtography, the goal of the comment explainability is to identify the top k comments that explain the woman in the image is not the same one described in the text (i.e., the reason for the fauxtography detection is the wrong association between the text and image components of the post).
Please note that the content explainability identifies what specific component(s) of a post is false and the comment explainability explains why such component(s) is false. The content and comment explainability aspects are complementary to each other, and together provide a comprehensive explanation of the detected fauxtography post. The goal of our explainable fauxtography detection problem in this paper is to i) detect the fauxtography posts, and ii) provide both content and comment explainability for the detected posts. Using the definitions above, our problem is formally defined as: where c Z n is the fauxtography estimation of the post P n that indicates i) if the post is fauxtography, and ii) if yes, what component(s) of the post is false. c C k n represents the top k comments retrieved from the user comment list to explain why a specific component(s) of the post is false. P G is the constraint set defined above. Please note that ExFaux in our conference paper is a special case of our current problem where we did not include c C k n in the problem formulation and assumed P G ¼ ; in the model. To address this problem, we develop the DExFC framework in the next section.

SOLUTION
DExFC is an end-to-end graph based co-attention framework to address the explainable fauxtography detection problem. The overview of the DExFC is shown in Fig. 3. It consists of four modules: 1) a duo multi-modal graph convolutional feature encoder (DGFE), 2) a modality-level graph refinement module (MGR) 3) a multi-modal co-attention module (MCA), and 4) a modality-level discriminator (MLD). First, the DGFE module develops a set of feature encoding networks to encode the image, text and comments of the post to high-dimensional features by aggregating them into a duo graph neural network structure. Second, the MGR module refines the adjacent matrix of the duo graph neural networks in the DGFE module by exploring the modality-level representations from the fauxtography posts in the constraint set. Third, the MCA module designs a multi-modal co-attention network to integrate the encoded features and generate the attention scores that can be used for explainability. Finally, the MLD module determines whether a post is fauxtography or not based on integrated features from MCA module. For the detected fauxtography posts, MCA and MLD jointly output the content and comment explainability of the post. We discuss the above components in details below.

Duo Multi-Modal Graph Convolutional Feature Encoder (DGFE)
In this subsection, we present the DGFE module in DExFC. The DGFE module consists of four deep learning architectures: a word feature encoder (WFE), a text feature encoder (TFE), an image feature encoder (IFE), and a duo graph convolutional network (DGCN). The WFE linearly transforms all words to high-dimensional embeddings. The TFE encodes the text component of the post based on word embeddings to semantic features. The IFE encodes the image component of the post as the visual features with the same dimension. The DGCN connects and refines content and comment features by constructing a novel two-level multimodal graph structure. We first define the four network architectures below. WFE is a word transformation network that transforms the text description and comments of a post to high-dimensional embeddings. In particular, we define L n ¼ fT n ; C 1 n ; . . . ; C M n g as a text list that contains both the text description and user comment components of a post P n . The L 0 n denotes T n (i.e., text description component of the post) and M is the total number of comments. We convert all words in each element of L n to one-hot vectors and build an embedding matrix W D to transform the one-hot vectors to high-dimensional features as: where i denotes the ith element in the text list and j represents the jth word in the ith element. g L i;j n 2 R d is the transformed word embeddings of d dimension.
We build the TFE as an attention bi-directional GRU network [38], [39] to recurrently process word embeddings in each element of the text list and adaptively merges them to element-level features based on attention scores. In particular, we first construct a bi-direction GRU network to process word embedding sequences from both directions. The forward GRU f ! gru reads from the first word embedding to the last one while the backward GRU f gru reads them reversely. The bi-directional modeling process for the word embedding sequences of the post P n can be denoted as: where h ! i n;x 2 R d and h i n;x 2 R d are hidden states for the xth word in the ith element of the text list, X is the total number of words in an element. For each word, we obtain its final feature by concatenating the forward and backward hidden states, i.e., h i n;x ¼ ½ h ! i n;x ; h i n;x 2 R 2d . Therefore, the aggregated feature of an element in the text list is h i n 2 R XÂ2d . Given the above aggregated features, we propose a word-level attention module to integrate word-level features to element-level feature for each element in the text list. While the integration can be achieved by simply averaging or max-pooling the word features at the word level, those operations do not consider the semantic relations between adjacent words. Therefore, we leverage the attention scores from the attention module of the TFE to estimate the importance of each word in terms of their contributions to a higher-level semantic feature [19]. For the ith element in the text list of the post P n , the above process can be characterized as: where W u 2 R 2dÂ1 and b u 2 R are learnable parameters in the attention module and u i n 2 R XÂ1 are attention scores for all words in h i n . The element-level feature is generated by the multiplication of u i n and h i n as follows: where X is the total number of word features. We denote the element-level feature of the text description T n as ST n 2 R 2d and SC i n 2 R 2d where i 2 f1; . . . Mg. We build IFE by constructing a deep convolutional neural network to extract visual features from the images of posts. The visual features provide abstract visual information for our framework to determine if the image part of a post contains misleading content. We utilize the pretrained ResNet [40] deep neural network as the encoder because it contains multiple residual convolutional blocks that can effectively extract visual features from the image. For the image I n of the post P n , the encoding process is: where f res is the encoder and EI n 2 R 2d is the generated visual feature.
Definition 9 Duo Graph Neural Network (DGCN). We define DGCN as a pair of graph convolution neural networks to explicitly connect the content and comment components of a post with a novel two-level content-comment graph structure. The output of DGCN will be utilized to generate the content and comment explanation of the fauxtography post, which we will elaborate in the following subsections.
Current fauxtography detection methods often encode the user comments of a post without explicitly considering the connection between comments [3], [10]. However, we observe that the "reply" connections between user's comments can usually reflect the hidden relations between the user comments and the connection between the content and comments of a post. For example, if a user's comment of a post reports the post as misleading and the comment is replied by other comments with supports, the post is likely to be fauxtography. Therefore, we model the comments and their interactions as a graph neural network structure to fully aggregate the useful information that helps to identify the fauxtography posts. In particular, the comments are modeled as graph nodes and the "reply" relations are modeled as graph edges in the network. However, in many cases, only considering direct "reply" between user comments (e.g., ExFaux [18], FauxWard [11]) is insufficient because such direct "reply" is often either sparse (e.g., few discussions under the post) or long-chain (e.g., a long debate between two users) in reality [41]. One possible solution is to connect each comment with all other comments in the same thread of the given post as indirect "reply". However, the solution ignores the dynamic correlation between different comments in a "reply-chain" that are of different depths from the head comment. For example, consider a "reply-chain" of comments C ¼ fC i g; i 2 ½1; N where comment C i replies to its previous comment C iÀ1 . We define the correlation between the contents of C i and C j in the chain as CorrðC i ; C j Þ; i; j 2 N. We empirically observe that the CorrðC i ; C j Þ decreases exponentially as the distance between the two comments (i.e., ji À jj) increases [42]. We observe that focusing on the highly correlated comments while ignoring the ones with low correlations in the comment graph network greatly facilitates the DExFC in accurately detecting and explaining fauxtography posts when the dataset is noisy. Therefore, we decide to only keep the connections between the pair of comments whose distance is less or equal than 2. We propose a two-level graph neural network to fully explore both the direct and indirect interactions between user's comments.
We first formally define the two-level graph structure for the comments of the post P n . In particular, we define the graph as G n ¼ ðV n ; E n Þ where V n is the set of user comments and E n is the set of direct (i.e., first-level) replies between user comments. For example, an edge e s;s 0 2 E n denotes that comment s 0 replies to comment s. We further extend the graph by adding more edges that connect comments with indirect (i.e., second-level) relations. For example, if there are three nodes s 1 , s 2 , s 3 in the comment graph and s 2 replies to s 1 while s 3 replies to s 2 , we not only create edges e s 1 ;s 2 and e s 2 ;s 3 , but also connect s 1 and s 3 with e s 1 ;s 3 . Two examples of the two-level comment relations are shown in Fig. 4. We define the set of second-level indirect replies between user comments as E Ã n . Based on the two-level graph comment network, we further develop a novel multi-modal duo graph structure to integrate the text, image, and comment components of a post into a holistic structure. Unlike previous fauxtography detection methods that usually process the content and comments of the posts separately, the duo graph structure embeds the content as additional graph nodes for the graph comment network and connect all comments to the content nodes. Therefore, the duo graph structure is able to fully explore the hidden relations between the content and comments and identify the misleading points in the post by aggregating the information between content and comments. Given the text feature ST n from the text feature encoder and the image feature EI n from the image feature encoder above, the process of building the new multi-modal graphs is denoted as: G T n ¼ ðfV n ; ST n g; fE n ; E Ã n ; e i;STn gÞ; i 2 f1; . . . ; Mg G I n ¼ ðfV n ; EI n g; fE n ; E Ã n ; e i;EIn gÞ; i 2 f1; . . . ; Mg where M is the total number of comments. fV n ; ST n g represents the union of features between comment features and the text feature. Similarly, fV n ; EI n g represents the union of features between comment features and the image feature. e i;STn denotes the edges between a comment feature and the text feature. e i;EIn denotes the edges between a comment feature and the image feature. G T n is the generated graph structure that contains the text and comments and G I n contains the image and comments. The edges in both G T n and G I n are further represented as adjacent matrices as below.
A T n ¼ symmetricðfE n ; E Ã n ; e i;ST n gÞ þ I n A I n ¼ symmetricðfE n ; E Ã n ; e i;EI n gÞ þ I n (8) where symmetricðÁÞ represents adding edges to convert asymmetric directed graphs to symmetric undirected graphs and I n denotes the identity matrix. A T n 2 R ðMþ1ÞÂðMþ1Þ and A I n 2 R ðMþ1ÞÂðMþ1Þ are binary adjacent matrices where the value "1" denotes the connection between two features while "0" denotes no connection between the two features. For example, the values of A T n 2 R ðMþ1ÞÂðMþ1Þ between all comment features and the text feature are "1".
Given the constructed graph structures, the multiple graph convolutional layers in the DGCN convolve the corresponding content and comment features with graph layer weights. For the post P n , the process can be denoted as: ½ST n ; SC n ðlþ1Þ ¼ sð f A T n ½ST n ; SC n ðlÞ W l Þ ½EI n ; SC n ðlþ1Þ ¼ sð f A I n ½EI n ; SC n ðlÞ W l Þ where l is the layer index for DGCN (l ¼ 0 for original 2 is the normalized symmetric weight matrix with D as the degree of the matrix (D ii ¼ P j A ij ), and W l 2 R 2dÂ2d denotes the learnable parameters, s represents the non-linear functions ReLU. The comment features SC n ðlþ1Þ are processed by max-pooling the comment features from the outputs ½ST n ; SC n ðlþ1Þ and ½EI n ; SC n ðlþ1Þ in the feature dimension.

Modality-Level Graph Refinement Module (MGR)
The MGR module aims to leverage the K fully annotated fauxtography posts in the constraint set P G to improve both detection and explanation performance of the DExFC. Given the non-trivial cost of obtaining modality-level labels of fauxtography posts, we limit the number of posts in P G to be small. In particular, we divide the set into 4 subsets with equal size where each subset contains K=4 fauxtography posts with the same modality-level labels (e.g., the posts with true text and false images go to the same subset). However, it is difficult to leverage these subsets in a traditional gradient decent training process to optimize the DExFC due to the limited number of posts in the subsets. Therefore, our MGR module treats P G as an internal constraint to the DExFC to further optimize the performance of the DGCN by adjusting its internal graph structure.
While various fauxtography posts contain totally different contents in both text and image components, the embedded misleading information could be similar. For example, the texts of different fauxtography posts may include different terms (e.g., gigantic birds, finger elephants). However, they often deliver the same misleading concept (e.g., exaggeration). Similarly, the visual contents in multiple photoshopped images are different but share the same type of pixel-level discrepancy (e.g., the inconsistency between altered human face and its surrounding pixels [43]). Therefore, the goal of the MGR module is to retrieve the meta-representations of the fauxtography posts in P G by performing a metric-based learning strategy. Metric learning is a machine learning task that learns a distance function to generate high-quality representations of input data samples [44]. In P G , for each modality of the posts in the constraint set (i.e., text or image), the modality-level contents annotated as true or false are expected to share the same true or false meta-representations. Moreover, we observe that the meta-representations related to true or false samples are often easier to be distinguished from each other than the original contents of the post because they transfer low-level modality specific descriptions into high-level representation concepts. With the above intuition, we design a novel metric based loss function to generate meta-representations and utilize them as additional guidance to optimize the structure of the graph neural network in the DGCN.
In particular, the content of posts in P G are first encoded as high-dimensional features by WFE, TFE and IFE described in Section 4.1. The encoded features are denoted as F GT 2 R KÂ2d and F GI 2 R KÂ2d for text and image part, respectively. Moreover, we define F GT;T 2 R K=2Â2d and F GT;F 2 R K=2Â2d as text related features of the posts with true and false labels in P G , respectively. Similarly, we define F GI;T 2 R K=2Â2d and F GI;F 2 R K=2Â2d as image related features of the posts with true and false labels in P G , respectively. With the above definitions, we derive the generation process for all meta-representations using the encoded features as follows: whereF GT;T 2 R 2d andF GT;F 2 R 2d are averaged text features with true and false labels, respectively. Similarly, F GI;T andF GI;F are averaged image features with true and false labels, respectively. The four generated representations (i.e.,F GT;T ,F GT;F ,F GI;T ,F GI;F ) are the meta-representations for text/image component with true and false labels, respectively. We then design our metric-based loss function L G to optimize all the concept representations, which can be denoted as: i¼0 kF GT;F À F GT;F;i k þ kF GT;T À F GT;T;i k kF GT;F ÀF GT;T k þ L GI ¼ P K=2 i¼0 kF GI;F À F GT;F;i k þ kF GI;T À F GT;T;i k kF GI;F ÀF GI;T k þ where the sub-functions L GT and L GI denote the metricbased losses for the text and image modalities of the posts in P G , respectively. Given an input multi-modal post, the graph neural network in the DGCN connects content and comment features of the post via the graph structure. The generated meta-representations adjust the relations of the graph nodes (i.e., content and comment nodes) in the structure by updating the corresponding adjacent matrix. In particular, we first average the true and false meta-representations for both text and image to obtain the overall modality-level metarepresentations, which are denoted as F GT 2 R 2d and F GI 2 R 2d , respectively. For a given input post P n with C comments, we first group the text and image representations with comment representations separately to construct joint features as J T n ¼ ½ST n ; SC n 2 R ðCþ1ÞÂ2d and J I n ¼ ½EI n ; SC n 2 R ðCþ1ÞÂ2d . Then we calculate the correlation between the joint features of P n and the meta-representations to estimate their correlations, which can be denoted as: A G;T n;i ¼ J T n;i Á F GT ; A G;I n;i ¼ J I n;i Á F GI ; (11) where A G;T n;i represents the ith correlation factor between the ith element in the text-related joint features (i.e., J T n;i ) and the corresponding text meta-representation (i.e., F GT ). Similarly, A G;I n;i represents the ith correlation factor between the ith element in the image-related joint features (i.e., J I n;i ) and the corresponding image meta-representation (i.e., F GI ). We perform the matrix multiplication on A G;T n and A G;I n to construct a global adjacent matrix A G n 2 R ðCþ1ÞÂðCþ1Þ that illustrates a new relation between content and comments from meta-representations. Finally, the adjacent matrices A T n and A I n in the DGCN are replaced with A T n þ A G n and A I n þ A G n as new adjacent matrices to perform graph convolution.

Multi-Modal Co-Attention Module (MCA)
In this subsection, we present the MCA module that integrates the encoded features from the DGFE and generates the attention scores that can be used for the explainability tasks. We observe that text and image components of a post may weight differently in the user's judgement on a fauxtography post. For example, the post in Fig. 1a contains both false text and false image. However, people are more likely to determine the post as fauxtography based on the content of the image rather than the text. Similarly, we also observe that not all comments are equally important in determining and explaining a fauxtography post. For example, the first comment of the post in Fig. 1b is more convincing than others to explain that the text description of the post is misleading. To accommodate the above observations, we develop the MCA module to estimate the relative importance of each content modality and each comment by generating corresponding attention scores. Using the features from the DGFE, we first concatenate ST n and EI n (i.e., text and image features from the DGCN) to create a content feature list F con n ¼ ½ST n ; EI n 2 R 2Â2d . Then we compute the affinity matrix M C n 2 R 2ÂM for the post P n to obtain the joint representations for content and comments as follows: where T represents the transpose of a matrix, tanh is the activation function for the non-linear feature transformation, and W M 2 R 2dÂ2d represents the learnable parameters. Then we add the joint representations back to content and comment features and generate attention scores as follows: where W CN 2 R 2dÂ1 and W CM 2 R 2dÂ1 are learnable parameters for transformation, a CN n 2 R 2Â1 and a CM n 2 R MÂ1 are generated attention scores that are further normalized to 1 by the Softmax operation. If the post P n is determined as fauxtography, the two scores in a CN n estimate what component (i.e., text or image) is more likely to be false in the content of post P n . Furthermore, each score in a CM n indicates how likely a comment can explain the reason why a specific component is false in the fauxtography post.
Our next goal is to generate an integrated content-level feature and an integrated comment-level feature based on the attention scores to classify the fauxtography posts. The generation process is denoted as: where ða CN n Þ 0 and ða CN n Þ 1 represent the first and second element in a CN n . e F CN n 2 R 2d and e F CM n 2 R 2d are the integrated content-level and comment-level features of P n , respectively.

Modality-Level Discriminator (MLD)
In this subsection, we present the MLD module in DExFC. The module consists of two network architectures: i) a duo modality-level false content discriminator (DMD) for the content explainability, and ii) a final fauxtography discriminator (FFD) based on the DMD and the generated text, image and comment features from the MCA module. First, the duo modality-level false content discriminators discriminate the text and image components of a post to provide content explainability. Second, the fauxtography discriminator concatenates all features and the modality-level predictions to determine if the post is fauxtography. We illustrate the two network architectures below.
While the scores in a CN n from the MCA module are able to indicate the importance of the text and image components of the post P n , they are restricted by the Softmax operation that normalizes the sum of scores in a CN n to 1, which ignores the possibility that both text and image components can be true (i.e., both with low scores). To address this issue, we define the Duo Modality-Level False Content Discriminator (DMD) as a pair of binary neural network classifiers that discriminate the text and image features of a post. The results of DMD provide the content explainability to identify what component(s) (text or image or both) contain the misleading information. We first concatenate the integrated content-level feature with the original text and image feature. The updated features are d ST n ¼ ½ST n ; e F CN n 2 R 4d and d EI n ¼ ½EI n ; e F CN n 2 R 4d for text and image components, respectively. After that, we derive the modality-level predictions from the refined features of the post P n as: whereŷ T n 2 R 2 andŷ I n 2 R 2 are predicted results for text and image components of a post, respectively.
We define the Final Fauxtography Discriminator (FFD) as a binary neural network classifier that considers the updated text and image features from the DMD and the integrated comment-level feature from the MCA module and makes the final decision to determine if a post is fauxtography. For a post P n , we concatenate modality-level and comment-level features as: where W p 2 R 8dÂ2d are the learnable parameters and F final n 2 R 4d is the overall feature vector. We apply the transformation on F final n to obtain the final prediction as: where W f 2 R 4dÂ2 are the learnable parameters. We fuseŷ f n withŷ T n andŷ I n for the computation of the final loss function as follows:ŷ n ¼ a Âŷ f n þŷ T n þŷ I n ; whereŷ n denotes the final prediction of the post and a is an adjustable factor to balance the optimization weights between the overall and modality-level predictions. The goal of our loss function is to minimize the cross-entropy loss. The above optimization process is denoted as follows: Ày n logððŷ n Þ 1 Þ À ð1 À y n Þlogð1 À ðŷ n Þ 0 Þ L W denotes the loss optimized by the final fauxtography labels and L represents the overall final loss. We summarize the DExFC scheme in Algorithm 1.

Algorithm 1. DExFC Scheme Workflow
Require: Detect and explain the input fauxtography post Input: fauxtography post P n ¼ fT n ; I n ; C n g, constraint set P G ¼ fP G 1 ; . . . ; P G K g, text-comment graph structure G T n , imagecomment graph structure G I n Output: fauxtography predictionŷ n , content explanation f c Z T n ; c Z I n g and comment explanation c C k n after M iteration 1: while i M do 2: ST n ; SC n ; EI n ¼ TFEðT n Þ; TFEðC n Þ; IFEðI n Þ 3: if P G is not empty then 4: L G , A G n ¼ MGRðP G Þ 5: else 6: L G , A G n ¼ 0; 0 7: end if 8: ST n ; SC n ; EI n ¼ DGCNðST n ; SC n ; EI n ; G T n ; G I n ; A G n Þ 9: e F CN n ; e F CM n ; a CM n ¼ MCAðST n ; SC n ; EI n Þ 10:ŷ T n ;ŷ I n ¼ DMDðST n ; e F CN n Þ; DMDðEI n ; e F CN n Þ 11:ŷ f n ¼ FFDðST n ; EI n ; e F CN n ; e F CM n Þ 12: L W ¼ Cross Entropyðŷ T n ;ŷ I n ;ŷ f n ; y n Þ 13: Adam_Optimization(L G , L W ) 14: end while 15:ŷ n ¼ a Âŷ f n þŷ T n þŷ I n 16: c Z T n ; c Z I n ¼ŷ T n ;ŷ I n 17: c C k n ¼ C n ½Top-Kða CM n Þ

EVALUATION
In this section, we conduct extensive experiments on a realworld dataset to study the performance of DExFC and compare it with the state-of-the-art solutions. In particular, our evaluation study answers the following questions: Q1: Can DExFC achieve better fauxtography detection performance than the state-of-the-art baselines?

Dataset and Experiment Setup
Data. We create a real-world dataset by collecting social media posts from Twitter and Reddit, both of which are widely used online social platforms that contain a good amount of fauxtography posts [10]. In particular, we first collect a set of social media posts from three independent fact checking websites (i.e., snopes.com, factcheck.org, truthorfiction.com). We then assign three different annotators to manually verify the labels from the fact-checking websites. We also utilize Google Vision API 1 for reverse search on the image of the post to obtain the corresponding URLs of the images. If a URL points to a post on Twitter or Reddit, we crawl the text description, image and comments of the post using a crawler script we developed. For each post in the collected dataset, we ask the annotators to further check if the text description and the image components are false and record their decisions (1 for false and 0 for true). We use majority voting on the annotations to decide the final labels of all components of the post [45]. Compared to the dataset used in our conference paper [18], our current dataset contains 21% more recent fauxtography posts and as well as 20% more comments of the posts. The social media posts in the new dataset are crawled from different social media platforms (e.g., Twitter, Reddit). It covers all types of fauxtography posts in Fig. 1 to ensure the trained model is capable of identifying various misleading information embedded in the fauxtography posts across different social media platforms. Moreover, the newly added social media posts are all recent ones posted in 2020, which demonstrates the capability of the our model in terms of identifying the recent fauxtography posts. The summary of our dataset is shown in Table 1. 2 Experiment Setup. In our experiments, the dataset is split into a train-val set and a test set. The train-val set contains 80% data samples and the test set contains the other 20%. We perform 10-fold cross validation on train-val set to tune the network parameters of all schemes and evaluate them on the test set. For the implementation details of DExFC, the GCN network in the DGCN holds 2 layers with each layer followed by ReLU activation. We empirically set the size (i.e., K) of the constraint set P G as 8 and tune K in our evaluation. We resize the input images to 256 Â 256 and randomly crop them to 224 Â 224 to prevent the overfitting issue for training. For testing, we directly resize images to 224 Â 224. We set the total number of epochs as 40 and train the model with an initial learning rate of 0.001 and decay of 0.95 in each epoch. The optimizer is Adam with 5 Â 10 À4 1. https://cloud.google.com/vision 2. We will make all our codes and datasets publicly available upon the acceptance of the paper. weight decay. We run our experiments on Ubuntu 16.04 with two NVIDIA 1080Ti.

Baselines and Metrics.
We compare the performance of DExFC with the following state-of-the-art baselines.
FxBuster [10]: FxBuster is a fauxtography detection tool that detects the fauxtography posts by exploring the comments from readers of the posts. FCMF [7]: FCMF is a fauxtography detector that identifies fauxtography posts by exploring the image URLs and hand-crafted text features of the posts. ExFaux [18]: ExFaux is an explainable fauxtography detection method developed in our previous conference paper. Compared to DExFC with an adjustable constraint set to perform both content and comment explainability, the ExFaux can only work with an empty constraint set (i.e., weakly supervised) and provide only the content explainability. AIFN [20]: AIFN develops a gated neural network for fake news detection by fusing text and comments based on a multi-head attention mechanism. EANN [46]: EANN is a recent fake news detection scheme that handles multi-modal content with convolution filters and applies an adversarial loss function to make the model event-invariant. DEAN [22]: DEAN leverages both text content and comments to detect fake news by employing two independent recurrent neural networks and a fully connected layer for the detection task. HAN [24]: HAN constructs a hierarchical attention neural network from word-level to sentence-level to identify fake news. It can not only detect the fake news but also explain why a news post is fake by pointing out relevant sentences in the news. MVAE [21]: MVAE develops a variational autoencoder neural network for fake news dection by learning a shared representation of multi-modal content of posts.
HPA [19]: HPA segments user engagements (e.g., user comments) in social media to different levels and construct an attention neural network to detect rumors. The generated attention scores can help to explain why a post is a rumor by indicating rumorrelated comments. dEFEND [3]: dEFEND detects fake news by applying a co-attention strategy to retrieve relevant sentences from both text content and comments that offer the potential reasons for the detection. We adapt the above baselines to solve our problem in a way that ensures all schemes take the same inputs for a fair comparison. For the methods that utilize only text content, such as HAN, DEAN, DEFEND and AIFN, we let them treat the image in our dataset as a new feature in addition to the text in their models. For EANN, we remove the adversarial loss function because it needs additional annotations that our dataset does not contain. We strictly follow the parameters and configurations of all schemes as documented in their papers.
To evaluate DExFC and the above state-of-the-art schemes, we conduct several experiment tasks with different evaluation metrics. We first evaluate the fauxtography detection performance of all compared schemes by leveraging four classic evaluation metrics: F1-Score, Accuracy, Precision and Recall. Then, we evaluate the explainability performance of the compared schemes in terms of explaining which component of a detected fauxtography post is false by using Accuracy metric. Additionally, we evaluate the compared schemes on explaining why the detected post is fauxtography by using the list-wise comparison [3] and minimum read index. Finally, we carry out ablation studies to investigate the contribution of different modules in DExFC by applying F1-Score and Accuracy as evaluation metrics. We elaborate the above experiments in details below.

Evaluation Results
Fauxtography Detection (Q1). In the first set of experiments, we focus on the overall performance of all schemes in terms of fauxtography detection. We use the classic evaluation metrics for binary-class classification: F1-Score, Accuracy, Precision and Recall. The results are reported in Table 2. We observe DExFC significantly outperforms all baselines. For example, DExFC is able to achieve 11:1% higher F-1 score than dEDEND, one of the state-of-the-art explainable fake news detection approach. The reason is that our duo graph convolutional network design in the DGCN effectively refines the representations of both content and comments of the input post by connecting all of them with a novel towlevel multi-modal graph structure. More importantly, we also observe DExFC is superior to ExFaux, the model of our  True  True  150  27,125   False  False  32  3,395  True  False  38  4,835  Fauxtography  False  True  62  7,045  True  True  21  1,826  ALL  ALL  153 17,101 respectively. This is because the DExFC scheme develops a metric-based optimization strategy in the MGR module to generate meta-representations from posts in the constrained set, which improves the effectiveness of fauxtography identification. Moreover, unlike the traditional attention mechanism used in ExFaux that only considers the content component of the input post, we develop a multi-modal coattention module in DExFC to fully explore the internal relations between content and comments of a post, which also improves its performance of fauxtography detection. Content Explainability (Q2). In the second experiment, we study the performance of DExFC in terms of identifying false component(s) in the detected fauxtography posts (i.e., content explainability). In this experiment, we select HAN, dEFEND and ExFaux for comparison because they are the only baselines that can generate attention scores for content of posts, which can be used for explanations. The evaluation results are presented in Table 3. In Table 3, the Overall metric evaluates if a scheme can at least determine the truthfulness of either the text or image component of the fauxtography posts correctly. The Text Only metric evaluates if a scheme can correctly determine the truthfulness of the text component. Similarly, the Image Only metric evaluates if a scheme can correctly determine the truthfulness of the image component. We observe that DExFC outperforms all baselines in identifying the false component(s) of a fauxtography post. For example, DExFC achieves 13:4%, 16:7%, and 20:0% higher overall accuracy than ExFaux, HAN, and dEFEND on the overall metric, respectively. The above results validate our hypothesis that the MCA module of our design can provide better representations for content features, which significantly improves the content explainability accuracy of the DExFC framework. The visualization of explainable results of DExFC are shown in Fig. 5.
Comment Explainability (Q3). In the third experiment, we study how effective the DExFC can retrieve the relevant user comments to explain why the identified component(s) of a post is false (i.e., comment explainability). In particular, we carry out a real-world user study using the Amazon Mechanical Turk (AMT), one of the largest crowdsourcing platforms 3 . In our experiment, we recruited AMT workers with the approval rate >0:95. We set the payment to workers well above the minimum requirement from AMT 4 . We select a test set that contains the fauxtography posts with different fauxtography types for our experiment. For each testing post, we compare the results of DExFC with two other baselines (i.e., DEFEND and HPA) because they are the only baselines that are capable of providing comment explanations on their results. We collect the attention scores from all compared schemes and then rank the user comments of the post based on the scores in a descending order. For example, the comments in the posts of Fig. 5 are ranked by the attention scores from DExFC. The higher a comment ranks, the more likely it can explain why the post is fauxtography.
We design several AMT tasks to evaluate the comment explainability of all compared schemes. In particular, we first perform a list-wise comparison [3] to evaluate the explainability quality of the comment lists ranked by different schemes. For each testing post, each compared scheme generates three types of comment lists that contain Top-1, Top-3 and Top-5 comments from all the comments sorted by the corresponding attention scores. Given each type of the generated comment lists by all compared schemes, we ask four AMT crowd workers to pick the best comment list that they believe can explain why the post is identified as fauxtography. If there exists more than one highest votes,  3. https://www.mturk.com/ 4. https://www.mturk.com/pricing we ask more workers to vote on it until only one comment list receives the highest votes. We adopt two evaluation metrics from dEFEND [3] to study the performance of DExFC and the baselines as below.
Worker Level Evaluation: the worker level evaluation denotes the percentage of workers who select their preferred comment list generated by each scheme. For example, given 10 fauxtography posts, if there are totally 25 crowd workers choosing DExFC as the best explainable scheme for the post, the percentage of Workers for DExFC is 25 10Â4 ¼ 62:5%. Post Level Evaluation: the post level evaluation denotes the percentage of posts whose comment list picked by the majority workers as their preferred ones belonging to each scheme. For example, given an input fauxtography post, if three or more crowd workers believe that the post is best explained by the comment list from the DEFEND scheme, we assign DEFEND as the best explainable scheme to the post." The results are shown in Fig. 6. We observe that DExFC outperforms the compared baselines on both evaluation metrics. We attribute the significant performance gains of DExFC framework to its duo multi-modal graph convolutional networks and co-attention module design that explicitly explores both direct and indirect relations hidden in the user's comments.
To further investigate the efficiency of the comment explainability of DExFC, we did an additional experiment to study how many top comments in the ranked list a user has to read before she/he decides the post is fauxtography. In particular, we ask each AMT worker to read the user comment list of a post ranked by a scheme from the top to end and stop reading if the worker feels comfortable to make a decision on whether the post is fauxtography. We then ask the worker to document the number of comments she/he reads in order to make the decision. We define this recorded number as the minimum read index (MRI). We focus on two evaluation metrics as below. The results are reported in Table 4. We observe that DExFC outperforms the compared baselines on both evaluation metrics. In particular, the users of DExFC only have to read less than 3 comments on average to detect a fauxtography post and more than 60% of the posts can be detected by DExFC with the smallest AvgMRI.
Ablation Study (Q4) Finally, we perform a comprehensive ablation study to study the contribution of each important component of DExFC. In particular, we first investigate the effect of the size of the constraint set on the performance of DExFC. The results are show in Fig. 7. We observe that the non-empty constraint set can clearly help to improve the performance of DExFC framework and the performance gain stabilizes when the size of the constrained set reaches 8, which indicates a very affordable labeling cost of our solution (i.e., only 8 posts with modality-level labels are needed for DExFC to reach its optimized performance).
We then create different variants of DExFC framework by removing its key components: 1) DEx-base: we remove the duo graph neural networks and modality-level discriminators; 2) DEx-graph-1v: we remove the modality-level discriminators and also change the duo two-level graphs to one-level (i.e., only use the direct replies in comments), and 3) DEx-graph-2v: we only remove the modality-level discriminators. The fauxtography detection results are shown in Table 5. We observe that, by adding the duo one-level graph neural networks, DExFC is able to increase its F-1 score and Accuracy score by 6:0% and 6:6%, respectively. This result illustrates the importance of connecting content and comments with the duo graph structures. The two-level design that involves indirect comment connections also contributes to 3:8% and 4:9% in F-1 score and Accuracy score, respectively. Furthermore, we also found the modality-level discriminators are helpful, which yield 4:8% and 4:9% higher F-1 score and Accuracy score improvement, respectively.

CONCLUSION
This paper presents DExFC to address fundamental challenges in the explainable fauxtography detection problem.    DExFC designs a novel graph based co-attention network under constrained supervision to explain detected fauxtography posts by identifying the exact false component(s) of the post and retrieving relevant user comments of the post as explanations for the identification results. Evaluation results on a real-word fauxtography dataset show that DExFC significantly outperforms the state-of-the-art baselines in terms of both fauxtography detection accuracy and explainability.