Robust Reasoning Over Heterogeneous Textual Information for Fact Verification

Automatic fact verification (FV) based on artificial intelligence is considered as a promising approach which can be used to identify misinformation distributed on the web. Even though previous FV using deep learning have made great achievements in single dataset (e.g., FEVER), the trained systems are unlikely to be capable of extracting evidence from heterogeneous web-sources and validating claims in accordance with evidence found on the Internet. Nevertheless, the heterogeneity covers abundant semantic information, which will help FV system identify misinformation in a more accurate way. The current work is the first attempt to make the combination of knowledge graph (KG) and graph neural network (GNN) to enhance the robustness of FV systems for heterogeneous information. As a result, it can be generalized to multi-domain datasets after training on a sufficient single one. To make information update and aggregate well on the collaborative graph, the present study proposes a double graph attention network (DGAT) framework which recursively propagates the embeddings from a node’s neighbors to refine the node’s embedding as well as applies an attention mechanism to classify the importance of the neighbors. We train and evaluate our system on FEVER, a single and benchmark dataset for FV, and then re-evaluate our system on UKP Snopes Corpus, a new richly annotated corpus for FV tasks on the basis of heterogeneous web sources. According to experimental results, although DGAT has no excellent advantages in a single dataset, it shows outstanding performance in more realistic and multi-domain datasets. Moreover, the current study also provides a feasible method for deep learning to have the ability to infer heterogeneous information robustly.


I. INTRODUCTION
With the explosive growth of Internet, fake news has already posed serious threats to the public's factual judgment and the credibility of the governments. Rumor or misinformation makers usually use creative language to embellish their content and attract netizens, which will hide this important features and thus traditional FV methods fail to recognize the misinformation [1].
Recent achievements in scattered information fusion technology and multi-hop reasoning method (such as natural The associate editor coordinating the review of this manuscript and approving it for publication was Arianna Dulizia . language inference) explains this problem, which improve the performance of fact verification (FV) by integrating the entire network information. That is to say, natural language inference (NLI) models can use scattered web information comprehensively, which will lead them to obtain sufficient web sources. For example, as shown in Figure 1, a FV system first searches related evidence sentences from one dataset, then conducts joint reasoning over these evidences, and finally aggregates the information to confirm the claim integrity.
NLI models either concatenate all relevant information (often referred to as evidence) into a single string, which is often used in FV systems in the FEVER challenge [2], [3], or adopt fusion technology to aggregate the features of isolated relevant information [4], [5]. However, the proposed traditional NLI methods fail to grasp the correlation between evidences, causing that these FV systems cannot make full use of the valuable evidences.
To make full use of the structural information of the provided context, some recent work has integrated the graph neural network into FV systems (see Section II-A for detail), and has also made a significant improvement. Both train and test of the recent methods are on a single dataset (e.g., Wikipedia) with the same structural information. This single dataset bases on synthetic information derived from Wikipedia rather than natural information that is originated from heterogeneous web sources. Web sources with heterogeneous datasets may contain more valuable information which can help better verify claims. Since the single datasets only cover Wikipedia, the trained systems are unlikely to be capable of extracting evidence from heterogeneous web-sources and validating claims on the basis of evidence found on the Internet. Though effective in single datasets, current methods are challenging to model the explicit relationship between heterogeneous information. The main challenge is how to effectively integrate large-scale textual information with different structure from various sources to recognize the embellished misinformation with hidden characteristics. Recently, although people have proposed numerous heterogeneous information fusion methods [6]- [12], the necessary condition of these methods is that the structures of the heterogeneous information are specific and known before training and testing, which is impracticable to apply in the unpredictable web source. Fortunately, the emerging success on knowledge-graph-based inference may explain this problem. Inspired by [13], [14] and based on state-of-the-art model [15], in the present study, we first proposed a novel double-graph-based reasoning framework over knowledge graphs for FV. The main building block is a collaborative knowledge graph (CKG) construction by combining the information of evidences and world knowledge as a unified graph. This new CKG can help NLI model complete self-taught learning on heterogeneous datasets. Furthermore, we also adopt knowledge-aware attention [14] to propagate and aggregate information over the collaborative graph structure. In this way, the reasoning process simultaneously employs semantic and the relevant word knowledge of entity in this graph. We conduct experiments both on the most influential benchmark datasets FEVER [2] and richly annotated corpus based on heterogeneous web source UKP Snopes Corpus [16] for fact verification. Ablation study demonstrates that the integration of double-graph-driven representation learning mechanisms enhances the performance of reasoning on heterogeneous information. Our contributions are briefly summarized as follows: • We first combine the intrinsic information of evidences and world knowledge with evidences as a unified relation graph and propose a double-graph-based graph reasoning approach to enhance the robustness of the FV system when inferring heterogeneous information.
• Results demonstrate that although our double-graphbased mechanisms have no excellent advantages in the single source dataset (FEVER), it achieves state-ofthe-art performance on the UKP Snopes Corpus. Our study illustrates that the state-of-the-art methods on one dataset with the same structural information may not perform well on realistic scenario with heterogeneous data. In addition, we also provide a feasible method for deep learning to infer this heterogeneous information robustly.

A. ADVANCED AUTOMATIC FACT VERIFICATION
Reference [17] uses the attention model (DAM) [18] to produce NLI predictions for each claim(sentence to be verified)-evidence pair individually and aggregate all these predictions for final checking. Then, [19]- [21] employ a more effective NLI model enhanced by sequential inference model [22] to predict the relevance between individual evidence and claim. As pre-trained language models can fully grasp the semantic information of context, [23] adopts the generative pretraining transformer (GPT) [24] to enhance the understanding of claim-evidence logical relationships. Reference [25] improves the pre-trained process by updating Transformers, which can natively represent structure textual information. Using neural matching kernel, [26] proposes more fine-grained evidence selection and reasoning method KGAT for FV. To capture rich semantic-level structures among multiple evidences, [15] presents a graph-based approach for fact checking based on semantic role labeling (AllenNLP [27]) and achieves state-of-the-art performance.

B. FEVER: FACT EXTRACTION AND VERIFICATION
FEVER Shared task [2] aims to evaluate a FV system's ability to verify the claims based on evidences from Wikipedia. The dataset [3] for the current task consists of 185,445 claims. Given a claim, FV systems first extract relevant evidence VOLUME 8, 2020 (the sentences supporting or refuting the claim). Then, NLI methods process the claim and evidence, and finally label the claim as Supported, Refuted or NotEnoughInfo. The formal expression of these verification process is presented as follows: Suppose D = {D 1 , D 2 · · ·} denotes a set of documents, and one certain document D i is an array of sentences, namely, The inputs of a FV system are a textual claim C i and the relevant documents ∪D i . The output should be a tuple E i , y i where E i = s e 0 i , s e 1 i , · · · ⊂ ∪D i , denoting the set of evidence sentences specific to the claim, and y i ∈ {SUPPORTS, REFUTES, NOTENOUGHINFO}, the prediction to the given claim. Suppose the standard evidence and ground truth label of the claim are E i and y i . Regarding a successful verification of a claim c i , the tuple E i , y i predicted by the FV system should satisfy the following conditions: E i ∈ E i and y i = y i .

C. NATURAL LANGUAGE INFERENCE
Natural Language Inference (NLI) is a task related to the judgement of semantic relationship between a premise and a hypothesis [28]. Recently, the availability of much larger annotated language corpus, e.g., Multi-NLI [29] and SNLI [30], have made it practicable to train more complicated deep learning models which have achieved state-ofthe-art performance [31]- [33]. NLI which is regarded as a typical semantic matching problem almost exists in every NLP task. NLI is targeted at comprehensively analyzing scattered sentences, and then can infer the desired information. Coincidentally, the FV requires the comprehensive use of web information resources as the evidence to label the given claim. It is intuitive to transfer NLI models into the FV task and several methods based on NLI model, which has also achieved significant improvement in FEVER shared task. The current study enriches neural-network-based NLI models with external knowledge. Different from early study on NLI that analyzes internal knowledge of relatively small NLI datasets, the present work aims to merge the advantage of powerful modeling ability of neural networks with extra external robust knowledge so as to finish reasoning task on heterogeneous information. It can be proved that the proposed model enhances the state-of-the-art NLI models' adaptability in heterogeneous information.

D. GNN FOR NLI
To obtain both the global and local information representation of graph network, Graph Neural Networks (GNNs) learn nodes' effective feature vectors through a recursive neighborhood aggregation [34] on the information graph. Reference [35] proposes Graph Convolutional Network (GCN). Recently, self-attention mechanism has been introduced into GNNs (Graph Neural Network [36]) to encourage FV models to focus on the most significant parts of data, aiming to address the shortcomings of prior approaches based on GCNs. Therefore, in order to learn more information of the relationships between the given claim and evidence sentences, [36] first proposes Graph Attention Network (GAT), introducing the attention mechanism into graphs learning. GAT is a new type of GCN, which uses attention mechanism for homogeneous graphs containing single type of nodes and edges. Although GAT has made great process in FV tasks, it has not been effectively applied to heterogenous information [12]. Based on these two models, there exist a lot of improved methods such as [4], [5], [12], [15], [26], achieving good performance in NLI task. However, the above GNNs can only provide NLI with relation representation of one graph and cannot deal with various types of nodes and edges in one more different graph. The incorporation of external and inference-related knowledge graph in the process of natural language inference is studied in the current work. For example, intuitive knowledge concerning synonymy, antonymy, hypernymy and hyponymy at semantic level may help NLI models achieve soft-alignment between premises and hypotheses. Knowledge about hypernymy and hyponymy may help NLI models to capture entailment relation. Knowledge about antonymy and co-hyponyms (words sharing the same hypernym) may be beneficial for the modeling of contradiction.

III. METHODOLOGY
We adopt a three-step pipeline with components for document retrieval, sentence selection and claim verification to complete the FV task.

A. PIPELINE
In the stages of document retrieval and sentence selection, we simply employ the method from [5], since their method has achieved the best performance. Finally, the claim verification model outputs the veracity of the given claim after the FV system obtains the claim and evidential sentences. The overview pipeline of the proposed method is illustrated in Figure 2. In Figure 2, ev i denotes the i-th evidential sentence, hcei denotes the representation of i-th claim-evidence pair, MPL denotes Multilayer Perceptron, softmax denotes softmax function, KG denotes the Knowledge Graph, CKG denotes the Collaborative Knowledge Graph, h evi denotes the representation of ev i , αh denotes the weighted sum of multiple representations an align denotes the align function. Based on majority of previous models, we first introduce a double graph attention network (DGAT) framework in the final claim verification stage, which is explained detailly in Section III-B. The main contribution of the current work is the double-graph-based reasoning approach for claim verification.

B. CLAIM VERIFICATION WITH GRAPH-BASED REASONING
The present section describes our DGAT framework for claim verification. According to Figure 2, given a claim and the retrieved evidence, we first employ a sentence encoder to achieve representations for the claim and the evidence. Subsequently, we build a fully-connected evidence graph which is integrated with knowledge graph and put forward an evidence reasoning network to propagate information among evidence and reason over this collaborative graph. Eventually, we utilize an evidence aggregator to infer the final claim labels.

1) CONTEXTUAL SEMANTIC GRAPH CONSTRUCTION
After feeding the evidential sentences, to represent the structural information of these sentences, we construct a contextual semantic graph. During this process, we use a practical and ready-made labeling tool based on semantic role [37] to construct this semantic relation graph. Specifically, with the injected evidential sentences, this construction process will proceed in the following three steps.
1. We first use Semantic Role Labeling (SRL) toolkit [27], named AllenNLP, to parse each provided sentences (including the claim and evidence evd i ) into tuples. The accuracy of this module can reach 96.88%, trained on the CoNLL 2009 dataset. 2. Following the semantic-level graph construction process [15], we then use the specific type elements in the tuple described in the first step as nodes of the semantic graph. We attempt to set these types as subject, predicate, object, verb, tense, voice, position, and time.
The study connects any two nodes as their edges within a tuple.
3. To learn the relationship between evidential sentences, we create edges across different tuples. [15] provides a very good idea: creating edges by finding similarity relationships across two different tuples. Assuming tuple A as the subject(s) and B as the object(o), this similarity relationship includes the following three aspects: (1) tuple A equals tuple B; (2) tuple A contains tuple B and (3) tuple A and tuple B have the number of overlapped tokens. However, this coarse evidential relationship across different sentences may lose a lot of important structural information. In order to refine the external relationship between evidences, we employ WordNet [38] to further find these relationships including synonymy, antonymy, hypernymy and meronymy.
This developed contextual semantic graph construction method provides an operational and more fine-grained strategy for representing structural information of multiple evidential sentences.
We denote the contextual semantic graph as G 1 . Next, we integrate world knowledge into the constructed semantic graph G 1 to improve the robustness of FV system' s inference on heterogeneous information.

2) CONTEXTUAL KNOWLEDGE GRAPH CONSTRUCTION
In addition to the intrinsic information of evidences, we have to integrate world knowledge within evidences into the cognitive graph. Typically, such auxiliary data consists of real-world entities and relationships among them to profile the evidence sentences at semantic level. We organize supplementary information in the form of knowledge graph G 2 , which is a directed graph composed of subject-propertyobject triple facts [39]. More formally, it is presented as where each triplet describes that there exists a relationship r from head entity h to tail entity t. For example, (Beijing, CapitalOf, China) points out the fact that Beijing is the capital of china. Note that R contains relations in both canonical direction (e.g., CapitalOf) and inverse direction (e.g., MotherlandOf). We train and evaluate this module that generate ConceptNet [40] relations, and this module achieved 92.7% accurate.

3) CONTEXTUAL COLLABORATIVE KNOWLEDGE GRAPH CONSTRUCTION
The present section defines an enhanced graph, named Collaborative Knowledge Graph (CKG), which encodes intrinsic information of evidence sentences and world knowledge related to evidences as a unified graph. We first represent each entity's entailment behavior as a triplet, (s, predicate, o), where y so = 1 denotes as an additional relation interact between subject s and object o. The intrinsic information of these evidence graph can be seamlessly integrated with KG as a unified graph where E = E ∩ S and R = r ∩ {predicate}.
• Input: a CKG (denoted G) that contains the intrinsic information of evidence graph G 1 and knowledge graph G 2 .
• Output: the probability of the labels for the given claim.

C. CONTEXTUAL WORD REPRESENTATIONS ENHANCED BY COLLABORATIVE KNOWLEDGE GRAPH
In the current section, we proposed to use CKG structural features to enhance contextual word representations. Specifically, in order to make better use of the CKG structural information, our idea is that the stronger semantic relationship in the CKG, the closer the relative position of the nodes. Therefore, we reordered the evidence sentences by employing a sort module based on [15]. Fortunately, the pre-trained language model XLNet [41] naturally includes the concept of relative position. Thus, to enhance the representation ability of the pre-trained model, we injected the relative position information of CKG nodes to the input of XLNet. TransR [42] learns the CKG representation of each entity and relation by optimizing the translation principle: If a triplet (h, r, t) exists in the CKG, where e h , e t ∈ R d and e r ∈ R k are the embedding vector for h, t and r, respectively; e r h and e r t are the projected representations of e h and e t in the relation r's space. Therefore, for a given triplet (h, r, t), its plausibility score is formulated as following: where W r ∈ R k×d denotes the transformation matrix of relation r, which projects entities from the d-dimension entity space into the k-dimension relation space. A lower score of g (h, r, t) suggests that there exists strong relationship between the two entities. As a result, we can define the energy score as the relative position distance between the two entities. We train our representation module on FB15K [43], and the accuracy of triple classification can reach 91.2% After reordering the sequences, we inject them into XLNet model to obtain the contextual representations. Finally, we obtain the contextual representations h ([CLS]). The h ([CLS]) captures the semantic interaction between the claim and the evidence at word level.

D. ATTENTIVE INFORMATION PROPAGATION LAYERS
After obtaining the contextual word representations, to further exploit graph information at the semantic level, we take advantage of the GCN [35] architecture to recursively propagate information along high-order connectivity. We closely follow the experimental setting in [35], achieving 80.4% classification accuracy on Cora. Moreover, in order to reveal the importance of different node connections, we employ graph attention network [36] to create attentive weights of cascaded propagations. We follow the setting of GAT [36], and the classification accuracy can reach 83.5%. We first describe a single layer architecture, which consists of three parts including information propagation, knowledge-aware attention, and information aggregation. Then, we will succinctly study how to extend it to multiple layers.

1) INFORMATION PROPAGATION
An entity in multiple triplets may act as the bridge, connecting two different triplets and propagating information. Our intuitive idea is to operate information propagation between one entity and its neighbors in the CKG.
Formally, we denote G as the collaborative graph and H ∈ R N v ×d as a matrix containing representation of all nodes, where N v and d signify the number of nodes and the dimension of node representations, respectively. Each row H i ∈ R d is the representation of node i. We denote the adjacency of graph G as A and its degree matrix as D Considering an entity h, we use to denote the set of triplets as termed ego-network [44], where h and t represent the head and tail entity respectively. In order to describe the first-order connectivity structure of entity h in a formal way, based on [44], we calculate the linear combination value of h's ego-network: where π (h, r, t) can be learned through datasets to control the decay factor on each edge (h, r, t) propagation, demonstrating how much information being propagated from t to h conditioned on r.
More generally, we organize the information propagation among the nodes into matrix form to facilitate GPU calculation. GCNs will aggregate information from the multi-hop neighboring nodes as following: where j denotes the layer number, H (j+1) i ∈ R d is the updated d-dimension representation of node i,Ã is the normalized symmetric adjacency matrix, W is the weight matrix of j-th layer, and ρ signifies an activation function. The propagation method learns the claim-based and evidence-based graph separately. Meanwhile, we obtain the representations of all nodes in claim-based graph H c and evidence-based H e graphs respectively.

2) KNOWLEDGE-AWARE (KA) ATTENTION
For the sake of simplicity, we use e ij for π (h, r, t). We conduct π (h, r, t) based on attention mechanism, which is formulated as following: where the function tanh [36] is used as the nonlinear activation function. The tanh nonlinear activation function is talented at making the attention degree dependent on the distance between H e and H c in the relation r's space, which will propagate more instructive information for closer entities. Note that, we only use dot product on these representations progress for simplicity. Then, we normalize the coefficients based on the soft-max function across all triplets connected with h: As a result, the final attention score has the ability to indicate which neighbor nodes should be paid more attention to capturing more useful CKG information. When operating propagation forward, the attention flow indicates the partial data given close attention, which can also be considered as explanations for the fact verification.
Different from the information propagation in GCN [35] and GraphSage [45], our model not only employs the proximity structure of CKG, but also specifically calculates the variable importance of neighbors. Additionally, different from graph attention network [15], [36] which only uses each node representation as the model's inputs, we calculate the relation h r between h c and h e , which will obtain more information during the propagation progress.

3) INFORMATION AGGREGATION
After attention, we obtain claim-centric evidence information representation: Besides, we use the matrix X to denote the claim-centric evidence representation of all nodes in CKG.
To obtain the node-to-node alignment, based on finegrained word alignment function [46], we employ the claimcentric evidence information representation and claim node representation, as shown below: Consequently, the aligned vector is A = a 1 , · · · , a N v c . We use the max pooling over A to obtain the final output q. Finally, q is concatenated and injected with final hidden vector h ([CLS]) to an MLP layer for the label prediction p.

IV. EXPERIMENTAL METHODOLOGY
In the current section, we describe the dataset, evaluation metrics, baselines, and implementation details in our experiments.

A. DATASETS
The success of FV system-centered experiments depends on the compliance of the corpus. First of all, considerable tagged cases should be introduced into discrepant FV sub-task. Secondly, given the possible sources of misconceptions varying from official announcements to blogs or Twitter posts., the training data should not be chosen from the same web source alone. Specifically, references related to FV can be drawn from the datasets (PolitiFact14 [45], Emergent16 [46] and RumourEval17 [47]). However, it is impractical to classify the deep learning model into the category of unobserved claims according to the training on these datasets which just incorporate hundreds of verified claims. In spite that the Snopes17 [48] covers more verified claims, most of the documents involved have nothing to do with the claim. In another word, it is devoid of useful information for verification. The FEVER [2], the largest dataset useful in the development of FV system, is composed of 185,455 tagged claims and a total of 5,416,537 Wikipedia documents. Users may also make adjustments on available sentences to propose subjective claims and mark other sentences. FEVER supports the training of deep learning systems with full access to evidence on Wikipedia. Whereas, as FEVER is solely based on Wikipedia and overall claims are raised collectively, it is unlikely to retrieve evidence from diverse web-sources or verify the reliability of claims with such evidence gathered by training systems. UKP Snopes Corpus [27] comes up with a novel tagged corpus fit for FV tasks on the basis of heterogeneous web sources, in which the original corpus1 is licensed to be free. VOLUME 8, 2020 To train the model proposed in the research, experiments have been conducted on the paramount dataset FEVER [2]. As mentioned above, DGAT has been endowed with the general applicability in heterogeneous drawings. To examine the universality of the model, UKP Snopes Corpus is then adopted to generalize multi-domain FV tasks.

C. IMPLEMENTATION DETAILS
During training, we adopt the batch size of 32, the learning rate of 2e − 5. All models are evaluated with LA on the test set and trained for two epochs. All claims are given five pieces of evidence. In our experiments, the max length is set to 150. All models are implemented with PyTorch.

D. EVALUATION METRICS
In the FEVER shared task, we follow the official evaluation, metrics for claim verification, including Label Accuracy (LA) and FEVER score. We use recall and F1 on UKP Snopes Corpus. It needs to be pointed out that although F1 contains recall and accuracy components, recall still remains the most important factor because FV system bases on the retrieval predictions, and low recall makes many given claims have no probable evidence.

1) LA
LA calculates classification accuracy rate of claims without the consideration of retrieved evidence.

2) FEVER SCORE
The FEVER score is the LA conditioned on providing at least one complete evidence set, which better reflects the inference model ability.

V. EXPERIMENTAL RESULTS AND ANALYSIS
In this section, we first compare our DGAT framework with baseline models on FEVER test sets to evaluate whether our model can function well on homogeneous dataset. Then, we compare the performance of the proposed method with others in UKP Snopes Corpus, a rich heterogeneous dataset. Next, we provide an ablation study to obtain the effect of individual module. Finally, the current work presents a case study to demonstrate the effectiveness of our framework. Table 1 shows the FV performances. Several testing scenarios are conducted to compare DGAT effectiveness with transformer-based baselines. In comparison with baseline models, although our DGAT is not the best on all scenarios in homogeneous dataset, it can work well on the benchmark dataset for fact checking. Most notably, our model has achieved the results in heterogeneous datasets, demonstrating that the reasoning model with knowledge graph can improve the robustness of the FV system on heterogeneous information. Table 1 reports the performance of our model and baselines on the FEVER test set (The public leaderboard for perpetual evaluation of FEVER is https://competitions.codalab.org/ competitions/18814#results. BUAA_KLBNT is our Team Name), showing that regarding label accuracy and FEVER score, the performance of our model is inferior to KGAT, Transformer-XH and DEARM on the test set, within 5.5% gaps. It is worth noting that, the proposed approach, which exploits improving the robustness of FV systems in heterogeneous datasets through combining the intrinsic information of evidences and world knowledge within evidences, outperforms other baselines. In addition, we also find that the performance of baseline models on heterogeneous datasets is not in consistence with that on homogeneous datasets, for instance, Transformer-XH performs better than GEAR in FEVER test set, conversely in UKP Snopes Corpus. The advantage of our DGAT is that it can fill the knowledge gap between contexts. However, our model has no advantage in single-domain dataset FEVER, since there is almost no or very little knowledge gap in FEVER, and the complex structure is easy to make the model overfit. Heterogeneous data and evidential sentences from complex textual sources, as found in UKP Snopes Corpus and in the real web, make it difficult to correctly label the claims with previous methods.

C. ABLATION STUDY
The section reports extensive ablation studies for each module in our DGAT model, and also makes comparison with our full-model results. Table 2 presents the recall rate and F1 on the UKP Snopes Corpus after removing different components (including the knowledge graph fusion mechanism (III-B-3 and III-C) and graph convolutional network and graph attention network (III-D) respectively in our DGAT model. The last row in Table 2 is equivalent to simply connecting all the evidence sentences into a string for reasoning, without any graph structure information for FV task. As shown in Table 2, compared with not using graph structure information, the use of graph network structure can make full use of semantic information and bring 27.6% improvement on recall rate. Eliminating the external knowledge collaboration module decreases 14.4% and 15.8% in terms of recall rate and F1 respectively. Eliminating the graph-based reasoning and attention module reduces 12.9% (Recall rate) and 13.7% (F1) because graph-based reasoning module learns more graph network information.
As shown in Table 2, after removing the two main modules, the performance of DGAT is quite poor compared to the other approaches (Athene, KGAT and DREAM) in Table 1. We integrate our DGAT approach to these three strong baselines, and report the results in Table 3. From Table 3, the performance of baselines incorporated with the DGAT approach can be improved effectively on the Snopes Corpus, showing that incorporating with external knowledge provides a feasible method for FV based on deep learning to have the ability to infer multi-domain information robustly. Figure 3 shows two examples in our experiments which needs scattered evidence from textual sources with multi-structures to make the right inference. Additionally, these two evidences also rank at the first and second in our retrieved evidence set. To verify whether Claim 1 is ''refuted'', ''supported'' or ''not enough information'', our model needs the evidence ''1-1'' and the evidence ''1-2''. The two evidential sentences are from different web structural textual information and our collaborative graph bridges the gap of different textual sources. According to the top two evidential sentences for Claim 1, they use different tokens (Hilter and Nazi respectively) to convey relevant semantics and the KG can establish the connection between these two sentences from different web sources. In addition, it can also be found all evidence nodes in the collaborative graphs tend to attend the top 2 evidence nodes, providing the most beneficial information for the inference process. The attention weights in other nodes are extremely low, indicating that our model has the ability to choose useful information from scattered evidence even in various textual sources with their respective structures.

E. ERROR ANALYSIS
In the current section, we present the error analysis for the incorrectly predicted instances and summarize the 4 primary types of errors as follows, which can also provide experience for later model improvement of FV systems. Table 4 presents the confusion matrix for the FGE sets predictions. Obviously, the system finds it easiest to classify instances labelled as REFUTES. However, using the NOT ENOUGH INFO label correctly is the most difficult. Then, we describe some frequent failure cases of our model in the description below.

1) CONFUSING SEMANTICS AT SENTENCE LEVEL
In the semantic level of the whole sentence, our DGAT does not have good alignment ability to predict the relationship of two complex sentences. For example, ''Andrea Pirlo is an American professional footballer'' vs ''Andrea Pirlo is an Italian professional footballer who plays for an American club.'' This defect results in that FV system fail to correctly label the ''not enough information'' instances. For example, the given claim ''Terry Crews played on the Los Angeles Chargers.'' (annotated as NotEnoughInfo) is labeled as refuted, with the retrieved top-1 sentence ''In football, Crews played . . . for the Los Angeles Rams, San Diego Chargers and Washington Redskins, . . . ''. Although this evidential sentence is highly associated with the given claim, the evidence cannot conclude that the claim is wrong, making this kind of cases even more difficult.

2) MULTIPLE SEMANTIC COMPLEXITY
In some complex situations, just using alignment methods in the NLI system is not enough to make the prediction of correct relationships. In these cases, the model needs to obtain the relationship between scattered words. For example, ''China keeps all mobile phone chips manufactured within the Huawei for use in Chinese electronics.'' vs. ''Huawei's mobile phone chips became the Chinese leading electronic by monetary value.''

3) NUMBERS COMPLEXITY AT SENTECE LEVEL
Due to the method's disability to interpret numerical values, a great many claims with numerical component factors are labeled incorrectly. For instance, the claim ''To maintain one's current weight, one must eat at least 30 calories per kilogram of body weight per day.'' is not classified as refuted based on the evidence sentence ''To maintain their current weight, humans need to consume nearly 10 calories per kilogram of body weight per day.''. The number is the most important factor in the determination of what the claim is labeled as. Nevertheless, the information representation module cannot embed numbers distinctly enough.

VI. CONCLUSION
To conclude, the current work presents a DGAT approach with semantic relations in CKG for knowledge-aware FV. When assessing the veracity of a claim providing multiple evidential sentences, our approach is first built upon an automatically constructed semantic graph and a knowledge graph. Compared with the previous research, our DGAT has the greatest advantage of completing the NLI task on heterogeneous web textual information with the help of knowledge graph. In addition, the core of reasoning modules is the attentive embedding propagation layer, which adaptively propagates the embeddings from a node's neighbors to update the node's representation. Experiments demon-strate that DGAT can function well on the FEVER shared test datasets, which are the common homogeneity datasets. However, more importantly, extensive experiments on heterogeneous datasets prove the rationality and effectiveness of DGAT. The present study first analyzes the potential of joint knowledge graph and GNN, and stands for an initial attempt to exploit structural knowledge based on information propagation mechanism in FV task. Besides knowledge graph, a lot of other structural textual information with higher intelligence and wisdom exists in real-world scenarios, such as social networks. For instance, through the integration of social network with our CKG, we can study how social influence influences the fact checking. Another exciting direction refers to the integration of information propagation and decision process, also opening up research possibilities of explainable FV.

4) FUTURE WORK
There is an important point about training methods on multi-domain datasets. In some cases, the interpretation of facts may differ between multiple sources. Therefore, it may be better to use multiple models trained on multiple single datasets for fact validation. To test this, we have trained our DGAT on multiple single datasets (Partial FEVER dev set, PoliFact and GossipCop). Unfortunately, under such experimental conditions, DGAT will suffer from learning degradation (The models only perform well on the latter single-domain dataset on which they are trained, while achieve degraded results on the former single-domain dataset), which makes FV performance worse. We make the following analysis on this interesting experimental result. Compared with the current advanced models only based on GNNs, the advantage of our DGAT is that it can fill knowledge gaps by training on multi-domain datasets. However, there is less knowledge gap in one single-domain dataset, which will lead to the lack of the ability to fill the knowledge gaps between different domains for the trained model. In fact, our training method has disadvantages in some cases. In other words, how does a GNNs trained on a given dataset perform well on a new and potentially significantly different dataset? This issue has not been addressed in previous work that applies GNNs for FV and I think this problem may need more advanced training mechanism to solve. We further try to improve this deficiency by studying new training methods, which may require us to devote more time to study and is beyond the scope of this research. He is currently a Supervisor and a Professor with Beihang University, where he is also the Director of the Beijing Key Laboratory of Network Technology. He has participated in different national major research projects and published more than 70 research papers in important international conferences and journals. His current research interests include network and information security, data mining, information countermeasure, cloud security, and deep neural networks.
CHENGXIANG SI received the Ph.D. degree from Beihang University, Beijing, China, in 2018. His main research interest includes network and information security.
BEITONG YAO received the B.S. degree in applied chemistry from Beihang University, Beijing, China, in 2018, where he is currently pursuing the M.S. degree in computer science with the School of Computer Science and Engineering. His main research interests include natural language processing, data mining, and machine learning.
TIANBO WANG (Member, IEEE) received the Ph.D. degree in computer application from Beihang University, Beijing, China, in 2018. He is currently a Lecturer with Beihang University. He has participated in several national natural science foundations and other research projects. His research interests include network and information security, intrusion detection technology, and information countermeasure.