Completing Scientific Facts in Knowledge Graphs of Research Concepts

In the last few years, we have witnessed the emergence of several knowledge graphs that explicitly describe research knowledge with the aim of enabling intelligent systems for supporting and accelerating the scientific process. These resources typically characterize a set of entities in this space (e.g., tasks, methods, evaluation techniques, proteins, chemicals), their relations, and the relevant actors (e.g., researchers, organizations) and documents (e.g., articles, books). However, they are usually very partial representations of the actual research knowledge and may miss several relevant facts. In this paper, we introduce SciCheck, a new triple classification approach for completing scientific statements in knowledge graphs. SciCheck was evaluated against other state-of-the-art approaches on seven benchmarks, yielding excellent results. Finally, we provide a real-world use case and applied SciCheck to the Artificial Intelligence Knowledge Graph (AI-KG), a large-scale automatically-generated open knowledge graph including 1.2M statements extracted from the 333K most cited articles in the field of Artificial Intelligence, and generated a new version of this knowledge graph with 300K additional triples.


I. INTRODUCTION
The rise of Open Science and the steady growth of the number of research publications, datasets, and other materials on the web is changing the way research outcomes are shared and explored, and is posing new challenges and opportunities. This large mass of open research outcomes has the potential of supporting a new generation of intelligent systems for actively supporting, automatizing, and accelerating the scientific effort [1].
One of the main challenges in this space is to generate a semantically rich, interlinked, and machine readable description of the available research knowledge. This could enable more sophisticated techniques to analyze the scientific literature. As a consequence, more advanced services could The associate editor coordinating the review of this manuscript and approving it for publication was Ananya Sen Gupta . be provided, e.g., forecasting research dynamics, generating scientific hypothesis, identifying key insights, informing funding decision, confirming claims in news, or automatically running experiments [2], [3], [4].
As many other KGs, those that describe research concepts suffer from incompleteness. They are typically very partial representations of the actual research knowledge and may lack several relevant facts, that were not identified by information extraction approaches or human experts. The issue of incompleteness in knowledge graphs is usually addressed by link prediction or triple classification techniques [22], [23], which have proved to yield good results in several domains [17]. These methods typically use KG Embedding models (e.g. TransE [24], RotatE [25], ComplEx [26]), pathbased features [27], [28], or Graph Neural Networks [29]. However, existing methods for knowledge graph completion under-perform on KGs of research concepts, as detailed in Section IV. In particular, they suffer from low precision, which is not acceptable in the scientific domain.
To address the above issue, in this paper, we introduce SciCheck, a new approach for completing scientific facts in knowledge graphs of research concepts. SciCheck is built on top of the CAFE approach [27] and introduces several new features and heuristics for the scholarly domain.
We evaluated SciCheck on two new benchmarks extracted from AI-KG (AIKG-1M and AIKG-500) and five well-known general benchmarks for triple classification (FB13, WN11, WN18, WN18RR, and NELL). The evaluation shows that SciCheck significantly outperforms nine alternative approaches in terms of precision, which we consider key for reliably extending knowledge graphs of research concepts, while still obtaining good values of recall. All the resources used for evaluation are available online. 3 As use case, we used SciCheck to enrich the Artificial Intelligence Knowledge Graph (AI-KG) 4 [2], a largescale automatically-generated open KG including 14M RDF triples and 1.2M reified statements extracted from the 333K most cited articles in the field of AI. We also made available to the scientific community a new version of AI-KG (version 1.2) with 300K additional triples 5 that we generated with SciCheck.
In summary, the main contributions of our work are the following: • We propose SciCheck, a new triple classification technique that uses a variety of features to complete KGs of research concepts with a high precision.
• We provide a real-world use case and apply SciCheck on AI-KG and use it to generate a new version of AI-KG containing 300K additional triples. The remainder of this paper is organized as follows. Section II describes the related work. Section III describes SciCheck in detail, and Section IV discusses the evaluation results. Section V describes AI-KG and how SciCheck was applied to it in order to extend it. Finally, Section VI concludes the paper and presents future directions of research.

II. RELATED WORK
The majority of related proposals in this field are nowadays based on embedding models, i.e., producing a translation from the entities and relations in the graph into vectors that preserve their semantics. In this area, experts usually distinguish between knowledge graph embeddings, and language models.
KG embeddings [23], [30] learn embedded representations of KGs entities and relations, performing different transformations in an embedding space [24], [25], [26], [31], [32], [33], [34]. The resulting embedding space is subsequently used to evaluate the likelihood of a candidate triple to be correct or incorrect, since entities that are supposed to be related by means of a certain relation are expected to be closer to each other in the embedding space. They have also been recently used for assessing research hypotheses, yielding promising results [3].
While they provide good results in general, all of the former proposals suffer from a performance drawback: due to the way in which the embedded representations are obtained, they need to be recomputed whenever new triples are added to the KG, which is a relatively frequent event [35]. Language models are based on word embeddings (such as Word2Vec [36] or BERT [37]), that represent the semantic information encoded in the text of nodes and relations, and are therefore less affected by the introduction of new triples. These models are able to deal with text ambiguity and produce contextualized embeddings.
Embedding-based approaches are able to exploit features from both the entities and relations in the graph, but they usually explore the immediate neighborhood of entities, disregarding longer paths in the graph that could also provide some interesting features. Therefore, other approaches are proposed to leverage these longer paths: path-based, and graph neural network-based approaches.
Path-based techniques exploit the highly relational nature of KGs to learn how to predict new relations between entities. Regarding this approach, Lao and Cohen [38] introduced the Path Ranking Algorithm (PRA), a two-step process to find which paths may be useful to predict a certain relation. An evolution of PRA named Subgraph Feature Extraction (SFE) by Gardner and Mitchell [39] achieves better performance than PRA and produces more expressive results. Mazumder et al. [40] propose a random walk-based approach using neighborhood-guided path finding, where semantic similarities between entities are computed by applying a Word2Vec-based embedding model on the names of the entities. Reinforcement learning has also been used to find valuable paths that can help to successfully complete a KG [41]. Shen et al. [28] propose combining the benefits of embeddings and path-based approaches, by computing embeddings of the entities and relations, and then combining these embeddings in the forms of paths. Unfortunately, due to the non-deterministic way in which these paths are computed, they may miss relevant information by mere chance. More recently, Borrego et al. [27] proposed CAFE, a deterministic approach to exploit the highly connected nature of KGs that does not rely on random paths.
There are also a number of proposals that leverage the use of Graph Neural Networks (GNNs) to exploit not just a limited set of paths, but the entire structure of the graph. Some of them are based on traditional embedding models [42], [43]. The most recent proposals are based on Graph Attention Networks [44], [45], [46]. An extended survey on GNNs and their applications has been carried out by Zhou et al. [29]. The main drawback of this approach is the amount of computational resources it requires, making them unappealing to deal with real-world KGs, such as those about research concepts, which are our focus.
The particularities of research concepts make the former proposals generally unable to complete these KGs with a high precision. They usually contain a large number of ambiguous and synonym terms, due to a lack of standardization in the vocabulary used in different research works [47]. Also, they often contain highly categorical relations [48], i.e., relations in which the number of possible head entities is significantly higher than the number of possible tail entities. Therefore, some language models have been proposed based on different types of KG embeddings to deal specifically with this type of graphs [48], [49], [50]. Some recent techniques, such as exBERT [47] exploit contextualized language models rather than KG embeddings.
The novelty of our approach resides in not solely using KG embeddings, language models, or random paths, but on a combination of features that leverages the strengths of embeddings and deterministic path features, and does not suffer from the high hardware requirements of GNNs.
Specifically, SciCheck makes use of deterministic path-based and embedding-based features to solve the problem of triple classification in general-domain knowledge graphs, and more specifically, in scholarly KGs. In addition, according to our experimental results, SciCheck is also able to outperform the other proposals in terms of precision, which is essential to complete KGs of research concepts, while still achieving a fair recall.

III. SciCheck
SciCheck 6 is a novel approach for triple classification designed to complete scientific statements in a knowledge graph. It is built on top of the CAFE approach [27] by incorporting a new set of features and heuristics tailored to capture scientific knowledge. SciCheck takes an entire KG in the form of triples as input, and produces one neural-based classifier for each relation in the KG as output. Specifically, given a relationship r, SciCheck generates a model f r : (h, r, t) → s, that assigns a confidence score s in the range [0, 1] to any arbitrary triple < h, r, t > to solve a binary classification task (''is the triple correct or not?''). To feed the model, triples are converted into a numerical vector representation using ad-hoc features and contextual embedding representations. SciCheck can operate on any KG and focuses on optimizing precision, to ensure that the knowledge deemed correct is trustworthy.
In the following subsections, we describe all the relevant steps for the workflow of SciCheck. For the sake of illustration, we provide a visual summary of this workflow in Fig. 1. Additionally, Fig. 2 displays a small KG that will be used to provide specific examples for some steps.

A. LOADING THE KG
The first step of SciCheck takes as an input a set of triples from the target KG. Triples are transformed into a graph structure. Due to the generally large number of entities that comprise a KG and the high volume of read operations that are used in the following steps, the KG is stored in the form of adjacency hashmaps, which also preserve the types of the different relations.

B. GENERATING NEGATIVE EXAMPLES
Knowledge Graphs only contain positive knowledge, i.e., triples for which their heads and tails are known to be related by means of a relationship. However, in order to train a classifier, negative triples are also needed. To do this, Sci-Check follows the same approach as many other related techniques [27], [28], [38], [51], [52], [53] and generates negative triples by corrupting a positive triple < h, r, t > and replacing t with t , in such a way that < h, r, t > is not part of the original graph.
In order to produce more realistic negative triples, we randomly pick t such that its type is in the range of the relation r [54]. This can either be done automatically by using entities which appear as tail of that relation in the set of positive triples, or by using ontological information if it is available.

C. CONVERTING TRIPLES INTO FEATURE VECTORS
After both positive and negative examples are included in the graph, all triples are converted into labeled feature vectors that are provided to the neural classifier for both training and testing. For this purpose, SciCheck uses an extensible set of neighbourhood-aware features specifically tailored to scholarly information, which represent the neighbourhoods  of the two entities of a triple in a variety of ways. The neighbourhood of an entity is considered to be the set of all other entities that can be reached from it using an oriented path (i.e., the direction of links matters) in a certain number of hops. This number of hops is called the neighbourhood's ''radius''. Fig. 2 shows a KG that will be used as an example in the discussion of the features.
Each triple is evaluated by all features. The values associated to the triple for each feature form the triple feature vector.
Each feature can also depend on a number of parameters, such as a maximum neighbourhood radius. These features, and their rationales, are as follows: • f 1 : Number of entities in the neighbourhood of radius r of the head and the tail of a triple. For example, in Fig. 2, three entities can be reached in total using up to two hops from link_prediction, namely, neural_network, dbpedia, and triple_ classification Note that the entity 'accuracy' is not reachable because the graph is oriented.
• f 2 : Index of N-path centrality [55] of the head and tail of a triple. This feature assesses how well-connected an entity is to the rest of the graph in relative terms. It is defined as follows: for every vertex v of a graph G = (V , E), the n-path centrality C k (v) is defined as the sum, over all possible source nodes s, of the probability that a message originating from s goes through v, assuming that the message traversals are only going along random simple paths of at most k edges. For example, in the KG shown in Fig. 2, the entity dbpedia has a higher N-path centrality than accuracy, since a random path from any entity in the graph is more likely to go through the former than the latter, considering the directionality of the graph.  [56] between the head and tail. This index gives higher scores to entities whose neighbourhoods are smaller. It complements the previous two features, since a higher number of shared nearby entities is likely to be less significant if head and tail have a very large amount of connections. It is defined as the sum of the inverse logarithmic degree centrality of the neighbors shared by the two nodes: where N (u) is the set of nodes adjacent to u • f 6 : Paths of length r between the head and tail. For example, in Fig. 2, the entities link_prediction and dbpedia are connected by a path of length 2, by means of the triples < link_prediction, usesMethod, neural_network> and < neu ral_network, usesMaterial, dbpedia>.
Additionally, the relations that are present in those paths are also encoded using a r-hot vector.
• f 7 : Cosine similarity of the word embeddings of the head and tail. This feature measures the semantic similarity of the two entities in a triple, using any entity embeddings If we consider A and B to be the embeddings of the head and tail entities of the triple respectively, it is defined as: • f 8 : Dot product of the word embeddings of the head and tail entities. This feature complements the previous one by also taking into account the magnitudes of the embeddings of the entities If we consider A and B to be the embeddings of the head and tail entities of the triple respectively, it is defined as: Types of the head and tail entities according to the ontology of the KG. This feature encodes the known types of the entities according to the available ontology as two one-hot vectors. In Fig. 2, the entity dbpedia has type Resource, while accuracy is a Metric Regarding the rationales of the features, f 1 and f 2 leverage the fact that large neighbourhoods are more prone to contain unrelated information, while smaller ones are usually more specific. This is especially true in the scholarly domain, since, as an example, the entity neural_network may be mentioned in a large amount of papers and proposals that are not directly related to each other.
The features that measure the similarities of two neighborhoods (f 3 , f 4 , and f 5 ) follow the intuition that correct triples have a higher amount of shared entities in their respective neighbourhoods than incorrect ones, as shown by previous research efforts [18], [27], [57].
Feature f 6 measures the number of paths between two entities because a correct triple will typically have a larger number of unique paths of a given maximum length between head and tail than an incorrect one. Furthermore, the information about which relations are comprised by those paths can be useful since the semantic meaning of a path changes depending on the relevant relations.
Features f 7 and f 8 incorporate information from the word embeddings of the two entities, which had been shown to be advantageous for triple classification [25], [31]. SciCheck uses by default the RoBERTa model [37] to generate the word embeddings, since is able to capture and represent semantic similarities across a wide range of domains.
Finally, feature f 9 leverages the ontological schema of the KG. This allows SciCheck to include information regarding the types of the two entities in a triple into the feature vector for that triple. Furthermore, SciCheck can automatically classify a triple as incorrect if the triple does not respect the domains and ranges of the relation as defined in the ontological schema. For example, in the KG shown in Fig. 2, the triple < accuracy, evaluatesTask, rdf_graph> would be considered incorrect without further evaluation, because the range of the relation evaluatesTask is Task, while rdf_graph is a Material.
SciCheck makes use of a much more comprehensive set of features than the original CAFE, which in turn allows a better characterization of entities and predicates. In particular, the features based on word embeddings enable SciCheck to exploit the implicit contextual information from the training papers that may not be encoded in the KG. Additionally, the inclusion of ontology-based features allows SciCheck to take advantage of the available high-level knowledge about any specific domain. These improvements are particularly crucial for assessing scientific claims, which tend to use a specific jargon and to rely on a well defined epistemological framework.
Furthermore, different types of relations in the graph may carry specific insight that should be captured separately. For this reason, SciCheck first computes all features in the input KG as-is, and then it computes them again in different versions of the KG where only relations of a single type are present. This is done for all the different relations in the KG. Additionally, in features that use the neighbourhoods of the head and tail entities such as f 1 or f 3 , these two neighborhoods are calculated using all possible combinations of relations. Finally, SciCheck concatenates all the resulting features in the final feature vector.
The features which involve computing entity neighbourhoods or paths (from f 1 to f 6 ) use a maximum number of hops for their computations. Following the findings in [27], by default SciCheck computes them for a maximum number of hops num hops of 1, 2, and 3. The resulting set of features using different radii are eventually all added to the final feature vector. Considering all the possible combinations with the number of different relations in the graph, which also affects the size of the feature vector as described previously, the number of total features is num hops × 6 × #rels 2 + 3 × #rels, where #rels is the number of distinct relations in the KG.

D. GROUPING FEATURE VECTORS
SciCheck creates one classifier per each relation, under the assumption that the specific information needed to correctly classify triples may vary depending on the specific relation. After all triples have been converted into feature vectors in the previous step, they are grouped by the relation present in the triple, and passed on to the relevant classifier.

E. TRAINING AND EVALUATING THE MODELS
SciCheck trains a neural network-based classifier model for each relation using the resulting feature vectors. We generate VOLUME 10, 2022 multiple models, so that each classifier has a high specialization in addressing the target relation.
It is also advantageous to consider different neighbourhood radii that might carry information of different nature. For this reason, each of these classifiers is composed of several sub-models that consider only the features computed using a specific radius value on the sub-graph of a specific relation as in [27]. They are combined into a single classifier model by using an additional layer with a single neuron, which receives the outputs of all sub-models and combines them into a single output.
This step involves the use of a flexible neural classifier, which can be fine-tuned for the KG in question. The hyperparameters used in the evaluation are discussed in Section IV-A.

IV. EVALUATION
This section reports and discusses the evaluation of Sci-Check. It also describes the evaluation data, including the new benchmarks that we created from the AI-KG Knowledge Graph (AIKG-1M and AIKG-500 are discussed in Section IV-A, and they are available at https://zenodo.org/ record/5764114).

A. EVALUATION PROTOCOL
We evaluated the performance of SciCheck on seven benchmarks against nine alternative approaches. Five of the baselines are well-known embedding-based KG completion approaches: TransE, TransD, TransH, SimplE, and Com-plEx [24], [26], [31], [53], [58]. To provide a common ground to train and test these techniques, we used the OpenKE [59] tool.
In order to assess the contributions of the different components of SciCheck, we also considered five alternative versions of our approach: • CAFE Baseline, which uses solely the context-aware features for KG completion such as neighbourhood size, shared entities, connectivity, and so on from the original implementation [27].
• CAFE + RoBERTa, which extends CAFE by considering features based on the similarity of the embeddings of head and tail, using the RoBERTa model.
• CAFE + SciBERT, which extends CAFE by considering features based on the similarity of the embeddings of head and tail, using SciBERT, an alternative BERTbased text embedding model 7 specifically tailored to scientific documents.
• CAFE + Ontology, which extends CAFE by considering features that identify the types of head and tail according to the domain ontology (e.g., AI-KG ontology) and also filters triples whose entities are not consistent with the domain and range restrictions of the relation.
These methods were evaluated on the following benchmarks, whose characteristics are summarized in Table 1: • AIKG-1M, a new dataset that we created from AI-KG.
We used a de-reified version of AI-KG, in order to consider only triples which involve tasks, methods, materials, metrics, and other scientific entities. As a result, 1,075,652 triples were directly generated from scientific literature, without considering facts that were materialized using the domain semantics defined in the AI-KG ontology (e.g. transitivity). Triples were split into a training and a testing set with a split ratio of 80%-20%, respectively. To generate negative triples in the testing split, each positive triple was corrupted once by randomly replacing the tail entity with another one within the domain of the relation in the triple, i.e., if the range of the tail entity is a Task, then it is substituted by another entity whose type is Task. We also make sure that the randomly generated negative triple is not already present in the KG, to prevent creating false negatives whenever possible. As an example, the triple < dbpedia, usesOtherEntity, sparql_query> is correct, while the corrupted version < dbpedia, usesOtherEntity, cost_function> is considered incorrect, where sparql_query and cost_ function are both of type OtherEntity. However, negative examples were not generated for the training split, as specific KG completion techniques usually have a preferred way to generate them automatically [60]. In total, the training split comprised 860,512 positive triples and the testing split includes 430,280 triples (50% positive and 50% negative).
• AIKG-500, a new dataset that we constructed by manually annotating triples in AI-KG about the Semantic Web. To construct it, we randomly selected 250 triples which had as their head one of the 24 sub-topics of the Semantic Web according to the CSO ontology [61] and were considered to be correct by at least 2 methods among TransE, TransD, TransH, SimplE, ComplEx, and SciCheck. Another 250 triples were randomly selected out of those deemed incorrect by at least 2 techniques. The resulting 500 triples were manually annotated by five domain experts, with an inter-reviewer agreement of 0.61 (according to Cohen's kappa), which is typically considered a substantial agreement. A majority vote approach was used to determine that 221 triples were correct and 279 were incorrect. Since this dataset was created for the purpose of providing a small but high-quality and manually-annotated testing split, in this evaluation we used AIKG-1M for the training split. • WN11 [34], a subset of WordNet centered around different semantic relations between over 38K words.
• WN18RR [63], which improves WN18 by removing reciprocal relations in the test set. This makes triple classification more challenging, since otherwise the model can predict that a triple < a, hasChildren, b> is true whenever the triple < b, hasParent, a> appears in the training set.
• NELL [39], a subset of the NELL KG [35] with information and relations about many different domains, e.g., actors which starred in movies, writers and their works, or athletes and their teams. It is well-known [63] that these traditional benchmarks suffer from information leakage between the training and test sets, due to the presence of reciprocal relations. For this reason, we removed all reciprocal relations in all datasets except WN18, since we also include its previously discussed sanitized version, WN18RR.
To predict the correctness of a triple, we used feedforward neural networks with 3 intermediate layers containing 128, 64 and 32 neurons, respectively. The output neuron uses a sigmoid function, returning a confidence score in the interval [0, 1]. The classifier was trained throughout 100 epochs, using a binary cross-entropy loss function.
Since SciCheck is a triple classifier, we evaluated its effectiveness by comparing the labels it predicted for the triples in the testing set against the ground truth. The results are thus reported in terms of precision and recall, which have been recently become standard metrics to evaluate KG completion, since they can be more informative than MRR and Hits@N in many practical settings [64], [65]. In this paper, we specifically focus on precision, since we have the concrete objective of extending AI-KG and this can only be reliably done using a method with a high precision. Table 2 and Table 3 report the precision and recall of the KG completion techniques on AIKG-1M. To determine whether a triple was correct or incorrect, we used a confidence threshold of 0.5 for SciCheck, as suggested in [27]. The thresholds of the other state-of-the-art techniques under evaluation and their results were obtained using the OpenKE [59] tool, allowing it to choose the optimal value for each one.

B. RESULTS AND DISCUSSION
All CAFE variants outperform embedding-based techniques in precision, achieving notably higher values. Including features from the text embeddings provides also an important improvement over the base version of CAFE. Both SciCheck and the variants that improve the baseline using embedding-based features rank consistently among those with the highest precision for all relations, with the differences between them being very narrow.
The best performing method in terms of precision is the final version of SciCheck (0.74), followed by RoBERTa (0.73), which can obtain better precision for some less common relationships. Interestingly, using text embeddings trained specifically on academic abstracts (SciBERT) yields a slightly worse performance than using the generic RoBERTa model. This may suggest that more general embeddings may sometimes produce better performance on KGs of research concepts, but this needs to be investigated further.
The Ontology variation, which includes one-hot type vectors and domain/range checking for the relation, only slightly improves the baseline. This is most likely due to the type-constrained way in which the negative triples were generated, since it already guarantees that the domain and range types of the relation are preserved.
The recall of SciCheck is naturally lower than that of the embedding-based approaches, in a typical precision-recall trade-off. However, this is acceptable since the main goal is to expand scientific KG with correct triples, hence, a high precision is necessary. SciCheck has also a generally higher recall than all other CAFE variants. Consequently, the results suggest that SciCheck is the best performing technique for the task of reliably completing scientific KGs.
It is noteworthy that different relations can lead to very different performance. For instance, relations such as narrower, supportsTask and supportsMethod yield very good performance. Conversely, the methods did not perform as well on relations such as evaluatesTask and evaluatesOtherEntity. This may depend on the number of relevant examples or the fact that some relations are inherently harder to predict. The role of different relations in the context of completing scientific KG requires further analysis.
In order to study the performance of the different techniques for all possible threshold values, we also report their corresponding ROC curves in Fig. 3. This analysis confirms the previous findings: 1) SciCheck outperforms all the other methods, 2) text embedding features significantly improve the baseline, and 3) the ontological features slightly improve the baseline. In addition, Fig. 3(b) confirm that SciCheck outperforms the standard state-of-the art methods regardless of the threshold.    To check whether the differences between the methods were statistically significant, we used DeLong's test [66] to compare the areas under two curves. The p-values obtained when comparing the ROC curve of SciCheck with the alternative methods in Fig. 3(a) and Fig. 3(b) were all < 0.0001. This very high statistical confidence is due to the large number of observations, since the testing set of AIKG-1M includes more than 400,000 triples. Table 4 shows the performance of the methods on AIKG-500, which are consistent with the previous findings. For the sake of brevity, here we do not report the results of all CAFE variants, which are in line with those obtained on AIKG-1M. Even in a smaller, manually annotated benchmark, SciCheck achieves a high precision, which confirms that it is suitable for completing scientific KGs. Table 5 reports the performance of all the techniques on five standard benchmarks for triple classification. The results show that SciCheck is able to outperform other techniques in almost all cases, thus being an effective triple classification tool for KGs of many different natures. They also confirm that completing scientific KGs is indeed a challenging task that requires specialized techniques, as the general purpose embedding-based approaches yield worse results on benchmarks extracted from AI-KG in comparison to generic ones.
In order to assess the scalability of our solution, Table 6 reports the seconds used by SciCheck to process the previously discussed datasets. To ensure statistical significance, we measured the runtime for each benchmark 10 times, and we report the average and the standard deviation for each one. Table 6 shows that the runtime ranges from a few seconds to over two hours according to the dataset. These differences are caused by mainly two factors. First, the amount of distinct entities corresponds directly to the number of RoBERTa embeddings that have to be computed, which are typically quite time-consuming. Hence, a larger number of entities has a negative impact on runtime. Second, and most importantly, the specific topology of every KG affects the size of the neighborhoods of the entities, and thus also affects the time it takes to compute features on them. The case of FB13 is particularly noteworthy since, in contrast with the other datasets, it contains many entities with a very high cardinality. This causes the sizes of the entity neighborhoods to grow exponentially in size, resulting in longer runtimes.
Finally, in order to establish a fair comparison with the existing embedding-based KG completion approaches, Table 7 reports their runtime in seconds compared to that of SciCheck for the AIKG-1M dataset. Embedding-based KG completion approaches were run using 1, 000 iterations, as it is commonly done by related studies [24], [26], [31], [53]. SciCheck took considerably less time to run on the large AIKG-1M dataset than its state-of-the-art counterparts. This suggests that SciCheck is more scalable and can realistically be used on large-scale scientific KGs.

V. USE CASE: AI-KG
A real-world use case for SciCheck involves the development and extension of AI-KG [2], a large scale knowledge graph about research entities from the AI domain. AI-KG was released in late 2020 and it includes about 14M RDF triples and 1.2M reified statements about 800K entities extracted from 333K articles in the field of AI. It describes 5 types of entities (tasks, methods, materials, metrics, others) linked by 27 relations (e.g., usesMaterial, evaluatesMethod, supportsTask). AI-KG statements characterize the relationships between two entities according to their description in a set of scientific articles, e.g., < sentiment_analysis, usesMaterial, twitter_data>. VOLUME 10, 2022     It is important to note that in AI-KG a triple associated with a set of papers is considered true if the papers actually contain that claim. To analyze the general truth value of each claim is not currently possible. Therefore, triples in AI-KG are devised to be a means for representing specific claims by researchers.
For example, the entity sentiment_analysis only represents the concept or idea of sentiment analysis as it is described in the original corpus of papers, but it is not aimed to represent or include all available prototypes and implementations to predict sentiments and emotions available today. In fact, such a modeling would require to promote research entities from concepts to classes to describe specific ontological knowledge (e.g., by defining an ontology to describe how sentiment analysis prototypes can use datasets and machine learning approaches) which is out of the scope of AI-KG.
For instance, a triple < deep_model_cnn, usedBy Task, toxicity_detection> from the paper [57] should be interpreted in the context of the same paper [57] i.e., deep model cnn is used for toxicity detection in [57] and, more broadly, some deep model cnn can be used for toxicity detection. Neither an interpretation like all deep model cnn are used for toxicity detection nor deep model cnn must be used for toxicity detection are correct according to the design and use of the current implementation of AI-KG.
AI-KG is adopted by several organizations for characterizing the AI domain and it has been used for supporting several research efforts, e.g., for extracting entities from scientific publications [67], describing competencies [52], and classifying scholarly articles [68]. AI-KG was generated by using Natural Language Processing (NLP) and Machine Learning (ML) methods for extracting entities and their relationships [69]. More specifically, AI-KG adopts a pipeline process that is applied on natural language scientific texts to (i) detect entities using a domain-specific extractor based on transformers [70] and a topic classifier developed on top of the CSO ontology [71]; (ii) identify relationships between entities by using open-and domain-specific ML and NLP tools [70], [72], [73], and (iii) define which facts make sense according to an ontology representing the domain semantics. In addition, to determine whether a fact makes sense, the authors adopted a support score defined as the number of research papers where the fact was extracted from.
The reader can find more details about this methodology in [2], [69]. The current version of AI-KG consists of research entities belonging to one of the following classes: • Task: A research challenge or a certain work to perform.
• Method: A research proposal or approach whose aim is to perform a certain task.
• Material: Resources that are employed for a certain research task, e.g., a dataset, an image, a text corpus.
• Metric: Entities that can be quantified and are used to measure the quality of a certain method.
• OtherEntity: A class used to group entities that cannot be classified in any of the previous ones. The relations were created by clustering frequent verbs and asking human experts to define domain and range restrictions as well as transitiveness. Some examples of object properties are evaluatesMethod, includesMaterial, or usesMethod. The ontology of AI-KG is available online. 8 Although the extracted facts compose a large-scale KG, the mining of such knowledge from natural language is an errorprone and challenging task and, therefore, it tends to have low coverage, i.e. well-known facts might not be materialized within the KG. As a result, AI-KG is sparse and incomplete. For example, the well-known fact <neural_network, usesMaterial, rdf_graph> cannot be found in the current AI-KG resource despite the fact that RDF graphs are the input of most of the existing neural network-based link prediction and triple classification algorithms.
For this reason, scientific KGs are calling for specific approaches for their completion [47]. However, state-ofthe-art methods developed for general-domain KGs such as TransE, TransR, RotatE, and so on fail to predict triples with a good accuracy on AI-KG.
As reported in Section IV, these methods yield decent F1-measures, but suffer from a low precision (typically around 45-60%). Their adoption would thus introduce too many incorrect facts in the graph. The poor results of the existing techniques motivated this use case.
We applied SciCheck to AI-KG and, using a confidence threshold of 0.7, materialized 303, 760 additional facts. Specifically, we used SciCheck to connect the most frequent 500 entities according to the relations defined in the AI-KG ontology. These include many significant facts there were missed by the information extraction pipeline, such as < search_engine, includesMaterial, knowledge_base>, < f_measure, evaluatesMe thod, neural_network>, <neural_network, uses Material, rdf_graph>, or < recommend er_system, usesMethod, predictive_model>. The new version of AI-KG is available online at https://zenodo.org/record/7276434.

VI. CONCLUSION
In this paper, we introduced SciCheck, a new approach for completing scientific facts in knowledge graphs of research concepts. We evaluated SciCheck on two new benchmarks extracted from the Artificial Intelligence Knowledge Graph (AI-KG) [2], a large-scale KG of research concepts, (AIKG-1M and AIKG-500) and five well-known general benchmarks for link prediction (FB13, WN11, WN18, WN18RR, and NELL). The experiments show that SciCheck outperforms nine alternative approaches in terms of precision. Furthermore, we have shown a real-world use case and used SciCheck to complete AI-KG, producing a new version of it including more than 300K additional statements (a 28% increase).
As future work, we plan to study the application of KG completion techniques to hypothesis generation and extend SciCheck in this space. We also plan to consider weighted triples [50], [74] that could formalize the degree of certainty in specific statements. In addition, we intend to incorporate new features that could further improve recall. Finally, we look forward to applying our methodology to other scientific KGs, such as Open Research Knowledge Graph [4] and Nanopublications [13].
INMA HERNÁNDEZ is an Associate Professor at the University of Seville and a Founding Member at the Data Engineering Applications Laboratory. Her current research involves data engineering and knowledge graphs. She has authored many peer-reviewed publications on these topics in top conferences and journals.
She is a very active reviewer and a member of several program committees in major conferences. She is currently a Principal Investigator on a number of projects funded by the Spanish National Research and Development Program. Since 2020, she has been the coordinator of the master on software engineering: cloud, data and IT management at the Postgraduate School of the University of Sevilla.
FRANCESCO OSBORNE is a Research Fellow at the Knowledge Media Institute, The Open University, U.K., where he leads the Scholarly Data Mining Team. He is also an Assistant Professor at the University of Milano Bicocca. He collaborates with major publishers, universities, and companies in the space of innovation for producing a variety of innovative services for supporting researchers, editors, and research polities makers. He has released many well-adopted resources such as the computer science ontology and the artificial intelligence knowledge graph. His research interests include artificial intelligence, information extraction, knowledge graphs, science of science, and semantic web. He has authored more than 100 peer-reviewed publications in top journals and conferences of these fields.
DIEGO REFORGIATO RECUPERO received the Ph.D. degree in computer science from the University of Naples Federico II, Italy, in 2004. He has been a Full Professor at the Department of Mathematics and Computer Science, University of Cagliari, Italy, since February 2022. From 2005 to 2008, he was a Postdoctoral Researcher at the University of Maryland, College Park, USA. He co-founded six companies with the ICT sector. He is actively involved in European projects and research (with one of his companies he won more than 40 FP7 and H2020 projects). His current research interests include sentiment analysis, semantic web, natural language processing, human-robot interaction, financial technology, and smart grid. He is the author of more than 190 conference and journal papers in these research fields, with more than 2400 citations. He has won different awards in his career (such as the Marie Curie International Reintegration Grant, the Marie Curie Innovative Training Network, the Best Researcher Award from the University of Catania, the Computer World Horizon Award, the Telecom Working Capital, the Startup Weekend, and the Best Paper Award).
DAVID RUIZ is a Full Professor of software engineering at the University of Seville. He leads the Data Engineering Applications Laboratory, University of Seville, focusing his research on data engineering, knowledge graphs, and data integration. He has recently started two new related lines of research, focused on the application of machine learning techniques for the automated retrieval and processing of aviation data, and for the genomic analysis of multi-resistant bacteria. Since 2014, he has been the Deputy Director of the School of Computer Science, University of Seville, where he has contributed to the creation of a dual bachelor's degree in computer science and mathematics, and two new postgraduate master's courses.
DAVIDE BUSCALDI is an Associate Professor at LIPN, Sorbonne Paris North University and a part-time Assistant Professor at the Ecole Polytechnique, where he is teaching machine learning and data science courses, principally. He has directed or co-directed two Ph.D. theses and is currently directing three more theses in NLP and Machine Learning. He collaborates in various national and European projects. He is the author of more than 110 peer-reviewed conference and journal papers. His main research interests include natural language processing and text mining, in particular the application of modern NLP techniques to text annotation and relation extraction.
ENRICO MOTTA received the Laurea degree in computer science from the University of Pisa, Italy, and the Ph.D. degree in artificial intelligence from The Open University. He is currently a Professor of knowledge technologies at The Open University, U.K. From 2000 to 2007, he was the Former Director at the Knowledge Media Institute (KMi), The Open University. Over the years, he has been Leding KMi's contribution to numerous high-profile projects, receiving over 10.4M in external funding since 2000, from a variety of institutional funding bodies and commercial organizations. His research spans a variety of aspects at the intersection of large-scale data integration and modeling, semantic and language technologies, intelligent systems, and human-computer interaction.
Open Access funding provided by 'Università degli Studi di Cagliari' within the CRUI CARE Agreement