RDF-star2Vec: RDF-star Graph Embeddings for Data Mining

Knowledge Graphs (KGs) such as Resource Description Framework (RDF) data represent relationships between various entities through the structure of triples (<inline-formula> <tex-math notation="LaTeX">$ < subject$ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$predicate$ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$object>$ </tex-math></inline-formula>). Knowledge graph embedding (KGE) is crucial in machine learning applications, specifically in node classification and link prediction tasks. KGE remains a vital research topic within the semantic web community. RDF-star introduces the concept of a quoted triple (QT), a specific form of triple employed either as the subject or object within another triple. Moreover, RDF-star permits a QT to act as compositional entities within another QT, thereby enabling the representation of recursive, hyper-relational KGs with nested structures. However, existing KGE models fail to adequately learn the semantics of QTs and entities, primarily because they do not account for RDF-star graphs containing multi-leveled nested QTs and QT–QT relationships. This study introduces RDF-star2Vec, a novel KGE model specifically designed for RDF-star graphs. RDF-star2Vec introduces graph walk techniques that enable probabilistic transitions between a QT and its compositional entities. Feature vectors for QTs, entities, and relations are derived from generated sequences through the structured skip-gram model. Additionally, we provide a dataset and a benchmarking framework for data mining tasks focused on complex RDF-star graphs. Evaluative experiments demonstrated that RDF-star2Vec yielded superior performance compared to recent extensions of RDF2Vec in various tasks including classification, clustering, entity relatedness, and QT similarity.


I. INTRODUCTION
Knowledge graphs (KGs) such as Resource Description Framework (RDF) data represent relationships between various entities in terms of triples (<subject, predicate, object>), facilitating advanced search and reasoning based on semantic relationships.Despite these capabilities, RDF faces the challenge of adequately representing relations beyond the binary.To address this limitation, RDF-star (formerly spelled RDF*) [1] has garnered considerable The associate editor coordinating the review of this manuscript and approving it for publication was Giacomo Fiumara .
Various methods for knowledge graph embedding (KGE) have been proposed in recent years, targeting data mining applications.However, two distinct limitations have been identified in existing KGE methods.
(1) RDF-star data: KGE methods for RDF exist [8], [9], but methods specifically tailored for RDF-star data are conspicuously lacking.Although there are approaches to convert RDF-star triples into regular RDF [10] triples, as well as to load non-RDF data formats (e.g., CSV) [11], these techniques fail to capture the vector representations corresponding to QTs, and consequently, compromise the original semantic representation.
(2) Inability to handle complex RDF-star data: RDF-star permits the representation of recursive hyper-relational KGs featuring nested structures.This is possible because QT can function as the compositional entities of another QT.However, existing hyper-relational KGE methods [11], [12], [13] can handle only rudimentary structures, as shown in Figure 1(a), and fall short of capturing the semantics of complex structures, including nested QT structures and interrelations between QTs, as shown in Figure 1(b).Additionally, the absence of a publicly available complex RDF-star dataset hinders the further development of RDF-star KGE methods.This paper introduces RDF-star2Vec, a novel graph walkbased KGE method designed to learn vector representations of normal entities, relations, and QTs in RDF-star graphs with complex structures, including multi-leveled nested QTs and relations between QTs.Specifically, graph walk methods generate sequences that permit probabilistic transition between QTs and asserted triples in RDF-star graphs.Subsequently, feature vectors for QTs, entities, and relations are derived from the generated sequences employing the structured skip-gram model [14].This approach allows for the direct embedding of RDF-star data into low-dimensional vector space, preserving the semantic representation of the original data.Importantly, the technique situates QTs and their compositional entities in close proximity within the generated sequences, thereby learning the N-ary relations represented by RDF-star.The method was realized through a Java implementation of RDF2Vec [8], [9] and has been released as open-source software. 5urthermore, we introduce a complex RDF-star dataset (KGRC-RDF-star), predicated on KGRC-RDF [15], [16], a scene KG designed for Explainable Artificial Intelligence (XAI) benchmarking and constructed from text data in mystery novels.The dataset features nested statements and scenes such as ''Person A said 'Person B saw Person C was in D'.''Moreover, four gold standard datasets have been constructed to assess classification, clustering, entity relatedness, and QT similarity tasks using the embeddings of KGRC-RDF-star and provided through GEval [17], a KGE evaluation framework.Therefore, this study will augment the existing body of knowledge concerning KGE methods for RDF-star graphs.
The remainder of this paper is structured as follows: Section II introduces related works focusing on KGE methods for RDF, methods for hyper-relational KGs, and benchmarking datasets, and outlines the limitations of these prior works and the position of this paper.Section III introduces our innovative approach, which combines novel graph walks and representation learning methods for RDF-star.Section IV describes the construction of our complex RDF-star graph dataset, featuring multi-leveled nested structures employed in our experimental analysis.Section V describes evaluation tasks, describes the gold standard dataset, and discusses the evaluation results and parameter analysis.Section VI concludes the paper, including a brief summary and future works.

II. RELATED WORK A. KNOWLEDGE GRAPH EMBEDDINGS FOR RDF
Numerous KGE methods have been proposed [18], encompassing graph walk-based, translation-based, and graph neural network methods.This paper focuses on graph walkbased methods applied to RDF graphs.RDF2Vec [8], [9] is a well-known graph walk-based KGE method for RDF graphs.Initially, RDF2Vec generates a sequence set using a random walk.Subsequently, vertices and edges are relabeled employing Weisfeiler-Lehman (WL) graph kernels for RDF [19].The finalized sequence set serves as input to word2vec [20].Cochez et al. [21] extended this by incorporating biased random walks for more semantically rich walks.Portisch et al. [22] proposed RDF2Vec Light, a streamlined variant of RDF2Vec that reduces computational complexities by generating vectors only for entities of interest.Moreover, Portisch et al. introduced an order-aware RDF2Vec (RDF2Vec oa ) [23] using structured word2vec [14], focusing on the original word2vec model's positional insensitivity.Additionally, they proposed similarity-oriented and relevance-oriented walk methods and identified the advantages of each method [24].Steenwinckel et al. [25] showed that the WL graph kernel offers little improvements in the context of a single KG with respect to walk embeddings, and proposed five alternative walking strategies.
In this way, numerous KGE methods for RDF have been developed, and most of them were influenced by RDF2Vec.Our approach is positioned as an extension of RDF2Vec.To the best of our knowledge, our approach is the first method capable of directly representing recursive hyper-relational KGs described in RDF-star in low-dimensional vector space without any information loss.

B. HYPER-RELATIONAL KNOWLEDGE GRAPH EMBEDDINGS
Wen et al. [26] proposed m-TransH, a generalized model of TransH [27] for link prediction of Hyper-relational KGs.However, m-TransH does not consider the relatedness of the components in the same hyper-relation.Guan et al. [12] proposed NaLP, a method to explicitly model the relatedness of the role-value pairs involved in the same hyper-relation.While m-TransH is grounded in a translation-based link prediction model, NaLP employs a fully connected neural network (FCN).Galkin et al. [11] proposed StarE, a graph neural network (GNN)-based link prediction model for hyper-relational KG.StarE extended CompGCN [28] to handle qualifiers of Wikidata.However, these approaches focus on link prediction in simple hyper-relational KGs without recursive structures.Therefore, these approaches are different from the purpose of this study.In addition, they are unable to load standard RDF or RDF-star data directly.Note that these studies have generated their benchmark datasets.A comparison table describing the embedding methods for RDF and hyper-relational KGs is presented in Table 1.
Kwan et al. [10] proposed ExtRet, an algorithm designed to convert RDF-star graphs into regular RDF graphs.ExtRet extended the representation of standard RDF Reification to minimize structural information loss to perform link predictions on binary relations.In contrast, our method can generate embeddings directly without converting the structure of the original RDF-star graphs.Therefore, no additional properties or intermediate nodes are generated due to the conversion, and the embeddings can be produced for the original RDFstar without any information loss.

C. BENCHMARKING DATASETS FOR HYPER-RELATIONAL KNOWLEDGE GRAPH EMBEDDING
A well-known dataset in the realm of hyper-relational KG is Wikidata [29], which contains qualifiers to specify relationships and represents supplementary information in a key-value format.WikiPeople [12], a benchmark dataset consisting of information about individuals, is also extracted from Wikidata and serves the purpose of evaluating link prediction models for hyper-relational KGs.JF17K [26], another benchmark, is extracted from Freebase [30].However, it was pointed out that WikiPeople contains numerous literal values that are conventionally ignored in KGE approaches, and JF17K has been criticized for significant test leakage.As a response, Galkin et al. [11] proposed WD50K as an alternative benchmark dataset designed for the link prediction tasks.
However, existing datasets lack complex structures such as multi-leveled nested QTs and QT-QT relations.Thus, we have developed a complex RDF-star dataset based on KGRC-RDF [15], [16], [31] and integrated its gold standard dataset into GEval [17], thereby enabling evaluations in classification, clustering, entity relatedness, and QT similarity tasks.

III. RDF-STAR2VEC A. GRAPH WALKS
The proposed RDF-star2Vec is a graph walk-based embedding model based on RDF2Vec.Let e ∈ E denote an entity for walk ∈ wl do end while 33: end for with Uniform Resource Identifier (URI) (i.e., the resource in RDF) and r ∈ R denotes a relation (i.e., the property in RDF).
An example of an RDF-star graph containing the complex structure used in the following description is displayed in Figure 2. Similar to the default walks in RDF2Vec, the walk path is as e 1 → r 1 →≪≪ e 2 r 2 e 3 ≫ r 3 e 4 ≫→ r 6 → e 7 , when starting from e 1 , where ≪≪ e 2 r 2 e 3 ≫ r 3 e 4 ≫ denotes a multi-leveled nested QT.Similar to other entities, this multi-leveled nested QT is treated as a node.Thus, in the default walk, for each node n ∈ E ∪ Q in a given G, we generate all sequences S d n of depth d rooted in the node n.The S d n represents a set of sequences s ′d n expressed in Equation 1.
Algorithm 2 QT-Walk Generation where j ∈ R(n) denotes an element of the preceding node's relationships.Thus, the final sequence set generated by the default walks at depth d on the single RDF-star graph is n∈E∪Q S d n .As a limitation, the default walks cannot accurately extract the semantics originally expressed by QTs because it cannot walk the compositional entities of QTs.Thus, we propose a new walk method ''QT-walk'' between the QTs and their compositional entities.Specifically, we propose the following two types of QT-walk.
(1) qs-walk: walk from a QT to its compositional entity in the subject role (2) oq-walk: walk from a compositional entity in the object role to the QT For example, in Figure 2, the qs-walk generates a sequence as e 1 → r 1 →≪≪ e 2 r 2 e 3 ≫ r 3 e 4 ≫→≪ e 2 r 2 e 3 ≫→ r 3 → e 4 when the starting point is e 1 .The oq-walk generates a sequence as e 2 → r 2 → e 3 →≪ e 2 r 2 e 3 ≫→ r 3 → e 4 when the starting point is e 2 .Thus, in the QT-walk, for a given shown in Equation 2.
where w(n i,j ) is a QT q i,j , a triple (n subj i,j , r i,j , n obj i,j ), or a pair (r i+1,j , n i+1,j ).The q i,j is an expression of a QT with quotes (''≪'' and ''≫'').The (n subj i,j , r i,j , n obj i,j ) represents a compositional triple of a QT.The (r i+1,j , n i+1,j ) denotes the pair of predicate and object related to n i,j .Therefore, the final set of the sequences generated by the QT-walk of depth d on the single RDF-star graph is n∈E∪Q P d n .Algorithms 1 and 2 describe the graph walk of our method.Specifically, Algorithm 1 represents an algorithm that retrieves candidate QTs and triples for the graph walk.Algorithm 2 is the QT-walk that cases w(n i,j ) in Equation 2. In addition, we introduce two parameters α and β to set the transition probability.The α is the transition probability from a QT to its compositional entity in the subject role.The β is the transition probability from a compositional entity in the object role to the QT.The former walks deeply into the nested structure, while the latter walks out of the nested structure.The parameters α and β are set manually in the range α ∈ (0, 1] and β ∈ (0, 1], respectively.If both the qswalk and oq-walk transitions are possible, priority is given to the oq-walk.
We also introduce mid walk [22] to RDF-star2Vec.Instead of starting random walks at all entities of interest, it is randomly decided for each depth iteration whether to go backward, i.e., to one of the node's predecessors, or forwards, i.e., to the node's successors.The mid walk of the proposed method is explained in Algorithm 3.

B. REPRESENTATION LEARNING
DeepWalk [32], node2vec [33], RDF2Vec [8] and other representative graph walk-based KGE methods use word2vec [20], a neural network-based representation learning method of words, to encode entities and relations into distributed representations from the generated sequence set.Word2vec offers two principal learning algorithms: the continuous bagof-words (CBOW) and skip-gram models.Our proposed method utilizes the skip-gram model for learning distributed representations from sequence sets, as empirical evidence suggests that the skip-gram model outperforms CBOW model in the context of RDF2Vec.The skip-gram model aims to maximize the average log probability denoted by Equation 3, given a sequence of training words w 1 , w 2 , . . ., w t .
where T indicates the number of words and w t+j (−c ≤ j ≤ c) represents the words appearing in context window c.The probability p(w o |w i ) of an input word w i and an output word w o is calculated using the Softmax function in Equation 4. append walk to wl 47: end while where v w and v ′ w represent the input and output vectors of the word w, and W is the word set.
Portisch et al. demonstrated a partial enhancement in RDF2Vec's performance by incorporating structured word2vec [14], which takes word order into account [23].Therefore, in the proposed method, we employ structured word2vec, specifically the structured skip-gram model, expecting to improve the embedding performance of the context of the QT-walk.

IV. DATASET
Given that existing hyper-relational KG datasets such as WikiPeople, JF17K, and WD50K lack complex structures, they are unsuitable for evaluating the proposed method.Thus, we provide a dataset KGRC-RDF-star that is available for benchmarking the embeddings of complex RDF-star graphs by converting the KGRC-RDF 6 [15], [31] to RDF-star data, which is provided for International Knowledge Graph Reasoning Challenge (IKGRC). 7 The original KGRC-RDF is a set of eight KGs built based on Sherlock Holmes mystery stories and has been published as an RDF dataset for benchmarking XAI systems that can provide reasons for its decisions.Figure 4(a) shows the data structure of KGRC-RDF.The KGRC-RDF contains detailed descriptions of ''who, what, when, where, why, and how (5W1H)'' information along with the storyline and has a rich variation of entities as values of 5W1H.Several approaches [31], [34], [35] were proposed to estimate the criminal from the KGRC-RDF and present a convincing explanation by KGE and logical reasoning technologies.
We converted the KGRC-RDF graphs to the RDF-star graphs, as shown in Figure 4(b), and provided it as a dataset for evaluating the RDF-star embeddings.For example, the scene ''Julia met a lieutenant commander two years ago at Harrow.''was described in the KGRC-RDF as follows. 6https://github.com/KnowledgeGraphJapan/KGRC-RDF/tree/ikgrc2023 7https://ikgrc.org/The above description was converted as follows.It is necessary to specify an object of a triple explicitly when converting from the KGRC-RDF to the KGRC-RDFstar.We set the priority for the object selection as follows: what > whom > where > on > to > from.If either subject or object did not exist, owl:Nothing was substituted.QTs are unique for each combination of subject s, predicate p, and object o, and there is no URI and ID for identifying a QT.However, the same s, p, and o combinations might occur in different scenes when the KGRC-RDF is converted to the KGRC-RDF-star.It is necessary to distinguish these QTs and assign different metadata to them.Therefore, we solved this issue by assigning a unique ID to each QT and nested these triples as a QT as follows: << <<s p o >> id val >> p′ o′.
Table 2 provides a statistical overview of the KGRC-RDFstar.Notably, these statistics exclude the nesting introduced to address the issue of QT uniqueness.The finalized dataset has been made publicly accessible via GitHub.8

V. EVALUATION A. EVALUATION TASKS
For the evaluation of RDF-star embeddings generated through our proposed method, we focus on classification, clustering, entity relatedness, and QT similarity tasks.We selected RDF2Vec and RDF2vec oa as the baseline methods of KGE.Here, we extended jRDF2Vec 9 to enable to generate sequences of RDF-star data, since RDF2vec cannot load RDF-star data.We employ an evaluation framework for graph embedding, GEval [17], to evaluate the performance of each task.Although the GEval provides the entity list of DBpedia [36] as the default gold standard dataset, DBpedia is not RDF-star data and is not appropriate for the evaluation in this experiment.Thus, we constructed gold standard datasets based on KGRC-RDF-star to support benchmarking of the complex RDF-star embeddings and incorporated them into the GEval, and then published them on GitHub. 10In the following sections, we describe the details of each task and how to create gold standard datasets.

1) CLASSIFICATION
The classification task learns training data for entity-label pairs and estimates the labels given unknown entities.Table 5 shows the details of the classification task and the gold standard datasets.These gold standard datasets have been designed to quantitatively evaluate whether each entity and QT has represented suitable feature vectors that contribute to the classification task.The dataset PersonObjectPlace comprises pairs of entities and labels, with labels classified into Person, Object, and Place.It is an unbiased dataset randomly extracted from the KGRC-RDF-star using SPARQL queries and designed to have an equal number of each class.Similarly, QT900 contains pairs of QTs and labels, which are classified into Situation, Statement, and Thought, and it is also an unbiased dataset designed to have an equal number of each class.In the classification task, the above values of the types are removed from the source RDF-star dataset when generating embeddings to exclude the potential ground truth information from the feature vectors.Finally, the results are calculated using 10-fold cross-validation. 9https://github.com/dwslab/jRDF2Vec 10https://github.com/aistairc/GEval-forKGRC-RDF-star/

2) CLUSTERING
In the clustering task, clusters are generated from the embeddings using unsupervised methods, and performance is evaluated by comparing the clusters to the gold standard dataset.Table 4 shows the details of the clustering task.The gold standard datasets are the same ones used in the classification task.

3) ENTITY RELATEDNESS
In the entity relatedness task, we assume that two entities are related if they often appear in the same context, as in prior studies [9], [17].Table 5 shows the details of the entity relatedness task.The kgrc_entity_relatedness is a set of 21 person entities extracted from the KGRC-RDF-star and 10 entities related to each person, and the related entities are sorted by their relatedness.This gold standard dataset was created as follows, referring to the methodology of KORE [37].7. Evaluation results (parameter settings: depth=8, walks_per_entity=100, window=5, dimension=100, α=0.5, β=0.5).
regarding the classification task of the QT900 between the baseline and the RDF-star2Vec oa (p-value = 0.131).
RDF-star2Vec oa achieved the highest accuracy in the PersonObjectPlace dataset in the clustering task.Conversely, the classic RDF2Vec achieved the highest accuracy in the QT900 dataset.Our proposed method is inclined to walk from a QT to its compositional entities rather than to the rdf:type of the QT.In contrast, RDF2Vec frequently walks to the rdf:type, which corresponds to the cluster information in the gold standard dataset.As a result, the clusters generated by the proposed method deviated from those of the gold standard dataset.
In the entity relatedness task, we observed the highest correlation when using RDF-star2Vec with the classic skipgram (p = 0.0166; p < 0.05 was taken to indicate statistical significance).The embedding of the proposed method reflects entity relatedness more than the existing methods because the proposed method can place a QT and its compositional entities close to each other in the generated sequences.For instance, when two individuals appear in the same context, they serve as the QT's compositional subject or object entities and share metadata such as time, place, reason, and related scenes.Therefore, the proposed method was able to place these entities (i.e., entities of N-ary relations represented by RDF-star) close to each other in the generated sequence.Consequently, the skip-gram was able to embed entities considering the entity relatedness.
In the QT similarity task, none of the methods performed well.However, the results of the proposed method very slightly correlated with the gold standard dataset.
These results indicate that our proposed method is especially suitable for the classification and clustering tasks of normal entities in RDF-star datasets.
To summarize, RDF-star2Vec oa significantly outperformed the baseline in four of the six tasks, was inferior in one task, and equaled the baseline in another.Therefore, when considering all tasks, RDF-star2Vec oa overall outperforms the baseline.

C. VISUALIZATION
We applied the t-SNE [39] to the 100-dimensional embedding vectors to visualize the results as a two-dimensional plot.Figure 5 shows a comparison between the embedding vectors of baseline (RDF2Vec oa ) and RDF-star2Vec oa .We used the same hyperparameters as in the evaluation experiments in Section V-B.
Figure 5 (a) shows the clustering results using the DBSCAN [40] after reducing the 100-dimensional embedding vectors to two dimensions using t-SNE [39].DBSCAN, a density-based clustering algorithm, automatically determines the number of clusters.In this visualization, we configured the parameters as follows: the neighborhood distance is 60, and the minimum number of samples in a neighborhood for a core point is 6.In the RDF2Vec oa , 20 clusters were formed with finely distributed nodes.In addition, some clusters of the same color were scattered in different locations (e.g., light blue).Conversely, in the RDF-star2Vec oa , 17 clusters were formed with fewer instances of noise.Therefore, the RDF-star2Vec's embeddings are better for visualizing cluster characteristics than the RDF2Vec.
Figure 5 (b) shows the embedding vectors, color-coded to distinguish QTs from other nodes.In the RDF2Vec oa , there is a polarization between QT and others.Since the RDF2Vec cannot walk the entities constituting a QT, the internal semantics of the QT are not reflected in the embedding results.This polarization hinders any meaningful analysis of the relationships between QTs and normal entities.In contrast, RDF-star2Vec oa demonstrates no such polarization; it allows for the formation of mixed clusters containing both QTs and normal entities.The RDF-star2Vec can visualize QTs and normal entities closely since the internal semantics of the QTs are reflected in the embedding results.Therefore, from this visualization results, we can analyze the relationships between QTs and normal entities as well as the meaningful clusters they form.

D. PARAMETER ANALYSIS AND DISCUSSION
We conducted more experiments by changing parameters to analyze and discuss the characteristics of RDF-star2Vec.Figure 6(a) shows the accuracy of the classification task for the PersonObjectPlace dataset for each combination of parameters α (i.e., probability of qs-walk) and β (i.e., probability of oq-walk).Here, the structured skip-gram was used for representation learning.The results show that α = 0.2 and β = 0.2 are the optimal combination, and α = 1.0 and β = 1.0 are the worst combination.Since qswalk and oq-walk are always performed when α = 1.0 and β = 1.0, walks that loop on the same QT frequently occurs as follows: Therefore, it is difficult to adequately learn the representations of entities by using skip-gram because the variety of the generated sequences is reduced.If α = 0 and β = 0, it is equal to RDF2Vec, 12 which generates the sequence described in Equation 1.In KGRC-RDF-star, the variety of neighbors of normal entities is limited since the RDF2Vec's default walking strategy transition from QT to only metadata, such as ''where,'' ''when,'' ''whom,'' ''then,'' and ''because.''In contrast, our proposed method probabilistically transitions between QTs and their compositional entities.Hence, the proposed method can probabilistically place QTs before or after the normal entities in the generated sequences.Consequently, we observed that our proposed method improved the embedding performance as we assumed.We also found that values of α and β less than or equal to 0.5 are suitable since excessive QT-walk leads to poor performance.
Figure 6(b) shows the accuracy of the classification tasks when changing the depth of the walks (α = 0.5, β = 0.5).The figure shows that the large depth of the walks decreases accuracy.The deeper the walks in RDF-star2Vec, the greater the number of transitions between QTs and their compositional entities, and the loops will occur as described above.Therefore, such excessive QT-walk affected the accuracy.
Several issues remain to be addressed to improve embedding performance.For example, qs-walk and oqwalk implemented in this study can be further classified in detail.Specifically, the qs-walk transitions from a QT to its compositional subject entity and then walk to the object in the same QT since we emphasize the context of the QT-walk.However, in KGRC-RDF-star, it is also possible to walk from this subject entity to other entities that are not components of the QT.We can also consider another version of oq-walk that ignores the context.Although we did not implement such walks that ignore the context of the QT-walk, it is possible to break away from the loop of the QT-walk using them.In addition, it is also possible to implement qo-walk, which transitions from a QT to its compositional object entity, and sq-walk, which transitions from a subject entity to a QT.In future work, we will implement new parameters to allow these walks.Furthermore, we will incorporate other walk flavors [24], [25].

E. LIMITATIONS AND POTENTIAL BIASES OF DATASETS
We constructed KGRC-RDF-star based on KGRC-RDF.The KGRC-RDF consists of eight KGs constructed based on eight mystery stories.Although the mystery stories are closed worlds, they are datasets rich in diversity, with a wide variety of entities such as characters, animals, places, objects, events, and statements, and containing various relationships such as actions, types, person relationships, and causal relationships.Thus, we consider the KGRC-RDF and KGRC-RDF-star to be generalized datasets compared to domain-specific datasets such as biomedical KGs.
According to Kozaki et al. [16], there is a bias in the number of some properties among the eight KGs in KGRC-RDF.Since KGRC-RDF-star is based on these KGs, there is also a bias in the number of some properties among the novels.However, this bias does not directly affect the labels in our gold standard dataset.
For instance, our PersonObjectPlace dataset consists of 540 entities with cluster labels based on its rdf:type information (# of Person=180, # of Object=180, # of Place=180).Similarly, the QT900 dataset consists of 900 QTs with cluster labels based on its rdf:type information (# of Scene=300, # of Situation=300, # of Thought=300).Therefore, there is no bias in the number of correct labels.However, the variation in the novels from which the gold standard data was extracted is slightly biased.This is because there is a bias in the number of characters and places in different novels.In other words, each eight KG has its own characteristics, and we consider that evaluation experiments can be conducted according to the purpose by restricting the number of KGs used in an experiment.
The experimental results are somewhat generalized since this study used the eight KGs.We expect the future development of an even more robust benchmark dataset by including additional data.

VI. CONCLUSION
In this paper, we proposed RDF-star2Vec, a novel graph walk-based KGE method, which directly represents RDFstar graphs with complex structures, such as multi-leveled nested QTs and QT-QT relations, in low-dimensional vector space without any information loss.The approach overcomes the challenge of embedding RDF-star's QTs in vector space by placing related entities closer together in generated sequences.
In addition, we constructed an RDF-star dataset containing complex structures based on KGRC-RDF, and incorporated four gold standard datasets based on it into GEval to enable benchmarking RDF-star embeddings.This framework facilitates evaluating the performance of future KGE methods for RDF-star.
Our proposed method demonstrated significantly better performance than existing methods in tasks such as classification, clustering, entity relatedness, and QT similarity tasks.Future work includes implementing other possible walks, optimizing hyperparameters, and constructing an RDF-star dataset to evaluate additional tasks such as regression, semantic analogy, and link prediction.
It is expected that many N-ary relation graphs (i.e., Hyperrelational KGs) will be published in RDF-star format in the future.We believe that the proposed method will be used as data mining tasks for RDF-star in various domains in the future.

Algorithm 1
Random Walk of RDF-Star2Vec Require: Root node e, # of walks n, Depth d, Map of QTs qts Ensure: Walk list wl 1: wl ← an empty list 2: for currentDepth < d do Figure 3(b) illustrates the architecture of the structured skip-gram.The classic skip-gram uses a

FIGURE 5 .
FIGURE 5. Visualization results of embeddings of baseline (RDF2Vec oa ) and RDF-star2Vec: (a) is the result of clustering using DBSCAN for 100-dimensional embeddings of all nodes, compressed to two dimensions using t-SNE, and (b) is color-coded to distinguish QTs from others.

FIGURE 6 .
FIGURE 6. Parameter analysis for classification based on the RDF-star2Vec embeddings.
SHUSAKU EGAMI received the Ph.D. degree in engineering from The University of Electro-Communications, Tokyo, Japan, in 2019.He is currently a Senior Researcher with the National Institute of Advanced Industrial Science and Technology, Japan.He is also a part-time Lecturer with Hosei University, Tokyo, and a Collaborative Associate Professor with The University of Electro-Communications.His research interests include the semantic web, ontologies, and knowledge graphs.

TABLE 1 .
Comparison table describing embedding methods for RDF and hyper-relational KGs.

qs-walk: from QT to subject */
Require: wl, walk, triples subj e , qt obj e , qt subj e , α, β Ensure: wl 1: rand oq ← a random number 2: rand qs ← a random number 3: newWalk ← a copy of walk 4: if qt obj e is not null AND rand oq < β then

TABLE 3 .
Details of the classification task.

TABLE 4 .
Details of the clustering task.

TABLE 6 .
Details of the QT similarity task.